Introduction to substitution matrices: PAM and BLOSUM matrices

Substitution matrices are used to score alignments between two protein or nucleotide sequences.
They provide a quantitative basis for determining matched and mismatched alignments in sequence comparisons.
Substitution matrices assign a score for aligning any possible pair of residues/nucleotides.
Higher scores are given to conserved residue/nucleotide pairs and lower scores for non-conserved pairs.
Positive scores are given to matches and negative scores for mismatches.
Gaps in alignments also receive negative scores based on the matrix.
Different matrices have been developed for protein vs DNA/RNA alignments.

Scoring with a matrix

Optimum alignment (global, local, end-gap free, etc.) can be found using dynamic programming.
Scoring matrices can be used for any kind of sequence (DNA or amino acid)

PAM, Gonnet, JTT, and DNA matrices are based on an explicit evolutionary model. BLOSUM matrices are based on an implicit model.

Percent Accepted Mutation: Unit of evolutionary change for protein sequences [Dayhoff78].
A PAM unit is the amount of evolution that will on average change 1% of the amino acids within a protein sequence:
- Based on mutational model of evolution
- Assume changes occur independently
- Changes scored in sequences 85% similar
- Changes are a prediction of first changes that occur as proteins diverge from common ancestor
- Matrices for more distantly related protein sequences extrapolated from short-term changes
All amino acids positions in related sequences were scored.
For nucleotide sequence searching a simpler approach is used which either convert a PAM40 matrix into match/mismatch values which takes into consideration that a purine may be replaced by a purine and a pyrimidine by a pyrimidine.
PAM1 matrix represents 1% evolution, PAM250 represents 250% evolution.
Higher PAM numbers allow for more substitutions and are useful for detecting distant homologies.
PAM matrices are based on global alignments and work for closely related protein sequences.

PAM	0	30	80	110	200	250
% identity	100	75	50	60	25	20

PAM matrices are based on a simple evolutionary model:

Only mutations are allowed
Sites evolve independently
Evolution at each site occurs according to a simple (“first-order”) Markov process. Next mutation depends only on current state and is independent of previous mutations
Mutation probabilities are given by a substitution matrix M = [m_XY], where m_xy = Prob(X -> Y mutation) = Prob(Y|X)
Mutation probabilities depend on evolutionary distance

Find amino acids substitution statistics by comparing evolutionarily close sequences that are highly similar:

Easier than for distant sequences, since only few insertions and deletions took place.

Computing PAM 1 (Dayhoff’s approach):

Start with highly similar aligned sequences, with known evolutionary trees (71 trees total).
Collect substitution statistics (1572 exchanges total).
Let m_ij = observed frequency (= estimated probability) of amino acid A_i mutating into amino acid A_j during one PAM unit Result: a 20× 20 real matrix where columns add up to 1.

Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff & Henikoff92].
For example BLOSUM62 is derived from sequence alignments with no more than 62% identity

BLOck SUbstitution Matrix
Based on comparisons of blocks of sequences derived from the Blocks database
The Blocks database contains multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins (local alignment versus global alignment)
BLOSUM matrices are derived from blocks whose alignment corresponds to the BLOSUM matrix number

To avoid bias in favor of a certain protein, first eliminate sequences that are more than r% identical. The elimination is done by either:

removing sequences from the block, or
finding a cluster of similar sequences and replacing it by a new sequence that represents the cluster.

BLOSUM r is the matrix built from blocks with no more the r% of similarity

E.g., BLOSUM62 is the matrix built using sequences with no more than 62% similarity.
BLOSUM 62 is the default matrix for protein BLAST

Feature	PAM Matrices	BLOSUM Matrices
Methodology	Derived from evolutionary distances based on mutation rates	Constructed from local alignments of closely related proteins
Scoring	Log odds scores	Raw scores
Applicability	Suitable for aligning distantly related sequences	Effective for aligning closely related sequences
Matrix Size	Larger matrices (e.g., PAM30 to PAM500)	Smaller matrices (e.g., BLOSUM30 to BLOSUM90)
Updating	Static and based on fixed evolutionary models	Regularly updated using the latest protein sequence databases
Residue Frequencies	Incorporate residue frequencies from protein databases	Based on observed frequencies within aligned sequences

Differences between PAM and BLOSUM matrices

Equivalent PAM and Blossum matrices

Developed by Gonnet, Cohen and Benner (1992)
Different method to measure differences among amino acids using exhaustive pairwise alignments of the protein databases as they existed at that time.
They used classical distance measures to estimate an alignment of the proteins. They then used this data to estimate a new distance matrix.

Substitution matrices are constructed based on log-odds ratios of observed substitutions.
The odds ratio evaluates the probability of two amino acids/nucleotides substituting for one another versus random chance.
Log-odds scores >0 indicate conserved pairs, while scores <0 indicate unlikely substitutions.
Higher log-odds ratios translate to higher scores for aligned residues in the matrix.