- Substitution matrices are used to score alignments between two protein or nucleotide sequences.
- They provide a quantitative basis for determining matched and mismatched alignments in sequence comparisons.
- Substitution matrices assign a score for aligning any possible pair of residues/nucleotides.
- Higher scores are given to conserved residue/nucleotide pairs and lower scores for non-conserved pairs.
- Positive scores are given to matches and negative scores for mismatches.
- Gaps in alignments also receive negative scores based on the matrix.
- Different matrices have been developed for protein vs DNA/RNA alignments.
Scoring with a matrix
- Optimum alignment (global, local, end-gap free, etc.) can be found using dynamic programming.
- Scoring matrices can be used for any kind of sequence (DNA or amino acid)
Types of Matrices
- PAM
- BLOSUM
- Gonnet
- JTT
- DNA matrices
PAM, Gonnet, JTT, and DNA matrices are based on an explicit evolutionary model. BLOSUM matrices are based on an implicit model.
PAM Matrices
- Percent Accepted Mutation: Unit of evolutionary change for protein sequences [Dayhoff78].
- A PAM unit is the amount of evolution that will on average change 1% of the amino acids within a protein sequence:
- Based on mutational model of evolution
- Assume changes occur independently
- Changes scored in sequences 85% similar
- Changes are a prediction of first changes that occur as proteins diverge from common ancestor
- Matrices for more distantly related protein sequences extrapolated from short-term changes
- All amino acids positions in related sequences were scored.
- For nucleotide sequence searching a simpler approach is used which either convert a PAM40 matrix into match/mismatch values which takes into consideration that a purine may be replaced by a purine and a pyrimidine by a pyrimidine.
- PAM1 matrix represents 1% evolution, PAM250 represents 250% evolution.
- Higher PAM numbers allow for more substitutions and are useful for detecting distant homologies.
- PAM matrices are based on global alignments and work for closely related protein sequences.
PAM | 0 | 30 | 80 | 110 | 200 | 250 |
---|---|---|---|---|---|---|
% identity | 100 | 75 | 50 | 60 | 25 | 20 |
PAM matrices are based on a simple evolutionary model:
- Only mutations are allowed
- Sites evolve independently
PAM matrices Assumptions
- Only mutations are allowed
- Sites evolve independently
- Evolution at each site occurs according to a simple (“first-order”) Markov process. Next mutation depends only on current state and is independent of previous mutations
- Mutation probabilities are given by a substitution matrix M = [mXY], where mxy = Prob(X -> Y mutation) = Prob(Y|X)
- Mutation probabilities depend on evolutionary distance
Generating PAM matrices
Find amino acids substitution statistics by comparing evolutionarily close sequences that are highly similar:
- Easier than for distant sequences, since only few insertions and deletions took place.
Computing PAM 1 (Dayhoff’s approach):
- Start with highly similar aligned sequences, with known evolutionary trees (71 trees total).
- Collect substitution statistics (1572 exchanges total).
- Let mij = observed frequency (= estimated probability) of amino acid Ai mutating into amino acid Aj during one PAM unit Result: a 20× 20 real matrix where columns add up to 1.
BLOSUM matrices
- Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff & Henikoff92].
- For example BLOSUM62 is derived from sequence alignments with no more than 62% identity
BLOSUM Scoring Matrices
- BLOck SUbstitution Matrix
- Based on comparisons of blocks of sequences derived from the Blocks database
- The Blocks database contains multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins (local alignment versus global alignment)
- BLOSUM matrices are derived from blocks whose alignment corresponds to the BLOSUM matrix number
Constructing BLOSUM r
To avoid bias in favor of a certain protein, first eliminate sequences that are more than r% identical. The elimination is done by either:
- removing sequences from the block, or
- finding a cluster of similar sequences and replacing it by a new sequence that represents the cluster.
BLOSUM r is the matrix built from blocks with no more the r% of similarity
- E.g., BLOSUM62 is the matrix built using sequences with no more than 62% similarity.
- BLOSUM 62 is the default matrix for protein BLAST
Differences between PAM and BLOSUM matrices
Feature | PAM Matrices | BLOSUM Matrices |
---|---|---|
Methodology | Derived from evolutionary distances based on mutation rates | Constructed from local alignments of closely related proteins |
Scoring | Log odds scores | Raw scores |
Applicability | Suitable for aligning distantly related sequences | Effective for aligning closely related sequences |
Matrix Size | Larger matrices (e.g., PAM30 to PAM500) | Smaller matrices (e.g., BLOSUM30 to BLOSUM90) |
Updating | Static and based on fixed evolutionary models | Regularly updated using the latest protein sequence databases |
Residue Frequencies | Incorporate residue frequencies from protein databases | Based on observed frequencies within aligned sequences |
PAM Matrix | Blosum Matrix |
---|---|
PAM100 | Blosum90 |
PAM120 | Blosum80 |
PAM160 | Blosum60 |
PAM200 | Blosum52 |
PAM250 | Blosum45 |
GONNETMatrix
- Developed by Gonnet, Cohen and Benner (1992)
- Different method to measure differences among amino acids using exhaustive pairwise alignments of the protein databases as they existed at that time.
- They used classical distance measures to estimate an alignment of the proteins. They then used this data to estimate a new distance matrix.
The concept of log odd ratio.
- Substitution matrices are constructed based on log-odds ratios of observed substitutions.
- The odds ratio evaluates the probability of two amino acids/nucleotides substituting for one another versus random chance.
- Log-odds scores >0 indicate conserved pairs, while scores <0 indicate unlikely substitutions.
- Higher log-odds ratios translate to higher scores for aligned residues in the matrix.