Amino Acid Substitution Matrices Overview

Computational Genomics and Molecular Biology, Fall 2013 1

Amino Acid Substitution Matrices Tuesday, October 1st Dannie Durand

Overview

In the last lecture, we introduced a Markov model of substitution in nucleotide sequences and used that model to estimate the number of substitutions, taking multiple substitutions into account. In this lecture, we focused on Markov models of amino acid replacement and their use in deriving amino acid substitution matrices.

An amino acid substitution matrix assigns a score to a pair of aligned amino acids, j and k.A good substitution matrix should have the following properties:

• Biophysical properties of residues: Amino acids differ in size and charge. Some are acidic, some are basic, some have aromatic side chains. Generally, replacement of an amino acid with another amino acid with similar properties is less likely to break the protein or cause a dramatic change in function than replacement with an amino acid with different properties. A substitution matrix should reflect this. • Evolutionary divergence: The observation of identical or functionally conservative amino acids at the same site is more surprising in highly diverged protein families than in families char- acterized by little sequence divergence. The best results are obtained using a substitution matrix based on the statistics of amino acid replacements typical of the degree of evolutionary divergence of the proteins under consideration. Therefore, a family of matrices that is parameterized by sequence divergence is desired. • Multiple substitutions: The score associated with an amino acid pair, j and k, should reflect the probability of observing j aligned with k, taking into account the possibility of multiple replacements at the same site.

There are two commonly used families of amino acid substitution matrices that have these properties, the PAM matrices (Dayhoff et al., 1978) and the BLOSUM matrices (Henikoff and Henikoff, 1992.) Both substitution matrix families are parameterized by sequence divergence. The PAM matrices are based on a formal Markov model of sequence evolution. The BLOSUM matrices use an ad hoc approach. Both families were derived according to the following general approach, although the details of each step differ between the two methods.

1. Use a set of “trusted” multiple sequence alignments (ungapped) to infer model parameters. 2. Count observed amino acid pairs in the trusted alignments, correcting for sample bias. 3. Estimate substitution frequencies from amino acid pair counts. 4. Construct a log odds scoring matrix from substitution frequencies. Computational Genomics and Molecular Biology, Fall 2013 2

PAM matrices

The PAM matrices were developed by Margaret Dayhoﬀ and her colleagues in 1978. A PAM is a unit of evolutionary distance. The term “PAM” means “percent accepted mutation.” We say the divergence between two sequences is n PAMs, if, on average, n amino acid replacements per 100 residues (including multiple substitutions) occurred since their separation.

The Dayhoﬀ matrices are parameterized by PAM distance. Dayhoﬀ used the following strategy to obtain amino acid substitution matrices that are parameterized by evolutionary distance:

• Construct a Markov chain to model amino acid substitution at a single site i. This chain has twenty states, one for each possible amino acid at that site. If the chain is in state j at time t, we say that we see amino acid j at site i at time t. Note that this model assumes site independence.

(1) • For this Markov chain, we derive the PAM-1 transition probability, Pjk , from closely related (1) alignments, assumed to contain no multiple substitutions. Pjk is the probability of observing amino acid k at site i at time t + 1, given that we observed amino acid j at site i at time t; in other words, the probability that amino acid j will be replaced by amino acid k in sequences separated by 1 PAM of evolutionary distance.

(n) • The PAM-n transition probability, Pjk , is obtained by extrapolating from the PAM-1 transition probability. This is the probability that j will be replaced with k in n time steps. We (n) can also think of Pjk as the probability of observing amino acid j aligned with amino acid k in sequences that are n PAM units apart.

Dayhoﬀ’s implementation of the general approach given above is as follows:

1. As training data, Dayhoﬀ et al used a set of ungapped, global multiple sequence alignments of 71 groups of closely related sequences. Within each group, the sequence identity was 85% or greater.

2. Observed amino acid pair frequencies were tabulated from the 71 multiple alignments. Sample bias was corrected by counting the minimum number of changes required to ﬁt the data to a tree, according to a parsimony model. The counts were averaged over all most parsimonious T trees. For each tree, T , we calculate Ajk by counting the number of edges connecting j and k, T T T for j 6= k. Note that Ajk = Akj. We deﬁne Ajj to be twice the number of edges connecting j and j. This is because the edges connecting two dissimilular residues are also counted twice, once in the jk direction and once in the kj direction. The overall counts are obtained by averaging over all trees: 1 X T Ajk = Ajk, nT T Computational Genomics and Molecular Biology, Fall 2013 3

where nT is the number of trees with an optimal parsimony score. (1) 3. The transition matrix Pjk is derived from the counts, Ajk, obtained in step 2 as follows:

(1) Ajk Pjk = mj P , j 6= k h6=j Ajh

(1) Pjj = 1 − mj

Here, mj is the mutability of amino acid j and is deﬁned to be P 1 l6=j Ajl mj = P P , (1) npjz h l6=h Ahl

where pj is the background frequency of j and n is the length of the alignment. We select the normalization factor, z, so that

20 X 1 (p m ) = (2) j j 100 j=1 in order to guarantee that we obtain a transition matrix corresponding to exactly 1 PAM. We obtain an expression for the normalization factor, z, by substituting the right hand side of equation (1) for mj in equation (2) and solving for z. This yields

20 100 X X z = A . (3) n jl j=1 l6=j We now replace the z in equation (1) with the right hand side of equation (3) to obtain the mutability of j: P 1 l6=j Ajl mj = 0.01 P P . pj h l6=h Ahl (1) Note that Pjk is consistent with the deﬁnition of a Markov chain. The rows sum to 1 and it is history independent. This Markov chain is ﬁnite, aperiodic and irreducible. Therefore, it has a stationary distribution. We now consider the PAM-2 transition matrix. Note that the residue at site i can change from a j to a k in two time steps via several state paths: j → j → k, j → k → k, or j → l → k, where l is a third amino acid, not equal to j or k. The probability of changing from a j to a k in two time steps is

(2) X (1) (1) Pjk = Pjl Plk l P (2) can also be derived by squaring the matrix P (1) by matrix multiplication. Computational Genomics and Molecular Biology, Fall 2013 4

Similarly, we can use matrix multiplication to derive the PAM-n transition matrix for any n ≥ 2 as follows: n P (n) = P (1) .

4. We obtain a log odds scoring matrix from the transition probability matrix as follows. Let (n) (n) qjk = pjPjk be the probability that we see amino acid j aligned with amino acid k at a given position in an alignment of sequences with n PAMs of divergence; i.e., that amino acid j has been replaced by amino acid k after n PAMs of mutational change. Then, we deﬁne the PAM n scoring matrix to be

q(n) Sn[j, k] = λ log jk (4) pjpk P (n) = λ log jk (5) pk (n) where λ is a constant. Note that equation (5) is a log odds ratio, where qjk is the probability of seeing j and k aligned under the alternate hypothesis that j and k share common ancestry and pjpk is the probability that j and k are aligned by chance. Typically λ = 10 and the entries of Sn are rounded to the nearest integer.

(n) (n) It is easy to verify that the PAM-n transition matrix is not symmetric; that is, Pjk 6= Pkj . This makes sense since replacing amino acid j with amino acid k may have diﬀerent consequences than replacing k with j.

In contrast, the substitution matrix is symmetric; that is, Sn[j, k] = Sn[k, j]. This is because in an alignment, we cannot determine direction of evolution, so we assign the same score to j aligned with k and to k aligned with j.

BLOSUM Matrices

The BLOSUM (BLOck SUbstitution Matrices) matrices were derived by the Henikoﬀ’s in 1992.

They were based on a much larger data set than the PAM matrices, and used conserved local alignments or “blocks”, rather than global alignments of very closely related sequences. In order to account for diﬀerent degrees of sequence divergence, the Henikoﬀ’s used clustering rather than an explicit evolutionary model. The clustering procedure also addressed the issue of sample bias.

See Ewens and Grant, 6.5.2. for a detailed discussion of how the BLOSUM matrices are computed. Note that their notation is slightly diﬀerent. Computational Genomics and Molecular Biology, Fall 2013 5

1. The “trusted” alignments used to construct the BLOSUM matrices consisted of roughly 2000 blocks of conserved regions representing 500+ groups of proteins. In contrast to the PAM alignments, which were full length alignments of very closely related sequences, the BLOSUM matrices are based on locally conserved regions (ungapped blocks) in multiple alignments of sequences that were not highly conserved, overall. 2. Amino acid pair counts: In the BLOSUM matrix construction process, amino acid pair counts are obtained directly from columns in the conserved blocks (no trees.) In order to construct a BLOSUMn matrix, the sequences in each block were ﬁrst grouped into clusters of sequences that are at least n% identical. For every pair of clusters, amino acids pairs consisting of one amino acid from each cluster were tabulated to obtain amino acid pair counts. Pairs of amino acids within the same cluster were not tabulated. Since some clusters are bigger than others, the counts were normalized by the number of sequences in the clusters. Conceptually, this could be viewed as treating a cluster as an ”average sequence”.

Clustering with different values of n, ranging from 45% to 90%, produces a parameterized set of matrices representing different degrees of sequence divergence. In the BLOSUM method, counting amino acid pairs and estimating substitution frequencies are most easily treated unified process, so we describe the details of the counting process in the next step.

3. Estimating substitution frequences:

• Input: B blocks of sequences. Each block b contains kb sequences of length nb (no gaps). • Cluster sequences such that within each cluster, each sequence is at least n% identical to at least one other sequence in the cluster.

• Let Cb be the number of clusters in block b following the clustering step, where the ith P cluster Cbi has kbi sequences (kb = kbi ). The observed frequency of x aligned with y is calculated as follows. For a given block, for each pair of clusters, we sum the number of x, y pairs, where x and y are in the same column, but in diﬀerent clusters. This quantity is normalized by the number of possible pairs between cluster i and cluster j.

Cb b X X axy = i=1 j>i Pnb l=1 (# x’s at site l in Cbi ) · (# y’s at site l in Cbj ) + (# y’s at site l in Cbi ) · (# x’s at site l in Cbj )

kbi · kbj The counts for each block are then summed, normalizing for the number of columns in the block and the number of possible pairs of clusters:

PB b b=1 axy Axy = . PB Cb b=1 nb · 2 Computational Genomics and Molecular Biology, Fall 2013 6

The expected frequency of x aligned with y is calculated as follows:

Cb nb b X X (# x’s at site l in Cbi ) px = kb i=1 l=1 i PB pb p = b=1 x x PB b=1 nb · Cb Exy = pxpy + pypx 2 Exx = px

4. Calculate the log odds scoring matrix from the observed and expected frequencies:

Axy S[x, y] = 2 log2 Exy

Comparing PAM and BLOSUM Matrices

PAM BLOSUM Evolutionary model Explicit evolutionary model None Data Full length MSAs of closely related sequences. Conserved blocks in protein Bias correction Trees Clustering Evolutionary distance From Markov model of sequence evolution. From clustering of sequences. Matrices Transition and log odds scoring matrices Log odds scoring matrix only. Parameter n Distance increases with n Distance decreases with n Biophysical properties Derived indirectly from data Derived indirectly from data

The PAM and BLOSUM matrices were constructed from an evolutionary model and conserved blocks where amino acids are under selective constraints, respectively. Nevertheless, the matrices favor replacement of amino acids which share biochemical properties. Inspection of the BLOSUM 62 matrix shows that alignments of residues in the same biochemical group tend to have positive log odds scores. These residues are more likely to be observed together in related sequences than by chance. Residues from diﬀerent groups tend to have negative scores. These residues are less likely to be observed together in related sequences than in chance alignments. A score of zero means that this pair of residues is equally likely in related and chance alignments.

% Identity PAM BLOSUM 20 250 45 30 160 62 40 120 80 50 80 - 60 60 -