_computational BIOLOGY PRIMER

Where did the BLOSUM62 alignment score matrix come from?

Sean R Eddy

Many programs use the BLOSUM62 score matrix to score pairs of aligned residues. Where did BLOSUM62 come from?

Back in the good old days, so many things ment score is the sum of individual log- chance (pab>fa fb), then the odds ratio is were easier to understand. I once disassem- odds scores for each aligned residue pair. greater than one and the score is positive. bled the engine of my 1972 MG just to see Those individual scores make up a 20 × 20 Operationally, we say that positive scores how it worked, but now I won’t touch the score matrix. The equation for calculating a mean conservative substitutions, and nega- http://www.nature.com/naturebiotechnology squirrel’s nest of technology that’s inside score s(a,b) for aligning two residues a and tive scores indicate nonconservative substi- my modern Honda Civic. Likewise, in the b is: tutions. This definition of ‘conservative early days of sequence comparison, align- substitution’ in a score matrix is purely sta- 1 pab ment scores were straightforward stuff that s(a,b) = — log —– tistical. It has nothing directly to do with λ f f anybody could tweak. The first sequence a b structure or biochemistry. comparisons just assigned –1 per mismatch The numerator (pab) is the likelihood of This explains some details in BLOSUM62 and –1 per insertion/deletion, and if you the hypothesis we want to test: that these that may seem counterintuitive at first didn’t like that, you could make up what- two residues are correlated because they’re glance. For instance, tryptophan (W/W) ever scores you thought gave you better- pairs score +11, while leucine (L/L) pairs looking alignments. Those days are gone. only score +4; why shouldn’t all identitites Look inside a modern amino acid score The definition of ‘conservative get the same score? The rarer the amino matrix, and you’ll see a squirrel’s nest of 400 substitution’ in a score matrix acid is, the more surprising it would be to numbers. These highly tuned matrices, see two of them align together by chance. © 2004 Publishing Group which go by industrialized acronyms like is purely statistical. It has In the homologous alignment data that BLOSUM62 and PAM250, no longer seem nothing directly to do with BLOSUM62 was trained on, leucine/leucine to have any user serviceable parts inside. amino acid structure or (L/L) pairs were in fact more common Blame probability theory. than tryptophan/tryptophan (W/W) pairs

biochemistry. (pLL = 0.0371, pWW = 0.0065), but tryptophan

Alignment scores are log-odds scores is a much rarer amino acid (fL = 0.099,

What we want to know is whether two fW = 0.013). Run those numbers (with BLO- λ sequences are homologous (evolutionarily homologous. Thus, pab are the target fre- SUM62’s original = 0.347) and you get related) or not, so we want an alignment quencies: the probability that we expect to +3.8 for L/L and +10.5 for W/W, which score that reflects that. Theory says that if observe residues a and b aligned in homo- were rounded to +4 and +11. you want to compare two hypotheses, a logous sequence alignments. The denomi- Another example is that BLOSUM62 good score is a log-odds score: the loga- nator ( fa fb) is the likelihood of a null awards a +1 to an apparently nonconser- rithm of the ratio of the likelihoods of your hypothesis: that these two residues are un- vative alignment of a positively charged two hypotheses. If we assume that each correlated and unrelated, occurring inde- glutamic acid, but a seemingly more aligned residue pair is statistically inde- pendently. Thus, fa and fb are background innocuous alignment of an alanine to a leu- pendent of the others (biologically dubious, frequencies: the probabilities that we expect cine gets penalized –1. A/L pairs are indeed but mathematically convenient), the align- to observe amino acids a and b on average slightly more frequent in homologous λ in any sequence. is a scaling fac- alignments than K/E pairs (pAL = 0.0044,

Sean R. Eddy is at Howard Hughes Medical tor. It is usually set to something that lets us pKE = 0.0041 in the BLOSUM62 training Institute & Department of Genetics, round off all the terms in the score matrix data), but A and L are more common amino

Washington University School of Medicine, to sensible integers. acids (pA = 0.074, pL = 0.099, pK = 0.058, λ 4444 Forest Park Blvd., Box 8510, Saint Louis, If we expect to find a and b aligned pE = 0.054). With = 0.347, this gives a Missouri 63108, USA. together in homologous sequences more score of –1.47 for A/L (rounded to –1) and e-mail: [email protected] often than we expect them to occur by 0.76 for K/E (rounded to +1).

NATURE BIOTECHNOLOGY VOLUME 22 NUMBER 8 AUGUST 2004 1035 PRIMER

Where did those numbers come from? DIY score matrices and solve for a nonzero λ.Such a λ exists so So much for the scores. But we’ve just We can even make up the values if we state long as the score matrix has two key prop- pushed the question to a different level. some assumptions, which is especially prac- erties: it must have at least one positive Where did we get the target frequencies tical for smaller, simpler 4 × 4 DNA score score, and the expected score for random pab? matrices. Say we want to make a DNA scor- sequence alignments must be negative. The target frequencies are the probability ing matrix optimized for finding 88% iden- Most score matrices have these properties we expect to see a,b aligned in homologous tity alignments. Let’s assume that all because the same properties are necessary alignments. Thus, the basic idea is to take mismatches are equiprobable, and the com- to make local sequence alignment algo- lot of known, trusted pairwise alignments position of both alignments and back- rithms like BLAST and Smith/Waterman similar to what we expect our next align- ground sequences is uniform at 25% for work3.(Both conditions are met by def- ment to look like, and count the frequency each nucleotide. Then, our values are 0.22 inition for matrices derived as log-odds at which each residue pair occurs. for the four identities and 0.01 for each of scores, except for the useless case of The more information we have about the the 12 types of mismatch, and our back- pab = fa fb for all a,b.) two sequences we’re aligning, the better ground frequencies fa, fb = 0.25 for all a,b. For instance, both FASTA and WU- we’ll be able to estimate what their target BLASTN use an arbitrary +5/–4 scoring frequencies should be. For example, if What’s the difference system for matches/mismatches in DNA we know that we’re aligning the sequences alignments, whereas NCBI BLASTN uses a of two integral membrane , our tar- between making up our target +1/–2 scoring system. Is there a big differ- get frequencies would be biased toward frequencies and calculating ence? Probably hard to tell just from look- hydrophobicity. There are endless ways of ing at those scores. If you run the slicing sequence alignment databases and scores, versus just making calculation, you find that these two scoring estimating new score matrices specialized up scores? systems are almost polar opposites. NCBI for certain organisms or certain types of BLASTN’s +2/–1 system is optimal for sequences. A cottage industry of bioinfor- detecting homologous DNA alignments http://www.nature.com/naturebiotechnology matics toils in this happy realm. For a Plug those into the log-odds equation, and that are 95% identical—almost perfect general purpose matrix like BLOSUM62, we get (if λ = 1) +1.26 for a match and matches. FASTA and WU-BLASTN’s +5/–4 though, we can’t really use sequence- or –1.83 for a mismatch. Scale up a bit with system is optimal for detecting homologous species-specific sources of information. λ = 0.25 and round off, and voilà,we have a DNA alignments that are only 65% identi- One source of information remains cru- new scoring system of +4/–7. cal—at the edge of the ‘twilight zone’ for cial: evolutionary distance. The target fre- What’s the difference between making up gapped alignment methods’ ability to rec- quencies depend very strongly on the our target frequencies and calculating ognize homologous DNA alignments. evolutionary distance between the two scores, versus just making up scores? When sequences. If the two sequences diverged we make up our p values, we’re directly Note: Supplementary information is available on the ab Nature Biotechnology website. recently, the target frequencies should be describing what we expect homologous peaked on identical residues. The more alignments to look like (here, simply 1. Henikoff, J.G. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, divergent the relationship we’re looking 88% identity), and the resulting score 10915–10919 (1992). for, the flatter the target frequencies need to matrix is optimal for detecting alignments 2. Karlin. S. & Altschul, S.F. Methods for assessing the © 2004 Nature Publishing Group be. All modern amino acid score matrices that match our target frequencies. If statistical significance of molecular sequence fea- tures by using general scoring schemes. Proc. Natl. are therefore estimated from frequencies instead, we make up an arbitrary score Acad. Sci. USA 87, 2264–2268 (1990). observed in trusted alignment data, using matrix, we’re blindly looking for a scheme 3. Altschul, S.F. Amino acid substitution matrices from some procedure to make a series of related that works well. an information theoretic perspective. J. Mol. Biol. 219, 555–565 (1991). matrices that are appropriate for different expected divergences. Even arbitrary scores imply target The procedure that Steve and Jorja alignment frequencies Henikoff used to estimate the BLOSUM Remarkably, even if we do make up arbi- Further study matrices was straightforward1.The Heni- trary scores, they still imply target frequen- You can download an ANSI C program for calculating the implicit target frequencies pab of a score matrix (see koffs took a big database of trusted align- cies. It’s useful to know what these implicit Supplementary Notes). The BLOSUM62 score matrix and ments (their BLOCKS database), and (in target frequencies are, so we know what sort its background frequencies are included as an example. The effect) only counted pairwise sequence of alignments the score matrix will opti- code also contains two basic methods of solving for roots of equations like the one for λ: the bisection method, and the alignments related by less than some mally detect. The proof that arbitrary Newton/Raphson method. threshold percentage identity. A threshold scores still imply optimal target frequencies of 62% identity or less resulted in the target is subtle (an important statistical result frequencies for the BLOSUM62 matrix. from Sam Karlin and Steve Altschul2,3), but An 80% threshold gave the more highly the arithmetic is straightforward. conserved target frequencies of the BLO- Rearrangement of the log-odds equation λsab SUM80 matrix, and a 45% threshold gave gives us pab = fa fbe ; the problem is the λ Wondering how some other the more divergent BLOSUM45 matrix. unknown .The sum of all the pab values Empirically, the BLOSUM matrices have must be 1, by definition, because they’re mathematical technique really works? performed very well. BLOSUM62 has probabilities. So, set Send suggestions for future primers to λ [email protected]. become a de facto standard for many pro- ∑ sab fa fbe = 1 tein alignment programs. a,b

1036 VOLUME 22 NUMBER 8 AUGUST 2004 NATURE BIOTECHNOLOGY