Where Did the BLOSUM62 Alignment Score Matrix Come From?
Total Page:16
File Type:pdf, Size:1020Kb
_computational BIOLOGY PRIMER Where did the BLOSUM62 alignment score matrix come from? Sean R Eddy Many sequence alignment programs use the BLOSUM62 score matrix to score pairs of aligned residues. Where did BLOSUM62 come from? Back in the good old days, so many things ment score is the sum of individual log- chance (pab>fa fb), then the odds ratio is were easier to understand. I once disassem- odds scores for each aligned residue pair. greater than one and the score is positive. bled the engine of my 1972 MG just to see Those individual scores make up a 20 × 20 Operationally, we say that positive scores how it worked, but now I won’t touch the score matrix. The equation for calculating a mean conservative substitutions, and nega- http://www.nature.com/naturebiotechnology squirrel’s nest of technology that’s inside score s(a,b) for aligning two residues a and tive scores indicate nonconservative substi- my modern Honda Civic. Likewise, in the b is: tutions. This definition of ‘conservative early days of sequence comparison, align- substitution’ in a score matrix is purely sta- 1 pab ment scores were straightforward stuff that s(a,b) = — log —– tistical. It has nothing directly to do with λ f f anybody could tweak. The first sequence a b amino acid structure or biochemistry. comparisons just assigned –1 per mismatch The numerator (pab) is the likelihood of This explains some details in BLOSUM62 and –1 per insertion/deletion, and if you the hypothesis we want to test: that these that may seem counterintuitive at first didn’t like that, you could make up what- two residues are correlated because they’re glance. For instance, tryptophan (W/W) ever scores you thought gave you better- pairs score +11, while leucine (L/L) pairs looking alignments. Those days are gone. only score +4; why shouldn’t all identitites Look inside a modern amino acid score The definition of ‘conservative get the same score? The rarer the amino matrix, and you’ll see a squirrel’s nest of 400 substitution’ in a score matrix acid is, the more surprising it would be to numbers. These highly tuned matrices, see two of them align together by chance. © 2004 Nature Publishing Group which go by industrialized acronyms like is purely statistical. It has In the homologous alignment data that BLOSUM62 and PAM250, no longer seem nothing directly to do with BLOSUM62 was trained on, leucine/leucine to have any user serviceable parts inside. amino acid structure or (L/L) pairs were in fact more common Blame probability theory. than tryptophan/tryptophan (W/W) pairs biochemistry. (pLL = 0.0371, pWW = 0.0065), but tryptophan Alignment scores are log-odds scores is a much rarer amino acid (fL = 0.099, What we want to know is whether two fW = 0.013). Run those numbers (with BLO- λ sequences are homologous (evolutionarily homologous. Thus, pab are the target fre- SUM62’s original = 0.347) and you get related) or not, so we want an alignment quencies: the probability that we expect to +3.8 for L/L and +10.5 for W/W, which score that reflects that. Theory says that if observe residues a and b aligned in homo- were rounded to +4 and +11. you want to compare two hypotheses, a logous sequence alignments. The denomi- Another example is that BLOSUM62 good score is a log-odds score: the loga- nator ( fa fb) is the likelihood of a null awards a +1 to an apparently nonconser- rithm of the ratio of the likelihoods of your hypothesis: that these two residues are un- vative alignment of a positively charged two hypotheses. If we assume that each correlated and unrelated, occurring inde- glutamic acid, but a seemingly more aligned residue pair is statistically inde- pendently. Thus, fa and fb are background innocuous alignment of an alanine to a leu- pendent of the others (biologically dubious, frequencies: the probabilities that we expect cine gets penalized –1. A/L pairs are indeed but mathematically convenient), the align- to observe amino acids a and b on average slightly more frequent in homologous λ in any protein sequence. is a scaling fac- alignments than K/E pairs (pAL = 0.0044, Sean R. Eddy is at Howard Hughes Medical tor. It is usually set to something that lets us pKE = 0.0041 in the BLOSUM62 training Institute & Department of Genetics, round off all the terms in the score matrix data), but A and L are more common amino Washington University School of Medicine, to sensible integers. acids (pA = 0.074, pL = 0.099, pK = 0.058, λ 4444 Forest Park Blvd., Box 8510, Saint Louis, If we expect to find a and b aligned pE = 0.054). With = 0.347, this gives a Missouri 63108, USA. together in homologous sequences more score of –1.47 for A/L (rounded to –1) and e-mail: [email protected] often than we expect them to occur by 0.76 for K/E (rounded to +1). NATURE BIOTECHNOLOGY VOLUME 22 NUMBER 8 AUGUST 2004 1035 PRIMER Where did those numbers come from? DIY score matrices and solve for a nonzero λ.Such a λ exists so So much for the scores. But we’ve just We can even make up the values if we state long as the score matrix has two key prop- pushed the question to a different level. some assumptions, which is especially prac- erties: it must have at least one positive Where did we get the target frequencies tical for smaller, simpler 4 × 4 DNA score score, and the expected score for random pab? matrices. Say we want to make a DNA scor- sequence alignments must be negative. The target frequencies are the probability ing matrix optimized for finding 88% iden- Most score matrices have these properties we expect to see a,b aligned in homologous tity alignments. Let’s assume that all because the same properties are necessary alignments. Thus, the basic idea is to take mismatches are equiprobable, and the com- to make local sequence alignment algo- lot of known, trusted pairwise alignments position of both alignments and back- rithms like BLAST and Smith/Waterman similar to what we expect our next align- ground sequences is uniform at 25% for work3.(Both conditions are met by def- ment to look like, and count the frequency each nucleotide. Then, our values are 0.22 inition for matrices derived as log-odds at which each residue pair occurs. for the four identities and 0.01 for each of scores, except for the useless case of The more information we have about the the 12 types of mismatch, and our back- pab = fa fb for all a,b.) two sequences we’re aligning, the better ground frequencies fa, fb = 0.25 for all a,b. For instance, both FASTA and WU- we’ll be able to estimate what their target BLASTN use an arbitrary +5/–4 scoring frequencies should be. For example, if What’s the difference system for matches/mismatches in DNA we know that we’re aligning the sequences alignments, whereas NCBI BLASTN uses a of two integral membrane proteins, our tar- between making up our target +1/–2 scoring system. Is there a big differ- get frequencies would be biased toward frequencies and calculating ence? Probably hard to tell just from look- hydrophobicity. There are endless ways of ing at those scores. If you run the slicing sequence alignment databases and scores, versus just making calculation, you find that these two scoring estimating new score matrices specialized up scores? systems are almost polar opposites. NCBI for certain organisms or certain types of BLASTN’s +2/–1 system is optimal for sequences. A cottage industry of bioinfor- detecting homologous DNA alignments http://www.nature.com/naturebiotechnology matics toils in this happy realm. For a Plug those into the log-odds equation, and that are 95% identical—almost perfect general purpose matrix like BLOSUM62, we get (if λ = 1) +1.26 for a match and matches. FASTA and WU-BLASTN’s +5/–4 though, we can’t really use sequence- or –1.83 for a mismatch. Scale up a bit with system is optimal for detecting homologous species-specific sources of information. λ = 0.25 and round off, and voilà,we have a DNA alignments that are only 65% identi- One source of information remains cru- new scoring system of +4/–7. cal—at the edge of the ‘twilight zone’ for cial: evolutionary distance. The target fre- What’s the difference between making up gapped alignment methods’ ability to rec- quencies depend very strongly on the our target frequencies and calculating ognize homologous DNA alignments. evolutionary distance between the two scores, versus just making up scores? When sequences. If the two sequences diverged we make up our p values, we’re directly Note: Supplementary information is available on the ab Nature Biotechnology website. recently, the target frequencies should be describing what we expect homologous peaked on identical residues. The more alignments to look like (here, simply 1. Henikoff, J.G. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89, divergent the relationship we’re looking 88% identity), and the resulting score 10915–10919 (1992). for, the flatter the target frequencies need to matrix is optimal for detecting alignments 2. Karlin. S. & Altschul, S.F. Methods for assessing the © 2004 Nature Publishing Group be. All modern amino acid score matrices that match our target frequencies. If statistical significance of molecular sequence fea- tures by using general scoring schemes. Proc. Natl. are therefore estimated from frequencies instead, we make up an arbitrary score Acad.