On Alignments

1 1.b.iv Scores to estimate protein similarities 1.b.iv.1 Introduction In the previous chapters we used a substitution matrix (BLOSUM) to measure similarities between amino acids of two aligned sequences. We briefly considered the way in which the substitution matrix was derived. However, considerable room for more detailed discussion remains. In this chapter we fill this gap and discuss different approaches to calculate similarity scores between sequences, i.e. substitution matrices. One example of these sets of similarity scores is of course the BLOSUM matrix. We also consider fitness criteria of sequences with structures. A fitness criterion measures the matching of an amino acid and a structural or functional property of a site in a protein. Examples for a structural features are the secondary structure, (helix, beta sheet, loop, etc.), the water exposed surface area of an amino acid, and the number of other amino acids in close proximity (contacts). Techniques to compute these scores can be divided into two groups. The first group uses physical or chemical knowledge about amino acids to determine the energetic cost of a substitution and to compute a similarity score. For example, a charged residue (e.g. Lysine) with its favorable interactions with high dielectric medium (water) is unlikely to be replaced by an apolar residue such as Leucine. Apolar residues (with relatively weak electrostatic interactions with other groups) prefer to remain buried inside the protein matrix. A score in the chemical physics sense is a free energy difference, a concept that we will consider later in chapter X. The second group of methods employs computational statistics and machine learning approaches that learn from (experimental) substitution data of amino acids in proteins. It learns from observed changes in amino acids when homologous proteins are compared and use this data to extrapolate for substitutions in newly available proteins. The first approach is based on principles from the natural sciences; principles that connect with fundamental ideas from chemical physics. This is a major advantage. In the second approach that extracts substitution probabilities from empirical observations of these changes we interpolate to new systems, but do not acquire a basic understanding of these events. Perhaps surprisingly, the major disadvantage of the chemical physics approach is of accuracy. Overall, the accuracy of physically and chemically based methods in scoring fitness of an amino acid to a structural site (or a substitution of an amino acid with another residue) is significantly lower than the accuracy of the second approach. A plausible explanation of this observation is the down up approach in which these scores are computed. The parameters are determined from experimental and computational studies of small molecules [x] and used (with perhaps some adjustments) in the much larger protein molecules. Small errors that we undoubtedly make at the level of an isolated amino acid (in the down to up approach) accumulate to significant inaccuracies when proteins with tens to thousands of amino acids are considered. After all, proteins are only marginally stable under normal physiological conditions, and prediction of this small stability energy is demanding. This observation is to be contrasted with the statistical or machine learning approaches that learn directly from the large molecules (protein) data, and are therefore less sensitive to inaccuracies at the 2 amino acid level. Of course it is unlikely that the machine learning approach will compete with the natural science approach when thermodynamic information is desired. The design of novel proteins, not found in nature, is also likely to benefit from chemical and physical knowledge. Because of the clear practical advantage of the machine learning and statistical approaches for learning score functions we focus in this book on the latter. Even after restricting the discussion, the field remains too broad to be discussed in full in a single chapter. We therefore limit the discussion to two methods that have significant impact on the field and have the potential to make additional important contributions. The two techniques differ appreciably in many of their computational aspects: (i) Statistical analysis of correct alignments and (ii) Mathematical Programming and learning from negative examples. (A widely used approach that we do not cover is the Hidden Markov Model which is discussed extensively elsewhere [x].) 1.b.iv.1 Computational statistics of sequence blocks If we are to learn a substitution matrix by statistical analysis of known alignments, we must have at hand accurate (known) alignments to begin with. These accurate alignments will serve as a base for statistical analysis of mutations and for the computations of substitution matrices as discussed below. However, since we do not have a substitution matrix in the beginning of this process, it is not obvious how to generate the initial alignments. Therefore we must restore to one of the following two options: (i) restrict the initial set of data to alignments that are easy to produce by hand, and do not require a substitution matrix, or (ii) generate alignments according to a similarity criterion which is sequence independent (and therefore does not require a sequence substitution table to start). Historically approach (i) was used to generate sequence-to-sequence substitution matrices that are most widely used today (like BLOSUM). To generate the required accurate alignments, only alignments with high percentage of identity are considered. No gaps are used for the initial statistical analysis. A plausible block of this type is sketched below: ACC R AC L R AC L K ACC K VCCR AICR ACC R Note that the sequence fragments are short (fragments of whole sequences) to maintain high degree of sequence identity, and 100 percent sequence identity is also possible. Such alignments “by hand” can be interpreted as the use of the identity for a substitution matrix. The identity assigns the value of one if the two amino acids are identical and zero otherwise. 3 In principle our task is now clear. Consider the joint probability p(abii, ) that an amino acid type ai is aligned against amino acid type bi (i is the index of the column). Clearly we anticipate that the larger is the probability then the higher is the score that is used in the alignment which suggests that the probability is monotonic with the score. Nevertheless, there are two manipulations we must do to the pair probability to obtain a score. 1. The first manipulation is of normalization with respect to a null hypothesis. The null hypothesis is the probability that amino acid ai aligned against bi by chance, that is p(apbii) ( ) . For example, it is possible that we will get a lot of pair (abii, ) just because the two amino acids appear frequently in the sequence and not a because of a likely substation of ai to bi given an amino acid ai . We consider the ratio of the probability of interest and the null hypothesis pab()()ii, pa i pb( i) . If the result is one, there is no preference for the pair to form (in the null hypothesis the two amino acids are independent). If the number is larger than one the observation is that the two amino acids are likely to substitute for each other, and if it is negative they are not. 2. In the second manipulation we transform the probability to a score which is additive; the score of a whole alignment is a sum of scores of individual pairs. The probability of a whole alignment (assuming ∏ pab()ii, independence of the pair probabilities) is a product i . ∏ p()()apbii i A way to make the product into the (desired) sum is by taking the logarithm of the product to give ⎡⎤pab(), ∏ ii ⎡⎤pab(), λλlog⎢⎥i == logii sab , ⎢⎥∑∑⎢⎥()ii ∏ pa()()ii pbii⎣⎦ pa()() ii pb ⎣⎦⎢⎥i where in the last expression we identify the score of matching individual pair. The multiplication by λ is helpful in translating the scores to energy or other more convenient units. This constant cannot affect the most common application of scoring matrices in which we compare one protein (target) to a group of other proteins (templates) and rank the similarities of the target to the templates. It can however be important in establishing the statistical significance of an alignment. Individual entries to the BLOSUM matrix indeed have the form of the log of probability ratio. The discussion so far is straightforward. If we could have ended it here, it would have been a nice clean conclusion. There is however a non-trivial problem in the procedure we described so far. We selected blocks of alignments with high percentage of identical amino acids (these were the 4 alignments we could generate with confidence, without a substitution matrix in the beginning). However, these types of blocks clearly bias the statistics towards more diagonal substitution matrices and alignments of highly similar sequences. More diagonal matrices are less suitable to detect remote homology between sequences, which is a prime reason for generating the similarity scores to begin with. We wish to determine less than obvious similarities. To overcome this problem a cutoff is defined within the blocks, and sequences that are more similar than the cutoff are aggregated to a single sub-clock. The newly formed sub-block is assigned a statistical weight of one. Consider for example the block below YFRRAC YFRKAC YFRGAC YW RRVC If we set the identity cutoff at 80 percents, then the first three sequences are above threshold and their combined statistical weight is set to one (each sequence is given a weight of 13. For example, the statistical weight in the block of alanine ()A is one. The statistical weight of tryptophan (W ) is also one, and the statistical weight of arginine (R) is 10 3.

Load more