<<

1

1.b.iv Scores to estimate protein similarities

1.b.iv.1 Introduction In the previous chapters we used a substitution (BLOSUM) to measure similarities between amino acids of two aligned sequences. We briefly considered the way in which the substitution matrix was derived. However, considerable room for more detailed discussion remains. In this chapter we fill this gap and discuss different approaches to calculate similarity scores between sequences, i.e. substitution matrices. One example of these sets of similarity scores is of course the BLOSUM matrix. We also consider fitness criteria of sequences with structures. A fitness criterion measures the matching of an and a structural or functional property of a site in a protein. Examples for a structural features are the secondary structure, (helix, beta sheet, loop, etc.), the water exposed surface area of an amino acid, and the number of other amino acids in close proximity (contacts).

Techniques to compute these scores can be divided into two groups. The first group uses physical or chemical knowledge about amino acids to determine the energetic cost of a substitution and to compute a similarity score. For example, a charged residue (e.g. Lysine) with its favorable interactions with high dielectric medium (water) is unlikely to be replaced by an apolar residue such as . Apolar residues (with relatively weak electrostatic interactions with other groups) prefer to remain buried inside the protein matrix. A score in the chemical physics sense is a free energy difference, a concept that we will consider later in chapter X. The second group of methods employs computational and machine learning approaches that learn from (experimental) substitution data of amino acids in proteins. It learns from observed changes in amino acids when homologous proteins are compared and use this data to extrapolate for substitutions in newly available proteins.

The first approach is based on principles from the natural sciences; principles that connect with fundamental ideas from chemical physics. This is a major advantage. In the second approach that extracts substitution probabilities from empirical observations of these changes we interpolate to new systems, but do not acquire a basic understanding of these events. Perhaps surprisingly, the major disadvantage of the chemical physics approach is of accuracy. Overall, the accuracy of physically and chemically based methods in scoring fitness of an amino acid to a structural site (or a substitution of an amino acid with another residue) is significantly lower than the accuracy of the second approach. A plausible explanation of this observation is the down up approach in which these scores are computed. The parameters are determined from experimental and computational studies of small molecules [x] and used (with perhaps some adjustments) in the much larger protein molecules. Small errors that we undoubtedly make at the level of an isolated amino acid (in the down to up approach) accumulate to significant inaccuracies when proteins with tens to thousands of amino acids are considered. After all, proteins are only marginally stable under normal physiological conditions, and prediction of this small stability energy is demanding. This observation is to be contrasted with the statistical or machine learning approaches that learn directly from the large molecules (protein) data, and are therefore less sensitive to inaccuracies at the 2 amino acid level. Of course it is unlikely that the machine learning approach will compete with the natural science approach when thermodynamic information is desired. The design of novel proteins, not found in nature, is also likely to benefit from chemical and physical knowledge.

Because of the clear practical advantage of the machine learning and statistical approaches for learning score functions we focus in this book on the latter. Even after restricting the discussion, the field remains too broad to be discussed in full in a single chapter. We therefore limit the discussion to two methods that have significant impact on the field and have the potential to make additional important contributions. The two techniques differ appreciably in many of their computational aspects: (i) Statistical analysis of correct alignments and (ii) Mathematical Programming and learning from negative examples. (A widely used approach that we do not cover is the Hidden Markov Model which is discussed extensively elsewhere [x].)

1.b.iv.1 Computational statistics of sequence blocks If we are to learn a substitution matrix by statistical analysis of known alignments, we must have at hand accurate (known) alignments to begin with. These accurate alignments will serve as a base for statistical analysis of mutations and for the computations of substitution matrices as discussed below. However, since we do not have a substitution matrix in the beginning of this process, it is not obvious how to generate the initial alignments. Therefore we must restore to one of the following two options: (i) restrict the initial set of data to alignments that are easy to produce by hand, and do not require a substitution matrix, or (ii) generate alignments according to a similarity criterion which is sequence independent (and therefore does not require a sequence substitution table to start).

Historically approach (i) was used to generate sequence-to-sequence substitution matrices that are most widely used today (like BLOSUM). To generate the required accurate alignments, only alignments with high percentage of identity are considered. No gaps are used for the initial statistical analysis. A plausible block of this type is sketched below:

ACC R AC L R AC L K ACC K VCCR AICR ACC R

Note that the sequence fragments are short (fragments of whole sequences) to maintain high degree of sequence identity, and 100 percent sequence identity is also possible. Such alignments “by hand” can be interpreted as the use of the identity for a substitution matrix. The identity assigns the value of one if the two amino acids are identical and zero otherwise. 3

In principle our task is now clear. Consider the joint probability p(abii, ) that an amino acid type ai is aligned against amino acid type bi (i is the index of the column).

Clearly we anticipate that the larger is the probability then the higher is the score that is used in the alignment which suggests that the probability is monotonic with the score. Nevertheless, there are two manipulations we must do to the pair probability to obtain a score. 1. The first manipulation is of normalization with respect to a null

hypothesis. The null hypothesis is the probability that amino acid ai

aligned against bi by chance, that is p(apbii) ( ) . For example, it is

possible that we will get a lot of pair (abii, ) just because the two amino acids appear frequently in the sequence and not a because of a

likely substation of ai to bi given an amino acid ai . We consider the ratio of the probability of interest and the null hypothesis

pab()()ii, pa i pb( i) . If the result is one, there is no preference for the pair to form (in the null hypothesis the two amino acids are independent). If the number is larger than one the observation is that the two amino acids are likely to substitute for each other, and if it is negative they are not. 2. In the second manipulation we transform the probability to a score which is additive; the score of a whole alignment is a sum of scores of individual pairs. The probability of a whole alignment (assuming

∏ pab()ii, independence of the pair probabilities) is a product i . ∏ p()()apbii i A way to make the product into the (desired) sum is by taking the logarithm of the product to give ⎡⎤pab(), ∏ ii ⎡⎤pab(), λλlog⎢⎥i == logii sab , ⎢⎥∑∑⎢⎥()ii ∏ pa()()ii pbii⎣⎦ pa()() ii pb ⎣⎦⎢⎥i where in the last expression we identify the score of matching individual pair. The multiplication by λ is helpful in translating the scores to energy or other more convenient units. This constant cannot affect the most common application of scoring matrices in which we compare one protein (target) to a group of other proteins (templates) and rank the similarities of the target to the templates. It can however be important in establishing the statistical significance of an alignment. Individual entries to the BLOSUM matrix indeed have the form of the log of probability ratio. The discussion so far is straightforward. If we could have ended it here, it would have been a nice clean conclusion. There is however a non-trivial problem in the procedure we described so far. We selected blocks of alignments with high percentage of identical amino acids (these were the 4 alignments we could generate with confidence, without a substitution matrix in the beginning). However, these types of blocks clearly bias the statistics towards more diagonal substitution matrices and alignments of highly similar sequences. More diagonal matrices are less suitable to detect remote homology between sequences, which is a prime reason for generating the similarity scores to begin with. We wish to determine less than obvious similarities. To overcome this problem a cutoff is defined within the blocks, and sequences that are more similar than the cutoff are aggregated to a single sub-clock. The newly formed sub-block is assigned a statistical weight of one. Consider for example the block below YFRRAC YFRKAC

YFRGAC YW RRVC If we set the identity cutoff at 80 percents, then the first three sequences are above threshold and their combined statistical weight is set to one (each sequence is given a weight of 13. For example, the statistical weight in the block of alanine ()A is one. The statistical weight of tryptophan (W ) is also one, and the statistical weight of (R) is 10 3. These weights are used in computational estimates of the distribution functions of the pair and single amino acids. Obviously, depending on the identity cutoffs different substitution matrices will emerge. We expect that the higher is the cutoff then the more diagonal is the substitution matrix. For substitution matrices with high recognition capacity for remote sequences lower cutoff are desirable. ****** There is a “conflict” between the need for accuracy satisfied with highly accurate and (almost) identical alignments and the need to learn more diverse substitutions that are more interesting from biological perspective. This “competition” has led to a number of BLOSUM matrices, with performance tuned to different levels of sequence divergence. For example the names BLOSUM50, BLOSUM62, and BLOSUM80 correspond to cutoffs of 50, 62 and 80 percent of sequence identity of the initial blocks, respectively.

While the BLOSUM matrices are a clear success story and they are widely used in the field, the solution of the above ambiguity encourage others (arguably) more direct solutions to the problem of generating initial alignments to learn from. These approaches use only a single set of alignments with no need for blocks and are the topic of the next section.

1.b.iv.1 Measuring similarity by comparison of structures Proteins are more than one-dimensional sequence of characters. They also have a well-defined three-dimensional shape. The compact shape, essential to the protein function, consists of a linear polymer that “collapses” into the biologically active form. It is convenient to represent the folded (native) structure by the linked positions of the sequential amino acids (backbone trace). In one of the most common representation, the alpha carbons ()Cα of each of the amino acids are used for the trace. Since all amino acids differ only in their side chains and all have an alpha carbon as part of their backbone 5 structure, this representation does not carry an explicit identification of the type of the amino acid. For example, both alanine and tryptophan residues will be presented by a single Cα each, with a similar distance from the pervious and next amino acids (their corresponding alpha carbons).

*** PLACE FIGURE X HERE ***

The only amino acid conformations that can be identified from its backbone (without the corresponding side chain) is cis proline. Proline has two conformations of the amide group, cis and trans, the cis conformation has a shorter CCα − α distance compared to all other sequential pairs of amino acids. All other amide planes in proteins are found exclusively in the trans configuration in which the distance between sequential Cα -s is 3.8 Å.

*** PLACE FIGURE X HERE ***

An intriguing possibility is to use the structure of proteins (which is available experimentally) to compare and align two proteins with no reference to the identity of the amino acids (based on the backbone atoms, or the Cα atoms only). The alignment requires a measure of similarity which is based on two (or more) sets of coordinates at hand. A natural and suggestive measure is the distance between the two shapes, A and B , which we call dRR≡− . AB A B 2 The procedure how to compute a structural alignment is still a mystery, (which will be solved in section X), but we already can discuss its implication. With the appropriate algorithm, a structural alignment between protein shapes can be computed and used to estimate the probability pab( ii, ) , the probability of having amino acids types a and b at the aligned (structural) site i . Even if the identity of the amino acids was not used in the structural alignment, we can refer to the order set of coordinates to identify the corresponding amino acids and to construct the above distribution. The probabilities of different alignment sites are accumulated to obtain p(ab, ) anywhere along the structural alignments. A substitution table is then computed following the same formula as for pure : ⎡ pab( , ) ⎤ score = λ log ⎢ ⎥ ⎣ p()apb ()⎦

Table X: A sample of structurally based substitution matrix is [***]: *** INSERT SUBSTITUTION MATRIX HERE ***

Compared to the BLOSUM table and from the perspective of amino acid similarity and identity the matrix above is significantly less diagonal and more permissive. This is of course not surprising considering the different ways in which the matrices were constructed. However, there is more to the observed differences than superficially changing the way we compute the initial alignments. A particularly intriguing empirical observation about structural comparisons of proteins is of conservation. Structures seem to be conserved a 6 lot better during evolutionary processes compared to sequences. This observation further supports structural alignment as a base to learn sequence substitution matrices, and to detect (remote) family relationships. An example of a family relationship which is obvious on the structural level but is not so obvious on the sequence level is given in the figure below (Figure x) in which the backbone of sperm whale myoglobin and lupine leghemoglobin are overlapped and the alignment of the two sequences is given

** INSERT FIGURE X HERE **

The percentage of sequence identity (XX percents according to the structural alignment) is significantly lower than what is considered the safe level for identification. The remote evolutionary similarity makes the visual detection of the close relationship between myoglobin and leghemoglobin a challenge if we have only the sequences at hand. On the other hand the structural similarity is obvious by visual inspection. It is not necessary to use sophisticated tools of statistical significance to assess the accuracy of the structural matches. Both structures are rich in helices (7 to 8 helices) and are similarly packed in space. In both proteins the helices create a box that packs the chemically active group, the heme. The structural similarity strongly supports what we know from biochemical studies that both are oxygen binding proteins. The globin family is a nice example of a success of a structural alignment to measure evolutionary and functional diversity. However, some difficulties are unique to the procedure of structural alignment (which so far was not described in details) when compared to sequence alignment.

First, the fact that two proteins are structurally similar does not necessarily imply similarity in function. It turns out that nature is using the same fold again and again for multiple purposes and diverse functions. Care is therefore required when correlating structural similarity with functional similarity. Some protein families will allow extrapolation from structure to function based on shape similarities, and some will not. The TIM-Barrell family is a good example [x] of a structural family with diverse functions. Some fold (structural) families are however quite restricted in their functional diversity. The globin family (with the prime function of oxygen storage and transport) is a good example of a structural family with narrow functionality. It is possible to assign function from structure if the structure is from a fold family with a reasonably unique function. On the positive side, if our prime goal is to build a structural model for the protein, prediction of its function may come later. We can use a structural model to examine putative active sites, or other biochemical data not included in the building of the model to assist the annotation of the protein.

Second, the number of protein structures that we have is significantly smaller than the number of sequences. There are couple of millions of protein sequences that were determined and deposited in the usual databases (e.g. the non-redundant (nr) database for protein sequences [x]). In contrast, there are only tens of thousands of protein structures that were determined experimentally and deposited in the protein databank [x]. Even with the observation that many sequences share roughly the same fold, there remain many protein sequences that do not belong to a particular structural family. 7

Hence, our statistics in determining sequence substitution may be biased by the availability of structural information for a particular subset of protein families.

Third, efficient protocols for structural alignment are not rigorous. For sequence alignment we have a widely accepted definition of the optimal score, and an alignment algorithm (dynamic programming) that finds the optimal alignment in a quadratic number of operations in the sequence length. There is well established score for structural overlaps – the root mean square distance between overlapped structure (1.b.iv.2 Structural overlaps). However, there is no known and widely accepted score for structural alignment, though it makes sense that the score for a structural alignment will be similar to score of structural overlap. It is expected to satisfy the usual properties of distance (like triangular inequality) helping in the geometrical interpretation of protein space. The difficulties in finding optimal algorithms led many investigators to use heuristic measures of structural similarity with more corresponding “dynamic- programming-like” alignments. Nevertheless, at least from pedagogical view points we start the discussion below with structural overlaps (and alignments) that keep the notion of the usual Euclidian distance.

Regardless of the above reservations structural alignments still serves important goals. The two most interesting ones are (i) to learn substitution probabilities from alignments constructed without sequence information, and (ii) to hint into remote evolutionary connections. Therefore the sections below are very useful in the field of bioinformatics in general, and are essential in the field of structural bioinformatics [x].

1.b.iv.2 Structural overlaps

In the discussion below we restrict ourselves to an Euclidian measure (norm 2) of the distance between two structures. The discussion follows the original papers by Kabsch [x]

We consider two proteins A and B with the same number of amino acids n . At present we ignore the possibility that the proteins may have different lengths or the possibility of insertions and deletions. In other words, we assume that the alignment is already given and ignore the potential complications introduced by an alignment procedure. We clearly need this calculation at the least to measure the similarity between structures for a given trial alignment.

For the moment every coordinate (the position of the Cα atom of an amino acid) in structure A has a corresponding coordinates in B . We shall deal with the problem of structural alignment in the follow-up section 1.b.iv.3 Structural alignments.

The coordinate vectors of proteins A and B are denoted by X A and X B respectively. Each of these vectors is of length 3n including the (x,y,z)

(Cartesian) positions of the Cα -s of the amino acids. The vector of rank 3 of A amino acid i in structure A is denoted by ri . The distance between the two structures D is defined (and written explicitly as)

8

n 2 AB2 Drr=−∑()ii i=1

Note that the ordering of the points does not matter, as long as an alignment is given. Therefore, the algorithm below can be used to overlap arbitrary collections of points and is not restricted to polymers like proteins. Note that also in sequence alignment, if only the scores of amino acid pairs are considered then the ordering of the aligned pairs do not matter (the ordering of gaps may. We have used this property to estimate the number of non- degenerate alignments in chapter X.

It is (of course) possible to translate or rotate one of the structures with respect to the other without changing any of the internal distances between the points that belong to the same object, the protein. That is, maintaining its rigid shape (and ignoring for the moment the operation of mirroring or inversion that keep the distances the same but change protein structure). We anticipate that such actions will not change the distance (or the ) between the two proteins. However, a translation or a rotation of one of the objects in Cartesian space will impact the distance, as expressed in the formula above. We seek a specific translation vector and a that minimize the distance between the two structures (and do not change the individual shapes). The minimal distance requirement defines the difference between the two structures (almost) uniquely (the result is not exactly unique since mirror image satisfies the same set of distances as the original object but is a different entity in biology). We will always move structure A keeping structure B fixed without loss of generality.

The translations and the rotations are considered separately. A translation is A defined by adding to each of the ri vector a single constant vector t . A rotation is defined by multiplying a coordinate vector of rank 3 by a 3x3 matrix A t U (e.g. Uri ). The matrix U satisfies UU= I (a unitary condition, I is the ) and det(U )= 1 (avoiding a mirror image or inversion) which are the usual conditions on a rotation matrix.

The condition UUt = I ensures that all the internal distances of the protein remains the same, i.e. that the transformation kept the rigid structure of the object under consideration. Consider an arbitrary pair of Cα in amino acids i and j with coordinate vectors rrijand respectively and their corresponding

2 t square of distance: drrrrij=−() i j( i − j ) . If we rotate each of the position vectors with the same rotation matrix U we have t 2 t dijij=−()()() UrUrUrUrUrr ij − =( ij −) ( Urr() ij −)

2 ttt drrUUrrrrrrijij=−()()()() ij − =− ij ij − 9

In the last line of the above equation we use the (unitary) condition on the rotation matrix. Indeed the transformation so defined preserves the distances between all amino acids.

After the long introduction let us start with the simpler problem, the problem of optimizing the translation. We wish to determine a vector of translation t that will be added to each of the atoms in protein A so that D2 is minimal. This task is trivial and is followed from direct differentiation of the square of the distance with respect to the components of the translational vector

N 2 AB2 Drtr=+−=∑()nnminimum n=1 dD N 22=+−=rtrAB 0 ∑()nnη dtη n=1 NNN 111BA A B A B trrrrrrηηηηη=−=−=−∑∑∑()nn n nηη()gcgc () NNNnnn===111 η =x,y,z

A Hence, all we need to do is to correct the position of ri by the difference of AB the geometric centers of the two proteins -- rrηη()gcgc and () . After the correction we will be ready to consider the more interesting problem of the relative orientation of the two structures, the problem of rotation.

In fact, to make sure that the next item on the agenda is indeed pure rotation we will set the geometric centers of both proteins to zero by the application of two translations that set both geometric centers to zero: 1 rrttAA←− = r A nnAAN ∑ n n BB1 B rrttnnBB←− =∑ r n N n In the following derivation we assume that the above adjustments were already A B made. We will keep the same notation of ri and ri for the vectors with the adjusted translation.

We now consider another optimization problem for the rotation. The distance between the two structures is a function of the rotation matrix U , which is the unknown we seek. The rotation matrix is required to make the distance as small as possible (minimal). As we shall see, this problem has a unique solution that forms the algorithmic core for the way in which we perform overlap of the structures and for many structural alignment algorithms.

Of course the rotation matrix U cannot be any matrix. It must keep the overall shape of the protein unchanged (the proteins must be rotated with respect to 10 each other as rigid bodies). We therefore insist on the unitary constraint UUt −= I 0 . We also must avoid reflection (det(U ) = − 1) since reflection changes the so-called “chirality” of proteins and their chemical identity. We shall deal with distance conservation first and focus on solving the minimum distance problem with the application of one constraint only (UUt = I ) and only later return to the reflection problem (det(U ) = 1) .

After the lengthy introduction, here is the optimization task that we are facing: Minimize D2 as a function of the rotation matrix U

N 2 AB2 DUrr=−=∑()nn minimum n=1 subject to the constraint: UUt = I 3 or ∑uuki kj−=δ ij 0 k=1

⎧1 ij= ⎫ We have used the notation uUki==() and δ ij ⎨ ⎬ ki ⎩⎭0 ij≠ To determine a unique rotation in three dimensions we must specify three parameters. These parameters include the direction of a rotation axis (two angles) and the rotation around this axis (one angle). It may look strange that we are after a 33× matrix with a total of nine parameters when a rotation is determined by three. The secret is in the constraint. If we write the rotation matrix as a stack of three vectors - u1 ,u2 and u3 we realize that the equations of the constraint are of the form tttt ⎛⎞uuuuuuu1111213 ⎛ ⎞⎛⎞100 ⎜⎟tttt ⎜ ⎟⎜⎟ u u u u== uu uu uu 010 ⎜⎟2123() ⎜ 212223 ⎟⎜⎟ ⎜⎟tttt ⎜ ⎟⎜⎟ ⎝⎠uuuuuuu3313233 ⎝ ⎠⎝⎠001 Since the vector-matrix expression for the constraints yields nine equations, it may seem that there are no free parameters for rotations! However the off tt diagonal elements uuij== uu ji 0 provide two dependent equations of which only one provides new information. As a result only six independent equations for the constraints exist. Summarizing: since the number of elements in the matrix is nine, and the number of independent equations of constraints is six, we have 963−= independent variables. Exactly the number required to obtain unique rotation via the use of (for example) the Euler angles. The Euler angles are however, complex non linear descriptions, working directly with the rotation matrix with the additional constraints is more convenient computationally, as we shall also demonstrate below.

Using the language of Lagrange’s multipliers, we add the constraints to the target function that we wish to optimize.

11

2 FD=+Λ∑∑ij() uu ki kj −δ ij ij, k

The 33× matrix Λ contains the nine Lagrange’s multipliers. The indices ijk,, run over the three Cartesian axes. The Lagrange’s multiplier matrix ( Λ ) is symmetric since the matrix that multiplies it is symmetric as well. The degeneracy of the constraints discussed above (9 equations for the constraints but only 6 are independent) implies the symmetry of Λ as well. To find the minimum of D2 subject to the constraint of a U , we differentiate with respect to the matrix element uij , we have

∂F ⎛⎞AA AB =+−=∑∑urrik⎜⎟ nk njλ kj ∑ rr n n 0 ∂uij kn⎝⎠ n

We now define two matrices

BA AA Rij= ∑ rr ni nj Srrij= ∑ ni nj n n

Note that Sij is a while Rij is not. ∂F With the help of the above definitions we can write = 0 in a more ∂uij compact form

US()+Λ − R =0

We have one matrix equation with two unknown matrices -- U and Λ . Of course, things are not so bad since we still have the constraint equation: UU t =1. Note that ()S +Λ is a symmetric matrix. On the other hand R is not symmetric which makes our problem a little more interesting. The following trick will eliminate some of our problems: Multiply the last equation by its transpose:

t ()()SUUSRR+Λtt +Λ = and using UUt =1 (our favorite constraint) eliminates U from the equation. This does not seem like a positive step since the rotation matrix U is what we are after. Nevertheless, some insight to the problem will be given from the equation below

()()SS+Λ +Λ = RRt

12

t The eigenvectors of ()S +Λ -- ak are the same as the eigenvectors of R R (assuming no eigenvalue degeneracy for the symmetric matrix (S + Λ) ). The t 2 eigenvalues of R R are µk . The corresponding eigenvalues of (S +Λ) are therefore

()Sa+Λkkk =±µ a (the eigenvalues of the symmetric matrix must be real but since we have only the eigenvalues of the square of the matrix, the eigenvalues themselves are determined only up to a sign)

We now use the eigenvectors ak to reconsider the matrix equation after multiplying from the right with ak

US()+Λ akk = Ra

Since ak is an eigenvector of (S + Λ) we can also write

UaRa()±=µkk k

We have three orthogonal eigenvectors akk =1, 2, 3 . The rotation matrix U transforms these three vectors to another set that we call bkk =1, 2, 3 . Note that the bk set is also a set of orthogonal vector. This is easy to appreciate as follows: tttt t ()bij() b====() Ua i() Ua j a i U Ua jijij a a δ Using the ""b notation we can also write 1 bRakk=± ()µk ≠ 0 µk

The right hand side includes only known (by now) entities. So we can use R , ak and µk to compute the bk -s. Since we also knows that

Uakk= b We can immediately reconstruct the rotation matrix relying on the tt orthogonality of the ak set (and of the bk set -- bUkk= a ) 3 t Uba= ∑ kk k=1 This “optimal” U can be plugged in the initial equation for the distance to compute the “optimal” distance. There is however a few more subtle points that are discussed below, and we postpone for the moment the calculation of the distance.

Using a similar argument that we put forward for the matrix U , we derive a related expression for R in terms of the two sets of vectors 3 t R =±∑bakkk()µ k =1

13

In other words R is a “weighted” (by the µk ) U . The right and the left eigenvectors, and the absolute values of the eigenvalues can be obtained directly from Singular Value Decomposition (SVD) of the asymmetric matrix R . Using SVD (without going into details exactly what it means) is the simplest approach to our problem using the facilities and the resources of MATLAB.

Finally our optimal distance can be computed more directly without explicitly determining U (of course to make plots of overlapping structures like in figure X requires the rotation matrix). Consider the direct calculation of distance outlines below in light of what we already know

2 AB222 A B Bt A DUrrrr=−=+−∑∑()nn( n) ( n) 2 ∑( rUr n) ( n) nn n

AB22 Bt ⎛⎞ tA =+−∑∑∑()rrnn ()2 () r n⎜⎟ bar kkn nnk⎝⎠ 22 t =+−rrAB2 brra tBA ∑∑∑()nn () ( knnk )()( ) nnk

AB22 t =+−∑∑()rrnn () 2 ()( bRa kk ) nk

AB22 =+−±∑∑()rrnn () 2 µ k nk where we have used the known forms of U and R in terms of the vectors ak and bk to arrive at the final (simple) expression in the last line. Since the sum AB22 ∑()rrnn+ () is a constant (independent of the rotation matrix), the only n term that can make a difference is the −2∑ ±µk . If we wish (and we do!) to k make the distance as small as possible we should only positive values for the eigenvalues µk . Hence to compute the minimal distance we need to compute t 2 only the eigenvalues of R R , which are µk and not the eigenvectors.

There is however one caveat. So far we made sure that the constraint UUt = I is satisfied. However, this is not enough. Rotation in physical systems are expected to have the following form, a scalar displacement around a fixed (special) rotational axis (figure x).

*** PLACE FIGURE X HERE *** Mathematically, this means a transformation to a coordinate system in which the rotation is done in a new XY plane while the axis of rotation is the new Z axis. Hence the rotation matrix takes the form (in the new coordinate frame 14

⎛⎞cos(φφ) sin( ) 0 ⎜⎟ U =−⎜⎟sin()ϕφ cos () 0 ⎜⎟ ⎝⎠001

*** PLACE FIGURE X HERE ** Hence, any rotation in the ordinary physical space can be determined by a single rotation axis (defined by a unit vector e ), and a scalar rotation angle φ in physical systems (two angles determined this special axis, and together with the rotation angle we obtain three parameters that are required to define a rotation uniquely). As this rotation we expect a rotation to be limited to The rotation matrix we construct are real it is therefore however, we did not take care of the second condition for a proper rotation det(U ) = 1. It is possible 3 t that the rotation matrix defined by Uba= ∑ kk has a determinant of –1 and k=1 therefore describes reflection or inversion. This can be tested for by explicit construction of the rotation matrix and calculation of the determinant. Suppose the determinant is negative, then what next?

We clearly need to modify the rotation matrix to have a determinant of +1. This is the place where we can go back and re-investigate the ± sign we have placed before the eigenvalues. The smallest possible distance between the structures will be obtained for all positive eigenvalues (choosing only the + sign). However, if the rotation is not proper (determinat is equal to –1), we may need to compromise on something else. We can change the sign of the determinant by changing the sign of the vector bk to −bk . A negative eigenvalue −µk means that we also change the sign of the “secondary” eigenvector bk to −bk . (Note Rabkkk= µ if we change the sign of the vector bk we must change the sign of the eigenvalue to maintain the equality i.e.

Rabkkk=−()()µ ⋅− . Since the eigenvalues of U are all of norm 1, changing the sign of one the left eigenvectors (bk ) changes the sign of the corresponding eigenvalue and the sign of the determinant (as desired). The distance between the two proteins will not be the shortest possible after the adjustment but this is the price we have to pay in order to obtain a proper rotation. The distance between the two proteins defined in this way is also called the RMS (root mean square) distance.

1.b.iv.3 Why is structural alignment difficult? So far we consider the calculation of the distance between two protein structures (say A and B ) for which the alignment is known. In the next step we consider the procedure of structural alignment. It would be nice if we could use the same successful approach (dynamic programming) that we adopted in sequence alignment. Unfortunately, the analogy is not simple. To appreciate the difficulties let us re-consider the line of thought that led to dynamic 15 programming in comparison of two sequences and attempt to adjust it for the comparison of structures. Consider an alignment of total length l of two structural fragments, lengths n and m amino acids respectively. The alignment includes k gaps, and therefore the number of amino acid-to-amino acid comparisons is lk− . For the moment we will ignore the score of the gaps and considered only aligned pairs of structural sites in the calculation of the similarity. For an alignment of length l we expect kl≤ sites without gaps. We wish to score this alignment based on the distance measure we developed in the present section, and the experience we gathered from sequence alignment. For example, we expect to score the lk− gaps by the addition of constant gap penalties -- (lkg− )⋅ , the value of g is to be determined empirically. For simplicity we do not consider the option of gap opening and extension. The next step is to decide on scoring the actual comparison of structures (alignment of amino acids against other amino acids). By now we know that the optimal distance between the two structures (after overlaps) depends only on and is increasing monotonically in the singular values of the R matrix (or the square roots of the eigenvalues of the matrix Rt R ). Here the R matrix is computed for k pairs of three dimensional vectors pointing to the Cα -s of aligned amino acids. The larger is the (absolute) sum of the singular values -- T = ∑ µi the more similar are i=1,2,3 the two structures, and the smaller is the distance between them as we defined in the section about structural overlaps. It makes sense to use the above Tk() (of k structural sites) as a similarity score. Following the definition of the score in sequence alignment we define the total score as a sum -- TkTklkgtotal ( ) =+−( ) ( ) . Assume that the current alignment is optimal and we seek an extension to the present alignment that is optimal as well. In sequence alignment we asked how the score changes with each of the three options for an extension of the alignment, and we attempt a similar approach here. The options are: (i) extension by a gap against a structural site from structure A , (ii) extension by a gap against a structural site from structure B , or (iii) an extension of the alignment by pair of structural sites. The gap penalty can be handled similarly to sequence alignment, we add a gap penalty g (determined empirically) to the optimal score (so far) T . The difficulty is when we add two position vectors of the next amino acid (e.g. the position of the Cα atom), one from structure A and one from structure B . It is not clear at all that Tktotal( +11) =++ Tkcl total ( )( ), where cl()+1 is a scalar that depends only on the properties of the last pair of amino acids that extend the alignment from l to l +1, for example cl()+1 AB may depend on the distance rrll+11− + . The problem is that the addition of two vectors can change the rotation matrix U , which in turns will change the similarity score Tktotal (). Therefore it is not possible to determine exact partial alignments. The orientation and the rotation matrix will always depend on th previous alignment before changing the score of the (sub) structure of k 16 structural sites. As of the writing of this book there is no efficient algorithm (of the dynamic programming type) that is able to optimized the most widely used measure of protein distance (and similarity) namely the RMS distance. An assumption used in a number of algorithm is to keep the same rotation matrix or to assume that the new rotation matrix is not very different and (or) does not impact the similarity score. In that case we could continue with business as usual with a dynamic programming algorithm. Below we describe how heuristic of this type can be used in the context of the URMS. Other, fundamentally different algorithms are described in the next section.

IV.2.3 The URMS and approximate structural alignments The URMS is an alternative distance measure introduced in a paper by Chew et al [x]. The measure has close relationships to the RMS distance but also important differences. In particular the URMS builds on specific properties of protein structures. Namely, that (i) proteins are one dimensional chain of monomers and that distances between sequential Cα are to a good approximation fixed at a constant value equal to 3.8 Å. A protein chain with N amino acids could be represented (at the level of backbone description only) as a sequence of N −1 vectors connected head-to-tail between the positions of the Cα of N amino acids. A three dimensional picture of such a representation of protein structure is shown below, together with a more detailed picture of the protein shape. We clearly see that

IV.2.3 Approximate procedures for structural alignment *** MISSING a COMPONENT

IV.3 More complex models

We have demonstrated interplay between sequences and structures that allows for better alignment and identification schemes. It is clear that sequences have a special role in bioinformatics (there are more readily available experimentally, and are being coded genetically through DNA sequences). Nevertheless, we should use additional information if it is available to us. Rather than thinking of a substitution matrix as a function of two parameters, the two amino acids of the proteins at the site under consideration, we can think of a vector of descriptors describing the sites of the proteins to the best of our abilities. The vector of descriptors need not be complete to the same extent on both sides. One of the proteins under consideration is an unknown, for which we know only the amino acid sequence, the other may be a well characterized protein for which we know the function, the secondary and tertiary structure, the location in a metabolic pathway, etc. A single site of the protein may include the following descriptors descriptor vector = () seq, secondary str, contacts, exposed surface area, evolutionary profile

17

Let us examine what the different typical terms means. The first entry ()seq is the sequence or the identity of the amino acid at the site. This is the usual stuff we have dealt with up to now. The second entry (secondary str.) is the secondary structure of that particular sites, usually classified into 3 to 7 different types. The simplest classification scheme includes a helix, a beta strand and a coil (or anything else) as the three available secondary structure states. ishelping us in better understanding the relationship between different proteins inefficient algorithm that find an optimal alignment for structural alignment and even the definition of a similarity (or a distance) measure is not uniformly accepted by different researchers. We shall describe in the next section an algorithm for structural alignment without gaps. However, we emphasize that the number of structural alignment algorithm is probably equal to the number of research groups in the field. While we can provide a passionate argument in favor of specific type of algorithms (and we will), we hardly believe that we will reduce significantly the number of algorithms that are out there. The most popular algorithms at present are the ones that were around the longest (e.g. Dali) and ones with convenient and efficient web interfaces (such as CE). Whatever is the algorithm that we will finally use, the current status of the field is not optimal. The alignment will be either approximate for (more or less) precise definition of a distance or score function, or exact using an approximate definition of the score that (for example) does not satisfy the usual properties of distance in Euclidean space.

However, there is one major caveat. The number of experimental structures that we have at hand is significantly smaller than the number of sequences. While many of the sequences can be assigned

matrices obtained from we consider using dynamic programming subject to a guess substitution matrix to begin with. Whatever is the initial guess for a substitution matrix, the initial alignments must be simple so that that accurate alignment will be obtained even for highly crude substitution matrices. It is therefore necessary to start with a very easy alignment task for which we could write (or almost write) the alignment by inspection, regardless of the quality of the guess for a substitution matrix. (Consider an alignment of a sequence A against a sequence B . The alignment is given and is done “by hand” using biological or chemical intuition and without gaps. For example, we can use highly similar proteins for which the degree of identity (and similarity) between the sequences is very high (e.g. BLOSUM 50, implies identity of 50 percent of the sequences). Consider the probability of having an amino acid a substituted by another amino acid b . The substitution is detected after aligning the sequence A against the sequence B . Here we exploit to a maximum the log-odd ratio of probabilities as mean of scoring similarity between distributions. If a and b are amino acids sampled from a sequence A and B respectively