Matching Observed Alpha Helix Lengths to Predicted Secondary Structure∗

Matching Observed Alpha Helix Lengths to Predicted Secondary Structure∗ Brian Cloteaux† Nadezhda Serova National Institute of Standards and Technology University of Maryland, Baltimore County Gaithersburg, Maryland, USA Baltimore, Maryland, USA [email protected] [email protected] Abstract information about the protein’s 3D structure. At the same time, the computational problem of deter- Because of the complexity in determining the 3D struc- mining 3D structure of these proteins is, in general, in- ture of a protein, the use of partial information determined tractable. In order to reduce the difficulty of this problem, a from experimental techniques can greatly reduce the over- recent approach has been to use computational methods to all computational expense. We investigate the problem of match observed secondary structure to the possible place- matching experimentally observed lengths of helices to the ments on the 1D structure. This paper extends an origi- predicted secondary structure of a protein. We give a simple nal investigation by He, Lu, and Pontelli [4] of the problem and fast algorithm for producing a library of possible solu- of matching observed lengths of the alpha helices from the tions. Then we test our algorithm by performing a series electron cryomicroscopy technique to the predicted areas of of computational experiments of predicting the alpha helix secondary structure. placement of proteins with an already known order. These Electron cryomicroscopy can be used to produce a den- tests seem to demonstrate that our method, if given a good sity map of some proteins. Although with current technol- prediction of the protein’s secondary structure, can gener- ogy the resolutions of such maps are relatively low, certain ate high quality lists of potential placements of the helix secondary structures (such as alpha helices) can still iden- length onto the protein sequences. tified. These alpha helices can be identified as lengths of amino acids, although the location of these helix lengths on the protein sequence is not clear. He, Lu, and Pontelli sug- 1 Introduction gested using placing these observed length on the predicted probability of the individual amino acids in the sequence Understanding how specific proteins fold, or arrange being in a helix. Prediction servers have been created that themselves in three dimensional space (3D) based on en- are able to use the information about the protein’s sequence vironmental and internal chemical constraints, is necessary in order to predict the placement of alpha helices on the 1D to determine the how these proteins function. But even sequence. Still, these methods have limited accuracy and with the amino acid sequence of the proteins (1D structure) so the best result that can be computed is a set of possible known, the prediction of their corresponding 3D structure arrangements of the observed lengths that a researcher can is an extremely challenging problem. use as a starting point in determining 3D structure. This challenge is both from an experimental and compu- This paper offers two contributions to the matching of tational viewpoint. Proteins require precise environments to observed lengths to their placement on the 1D protein struc- fold properly. Because of these numerous complications in ture. The first is to examine the complexity and necessity measuring these protein under the correct environment, ex- of computing the optimal length placement. We give evi- perimental methods for ascertaining the 3D arrangements dence that computing optimal solutions may not be worth are expensive, time consuming, and of limited accuracy. X- the computational expense. ray crystallography, for example, is a powerful technique in A second contribution is to introduce a new approach the determination of the structures, however it is ineffective to computing possible arrangements. He, Lu, and Pontelli in proteins that are not easily crystallized, such as mem- introduced a method to produce a library of likely mappings brane proteins. Many other methods provide only partial to serve as starting points for a researcher. Our approach is similar to the He, Lu, and Pontelli method in the sense that ∗Official contribution of the National Institute of Standards and Tech- nology; not subject to copyright in the United States. it does not generate all of the possible mappings nor does †Corresponding author. it try to find an optimal solution. Instead, we give a simple heuristic algorithm that gives a good approximation of the problem. When examining possible arrangements of the placement of the lengths and then using this approximation helices onto the protein string, we would expect that the ar- as a starting point, we randomly modify it to look for other rangements that cover the maximal value for the predicted possible solutions. We collect the best arrangements to use probabilities of helices would be the most likely to occur in as a library of possible mappings. the actual protein. Thus we would be most interested in ex- To test our approach, we compared the predicting length amining those arrangements first while trying to determine placement produced by our algorithm to actual ordering on the 3D structure of the protein. several known proteins. These tests show that our method is In considering how to obtain an optimal arrangement, we a fast and simple approach for producing high quality pos- first notice that for any given ordering of the covers, we can sible placements of observed alpha helix lengths onto a pro- compute an optimal covering using that ordering in poly- tein’s sequence. nomial time. To show this, we define π as an ordering of the cover set C = {c1,c2,...,cn}, i.e. the value of π(i) is 2 The Maximal Cover Sum Problem the position of the element ci ∈ C in the order π. The in- verse function π−1 then takes a position i in the ordering and returns the cover in that position. Using a given order Before we examine the problem of mapping the observed π, we can define the following recurrence equation that de- alpha helices to the predicted secondary structure, we first termines the size of an optimal covering using π . will consider a closely related problem that we call the maximal cover sum problem. This problem consists of a set C 0 if a or b = 0, of covers with an associated function ω : C → that gives −1 N m(a,b) = max m a,b − 1 , m a − 1,b − ω(π (a)) + the length of each cover. There is also an n-length string P n b P of positive real numbers, i.e. P ∈ R+. The expression Pi is ∑k=b−ω(π−1(a))+1 k used to denote the ith value in the string P. We define a placement of the covers on the string using In this definition, the value m(a,b) computes value of the maximal covering of the first a covers in the order π onto an index function I : C → {1..|P|}. The value I(c) gives the index in P of the first location to place the cover c. Any the first b positions of the string P. The recursion is based index function has the following three restrictions. on determining whether or not an optimal covering covers the b position of P with π−1(a). By using dynamic pro- 1. ∀c1,c2 ∈ C , I(c1) = I(c2) if and only if c1 = c2 gramming, this recurrence and its associated index function can be computed in O(|C | · |P|) time. 2. ∀c1,c2 ∈ C , if I(c1) < I(c2) then I(c1)+ω(c1) < I(c2) Thus the complexity in the maximal cover set problem stems from finding an order that produces an optimal cov- 3. ∀c ∈ C , I(c) + ω(c) ≤ |P| ering. In general, finding an optimal ordering is superpoly- The first two restrictions say that covers are not allowed to nomial in the number of covers unless P = NP. This is a overlap as they cover the string P. The last restriction pre- consequence of the fact that we can reduce the set partition vents covers from extending beyond the length of the string problem, which is NP-complete [3], to the maximal cover P. These restrictions trivially imply that if for a problem sum problem. To see this, consider an instance of a set parti- instance the condition tion problem with a multiset of values S. Using the multiset S, we can create an instance of the maximal cover problems |P| ≥ ∑ ω(c) with by making the set of covers C where |C| = |S| and the c∈C length of the covers are the values in S. We then create a does not hold, then no index function can exist for that in- string P where stance. The maximal cover sum problem is then to find an index P = 1,1,1,...1,1,0,1,1,1,...1,1 function that maximizes the expression | {z } | {z } ` ` ω(c)−1 and ` = ∑c∈C ω(c) . The point of this construction is that the P 2 ∑ ∑ j+I(c) maximal cover sum is equal to (c) = 2` if and only c∈C j=0 ∑c∈C ω if the multiset S can be equally partitioned. In other words, find a arrangement of the covers in C that Although the set partition problem is NP-complete, our covers the largest total of values on P. reduction does not necessarily prove that the maximal cover We are interested in this problem since we can view sum problem is NP-hard. This is because the number of bits matching the observed lengths of helices to the predicted needed in our create string P can potentially be exponen- secondary structure of a protein as a maximal cover sum tial to the number bits in S and so the size of the instance Protein id Number Minimum Average Minimum Average optimal Kendall-tau Kendall-tau Hamming Hamming solutions distance distance distance distance 1CC5 1 0.667 0.667 0.750 0.750 6TMN E 2 0.238 0.310 0.286 0.286 3TIM A 1 0.100 0.100 0.400 0.400 2TSC A 5 0.095 0.305 0.429 0.600 1ECA 6 0.190 0.294 0.429 0.714 1GD1 O 2 0.067 0.100 0.333 0.500 1L58 8 0.393 0.429 0.625 0.688 2PHH 1 0.476 0.476 1.000 1.000 Table 1.

Load more