Signature of Author
Total Page:16
File Type:pdf, Size:1020Kb
An Analysis of Representations for Protein Structure Prediction by Barbara K. Moore Bryant S.B., S.M., Massachusetts Institute of Technology (1986) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 1994 ( Massachusetts Institute of Technology 1994 SignatureofAuthor ............. ,...... ........-.. .. Department of Electrical Engineering and Computer Science September 2, 1994 Certified by... ..... -- Patrick. H. Winston Professor, Electrical Engineering and Computer Science This Supervisor %-"ULfrtifioAvL.Lv--k &y ho, . Tomas o) no-Perez and Computer Science L! Thp.i. .,nPrvi.qnr Accepted by .. ... , a. ,- · F' drY................... / Frederic~) R. Morgenthaler Departmental /Committee on Graduate Students Iv:r tfF:r _ ;: ! --.. t.8r'(itr i-7etFr An Analysis of Representations for Protein Structure Prediction by Barbara K. Moore Bryant Submitted to the Department of Electrical Engineering and Computer Science on September 2, 1994, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Protein structure prediction is a grand challenge in' the fields of biology and computer science. Being able to quickly determine the structure of a protein from its amino acid sequence would be extremely useful to biologists interested in elucidating the mechanisms of life and finding ways to cure disease. In spite of a wealth of knowl- edge about proteins and their structure, the structure prediction problem has gone unsolved in the nearly forty years since the first determination of a protein structure by X-ray crystallography. In this thesis, I discuss issues in the representation of protein structure and se- quence for algorithms which perform structure prediction. There is a tradeoff between the complexity of the representation and the accuracy to which we can determine the empirical parameters of the prediction algorithms. I am concerned here with method- ologies to help determine how to make these tradeoffs. In the course of my exploration of several particular representation schemes, I find that there is a very strong correlation between amino acid type and the degree to which residues are exposed to the solvent that surrounds the protein. In addition to confirming current models of protein folding, this results suggests that solvent exposure should be an element of protein structure representation. Thesis Supervisor: Patrick H. Winston Title: Professor, Electrical Engineering and Computer Science Thesis Supervisor: Tomas Lozano-Perez Title: Professor, Electrical Engineering and Computer Science 2 Acknowledgments F'or their support, I thank my thesis committee, the members of the Artificial Intelli- gence Laboratory, Jonathan King's group, the biology reading group, Temple Smith's core modeling group at Boston University, NIH, my friends and family. The following people are some of those who have provided support and encour- agement in many ways. To those who I've forgotten to list, my apologies, and thanks. Bill Bryant, Ljubomir Buturovic, John Canny, Peggy Carney, Kevin Cunningham, Sudeshna Das, Cass Downing-Bryant, Rich Ferrante, David Gifford, Lydia Gregoret, Eric Grimson, Nomi Harris, Roger Kautz, Jonathan King, Ross King, Tom Knight, Kimberle Koile, Rick Lathrop, Tomas Lozano-P6rez, Eric Martin, Mark Matthews, Michael de la Maza, Bettina McGimsey, Scott Minneman, Doug Moore, Jan Moore, Stan Moore, Ilya Muchnik, Raman Nambudripad, Sundar Narasimhan, Carl Pabo, Marilyn Pierce, Geoff Pingree, Tomaso Poggio, Ken Rice, Karen Sarachik, Richard Shapiro, Laurel Simmons, Temple Smith, D. Srikar Rao, Susan Symington, James Tate, Bruce Tidor, Bruce Tidor, Lisa Tucker-Kellogg, Paul Viola, Bruce Walton, Teresa Webster, Jeremy Wertheimer, Jim White, Patrick Winston, and Tau-Mu Yi. 3 Contents 1 Overview 15 1.1 Why predict protein structures? .............. .. 15 1.2 Background ......................... 20 1.2.1 Secondary structure prediction . 23 1.2.2 Tertiary structure prediction . 25 1.3 My work . 30 1.3.1 Hydrophobic collapse vs. structure nucleation and propagation 31 1.3.2 Modeling hydrophobic collapse. 32 1.3.3 Pairwise interactions: great expectations . .. 33 1.3.4 Amphipathicity ................... 34 1.4 Outline of thesis . 34 2 Background 35 2.1 Knowledge representation. 35 2.1.1 Single residue attributes ................... 36 2.1.2 Attributes of residue pairs .................. 39 2.2 Contingency table analysis ..... .............. 39 2.2.1 Contingency tables. 40 2.2.2 Questions asked in contingency table analysis ....... 40 2.2.3 Models of data ........................ 41 2.2.4 A simple contingency table example ............. 42 2.2.5 Loglinear models ....................... 47 2.2.6 Margin counts . 48 2.2.7 Computing model parameters. 49 2.2.8 Examining conditional independence with loglinear models 49 2.2.9 Comparing association strengths of variables ........ 49 2.2.10 Examples of questions .................... 52 3 Single-Residue Statistics 54 3.1 Introduction. 54 3.2 Method .............. ............. 54 3.2.1 Data . ............. 54 3.2.2 Residue attributes . ............. 55 3.2.3 Contingency table analysis ............. 55 3.3 Results and Discussion ...... ............. 56 4 3.3.1 Loglinear models. ..... 56 3.3.2 Model hierarchies. ..... 59 3.3.3 Nonspecific vs. specific amino acid representation . ..... 65 3.4 Conclusions ......................... .... 71 3.4.1 Solvent exposure is of primary importance .... ..... 71 3.4.2 Grouping residues by hydrophobicity class .... ..... 71 3.4.3 Attribute preferences .............. ...... 72 4 Paired-Residue Statistics 73 4.1 Introduction. 73 4.2 Methods. 74 4.2.1 Data. 74 4.2.2 Representational attributes .............. 74 4.3 Results and Discussion ..................... 76 4.3.1 Non-contacting pairs. 76 4.3.2 Contacting pairs; residues grouped into three classes . 81 4.3.3 Contacting residues: all twenty amino acids ...... 85 4.4 Conclusions ........................... 87 5 Distinguishing Parallel from Antiparallel 89 5.1 Summary . 89 5.2 Methods ............................. 91 5.2.1 Protein data ....................... 91 5.2.2 Definition of secondary structure, and topological relationships. 91 5.2.3 Counting and statistics ..................... 92 5.2.4 Contingency table analysis .................... 94 5.3 Results and Discussion .......................... 94 5.3.1 Amino acid compositions of parallel and antiparallel beta struc- ture. 94 5.3.2 Solvent exposure ........ .101 5.3.3 Position in sheet ........ 105 5.3.4 Grouping residues into classes . 106 5.3.5 Contingency Table Analysis . 106 5.4 Implications ............... ... ... 111 5.4.1 Protein folding and structure . ...... ... .... 111 5.4.2 Secondary structure prediction . ... ... 111 5.4.3 Tertiary structure prediction . 112 6 Pairwise Interactions in Beta Sheets 113 6.1 Introduction. 113 6.2 Method. 114 6.2.1 Significance testing ....... 115 6.3 Results and Discussion ......... 115 6.3.1 Counts and preferences ..... 115 6.3.2 Significant specific "recognition" of beta pairs . .. 128 6.3.3 Significant nonspecific recognition ................ 128 6.3.4 Solvent exposure ........... 129 6.3.5 Five-dimensional contingency table . 131 6.3.6 Nonrandom association of exposure ............... 131 6.3.7 Unexpected correlation in buried (i,j + 1) pairs. 135 6.4 Conclusions.......... 139 7 Secondary Structure Prediction 141 7.1 Introduction. 141 7.2 Related Work ............................... 142 7.2.1 Hydrophobicity patterns in other secondary structure predic- tion methods .................. 142 7.2.2 Neural Nets for Predicting Secondary Structure 144 7.2.3 Periodic features in sequences ......... 146 7.3 A representation for hydrophobicity patterns ..... 147 7.3.1 I, and I: maximum hydrophobic moments 148 7.3.2 Characterization of I, and I, ......... 150 7.3.3 De,cision rule based on I, and Io ....... 152 7.4 Method ............ 153 . :. :. 7.4.1 Neural network .... 157 7.4.2 Data .......... 159 7.4.3 Output representation 159 7.4.4 Input representation . 161 7.4.5 Network performance 162 7.4.6 Significancetests . 165 7.5 Results. ............ ............. 166 7.5.1 Performance. ............. 166 7.5.2 Amphipathicity .... ............. 166 7.5.3 Amino acid encoding ............. 168 7.5.4 Learning Curves .... ............. 169 7.5.5 ROC curves ...... ............. 169 7.5.6 Weights ........ ............. 173 7.6 Conclusion. .......... ............. 180 8 Threading 181 8.1 Introduction. 182 8.1.1 Threading algorithm. 182 8.1.2 Pseudopotentials for threading ...... 188 8.1.3 Sample size problems ........... 189 8.1.4 Structure representations. 192 8.1.5 Incorporating local sequence information 193 8.2 Method ...................... 194 8.2.1 Data. 194 8.2.2 Threading code .............. 194 8.2.3 Structure representations ......... 194 6 S.2.4 Amino acid counts ........................ 196 $.2.5 Scores for threading; padding .................. 196 8.2.6 Incorporating local sequence information. 197 8.3 Results and Discussion ........................ 197 $.3.1 Counts. 197 S.3.2 Likelihood ratios ........... 197 8.3.3 Comparison of singleton pseudopotential components 197 8.3.4 Incorporating sequence window information .......... 205 S.3.5 Splitting the beta structure representation into parallel and an- tiparallel. 207 8.4 Conclusions. 210 9 Work in Progress 212 9.1 (Computer representations of proteins. 212 9.1.1 Automation. 212