Methods for Global Organization of All Known Protein Sequences
Total Page:16
File Type:pdf, Size:1020Kb
Metho ds for Global Organization of all Known Protein Sequences Thesis submitted for the degree \Do ctor of Philosophy" Golan Yona Submitted to the Senate of the Hebrew University in the year 1999 1 This work was carried out under the sup ervision of Prof. Nathan Linial, Prof. Naftali Tishby and Dr. Michal Linial. 2 Acknowledgments First, I would like to thank my advisors: Nati Linial and Tali Tishby of the computer science department and Michal Linial of the life science department at the Hebrew University. Within the last ve years we have come a long way together, and I'm grateful for their guidance, supp ort, warmth and generous help all these years. I would like to thank my colleagues and friends in the machine learning lab, with whom I sp ent most of the last ve years: Itay Gat, Lidror Troyansky, Elad Schneidman, Shlomo Dubnov, Shai Fine, Noam Slonim, Ofer Neiman and Ran El-Yaniv, for all their help and for b eing great company. A sp ecial thanks to Ran El-Yaniv, a dear friend who encouraged me to keep b elieving in myself. I very much enjoyed working with Gill Bejerano, with whom I am still working on fascinating new directions. I would next like to thank all of my colleagues and friends in the molecular genetics and biotechnology lab headed by Hanah Margalit: Ora Schueler, Eyal Nadir, Iddo Friedb erg, Yael Mandel and Yael Altuvia, for the great time I had every Thursday in our journal club group meetings. Besides the many pap ers I had in my bag after each such meeting, it was also great fun. A very sp ecial thanks to Yael Altuvia for her continuous and wonderful help and supp ort, and to Hanah Margalit, who was like an advisor for me, always happy to answer questions and give advice, as well as give me encouragement in hard times. I thank Amnon Barak for making the MOSIX parallel system available, and the p eople in Compugen Ltd. who gave me access to their Bio ccelerator and to their software. It was a pleasure to work with Alex Kremer, Avi Kavas, Yoav Etsion and Daniel Avrahami, the great team of students with whom I created the ProtoMap web site. I thank Michael Levitt, Steven Brenner, Patrice Ko ehl, Boaz Shaanan, Yoram Gdalyahu and Ran El-Yaniv for critically reading parts of this manuscript and for making many helpful comments, and Avner Magen for valuable suggestions. A sp ecial thanks to Nati Linial, my advisor, who with much patience read most of this thesis, commenting, correcting my English, and improving my writing style. To my family, esp ecially my mother and father for their tremendous love and encouragement, and for continually reminding me that they will b e happy with whatever I cho ose to do (always leaving me the option of b ecoming a carp enter). And last, to my b est friends, Yoram Gdalyahu and Rami Doron, for their enormous help and for their invaluable friendship during these intensive years. Contents 1 Intro duction 7 1.1 What are proteins? . 7 1.2 Functional analysis of protein sequences . 8 1.3 The explosion of biological information . 9 1.4 Towards global organization . 10 2 Comparison of protein sequences 13 2.1 Alignment of sequences . 13 2.1.1 Global similarity of protein sequences . 14 2.1.2 Penalties for gaps . 15 2.1.3 Global distance alignment . 16 2.1.4 Position dep endent scores . 18 2.1.5 Lo cal alignments . 19 2.2 Other algorithms for sequence comparison . 19 2.2.1 BLAST (Basic Lo cal Alignment Search To ol) . 20 2.2.2 FASTA . 22 2.3 Probability and statistics of sequence alignments . 23 2.3.1 Basic random mo del . 23 2.3.2 Statistics of global alignment . 24 2.3.3 Statistics of lo cal alignment . 26 2.4 Scoring matrices and gap p enalties . 31 2.4.1 The PAM family of scoring matrices . 32 2.4.2 The BLOSUM family of scoring matrices . 34 2.4.3 Information content of scoring matrix . 35 2.4.4 Gap p enalties . 38 3 Large scale analyses of protein sequences 39 3.1 Motif and domain based analyses . 39 3.1.1 The PROSITE dictionary [Bairo ch 1991 ] . 39 3.1.2 The BLOCKS database [Heniko & Heniko 1991 ] . 40 3.1.3 The ProDom database[Sonnhammer & Kahn 1994 ] . 40 3.1.4 The Pfam database [Sonnhammer et al. 1997 ] . 41 3.1.5 The DOMO database [Gracy & Argos 1998 ] . 41 3.2 Protein based analysis . 41 3.2.1 The study by [Gonnet et al. 1992 ] . 41 3.2.2 The study by [Harris et al. 1992 ] . 42 3 4 CONTENTS 3.2.3 The study by [Watanab e & Otsuka 1995 ] . 42 3.2.4 The PIR classi cation [Barker et al. 1996 ] . 42 3.2.5 The COGS database [Tatusov et al. 1997 ] . 43 3.2.6 The SYSTERS database [Krause & Vingron 1998 ] . 43 3.3 Alternative representations of protein sequences . 44 3.3.1 The study by [van Heel et al. 1991 ] . 44 3.3.2 The study by [Ferran et al. 1994 ] . 44 3.3.3 The study by [Han & Baker 1995 ] . 44 3.3.4 The study by [Casari et al. 1995 ] . 45 3.4 Structural classi cations of proteins . 45 3.5 Summary . 46 4 Part I - The Euclidean emb edding approach 47 4.1 Constructing a metric space on the protein sequence space . 47 4.2 Why emb ed? . 49 4.3 Emb edding strategies . 50 4.3.1 Classical approach . 50 4.3.2 Current approach . 52 4.4 Application of emb edding techniques for protein sequences . 57 5 Data clustering 61 5.1 Intro duction . 61 5.1.1 Basic de nitions . 62 5.1.2 Clustering metho ds . 62 5.1.3 Partitional clustering algorithms . 63 5.1.4 The basic clustering pro cedure . 64 5.1.5 Hierarchical clustering algorithms . 65 5.1.6 Cluster validation . 66 5.2 Our clustering algorithm . 67 5.2.1 Algorithm outline . 68 5.2.2 Cross validation . 68 5.2.3 Algorithm sketch . 69 5.2.4 Conditions for agreement . 69 6 Global organization of protein segments 75 6.1 Results . 75 6.1.1 Clusters of protein sequences . 76 6.1.2 \Fingerprints" of biological families based on cluster memb ership . 86 6.1.3 Higher-level measures of similarity b etween sequences . 87 6.2 Discussion . 91 7 Part II - The graph-based approach 95 7.1 Preface . 95 7.2 Pairwise clustering algorithms . 96 7.2.1 The single linkage algorithm . 97 7.2.2 The complete linkage algorithm . 98 7.2.3 The average linkage algorithm . 98 CONTENTS 5 7.2.4 Other pairwise clustering algorithms . 99 8 The graph based approach - representation and algorithms 103 8.1 Intro duction . 104 8.1.1 Related works . 105 8.2 Metho ds . 107 8.2.1 De ning the graph . 107 8.2.2 Placing all metho ds on a common numerical scale . 108 8.2.3 Neighb ors' lists . 109 8.2.4 Exploring the connectivity . 111 9 ProtoMap: Automatic classi cation of protein sequences, a hierarchy of protein families, and lo cal maps of the protein space. 115 9.1 General information on clusters . 115 9.2 Performance evaluation . 118 9.2.1 The evaluation metho dology . 118 9.2.2 The reference databases . 120 9.2.3 Evaluation results . 121 9.2.4 Critique of the evaluation metho dology . 122 9.2.5 New clusters . 123 9.3 ProtoMap as a to ol for analysis . 124 9.3.1 Tracing the formation of clusters . ..