Advanced Stochastic Protein Sequence Analysis

Total Page:16

File Type:pdf, Size:1020Kb

Advanced Stochastic Protein Sequence Analysis Advanced Stochastic Protein Sequence Analysis Thomas Plotz¨ Dipl.-Inform. Thomas Plotz¨ AG Angewandte Informatik Technische Fakultat¨ Universitat¨ Bielefeld email: [email protected] Abdruck der genehmigten Dissertation zur Erlangung des akademischen Grades Doktor der Ingenieurwissenschaften (Dr.-Ing.). Der Technischen Fakultat¨ der Universitat¨ Bielefeld am 18.04.2005 vorgelegt von Thomas Plotz,¨ am 13.06.2005 verteidigt und genehmigt. Gutachter: PD Dr.-Ing. Gernot A. Fink, Universitat¨ Bielefeld Dr. rer. nat. Karsten Quast, Boehringer Ingelheim Pharma GmbH und Co. KG Prof. Dr. Jens Stoye, Universitat¨ Bielefeld Prufungsausschuss:¨ Prof. Dr. Robert Giegerich, Universitat¨ Bielefeld PD Dr.-Ing. Gernot A. Fink, Universitat¨ Bielefeld Dr. rer. nat. Karsten Quast, Boehringer Ingelheim Pharma GmbH und Co. KG Prof. Dr. Jens Stoye, Universitat¨ Bielefeld Dr.-Ing. Frank G. Zollner,¨ Universitat¨ Bielefeld Gedruckt auf alterungsbestandigem¨ Papier nach ISO 9706 Advanced Stochastic Protein Sequence Analysis Dissertation zur Erlangung des akademischen Grades Doktor der Ingenieurwissenschaften (Dr.-Ing.) der Technischen Fakultat¨ der Universitat¨ Bielefeld vorgelegt von Thomas Plotz¨ Bielefeld – April 2005 Acknowledgements I often compared the process of writing my PhD-thesis with a bicycle ride across an alpine pass: It is a long and hard ascend that requires a lot of endurance. Sometimes it hurts, but most of the time it is very exciting to climb the mountain and to reach different points of view which is the prerequisite for new thoughts and ideas. Similar to a ride across an alpine pass, writing my PhD-thesis would not have been possible without the support of a strong team which I would like to thank here. First of all, I am very much obliged to my supervisor PD Dr.-Ing. Gernot A. Fink. Over the years, a very close collaboration with him has been established, including countless fruitful discussions with respect to all kinds of pattern recognition problems. His fresh and honest way always gives me the motivation to follow new ideas but also to at least double check them before arguing. It is difficult to imagine how my work would have been devel- oped without the influence of his great experience and his numerous ideas. Thank you very much, Gernot, you actually made me stand here at the top of the alpine pass. The research performed for this thesis was embedded in a cooperation with Boehringer Ingelheim and the Boehringer Ingelheim Pharma GmbH und Co. KG Genomics Group. I would like to thank the project partners, especially Dr. Andreas Weith, Dr. Karsten Quast, Dr. Andreas Kohler,¨ and Ogsen Gabrielyan, for their enthusiastic support. I am very grateful to Dr. Quast who agreed to review this thesis. Over the years Birgit Moller¨ and I have become real friends, sharing similar ideas, sup- porting each other for individual goals and having fantastic discussions. Although Birgit wrote her PhD-thesis at the very same time as I, including her own hard ascend to her alpine pass, she always had an open mind for my problems. Her very exact proofreading including productive criticisms substantially helped me to improve the quality of this thesis. My second proofreader was Erich Wehmeyer who checked the thesis for any language related traps and failures. I am very grateful for his native speaker expertise and his will- ingness to correct the work. Furthermore, I would like to thank Prof. Dr.-Ing. Gerhard Sagerer, the leader of the Ap- plied Computer Science group at the Bielefeld University, who always encouraged me to cut my own path even when it was much rockier than expected. The productive atmosphere within the working group including discussions, collaborations, chocolate and tea, gave the background for successfully finishing this thesis. Finally, my wife Alexandra Schubert played an important role for the successfull com- pletion of this thesis. I would like to thank her for her assistance, affection, and emotional support. Alex, without you, everything would count for nothing. I II Contents 1 Introduction 1 2 Principles of Modern Protein Analysis 5 2.1 The Central Dogma of Molecular Biology .................. 6 2.2 Proteins: The Fundamentals of Life ..................... 8 2.2.1 Biochemical Composition ...................... 8 2.2.2 Biological Relevancy ......................... 11 2.3 Protein Relationships ............................. 13 2.3.1 Protein Families ........................... 13 2.3.2 Exemplary Hierarchical Classification ................ 14 2.4 Protein Analysis ................................ 16 2.4.1 The Drug Discovery Process ..................... 16 2.4.2 Protein Sequence Analysis ...................... 18 2.5 Summary ................................... 19 3 Computational Protein Sequence Analysis 21 3.1 Pairwise Sequence Alignment ........................ 22 3.1.1 Principles of Sequence Alignment .................. 23 3.1.2 Heuristic Approximations ...................... 34 3.2 Analysis of Sequence Families ........................ 37 3.2.1 Profile Analysis ............................ 39 3.2.2 Profile Hidden Markov Models ................... 41 3.2.3 Further Probabilistic Modeling Approaches ............. 65 3.3 Signal Processing based Sequence Comparison ............... 72 3.3.1 Alternative Representations of Protein Sequences .......... 73 3.3.2 Signal Processing Methods for Classification ............ 77 3.4 Summary ................................... 81 4 Concepts for Improved HMM Based Sequence Analysis 83 4.1 Assessment of Current Methodologies’ Capabilities ............. 84 4.1.1 Task: Homology Detection at the Superfamily Level ........ 84 4.1.2 Capabilities of State-of-the-Art Approaches ............. 87 4.2 Improving the Quality of HMM Based Sequence Analysis ......... 92 4.2.1 Semi-Continuous Feature Based Modeling ............. 95 4.2.2 Model Architectures with Reduced Complexity ........... 97 4.2.3 Accelerating the Model Evaluation ................. 98 4.3 Summary ................................... 101 III Contents 5 Advanced Probabilistic Models for Protein Families 103 5.1 Feature Extraction from Protein Sequences ................. 104 5.1.1 Rich Signal-Like Protein Sequence Representation ......... 104 5.1.2 Feature Extraction by Abstraction .................. 108 5.2 Robust Feature Based Profile HMMs and Remote Homology Detection .. 113 5.2.1 Feature Space Representation .................... 114 5.2.2 General Semi-Continuous Profile HMMs .............. 117 5.2.3 Specialization by Adaptation ..................... 119 5.2.4 Explicit Background Model ..................... 125 5.3 Protein Family HMMs with Reduced Complexity .............. 127 5.3.1 Beyond Profile HMMs ........................ 128 5.3.2 Protein Family Modeling using Sub-Protein Units (SPUs) ..... 132 5.4 Accelerating the Model Evaluation by Pruning Techniques ......... 137 5.4.1 State-Space Pruning ......................... 139 5.4.2 Combined Model Evaluation ..................... 142 5.4.3 Optimization of Mixture Density Evaluation ............ 143 5.5 Summary ................................... 146 6 Evaluation 149 6.1 Methodology and Datasets .......................... 149 6.2 Effectiveness of Semi-Continuous Feature Based Profile HMMs ...... 155 6.3 Advanced Stochastic Protein Family Models for Small Training Sets .... 159 6.3.1 Effectiveness of Sub-Protein Unit based Models ........... 159 6.3.2 Effectiveness of Bounded Left-Right Models ............ 166 6.4 Acceleration of Protein Family HMM Evaluation .............. 174 6.4.1 Effectiveness of State-Space Pruning ................ 176 6.4.2 Effectiveness of Accelerated Mixture Density Evaluation ...... 177 6.5 Combined Evaluation of Advanced Stochastic Modeling Techniques .... 180 6.6 Summary ................................... 184 7 Conclusion 187 A Wavelets 193 A.1 Fourier Analysis ................................ 193 A.2 Continuous Wavelet Transformation ..................... 194 A.3 Discrete Wavelet Transformation ....................... 196 B Principal Components Analysis (PCA) 201 C Amino Acid Indices 203 D Detailed Evaluation Results 205 Bibliography 215 IV 1 Introduction Millennia ago, the ancient Egyptians used selected micro-organisms to produce cheese, wine and bread. Apparently, they were very experienced in food-making, because in Egypt one of the cradles of civilization could develop and the high quality catering certainly had a positive influence on this process. However, strictly speaking, they had no idea why their food became so tasty, giving them the power to build giant pyramids and to establish science and culture. It took ages until the reasons for it, the foundations of molecular biology, could be explained.1 In fact, not until 1866 the Augustinian monk Gregor Mendel developed the first general theory of heredity by means of the analysis of garden peas which represents the base for all further molecular biology research. Later on James Watson and Francis Crick discovered the double-helical structure of DNA in 1953 which determined the major breakthrough on the way to understanding the microbiological foundations of life [Wat53]. Between these principle breakthroughs lay almost hundred years and they were gained thousands of years after the Egyptians baked their tasty bread. Due to the development of several revolutionary methodologies in the last decades, the speed of knowledge gain could be increased dramatically. Not before Fred Sanger and Walter Gilbert in 1977 independently invented powerful sequencing methods, nowadays’ large-scale sequencing
Recommended publications
  • Meeting Review: Bioinformatics and Medicine – from Molecules To
    Comparative and Functional Genomics Comp Funct Genom 2002; 3: 270–276. Published online 9 May 2002 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.178 Feature Meeting Review: Bioinformatics And Medicine – From molecules to humans, virtual and real Hinxton Hall Conference Centre, Genome Campus, Hinxton, Cambridge, UK – April 5th–7th Roslin Russell* MRC UK HGMP Resource Centre, Genome Campus, Hinxton, Cambridge CB10 1SB, UK *Correspondence to: Abstract MRC UK HGMP Resource Centre, Genome Campus, The Industrialization Workshop Series aims to promote and discuss integration, automa- Hinxton, Cambridge CB10 1SB, tion, simulation, quality, availability and standards in the high-throughput life sciences. UK. The main issues addressed being the transformation of bioinformatics and bioinformatics- based drug design into a robust discipline in industry, the government, research institutes and academia. The latest workshop emphasized the influence of the post-genomic era on medicine and healthcare with reference to advanced biological systems modeling and simulation, protein structure research, protein-protein interactions, metabolism and physiology. Speakers included Michael Ashburner, Kenneth Buetow, Francois Cambien, Cyrus Chothia, Jean Garnier, Francois Iris, Matthias Mann, Maya Natarajan, Peter Murray-Rust, Richard Mushlin, Barry Robson, David Rubin, Kosta Steliou, John Todd, Janet Thornton, Pim van der Eijk, Michael Vieth and Richard Ward. Copyright # 2002 John Wiley & Sons, Ltd. Received: 22 April 2002 Keywords: bioinformatics;
    [Show full text]
  • Assessing Sequence Comparison Methods with Reliable Structurally Identified Distant Evolutionary Relationships
    Proc. Natl. Acad. Sci. USA Vol. 95, pp. 6073–6078, May 1998 Biochemistry Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships STEVEN E. BRENNER*†‡,CYRUS CHOTHIA*, AND TIM J. P. HUBBARD§ *MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, United Kingdom; and §Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA, United Kingdom Communicated by David R. Davies, National Institute of Diabetes, Bethesda, MD, March 16, 1998 (received for review November 12, 1997) ABSTRACT Pairwise sequence comparison methods have Sequence comparison methodologies have evolved rapidly, been assessed using proteins whose relationships are known so no previously published tests has evaluated modern versions reliably from their structures and functions, as described in of programs commonly used. For example, parameters in the SCOP database [Murzin, A. G., Brenner, S. E., Hubbard, T. BLAST (1) have changed, and WU-BLAST2 (2)—which produces & Chothia C. (1995) J. Mol. Biol. 247, 536–540]. The evalua- gapped alignments—has become available. The latest version tion tested the programs BLAST [Altschul, S. F., Gish, W., of FASTA (3) previously tested was 1.6, but the current release Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. (version 3.0) provides fundamentally different results in the 215, 403–410], WU-BLAST2 [Altschul, S. F. & Gish, W. (1996) form of statistical scoring. Methods Enzymol. 266, 460–480], FASTA [Pearson, W. R. & The previous reports also have left gaps in our knowledge. Lipman, D. J. (1988) Proc. Natl. Acad. Sci. USA 85, 2444–2448], For example, there has been no published assessment of and SSEARCH [Smith, T.
    [Show full text]
  • A Novel and Effective Scoring Scheme for Structure Classification And
    A novel and effective scoring scheme for structure classification and pairwise similarity measurement Rezaul Karim∗, Md. Momin Al Aziz†,Swakkhar Shatabda‡ and M. Sohel Rahman∗ ∗Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology †Department of Computer Science, University of Manitoba, Canada ‡Department of Computer Science and Engineering, United International University Abstract—Protein tertiary structure defines its functions, clas- However, a major drawback of these approaches with re- sification and binding sites. Similar structural characteristics gards to the global structure similarity comparison is that these between two proteins often lead to the similar characteristics are much sensitive to local structure dissimilarity. Furthermore, thereof. Determining structural similarity accurately in real these approaches need to find an optimal alignment before time is a crucial research issue. In this paper, we present computing the score despite that often there are no remarkable a novel and effective scoring scheme that is dependent on alignment possible. Also finding proper alignment between two novel features extracted from protein alpha carbon distance matrices. Our scoring scheme is inspired from pattern recog- protein structures is computationally expensive if the size of nition and computer vision. Our method is significantly better the protein is large [9]. In spite of these inherent limitations than the current state of the art methods in terms of fam- and drawbacks, there exist a number of such methods in the ily match of pairs of protein structures and other statistical literature and the most notable ones among these are DALI measurements. The effectiveness of our method is tested on [10], [11], CE [12], TM Align [13], [14] and SP Align [15].
    [Show full text]
  • Relative Orientation of Close-Packed ,8-Pleated Sheets in Proteins
    Proc. Nati Acad. Sci. USA Vol. 78, No. 7, pp. 4146-4150, July 1981 Biochemistry Relative orientation of close-packed ,8-pleated sheets in proteins (twisted ,3sheets/protein secondary and tertiary structure) CYRUS CHOTHIA*t AND JOEL JANINt *William Ramsay, Ralph Foster and Christopher Ingold Laboratories, University College London, Gower Street, London WC1E 6BT, England; tMedical Research Council Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, England; and *Unit6 de Biochimie Cellulaire, Department de Biochimie et Gen~tique Moleculaire, Institut Pasteur, 28 rue du Docteur Roux, 75724 Paris Cedex 15, France Communicated by Max F. Perutz, March 30, 1981 ABSTRACT When (3-pleated sheets pack face to face in pro- curve from left to right, while those that point down curve from teins, the angle between the strand directions ofthe two (3-sheets right to left. is observed to be near -30°. We propose a simple model for f3- Fig. 2 b and c illustrates some of the effects of twisting two sheet-to-(-sheet packing in concanavalin A, plastocyanin, y-crys- ,(3pleated sheets that are packed together. We show, on the left tallin, superoxide dismutase, prealbumin, and the immunoglobin ofthe figures, two hypothetical untwisted (3-sheets packed with fragment VREI. This model shows how the observed relative ori- their rows of side chains aligned: those in the top (-sheet are entation of two packed (3-sheets is a consequence of (i) the rows parallel to those in the bottom. On the right of the figures we of side chains at the interface being approximately aligned and show what happens when these packed (3-sheets are twisted (ii) the (3-sheet having a right-handed twist.
    [Show full text]
  • Transformer Neural Networks for Protein Prediction Tasks
    bioRxiv preprint doi: https://doi.org/10.1101/2020.06.15.153643; this version posted June 16, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. TRANSFORMING THE LANGUAGE OF LIFE:TRANSFORMER NEURAL NETWORKS FOR PROTEIN PREDICTION TASKS Ananthan Nambiar ∗ Maeve Heflin∗ Department of Bioengineering Department of Computer Science Carl R. Woese Inst. for Genomic Biol. Carl R. Woese Inst. for Genomic Biol. University of Illinois at Urbana-Champaign University of Illinois at Urbana-Champaign Urbana, IL 61801 Urbana, IL 61801 [email protected] Simon Liu∗ Sergei Maslov Department of Computer Science Department Bioengineering Carl R. Woese Inst. for Genomic Biol. Department of Physics University of Illinois at Urbana-Champaign Carl R. Woese Inst. for Genomic Biol. Urbana, IL 61801 University of Illinois at Urbana-Champaign Urbana, IL 61801 Mark Hopkinsy Anna Ritzy Department of Computer Science Department of Biology Reed College Reed College Portland, OR 97202 Portland, OR 97202 June 16, 2020 ABSTRACT The scientific community is rapidly generating protein sequence information, but only a fraction of these proteins can be experimentally characterized. While promising deep learning approaches for protein prediction tasks have emerged, they have computational limitations or are designed to solve a specific task. We present a Transformer neural network that pre-trains task-agnostic sequence representations. This model is fine-tuned to solve two different protein prediction tasks: protein family classification and protein interaction prediction.
    [Show full text]
  • Predicting RNA Binding Proteins Using Discriminative Methods
    Predicting RNA binding proteins using discriminative methods by Shweta Bhandare M.S. Department of Interdisciplinary Telecommunications, University of Colorado at Boulder 2003 B.E., College of Engineering, University of Goa, 1995 A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science 2017 This thesis entitled: Predicting RNA binding proteins using discriminative methods written by Shweta Bhandare has been approved for the Department of Computer Science Prof. Robin Dowell Dr. Debra S. Goldberg Prof. Larry E. Hunter Dr. Daniel Weaver Date The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. iii Bhandare, Shweta (Ph.D., Computer Science) Predicting RNA binding proteins using discriminative methods Thesis directed by Prof. Robin Dowell and Dr. Debra S. Goldberg This thesis examines the role of computational methods in identifying the motifs utilized by RNA-binding proteins (RBPs). RBPs play an important role in post-transcriptional regulation and identify their targets in a highly specific fashion through recognition of primary sequence and/or secondary structure hence making the prediction a complex problem. I applied the existing k-spectrum kernel method to a support vector machine and verified the published binding sites of two RBPs: Human antigen R (HuR) and Tristetraprolin (TTP). These RBPs exhibit opposing effects to the bound messenger RNA (mRNA) transcript but have simi- lar binding preferences.
    [Show full text]
  • Wolfson Review Wolfson The
    2012 – 2013 2013 No.37 – 2012 The Wolfson Review Wolfson The THE Wolfson Review 2012 – 2013 2013 No.37 – 2012 Wolfson College Barton Road Cambridge CB3 9BB www.wolfson.cam.ac.uk Upon 50 Years by John McClenahen (1986), Press Fellow The College is stone and mortar, and wood and glass. The College is ideas, great and small. The College is books and the Internet. Published in 2013 by Wolfson College, Cambridge The College is gates, gardens, paths, Barton Road, Cambridge CB3 9BB courts, and plaques, and a sundial. © Wolfson College, 2013 The College is Lee Library, the Dining Hall, and Bredon House. The College is students, and tutors, and Fellows. The College is fellowship, principles, and ritual. And the College, this College, Wolfson College, Cambridge, is much more. For this remarkable College is a diverse universe, ever expanding. From this College, in their diversity, those who study, guide, and reside here seek knowledge and truth in myriad ways. From this College, this special place, those who study, guide, and reside here seek to create, to find, to explore, to challenge, and to validate. Now and forever may their efforts – all our efforts wherever we are – Cover photograph Coloured primary hypothalamic neuronal culture, labelled ring true to the diversity and distinguishing humanness of this College, with MAP2, GFAP and Dapi under microscope, part of our young College in this ancient University. Wolfson Fellow Giles Yeo’s research into the brain control of food intake. Image created by Dr Brian Lam and Mr Joseph Polex-Wolf from the Yeo laboratory. The paper used for the Review contains material sourced from responsibly managed forests, certified in accordance with the Forestry Stewardship Council, and is printed using vegetable based inks.
    [Show full text]
  • New Tools for Old Organelles
    New tools for old organelles Bioinformatics for comparative cell biology Marc Gouw Dissertation presented to obtain the Ph.D degree in Bioinformatics Instituto de Tecnologia Química e Biológica António Xavier | Universidade Nova de Lisboa Oeiras, July, 2015 New tools for old organelles Bioinformatics for comparative cell biology Marc Gouw Dissertation presented to obtain the Ph.D degree in Evolutionary Biology Instituto de Tecnologia Química e Biológica António Xavier | Universidade Nova de Lisboa Research work coordinated by: Oeiras July, 2015 Acknowledgments This thesis and the years of work and inspiration put into it would not have been possible without the help of many people. First of all, I just generally want to thank everyone for being so great. Especially those who in any way contributed to making the past few years in Portugal and at the IGC such a wonderful experience: without all of you, my life and this thesis would have turned out very different indeed. A special thanks to all of the past and present members of the Computational Genomics Laboratory, who have always been kind enough to be unforgivingly critical of my work, and equally enthusiastic to go enjoy a beer on a Friday afternoon. And especially Beatriz Ferreira Gomes and Yoan Diekmann, who were essential to my PhD project: You have been truly inspiring collogues and close friends. A lot of this work would not have been possible without you. The epic mtoc-explorer.org, and indeed this thesis would also not have been possible without Renato Alves. I blame you, your programming genius and incredible patience for teaching me so much.
    [Show full text]
  • 1975.Txt 1. Computer Simulation of Protein Folding Michael Levitt
    1975.txt 1. Computer simulation of protein folding Michael Levitt* & Arieh Warshel*† †Department of Chemical Physics, Weizmann Institute of Science, Rehovoth, Israel *Present Address: MRC Laboratory of Molecular Biology, Hills Road, Cambridge, UK. A new and very simple representation of protein conformations has been used together with energy minimisation and thermalisation to simulate protein folding. Under certain conditions, the method succeeds in 'renaturing' bovine pancreatic trypsin inhibitor from an open-chain conformation into a folded conformation close to that of the native molecule. References 1.Anfinsen, C. B., Science, 181, 223–230 (1973). 2.Schulz, G. E., Barry, C. D., Friedman, J., Chou, P. Y., Fasman, G. D., Finkelstein, A. V., Lim, V. I., Ptitsyn, O. B., Kabat, E. A., Wu, T. T., Levitt, M., Robson, B., Nagano, K., Nature, 250, 140–142 (1974). 3.Scheraga, H. A., in Current topics in biochemistry (edit. by Anfinsen, C. B., and Schechter, A. N.), 1–42 (Academic, New York, 1974). 4.Ptitsyn, O. B., Vestnik Akad. Nauk S.S.S.R., 5, 57–68 (1973). 5.Flory, P. J. in Statistical Mechanics of Chain Molecules, 248–306 (Wiley, New York, 1969). 6.Hill, T. L., in Statistical Mechanics, 15–17 (McGraw-Hill, New York, 1956). 7.Nozaki, Y., and Tanford, C., J. biol. Chem., 246, 2211–2217 (1971). 8.Simon, E. M., Biopolymers, 10, 973–989 (1971). 9.Huber, R., Kukla, D., Ruhlmann, A., and Steigemann, W., Cold Spring Harbor Symp. quant. Biol., 36, 141–148 (1971). 10.Creighton, T. E., J. molec. Biol., 87, 603–624 (1974). 11.Phillips, D. C., in British Biochemistry, Past and Present (edit.
    [Show full text]
  • Weighted Graphlets and Deep Neural Networks for Protein Structure
    Weighted graphlets and deep neural networks for protein structure classification Hongyu Guo1, Khalique Newaz2,3,4, Scott Emrich5, Tijana Milenković2,3,4, and Jun Li1,* 1Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN 46556, USA 2Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA 3Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN 46556, USA 4Interdisciplinary Center for Network Science & Applications, University of Notre Dame, Notre Dame, IN 46556, USA 5Department of Electrical Engineering & Computer Science, University of Tennessee at Knoxville, TN 37996, USA *To whom correspondence should be addressed. Tel: +1 574 631 3429; Fax: +1 574 631 4822; Email: [email protected] Abstract As proteins with similar structures often have similar functions, analysis of protein structures can help predict protein functions and is thus important. We consider the problem of protein structure classification, which computationally classifies the struc- tures of proteins into pre-defined groups. We develop a weighted network that depicts the protein structures, and more importantly, we propose the first graphlet-based mea- sure that applies to weighted networks. Further, we develop a deep neural network (DNN) composed of both convolutional and recurrent layers to use this measure for classification. Put together, our approach shows dramatic improvements in performance arXiv:1910.02594v1 [stat.ML] 7 Oct 2019 over existing graphlet-based approaches on 36 real datasets. Even comparing with the state-of-the-art approach, it almost halves the classification error. In addition to protein structure networks, our weighted-graphlet measure and DNN classifier can potentially be applied to classification of other weighted networks in computational biology as well as in other domains.
    [Show full text]
  • Multi-Domain Protein Families and Domain Pairs: Comparison with Known Structures and a Random Model of Domain Recombination
    Journal of Structural and Functional Genomics 4: 67–78, 2003. 67 © 2003 Kluwer Academic Publishers. Printed in the Netherlands. Multi-domain protein families and domain pairs: comparison with known structures and a random model of domain recombination Gordana Apic1, Wolfgang Huber2 & Sarah A. Teichmann1* 1MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK and 2DKFZ (German Cancer Research Center), Im Neuenheimer Feld 280, 69120 Heidelberg, Germany; * To whom correspondence should be addressed at [email protected] Received 26 November 2002; Accepted in final form 11 March 2003 Key words: structural genomics, multi-domain proteins, recombination, domain combinations Abstract There is a limited repertoire of domain families in nature that are duplicated and combined in different ways to form the set of proteins in a genome. Most proteins in both prokaryote and eukaryote genomes consist of two or more domains, and we show that the family size distribution of multi-domain protein families follows a power law like that of individual families. Most domain pairs occur in four to six different domain architectures: in isolation and in combinations with different partners. We showed previously that within the set of all pairwise domain combinations, most small and medium-sized families are observed in combination with one or two other families, while a few large families are very versatile and combine with many different partners. Though this may appear to be a stochastic pattern, in which large families have more combination partners by virtue of their size, we establish here that all the domain families with more than three members in genomes are duplicated more frequently than would be expected by chance considering their number of neighbouring domains.
    [Show full text]
  • Signature of Author
    An Analysis of Representations for Protein Structure Prediction by Barbara K. Moore Bryant S.B., S.M., Massachusetts Institute of Technology (1986) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 1994 ( Massachusetts Institute of Technology 1994 SignatureofAuthor ............. ,...... ........-.. .. Department of Electrical Engineering and Computer Science September 2, 1994 Certified by... ..... -- Patrick. H. Winston Professor, Electrical Engineering and Computer Science This Supervisor %-"ULfrtifioAvL.Lv--k &y ho, . Tomas o) no-Perez and Computer Science L! Thp.i. .,nPrvi.qnr Accepted by .. ... , a. ,- · F' drY................... / Frederic~) R. Morgenthaler Departmental /Committee on Graduate Students Iv:r tfF:r _ ;: ! --.. t.8r'(itr i-7etFr An Analysis of Representations for Protein Structure Prediction by Barbara K. Moore Bryant Submitted to the Department of Electrical Engineering and Computer Science on September 2, 1994, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Protein structure prediction is a grand challenge in' the fields of biology and computer science. Being able to quickly determine the structure of a protein from its amino acid sequence would be extremely useful to biologists interested in elucidating the mechanisms of life and finding ways to cure disease. In spite of a wealth of knowl- edge about proteins and their structure, the structure prediction problem has gone unsolved in the nearly forty years since the first determination of a protein structure by X-ray crystallography. In this thesis, I discuss issues in the representation of protein structure and se- quence for algorithms which perform structure prediction.
    [Show full text]