Predicting Protein Folding Pathways Using Ensemble Modeling and Sequence Information

Total Page:16

File Type:pdf, Size:1020Kb

Predicting Protein Folding Pathways Using Ensemble Modeling and Sequence Information Predicting Protein Folding Pathways Using Ensemble Modeling and Sequence Information by David Becerra McGill University School of Computer Science Montr´eal, Qu´ebec, Canada A thesis submitted to McGill University in partial fulfillment of the requirements of the degree of Doctor of Philosophy c David Becerra 2017 Dedicated to my GRANDMOM. Thanks for everything. You are such a strong woman i Acknowledgements My doctoral program has been an amazing adventure with ups and downs. The most valuable fact is that I am a different person with respect to the one who started the program five years ago. I am not better or worst, I am just different. I would like to express my gratitude to all the people who helped me during my time at McGill. This thesis contains only my name as author, but it should have the name of all the people who played an important role in the completion of this thesis (and they are a lot of people). I am certain I would have not reached the end of my thesis without the help, advises and support of all people around me. First at all, I want to thank my country Colombia. Being abroad, I learnt to value my culture, my roots and my south-american positiveness, happiness and resourcefulness. I am really proud of being Colombian and of showing to the world a bit of what our wonderful land has given to us. I would also like to thank all Colombians, because with your taxes (via Colciencia’s funding) I was able to complete this dream. I also like to thank Canada for accepting us in this beautiful land and for embracing our lives and dreams. I would like to express my deepest gratitude to my wife Luisa and to my son Samuel. You are a constant (and fundamental) source of motivation and encouragement. Your are the real reason why this dream become a reality. I will be always in depth debt with you for walking this journey with me. You have inspired me to be better everyday and to do not give up during the hard times. You know that this thesis really belongs to you two. I would like to thank my parents for supporting my dreams. They have always encouraged us (me and my brother) to take risks and to work for our dreams. I deeply appreciate their help, support and advices. I still have a lot to learn from you. I want to extend my appreciation to my brother Andres. He is an excellent professional and he has been a long personal and career model to follow. I only hope that all our efforts would inspire Samuel and Maria Paula to pursuit their dreams and happiness. I would like to express my deepest gratitude to my supervisor Prof. J´erˆome Waldisp¨uhl for giving me the opportunity to carry out the research presented in this thesis. His guid- ance, motivation and patience were fundamental for succeeding on this project. Working with Professor Waldisp¨uhl has represented for me a once-in-a-life-time opportunity to ii iii learn from a very motivate, innovative and supportive person. Beside my supervisor, I would like to thank Professor Mathieu Blanchette for their generous time and good will through all the (non-appointed) meetings that we had. I feel very honored to have such a distinguished researcher and person as my advisor. I am also very grateful to all the current and former members of the Waldisp¨uh and Blanchette research groups. I would particularly like to thank Alex and Chris, who helped me in countless ways, from giving me advices, sharing complain lunches, fixing my code and for reading and editing this manuscript. Your friendships have always been invaluable. I also really appreciate Mathieu Lavall´e’s tips about the Canadian life. Our shared hobbies of watching cycling and soccer sports made my integration to the lab (and country) much easier. Many thanks to Ayrin and Rola for caring so much about Samuel and my family. Your support was fundamental during all this process. I would like to show my gratitude to Mohamed. You are the right model to follow to balance parenthood and graduate studies. I owe sincere and earnest thankfulness to the students who I mentored (Zheng, Shu and Ettienne). I learn so much from you guys. I owe sincere and earnest thankfulness to all the members of the labs with who I shared invaluable moments during the daily life in the group Olivier, Vladimir, Mohammed I and II, Olivia, Antoine, Dilmi, Chrisostomos, Carlos, Roman, James, Jimmy, Faizy, Javier, Navin, Pouriya, Maia. There are many people to thank beyond the scope of the research environment. Thanks a lot to Juan, Ismenia, Paulina, Felipe, Alejandro, Silvana, Patricia, Pablo, Jessica here in Montreal. Back in Colombia, I would like to thank a los compas Barrantes, Fernando, Goyes, Gomez, Juan Carlos, Lisbeth. I would like to show my deepest gratitude to my friend Sergio in Netherlands. Even in the long distance, Sergio has been always supporting me and giving me the motivation needed to succeed in this doctoral process. I really value our friendship and I really hope to keep you as my friend for long time. I am also want to thank Norman for his valuable advices. I will be always in depth debt with all the Computer Science staff. I thank Sheryl, Ann, Diti, Tricia and Professor Kemme. I apologize for all the persons I have not mentioned. There could be many names that escape my mind right now, but you know that you play an important role in my life. Thanks a lot for all. Abstract - Abr´eg´e The protein folding problem aims to predict the complete physical and dynamical pro- cess that transforms an unfolded sequence into a functional 3D protein structure. This problem consists of two (open) sub-problems: i) the protein structure prediction prob- lem and ii) the protein pathway prediction problem. Computational techniques to face these two sub-problems have been based on the theory of evolution and laws of physics. To-date, classical approaches to obtaining detailed information about protein folding rely on time-consuming methods that are primarily limited to relatively small proteins (i.e., ≤ 50 amino acids). The overall objective of this thesis is to explore algorithms that conciliate: i) the prediction of protein structures and pathways, ii) physical-based pre- dictions (i.e., low free-energy models) & evolutionary based predictions (i.e., sequence variation methods), and iii) computational costs and granularity level requirements of protein folding simulations. We propose an algorithmic framework for predicting protein folding that offers a better trade-off between resolution and efficiency. This framework computes accurate coarse-grained representations of the conformational landscape for large proteins through the combination of ensemble modeling techniques and evolution- ary based sequence information. The resulting conformational energy landscape is then used to predict dominant folding pathways. Given that the proposed framework in this thesis makes use of sequence information, we also explore a crowdsourcing and multi- objective evolutionary strategy to investigate the accuracy of evolutionary information encoded by multiple sequence alignments. Finally, to present our results to the wider bi- ology and computer science communities, we develop an easy-to-use interactive molecular visualizer. Abstract - Abr´eg´e Le probl`eme du repliement des prot´eines a pour objectif de pr´edire le processus physique et dynamique complet de conversion de la s´equence d’une prot´eine non-repli´ee en une structure 3D. Ce probl`eme peut ˆetre d´ecompos´e en deux sous-probl`emes (ouverts): i)le probl`eme de la pr´ediction de la structure des prot´eines et ii)leprobl`eme de la pr´ediction de le processus de repliement des prot´eines. La th´eoriedel’´evolution et les lois de la physique sont les principes sur lesquels les techniques computationnelles se sont bas´ees afin de r´esoudre ces deux sous-probl`emes. A` ce jour, les approches classiques qui sont utilis´ees afin d’obtenir des informations sur le repliement des prot´eines sont reli´ees `ades m´ethodes chronophages, en principe limit´ees `adesprot´eines de petite taille. L’objectif g´en´eral de cette th`ese est d’explorer des algorithmes qui concilient : i)lapr´ediction des structures et celle du processus de repliement des prot`eines, ii)lespr´edictions bas´ees sur les lois physiques (i.e., des mod`elesa´ ` energie libre minimale) et celles bas´ees sur la th´eoriedel’´evolution (i.e., des m´ethodes bas´ees sur les variations de s´equences) et iii) les coˆuts computationnels et les niveaux de granularit´e requis par les simulations. Dans ce but, nous proposons une structure d’algorithme pour le repliement des prot´eines qui offre un meilleur compromis entre la r´esolution et l’efficience. Cet algorithm calcule des repr´esentations pr´ecises `a gros grain du paysage conformationnel des prot´eines de grande taille, en combinant des techniques de mod´elisation d’ensemble avec l’informations ´evolutive des s´equences. Ce paysage ´energ´etique conformationnel est ensuite utilis´eafin de pr´edire les voies de repliement les plus probables. Puisque la structure propos´ee dans le cadre de cette th´ese utilise des informations de s´equences, nous ´etudions aussi une strat´egie participative, et une autre, multi-objectif et ´evolutive pour examiner la pr´ecision de l’information encod´ee par les alignements multiples de s´equences. Finalement, de mani`ere `apr´esenter nos r´esultatsa ` la communaut´e de biologistes et de chercheurs en in- formatique, nous avons d´evelopp´e un visualiseur mol´eculaire interactif facile d’utilisation.
Recommended publications
  • 2009 Riboclub Program
    2019 RiboClub Program RNA: 20 Years of Discoveries In Association with the Gairdner Foundation September 22-27 - Hotel Chéribourg, Orford, Québec, Canada Organizing Committee: Allan Jacobson, Robert Singer, Adrian Krainer, Françoise Stutz, Yukihide Tomari, Sandra Wolin, Nehalkumar Thakor (ARRTI co-organizer for western Canada) and the RiboClub. Sunday, September 22, 2019 (Day1) 15:00 – 20:00 Registration / Coffee and cake 17:00 – 17:10 Welcome notes and Announcements Sherif Abou Elela RiboClub organizer and co-founder, Member of the Université de Sherbrooke RNA Group 17:15 – 17:45 Opening Lecture I: RNA Biology: Origins and Reflections Joan Steitz Sterling Professor of Molecular Biophysics and Biochemistry, Yale University First speakers of the RiboClub 1998 17:45 – 18:15 Opening Lecture II: RNA Biology: History of Discoveries Witold Filipowicz Professor of Biochemistry, Friedrich Miescher Institute for Biomedical Research 18:15 – 18:45 Special presentation by the recipient of the RiboClub 2009 life achievement award: Initiating the study of Initiation: Initiating translation initiation: a short history and perspective Nahum Sonenberg, Member of the RiboClub, Gilman Cheney Chair in Biochemistry, McGill University 18:45 – 19:45 Cocktail and social time 19:45 – 19:50 Introduction of members of the community and dignitaries Raymund Wellinger Master of ceremonies, Co-founder of the Université de Sherbrooke RNA group, 19:50 – 19:50 Welcome note Pierre Cossette President, Université de Sherbrooke 19:50 – 20:45 Networking Dinner 20:45 – 20:55
    [Show full text]
  • The Principled Design of Large-Scale Recursive Neural Network Architectures–DAG-Rnns and the Protein Structure Prediction Problem
    Journal of Machine Learning Research 4 (2003) 575-602 Submitted 2/02; Revised 4/03; Published 9/03 The Principled Design of Large-Scale Recursive Neural Network Architectures–DAG-RNNs and the Protein Structure Prediction Problem Pierre Baldi [email protected] Gianluca Pollastri [email protected] School of Information and Computer Science Institute for Genomics and Bioinformatics University of California, Irvine Irvine, CA 92697-3425, USA Editor: Michael I. Jordan Abstract We describe a general methodology for the design of large-scale recursive neural network architec- tures (DAG-RNNs) which comprises three fundamental steps: (1) representation of a given domain using suitable directed acyclic graphs (DAGs) to connect visible and hidden node variables; (2) parameterization of the relationship between each variable and its parent variables by feedforward neural networks; and (3) application of weight-sharing within appropriate subsets of DAG connec- tions to capture stationarity and control model complexity. Here we use these principles to derive several specific classes of DAG-RNN architectures based on lattices, trees, and other structured graphs. These architectures can process a wide range of data structures with variable sizes and dimensions. While the overall resulting models remain probabilistic, the internal deterministic dy- namics allows efficient propagation of information, as well as training by gradient descent, in order to tackle large-scale problems. These methods are used here to derive state-of-the-art predictors for protein structural features such as secondary structure (1D) and both fine- and coarse-grained contact maps (2D). Extensions, relationships to graphical models, and implications for the design of neural architectures are briefly discussed.
    [Show full text]
  • Methodology for Predicting Semantic Annotations of Protein Sequences by Feature Extraction Derived of Statistical Contact Potentials and Continuous Wavelet Transform
    Universidad Nacional de Colombia Sede Manizales Master’s Thesis Methodology for predicting semantic annotations of protein sequences by feature extraction derived of statistical contact potentials and continuous wavelet transform Author: Supervisor: Gustavo Alonso Arango Dr. Cesar German Argoty Castellanos Dominguez A thesis submitted in fulfillment of the requirements for the degree of Master’s on Engineering - Industrial Automation in the Department of Electronic, Electric Engineering and Computation Signal Processing and Recognition Group June 2014 Universidad Nacional de Colombia Sede Manizales Tesis de Maestr´ıa Metodolog´ıapara predecir la anotaci´on sem´antica de prote´ınaspor medio de extracci´on de caracter´ısticas derivadas de potenciales de contacto y transformada wavelet continua Autor: Tutor: Gustavo Alonso Arango Dr. Cesar German Argoty Castellanos Dominguez Tesis presentada en cumplimiento a los requerimientos necesarios para obtener el grado de Maestr´ıaen Ingenier´ıaen Automatizaci´onIndustrial en el Departamento de Ingenier´ıaEl´ectrica,Electr´onicay Computaci´on Grupo de Procesamiento Digital de Senales Enero 2014 UNIVERSIDAD NACIONAL DE COLOMBIA Abstract Faculty of Engineering and Architecture Department of Electronic, Electric Engineering and Computation Master’s on Engineering - Industrial Automation Methodology for predicting semantic annotations of protein sequences by feature extraction derived of statistical contact potentials and continuous wavelet transform by Gustavo Alonso Arango Argoty In this thesis, a method to predict semantic annotations of the proteins from its primary structure is proposed. The main contribution of this thesis lies in the implementation of a novel protein feature representation, which makes use of the pairwise statistical contact potentials describing the protein interactions and geometry at the atomic level.
    [Show full text]
  • Deep Learning in Chemoinformatics Using Tensor Flow
    UC Irvine UC Irvine Electronic Theses and Dissertations Title Deep Learning in Chemoinformatics using Tensor Flow Permalink https://escholarship.org/uc/item/963505w5 Author Jain, Akshay Publication Date 2017 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, IRVINE Deep Learning in Chemoinformatics using Tensor Flow THESIS submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in Computer Science by Akshay Jain Thesis Committee: Professor Pierre Baldi, Chair Professor Cristina Videira Lopes Professor Eric Mjolsness 2017 c 2017 Akshay Jain DEDICATION To my family and friends. ii TABLE OF CONTENTS Page LIST OF FIGURES v LIST OF TABLES vi ACKNOWLEDGMENTS vii ABSTRACT OF THE THESIS viii 1 Introduction 1 1.1 QSAR Prediction Methods . .2 1.2 Deep Learning . .4 2 Artificial Neural Networks(ANN) 5 2.1 Artificial Neuron . .5 2.2 Activation Function . .7 2.3 Loss function . .8 2.4 Optimization . .8 3 Deep Recursive Architectures 10 3.1 Recurrent Neural Networks (RNN) . 10 3.2 Recursive Neural Networks . 11 3.3 Directed Acyclic Graph Recursive Neural Networks (DAG-RNN) . 11 4 UG-RNN for small molecules 14 4.1 DAG Generation . 16 4.2 Local Information Vector . 16 4.3 Contextual Vectors . 17 4.4 Activity Prediction . 17 4.5 UG-RNN With Contracted Rings (UG-RNN-CR) . 18 4.6 Example: UG-RNN Model of Propionic Acid . 20 5 Implementation 24 6 Data & Results 26 6.1 Aqueous Solubility Prediction . 26 6.2 Melting Point Prediction . 28 iii 7 Conclusions 30 Bibliography 32 A Source Code 37 A.1 UGRNN .
    [Show full text]
  • Watching Dynamics and Assembly of Spliceosomal
    WATCHING DYNAMICS AND ASSEMBLY OF SPLICEOSOMAL COMPLEXES AT SINGLE MOLECULE RESOLUTION A thesis submitted for the degree of DOCTOR OF PHILOSOPHY 2015 CHANDANI MANOJA WARNASOORIYA Section of Virology, Department of Medicine Imperial College London 1 Copyright Declaration The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work. 2 Declaration of Origin I hereby declare that this project was entirely my own work (except the gel images in figures 4.2-4.5 in the chapter 4) and that any additional sources of information have been properly cited. I hereby declare that any internet sources, published and unpublished works from which I have quoted or drawn reference have been referenced fully in the text and in the reference list. I understand that failure to do so will result in failure of this project due to plagiarism. Chapter 3 was in collaboration with Prof. Samuel Butcher and Prof. David Brow at University of Wisconsin-Madison, Wisconsin, USA. The Prp24 full length protein and the truncated protein (234C) were provided by them. Chapter 4 was in collaboration with Prof. Kiyoshi Nagai and his lab members Dr. John Hardin and Yasushi Kondo at LMB Cambridge University, Cambridge, UK.
    [Show full text]
  • Course Outline
    Department of Computer Science and Software Engineering COMP 6811 Bioinformatics Algorithms (Reading Course) Fall 2019 Section AA Instructor: Gregory Butler Curriculum Description COMP 6811 Bioinformatics Algorithms (4 credits) The principal objectives of the course are to cover the major algorithms used in bioinformatics; sequence alignment, multiple sequence alignment, phylogeny; classi- fying patterns in sequences; secondary structure prediction; 3D structure prediction; analysis of gene expression data. This includes dynamic programming, machine learning, simulated annealing, and clustering algorithms. Algorithmic principles will be emphasized. A project is required. Outline of Topics The course will focus on algorithms for protein sequence analysis. It will not cover genome assembly, genome mapping, or gene recognition. • Background in Biology and Genomics • Sequence Alignment: Pairwise and Multiple • Representation of Protein Amino Acid Composition • Profile Hidden Markov Models • Specificity Determining Sites • Curation, Annotation, and Ontologies • Machine Learning: Secondary Structure, Signals, Subcellular Location • Protein Families, Phylogenomics, and Orthologous Groups • Profile-Based Alignments • Algorithms Based on k-mers 1 Texts | in Library D. Higgins and W. Taylor (editors). Bioinformatics: Sequence, Structure and Databanks, Oxford University Press, 2000. A. D. Baxevanis and B. F. F. Ouelette. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, Wiley, 1998. Richard Durbin, Sean R. Eddy, Anders Krogh,
    [Show full text]
  • Structural and Functional Characterization of the N-Terminal Acetyltransferase Natc
    Structural and functional characterization of the N-terminal acetyltransferase NatC Inaugural-Dissertation to obtain the academic degree Doctor rerum naturalium (Dr. rer. nat.) submitted to the Department of Biology, Chemistry, Pharmacy of Freie Universität Berlin by Stephan Grunwald Berlin October 01, 2019 Die vorliegende Arbeit wurde von April 2014 bis Oktober 2019 am Max-Delbrück-Centrum für Molekulare Medizin unter der Anleitung von PROF. DR. OLIVER DAUMKE angefertigt. Erster Gutachter: PROF. DR. OLIVER DAUMKE Zweite Gutachterin: PROF. DR. ANNETTE SCHÜRMANN Disputation am 26. November 2019 iii iv Erklärung Ich versichere, dass ich die von mir vorgelegte Dissertation selbstständig angefertigt, die benutzten Quellen und Hilfsmittel vollständig angegeben und die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen – die anderen Werken im Wortlaut oder dem Sinn nach entnommen sind, in jedem Einzelfall als Entlehnung kenntlich gemacht habe; und dass diese Dissertation keiner anderen Fakultät oder Universität zur Prüfung vorgelegen hat. Berlin, 9. November 2020 Stephan Grunwald v vi Acknowledgement I would like to thank Prof. Oliver Daumke for giving me the opportunity to do the research for this project in his laboratory and for the supervision of this thesis. I would also like to thank Prof. Dr. Annette Schürmann from the German Institute of Human Nutrition (DifE) in Potsdam-Rehbruecke for being my second supervisor. From the Daumke laboratory I would like to especially thank Dr. Manuel Hessenberger, Dr. Stephen Marino and Dr. Tobias Bock-Bierbaum, who gave me helpful advice. I also like to thank the rest of the lab members for helpful discussions. I deeply thank my wife Theresa Grunwald, who always had some helpful suggestions and kept me alive while writing this thesis.
    [Show full text]
  • Assigning Folds to the Proteins Encoded by the Genome of Mycoplasma Genitalium (Protein Fold Recognition͞computer Analysis of Genome Sequences)
    Proc. Natl. Acad. Sci. USA Vol. 94, pp. 11929–11934, October 1997 Biophysics Assigning folds to the proteins encoded by the genome of Mycoplasma genitalium (protein fold recognitionycomputer analysis of genome sequences) DANIEL FISCHER* AND DAVID EISENBERG University of California, Los Angeles–Department of Energy Laboratory of Structural Biology and Molecular Medicine, Molecular Biology Institute, University of California, Los Angeles, Box 951570, Los Angeles, CA 90095-1570 Contributed by David Eisenberg, August 8, 1997 ABSTRACT A crucial step in exploiting the information genitalium (MG) (10), as a test of the capabilities of our inherent in genome sequences is to assign to each protein automatic fold recognition server and as a case study to sequence its three-dimensional fold and biological function. identify the difficulties facing automated fold assignment. Here we describe fold assignment for the proteins encoded by the small genome of Mycoplasma genitalium. The assignment MATERIALS AND METHODS was carried out by our computer server (http:yywww.doe- mbi.ucla.eduypeopleyfrsvryfrsvr.html), which assigns folds to The MG Sequences. The 468 MG sequences were obtained amino acid sequences by comparing sequence-derived predic- from The Institute for Genome Research (TIGR) through its tions with known structures. Of the total of 468 protein ORFs, Web address: http:yywww.tigr.orgytdbymdbymgdbymgd- 103 (22%) can be assigned a known protein fold with high b.html. Three types of annotation (based on searches in the confidence, as cross-validated with tests on known structures. sequence database) accompany each TIGR sequence (10): (i) Of these sequences, 75 (16%) show enough sequence similarity functional assignment—a clear sequence similarity with a to proteins of known structure that they can also be detected protein of known function from another organism was found by traditional sequence–sequence comparison methods.
    [Show full text]
  • Practical Structure-Sequence Alignment of Pseudoknotted Rnas Wei Wang
    Practical structure-sequence alignment of pseudoknotted RNAs Wei Wang To cite this version: Wei Wang. Practical structure-sequence alignment of pseudoknotted RNAs. Bioinformatics [q- bio.QM]. Université Paris Saclay (COmUE), 2017. English. NNT : 2017SACLS563. tel-01697889 HAL Id: tel-01697889 https://tel.archives-ouvertes.fr/tel-01697889 Submitted on 31 Jan 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 1 NNT : 2017SACLS563 Thèse de doctorat de l’Université Paris-Saclay préparée à L’Université Paris-Sud Ecole doctorale n◦580 (STIC) Sciences et Technologies de l’Information et de la Communication Spécialité de doctorat : Informatique par M. Wei WANG Alignement pratique de structure-séquence d’ARN avec pseudonœuds Thèse présentée et soutenue à Orsay, le 18 Décembre 2017. Composition du Jury : Mme Hélène TOUZET Directrice de Recherche (Présidente) CNRS, Université Lille 1 M. Guillaume FERTIN Professeur (Rapporteur) Université de Nantes M. Jan GORODKIN Professeur (Rapporteur) University of Copenhagen Mme Johanne COHEN Directrice de Recherche (Examinatrice)
    [Show full text]
  • Deep Learning in Science Pierre Baldi Frontmatter More Information
    Cambridge University Press 978-1-108-84535-9 — Deep Learning in Science Pierre Baldi Frontmatter More Information Deep Learning in Science This is the first rigorous, self-contained treatment of the theory of deep learning. Starting with the foundations of the theory and building it up, this is essential reading for any scientists, instructors, and students interested in artificial intelligence and deep learning. It provides guidance on how to think about scientific questions, and leads readers through the history of the field and its fundamental connections to neuro- science. The author discusses many applications to beautiful problems in the natural sciences, in physics, chemistry, and biomedicine. Examples include the search for exotic particles and dark matter in experimental physics, the prediction of molecular properties and reaction outcomes in chemistry, and the prediction of protein structures and the diagnostic analysis of biomedical images in the natural sciences. The text is accompanied by a full set of exercises at different difficulty levels and encourages out-of-the-box thinking. Pierre Baldi is Distinguished Professor of Computer Science at University of California, Irvine. His main research interest is understanding intelligence in brains and machines. He has made seminal contributions to the theory of deep learning and its applications to the natural sciences, and has written four other books. © in this web service Cambridge University Press www.cambridge.org Cambridge University Press 978-1-108-84535-9 — Deep Learning in
    [Show full text]
  • Bioinformatic Methods in Applied Genomic Research
    Alma mater studiorum - Università di Bologna Dottorato in Biotecnologie Cellulari e Molecolari: XXIII ciclo Settore scientifico disciplinare di afferenza: BIO11 BIOINFORMATIC METHODS IN APPLIED GENOMIC RESEARCH Presentato da Raffaele Fronza Coordinatore Dottorato: Supervisore: Prof. Santi Mario Spampinato Prof. Rita Casadio Esame finale anno 2011 ii Abstract Here I will focus on three main topics that best address and include the projects I have been working in during my three year PhD period that I have spent in different research laboratories addressing both computationally and practically important problems all related to modern molecular genomics. The first topic is the use of livestock species (pigs) as a model of obesity, a complex hu- man dysfunction. My efforts here concern the detection and annotation of Single Nucleotide Polymorphisms. I developed a pipeline for mining human and porcine sequences. Starting from a set of human genes related with obesity the platform returns a list of annotated porcine SNPs extracted from a new set of potential obesity-genes. 565 of these SNPs were analyzed on an Illumina chip to test the involvement in obesity on a population composed by more than 500 pigs. Results will be discussed. All the computational analysis and experiments were done in collaboration with the Biocomputing group and Dr.Luca Fontanesi, respectively, under the direction of prof. Rita Casadio at the Bologna University, Italy. The second topic concerns developing a methodology, based on Factor Analysis, to simul- taneously mine information from different levels of biological organization. With specific test cases we develop models of the complexity of the mRNA-miRNA molecular interaction in brain tumors measured indirectly by microarray and quantitative PCR.
    [Show full text]
  • UC San Diego UC San Diego Electronic Theses and Dissertations
    UC San Diego UC San Diego Electronic Theses and Dissertations Title Analysis and applications of conserved sequence patterns in proteins Permalink https://escholarship.org/uc/item/5kh9z3fr Author Ie, Tze Way Eugene Publication Date 2007 Peer reviewed|Thesis/dissertation eScholarship.org Powered by the California Digital Library University of California UNIVERSITY OF CALIFORNIA, SAN DIEGO Analysis and Applications of Conserved Sequence Patterns in Proteins A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science by Tze Way Eugene Ie Committee in charge: Professor Yoav Freund, Chair Professor Sanjoy Dasgupta Professor Charles Elkan Professor Terry Gaasterland Professor Philip Papadopoulos Professor Pavel Pevzner 2007 Copyright Tze Way Eugene Ie, 2007 All rights reserved. The dissertation of Tze Way Eugene Ie is ap- proved, and it is acceptable in quality and form for publication on microfilm: Chair University of California, San Diego 2007 iii DEDICATION This dissertation is dedicated in memory of my beloved father, Ie It Sim (1951–1997). iv TABLE OF CONTENTS Signature Page . iii Dedication . iv Table of Contents . v List of Figures . viii List of Tables . x Acknowledgements . xi Vita, Publications, and Fields of Study . xiii Abstract of the Dissertation . xv 1 Introduction . 1 1.1 Protein Homology Search . 1 1.2 Sequence Comparison Methods . 2 1.3 Statistical Analysis of Protein Motifs . 4 1.4 Motif Finding using Random Projections . 5 1.5 Microbial Gene Finding without Assembly . 7 2 Multi-Class Protein Classification using Adaptive Codes . 9 2.1 Profile-based Protein Classifiers . 14 2.2 Embedding Base Classifiers in Code Space .
    [Show full text]