Predicting Protein Folding Pathways Using Ensemble Modeling and Sequence Information

Predicting Protein Folding Pathways Using Ensemble Modeling and Sequence Information by David Becerra McGill University School of Computer Science Montréal, Québec, Canada A thesis submitted to McGill University in partial fulfillment of the requirements of the degree of Doctor of Philosophy c David Becerra 2017 Dedicated to my GRANDMOM. Thanks for everything. You are such a strong woman i Acknowledgements My doctoral program has been an amazing adventure with ups and downs. The most valuable fact is that I am a different person with respect to the one who started the program five years ago. I am not better or worst, I am just different. I would like to express my gratitude to all the people who helped me during my time at McGill. This thesis contains only my name as author, but it should have the name of all the people who played an important role in the completion of this thesis (and they are a lot of people). I am certain I would have not reached the end of my thesis without the help, advises and support of all people around me. First at all, I want to thank my country Colombia. Being abroad, I learnt to value my culture, my roots and my south-american positiveness, happiness and resourcefulness. I am really proud of being Colombian and of showing to the world a bit of what our wonderful land has given to us. I would also like to thank all Colombians, because with your taxes (via Colciencia’s funding) I was able to complete this dream. I also like to thank Canada for accepting us in this beautiful land and for embracing our lives and dreams. I would like to express my deepest gratitude to my wife Luisa and to my son Samuel. You are a constant (and fundamental) source of motivation and encouragement. Your are the real reason why this dream become a reality. I will be always in depth debt with you for walking this journey with me. You have inspired me to be better everyday and to do not give up during the hard times. You know that this thesis really belongs to you two. I would like to thank my parents for supporting my dreams. They have always encouraged us (me and my brother) to take risks and to work for our dreams. I deeply appreciate their help, support and advices. I still have a lot to learn from you. I want to extend my appreciation to my brother Andres. He is an excellent professional and he has been a long personal and career model to follow. I only hope that all our efforts would inspire Samuel and Maria Paula to pursuit their dreams and happiness. I would like to express my deepest gratitude to my supervisor Prof. Jérôme Waldispühl for giving me the opportunity to carry out the research presented in this thesis. His guid- ance, motivation and patience were fundamental for succeeding on this project. Working with Professor Waldispühl has represented for me a once-in-a-life-time opportunity to ii iii learn from a very motivate, innovative and supportive person. Beside my supervisor, I would like to thank Professor Mathieu Blanchette for their generous time and good will through all the (non-appointed) meetings that we had. I feel very honored to have such a distinguished researcher and person as my advisor. I am also very grateful to all the current and former members of the Waldispüh and Blanchette research groups. I would particularly like to thank Alex and Chris, who helped me in countless ways, from giving me advices, sharing complain lunches, fixing my code and for reading and editing this manuscript. Your friendships have always been invaluable. I also really appreciate Mathieu Lavallé’s tips about the Canadian life. Our shared hobbies of watching cycling and soccer sports made my integration to the lab (and country) much easier. Many thanks to Ayrin and Rola for caring so much about Samuel and my family. Your support was fundamental during all this process. I would like to show my gratitude to Mohamed. You are the right model to follow to balance parenthood and graduate studies. I owe sincere and earnest thankfulness to the students who I mentored (Zheng, Shu and Ettienne). I learn so much from you guys. I owe sincere and earnest thankfulness to all the members of the labs with who I shared invaluable moments during the daily life in the group Olivier, Vladimir, Mohammed I and II, Olivia, Antoine, Dilmi, Chrisostomos, Carlos, Roman, James, Jimmy, Faizy, Javier, Navin, Pouriya, Maia. There are many people to thank beyond the scope of the research environment. Thanks a lot to Juan, Ismenia, Paulina, Felipe, Alejandro, Silvana, Patricia, Pablo, Jessica here in Montreal. Back in Colombia, I would like to thank a los compas Barrantes, Fernando, Goyes, Gomez, Juan Carlos, Lisbeth. I would like to show my deepest gratitude to my friend Sergio in Netherlands. Even in the long distance, Sergio has been always supporting me and giving me the motivation needed to succeed in this doctoral process. I really value our friendship and I really hope to keep you as my friend for long time. I am also want to thank Norman for his valuable advices. I will be always in depth debt with all the Computer Science staff. I thank Sheryl, Ann, Diti, Tricia and Professor Kemme. I apologize for all the persons I have not mentioned. There could be many names that escape my mind right now, but you know that you play an important role in my life. Thanks a lot for all. Abstract - Abrégé The protein folding problem aims to predict the complete physical and dynamical process that transforms an unfolded sequence into a functional 3D protein structure. This problem consists of two (open) sub-problems: i) the protein structure prediction problem and ii) the protein pathway prediction problem. Computational techniques to face these two sub-problems have been based on the theory of evolution and laws of physics. To-date, classical approaches to obtaining detailed information about protein folding rely on time-consuming methods that are primarily limited to relatively small proteins (i.e., ≤ 50 amino acids). The overall objective of this thesis is to explore algorithms that conciliate: i) the prediction of protein structures and pathways, ii) physical-based predictions (i.e., low free-energy models) & evolutionary based predictions (i.e., sequence variation methods), and iii) computational costs and granularity level requirements of protein folding simulations. We propose an algorithmic framework for predicting protein folding that offers a better trade-off between resolution and efficiency. This framework computes accurate coarse-grained representations of the conformational landscape for large proteins through the combination of ensemble modeling techniques and evolutionary based sequence information. The resulting conformational energy landscape is then used to predict dominant folding pathways. Given that the proposed framework in this thesis makes use of sequence information, we also explore a crowdsourcing and multi- objective evolutionary strategy to investigate the accuracy of evolutionary information encoded by multiple sequence alignments. Finally, to present our results to the wider bi- ology and computer science communities, we develop an easy-to-use interactive molecular visualizer. Abstract - Abrégé Le problème du repliement des protéines a pour objectif de prédire le processus physique et dynamique complet de conversion de la séquence d’une protéine non-repliée en une structure 3D. Ce problème peut être décomposé en deux sous-problèmes (ouverts): i)le problème de la prédiction de la structure des protéines et ii)leproblème de la prédiction de le processus de repliement des protéines. La théoriedel’évolution et les lois de la physique sont les principes sur lesquels les techniques computationnelles se sont basées afin de résoudre ces deux sous-problèmes. A` ce jour, les approches classiques qui sont utilisées afin d’obtenir des informations sur le repliement des protéines sont reliées àdes méthodes chronophages, en principe limitées àdesprotéines de petite taille. L’objectif général de cette thèse est d’explorer des algorithmes qui concilient : i)laprédiction des structures et celle du processus de repliement des protèines, ii)lesprédictions basées sur les lois physiques (i.e., des modèlesa´ ` energie libre minimale) et celles basées sur la théoriedel’évolution (i.e., des méthodes basées sur les variations de séquences) et iii) les coûts computationnels et les niveaux de granularité requis par les simulations. Dans ce but, nous proposons une structure d’algorithme pour le repliement des protéines qui offre un meilleur compromis entre la résolution et l’efficience. Cet algorithm calcule des représentations précises à gros grain du paysage conformationnel des protéines de grande taille, en combinant des techniques de modélisation d’ensemble avec l’informations évolutive des séquences. Ce paysage énergétique conformationnel est ensuite utiliséafin de prédire les voies de repliement les plus probables. Puisque la structure proposée dans le cadre de cette thése utilise des informations de séquences, nous étudions aussi une stratégie participative, et une autre, multi-objectif et évolutive pour examiner la précision de l’information encodée par les alignements multiples de séquences. Finalement, de manière àprésenter nos résultatsa ` la communauté de biologistes et de chercheurs en in- formatique, nous avons développé un visualiseur moléculaire interactif facile d’utilisation.

Predicting Protein Folding Pathways Using Ensemble Modeling and Sequence Information

2009 Riboclub Program

The Principled Design of Large-Scale Recursive Neural Network Architectures–DAG-Rnns and the Protein Structure Prediction Problem

Methodology for Predicting Semantic Annotations of Protein Sequences by Feature Extraction Derived of Statistical Contact Potentials and Continuous Wavelet Transform

Deep Learning in Chemoinformatics Using Tensor Flow

Watching Dynamics and Assembly of Spliceosomal

Course Outline

Structural and Functional Characterization of the N-Terminal Acetyltransferase Natc

Assigning Folds to the Proteins Encoded by the Genome of Mycoplasma Genitalium (Protein Fold Recognition͞computer Analysis of Genome Sequences)

Practical Structure-Sequence Alignment of Pseudoknotted Rnas Wei Wang

Deep Learning in Science Pierre Baldi Frontmatter More Information

Bioinformatic Methods in Applied Genomic Research

UC San Diego UC San Diego Electronic Theses and Dissertations