Methodology for Predicting Semantic Annotations of Protein Sequences by Feature Extraction Derived of Statistical Contact Potentials and Continuous Wavelet Transform

Universidad Nacional de Colombia Sede Manizales Master’s Thesis Methodology for predicting semantic annotations of protein sequences by feature extraction derived of statistical contact potentials and continuous wavelet transform Author: Supervisor: Gustavo Alonso Arango Dr. Cesar German Argoty Castellanos Dominguez A thesis submitted in fulfillment of the requirements for the degree of Master’s on Engineering - Industrial Automation in the Department of Electronic, Electric Engineering and Computation Signal Processing and Recognition Group June 2014 Universidad Nacional de Colombia Sede Manizales Tesis de Maestr´ıa Metodolog´ıapara predecir la anotación semántica de prote´ınaspor medio de extracción de caracter´ısticas derivadas de potenciales de contacto y transformada wavelet continua Autor: Tutor: Gustavo Alonso Arango Dr. Cesar German Argoty Castellanos Dominguez Tesis presentada en cumplimiento a los requerimientos necesarios para obtener el grado de Maestr´ıaen Ingenier´ıaen AutomatizaciónIndustrial en el Departamento de Ingenier´ıaEléctrica,Electrónicay Computación Grupo de Procesamiento Digital de Senales Enero 2014 UNIVERSIDAD NACIONAL DE COLOMBIA Abstract Faculty of Engineering and Architecture Department of Electronic, Electric Engineering and Computation Master’s on Engineering - Industrial Automation Methodology for predicting semantic annotations of protein sequences by feature extraction derived of statistical contact potentials and continuous wavelet transform by Gustavo Alonso Arango Argoty In this thesis, a method to predict semantic annotations of the proteins from its primary structure is proposed. The main contribution of this thesis lies in the implementation of a novel protein feature representation, which makes use of the pairwise statistical contact potentials describing the protein interactions and geometry at the atomic level. Initially, a protein sequence is decomposed into a numerical series by a contact potential. From the interactions between adjacent amino acids, the wavelet transform can easily detect and characterize subsequences at specific position along the protein sequence. Then, all subsequences are grouped into clusters and a Hidden Markov Model (HMM) profile is built for each one of the groups. Finally, the modeled profiles HMM are used as features in order to build a feature space with the aim to train and evaluate a support vector machine classifier. Evaluations of the proposed methodology are driven against three different views 1) known protein features 2) motif-domain based features (PFam terms) and 3) performance evaluation over several methods for protein annotation prediction. As result, The method have acquired the highest performance prediction in most of the study cases. Thus, this efficiency suggest our approach as an alternative method for the characterization of protein sequences. Although, the research in this thesis focuses on the classification problem, the scientific community can make use of the methodology in two different ways: 1) as a protein predictor and 2) as a motif finding tool. Finally, the source code of the method is free available for download at SourceForge http://sourceforge.net/projects/wamofi/?source=navbar. UNIVERSIDAD NACIONAL DE COLOMBIA Resumen Facultad de Ingenier´ıay Arquitectura Departamento de Ingenier´ıaEléctrica, Electrónicay Computación Maestr´ıaen Ingenieria AutomatizaciónIndustrial Metodolog´ıa para predecir la anotaciónsemántica de prote´ınaspor medio de extracciónde caracter´ısticas derivadas de potenciales de contacto y transformada wavelet continua by Gustavo Alonso Arango Argoty En esta tesis se propone un método para la predicciónde anotaciones de prote´ınas a partir de la estimaciónde caracter´ısticas en secuencias biológicas. Dicha estimaciónemplea informaciónsobre la estructura de las prote´ınas a partir de las estadisticas de contactos potenciales entre pares de amino ácidos. Inicialmente, una prote´ınaes transformada a una serie numerica por medio de estos contactos potenciales. Debido a las interac- ciones entre amino ácidoscercanos, la transformada wavelet puede fácilmente detectar las subsecuencias pertenecientes a posiciones espec´ıficas a lo largo de la prote´ına. As´ı, todas las subsecuencias son agrupadas de acuerdo a su distribucióny éstos grupos son modelados empleando perfiles de Modelos Ocultos de Markov. Finalmente, los perfiles son usados como caracter´ısticas donde prote´ınas de análsisson mapeadas generando as´ıun espacio de representaciónque es usado para entrenar un clasificador basado en vectores de soporte. La metodolg´ıaha sido rigurosamente evaluada y comparada con tres diferentes criterios de caracterización: 1) caracter´ısticas globales comunmente us- adas para representar prote´ınas, 2) caracter´ısticas espec´ıficascomo motivos y domin- ios, y por último 3) evaluaciónde el renimiento de varios programas construidos para la predicciónde anotaciónde prote´ınas. Como resultado el método propuesto ha lo- grado los mas altos puntajes de predicciónen la mayor´ıade los casos de estudio. De manera que éstas predicciones sugieren a nuestro método como una alternativa a los comunmente usados algoritmos de caracterización. Por otra parte, a pesar de que el enfoque de la metodolog´ıa esta diseñada para resolver problemas de clasificación,la comunidad cient´ıfica puede hacer uso de ella en dos diferentes enfoques: 1) como un predictor de anotaciones en prote´ınasy 2) como una herramienta para encontrar motivos. Por último, el códigofuente del método se encuentra para libre descarga en: http://sourceforge.net/projects/wamofi/?source=navbar. iii keywords continuous wavelet transform, statistical contact potentials, protein prediction, support vector machine, sequence alignment. palabras-clave transformada wavelet continua, potenciales de contacto estadisticos, prediccion de proteins, maquinas de vectores de soporte, alineamiento de secuencias. Acknowledgements To my siblings Manuel, Miguel and Martha Arango who emotionaly helped me in the achievement of this goal. To my advisers German Castellanos and Jorge Jaramillo who guide me on the process of the development of this thesis and I specially want to dedicate this work to my parents (passed away) Gladys Argoty and Miguel Arango who are my main inspiration for the development of my academic and personal life. iv Contents Acknowledgements iv List of Figures vii List of Tables ix 1 Introduction 1 2 Objectives 3 2.1 Main objective ................................. 3 2.2 Specific objectives ............................... 3 3 Background 4 3.1 Proteins ..................................... 4 3.1.1 Protein synthesis . ........................... 4 3.1.2 Protein structure . ........................... 5 3.1.3 Protein folding ............................. 7 3.1.4 Protein domains and motifs . ..................... 7 3.2 Protein annotation prediction . ........................ 8 3.2.1 Protein Similarity . .......................... 8 3.2.2 Machine learning and feature-based classification . 10 3.2.2.1 Protein classification . .................... 11 3.2.2.2 Feature extraction . ..................... 12 4 Methodology 15 4.1 Profile descriptor ................................ 16 4.1.1 Statistical contact potentials . .................... 17 4.1.2 Continuous wavelet transform . .................... 18 4.1.2.1 Resolution of the continuous wavelet transform . 21 4.1.3 Feature detection by wavelet transform . 22 4.1.3.1 Numerical representation . 22 4.1.3.2 Motif detection . ....................... 23 4.1.3.3 Projection . ......................... 24 4.1.4 Multiple sequence alignment . .................... 25 4.1.4.1 The distance matrix . .................... 26 4.1.4.2 The guide tree . ....................... 27 v Contents vi 4.1.4.3 Progressive alignment . ................... 27 4.1.5 Hidden Markov Models . ....................... 27 4.1.5.1 Profile HMMs . ....................... 30 4.2 Classification framework . ........................... 32 4.2.1 Support Vector Machines (SVMs) . 34 4.2.2 Selecting the best contact potential . 36 5 Experimental framework 37 5.1 Protein subcellular localization on Gram-Negative bacterial proteins . 37 5.2 Protein control-evaluation data sets for Gram negative bacteria . 38 6 Results and discussion 40 6.1 Comparison with cutting edge predictors ................... 42 6.2 Comparison with global featuring methods . 43 6.3 Comparison with a profile based method . 44 7 Conclusions 46 8 Appendix A: Wavelet Transform 48 8.1 Properties of the Continuous Wavelet Transform . 48 8.2 Mother Wavelets Commonly Used . ..................... 50 9 Appendix B: List of Pairwise Contact Potentials 52 List of Figures 3.1 The basis of cell and molecular biology: a) DNA replication one double- stranded DNA molecule produces two identical copies of the DNA, b) RNA transcription a segment of DNA is copied into RNA by the en- zyme RNA polymerase, if the gene transcribed encodes a protein, the result of transcription is the messenger RNA (mRNA). c) Protein trans- lation mRNA is decoded by the ribosome to produce a specific amino acid chain, which, later will fold into an active protein. ........... 5 3.2 General formula of the amino acid showing the positions of amino (−NH2) group, carboxyl group (−COOH), α-carbon atom and the side chain R that can be any of the 20 different chains . ................. 5 3.3 Levels of protein structure: A) Primary structure is an arrangement of amino acids. B) Secondary structure is a linear folding of the polypep-

Methodology for Predicting Semantic Annotations of Protein Sequences by Feature Extraction Derived of Statistical Contact Potentials and Continuous Wavelet Transform

The Principled Design of Large-Scale Recursive Neural Network Architectures–DAG-Rnns and the Protein Structure Prediction Problem

120421-24Recombschedule FINAL.Xlsx

Deep Learning in Chemoinformatics Using Tensor Flow

Course Outline

EMBO Encounters Issue43.Pdf

Deep Learning in Science Pierre Baldi Frontmatter More Information

Bioinformatic Methods in Applied Genomic Research

UC San Diego UC San Diego Electronic Theses and Dissertations

The 4Th Bologna Winter School: Hot Topics in Structural Genomics†

Bringing Folding Pathways Into Strand Pairing Prediction

Refining Neural Network Predictions for Helical Transmembraneproteins by Dynamic Programming

A Day in the Life of Dr K. Or How I Learned to Stop Worrying and Love Lysozyme: a Tragedy in Six Acts Gunnar Von Heijne