Methodology for Predicting Semantic Annotations of Protein Sequences by Feature Extraction Derived of Statistical Contact Potentials and Continuous Wavelet Transform
Total Page:16
File Type:pdf, Size:1020Kb
Universidad Nacional de Colombia Sede Manizales Master’s Thesis Methodology for predicting semantic annotations of protein sequences by feature extraction derived of statistical contact potentials and continuous wavelet transform Author: Supervisor: Gustavo Alonso Arango Dr. Cesar German Argoty Castellanos Dominguez A thesis submitted in fulfillment of the requirements for the degree of Master’s on Engineering - Industrial Automation in the Department of Electronic, Electric Engineering and Computation Signal Processing and Recognition Group June 2014 Universidad Nacional de Colombia Sede Manizales Tesis de Maestr´ıa Metodolog´ıapara predecir la anotaci´on sem´antica de prote´ınaspor medio de extracci´on de caracter´ısticas derivadas de potenciales de contacto y transformada wavelet continua Autor: Tutor: Gustavo Alonso Arango Dr. Cesar German Argoty Castellanos Dominguez Tesis presentada en cumplimiento a los requerimientos necesarios para obtener el grado de Maestr´ıaen Ingenier´ıaen Automatizaci´onIndustrial en el Departamento de Ingenier´ıaEl´ectrica,Electr´onicay Computaci´on Grupo de Procesamiento Digital de Senales Enero 2014 UNIVERSIDAD NACIONAL DE COLOMBIA Abstract Faculty of Engineering and Architecture Department of Electronic, Electric Engineering and Computation Master’s on Engineering - Industrial Automation Methodology for predicting semantic annotations of protein sequences by feature extraction derived of statistical contact potentials and continuous wavelet transform by Gustavo Alonso Arango Argoty In this thesis, a method to predict semantic annotations of the proteins from its primary structure is proposed. The main contribution of this thesis lies in the implementation of a novel protein feature representation, which makes use of the pairwise statistical contact potentials describing the protein interactions and geometry at the atomic level. Initially, a protein sequence is decomposed into a numerical series by a contact potential. From the interactions between adjacent amino acids, the wavelet transform can easily detect and characterize subsequences at specific position along the protein sequence. Then, all subsequences are grouped into clusters and a Hidden Markov Model (HMM) profile is built for each one of the groups. Finally, the modeled profiles HMM are used as features in order to build a feature space with the aim to train and evaluate a support vector machine classifier. Evaluations of the proposed methodology are driven against three different views 1) known protein features 2) motif-domain based features (PFam terms) and 3) performance evaluation over several methods for protein annotation prediction. As result, The method have acquired the highest performance prediction in most of the study cases. Thus, this efficiency suggest our approach as an alternative method for the characterization of protein sequences. Although, the research in this thesis focuses on the classification problem, the scientific community can make use of the methodology in two different ways: 1) as a protein predictor and 2) as a motif finding tool. Finally, the source code of the method is free available for download at SourceForge http://sourceforge.net/projects/wamofi/?source=navbar. UNIVERSIDAD NACIONAL DE COLOMBIA Resumen Facultad de Ingenier´ıay Arquitectura Departamento de Ingenier´ıaEl´ectrica, Electr´onicay Computaci´on Maestr´ıaen Ingenieria Automatizaci´onIndustrial Metodolog´ıa para predecir la anotaci´onsem´antica de prote´ınaspor medio de extracci´onde caracter´ısticas derivadas de potenciales de contacto y transformada wavelet continua by Gustavo Alonso Arango Argoty En esta tesis se propone un m´etodo para la predicci´onde anotaciones de prote´ınas a par- tir de la estimaci´onde caracter´ısticas en secuencias biol´ogicas. Dicha estimaci´onemplea informaci´onsobre la estructura de las prote´ınas a partir de las estadisticas de contactos potenciales entre pares de amino ´acidos. Inicialmente, una prote´ınaes transformada a una serie numerica por medio de estos contactos potenciales. Debido a las interac- ciones entre amino ´acidoscercanos, la transformada wavelet puede f´acilmente detectar las subsecuencias pertenecientes a posiciones espec´ıficas a lo largo de la prote´ına. As´ı, todas las subsecuencias son agrupadas de acuerdo a su distribuci´ony ´estos grupos son modelados empleando perfiles de Modelos Ocultos de Markov. Finalmente, los perfiles son usados como caracter´ısticas donde prote´ınas de an´alsisson mapeadas generando as´ıun espacio de representaci´onque es usado para entrenar un clasificador basado en vectores de soporte. La metodolg´ıaha sido rigurosamente evaluada y comparada con tres diferentes criterios de caracterizaci´on: 1) caracter´ısticas globales comunmente us- adas para representar prote´ınas, 2) caracter´ısticas espec´ıficascomo motivos y domin- ios, y por ´ultimo 3) evaluaci´onde el renimiento de varios programas construidos para la predicci´onde anotaci´onde prote´ınas. Como resultado el m´etodo propuesto ha lo- grado los mas altos puntajes de predicci´onen la mayor´ıade los casos de estudio. De manera que ´estas predicciones sugieren a nuestro m´etodo como una alternativa a los comunmente usados algoritmos de caracterizaci´on. Por otra parte, a pesar de que el enfoque de la metodolog´ıa esta dise˜nada para resolver problemas de clasificaci´on,la comunidad cient´ıfica puede hacer uso de ella en dos diferentes enfoques: 1) como un predictor de anotaciones en prote´ınasy 2) como una herramienta para encontrar mo- tivos. Por ´ultimo, el c´odigofuente del m´etodo se encuentra para libre descarga en: http://sourceforge.net/projects/wamofi/?source=navbar. iii keywords continuous wavelet transform, statistical contact potentials, protein predic- tion, support vector machine, sequence alignment. palabras-clave transformada wavelet continua, potenciales de contacto estadisticos, prediccion de proteins, maquinas de vectores de soporte, alineamiento de secuencias. Acknowledgements To my siblings Manuel, Miguel and Martha Arango who emotionaly helped me in the achievement of this goal. To my advisers German Castellanos and Jorge Jaramillo who guide me on the process of the development of this thesis and I specially want to dedicate this work to my parents (passed away) Gladys Argoty and Miguel Arango who are my main inspiration for the development of my academic and personal life. iv Contents Acknowledgements iv List of Figures vii List of Tables ix 1 Introduction 1 2 Objectives 3 2.1 Main objective ................................. 3 2.2 Specific objectives ............................... 3 3 Background 4 3.1 Proteins ..................................... 4 3.1.1 Protein synthesis . ........................... 4 3.1.2 Protein structure . ........................... 5 3.1.3 Protein folding ............................. 7 3.1.4 Protein domains and motifs . ..................... 7 3.2 Protein annotation prediction . ........................ 8 3.2.1 Protein Similarity . .......................... 8 3.2.2 Machine learning and feature-based classification . 10 3.2.2.1 Protein classification . .................... 11 3.2.2.2 Feature extraction . ..................... 12 4 Methodology 15 4.1 Profile descriptor ................................ 16 4.1.1 Statistical contact potentials . .................... 17 4.1.2 Continuous wavelet transform . .................... 18 4.1.2.1 Resolution of the continuous wavelet transform . 21 4.1.3 Feature detection by wavelet transform . 22 4.1.3.1 Numerical representation . 22 4.1.3.2 Motif detection . ....................... 23 4.1.3.3 Projection . ......................... 24 4.1.4 Multiple sequence alignment . .................... 25 4.1.4.1 The distance matrix . .................... 26 4.1.4.2 The guide tree . ....................... 27 v Contents vi 4.1.4.3 Progressive alignment . ................... 27 4.1.5 Hidden Markov Models . ....................... 27 4.1.5.1 Profile HMMs . ....................... 30 4.2 Classification framework . ........................... 32 4.2.1 Support Vector Machines (SVMs) . 34 4.2.2 Selecting the best contact potential . 36 5 Experimental framework 37 5.1 Protein subcellular localization on Gram-Negative bacterial proteins . 37 5.2 Protein control-evaluation data sets for Gram negative bacteria . 38 6 Results and discussion 40 6.1 Comparison with cutting edge predictors ................... 42 6.2 Comparison with global featuring methods . 43 6.3 Comparison with a profile based method . 44 7 Conclusions 46 8 Appendix A: Wavelet Transform 48 8.1 Properties of the Continuous Wavelet Transform . 48 8.2 Mother Wavelets Commonly Used . ..................... 50 9 Appendix B: List of Pairwise Contact Potentials 52 List of Figures 3.1 The basis of cell and molecular biology: a) DNA replication one double- stranded DNA molecule produces two identical copies of the DNA, b) RNA transcription a segment of DNA is copied into RNA by the en- zyme RNA polymerase, if the gene transcribed encodes a protein, the re- sult of transcription is the messenger RNA (mRNA). c) Protein trans- lation mRNA is decoded by the ribosome to produce a specific amino acid chain, which, later will fold into an active protein. ........... 5 3.2 General formula of the amino acid showing the positions of amino (−NH2) group, carboxyl group (−COOH), α-carbon atom and the side chain R that can be any of the 20 different chains . ................. 5 3.3 Levels of protein structure: A) Primary structure is an arrangement of amino acids. B) Secondary structure is a linear folding of the polypep-