Protein Structure Comparison Via Contact Map Alignment

PROTEIN STRUCTURE COMPARISON VIA CONTACT MAP ALIGNMENT FELIPE LEAL VALENTIM 2010 FELIPE LEAL VALENTIM PROTEIN STRUCTURE COMPARISON VIA CONTACT MAP ALIGNMENT Dissertação apresentada à Universidade Federal de Lavras, como parte das exigências do Pro- grama de Pós-Graduação em Biotecnologia Veg- etal, área de Concentração em Biotecnologia Vegetal, para obtenção do título de “Mestre”. Orientador Prof. Ricardo Martins de Abreu Silva LAVRAS MINAS GERAIS -BRASIL 2010 Ficha Catalográfica Preparada pela Divisão de Processos Técnicos da Biblioteca Central da UFLA Valentim, Felipe Leal. Protein Structure Comparison via Contact Map Alignment / Felipe Leal Valentim. – Lavras : UFLA, 2010. 77 p. : il. Dissertação (mestrado) – Universidade Federal de Lavras, 2010. Orientador: Ricardo Martins de Abreu Silva. Bibliografia. 1. Proteínas. 2. Alinhamento estrutural. 3. Mapas de contato. 4. MAX-CMO. I. Universidade Federal de Lavras. II. Título. CDD – 574.0285 FELIPE LEAL VALENTIM PROTEIN STRUCTURE COMPARISON VIA CONTACT MAP ALIGNMENT Dissertação apresentada à Universidade Federal de Lavras, como parte das exigências do Pro- grama de Pós-Graduação em Biotecnologia Veg- etal, area de Concentração em Biotecnologia Vegetal, para obtenção do título de “Mestre”. APROVADA em 05 de Fevereiro de 2010 Prof. Luciano Vilela Paiva - UFLA Prof. Wagner Meira Junior - UFMG Profa. Gloria Regina Franco - UFMG Prof. Ricardo Martins de Abreu Silva UFLA (Orientador) LAVRAS MINAS GERAIS - BRASIL SUMÁRIO ABSTRACT . .i RESUMO . ii CHAPTER 1: General Introduction . .1 1 Protein structures and Contact Maps . .2 1.1 Availability of structural data of proteins . .3 1.2 Protein Contact Maps . .3 2 Protein Structure Comparisons . .5 2.1 Contact Map Alignment . .6 3 Objectives and structure of the thesis . .7 4 References . .8 CHAPTER 2: GRASP with Path-Relinking for the MAX-CMO problem . 10 1 Abstract . 11 2 Resumo . 12 3 Introduction . 13 4 GRASP with Path-relinking for MAX-CMO . 20 4.1 Greedy randomized construction . 23 4.2 Approximate local search . 24 4.3 Path-relinking . 35 5 Computational experiments and Results . 36 5.1 Test environment and Datasets . 37 5.2 Comparison of the GRASP-PR heuristic with other algorithms . 37 5.3 Time-to-target plots for GRASP-PR against other heuristics . 44 5.4 GRASP-PR and the Skolnick Clustering test set . 47 5.5 GRASP-PR as a structural alignment tool . 49 6 Concluding remarks . 56 7 References . 57 CHAPTER 3: GOIC-Biocomp: a web-based tool for protein alignment . 61 1 Abstract . 62 2 Resumo . 63 3 GRASP with Path-relinking for MAX-CMO . 64 4 GOIC-Biocomp server . 67 5 References . 69 CHAPTER 4: Summary, Conclusions and Future Work . 71 1 Abstract . 72 2 Resumo . 73 3 Summary, General Conclusions and Future Work . 74 4 References . 77 ABSTRACT VALENTIM, Felipe Leal. Protein Structure Comparison via Contact Map Alignment. 2010. 77 p. Master thesis (Master in Plant Biotechnology) - Universidade Federal de Lavras, Lavras.* Proteins are primary components in almost all biological processes in living organisms. It is known that the variety of protein functions is a result of the differ- ences in protein structures. Therefore, understanding and comparing the structure of proteins is a major challenge in modern molecular biology. The structural alignment and comparison of proteins became an essential task, whose solution is in- strumental in aiding other problems such as drug design, protein structure/function prediction, and protein clustering. One promising class of approaches for measur- ing protein similarity relies on the alignment of the protein contact maps. The most common mathematical statement of the contact map comparison problem is called the Maximum Contact Map Overlap (MAX-CMO). In this context, in this Master’s thesis it has been proposed the hybrid heuristic Greedy Random Adap- tive Search Procedure with Path-relinking for the Maximum Contact Map overlap problem which have been revealed able to find improved solutions. Another pro- posal which has been presented in this work is the implementation of a computational tool that allows the structural alignment of proteins through the proposed heuristic. The chapters 2 and 3 of this dissertation represent the manuscripts des- cribing these two proposals and a final chapter that contains the conclusion and outlines the possibilities for future work. Keywords: Proteins, Structural alignment, Contact Maps, MAX-CMO. *Advisor: Ricardo Martins de Abreu Silva - UFLA i RESUMO VALENTIM, Felipe Leal. Protein Structure Comparison via Contact Map Alignment. 2010. 77 p. Dissertação (Mestrado em Biotecnologia Vegetal) - Universidade Federal de Lavras, Lavras.* As proteínas são componentes primários em quase todos os processos bi- ológicos nos organismos vivos. Sabe-se que a variedade de funções de proteínas é um resultado das diferenças nas estruturas de proteínas. Portanto, compreender e comparar a estrutura das proteínas é um desafio importante na biologia molecular moderna. O alinhamento estrutural e comparação das proteínas tornaram-se tarefas essenciais, cuja solução é fundamental para auxiliar outros problemas, tais como a desenho racional de novos fármacos, predição de função/estrutura de proteínas, e clusterização de proteínas. Uma classe promissora de abordagens para medir a similaridade da proteína depende do alinhamento dos mapas de contato da pro- teína. A fomalização mais comum para o problema matemático de alinhamento de mapas de contato é chamado o Maximum Contact Map Overlap problem (MAX- CMO). Neste contexto, esta dissertação de mestrado propõe a heurística híbrida Greedy Random Adaptive Search Procedure com Path-relinking para o Maximum Contact Map Overlap problem, que tem se revelado capaz de encontrar soluções promissoras. Outra proposta apresentada neste trabalho é a implementação de uma ferramenta computacional que permite o alinhamento estrutural de proteínas através da heurística proposta. Os capítulos 2 e 3 desta dissertação representam os artigos que descrevem estas duas propostas. Um capítulo final descreve experi- mentos adicionais realizados com a heurística e a ferramenta computacional. Palavras-chave: Proteínas, Alinhamento estrutural, mapas de contato, MAX- CMO. *Orientador: Ricardo Martins de Abreu Silva - UFLA ii CHAPTER 1 GENERAL INTRODUCTION 1 1 PROTEIN STRUCTURES AND CONTACT MAPS Proteins are primary components in almost all biological processes in living organisms. It is known that the variety of protein functions is a result of the differ- ences in protein structures. Therefore, understanding and comparing the structure of proteins is a major challenge in modern molecular biology (Eidhammer et al., 2004). Despite the large amount of diversity in their functions, all proteins are made of the same components, amino acids. And all amino acids share the same basic structure. Each amino acid consists of a central carbon atom (Cα), an amino group (NH3), at one end, a carboxyl group (COOH) at the other end, and a side-chain (R) that characterizes the amino acids. This side-chain is usually referred to as an amino acid residue, or simply a residue (Eidhammer et al., 2004). In order to form a protein molecule, the carboxyl group of one amino acid forms a peptide bond with the amino group of another amino acid and an (H2O) molecule is revealed. The sequence of peptide bonds forms the protein backbone. There are 20 different side-chains specified by genetic code each of which is ad- dressed by a letter of the alphabet. Since each protein is a sequence of amino acids, it can be described by a string over this set of 20 letters (Eidhammer et al., 2004). The structure of proteins is organized in four structural levels: primary, secondary, tertiary, and quaternary structures. The linear sequence of amino acids that contribute to the formation of a protein molecule is called its primary structure. Many proteins contain roughly 100-1000 amino acids (Eidhammer et al., 2004) (some even more than 4000). Local arrangement of a few or a few dozen amino acid residues (Hunter, 1993) is seen in particular patterns repeatedly in many different proteins. These patterns are formed because of the interactions mediated by hydrogen bonds mainly within the backbone. The three-dimensional fold of the 2 protein molecule - which is a result of connecting secondary structures together - is called tertiary structure of the protein (Eidhammer et al., 2004). There are many proteins in nature which form from combinations of two or more protein chains. The spatial arrangement of these proteins is called quaternary structure. In this work, we are particularly interested in the tertiary structure of proteins, and when we refer to the protein structure in what follows, we will be referring to their tertiary ones. 1.1 Availability of structural data of proteins The Protein Data Bank (PDB) (Berman et al., 2000) is the standard reposi- tory for collecting information on determined three-dimensional structures of proteins and other large biological molecules which are found in all of the living organisms. The coordinates of the structures in the PDB are determined by some experimental methods such as X-ray crystallography and Nuclear Magnetic Res- onance (NMR) spectroscopy∗. A rapid increase in the number of protein structures deposited in the PDB has been observed in recent years, and because of this growth, protein structure comparison has become a key problem in bioinformat- ics. This rapid growth has been related to the recent emergence of large scale protein structure determination projects (Nair et al., 2009), called structural genomics (Westbrook et al., 2003). 1.2 Protein Contact Maps Three-dimensional structure of proteins can be represented by their distance maps. A distance map is an N N matrix, where N is the number of amino × acids in the sequence of a protein. Each element di,j in the matrix D represents the distance between the ith and the jth amino acids, usually in Angestrom (Å). The distance between two residues can be defined in different ways, such as the ∗More information about the PDB can be found in http://www.wwpdb.org/documentation/ 3 distance between Cα -Cα (Vullo & Frasconi, 2003).

Protein Structure Comparison Via Contact Map Alignment

Arxiv:1911.09811V1 [Physics.Bio-Ph] 22 Nov 2019 Help to Associate a Folded Structure to a Protein Sequence

Protein Contact Map Prediction Using Multiple Sequence Alignment

Computational Protein Structure Prediction Using Deep Learning

Leveraging Novel Information Sources for Protein Structure Prediction

Ensembling Multiple Raw Coevolutionary Features with Deep Residual Neural Networks for Contact‐Map Prediction in CASP13

Direct Information Reweighted by Contact Templates: Improved RNA Contact Prediction by Combining Structural Features

Machine Learning Methods for Protein Structure Prediction Jianlin Cheng, Allison N

Protein Structures

Deep Learning-Based Advances in Protein Structure Prediction

Challenges in the Computational Modeling of the Protein Structure—Activity Relationship

Protein Contact Map Denoising Using Generative Adversarial Networks

Homology Modeling in the Time of Collective and Artificial Intelligence