Tracing Protein Evolutionary Trajectory Homology Inference with Specific Molecular Constraints

Tracing protein evolutionary trajectory Homology inference with specific molecular constraints Jarosław Surkont Dissertation presented to obtain the Ph.D degree in Bioinformatics Instituto de Tecnologia Química e Biológica António Xavier | Universidade Nova de Lisboa Research work coordinated by: Oeiras April, 2016 Cover: A cartoon representation of the Rab11b and cGMP-dependent protein kinase II leucine zipper complex (A. S. Reger et al. (2014). Crystal Structure of the cGMP-dependent Protein Kinase II Leucine Zipper and Rab11b Protein Complex Reveals Molecular Details of G-kinase-specific Interactions. Journal of Biological Chemistry 289.37, 25393–25403). Tracing protein evolutionary trajectory Homology inference with specific molecular constraints Jarosław Surkont, Computational Genomics Laboratory, Instituto Gulbenkian de Ciência Declaration: This dissertation is a result of my own research carried out between June 2012 and March 2016 in the laboratory of Dr. José B. Pereira Leal, Ins- tituto Gulbenkian de Ciência in Oeiras, Portugal, within the PhD Programme in Integrative Biomedical Sciences (edition 2011-2012). Declaração: Esta dissertação é o resultado do meu próprio trabalho desenvolvido entre Julho 2012 e Março 2016 no laboratório do Doutor José B. Pereira Leal, Instituto Gulbenkian de Ciência em Oeiras, Portugal, no âmbito do Programa de Doutoramento em Integrative Biomedical Sciences (edição 2011-2012). Financial support: This dissertation had the financial support from Fundação para a Ciência e a Tecnologia and European Social Found, through grant SFRH/ BD/51880/2012 awarded to Jarosław Surkont, and Fundação Calouste Gulben- kian. Apoio financeiro: Esta dissertação teve o apoio financeiro da Fundação para a Ciência e a Tecnologia e do Fundo Social Europeu no âmbito do Quadro Comu- nitário de apoio através da bolsa de doutoramento SFRH/BD/51880/2012 e da Fundação Calouste Gulbenkian. Acknowledgements: I would like to thank my supervisor José Pereira Leal for his guidance and support throughout my PhD. Special thanks to all of the past and present members of the Computational Genomics Lab for their support, enthusi- asm, and criticism that helped to develop the ideas presented in this thesis. I also wanted to thank Thiago Carvalho for providing me with an opportunity to join the PhD programme and the IGC community. Summary volutionary processes can be considered at multiple levels of biological or- E ganization. The work developed in this thesis focuses on protein molecular evolution. Although proteins are linear polymers composed from a basic set of 20 amino acids, they generate an enormous variety of form and function. Pro- teins that have arisen by a common descent are classified into families; they often share common properties including similarities in sequence, structure, and function. Multiple methods have been developed to infer evolutionary relationships between proteins and classify them into families. Yet, those generic methods are often inaccurate, especially when specific protein properties limit their applica- tions. In this thesis, we analyse two protein classes that are often difficult for the evolutionary analysis: the coiled-coils – repetitive protein domains defined by a simple widespread peptide motif (chapters 2 and 3) and Rab small GTPases – a large family of closely related proteins (chapters 4 and 5). In both cases, we analyse the specific properties that determine protein structure and function and use them to improve their evolutionary inference. Coiled-coils are ubiquitous rod-like domains present in all living organisms that comprise up to 10% of proteins encoded in a genome. They are involved in multiple cellular processes, where they function as spaces separating func- tional domains, or (and) interaction sites, often providing scaffolds for protein complexes. Resolving evolutionary histories of coiled-coil proteins should shed some light on the evolution of many intracellular processes and systems. Yet, coiled-coils are considered difficult for evolutionary inference due to their low- complexity: The domain consists of multiple repetitions of a simple peptide pattern of seven amino acids. As a result, non-homologous coiled-coil domains can converge to a similar sequence and structure; coiled-coils have independently arisen multiple times in evolution. In chapter 2 we analysed the evolutionary properties of coiled-coil sequences. We showed that, despite the underlying simple pattern of hydrophobic and polar residues, coiled-coil sequences are conserved and contain evolutionary relevant information (similar to the globular domains). Yet, the patterns of amino acid substitutions differ from those of globular domains. We developed a coiled-coil i specific model (CC) that reflects this substitution patterns. In the context of phylogenetic reconstruction of coiled-coil proteins, it outperforms general models, often leading to different tree topologies. For multidomain proteins, consisting of both coiled-coil and globular domains, model partitioning involving the CC model and a general model yields more likely tree estimates than a single model. Lastly, the new model improves search sensitivity of (sequence similarity-based) homology detection methods for coiled-coil proteins. Proteins evolve by altering the composition of their primary sequence but also by changing their length. Hence, to complement the analysis of the substitution patterns in coiled-coil domains we also analysed their length evolution. The number of sequence repeats (in repetitive proteins), and as a consequence protein length, varies across homologues. Yet, given the structural role in the spacer and scaffold formation, we hypothesized, in chapter 3, that the length of coiled-coil domains is largely conserved in evolution. Indeed, we observed high conservation of coiled-coil regions length throughout the tree of life, even when the remaining parts of the protein, including globular domains, change. This length conservation is independent of the conservation of the amino acid composition. It reflects the conservation of the physical length of the domain; contrary to the globular domains, the size of the coiled-coil domains changes proportionally to the change in the sequence length. Length conservation is functionally specific, suggesting that the domain size is constrained by its function. The remaining chapters focus on the Rab GTPases, a large protein family of closely related proteins that regulates membrane trafficking by providing speci- ficity to the system. Rabs are short single-domain proteins with a complex evolutionary history: They have been subject to multiple general and taxon-specific duplications (and losses). The family structure, its size, and a high level of sequence similarity pose a challenge for evolutionary inference methods. In chapter 4 we present Rabifier2, a new version of a bioinformatic pipeline for Rab GTPase identification and classification. The new Rabifier outperforms the ini- tial version in both the annotation accuracy and speed. It is available as a web service (RabDB, which also includes pre-computed annotations for sequenced eukaryotic genomes) and a stand-alone package. Rabifier is distributed as an open source software, which should foster its further community-driven develop- ii ment. Mounting evidence suggests that the Last Eukaryotic Common Ancestor (LECA) was a complex organism with many features characteristic of extant Eukaryotes, including a large array of Rab GTPases. Yet, it remains unknown how such elaborate repertoire had emerged. In chapter 5 we analysed the ori- gin of eukaryotic Rabs. Using the new Rabifier pipeline, we predicted putative Rab-like GTPases in Archaea and specifically in Lokiarchaeaon – an archaeal species that contains several eukaryotic signatures, including an expanded repertoire of small GTPases. The phylogenetic analysis was inconclusive about the position of the archaeal Rab-like proteins within the small GTPase family. Yet, a detailed sequence- and structure-based analysis revealed a strict conservation of Rab-specific motifs that mediate interactions with Rab regulators. A sensitive search revealed that, indeed, putative Rab-binding proteins exist in Archaea, sup- porting the hypothesis that some components of the eukaryotic endomembrane system evolved before LECA. This thesis focuses on analysing protein properties in the evolutionary context. The incorporation of such protein class-specific information can often result in a better inference of the protein evolutionary trajectory. These improvements are, however, not restricted to protein evolution. The presence of characteristic marker proteins in an organism is a strong indicator for the presence of specific cellular processes and structures; conversely, the absence of a protein suggests a missing functionality. Hence, the improved tools for inferring evolutionary relationships between proteins should ultimately help to uncover the evolution of cellular components and functions. iii Sumário s processos evolutivos podem ser observados em vários níveis de organiza- O ção biológica. Neste contexto, o trabalho desenvolvido nesta tese centra-se na evolução molecular de proteínas. Apesar de as proteínas serem polímeros li- neares compostos por um conjunto básico de 20 aminoácidos, podem gerar uma enorme variedade de formas e funções. Proteínas que surgiram de um descen- dente comum são classificadas por famílias, que têm geralmente propriedades comuns, incluindo similaridades na sequência, estrutura e função.

Tracing Protein Evolutionary Trajectory Homology Inference with Specific Molecular Constraints

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Genome Sequence and Description of Haloferax Massiliense Sp. Nov., a New Halophilic Archaeon Isolated from the Human Gut

Molecular Characterization of Acquired Tolerance of Tumor Cells to Picropodophyllin (PPP)

Deletion of 11Q in Neuroblastomas Drives Sensitivity to PARP Inhibition Elena Sanmartín1,2, Lisandra Munoz~ 1,2, Marta Piqueras3, J

Haloferax Massiliensis Sp. Nov., the First Human-Associated Halophilic

The Role of Stress Proteins in Haloarchaea and Their Adaptive Response to Environmental Shifts

Research Article Identification of Key Genes and Pathways in Triple-Negative Breast Cancer by Integrated Bioinformatics Analysis

Genome-Wide Association Meta-Analysis for Early Age-Related Macular Degeneration Highlights Novel Loci and Insights for Advanced Disease Thomas W

Variations in the Two Last Steps of the Purine Biosynthetic Pathway in Prokaryotes

Transcriptomic Causal Networks Identified Patterns of Differential Gene Regulation in Human Brain from Schizophrenia Cases Versus Controls

Differentiation Involved in Early Th1 and Th2 Cell Genome-Wide

A Differential Metabarcoding Approach to Describe Taxonomy Profiles Of