Computational Identification of Genes: Ab Initio

Computational identification of genes: ab initio and comparative approaches Genís Parra Farré PhD thesis Barcelona, December 2004 The Figure in the cover shows a representation of geneid predictions (Figure 1, Parra et al. (2000)). Computational identification of genes: ab initio and comparative approaches Genís Parra Farré Memòria presentada per optar al grau de Doctor en Biologia per la Universitat Pompeu Fabra. Aquesta Tesi Doctoral ha estat realitzada sota la direcció del Dr. Roderic Guigó Serra al Departament de Ciències Experimentals i de la Salut de la Universitat Pompeu Fabra Roderic Guigó Serra Genís Parra Farré Barcelona, December 2004 The research in this thesis has been carried out at the Genome Bioinformatics Lab (GBL) within the Grup de Recerca en Informàtica Biomèdica (GRIB) at the Parc de Recerca Biomèdica de Barcelona (PRBB), a consortium of the Institut Municipal d’Investigació Mèdica (IMIM), Universitat Pompeu Fabra (UPF) and Centre de Regulació Genòmica (CRG). The research carried out in this thesis has been supported by grants from Ministerio de Ciencia y Tecnología to R. Guigó. To my parents To my brother and sisters Dipòsit legal: B.30805-2005 ISBN: 84-689-2750-3 Motivation It is clear that we are living a really important period in the development and the knowl- edge of life sciences. Fifty years after the description of the structure of the double helix, we have moved from the analysis of a single gene to the systematic mass sequencing of entire genomes. At the time of writing this dissertation, whole genome sequencing projects for hundreds of organisms (bacteria, archea and eukaryota, as well as many viruses and organelles) are either complete or underway. All the information we are gath- ering today will probably modify the way we will understand life, science and medicine. But, before the best use can be made of this data, the identification and the precise location of the functional regions of the genomic sequences must be determined. The most important things to realize about the “book of life”, is that we know almost nothing about the language in which it is written, and that raw genomic sequences are mainly useless for the scientific community. The challenge ahead is to extract relevant information encoded within the billions of nucleotides stored in our databases. In a very simplistic description, the first step in the functional annotation of a genome would be to find the collection of genes encoded in the nucleic acid sequences. The next step would be to assign a function to each protein, where the three dimensional structure of the proteins will play a key role. Then, using microarrays technology, it will be feasible to obtain the spatial and temporal expression pattern of each gene at any developmental stage or specific condition. Finally, the last step would be to establish the network of interactions and regulations among all the proteins of a complete genome. This thesis focuses on the first step of any genome analysis: to find where genes are. The motivation of this thesis, thus, is to give a little insight in how genes are encoded and recognized by the cell machinery and to use this information to find genes in unannotated genomic sequences. The complexity of gene prediction differs substantially in prokaryotic and eukaryotic genomes. While prokaryotic genes are encoded in single continuous open reading frames, usually adjacent to each other, eukaryotic genes are separated by long stretches of intergenic regions, and their coding sequences can be interrupted by large non coding sequences. One of the objectives of this thesis is the development of tools to identify eukaryotic genes through the modeling and recognition of their intrinsic signals and properties. This thesis addresses another significant open problem of this field: how the sequence of related genomes can contribute to the identification of genes. The value of comparative genomics is illustrated by the sequencing of the mouse genome for the purpose of anno- tating the human genome. The availability of closely related genomes makes it possible to carry out genome-wide comparisons and analysis of syntenic regions. Recently, compar- i ii Motivation ative gene predictions programs exploit this data under the assumption that conserved regions between related species correspond to functional regions (coding genes among them). Thus, the second part of this thesis describes a gene prediction program that com- bines ab initio gene prediction with comparative information between two genomes to improve the accuracy of the predictions. Nowadays computational analysis is a major, integral part of genomics. It would not be an exaggeration to claim that genomics analysis can only be made with computational tools. Only by using computational methods and statistical models we can try to find out how genes are encoded and try to accurately predict their location in complete genomic sequences. Thus, the work described in this dissertation is essentially interdisciplinary; this means that while the basic subject of matter is biological and the obtained results are of biological interest, techniques from other fields have been extensively used. Statistical approaches have been used to create models of genomic features to be able to recognize sequence mo- tifs and reproduce the underneath biological process, while computational programming has been applied to include these models into efficient bioinformatic tools. Genís Parra Barcelona, December 2004 Acknowledgments It is not just to follow convention that I first acknowledge my PhD advisor, Roderic Guigó. Quite simply, if not for him, my academic career would have finished with my bachelor degree. It was he who saw past my sub-optimal scores to someone able to work on research. I am indebted to him for letting me start what I hope will be a long career in research. Other people thath I would like to thanl are: Pankaj Agarwall, for the stage in the GlaxoSmithKline, the Dyctiostelium annotation group: Karol Szafranski, Gernot Glökner and Mathias Platzer, the people from the University of Geneva: Manolis Dermitzakis, Alexandre Reymond and Stylianos Antonarakis and Michael Brent and all the people of his lab for the scientific collaboration and the invitation to Saint Louis. I would also like to mention the people who may not had a direct impact on this thesis, but have influenced it indirectly by molding me into the person I am today. All the people who have given of their time, talents, and expertise to help me on this project have enriched my life. For their special friendships and assistance, I am most grateful: To Mercè for being there in the darkest years of my PhD (and life). For encouraging me when I was giving everything up. For your trust in friendship. For the years we lived together and for all the amazing things you taught me. For all the incredible journeys we did. For all the love you gave me. To Sergi for those Sunday afternoons in la filmoteca. For your way of living life and science. For la Passió d’Esparraguera, Eric Sardinas and for your comics. For those nights in New York. Special thanks for your cocktails, for listening to me and for your wise advises. To Cristina for your enthusiasm and your energy. For your tiramisu, for your pesto and for your profiteroles. For all your tenderness and comprehension. For Patrizio, Gur- dieff and Dilan Dog. For going with me to fill my bottle of water every day. For the Ravenna mosaics. For bringing happiness and joy in our every day life. To Robert for your thesis template, for your whiskey and Risk sessions. For the par- ties on your roof. For being our volleyball captain. For your pictures of California (that decided my future). For your wise statements. To Pep for sharing a fraction of all your infinite wisdom with me. I learned (or at least I tried) from you to try to do the things the best one can. To Enrique for programing geneid. I learned a lot while working with you. For your patience with my geneid problems. For all the course we teached and the moments we shared. iii iv Acknowledgments To Fabien and Isabel for those roller hockey nights. For your penguin. For your strength and courage. For your friendship. To Xavi for being our mentor in the early days. I will say nothing about your home directory. For being a destroyer. Quin payo !! To Bet for massaging my breast. For gifts you give. For your sympathy and friendship. For your complicity. To Rut for the swimming mornings, for the theater, for your smile. Remember me when you become a famous actress. To Noura for being just like you are. For your voice and for that night in the karaoke. To Ramon for all those amazing gadgets you have. For your outdoor activities. For your true friendship, for your music, for your stories and for el Pilar. To Peppolino for your hugs, for running with me, for the musica pertarda, and for the Arena sessions. To Citlali for all the pushes and shoves playing hockey. For Valencia and Blanes. For how I feel being with you. For your guacamole. To Moisès for installing Slackware on my first hard disk and for your long discussions, for your spontaneity and freshness. To Charles for incite, for your comics and your sense of humor. To Oscar for all the conversations we had while other people were dancing. For the amazing physical properties of liquids falling inside glasses. To my students. Specially to my first group: Jimena, Encarni, Bet and Jordi, who show me how difficult is to be a teacher. Just kidding !! You were the best group I ever had. To Gus. To Jan-Jaap for all the effort you did in the correction of this thesis, it was really a lot of work and I really appreciate it !!! To Queviures Murgadella for all the food I shared with my friends in the lab.

Computational Identification of Genes: Ab Initio

Large-Scale Opening of Utrophints Tandem Calponin Homology (CH

Functional Aspects and Genomic Analysis

Sequence Alignment/Map) Is a Text Format for Storing Sequence Alignment Data in a Series of Tab Delimited ASCII Columns

The Roles of Actin-Binding Domains 1 and 2 in the Calcium-Dependent Regulation of Actin Filament Bundling by Human Plastins

Gene Prediction Using Deep Learning

The Actin Binding Protein Plastin-3 Is Involved in the Pathogenesis of Acute Myeloid Leukemia

Complete Genome Sequence of the Hyperthermophilic Bacteria- Thermotoga Sp

Alternate-Locus Aware Variant Calling in Whole Genome Sequencing Marten Jäger1,2, Max Schubach1, Tomasz Zemojtel1,Knutreinert3, Deanna M

The Role of Actin Binding Proteins in Cell Motility Elizabeth Ojukwu University of Connecticut - Storrs, [email protected]

Galaxy Platform for NGS Data Analyses

Cep-2020-00633.Pdf

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D