Evolution and Function of Drososphila Melanogaster Cis-Regulatory Sequences

Total Page:16

File Type:pdf, Size:1020Kb

Evolution and Function of Drososphila Melanogaster Cis-Regulatory Sequences Evolution and Function of Drososphila melanogaster cis-regulatory Sequences By Aaron Hardin A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Molecular and Cell Biology in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Michael Eisen, Chair Professor Doris Bachtrog Professor Gary Karpen Professor Lior Pachter Fall 2013 Evolution and Function of Drososphila melanogaster cis-regulatory Sequences This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License 2013 by Aaron Hardin 1 Abstract Evolution and Function of Drososphila melanogaster cis-regulatory Sequences by Aaron Hardin Doctor of Philosophy in Molecular and Cell Biology University of California, Berkeley Professor Michael Eisen, Chair In this work, I describe my doctoral work studying the regulation of transcription with both computational and experimental methods on the natural genetic variation in a population. This works integrates an investigation of the consequences of polymorphisms at three stages of gene regulation in the developing fly embryo: the diversity at cis-regulatory modules, the integration of transcription factor binding into changes in chromatin state and the effects of these inputs on the final phenotype of embryonic gene expression. i I dedicate this dissertation to Mela Hardin who has been here for me at all times, even when we were apart. ii Contents List of Figures iv List of Tables vi Acknowledgments vii 1 Introduction1 2 Within Species Diversity in cis-Regulatory Modules6 2.1 Introduction....................................6 2.2 Results.......................................8 2.2.1 Genome wide diversity in transcription factor binding sites......8 2.2.2 Genome wide purifying selection on cis-regulatory modules......9 2.3 Discussion.....................................9 2.4 Methods for finding polymorphisms....................... 11 2.5 Methods for inferring selection on binding sites................. 11 2.6 Figures....................................... 12 3 Early Embryo Chromatin Landscape 17 3.1 Introduction.................................... 17 3.2 Results....................................... 19 3.3 Discussion..................................... 22 3.4 Methods of Embryo Collection and Treatment................. 23 3.5 Methods of Comparing DNase-seq........................ 26 3.6 Figures....................................... 27 3.7 Tables....................................... 35 4 Interaction of Chromatin and Expression in Early Embryogenesis 36 4.1 Introduction.................................... 36 4.2 Results....................................... 38 4.3 Discussion..................................... 38 4.4 Methods...................................... 40 4.5 Figures....................................... 40 Contents iii Bibliography 44 iv List of Figures 2.1 Histogram of allele frequencies from two populations of N. American strains. The alleles found from the Raleigh population measured with short reads correlate highly with alleles in the Winters population from Sanger sequencing...... 12 2.2 Distribution of Tajima’s D values for classes of D. melanogaster DNA. Bar are given with whiskers indicating two standard errors. Transcription factors regions for each factor were appended together........................ 13 2.3 Strength of AP transcription factors binding in relationship to population motif diversity in bound regions. Strength of binding is measured by the maximum fragment depth of the region and motifs were considered as binary absent or present if the motif passed a p-value threshold................... 14 2.4 Binding of AP transcription factors in relationship of excess low frequency motifs D as calculated in equation 2.1........................... 15 2.5 Transcription factor binding sites are under similar selective pressures as the entire cis-regulatory module. Each motif was shuffled by column followed by row and scored against the bound regions using the same thresholds as the canonical motif. The score distributions were compared with Wilcoxon rank-sum test and no significant differences were found......................... 16 3.1 DGRP strains used for mixture combinations. Each strain was mixed with all others at least once and biological replicates for DGRP705 mixed with DGRP380 and DGRP437 were collected and sequenced..................... 28 3.2 DNase landscape at eve. A subset of the strain mixtures shows the range of accessibilities and general consistency between strains............... 29 3.3 DNase signal at TF peaks from stage 5 embryos.................. 30 3.4 DNase I cleavage (per nucleotide per 1M reads) at ChIP-seq identified ZLD bind- ing sites. Red line is the average cuts on the positive strand and green represents DNase cuts mapping to the negative strand..................... 31 3.5 Kmer effects on chromatin accessibility....................... 32 3.6 Fold change in accessibility at kmers with largest mean effects on chromatin state. 32 3.7 Change in accessibility around differentially accessibility SNPs. The difference in expected accessibility corrected for mixture ratios around SNPs which deviated significantly from expected ratios........................... 33 List of Figures v 3.8 Regions with conserved and variable chromatin accessibility are both under puri- fying selection. The mean pairwise diversity around the peak of chromatin acces- sible regions were smoothed with LOWESS for each class of chromatin accessible regions, shared within all strains in red and polymorphic in blue......... 34 4.1 RNA fragment distribution between stain mixtures, including biological replicates. 41 4.2 Differential Expression in mixture of strains with alleles classified into matching strains 324 and 303. Red transcripts are expressed higher in 324, green transcripts are expressed higher in 303 with an FDR of 0.05 calculated with EBSeq..... 42 4.3 Expression of AP and DV target genes. Of the 7064 expressed genes, 4% show differential expression................................ 42 4.4 Differential chromatin accessibility within 2Kb of TSS and expression per tran- script. The colors of each transcript match Figure 4.2. All correlations of expres- sion and DNase accessibility per transcript partitioned into increased expression, decreased expression, no change, and all transcripts were non-significant, p = 0.06 to 0.90...................................... 43 vi List of Tables 3.1 Kmer effects on chromatin accessibility....................... 35 vii Acknowledgments I would like to acknowledge the following individuals for their invaluable contributions to this work and my personal growth in and out of the lab: Matt Davis, Xiao-Yong Li, Tommy Kaplan, Jacqueline Villalta, Mike Eisen, Devin Scannell, Doris Bachtrog, John Pool, Ryan Shultzaberger, Colin Brown, Susan Lott, Rich Lusk, Kelly Schiabor, Peter Combs, Dan Richter, Matt Taliaferro, Oh-Kyu Yoon, Tera Levin, Heather Bruce, Adam Session, Lior Pachter, Adam Roberts, Christopher Villalta, Eric Buhlis, Gary Karpen, Craig Miller, Robert Fowler, and many other members of the Eisen Lab, the Department of Molecular and Cell Biology, and the community of the University of California. 1 Chapter 1 Introduction The power of Selection, whether exercised by man or brought into play under na- ture through the struggle for existence and the consequent survival of the fittest, absolutely depends on the variability of organic beings. Without variability, noth- ing can be effected; slight individual differences, however, suffice for the work, and are probably the chief or sole means in the production of new species. (Charles Darwin 1868[1]) When Charles Darwin described the mechanisms of evolution[2], he had no knowledge of the genetic mechanisms of inheritance. He proposed gemmules[1], small packages carrying information for each part of the organism to the germ cells where they were collected and integrated forming the next generation. Darwin’s erroneous solution was motivated by the puzzle as to how quantitative and discrete heritable variation could function. We now know that both result from the interaction of a multitude of genes and their variants [3]. These networks of interactions have explained several of Darwin’s original observations of variability in and between species. The heritable variance that lead to differences in finch beaks[4] and dog varieties[5] have been traced down to the level of individual nucleotides through the study of the molecule of inheritance we now know to be DNA[6]. DNA contains the information for producing enzymatic and structural proteins and ul- timately all metabolic processes and products are encoded in DNA. DNA is heritable and is passed down from parents to offspring over generations, accumulating changes in both the regions which encode proteins and the regulatory regions that control when, where, how much, and what form of protein is produced. DNA can be modified in the short term through the addition of chemical modifications such as methylation[7] but these effects do not persist over evolutionary time scales[8], only the changes of the sequence itself. Understanding how the primary sequence of DNA is interpreted and translated into the complex form of life has been one of the major scientific efforts since the 1950s. As time and resources were put into determining the genomes of species, we collected genomes on larger scales starting with small viruses with only 4 genes[9], bacteria[10], and finally model eukaryotes such
Recommended publications
  • Enhanced Representation of Natural Product Metabolism in Uniprotkb
    H OH metabolites OH Article Diverse Taxonomies for Diverse Chemistries: Enhanced Representation of Natural Product Metabolism in UniProtKB Marc Feuermann 1,* , Emmanuel Boutet 1,* , Anne Morgat 1 , Kristian B. Axelsen 1, Parit Bansal 1, Jerven Bolleman 1 , Edouard de Castro 1, Elisabeth Coudert 1, Elisabeth Gasteiger 1,Sébastien Géhant 1, Damien Lieberherr 1, Thierry Lombardot 1,†, Teresa B. Neto 1, Ivo Pedruzzi 1, Sylvain Poux 1, Monica Pozzato 1, Nicole Redaschi 1 , Alan Bridge 1 and on behalf of the UniProt Consortium 1,2,3,4,‡ 1 Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, CMU, 1 Michel-Servet, CH-1211 Geneva 4, Switzerland; [email protected] (A.M.); [email protected] (K.B.A.); [email protected] (P.B.); [email protected] (J.B.); [email protected] (E.d.C.); [email protected] (E.C.); [email protected] (E.G.); [email protected] (S.G.); [email protected] (D.L.); [email protected] (T.L.); [email protected] (T.B.N.); [email protected] (I.P.); [email protected] (S.P.); [email protected] (M.P.); [email protected] (N.R.); [email protected] (A.B.); [email protected] (U.C.) 2 European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK 3 Protein Information Resource, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE 19711, USA 4 Protein Information Resource, Georgetown University Medical Center, 3300 Whitehaven Street NorthWest, Suite 1200, Washington, DC 20007, USA * Correspondence: [email protected] (M.F.); [email protected] (E.B.); Tel.: +41-22-379-58-75 (M.F.); +41-22-379-49-10 (E.B.) † Current address: Centre Informatique, Division Calcul et Soutien à la Recherche, University of Lausanne, CH-1015 Lausanne, Switzerland.
    [Show full text]
  • The EMBL-European Bioinformatics Institute the Hub for Bioinformatics in Europe
    The EMBL-European Bioinformatics Institute The hub for bioinformatics in Europe Blaise T.F. Alako, PhD [email protected] www.ebi.ac.uk What is EMBL-EBI? • Part of the European Molecular Biology Laboratory • International, non-profit research institute • Europe’s hub for biological data, services and research The European Molecular Biology Laboratory Heidelberg Hamburg Hinxton, Cambridge Basic research Structural biology Bioinformatics Administration Grenoble Monterotondo, Rome EMBO EMBL staff: 1500 people Structural biology Mouse biology >60 nationalities EMBL member states Austria, Belgium, Croatia, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom Associate member state: Australia Who we are ~500 members of staff ~400 work in services & support >53 nationalities ~120 focus on basic research EMBL-EBI’s mission • Provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress • Contribute to the advancement of biology through basic investigator-driven research in bioinformatics • Provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators • Help disseminate cutting-edge technologies to industry • Coordinate biological data provision throughout Europe Services Data and tools for molecular life science www.ebi.ac.uk/services Browse our services 9 What services do we provide? Labs around the
    [Show full text]
  • A Multiscale Tool to Explore Genomic Conservation
    SynVisio: A Multiscale Tool to Explore Genomic Conservation A Thesis Submitted to the College of Graduate and Postdoctoral Studies in Partial Fulfillment of the Requirements for the degree of Master of Science in the Department of Computer Science University of Saskatchewan Saskatoon By Venkat Kiran Bandi ©Venkat Kiran Bandi, May/2020. All rights reserved. Permission to Use In presenting this thesis in partial fulfilment of the requirements for a Postgraduate degree from the University of Saskatchewan, I agree that the Libraries of this University may make it freely available for inspection. I further agree that permission for copying of this thesis in any manner, in whole or in part, for scholarly purposes may be granted by the professor or professors who supervised my thesis work or, in their absence, by the Head of the Department or the Dean of the College in which my thesis work was done. It is understood that any copying or publication or use of this thesis or parts thereof for financial gain shall not be allowed without my written permission. It is also understood that due recognition shall be given to me and to the University of Saskatchewan in any scholarly use which may be made of any material in my thesis. Requests for permission to copy or to make other use of material in this thesis in whole or part should be addressed to: Head of the Department of Computer Science 176 Thorvaldson Building 110 Science Place University of Saskatchewan Saskatoon, Saskatchewan Canada S7N 5C9 Or Dean College of Graduate and Postdoctoral Studies University of Saskatchewan 116 Thorvaldson Building, 110 Science Place Saskatoon, Saskatchewan S7N 5C9 Canada i Abstract Comparative analysis of genomes is an important area in biological research that can shed light on an organism's internal functions and evolutionary history.
    [Show full text]
  • A Zebrafish Reporter Line Reveals Immune and Neuronal Expression of Endogenous Retrovirus
    bioRxiv preprint doi: https://doi.org/10.1101/2021.01.21.427598; this version posted January 21, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. A zebrafish reporter line reveals immune and neuronal expression of endogenous retrovirus. Noémie Hamilton1,2*, Amy Clarke1, Hannah Isles1, Euan Carson1, Jean-Pierre Levraud3, Stephen A Renshaw1 1. The Bateson Centre, Department of Infection, Immunity and Cardiovascular Disease, University of Sheffield, Sheffield, UK 2. The Institute of Neuroscience, University of Sheffield, Sheffield, UK 3. Macrophages et Développement de l’Immunité, Institut Pasteur, CNRS UMR3738, 25 rue du docteur Roux, 75015 Paris *Corresponding author: [email protected] Abstract Endogenous retroviruses (ERVs) are fossils left in our genome from retrovirus infections of the past. Their sequences are part of every vertebrate genome and their random integrations are thought to have contributed to evolution. Although ERVs are mainly kept silenced by the host genome, they are found activated in multiple disease states such as auto-inflammatory disorders and neurological diseases. What makes defining their role in health and diseases challenging is the numerous copies in mammalian genomes and the lack of tools to study them. In this study, we identified 8 copies of the zebrafish endogenous retrovirus (zferv). We created and characterised the first in vivo ERV reporter line in any species. Using a combination of live imaging, flow cytometry and single cell RNA sequencing, we mapped zferv expression to early T cells and neurons.
    [Show full text]
  • Tunca Doğan , Alex Bateman , Maria J. Martin Your Choice
    (—THIS SIDEBAR DOES NOT PRINT—) UniProt Domain Architecture Alignment: A New Approach for Protein Similarity QUICK START (cont.) DESIGN GUIDE Search using InterPro Domain Annotation How to change the template color theme This PowerPoint 2007 template produces a 44”x44” You can easily change the color theme of your poster by going to presentation poster. You can use it to create your research 1 1 1 the DESIGN menu, click on COLORS, and choose the color theme of poster and save valuable time placing titles, subtitles, text, Tunca Doğan , Alex Bateman , Maria J. Martin your choice. You can also create your own color theme. and graphics. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), We provide a series of online tutorials that will guide you Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK through the poster design process and answer your poster Correspondence: [email protected] production questions. To view our template tutorials, go online to PosterPresentations.com and click on HELP DESK. ABSTRACT METHODOLOGY RESULTS & DISCUSSION When you are ready to print your poster, go online to InterPro Domains, DAs and DA Alignment PosterPresentations.com Motivation: Similarity based methods have been widely used in order to Generation of the Domain Architectures: You can also manually change the color of your background by going to VIEW > SLIDE MASTER. After you finish working on the master be infer the properties of genes and gene products containing little or no 1) Collect the hits for each protein from InterPro. Domain annotation coverage Overlap domain hits problem in Need assistance? Call us at 1.510.649.3001 difference b/w domain databases: the InterPro database: sure to go to VIEW > NORMAL to continue working on your poster.
    [Show full text]
  • A Burst of Protein Sequence Evolution and a Prolonged Period of Asymmetric Evolution Follow Gene Duplication in Yeast
    Downloaded from genome.cshlp.org on October 3, 2021 - Published by Cold Spring Harbor Laboratory Press Letter A burst of protein sequence evolution and a prolonged period of asymmetric evolution follow gene duplication in yeast Devin R. Scannell1,2 and Kenneth H. Wolfe Smurfit Institute of Genetics, Trinity College Dublin, Dublin 2, Ireland It is widely accepted that newly arisen duplicate gene pairs experience an altered selective regime that is often manifested as an increase in the rate of protein sequence evolution. Many details about the nature of the rate acceleration remain unknown, however, including its typical magnitude and duration, and whether it applies to both gene copies or just one. We provide initial answers to these questions by comparing the rate of protein sequence evolution among eight yeast species, between a large set of duplicate gene pairs that were created by a whole-genome duplication (WGD) and a set of genes that were returned to single-copy after this event. Importantly, we use a new method that takes into account the tendency for slowly evolving genes to be retained preferentially in duplicate. We show that, on average, proteins encoded by duplicate gene pairs evolved at least three times faster immediately after the WGD than single-copy genes to which they behave identically in non-WGD lineages. Although the high rate in duplicated genes subsequently declined rapidly, it has not yet returned to the typical rate for single-copy genes. In addition, we show that although duplicate gene pairs often have highly asymmetric rates of evolution, even the slower members of pairs show evidence of a burst of protein sequence evolution immediately after duplication.
    [Show full text]
  • A Phd Position Is Available in the Research Group of Aoife Mclysaght
    A PhD position is available in the research group of Aoife McLysaght in Trinity College Dublin to work on comparative genomics of vertebrates with a focus on understanding dosage sensitive genes in terms of evolution and disease. The execution of the project will involve bioinformatics and computer programming as well as statistical analysis of data. For examples of the kinds of things we work on, look up our papers on PubMed: https://www.ncbi.nlm.nih.gov/pubmed?cmd=search&term=mclysaght+a[au]&dispmax=50 Applicants should be in the final year of, or hold, a bachelor's degree in molecular biology with an interest and aptitude for programming and bioinformatics; or vice versa (computer science bachelor's with an interest in molecular biology). Interest and enthusiasm for molecular evolution is essential. Students will be expected to be self-motivated and creative and to work closely with other members of the team. This is a four-year position and comes with a tax-free stipend of 18000euro and covers fees up to EU level (non-EU students may apply, but fees are only covered up to EU rates). TCD is Ireland’s top ranked University and a member of the League of European Research Universities (LERU). The McLysaght lab is funded by a European Research Council (ERC) Starting Grant. Lab: http://www.gen.tcd.ie/molevol/ Applications including a cover letter, CV, summary of scientific interests and reasons for being interested in our research group, and contact details for two referees should be made to [email protected].
    [Show full text]
  • Molecular Genetics & Genomics
    page 46 Lab Times 5-2010 Ranking Illustration: Christina Ullman Publication Analysis 1997-2008 Molecular Genetics & Genomics Under the premise of a “narrow” definition of the field, Germany and England co-dominated European molecular genetics/genomics. The most frequently citated sub-fields were bioinformatical genomics, epigenetics, RNA biology and DNA repair. irst of all, a little science history (you’ll soon see why). As and expression. That’s where so-called computational biology is well known, in the 1950s genetics went molecular – and and systems biology enter research into basic genetic problems. Fdid not just become molecular genetics but rather molec- Given that development, it is not easy to answer the question ular bio logy. In 1963, however, Sydney Brenner wrote in his fa- what “molecular genetics & genomics” today actually is – and, mous letter to Max Perutz: “[...] I have long felt that the future of in particular, what is it in the context of our publication analy- molecular biology lies in the extension of research to other fields sis of the field? It is obvious that, as for example science historian of biology, notably development and the nervous system.” He Robert Olby put it, a “wide” definition can be distinguished from appeared not to be alone with this view and, as a consequence, a “narrow” definition of the field. The wide definition includes along with Brenner many of the leading molecular biologists all fields, into which molecular biology has entered as an exper- from the classical period redirected their research agendas, utilis- imental and theoretical paradigm. The “narrow” definition, on ing the newly developed molecular techniques to investigate un- the other hand, still tries to maintain the status as an explicit bio- solved problems in other fields.
    [Show full text]
  • Pfam: the Protein Families Database in 2021 Jaina Mistry 1,*, Sara Chuguransky 1, Lowri Williams 1, Matloob Qureshi 1, Gustavo A
    D412–D419 Nucleic Acids Research, 2021, Vol. 49, Database issue Published online 30 October 2020 doi: 10.1093/nar/gkaa913 Pfam: The protein families database in 2021 Jaina Mistry 1,*, Sara Chuguransky 1, Lowri Williams 1, Matloob Qureshi 1, Gustavo A. Salazar1, Erik L.L. Sonnhammer2, Silvio C.E. Tosatto 3, Lisanna Paladin 3, Shriya Raj 1, Lorna J. Richardson 1, Robert D. Finn 1 and Alex Bateman 1 1European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK, 2Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden and 3Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy Received September 11, 2020; Revised October 01, 2020; Editorial Decision October 02, 2020; Accepted October 06, 2020 ABSTRACT per-family gathering thresholds. Pfam entries are manually annotated with functional information from the literature The Pfam database is a widely used resource for clas- where available. sifying protein sequences into families and domains. Since Pfam release 29.0, pfamseq is based on UniPro- Since Pfam was last described in this journal, over tKB reference proteomes, whilst prior to that, it was based 350 new families have been added in Pfam 33.1 and on the whole of UniProtKB (3,4). Although the underly- numerous improvements have been made to existing ing sequence database is based on reference proteomes, all entries. To facilitate research on COVID-19, we have of the profile HMMs are also searched against UniProtKB revised the Pfam entries that cover the SARS-CoV-2 and the resulting matches are made available on the Pfam proteome, and built new entries for regions that were website and in a flatfile format.
    [Show full text]
  • Rapid Identification of Novel Protein Families Using Similarity Searches [Version 1; Peer Review: 2 Approved]
    F1000Research 2018, 7:1975 Last updated: 26 JUL 2021 METHOD ARTICLE Rapid identification of novel protein families using similarity searches [version 1; peer review: 2 approved] Matt Jeffryes, Alex Bateman European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK v1 First published: 24 Dec 2018, 7:1975 Open Peer Review https://doi.org/10.12688/f1000research.17315.1 Latest published: 24 Dec 2018, 7:1975 https://doi.org/10.12688/f1000research.17315.1 Reviewer Status Invited Reviewers Abstract Protein family databases are an important tool for biologists trying to 1 2 dissect the function of proteins. Comparing potential new families to the thousands of existing entries is an important task when operating version 1 a protein family database. This comparison helps to understand 24 Dec 2018 report report whether a collection of protein regions forms a novel family or has overlaps with existing families of proteins. In this paper, we describe a 1. Daniel J. Rigden , University of Liverpool, method for performing this analysis with an adjustable level of accuracy, depending on the desired speed, enabling interactive Liverpool, UK comparisons. This method is based upon the MinHash algorithm, 2. Desmond G Higgins, Conway Institute, which we have further extended to calculate the Jaccard containment rather than the Jaccard index of the original MinHash technique. University College Dublin, Dublin, Ireland Testing this method with the Pfam protein family database, we are able to compare potential new families to the over 17,000 existing Any reports and responses or comments on the families in Pfam in less than a second, with little loss in accuracy.
    [Show full text]
  • Structure-Based Realignment of Non-Coding Rnas in Multiple Whole Genome Alignments
    Structure-based Realignment of Non-coding RNAs in Multiple Whole Genome Alignments. by Michael Ku Yu Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of ARCHIVES Masters of Engineering in Computer Science and Engineering MASSACHUE N U TE at the OF TECH IOLOY MASSACHUSETTS INSTITUTE OF TECHNOLOGY JUN 2 1 2011 June 2011 LIBRARI ES @ Massachusetts Institute of Technology 2011. All rights reserved. '$7 A uthor ............ .. .. ... ............. Department of Electrical Wgineering and Computer Science May 20, 2011 Certified by..................................... ...... Bonnie Berger Professor of Applied Mathematics and Computer Science Thesis Supervisor Accepted by.... ....................................... Christopher J. Terman Chairman, Department Committee on Graduate Theses 2 Structure-based Realignment of Non-coding RNAs in Multiple Whole Genome Alignments by Michael Ku Yu Submitted to the Department of Electrical Engineering and Computer Science on May 20, 2011, in partial fulfillment of the requirements for the degree of Masters of Engineering in Computer Science and Engineering Abstract Whole genome alignments have become a central tool in biological sequence analy- sis. A major application is the de novo prediction of non-coding RNAs (ncRNAs) from structural conservation visible in the alignment. However, current methods for constructing genome alignments do so by explicitly optimizing for sequence simi- larity but not structural similarity. Therefore, de novo prediction of ncRNAs with high structural but low sequence conservation is intrinsically challenging in a genome alignment because the conservation signal is typically hidden. This study addresses this problem with a method for genome-wide realignment of potential ncRNAs ac- cording to structural similarity.
    [Show full text]
  • Generating Functional Protein Variants with Variational Autoencoders
    bioRxiv preprint doi: https://doi.org/10.1101/2020.04.07.029264; this version posted April 7, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. 1 Generating functional protein variants with variational 2 autoencoders 1 1 1 3 Alex Hawkins-Hooker , Florence Depardieu , Sebastien Baur , Guillaume 1 1 1 4 Couairon , Arthur Chen , and David Bikard 1 5 Synthetic Biology Group, Microbiology Department, Institut Pasteur, Paris, France 6 Abstract 7 The design of novel proteins with specified function and controllable biochemical properties 8 is a longstanding goal in bio-engineering with potential applications across medicine and nan- 9 otechnology. The vast expansion of protein sequence databases over the last decades provides 10 an opportunity for new approaches which seek to learn the sequence-function relationship 11 directly from natural sequence variation. Advances in deep generative models have led to 12 the successful modelling of diverse kinds of high-dimensional data, from images to molecules, 13 allowing the generation of novel, realistic samples. While deep models trained on protein 14 sequence data have been shown to learn biologically meaningful representations helpful for 15 a variety of downstream tasks, their potential for direct use in protein engineering remains 16 largely unexplored. Here we show that variational autoencoders trained on a dataset of almost 17 70000 luciferase-like oxidoreductases can be used to generate novel, functional variants of the 18 luxA bacterial luciferase.
    [Show full text]