Pseudoalignment for Metagenomic and Metatranscriptomic Read Assignment

Total Page:16

File Type:pdf, Size:1020Kb

Pseudoalignment for Metagenomic and Metatranscriptomic Read Assignment Pseudoalignment for metagenomic and metatranscriptomic read assignment By Lorian Victoria Schaeffer A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Molecular and Cell Biology in the Graduate Division of the University of California, Berkeley Committee in Charge: Professor Lior Pachter, Chair Professor Gregory Barton Professor Nicholas Ingolia Professor Kimmen Sjolander Fall 2016 Abstract Pseudoalignment for metagenomic and metatranscriptomic read assignment By Lorian Victoria Schaeffer Doctor of Philosophy in Molecular and Cell Biology University of California, Berkeley Professor Lior Pachter, Chair The first step in many metagenomic and metatranscriptomic analysis workflows is assigning high-throughput sequencing reads to specific strains or transcripts, providing the basis for identification and later quantification. However, the high degree of similarity between the sequences of many strains and genes makes it difficult to assign reads at the lowest level of taxonomy, and reads are typically assigned to more general taxonomic levels where they are unambiguous. Recent developments in RNA-Seq analysis have found direct-match k-mer based methods to be extremely accurate and fast when comparing sequenced RNA-Seq reads to transcriptomes. While similar methods have been used in metagenomics before now, none are highly accurate at distinguishing similar strains, and none have been applied to metatranscriptomic data. We explore connections between metagenomic and metatranscriptomic read assignment and the quantification of transcripts from RNA-Seq data to develop novel methods for rapid and accurate quantification of microbiome strains and transcripts. We find that the recent idea of pseudoalignment introduced in the RNA-Seq context is highly applicable in the metagenomics and metatranscriptomics settings as well. When coupled with the Expectation-Maximization (EM) algorithm, reads can be assigned far more accurately and quickly than is currently possible with state of the art software, making it possible and practical for the first time to analyze abundances of individual genomes in metagenomics and metatranscriptomics projects. 1 Table of Contents Table of Contents ............................................................................................................. i List of Figures ................................................................................................................ iii List of Tables .................................................................................................................. v Acknowledgements ........................................................................................................ vi Chapter 1: History of abundance estimation .................................................................. 1 1.1 Metagenomics ....................................................................................................... 1 1.2 k-mer based estimation ......................................................................................... 6 1.3 Metatranscriptomics .............................................................................................. 9 Chapter 2: Pseudoalignment for metagenomic read assignment .................................. 12 2.1 Introduction ......................................................................................................... 12 2.2 Results ................................................................................................................. 13 2.3 Methods............................................................................................................... 18 2.4 Conclusions ......................................................................................................... 19 Chapter 3: K-mer based metatranscriptome analysis.................................................... 21 3.1 Introduction ......................................................................................................... 21 3.2 Results ................................................................................................................. 22 3.3 Methods............................................................................................................... 32 3.4 Conclusions ......................................................................................................... 34 Chapter 4: Concluding remarks on low-memory k-mer indexing improvements ........ 36 References ..................................................................................................................... 39 Appendix A: Notes on collecting microbiome samples from D. melanogaster guts .... 47 A.1 Gut dissection ..................................................................................................... 47 A.2 DNA/RNA extraction ........................................................................................ 48 i A.3 Microbial mRNA enrichment ............................................................................ 50 ii List of Figures Figure 1.1. A phylogenetic tree of the V4 region of the 16S rRNA gene .......................... 3 Figure 1.2. Drop in sequencing cost per megabase since 2001 .......................................... 4 Figure 1.3. MEGAN’s use of lowest common ancestor during read assignment ............... 6 Figure 2.1. Results of kallisto on simulated reads from the Ensembl dataset at the exact genome level. ................................................................................................................................ 15 Figure 2.2. Comparison of species-level abundance estimation between metagenomic programs ....................................................................................................................................... 16 Figure 2.3. Results of kallisto on bacterial reads in human saliva samples at all taxonomic levels. ............................................................................................................................................ 17 Figure 3.1. Estimated counts at strain level aligned against present transcriptomes ........ 22 Figure 3.2. Estimated counts of transcripts in simulated data .......................................... 23 Figure 3.3. Estimated counts at species level aligned against representative transcriptomes. .............................................................................................................................. 24 Figure 3.4. Estimated counts at species level aligned against representative genomes .... 25 Figure 3.5. Estimated counts at strain level aligned against pre-filtered transcriptomes . 26 Figure 3.6. Estimated counts of human gut metatranscriptome at species level aligned against representative genomes..................................................................................................... 27 Figure 3.7. Estimated counts of human gut metatranscriptome at genus level aligned against pre-filtered transcriptomes ................................................................................................ 28 Figure 3.8. Estimated counts of human gut metagenome at genus level aligned against pre-filtered transcriptomes ............................................................................................................ 29 Figure 3.9. Estimated relative abundance of top genera in human gut metagenomes. ..... 30 Figure 3.10. Percentage of kallisto-estimated human gut microbiome transcripts assigned to listed KEGG functional pathways ............................................................................................ 31 Figure 3.11. Estimated percentage of genes present in KEGG functional pathways, as estimated by COMAN .................................................................................................................. 32 Figure A.1 Dissected third instar larva gut ....................................................................... 48 iii Figure A.2 Bioanalyzer trace of RNA sample extracted from gut. .................................. 49 Figure A.3 Bioanalyzer trace of DNA sample extracted from gut. .................................. 49 iv List of Tables Table 1.1. Performance of selected metagenomic read assignment tools.. ......................... 7 Table 2.1. Normalized count-based classification accuracy at four taxonomic ranks. ..... 17 v Acknowledgements Thanks to my friends and family, whose support has made the completion of this dissertation possible. Special thanks to my advisor, Lior Pachter, for his encouragement and boundless enthusiasm throughout my research, whether I had results or not. While everyone in the Pachter lab has been a delight, I am especially grateful for Shannon Hateley, who gave me considerable support during my wetlab trials, and Isaac Joseph, who helped keep me moving forward when nothing was working. I also enormously appreciate the help of Harold Pimentel, Nicolas Bray, and Páll Melsted, who were ever helpful in getting kallisto to do what I needed it to do. Outside the Pachter lab, thanks go to Carolyn Elya, whose experience with Drosophila gut microbes was invaluable to my own investigations. Thanks also to Andrew Toseland, without whose simulated metatranscriptome data I would have missed a lot of interesting results. Finally, thanks to Diya Das for her significant help with editing everything, ever. vi Chapter 1: History of abundance estimation in metagenomics and metatranscriptomics 1.1 Metagenomics Microbial ecology,
Recommended publications
  • Functional Profiling of the Human Gut Microbiome Using Metatranscriptomic Approach
    ADVERTIMENT. Lʼaccés als continguts dʼaquesta tesi queda condicionat a lʼacceptació de les condicions dʼús establertes per la següent llicència Creative Commons: http://cat.creativecommons.org/?page_id=184 ADVERTENCIA. El acceso a los contenidos de esta tesis queda condicionado a la aceptación de las condiciones de uso establecidas por la siguiente licencia Creative Commons: http://es.creativecommons.org/blog/licencias/ WARNING. The access to the contents of this doctoral thesis it is limited to the acceptance of the use conditions set by the following Creative Commons license: https://creativecommons.org/licenses/?lang=en FUNCTIONAL PROFILING OF THE HUMAN GUT MICROBIOME USING METATRANSCRIPTOMIC APPROACH DOCTORAL THESIS DECEMBER 2019 XAVIER MARTÍNEZ SERRANO PROGRAMA DE DOCTORAT EN MEDICINA DEPARTAMENT DE MEDICINA FUNCTIONAL PROFILING OF THE HUMAN GUT MICROBIOME USING METATRANSCRIPTOMIC APPROACH XAVIER MARTÍNEZ SERRANO PhD THESIS – Barcelona, December 2019 Director Tutor Author Dr. Chaysavanh Manichanh Dr. Victor Manuel Vargas Blasco Xavier Martínez Serrano Universitat Autònoma de Barcelona Dr. Chaysavanh Manichanh Principal Investigator – Microbiome Lab Department of Physiology and Physiopathology of the Digestive Tract Vall d’Hebron Institut de Recerca Certify that the thesis entitled “Functional profiling of the human gut microbiome using metatranscriptomic approach” submitted by Xavier Martínez Serrano, was carried out under her supervision and tutorship of Dr. Víctor Vargas Blasco, and authorize its presentation for the defense ahead of the corresponding tribunal. Barcelona, December 2019 “All sorts of computer errors are now turning up. You'd be surprised to know the number of doctors who claim they are treating pregnant men” I. Asimov “Science knows no country, because knowledge belongs to humanity, and is the torch which illuminates the world.” L.
    [Show full text]
  • BIOGRAPHICAL SKETCH NAME: Berger
    BIOGRAPHICAL SKETCH NAME: Berger, Bonnie eRA COMMONS USER NAME (credential, e.g., agency login): BABERGER POSITION TITLE: Simons Professor of Mathematics and Professor of Electrical Engineering and Computer Science EDUCATION/TRAINING (Begin with baccalaureate or other initial professional education, such as nursing, include postdoctoral training and residency training if applicable. Add/delete rows as necessary.) EDUCATION/TRAINING DEGREE Completion (if Date FIELD OF STUDY INSTITUTION AND LOCATION applicable) MM/YYYY Brandeis University, Waltham, MA AB 06/1983 Computer Science Massachusetts Institute of Technology SM 01/1986 Computer Science Massachusetts Institute of Technology Ph.D. 06/1990 Computer Science Massachusetts Institute of Technology Postdoc 06/1992 Applied Mathematics A. Personal Statement Advances in modern biology revolve around automated data collection and sharing of the large resulting datasets. I am considered a pioneer in the area of bringing computer algorithms to the study of biological data, and a founder in this community that I have witnessed grow so profoundly over the last 26 years. I have made major contributions to many areas of computational biology and biomedicine, largely, though not exclusively through algorithmic innovations, as demonstrated by nearly twenty thousand citations to my scientific papers and widely-used software. In recognition of my success, I have just been elected to the National Academy of Sciences and in 2019 received the ISCB Senior Scientist Award, the pinnacle award in computational biology. My research group works on diverse challenges, including Computational Genomics, High-throughput Technology Analysis and Design, Biological Networks, Structural Bioinformatics, Population Genetics and Biomedical Privacy. I spearheaded research on analyzing large and complex biological data sets through topological and machine learning approaches; e.g.
    [Show full text]
  • Metatranscriptomics to Characterize Respiratory Virome, Microbiome, and Host Response Directly from Clinical Samples
    Metatranscriptomics to characterize respiratory virome, microbiome, and host response directly from clinical samples Seesandra Rajagopala ( [email protected] ) Vanderbilt University Medical Center Nicole Bakhoum Vanderbilt University Medical Center Suman Pakala Vanderbilt University Medical Center Meghan Shilts Vanderbilt University Medical Center Christian Rosas-Salazar Vanderbilt University Medical Center Annie Mai Vanderbilt University Medical Center Helen Boone Vanderbilt University Medical Center Rendie McHenry Vanderbilt University Medical Center Shibu Yooseph University of Central Florida https://orcid.org/0000-0001-5581-5002 Natasha Halasa Vanderbilt University Medical Center Suman Das Vanderbilt University Medical Center https://orcid.org/0000-0003-2496-9724 Article Keywords: Metatranscriptomics, Virome, Microbiome, Next-Generation Sequencing (NGS), RNA-Seq, Acute Respiratory Infection (ARI), Respiratory Syncytial Virus (RSV), Coronavirus (CoV) Posted Date: October 15th, 2020 DOI: https://doi.org/10.21203/rs.3.rs-77157/v1 Page 1/31 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License Page 2/31 Abstract We developed a metatranscriptomics method that can simultaneously capture the respiratory virome, microbiome, and host response directly from low-biomass clinical samples. Using nasal swab samples, we have demonstrated that this method captures the comprehensive RNA virome with sucient sequencing depth required to assemble complete genomes. We nd a surprisingly high-frequency of Respiratory Syncytial Virus (RSV) and Coronavirus (CoV) in healthy children, and a high frequency of RSV-A and RSV-B co-infections in children with symptomatic RSV. In addition, we have characterized commensal and pathogenic bacteria, and fungi at the species-level. Functional analysis of bacterial transcripts revealed H.
    [Show full text]
  • Microbes and Metagenomics in Human Health an Overview of Recent Publications Featuring Illumina® Technology TABLE of CONTENTS
    Microbes and Metagenomics in Human Health An overview of recent publications featuring Illumina® technology TABLE OF CONTENTS 4 Introduction 5 Human Microbiome Gut Microbiome Gut Microbiome and Disease Inflammatory Bowel Disease (IBD) Metabolic Diseases: Diabetes and Obesity Obesity Oral Microbiome Other Human Biomes 25 Viromes and Human Health Viral Populations Viral Zoonotic Reservoirs DNA Viruses RNA Viruses Human Viral Pathogens Phages Virus Vaccine Development 44 Microbial Pathogenesis Important Microorganisms in Human Health Antimicrobial Resistance Bacterial Vaccines 54 Microbial Populations Amplicon Sequencing 16S: Ribosomal RNA Metagenome Sequencing: Whole-Genome Shotgun Metagenomics Eukaryotes Single-Cell Sequencing (SCS) Plasmidome Transcriptome Sequencing 63 Glossary of Terms 64 Bibliography This document highlights recent publications that demonstrate the use of Illumina technologies in immunology research. To learn more about the platforms and assays cited, visit www.illumina.com. An overview of recent publications featuring Illumina technology 3 INTRODUCTION The study of microbes in human health traditionally focused on identifying and 1. Roca I., Akova M., Baquero F., Carlet J., treating pathogens in patients, usually with antibiotics. The rise of antibiotic Cavaleri M., et al. (2015) The global threat of resistance and an increasingly dense—and mobile—global population is forcing a antimicrobial resistance: science for interven- tion. New Microbes New Infect 6: 22-29 1, 2, 3 change in that paradigm. Improvements in high-throughput sequencing, also 2. Shallcross L. J., Howard S. J., Fowler T. and called next-generation sequencing (NGS), allow a holistic approach to managing Davies S. C. (2015) Tackling the threat of anti- microbial resistance: from policy to sustainable microbes in human health.
    [Show full text]
  • Bioinformatics Tools for RNA-Seq Gene and Isoform Quantification
    on: Sequ ati en er c n in e g G & t x A Journal of e p Zhang, et al., Next Generat Sequenc & Applic p N l f i c o 2016, 3:3 a l t a i o n r ISSN: 2469-9853n u s DOI: 10.4172/2469-9853.1000140 o Next Generation Sequencing & Applications J Review Article Open Access Bioinformatics Tools for RNA-seq Gene and Isoform Quantification Chi Zhang1, Baohong Zhang1, Michael S Vincent2 and Shanrong Zhao1* 1Early Clinical Development, Pfizer Worldwide R&D, Cambridge, MA, USA 2Inflammation and Immunology RU, Pfizer Worldwide R&D, Cambridge, MA, USA *Corresponding author: Shanrong Zhao, Early Clinical Development, Pfizer Worldwide R&D, Cambridge, MA, 02139, USA, Tel: + 1-212-733-2323; E-mail: [email protected] Rec date: Oct 27, 2016; Acc date: Dec 15, 2016; Pub date: Dec 17, 2016 Copyright: © 2016 Zhang C, et al. This is an open-access article distributed under the terms of the creative commons attribution license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Abstract In recent years, RNA-seq has emerged as a powerful technology in estimation of gene or transcript expression. ‘Union-exon’ and transcript based approaches are widely used in gene quantification. The ‘Union-exon’ based approach is simple, but it does not distinguish between isoforms when multiple alternatively spliced transcripts are expressed from the same gene. Because a gene is expressed in one or more transcript isoforms, the transcript based approach is more biologically meaningful than the ‘union exon’-based approach.
    [Show full text]
  • Lior Pachter Genome Informatics 2013 Keynote
    Stories from the Supplement Lior Pachter! Department of Mathematics and Molecular & Cell Biology! UC Berkeley! ! November 1, 2013! Genome Informatics, CSHL The Cufflinks supplements • C. Trapnell, B.A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M.J. van Baren, S.L. Salzberg, B.J. Wold and L. Pachter, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology 28, (2010), p 511–515. Supplementary Material: 42 pages. • C. Trapnell, D.G. Hendrickson, M. Sauvageau, L. Goff, J.L. Rinn and L. Pachter, Differential analysis of gene regulation at transcript resolution with RNA-seq, Nature Biotechnology, 31 (2012), p 46–53. Supplementary Material: 70 pages. A supplementary arithmetic progression? • C. Trapnell, B.A. Williams, G. Pertea, A. Mortazavi, G. Kwan, M.J. van Baren, S.L. Salzberg, B.J. Wold and L. Pachter, Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, Nature Biotechnology 28, (2010), p 511–515. Supplementary Material: 42 pages. • C. Trapnell, D.G. Hendrickson, M. Sauvageau, L. Goff, J.L. Rinn and L. Pachter, Differential analysis of gene regulation at transcript resolution with RNA- seq, Nature Biotechnology, 31 (2012), p 46–53. Supplementary Material: 70 pages. • A. Roberts and L. Pachter, Streaming algorithms for fragment assignment, Nature Methods 10 (2013), p 71—73. Supplementary Material: 98 pages? The nature methods manuscript checklist The nature methods manuscript checklist Emperor Joseph II: My dear young man, don't take it too hard. Your work is ingenious. It's quality work. And there are simply too many notes, that's all.
    [Show full text]
  • Modeling and Analysis of RNA-Seq Data: a Review from a Statistical Perspective
    Modeling and analysis of RNA-seq data: a review from a statistical perspective Wei Vivian Li 1 and Jingyi Jessica Li 1;2;∗ Abstract Background: Since the invention of next-generation RNA sequencing (RNA-seq) technolo- gies, they have become a powerful tool to study the presence and quantity of RNA molecules in biological samples and have revolutionized transcriptomic studies. The analysis of RNA-seq data at four different levels (samples, genes, transcripts, and exons) involve multiple statistical and computational questions, some of which remain challenging up to date. Results: We review RNA-seq analysis tools at the sample, gene, transcript, and exon levels from a statistical perspective. We also highlight the biological and statistical questions of most practical considerations. Conclusion: The development of statistical and computational methods for analyzing RNA- seq data has made significant advances in the past decade. However, methods developed to answer the same biological question often rely on diverse statical models and exhibit dif- ferent performance under different scenarios. This review discusses and compares multiple commonly used statistical models regarding their assumptions, in the hope of helping users select appropriate methods as needed, as well as assisting developers for future method development. 1 Introduction RNA sequencing (RNA-seq) uses the next generation sequencing (NGS) technologies to reveal arXiv:1804.06050v3 [q-bio.GN] 1 May 2018 the presence and quantity of RNA molecules in biological samples. Since its invention, RNA- seq has revolutionized transcriptome analysis in biological research. RNA-seq does not require any prior knowledge on RNA sequences, and its high-throughput manner allows for genome-wide profiling of transcriptome landscapes [1,2].
    [Show full text]
  • The Chromosome-Centric Human Proteome Project for Cataloging Proteins Encoded in the Genome
    CORRESPONDENCE The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome To the Editor: utility for biological and disease studies. Table 1 Features of salient genes on The Chromosome-Centric Human With development of new tools for in- chromosomes 13 and 17 Proteome Project (C-HPP) aims to define depth characterization of the transcriptome Genea AST nsSNPs the full set of proteins encoded in each and proteome, the HPP is well positioned Chromosome 13 chromosome through development of a to have a strategic role in addressing the BRCA2 3 54 standardized approach for analyzing the complexity of human phenotypes. With this RB1 2 3 massive proteomic data sets currently being in mind, the HUPO has organized national IRS2 1 3 generated from dedicated efforts of national chromosome teams that will collaborate and international teams. The initial goal with well-established laboratories building Chromosome 17 of the C-HPP is to identify at least one complementary proteotypic peptides, BRCA1 24 24 representative protein encoded by each of antibodies and informatics resources. ERBB2 6 13 the approximately 20,300 human genes1,2. An important C-HPP goal is to encourage TP53 14 5 aEnsembl protein and AST information can be found at The proteins will be characterized for tissue capture and open sharing of proteomic http://www.ensembl.org/Homo_sapiens/. localization and major isoforms, including data sets from diverse samples to enhance AST, alternative splicing transcript; nsSNP, nonsyno- mous single-nucleotide polyphorphism assembled from post-translational modifications (PTMs), a gene- and chromosome-centric display data from the 1000 Genomes Projects.
    [Show full text]
  • Downloaded As a CSV Dump file
    cells Article Transcriptome and Methylome Analysis Reveal Complex Cross-Talks between Thyroid Hormone and Glucocorticoid Signaling at Xenopus Metamorphosis Nicolas Buisine 1,† , Alexis Grimaldi 1,†, Vincent Jonchere 1,† , Muriel Rigolet 1, Corinne Blugeon 2 , Juliette Hamroune 2 and Laurent Marc Sachs 1,* 1 UMR7221 Molecular Physiology and Adaption, CNRS, Museum National d’Histoire Naturelle, 57 Rue Cuvier, CEDEX 05, 75231 Paris, France; [email protected] (N.B.); [email protected] (A.G.); [email protected] (V.J.); [email protected] (M.R.) 2 Genomics Core Facility, Département de Biologie, Institut de Biologie de l’ENS (IBENS), École Normale Supérieure, CNRS, INSERM, Université PSL, 75005 Paris, France; [email protected] (C.B.); [email protected] (J.H.) * Correspondence: [email protected] † Co-first authors, alphabetic order. Abstract: Background: Most work in endocrinology focus on the action of a single hormone, and very little on the cross-talks between two hormones. Here we characterize the nature of interactions between thyroid hormone and glucocorticoid signaling during Xenopus tropicalis metamorphosis. Methods: We used functional genomics to derive genome wide profiles of methylated DNA and measured changes of gene expression after hormonal treatments of a highly responsive tissue, tailfin. Clustering classified the data into four types of biological responses, and biological networks were Citation: Buisine, N.; Grimaldi, A.; modeled by system biology. Results: We found that gene expression is mostly regulated by either Jonchere, V.; Rigolet, M.; Blugeon, C.; T or CORT, or their additive effect when they both regulate the same genes. A small but non- Hamroune, J.; Sachs, L.M.
    [Show full text]
  • Metagenomics and Metatranscriptomics
    Metagenomics and Metatranscriptomics Jinyu Yang Bioinformatics and Mathematical Biosciences Lab July/15th/2016 Background Conventional sequencing begins with a culture of identical cells as a source of DNA Bottleneck Vast majority of microbial biodiversity had been missed by cultivation-based methods Large groups of microorganisms cannot be cultured and thus cannot be sequenced Bottleneck For example: 16S ribosomal RNA sequences 1. Relatively short 2. Often conserved within a species 3. Generally different between species Many 16S rRNA sequences have been found which do not belong to any known cultured species there are numerous non-isolated organisms Cultivation-based methods find < 1% of the bacterial and archaeal species in a sample Next-generation sequencing The price of DNA sequencing continues to fall All genes from all the members of the communities in environmental samples Metagenomics The term “Metagenomics” first appeared in 1998 Metagenomics: 1. genomic analysis of microbial DNA, which is extracted directly from communities in environmental samples 2. Alleviating the need for isolation and lab cultivation of individual species 3. Focus on microbial communities Metagenomics Taxonomic analysis (“who is out there?”) Assign each read to a taxonomy Functional analysis (“what are they doing?”) Map each read to a functional role/pathway Comparative analysis (“how do different samples compare?”) Level 1: sequence composition Level 2: taxonomic diversity Level 3: functional complement, etc. Metagenomics Metagenomics Metagenomics Microbial
    [Show full text]
  • Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction
    Downloaded from genome.cshlp.org on October 4, 2021 - Published by Cold Spring Harbor Laboratory Press Letter Human and Mouse Gene Structure: Comparative Analysis and Application to Exon Prediction Serafim Batzoglou,1,4,7 Lior Pachter,2,7 Jill P. Mesirov,3 Bonnie Berger,1,4,6 and Eric S. Lander3,5,6 1Laboratory for Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139 USA; 2Department of Mathematics, University of California Berkeley, Berkeley, California 94720 USA; 3Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142 USA; 4Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139 USA; 5Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139 USA We describe a novel analytical approach to gene recognition based on cross-species comparison. We first undertook a comparison of orthologous genomic loci from human and mouse, studying the extent of similarity in the number, size and sequence of exons and introns. We then developed an approach for recognizing genes within such orthologous regions by first aligning the regions using an iterative global alignment system and then identifying genes based on conservation of exonic features at aligned positions in both species. The alignment and gene recognition are performed by new programs called GLASS and ROSETTA, respectively. ROSETTA performed well at exact identification of coding exons in 117 orthologous pairs tested. A fundamental task in analyzing genomes is to identify by comparison of syntenic human and mouse genomic the genes. This is relatively straightforward for organ- sequences. isms with compact genomes (such as bacteria, yeast, It is well known that cross-species sequence com- flies and worms) because exons tend to be large and the parison can help highlight important functional ele- introns are either non-existent or tend to be short.
    [Show full text]
  • Trees, Fungi and Bacteria: Tripartite Metatranscriptomics of a Root Microbiome Responding to Soil Contamination E
    Gonzalez et al. Microbiome (2018) 6:53 https://doi.org/10.1186/s40168-018-0432-5 RESEARCH Open Access Trees, fungi and bacteria: tripartite metatranscriptomics of a root microbiome responding to soil contamination E. Gonzalez1,2, F. E. Pitre3,4, A. P. Pagé5, J. Marleau3, W. Guidi Nissim6, M. St-Arnaud3,4, M. Labrecque3,4, S. Joly3,4, E. Yergeau7 and N. J. B. Brereton3* Abstract Background: One method for rejuvenating land polluted with anthropogenic contaminants is through phytoremediation, the reclamation of land through the cultivation of specific crops. The capacity for phytoremediation crops, such as Salix spp., to tolerate and even flourish in contaminated soils relies on a highly complex and predominantly cryptic interacting community of microbial life. Methods: Here, Illumina HiSeq 2500 sequencing and de novo transcriptome assembly were used to observe gene expression in washed Salix purpurea cv. ‘Fish Creek’ roots from trees pot grown in petroleum hydrocarbon- contaminated or non-contaminated soil. All 189,849 assembled contigs were annotated without a priori assumption as to sequence origin and differential expression was assessed. Results: The 839 contigs differentially expressed (DE) and annotated from S. purpurea revealed substantial increases in transcripts encoding abiotic stress response equipment, such as glutathione S-transferases, in roots of contaminated trees as well as the hallmarks of fungal interaction, such as SWEET2 (Sugars Will Eventually Be Exported Transporter). A total of 8252 DE transcripts were fungal in origin, with contamination conditions resulting in a community shift from Ascomycota to Basidiomycota genera. In response to contamination, 1745 Basidiomycota transcripts increased in abundance (the majority uniquely expressed in contaminated soil) including major monosaccharide transporter MST1, primary cell wall and lamella CAZy enzymes, and an ectomycorrhiza-upregulated exo-β-1,3-glucanase (GH5).
    [Show full text]