Beyond Four Bases: Epigenetic Modifications Prove Critical to Understanding E

Total Page:16

File Type:pdf, Size:1020Kb

Beyond Four Bases: Epigenetic Modifications Prove Critical to Understanding E Base Modification Detection Case Study | November 2012 BEYOND FOUR BASES: EPIGENETIC MODIFICATIONS PROVE CRITICAL TO UNDERSTANDING E. COLI OUTBREAK Studies of the E. coli outbreak in Germany demonstrate the fundamental need for long-read sequencing and DNA base modification data. Using SMRT® sequencing technology, scientists were for the first time able to reveal some of the complex mechanisms underlying gene regulation processes in the organism. A newly published paper adds to insights generated in a 2011 study investigating last year’s deadly E. coli outbreak in Germany; together, they offer a fascinating new view of the mechanisms of gene regulation in a microbe. Dr. Eric Schadt, Chair of the Department of Genetics and Genomics Sciences and Director of the Institute for Genomics and Multiscale Biology at Mount Sinai School of Medicine, helped lead these efforts to elucidate the strain of E. coli responsible for the outbreak. “All these bugs living in us and around us are affecting us on a much deeper level than we’ve appreciated,” he says. “Even in a microorganism there are complex networks at play, and being able to study these on a genome-wide scale for the first time is really causing a revolution within the microbiology community.” The PacBio RS sequencing platform “has the throughput to completely finish these small microbial genomes in under a day,” Schadt says. Dr. Eric Schadt, Director of the Institute for Genomics and Multiscale Biology, Mount Sinai School of Medicine The papers are remarkable for different reasons. “Origins — and then pulling those together into a multiscale view of the E. coli Strain Causing an Outbreak of Hemolytic– of the organism’s biology. “Living systems are composed Uremic Syndrome in Germany,” published in July 2011 in of lots of pieces interacting in very complex ways,” Schadt the New England Journal of Medicine, provided a sharper says. “The way to understand such systems is to take into understanding of the particular microbial strain involved account more of the information on a global level, not just in the outbreak and cleared up lingering questions about a single protein level. That is what drives systems biology.” the strain’s phylogeny. “Genome-wide map of methylated adenine residues using single-molecule real-time The Outbreak sequencing in pathogenic Escherichia coli,” published in In May and June of 2011, a highly virulent strain of E. November 2012 in Nature Biotechnology, offers a whole- coli broke out in northern Germany, affecting thousands genome interrogation of base modifications, elucidating of people. Fifty died, and nearly 1,000 suffered kidney that strain’s “methylome.” That information gives valuable failure from hemolytic-uremic syndrome. The severity of clues about the gene regulation processes of the E. coli this particular microbe launched investigations in labs strain in the German outbreak. around the world to try to understand why this bug, which The research projects are also a glimpse into the relatively produced Shiga toxin, was so deadly. The outbreak strain nascent field of generating full-genome data sets for serotype was O104:H4, which previously had not been multiple dimensions — such as DNA, RNA, and proteins associated with such pathogenicity. 1 Base Modification Detection Case Study | November 2012 Scientists from a number of it was an enterohemorrhagic strain answering that critical question,” organizations sequenced outbreak which had acquired enteroaggregative Schadt says. samples on next-generation DNA properties through horizontal gene Analyzing these epigenetic marks sequencers and contributed their transfer — was not quite right. would not have been possible on any data to public repositories. But with “What our work showed is that it other sequencer. Other sequencing short-read sequencing platforms, was exactly the opposite — it was platforms use DNA amplification the genome assemblies that became definitely an enteroaggregative which strips away base modifications publicly available were fragmented strain, and the way we could tell that prior to the sequencing process. But into more than 300 pieces, known as was that we sequenced nine other a unique aspect of single molecule, contigs, making it no trivial matter to strains of the same serotype. They real-time (SMRT) sequencing is that identify the phylogenetic basis of the matched perfectly to the outbreak it collects base modification data microbe. Add to that a starting data strain,” Schadt says. “It was a pretty simultaneously as it collects DNA bias — several sequenced genomes fundamental difference, but these sequence data. Scientists at Pacific were available for one serotype of E. bugs are tricky.” coli, but just a few from the O104:H4 Biosciences have demonstrated The team reported that the serotype — and it’s no wonder that that the polymerase used in their outbreak strain had acquired early attempts to determine the sequencing technology responds enterohemorrhagic traits, including origins of the outbreak strain wound differently to modified bases than it the relatively recent insertion of a up being incorrect. does to unmodified bases, leaving a Shiga-toxin–encoding lambda-like kinetic signature that can be detected Eric Schadt, then Chief Scientific prophage element. The strain had in the data. Indeed, the polymerase Officer at Pacific Biosciences, and also developed a devious form of is sensitive enough that it pauses his colleagues saw an opportunity to antibiotic resistance: the standard at measurably different intervals help the community make inroads treatment of ciprofloxacin actually for different types of modifications, into understanding the outbreak boosted the expression of the bug’s leading to distinct kinetic patterns for strain by adding long-read sequence Shiga toxin gene, leading to much data to the short-read data already more than a dozen different types of higher levels of the toxin, which base modification. available. Teaming up with scientists was partly responsible for the onset at medical schools and the World of hemolytic-uremic syndrome in Health Organization, they launched patients. In a real-time setting, their investigation, using the this information could be critical to “Some of the base ® PacBio RS platform to sequence selecting the best possible treatment modifications that we’re not just the German strain but also for a particular microbial strain. several other E. coli strains from detecting appear to be the underrepresented serotype to Studying Adenine Methylation associated with virulence, pinpoint the outbreak strain’s origins. Their NEJM paper reports 13 genome Despite the remarkable findings from but that’s virtually unknown sequences, including an assembly of the team’s E. coli sequencing effort, because people haven’t been the outbreak strain that used the long the DNA sequence alone did not fully able to see it before.” PacBio reads to get down to just 33 explain the unusually high virulence contigs. Combining PacBio data with seen in the outbreak. Some months existing short-read data, the team after the initial sequencing work was able to represent the full E. coli was done, Schadt and his colleagues That array of modifications genome in a single contig. had an opportunity to reanalyze the represents a far larger epigenetic data — this time looking at chemical The PacBio RS sequencing platform universe than could be assessed with modifications to DNA bases. “has the throughput to completely earlier tools, Schadt says. “Some finish these small microbial genomes This layer of information was of of the modifications that we’re in under a day,” Schadt says, noting particular interest to Schadt and detecting appear to be associated that this project would not have been his research team because of the with virulence, but that’s virtually possible on any other sequencer growing recognition during the last unknown because people haven’t currently available. That speed was ten years that methylation can play a been able to see it before.” critical to the team’s efforts and significant role in microbial virulence The computational ability to analyze points to real-time utility of the and other biological processes. this data became possible after sequencer for outbreaks and other “If you want to understand why this Schadt and his colleagues had situations where information is bug is so much more virulent than completed their first examination needed urgently. these other bugs that have the of the E. coli outbreak strain, but A key finding from the team’s same serotype, if you don’t have all because the base modification data investigation was that the widely of the axes of variation that can be was an inherent part of the sequence reported origin of the strain — that happening, then you might miss data, they were able to go back after 2 Base Modification Detection Case Study | November 2012 the fact and reanalyze the same age-old biology techniques to examine that may indicate involvement data set to detect base modifications the functional consequences of base with processes associated with genome-wide. modifications, including knocking pathogenicity and virulence. While some of the software to out the gene making the changes For Schadt, these findings perform that detection was just and then sequencing the RNA of emphasized the importance of adding being crafted at the time of Schadt’s the knockout alongside the RNA of more layers of data to the picture study of E. coli, the tools have since the wild type. In other work, they of any organism’s biology. He notes been incorporated into the standard compared the K12 strain with added that the microbiology community in software offered with the PacBio methylations to the original K12 to particular is seeing a revolution with instrument. Any user can detect study the effect of those methylations. this kind of information and “turning base modifications on new bacteria Data was sifted to show expression on whole areas of investigation that being sequenced, or reanalyze older signatures, pathway enrichment, weren’t considered in the past.” Being sequences gathered by the PacBio RS.
Recommended publications
  • Title Authors Abstract Introduction
    Title MOSAIK: A hash-based algorithm for accurate next-generation sequencing read mapping Authors Wan-Ping Lee1, Michael Stromberg1,2, Alistair Ward1, Chip Stewart1,3, Erik Garrison1, Gabor T. Marth1 1 Department of Biology, Boston College, Chestnut Hill, MA 2 Illumina, Inc., San Diego, CA 3 Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA Abstract MOSAIK is a stable, sensitive and open-source program for mapping second and third-generation sequencing reads to a reference genome. Uniquely among current mapping tools, MOSAIK can align reads generated by all the major sequencing technologies, including Illumina, Applied Biosystems SOLiD, Roche 454, Ion Torrent and Pacific BioSciences SMRT. Indeed, MOSAIK was the only aligner to provide consistent mappings for all the generated data (sequencing technologies, low-coverage and exome) in the 1000 Genomes Project. To provide highly accurate alignments, MOSAIK employs a hash clustering strategy coupled with the Smith-Waterman algorithm. This method is well-suited to capture mismatches as well as short insertions and deletions. To support the growing interest in larger structural variant (SV) discovery, MOSAIK provides explicit support for handling known-sequence SVs, e.g. mobile element insertions (MEIs) as well as generating outputs tailored to aid in SV discovery. All variant discovery benefits from an accurate description of the read placement confidence. To this end, MOSAIK uses a neural-net based training scheme to provide well-calibrated mapping quality scores, demonstrated by a correlation coefficient between MOSAIK assigned and actual mapping qualities greater than 0.98. In order to ensure that studies of any genome are supported, a training pipeline is provided to ensure optimal mapping quality scores for the genome under investigation.
    [Show full text]
  • Pacific Biosciences' 2020 Annual Report
    2020 Annual Report $SULO )HOORZ6WRFNKROGHUV ZDVDQXQIRUJHWWDEOH\HDURQPDQ\DFFRXQWV7KH&29,'SDQGHPLFFKDQJHGRXUOLYHVDQGWKHZD\ZHGREXVLQHVV2XUHPSOR\HHVZRUNHG WLUHOHVVO\WRVXSSRUWRXUFXVWRPHUVDQG,FDQQRWWKDQNWKHPHQRXJK'HVSLWHWKHFKDOOHQJHVRIWKHSDQGHPLFZHFRQWLQXHGWRGULYHRXUEXVLQHVV IRUZDUGJURZLQJRXU6HTXHO,,6\VWHPLQVWDOOHGEDVHE\QHDUO\ 3DQGHPLFDVLGHZDVDOVRDGHILQLQJDQGWUDQVIRUPDWLRQDO\HDUIRU3DFLILF%LRVFLHQFHV$IWHUPRQWKVRIPHUJHUSODQQLQJZLWK,OOXPLQDHQGHG LQWHUPLQDWLRQRIWKHPHUJHUDJUHHPHQWZHVHWRXWWRLQYHVWDQGSODQIRUVXFFHVVDVDVWDQGDORQHFRPSDQ\,Q6HSWHPEHURIODVW\HDUDIWHUVHUYLQJ RQWKH%RDUGVLQFH,MRLQHGIXOOWLPHDV&(2WROHDGWKHFRPSDQ\LQWRDQHZSKDVHRIFRPPHUFLDOL]DWLRQDQGJURZWK,MRLQHG3DF%LREHFDXVH, EHOLHYHWKDWRXUWHFKQRORJ\SURGXFWURDGPDSDQGPRVWLPSRUWDQWO\RXUSHRSOHDUHSRVLWLRQHGWRWUXO\HQDEOHWKHSURPLVHRIJHQRPLFVWREHWWHUKXPDQ KHDOWK,QRUGHUWRVHWWKHFRPSDQ\RQDSDWKWRDFKLHYHWKLVPLVVLRQ,VHWIRUWKWKUHHLQLWLDOVWUDWHJLHVWKDWZLOOGULYHRXUH[HFXWLRQDQGVXVWDLQJURZWK RYHUWKHQH[WVHYHUDO\HDUV7KHVHVWUDWHJLFSLOODUVZLOOKHOSXVEULQJWKHZRUOG¶VPRVWDGYDQFHGVLQJOHPROHFXOHORQJUHDGVHTXHQFLQJWHFKQRORJLHV WRUHVHDUFKHUVDQGFOLQLFLDQVZRUOGZLGHDQGZLOOHQDEOHXVWRSDUWLFLSDWHLQDPDUNHWRSSRUWXQLW\RIPRUHWKDQELOOLRQ ([SDQGLQJRXU&RPPHUFLDO5HDFK 7KHJHQRPLFVODQGVFDSHKDVJURZQFRQVLGHUDEO\RYHUWKHSDVWGHFDGHDQGWKHUHDUHPRUHODEVWKDQHYHUSHUIRUPLQJVHTXHQFLQJ$IWHU\HDUVRI LQYHVWPHQW DQG SURGXFW GHYHORSPHQW LQGXVWU\ H[SHUWV UHFRJQL]H 3DF%LR DV WKH OHDGHU LQ DFFXUDWH DQG FRPSOHWH ORQJUHDG VHTXHQFLQJ 2XU WHFKQRORJLFDOGLIIHUHQWLDWLRQPHDQVWKDWPRUHFXVWRPHUVWKDQHYHUFDQQRZEHQHILWIURPLQFRUSRUDWLQJ3DF%LRVHTXHQFLQJLQWRWKHLUODEV7RFDSWXUH
    [Show full text]
  • Next Generation Genomics the DNA Revolution
    June 2020 Next Generation Genomics The DNA Revolution Leonard C. Mitchell, CFA Table of Contents Introduction 3 Precedents for Hyperbolic Change 4 The Promise of Genetic Engineering to Reduce Human Suffering 5 To Understand Genomics, Understand Proteins 8 Big Pharma and The Era of Personalized Medicine 12 The Human Microbiome 16 Epigenetics and Disease 18 Genomics and Cancer 20 Viruses, Vaccines and Cures 24 Reading and Writing the Genetic Code 28 The Blueprint for Life 29 The Race to Map the Human Genome 31 Then There Was CRISPR 33 How and Why CRISPR? 34 The Need for Faster Sequencing 35 Hacking the Human Genome 36 The Life Science Industry - Mergers and Acquisitions 38 Artificial Intelligence and Big Data 39 Bioengineering Agricultural Engineering 40 Industrial Scale Bio-Manufacturing 41 Conclusion and Risks 44 About the Author 46-47 This is the third in a series on developing technologies that will drive our economy and soon change our lives. Two technologies in particular, Artificial Intelligence, covered earlier, and Genomics, will be the most impactful in shaping life in the 21st century. Leonard C. Mitchell, CFA “We are running on the last ounces of gas in the philosophical gas tank, we are facing philosophical bankruptcy, the new challenges, especially for the new technologies. Climate change and nuclear war are kind of easy challenges, because we know what to do about them, we need to prevent them, it’s very easy. Not all agree on what to do, but nobody says in principal, we should have more of these. But with AI and genomic engineering there are no agreed goals.
    [Show full text]
  • Prodoma: Improve Protein Domain Classification for Third-Generation
    PRODOMA: IMPROVE PROTEIN DOMAIN CLASSIFICATION FOR THIRD-GENERATION SEQUENCING READS USING DEEP LEARNING Nan Du Jiayu Shang Dept. of Computer Science and Engineering Dept. of Electrical Engineering Michigan State University City University of Hong Kong East Lansing, MI 48824, United States Kowloon, Hong Kong SAR, China [email protected] [email protected] Yanni Sun Dept. of Electrical Engineering City University of Hong Kong Kowloon, Hong Kong SAR, China [email protected] September 29, 2020 ABSTRACT Motivation: With the development of third-generation sequencing technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce important insights into the biological functions of the underlying data. However, the high error rate in third-generation sequencing data raises a new challenge to established domain analysis pipelines. The state-of-the-art methods are not optimized for noisy reads and have shown unsatisfactory accuracy of domain classification in third-generation sequencing data. New computational methods are still needed to improve the performance of domain prediction in long noisy reads. Results: In this work, we introduce ProDOMA, a deep learning model that conducts domain classification for third-generation sequencing reads. It uses deep neural networks with 3-frame translation encoding to learn conserved features from partially correct translations. In addition, we formulate our problem as an open-set problem and thus our model can reject unrelated DNA reads such as those from noncoding regions. In the experiments on simulated reads of protein coding sequences and real reads from the human genome, our model outperforms HMMER and DeepFam on protein domain classification.
    [Show full text]
  • No Evidence for Extensive Horizontal Gene Transfer in The
    No evidence for extensive horizontal gene transfer in SEE COMMENTARY the genome of the tardigrade Hypsibius dujardini Georgios Koutsovoulosa, Sujai Kumara, Dominik R. Laetscha,b, Lewis Stevensa, Jennifer Dauba, Claire Conlona, Habib Maroona, Fran Thomasa, Aziz A. Aboobakerc, and Mark Blaxtera,1 aInstitute of Evolutionary Biology, University of Edinburgh, Edinburgh EH9 3FL, United Kingdom; bThe James Hutton Institute, Dundee DD2 5DA, United Kingdom; and cDepartment of Zoology, University of Oxford, Oxford OX1 3PS, United Kingdom Edited by W. Ford Doolittle, Dalhousie University, Halifax, Canada, and approved March 1, 2016 (received for review January 8, 2016) Tardigrades are meiofaunal ecdysozoans that are key to under- cryptobiotic (24), but serves as a useful comparator for good cryp- standing the origins of Arthropoda. Many species of Tardigrada tobiotic species (9). can survive extreme conditions through cryptobiosis. In a re- Animal genomes can accrete horizontally transferred DNA, cent paper [Boothby TC, et al. (2015) Proc Natl Acad Sci USA especially from germ line-transmitted symbionts (25), but the 112(52):15976–15981], the authors concluded that the tardigrade majority of transfers are nonfunctional and subsequently evolve Hypsibius dujardini had an unprecedented proportion (17%) of neutrally and can be characterized as dead-on-arrival horizontal genes originating through functional horizontal gene transfer (fHGT) gene transfer (doaHGT) (25–27). Functional horizontal gene and speculated that fHGT was likely formative in the evolution of transfer (fHGT) can bring to a recipient genome new biochemical cryptobiosis. We independently sequenced the genome of H. dujardini. capacities and contrasts with gradualist evolution of endogenous As expected from whole-organism DNA sampling, our raw data con- genes to new function.
    [Show full text]
  • Pacific Biosciences Contributes Whole Genome Sequence Data for German E. Coli Outbreak Strain and 11 Related Strains for Comparative Analysis
    Pacific Biosciences Contributes Whole Genome Sequence Data for German E. Coli Outbreak Strain and 11 Related Strains for Comparative Analysis Improved Chemistry and Software Provides Higher Accuracy Single Molecule Reads and Longer Readlengths to Yield PacBio-only De Novo Assembly MENLO PARK, Calif.--(BUSINESS WIRE)-- Pacific Biosciences of California, Inc. (NASDAQ: PACB) announced that it has completed a de novo sequence assembly of the Escherichia coli O104:H4 strain responsible for the recent outbreak in Germany using its Single Molecule Real Time (SMRT™) technology, and sequenced 11 related bacterial strains (including six previously unsequenced strains of the same serotype) for comparative analyses. An international team of scientific experts on E. coli collaborated on the rapid sequencing project to provide more comprehensive information about the origins of the strain that gave rise to the deadly outbreak. The data were generated using an early version of chemistry and software in development at Pacific Biosciences for the next major PacBio RS product upgrade, planned for the fourth quarter of 2011. The data provided to the public domain includes a complete assembly of the German outbreak strain, alignment to assemblies from other outbreak isolates, and sequences for 11 related Enteroaggregative E. coli strains. The project demonstrates the ability to produce a PacBio-only de novo assembly for a complex microbial pathogen, and the power of rapid sequencing of multiple genomes with the PacBio RS to elucidate the evolutionary history of a pathogenic microbe. A summary of the project appears on the company's website at http://blog.pacificbiosciences.com. The Pacific Biosciences scientific team, led by Chief Scientific Officer Eric Schadt, Ph.D., is collaborating with some of the world's leading experts on E.
    [Show full text]
  • The Genome of the Parthenogenetic Springtail Folsomia Candida
    Faddeeva-Vakhrusheva et al. BMC Genomics (2017) 18:493 DOI 10.1186/s12864-017-3852-x RESEARCH ARTICLE Open Access Coping with living in the soil: the genome of the parthenogenetic springtail Folsomia candida Anna Faddeeva-Vakhrusheva1, Ken Kraaijeveld1, Martijn F. L. Derks2, Seyed Yahya Anvar3,4, Valeria Agamennone1, Wouter Suring1, Andries A. Kampfraath1, Jacintha Ellers1, Giang Le Ngoc1,6, Cornelis A. M. van Gestel1, Janine Mariën1, Sandra Smit5, Nico M. van Straalen1 and Dick Roelofs1* Abstract Background: Folsomia candida is a model in soil biology, belonging to the family of Isotomidae, subclass Collembola. It reproduces parthenogenetically in the presence of Wolbachia, and exhibits remarkable physiological adaptations to stress. To better understand these features and adaptations to life in the soil, we studied its genome in the context of its parthenogenetic lifestyle. Results: We applied Pacific Bioscience sequencing and assembly to generate a reference genome for F. candida of 221.7 Mbp, comprising only 162 scaffolds. The complete genome of its endosymbiont Wolbachia, was also assembled and turned out to be the largest strain identified so far. Substantial gene family expansions and lineage-specific gene clusters were linked to stress response. A large number of genes (809) were acquired by horizontal gene transfer. A substantial fraction of these genes are involved in lignocellulose degradation. Also, the presence of genes involved in antibiotic biosynthesis was confirmed. Intra-genomic rearrangements of collinear gene clusters were observed, of which 11 were organized as palindromes. The Hox gene cluster of F. candida showed major rearrangements compared to arthropod consensus cluster, resulting in a disorganized cluster.
    [Show full text]
  • Illumina Pacbio: Provisional Findings Report
    Anticipated acquisition by Illumina, Inc. of Pacific Biosciences of California, Inc. Provisional findings report 24 October 2019 © Crown copyright 2019 You may reuse this information (not including logos) free of charge in any format or medium, under the terms of the Open Government Licence. To view this licence, visit www.nationalarchives.gov.uk/doc/open-government- licence/ or write to the Information Policy Team, The National Archives, Kew, London TW9 4DU, or email: [email protected]. The Competition and Markets Authority has excluded from this published version of the provisional findings report information which the inquiry group considers should be excluded having regard to the three considerations set out in section 244 of the Enterprise Act 2002 (specified information: considerations relevant to disclosure). The omissions are indicated by []. Some numbers have been replaced by a range. These are shown in square brackets. Contents Page Provisional Findings ................................................................................................... 4 1. The reference ....................................................................................................... 4 2. The industry .......................................................................................................... 4 Introduction to DNA sequencing ........................................................................... 4 Applications of DNA sequencing ........................................................................... 6
    [Show full text]