Base Modification Detection Case Study | November 2012

Beyond Four Bases: Epigenetic Modifications Prove Critical to Understanding E. coli Outbreak Studies of the E. coli outbreak in Germany demonstrate the fundamental need for long-read sequencing and DNA base modification data. Using SMRT® sequencing technology, scientists were for the first time able to reveal some of the complex mechanisms underlying gene regulation processes in the organism.

A newly published paper adds to insights generated in a 2011 study investigating last year’s deadly E. coli outbreak in Germany; together, they offer a fascinating new view of the mechanisms of gene regulation in a microbe. Dr. Eric Schadt, Chair of the Department of Genetics and Genomics Sciences and Director of the Institute for Genomics and Multiscale Biology at Mount Sinai School of Medicine, helped lead these efforts to elucidate the strain of E. coli responsible for the outbreak. “All these bugs living in us and around us are affecting us on a much deeper level than we’ve appreciated,” he says. “Even in a microorganism there are complex networks at play, and being able to study these on a genome-wide scale for the first time is really causing a revolution within the microbiology community.”

The PacBio RS sequencing platform “has the throughput to completely finish these small microbial genomes in under a day,” Schadt says. Dr. Eric Schadt, Director of the Institute for Genomics and Multiscale Biology, Mount Sinai School of Medicine The papers are remarkable for different reasons. “Origins — and then pulling those together into a multiscale view of the E. coli Strain Causing an Outbreak of Hemolytic– of the organism’s biology. “Living systems are composed Uremic Syndrome in Germany,” published in July 2011 in of lots of pieces interacting in very complex ways,” Schadt the New England Journal of Medicine, provided a sharper says. “The way to understand such systems is to take into understanding of the particular microbial strain involved account more of the information on a global level, not just in the outbreak and cleared up lingering questions about a single protein level. That is what drives systems biology.” the strain’s phylogeny. “Genome-wide map of methylated adenine residues using single-molecule real-time The Outbreak sequencing in pathogenic ,” published in In May and June of 2011, a highly virulent strain of E. November 2012 in Nature , offers a whole- coli broke out in northern Germany, affecting thousands genome interrogation of base modifications, elucidating of people. Fifty died, and nearly 1,000 suffered kidney that strain’s “methylome.” That information gives valuable failure from hemolytic-uremic syndrome. The severity of clues about the gene regulation processes of the E. coli this particular microbe launched investigations in labs strain in the German outbreak. around the world to try to understand why this bug, which The research projects are also a glimpse into the relatively produced Shiga toxin, was so deadly. The outbreak strain nascent field of generating full-genome data sets for serotype was O104:H4, which previously had not been multiple dimensions — such as DNA, RNA, and proteins associated with such pathogenicity. 1 Base Modification Detection Case Study | November 2012

Scientists from a number of it was an enterohemorrhagic strain answering that critical question,” organizations sequenced outbreak which had acquired enteroaggregative Schadt says. samples on next-generation DNA properties through horizontal gene Analyzing these epigenetic marks sequencers and contributed their transfer — was not quite right. would not have been possible on any data to public repositories. But with “What our work showed is that it other sequencer. Other sequencing short-read sequencing platforms, was exactly the opposite — it was platforms use DNA amplification the genome assemblies that became definitely an enteroaggregative which strips away base modifications publicly available were fragmented strain, and the way we could tell that prior to the sequencing process. But into more than 300 pieces, known as was that we sequenced nine other a unique aspect of single molecule, contigs, making it no trivial matter to strains of the same serotype. They real-time (SMRT) sequencing is that identify the phylogenetic basis of the matched perfectly to the outbreak it collects base modification data microbe. Add to that a starting data strain,” Schadt says. “It was a pretty simultaneously as it collects DNA bias — several sequenced genomes fundamental difference, but these sequence data. Scientists at Pacific were available for one serotype of E. bugs are tricky.” coli, but just a few from the O104:H4 Biosciences have demonstrated The team reported that the serotype — and it’s no wonder that that the polymerase used in their outbreak strain had acquired early attempts to determine the sequencing technology responds enterohemorrhagic traits, including origins of the outbreak strain wound differently to modified bases than it the relatively recent insertion of a up being incorrect. does to unmodified bases, leaving a Shiga-toxin–encoding lambda-like kinetic signature that can be detected Eric Schadt, then Chief Scientific prophage element. The strain had in the data. Indeed, the polymerase Officer at Pacific Biosciences, and also developed a devious form of is sensitive enough that it pauses his colleagues saw an opportunity to antibiotic resistance: the standard at measurably different intervals help the community make inroads treatment of ciprofloxacin actually for different types of modifications, into understanding the outbreak boosted the expression of the bug’s leading to distinct kinetic patterns for strain by adding long-read sequence Shiga toxin gene, leading to much data to the short-read data already more than a dozen different types of higher levels of the toxin, which base modification. available. Teaming up with scientists was partly responsible for the onset at medical schools and the World of hemolytic-uremic syndrome in Health Organization, they launched patients. In a real-time setting, their investigation, using the this information could be critical to “Some of the base ® PacBio RS platform to sequence selecting the best possible treatment modifications that we’re not just the German strain but also for a particular microbial strain. several other E. coli strains from detecting appear to be the underrepresented serotype to Studying Adenine Methylation associated with virulence, pinpoint the outbreak strain’s origins. Their NEJM paper reports 13 genome Despite the remarkable findings from but that’s virtually unknown sequences, including an assembly of the team’s E. coli sequencing effort, because people haven’t been the outbreak strain that used the long the DNA sequence alone did not fully able to see it before.” PacBio reads to get down to just 33 explain the unusually high virulence contigs. Combining PacBio data with seen in the outbreak. Some months existing short-read data, the team after the initial sequencing work was able to represent the full E. coli was done, Schadt and his colleagues That array of modifications genome in a single contig. had an opportunity to reanalyze the represents a far larger epigenetic data — this time looking at chemical The PacBio RS sequencing platform universe than could be assessed with modifications to DNA bases. “has the throughput to completely earlier tools, Schadt says. “Some finish these small microbial genomes This layer of information was of of the modifications that we’re in under a day,” Schadt says, noting particular interest to Schadt and detecting appear to be associated that this project would not have been his research team because of the with virulence, but that’s virtually possible on any other sequencer growing recognition during the last unknown because people haven’t currently available. That speed was ten years that methylation can play a been able to see it before.” critical to the team’s efforts and significant role in microbial virulence The computational ability to analyze points to real-time utility of the and other biological processes. this data became possible after sequencer for outbreaks and other “If you want to understand why this Schadt and his colleagues had situations where information is bug is so much more virulent than completed their first examination needed urgently. these other bugs that have the of the E. coli outbreak strain, but A key finding from the team’s same serotype, if you don’t have all because the base modification data investigation was that the widely of the axes of variation that can be was an inherent part of the sequence reported origin of the strain — that happening, then you might miss data, they were able to go back after 2 Base Modification Detection Case Study | November 2012 the fact and reanalyze the same age-old biology techniques to examine that may indicate involvement data set to detect base modifications the functional consequences of base with processes associated with genome-wide. modifications, including knocking pathogenicity and virulence. While some of the software to out the gene making the changes For Schadt, these findings perform that detection was just and then sequencing the RNA of emphasized the importance of adding being crafted at the time of Schadt’s the knockout alongside the RNA of more layers of data to the picture study of E. coli, the tools have since the wild type. In other work, they of any organism’s biology. He notes been incorporated into the standard compared the K12 strain with added that the microbiology community in software offered with the PacBio methylations to the original K12 to particular is seeing a revolution with instrument. Any user can detect study the effect of those methylations. this kind of information and “turning base modifications on new bacteria Data was sifted to show expression on whole areas of investigation that being sequenced, or reanalyze older signatures, pathway enrichment, weren’t considered in the past.” Being sequences gathered by the PacBio RS. and more. able to generate DNA sequence and methylomes for a microbe on The Epigenetic View a genome-wide scale in less than a day is a major step toward filling in By analyzing the outbreak strain “Without base the picture of how these organisms sequence for base modifications, in modification information, function. this case N6-methyladenine residues, Schadt and the team discovered a you simply don’t have “The A’s, G’s, C’s, and T’s that you series of methylase enzymes that a complete picture of get from sequencing are a big piece of the puzzle of putting a appeared to target specific sequence all the variation and the motifs throughout the genome as genome together — that’s why it they made their chemical changes. phenotypic variability that is so important to have long reads For example, Dam methyltransferase we see.” for a complete genome assembly,” targeted the A residue in DNA with Schadt says. “But beyond that, those the sequence motif GATC, while a bases can be chemically modified, methyltransferase found in the Shiga “The dogma in the field for these changing how proteins interact with toxin region acts on the CTGCAG restriction modification systems is that particular sequence and having motif. “We found a whole array of that they’re protecting the DNA of significant functional consequences. Without base modification methylase-like enzymes that were the bug from degradation when they information, you simply don’t have a making modifications by targeting release the endonuclease to degrade complete picture of all the variation different motifs,” Schadt says. “It was other DNA,” Schadt says. “But what and the phenotypic variability that almost like a language.” we found was that these modifications we see.” Many of those enzymes had not been were having a very significant impact previously characterized, Schadt on the transcription of genes, and notes. But there seems to be little that the genes being affected were question about the importance of the enriched in a number of different motif-targeting patterns the team pathways.” Notably, they found found: “They’re highly non-random, marked enrichment for pathways and the targeting had an effect on the linked to in transcription of genes,” he adds. the outbreak strain. Throughout the The team followed up the base organism’s genome, many pathways modification study with RNA-seq were up- or down-regulated by one to determine how these marks of the methylases found in a mobile were affecting the transcriptome. element next to the Shiga toxin gene, “Connecting the RNA sequence which is known to have an impact to the epigenetic changes at this on virulence. The removal of this point isn’t quite seamless,” Schadt methylase in a knockout experiment acknowledges. His team relied on led to structural genomic changes

www.pacb.com/basemod

For Research Use Only. Not for use in diagnostic procedures. © Copyright 2012, Pacific Biosciences of California, Inc. All rights reserved. Information in this document is subject to change without notice. Pacific Biosciences assumes no responsibility for any errors or omissions in this document. Certain notices, terms, conditions and/or use restrictions may pertain to your use of Pacific Biosciences products and/or third party products. Please refer to the applicable Pacific Biosciences Terms and Conditions of Sale and to the applicable license terms at http://www.pacificbiosciences.com/licenses.html. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT and SMRTbell are trademarks of Pacific Biosciences in the United States and/or certain other countries. All other trademarks are the sole property of their respective owners. 3