ESCMID Online Lecture Library © by Author

Overview of Software & Databases for Rapid High-throughput Sequence Analysis in Clinical & Public Health Microbiology © by author ESCMID OnlineDag HarmsenLecture Library University of Münster, Germany Short CV – Dag Harmsen • 1983–1991, Medical Schools University Antwerpen, Belgium and Univ. Würzburg, Germany • 1991, MD and doctoral thesis (‘Nucleo-capsid gene of Coronavirus HCV-229E’, V. ter Meulen) • 1992 – 2000, Institute of Hygiene & Microbiology (J. Heesemannn, Candida spp. & Aspergillus spp. & M. Frosch, Mycobacterium spp.) / Internal Medicine (K. Kochsiek) / Institute of Virology (V. ter Meulen), Univ. Würzburg, Germany • 1998, Inclusion in the German Medical Microbiology and Infectious Epidemiology specialist register • 2000-2001, Head of R&D CREATOGEN diagnostics, CREATOGEN AG, Augsburg, Germany • 2002-2004, Institute of Hygiene (H. Karch, MRSA), University of Münster, Germany • 2003, co-founder and shareholder of a bioinformatics company (Ridom GmbH, Münster, Germany) • 2004– , Full Professorship Department of Periodontology, Univ. Münster, Germany • July 2005 – July 2008, Temporary Head of the Department of Periodontology, Univ. Münster • August 2008 – , Head of Research Department of Periodontology • July 2005 – , Member of the Executive© Board by of the author International Committee on Systematics of Prokaryotes (ICSP) of the ‘International Union of Microbiological Societies’ (IUMS) • June 2007 – July 2012, Member of the ASM Professional Development Committee • OctoberESCMID 2012 - , ASM Ambassador Online in Germany Lecture Library • ECDC & EFSA technical advisor for genotyping and NGS/WGS • Scientific interests: molecular diagnostic, epidemiology, and phylogeny of microorganisms; applied bioinformatics in microbiology Commercial Disclosure Dag Harmsen is co-founder and partial owner of a bioinformatics company (Ridom GmbH, Münster, Germany) that develops software for DNA sequence analysis. Recently Ridom and Ion Torrent/Thermo Fisher (Waltham, MA) partnered and released SeqSphere+ software to speed and simplify whole genome ©based by bacterial author typing. ESCMID Online Lecture Library Next Generation Sequencing (NGS) AS = Amplicon Sequencing DNA WGS = Whole Genome Sequencing MPSS = Massive Parallel Signature Sequencing AS WGS Single molecule PCR Cloning 3. Generation Immobilized Arrayed Nanopore in vitro polymerase in vivo fragments readers 2. Generation (PacBio; (Helicos) (Oxford, Genia) VisiGen) Braslavsky et al., Levene et al., Kasianowicz et al., © by author2003 [PubMed] 2003 [PubMed] 1996 [PubMed] Cloning Sequencing Hybridization Sanger method in vivo by synthesis and ligation ESCMIDSanger et al., Online Lecture Library 1977 [PubMed] Polony / 454 Roche / Illumina MPSS Ion Torrent (Solexa) ABI SOLiD Modified from: Hall (2007). J. Marguiles et al., Bennett et al., Shendure et al., Brenner et al., Exp. Biol. 210: 1518 [PubMed] 2005 [PubMed] 2005 [PubMed] 2005 [PubMed] 2000 [PubMed] It’s the Consensus Genome-wide Gene by Gene de novo Consensus Accuracy Venn diagram of de novo consensus Details accuracy for PGM, MiSeq and GSJ . Consensus errors were analyzed for 4,632 coding NCBI Sakai reference genes retrieved from MIRA de novo assemblies using SeqSphere+ for all 3 platforms . Number of variants confirmed by bidirectional Sanger sequencing indicated in parentheses . Validation of the 8 substitution and 15 indel variants identified using all 3 NGS platforms, suggested that either the Sakai © by authorstrain experienced micro-evolutionary changes or the genome sequence deposited in 2001 contains sequencing PGM, Ion Torrent Personal Genome Machine 300bp; MiSeq, Illumina MiSeq 2x 250bp PE; errors ESCMIDGSJ, 454 GS Junior with GSJ Titanium Online chemistry; Lecture Library bp, base pairs Jünemann et al. (2013). Nature Biotechnology 31: 294 [PubMed]. Real Cost of Sequencing Cost per Mb of DNA Sequence Moore's law $10,000.00 $1,000.00 $100.00 $10.00 © by author $1.00 ESCMID Online Lecture Library $0.10 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Sboner et al. (2011). Genome Biol. 12: 125 [PubMed]. Real Cost of Sequencing Sample collection and Data reduction Sequencing Downstream experimental design Data management analyses 100% 80 60 40 Data management Data © by author 20 ESCMID Online Lecture Library 0% Pre-NGS Now Future (Approximately 2000) (Approximately 2010) (Approximately 2020) Sboner et al. (2011). Genome Biol. 12: 125 [PubMed]. Microbial Genomics Data, Workflow & Bioinformatics Tasks 16S rDNA community profiling Other ‚omics‘ applications ... pathogenicity, pathogenicity amplicon resistance profiling … … community profiling & meta-genome functional analysis microbial NGS reads pathogen diseased discovery human tissue Proteomics Shendure et al. (2012). Nature Biotechnol. 30: 1084 [PubMed]. strain attribution shot-gun © by authorsurveillance, plain language phylogeny & report resistom pure culture / de novo automated assembled annotated ESCMID Onlinesingle cell WGS Lecture Library reference assisted semi-automated assembly annotation PCR signatures synteny WGS; whole genome sequencing. General Hardware Requirements for NGS Software © by author RAM ≥ 32GB ESCMID Online Lecture Library HD ≥ 1TB (Raid) Metagenomic Analysis / Strain Attribution Composition or pattern matching approach • uses predetermined patterns in the data (e.g., taxonomic clade markers [MetaPhlAn], k- mer frequency) often together with sophisticated classification algorithms • require intensive preprocessing of the genomic database before application and classification results can change drastically depending on size and composition of genomic database Taxonomic mapping approach • typically rely on a ‘lowest common ancestor’ • usually only highly accurate for higher taxonomic levels • e.g., MEGAN , MG-RAST , CARMA3 Whole-genome assembly approach • often lead to most accurate strain identification Meyer et al. (2008). BMC Bioinformatics 19: 386 [PubMed]. • computational very intensive (e.g., PathSeq) Pathoscope © by author • capitalizes on a Bayesian statistical framework • takes NGS reads as input; considers sequencing errors and sequence & mappingESCMID quality; provides Online posterior matches Lecture to a known Library database • accurately discriminates for presence of multiple species or closely related strains of the same species • considers cases when the sample species/strain is not in reference database • fast because no database preprocessing or sequence assembly FrancisFrancis et al. et (2013).al. (2013). Genome Genome Res. Res. 23: 23: Epub 1721 ahead [PubMed of print]. [PubMed]. De novo vs. Reference Assisted Assembly De novo assembly slower (produces e.g. ACE/AMOS files); ‘hypothesis free’ assembly (resistance) frequently used for panmictic/non-clonal bacteria (e.g., N. meningitidis) e.g., MIRA, Newbler, or Velvet © by author Reference-assistedESCMID assembly Online Lecture Library fast (produces SAM/BAM files); detects only what is present in reference frequently used for monomorphic/ clonal bacteria (e.g., M. tuberculosis) or lineage-only for non- clonal bacteria e.g., bwa GABenchToB: A Genome Assembly Benchmark Tuned on Bacteria and Benchtop Sequencers Objecves • Focus on practical questions • benchtop platforms (MiSeq, PGM, GS Junior) • single library (single or paired end reads) • real data (no synthetic data) • assembler specific default procedures and parameters • parallelization on single dedicated host • including commercial assemblers • benchmark run time and memory usage • Bacterial genomes only Supermicro A+ Server • Escherichia coli O157:H7 SAKAI • Staphylococcus aureus© COL by author 4 CPUs / 48 Cores • Mycobacterium tuberculosis H37Rv 128GB RAM • all with complete reference 500GB local storage • whole GC-range: 66% (H37), 51% (SAKAI), 33% (COL) ~6.000 Euro (mid/ 2011) ESCMID• coverage down-sampled Online Lecture Library • 75-fold: MiSeq 2x150bp, MiSeq 2x250bp, PGM400bp • 40-fold: PGM 200bp, GSJ Jünemann et al. (2014). unpublished. Effect of Coverage on N50 Contig Size for Benchtop NGS Platforms Less coverage means higher through-put (cheaper NGS) and faster de novo assembly (OLC). For PGM © by authorEffect of coverage400bp optimalon N50 contig size coverageand memory lies requirements in aroundan E. coli ESCMID Online Lecturede Librarynovo assembly50- 60x(simulated! MiSeq 2x 250bp PE (blue), PGM 300bp (red) and GSJ (green) MIRAIllumina de 75 bp paired-end data sets novo assemblies at different average coverage obtained by randomwith sub- a 200 bp insert size assembled sampling of original E. coli datasets. N50, a statistic for describingusing the Velvet 0.7.31). distribution of contig lengths in an assembly; kb, kilo bases. Illumina Technical Note (2010). De Novo Jünemann et al. (2013).Assembly Nature Using Biotechnology Illumina Reads. 31: 294 [PubMed]. De novo Assembler Assembler selecon criteria • processing single and paired-end reads • not relying on mate-pair libraries • no requirement on multiple libraries • only de Bruijn and OLC approaches Assemblers included in other studies but not considered here • ARACHNE: long Sanger reads and mate-pair libraries • MaSuRCA: relies on paired-end data • SGA: relies on paired-end© bydata author • ALLPATH-LG: needs at least one paired-end and one mate-pair library • Phusion: needs mate-pair data ESCMID Online Lecture Library 14 De novo Assembler Benchmarks Main metrics • Quast1 used to assess contiguity and accuracy (also used by GAGE-B) • NGA50 • # misassemblies: misassemblies (inversion, relocation, translocation) & local misassemblies (misassembly

ESCMID Online Lecture Library © by Author

Proquest Dissertations

QUANTIFYING TRENDS in BACTERIAL VIRULENCE and PATHOGEN-ASSOCIATED GENES THROUGH LARGE SCALE BIOINFORMATICS ANALYSIS By

Comparative Genome Analyses of Vibrio Anguillarum Strains Reveal a Link with Pathogenicity Traits

Whole Genome Sequencing Revealed Host Adaptation-Focused Genomic

MOCAT2: a Metagenomic Assembly, Annotation and Profiling Framework Jens Roat Kultima1, Luis Pedro Coelho1, Kristoffer Forslund1, Jaime Huerta-Cepas1, Simone S

Detailed Analysis of Metagenome Datasets Obtained from Biogas

Whole Genome Sequencing for Foodborne Disease Surveillance Landscape Paper

Leveraging Public Information About Pathogens for Disease Outbreak

Using Genomics to Track Global Antimicrobial Resistance

Metagenomic Insights Into Chlorination Effects on Microbial Antibiotic Resistance in Drinking Water

Learning Virulent Proteins from Integrated Query Networks Eithon Cadag1*, Peter Tarczy-Hornoch2 and Peter J Myler2,3

Excess Labile Carbon Promotes the Expression of Virulence Factors in Coral Reef Bacterioplankton