ESCMID Online Lecture Library © by Author
Total Page:16
File Type:pdf, Size:1020Kb
Overview of Software & Databases for Rapid High-throughput Sequence Analysis in Clinical & Public Health Microbiology © by author ESCMID OnlineDag HarmsenLecture Library University of Münster, Germany Short CV – Dag Harmsen • 1983–1991, Medical Schools University Antwerpen, Belgium and Univ. Würzburg, Germany • 1991, MD and doctoral thesis (‘Nucleo-capsid gene of Coronavirus HCV-229E’, V. ter Meulen) • 1992 – 2000, Institute of Hygiene & Microbiology (J. Heesemannn, Candida spp. & Aspergillus spp. & M. Frosch, Mycobacterium spp.) / Internal Medicine (K. Kochsiek) / Institute of Virology (V. ter Meulen), Univ. Würzburg, Germany • 1998, Inclusion in the German Medical Microbiology and Infectious Epidemiology specialist register • 2000-2001, Head of R&D CREATOGEN diagnostics, CREATOGEN AG, Augsburg, Germany • 2002-2004, Institute of Hygiene (H. Karch, MRSA), University of Münster, Germany • 2003, co-founder and shareholder of a bioinformatics company (Ridom GmbH, Münster, Germany) • 2004– , Full Professorship Department of Periodontology, Univ. Münster, Germany • July 2005 – July 2008, Temporary Head of the Department of Periodontology, Univ. Münster • August 2008 – , Head of Research Department of Periodontology • July 2005 – , Member of the Executive© Board by of the author International Committee on Systematics of Prokaryotes (ICSP) of the ‘International Union of Microbiological Societies’ (IUMS) • June 2007 – July 2012, Member of the ASM Professional Development Committee • OctoberESCMID 2012 - , ASM Ambassador Online in Germany Lecture Library • ECDC & EFSA technical advisor for genotyping and NGS/WGS • Scientific interests: molecular diagnostic, epidemiology, and phylogeny of microorganisms; applied bioinformatics in microbiology Commercial Disclosure Dag Harmsen is co-founder and partial owner of a bioinformatics company (Ridom GmbH, Münster, Germany) that develops software for DNA sequence analysis. Recently Ridom and Ion Torrent/Thermo Fisher (Waltham, MA) partnered and released SeqSphere+ software to speed and simplify whole genome ©based by bacterial author typing. ESCMID Online Lecture Library Next Generation Sequencing (NGS) AS = Amplicon Sequencing DNA WGS = Whole Genome Sequencing MPSS = Massive Parallel Signature Sequencing AS WGS Single molecule PCR Cloning 3. Generation Immobilized Arrayed Nanopore in vitro polymerase in vivo fragments readers 2. Generation (PacBio; (Helicos) (Oxford, Genia) VisiGen) Braslavsky et al., Levene et al., Kasianowicz et al., © by author2003 [PubMed] 2003 [PubMed] 1996 [PubMed] Cloning Sequencing Hybridization Sanger method in vivo by synthesis and ligation ESCMIDSanger et al., Online Lecture Library 1977 [PubMed] Polony / 454 Roche / Illumina MPSS Ion Torrent (Solexa) ABI SOLiD Modified from: Hall (2007). J. Marguiles et al., Bennett et al., Shendure et al., Brenner et al., Exp. Biol. 210: 1518 [PubMed] 2005 [PubMed] 2005 [PubMed] 2005 [PubMed] 2000 [PubMed] It’s the Consensus Genome-wide Gene by Gene de novo Consensus Accuracy Venn diagram of de novo consensus Details accuracy for PGM, MiSeq and GSJ . Consensus errors were analyzed for 4,632 coding NCBI Sakai reference genes retrieved from MIRA de novo assemblies using SeqSphere+ for all 3 platforms . Number of variants confirmed by bidirectional Sanger sequencing indicated in parentheses . Validation of the 8 substitution and 15 indel variants identified using all 3 NGS platforms, suggested that either the Sakai © by authorstrain experienced micro-evolutionary changes or the genome sequence deposited in 2001 contains sequencing PGM, Ion Torrent Personal Genome Machine 300bp; MiSeq, Illumina MiSeq 2x 250bp PE; errors ESCMIDGSJ, 454 GS Junior with GSJ Titanium Online chemistry; Lecture Library bp, base pairs Jünemann et al. (2013). Nature Biotechnology 31: 294 [PubMed]. Real Cost of Sequencing Cost per Mb of DNA Sequence Moore's law $10,000.00 $1,000.00 $100.00 $10.00 © by author $1.00 ESCMID Online Lecture Library $0.10 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 Sboner et al. (2011). Genome Biol. 12: 125 [PubMed]. Real Cost of Sequencing Sample collection and Data reduction Sequencing Downstream experimental design Data management analyses 100% 80 60 40 Data management Data © by author 20 ESCMID Online Lecture Library 0% Pre-NGS Now Future (Approximately 2000) (Approximately 2010) (Approximately 2020) Sboner et al. (2011). Genome Biol. 12: 125 [PubMed]. Microbial Genomics Data, Workflow & Bioinformatics Tasks 16S rDNA community profiling Other ‚omics‘ applications ... pathogenicity, pathogenicity amplicon resistance profiling … … community profiling & meta-genome functional analysis microbial NGS reads pathogen diseased discovery human tissue Proteomics Shendure et al. (2012). Nature Biotechnol. 30: 1084 [PubMed]. strain attribution shot-gun © by authorsurveillance, plain language phylogeny & report resistom pure culture / de novo automated assembled annotated ESCMID Onlinesingle cell WGS Lecture Library reference assisted semi-automated assembly annotation PCR signatures synteny WGS; whole genome sequencing. General Hardware Requirements for NGS Software © by author RAM ≥ 32GB ESCMID Online Lecture Library HD ≥ 1TB (Raid) Metagenomic Analysis / Strain Attribution Composition or pattern matching approach • uses predetermined patterns in the data (e.g., taxonomic clade markers [MetaPhlAn], k- mer frequency) often together with sophisticated classification algorithms • require intensive preprocessing of the genomic database before application and classification results can change drastically depending on size and composition of genomic database Taxonomic mapping approach • typically rely on a ‘lowest common ancestor’ • usually only highly accurate for higher taxonomic levels • e.g., MEGAN , MG-RAST , CARMA3 Whole-genome assembly approach • often lead to most accurate strain identification Meyer et al. (2008). BMC Bioinformatics 19: 386 [PubMed]. • computational very intensive (e.g., PathSeq) Pathoscope © by author • capitalizes on a Bayesian statistical framework • takes NGS reads as input; considers sequencing errors and sequence & mappingESCMID quality; provides Online posterior matches Lecture to a known Library database • accurately discriminates for presence of multiple species or closely related strains of the same species • considers cases when the sample species/strain is not in reference database • fast because no database preprocessing or sequence assembly FrancisFrancis et al. et (2013).al. (2013). Genome Genome Res. Res. 23: 23: Epub 1721 ahead [PubMed of print]. [PubMed]. De novo vs. Reference Assisted Assembly De novo assembly slower (produces e.g. ACE/AMOS files); ‘hypothesis free’ assembly (resistance) frequently used for panmictic/non-clonal bacteria (e.g., N. meningitidis) e.g., MIRA, Newbler, or Velvet © by author Reference-assistedESCMID assembly Online Lecture Library fast (produces SAM/BAM files); detects only what is present in reference frequently used for monomorphic/ clonal bacteria (e.g., M. tuberculosis) or lineage-only for non- clonal bacteria e.g., bwa GABenchToB: A Genome Assembly Benchmark Tuned on Bacteria and Benchtop Sequencers Objecves • Focus on practical questions • benchtop platforms (MiSeq, PGM, GS Junior) • single library (single or paired end reads) • real data (no synthetic data) • assembler specific default procedures and parameters • parallelization on single dedicated host • including commercial assemblers • benchmark run time and memory usage • Bacterial genomes only Supermicro A+ Server • Escherichia coli O157:H7 SAKAI • Staphylococcus aureus© COL by author 4 CPUs / 48 Cores • Mycobacterium tuberculosis H37Rv 128GB RAM • all with complete reference 500GB local storage • whole GC-range: 66% (H37), 51% (SAKAI), 33% (COL) ~6.000 Euro (mid/ 2011) ESCMID• coverage down-sampled Online Lecture Library • 75-fold: MiSeq 2x150bp, MiSeq 2x250bp, PGM400bp • 40-fold: PGM 200bp, GSJ Jünemann et al. (2014). unpublished. Effect of Coverage on N50 Contig Size for Benchtop NGS Platforms Less coverage means higher through-put (cheaper NGS) and faster de novo assembly (OLC). For PGM © by authorEffect of coverage400bp optimalon N50 contig size coverageand memory lies requirements in aroundan E. coli ESCMID Online Lecturede Librarynovo assembly50- 60x(simulated! MiSeq 2x 250bp PE (blue), PGM 300bp (red) and GSJ (green) MIRAIllumina de 75 bp paired-end data sets novo assemblies at different average coverage obtained by randomwith sub- a 200 bp insert size assembled sampling of original E. coli datasets. N50, a statistic for describingusing the Velvet 0.7.31). distribution of contig lengths in an assembly; kb, kilo bases. Illumina Technical Note (2010). De Novo Jünemann et al. (2013).Assembly Nature Using Biotechnology Illumina Reads. 31: 294 [PubMed]. De novo Assembler Assembler selecon criteria • processing single and paired-end reads • not relying on mate-pair libraries • no requirement on multiple libraries • only de Bruijn and OLC approaches Assemblers included in other studies but not considered here • ARACHNE: long Sanger reads and mate-pair libraries • MaSuRCA: relies on paired-end data • SGA: relies on paired-end© bydata author • ALLPATH-LG: needs at least one paired-end and one mate-pair library • Phusion: needs mate-pair data ESCMID Online Lecture Library 14 De novo Assembler Benchmarks Main metrics • Quast1 used to assess contiguity and accuracy (also used by GAGE-B) • NGA50 • # misassemblies: misassemblies (inversion, relocation, translocation) & local misassemblies (misassembly