Overview of Software & Databases for Rapid High-throughput Sequence Analysis in Clinical & Public Health Microbiology

© by author

ESCMID OnlineDag HarmsenLecture Library University of Münster, Germany Short CV – Dag Harmsen

• 1983–1991, Medical Schools University Antwerpen, Belgium and Univ. Würzburg, Germany • 1991, MD and doctoral thesis (‘Nucleo-capsid gene of Coronavirus HCV-229E’, V. ter Meulen) • 1992 – 2000, Institute of Hygiene & Microbiology (J. Heesemannn, Candida spp. & Aspergillus spp. & M. Frosch, Mycobacterium spp.) / Internal Medicine (K. Kochsiek) / Institute of Virology (V. ter Meulen), Univ. Würzburg, Germany • 1998, Inclusion in the German Medical Microbiology and Infectious Epidemiology specialist register • 2000-2001, Head of R&D CREATOGEN diagnostics, CREATOGEN AG, Augsburg, Germany • 2002-2004, Institute of Hygiene (H. Karch, MRSA), University of Münster, Germany • 2003, co-founder and shareholder of a bioinformatics company (Ridom GmbH, Münster, Germany) • 2004– , Full Professorship Department of Periodontology, Univ. Münster, Germany • July 2005 – July 2008, Temporary Head of the Department of Periodontology, Univ. Münster • August 2008 – , Head of Research Department of Periodontology • July 2005 – , Member of the Executive© Board by of the author International Committee on Systematics of Prokaryotes (ICSP) of the ‘International Union of Microbiological Societies’ (IUMS) • June 2007 – July 2012, Member of the ASM Professional Development Committee • OctoberESCMID 2012 - , ASM Ambassador Online in Germany Lecture Library • ECDC & EFSA technical advisor for genotyping and NGS/WGS

• Scientific interests: molecular diagnostic, epidemiology, and phylogeny of microorganisms; applied bioinformatics in microbiology Commercial Disclosure

Dag Harmsen is co-founder and partial owner of a bioinformatics company (Ridom GmbH, Münster, Germany) that develops software for DNA sequence analysis. Recently Ridom and Ion Torrent/Thermo Fisher (Waltham, MA) partnered and released SeqSphere+ software to speed and simplify whole genome ©based by bacterial author typing. ESCMID Online Lecture Library Next Generation Sequencing (NGS) AS = Amplicon Sequencing DNA WGS = Whole Genome Sequencing MPSS = Massive Parallel Signature Sequencing

AS WGS

Single molecule PCR Cloning 3. Generation

Immobilized Arrayed Nanopore in vitro polymerase in vivo fragments readers 2. Generation (PacBio; (Helicos) (Oxford, Genia) VisiGen) Braslavsky et al., Levene et al., Kasianowicz et al., © by author2003 [PubMed] 2003 [PubMed] 1996 [PubMed]

Cloning Sequencing Hybridization Sanger method in vivo by synthesis and ligation

ESCMIDSanger et al., Online Lecture Library 1977 [PubMed]

Polony / 454 Roche / Illumina MPSS Ion Torrent (Solexa) ABI SOLiD Modified from: Hall (2007). J. Marguiles et al., Bennett et al., Shendure et al., Brenner et al., Exp. Biol. 210: 1518 [PubMed] 2005 [PubMed] 2005 [PubMed] 2005 [PubMed] 2000 [PubMed] It’s the Consensus Genome-wide Gene by Gene de novo Consensus Accuracy

Venn diagram of de novo consensus Details accuracy for PGM, MiSeq and GSJ . Consensus errors were analyzed for 4,632 coding NCBI Sakai reference genes retrieved from MIRA de novo assemblies using SeqSphere+ for all 3 platforms . Number of variants confirmed by bidirectional Sanger sequencing indicated in parentheses . Validation of the 8 substitution and 15 indel variants identified using all 3 NGS platforms, suggested that either the Sakai © by authorstrain experienced micro-evolutionary changes or the genome sequence deposited in 2001 contains sequencing PGM, Ion Torrent Personal Genome Machine 300bp; MiSeq, Illumina MiSeq 2x 250bp PE; errors ESCMIDGSJ, 454 GS Junior with GSJ Titanium Online chemistry; Lecture Library bp, base pairs

Jünemann et al. (2013). Nature Biotechnology 31: 294 [PubMed]. Real Cost of Sequencing

Cost per Mb of DNA Sequence Moore's law

$10,000.00

$1,000.00

$100.00

$10.00 © by author $1.00 ESCMID Online Lecture Library $0.10

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011

Sboner et al. (2011). Genome Biol. 12: 125 [PubMed]. Real Cost of Sequencing

Sample collection and Data reduction Sequencing Downstream experimental design Data management analyses

100% 80 60 40

Data management Data © by author 20 ESCMID Online Lecture Library 0% Pre-NGS Now Future (Approximately 2000) (Approximately 2010) (Approximately 2020)

Sboner et al. (2011). Genome Biol. 12: 125 [PubMed]. Microbial Genomics Data, Workflow & Bioinformatics Tasks

16S rDNA community profiling

Other ‚omics‘ applications ... pathogenicity, pathogenicity amplicon resistance profiling …

community profiling & meta-genome functional analysis microbial NGS reads pathogen diseased discovery human tissue Proteomics Shendure et al. (2012). Nature Biotechnol. 30: 1084 [PubMed]. strain attribution

shot-gun © by authorsurveillance, plain language phylogeny & report resistom

pure culture / de novo automated assembled annotated ESCMID Onlinesingle cell WGS Lecture Library reference assisted semi-automated assembly annotation

PCR signatures synteny WGS; whole genome sequencing. General Hardware Requirements for NGS Software

© by author RAM ≥ 32GB ESCMID Online Lecture Library

HD ≥ 1TB (Raid) Metagenomic Analysis / Strain Attribution Composition or pattern matching approach • uses predetermined patterns in the data (e.g., taxonomic clade markers [MetaPhlAn], k- mer frequency) often together with sophisticated classification algorithms • require intensive preprocessing of the genomic database before application and classification results can change drastically depending on size and composition of genomic database

Taxonomic mapping approach • typically rely on a ‘lowest common ancestor’ • usually only highly accurate for higher taxonomic levels • e.g., MEGAN , MG-RAST , CARMA3

Whole-genome assembly approach • often lead to most accurate strain identification Meyer et al. (2008). BMC Bioinformatics 19: 386 [PubMed]. • computational very intensive (e.g., PathSeq) Pathoscope © by author • capitalizes on a Bayesian statistical framework • takes NGS reads as input; considers sequencing errors and sequence & mappingESCMID quality; provides Online posterior matches Lecture to a known Library database • accurately discriminates for presence of multiple species or closely related strains of the same species • considers cases when the sample species/strain is not in reference database • fast because no database preprocessing or sequence assembly FrancisFrancis et al. et (2013).al. (2013). Genome Genome Res. Res. 23: 23: Epub 1721 ahead [PubMed of print]. [PubMed]. De novo vs. Reference Assisted Assembly

De novo assembly

slower (produces e.g. ACE/AMOS files); ‘hypothesis free’ assembly (resistance) frequently used for panmictic/non-clonal (e.g., N. meningitidis) e.g., MIRA, Newbler, or Velvet

© by author

Reference-assistedESCMID assembly Online Lecture Library

fast (produces SAM/BAM files); detects only what is present in reference frequently used for monomorphic/ clonal bacteria (e.g., M. tuberculosis) or lineage-only for non- clonal bacteria e.g., bwa GABenchToB: A Genome Assembly Benchmark Tuned on Bacteria and Benchtop Sequencers Objecves • Focus on practical questions • benchtop platforms (MiSeq, PGM, GS Junior) • single library (single or paired end reads) • real data (no synthetic data) • assembler specific default procedures and parameters • parallelization on single dedicated host • including commercial assemblers • benchmark run time and memory usage

• Bacterial genomes only Supermicro A+ Server • Escherichia coli O157:H7 SAKAI • Staphylococcus aureus© COL by author 4 CPUs / 48 Cores • Mycobacterium tuberculosis H37Rv 128GB RAM • all with complete reference 500GB local storage • whole GC-range: 66% (H37), 51% (SAKAI), 33% (COL) ~6.000 Euro (mid/ 2011) ESCMID• coverage down-sampled Online Lecture Library • 75-fold: MiSeq 2x150bp, MiSeq 2x250bp, PGM400bp • 40-fold: PGM 200bp, GSJ

Jünemann et al. (2014). unpublished. Effect of Coverage on N50 Contig Size for Benchtop NGS Platforms

Less coverage means higher through-put (cheaper NGS) and faster de novo assembly (OLC).

For PGM © by authorEffect of coverage400bp optimal on N50 contig size coverage and memory lies requirements in around an E. coli ESCMID Online Lecturede Library novo assembly50- 60x(simulated! MiSeq 2x 250bp PE (blue), PGM 300bp (red) and GSJ (green) MIRAIllumina de 75 bp paired-end data sets novo assemblies at different average coverage obtained by randomwith sub- a 200 bp insert size assembled sampling of original E. coli datasets. N50, a statistic for describingusing the Velvet 0.7.31). distribution of contig lengths in an assembly; kb, kilo bases. Illumina Technical Note (2010). De Novo Jünemann et al. (2013).Assembly Nature Using Biotechnology Illumina Reads. 31: 294 [PubMed]. De novo Assembler

Assembler selecon criteria • processing single and paired-end reads • not relying on mate-pair libraries • no requirement on multiple libraries • only de Bruijn and OLC approaches

Assemblers included in other studies but not considered here • ARACHNE: long Sanger reads and mate-pair libraries • MaSuRCA: relies on paired-end data • SGA: relies on paired-end© bydata author • ALLPATH-LG: needs at least one paired-end and one mate-pair library • Phusion: needs mate-pair data ESCMID Online Lecture Library

14 De novo Assembler Benchmarks

Main metrics • Quast1 used to assess contiguity and accuracy (also used by GAGE-B) • NGA50 • # misassemblies: misassemblies (inversion, relocation, translocation) & local misassemblies (misassembly which reference locations are <1000bp apart or which overlap by < 1000bp) • errors (substitutions, insertions© by and author deletions) and ambiguities • wall clock time ESCMID• average system Onlineload (CPU time Lecture + system time / wallLibrary clock time / number of CPUs) • peak memory usage

1Gurevich et al. (2013), Bioinformatics. 29 (8): 1072-1075 [PubMed]. De novo Assembler Assemblers used for comparison

Assembler Acronym Version Type Scaffolds Availability / License OS ABySS ABYSS 1.3.5 DB$ x Free (BCCA SLA)

CELERA CELERA 7.0 OLC& x Free (GNU GPL v2)

CLC CLC 4.0.10 DB x Commercial Assembly Cell

MIRA MIRA 3.9.9 OLC§ - Free (GNU GPL v2)

GS De Novo NEWBLER 2.8 OLC x Proprietary (free usage on Assembler request*)

SeqMan NGen SEQMAN 11.0.0.17 OLC - Commercial 2 © by author SOAPdenovo SOAP 2.04 DB x Free (GNU GPL v3)

SPAdes SPADES 2.5.0 DB x Free (GNU GPL v2) Velvet ESCMID VELVET 1.2.08 Online DB Lecture x Free (GNU Library GPL v2)

$de Bruijn assembler, &Overlap Layout Consensus assembler, §MIRA is not a pure OLC assembler, uses also greedy assembler techniques

*Newbler available from Roche on request: http://454.com/contact-us/software-request.asp 16 De novo Assembler NGA50

© by author

ESCMID Scaffolds: MiSeq ABYSS, CLC, SOAP2, VELVET, SPADES, CELERA, NEWBLER Online Lecture Library Congs: MiSeq MIRA and SEQMAN; all PGM and GSJ data

 MiSeq more robust to assembler  no overall winner

 PGM / GSJunior pairs with OLC assemblers  superior: MIRA, NEWBLER, SPADES De novo Assembler Misassemblies

© by author ESCMID Online Lecture Library

SOAP, SEQMAN failed on some PGM and GSJ data  superior: NEWBLER, CLC, ABYSS high NGA50 not necessarily few misassemblies  inferior: MIRA, SEQMAN De novo Assembler Gene Coverage

© by author ESCMID Online Lecture Library

19 De novo Assembler Run Time

© by author ESCMID Online Lecture Library

de Bruijn: very fast OLC: very slow ABYSS, SOAP2, VELVET: single k-mer runtime, optimization needed superior: NEWBLER, SEQMAN SPADES, CLC: entire run time, already optimized inferior: MIRA, CELERA GAPE-B Summary • no single assembly evaluation metric  discrepancy between contiguity and accuracy  consider misassemblies and errors next to N50/NG50 or NGA50

• ≤ 100-fold coverage sufficient for recent chemistries  reasonable assembly results  reasonable sequencing cost and computing effort

• k-mer optimization necessary for most de Bruijn assemblers  great impact on assembly result and run time

• OLC assemblers applicable on Ion Torrent & GS Junior data  relatively slow and high memory footprint • de Bruijn assembler better suited© generally by author for Illumina data  fast and small memory footprint • no assemblerESCMID best by all metricsOnline Lecture Library • in general good performing assemblers  OLC: Newbler (also good and fast for Illumina data) and MIRA  De Bruijn: CLC, SPAdes (also for PGM & GSJ data) and ABySS See also GAGE-B: Jünemann et al. (2014). unpublished. Magoc et al. (2013). Bioinformatics 29: 1718 [PubMed]. Reference Assisted Assembly & SNP Calling Comparative analysis of algorithms for next-generation

sequencing read alignment. Ruffalo et al. (2011). Bioinformatics 27: 2790 [PubMed].

Mapathon: Benchmark mapping and variant calling tools using artificiallyRuntimeHuman genome:measurements: created accuracy (a )with Shows variation varying indexing error time rate.in versus simulated(a) Shows genome mapping size, bacterial quality threshold NGS 0, (b ) data.shows (thresholdb) shows alignment10 and (c) timeshows versus the proportion read count© of on readsby a 500 that Mbauthor have genome. mapping quality of at least 10. Conclusion: BWA in tandem with GATK in diploid mode detectedESCMID most SNPs Online and INDELs Lecture, with no false Library positive values and took least time to run. R. Tewolde, A. Al-Shahib, A. Underwood (2013). Dept. Bioinformatics, Public Health England, London; IMMEM-10.

See also: A survey of tools for variant analysis of next-generation genome sequencing data. Pabinger et al. (2013). Brief Bioinform., doi: 10.1093/bib/bbs086 [PubMed]. Tablet - Next Generation Sequence Assembly Visualization

Input file formats • ACE, AFG • SAM, BAM (needs index file) • FASTA, FASTQ • GFF3 • NOT GenBank © by author ESCMID Online Lecture Library

Milne et al. (2013). Brief Bioinform. 14: 193 [PubMed]. Synteny, Genome Location & Major Unique Genomic Areas

1800000 Pair-wise alignment of two genomes 1600000 1400000

• MUMmer Kurtz et al. (2004). Genome Biol. 5: R12 [PubMed]. 1200000

1000000 • version for nucleotide comparison NUCmer (NUCleotide MUMmer) 800000 • ACT for visualization of comparisons between complete 600000 400000

genome sequences and associated annotations 200000

0 1000000 1500000 2000000 2500000 3000000 Multiple alignments of whole genomes (MUM/MEM ‘anchored’ whole genome alignments) • Mugsy Angiuoli et al. (2011). Bioinformatics 27: 334 [PubMed]. • progressiveMauve Darling et al. (2010). PLoS ONE 5: e11147 • input formats: FASTA, Multi-FASTA, GenBank [PubMed]. • increasing RAM requirement with increasing number of genomes (limit ≈ 30-50 genomes); works with draft genomes • additive expandable [≈ O(n)], no nomenclature possible Fragmented all vs. all genome alignment • Gegenees Agren et al. (2012). PLoS ONE 7: e34498 [PubMed]. • input formats: FASTA, Multi-FASTA,© by GenBank author • fast, phylogenomics (ASC = Average Similarity of the Core genome when done with threshold) and biomarker search • no limit by RAM for number of genomes •ESCMID no synteny, rearrangement Online and genome Lecture location Library analysis BUT major and minor unique genomic areas • in contrast to cgMLST/MLST+ not only based on (shared) core genome but on the whole genome; works with genome fragments instead of genes • additive expandable [≈ O(n)], no nomenclature possible High-throughput Unique Signatures vs. rationally guided design of discriminatory primer sets; multi-target PCR assays only usually only applicable for cultured isolates Whole genome multiple alignment • computational extrem intensive • heavy scaling penalty when working with more than a small number of genomes • e.g., KPATH Whole genome pair-wise alignment • computational intensive • primer design restricted to sequence that are already present in the server’s database • Insignia

Whole genome progressive alignment and subtraction of regions of similarity • TOPSI (difficult to install on cluster system; comes up with primer selection) • PanSeq (Novel Region Finder based on NUCmer, iterative) Laing et al. (2010). BMC Bioinform 11: 461 [PubMed]. • ssGeneFinder (O104:H4; very simple© algorithm; by genome author annotation of target genome[s] not required) ( ) Ho et al. (2011). JCM 49: 3714 [PubMed]. • Gegenees fragmented all vs. all genome alignment; fast, only unique signatures no primer selection, works best with at least one completed & annotated target genome WholeESCMID genome alignment-free Online (0104:H4 Lecture) Agren et al.Library (2012). PLoS ONE 7: e34498 [PubMed]. • Find_differential_Primers tool very slow, very sensitive, includes primer selection; genome annotation of target genomes NOT required; simultaneous design of primers that discriminate between several subclasses of target genomes Pritchard et al. (2012). PLoS ONE 7: e34498 [PubMed]. ssGeneFinder

Iteration with decreasing BLAST word sizes Input format FASTA (.fas)

3 targets selected after further similarity search NCBI BLASTn search (nr/nt, wgs and human G+T databases; exclude strain sequences; targets with E- value < 1.0 rejected) © by author NCBI Primer-BLAST search Download and unzip ssGeneFinder. in silico in vitro ESCMID Online Lecture Library

N.B.: for Win 64bit systems download and install also BLAST+ Win 64bit. Ho et al. (2011). JCM 49: 3714 [PubMed]. PanSeq

© by author ESCMID Online Lecture Library

http://lfz.corefacility.ca/panseq/analyses/ Novel Region Finder: accepts FASTA or multi-FASTA uploaded files. More than one genome may be in a single file, but for all genomes consisting of more than one contig, a distinct identifier must be present in the fasta header of each contig belonging to the same genome. Gegenees Input formats FASTA (.fna) GenBank (.gbk)

© by author

BatchPrimer3 / Primer-BLAST ESCMIDSplitsTree* Online Lecture Library Pangenome MEGA** RAxML*** *Huson et al. (2006). Mol. Biol. 23: 254 [PubMed]; **Tamura et al. (2013). Mol. Biol. Evol. 30: 2725 [PubMed]; ***Stamatakis et al. (2014). Bioinformatics [PubMed]. Agren et al. (2012). PLoS ONE 7: e34498 [PubMed]. Primer Design Primer3 • designs primers on a variety of parameters (including position restricted primers) • does not check target specificity • FASTA input format • Many pipelines and web services use Primer3 (incl. BatchPrimer3 & Primer-BLAST) • command-line interface (requires Perl installed) Untergasser et al. (2013). NAR 40: e115 [PubMed] BatchPrimer3 • DNA sequences can be batch read (copy/paste or Multi-FASTA file upload) • Tab-delimited or Excel-formatted primer output facilitates subsequent primer ordering process • use to control by Sanger sequencing cgMLST/MLST+ allele differences from pure culture isolates or if target specificity has been controlled before You et al. (2008). BMC Bioinformatics 9: 253 [PubMed] Primer-BLAST © by author • module for generating candidate primers with Primer3 options and module for checking target specificity (chose NCBI databases and/or use a regular entrez query to limit the database search for primer specificity) • FASTAESCMID or NCBI accession Online input format (performs Lecture fast check onLibrary any FASTA input to determine if it is an exact match to a RefSeq sequence) • no batch read capability • use if target specificity testing required (use RefSeq accession or GI if possible) Ye et al. (2012). BMC Bioinformatics 13: 134 [PubMed] Artemis – DNA Sequence Viewer and Annotation Tool

Input file formats • EMBL • GenBank • GFF3 • BAM • FASTA • output blastall • NOT ACE

Comes with • ACT: the Artemis Comparison Tool

© by author

• DNAPlotter: circular and linear ESCMID Online Lecture interactiveLibrary genome visualization

Carver et al. (2012). Bioinformatics 28: 464 [PubMed]. BRIG – BLAST Ring Image Generator

• displays BLAST identity between a reference genome in the center against other query sequences as a set of concentric rings • matches are calculated from the perspective from the reference genome. Hence, regions that are absent from the reference but present in one or more of the query sequences will not be shown • adequately utilize draft query © by author genome sequence data • inputs formats: GenBank, EMBL, FASTA, ACE and ESCMID Online Lecture LibrarySAM

Alikhan et al. (2011). BMC Genomics 12: 402 [PubMed]. Tools for Surveillance & Phylogeny

Nomenclature

No No No

Yes © by author No Yes No

Wyres et al. (2014). WGS analysis and interpretation in clinical and public health microbiology laboratories: what are the RequirementsESCMID and how do existing Online tools compare? Pathogens Lecture 3: 437 [doi:10.3390/pathogens3020437 Library]. ‘n+1’ problem

© by author ESCMID Online Lecture Library

n, number of isolates in database Surveillance & Phylogeny ‘Molecular Typing Esperanto’ by Standardized Genome Comparison

- Difficult to interpret with draft genomes Multiple Genome Alignment - Computational intensive (≧ O(n2), limit ≈ 30-50 genomes) (e.g., progressive Mauve) - Not additive expandable, no nomenclature possible

k-mer + Works on read, draft & complete genome level, quickly identifies without alignment closest matching genome - Whole genome reduced to a single number of similarity ANI with alignment - Additively expandable [≈ O(n)], but poor mapping to (Average Nucleotide Identity) nomenclature possible

Genome-wide + Works well for monomorphic and ‘ad hoc’ analysis mapping & - Problematic© bywith rearrangement author / recombination events SNP calling - Not additive expandable (at least if not always mapped to same reference) for ESCMID+ Scalable,Online working onLecture single gene to whole Library genome levels Genome-wide Outbreak Investigation gene by gene + Both recombination & point mutation accommodated a single event & allele typing + Additively expandable [≈ O(1)] & nomenclature possible Global Nomenclature (cgMLST or MLST+) SNP, single nucleotide polymorphism; cgMLST, core genome multi locus sequence typing; n, number of isolates in database. The Multi‘Next Locus Generation Sequence MLST Typing’: MLST+

N. meningitidis MLST N. meningitidis MLST+ 7 genes N. meningitidis FAM18 1241 genes 2,194,961 bp 0.1 % 54.5 % of FAM18 genome of FAM18 genome

© by author

Maiden et al. (1998). PNAS 95: 3140 [PubMed] ESCMIDMLST Online Lecture MLST+/cgMLSTLibrary . 5-7 housekeeping genes . hundreds/thousands of ‘core genome’ genes . Sequence type (ST) and Clonal complex (CC) . scalable, portable and understandable . public nomenclature . public, additive expandable [≈ O(1)] nomenclature Used mainly for population genetics & Higher discrimination power also suited for outbreak evolutionary studies investigation One Disruptive Technology Fits it All - Genomic Surveillance

Pure bacterial culture

LIMS

DNA

Rapid NGS

Phenotype and De novo or reference epidemiologic assisted assembly information (incorporated in EBI ENA/ SeqSphere+) NCBI SRA

MLST+/ cgMLST © by author nomen- clature SeqSphere+ / BIGSdb Standardized + hierarchical microbial MLST/rMLST cgMLST/MLST Accessory SNP Antibiotic Toxins & ESCMID Online targetsLecture resistance Library targets pathogenicity targets typing Surveillance Evolutionary Resistome / & outbreak analysis Toxome investigation Standardized Hierarchical Microbial Typing

Outbreak / Lineage specific

SNPs/ e.g., Köser et al (2012). NEJM 366: 2267 [PubMed] Alleles accessory targets Standardized Species specific STEC: Mellmann et al. (2011). PLoS One. 6: e22751 [PubMed]

N. meng.: Vogel et al. (2012). JCM 50: 1889 [PubMed] + MLST N. meng.: Jolley et al. (2012). JCM. 50: 3046 [PubMed] core genome MLST (cgMLST) C. jejuni: Cody et al. (2013). JCM. 51: 2526 [PubMed]

Listeria: CIM 2014 [PubMed]; S. aureus: JCM 2014 [PubMed]

Pan-bacterial specific (also suited for speciation) Discriminatory Power © by author rMLST Jolley et al. (2012). Microbiology 158: 1005 [PubMed]

Species specific SNPs e.g., Van Ert et al. (2007). JCM 45: 47 [PubMed] ESCMIDconfirmatory/canonical Online Lecture Library Maiden et al. (1998). PNAS 95: 3140 [PubMed] MLST also needed for backward compatibility Hierarchical microbial typing approach. From bottom to top with increasing discriminatory power. MLST, multi locus sequence typing; rMLST, ribosomal MLST; SNP, single nucleotide polymorphism; cgMLST, core genome MLST.

For hierarchical microbial typing see also: Maiden et al. (2013). Nature Rev. Microbiol. 11: 728 [PubMed]. Automated Bacterial Genome Annotation Web submission systems RAST - Rapid Annotation using Subsystem Technology BaSYS - Bacterial Annotation System xBASE Bacterial Genome Annotation Service IGS Annotation Engine JGI/DOE IMG annotation service MicroScope GenoScope annotation service PGAAP - NCBI Prokaryotic Genome Automatic Annotation Pipeline (during submission process) MAKER Web Annotation Service

Standalone systems BG7 bacterial genome annotation system AGeS - Annotation/Analysis of Genome Sequences MAKER Prokka - prokaryotic annotation PAGIT - Post Assembly Genome© Improvement by author Toolkit

Curation systems Manatee (web interface + database backend) ArtemisESCMID (local app, can be combined Online with Chado SQL Lecture backend) Library WebApollo (web interface + database backend)

Modified from: Seemann (30 March 2013). The Genome Factory blog; Review: Richardson et al. (2013). Brief. Bioinform 14: 1 [Publisher]. Commercial All-in-one Bioinformatics Solutions

http://www.dnastar.com/ • all support NGS assembly (mapping & de novo) and SNP calling • many more analysis modules ….

• surveillance and typing database with expanding nomenclature, pathogen discovery & PCR signatures NOT the software http://www.clcbio.com/ © by authorfocuses ESCMID Online Lecture Library

http://www.geneious.com/ Evaluation of Typing Techniques

© by author

O’SullivanESCMID et al. (2013). AuSeTTS . BMCOnline Bioinformatics 14: 148Lecture [PubMed]. Library

Carrico et al. (2006). JCM 44: 2524 [PubMed]. http://darwin.phyloviz.net/ComparingPartitions/index.php?link=Tool Outlook

From genotype Standardize to phenotype WGS Typing Early warning Plain language (resistome, system & GIS report with pathogenome© by & author toxome analysis) MLST+ ESCMID Online Lecture Library

http://patho-ngen-trace.eu/ Genomic Tools for Species Identification

(i) SpeciesFinder, which is based on the complete 16S rRNA gene; (ii) Reads2Type that searches for species-specific 50-mers in either the 16S rRNA gene or the gyrB gene (for the Enterobac-teraceae family); (iii) the ribosomal multilocus sequence typing (rMLST) method that samples up © by author to 53 ribosomal genes; (iv) TaxonomyFinder, which is based on species-specific functional protein domain profiles; and finally (v) KmerFinder, which ESCMID Online Lecture Libraryexamines the number of occuring k-mers (substrings of k nucleotides in DNA sequence data).

Larsen et al (2014). JCM 52: 1182 [PubMed] Antibiotic Resistance Genes Databases

ARDB: http://ardb.cbcb.umd.edu/ (multi-FASTA [≤5 kb] BLAST; AR ontology) Liu et al (2008). NAR 37: D443 [PubMed] ß-Lactamase Classification and amino acid sequences for TEM, SHV and OXA extended-spectrum and Inhibitor resistant enzyme, http://www.lahey.org/Studies/; Bush et al (2010). AAC 54: 969 [PubMed] Nomenclature for Tetracycline genes, http://faculty.washington.edu/marilynr/; Roberts et al (1999). AAC 43: 2823 [PubMed]

© by author ESCMID Online Lecture Library

ResFinder: http://cge.cbs.dtu.dk/services/ResFinder/ (raw reads, multi-FASTA BLAST; does not detect multi-copy AR genes) Zankari et al (2012). J Antimicrob Chemother 67: 2640 [PubMed] Antibiotic Resistance Genes Databases

CARD: http://arpcard.mcmaster.ca (Antibiotic Resistance Ontology, multi-FASTA BLAST & HMM) McArthur et al (2013). AAC 57: 3348 [PubMed] © by author ESCMID Online Lecture Library

ARG-ANNOT: local BLAST program in Bio-Edit software (multi-FASTA) Gupta et al (2014). AAC 58: 212 [PubMed] From WGS Geno- to Phenotype - MRSA !"#$%&"' G2"C/A+C4 %(() =5C4F#*&+C4

>?@$%ABC4 *2", '/*=

EABC+$-:A+0 520! 5206 -&4F/A*'$%ABC4 *'+, G'%:C+C55C4 '/*, *-/, %-% *-/1 ##+789#":;<

'%# 3#4, D/&%:/A*&+C4 0#%, +./ DB.A5C#%C3'$%ABC4$, =#%#5#-'© by author ?#4+A*&+C4

@C4'JA5CF ESCMID OnlineH'4%#*C+C4 Lecture Library EAI/#*&+C4 Species identification, spa type, antibiotic susceptibility profile and presence of toxins can be rapidly determined by query of the WGS data. Colored squares represent genes potentially present on the chromosome and/or plasmids. The presence of genes in our cluster isolates are indicated by color: antibiotic resistance genes are shown in red, green for the toxin gene, blue for the catalase-encoding katA, yellow for the spa gene and gray indicates genes that were queried but not found. Leopold et al. (2014). JCM 52: ahead of print [PubMed]. Advanced Antibiotic Resistance NGS Demonstrations

© by author ESCMID Online Lecture Library

Gordon et al (2014). JCM 52: 1182 [PubMed] Virulence Genes Databases

VFDB: http://www.mgc.ac.cn/VFs/ (BLAST) Chen et al (2005). NAR 33: D325 [PubMed]

© by author MvirDB: http://mvirdb.llnl.gov (BLAST) Zhou et al (2007). NAR 35: D391 [PubMed] ESCMID Online Lecture Library

MP3: http://metagenomics.iiserb.ac.in/mp3/ (SVM & HMM) Gupta et al (2014). PLoS ONE 9: e93907 [PubMed] Conclusions • NGS/WGS is going to be the method of choice for genotyping as it marks the endpoint in discriminatory power. However, global standardization is needed. NGS are the current bottleneck. • NGS data are suited for reliable species identification. However, that seems to be not (yet) economical feasible. • NGS data are perfect for strain attribution and development of strain specific screening tests. • NGS/WGS is in principal suited for antimicrobial susceptibility testing. For certain species like S. aureus and M. tuberculosis ©it will by be author used soon; for other species the interpreting middleware is currently missing. • NGS/WGSESCMID is suitable Online for virulence Lecture analysis Library. However, as mechanisms of pathogenicity are still poorly understood its diagnostic value is currently limited. However, GWAS might be partial remedy for this limitation. Overview of Software and Databases for Rapid High-throughput Sequence Analysis in Clinical & Public Health Microbiology

© by author

ESCMID OnlineDag HarmsenLecture Library University of Münster, Germany