Whole Genome Sequencing
Total Page:16
File Type:pdf, Size:1020Kb
Whole Genome Sequencing Introduction to the Interpretation of Whole Genome Sequence Data in Food Safety ntroduction 2. Describe dimensions of WGS analysis for which I Whole genome sequencing (WGS) for bacterial there are still significant ambiguity in the foodborne pathogen characterization is here to scientific approaches; and stay. Developments in WGS platforms have made it 3. Summarize some outstanding challenges to the possible to sequence the entire genome of bacteria application of these methods to bacterial at prices comparable to common molecular foodborne pathogen subtyping. We will introduce whole genome sequencing subtyping methods. These data can provide near- perfect discrimination of bacterial isolates. While technologies and common analyses, then discuss the common molecular subtyping methods only application of these methods to regulatory action interrogate small parts of the genome (e.g., for and different foodborne pathogens. pulsed-field gel electrophoresis [PFGE] restriction sites, MLST sequence of ~7 loci of ~500 basepairs equencing technologies [bp]), WGS approaches make it possible to S Sequencing technologies used for WGS can be interrogate more than 99% of the genome, which subdivided into two categories; (i) short-read translates to approximately 2.8 million and 4.8 technologies, which produce sequence reads up to million basepairs in Listeria monocytogenes and 500 bp (e.g., Illumina, IonTorrent), and (ii) long- Salmonella enterica, respectively. read technologies, which produce reads longer than 1000 bp and often lengths over 70,000 bp What’s more, as sequencing technologies and data (e.g., Pacific Biosciences, Oxford Nanopore). At the analytics continue to mature, WGS will provide time of writing this document (May 2016), the two results at costs and timeframes cheaper/faster than sequencing platforms most commonly used in WGS traditional subtyping. U.S. governmental agencies are Illumina and Pacific Biosciences. Illumina (CDC, FDA, and USDA-FSIS) are beginning to build sequencers (e.g., MiSeq, NextSeq, HiSeq) are large, shared databases (e.g. GenomeTrakr) to popular because of speed, throughput and high store WGS data generated from foodborne accuracy of the data produced by these pathogen isolates collected from routine surveillance sequencers, allowing bacterial genomes to be or human disease cases to compare records sequenced at low costs (between $50 and $100 between isolates and use these comparisons to per bacterial genome). The per bacterial genome inform regulatory action. costs of Pacific Biosciences sequencers are considerably more expensive (>$800), making it While the field is beginning to coalesce around cost prohibitive for WGS-based typing. common WGS sequencing platforms and basic data analysis approaches (1), the application of Short-read technologies are best suited for high- those platforms and approaches to bacterial food throughput applications due to high accuracy and safety has not yet matured. The purpose of this low costs per base sequenced. They are the main article is to: technologies used for routine WGS by government 1. Introduce whole genome sequencing platforms agencies and are used for whole genome analogs and analytical approaches to practitioners who to nucleotide-based subtyping schemes, such as have not yet encountered these data in their whole-genome Single Nucleotide Polymorphism work; (SNP) or Multi Locus Sequence Typing (MLST) analysis. Gaps in genome sequencing (see ‘data as reference-based assembly for outbreak analysis – genome assembly’) prevent interrogation detection, but may be problematic for gene of genome-scale events, such as genome detection-based applications such as WGS-based rearrangements or differences in PFGE patterns. screening of antibiotic-resistance genes. Long-read sequencing technologies can complement Genome assembly and/or variant calling short-read technologies, at a higher cost and lower The main objective of WGS analysis is the throughput. Their main advantage is that longer identification of genomic differences between reads can often be assembled de novo into a bacterial strains. Since the raw data of WGS complete genome, either alone or in combination technology are bacterial genome sequence with short-read data. In principle, long-read data fragments of various size (from 100s-10,000+ bp), could be used to directly calculate PFGE pattern a fundamental question is how to use those profiles for comparison to existing databases. fragments to determine genomic differences, referred to as genomic variants. Bioinformatics ata analysis pipelines are tools that identify these variants, and D The field of computer science called are generally referred to as ‘variant’ callers. bioinformatics is used to analyze WGS data. This Genomic variants include (i) single nucleotide involves algorithm-, pipeline- and software polymorphisms (SNPs) or single nucleotide variants development, analysis, transfer and storage/ (SNVs), which indicate a single nucleotide database development of genomics data. substitution difference between genomes, (ii) insertions and deletions of nucleotide/s (commonly A typical WGS workflow contains the following referred to as indels), and (iii) genomic steps; (i) quality control and data grooming, (ii) rearrangements. genome assembly and/or variant calling, and (iii) post-assembly analysis. Current academic reviews, One approach to detect variants is to first such as (1), give more detail on these steps than assemble genomes de novo and then use whole what follows below. genome alignment-based methods to compare two or more strains. De novo assembly of short-read Data quality control and data grooming sequences generally yields so-called draft genome Quality control of WGS data involves multiple sequences, genome sequences that still contain aspects, but some of the most important involves gaps. These gaps are generally caused by the read quality (e.g., how may sites of a 300 bp read presence of repetitive sequences (e.g., rRNA fall below a specific quality threshold), fold sequence clusters) in the genome. Recent coverage or sequence depth and putative bioinformatic advances in the assembly strategies contaminants. Read quality is usually dealt with by of long reads from Pacific Biosciences sequencing data grooming, i.e., removal of low-quality regions technologies now make it possible to produce de of the individual reads with specialized novo closed genome sequences (i.e., sequences bioinformatics tools. The second aspect involves fold without gaps). coverage. WGS data typically consists of hundreds of thousands of short sequence reads representing A second approach is the reference mapping fragments of the genome. Fold coverage or approach. In this approach, reads are aligned sequence depth refers to the median or average (mapped) against a (preferably closed) reference number of reads that cover each nucleotide in a genome. After mapping, variants are called from genome. Too low coverage will influence the the consensus of the mapped reads. Reference accuracy of downstream analyses, as will too high mapping-based approaches are very popular coverage. A commonly overlooked aspect of data because they are computationally inexpensive and quality is contamination with a non-target organism, are fast compared to de novo assembly. A limitation which can be a laboratory-introduced contamination of this method is the reliance on a reference or an organism that is co-isolated. This may not genome. Especially if a closely related reference pose problems for some downstream analyses, such genome is absent, mapping against a distantly related genome may lead to problems with variant inferred using statistical methods such as the calling, and unique regions not found in the bootstrap for parsimony, maximum likelihood and reference sequence will not be included in distance methods and posterior probabilities for downstream analysis. Bayesian methods. In addition to de novo assembly and reference The next challenge is to visualize the differences in mapping-based methods, reference-free de novo a way that can guide action, such as identifying a variant calling methods exist. These methods do not plausible outbreak or source of contamination. require reference sequences and are faster than de When distances are counted, a reasonable novo assembly-based methods. visualization approach is to plot, or report, the differences between groups of strains. For example, Post assembly analysis one can plot the number of pairwise SNP One underappreciated subtlety in WGS analysis is differences between bacterial isolates within or how to interpret the genetic variants between between outbreaks (2), or individual food strains as biologically relevant measures of strain establishments (3). Another common choice is to difference. This problem has two facets: (i) build a phylogenetic tree that displays the determining which differences matter and how to calculated evolutionary model of the isolates as a count them, and (ii) visualizing the differences in a series of splits from a root state. In these, clusters of way that can guide action. isolates near the ‘tips’ of the tree are more closely related to each other than isolates elsewhere in the Most WGS analyses use SNPs as the primary tree. As a hybrid approach, one can combine SNP measure of genetic distance, although other counting and phylogenetic tree production to find methods include whole-genome multilocus sequence clusters of isolates and then report the differences typing (MLST)