2019

14TH ANNUAL

in the Future Meeting

MAY 21-23, 2019 Santa Fe, New Mexico La Fonda on the Plaza Sequencing, Finishing, and Analysis in the Future Meeting 2019

Table of Contents

2019 Agenda Overview ...... X 2019 SFAF Sponsors ...... X May 21st Agenda ...... X May 21st Speaker Presentations ...... X Panel ...... X Vendor Tech Talks ...... X Poster Presentations with Meet & Greet Party ...... X May 22nd Agenda ...... X May 22nd Speaker Presentations ...... X Genome Center Updates ...... X Vendor Tech Talks ...... X Happy Hours at Cowgirl Café (with map) ...... X May 23rd Agenda ...... X

rd May 23 Speaker Presentations ...... X Happy Hour Close Out ...... X Attendees ...... X Map of Santa Fe, NM ...... X

THE 2019 “SEQUENCING, FINISHING, AND ANALYSIS IN THE FUTURE” ORGANIZING COMMITTEE * Chris Detter, Ph.D., Chief Scientist and R&D Manager, MRIGlobal * Johar Ali, Ph.D., Director Research, AA Ontario, Canada * Patrick Chain, Bioinformatics Team Leader, LANL * Michael FitzGerald, Microbial Special Projects Manager, Broad Institute * Bob Fulton, M.S., Director of Development, Washington University * Darren Grafham, Manager, Battelle UK limited * Alla Lapidus, Ph.D., Director, Centre for Algorithmic Biotechnology, SPbU, Russia * Donna Muzny, M.Sc., Director of Operations, BCM * Tootie Tatum, Black Hawk Genomics * Kenny Yeh, Senior Science Advisor, MRIGlobal * Shannon Johnson, Ph.D., Senior Research Scientist, LANL Sequencing, Finishing, and Analysis in the Future Meeting 2019

Agenda Overview

Tuesday, May 21st 7:30 AM – 8:30 AM Breakfast (Sponsored by New England Biolabs) 8:30 AM – 8:45 AM Opening Remarks (Dr. Patrick Fitch, Associate Laboratory Director, LANL) 8:45 AM – 930 AM Keynote Day 1: Address by Dr. Rita Colwell 9:30 AM – 10:50 AM Oral Session - Part 1 (Chairs: Chris Detter & Tootie Tatum) 10:50 AM – 11:20 AM Break (Sponsored by DNAnexus) 11:20 AM – 12:20 PM Panel (Sponsored by MRIGlobal) (Chair: Kenny Yeh) 12:20 PM – 1:40 PM Lunch (Sponsored by Qiagen) 1:40 PM – 3:40 PM Oral Session- Part 2 (Chairs: Mike Fitzgerald & Alla Lapidus) 3:40 PM – 4:10 PM Break (Sponsored by BioID Genomics) 4:10 PM – 4:50 PM Oral Session - Part 3 (Chairs: Donna Muzny & Teressa Torres ) 4:50 PM – 6:40 PM Tech Talks - Part 1 (Chairs: Chris Detter & Bob Fulton ) Meeting & Greet Poster session with Food & Drinks (Sponsored by Promega) 7:00 PM – 10:00 PM • Poster session 1: even #s (PS-1) 7:30 PM – 8:30 PM • Poster session 2: odd #s (PS-2) 8:30 PM – 9:30 PM

Wednesday, May 22nd

7:30 AM – 8:30 AM Breakfast (Sponsored by Perkin Elmer & Swift) 8:30 AM – 8:45 AM Welcome Back & Opening Remarks 8:45 AM – 930 AM Keynote Day 2: Address by Dr. Deborah Hung 9:30 AM – 10:30 AM Oral Session - Part 4 (Chairs: Mike Fitzgerald & Tootie Tatum) 10:30 AM – 11:00 AM Break (Sponsored by PacBio) 11:00 AM – 12:20 PM Oral Session - Part 5 (Chairs: Bob Fulton & Donna Muzny) 12:20 PM – 1:40 PM Lunch (Sponsored by iGenomx) 1:40 PM – 3:00 PM Oral Session - Part 6 (Chairs: Kenny Yeh & Tootie Tatum) 3:00 PM – 3:40 PM Genome Center updates 3:40 PM – 4:10 PM Break (Sponsored by Phase Genomics) 4:10 PM – 5:40 PM Tech Talks - Part 2 (Chairs: Alla Lapidus & Chris Detter) 5:40 PM – 6:00 PM Walk to Cowgirl Café 6:00 PM - 8:30 PM Cowgirl Café- Happy Hours Social - Food & Drinks Served (Sponsored by Illumina)

Thursday, May 23rd 7:30 AM – 8:30 AM Breakfast (Sponsored by Twist & Covaris) 8:30 AM – 8:45 AM Welcome & Opening Remarks 8:45 AM – 930 AM Keynote Day 3: Address by Dr. Chiu 9:30 AM – 10:30 AM Oral Session - Part 7 (Chairs: Kenny Yeh & Mike Fitzgerald) 10:30 AM – 11:00 AM Break (Sponsored by SeqWell) 11:00 AM – 12:20 PM Oral Session - Part 8 (Chairs: Donna Muzny & Tootie Tatum) 12:20 PM – 1:40 PM Lunch (Sponsored by JumpCode Genomics & Dovetail) 1:40 PM – 3:30 PM Oral Session - Part 9 (Chairs: Chris Detter & Bob Fulton) 3:30 PM – 3:40 PM Wrap-up 3:40 PM – 5:00 PM Closing Happy Hour – Break & Drinks (Sponsored by IDT) The 2019 SFAF Organizing Committee THANKS TO ALL OF OUR SPONSORS! Created by scientists, for scientists. Innovative web tools. Hands-on customer service. Expert educational content. A founder’s commitment to accessible science. At IDT, your success is our success, which is why we put you and your research goals at the center of everything we do.

Learn more at www.idtdna.com/ForMe. One source. Multiple solutions.

Custom oligos Next generation sequencing

Custom DNA and RNA oligos xGen custom target capture probes Ultramer DNA and RNA oligos rhAmpSeq amplicon sequencing SameDay Oligos xGen Predesigned Gene Capture Pools RxnReady Primer Pools xGen Exome Research Panel HOTplates expedited services xGen Lockdown Panels Large-scale synthesis (gram quantities) xGen Hybridization and Wash Kit >500 modifications and modified bases xGen Universal Blockers xGen Dual Index UMI Adapters Custom NGS adapters CRISPR genome editing

Alt-R CRISPR Systems qPCR & PCR Alt-R guide RNAs (2-part and sgRNA) Alt-R S.p. Cas9 & HiFi Cas9 nuclease rhAmp SNP Genotyping System Alt-R A.s. Cas12a (Cpf1) Nuclease PrimeTime qPCR Assays Alt-R S.p. Cas9 nickases PrimeTime Gene Expression Master Mix

Genes & gene fragments Functional genomics

gBlocks Gene Fragments Dicer-Substrate siRNAs (DsiRNAs) Custom gene synthesis Antisense oligos (ASOs) Megamer Single-Stranded DNA Fragments IDT miRNA Inhibitors

Specialty services

Formulation OEM E-commerce Packaging GMP E-procurement

See what more we can do for you at www.idtdna.com/ForMe.

For Research Use Only. Not for use in diagnostic procedures. © 2019 Integrated DNA Technologies. Alt-R, gBlocks, HOTplates, IDT, Lockdown, Megamer, PrimeTime, rhAmp, RxnReady, SameDay, Ultramer, and xGen are trademarks of Integrated DNA Technologies, Inc. and are registered in the USA. rhAmpSeq is a trademark of Integrated DNA Technologies, Inc. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Tuesday, May 21st 7:30 AM – 8:30 AM Breakfast (Sponsored by New England Biolabs) 8:30 AM – 8:45 AM Welcome & Opening Remarks 8:45 AM – 930 AM Keynote 1: Dr. Rita Colwell, #13 • Transformative microbiome insights – A journey from basic research to routine applications 9:30 AM – 10:50 AM Oral Session Part 1: (Chairs: Chris Detter & Tootie Tatum) • Microbiome exploration and discovery (Mouncey, #54) • PanGIA - Development and optimization of an unbiased, metagenomics- based pathogen detection workflow for clinical and environmental matrices (Parker, #59) • PanGIA Bioinformatics – A metagenomics analytical framework for routine biosurveillance (Russell, #66) • Exploration of microbial diversity in ticks from Gabon by high throughput sequencing of 16S RNA (Nkili, #03) 10:50 AM – 11:20 AM Break (Sponsored by DNAnexus) 11:20 AM – 12:20 PM Panel (Sponsored by MRIGlobal; Chair: Kenny Yeh) • Climate impacts on ecosystems and infectious diseases: The Important Role of Sequencing (Yeh, #86) 12:20 PM – 1:40 PM Lunch (Sponsored by Qiagen) 1:40 PM – 3:40 PM Oral Session Part 2: (Chairs: Mike Fitzgerald & Alla Lapidus) • Genomics, mimicry, speciation and hybridization (Grishin, #32) • Gypsy moth genome provides insights into flight capability and virus-host interactions (Zhang, #89) • Pangenome-wide association studies (PWAS) w frequented regions (Mumey, #55) • Genomics of ancient specimens (Cong, #14) • Human environmental exposome revealed by extensive longitudinal personal monitoring (Jiang, #39) • Immuno-biotechnology and bioinformatics in Community Colleges (Smith, #75) 3:40 PM – 4:10 PM Break (Sponsored by BioID Genomics / Fry Lab) 4:10 PM – 4:50 PM Oral Session Part 3: (Chairs: Donna Muzny & Teressa Torres) • Utility of the Ion S5™ and MiSeq FGx™ sequencing platforms to characterize challenging human remains (Elwick, #21) • Introducing Morphoseq: high accuracy long reads from short read platforms (Darling, #15) 4:50 PM - 6:40 PM Tech Talks Part 1: (Chairs: Chris Detter & Bob Fulton) • Scalable library prep for genotype imputation via low-pass sequencing (Mellor; SeqWell, #92)

• From squirrels to clinical genomes: The importance of recapturing long-range genomic information (Blanchette; Dovetail, #97) • Sample to NGS library: Adaptive Focused Acoustics® (AFA®)-enhanced lysis and lysate homogenization for high-throughput nucleic acid extraction (Thomann; Covaris, #78) • Twist - TBD • plexWell: Highly multiplexed library prep (Sullivan; SeqWell, #93) • Swift - TBD • Tech Talk Question Panel Meeting and Greet Poster session with Food & Drinks (Sponsored by Promega) 7:00 PM – 10:00 PM • Poster session 1: even #s (PS-1) 7:30 PM – 8:30 PM • Poster session 2: odd #s (PS-1) 8:30 PM – 9:30 PM Sequencing, Finishing, and Analysis in the Future Meeting 2019

Breakfast

Tuesday, May 21st, 7:30 AM – 8:30 AM, La Fonda Ballroom

Sponsored by Sequencing, Finishing, and Analysis in the Future Meeting 2019

Transformative Microbiome Insights – A Journey From Basic Research To Routine Applications

Keynote Speaker - Abstract ID: 13

Rita R. Colwell

Chairman and Founder, CosmosID

The emergence of NGS is revolutionizing the microbiological sciences and transforming medicine. Deep sequencing has revealed that virtually all environments, including the human body, are teeming with diverse microbial communities. While microbes have predominantly been studied in the context of pathogenicity, it is now evident that the human microbiota contributes biological functions essential to health. Conversely, disrupting the microbiota host homeostasis in healthy individuals can lead to dysbiosis and is associated with many diseases and pathologies.

As a consequence, research into the human microbiome has started to transform the healthcare landscape providing novel approaches for diagnostics and therapeutics. Other areas benefitting from metagenomics include biological safety of food and water. And in agriculture, changes in soil and in microbiomes can now be evaluated in the context of crop yield, livestock health, and reduced dependency on herbicides, pesticides, and antibiotics.

However, translating microbiome research into routine applications depends on robust methods for data generation and interpretation. Yet, metagenomic sequencing is uniquely vulnerable to the introduction of bias and contamination creating a need for standardized workflows and controls.

In this presentation, the transformative application of microbiome analysis in diagnostics, drug development, public health, agriculture, and water safety will be discussed. Potential error modes and strategies to control the sequencing and analysis workflow will be surveyed. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Microbiome Exploration and Discovery

Oral - Abstract ID: 54

Nigel J. Mouncey, Emiley Eloe-Fadrosh, Nikos Kyrpides, Simon Roux, Natalia Ivanova

US Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek CA 94598 USA

The cross-cutting nature of microbiome research in environmental sciences, health, agriculture, energy, and natural and built environments requires the development of new solutions and community coordination to tackle grand challenges that will accelerate basic discovery and lead to transformative advances. The exponential growth of microbiome data over the past few decades has ushered in a new era of biology, shifting the focus from descriptive observations and small-scale experimental paradigms to data-driven exploration and hypothesis generation, enabled by and relying on a rapidly growing suite of transformative data science strategies. The DOE JGI is a leader in microbiome data science through its user-centric scientific programs that have expanded the microbial tree-of-life, discovered numerous new viruses and host associations, explored microbe-plant interactions, and discovered new biochemical pathways that play important roles in nutrient cycling and microbial survival. I will discuss some of our recent scientific studies, as well as provide an overview of the new National Microbiome Data Collaborative. Sequencing, Finishing, and Analysis in the Future Meeting 2019

PanGIA - Development and optimization of an unbiased, metagenomics-based pathogen detection workflow for clinical and environmental matrices

Oral - Abstract ID: 59

Kyle Parker

MRIGlobal

Rapid, specific, and sensitive identification of microbial pathogens is a key component of infectious disease diagnosis and biosurveillance. Classical culture methods allow for the identification of most bacteria and fungi, but have long turnaround times. Additionally, viruses require specialized culture methods. Molecular methods, such as quantitative polymerase chain reaction (qPCR), are extremely sensitive and specific, but require a separate assay design for each individual pathogen. Thus, qPCR assays have been validated for only a small set of all possible pathogens. Furthermore, novel strains with altered genome sequences may emerge that cause assays to fail. Metagenomic shotgun next-generation sequencing (NGS) holds the promise of specific identification and characterization of any pathogen (viruses, bacteria, fungi, and protozoa) in an unbiased fashion. Despite its great potential, NGS has not been widely adopted by clinical microbiology laboratories. This may be attributed to long labor-intensive protocols and the absence of standardized workflows. Here, we describe PanGIA (Pan-Genomics for Infectious Agents), a sample-to-answer workflow that includes simplified and standardized methods for sample processing, library preparation, shotgun sequencing and data analysis. For each component of the workflow, we performed a comprehensive survey and assessment of the current, commercially available technologies. The final PanGIA workflow includes total nucleic acid extraction from clinical (human whole blood) and environmental (forensic swabs) samples, host depletion, dual DNA and RNA library preparation, shotgun sequencing on an Illumina MiSeq, and sequencing data analysis via pangia, a user-friendly bioinformatics tool with a graphical user interface. The entire workflow from sample to answer can be completed within 24 hours and is compatible with bacteria and viruses. Here, we present data from the development and application of both the clinical and environmental S2S workflows for the detection of a variety of human pathogens. The workflow solutions presented here can be applied for a variety of environmental and clinical applications enabling the sensitive and specific detection of pathogens without the need for targeted assay development.

DISTRIBUTION STATEMENT A: APPROVED FOR PUBLIC RELEASE, DISTRIBUTION IS UNLIMITED. Sequencing, Finishing, and Analysis in the Future Meeting 2019

PanGIA Bioinformatics – A Metagenomics Analytical Framework for Routine Biosurveillance

Oral - Abstract ID: 66

Joseph A. Russell1, Paul Li2, David Yarmosh1, Alan Shteyman1, Kyle Parker3, Hillary Wood3, Karen Davenport2, Patrick Chain2

1. MRIGlobal – 65 West Watkins Mill Road, Gaithersburg, MD, USA 20878; 2. Los Alamos National Laboratory, Los Alamos, NM, USA 87545; 3. MRIGlobal – 1470 Treeland Blvd. S.E., Palm Bay, FL, USA 32909

Metagenomics is emerging as an important tool in biosurveillance, public health, and clinical applications. However, ease-of- use for execution and data analysis remains a primary barrier-of-entry to the full adoption of metagenomics in applied health and forensics settings. Additionally, these venues often have more stringent requirements for portability, reporting, accuracy, and precision than more ecological research roles of the technology. Here, we present PanGIA (Pan-Genomics for Infectious Agents), a novel framework for hosting, processing, analyzing, and reporting read-mapping data from metagenomics samples. PanGIA was developed to address existing gaps that may preclude clinicians, medical technicians, forensics personnel, or other non-expert end-users from routinely leveraging metagenomics data for their needs, while adding additional analytical features that advance the information content beyond other tools used by microbial ecologists and expert bioinformaticians. PanGIA is primarily meant for the detection and discovery of pathogenic microorganisms from clinical and environmental metagenomics data, however, it can agnostically characterize entire microbial communities, depending on the comprehensiveness of user- defined reference databases. PanGIA provides two forms of confidence scoring; the first pairs coverage data with ‘uniqueness’ information mapped across each reference genome for a stand-alone determination of confidence for each query sequence at each taxonomy level, and the second compares a known ‘negative control’ profile with the profile of an unknown sample to determine significance in presence ‘above background’. Data can be quickly summarized within the graphical user interface by confidence score, normalized read abundance, reference genome linear coverage, depth-of-coverage, RPKM, and other metrics to detect specific organisms-of-interest. A strain-level genome-coverage plot feature allows rapid visualization of ‘masked’ regions that were found in ‘background’ / negative control samples, as well as embedded information of taxa-specific marker genes and genomic loci related to various phenotypes of interest (e.g., pathogenicity, virulence, toxins, etc). Taken together, these various outputs can be leveraged for associating an unprecedented level of confidence with any particular species-level call in a metagenomics analysis. Comparison testing of PanGIA against current leading k-mer, read-mapping, and marker-gene based taxonomy classifiers across various spiked targets in real-world sequencing datasets shows superior mean F1-score, sensitivity, and specificity. PanGIA can process a 5 million paired-end read dataset in under 1 hour on commodity computational hardware. PanGIA is available for download as a docker container with all necessary dependencies and analysis is run locally without any need for an internet connection. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Exploration of Microbial Diversity in Ticks from Gabon by High Throuput Sequencing of 16S RNA

Oral and Poster - Abstract ID: 03

Lejarre Quentin1, Nkili Meyong Andrinaina Andy1, Rahola Nil2, Paupy Christophe2 and Ayala Diégo2

1. Centre International de Recherches Médicales de Franceville (CIRMF), 2. Institut de Recherche pour le Développement (IRD)

Background: Tick are considered second to mosquitoes worldwide for transmitting diseases to humans primarily due to them i) being present on almost every continent, although often geographically confined in wooded areas, ii) being able to infest a large variety of hosts such as mamals, birds and even reptiles and iii) being able, for some species, to infect their hosts with more than one pathogens. In Western countries tick-borne illnesses such as Lyme, Ehrlichiosis, Anaplasmosis, Rickettisoses or Tick-Borne Encephalitis have been on the rise in the past 10 years. Moreover in Sub-Saharan Africa, where the forest density remains largely significant, the lack of available data on those and the pathogens they transmit, raise the question of possible new emerging pathogens ; especially in Gabon where 85% of the territory is covered by woodland and where a novel species of Rickettsia was discovered in 2014. This context urges for more studies on the microbiome of ticks and the potential pathogens they may carry.

Methods: 370 samples were harvested from ticks living in 4 different national gabonese parks and 2 private reserves. After milling ticks and extracting DNA, the 16S RNA PCR targeted the hypervariable region V4. Amplicon sequencing was performed on a Illmina’s Miseq platform with the Reagent kit v3 600cycles (250bp paired-end reads). Preliminary analyses were conducted on EDGE bioinformatics v2.2 using the embedded Qiime module to compare the microbiome of ticks regarding their species and their locations but also to detect possible pathogens at the family and genus level. Then sequences constituting OTUs of interest were aligned (USEARCH) with a custom database build from references of bacterium known to be pathogenic in order to determine the presence of pathogens at the species level.

Results: Overall, the beta-diversity analysis (using the abundance weighted Jaccard distance) samples from the same species of ticks tend to carry the same microbiome, predominantely 1) Coxiella (60~80%), 2) Rickettsia (6.1~9%), 3) Mycobacterium (1~12.7%) and 4) Burkholderia (1.1~4%) genus. Sex and sample location were considered as parameters of similarity but did not account for any pattern.

Pathogens identified included Fracisella, Coxiella, Rickettsia and Mycobacterium. Pathogen search reveals that the major species present were Coxiella burnetii (78%) associated to A. Tholloni species for the first time and Rickettsia (20%), detected almost exclusively in R. longus. Moreover regardless the tick species, for all pathogens identifed the percentage of identity range from ~85% to 100%, indicating the potential presence of new species.

Perspectives: The next step will aim to confirm those preliminary results by performing PCR on samples where known pathogen species have been detected and WGS on samples with potential new pathogen species in order to obtain full genomes for a better characterization. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Break

Tuesday, May 21st, 10:50 AM – 11:20 AM, La Fonda Ballroom

Sponsored by Sequencing, Finishing, and Analysis in the Future Meeting 2019

Panel Sessionl Sponsored by

Tuesday, May 21st, 11:20AM - 12:20 PM, La Fonda Ballroom

Climate impacts on ecosystems and infectious diseases: The Important Role of Sequencing

Panel - Abstract ID: 86

Kenneth Yeh1, Jeanne Fair2, Corina Monagin3, Richard Winegar1, Jacqueline Fletcher4.

1. MRIGlobal, Gaithersburg, MD, 2. Los Alamos National Laboratory, Los Alamos NM, 3. One Health Institute, University of California, Davis, CA, 4. National Institute for Microbial Forensics & Food and Agricultural Biosecurity, Oklahoma State University

Changes in our climate continue to impact the planet’s ecosystems including the interface of infectious disease agents and their hosts. It is generally understood that changes in the environment due to disasters, natural and man-made accidents raise risk factors that indirectly facilitate infectious disease outbreaks. Changes in habitat, displaced populations, and environmental stresses affect the survival of species and these stresses are amplified over time that affect the climate. The recurrence and spread of vector-borne disease (mosquito, tick) to new geographic locations is also influenced by climatic changes. Plant systems, which provide diet to and humans, are also impacted by these changes, especially where populations rely primarily on plant-based diets.

Recent evidence shows that climate change is already affecting populations and infectious disease dynamics of wildlife, plants, and other living things, altering host and disease vector life cycles, shifting vector and host ranges and migratory patterns, and changing disease spread and its impacts. Molecular detection assays for viruses and parasites, as well as metagenomic sequencing to understand the full complement of microbes found within the microbiomes of hosts and vectors can uncover the mechanistic relationship between climate variability and pathogen transmission. Sequencing can now be applied to detect signatures of infectious pathogens’ movement northward into more temperate regions.

While the opportunity to detect and inventory sequences across species and strains is valuable, it should also be balanced with practical active surveillance needs and investments to better understand environmental background. The availability and broad application of genomic and next-generation sequencing coupled with an effective multi-sectoral One Health engagement is an important approach to address these biosecurity issues.

For this discussion, our moderator will lead a panel including: Sequencing, Finishing, and Analysis in the Future Meeting 2019

• Our objectives are to raise awareness for the impact of climate change and discuss examples of how sequencing technology can help detect changes at the genomic level that may be relevant to climate change.

• Review and evaluate the challenges for implementing sequencing technology.

• Summarize the discussion, including audience questions and answers, and compose a perspectives article for peer- reviewed publication.

References:

• Han, Barbara A., John Paul Schmidt, Sarah E. Bowden, and John M. Drake. 2015. “Rodent Reservoirs of Future Zoonotic Diseases.” Proceedings of the National Academy of Sciences 112 (22): 7039–44. https://doi.org/10.1073/pnas.1501598112.

• Kouadio IK, Syed Junid SMA-J, Kamigaki T, Hammad K, Oshitani H. Infectious diseases following natural disasters: Prevention and control measures. Expert Review of Anti-Infective Therapy. 2012 Jan;10(1):95-104. https://doi.org/10.1586/eri.11.155

• Fletcher J, Franz D, LeClerc JE. Healthy plants: necessary for a balanced ‘One Health’ concept. Veterinaria Italiana 2009; 45: 79-95.

• Mansfield KL, Jizhou L, Phipps LP and Johnson N (2017) Emerging Tick-Borne Viruses in the Twenty-First Century. Front. Cell. Infect. Microbiol. 7:298. doi: 10.3389/fcimb.2017.00298 Sequencing, Finishing, and Analysis in the Future Meeting 2019

Lunch

Tuesday, May 21st, 12:20 PM – 1:40 PM, La Fonda Ballroom

Sponsored by Sequencing, Finishing, and Analysis in the Future Meeting 2019

Genomics, mimicry, speciation and hybridization

Oral - Abstract ID: 32

Wenlin Li, Qian Cong, Jing Zhang, Jinhui Shen, and Nick V. Grishin

University of Texas Southwestern Medical Center and Howard Hughes Medical Institute

For centuries, biologists have used phenotypes to infer evolution. For decades, a handful of gene markers have given us a glimpse of the genotype to combine with phenotypic traits. Today, we can sequence entire genomes from hundreds of species and gain yet closer scrutiny. To illustrate the power of genomics, we overview the genomic landscape of an entire family of animals using butterflies (Hesperiidae) as an example. Genomic analyses of more than 1000 representative species of skippers reveal rampant inconsistencies between their current classification and a genome-based phylogeny. We use a dated genomic tree to overhaul their taxonomy, and to detect convergence in wing patterns that fooled researchers for decades (and birds for millennia). We find that many skippers with similar appearance are distantly related and several skippers with distinct morphology are close relatives. Their wing patterns are frequently convergent. This likely mimetic convergence is diversified, resulting in 5 distinct parallel patterns. Each of the 5 patterns occurs within at least two genera as well as in more distant relatives diverged more than 20 Mya. These conclusions are strongly supported by different genomic regions and are consistent with some morphological traits. Taking the next step, we sequence additional 1000 butterfly species from other families and investigate patterns of speciation and hybridization. We find that some groups experience rapid and recent speciation, while bursts of speciation are not found in the past. Within each group, its constituent species occasionally hybridize, exchanging genes. Based on these observations, we propose a general evolutionary model that in order to survive, a phylogenetic lineage sprawling over the landscape needs to undergo speciation. Resulting species hybridize, exchanging adaptive traits. One species gaining the largest number of such traits takes on, while others die out. A similar scenario is likely in the genus Homo, when humans hybridized with Neanderthals and Denisovans absorbing their advantages that posed humans for success. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Gypsy moth genome provides insights into flight capability and virus-host interactions

Oral - Abstract ID: 89

Jing Zhang, Qian Cong, Emily A. Rex, Winnie Hallwachs, Daniel H. Janzen, Nick V. Grishin, and Don B. Gammon

University of Texas Southwestern Medical Center and Howard Hughes Medical Institute

Since its accidental introduction to Massachusetts in the late 1800s, the European gypsy moth (EGM; Lymantria dispar dispar) has become a major defoliator in North American forests. However, in part because females are flightless, the spread of the EGM across the United States and Canada has been relatively slow over the past 150 years. In contrast, females of the Asian gypsy moth (AGM; Lymantria dispar asiatica) subspecies have fully developed wings and can fly, thereby posing a serious economic threat if populations are established in North America. To explore the genetic determinants of these phenotypic differences, we sequenced and annotated a draft genome of L. dispar and used it to identify genetic variation between EGM and AGM populations. The 865-Mb gypsy moth genome is the largest genome sequenced to date and encodes ∼13,300 proteins. In addition to mate pair sequencing, we used Hi-C to improve the assembly. Gene ontology analyses of EGM and AGM samples revealed divergence between these populations in genes enriched for several gene ontology categories related to muscle adaptation, chemosensory communication, detoxification of food plant foliage, and immunity. These genetic differences likely contribute to variations in flight ability, chemical sensing, and pathogen interactions among EGM and AGM populations. Finally, we use our new genomic and transcriptomic tools to provide insights into genome-wide gene-expression changes of the gypsy moth after viral infection. Characterizing the immunological response of gypsy moths to virus infection may aid in the improvement of virus-based bioinsecticides currently used to control larval populations. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Pangenome-wide association studies (PWAS) with frequented regions

Oral - Abstract ID: 55

Buwani Manuweera (MSU), Indika Kahanda (MSU), Thiruvarangan Ramaraj (NCGR), Joann Mudge (NCGR), Alan Cleary (NCGR), and Brendan Mumey (MSU)

A genome-wide association study (GWAS) is the study of genetic variation in different individuals to determine if a specific variation is associated with a phenotypic trait. It is important in many disciplines, including identifying genetic risk factors for common, complex diseases, and in agriculture to identify genes underlying important traits. GWAS is limited, though, in that the types of variations typically studied are single nucleotide polymorphisms (SNPs) identified relative to a single reference genome. These limitations lead to bias and preclude GWAS from studies across related species. The advent of next-generation sequencing has brought an exponential growth in DNA sequence data. This has lead to the more comprehensive pangenomics approach, where the entire sequence content and variation of a population are succinctly represented independent of a reference. In prior work, we developed a method for identifying genomic regions that characterize complex variations within pangenomic data and showed that these regions provide a more general way to study genetic variation than existing approaches. This work describes our initial results to develop a new branch of genomic analysis based on these regions called pangenome-wide association studies (PWAS) that generalizes GWAS to pangenomic datasets both within and across species, making use of machine learning techniques. We make use of recently developed algorithms for fast compressed DeBruijn graph construction and identifying frequented regions in these graphs that can be used as machine-learning features to identify pangenomic regions, overlaid with gene annotations, that relate to complex phenotypic traits. Initial results with a 100 genome yeast pangenomic data set indicate that frequented region features provide better machine-learning regression models than SNPs for predicting phenotypic traits. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Genomics of ancient specimens

Oral - Abstract ID: 14

Qian Cong, Jing Zhang, Jinhui Shen, John V. Calhoun, Andrew D. Warren and Nick V. Grishin

University of Washington, Seattle and Howard Hughes Medical Institute

Zoological museums across the world are full of treasures waiting to be unlocked. Not preserved with DNA studies in mind, pinned and dry specimens are challenging for genomic studies. However, many of these specimens are unusual or unique. Some represent extinct species or populations, while others are mutants with rare phenotypes that can shed light on genotypic determinants of phenotypic traits. Working with these century-old specimens is plagued with problems at every step: not much DNA remains in them, DNA is highly degraded and fragmented into short 30-50 bp segments. Phylogenetic and population analyses are not straightforward because completeness of data is low. We demonstrate how to overcome these difficulties working on a project that solves several long-standing mysteries about butterflies. (1) Can we determine locality a specimen was collected at from its genome? Type specimens are the name bearers for species. Most of these specimens are old and localities for some were not accurately recorded, but knowing the locality is important. We place the type specimen of Hesperia colorado collected in 1871 on the map with precision better than 10 miles. Detailed genomic comparison of specimens from present-day populations allows population assignment of ancient specimens of non-migratory, local butterflies. (2) Can we find adaptations to high elevation? H. colorado thrives at the bottom of hills and on top of the highest mountains in Colorado above 10,000 feet. Genomic comparison reveals that high elevation populations diverged significantly in oxygen- sensing and carrying proteins making them adapted to low oxygen conditions. (3) Can we find species boundaries with genomics? Hesperia comma was thought to occur throughout northern hemisphere, from Europe through Siberia and Alaska to Colorado and Labrador. Population genomics reveals a deep split between the Old and the New World, suggesting that the American species is H. colorado, not comma. Interestingly, we find Eurasian species H. comma in Alaska. In summary, we succeed with museomics approaches and further biology by tapping into genomes of ancient museum specimens. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Human environmental exposome revealed by extensive longitudinal personal monitoring

Oral - Abstract ID: 39

Chao Jiang, Xin Wang, Xiyan Li, Jinnga Inlora, Ting Wang, Qing Liu, Michael Snyder

Stanford University

Human health is dependent upon environmental exposures, yet the diversity and variation in exposures are poorly understood. We developed a sensitive method to monitor personal airborne biological and chemical exposures and followed the personal exposomes of 15 individuals for up to 890 days and over 66 distinct geographical locations. We found that individuals are potentially exposed to thousands of pan-domain species and chemical compounds, including insecticides and carcinogens. Personal biological and chemical exposomes are highly dynamic and vary spatiotemporally, even for individuals located in the same general geographical region. Integrated analysis of biological and chemical exposomes revealed strong location- dependent relationships. Finally, construction of an exposome interaction network demonstrated the presence of distinct yet interconnected human- and environment-centric clouds, comprised of interacting ecosystems such as human, flora, pets, and arthropods. Overall, we demonstrate that human exposomes are diverse, dynamic, spatiotemporally-driven interaction networks with the potential to impact human health. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Immuno-biotechnology and bioinformatics in Community Colleges

Oral and Poster - Abstract ID: 75

Todd M. Smith1, Sandra G. Porter1,2, Dina Kovarik2

1. Digital World Biology, Seattle WA; 2. Shoreline Community College, Shoreline WA

Immuno-biotechnology is one of the fastest growing areas in the field of biotechnology. Digital World Biology’s Biotech- Careers.org database of biotechnology employers (>6800) has nearly 700 organizations that are involved with immunology in some way. With the advent of next generation DNA sequencing, and other technologies, immuno-biotechnology has significantly increased the use of computing technologies to decipher the meaning of large datasets and predict interactions between immune receptors (antibodies / T-Cell receptors / MHC) and their targets.

The use of new technologies like immune-profiling - where large numbers of immune receptors are sequenced en masse - and targeted cancer therapies - where researchers create, engineer, and grow modified T cells to attack tumors - are leading to job growth and demands for new skills and knowledge in biomanufacturing, quality systems, immuno-bioinformatics, and cancer biology. In response to these new demands, Shoreline Community College (Shoreline, WA) has begun developing an immuno- biotechnology certificate. Part of this certificate includes a five-week course (30 hours hands-on computer lab) on immuno- bioinformatics.

The immuno-bioinformatics course includes exercises in immune profiling, vaccine development, and operating bioinformatics programs using a command line interface. In immune profiling, students explore T-cell receptor datasets from early stage breast cancer samples using Adaptive Biotechnologies’ (Seattle, WA) immunoSEQ Analyzer public server to learn how T-cells differ between normal tissue, blood, and tumors. Next, they use the IEDB (Immune Epitope Database) in conjunction with Molecule World (Digital World Biology) to predict antigens from sequences and verify the results to learn the differences between continuous and discontinuous epitopes that are recognized by T-cell receptors and antibodies. Finally, to get hands-on experience with bioinformatics programs, students will use cloud computing (CyVerse) and igBLAST (NCBI) to explore data from an immune profiling experiment. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Break

Tuesday, May 21st, 3:40 PM – 4:10 PM, La Fonda Ballroom

Sponsored by Sequencing, Finishing, and Analysis in the Future Meeting 2019

Utility of the Ion S5™ and MiSeq FGx™ Sequencing Platforms to Characterize Challenging Human Remains

Oral - Abstract ID: 21

Kyleen E. Elwick, PhD1,2, Magdalena M. Bus, PhD3, Jonathan L. King, MS3, Bruce Budowle, PhD3, Sheree Hughes-Stamm, PhD2,4

1. Visiting Scientist Program, Counterterrorism and Forensic Science Research Unit, Federal Bureau of Investigation, Quantico, Virginia, USA; 2. Department of Forensic Science, Sam Houston State University, Huntsville, Texas, USA; 3. Center for Human Identification, University of North Texas, Fort Worth, Texas, USA; 4. School of Biomedical Sciences, University of Queensland, QLD, Australia

Missing persons cases, unidentified human remains, and mass disasters are problems encountered worldwide. Migrants and refugees have died or gone missing due to their efforts to cross borders, seas, or through human trafficking. Skeletal remains (bone, teeth) are often the only samples available for DNA analysis in human identification. However, these samples are more challenging to process due to their biological composition and environmental exposure (humidity, temperature, UV light, and microorganisms) causing DNA damage and degradation, the presence of inhibitors, and the possibility of contamination or comingled remains.

This study compares the performance of the two most common forensic sequencing chemistries and platforms used within the community for identifying extremely challenging biological samples. Bone and teeth samples (N = 24) from 14 cadavers that were subjected to a range of environmental insults (cremation, embalming, decomposition, and fire). Samples were extracted in triplicate using a total demineralization protocol. DNA was amplified and sequenced using a custom AmpliSeq™ STR and iiSNP panel for degraded remains with Precision ID chemistry on the Ion S5™ system and the ForenSeq™ DNA Signature Prep Kit (using Primer Mix A) on the MiSeq FGx™. Performance differences between the two MPS systems was determined by comparing read depth, heterozygote balance, and the total number of alleles or percentage of alleles. Percentage/number of alleles and the performance of the CODIS loci were used to compare between the two MPS systems with traditional CE genotyping.

Overall, the results of this study demonstrate that both platforms were able to successfully sequence a variety of challenging samples. MPS provided more genetic data in 22 samples compared with the CE-based kit. Based only on the 20 CODIS loci, CE generated a greater number of alleles for the decomposed remains. However, a greater number of loci included in MPS multiplexes allowed for more genetic information to be obtained from most other samples. Furthermore, a greater number of alleles will translate in greater power of discrimination. Results suggest that MPS may recover more probative information from most samples, but CE-based methods were more robust for identifying severely challenged skeletal samples such as the decomposed remains. CE chemistry has been substantially developed over the past 25 years, while MPS kits for forensic applications have been available for fewer than five years. However, improvement in MPS panel design and chemistries could enhance performance. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Introducing Morphoseq: high accuracy long reads from short read platforms

Oral - Abstract ID: 15

Leigh G. Monahan1, Michael Imelfort1, Catherine M. Burke1, Kevin Ying1, Aaron Statham2, Joyce

To1, Ian G. Charles2,3, Nicholas J. McCooke2, Aaron E. Darling1,2

1. University of Technology Sydney, The ithree institute, Sydney Australia; 2. Longas Technologies Pty. Ltd.; 3. Quadram Institute Bioscience, Norwich Research Park, Norwich, UK

We introduce Morphoseq, a library prep chemistry and associated software technology that supports long read sequencing on short read platforms. We demonstrate Morphoseq on a set of 60 multiplexed bacterial isolates from the Human Microbiome Project with GC contents ranging from 25% to 69%. The resulting long reads have lengths up to 15kbp (mean 7.8 sd +/- 1.2kbp), with modal accuracy 100% and 92% of reads > Q40 when measured against independent reference genomes. Genomic coverage is shown to be highly uniform (mean 37.8 sd +/- 18.4, mode 25 on S. aureus ATCC25923). The data yields finished- quality closed circle assemblies for bacterial genomes across the entire GC content range (25-69%).

The Morphoseq chemistry works by uniquely barcoding individual long DNA template molecules by highly uniform random mutagenesis and amplification. Short read sequences can then be associated with the long template from which they originated. Using a graph representation of the unmutated genome, the Morphoseq data processing algorithms reconstruct the original unmutated long template molecules.

Morphoseq works with lower quality and quantity of DNA than competing platforms (10-50ng DNA) and has the potential to reduce cost per finished genome by an order of magnitude or more. This has allowed us to generate the first reference-quality genomes for several members of the HMP isolate collection, an outcome that would have been difficult to achieve with alternative long read platforms due to their stringent input requirements. Morphoseq is a straightforward in-solution and robotics-compatible protocol that does not require bespoke hardware. The protocol has flexibility to scale both to large genomes and large sample batches (>1000 samples) through early sample barcoding and subsequent processing in a single tube. Morphoseq’s high accuracy long reads promise to transform a range of public health applications including genomic epidemiology and antimicrobial resistance surveillance. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Tuesday, May 21st, 4:50 PM – 6:40 PM, La Fonda Ballroom Sequencing, Finishing, and Analysis in the Future Meeting 2019

Scalable Library Prep for Genotype Imputation via Low-Pass Sequencing

Tech Talk - Abstract ID: 92

Joe Mellor

SeqWell

Low-pass whole-genome sequencing (WGS) represents a cost-effective and flexible alternative to microarray based genotyping. While accurate genotypes can be imputed from raw WGS coverage depths as low as 0.4x, the economics of this approach depends strongly on an ability to efficiently and equally prepare and pool large numbers of samples in a single sequencing run. We describe the use of a high-throughput library preparation technology (plexWell™) for application to low-pass whole human genome sequencing. plexWell library preparation kits create normalized multiplexed libraries from batches of 96 samples without the need for time-consuming measurement or adjustment of input DNA concentrations, significantly simplifying the complex task of high-level multiplexing. Our results show the utility of plexWell for routine low-pass WGS applications, where we characterize multiplexing uniformity and genotype imputation accuracy on a collection of reference samples. Sequencing, Finishing, and Analysis in the Future Meeting 2019

From Squirrels to Clinical Genomes: The Importance of Recapturing Long-Range Genomic

Tech Talk - Abstract ID: 97

Marco Blanchette, PhD,

Senior Director of R&D, Dovetail Genomics

Long-range genomic information is an integral component of assembling genomes for non-model organisms as well as investigating the genomic structures associated with human diseases.

DovetailÔ Hi-C generates this valuable long-range information using short read Next Generation Sequencing technology and is therefore able to complement other long-read technologies like ONT, PacBio or 10X.

In the plant and animal world, a contiguous and accurate genome assembly is a crucial first step towards fully understanding the biology of any organism. A high-quality assembly will make any downstream analyses, including gene annotation, synteny analysis, comparative genomics and population genetics far easier and more reliable.

Within the clinical arena there are significant challenges in using standard NGS technologies alone to construct a comprehensive and detailed picture of cancer and other human disease. These challenges are compounded when attempting to glean this information from degraded archival samples such as FFPE. New tools and methods are needed to discover and reliably detect critical structural rearrangements, such as fusions and translocations, particularly in challenging areas of the genome where short-read approaches stumble. Careful analysis of topologically associated domains (TADs) and their role in cancer and other human disease is also of increasing interest yet requires long-range genomic information to adequately capture. Proximity ligation techniques offer a powerful data type for building more complete genomes, unraveling genome structure, and unveiling new insights into genome biology and human disease. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Sample to NGS Library: Adaptive Focused Acoustics® (AFA®)- Enhanced Lysis and Lysate Homogenization for High- Throughput Nucleic Acid Extraction

Tech Talk - Abstract ID: 78

Patrick McCarty, Lauren Jansen, Martina Werner, Vanessa Process, Eugenio Daviso, Jim Laugharn and Ulrich Thomann

Covaris, Inc. 14 Gill Street, Woburn, MA 01801

Recently, Covaris has introduced the AFA-TUBE™, a precision engineered polymer shearing vessel, which allows to perform the workflow from DNA fragmentation to NGS-ready library in a single vessel. Additionally, with the introduction of the R230 and LE220Rsc next generation Focused-ultrasonicators, Covaris facilitates the full integration of AFA onto automation platforms, further utilizing AFA for mixing during enzymatic incubations (adapter ligation, etc.) as well as during bead clean- up.

Sample preparation, upstream of any NGS library workflow, is still lacking in that nucleic acid extraction and purification is not synced with the downstream NGS process, which relies on the availability of high quality and quantity substrates. Using the 96 AFA-TUBE TPX Plate in combination with R230 or LE220Rsc Focused-ultrasonicators, novel workflows were developed to extract and purify nucleic acids from a variety of specimens, including whole blood, bacterial, and fungal cells (liquid culture and colonies) in a single 96 AFA-TUBE TPX Plate. These fully automated hands-off workflows, produce high quality, high molecular weight DNA (>30kb) and RNA (RIN >8) in quantities, that allow direct use in the NGS library prep workflow. Moreover, since harsh chemicals and denaturants are replaced by precise AFA-energetics™, nucleic acids obtained in the lysate can be directly used for downstream processing in enzymatic fragmentation and tagmentation assays. Twist TechTalk - TBD Sequencing, Finishing, and Analysis in the Future Meeting 2019

plexWell: Highly Multiplexed Library Prep

Tech Talk - Abstract ID: 93

Kate Sullivan

SeqWell

plexWell™ is a streamlined, NGS library preparation technology that enables efficient preparation of libraries from thousands of distinct samples for sequencing in a single run. Principal features of plexWell technology include removal of the need for time-consuming measurement or normalization of input DNA concentrations, and significant simplification of high-level multiplexing. The methodology and results using this technology will be presented for several applications including microbial whole genome sequencing, single cell RNA-Seq and low pass WGS. Swift TechTalk - TBD Sequencing, Finishing, and Analysis in the Future Meeting 2019

Panel Questions Sequencing, Finishing, and Analysis in the Future Meeting 2019

Poster Presentations & Meet Greet Party

Tuesday, May 21st, 7:00 PM – 10:00 PM, Mezzanine & New Mexico Room

Sponsored by

Drinks and Hors d’oeuvres provided… Enjoy!

Drink tickets provided at check in!

(Use your blue tickets)

** Food and drinks served from 7:00 PM – 9:30PM

We will have two different poster Sessions

(please stand by your poster during this time):

Even # Poster Abstracts = 7:30pm – 8:30pm Odd # Poster Abstracts = 8:30pm – 9:30pm

*** Note: you may leave your poster up until Thursday afternoon Sequencing, Finishing, and Analysis in the Future Meeting 2019

Finding plasmids with deep learning

Poster - Abstract ID: 02

Bill Andreopoulos1, Jan Balewski1

1. Joint Genome Institute, LBNL

Here we present DeLPlasmid, which is designed to meet the need of identifying plasmids in assembled bacterial genomes. DeLPlasmid is based on a deep learning LSTM algorithm that takes as input a combination of assembled sequences and extracted features to identify bacterial plasmids.

The deep learning model was trained on high-quality plasmid sequences from the ACLAME database and the NCBI Refseq.microbial dataset. The tool achieved an AUC-ROC of 91% on a 5-fold cross-validation. DeLPlasmid can recover plasmids from assembled bacterial genomes without any prior taxonomical knowledge with a low false positive rate. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Exploration of Microbial Diversity in Ticks from Gabon by High Throuput Sequencing of 16S RNA

Poster and Oral - Abstract ID: 03

Lejarre Quentin1, Nkili Meyong Andrinaina Andy1, Rahola Nil2, Paupy Christophe2 and Ayala Diégo2

1. Centre International de Recherches Médicales de Franceville (CIRMF), 2. Institut de Recherche pour le Développement (IRD)

Background: Tick are considered second to mosquitoes worldwide for transmitting diseases to humans primarily due to them i) being present on almost every continent, although often geographically confined in wooded areas, ii) being able to infest a large variety of hosts such as mamals, birds and even reptiles and iii) being able, for some species, to infect their hosts with more than one pathogens. In Western countries tick-borne illnesses such as Lyme, Ehrlichiosis, Anaplasmosis, Rickettisoses or Tick-Borne Encephalitis have been on the rise in the past 10 years. Moreover in Sub-Saharan Africa, where the forest density remains largely significant, the lack of available data on those arthropods and the pathogens they transmit, raise the question of possible new emerging pathogens ; especially in Gabon where 85% of the territory is covered by woodland and where a novel species of Rickettsia was discovered in 2014. This context urges for more studies on the microbiome of ticks and the potential pathogens they may carry.

Methods: 370 samples were harvested from ticks living in 4 different national gabonese parks and 2 private reserves. After milling ticks and extracting DNA, the 16S RNA PCR targeted the hypervariable region V4. Amplicon sequencing was performed on a Illmina’s Miseq platform with the Reagent kit v3 600cycles (250bp paired-end reads). Preliminary analyses were conducted on EDGE bioinformatics v2.2 using the embedded Qiime module to compare the microbiome of ticks regarding their species and their locations but also to detect possible pathogens at the family and genus level. Then sequences constituting OTUs of interest were aligned (USEARCH) with a custom database build from references of bacterium known to be pathogenic in order to determine the presence of pathogens at the species level.

Results: Overall, the beta-diversity analysis (using the abundance weighted Jaccard distance) samples from the same species of ticks tend to carry the same microbiome, predominantely 1) Coxiella (60~80%), 2) Rickettsia (6.1~9%), 3) Mycobacterium (1~12.7%) and 4) Burkholderia (1.1~4%) genus. Sex and sample location were considered as parameters of similarity but did not account for any pattern.

Pathogens identified included Fracisella, Coxiella, Rickettsia and Mycobacterium. Pathogen search reveals that the major species present were Coxiella burnetii (78%) associated to A. Tholloni species for the first time and Rickettsia (20%), detected almost exclusively in R. longus. Moreover regardless the tick species, for all pathogens identifed the percentage of identity range from ~85% to 100%, indicating the potential presence of new species.

Perspectives: The next step will aim to confirm those preliminary results by performing PCR on samples where known pathogen species have been detected and WGS on samples with potential new pathogen species in order to obtain full genomes for a better characterization. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Isolation and Whole-Genome Sequencing of Commensal Escherichia coli in Healthy Human Stool

Poster - Abstract ID: 05

B. Aspinwall1,2, J. Hensley1,2, K. Dillon1,2, S. Lucking1,2, H. Carleton1, A. Huang1, A. Williams-Newkirk1

1. Enteric Diseases Laboratory Branch, US Centers for Disease Control and Prevention, Atlanta, GA, USA, 2. Oak Ridge Institute for Science and Education, US Department of Energy, Oak Ridge, TN, USA

Background: Pathogenic, diarrheagenic Escherichia coli are a well-studied, common cause of diarrheal illness with thousands of complete genomes in GenBank. They are closely related to the commensal E. coli also found in the human gut and stool microbiomes. However, the under-representation of commensals in GenBank contributes to difficulties differentiating commensal from pathogenic E. coli in molecular and metagenomic studies. Additionally, well-characterized commensal isolates are useful as negative controls during the testing of pathogenic E. coli assays. Our goal is to narrow this gap in knowledge by isolating and sequencing commensal E. coli from 200 healthy human stools. Here we describe our pilot data.

Methods: Nine stool samples were collected from healthy donors, homogenized, inoculated into Cary-Blair Transport Medium, and held at room temperature for three days. We then plated each sample on CHROMagar™ E. coli plating media, incubated at 37°C for 24 hours, and isolated ≤10 suspect colonies from each plate. Sanger sequencing of the rpoB gene confirmed isolates to be E. coli. A subset of 26 isolates were sequenced on the Illumina MiSeq using 2x250 bp chemistry. We used BioNumerics v7.6 for assembly, serotyping, virulence and antibiotic resistance gene identification, and core genome (cg) MLST analysis (EnteroBase E. coli cgMLST scheme, https://enterobase.warwick.ac.uk/).

Results: For six of the stool samples, isolates from a single stool sample differed by only 0-2 alleles and had the same serotypes and AR profiles. For the remaining three samples, the isolates within each stool sample differed by >1600 alleles and had different serotypes and AR profiles. Isolates from different stool samples differed by 0-2380 alleles. Seven of the 26 isolates sequenced were found to have resistance to one or more antibiotic classes. Isolates that contained AR were resistant to an average of 3.6 antibiotic classes, and two of these isolates had resistance to six different antibiotic classes.

Conclusions: For two-thirds of the stool samples, isolates coming from the same stool sample were nearly identical, while one-third of our stool samples contained isolates that were different from one another. Within this pilot set, we found more inter-stool than intra-stool diversity. However, there is notable diversity in commensal E. coli isolates detected within one stool sample, despite limited sampling. Isolation and sequencing of additional isolates is ongoing as donor stools become available. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Target Amplicon Push-button Primer Retriever (TAPPR)

Poster - Abstract ID: 06

Phil Davis, Dave Yarmosh, Alan Shteyman, John Bagnoli and Joe Russell

MRIGlobal

As demand for quick and actionable diagnostic information has increased, so has the demand for molecular assays. Targeted molecular amplification-based experiments, such as amplicon sequencing or real-time qPCR, allow for answering specific questions with very high sensitivity. For this purpose, we introduce Target Amplicon Push-button Primer Retriever (TAPPR). It is an automated assay design pipeline that uses MRIGlobal’s TMArC (Targeted Metagenomic Analysis through marker Creation) tool to identify unique molecular markers, although it accepts targeted regions identified by any means, to produce candidate primer pairs for the organisms of interest, given a range of design criteria.

First, a database of genomic assemblies for the taxonomies of interest are acquired and then regions of conservation are identified using canonical k-mer counting. The k-mers are remapped and marker regions are resolved through BLAST to a user selected reference genome to generate candidate primer regions. Candidate amplicons are generated by filtering marker regions by those flanked by conserved primer regions within user specified design criteria. Finally, optimal primer sequences are extracted and their efficacy analyzed using an in silico PCR based on BLAST. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Analysis of 31 Francisella tularensis genomes from Georgia

Poster - Abstract ID: 09

G. Chanturia1, E. Khmaladze1, N. Kotaria1, M. Pantsulaia1, G. Brachveli1, A. Papkiauri1, N. Berishvili1, G. Tomashvili1, K. Davenport2, M. Shakya2, E. Hanschen2, T. Erkkila2, P. Imnadze1

1. National Center for Disease Control and Public Health of Georgia, Richard G. Lugar Center for Public Health Research, Georgia, 2. Applied Genomics Team, Los Alamos National Laboratory, NM, USA

Georgia is a small country with a specific geographic location that represents as natural foci of several selected bacterial and viral pathogens. The National Center for Disease Control and Public Health of Georgia (NCDC) is responsible for the regular surveillance of the pathogens throughout the country and retains collection of the bacterial strains. The Lugar Center for Public Health Research is the laboratory of NCDC that is responsible for the safe maintenance and further research of these pathogens. Under the DTRA funded project for characterization of NCDC strain Repository by Next Generation Sequencing (NGS) in collaboration with Los Alamos National Laboratory (LANL) up to hundred selected pathogens’ strains were sequenced using MiSeq, Illumina platform and more than 30 were sent to LANL for PacBio sequencing. Among these hundred strains 31 belonged to Francisella tularensis. Sequences of all F. tularensis strains were processed and assembled using EDGE and CLC-Bio tools. Illumina sequences of ten strains were combined with PacBio assemblies and some of them were finished. PHAME tool was used for phylogenetic analysis. F. tularensis Live Vaccine Strain (LVS) and other close relatives to the Georgian clade genomes were added from the NCBI database for phylogenetic analysis. Comparative genomics was applied to analyze the strains in correlation with available proteomic and phenotypic data. Taxonomic analysis of the reads and contigs was also performed. The taxonomic results were shared with a DTRA-implemented platform - Biosurveillance Ecosystem (BSVE), which provides a centralized resource for monitoring biosurveillance data and provides information to analysts. The obtained genomic data of F. tularensis will be available for global scientific community and will serve for the better understanding of diversity of circulating pathogens in the region. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Exome capture of RNA-seq libraries from low input samples

Poster - Abstract ID: 16

Ryan Demeter, Anastasia Potts, Kevin Lai and Kristina Giorda

Research & Development, Integrated DNA Technologies, Redwood City, CA

Next generation sequencing (NGS) is a valuable tool for identifying single nucleotide variants (SNVs), indels, and copy number variations, but DNA alone cannot provide a full picture of disease-associated molecular signatures1. RNA sequencing (RNA- seq) can identify gene expression changes, RNA structural variations, and SNVs to inform treatment and prognosis2. However, RNA-seq analyses can be difficult due to signal loss from poorly expressed genes and low-frequency structural variants3. Also, tissue-derived RNA samples often require extensive amplification due to degradation and limited yields. Here, we show that target capture of RNA-seq libraries tagged with unique molecular identifiers (UMIs) can help overcome those quantity and quality obstacles.

References

1. Sung J, Wang Y, Chandrasekaran S, Witten DM, Price ND. Molecular signatures from omics data: from chaos to consensus. Biotechnol J. 2012;7(8):946-57.

2. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57-63.

3. Agarwala V, Flannick J, Sunyaev S, Go TDC, Altshuler D. Evaluating empirical bounds on complex disease genetic architecture. Nat Genet. 2013;45:1418–27 Sequencing, Finishing, and Analysis in the Future Meeting 2019

Sequencing of N gene of rabies virus isolated from different hosts in Georgia

Poster - Abstract ID: 19

Elbakidze T1, Li Yu2, Wilkins K2, Kapanadze A1, Kokhreidze M1, Vepkhvadze N1, Donduashvili M1 1. Laboratory of the Ministry of Agriculture of Georgia (LMA); Tbilisi, Georgia; 2. Centers for Disease Control and Prevention; Poxvirus and Rabies Branch; Atlanta, GA

Introduction: Rabies, caused by rabies virus (RABV), is a highly lethal zoonotic disease characterized by encephalomyelitis in all warm-blooded animals including human. The transmission of rabies virus is through exposure to infected animals. Rabies remains a public health threat with more than 60,000 deaths annually across the globe (Hampson et al., 2015). Georgia is one of the leading countries with high post-exposure vaccination rate (1311,3 vaccinated per 1000000) and there are 96 suspected rabies cases during 1993-2002 and 1400 laboratory confirmed rabies cases during 2002-2016 respectively (Epidemiology Bulletin 2017/Vol.21 N5; NCDC).

Objective: The main goal of our research was to study the genetic diversity of rabies viruses spread in Georgia through sequencing of nucleoprotein gene (N-gene) of rabies virus strains circulating in the region.

Methods: Samples were screened and confirmed by direct fluorescence assay (DFA) and RT- PCR. Sequencing reactions were cleaned up by following the recommended protocol. Genetic Analyzer 3500 (ABI3500) was used for further sequencing analysis. Sequences were analyzed using CodonCode Aligner and BioEdit software. The phylogenetic analysis was done by using the BEAST v.1.8.3 software package.

Results: In the survey, we used 76 rabies virus spread across different animals in 8 regions of Georgia. 66 samples of the complete N gene were sequenced at the rabies department of the CDC United states and 10 fragments of N gene were sequenced at the LMA Georgia. The phylogenetic tree of Georgian strains isolated in 2014-2016 determined two different groups. Group one was circumscribed within the Kakheti region and the group two was widespread in 8 regions while strains isolated in 2018 composed an independent group.

Conclusion: The phylogenetic analysis suggest that group one is mainly circumscribed within the Kakheti region and had a common ancestor with RABV found in Europe (Estonia, Yugoslavia, Serbia) and Asia (Tajikistan, Kazakhstan, China, Russia) including Azerbaijan. Group two is distributed in all 8 regions of Georgia and has a common ancestor with RABV found in Russia, Iran and Turkey. All currently circulating rabies groups in Georgia are related to the dog variant. Three Sequences isolated in 2018 in Kakheti region formed independent clade and was not grouped with any other sequences but with the strains of rabies isolated from Georgian historic territories and now from the villages of Azerbaijan that may be related to the nomadic lifestyle of the domestic cattle when there is a high rate of biting those cattle with predatory animals carrying rabies virus. In addition, research reveals that most of the rabies virus spread in Georgia is a viral strain of cosmopolitan rabies associated with the dog. Relatedness of Georgian rabies strains with the strains spread in other countries of the Caucasus Sequencing, Finishing, and Analysis in the Future Meeting 2019 region suggest that there are several different strains circulating in the country which is not unexpected due to geographical location of Georgia. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Intra-epidemic genome variation in highly pathogenic African swine fever virus (ASFV) from the country of Georgia

Poster - Abstract ID: 22

Jason Farlow3, Marina Donduashvili1, Maka Kokhreidze1, Adam Kotorashvili2,

Nino Vepkhvadze1, Nino Kotaria2, Ana Gulbani1

1. Laboratory of the Ministry of Agriculture, Tbilisi, Georgia; 2. Richard G. Lugar Center for Public Health Research at the National Center for Disease Control (NCDC) Public Health Research at , Tbilisi Georgia; 3. Farlow Scientific Consulting Company, LLC Utah, USA

African swine fever virus (ASFV) causes an acute hemorrhagic infection in suids with a mortality rate of up to 100%. No vaccine is available and the potential for catastrophic disease in Europe remains elevated due to the ongoing ASF epidemic in Russia and Baltic countries. To date, intra-epidemic whole-genome variation for ASFV has not been reported. To provide a more comprehensive baseline for genetic variation early in the ASF outbreak, we sequenced two Georgian ASFV samples, G- 2008/1 and G-2008/2, derived from domestic porcine blood collected in 2008. The G-2008/1 and G-2008/2 genomes were distinguished from each other by coding changes in seven genes, including MGF 110-1 L, X69R, MGF 505-10R, EP364R, H233R, E199L, and MGF 360-21R in addition to eight homopolymer tract variations. The 2008/2 genome possessed a novel allele state at a previously undescribed intergenic repeat locus between genes C315R and C147L. The C315R/C147L locus represents the earliest observed variable repeat sequence polymorphism reported among isolates from this epidemic. No sequence variation was observed in conventional ASFV subtyping markers. The two genomes exhibited complete collinearity and identical gene content with the Georgia 2007/1 reference genome. Approximately 56 unique homopolymer A/T-tract variations were identified that were unique to the Georgia 2007/1 genome. In both 2008 genomes, within-sample sequence read heterogeneity was evident at six homopolymeric G/C-tracts confined to the known hypervariable ~ 7 kb region in the left terminal region of the genome. This is the first intra-epidemic comparative genomic analysis reported for ASFV and provides insight into the intra-epidemic microevolution of ASFV. The genomes reported here, in addition to the G-2007/1 genome, provide an early baseline for future genome-level comparisons and epidemiological tracing efforts. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Framework for Bioinformatics App Deployment

Poster - Abstract ID: 23

Mark Flynn, Justin Zhang, Forest Altherr, Migun Shakya, Chien-Chi Lo, Karen Davenport, Patrick Chain

Los Alamos National Laboratory

Bioinformaticists devote enormous amounts of time developing new tools for analysing genomics data. Traditionally, these tools are downloaded from a code or Docker image repository, installed on the user’s machine or computing cluster and invoked from the command line. This method for deployment has some limitations since it requires a significant amount of expertise and work on the part of the user to install and maintain the software and hardware. We have developed a framework for bioinformatics tool deployment on the web using a containerized microservices approach. Once the developer deploys their app to a server, users can go to a website, select the appropriate options, upload required files, execute the workflow and view the results without any effort or expertise in software/hardware operations. Our framework creates Docker containers for a web server, a GUI, a database, the bioinformatics application and message/task queues. The web server container handles connections to the internet, the GUI enables the user to configure the workflow for their analyses, the database container can be used for user authentication and authorization as well as storing run parameters. The task queue manager launches the user-specified workflow in the bioinformatics application container and the message queue alerts the user of progress along the pipeline. This framework has been successfully used to deploy the Phylogenetic Analysis and Molecular Evolution (PhaME) application and is being used to deploy SeqSim, a synthetic metagenomics data simulator. Deploying bioinformatics applications using this framework removes the technical barriers of software and hardware management for users, can greatly increase usage and expand the application’s impact on the bioinformatics community. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Optimization of a commercial DNA extraction kit for long read sequencing of Clostridium botulinum

Poster - Abstract ID: 24

Victoria Foltz1,2, Jessica Halpin1, Carolina Lúquez1

1. Centers for Disease Control and Prevention; 2. Oak Ridge Institute for Science Education

Clostridium botulinum is an anaerobic spore former that is naturally found throughout the environment. C. botulinum produces botulinum neurotoxin (BoNT), the causative agent of botulism. There are seven well-characterized serotypes of BoNT (A-G), and four are responsible for disease in humans (A, B, E, and F). BoNT is encoded by the botulinum toxin gene bont, for which over 40 subtypes have been identified. As whole genome sequencing technologies evolve and become more accessible to laboratories, so does the understanding of the diversity of bont gene and C. botulinum.

The gene encoding for A, B, E, or F toxins can be found on the chromosome, plasmids, or both. Whole genome sequencing is used as a surveillance tool to type strains and help determine relatedness of outbreak isolates. Currently, there is a limited number of publically available bont sequences, closed genomes, and C. botulinum reference sequences. The aim of this study was to establish a genomic DNA extraction protocol for sequencing with the Oxford Nanopore MinION instrument. MinION produces extremely long reads, which can improve the contiguity of assemblies in downstream analyses.

The MinION instrument and libraries require high molecular weight genomic DNA for the best results. We compared our current DNA extraction procedure which follows a modified Epicenter (Madison, WI) MasterPure Complete DNA and RNA Purification Kit protocol followed by filtration of DNA through 0.2µM filter, with the MP Bio (Santa Ana, CA) FastDNA SPIN Kit for Soil. We tested each with five isolates: serotypes A, B, Bf, E, and F. Quality of genomic DNA was measured using the Qubit 4 fluorimeter high sensitivity assay, and gDNA was visualized on an Invitrogen 0.8% NGS e-gel to assess fragmentation resulting from the procedure. Quality of gDNA extracted using the MP Bio kit was consistently >50 ng/mcL, similar to the modified Epicenter protocol. Test MinION sequencing runs of the isolates extracted using the modified Epicenter kit, both filtered and unfiltered, failed; DNA fragments were too small for sequencing.

Of the five strains tested, one was chosen for MinION sequencing: C. botulinum ATCC 43758 BoNT/Bf. A gDNA library was made following Oxford Nanopore’s 1D Genomic DNA by Ligation (SQK-LSK109) protocol. Basecalling was completed using Guppy (v 3.1.2) with homopolymer correction, and reads were assembled using Canu (v. 1.7.1). The resulting assembly consisted of 7 contigs with a total length of 4.34MB. Subtype of ATCC 43758 was determined using the CLC Genomic Workbench v10.1.1 “Map Reads to Reference” tool. This represents the first completed MinION sequence for C. botulinum within the CDC National Botulism and Enteric Toxins Team. Sequencing, Finishing, and Analysis in the Future Meeting 2019

A System for Statewide Surveillance and Reporting of Invasive Group A Strep in Arizona

Poster - Abstract ID: 25

Chris French, Dulce Jimenez, Michael Valentine, Darrin Lemmer, Dave Engelthaler

Tgen North

Invasive Group A Streptococcus (GAS or iGAS) is a bacterium responsible for many life threatening infections such as necrotizing fasciitis, bacteremia, and toxic shock syndrome. About 10%-15% of people with invasive GAS infection die as a result. This number can increase to about 25% in people with necrotizing fasciitis, and over 30% in people with toxic shock syndrome. The bacterium is spread through contact with infected sores or wounds on the skin, as well a contact from throat or nasal mucus from an infected person. Chances of severe infection are increased by factors including old age, decreased immune system function, skin lesions, and chronic illnesses like diabetes and cancer. It is possible for clusters of Invasive GAS to develop in long term care, and health related facilities. Poor surveillance may lead to undetected outbreaks of GAS as a result.

We have created a web application for early GAS outbreak detection in Arizona. The web app hosts metadata for all GAS samples that are sequenced and analyzed at TGen North. The web app uses a variety of visualization tools such as: Leaflet powered interactive Arizona map for plotting sample collection locations, county and facility level resolution for GAS emm- type distribution, phylogenies, and a date range slider for tracking both geographic and temporal sample data. We are additionally developing an automated pipeline for the tracking and analysis of submitted samples. The pipeline assembles read data, calls SNPs using the Northern Arizona SNP Pipeline (NASP), creates phylogenies, identifies emm-types with our own Amplicon Sequencing Analysis Pipeline (ASAP), and flags samples of interest for further analyses. Once analyzed, all of the collected metadata is pushed to the web app, and users then have the option to select samples in order to generate higher resolution phylogenies.

Invasive GAS infections in Arizona have risen significantly since 2015. Uncommon emm-types of GAS have contributed to this increase. The web application we have created will greatly assist in the ongoing effort to monitor GAS activity, classify GAS emm-types, and identify possible outbreaks. If the web app is useful for GAS then the app may be expanded for surveillance of antimicrobial resistance (AMR) associated pathogens, such as Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacter (ESKAPE). Sequencing, Finishing, and Analysis in the Future Meeting 2019

The Study of Midwife and Patient Relationship in the Context of the Improving Health Outcomes in Northern Uganda

Poster - Abstract ID: 26

Lydia Fulton and Dr. Kathleen Vongsathorn

Washington University School of Medicine

Midwives have played a valuable role in the advancement of Ugandan health care. This research analyzed the exact impact they had on the patients they interacted with and thus the impact they had on the developing health care system and health care education. This poster is based on a larger research project and focuses on a selection of interviews conducted with Ugandan midwives from Kalongo Hospital in the northern region of Uganda. Midwives who trained between 1960 and 1980 were asked to answer a number of questions about their experiences in the health care system and the education that led them there. Responses were collected and analyzed for patterns to draw out themes that indicated how people interacted with the hospital healthcare system in Uganda at the time. One of the main patterns which emerged in the responses was the importance of the patient relationship with the midwives and its significance in the context of patient relationships with hospitals. Analyzing the relationships they formed with patients and hospitals can give great insight into the timeline of the popularization of health care in Uganda and the ultimate effect that had on health outcomes in the country. Specifically, this research looks at the care provided by each midwife, the health education they distributed, their problem-solving skills used in educating patients in the community, and the education and training they received in order to positively interact with patients. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Targeted enrichment of enteric pathogen DNA from stool samples for metagenomic subtyping analysis

Poster - Abstract ID: 28

Yang Gao1,2, Brooke Aspinwall1,3, Mohit Thakur1,3, Katie Dillon1,3, Eija Trees1, John Besser 1, Heather Carleton1, Jo Williams-Newkirk1, Andrew Huang1

1. Enteric Diseases Laboratory Branch, US Centers for Disease Control and Prevention, Atlanta, GA, USA; 2. IHRC, Inc., Atlanta, GA, USA; 3. Oak Ridge Institute for Science and Education, US Department of Energy, Oak Ridge, TN, USA

Culture-independent diagnostic tests provide fast, cost-effective diagnostics for clinical laboratories and will likely replace culture-based methods to identify foodborne pathogens. However, they do not yield the bacterial isolates currently sequenced for outbreak detection and investigation. Shotgun metagenomic sequencing allows pathogen subtyping directly from stool. However, the practicality of this approach for routine surveillance activities is limited by the costs associated with deep sequencing and complex analytical pipelines required to resolve low pathogen signal to high stool background noise issues. For pathogens such as Shiga-toxin producing Escherichia coli (STEC), shotgun metagenomics is further limited by challenges with phasing, or the ability to differentiate between, in this case, pathogenic and commensal E. coli sequences.

To address these issues, we investigated the application of a target capture method to enrich for pathogen DNA. We tested the Roche HyperCap system, which uses biotinylated DNA oligonucleotides as capture baits. We selected two STEC O157:H7 strains, F7353 and K4623, to design our baits, and created extra tiling baits to enhance the capture of stx genes. The baits were tested using DNA samples that simulated disease-state stool with moderate (10^5 CFU/mL) infection. Samples were prepared by mixing 2% (by mass) of either F7353 or K4623 isolate DNA with stool DNA extracted from six healthy anonymous donors. To distinguish STEC DNA from background, stool and isolate DNA libraries were generated (using KAPA HyperPrep kit) and indexed separately prior to mixing. We tested F7353-spiked stool, K4623-spiked stool, stool only, and water control using the HyperCap workflow, and compared them against shotgun sequencing results of the same samples in Mauve. Each set-up was tested in triplicates. Samples were sequenced on a MiSeq, and reads were aligned against the closed genomes of F7353 and K4623 using Bowtie2, following trimming with Trimmomatic. Coverage was assessed against gene content using IGV.

Shotgun sequencing of STEC-spiked stool samples yielded a fraction of 0.02 of reads indexed as STEC, consistent with the input spike-in level. From the same samples that were target captured, fractions F7353 and K4623 reads increased to 13.8 and 11.7, respectively. Of these samples, 97% and 98% of the F7353 and K4623 reads, respectively, mapped to their respective genomes, indicating high probe specificity. Approximately 7% of the captured sequences belongs to stool background DNA, and of them about 70% mapped to either STEC genomes.

Overall, the Roche HyperCap system showed promising results in enriching target bacterial DNA from a complex stool microbiome background. Bait captured sequences that belong to stool DNA indicate the levels commensal E. coli being enriched, and future experiments are required to address the phasing issues. Ongoing analyses also include determining the coverage of the enriched sequences to inform whether the enrichment is sufficient for strain-level subtyping, as well as examining the captured reads that do not match to the target genomes. Future studies include testing lower pathogen levels Sequencing, Finishing, and Analysis in the Future Meeting 2019 to determine the HyperCap system detection limit, and testing actual disease state samples with unknown STEC pathogen targets and infection levels. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Whole Genome Sequence Analysis of Serratia marcescens Isolated in Georgia

Poster - Abstract ID: 30

Giorgi Gogoladze1, Tea Tevdoradze1, Nato Kotaria1, Pikria Shavreshiani1, Darejan Gamkrelidze2, Lia Tevzadze1, Paata Imnadze1

1. National Center for Disease Control and Public Health, 2. Tbilisi Chidren’s Infectious Diseases Hospital

Serratia marcescens is a rod-shaped Gram-negative bacterium in the family Enterobacteriaceae and has been recognized as a cause of hospital-acquired infection for the last five decades. S. marcescens is found in the respiratory and urinary tracts of adults and in the gastrointestinal system in children. In humans, S. marcescens can cause an opportunistic infection of the urinary and respiratory tracts, wounds and the eye.

Genetic information of S. marcescens is stored in chromosome and plasmid(s). Sequencing of housekeeping genes and whole genome is used for species typing. Here, we characterized Georgian strain of S. marcescens isolated from stool sample from patient having acute diarrhea caused by Shigella spp.

Library preparation was conducted using NEBNext Ultra DNA Library Prep kit for Illumina (New England BioLabs) at NCDC.

Quality of reads was monitored by FastQC v0.11.4. Average quality scores of read bases were more than 32. De novo assembly was done using CLC Genomics Workbench 8.0.2. Genome Compositional Analyses (GCA), Average Nucleotide Identity (ANI), Single Nucleotide Polymorphism (SNP) methods and housekeeping genes (gyrB, recA) were selected for Georgian S. marcescens characterization. Whole Genome Sequences (WGS) of reference S. marcescens were obtained from the National Center for Biotechnology Information (NCBI).

GCA was performed by software GET_HOMOLOGUES. Environmental Microbial Genomics Laboratory on-line tools were used for ANI calculations. Gene annotation of Georgian S. marcescens was done by the RAST server. The WGS SNP sites were found by kSNP3 package. Appropriate kmer sizes were found by Kchooser that was a part of kSNP3.

MrBayes was used for phylogenic reconstruction. Models of nucleotide substitution were given by server SMS: Smart Model Selection in PhyML. FigTree v1.4.3 was used for visualization of tree files.

Phylogenic trees reconstructed by gyrB and recA coding sequences show that Georgian S. marcescens is closely related to the strains: UMH9, UMH3, AR_0099, AR_0123, AR_0121, AR_0131, AR_0122, SmUNAM836, AR_0091, AR_0124, SMB2099, SM39, AR_0027 and FDAARGOS_65.

Results obtained by WGS SNPs are in a good agreement with the SNP results obtained by gyrB and recA gene analysis. In addition existence of six distinct groups of S. marcescens is inferred by SNP method.

Core, cloud and shell genes are found by GCA. 70% of genes are similar for all S. marcescens. Genes existing only in Georgian isolate are identified as well.

The reference strains and Georgian isolate whole genome pairwise comparison has been carried out by means of ANI method. The ANI value of 96% is used as boundary for species delineation. Some pairwise ANI values are less than 96% indicating to complexity of S. marcescens.

The present study is the first attempt of genetic characterization of S. marcescens strain isolated in Georgia. Sequencing, Finishing, and Analysis in the Future Meeting 2019

miRNA Profiles in Healthy Non-Human Primates

Poster - Abstract ID: 31

Amanda S. Graham1,2, Christopher P Stefan1,2, Janice Duy3, Timothy D. Minogue1

1. Diagnostic Systems Division, U.S. Army Medical Research Institute of Infectious Disease, Fort Detrick, MD, USA; 2. Cherokee Nation Assurance, Arlington,VA, USA; 3. LMI, Fort Detrick, MD, USA

Circulating miRNAs in biofluids have demonstrated diagnostic potential in detecting numerous diseases such as cancers, neurological disorders, diabetes, and infection with infectious agents to name a few. miRNA’s are short (17-22bp) non-protein coding sequences responsible for post-transcriptional regulation of mRNA sequences. Circulating miRNA’s are extremely stable and are secreted from cells in exosomes, attached to proteins, or released during cell death. Knowledge of baseline miRNA profiles in various biofluids provide valuable insights into changes that may occur during disease progression. Advances in next-generation sequencing have made it easier and more cost effective to preform deep RNA sequencing allowing for the discovery of novel miRNA sequences as well as establishing miRNA profiles in healthy tissues. The goal of this study was to establish baseline miRNA levels of blood derived tissue types such as whole blood, plasma, serum, and PBMCs. We tested approximately 30 non-human primate from three different species: African green monkeys (AGMs), cynomolgus macaques, and rhesus macaques. These animals are commonly used model organisms for rare infectious diseases for development of therapeutic agents as well as understanding disease progression and diagnostic windows. We characterized miRNA profiles across species, gender, and tissue types of healthy non-human primates using next generation sequencing providing a baseline for future work identifying novel miRNA signatures of infection. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Detection of West African Diseases with a Pathogen Identification and Characterization Panel using Illumina and Oxford Nanopore Platforms

Poster - Abstract ID:33

Adrienne T. Hall, Christopher P. Stefan, Amanda S. Graham, Tucker Maxon, Jeff W. Koehler, Timothy D. Minogue*

U.S. Army Medical Research Institute of Infectious Diseases, Diagnostic Systems Division, Fort Detrick, MD 21702, USA

Next-generation sequencing (NGS) is rapidly maturing as an alternative diagnostic capability. The ability to sequence all nucleic acids in any clinical or environmental sample can definitively make diagnostic calls from limited source material without prior knowledge of the infectious agent. However, this agnostic approach results in low target reads compared to host or environmental contamination. Here, we describe a targeted panel, Pathogen Identification and Characterization Panel (PICP), using molecular inversion probes (MIPs) to identify 18 viral and parasitic pathogens endemic to West Africa. We compare PICP performance in various matrices and evaluate diagnostic performance on the Illumina MiSeq and Oxford Nanopore MinION platforms testing identification, limits of detection, repeatability, and diagnostic performance. In silico analysis demonstrates these probes bind greater than 90% of GeneBank sequences with 100% identity and was capable of detecting three replicates for each organism in mock clinical samples of plasma and whole blood at lower concentrations of 104 PFU/ml. Analysis of sequencing platforms showed higher read counts on the Illumina platform, however a higher percentage of correctly mapped reads on the MinION platform. These differences had limited effect on limits of detection; Chikungunya virus (CHIKV) spiked plasma and whole blood demonstrated 103 PFU/mL LODs on both platforms. Similarly, Ebola virus (EBOV) spiked plasma and whole blood had LODs of 103 PFU/mL and 104 PFU/mL respectively for both platforms. Twenty replicates for both CHIKV and EBOV in plasma and whole blood were tested at the preliminary LOD concentration on both platforms and demonstrated similar standard deviations and coefficients of variance. PICP positively identified pathogens on de-identified human clinical EBOV and CHIKV samples on both sequencing platforms compared to qRT-PCR methods. The Illumina platform had a positive and negative predictive value of 1.00 and 0.84 respectively while the Oxford platform had values of 0.90 and 0.83. Overall, targeted NGS inches closer to being a viable option for reliable pathogen identification independent of the platform used, making it a valuable tool in biothreat surveillance. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Comparison of non-toxigenic and toxigenic Clostridium botulinum strains isolated from clinical specimens

Poster - Abstract ID: 34

Jessica L. Halpin1, Victoria Foltz1,2, Janet K. Dykes1, Carolina Lúquez1

1. Centers for Disease Control and Prevention, National Botulism and Enteric Toxins Team; 2. Oak Ridge Institute for Science Education

Clostridium botulinum produces botulinum neurotoxin (BoNT), which is the causative agent of botulism, a rare but serious neuroparalytic disease that can result in death if not treated. There are seven well-characterized serotypes of C. botulinum (A- G) identified by the ability of specific antitoxins to neutralize BoNTs. The molecular era has allowed researchers to narrow these into subtypes based on nucleic acid sequences of the botulinum toxin gene (bont). Over 40 subtypes have been identified across seven serotypes. For strains producing BoNT serotype B, the operon containing the bont gene, its accessory genes, and its regulator, are often contained within a large plasmid.

During botulism laboratory investigations, we have noted that some specimens that are positive for BoNT/B, may yield a relatively large number of nontoxigenic single colonies, in addition to toxigenic colonies. We hypothesize that the nontoxic isolates are genomically similar to the toxigenic isolates but have lost the plasmid containing the bont operon. Historically, C. botulinum has been named based on the phenotypic characteristic of toxin production; strains that are identical in other phenotypic characteristics but do not produce BoNT are called C. sporogenes. Here, we used whole genome sequencing to determine whether these isolates are more closely related to C. sporogenes or C. botulinum.

Nine isolates (three toxigenic) from two specimens were subjected to genomic DNA extraction followed by 400bp fragment library preparation using the Kapa Biosciences kit for Ion Torrent. Libraries were diluted and pooled in equimolar concentrations and loaded into the Ion Chef instrument at a concentration of 200pM for templating and enrichment. Libraries were sequenced using the Ion Torrent S5 instrument. Resulting reads were assessed with FastQC, assembled using SPAdes v. 3.12.0, and assemblies were assessed using Quast v. 5.0. To determine presence of the plasmid, reads were mapped to a reference plasmid sequence gi1243952054 from CDC67071. We also used Mashtree v.0.29 to create a neighbor-joining tree of both toxigenic and nontoxigenic isolates along with C. sporogenes and other C. botulinum type B.

Toxigenic isolates had evidence of a plasmid while non-toxigenic picks did not. The bont operon was contained within the plasmid for toxigenic strains and absent from the nontoxigenic strains. The neighbor-joining tree showed that nontoxigenic isolates clustered with toxigenic strains isolated from the same specimen, and differed from other C. botulinum type B and C. sporogenes.

These results suggest that nontoxigenic colonies isolated along with C. botulinum type B might represent isolates that have lost the plasmid containing the bont gene. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Characterization of mating in green alga Scenedesmus obliquus

Poster - Abstract ID: 35

Erik R. Hanschen1, James Umen2, Juergen Polle3, Shawn R. Starkenburg1

1. Los Alamos National Laboratory; 2. Donald Danforth Plant Science Center; 3. Brooklyn College of the City University of New York

Selective breeding of agricultural plants is ancient and universal, resulting in increased crop yield, resistance to predators and pests, and tolerance of environmental pressures. However, despite the growing importance of algae as a food and bio-based fuel source, selective breeding is not possible with industrially relevant algal species. Thus, selective breeding of algae is an untapped resource to improve bio-based fuel production. Most species of algae are facultatively sexual, and sexual reproduction is controlled by environmental or biochemical cues. When induced, sex increases genetic variation through chromosomal recombination, and subsequently increases variation among the offspring’s traits. We report on the characterization of mating and breeding in the green alga Scenedesmus obliquus, evaluating temperature change and nitrogen deprivation as likely environmental cues. Understanding and characterizing the sexual cycle of Scenedesmus will enable breeding practices that increase genetic variation and selection of desired traits. The ability to breed these algae has the potential to dramatically increase the economic feasibility of algae-based biofuels by improving yield, predator and pest resistance, and mitigate stress responses to temperature and salinity perturbation(s). Sequencing, Finishing, and Analysis in the Future Meeting 2019

Screening Fungal Genome Sequencing Data and Culture Collections to Better Understand Bacterial:Fungal Interaction

Poster - Abstract ID: 36

Geoffrey L. House1*, Andrea Lohberger2, Fabio Palmieri2, La Verne Gallegos-Graves1, Julia M. Kelliher1, Demosthenes P. Morales1, Armand E. K. Dichosa1, Debora F. Rodrigues3, Hang N. Nguyen3, Saskia Bindschedler2, Jean F. Challacombe4, Jamey D. Young5, Pilar Junier2, and Patrick S. G. Chain1

1. Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico, 2. Institute of Biology, University of Neuchâtel, lNeuchâte Switzerland, 3. Civil and Environmental Engineering, University of Houston, Houston, Texas, 4. College of Agricultural Sciences, Colorado State University, Fort Collins, Colorado, 5. School of Engineering Vanderbilt University, Nashville, Tennessee

Interactions between bacteria and fungi are critical to the functioning of terrestrial ecosystems, yet little is known about these interactions or how they function. Here we outline computational methods we are using to better understand the diversity of these bacterial:fungal interactions represented in publicly available DNA sequence data as well as in fungal culture collections. There is now a large amount of publicly available fungal genome sequencing data, due in large part to the 1000 Fungal Genomes Project through the Joint Genome Institute (JGI). Because many fungi have bacteria and viruses associated with them, these DNA sequence datasets from fungi may also begin to provide information about the associated microbiome of these fungal isolates. Furthermore, because the 1000 Fungal Genomes Initiative deliberately seeks to span the full range of known fungal diversity, this presents a unique opportunity to use these DNA sequences to start understanding the diversity of bacteria that may form associations with a wide range of fungi. To this end, we have developed a bioinformatics pipeline that consists of commonly used tools and custom scripts to identify signals of bacteria that co-occur with hundreds of different fungal isolates. We begin the analysis with raw DNA sequencing reads from fungal genome projects and then remove all identifiable fungal DNA sequencing reads in order to reduce spurious similarities to bacteria. Next, we assemble the remaining reads into longer contigs that contain more information that can aid their taxonomic classification. For this analysis, we discard reads that do not assemble into contigs. We then use multiple taxonomy classifiers to identify contigs that signal the presence of specific bacteria. Using this bioinformatics pipeline, we have been able to identify specific fungal isolates that have strong signals of likely associated bacteria (e.g. Rhizobium sp. and others), while other fungal isolates have strong signals of bacterial contamination (e.g. Escherichia coli). However, for other, ambiguous cases, determining whether the identified bacterial signals likely represent contaminants or whether they represent bacteria that may form associations with the fungi remains a challenge. To help address the problem of differentiating signals of true bacterial associates from likely contaminants, we have also independently used 16S amplicon screens of more than 200 fungal isolates from multiple fungal culture collections for associated bacteria to look for concordance between the results of these screens and the results of the bioinformatics analysis. By combining DNA sequence data from both fungal collections and publicly available fungal sequencing data, we have identified strong signals of associated bacteria in some fungal isolates. We will continue tuning this analysis pipeline and will also apply it to a wider variety of metagenome samples, including complex soil metagenomes, to identify potential bacterial associates of fungi from a range of environments. This information can then be used to target specific fungi for isolation from environmental samples and to determine how a range of these bacterial:fungal interactions affect both fungal and bacterial phenotypes. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Development and Verification of a Multiplexed, Custom AmpliSeq Panel for Antimicrobial Resistance Characterization of Isolates for the NIH Mycobacterium tuberculosis Quality Assurance Program

Poster - Abstract ID: 37

Ryan Howard, Brittany Knight, David Yarmosh, John Bagnoli, and Erin Tacheny

MRIGlobal, Global Health Surveillance & Diagnostics

The World Health Organization (WHO) estimates that in 2016, 490,000 people developed multidrug-resistant tuberculosis (MDR-TB). However, only an estimated 1 in 5 MDR-TB patients receive the correct antibiotic regimen and better, more cost efficient diagnostic tools are essential for detecting and treating MDR-TB patients. Here, MRIGlobal, in collaboration with Johns Hopkins University (JHU) and Illumina, Inc, presents a multiplexed amplicon-based sequencing panel to determine antimicrobial resistance (AMR) for clinical Mycobacterium tuberculosis (Mtb) isolates for the National Institutes of Health (NIH-DAIDS) Mycobacterium tuberculosis Quality Assessment Program (TBQA), contract number HHSN272201700001C. The panel leverages Illumina’s Ampliseq technology and iSeq 100 and MiSeq platforms in conjunction with established single nucleotide polymorphisms (SNPs), insertions, and deletions in the Mtb genome that have previously been shown to correlate with phenotypic AMR detection using standardized culture-based detection systems. We present this panel in an effort to provide guidance and better information to help better diagnose and treat the unique AMR profile of each patient’s disease. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Characterization of Wesselsbron virus isolated from mosquito pools collected in Kabale district after confirmation of Rift Valley Fever in one human case, Uganda, 2016.

Poster - Abstract ID: 41

John T Kayiwa1, Mayanja M1, Senfuka F1., Nakayiki, T1., Jeffrey W Koehler2, John M Dye2, Julius J Lutwama1

1. Arbovirology Department, Uganda Virus Research Institute, Entebbe, Uganda; 2. US Army Medical Research Institute of Infectious Diseases, Fort Detrick, Maryland, USA

Background: In 2016, the Uganda Virus Research Institute (UVRI) under the Ministry of Health (MoH) confirmed one case of Rift Valley Fever (RVF) in a truck driver in Kabale. This prompted the entomological investigation of mosquitoes collected from different villages in the Kabale district.

Objectives:

i. Identify mosquito species collected during the Rift Valley fever outbreak

ii. Characterize viruses isolated from dead whole mosquitoes.

Methods and Materials: CDC light traps baited with dry ice (carbon dioxide) were used to capture adult mosquitoes. The mosquitoes were identified at UVRI into species and pooled twenty five mosquitoes per vial. After trituration, the mosquito supernatant was inoculated on monolayers of Vero cells and incubated at 37oC in 5% CO2. Plaques were harvested and inoculated in T25 flasks for virus amplification. Total nucleic acid was extracted from 100 l culture supernatant using 300 l TRIzol LS (Thermo Fisher Scientific, Waltham, MA) and the EZ1 XL Advanced with the EZ1 Virus Mini kit 2.0 (Qiagen, Valencia, CA). Each sample was eluted in 60 l elution buffer and amplified using the SeqPlex WTA kit (Sigma-Aldrich, St. Louis, MO). Libraries generated using the Apollo 324 System (WaferGen Biosystems Inc., Fremont, CA) using TruSeq HT adapters (Illumina, San Diego, CA) and the PrepX ILM 32i DNA Library Kit (WaferGen Biosystems Inc.) were sequenced on the MiSeq platform (Illumina, San Diego, CA) using the MiSeq Reagent Kit v2 cycle 300. All sequence analysis was conducted using CLC Genomics Workbench v. 10.1.2 (Qiagen).

Results: Two viral RNA samples (5215 and 5236) had de novo contigs that were BLAST-identified as Wesselsbron virus (WSLV) with the closest sequence identity being an isolate collected from South Africa in 1997. A pairwise comparison of the two WSLV genomes from Uganda had 97.5% identity between the two isolates and 94.4-96.47% identity with the next known sequence in GenBank (AV259/ZAF/1997).

Conclusion: Given the relatively few complete genome sequences in GenBank for WSLV (n = 6), this additional genetic information will help guide assay development efforts for this virus. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Validation of a Single, Rapid, Multilocus Amplicon Sequencing Assay Using Limit of Detection Experiments for the Detection of Multiple Biothreat Agents

Poster - Abstract ID: 42

Alexandra H. Keene1, Adina L. Doyle1, James M. Schupp1, Jason W. Sahl2, Carina M. Hall2, Charles Williamson2, Adam J. Vazquez2, David M. Engelthaler1, David M. Wagner2, Paul Keim1,2

1. Translational Genomics Research Institute, Pathogen and Microbiome Division, Flagstaff, AZ, 86005; 2. Northern Arizona University, Pathogen and Microbiome Institute, Flagstaff, Arizona, 86011

In the event of a bioterror event, accurate detection of the agent is necessary for the safety of our nation. Currently, single locus Polymerase Chain Reaction is used for early detection of biothreat agents, providing only presence/absence information for a single organism. These PCR assays, if designed around poorly validated signatures, have lead to false positives due to the similarity between organisms and their Near Neighbor species. Here we describe a multiplex assay targeting Bacillus anthracis, Francisella tularensis, Yersinia pestis, Burkholderia pseudomallei and Burkholderia mallei. The assay includes 79 targets that detect the presence/absence of the organism as well as important plasmids, virulence and antimicrobial resistance factors, and sequence variant loci for NN species differentiation. The sensitivity/specificity of the assay was validated using 10-14 target agents and 11-48 NN strains. Limit of Detection experiments were conducted using dilutions (100 genome copies/ul - 0.01 genome copies/ul) of 2-3 strains of each agent against the multiplex assay. Sensitivity/specificity values for each agent are as follows: B. anthracis (81.01-100%,100%), Y. pestis (79.87-100%,90-100%), F. tularensis (98.02-100%,87.5-100%), B. pseudomallei and B. mallei (79.87-100%,90-100%). B. anthracis LOD is 1-10 GE with 100K-400K reads, F. tularensis LOD is 10-100 GE with 200K-300K reads, Y. pestis LOD is 1-10 GE with 200K-350K reads. Both Burkholderia agents LOD are 0.1-1 GE with 200K-500K reads. Detection of low level biothreat agents using a sensitive and specific amplicon indexing scheme is a superior alternative to the current single loci PCR system enabling us to sequence multiple agents across one sequencing run. Sequencing, Finishing, and Analysis in the Future Meeting 2019

A Robust RNaseH-Based rRNA Depletion Method Enabling Rich Microbial Transcriptome Analysis

Poster - Abstract ID: 43

Drew Kennedy, Jeff Koble, Scott Kuersten, Fred Hyde, Asako Tan, Lisa Watson, Megha Ghildiyal, Amanda Young

Illumina, Inc.

We have developed a new universal ribosomal RNA (rRNA)-depletion workflow enabling efficient RNA-Seq, applied here to microbial transcriptomic analyses. Abundant rRNA is removed from total RNA by targeted hybridization to DNA probes and subsequent RNaseH-mediated cleavage. This new methodology facilitates rich transcriptome and metatranscriptomic analysis of microbial isolates and communities, respectively. In all samples tested, rRNA sequencing reads are significantly reduced, lowering microbial transcriptomics sequencing costs. Our method is uniquely compatible with low inputs (≥100ng) for microbial RNA samples, facilitating metatranscriptomic analysis of low biomass samples. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Low Genetic Diversity of Recent Foot-and-Mouth Disease Virus Serotypes O and A from Districts Located Along Uganda and Tanzania Border

Poster - Abstract ID: 44

Kerfua, Susan D1, 2*, Arinaitwe, Eugene3, Esau, Martin3, Shirima, Gabriel1, 4, Kusiluka, Lughano, Ayebazibwe, Christosome3, Cleaveland, Sarah5 and Haydon, Daniel T5

1. Nelson Mandela African Institution of Science and Technology, P.O. box 447, Arusha; 2. National Livestock Resources Research Institute, P.O.Box 5704, Tororo, Uganda; 3. National Animal Disease Diagnostics and Epidemiology Centre, Uganda; 4. Mzumbe University P.O.Box 1, Morogoro, Tanzania; 5. University of Glasgow, G128QQ, United Kingdom

Border areas are critical in disease epidemiology and the Uganda-Tanzania border lies within the areas that was identified as an FMD epidemiological cluster. To determine the relationship between foot-and-mouth disease viruses circulating in districts along the Uganda and Tanzanian border between 2016 and 2017, phylogenetic analysis of virus sequences obtained from samples collected from districts of Isingiro, Rakai, Missenyi and Kyerwe was carried out. Positive samples were subjected to conventional PCR using serotype specific primers and amplicons were sequenced. Samples from both countries were positive for serotypes O and A. Maximum likelihood phylogenetic trees were drawn using MEGA 7 and results showed that there was limited divergence of serotypes O viruses with 4.9% nucleotide differences across the VP1 coding sequences and all the generated serotype O viral sequences belonged to East Africa-2. Similarly serotype A sequence analysis showed that all the virus sequences obtained belonged to one topotype Africa-G1 topotype with a sequence divergence of 7.4%. The obtained sequences were further compared with the current vaccines used in both countries and results showed that the circulating serotype O strains belonged to a different topotype from the vaccine strains currently used in both countries. Although serotype A viruses isolated from Uganda belonged to the same lineage as the vaccine strain, significant variation was observed between the amino acid sequences in the regions upstream and downstream of the RGD motif. The resurgence of serotype A in both countries calls for more vigilant measures in FMD surveillance and recommends for more collaborative efforts to be made towards strengthening control of FMD along the border. Sequencing, Finishing, and Analysis in the Future Meeting 2019

SPAdes Family of Tools for Genome Assembly and Analysis: Current Status

Poster - Abstract ID: 46

Dmitry Antipov, Elena Bushmanova, Tatiana Dvorkina, Olga Kunyavskaya, Alla Lapidus, Dmitry Meleshko, Sergey Nurk, Andrey Prjibelski, Anton Korobeynikov

Saint Petersburg State University

Despite its central role in genomics, accurate de novo genome assembly remains challenging. Moreover, the proliferation of new sequencing and sample preparation technologies introduces additional levels of complications. Recently the SPAdes genome assembler, which was originally conceived as a scalable and easy-to-modify platform, was gradually extended into a family of SPAdes tools aimed at various sequencing technologies and applications.

In addition to the constantly updated SPAdes assembler itself, the toolset now includes:

- metaSPAdes assembler for metagenomics data;

- rnaSPAdes: de novo RNA‐seq data assembler;

- plasmidSPAdes: assembly of plasmids from the whole genome sequencing data;

- exSPAnder module for repeat resolution that enables efficient utilization of mate‐pair libraries and even mate‐pairs only assemblies with certain types of mate-pair libraries;

- hybridSPAdes module for hybrid assembly of accurate short reads with long error‐prone reads, such as Pacific Biosciences and Oxford Nanopore reads.

The recent additions to SPAdes toolbox are:

- hpcSPAdes: a version of SPAdes that could utilize the resources of computation clusters

- Pathracer: a standalone tool to align nucleotide and amino acid HMM to assembly graphs

- SPAligner: a standalone tool for accurate alignment of long noisy reads to assembly graphs

- HybridMetaSPAdes – an adaptation of metaSPAdes that utilizes third generation sequencing data for metagenomic hybrid assembly and strain level deconvolution;

- bgcSPAdes tool aimed to accurate reconstruction of biosynthetic gene clusters using their domain structure (such genes often have high biomedical importance since they can be a source of active natural products including antibiotics);

- libSPAdes: a reusable library with methods and algorithms for genome assembly and analysis

This work was supported by Russian Science Foundation (grant 19-16-00049). Sequencing, Finishing, and Analysis in the Future Meeting 2019

Whole Genome Sequence based characterization of bacteriophages isolated from urban sewage – Georgia Ukraine regional collaboration

Poster - Abstract ID: 48

Adam Kotorashvili¹, Alla Kharina2, Natalia Kornienko2, Irena Budzanivska2, Nato Kotaria1

1. National Center for Disease Control and Public Health, Tbilisi, Georgia; 2. Taras Shevchenko National University of Kyiv, Kyiv, Ukraine

Introduction: Bacteriophages were discovered about 100 years ago. Phages are defined as viruses that infect bacteria. They are ubiquitous and are the most abundant organisms found in the biosphere. Until recent years, sufficient attention was not paid to the role of bacteriophages in natural environment or to biodiversity of these viruses.

Currently, since multi-drug resistance has become a major problem for the treatment of pathogenic bacterial infections, the use of bacteriophages became an attractive approach to overcome the problem of drug resistance. In this regard, our relatively poor knowledge of bacteriophage biodiversity is surprising. Although more than 2500 complete bacteriophage genome sequences have been deposited at the NCBI database, this number is still quite low.

In this work, particular focus has been put on whole genome sequence based characterization of group of bacteriophages isolated from urban sewage, which was assumed to be a rich source of these viruses. Such kind of study provides an example of the biodiversity level of bacteriophages isolated from a single place.

Methods/Materials: Isolation of phages cocktail and DNA extraction was performed at Taras Shevchenko National University of Kyiv (Kyiv, Ukraine) and DNA specimens were shipped at Lugar Center of National Center for Disease Control and Public Health of Georgia (Tbilisi, Georgia).

To isolate bacteriophages from urban sewage, multiphage sewage sample was obtained using several steps of filtration of raw sewage sample, amplification on bacterial culture Serratia marcescens, precipitation of viral-like particles and their further purification. Genomic DNA was isolated using AmpliSens® DNA-sorb-AM.

A library preparation was conducted using NEBNext Ultra DNA Library Prep kit for Illumina (New England BioLabs); DNA was amplified (10 PCR cycles) using indexed primers, and then purified using an appropriate volume of Ampure XP beads (Beckman Coulter). Libraries were quantified (Qubit HS DNA assay kit; Invitrogen) and assessed for fragment sizes (Bioanalyzer 2100, High Sensitivity kit; Agilent). DNA sequencing libraries were sequenced using 500 cycle sequencing kit on the Illumina MiSeq sequencing system with v2 chemistry.

Results: The sequence data were filtered with a Phred score of > 30. Low-quality bases were trimmed. de novo assembly and Reference based analysis were performed using CLC Bio version 8.0.2 (https://www.qiagenbioinformatics.com) resulting in 274 contigs with an N50 of 128 kb and maximum contig length 286 kb. contig analysis was performed using blast tool of National Center for Biotechnology Information (NCBI).

Analysis resulted draft genomes of following phages: Pseudomonas phage with genome length 286 057bp (genome size of reference Pseudomonas phage OBP is 284 757bp); Serratia phage – 275 798bp (reference Serratia phage 2050HW - 276 025bp); Xylella phage – 42 886bp (reference Xylella phage 43 869bp). Interestingly for contig 3 with length 176 682bp no significant similarity was found in database. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Conclusion: Regional collaborative effort between Georgia and Ukraine will be continued. In the future, phage therapy might become alternative way to treat MDR (Multi Drug Resistant) -bacterial infections and in this regard detection and purification of new phages will be critically important. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Characterization of African Swine Fever Virus in Ukraine by Genome Sequencing and Phylogenetic Analysis

Poster - Abstract ID: 49

Ganna Kovalenko1, Xiao Bai2, Matthew R. Redlinger2, Anne Lise Ducluzeau3, Mykola Sushko4, Maryna Sapachova4, Andrii A. Mezhinskiy4, Devin M. Drown3, Eric Bortz1,2

1. Institute for Veterinary Medicine, Kyiv Ukraine, 2. University of Alaska Anchorage, 3. University of Alaska Fairbanks, 4. State Institute for Veterinary Laboratory Diagnostics (SSRILDVSE), Kyiv Ukraine

African swine fever (ASF) is a lethal hemorrhagic swine disease and has become a significant threat to the global pig industry from Europe to China. In Ukraine, there were 466 ASF outbreaks (2012-2019) in domestic pigs and wild boars. ASFV is a large enveloped dsDNA virus, with 189kbp genome and at least 150 genes. Most gene products have not been functionally characterized. Thus, circulating ASFV strains with different virulence are of great interest. We characterized ASFV genomes of Ukrainian ASFV isolates using nanopore sequencing, reference-based assembly, and phylogenetic analysis. Nucleotide sequences of B646L (p72) and B602L (CVR) genes illustrated that these ASFV belong to the virulent p72-genotype II, with 99% identity to ASFV isolates from Eastern Europe, Russia and China. Remarkably, two ASFV genomes from Ukraine harbored only 5 and 8 genotypic variations from the ASFV/Georgia/2007 reference strain. Three SNP were in common: an 18 amino acid C-terminal truncation in MGF 110-1L; N414S in the NP419L DNA ligase OB-fold; and K323E in MGF 505-9R, an interferon evasion protein. Phylogenetic analyses of ASFV multigene family (MGF) members, proteins thought to influence cell tropism and host range of ASFV, were conducted to understand ASFV genome architecture and annotate novel virus strains. The sequences of MGF110 and MGF360 members were compared with those of 16 ASFV complete genome isolates belonging to four different genotypes (I, II, IX and X). The Ukrainian ASFV genomes contains 12 of 14 known members of MGF110 and 15 members of MGF360. This suggests a conserved architecture punctuated by discrete gene duplication events. Thus, long-read nanopore sequencing of the ASFV from outbreaks in Ukraine has provided new insight into the evolution and emergence of this pathogen. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Genomic Characterization of Viruses Identified in Upper Respiratory Samples in Dromedary Camels from United Arab Emirates (UAE)

Poster - Abstract ID: 51

Yan Li1, Abdelmalik Ibrahim Khalafalla2, Clinton R. Paden1, Mohammed F. Yusof2, Yassir M. Eltahir2, Zulaikha M. Al Hammadi2, Farida Al Hosani3, Aron J. Hall1, Susan I. Gerber1, Salama Al Muhairi2, Suxiang Tong1

1. Division of Viral Diseases, National Center for Immunization and Respiratory Diseases, Centers for Disease Control and Prevention, Atlanta GA, USA, 2. Animal Wealth Sector, Abu Dhabi Food Control Authority, Abu Dhabi, United Arab Emirates 3Health Authority Abu Dhabi, Abu Dhabi, United Arab Emirates

Because of the role played by camels in the emergence and transmission of MERS-CoV, it has become increasingly clear that camels can potentially serve as a source for zoonotic transmission of viruses to humans. Thus, it is important to understand the viruses that camels harbor and their zoonotic potential. Our recent metagenomic study of nasopharyngeal swab samples from dromedary camels in Abu Dhabi, UAE identified partial genomic sequences of five potentially novel virus species or strains using next generation sequencing (NGS). In this study we performed full genome sequencing to further characterize these viruses.

To fill the gaps in the sequences generated by NGS for generating their full genome sequences, PCR primers were designed based on the sequences obtained from NGS or designed from the conserved regions among closely related, previously known viruses. The amplicons obtained from RT-PCR or PCR were sequenced by Sanger sequencing. Genome annotation and phylogenetic analysis were done to further characterize the virus genomes.

We obtained full or close-to-full genome sequences of five recently discovered camel viruses: camel polyomavirus (PyV) Abu Dhabi, camel Crimean-Congo hemorrhagic fever virus (CCHFV) Abu Dhabi, camel parainfluenza virus 3 (PIV3) Abu Dhabi, camel parainfluenza virus 4 (PIV4) Abu Dhabi and camel bocavirus 3 (BoV3) Abu Dhabi. The nucleotide identities of camel PyV, camel CCHFV, camel PIV3, camel PIV4 and camel BoV3 to the closest relatives are about 63% (to sheep polyomavirus), 87%, 76%, 90% (to the best matched CCHFV L, M, S segment, respectively), 85% (to bovine PIV3 isolate TVMDL20, 80% (to human PIV4b strain 04-13) and 85% (to canine BoV3 isolate UCD), respectively. Annotation and characterization of the genomes predict similar ORFs and genomic features shared by the other viruses in the same taxonomical group.

These genomic sequences provide data for a more accurate taxonomical classification of the novel camel viruses. We propose that the camel PyV be classified as a novel virus species, and the other four camel viruses as novel strains. Because we observe viruses in camels which are related to those that are known to cause disease in humans, we expect that camels may pose a risk as an intermediate species for transmitting viruses to humans. Sequencing, Finishing, and Analysis in the Future Meeting 2019

A High-Quality Genome Assembly of the North American Song Sparrow, Melospiza melodia

Poster - Abstract ID: 52

Swarnali Louha1, David A. Ray3, Kevin Winker4, Travis Glenn1,2

1. Institute of Bioinformatics, University of Georgia, Athens, GA, USA; 2. Department of Environmental Health Science, College of Public Health, University of Georgia, Athens, GA, USA; 3. Department of Biological Science, Texas Tech University, Lubbock, TX, USA; 4. Department of Biology, University of Alaska, Fairbanks, AK, USA

The Song sparrow, Melospiza melodia, is widespread across North America and exhibits pronounced morphological diversity over its wide range (1). M. melodia has been used as a model organism in a broad range of behavioral and ecological studies (2). The specialized vocal learning capabilities of the Song sparrow has been widely used by neuroscientists for studying processes underlying memory and learning in humans (3,4). Although several other songbirds have been sequenced and studied (5,6), none show the degree of variation in behavior, morphology and demographics exhibited by the Song sparrow, which makes it a favorable candidate in several areas of biomedical research (7-13). To facilitate genetic studies that would advance these fields, we have generated a high-quality de novo genome assembly of M. melodia, using Chicago libraries and HiRise assembly software.

Our M. melodia genome assembly was 978.3 Mb in size and had high contiguity (scaffold N50 of 5.58 Mb) and completeness, with 87.5% full-length genes present out of a set of 4915 universal single copy orthologs in avian genomes. We annotated our genome assembly and constructed 15,298 gene models, a majority of which had high homology to related birds, Taeniopygia guttata and Junco hyemalis. In total, 82.6% of the annotated genes were assigned with putative functions. Furthermore, 5.93 % of the genome was found to be repetitive with transposable elements (TEs) constituting the majority of repeats (4.72%). Among all repeats identified, the M. melodia genome was found to be rich in LTRs and 2-mer microsatellite classes. Other non-genic features of interest like non-coding RNA were also identified in our genome.

The high-quality M. melodia genome assembly and annotations we report will serve as a valuable resource for facilitating studies on genome structure and evolution that can contribute to biomedical research, as well as a reference in population genomic and comparative genomic studies of closely related species.

References 1) Pruett CL, Arcese P, Chan YL, Wilson AG, Patten MA, Keller LF, Winker K (2008). Concordant and Discordant Signals Between Genetic Data and Described Subspecies of Pacific Coast Song Sparrows, The Condor, 110 (2), 359-364. 2) Arcese P, Sogge MK, Marr AB, Patten MA (2002). Song sparrow (Melospiza melodia) In: Poole A., Gill F., editors. The Birds of North America. Philadelphia, PA: The Birds of North America, Inc., No. 704. 3) Doupe AJ, Kuhl PK (1999). Birdsong and Human Speech: Common Themes and Mechanisms, Annual Review of Neuroscience, 22 (1), 567-631. 4) White SA (2010). Genes and vocal learning, Brain and Language, 115 (1), 21-28. 5) Jarvis ED (2014). Whole-genome analyses resolve early branches in the tree of life of modern birds, Science, 346, 1320-1331. 6) Warren WC (2010). The genome of a songbird, Nature, 464, 757-762. 7) Hawkins RD, Bashiardes S, Helms CA, Hu L, Saccnone NL, Warchol ME, Lovett M (2003). Gene expression differences in quiescent versus regenerating hair cells of avian sensory epithelia: implications for human hearing and balance disorders, Human Molecular Genetics, 12, 1261-1272. 8) Hawkins RD, Lovett M (2004). The developmental genetics of auditory hair cells, Human Molecular Genetics, 13, R289-R296. 9) Gosler AG (1996). Environmental and social determinants of winter fat storage in the gret tit Parus major, J. Animal Ecology, 65, 1-17. 10) Schubert KA, Mennill DJ, Ramsay SM, Otter KA, Boag PT, Ratcliffe LM (2007). Variation in social rank acquisition influences lifetime reproductive success in black-capped chickadees, Biol. J. Linn. Soc, 90, 85-95. 11) Brugmann SA, Powder KE, Young NM, Goodnough LH, Hahn SM, James AW, Helms JA, Lovett M (2010). Comparative gene expression analysis of avian embryonic facial structures reveals new candidates for human craniofacial disorders, Human Molecular Genetics, 19, 920-930. Sequencing, Finishing, and Analysis in the Future Meeting 2019

12) Sutter NB (2007). A single IGF1 allele is a major determinant of small size in dogs. Science, 316, 112-115. 13) Allen HL (2010). Hundreds of variants clustered in genomic loci and biological pathways affect human height, Nature, 467, 832-838. Sequencing, Finishing, and Analysis in the Future Meeting 2019

From Short Reads to ‘Super Genomes’ in Large Cohorts

Poster - Abstract ID: 56

C. Nessner1, H. Doddapaneni1 , F.J. Sedlazeck1, S. Medhat1, J., Jhangiani1, Y. Han1, Q. Meng1, H. Santibanez1, S. D. Kalra1, K. Walker1, V. Vee1, S. Lee1, M.L. Grove2, D. R. Murdock1, S.N. Richards1, G. Metcalf1, W.J. Salerno1, E. Boerwinkle1,2, D. Muzny1, R.A. Gibbs1

1. Baylor College of Medicine, Human Genome Sequencing Center, Houston, TX 77030; 2. Human Genetics Center, University of Texas Health Science Center at Houston, Houston, TX 77225

Short-read sequencing is the default choice for all current larger sequencing projects (>10k to >100k samples) focusing on population genomics and disease cohort studies. However, as shown by several recent studies, this approach alone is insufficient for comprehensive evaluation and discovery of genomic variants. Therefore, a framework for improving the quality of existing short-read genomic data by integrating long/linked reads sequencing is needed. We pursued a pilot project with the ultimate aim to generate ‘super genomes’ i.e. whole genome sequencing (WGS) data that fully represented the genome in its entirety – and all allelic variants. With the specific goals of extending contiguity, gathering phasing information and gaining resolution into complex alleles WGS data at 11x-25x coverage on multiple platforms (PacBio, 10x Genomics, Oxford Nanopore) and RNA-Seq data were generated for three human reference samples. A novel bioinformatics pipeline called Princes for handling these heterogeneous datasets was developed. This automated pipeline provided both haploid resolved single nucleotide variants (SNV) and structural variation (SV) call sets. Additionally, for the reference samples, a best global phasing of up to 67 Mb N50 and information on eQTLs in regions with genomic variations was obtained. A multitude of factors such as platform stability, cost per base, DNA quality, and DNA quantity requirements were also evaluated. This framework was applied to the Cardio Vascular Disease (CVD) cohort subset of 4,425 samples from the 22,600 samples having Illumina short- read WGS data. Using SVCollector (Sedlazeck et al., bioRxiv, 2018) on this cohort, the top 18 samples were selected automatically based on their ranking in descending order of SV representation with an allelic frequency of >0.001. These 18 samples collectively represent ~36.5% of the unique SVs called in the 22,600 samples. On average, 30.8 Gb of long-read data on the PacBio Sequel system was generated for each sample and applied for cohort improvement. On an average, 4,000 SV calls were identified in each of the 18 samples that over lapped with SV calls in them from Illumina short reads data. Additionally, 6.4k to 17.2k SVs were identified in these 18 samples only in the long reads data. Majority of these SVs were insertions of > 30bp. Availability of this long reads data will help confirm the SVs identified in the short reads data and in the due course minimize the SV False Discovery Rate in the short reads. By utilizing short reads and combining emergent technologies, ‘Super Genomes’ provides the ability to explore and examine new insights within genomic data. Sequencing, Finishing, and Analysis in the Future Meeting 2019

NovaSeq: Performance Optimization for Sequencing Pipeline

Poster - Abstract ID: 57

N. Osuji1, Y. Han1, H. Dinh1, J. Santibanez1, K. Walker1, Z. Momin1, C. Nessner1, H. Doddapaneni1, E. Boerwinkle 1, 2, R. Gibbs1, D.M. Muzny1

1. Baylor College of Medicine HGSC, Houston, TX; 2. Human Genetics Center, University of Texas Health Science Center at Houston, Houston, TX.

The latest Illumina high throughput sequencing has revolutionized genomics research and reshaped applications in clinical research. The NovaSeq platform further expands these opportunities with unprecedented capacities. Since the introduction of the platform in 2017, the Human Genome Sequencing Center (HGSC) acquired 5 Novaseq systems. Here we will present our experience in optimizing the NovaSeq production pipeline in applications range from basic research whole genome and whole exome sequencing to clinical quality Novaseq sequencing.

To date, we have completed 300 S4 NovaSeq flowcells. These studies have included more than 4800 whole genome sequence (WGS) samples, belong to various NHGRI funded projects that undertake variant discovery, understand rare variants and establish a genome database (TOPMed, CCDG). PCR-free library methods were evaluated and implemented for WGS sequencing to optimize coverage in GC-rich regions. The S4 flow cell and XP workflow is used as standard for cost and flexibility in sequencing production.

The sequencing libraries were pooled first based on qPCR quantification and re-pooled based on the sequencing results from the first “calibration” run. In this way, we optimized the coefficient of variation (CV) for 81-plex pool from 17% to 6.4%. On average, 3.4 Tbases were generated each S4 flow cell with 27 WGS samples to yield a 37x average coverage, which exceeds Illumina’s specifications. We have implemented standard metrics including % pass filter, % aligned bases, % error rate, % unique reads and % Q30 bases to achieve > 90 GB unique aligned bases per lane. Genome coverage metrics are also tracked for 90% of genome covered at 20x and 95% at 10x with a minimum of 86x109 mapped, aligned bases with Q20 or higher.

HGSC-BCM dedicated most sequencing efforts on NovaSeq pipeline for WGS sequencing, but exomes are still a widely used option due to cost and time consideration. We have also successfully migrated whole exome capture sequencing (WES) application to NovaSeq pipeline. Due to the sequencing capacity of S4 flow cell, we increased level of multiplexing in library pooling 70-plex WES library pools are routinely run on one lane of S4 flow cell. We are testing the smaller flow cell format on NovaSeq (S2, S1 and SP) to evaluate the balance between sample size management and run cost. We have completed 1600 WES samples and on average, 12.9Gbs were generated for each sample at 97.8% targeted bases covered at ≥20x surpassing the performance for research exome pipeline, reaching the clinic exome performance.

In summary, the sequencing ability of NovaSeq could open new horizons for more highly powered experiments, with even higher sample numbers will further reduce costs allowing applications of sequencing technology in all areas of research. Following performance optimization, the HGSC has honed and determined the NovaSeq’s suitability for All of Us (AoU) Program. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Evaluation of a library enrichment panel for improved detection of viruses

Poster - Abstract ID: 58

B. Knight, K. Parker, J Stone, R. Winegar

MRIGlobal, 1470 Treeland Blvd, S.E., Palm Bay, FL 32909

The accuracy of infectious disease diagnosis is dependent on the tools available to the clinician. Traditionally, microbial and molecular methods such as culture and quantitative polymerase chain reaction (qPCR) have been used in clinical settings for identification of bacteria and viruses. Increasingly, metagenomic next generation sequencing (NGS) is being utilized for a non- targeted detection approach. One of the challenges with performing metagenomic NGS on clinical samples is the amount of host genetic material in the sample. The abundance of human DNA may preclude NGS detection of pathogens at clinically- relevant titers. To address this shortfall, we have investigated the use of library enrichment techniques to enhance the detection of viral targets in clinical samples. Twist Bioscience offers a pan-viral probe panel for the enrichment of a broad range of viruses. We evaluated this viral panel with two library preparation kits: Trusight RNA Pan-Cancer kit and Nextera Flex for Enrichment. The results of this initial assessment demonstrate a significant reduction in host reads and increase in viral reads for enriched libraries compared to untreated libraries. While further method optimization is required, these findings suggest that application of library enrichment technologies to current sample preparation methods could greatly improve pathogen detection sensitivity. Sequencing, Finishing, and Analysis in the Future Meeting 2019

The benefits of reliable pre-plated DNA-seq reagents for high throughput NGS library prep

Poster- Abstract ID: 60

Shannon Piehl

Unknown

The applications of Next-generation sequencing (NGS) continues to grow in clinical genomics testing research as the cost decreases. Currently, library preparation can be a major bottleneck in a lab’s NGS workflow when performed manually at the bench, as it tends to be labor-intensive, time-consuming, error-prone, and operator-dependent. As throughput demands grow, automation of library preparation helps reduce some of these issues, though not entirely since reagent preparation and plate setup steps are still typically performed manually. Here we highlight the benefit of a commercially-available pre-plated NGS library prep kit versus the standard, tube-based version of the same kit that requires upstream preparation prior to automation. To demonstrate these benefits, library prep was performed using gDNA and FFPE samples on the PerkinElmer® Sciclone® NGSx workstation using the NEXTFLEX® Rapid XP DNA-Seq reagents. Setup times of (1) manual mixing and plating reagents from tubes and (2) pre-plated reagents were compared, as well as the reproducibility and quality of both methods. We find a 5- to 10-fold decrease in robot setup time, depending on the technician. Additionally, libraries generated using the pre-plated reagents were more reproducible and showed no failures or reagent plating errors. These results highlight the ability of pre-plated library prep reagents to save time and minimize cost, all while providing reliable results from run to run, which are beneficial features for any laboratory needing a robust, high-throughput DNA-seq solution for their unique lab applications. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Automatic Gene Annotation for Biofilms

Poster - Abstract ID: 61

Britney Gibbs1, David Millman1, Brendan Mumey1, Thiruvarangan Ramaraj2, Lucia Williams1

1. Gianforte School of Computing, Montana State University, Bozeman MT 59715, USA; 2. National Center for Genome Resources, Santa Fe, New Mexico 87505, USA

Biofilms are microbial communities consisting of multiple organisms which can be found in diverse settings, such as the human body, rivers and lakes, and oil pipelines. As such, understanding biofilms is important to advance medical knowledge and practice, manage natural resources, and improve industrial productivity, and scientific research on biofilms in particular has increased in recent years. However, there is no centralized database for storing, querying, and analyzing biofilm data. Biofilm data can include images, videos, nucleotide sequences, experimental conditions, and many other forms of information. We are building the Biofilm Resource and Information Database (BRaID) to fill this gap, thereby increasing biofilm research productivity.

In addition to storing biofilm data, BRaID will include tools to aid biofilm researchers in asking and answering questions of their data, or other data stored in the database. Because biofilms are present in numerous settings, many biofilms are under studied and poorly understood. When a researcher performs a genetic analysis on a biofilm, they may find that a significant number of genes of interest are not well annotated by any single resource. Thus, they must search across multiple databases for an understanding of what is known about their gene, and what function it may have in the process they are attempting to understand and describe. In this project, we develop a tool for automatically annotating a set of genes with a description and information from the Gene Ontology, conserved domain database from NCBI, and Interpro. By compiling this disparate information automatically rather than through a series of clicks performed by a human, we reduce the time needed for a researcher to annotate a set of thousands of genes from weeks or months to hours. Additionally, data generated by our tool is reproducible. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Differences in thermotolerance between ecotypes of Neurospora discreta are due primarily to only two genomic regions

Poster - Abstract ID: 62

Aaron J. Robinson1, Miriam I. Hutchinson1, Igor V. Grigoriev2,3, John W. Taylor3 and Donald O. Natvig1

1. Department of Biology, University of New Mexico, Albuquerque, NM, USA; 2. DOE Joint Genome Institute, 2800 Mitchell Dr, Walnut Creek, CA, USA; 3. Department of Plant & Microbial Biology, University of California, Berkeley, CA, USA

Differences in maximal growth temperature among Neurospora discreta isolates from the western United States correlate with differences in mean annual environmental temperature. Isolates from New Mexico and Alaska exhibit comparable growth rates below 35°C, but isolates from New Mexico grow much better near and above 40°C. Individual progeny from crosses between isolates from New Mexico and Alaska either possess one of the two parental temperature phenotypes or have an intermediate phenotype. The range of progeny phenotypes suggests the involvement of multiple gene regions. With support from the DOE Joint Genome Institute (JGI) Community Science Program (CSP), we obtained complete genome sequences for 82 progeny from crosses with parents from NM and AK. Progeny were selected to exhibit either the New Mexico parental temperature phenotype or the Alaska parental phenotype (39 NM-like and 43 AK-like progeny). High-quality genome assemblies of the parental strains were obtained utilizing sequence data from both Illumina (JGI) and Oxford Nanopore MinION platforms. Bulked-segregant analysis was also performed using the MinION platform and combined genomic DNA from 22 New Mexico- like progeny and 29 Alaska-like progeny. Comparative analyses of genomes from these two progeny pools demonstrated two regions associated with thermotolerance above 40ºC. Differences between the parental isolates in a region of linkage group III demonstrated a strong link between genotype and phenotype and indicated amino-acid modifications that could result in more thermotolerant proteins. It is striking that this region in N. discreta overlaps a region previously identified in N. crassa that may be under selection in adaptation to cold. A region on linkage group I plays a secondary role in determining the thermotolerant phenotype. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Mercury Lab – Bring the Laboratory to the Sample

Poster - Abstract ID: 65

Joseph A. Russell

MRIGlobal – 65 West Watkins Mill Road, Gaithersburg, MD, USA 20878

The ability to extract biological information from a given environment has undergone substantial change in just the past 3 years. Some of the latest hardware designed for quantitative PCR, antibody and protein detection, and genomic sequencing can easily fit in your coat pocket. The writing on the wall illustrates a future where advanced molecular diagnostics, biosurveillance, and forensic testing no longer requires transporting a sample back to a central reference laboratory. Samples can be processed on site, at the point-of-need, alleviating processing bottlenecks and dramatically reducing the time to acquire an actionable result. However, this future has not yet been realized. Despite the footprint of the molecular hardware becoming remarkably small, the operational footprint of the work is not equally small. Ancillary equipment including (but not limited to) a stable power supply, cold-chain storage, reagent/consumables storage and transport, computational capacity, a stable workbench, biohazard waste disposal strategies, biosafety equipment, and other logistics – all necessary for the effective use of the full complement of modern, hand-held genomics hardware – are capable of growing the operational footprint of these devices to sizes that are not reasonable to deploy at the point-of-need. MRIGlobal has developed a product to address this problem. This product is a purpose-built platform that provides all the necessary operational equipment in a human-centered laboratory- workbench design such that rapid, reproducible deployment of advanced genomic technologies to field-forward locations is no longer strategically unfeasible. This mobile laboratory, Mercury Lab, is a first-of-its-kind product that was built to lower the barrier-to-entry of modern molecular hardware where it is needed most. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Isolation and Separation of DNA and RNA from a Single Sample

Poster - Abstract ID: 68

Lauren Saunders, Han Wei , Brittany Niccum, Antonia Hur, and Asmita Patel

Beckman Coulter Life Sciences

Extraction of intact and high quality gDNA or RNA is one of the most common procedures in molecular biology labs. With the advance of NGS, scientists can analyze both genomic and transcriptomic information from the same sample. Techniques that offer the ability to isolate both DNA and RNA from a single biological sample are crucial for this application. There is still a great need for improved protocols in this area especially for precious and low-yield samples. Currently, most researchers use TRIzol® to isolate RNA and DNA from the same tube by extracting DNA from the organic phase and RNA from the aqueous phase. Although this method can extract high-quality RNA, the DNA yield is often low. With many biological samples in short supply, it is crucial to maximize yields. In this study, a novel buffer solution that selectively binds DNA was tested with mouse tissue and mammalian cell culture samples. This buffer binds DNA to magnetic beads, leaving the RNA in the supernatant. The supernatant is then removed, and the RNA is bound to a second aliquot of magnetic beads. This preferential binding buffer prevents the necessity for lysate splitting, which is common in applications that isolate RNA and DNA, and should lead to higher overall yields. Sequencing, Finishing, and Analysis in the Future Meeting 2019

High Quality Draft Genome Assemblies for Stored Product Insects from 10X Chromium Libraries

Poster - Abstract ID: 69

Erin D. Scully1, Scott M. Geib2, Sheina Sim2, Nathan Palmer3, Scott E. Sattler3, and Gautam Sarath3

1. USDA-ARS Center for Grain and Animal Health Research, Stored Product Insect and Engineering Research Unit, Manhattan, KS 66502; 2. USDA- ARS Daniel K. Inouye U.S. Pacific Basin Agricultural Research Center, Hilo, HI 96720; 3. USDA-ARS Wheat, Sorghum, and Forage Research Unit, Lincoln, NE 68583

Invasive and emerging insect pests often pose imminent threats to ecosystems and cause significant agricultural losses in both the pre- and post-harvest stages. Historically, genome and transcriptome sequencing of insect species have provided tremendous insights into their metabolic and physiological potentials, facilitated the development of molecular barcodes for taxonomic classification, and led to the identification of mutations associated with pesticide and fumigant resistance and genetic factors that allow insects to exploit new ecological niches. Despite the utility of genome sequences in understanding the biology of emerging and invasive insects and facilitating management decisions, insect genome assemblies have been hampered by a number of challenges. 10X Chromium libraries coupled with HiSeqX sequencing largely overcomes these challenges and has led to the assembly of high quality draft genomes for aphids and stored product insects, including several emerging and invasive species. Recovery of conserved single copy orthologs (BUSCOs) exceeded 92% and >80% of the total assembly length was present in >1000 scaffolds in the majority of assemblies. Overall, these assemblies exceeded the contiguity of several previously published insect genomes, suggesting that 10X Chromium libraries represent a viable approach for obtaining fast and reliable assemblies for insect genomes. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Subtracting Metagenomic Backgrounds: Towards the Ability to Remove Laboratory-specific-Noise from Metagenomic Analyses

Poster - Abstract ID: 71

Alan Shteyman, Joseph Russell

MRIGlobal

Metagenomic sequencing is increasingly used to determine the microbial community of environmental and clinical samples. These methods are attractive in a biosurveillance, and potentially diagnostic, framework. Specific organisms-of-interest can be identified while also obtaining contextual genotype- and community-level data. However, there are challenges in leveraging metagenomic sequencing data in this way. A primary challenge, addressed here, is that of discerning the signal of an organism- of-interest in a queried sample from the level of signal of that same organism in the ‘background’ environment. Laboratory and/or operator contaminants, and inter-sequencing-run cross-contamination, are common confounding variables when analyzing shotgun sequencing data. While there are several traditional methods to control for background signal in metagenomic analyses, a new tool called PanGIA (Pan-Genomics for Infectious Agents) has been developed that allows intuitive control over robust background subtraction through a simple graphical-user-interface (GUI). PanGIA does this by first classifying shotgun sequencing reads from a user-defined ‘background sample’ (i.e., negative control, or ‘blank’). The microbial community profile (including number of reads, depth of coverage, linear coverage, etc.) for the background sample is stored in a JSON file and remains accessible for reference in subsequent analysis of ‘target’ samples. When a target sample is run, and ‘background scoring’ is selected, each organism detected in the target sample is compared to its own profile in the background sample (if it was present in the background). Any regions of the organism’s reference genome that are covered by reads in the background sample are ‘masked’, and the remaining coverage of the organism in the target sample is used to calculate a ‘background confidence score’. In short, this is an “intersection-over-the-union” metric ( / ) that can be calculated a number of different ways, described herein. Simulated in-silico and “real-world” wet lab studies were done to validate this method and robust performance was documented. However, this approach may just be the beginning of how PanGIA can do background subtraction. Here, we describe ways that machine learning may be used to greatly reduce the number of background control samples needed to generate background confidence scores in a given laboratory, through the training of a machine learning model of the common background signals seen in that laboratory. In this way, over time, PanGIA can “learn” the particular biases of the laboratory or environment it is analyzing from and error-correct signal output accordingly. This could lead to more directly comparable metagenomic data across different laboratories. Additionally, the methods described here can be implemented to mitigate systematic noise associated with wet-lab procedural shifts such as reagents from new vendors, new technical staff, or integrating new workflows, as well as potentially allowing for brand new ways to leverage shotgun sequencing data across environmental biosurveillance and clinical diagnostics missions. Sequencing, Finishing, and Analysis in the Future Meeting 2019

RipTide™ Ultra High-Throughput Rapid DNA Library Preparation for Next Generation Sequencing

Poster - Abstract ID: 72

Azeem Siddique,1,3 Gaia Suckow,1,3 Nils Homer,2 Jorge Bahena,1,3 Phillip Ordoukhanian,1,3 Steve Head,1,3 Keith Brown3

1. The Scripps Research Institute, La Jolla, CA; 2. Fulcrum Genomics, Somerville, MA; 3. iGenomX, Carlsbad, CA

Whole Genome Shotgun Sequencing has become the tool of choice for microbial genome analysis. Rapidly declining costs of sequencing, data analysis, data storage and database access will continue to drive adoption. Library construction has not kept pace with these advancements, with costs of preparing a next generation sequencing (NGS) library often exceeding the cost of sequencing. Popular methods of library construction for NGS include fragmentation, end-repair and adapter ligation, and transposase-mediated adapter insertion. The Riptide High Throughput Rapid DNA Library Prep is distinctly different in its approach because it relies on polymerase-mediated primer extension for library preparation. The initial step of the prep, involving primer extension with barcoded random primers, is performed in a 96-well plate. Each well of the plate contains primers with a unique barcode; consequently, the library generated from each well is uniquely identifiable and can be bioinformatically traced back to the original sample after sequencing. Following this step, the primer extension products are combined into one pool and all subsequent steps, including second strand synthesis and PCR, are performed with the single pool. The library prep is fast, easily automatable and can be tuned to genomes of high and low GC content. With automation, 960 samples can be processed in a single day. The technology will aid genetic research by helping to increase sample throughput and by reducing processing steps and operating costs. Presented here is RipTide High Throughput Rapid DNA Library Prep sequencing data generated from multiple microbial genomes. Sequencing, Finishing, and Analysis in the Future Meeting 2019

RipTide™ High Throughput NGS Library Prep for Genotyping in Populations

Poster - Abstract ID: 73

Azeem Siddique1,3, Gaia Suckow1,3, Nils Homer2, Phillip Ordoukhanian1,3, Steve Head1,3, Keith Brown3, Lior Glick5, Kobi Baruch5, Paul Doran3, Alvaro Hernandez4

1. The Scripps Research Institute, La Jolla, CA; 2. Fulcrum Genomics, Somerville, MA; 3. iGenomX, Carlsbad, CA; 4. University of Illinois at Urbana-Champaign, Urbana, IL; 5. NRGene

High throughput genotyping technologies are required for large-scale population genetics. Evolutionary biology studies, human disease research and large-scale agricultural breeding programs all lend themselves to technologies that are able to provide more information at lower cost. Over the past decade, genotyping technology has transitioned from PCR-based SNP assays to microarrays, and is now shifting toward high-throughput genotyping by sequencing (GBS). The RipTide High Throughput Rapid DNA Library Prep allows for the preparation of NGS libraries from up to 960 individually barcoded samples in a few hours with automation. When combined with low coverage sequencing and imputation-based genotype analysis, the result is an order of magnitude greater information at a significantly reduced cost. Here we present data on 96 Zea mays (maize) samples consisting of 4 parent populations and 92 recombinant inbred lines (RILs). For each sample, hundreds of thousands to millions of haplotype markers, including SNVs and structural variants, are accurately detected. A minimum of 95% complete coverage of direct and imputed markers is obtained for each RIL. The approach can be applied to any species, regardless of genome size or GC content. In this study, a median of >1 million markers were genotyped by sequencing on an Illumina HiSeq 4000 instrument for an estimated cost of library construction and sequencing of < $25 per sample. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Genotyping by sequencing of Canis familiaris using iGenomX RipTide™ DNA library preparation

Poster and Tech Talk - Abstract ID: 74

Azeem Siddique1,3, Gaia Suckow1,3, Nils Homer2, Phillip Ordoukhanian1,3, Steve Head1,3, Keith Brown3, Paul Doran3, Matt Huentelman6, Joseph Pickrell5, Alvaro Hernandez4

1. The Scripps Research Institute, La Jolla, CA; 2. Fulcrum Genomics, Somerville, MA; 3. iGenomX, Carlsbad, CA; 4. University of Illinois at Urbana-Champaign, Urbana, IL; 5. Gencove, New York, NY; 6. Tgen, Phoenix, AZ

Dogs have been living with humans for approximately 15000 years. Selective breeding has created a multitude of dog breeds with distinct characteristics. Great interest exists in understanding how selection has affected the modern dog genome and what variants are linked to specific canine breed characteristics. Dogs are also susceptible to a number of diseases that have counterparts in humans. Their unique population structure, relatively limited heterogeneity within breeds, greater genome sequence identity to humans than mice, and their sharing of a common environment with humans make them an excellent model organism for certain human diseases.

The iGenomX RipTide library prep is a high throughput DNA library prep for next generation sequencing that has been used to prepare libraries for a variety of applications where large numbers of samples require library preparation at low cost. One such application is genotyping by sequencing. Here we show the use of the RipTide library prep in a case control GWAS study, generating over 30 million biallelic SNPs per sample on a cohort of West Highland White Terriers. After filtering, more than 5.2 million SNPs were identified with a minor allele frequency of >5%. PCA analysis showed that the variants permitted the accurate identification of breeds. The data also showed a novel genetic association with Westie lung disease, the canine equivalent of chronic obstructive pulmonary disease in humans.

The iGenomX RipTide library prep combined with Illumina sequencing generated more variants in less time and at lower cost than the standard microarray-based genotyping experiment. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Immuno-biotechnology and bioinformatics in Community Colleges

Poster and Oral - Abstract ID: 75

Todd M. Smith1, Sandra G. Porter1,2, Dina Kovarik2

1. Digital World Biology, Seattle WA; 2. Shoreline Community College, Shoreline WA

Immuno-biotechnology is one of the fastest growing areas in the field of biotechnology. Digital World Biology’s Biotech- Careers.org database of biotechnology employers (>6800) has nearly 700 organizations that are involved with immunology in some way. With the advent of next generation DNA sequencing, and other technologies, immuno-biotechnology has significantly increased the use of computing technologies to decipher the meaning of large datasets and predict interactions between immune receptors (antibodies / T-Cell receptors / MHC) and their targets.

The use of new technologies like immune-profiling - where large numbers of immune receptors are sequenced en masse - and targeted cancer therapies - where researchers create, engineer, and grow modified T cells to attack tumors - are leading to job growth and demands for new skills and knowledge in biomanufacturing, quality systems, immuno-bioinformatics, and cancer biology. In response to these new demands, Shoreline Community College (Shoreline, WA) has begun developing an immuno- biotechnology certificate. Part of this certificate includes a five-week course (30 hours hands-on computer lab) on immuno- bioinformatics.

The immuno-bioinformatics course includes exercises in immune profiling, vaccine development, and operating bioinformatics programs using a command line interface. In immune profiling, students explore T-cell receptor datasets from early stage breast cancer samples using Adaptive Biotechnologies’ (Seattle, WA) immunoSEQ Analyzer public server to learn how T-cells differ between normal tissue, blood, and tumors. Next, they use the IEDB (Immune Epitope Database) in conjunction with Molecule World (Digital World Biology) to predict antigens from sequences and verify the results to learn the differences between continuous and discontinuous epitopes that are recognized by T-cell receptors and antibodies. Finally, to get hands-on experience with bioinformatics programs, students will use cloud computing (CyVerse) and igBLAST (NCBI) to explore data from an immune profiling experiment. Sequencing, Finishing, and Analysis in the Future Meeting 2019

A Comparison of Sequencing Library Preparation Using Illumina Nextera XT, Illumina DNA Flex, and New England Biolabs NEBNext Ultra II Kits

Poster - Abstract ID: 79

Jenny Truong1, Angela Poates2, Patti Lafon2, and Eija Trees3

1. Oak Ridge Institute for Science and Education (ORISE), Oak Ridge, TN, USA; 2. IHRC, Inc., 2 Ravinia Drive, Suite 1200 Atlanta, GA, 30346; 3. Enteric Diseases Laboratory Branch, US CDC, Atlanta, GA, USA

Centers for Disease Control and Prevention, Atlanta, GA, 30329, United States

PulseNet Central at the Centers for Disease Control and Prevention (CDC) is the coordinator for PulseNet, the national public health surveillance network for foodborne disease. The central lab is continually testing new whole genome sequencing (WGS) platforms and library preparation kits for potential use within the network. PulseNet’s current standard operating procedures (SOPs) for WGS methods are based on Illumina library preparation kits, Nextera XT and DNA Flex (https://www.cdc.gov/pulsenet/pathogens/wgs.html). The objective of this study is to evaluate the New England Biolabs NEBNext Ultra II FS DNA Library Prep kit as a cost-effective alternative to the Illumina kits.

Libraries for sixteen Campylobacter species, sixteen Escherichia coli, and seven Salmonella enterica isolates were prepared using the NEBNext Ultra II kit at 5-minute fragmentation time and per the manufacturer’s stated SOP. The DNA libraries were sequenced on the Illumina MiSeq using v2 chemistry (500c) and loaded at 80 MBs, and the resulting data was compared to sequencing data generated from the same isolates prepped using the Illumina Nextera XT and DNA Flex kits and the same sequencing chemistry on the MiSeq. Average read lengths, coverage, and quality scores were compared between datasets from all three kits using the CG-pipeline (github.com/lskatz/CG-Pipeline). The sequences were assembled using CLC Genomic Workbench version 11 using a minimum distance of 50 and maximum distance of 700 for paired reads. In de novo assembly, the minimum contig length was 500 bp with scaffolding and auto-detected paired distances. In mapping back to contigs, the parameters were: Mismatch cost = 2, Insertion cost = 3, Deletion cost = 3, Length fraction = 0.5, Similarity Fraction = 0.8.

The libraries from NEBNext Ultra II had an average read length of 240.2bp (range: 221 to 247.2bp) compared to Nextera XT’s average 214.4bp (range: 147.8 to 245.3bp) and DNA Flex’s average 228.9 bp (range: 188.5 to 241bp). NEBNext sequences had an average quality score of 35.8, versus XT with average quality score of 34.5 and Flex’s average of 35. The coverage required by PulseNet SOPs was obtained with all kits. Across all species, NEBNext Ultra II preps produced average N50 of 334,258 versus Nextera XT’s 192,368 and DNA Flex’s 330,892 including scaffolding. NEBNext Ultra II had an average contig count of 67.4 compared to Nextera XT’s 103.1 and DNA Flex’s 67.7 including scaffolding.

NEBNext libraries appear to have higher and more consistent average read lengths, which may be attributed to the U loop adaptors used during preparation that may be more efficient at targeted ligation. These results show that the NEBNext and DNAFlex kits exhibit comparable performance in their ability to generate libraries for sequencing on the MiSeq and Sequencing, Finishing, and Analysis in the Future Meeting 2019 outperform the older Nextera XT kit. Further validation studies will include additional organisms, capacity testing, and wgMLST analysis. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Development of an Amplicon Sequencing Multiplex Capable of Escherichia coli Strain-Level Resolution, Mixture Deconvolution, and Antimicrobial Resistance Mechanism Identification Directly from Urine Samples

Poster - Abstract ID: 80

Adam J Vazquez, Kristen J. Kyger, Alden Miller, Jason W. Sahl

Northern Arizona University; Pathogen and Microbiome Institute

Escherichia coli causes more than 85% of the approximated 8 million urinary tract infections (UTIs), causing over 13,000 deaths, in the United States annually. Antimicrobial resistance (AMR) mechanisms are emerging in E.coli strains and spreading globally, threatening the ability to treat common infectious diseases, such as UTIs. Currently, clinical laboratories diagnose E. coli infections directly through passaging urine samples on selective media. However, bacterial diversity within urine samples will be overlooked using this strategy. While this strategy is sufficient for treatment and diagnosis, it complicates efforts to understand the transmission and diversity of this important human pathogen. In this project, the known diversity of E. coli was used to design an amplicon sequencing (AmpSeq) panel targeting an E.coli species marker, a single locus informative marker (SLIM), 8 highly diverse genomic loci, and 11 AMR targets, that can rapidly catalog the diversity of E. coli directly from urine in a more cost effective workflow. Our approach identifies and deconvolutes mixtures, and provides strain level resolution for source attribution and contact tracing, as well as provide resistance information for targeted patient therapy. Our research group received approximately 300 UTI associated urines from the Flagstaff Medical Center (FMC) and processed DNA extracted from these clinical specimens using our AmpSeq approach. Flagstaff is an ideal location for these types of studies due to its geographic isolation and single hospital system that processes all clinical UTI samples. Genotypes from urine samples were placed within a phylogeny along with a large database of E. coli genomes isolated from Flagstaff UTIs, as well as from commercially available meat products. The results demonstrate that many of our samples fall into the ST131 and ST195 clades, both of which are known to cause UTIs and include multidrug resistant strains. Once strains were associated with types of samples, we performed whole genome sequencing (WGS), which provides high resolution relationships between E. coli circulating in the Flagstaff community and UTIs entering the Flagstaff medical system. Additionally, this approach confidently identifies transmission events. Results from the WGS phylogeny also confirmed strain distribution observed using the AmpSeq approach, further confirming our rapid, cost effective design. Combined with the assays designed to identify antimicrobial resistance mechanisms, this AmpSeq assay can improve our understanding of the source of UTIs and help focus preventative measures and effective treatment therapies. We acknowledge Northern Arizona Healthcare (NAH) and FMC for providing the residual clinical urine samples. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Detecting genomic contamination using 7-gene MLST with ColorID

Poster - Abstract ID: 82

Eshaw Vidyaprakash1, Lee S. Katz1,2, Taylor Griswold1, Henk den Bakker2, and Heather A. Carleton1

1. Enteric Diseases Laboratory Branch, Centers for Disease Control and Prevention, Atlanta, Georgia; 2. Center for Food Safety, University of Georgia, Griffin, Georgia

The Enteric Disease Laboratory Branch of CDC receives over 1TB of genomic data submitted by PulseNet, the foodborne molecular surveillance network, per month. Some of these genomes are contaminated by other species (inter-species), or the same species (intra-species). Detecting contamination by eye is intractable on such a large data scale, so we have employed bioinformatics methods, such as Kraken and MIDAS, to detect inter-species sequence contamination. To address intra-species contamination, we tested the capabilities of heterozygous 7-gene MLST calls as an identifiable marker for intra-species contamination.

To detect intraspecies contamination, allele calls were performed on multiple in-silico raw read datasets using ColorID (“https://github.com/hcdenbakker/colorid”), an open source software tool which analyzes raw read input sequences against an MLST database and calls alleles using a k-mer matching method involving BIGSI and can be viewed as a probabilistic colored de Brujin graph. To test intraspecies contamination detection, we artificially contaminated 5 Salmonella enterica genomes in- silico, where in each instance, one of the five total genomes acted as the contaminant sequence for each of the additional four genome sequences. The MLST profiles for these genomes were identical by a minimum of two alleles. Different levels of contamination were generated for all our tests. For each contaminant genome, three out of the five genomes were used as the main contaminant, we used contamination levels 0%, 5%, 7.5%, 10%, and 50%. Each of these contamination levels were tested with different k-mer sizes (21 to 75).

As the k-mer length increased, the number of reads that matched the ColorID database decreased and when the k-mer length decreased, the specificity of the allelic hits decreased. This specificity decrease was most notable with the loci aroC and hisD. If the loci matched more than once with the database we were able to determine contamination. As a result, we found that k-mer sizes 27, 35, 37, and 39 yield the best sensitivity and specificity when testing against an allelic database with ColorID. We also sought to determine the lowest detectable contamination. We ran all contaminated genomes with k-mer size 27 at lower contamination levels 5.5%, 6%, 6.5%, and 7.0% to determine where we observe alleles calls from the contaminant. Most contaminants could be detected at 6%, except one notable genome at 10%. This discrepancy could be attributed to a very similar MLST profiles between the base and contaminant genomes, where five out of seven alleles were identical and only hisD and purE were different.

We demonstrated that our pipeline was able to detect contamination at low levels of 6% in most cases. Our pipeline is available here: (https://github.com/lskatz/SneakerNet/blob/master/SneakerNet.plugins/sn_detectContamination-mlst.pl). We plan to continue our research in the future by determining contamination at a lower level and try to detect intra-species contamination with more tools. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Investigation of a virulent lineage of Clostridioides difficile (ST42) present in clinical and environmental samples in Flagstaff, Arizona using whole genome sequencing and comparative genomics

Poster - Abstract ID: 83

Charles Williamson, Nathan Stone, Amalee Nunnally, Heidie Hornstra, David Wagner, Paul Keim and Jason Sahl

Pathogen and Microbiome Institute, Northern Arizona University

Clostridioides difficile is a diarrheagenic pathogen that can cause symptoms ranging from mild disease to toxic megacolon and death. Several lineages of C. difficile have been frequently associated with C. difficile infection (CDI) in humans. One of these lineages is sequence type 42 (ST42) in the multilocus sequence typing scheme for C. difficile. Recently, ST42 has been identified as one of the most prevalent C. difficile sequence types among adults in the United States. In this study, we investigated ST42 isolates (n=79) identified from human clinical samples, environmental samples and companion animal samples in Flagstaff, Arizona, USA. Our goals were to place these ST42 isolates into a global context, to gain insight into potential sources of human infections, and to understand antimicrobial resistance within the lineage. A phylogeny built from core genome single nucleotide polymorphisms (SNPs) indicated that ST42 isolates are closely related to ST28 isolates, which have also been isolated in Flagstaff. ST42 isolates from human clinical samples, environmental samples and companion animal samples are distributed throughout the phylogeny, and in some cases isolates from human clinical samples are closely related to isolates from the environment or companion animals, which suggests that these non-healthcare-related reservoirs could be potential sources of human infections. In silico screening of ST42 genomes for antimicrobial resistance markers indicates that ST42 isolates from Flagstaff generally lack many of the antimicrobial resistance markers described in other lineages of C. difficile, suggesting that currently described antimicrobial resistance may not be a major factor in CDI cases associated with ST42 in northern Arizona. The presence of the ST42 lineage in multiple sample types in northern Arizona suggests that this lineage is an ecological generalist capable of survival in and transmission between multiple reservoirs. This work focuses on ST42 but is part of a larger effort to characterize the diversity of C. difficile present in northern Arizona. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Barcoding method for Typing of Culex pipiens

Poster - Abstract ID: 88

Mariam Zakalashvili, Nato Dolidze, Lamzira Tskhvaradze, Magda Dgebuadzedze, Mari Gavashelidze,

Irakli Sikharulidze, David Putkaradze, Paata Imnadze

National Center for Disease Control and Public Health of Georgia, Lugar Center for Public Health Research

Culex pipiens is the potential vector of West Nile virus (WNV) worldwide and is the most commonly spread mosquito in Georgia. It is widely distributed and colonizes different habitats in urban and rural areas, thriving in both polluted and clean breeding water. This species includes two biological forms: Culex pipiens pipiens and Culex pipiens molestus, which are morphologically similar, but differ in their behavior and biology.

In order to determine molecular profiles of Culex pipiens populations from six different regions of Adjara, 456 Culicinae adult culex were collected in scope of routine surveillance program at National Center for Disease Control and Public Health (NCDC). The objective of this study was to find Culex pipiens molestus -known vectors of WNV.

DNA was extracted from mosquito leg samples with corresponding morphological vouchers only, using QIAgen® Blood and Tissue Kit and PCR reactions were conducted. MtDNA COI DNA barcoding sequences were generated for the 192 specimens; sufficient sequence data from 150 samples were obtained and further utilized in construction of phylogenetic tree using program, MEGA 7; currently, final data analysis is pending.

Identification of vectors associated with human illnesses is crucial for rapid detection and disease recognition in the region. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Ultra-High-throughput PCR-Free WGS library workflow

Poster - Abstract ID: 90

Michael Mueller, Zeineen Momin, Robert Glenn, Kimberly Walker, Glen Savery, Yimiti Meiheerguli, Dilrukshi Bandaranaike, Harsha Doddapaneni, Richard Gibbs, Donna Muzny

Human Genome Sequencing Center - Baylor College of Medicine, Houston, TX

With the launch of NovaSeq platform, it is now possible to sequence 3-4x Whole Genome Sequencing (WGS) samples compared to HiSeq X platform and NovaSeq is also open to sequencing exomes and RNA-Seq samples to the tune of 280 at a time on the same flow cell. This, while allows to support very large scale sequencing projects, the current practice of preparing libraries in batches of 96 creates a bottle neck to support NovaSeq throughput. To address this problem, HGSC has conceived the idea of scaling up the library preparation workflow by optimizing the robotics methods on 384 robotic liquid handlers. Processing of libraries in batches of 384 has significant cost advantages also due to reaction miniaturization and 4x labor savings. The key to successful implementation of library reactions in 384-well plates is using liquid handling platforms that have speed, efficiency, repeatability, and accuracy. However, currently, there is no commercial system that supports right out of the box, preparation of libraries especially, PCR-Free, in batches of 384. Five liquid handlers (Labcyte Echo, Formulatrix Mantis/Tempest, Biomek iSeries, Tecan Fluent, Hamilton Star) were evaluated for pipetting accuracy and consistency. These platforms were also assessed for their ability to mix reagents, solvents, and beads as required at various steps during library preparation. Based on this evaluation, Hamilton platform configured with a 384 multi-channel was used for robotic script development and testing. Several customizations such as use of an in-house modified thermocycler, a special magnet and a unique 384 well deep plate were made to complete this task. Illumina test libraries prepared on 384-well plates were successfully sequenced to generate 110Gb (37x coverage) of unique data, and 97% of the bases in the genome were covered to 20X depth. These metrics are comparable to WGS libraries currently being prepared on Beckman FxP robot in a 96-well format. Further, comparison of high quality variant calls generated for this library showed a high concordance rate of 98.9% with NIST reference calls. With this configuration, a single 384 liquid handler can generate 9,168 libraries per month (~110k/yr), enough to support ~11 NovaSeq 6000 instruments sequencing 27 libraries/flow cell to 30x WGS coverage on a S4 flow cell. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Implementing a scalable sequencing analysis pipeline for clinical whole-genome samples sequenced on the Illumina NovaSeq platform

Poster - Abstract ID: 91

Jesse Farek, William Salerno, Ziad Khan, Eric Venner, Donna Muzny, Richard Gibbs

Unknown

Decreasing costs in next-generation sequencing (NGS) and the increasingly critical importance or non-coding regions in clinical interpretation have prompted a shift from whole-exome sequencing to whole-genome sequencing (WGS) in clinical sequencing projects. The Baylor College of Medicine Human Genome Sequencing Center (BCM-HGSC) sequences and analyzes over 2,000 human whole-genome samples per month from several projects and collaborators, with a growing number of these samples as part of clinical studies. Adapting WGS analysis to clinical projects introduces additional analysis requirements, such as shorter turn-around times and the reliable detection of variants with known clinical significance. We present CLERIC, a sequencing analysis pipeline for analyzing whole-genome clinical samples sequenced on the Illumina NovaSeq platform from FASTQ sequences to alignment, variant calling, and variant annotation. CLERIC is modeled from NHLBI TOPMed analysis specifications, using the GRCh37 human reference genome, with optimizations for computational scalability. CLERIC utilizes software with low computational resource requirements in various pipeline steps, including xAtlas, a lightweight SNV and small indel variant caller developed at BCM-HGSC. Further improvements in scalability are achieved through the optional use of Sentieon NGS data processing software or Illumina's DRAGEN platform. Sentieon’s highly optimized BAM and CRAM- processing produces alignments that are nearly identical to those processed by more conventional sequencing analysis software, while providing a two to eight-fold decrease in processing time for BAM and CRAM-processing pipeline steps. CLERIC is implemented as a Snakemake workflow, which allows flexible control of pipeline and execution parameters. Clinically significant variants, including those from ACMG and OMIM panels, were confirmed to be detectable by CLERIC analysis in validation samples. Sequencing, Finishing, and Analysis in the Future Meeting 2019

A High-Quality de novo Genome Assembly from a Single Mosquito using PacBio Sequencing

Poster - Abstract ID: 98

Nick Sisneros1, Primo Baybayan1, Haynes Heaton2, Juliana Cudini2, Nancy Holroyd2, Alan Tracey2, Christine C. Lambert1, Sarah Kingan1, Brendan Galvin1, Jonas Korlach1, Matthew Berriman2, and Mara Lawniczak2

1. Pacific Biosciences, Menlo Park, CA, USA; 2. Wellcome Sanger Institute, Hinxton, Cambridgeshire, UK

A high-quality reference genome is an essential tool for studies of plant and animal genomics. PacBio Single Molecule, Real-Time (SMRT) Sequencing generates long reads with uniform coverage and high consensus accuracy, making it a powerful technology for de novo genome assembly. While PacBio is the core technology for many large genome initiatives, relatively high DNA input requirements (3 µg for standard library protocol) have placed PacBio out of reach for many projects on small, noninbred organisms that may have lower DNA content.

Here we present high-quality de novo genome assemblies from single invertebrate individuals for two different species: the Anopheles coluzzii mosquito and the Schistosoma mansoni parasitic flatworm. A modified SMRTbell library construction protocol without DNA shearing and size selection was used to generate a SMRTbell library from just 150 ng of starting genomic DNA. The libraries were run on the Sequel System with chemistry v3.0 and software v6.0, generating a range of 21-32 Gb of sequence per SMRT Cell with 20-hour movies (10-12 Gb for 10-hour movies), and followed by diploid de novo genome assembly with FALCON-Unzip. The resulting assemblies had high contiguity (contig N50s over 3 Mb for both species) and completeness (as determined by conserved BUSCO gene analysis). We were also able to resolve maternal and paternal haplotypes for 1/3 of the genome in both cases.

By sequencing and assembling material from a single diploid individual, only two haplotypes are present, simplifying the assembly process compared to samples from multiple pooled individuals. This new low-input approach puts PacBio-based assemblies in reach for small, highly heterozygous organisms that comprise much of the diversity of life. The method presented here can be applied to samples with starting DNA amounts around 150 ng per 250 Mb – 600 Mb genome size. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Wednesday, May 22nd 7:30 AM – 8:30 AM Breakfast (Sponsored by Perkin Elmer & Swift) 8:30 AM – 8:45 AM Welcome & Opening Remarks 8:45 AM – 930 AM Keynote 2: Dr. Deborah Hung, #38 • Diagnosis and treatment of infectious disease 9:30 AM – 10:30 AM Oral Session Part 4: (Chairs: Mike Fitzgerald & Tootie Tatum) • Linking the resistome to the microbiome: A culture-free method links plasmid, virus, and antimicrobial resistance genes to their hosts in complex microbial populations (Eacker, #18) • ROCker for improved antimicrobial resistance determinant detection in stool and isolate samples (Rowell, #64) • Whole genome and targeted sequencing of drug-resistant Mycobacterium tuberculosis on the iSeq100 and MiSeq (Colman, #12) 10:30 AM – 11:00 AM Break (Sponsored by PacBio) 11:00 AM – 12:20 PM Oral Session Part 5: (Chairs: Bob Fulton & Donna Muzny) • Dissemination of OXA-23 producing Acinetobacter baumannii during an outbreak in a long-term care facility (Young, #87) • Genome wide association studies (GWAS) and transcriptomics identifies cryptic antimicrobial resistance mechanisms in Acinetobacter baumannii (Roe, #63) • PiReT: Pipeline for Reference-based Transcriptomics (Shayka, #70) • GeneTable: A tool for comparative genomic analyzes of microbial genomes (Kittichoirat, #45) 12:20 PM – 1:40 PM Lunch (Sponsored by iGenomx) 1:40 PM – 3:00 PM Oral Session Part 6: (Chairs: Kenny Yeh & Tootie Tatum) • Amplicon prediction pipeline for an extended MLST approach to cultureiIndependent pathogen subtyping (Lucking, #53) • T-MArC: Targeted Metagenomic Analysis through marker Creation (Yarmosh, #85) • The utility of high throughput amplicon sequencing in the characterization of bacterial pathogens in complex backgrounds (Sahl, #67) • BioLaboro: An end to end application for detecting molecular assay signature erosion and design of new assays in response to emerging new biothreats (Sozhamannan, #76) 3:00 PM – 3:40 PM Genome Center updates (Baylor, WashU, JGI) 3:40 PM – 4:10 PM Break (Sponsored by Phase Genomics) 4:10 PM - 5:40 PM Tech Talks Part 2: (Chairs: Alla Lapidus & Chris Detter) • Nextera DNA Flex library preparation for soil shotgun metagenomics analysis (Koble; Illumina, #94) • Lesson Learned from 100 plants & animals de novo genome assembly using long read data (Fungtammasan; DNAnexus, #27) • Highs and lows of low coverage, high quality genotyping (Hill; Perkin Elmer, #95) • Genotyping by sequencing of Canis familiaris using iGenomX RipTide™ DNA library preparation (Siddique; iGenomix, #74) • The Sequel II system – The next evolution of SMRT sequencing (Michelle; PacBio, #96) • TBD, Jeremy Preston , Illumina, #99 5:40 PM – 6:00 PM Walk to Cowgirl Cafe 6:00 PM - 8:30 PM Cowgirl Café- Happy Hours Social - Food & Drinks Served (Sponsored by Illumina) Sequencing, Finishing, and Analysis in the Future Meeting 2019

Breakfast

Wednesday, May 22nd, 7:30 AM – 8:30 AM, La Fonda Ballroom

Sponsored by Sequencing, Finishing, and Analysis in the Future Meeting 2019

Diagnosis and Treatment of Infectious Disease

Keynote Speaker - Abstract ID: 38

Deborah T. Hung

1.eTh Broad Institute of Harvard and MIT; 2. Massachusetts General Hospital Department of Molecular Biology and Center for Integrative and Computational Biology; 3. Harvard Medical School Department of Genetics

Infectious disease continues to be a prominent global health threat. This issue is exacerbated by emergent pathogens and the rise of drug resistance that are significant challenges to our healthcare system. The accelerated innovation cycle inherent in genome science offers the ability to transform the diagnosis and treatment of infectious disease by enabling new or previously untenable approaches. Dr. Hung will present new genomic approaches to understand the biology and medicine of infectious disease, employing novel technological innovation in areas from genomic surveillance to drug discovery to conduct investigations for a range of infectious organisms. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Linking the Resistome to the Microbiome: A Culture-Free Method Links Plasmid, Virus, and Antimicrobial Resistance Genes to their Hosts in Complex Microbial Populations

Oral - Abstract ID: 18

Stephen Eacker1, Maximilian Press1, Shawn Sullivan1, Thibault Stalder2, Derek M. Bickhart3, Sergey Koren4, Eva M. Top2, Adam M. Phillippy4, Timothy P.L. Smith3, Ivan Liachko1*

1. Phase Genomics, Seattle, WA, USA; 2. University of Idaho, Moscow, ID, USA; 3. USDA-ARS, Madison, WI, USA; 4. NHGRI, Bethesda, MD, USA

Background: The rapid spread of antibiotic resistance is a global health threat. A range of environments have been identified as reservoirs of the antibiotic resistance genes (ARGs) found in pathogens, but we lack understanding of the origins of these ARGs and their spread from environment to clinic. This is partly due to an inability to identify the bacterial hosts of ARGs and the mobile genetic elements that mediate horizontal gene transfer due to the loss of intra-cellular contiguity upon DNA extraction.

Methods: In two recent studies we describe the application of proximity-ligation methods for the determination of the in situ host range of numerous ARGs, viruses, plasmids, and integrons within complex microbiome samples. This method forms physical junctions between sequences present within the same cell prior to DNA extraction. Subsequent sequencing generates a dataset that robustly connects mobile elements to their hosts and can assemble de novo genomes from mixed communities.

Results and Conclusions: Our application of this technology to complex wastewater and rumen samples yielded hundreds of novel ARG-, virus-, and plasmid-host interactions, as well as over a thousand new microbial genomes. These studies highlight the power of the proximity-ligation approach to deconvolving microbiome samples and foreshadow the development of rapid culture-free strategies for tracking and managing the spread of antimicrobial resistance.

Crosslinks are created in vivo, capturing intra-cellular DNA content prior to cell lysis

Crosslink) Fragment) Proximity) Sequence) Liga4on) Junc4ons)

References:

Linking the Resistome and Plasmidome to the Microbiome; Thibault Stalder et al., Dec. 2018, bioRxiv

Assignment of virus and antimicrobial resistance genes to microbial hosts in a complex microbial community by combined long-read assembly and proximity ligation; Derek Bickhart et. al., Dec. 2018, bioRxiv Sequencing, Finishing, and Analysis in the Future Meeting 2019

ROCker for improved antimicrobial resistance determinant detection in stool and isolate samples

Oral - Abstract ID: 64

J. L. Rowell1,2, S. Zhang3, L. M. Rodriguez-R3, B. Suttner4, V. M. Caban Figueroa4, K. T. Kostantinidis3,4 , J. R. Hensley1,5, K. C. Dillon1,5, B. A. Aspinwall1,5, Y. Gao1,6, N. Kanwar7, K. L. Weltmer8,9, C. J. Harrison9,10, R. Selvarangan7,9, J. Besser1, H. A. Carleton1, E. Trees1, A. D. Huang1, A. J. Williams-Newkirk1

1. Enteric Diseases Laboratory Branch, US Centers for Disease Control and Prevention, Atlanta, GA, USA; 2. Weems Design Studio, Suwanee, GA, USA; 3. School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA; 4. Department of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA; 5. Oak Ridge Institute for Science and Education, US Department of Energy, Oak Ridge, TN, USA; 6. IHRC, Inc., Atlanta, GA, USA; 7. Department of Pathology and Laboratory Medicine, Children’s Mercy Hospitals – KC, Kansas City, MO, USA; 8. Division of General Academic Pediatrics, Children’s Mercy Hospitals – KC, Kansas City, MO, USA; 9. University of Missouri – Kansas City, School of Medicine, Kansas City, MO, USA; 10. Division of Infectious Diseases, Children’s Mercy Hospitals – KC, Kansas City, MO, USA

Characterization of the resistome of complex metagenomic samples has broad applications for medicine, public health, and food safety. Existing methods using phenotypic or genotypic testing of bacterial isolates are costly, laborious, and biased. We previously presented a proof-of-concept system using highly multiplexed amplicon sequencing (HMAS) panels and our YAAMR v1 pipeline for data analysis for faster, cheaper, and more sensitive detection of multiple antimicrobial resistance determinants (ARDs) directly from complex samples such as stool. However, YAAMR v1 used manual filtering of HMAS data mapped using BWA to a curated reference ARD database to identify ARDs based on arbitrary depth and coverage cutoffs. This method was suboptimal because it used an arbitrary cutoff for depth and coverage, was limited to ARDs present in the database used, was unable to differentiate between closely-related ARD variants, and was difficult to reproduce between users. To address these limitations, we developed YAAMR v2, which replaces read mapping and manual filtering with ROCker profiles to allow automated detection of target ARDs in short read datasets and typing down to the family or variant level.

ROCker uses the Receiver Operating Characteristic (ROC) to identify bitscore thresholds of maximum sensitivity and specificity along sequences of a family of target proteins. This set of thresholds forms the ROCker model (or filter), which is then used to detect the target proteins encoded by short reads. Each family of ARDs requires at least one different ROCker model, but each model only needs to be generated once to be used in the pipeline for ARD detection. Closely-related protein families that encode functionally distinct enzymes can be used as negative references in ROCker to avoid false positive matches of reads encoding the non-target proteins.

We demonstrate a new application of ROCker to ARD detection using the blaTEM family of beta-lactamases based on mock datasets of known ARD composition and subsequently applied the model to short read HMAS data from 32 healthy stool samples collected by the New Vaccine Surveillance Network. The test of the blaTEM filter on the mock datasets identified no false negatives (no reads encoding target genes the ROCker filter missed) and almost no false positives (reads encoding non- target genes identified by ROCker as reads encoding target genes). Sequencing, Finishing, and Analysis in the Future Meeting 2019

The YAAMR pipeline streamlines the process of ARD detection in complex samples using HMAS, which is valuable in public health surveillance settings where bioinformatic resources are limited. YAAMR doesn’t allow host attribution, but it is a low- cost approach to triage samples for more labor-intensive and costly investigation. Using the blaTEM resistance family, we demonstrated that the inclusion of ROCker in YAAMR v2 improved the sensitivity and specificity of detection, reduced hands- on analysis time, improved reproducibility, and may also allow discovery of novel variants. Future work will include adapting the pipeline for deployment on multiple systems and continued development of ROCker models for additional ARDs. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Whole genome and targeted sequencing of drug- resistant Mycobacterium tuberculosis on the iSeq100 and MiSeq

Oral - Abstract ID: 12

Rebecca E. Colman1,2, Aurélien Mace1, Marva Seifert2, Jonathan Hetzel3, Haifa Mshaiel2, Anita Suresh1, Darrin Lemmer4, David M. Engelthaler4, Claudia M. Denkinger1 and Timothy C. Rodwell1,2

1. Foundation for Innovative New Diagnostics, Campus Biotech, Geneva, Switzerland; 2. Department of Medicine, University of California, San Diego, CA, USA; 3. Illumina Inc., San Diego, CA, USA; 4. Translational Genomics Research Institute, Flagstaff, AZ, USA

Tuberculosis is the leading cause of death from a single infectious agent worldwide, and the transmission of drug resistant Mycobacterium tuberculosis jeopardizes the World Health Organization’s goal to end the tuberculosis epidemic by 2035. Accurate detection and characterization of drug resistance by phenotype is hindered by the extremely slow growth of the organism, which provides an opportunity for rapid next generation sequencing (NGS) solutions for diagnosis of drug resistant TB. NGS has been proposed as an approach to capture comprehensive antibiotic resistance information for M. tuberculosis as either targeted (amplicon based) or whole genome sequence based, and can both identify and characterize the bacterial population drug resistance. However, possibilities of mixed populations of alleles at drug resistant loci leads to complications for setting thresholds for SNP calls. The intrinsic error rate of the sequencing process affects the ability to determine accurate detection of resistance associated mutations, leading to difficulties in discerning true subpopulation from background sequencing error.

We assessed the technical performance of the recently released Illumina iSeq100 instrument in comparison to the MiSeq for both targeted NGS and WGS for drug resistant M. tuberculosis characterization. We produced sequencing libraries from clinical isolates and sequenced the same libraries on the iSeq100 and the MiSeq platforms. Contrived mixtures of pan susceptible and drug resistant strains (50, 10, 5, and 2% drug resistant strain) were prepared, and targeted NGS was performed to examine the platforms performance for detection of drug resistant subpopulations on the iSeq100 platform in comparison to the MiSeq. We examined error profiles produced by each sequencing platform for both targeted NGS and WGS to set thresholds for mixture detection. Understanding the error profiles of each platform allows for examining the limit of discerning a true subpopulation from error. For the WGS, we observed equivalent uniform genome coverage, consistent depth of coverage for batching scheme used, and 94.0% (CI 93.1%–94.8%) agreement for all variants found using the cloud-based ReSeqTB bioinformatics pipeline. For the targeted NGS, we found 99.6% (CI 98.0%–99.9%) agreement for drug resistant associated SNPs between the iSeq100 and MiSeq data sets and correctly identify the 50, 10, and 5% resistant allele mixtures above background error across all loci involved. Taking into account the error profiles for the two sequencing platforms the 2% mixture was also detected above the background error, but only after algorithm optimization, illustrating the need for deep understanding of error profiles and for different sequencing instruments for mixture detection. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Break

Wednesday, May 22nd, 10:30 AM – 11:00 AM, La Fonda Ballroom

Sponsored by Sequencing, Finishing, and Analysis in the Future Meeting 2019

Dissemination of OXA-23 producing Acinetobacter baumannii During an Outbreak in a Long-term Care Facility

Oral - Abstract ID: 87

Erin L. Young1, Kelly Oakeson1, Alessandro Rossi1, and Robyn Atkinson-Dunn1

Utah Public Health Laboratory, Salt Lake City, Utah

Background: Acinetobacter baumannii is a highly adaptive organism associated with hospital and long-term care facility infections. A. baumannii readily achieves antibiotic resistance, either through intrinsic or acquired resistance through transformation. A. baumannii can colonize almost any surface and survive commonly used disinfectants, making environment eradication difficult. Immunocompromised and other vulnerable patients, such as those in long-term care facilities or nursing homes, are a target for A. baumannii infections.

Methods: Whole genome sequencing (WGS) was performed on all positive patient and environment samples submitted to the Utah Public Health Laboratory (UPHL) from an Acinetobacter baumannii outbreak at a long-term care faculty in Utah. WGS data was analyzed using the reference free analysis pipeline developed at UPHL (Oakeson et.al) to identify all shared homologous protein coding genes in the isolates and build a phylogenetic tree. The phylogenetic tree was then used to determine relatedness of the isolates. Additionally, the WGS data for each isolate was searched for the presence of know antimicrobial resistance genes.

Results: Phylogenetic analysis revealed that all the isolates are closely related and form two monophyletic clades, indicating multiple transmission events after a single contamination event. All isolates sequenced contained a single copy of OXA-23, a known carbapenemase that confers resistance to ampicillin and cephalothin antibiotics.

Conclusions: WGS of hospital acquired infections can provide invaluable information that can confirm to relatedness of isolate in outbreak situations and antimicrobial resistance predictions that can be used to inform treatment. Additionally, WGS can be used for surveillance and early detection of clusters before outbreaks can spread. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Genome wide association studies (GWAS) and transcriptomics identifies cryptic antimicrobial resistance mechanisms in Acinetobacter baumannii

Oral - Abstract ID: 63

Chandler C. Roe1, Charles H.D. Williamson1, Adam Vazquez1, Kristen Kyger1, Mike Valentine2, Jolene Bowers2, Paul Phillips1, Veronica Harrison2, Dave Engelthaler2, Jason W. Sahl1

1. Pathogen and Microbiome Institute, Northern Arizona University, Flagstaff, AZ; 2. Translational Genomics Research Institute North, Flagstaff, AZ

Antimicrobial resistance (AMR) in Acinetobacter baumannii is becoming a serious nosocomial public health threat. Understanding mechanisms of resistance is critical for identifying and tracking emerging resistance as well as implementing effective treatment regimens. Performing genotype/phenotype association studies in regards to AMR is complicated by the plastic nature of the A. baumannii pan-genome. In this study, we compared the antibiograms of 12 antimicrobials, associated with multiple drug families, of 84 confirmed A. baumannii genomes, many isolated in Arizona, USA. In silico screening of these genomes for known AMR mechanisms from public databases failed to identify clear genotype/phenotype correlations. In an effort to identify novel mechanisms associated with AMR, we performed a genome wide association study (GWAS) looking for associations between all possible 21-mers; however, this approach also failed to identify mechanisms that explained the resistance phenotype. In order to decrease the genomic noise associated with population stratification in GWAS studies, we compared 4 phylogenetically-related pairs of isolates with differing susceptibility profiles to at least one beta-lactam. RNA- Sequencing (RNA-Seq) was performed on these paired isolates and differentially expressed genomic regions associated with antibiotic resistance were identified. To verify these regions, amplicon sequencing on complementary DNA was performed and confirmed variable expression of these regions, despite showing similar genomic conservation. These results identify novel potential resistance mechanisms in an emerging global health threat and suggests a diagnostic platform based on gene expression rather than genomics alone may be beneficial for patient treatment. The implementation of such advanced diagnostics coupled with increased AMR surveillance will potentially improve A. baumannii infection treatment and patient outcome. Sequencing, Finishing, and Analysis in the Future Meeting 2019

PiReT: Pipeline for Reference-based Transcriptomics

Oral - Abstract ID: 70

Migun Shakya, Chien-chi Lo, Bin Hu, Patrick S. G Chain

Bioscience Division, Los Alamos National Laboratory Los Alamos, NM 87544

Transcriptomics is a powerful technique that has contributed to many biological discoveries. Traditionally, transcriptomics enables finding genes and pathways that are differentially expressed in one condition over another, discovering non-coding RNAs, annotating transcribed genes, and characterizing alternative splicing. With the rapid advancement in sequencing technologies providing unprecedented throughput at an acceptable cost, many research laboratories have shown interests in applying transriptomics to identify genes that are differentially expressed in distinct cell populations, or in response to different treatments. However, most of these laboratories have found themselves continuously challenged by the lack of bioinformatics and statistical expertise needed to design, implement, and maintain computational workflows capable of analyzing large amounts of sequencing data.

A typical transcriptomics workflow requires implementing an array of bioinformatics tools, each of which addresses a particular step in the analysis, e.g. quality control, alignment, fragment counting, statistical hypothesis testing, etc. It is also important to maintain an open and modular architecture, so that new tools can be added to the existing workflow for enabling new functionality and improving existing ones. Moreover, the workflow also needs to be optimized for high throughput and precision as well.

Here, we present PiReT, a one of a kind reference based transcriptomics workflow solution that adopts an open architecture and enables biologists with little or no computational knowledge to analyze their data. PiReT effectively weaves together open source bioinformatics tools and is integrated into EDGE Bioinformatics, extending its analytical capabilities. PiReT users can upload their raw data (fastq), customize steps of analysis, and produce biologist-friendly results (e.g. RPKM/FPKM/TPM, read counts, identify regulated genes and pathway, etc.) and data visualizations within the web interface. Here, we demonstrate some of the capabilities of PiReT using examples from a study to understand the interaction between Y. pestis and human host cells.

LA-UR-19-22693 Sequencing, Finishing, and Analysis in the Future Meeting 2019

GeneTable: A Tool for Comparative Genomic Analyzes of Microbial Genomes

Oral - Abstract ID: 45

Weerayuth Kittichotirat

Systems Biology and Bioinformatics Research Group, Pilot Plant Development and Training Institute, King Mongkut’s University of Technology Thonburi, Bangkhuntien, Bangkok, Thailand

A microbial species can be described by a pan-genome consisting of a core gene pool shared by all strains, plus a variable gene pool consisting of partially shared and strain-specific genes. The elucidation of the pan-genome for a species from sequence analysis of related strains can help in understanding of how genetic variability drives pathogenesis within a microbial species. In addition, core gene information can potentially aid in a genome wide screens for vaccine candidates or for antimicrobial targets. Here we present GeneTable, a pipeline for gene content comparison that facilitates comparative analysis of gene content across multiple closely related microbial genomes. The pipeline is made up of a homologous gene grouping method that automatically groups genes found in genomes of interest into homologous gene clusters. Our homologous gene grouping approach is designed to be robust to possible artifacts in annotation results that may have been introduced by the sequence error or incompleteness. A web-based viewer that allows flexible visualizations of gene contents in a table format as well as provides user friendly data mining functionality is also developed to facilitate data interpretation and sharing. Some of the applications of the GeneTable pipeline include characterization of genomic variations at both structural and gene content levels, identification of potential biologically-important and pathogenesis-related genes, and investigation of phylogenomic relationship among compared genomes. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Lunch

Wednesday, May 22nd, 12:20 PM – 1:40 PM, La Fonda Ballroom

Sponsored by Sequencing, Finishing, and Analysis in the Future Meeting 2019

Amplicon Prediction Pipeline for an Extended MLST Approach to Culture Independent Pathogen Subtyping

Oral - Abstract ID: 53

S. Lucking1, 2, E. Trees1, J. Besser1, H. Carleton1, A. Jo Williams-Newkirk1

1. Centers for Disease Control and Prevention, 2. Oakridge Institute for Science and Education

Background: Isolate whole genome sequencing (WGS) is a powerful tool in enteric disease surveillance. The declining availability of isolates due to the adoption of culture-independent diagnostic tests threatens culture-dependent surveillance systems, making the development of direct-from-specimen subtyping methods critically important. Highly multiplexed amplicon sequencing (HMAS) is a potentially cost-effective and scalable method to achieve a resolution similar to that of isolate WGS, but software for the selection of informative and amplifiable loci for HMAS panels is lacking. To address this gap, we developed a pipeline to design extended multilocus sequence typing (eMLST) schemes for enteric pathogen subtyping via HMAS.

Methods: Our amplicon prediction pipeline utilizes freely available open source software and takes as input annotated genomes in GenBank format. Core orthologous genes are identified using Orthofinder v.2.1.2, and primer pairs are designed with primer3 v.2.3.4 to generate 180-250bp amplicons at within the orthologous site. Amplicon size was chosen to work with the Fluidigm Juno, but can be adjusted for compatibility with any HMAS platform. Primer pairs are filtered on the basis of SNP presence between the left and right binding spots, and also for primer specificity to the target pathogen by using 10 metagenomes representing two separate outbreaks. We used EMBOSS v.6.4.0 to perform in silico PCR on the input genomes to generate candidate amplicons to further test and validate the eMLST scheme.

Results: As a proof of concept, 266 epidemiologically unrelated Salmonella bongori and enterica genomes consisting of 68 serotypes and representing the diversity of non-typhoidal Salmonella were analyzed using our pipeline. Of the original 2,125 core orthologous genes identified by Orthofinder, 1,497 were identified as containing SNPs and specific to the Salmonella genomes submitted, resulting in a total of 3,400 candidate primer pairs. The candidate primer pairs were used for in silico PCR, and the resulting amplicons were compared in a pairwise fashion to create a distance matrix. This distance matrix was used to compare our eMLST to that of BioNumerics v.7.6 core genome MLST scheme for Salmonella. Further, our candidate primer pairs were used for in silico PCR on ~14.6K Salmonella genomes and will be used to determine a subset of primer pairs that is most discriminatory and epidemiologically concordant for surveillance and identification of outbreaks.

Conclusions: This proof of concept demonstrates that our pipeline successfully generates candidate amplicons for use in direct-from-specimen enteric pathogen HMAS subtyping panels, with strain-level resolution similar to already established schemes such as the Enterobase-based BioNumerics cgMLST scheme. The final panel of target loci will be chosen using the set of ~14.6K Salmonella genomes to find the optimal subset of primer pairs for identification of outbreak-associated samples. Primer pairs for the final targets will be validated in vitro prior to deployment to public health partners for use in outbreak surveillance. Sequencing, Finishing, and Analysis in the Future Meeting 2019

T-MArC: Targeted Metagenomic Analysis through marker Creation

Oral - Abstract ID: 85

David Yarmosh, Joseph Russell, John Bagnoli

MRIGlobal

Metagenomic sample identification is being increasingly applied toward pathogen detection using the common shotgun sequencing approach, which attempts to collect as much nucleic acid data as possible, but does so unselectively. Unless an organism is present in high abundance, this technique is often unable to collect an entire genome sequence and tends to bias toward specific regions of genomes. Metagenomics classification programs attempt to deconvolute which reads belong to which organism and do so by comparison to well-characterized reference genomes. However, pathogens typically share the majority of their genomic sequence with nonpathogenic strains or species. These conditions make unambiguous pathogen detection difficult to produce with certainty. We have developed a method that identifies which regions of a genome are most characteristic to the target organism/clade. Analysis is based on the premise that reads which align to a region of a pathogen’s genome that is not shared with nonpathogens are much more valuable for pathogen detection and risk assessment.

The Huttenhower lab produced a program, ShortBRED, that was designed to find common traits to different sets of genes, specifically antimicrobial resistance genes. However, if the input data set is not a series of gene sequences, but instead entire genomes comprising every strain of a given organism, ShortBRED will attempt to identify regions that are unique to that organism. These sequences are developed into “characteristic markers.” The markers generated by ShortBRED are further refined through exclusivity analysis, involving all closely-related sequences that are available through RefSeq. Closely-related is defined, by default, as within the same genus, leading to species-level granularity. Any marker that hits to any offtarget sequence is discarded. The markers that pass this are considered highly unique and termed “true markers.” These are stored as a database to be used for highly-specific target identification. This pipeline was designed to be used in conjunction with a metagenomics profiler, which generates initial identification. If any organism that is related to a target pathogen is detected, this read analysis step is initiated and the appropriate fastq files are run against the corresponding marker database of the detected organism resulting in identification and degree to which “true markers” were detected. The process of running real metagenomic samples against the curated marker database is rapid, adding minimal increases to runtimes on our modest HPC.

This presents a significant step toward improving the specificity of metagenomic species identification through the automated identification of functionally conserved regions of genomes. Our initial mission space was confined to Biological Select Agents and Toxins, but this process is readily generalizable to any target of interest. There are many potential downstream applications of such a capability including rapid and unbiased probe design. Sequencing, Finishing, and Analysis in the Future Meeting 2019

The utility of high throughput amplicon sequencing in the characterization of bacterial pathogens in complex backgrounds

Oral - Abstract ID: 67

Jason W. Sahl, James M. Schupp, Viacheslav Fofanov, David M. Wagner, Paul Keim

Northern Arizona University

Whole genome sequencing (WGS) has allowed us an unprecedented view into the composition and evolution of bacterial organisms. However, the time and costs associated with preparation, isolation, and sequencing prevents the high throughput analyses of WGS data from bacterial isolates in some laboratories and for some pathogens. For metagenomic samples where isolation of the desired organism may be complicated or not possible, analysis requires deep sequencing in order to obtain sufficient signal. Here we present an amplicon sequencing (AmpSeq) approach to address these limitations, where tens to hundreds of targets can be combined into a single polymerase chain reaction (PCR) for focused pathogen detection and characterization. Combined with a Universal Tail indexing method, hundreds of samples can then be combined on a single Illumina MiSeq instrument, providing deep coverage for each target across each sample. In this work, we demonstrate the utility of the AmpSeq approach to: 1) redundantly detect biothreat pathogens from complex clinical matrices, such as sputum, detecting the pathogen at a single genome equivalent; 2) classify Clostridioides difficile at the strain level direct from stool, with results comparable to the detail provided by WGS analysis; and 3) characterize the antimicrobial resistance potential of Escherichia coli direct from urine, identifying the circulation of multi-drug resistant clones. Guided by comparative genomics analyses of WGS data, AmpSeq holds the promise to rapidly, dynamically, and inexpensively provide actionable data from complex specimens without the need for isolation or culture, deep sequencing, or complex bioinformatics. Coupled with emergent, real time sequencing technologies (e.g. Oxford Nanopore), AmpSeq represents a rapid platform for timely pathogen identification and characterization. Sequencing, Finishing, and Analysis in the Future Meeting 2019

BioLaboro: An end to end application for detecting molecular assay signature erosion and design of new assays in response to emerging new biothreats

Oral - Abstract ID: 76

Mitch Holland1, Daniel Negrón1, Shane Mitchell1, Nate Dellinger1, Mychal Ivancich1, David Peters1, Larry Wang1, Walter Berger1, Bruce Goodwin2, and Shanmuga Sozhamannan2,3

1. Noblis, Reston, VA 20191; 2. Defense Biological Product Assurance Office, Frederick, MD 21702; 3. Logitics Management Institute, Tysons, VA 22102

There is a constant need to respond rapidly to disease outbreaks, especially if an outbreak involves a newly discovered species. This need is further exacerbated if previously available detection assays and medical countermeasures are ineffective and no new countermeasures are in place. Annual Ebola outbreaks in the African continent exemplify the assay signature erosion problem due to constantly evolving pathogen genome. In addition, recently, a new species of Ebolavirus named Bombali virus, carried by bats and likely capable of infecting humans, was discovered. If a human outbreak of this new species occurred it is imperative to determine if current assays can effectively detect it, and if not, to quickly design new assays that can. BioLaboro is an end-to-end application designed for finding signature regions of select organisms, designing primers using those signatures, and then testing those primers in silico to determine their sensitivity and specificity. The entire system is driven by a simple graphical user interface that allows users to point-and-click to create analysis pipelines. The system ingests an array of NCBI data sources to perform its analyses and can be customized by a variety of parameters, adjustable on the fly. Using BioLaboro we were able to quickly identify which currently fielded Ebolavirus assays could potentially detect Bombali ebolavirus today and were able to design a set of potential new assays with perfect in silico detection accuracy for the future.

Mitch Holland: [email protected]

Daniel Negrón: [email protected]

Shane Mitchell: [email protected]

Nate Dellinger: [email protected]

Mychal Ivancich: [email protected]

David Peters: [email protected]

Larry Wang: [email protected]

Walter Berger: [email protected]

Bruce Goodwin: [email protected]

Shanmuga Sozhamannan: [email protected] Genome Center Updates Sequencing, Finishing, and Analysis in the Future Meeting 2019

Break

Wednesday, May 22nd, 3:40 PM – 4:10 PM, La Fonda Ballroom

Sponsored by Sequencing, Finishing, and Analysis in the Future Meeting 2019

Wednesday, May 22nd, 4:10 PM – 5:40 PM, La Fonda Ballroom Sequencing, Finishing, and Analysis in the Future Meeting 2019

Nextera DNA Flex Library Preparation for Soil Shotgun Metagenomics Analysis

Tech Talk - Abstract ID: 94

Jeffrey Koble, Associate Scientist

Genomic Applications Dept, Illumina, Inc Sequencing, Finishing, and Analysis in the Future Meeting 2019

Lesson learned from 100 plants and animals de novo genome assembly using long read data

Tech Talk - Abstract ID: 27

Chai Fungtammasan, Nicholas Hill, Brett Hannigan

DNAnexus

The de novo genome assembly enables us to create a reference genome and discover complex structural variants which are highly beneficial for genomic study. However, such a process is complex and computational intensive which is non-trivial for even well-trained bioinformatician to gain experience through working with different datasets. Based on our unique experience in long-read de novo genome assembly of 100 plant and animal genomes, we devise an evidence-based best practice guideline as the following: First, utilizing clean, homogeneous and low heterozygosity sample. If the low heterozygosity individual is not available, considering the trio approach in segregating data into haplotype. Second, sampling subset of data to assess both experimental and computational protocol and resource required before proceeding with all sample. Third, generating long read data as long as possible with sufficient coverage based on heterozygosity level. Finally, evaluating the assembled genome with several orthogonal matrices rather than solely on N50. We will walk through these recommendations using the case studies in contaminated samples and 27 maize cultivar. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Highs and Lows of Low Coverage, High Quality Genotyping

Tech Talk - Abstract ID: 95

Joshua Hill

PerkinElmer

Current status and future directions for massively high-throughput genotyping for plant and animal improvement and research.

A major drawback to sequencing-based agriculture studies has been the cost. Arrays and reduced representation sequencing methods are common alternatives for genotyping, but each of these methods has significant limitations associated with it.

AgSeq is a novel agriculture-focused genotyping pipeline that uses optimized laboratory processing, massive sample multiplexing, and machine learning to obtain highly accurate genotype information from low-coverage sequencing data. The reduced cost of whole-genome sequencing afforded by AgSeq allows for a substantial increase in the number of individuals genotyped per study. By reducing the amount of data needed to genotype a sample from 30X to 0.5X coverage, 60 times the number of individuals can be genotyped for the same sequencing cost. This represents a massive opportunity to significantly increase to usage of genotyping-via-sequencing. AgSeq is powered by optimized library prep, automation, and high-throughput sequencing coupled with a reduction in the amount of data needed per individual. Data from individual samples is used to accurately impute gaps resulting from reduced coverage, allowing for accurate genotyping of large populations for plant and animal studies.

. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Genotyping by sequencing of Canis familiaris using iGenomX RipTide™ DNA library preparation

Tech Talk and Poster - Abstract ID: 74

Azeem Siddique1,3, Gaia Suckow1,3, Nils Homer2, Phillip Ordoukhanian1,3, Steve Head1,3, Keith Brown3, Paul Doran3, Matt Huentelman6, Joseph Pickrell5, Alvaro Hernandez4

1. The Scripps Research Institute, La Jolla, CA; 2. Fulcrum Genomics, Somerville, MA; 3. iGenomX, Carlsbad, CA; 4. University of Illinois at Urbana-Champaign, Urbana, IL; 5. Gencove, New York, NY; 6. Tgen, Phoenix, AZ

Dogs have been living with humans for approximately 15000 years. Selective breeding has created a multitude of dog breeds with distinct characteristics. Great interest exists in understanding how selection has affected the modern dog genome and what variants are linked to specific canine breed characteristics. Dogs are also susceptible to a number of diseases that have counterparts in humans. Their unique population structure, relatively limited heterogeneity within breeds, greater genome sequence identity to humans than mice, and their sharing of a common environment with humans make them an excellent model organism for certain human diseases.

The iGenomX RipTide library prep is a high throughput DNA library prep for next generation sequencing that has been used to prepare libraries for a variety of applications where large numbers of samples require library preparation at low cost. One such application is genotyping by sequencing. Here we show the use of the RipTide library prep in a case control GWAS study, generating over 30 million biallelic SNPs per sample on a cohort of West Highland White Terriers. After filtering, more than 5.2 million SNPs were identified with a minor allele frequency of >5%. PCA analysis showed that the variants permitted the accurate identification of breeds. The data also showed a novel genetic association with Westie lung disease, the canine equivalent of chronic obstructive pulmonary disease in humans.

The iGenomX RipTide library prep combined with Illumina sequencing generated more variants in less time and at lower cost than the standard microarray-based genotyping experiment. Sequencing, Finishing, and Analysis in the Future Meeting 2019

The Sequel II System – The Next Evolution of SMRT Sequencing

Tech Talk - Abstract ID: 96

Michelle Vierra

Pacific Biosciences, Menlo Park, CA, USA

Single Molecule, Real-Time (SMRT) Sequencing has been the gold standard for high-quality genome assembly and has enabled research into many different biological applications such as transcriptomics, epigenetics, and microbial community characterization. This is because of the benefits conferred by SMRT Sequencing, including long average read lengths, high consensus accuracy, uniform coverage, simultaneous epigenetic characterization, and single-molecule resolution.

Here I will share a look at the latest results from circular consensus sequencing (CCS) mode for highly accurate reads from the new Sequel II System and relevant applications, such as pharmacogenomic gene analysis and resolving metagenomic communities. I will also provide an update on the Iso-Seq method, which can now segregate transcripts into haplotype-specific alleles using a new tool called IsoPhase, and a new workflow for low DNA input to assemble high-quality genomes from individual small-bodied organisms. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Transforming the Future of Genomics, Together

Tech Talk - Abstract ID: 99

Jeremy Preston, Vice President

Specialty Sales and Marketing, Americas, Illumina, Inc Sequencing, Finishing, and Analysis in the Future Meeting 2019

Panel Questions Sequencing, Finishing, and Analysis in the Future Meeting 2019

Happy Hour at Cowgirl Café

Wednesday, May 22nd, 6:00 PM – 8:30 PM, Cowgirl Café

Sponsored by

Enjoy!! Drink tickets provided at check in. (Use your red tickets)

Walking Map to Cowgirl Café 319 S. Guadalupe St Santa Fe, NM 505.982.2565

The Legend... Many years ago, when the cattle roamed free and Cowpokes and Cowgirls rode the range, a sassy young Cowgirl figured out that she could have as much fun smokin’ meats and baking fine confections as she could bustin’ broncs and rounding up outlaws. So she pulled into the fine bustling city of Santa Fe and noticed that nobody in town was making Barbeque the way she learned out on the range. She built herself a Texas- style barbecue pit and soon enough the sweet and pungent scent of mesquite smoke was wafting down Guadalupe street and within no time at all folks from far and near were lining up for heaping portions of tender mesquite-smoked brisket, ribs and chicken. Never one to sit on her laurels, our intrepid Cowgirl figured out that all those folks chowing down on her now-famous BBQ need something to wash it all down with. Remembering a long-forgotten recipe from the fabled beaches of Mexico, she began making the now-legendary Frozen Margarita and the rest, as we say, is History. Before you could say “Tequila!” the musicians were out playing on the Cowgirl Patio and the party was in full swing. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Thursday, May 23rd 7:30 AM – 8:30 AM Breakfast (Sponsored by Twist & Covaris) 8:30 AM – 8:45 AM Welcome & Opening Remarks Keynote 3: Dr. Charles Chiu, #11 8:45 AM – 930 AM • Clinical Metagenomic Sequencing for Diagnosis of Infectious Diseases Oral Session Part 7: (Chairs: Kenny Yeh & Mike Fitzgerald) • Identification and elimination of adoption hurdles in NGS for Microbiology. (Ellis, #20) 9:30 AM – 10:30 AM • Validating metagenomic analyses through simulated direct and indirect healthcare-related pathogen transmission events (Ternus, #77) • Clinical evaluation of bioinformatics analysis and data management solutions for amplicon panel (Gaskell, #29) 10:30 AM – 11:00 AM Break (Sponsored by SeqWell) Oral Session Part 8: (Chairs: Donna Muzny & Tootie Tatum) • Nanopore sequencing for biosurveillance in the New York City Subway (Arevalo, #04) 11:00 AM – 12:20 PM • Nanopore sequencing for biosurveillance in South Korea (Bernhards, #07) • Offline next generation metagenomics sequence analysis (Deshpande, #17) • KEATarHTs: Known Etiological Agents Targeted High Throughput Sequencing: An application for detection of Biodefense pathogens using MinION (Verratti, #81) 12:20 PM – 1:40 PM Lunch (Sponsored by JumpCode Genomics & Dovetail Genomics) Oral Session Part 9: (Chairs: Chris Detter & Bob Fulton) • Towards clean genomic databases: integrating sequence patterns evolved within and between genomes into automatic algorithms of genome annotation (Borodovsky, #08) • Chromosome-scale assemblies through manual curation (Wood, #84) 1:40 PM – 3:30 PM • Assembling a human genome within two hour (Chin, #10) • Tools for assembly graph analysis via SPAdes toolbox and more (Korobeynikov, #47)

• Repeats-aware approach for distance-based genome assembly (Andonov, #01)

3:30 PM – 3:40 PM Wrap-up 3:40 PM – 5:00 PM Happy Hour- Break (Sponsored by IDT) Sequencing, Finishing, and Analysis in the Future Meeting 2019

Breakfast

Thursday, May 23rd, 7:30 AM – 8:30 AM, La Fonda Ballroom

Sponsored by Sequencing, Finishing, and Analysis in the Future Meeting 2019

Clinical Metagenomic Sequencing for Diagnosis of Infectious Diseases

Keynote Speaker - Abstract ID: 11

Charles Chiu, M.D./Ph.D.

Division of Infectious Diseases at University of California, San Francisco

Metagenomic next-generation sequencing (mNGS) is a potentially game-changing technology for infectious disease diagnosis as enables detection of nearly all pathogens – viruses, bacteria, fungi, and parasites – in a single assay. This approach has been made feasible by the rapid advances in sequencing technology, bioinformatics analysis, and reference databases over the past several years. I will discuss how we overcome challenges in development and validation of an mNGS-based assay in a CLIA (Clinical Laboratory Improvements Amendments) laboratory regulatory environment, which has now achieved a first, in-class breakthrough device designation by the FDA. I will also review the results of the PDAID (Precision Diagnosis of Acute Infectious Diseases) study, a 1-year, multi-hospital prospective study evaluating the clinical utility and cost-effectiveness of a clinically validated mNGS assay, comparing its performance head-to-head against all conventional CSF microbiological testing for direct diagnosis of meningitis and encephalitis from cerebrospinal fluid (CSF). I will discuss efforts to expand clinical mNGS validation and testing to plasma (for sepsis), respiratory fluid (for pneumonia), and eventually to all body fluids, as well as new transformative technologies on the horizon such as nanopore sequencing for real-time diagnosis of infections, CRISPR- Cas and spiked primer enrichment strategies, and machine learning-based identification of host biomarkers ujsing RNA-Seq to discriminate between infectious (bacterial, viral) and non-infectious (autoimmune) conditions.

http://nextgendiagnostics.ucsf.edu http://chiulab.ucsf.edu Sequencing, Finishing, and Analysis in the Future Meeting 2019

Identification and Elimination of Adoption Hurdles in NGS for Microbiology

Oral - Abstract ID: 20

Jeremy E. Ellis, Ph.D.

BioID Genomics

Microbial culture, a keystone diagnostic technology, has existed in a form recognizable to modern microbiologists since the late 1800s and early 1900s and the standardization and advent of selective media. Limitations of microbial culture have become a growing concern as our knowledge of polymicrobial infections, population dysbiosis, deficits of due to time-to-results, and fastidious or emerging organisms has improved. Molecular technologies have been proposed as potential solutions to meet these challenges. With the use of multiplex and quantitative polymerase chain reactions (PCR), syndromic infections and common pathogens can be detected rapidly and with little training investment in modern diagnostic laboratories; however, these methods require a priori target selection unlike the role of microbial culture methods. Another more recent molecular technology, Next Generation DNA Sequencing (NGS), enables new unbiased microbial detection assays while leveraging the advantages of PCR-based methods. The currently-available NGS platforms and tools rely heavily on specialized technicians, bioinformatics experts, and analysis pipelines that are not compatible with rapid laboratory adoption. Simply, the investment remains large despite the promise. Appropriate design of a sequencing kit working seamlessly with cloud-enabled analytics software can address the technical challenges posed by NGS. Our standardized kit-based method has placed NGS for microbiology within reach of molecular-competent laboratories. The promise of NGS-based methods can be broadly realized with the use of a kit-based method. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Validating Metagenomic Analyses through Simulated Direct and Indirect Healthcare-Related Pathogen Transmission Events

Oral - Abstract ID: 77

Krista Ternus1, Katharina Weber1, Nicolette Albright1, Gene Godbold2, Veena Palsikar1, Danielle LeSassier1, Nicole Westfall1, Kathleen Schulte1, Curt Hewitt1

1. Signature Science, 8329 North Mopac Expressway, Austin TX 78759; 2. Signature Science, 1670 Discovery Drive, Charlottesville VA 22911

The prevalence of healthcare-acquired infections (HAI) places a significant economic burden on modern healthcare systems. Cultures are typically used to diagnose HAI; however, culture-dependent methods provide only limited presence/absence information and are not applicable to all pathogens. Next generation sequencing (NGS) has the potential to detect a wide variety of pathogens, viability signals, virulence elements, and antimicrobial resistance signatures in healthcare settings without the need for culturing. This study explored how metagenomics compared to traditional culturing methods to detect human pathogen transmission events under a variety of simulated HAI scenarios. Simulations were performed within a microbiology laboratory to reflect the transmission of ESKAPE pathogens (i.e., Enterococcus faecium, Staphylococcus aureus, Klebsiella pneumoniae, Acinetobacter baumannii, Pseudomonas aeruginosa, and Enterobacter species) and Clostridioides difficile between patients, healthcare personnel, and contaminated objects under remediated and non-remediated scenarios. The results of this study provided insights into the factors that influence transmission rates, the relationship between colony- forming units and metagenomics data, the value and limitations of metatranscriptomics, and the impact of custom databases for functional annotations.

Funding Source:

This work was supported by the Centers of Disease Control and Prevention’s investments to combat antibiotic resistance under award number (200-2018-75D30118C02922). Sequencing, Finishing, and Analysis in the Future Meeting 2019

Clinical evaluation of bioinformatics analysis and data management solutions for amplicon panel

Oral - Abstract ID: 29

Rebecca Bernard, Nicholas Miltgen, Alisa Gaskell

Children’s Hospital Colorado, Aurora, CO

As NGS testing becomes established in the clinical settings more third-party software solutions are available to ease the burden of launching these complex molecular offerings. The software packages can be broadly categorized as ready-to-use closed source solutions or comprehensive algorithm tool kit packages. It is also important to distinguish whether the software solution presents an end-to-end solution for the analytical process or if they only address a certain number of steps within the process. Here we present our newly developed data processing workflow highlighting our experiences both with a closed source solution as well as the use of third-party tool kits. Data management was an important requirement of the new data workflow thus we emphasize the ease of data management both with closed source and modular tool kit solutions. The validation study had two phases whereby the goal of phase one was to establish the specific pipeline tools and parameters to meet our predetermined quality and performance metrics and whilst phase two was seen as the actual validation of the new analytical process. Phase one underlined the need for additional complementary tools that were easily integrated into the modular tool kit solution but were not that easily identifiable when using the closed source solution. Once all the tools were assessed and adjusted they were integrated into one continuous process. Importantly, we identified the tool specific artifacts and established the signal to noise ratio for each type of observed change and for each type of genomic location. As validation of the final pipeline, phase two, a 92 sample cohort was analyzed confirming that with both the global and local filters in place we achieved a positive predictive value of greater than 0.92, false positive frequency of 0.97 and false negative frequency greater that 0.99. Taken together the performance of the new analytical pipeline composed of both open source and third-party tools matched or surpassed the performance and functionality of the previous closed source pipeline. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Break

Thursday, May 23rd, 10:30 AM – 11:00 AM, La Fonda Ballroom

Sponsored by Sequencing, Finishing, and Analysis in the Future Meeting 2019

Nanopore Sequencing for Biosurveillance in the New York City Subway

Oral - Abstract ID: 04

M. T. Arévalo1,2, M.A. Karavis2, S.V. Deshpande2, R.C. Bernhards1,2

1. Defense Threat Reduction Agency, Fort Belvoir, VA, USA; 2. U.S. Army Combat Capabilities Development Command (CCDC) Chemical Biological Center, Aberdeen Proving Ground, MD, USA

There is a critical need for rapid and accurate identification of unknown, emerging, and genetically modified biothreats in the field. Next-generation sequencing (NGS) technologies allow for the analyses of whole genomes and are thus superior to other molecular-based approaches. Whole genome analyses not only allow for unbiased, conclusive identification of pathogens, but can also help detect and distinguish novel and synthetically modified threats. Timely and accurate biothreat identification is needed to protect civilian populations. To this end, the Department of Homeland Security (DHS) is installing an Underground Transport BioDetection Test Bed in the New York City Subway system. The test bed will serve as a testing, evaluation, and validation site for new chemical and biological detection technologies that could provide warning during the advent of a chemical or biological attack on subway systems. We are conducting a feasibility study for the potential incorporation of the MinION nanopore sequencing device into the test bed. Environmental air filters from the NYC subway were provided by the DHS BioWatch program and are being used to: 1) determine limits of detection in these dirty filters, 2) evaluate the effect of environmental contaminants (e.g., iron oxide) on identification of pathogens by MinION sequencing, and 3) characterize the biological background collected by these filters. To establish the workflow from processing of filters and nucleic acid extraction to MinION sequencing, we spiked dirty filters with known amounts of bacteria, and showed that we could definitively identify the known organism. The MinION sequencer could eventually be used to identify any unknown biological agent within subway systems, including those that are genetically modified. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Nanopore Sequencing for Biosurveillance in South Korea

Oral - Abstract ID: 07

M.A. Karavis1, M.T. Arevalo2,1, B.A. Rivers1, T.M. Reed3, S.V. Deshpande1, K.M. Broadway1, C.J. Anderson1, L. Wallace2,1, R.C. Bernhards2,1

1. U.S. Army Combat Capabilities Development Command (CCDC) Chemical Biological Center, Aberdeen Proving Ground, MD, USA; 2. Defense Threat Reduction Agency, Fort Belvoir, VA, USA; 3. CBRNE Analytical & Remediation Activity, 20th CBRNE Command, U.S. Army, Aberdeen Proving Ground, MD, USA

Rapid and accurate biological identification technologies are critically needed in the field, especially for unknown, emerging, and genetically modified biothreats. Next-generation sequencing (NGS) technologies are superior to other molecular-based identification methods because entire genomes can be analyzed. This allows for unbiased, conclusive identification of biological agents, including novel and synthetically modified threats. CCDC Chemical Biological Center is developing a workflow for use in CENTAUR sample analysis facilities in South Korea for the identification of biothreats in environmental air filter samples using the MinION nanopore sequencing system. The approach will be used to detect any unknown biological organism present including bacteria, viruses, and genetically/synthetically modified organisms. Offline bioinformatics software is being developed for rapid and automated metagenomic analysis. Limits of detection are first being established for biothreat surrogates including a Gram-positive spore-forming bacterium, a Gram-negative bacterium, an RNA virus, and a genetically modified microorganism. Furthermore, incorporation of VolTRAX automated sample/library preparation, the MinIT miniature processor/base caller, and the Flongle flow cell adapter are being explored for ease of sample preparation, increased portability, and improved cost-efficiency, respectively. Performance evaluations will be conducted using blinded metagenomic samples and reviewed by an independent party. The goal is to deliver a compact, easy-to-use system for rapid biological identification within eight hours from sample-to-answer. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Offline next-generation metagenomics sequence analysis

Oral - Abstract ID: 17

Samir V. Deshpande1, Timothy Reed2, Keith Beigel2, Mary M. Wade3

1. Science & Technology Corp,111 C Bata Blvd., Belcamp, MD -21017; 2. CBRNE Analytical & Remediation Activity, 20th CBRNE Command, US Army Aberdeen Proving Ground, MD 21010; 3. US Army – CCDC –Chemical Biological Center, Aberdeen Proving Ground, MD - 21010

Detection of dangerous pathogens and emerging infectious diseases either in outbreak or biological attack scenario requires rapid identification of the threat agent. This allows for proper medical treatment and countermeasures to react and stop the dangerous pathogen from spreading. In resource-limited areas, detection presents a challenge especially if there is limited access to electricity and internet connection. The ability to rapidly deploy real-time detection and surveillance systems that can properly identify the threat agent is critical to stop these catastrophic events. With the development of Oxford Nanopore’s MinION, users now have the capability to collect metagenomics sequencing data in the field. However, most downstream analysis is conducted through online connection to Metrichor or other open source software that require internet access to run the bioinformatics software. This software requirement restricts data analysis in remote areas where internet is not available.

We have created an offline, real-time characterization software tool using open source softwares capable of performing classification and metagenomic visualization and tested it on Dell Precision 7720 64GB RAM laptop. Utilization of MinIT gives us the ability to basecall barcoded samples, thus increasing our throughput in the field. The fast search operations performed by Centrifuge allows for rapid identification of metagenomics sequences. The pipeline was tested on mock bacterial sample from ATCC (MSA-2002) with a sample-to-answer time under four hours. We were able to correctly identify all 20 organisms at the genus level and 19 out of 20 at the species level.

Using our bioinformatics pipeline along with Oxford Nanopore’s Rapid Kit(SQK-RAD004), we were able to correctly identify all organisms from a mock bacterial community in under four hours. Our bioinformatic pipeline allows for a cost effective solution and real time analysis of multiple samples in a field-forward environment without depending on reliable internet connection. Sequencing, Finishing, and Analysis in the Future Meeting 2019

KEATarHTS: Known Etiological Agents Targeted High Throughput Sequencing: An Application for Detection of Biodefense pathogens Using MinION

Oral - Abstract ID: 81

Kathleen Verratti1, Andrea Staab2, Robert Player1, Christopher Bradburne1, Bruce G Goodwin3 and Shanmuga Sozhamannan3,4

I. Johns Hopkins University-Applied Physics Laboratory, Laurel, MD, 2. Naval Surface Warfare Center, Dahlgren, VA, 3. Defense Biological Product Assurance Office, Frederick, MD, 4. Logistics Management Institute, Tysons Corner, VA

Nucleic acid-based assays such as real time polymerase chain reactions (PCRs) are the mainstay of clinical diagnostics and biosurveillance. A number of PCR assays targeting a battery of Category A select agent pathogens have been developed and are in extensive use within the DoD and other government agencies engaged in biosurveillance. All these are singleplex assays and there is an ever increasing fiscal pressure to develop an ultra-high-throughput multiplex assay for the following reasons: reduce the time needed to screen for multiple agents in multiple samples and make an actionable call, maximize the utility of small volume of test samples and most importantly reduce the cost to programs in a financially restrictive environment. However, multiplex real time PCR has severe limitations in terms of the number of targets that can be tested simultaneously (i.e., number of fluorescent channels in the PCR instruments used for monitoring real time PCR is limiting), thus imposing huge costs in reagents and operator time for testing in singleplex formats. Hence, there is an increased desire to perform ultrahigh throughput multiplex PCR and process hundreds of samples in parallel.

Through recent advances in sequencing technologies and the availability of hand-held sequencers such as the MinION from Oxford Nanopore that does not require huge infrastructure/hardware capital investment, this desire has now become a reality. Here, we have tested a 14-plex multiplex PCR assay followed by sequencing of the amplicons using MinION device. We demonstrated the feasibility of sequencing multiple samples simultaneously using bar codes, analyzing and obtaining sequence data in a very short time frame (~10 minutes). We have compared the limits of detection of PCR to MinION sequencing and also determined the time from sample to sequence. Data from testing operationally relevant matrices spiked with pathogens will be presented and discussed. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Lunch

Thursday, May 23rd, 12:20 PM – 1:40 PM, La Fonda Ballroom

Sponsored by

& Sequencing, Finishing, and Analysis in the Future Meeting 2019

Towards clean genomic databases: integrating sequence patterns evolved within and between genomes into automatic algorithms of genome annotation

Oral - Abstract ID: 08

Alexandre Lomsadze1, Katharina J. Hoff2, Karl Gemayel1, Tomas Bruna1, Mario Stanke2 and Mark Borodovsky1

1. Joint Georgia Tech and Emory Wallace H Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, USA; 2. Institute for Mathematics and Computer Science, University of Greifswald, Greifswald, Germany.

Errors in integration of layers of genomic, transcriptomic and cross-species protein information become primary source of erroneous annotations that propagate in genomic databases. Noisy splicing, horizontal gene transfer, uneven speed of evolution within the same genome hamper applications of genome annotation methods that would work well in ideal conditions. I will describe principles and performance assessment of new versions of genome annotation pipelines: eukaryotic BRAKER2 chaining GeneMark and AUGUSTUS in utilization of transcript and orthologous protein footprints for model training and gene prediction; prokaryotic GeneMarkS2+ a core element of the integrated genome annotation pipeline implemented at NCBI. Notably, GeneMarkS2 can run on single cell genomic sequences as well as on metagenomes. All the pipelines are available for cloud computing. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Chromosome-scale assemblies through manual curation

Oral - Abstract ID: 84

Jonathan Wood1, Kerstin Howe1, William Chow1 and Vertebrate Genomes Project2

1. Wellcome Sanger Institute, Hinxton, Cambs, UK., 2. https://vertebrategenomesproject.org/

The Vertebrate Genomes Project (VGP) produces high quality reference assemblies for all vertebrate species using four technologies; PacBio long reads, 10X chromium data, Bionano genomic maps and HiC linked reads. These data are processed through a semi-automated pipeline of assembly, scaffolding and polishing with the aim of achieving genome assemblies that are near gapless, error free and provide chromosome-scale scaffolds.

These software processes are however imperfect and in order to achieve the project goals, intervention in form of automated assessment of assembly quality with subsequent manual curation is required. The automated evaluation is done by producing a gEVAL database containing alignments of all available raw data with the assembly, plus alignments to other assemblies of the same or related species and available gene sets and any other relevant data. gEVAL allows visualisation of data discordances, guiding curators to assembly issues and supporting the identification and manual application of issue resolutions. Done in tandem with manual re-scaffolding guided by Bionano map alignments and 2D HiC heatmaps, the manual process is able to raise the quality significantly and ensure near complete chromosome-scale sequence assignation. Furthermore, the manual curation process feeds back into the assembly strategy supporting the development of improved algorithms and adjustment of pipelined processes, raising the quality of the automated assembly production.

We outline the evaluation and curation process and its benefits and show how this enables the aims of forthcoming projects to sequence all life on earth. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Assembling A Human Genome within Two Hours

Oral - Abstract ID: 10

Jason Chin

Independent Research

De novo genome assembly is the most unbiased way to acquire comprehensive genomic information and to gain insight for new DNA sequences that may not exist in reference genomes. Many de novo human genomes are published in the last couple of years leveraging cheaper short-read and single-molecule long-read technologies . Along with the scale of sequencing work, the computation burden persists for generating assemblies. The most common long-read assembly framework using overlap- layout-consensus paradigm requires all-to-all read comparisons. The computation complexity of this comparison step scales quadratically with the number of reads. Most methods still require hundreds to thousands of CPU hours although various techniques have been developed to reject non-overlapped pairs fast or to reduce the extra computation for repeats. High computation requirement persists for more accurate long reads (accuracy ~99% and length ~11 to 15k), which is achievable with current sequencing technologies.

We introduce the de novo assembler Peregrine , which uses a novel minimizer based read index schema. This allows the removal of the all-to-all read comparisons. Instead, read pairs with high overlapping probability are gathered in one step and compared by utilizing the index. In our initial implementation, we can assemble 28x to 32x human PacBio CCS read datasets in less than 20 cpu hours and two wall-clock hours to high contiguity (N50 > 20Mb). The continues advent of sequencing technologies in terms of read length and based accuracy together with Peregrine will enable routine generation of human de novo assemblies. This leads to more comprehensive representation of the genomic variations on population scale beyond SNPs and small indels. We further applied Peregrine successfully to non mammalian genomes such as plants. Future implementations will enable the usage of less accurate long reads such as Oxford Nanopore and longer PacBio reads. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Tools for assembly graph analysis via SPAdes toolbox and more

Oral - Abstract ID: 47

Alex Shlemov, Tatiana Dvorkina, Sergey Nurk, Dmitry Antipov. Anton Korobeynikov

Saint Petersburg State University

Recently a number of tools emerged that use the SPAdes genome assembler platform for various analysis applications. Here we present newer additions to SPAdes family of tools for genome assembly and analysis that rely on the assembly graph to guide the analysis and therefore improve the results as compared to contig-centric applications. Graph representation of genome assemblies has been recently used in different applications — from gene finding to haplotype separation. While many of these applications are based on aligning DNA and protein sequences to assembly graphs, existing software tools for finding such alignments have important limitations. We present a novel SPAligner (Saint Petersburg Aligner) tool for aligning long reads to assembly graphs and demonstrate that it generates accurate alignments. Recently large databases containing profile Hidden Markov Models (pHMMs) emerged. These pHMMs may represent the sequences of antibiotic resistance genes, or allelic variations amongst highly conserved housekeeping genes used for strain typing, etc. The typical application of such a database includes the alignment of contigs to pHMM hoping that the sequence of gene of interest is located within the single contig. Such a condition is often violated for metagenomes preventing the effective use of such databases. We present PathRacer — a novel standalone tool that aligns profile HMM directly to the assembly graph (performing the codon translation on fly for amino acid pHMMs). The tool provides the set of most probable paths traversed by a HMM through the whole assembly graph, regardless whether the sequence of interested is encoded on the single contig or scattered across the set of edges, therefore significantly improving the recovery of sequences of interest even from fragmented metagenome assemblies.

This work was supported by Russian Science Foundation (grant 19-14-00172). Sequencing, Finishing, and Analysis in the Future Meeting 2019

Repeats-Aware Approach for Distance-Based Genome Assembly

Oral - Abstract ID: 01

Rumen Andonov1*, Hristo Djidjev2, Sebastien Francois1, Dominique Lavenier1

1. Univ Rennes, Inria, CNRS, IRISA, F-35000 Rennes, France, 2. Los Alamos National Laboratory, Los Alamos, NM 87545, USA

It is commonly accepted that the most important challenge towards a truthful genome assembly is the presence of repeated identical subsequences in the genome (repeats). Current assembly tools are not capable of distinguishing different copies of the same repeated DNA region and usually merge them together thus yielding multiple errors in the proposed result. Here we describe a strategy for overcoming this challenge, which uses a genome assembly representation defined by a simple path in a graph we construct, which path satisfies as many as possible of the distance constraints encoding the insert-size information. We formulate the latter problem as a mixed-integer linear program and apply an optimization solver to find the exact solutions on a benchmark of chloroplasts. Chloroplasts possess circular and relatively small genomes. The particularity of these genomes is the presence of numerous repetitions, which pose significantly computational challenges for the modern genome assembly techniques. We show that these repeats are the main reason for the existence of multiple equivalent solutions that are associated to alternative subpaths. Our contribution here is to formulate two sufficient conditions for identifying such subpaths. Furthermore, we design efficient (linear) algorithms for detecting these conditions. Once these subpaths detected, our strategy consists in performing cuts at their endpoints. The remaining subpaths are considered as unambiguous portions of the genome (contigs). We call these contigs distance-based contigs (db-contigs) since our sits are safe (do not decrease the number of satisfied distances). We tested this strategy on a set of 33 chloroplast genomes and compared the results with three recent assemblers (SPAdes, SSPACE and BESST). Using the QUality ASsessment Tool (QUAST) for quality assessment, we demonstrate that our approach produces assemblies of higher quality than the above heuristics. These results fully justify the efforts for designing exact approaches for genome assembly. Extending the method to much bigger genomes is very challenging and a topic of ongoing research.

keywords: de novo genome assembly, unitig, contig, scaffolding, gap-filling, weighted simple path problem, linear integer programming. Sequencing, Finishing, and Analysis in the Future Meeting 2019

Happy Hour - Closing Break

Thursday, May 23rd, 3:40 PM -5:00 PM, La Fonda Ballroom

Sponsored by

Please grab a drink (or two) and enjoy our end of meeting Break!!

The bar will stay open until 5:00pm so you can grab another should you choose

Use your yellow tickets

2020 SFAF is planned for May 19-21, 2020 in Santa Fe, NM 2019 SFAF Attendee List

Last Name First Name Email Company Abdo Zaid [email protected] Colorado State University Abichu Getachew [email protected] NAHDIC Aghokeng Avelin [email protected] CIRMF Al-Awar Omayma [email protected] Illumina Alal George [email protected] University of Eldoret Alemayehu Dawit [email protected] Armauer Hansen Research Institute Alemu Redeat [email protected] NAHDIC Alterio John [email protected] PacBio Anderson Kevin [email protected] DHS Science & Technology Andonov Rumen [email protected] IRISA and University of Rennes 1 Andreopoulos Bill [email protected] Joint Genome Institute Appel Maryke [email protected] Private Arevalo Maria [email protected] DoD/DTRA Aspinwall Brooke [email protected] Centers for Disease Control and Prevention Avila-Herrera Aram [email protected] LLNL B R Ansil [email protected] NCBS, Bangalore Badenhorst Daleen [email protected] Roche Sequencing Solutions Bagnoli John [email protected] MRIGlobal Bailey Donovan [email protected] New Mexico State University-Biology Kazakh Scientific Center for Quarantine Bakhtybekkyzy Sholpan [email protected] and Zoonotic Diseases (KSCQZD) Balaji Uthra [email protected] Baylor Research Institute Bartlow Andrew [email protected] Los Alamos National Laboratory Basi Kelly [email protected] CCDC Chemical Biological Center Bauernfeind Selina [email protected] University of New Mexico Bazinet Adam [email protected] BNBI/NBACC BELL TISZA [email protected] Los Alamos National Laboratory National Center for Disease Control and Berishvili Nino [email protected] Public Health (NCDC) of Georgia Bernhards Cory [email protected] Defense Threat Reduction Agency Bhangoo Jasbir [email protected] Driscoll's Inc. Bista iliana [email protected] Sanger Institute Blackburn Jason [email protected] University of Florida Boellmann Frank [email protected] Illumina Borland Erin [email protected] Colorado State University National Center for Disease Control and Brachveli Gvantsa [email protected] Public Health (NCDC) of Georgia braulio wilson [email protected] Illumina Braunstein Gavin [email protected] CIV DTRA COOP THRT REDUCT Breaker Erin [email protected] Centers for Disease Control and Prevention Brooks Kelli [email protected] RTLGenomics Brown Keith [email protected] iGenomX Brownstein Buddy [email protected] Washington University St. Louis National Center for Disease Control and Chanturia Gvantsa [email protected] Public Health (NCDC) of Georgia Chaudhary Suman [email protected] Kansas State University Chen Feng [email protected] Illumina Chin Jason [email protected] Self Chokas Ann [email protected] Illumina Faculty of Veterinary Medicine, Universiti Choong Siew [email protected] Malaysia Kelantan Christensen Mikael [email protected] QIAGEN Cobaugh Kelly [email protected] Driscoll's Inc. Colman Rebecca [email protected] FIND/UCSD Cong Qian [email protected] University of Washington Cook Christopher [email protected] Swift Biosciences Cui Helen [email protected] Los Alamos National Laboratory Cunningham Heather [email protected] Peraton Daligault Hajnalka [email protected] Los Alamos National Laboratory Damaso Natalie [email protected] FBI Laboratory Daniels Jonathan [email protected] Centers for Disease Control and Prevention The ithree institute, University of Darling Aaron [email protected] Technology Sydney Davenport Karen [email protected] Los Alamos National Laboratory Davenport Karen [email protected] Los Alamos National Laboratory Davis Steve [email protected] FDA Defenbaugh Dawn [email protected] CTR DTRA J3-7 Deng Xiangyu [email protected] University of Georgia Deshpande Alina [email protected] Los Alamos National Laboratory Deshpande Samir [email protected] Science and Technology Corporation Detter Chris [email protected] MRIGlobal Di Han [email protected] Centers for Disease Control and Prevention Dichosa Armand [email protected] Los Alamos National Laboratory Diepold Sheila [email protected] Tetracore, Inc. University of New Mexico Health Sciences Dinwiddie Darrell [email protected] Center Domman Daryl [email protected] Los Alamos National Laboratory Doose Jonathan [email protected] Agilent Technologies National Centre for Biological Sciences, Dovih Dailu Pilot [email protected] TIFR Doyle Adina [email protected] TGen North Dvorak Ashley [email protected] IDT Ehrlich Garth [email protected] Drexel University Laboratory of the Ministry of Agriculture of Elbakidze Tinatin [email protected] Georgia (LMA) Elwick Kyleen [email protected] ORISE/FBI laboratory Erkkila Tracy [email protected] Los Alamos National Laboratory Ettenhuber Patrick [email protected] QIAGEN Fagre Anna [email protected] Colorado State University Farlow Jason [email protected] Farlow Scientific Consulting National Center for Genome Resources Farmer Andrew [email protected] (NCGR) Colorado Department of Public Health and Fink Logan [email protected] Environment Fiske Haley [email protected] Swift Biosciences FitzGerald Michael [email protected] Broad Institute Flynn Mark [email protected] Los Alamos National Laboratory Folkerts Megan [email protected] TGen North Foltz Victoria [email protected] Centers for Disease Control and Prevention French Chris [email protected] TGen North Fry Stephen [email protected] Fry Laboratories, LLC Fry STEPHEN [email protected] Fry Laboratories, LLC Student - Southern Illinois University - Fulton Lydia [email protected] Edwardsville Fungtammasan Arkarachai [email protected] DNAnexus Gale James [email protected] Tricore Reference labs Gao Yang [email protected] IHRC Inc. Garcia Alfredo [email protected] Government Scientific Source Gebresilase Tewodros [email protected] Armauer Hansen Research Institute Geib Scott [email protected] USDA-ARS Gillece John [email protected] TGen North Gleasner Cheryl [email protected] Los Alamos National Laboratory National Center for Disease Control and Gogoladze Giorgi [email protected] Public Health (NCDC) of Georgia Gomez Rosalie [email protected] Government Scientific Source graham amanda [email protected] USAMRIID Griego Anastacia [email protected] NM Dept. of Health, Scientific Lab Division Grishin Nick [email protected] UT Southwestern and HHMI Gu Jinghua [email protected] Baylor Research Institute Guertin Stephanie [email protected] Signature Science, LLC Guo Yirui [email protected] Ligo Analytics Hall Adrienne [email protected] USAMRIID Halpin Jessica [email protected] Centers for Disease Control and Prevention Hanschen Erik [email protected] Los Alamos National Laboratory Hartman Daniel [email protected] Colorado State University U.S Naval Medical Research Unit.2, Heang Vireak [email protected] Cambodia Thai Red Cross Emerging Infectious Hemachudha Pasin [email protected] Disease Health Sciences Herdon Keith [email protected] University of Florida Hill Josh [email protected] Texas A&M Hill Nicholas [email protected] DNAnexus Hirutu Yonas [email protected] Armauer Hansen Research Institute National Center for Genome Resources Hokin Sam [email protected] (NCGR) Holtz Jory [email protected] MicrogenDX Hong Charles [email protected] DTRA Hoon Kelly [email protected] Illumina Hopkins Christopher [email protected] Illumina House Geoffrey [email protected] Los Alamos National Laboratory Hovde Blake [email protected] Los Alamos National Laboratory Howard Ryan [email protected] MRIGlobal Hu Bin [email protected] Los Alamos National Laboratory Huang Andrew [email protected] Centers for Disease Control and Prevention Hubbard Kyle [email protected] Peraton International Centre for Medical Research Illich Mombo [email protected] in Franceville (CIRMF) Jacobs Jonathan [email protected] QIAGEN Jalan Neha [email protected] QIAGEN Jarvis Courtney [email protected] MicroGenDX Jiang Chao [email protected] Standford University Johnson Shannon [email protected] Los Alamos National Laboratory Johnson Francis Mayega [email protected] Makerere University Joseph Lavin [email protected] Centers for Disease Control and Prevention Joshi Vinay [email protected] LUVAS Kading Rebekah [email protected] Colorado State University Kahanda Indika [email protected] Montana State University Kant Shashi [email protected] Baylor Research Institute Colorado Department of Public Health and Kapsak Curtis [email protected] Environment Kayiwa John [email protected] Uganda Virus Research Institute Keddache Mehdi [email protected] Illumina Keene Alexandra [email protected] TGen North Kelliher Julia [email protected] Los Alamos National Laboratory Kennedy Drew [email protected] Illumina University of Arkansas for Medical Kennedy Josh [email protected] Sciences National Livestock Resources Reasearch Kerfua Susan [email protected] Institute (NaLIRRI) King Mongkut's University of Technology Kittichotirat Weerayuth [email protected] Thonburi Koehler Jeffrey [email protected] USAMRIID Koenig Lars [email protected] RTLGenomics Kolakowski Frank [email protected] Tetracore, Inc. Korobeynikov Anton [email protected] St. Petersburg State University National Centre for Disease Control and Kotorashvili Adam [email protected] Public Health Kovalenko Ganna [email protected] Institute of Veterinary Medicine NAAS Kumar Anand [email protected] Los Alamos National Laboratory Kumm Jochen [email protected] 1versal Ladner Jason [email protected] Northern Arizona University Lakin Steven [email protected] Colorado State University Lapidus Alla [email protected] St. Petersburg State University Center for Evolutionary and Theoretical Last Name Lijing [email protected] Immunology, Department of Biology, Unive Lavenier Dominique [email protected] CNRS Lawlor Amanda [email protected] Stowers Insitute for Medical Research LeBrun Erick [email protected] Pebble Labs USA Lee Eun Mi [email protected] NIAID Lemmer Darrin [email protected] TGen North Li Yan [email protected] Centers for Disease Control and Prevention Liachko Ivan [email protected] Phase Genomics Liulchuk Mariia [email protected] Public Health Center of the MOH of Ukraine Lo Chienchi [email protected] Los Alamos National Laboratory Locklear Chad [email protected] Integrated DNA Technologies Louha Swarnali [email protected] University of Georgia Lucas Julie [email protected] MRIGlobal Lucking Sean [email protected] Centers for Disease Control and Prevention Madden Joseph [email protected] Georgia State University Almaty Branch of National Center for Maltseva Elina [email protected] Biotechnology Mangum Sarah [email protected] RTLGenomics University of Georgia Center for Food Mann David [email protected] Safety [email protected] Manuweera Buwani Montana State University u Martinez Torres Teresa [email protected] MRIGlobal Maughan Jeff [email protected] Brigham Young University Mauro Lynn [email protected] Illumina McAllister Gillian [email protected] Centers for Disease Control and Prevention McBride Amber [email protected] Oak Ridge National Laboratory Mckay Kim [email protected] Covaris Inc. McMurry Kim [email protected] Los Alamos National Laboratory Mekonnen Meseret [email protected] Armauer Hansen Research Institute Mettler Jacquelyn [email protected] Los Alamos National Laboratory Miller Daniela [email protected] FDA Minihan Virginia [email protected] DOD Mojica Wilfrido [email protected] University at Buffalo National Center for Genome Resources Moll Karen [email protected] (NCGR) Morales Demosthenes [email protected] Los Alamos National Laboratory Moss Robert [email protected] Illumina Mouncey Nigel [email protected] Joint Genome Institute Mudge Joann [email protected] Monday Center for Genome Resources Mueller Kathryn [email protected] Phase Genomics Kazakh Scientific Center for Quarantine Mukashev Nurzhan [email protected] and Zoonotic Diseases (KSCQZD) Mulakken Nisha [email protected] LLNL Mumey Brendan [email protected] Montana State University navakulsirinat pimpajee [email protected] Private NAVARRO ROCIO [email protected] RTLGenomics GARCIA Newman Carl [email protected] CIV DTRA J3-7 Nkili-Meyong Andy [email protected] CIRMF Utah Department of Health / Utah Public Oakeson Kelly [email protected] Health Laborartory Ogawa Take [email protected] iGenomX Ohan Juliette [email protected] Los Alamos National Laboratory Olsen Christian [email protected] Stansberry Research Olson Karen [email protected] DFSC Onuska Jaya [email protected] MilliporeSigma Otwinowski Zbyszek [email protected] UT Southwestern pakala suman [email protected] Vanderbilt University Medical Center Palmer Nathan [email protected] USDA-ARS National Center for Disease Control and pantsulaia meri [email protected] Public Health (NCDC) of Georgia National Center for Disease Control and Papkiauri Ana [email protected] Public Health (NCDC) of Georgia PARK SUBIN [email protected] APHL-CDC Parker Kyle [email protected] MRIGlobal FDA Center for Food Safety and Applied Payne Justin [email protected] Nutrition Pedersen Connor [email protected] Los Alamos National Laboratory Perry Allison [email protected] Centers for Disease Control and Prevention Phelps Celina [email protected] NM Dept. of Health, Scientific Lab Division Phillips Paul [email protected] Northern Arizona University Piehl Shannon [email protected] PerkinElmer Porter Sandra [email protected] Digital World Biology Preston Jeremy [email protected] Illumina Puiu Daniela [email protected] Johns Hopkins University Raj Prithvi [email protected] UT Southwestern Medical Center Rajendren Sujey Kumar [email protected] University of Malaysia, Kelantan National Center for Genome Resources Ramaraj Thiruvarangan [email protected] (NCGR) Rambo Mueller Teri [email protected] Roche Sequencing Solutions Ramsey Kristy [email protected] Lexogen Kerala Veterinary and Animal Sciences Ravishankar Chintu [email protected] University, Pookode Ray Linda [email protected] Illumina Reining Lauren [email protected] TGen North Rhodes Michael [email protected] Nanostring Technologies Richmond Todd [email protected] Roche Sequencing Solutions Robertson James [email protected] FBI Laboratory Robinson Aaron [email protected] Los Alamos National Laboratory Thai Red Cross Emerging Infectious Rodpan Apaporn [email protected] Disease Health Sciences Roe Chandler [email protected] Northern Arizona University Rosovitz MJ [email protected] NBACC/NBFAC Rowell Jessica [email protected] WDS/CDC Russell Joseph [email protected] MRIGlobal Sahl Jason [email protected] Northern Arizona University Sanchez Melissa [email protected] University of New Mexico Sanders Claire [email protected] Los Alamos National Laboratory Saunders Lauren [email protected] Beckman Coulter life sciences National Center for Genome Resources Schilkey Faye [email protected] (NCGR) National Center for Genome Resources Schilling Kelly [email protected] (NCGR) Scully Erin [email protected] USDA-ARS National Center for Genome Resources Sena Johnny [email protected] (NCGR) Senutovitch Nina [email protected] Durham Editors Sevinsky Joel [email protected] CDPHE Shakya Migun [email protected] Los Alamos National Laboratory Shaner Kendra [email protected] 10x Genomics Shapiro Nicole [email protected] Joint Genome Institute Shoemaker Michael [email protected] U.S. Army Shteyman Alan [email protected] MRIGlobal Sikora Per [email protected] Genomics Medicine Sweden Sim Sheina [email protected] USDA-ARS Simon Jayne [email protected] Covaris Inc. Simpson Gary [email protected] Placitas Consulting Group Siniard Ashley [email protected] Illumina Sisneros Nick [email protected] PacBio Almaty Branch of National Center for Skiba Yuriy [email protected] Biotechnology Smith Todd [email protected] Digital World Biology Sozhamannan Shanmuga [email protected] DBPAO Spetzger Rachel [email protected] Illumina Starkenburg Shawn [email protected] Los Alamos National Laboratory Stefan Christopher [email protected] USAMRIID Stewart Gregory [email protected] Tetracore, Inc. Sullivan Raymond [email protected] JPEO-CBRND National Center for Genome Resources SUNDARARAJAN ANITHA [email protected] (NCGR) Tacheny Erin [email protected] MRIGlobal Talundzic Eldin [email protected] Centers for Disease Control and Prevention Tang Yitao [email protected] MilliporeSigma Tatum Tootie [email protected] Blackhawk Genomics Tatum Tootie [email protected] Blackhawk Genomics Ternus Krista [email protected] Signature Science, LLC Tezera Yodit Alemnew [email protected] Ethiopian Public Health Institute Thomann Ulrich [email protected] Covaris Inc. Tiedt Fritz [email protected] Private Todd Jonathon [email protected] TGen North National Center for Disease Control and Tomashvili Giorgi [email protected] Public Health (NCDC) of Georgia Travis Jason [email protected] TGen North Truong Jenny [email protected] ORISE at CDC Turner Stephen [email protected] Signature Science, LLC Udall Joshua [email protected] Iowa State University Unoarumhi Yvette [email protected] Centers for Disease Control and Prevention Vazquez Adam [email protected] The Pathogen and Microbiome Institute Verratti Kathleen [email protected] Johns Hopkins Applied Physics Lab Vidyaprakash Eshaw [email protected] Centers for Disease Control and Prevention Vierra Michelle [email protected] PacBio Vuyisich Grace [email protected] Los Alamos National Laboratory Vydayko Natallia [email protected] Primary Health Care, Uganda Wakeland Edward [email protected] UT Southwestern Medical Center Wallace Lalena [email protected] Defense Threat Reduction Agency Walters Ron [email protected] Pacific NW Nat'l. Lab (retired) Wang Yu [email protected] FDA Ward Judson [email protected] DRAM Wiggins Kristin [email protected] TGen North Williams-Newkirk Jo [email protected] Centers for Disease Control and Prevention Williamson Chase [email protected] Pathogen and Microbiome Institute, NAU Wood Jonathan [email protected] Wellcome Sanger Institute Wright Kimberly [email protected] Los Alamos National Laboratory Xu Zhaohui [email protected] Baylor Research Institute Yarmosh David [email protected] MRIGlobal yeh Kenny [email protected] MRIGlobal Yenovkian Jonathon [email protected] USFDA Yi Huiguang [email protected] South university of science and technology Yin Guohua (Karen) [email protected] New Mexico Consortium Yin Karen [email protected] New Mexico Consortium Young Brandon [email protected] Murrieta Genomics Young Erin [email protected] UPHL Young Kayla [email protected] Phase Genomics Young Kayla [email protected] Phase Genomics National Center for Disease Control and Zakalashvili Mariam [email protected] Public Health (NCDC) of Georgia Zhang Jing [email protected] UTSouthwestern Medical Center @ Dallas Zhang Xiangli [email protected] Los Alamos National Laboratory Zharova MIra [email protected] National Veterinary Reference Center Kazakh Scientific Center for Quarantine Zhunussova Aigul [email protected] and Zoonotic Diseases (KSCQZD) Zincke Dainty [email protected] University of Florida Sequencing, Finishing, and Analysis in the Future Meeting 2019 The 2019 SFAF Organizing Committee THANKS Again TO ALL OF OUR SPONSORS! Sequencing, Finishing, and Analysis in the Future Meeting 2019