Rapid, Raw-Read Reference and Identification (R4ids): a Flexible Platform for Rapid Generic Species ID Using Long-Read Sequencing Technology
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/281048; this version posted March 13, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Rapid, raw-read reference and identification (R4IDs): A flexible platform for rapid generic species ID using long-read sequencing technology. Joe Parker1*, Andrew Helmstetter2, James Crowe1, John Iacona1, Dion Devey1 & Alexander S. T. Papadopulos3* 1The Jodrell Laboratory, Royal Botanic Gardens, Kew, TW9 3AA UK 2CNRS Montpellier, France 3Molecular Ecology and Fisheries Genetics Laboratory, Environment Centre Wales, School of Biological Sciences, University of Bangor, Wales. *Corresponding authors: [email protected] and [email protected] bioRxiv preprint doi: https://doi.org/10.1101/281048; this version posted March 13, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Abstract The versatility of the current DNA sequencing platforms and the development of portable, nanopore sequencers means that it has never been easier to collect genetic data for unknown sample ID. DNA barcoding and meta- barcoding have become increasingly popular and barcode databases continue to grow at an impressive rate. However, the number of canonical genome assemblies (reference or draft) that are publically available is relatively tiny, hindering the more widespread use of genome scale DNA sequencing technology for accurate species identification and discovery. Here, we show that rapid raw-read reference datasets, or R4IDs for short, generated in a matter of hours on the Oxford Nanopore MinION, can bridge this gap and accelerate the generation of useable reference sequence data. By exploiting the long read length of this technology, shotgun genomic sequencing of a small portion of an organism’s genome can act as a suitable reference database despite the low sequencing coverage. These R4IDs can then be used for accurate species identification with minimal amounts of re- sequencing effort (1000s of reads). We demonstrated the capabilities of this approach with six vascular plant species for which we created R4IDs in the laboratory and then re-sequenced, live at the Kew Science Festival 2016. We further validated our method using simulations to determine the broader applicability of the approach. Our data analysis pipeline has been made available as a Dockerised workflow for simple, scalable deployment for a range of uses. bioRxiv preprint doi: https://doi.org/10.1101/281048; this version posted March 13, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Introduction DNA-based taxonomic identification has become widespread since the characterisation of “universal”, easily amplifiable and sufficiently variable DNA barcoding genes, such as COI, rbcL and matK (Hebert et al. 2003, 2016a; CBOL Plant Working Group et al. 2009; Coissac et al. 2016). In many circumstances the level of taxonomic resolution that short DNA barcodes can provide is suitable for the chosen purpose, but it has become clear that barcoding genes cannot always provide truly species-level discrimination and identification(Hebert et al. 2003; CBOL Plant Working Group et al. 2009; Little 2011; Hawlitschek et al. 2016). Barcoding approaches have certainly been effective for generic level identification, but resolution at the species level is less consistent (CBOL Plant Working Group et al. 2009; Little 2011; van Velzen et al. 2012; Collins & Cruickshank 2013; Mallo & Posada 2016); Pryron, 2015); and it is evident that assuming that unique barcodes or bins are synonymous with distinct species is potentially misleading (Tang et al. 2012; van Velzen et al. 2012; Collins & Cruickshank 2013; Schmidt et al. 2015; Hawlitschek et al. 2016). Hebert et al. (2016b) conducted extensive barcoding of the particularly well studied insect fauna of Canada (33 572 species) sequencing of more than 1 million samples (14% of samples produced no barcode) assigning barcodes to family or genus level. Assuming that each unique barcode represented a different species, species richness was estimated at three times the taxonomically described number. This figure may be accurate, but the discrepancy highlights that there number of reasons to believe that barcoding approaches are susceptible to errors introduced by low inter- and high intra-specific genetic variation in different groups (Tang et al. 2012; van Velzen et al. 2012; Mallo & Posada 2016; Liu et al. 2017). Broader, genome-scale identification methods hold a great deal of potential to bridge this gap. These approaches remove ambiguity from species-level identifications and produce information rich data that may be more applicable to species delimitation (Coissac et al. 2016; Hollingsworth et al. 2016; Parker et al. 2017). Indeed, we have previously shown that field-based, genome scale sequencing is able to provide accurate species identification in closely related Arabidopsis species (Parker et al. 2017). This study used carefully curated whole genome sequences of A. thaliana and A. lyrata as a reference database (against which newly generated long-read data generated in the field was matched) to explore the potential of this type of data for species identification. All types of accurate taxonomic identification from DNA sequence data are reliant on reference databases of DNA sequences obtained from specimens identified morphologically by experts. This is exemplified by the rigor with which the most high quality DNA barcode databases are constructed (IBOL)(Hebert et al. 2016a) and highlights the important role that taxonomist, taxon-specific expertise and collections based institutions (such as natural history museums and botanic gardens) have played and will continue to play in expanding the utility of DNA-based identification. Increasingly high throughput methods of producing reference DNA barcodes bioRxiv preprint doi: https://doi.org/10.1101/281048; this version posted March 13, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. are being developed, but the rate at which reference genomic data is published is substantially slower. Currently, whole genome sequences are available in public repositories (INSDC) for only 2733 eukaryotic species and these are distributed unevenly across the tree of life – 874 bilatarians, 274 plants and 1237 fungi. This means that there are not enough sequences to represent each family of organism, let alone genus and species. As a result, the use of genome scale data for accurate species identification as in Parker (2017) is apparently a very long way away from having the same applicability as DNA barcodes. Portable single molecule, real-time DNA sequencers have now become a commercial reality. Their potential ubiquity, coupled with cloud computation, mean that bioscience is set to undergo a transformation: millions of researchers, clinicians, conservation professionals and citizen-scientists will have the potential to sequence and analyse genomic material anytime, anywhere (Erlich, 2015). Portable sequencers such as the MinION allow DNA-sequencing to happen anywhere, in real-time, with important applications in human, animal and plant health; border control; resource management; and environmental monitoring. Demonstrated applications have included epidemic monitoring in Guinea (Quick et al., 2016), extremophile sequencing in Antarctica (Michael et al., 2017), assembly of complete plant genomes (Johnson et al., 2017), species ID and phylogenomics in the field (Parker et al., 2017). Sequencing reads from these platforms are orders-of- magnitude longer than those generated by more established, sequencing-by- synthesis technologies. Few tools or techniques are currently optimised for this technology and these developments present opportunities to revisit existing orthodoxies, analyses and uses of DNA data. In this era of mobile genomic sequencing, can genome-scale reference datasets be produced at a rate that species-level ID from genome resequencing will become a practicable and widely applicable reality? High quality reference genomes are still challenging and time consuming to produce, despite the incredible pace at which new data can be generated. However, the results of our earlier work (Parker et al., 2017) suggest that fully annotated and carefully analysed whole genome sequences may not be necessary for species ID from shotgun long read resequencing data. Simulations in which the Arabidopsis genomes were fragmented in silico demonstrated that even quite poor quality genome assemblies would be sufficiently information rich for confident identification. This raises the possibility that lab or field generated, genome-shotgun long read data itself may be able to act as a sufficiently detailed sketch of the genome for future identification, which we term rapid-raw-read-reference for ID, or ‘R4IDs’. In this