An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads

bioRxiv preprint doi: https://doi.org/10.1101/2020.04.20.050369; this version posted May 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. An open-sourced bioinformatic pipeline for the processing of Next-Generation Sequencing derived nucleotide reads: Identification and authentication of ancient metagenomic DNA Thomas C. Collin1, *, Konstantina Drosou2, 3, Jeremiah Daniel O’Riordan4, Tengiz Meshveliani5, Ron Pinhasi6, and Robin N. M. Feeney1 1School of Medicine, University College Dublin, Ireland 2Division of Cell Matrix Biology Regenerative Medicine, University of Manchester, United Kingdom 3Manchester Institute of Biotechnology, School of Earth and Environmental Sciences, University of Manchester, United Kingdom [email protected] 5Institute of Paleobiology and Paleoanthropology, National Museum of Georgia, Tbilisi, Georgia 6Department of Evolutionary Anthropology, University of Vienna, Austria *Corresponding Author Abstract The emerging field of ancient metagenomics adds to these Bioinformatic pipelines optimised for the processing and as- processing complexities with the need for additional steps sessment of metagenomic ancient DNA (aDNA) are needed in the separation and authentication of ancient sequences from modern sequences. Currently, there are few pipelines for studies that do not make use of high yielding DNA cap- available for the analysis of ancient metagenomic DNA ture techniques. These bioinformatic pipelines are tradition- 1 4 ally optimised for broad aDNA purposes, are contingent on (aDNA) ≠ The limited number of bioinformatic pipelines selection biases and are associated with high costs. Here we for aDNA metagenomics can be attributed to a low yield present a bioinformatic pipeline optimised for the identifi- of DNA compared to that achieved in modern metagenomic cation and assessment of ancient metagenomic DNA with- DNA extraction. This results in the use of high-cost DNA capture techniques to improve aDNA yields to levels suit- out the use of expensive DNA capture techniques. Our 3 7 pipeline actively conserves aDNA reads, allowing the appli- able for bioinformatic assessment ≠ . In parallel, the lack of cation of a bioinformatic approach by identifying the shortest bioinformatic pipelines optimised for the processing of lower reads possible for analysis (22-28bp). The time required for aDNA yields (i.e. those studies which do not use high-cost processing is drastically reduced through the use of a 10% DNA capture techniques) deters researchers from exploring segmented non-redundant sequence file (229 hours to 53). and developing alternative methods of aDNA extraction for Processing speed is improved through the optimisation of metagenomic purposes. BLAST parameters (53 hours to 48). Additionally, the use Those metagenomic studies performed using DNA capture of multi-alignment authentication in the identification of taxa techniques, necessary for existing bioinformatic pipelines, al- increases overall confidence of metagenomic results. DNA low for the implementation of “quick” bioinformatic com- yields are further increased through the use of an optimal parisons using RefSeq or Blastn megablast options due to higher yields of DNA5 7. Due to higher yields, however, MAPQ setting (MAPQ 25) and the optimisation of the du- ≠ plicate removal process using multiple sequence identifiers (a these pipelines, by their nature, often compromise between 4.35-6.88% better retention). Moreover, characteristic aDNA read conservation and processing time, thus losing sequences damage patterns are used to bioinformatically assess ancient throughout individual computational steps. In doing so these vs. modern DNA origin throughout pipeline development. Of pipelines do not account for the nature of metagenomic additional value, this pipeline uses open-source technologies, aDNA which, having originated from multiple sources (e.g. bone8, soil3,9), can vary in aDNA yields and in the degree of which increases its accessibility to the scientific community. Open Source | Metagenomic | Bioinformatics Pipeline | Ancient DNA | Ge- damage over time leading to a potential loss and underrepre- nomics sentation of aDNA sequences. Correspondence: [email protected] The damage pattern of aDNA serves as a useful tool in the distinction between ancient and modern DNA Introduction sequences10,11. Characteristically, ancient sequences should 6,12,13 Optimised bioinformatic pipelines are of particular impor- consist of heavily fragmented DNA strands , depurina- tion, depyrimidination and deamination events14,15 and mis- tance in the broad study of metagenomics, which consist of 16,17 large datasets of multi-origins, and the associated complexi- coding lesions . These characteristics therefore form the ties of large-scale computational processes such as compar- central damage pattern for aDNA authentication in the devel- ative sequence alignment and multiple taxa identifications. opment of this bioinformatic pipeline. In addition to the high Collin et al. | bioR‰iv | May 28, 2020 | 1–12 bioRxiv preprint doi: https://doi.org/10.1101/2020.04.20.050369; this version posted May 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. cost associated with DNA capture techniques, a further issue of genomic assignments. Mammalia and Plantae assign- arises from the direct selection of target taxa for probe gen- ments at the taxonomic genus level were assigned a unique eration prior to sequencing. This action inevitably introduces identification number. Using the random number generator a selection bias into a study and could prevent the metage- specified in the SI (step 9), unique identification numbers nomic analysis of all aDNA present within a sample with were selected and used for further in-depth alignment to targeted DNA yields overshadowing untargeted yields7,18,19. its associated reference genome using Burrows-Wheeler The additional cost associated with computation and long Aligner (BWA)29. Alignments were converted to SAM processing times for comparative analysis acts as a barrier to format30 and imported into DNA visualisation software. more widespread use of aDNA metagenomics in fields such This visualisation allowed for the manual identification of as archaeology, bioanthropology and paleoenvironmental sci- (C>T, G>A) damage patterns at the terminal ends of aligned ences. Bioinformatic costs associated with computation to sequences against a reference genome. Fragments without achieve faster processing speeds or the use of third-party in- these characteristics were deemed modern in origin. terface platforms are usually unavoidable due to the large demands on processing time and the steep learning curve This procedure resulted in the identification of ancient needed to gain proficiency in the open-source alternatives, sequences ranging from 22 - 28bp in length using the extrac- which are often less user-friendly and lack the benefit of cus- tion method outlined by Collin9,20. For comparison, using tomer support and easily accessible manuals. Dabney’s method which was developed for aDNA extraction Using a method developed for the extraction of metagenomic from bone31, Slon3 found that the lowest extractable reads aDNA from anthropogenic sediment (sediment that has come were 35bp when using a ladder that mimicked aDNA. into contact with past human activity) without the use of 9,20,21 This suggests that the method used in this study is capable DNA capture techniques , we present a bioinformatic of extracting shorter aDNA fragments and as such may pipeline optimised for the identification and authentication achieve a wider range of representative sequences. This is of metagenomic aDNA that can be applied to studies yield- particularly important when considering the nature of DNA ing comparatively lower yields of aDNA. The development damage over time which not only results in increased frag- of the bioinformatic pipeline involves four fundamental un- mentation through oxidative strand breakages10,11, but also derpinnings, in which it must: the potentially disproportionate lesion damage to genomes 16,17,32 • Cater to low yielding metagenomic aDNA and con- with high cytosine content which could otherwise serve reads wherever possible be overlooked. It should be noted however that the shorter the DNA sequence, the more prone it is to misalignment • Be able to process data within a reasonable timeframe errors33. For this reason, a cut of 28bp was selected as a and allowing for multiple taxa identifications conservative threshold for the purpose of this study. • Be developed using open-source technologies and software wherever possible to facilitate universal access Creation of a non-redundant, 10% representative sample and to bypass financial barriers file for comparative analysis One of the most common issues with the bioinformatic • Be accompanied by a step-by-step user-friendly man- assessment of DNA sequences is the processing power ual, to facilitate its use without coding and program- required (corresponding to speed) and time it takes to ming expertise (supplementary; SI) process data. This is especially true for metagenomic data, where multiple taxa identifications

An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads

De Novo Genomic Analyses for Non-Model Organisms: an Evaluation of Methods Across a Multi-Species Data Set

Gene Discovery and Annotation Using LCM-454 Transcriptome Sequencing Scott J

MATCH-G Program

EMBL-EBI Powerpoint Presentation

Large Scale Genomic Rearrangements in Selected Arabidopsis Thaliana T

Identification of Transcribed Sequences in Arabidopsis Thaliana by Using High-Resolution Genome Tiling Arrays

Sequence Alignment/Map Format Specification

Assembly Exercise

Multi-Scale Analysis and Clustering of Co-Expression Networks

Efficient Storage and Analysis of Genome Data in Relational Database Systems

NGS Raw Data Analysis

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd