Estimating the Distribution of Viral Taxa in Next-Generation Sequencing Data
Total Page:16
File Type:pdf, Size:1020Kb
Estimating the Distribution of Viral Taxa in Next- generation Sequencing data using Articial Neural Networks Moritz Kohls University of Veterinary Medicine Hannover: Tierarztliche Hochschule Hannover Magdalena Kircher University of Veterinary Medicine Hannover: Tierarztliche Hochschule Hannover Jessica Krepel University of Veterinary Medicine Hannover: Tierarztliche Hochschule Hannover Pamela Liebig University of Veterinary Medicine Hannover: Tierarztliche Hochschule Hannover Klaus Jung ( [email protected] ) Tierarztliche Hochschule Hannover Institut fur Tierzucht und Vererbungsforschung https://orcid.org/0000-0001-9955-748X Research article Keywords: Artificial neural network, Virus classification, Supervised learning, Next-Generation Sequencing, Viral metagenomics Posted Date: December 17th, 2020 DOI: https://doi.org/10.21203/rs.3.rs-127809/v1 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License 1 Kohls et al. 2 3 4 5 6 RESEARCH 7 8 9 Estimating the distribution of viral taxa in 10 11 12 next-generation sequencing data using artificial 13 14 neural networks 15 16 Moritz Kohls*, Magdalena Kircher, Jessica Krepel, Pamela Liebig and Klaus Jung 17 18 19 *Correspondence: [email protected] 20 Institute for Animal Breeding and 21 Genetics, University of Veterinary 22 Medicine Hannover, B¨unteweg Abstract 23 17p, 30559 Hannover, GER Full list of author information is Background: Estimating the taxonomic composition of viral sequences in a 24 available at the end of the article 25 biological sample processed by next-generation sequencing is an important step 26 for comparative metagenomics. For that purpose, sequencing reads are usually 27 classified by mapping them against a database of known viral reference genomes. 28 This fails, however, to classify reads from novel viruses and quasispecies whose 29 reference sequences are not yet available in public databases. 30 Methods: In order to circumvent the problem of a mapping approach with 31 unknown viruses, the feasibility and performance of neural networks to classify 32 33 sequencing reads to taxonomic classes is studied. For that purpose, taxonomy 34 and genome data from the NCBI database are used to sample artificial reads 35 from known viruses with known taxonomic attribution. Based on these training 36 data, artificial neural networks are fitted and applied to classify single viral read 37 sequences to different taxa. Model building includes different input features 38 derived from artificial read sequences as possible predictors which are chosen by a 39 feature selection method. Training, validation and test data are computed from 40 these input features. To summarise classification results, a generalised confusion 41 matrix is proposed which lists all possible misclassification combination 42 frequencies. Two new formulas to statistically estimate taxa frequencies are 43 introduced for studying the overall viral composition. 44 Results: We found that the best taxonomic level supported by the NCBI 45 database is that of viral orders. Prediction accuracy of the fitted models is 46 evaluated on test data and classification results are summarised in a confusion 47 48 matrix, from which diagnostic measures such as sensitivity and specificity as well 49 as positive and negative predictive values are calculated. The prediction accuracy 50 of the artificial neural net is considerably higher than for random classification 51 and posterior estimation of taxa frequencies is closer to the true distribution in 52 the training data than simple classification or mapping results. 53 Conclusions: Neural networks are helpful to classify sequencing reads into viral 54 orders and can be used to complement the results of mapping approaches. The 55 machine learning approach is not limited to already known viruses. In addition, 56 statistical estimations of taxa frequencies can be used for subsequent 57 comparative metagenomics. 58 59 Keywords: Artificial neural network; Virus classification; Supervised learning; 60 Next-Generation Sequencing; Viral metagenomics 61 62 63 64 65 1 Kohls et al. Page 2 of 16 2 3 4 5 6 Background 7 Next-generation sequencing (NGS) is now regularly used to identify viral sequences 8 5 in the biological sample of an infected host in order to relate the presence of a 9 10 virus with symptoms of the host [1–3]. Furthermore, NGS data is regularly used to 11 determine the taxonomic composition of pathogens for the purpose of comparative 12 metagenomics [4] or reconstruction of networks between microbial taxa [5]. Most 13 computational virus detection pipelines or pipelines for determining the taxonomic 14 15 10 composition map sequencing reads or assembled contigs against viral reference se- 16 quences available in databases [6–10]. These mapping approaches have been proven 17 to be successful in a large number of examples, however, they mostly fail to clas- 18 19 sify reads from new emerging viruses whose sequences are not yet deposited in 20 a database. Besides such mapping approaches, machine learning approaches have 21 15 been used to classify reads to taxonomic ranks [11] such as genus, family, order, class 22 and phylum. Among the machine learning approaches listed by Peabody et al., only 23 24 one approach studies the performance of deep learning models for taxonomic clas- 25 sification (Rasheed et al. [12]). Rasheed et al. use artificial neural networks (ANN) 26 to classify reads to different taxonomic ranks and evaluate their models on datasets 27 20 representing about 500 microbial species. Some approaches have also been under- 28 29 taken to use machine learning models on NGS data to identify or classify reads that 30 belong to novel emerged viruses. Zhang et al. [13] as well as Bartoszewicz et al. [14] 31 employ diverse machine learning models to identify viruses that can infect humans. 32 33 Whilst Zhang et al. applied the k-nearest neighbour (k-NN) model using k-mer 34 25 frequencies as input features for their classification, Bartoszewicz et al. employed a 35 convolutional neural network (CNN) on simulated reads and contigs. 36 In contrast to the approaches by Zhang et al. as well as Bartoszewicz et al., in this 37 38 work we study the ability of an ANN to classify sequencing reads on the taxonomic 39 rank of nine biological orders. While the ANN studied by Rasheed et al. was fitted 40 30 on sequence reads from a rather small number of microbes, we train our model on 41 the basis of several thousand viral sequences available at the NCBI database [15]. 42 43 In comparison to support vector machines or linear discriminant analysis, ANNs 44 are reported to achieve the highest accuracy rates on test data, though it depends 45 on the specific classification problem [16, 17]. 46 47 35 Before fitting ANNs, we link available viral reference sequences with their taxo- 48 nomic description and study the distribution of different taxonomic ranks (phyla, 49 subphyla, classes, subclasses, orders, etc.) to identify on which of these ranks a 50 solid basis for building a training dataset would be reasonable. Next, we extract k- 51 52 mer distributions and inter-nucleotide distances as input features from the available 53 40 reference sequences. In total, we generated 120 input features. 54 In the methods section, we describe the building of the training data in more 55 56 detail, as well as the ANN models used. Furthermore, we present the new formulas 57 for estimating the frequency distribution of viral orders in a sample. In the results 58 section, we present a simulation study in which we evaluate the performance of 59 45 the ANNs and demonstrate how to estimate the frequency distributions. We also 60 61 compare the precision of the estimations with frequency distributions derived from 62 a mapping approach. Finally, we discuss benefits and limitations of our approach. 63 64 65 1 Kohls et al. Page 3 of 16 2 3 4 5 6 Methods 7 Taxonomy and genome data 8 A taxonomy database containing data that are needed for taxonomic classifica- 50 9 10 tion can be downloaded from the ftp-server ftp://ftp.ncbi.nih.gov/refseq/ 11 release/viral of the National Center for Biotechnology Information [15, 18]. We 12 downloaded these data in March 2020. 13 A section of the taxonomy database file fullnamelineage.dmp from this ftp- 14 15 server is depicted in Figure 1. In this file, taxonomic information about species are 55 16 listed in rows. Beginning with archaea, bacteria and fungi, viral data is included 17 from line 2,007,414 onwards. Row entries contain species names and taxonomy 18 19 ranks which are unfortunately neither labelled nor complete. 20 We assign these taxa to taxonomy ranks based on the International Committee 21 on Taxonomy of Viruses [19]. To enable correct assignments, we use the taxa end- 60 22 ings summarised in Table 1. The ending "...viridae" indicates to which family 23 24 the virus belongs. The Bovine adenovirus 4 on line 2,007,418 from Figure 1, for 25 example is a member of the family of Adenoviruses. 26 As the taxonomy database does not provide any unique NC identifier, the species 27 names of the reference genome file are searched in the taxonomy file to link genome 65 28 29 with taxonomy data. 30 We obtain viral reference genome FASTA files [20] again from the NCBI database 31 and concatenate the three files into a single one. The first lines of the FASTA file 32 33 (Figure 2) show its composition. Line one consists of a unique identifier together 34 with its species name. The genome sequence consisting of the single-letter codes A, 70 35 C, G and T that represent the four nucleobases follows from line two onwards. Com- 36 bining genome and taxonomy data, we create a structured taxonomy table including 37 38 FASTA file species which replaces the unstructured original taxonomy file.