Specific Primer Design for Accurate Detection of SARS-Cov-2 External Link
Total Page:16
File Type:pdf, Size:1020Kb
Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning Alejandro Lopez-Rincon1, Alberto Tonda2, Lucero Mendoza-Maldonado3, Daphne G.J.C. Mulders4, Richard Molenkamp4, Eric Claassen5, Johan Garssen1,6, Aletta D. Kraneveld1 1. Division of Pharmacology, Utrecht Institute for Pharmaceutical Sciences, Faculty of Science, Utrecht University, Universiteitsweg 99, 3584 CG Utrecht, the Netherlands 2. UMR 518 MIA-Paris, INRAE, c/o 113 rue Nationale, 75103, Paris, France 3. Hospital Civil de Guadalajara ”Dr. Juan I. Menchaca”. Salvador Quevedo y Zubieta 750, Independencia Oriente, C.P. 44340 Guadalajara, Jalisco, Mexico 4. Department of Viroscience, Erasmus Medical Center, Rotterdam, the Netherlands 5. Athena Institute, Vrije Universiteit, De Boelelaan 1085, 1081 HV Amsterdam, the Netherlands. 6. Department Immunology, Danone Nutricia research, Uppsalalaan 12, 3584 CT Utrecht, the Netherlands (Submitted: 23 April 2020 – Published online: 27 April 2020) DISCLAIMER This paper was submitted to the Bulletin of the World Health Organization and was posted to the COVID-19 open site, according to the protocol for public health emergencies for international concern as described in Vasee Moorthy et al. (http://dx.doi.org/10.2471/BLT.20.251561). The information herein is available for unrestricted use, distribution and reproduction in any medium, provided that the original work is properly cited as indicated by the Creative Commons Attribution 3.0 Intergovernmental Organizations licence (CC BY IGO 3.0). RECOMMENDED CITATION Lopez-Rincon A, Tonda A, Mendoza-Maldonado L, Mulders D.G.J.C., Molenkamp R, Claassen E, et al. Specific Primer Design for Accurate Detection of SARS-CoV-2 Using Deep Learning. [Preprint]. Bull World Health Organ. E-pub: 27 April 2020. doi: http://dx.doi.org/10.2471/BLT.20.261842 April 23, 2020 1/16 Abstract Objective Deep learning techniques can deliver remarkable results in biology, being able to flawlessly perform complex tasks such as differentiating virus strains of the same family. Nevertheless, most of the models created by these algorithms are black boxes, and their decision process is impervious to human interpretation. In this paper, we apply techniques of explainable AI to the task of discovering representative genomic sequences in SARS-CoV-2, to finally generate specific primers. Methods Starting from a convolutional neural network trained on 553 sequences from the 2019 Novel Coronavirus Resource database, we distinguish the genome of virus strains from the Coronavirus family with considerable accuracy (> 98%). Next, we analyze the network’s behavior on every sample, to discover sequences used by the model to classify SARS-CoV-2. Then, using feature selection algorithms we find sequences exclusive to SARS-CoV-2. Finally, using the identified sequences we develop a SARS-CoV-2 specific primer set and test it using a conventional PCR. Findings A first validation, performed on 583 samples from the NGDC repository and 20,604 from the NCBI repository, show that we can identify SARS-CoV-2 from more than 900 other viruses with a high classification accuracy (> 99%) using only 12 sequences of 21 base pairs. Then, we compute the frequency of appearance of these 12 sequences in 9,294 samples from the GISAID repository, observing frequencies ranging from 95.24% to 99.73%. Finally, we use one of the sequences as a forward primer, generating a primert. Testing the primer set using a conventional PCR delivers a sensibility similar to routine diagnostic methods, and 100% specificity when comparing to other coronaviruses and differentiating between SARS-CoV-2 positive patients (n=5) and controls (n=3). Conclusion Our methodology, combining deep learning, viromics and primer design,to develop accurate detection of SARS-CoV-2 by means of qPCR proved to be effective. This approach of detection using cDNA or DNA sequences can be applied to a range of different problems, like mutations in cancer, and autoimmune diseases. Finally, considering the possibility of future pandemics this technology will be suitable to fast and accurately create methods for diagnostics to combat the spread. Introduction 1 The Coronaviridae family presents a positive sense, single-strand RNA genome. These 2 viruses have been identified in avian and mammal hosts, including humans. 3 Coronaviruses have genomes from 26.4 kilo base-pairs (kbps) to 31.7 kbps, with G + C 4 contents varying from 32% to 43%; human-infecting coronaviruses belonging to this 5 family include SARS-CoV, MERS-CoV, HCoV-OC43, HCoV-229E, HCoV-NL63 and 6 HCoV-HKU1 [1]. In December 2019, SARS-CoV-2, a novel, human-infecting 7 Coronavirus was identified in Wuhan, China, using Next Generation Sequencing 8 (NGS) [2]. 9 As a typical RNA virus, new mutations appears every replication cycle of 10 Coronavirus, and its average evolutionary rate is roughly 10-4 nucleotide substitutions 11 April 23, 2020 2/16 per site each year [2]. In the specific case of SARS-CoV-2, RT-qPCR testing using 12 primers in ORF1ab and N genes have been used to identified the infection in humans [3]. 13 However, this method presents a high false negative rate (FNR), with a detection rate 14 of 30-50% [4]. This low detection rate can be explained by the variation of viral RNA 15 sequences within virus species, and the viral load in different anatomic sites [5]. 16 Population mutation frequency of site 8,872 located in ORF1ab gene and site 28,144 17 located in ORF8 gene gradually increased from 0 to 29% as the epidemic progressed [6]. 18 th As of March 27 of 2020, the new SARS-CoV-2 has 462,684 confirmed cases across 19 almost all countries, with 250,287 cases in the European region [7]. In addition, 20 SARS-CoV-2 has an estimated mortality rate of 3-4%, and it is spreading faster than 21 SARS-CoV and MERS-CoV [8]. SARS-CoV-2 assays can yield false positives if they are 22 not targeted specifically to SARS-CoV-2, as the virus is closely related to other 23 Coronavirus organisms. In addition, SARS-CoV-2 may present with other respiratory 24 infections, which make it even more difficult to identify [9, 10]. 25 Thus, it is fundamental to improve existing diagnostic tools to contain the spread. 26 For example, diagnostic tools combining computed tomography (CT) scans with deep 27 learning have been proposed, achieving an improved detection accuracy of 82.9% [11]. 28 Another solution for identifying SARS-CoV-2 is additional sequencing of the viral 29 complementary DNA (cDNA). We can use sequencing data with cDNA, resulting from 30 the PCR of the original viral RNA; e,g, Real-Time PCR amplicons (Fig. 1) to identify 31 the SARS-CoV-2 [12]. 32 Classification using viral sequencing techniques is mainly based on alignment 33 methods such as FASTA [13] and BLAST [14]. These methods rely on the assumption 34 that DNA sequences share common features, and their order prevails among different 35 sequences [15, 16]. However, these methods suffer from the necessity of needing base 36 sequences for the detection [17]. Nevertheless, it is necessary to develop innovative 37 improved diagnostic tools that target the genome to improve the identification of 38 Fig 1. PCR Amplicons sequencing procedure. pathogenic variants, as sometimes several tests, are needed to have an accurate 39 diagnosis. As an alternative deep learning methods have been suggested for 40 classification of DNA sequences, as these methods do not need pre-selected features to 41 identify or classify DNA sequences. Deep Learning has been efficiently used for 42 classification of DNA sequences, using one-hot label encoding and Convolution Neural 43 Networks (CNN) [18, 19], albeit the examples in literature are featuring DNA sequences 44 of length up to 500 bps, only. 45 In particular, for the case of viruses, NGS genomic samples might not be identified 46 by BLAST, as there are no reference sequences valid for all genomes, as viruses have 47 April 23, 2020 3/16 high mutation frequency [20]. Alternative solutions based on deep learning have been 48 proposed to classify viruses, by dividing sequences into pieces of fixed length, ranging 49 from 300 bps [20] to 3,000 bps [21]. However, this approach has the negative effect of 50 potentially ignoring part of the information contained in the input sequence, that is 51 disregarded if it cannot completely fill a piece of fixed size. 52 Given the impact of the world-wide outbreak, international efforts have been made 53 to simplify the access to viral genomic data and metadata through international 54 repositories, such as The 2019 Novel Coronavirus Resource (2019nCoVR) repository [6], 55 the National Center for Biotechnology Information (NCBI) repository [22] and the 56 Global Initiative on Sharing All Influenza Data (GISAID) repository [23], expecting 57 that the easiness to acquire information would make it possible to develop medical 58 countermeasures to control the disease worldwide, as it happened in similar cases 59 earlier [24–26]. Thus, taking advantage of the available information of international 60 resources without any political and/or economic borders, we propose an innovative 61 system based on viral gene sequencing. 62 Starting from a CNN trained to separate Coronavirus samples belonging to different 63 strains [27], including SARS-CoV-2, we apply techniques inspired by explainable AI in 64 computer vision to discover representative cDNA sequences that the network uses to 65 classify SARS-CoV-2. We then validate the discovered sequences on datasets not used 66 during the training of the CNN, and show how to exploit them to create a novel, highly 67 informative feature space. Experimental results show that the new feature space leads 68 traditional, simple classifiers, to correctly assess SARS-CoV-2 with remarkable accuracy 69 (> 99%).