Using Machine Learning for Antimicrobial Resistant DNA Identification Jason I
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Southern Methodist University SMU Data Science Review Volume 2 | Number 2 Article 12 2019 Using Machine Learning for Antimicrobial Resistant DNA Identification Jason I. Lingle SMU, [email protected] John Santerre SMU, [email protected] Follow this and additional works at: https://scholar.smu.edu/datasciencereview Part of the Computational Biology Commons Recommended Citation Lingle, Jason I. and Santerre, John (2019) "Using Machine Learning for Antimicrobial Resistant DNA Identification," SMU Data Science Review: Vol. 2 : No. 2 , Article 12. Available at: https://scholar.smu.edu/datasciencereview/vol2/iss2/12 This Article is brought to you for free and open access by SMU Scholar. It has been accepted for inclusion in SMU Data Science Review by an authorized administrator of SMU Scholar. For more information, please visit http://digitalrepository.smu.edu. Lingle and Santerre: Using Machine Learning for Antimicrobial Resistant DNA Using Machine Learning for Antimicrobial Resistant DNA Identification Jason Lingle, John Santerre Southern Methodist University Dallas, Texas USA fjlingle, [email protected] Abstract. In this paper, we present a machine learning-based method- ology for identifying bacteria DNA sub-sequences that are associated with antimicrobial resistance. The dramatic rise in cases of antibiotic re- sistant bacteria being an increasing threat across the globe with existing treatments rendered ineffective in many cases due to mutations of their DNA. Neisseria Gonorrhea is one such bacteria that has shown increas- ing antimicrobial resistance (AMR) and has been identified as an urgent threat by the Centers for Disease Control (CDC). To help address AMR, researchers need tools to help them identify resistance quickly and accu- rately. Fortunately, by the use of machine learning, we are able to make a prediction of an isolate being resistant or susceptible to antibiotics given a unique, overlapping DNA substring or k-mer of length 10 with accu- racy above 90%. Using this approach, researchers without resources may be able to identify the resistant isolates to determine proper course of action and prevent wide infections. 1 Introduction Antibiotics are drugs that are used to treat and prevent bacterial infections. They have been an important tool for doctors in dealing with infections since the introduction of penicillin in 1928. While other antibiotics have been devel- oped since then, antibiotic resistant bacteria have rapidly emerged as a threat. Almost all antibiotics have developed some level of resistance and have lead to increasing health and economic burdens to health care [1]. In the United States alone, at least 2 million people are reported each year as having an antibiotic resistant infection which leads to at least 23,000 fatalities. In 2013, the Centers for Disease Control and Prevention (CDC) published a report on the top 18 antibiotic threats in the US which highlighted the increasing threat of antimi- crobial resistance to public safety. Furthermore, some projections suggest that deaths due to bacterial infections that are resistant to multiple antibiotics are projected to reach 10 million annually by 2050 which is a higher mortality rate than cancer [2]. In September 2018, the United Nations further sounded the alarm related to this threat by issuing "The AMR (Antimicrobial Resistance) Challenge". The purpose of the challenge was to seek alignment from parties Published by SMU Scholar, 2019 1 SMU Data Science Review, Vol. 2 [2019], No. 2, Art. 12 such as pharmaceutical companies, medical professionals, and governments in areas such as development of drugs and testing as well as antibiotic use. Treatment of patients is usually time critical, but timing is even more essen- tial when antimicrobial resistant organisms are involved. Rapid identification of antibiotic resistant bacteria greatly improves the clinical and financial outcomes for patients[3]. Additionally, proper identification of antibiotic resistant infec- tions is one of the most effective ways to reduce antibiotic misuse. Rapid and accurate diagnosis have been cited in helping reduce the empiric use of antibiotics where selection of the proper antibiotic is based on the provider's experience or educated guess. This is important as in some cases, patients may be prescribed multiple antibiotics which are harmful to their antibiotic resistance in the long term[4]. With 78 million cases globally, Gonorrhea is among the most prevalent sexu- ally transmitted infections (STI) even as diagnosis rates continue to increase[6]. The bacterium that causes Gonorrhea, Neisseria Gonorrhea, has been identi- fied as a global threat and priority as it is becoming increasingly resistant to existing treatments[5]. Extended-spectrum cephalosporins (ESCs) represent the "last-line" treatment, and there are no new options available for treatment and few in development as the current. As the first-line antibiotics treatments using ciproflaxin, cefixime, and azithromycin for Neisseria Gonorrhea have experienced high levels of resistance, there is increasing concern that Neisseria Gonorrhea may become untreatable. This is highlighted by a case in 2018 where an isolate from a heterosexual male being treated for Neisseria was resistant to treatment from combined high-level azithromycin resistance and ceftriaxone. This increasing resistance to existing treatments poses a threat to segments that are at high risk of exposure such as men who have sex with men (MSM), individuals who have sex with multiple partners, and recreational drug users [6]. The threat of this STI is increased by the lack of signs to those infected by Gonorrhea as symptoms typically develop after two weeks but might be ascribed to something else. Furthermore, Gonorrhea is spread easily with the transmission rate from a single sexual encounter is 50% to 60% from an infected man to an uninfected woman and 20% from an infected woman to an uninfected man [7]. The techniques used for identifying AMR bacteria has evolved over time from culture-based testing. Antibiotic resistance information was traditionally gathered by public health officials who analyzed bacteria isolates after they were grown in a culture [8]. Conventional approaches are known to have limitations as they are heavily targeted to testing human pathogens. Additionally, only a few bacterial microbes can be cultured. Sequencing of pure culture isolates is a newer approach to detection and identification of AMR. By using genome sequencing, it is possible to identify the AMR phenotype with high accuracy [8]. However, genome sequencing may not be the best solution as most models interpret the genomic code rather than predicting a phenotype from the genome sequence because of the complex relationship between a genotype and phenotype [9, 10]. Moreover, DNA sequencing is a complex task where there is a high risk of misinterpretation without the appropriate expertise [11]. https://scholar.smu.edu/datasciencereview/vol2/iss2/12 2 Lingle and Santerre: Using Machine Learning for Antimicrobial Resistant DNA To improve the process in identifying the genes that have mutated to make Neisseria Gonorrhea resistant to antibiotics, we identify the responsible DNA regions which can then be mapped back to genes using machine learning. We take each contig and divide them into smaller sub-sections called k-mers which can be used as a feature. We can then evaluate those features using feature importance to identify the k-mers that are most important for antimicrobial resistance. While we are comparing the DNA of susceptible and resistant isolates to identify mutations, this is by no means a simple string search. We will create a unique, overlapping matrix with sub-strings of a fixed length k. The matrix is typically sparse with millions of columns before it is compressed to half a million. The benefit of this approach is that we have successfully identified the sequencing responsible for genetic mutation that are responsible for AMR in a way that would otherwise be computationally difficult to execute given the complexity of DNA. We train the models using Naive Bayes, Random Forest, and Support Vector Machines (SVM) classifiers for k-mers of length 5 through 10. We find that Random Forest performs consistently better than Naive Bayes and SVM until lengths of k = 10 where SVM produces an accuracy of 93.6% compared to 92% for Random Forest. In our work, we find results that support prior evidence that this approach for using k-mers of increasing lengths can be used to identify gene regions most likely responsible for AMR. Further, we use a different dataset than used in prior research to provide further evidence of the approach. However, we also find that using k-mers with a length of 10 can still provide high accuracy results which allows for further research without optimized hardware. In this paper, we review similar works performed by others in predicting AMR using various machine learning methods. We review the scope of their work in terms of data as well as the algorithms used. We next provide a short tutorial on some of the topics that are discussed to provide additional background. This background covers items related to technical and subject matter. We also provide an overview of the data set used in our analysis including the source and how the data was gathered. Next, we present our results and analysis as well as ethical considerations related to this topic as well as our solution. Finally, we present the conclusions that we have reached from this research as well as future works that we would consider for this topic. 2 Related Work In this section we discuss related works and research trends for the use of various machine learning methods in identifying antimicrobial resistant microbes. Davis et. al [13] provide insights into the infrastructure built at PATRIC with different sets of genomes that are binned by their AMR phenotype. Various machine learning classifiers are made available through PATRIC to detect AMR phenotype.