Machine Learning with Digital Signal Processing for Rapid and Accurate Alignment-Free Genome Analysis: from Methodological Design to a Covid-19 Case Study

Western University Scholarship@Western Electronic Thesis and Dissertation Repository 6-1-2020 10:00 AM Machine Learning with Digital Signal Processing for Rapid and Accurate Alignment-Free Genome Analysis: From Methodological Design to a Covid-19 Case Study Gurjit Singh Randhawa, The University of Western Ontario Supervisor: Kari, Lila, The University of Waterloo A thesis submitted in partial fulfillment of the equirr ements for the Doctor of Philosophy degree in Computer Science © Gurjit Singh Randhawa 2020 Follow this and additional works at: https://ir.lib.uwo.ca/etd Part of the Artificial Intelligence and Robotics Commons, Bioinformatics Commons, Computational Biology Commons, and the Genomics Commons Recommended Citation Randhawa, Gurjit Singh, "Machine Learning with Digital Signal Processing for Rapid and Accurate Alignment-Free Genome Analysis: From Methodological Design to a Covid-19 Case Study" (2020). Electronic Thesis and Dissertation Repository. 7007. https://ir.lib.uwo.ca/etd/7007 This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of Scholarship@Western. For more information, please contact [email protected]. Abstract In the field of bioinformatics, taxonomic classification is the scientific practice of identifying, naming, and grouping of organisms based on their similarities and differences. The problem of taxonomic classification is of immense importance considering that nearly 86% of existing species on Earth and 91% of marine species remain unclassified. Due to the magnitude of the datasets, the need exists for an approach and software tool that is scalable enough to handle large datasets and can be used for rapid sequence comparison and analysis. We propose ML-DSP, a stand-alone alignment-free software tool that uses Machine Learning and Digital Signal Processing to classify genomic sequences. ML-DSP uses numerical representations to map genomic sequences to discrete numerical series (genomic signals), Discrete Fourier Transform (DFT) to obtain magnitude spectra from the genomic signals, Pearson Correlation Coefficient (PCC) as a dissimilarity measure to compute pairwise distances between magnitude spectra of any two genomic signals, and supervised machine learning for the classification and prediction of the labels of new sequences. We first test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an aver- age classification accuracy of > 97%. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4710 bacterial genomes into phyla with 95:5% accuracy. Second, we propose another tool, MLDSP-GUI, where additional features include: a user-friendly Graphical User Interface, Chaos Game Representation (CGR) to nu- merically represent DNA sequences, Euclidean and Manhattan distances as additional distance measures, phylogenetic tree output, oligomer frequency information to study the under- and over-representation of any particular sub-sequence in a selected sequence, and inter-cluster distances analysis, among others. We test MLDSP-GUI by classifying 7881 complete genomes of Flavivirus genus into species with 100% classification accuracy. Third, we provide a proof of principle that MLDSP-GUI is able to classify newly discovered organisms by classifying the novel COVID-19 virus. ii Keywords: Machine Learning, Digital Signal Processing, Discrete Fourier Transform, taxonomic classification, whole genome analysis, genomic signature, Chaos Game Representa- tion, alignment-free sequence analysis iii Summary Sequence classification is the scientific practice of identifying, naming, and grouping organisms based on their differences and similarities. Considering that most of the existing species (nearly 86% of species on Earth and 91% of marine species) remain unclassified, the problem of sequence classification is of immense importance. Due to the magnitude of the datasets, the problem of sequence comparison and analysis for the purpose of classification remains challenging. Sequence (dis)similarity analysis has multiple possible applications including taxonomic classification (classify organisms on the basis of shared characteristics), virus-subtype classification (assign viral sequences to their subtypes), disease classification (classify human genomic sequences on the basis of disease type), human haplogroup classification (assign human mitochondrial on the basis of maternal lineage), etc. The need exists for an approach and software tool that is scalable enough to handle large datasets and is able to provide accurate classifications within a short time period. We propose a machine learning-based methodology, ML-DSP, that is effective in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity. We also propose MLDSP-GUI, an extension of ML-DSP with multiple additional valuable features. Lastly, we show the applicability of our approach to taxonomy classification, virus-subtype classification and provide a proof of principle that our approach is able to classify newly discovered organisms by classifying the previously unclassified novel coronavirus (COVID-19 virus) sequences. iv Co-Authorship Statement This thesis consists of three published articles. The article in Chapter 3 was published in the journal BMC Genomics, while the article in Chapter 4 was published in the journal Bioinfor- matics, and the article in Chapter 5 in the journal PLOS ONE. The papers in Chapters 3, 4 and 5 have two senior authors (K.A.H, L.K.). The major individual contributions are listed below. Chapter 3 contains the article “ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels” by Gurjit S. Randhawa, Kathleen A. Hill, and Lila Kari. The individual contributions are as follows. G.S.R. and L.K. conceived the study and wrote the manuscript. G.S.R. designed and tested the software. G.S.R., L.K. and K.A.H. conducted the data analysis and edited the manuscript, with K.A.H. contributing biological expertize. Chapter 4 contains the article “MLDSP-GUI: an alignment-free standalone tool with an interactive graphical user interface for DNA sequence comparison and analysis” by Gurjit S. Randhawa, Kathleen A. Hill, and Lila Kari. The individual contributions are as follows. G.S.R. and L.K. conceived the study and wrote the manuscript. G.S.R. designed and tested the software. G.S.R., L.K. and K.A.H. conducted the data analysis and edited the manuscript, with K.A.H. contributing biological expertize. Chapter 5 contains the article “Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study” by Gurjit S. Randhawa, Max- imillian P.M. Soltysiak, Hadi El Roz, Camila P.E. de Souza, Kathleen A. Hill, and Lila Kari. The individual contributions are as follows: Conceptualization, G.S.R.; methodology, G.S.R. and L.K.; software, G.S.R. and L.K.; validation, G.S.R.; formal analysis, G.S.R., M.P.M.S, H.E., C.P.E.d.S. and K.A.H.; investigation, G.S.R., M.P.M.S and K.A.H.; resources, G.S.R.; data curation, G.S.R.; writing—original draft preparation, G.S.R and M.P.M.S writing—review and editing, G.S.R., M.P.M.S., H.E., C.P.E.d.S., K.A.H. and L.K.; visualization, G.S.R.; super- vision, K.A.H and L.K.; project administration, K.A.H. and L.K.; funding acquisition, K.A.H. and L.K. v Acknowlegements First and foremost, I would like to express my gratitude to my supervisor, Dr. Lila Kari for her guidance and encouragement. Her enthusiasm, independent thought, meticulousness, and dedication are an inspiration I hope to learn from and emulate in my career. I am very thankful for all her tremendous support in all aspects of my professional development. I would also like to thank Dr. Kathleen A. Hill for being an extraordinary mentor. She introduced me, and ignited my interest in knowledge areas I was less familiar with than I care to admit. She has been a great collaborator and has helped me a lot in expanding the outreach of my research, and introduced me to the immensely valuable research community. I offer special thanks to my friend and colleague Maximillian P.M. Soltysiak. We had to set challenging deadlines and tirelessly work to honor them for the time-sensitive COVID-19 case study. I must also thank my current and former team-mates: Zihao Wang, Pablo Millan Arias, Fatemeh Alipour, Dr. Rallis Karamichalis, and Stephen Solis-Reyes for fostering a constructive research environment. I also want to acknowledge the research group working out of Dr. Hill’s lab, with whom I have had the pleasure of collaborating with on multiple projects. Lastly, I would like to thank my family and friends who have been a bedrock of support, all through my life. I am what I am because of how my parents have influenced me. Most of all, I would like to thank my wife Taran for her love, support, and understanding. vi Contents Abstract ii Summary iv Co-Authorship Statement v Acknowlegements vi List of Figures xi List of Tables xviii List of Appendices xx 1 Introduction 1 2 Literature review 6 2.1 Biological background . 6 2.2 Genomic sequence analysis methods . 8 2.2.1 Alignment-based methods . 8 2.2.2 Alignment-free methods . 10 2.3 Our approach . 14 2.3.1 DNA numerical representations . 14 2.3.2 Discrete Fourier Transform . 19 2.3.3 Distance measures . 19 vii 2.3.4 Multi-dimensional scaling . 20 2.3.5 Supervised learning classification models . 22 3 ML-DSP: Machine Learning with Digital Signal Processing 38 3.1 Background .

Machine Learning with Digital Signal Processing for Rapid and Accurate Alignment-Free Genome Analysis: from Methodological Design to a Covid-19 Case Study

ML-DSP: Machine Learning with Digital Signal Processing for Ultrafast, Accurate, and Scalable Genome Classification at All Taxonomic Levels Gurjit S

Airborne SARS-Cov-2 and the Use of Masks for Protection Against Its Spread in Wuhan, China

THE POSSIBLE ROLE of ENDOGENOUS RETROVIRUSES in TUMOUR DEVELOPMENT & INNATE SIGNALLING by EMMANUEL ATANGANA MAZE a Thesis Su

As Schools Reopen, Classes Discuss Battles Against Novel Coronavirus Epidemic and Floods by Ji Jing

Pursue Greater Unity and Progress News Brief

Origins and Evolution of the Global RNA Virome

Employing Geographical Information Systems in Fisheries Management in the Mekong River: a Case Study of Lao PDR

Meta-Transcriptomic Detection of Diverse and Divergent RNA Viruses

A Revision of the Species of the Cyprinion Macrostomus-Group (Pisces: Cyprinidae)

Fidelity of Human Immunodeficiency Viruses Type 1 and Type 2 Reverse Transcriptases in DNA Synthesis Reactions Using DNA and RNA Templates

Retrospective Analysis of Clinical Features in 101 Death Cases with COVID-19

Download Preprint