Machine Learning with Digital Signal Processing for Rapid and Accurate Alignment-Free Genome Analysis: from Methodological Design to a Covid-19 Case Study

Machine Learning with Digital Signal Processing for Rapid and Accurate Alignment-Free Genome Analysis: from Methodological Design to a Covid-19 Case Study

Western University Scholarship@Western Electronic Thesis and Dissertation Repository 6-1-2020 10:00 AM Machine Learning with Digital Signal Processing for Rapid and Accurate Alignment-Free Genome Analysis: From Methodological Design to a Covid-19 Case Study Gurjit Singh Randhawa, The University of Western Ontario Supervisor: Kari, Lila, The University of Waterloo A thesis submitted in partial fulfillment of the equirr ements for the Doctor of Philosophy degree in Computer Science © Gurjit Singh Randhawa 2020 Follow this and additional works at: https://ir.lib.uwo.ca/etd Part of the Artificial Intelligence and Robotics Commons, Bioinformatics Commons, Computational Biology Commons, and the Genomics Commons Recommended Citation Randhawa, Gurjit Singh, "Machine Learning with Digital Signal Processing for Rapid and Accurate Alignment-Free Genome Analysis: From Methodological Design to a Covid-19 Case Study" (2020). Electronic Thesis and Dissertation Repository. 7007. https://ir.lib.uwo.ca/etd/7007 This Dissertation/Thesis is brought to you for free and open access by Scholarship@Western. It has been accepted for inclusion in Electronic Thesis and Dissertation Repository by an authorized administrator of Scholarship@Western. For more information, please contact [email protected]. Abstract In the field of bioinformatics, taxonomic classification is the scientific practice of identifying, naming, and grouping of organisms based on their similarities and differences. The problem of taxonomic classification is of immense importance considering that nearly 86% of existing species on Earth and 91% of marine species remain unclassified. Due to the magnitude of the datasets, the need exists for an approach and software tool that is scalable enough to han- dle large datasets and can be used for rapid sequence comparison and analysis. We propose ML-DSP, a stand-alone alignment-free software tool that uses Machine Learning and Digital Signal Processing to classify genomic sequences. ML-DSP uses numerical representations to map genomic sequences to discrete numerical series (genomic signals), Discrete Fourier Transform (DFT) to obtain magnitude spectra from the genomic signals, Pearson Correlation Coefficient (PCC) as a dissimilarity measure to compute pairwise distances between magni- tude spectra of any two genomic signals, and supervised machine learning for the classification and prediction of the labels of new sequences. We first test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an aver- age classification accuracy of > 97%. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4710 bacterial genomes into phyla with 95:5% accuracy. Second, we propose another tool, MLDSP-GUI, where additional features include: a user-friendly Graphical User Interface, Chaos Game Representation (CGR) to nu- merically represent DNA sequences, Euclidean and Manhattan distances as additional distance measures, phylogenetic tree output, oligomer frequency information to study the under- and over-representation of any particular sub-sequence in a selected sequence, and inter-cluster dis- tances analysis, among others. We test MLDSP-GUI by classifying 7881 complete genomes of Flavivirus genus into species with 100% classification accuracy. Third, we provide a proof of principle that MLDSP-GUI is able to classify newly discovered organisms by classifying the novel COVID-19 virus. ii Keywords: Machine Learning, Digital Signal Processing, Discrete Fourier Transform, tax- onomic classification, whole genome analysis, genomic signature, Chaos Game Representa- tion, alignment-free sequence analysis iii Summary Sequence classification is the scientific practice of identifying, naming, and grouping organ- isms based on their differences and similarities. Considering that most of the existing species (nearly 86% of species on Earth and 91% of marine species) remain unclassified, the problem of sequence classification is of immense importance. Due to the magnitude of the datasets, the problem of sequence comparison and analysis for the purpose of classification remains chal- lenging. Sequence (dis)similarity analysis has multiple possible applications including taxo- nomic classification (classify organisms on the basis of shared characteristics), virus-subtype classification (assign viral sequences to their subtypes), disease classification (classify human genomic sequences on the basis of disease type), human haplogroup classification (assign hu- man mitochondrial on the basis of maternal lineage), etc. The need exists for an approach and software tool that is scalable enough to handle large datasets and is able to provide accurate classifications within a short time period. We propose a machine learning-based methodology, ML-DSP, that is effective in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity. We also propose MLDSP-GUI, an extension of ML-DSP with multiple additional valuable features. Lastly, we show the applicability of our approach to taxonomy classifica- tion, virus-subtype classification and provide a proof of principle that our approach is able to classify newly discovered organisms by classifying the previously unclassified novel coron- avirus (COVID-19 virus) sequences. iv Co-Authorship Statement This thesis consists of three published articles. The article in Chapter 3 was published in the journal BMC Genomics, while the article in Chapter 4 was published in the journal Bioinfor- matics, and the article in Chapter 5 in the journal PLOS ONE. The papers in Chapters 3, 4 and 5 have two senior authors (K.A.H, L.K.). The major individual contributions are listed below. Chapter 3 contains the article “ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels” by Gurjit S. Randhawa, Kathleen A. Hill, and Lila Kari. The individual contributions are as follows. G.S.R. and L.K. conceived the study and wrote the manuscript. G.S.R. designed and tested the software. G.S.R., L.K. and K.A.H. conducted the data analysis and edited the manuscript, with K.A.H. contributing biological expertize. Chapter 4 contains the article “MLDSP-GUI: an alignment-free standalone tool with an interactive graphical user interface for DNA sequence comparison and analysis” by Gurjit S. Randhawa, Kathleen A. Hill, and Lila Kari. The individual contributions are as follows. G.S.R. and L.K. conceived the study and wrote the manuscript. G.S.R. designed and tested the soft- ware. G.S.R., L.K. and K.A.H. conducted the data analysis and edited the manuscript, with K.A.H. contributing biological expertize. Chapter 5 contains the article “Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study” by Gurjit S. Randhawa, Max- imillian P.M. Soltysiak, Hadi El Roz, Camila P.E. de Souza, Kathleen A. Hill, and Lila Kari. The individual contributions are as follows: Conceptualization, G.S.R.; methodology, G.S.R. and L.K.; software, G.S.R. and L.K.; validation, G.S.R.; formal analysis, G.S.R., M.P.M.S, H.E., C.P.E.d.S. and K.A.H.; investigation, G.S.R., M.P.M.S and K.A.H.; resources, G.S.R.; data curation, G.S.R.; writing—original draft preparation, G.S.R and M.P.M.S writing—review and editing, G.S.R., M.P.M.S., H.E., C.P.E.d.S., K.A.H. and L.K.; visualization, G.S.R.; super- vision, K.A.H and L.K.; project administration, K.A.H. and L.K.; funding acquisition, K.A.H. and L.K. v Acknowlegements First and foremost, I would like to express my gratitude to my supervisor, Dr. Lila Kari for her guidance and encouragement. Her enthusiasm, independent thought, meticulousness, and dedication are an inspiration I hope to learn from and emulate in my career. I am very thankful for all her tremendous support in all aspects of my professional development. I would also like to thank Dr. Kathleen A. Hill for being an extraordinary mentor. She introduced me, and ignited my interest in knowledge areas I was less familiar with than I care to admit. She has been a great collaborator and has helped me a lot in expanding the outreach of my research, and introduced me to the immensely valuable research community. I offer special thanks to my friend and colleague Maximillian P.M. Soltysiak. We had to set challenging deadlines and tirelessly work to honor them for the time-sensitive COVID-19 case study. I must also thank my current and former team-mates: Zihao Wang, Pablo Millan Arias, Fatemeh Alipour, Dr. Rallis Karamichalis, and Stephen Solis-Reyes for fostering a constructive research environment. I also want to acknowledge the research group working out of Dr. Hill’s lab, with whom I have had the pleasure of collaborating with on multiple projects. Lastly, I would like to thank my family and friends who have been a bedrock of support, all through my life. I am what I am because of how my parents have influenced me. Most of all, I would like to thank my wife Taran for her love, support, and understanding. vi Contents Abstract ii Summary iv Co-Authorship Statement v Acknowlegements vi List of Figures xi List of Tables xviii List of Appendices xx 1 Introduction 1 2 Literature review 6 2.1 Biological background . 6 2.2 Genomic sequence analysis methods . 8 2.2.1 Alignment-based methods . 8 2.2.2 Alignment-free methods . 10 2.3 Our approach . 14 2.3.1 DNA numerical representations . 14 2.3.2 Discrete Fourier Transform . 19 2.3.3 Distance measures . 19 vii 2.3.4 Multi-dimensional scaling . 20 2.3.5 Supervised learning classification models . 22 3 ML-DSP: Machine Learning with Digital Signal Processing 38 3.1 Background .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    200 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us