Classification of Genomic Sequences by Latent Semantic Analysis
Total Page:16
File Type:pdf, Size:1020Kb
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Theses, Dissertations, and Student Research Electrical & Computer Engineering, Department from Electrical & Computer Engineering of Summer 8-2012 Classification of Genomic Sequences yb Latent Semantic Analysis Samuel F. Way University of Nebraska-Lincoln, [email protected] Follow this and additional works at: https://digitalcommons.unl.edu/elecengtheses Part of the Bioinformatics Commons, Computational Biology Commons, and the Other Electrical and Computer Engineering Commons Way, Samuel F., "Classification of Genomic Sequences yb Latent Semantic Analysis" (2012). Theses, Dissertations, and Student Research from Electrical & Computer Engineering. 42. https://digitalcommons.unl.edu/elecengtheses/42 This Article is brought to you for free and open access by the Electrical & Computer Engineering, Department of at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in Theses, Dissertations, and Student Research from Electrical & Computer Engineering by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. CLASSIFICATION OF GENOMIC SEQUENCES BY LATENT SEMANTIC ANALYSIS by Samuel F. Way A THESIS Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulfilment of Requirements For the Degree of Master of Science Major: Electrical Engineering Under the Supervision of Professor Khalid Sayood and Professor Ozkan Ufuk Nalbantoglu Lincoln, Nebraska August, 2012 CLASSIFICATION OF GENOMIC SEQUENCES BY LATENT SEMANTIC ANALYSIS Samuel F. Way, M. S. University of Nebraska, 2012 Adviser: Khalid Sayood and Ozkan Ufuk Nalbantoglu Evolutionary distance measures provide a means of identifying and organizing related organisms by comparing their genomic sequences. As such, techniques that quantify the level of similarity between DNA sequences are essential in our efforts to decipher the genetic code in which they are written. Traditional methods for estimating the evolutionary distance separating two ge- nomic sequences often require that the sequences first be aligned before they are compared. Unfortunately, this preliminary step imposes great computational burden, making this class of techniques impractical for applications involving a large number of sequences. Instead, we desire new methods for differentiating genomic sequences that eliminate the need for sequence alignment. Here, we present a novel approach for identifying evolutionarily similar DNA se- quences using a theory and collection of techniques called latent semantic analysis (LSA). In the field of information retrieval, LSA techniques are used to identify text documents with related underlying concepts, or \latent semantics". These techniques capture the inherent structure within a collection of documents, and in much the same way, we extend these approaches to infer the biological structure through which a collection of organisms are related. In doing so, we develop a computationally ef- ficient means of identifying evolutionarily similar DNA sequences that is especially well-suited for partitioning large sets of biological data. iii ACKNOWLEDGMENTS First and foremost, I would like to thank my committee and academic advisors for their help throughout my graduate studies. In particular, no words could ever express my appreciation for Dr. Khalid Sayood and the impact that he has had on my life, both academically and personally. I feel very fortunate to have found an advisor who is as patient and caring as he is, and I can't thank him enough for his support and guidance. I would also like to thank Dr. Michael Hoffman for his invaluable insight and help in preparing this research. He and Dr. Sayood have shown me how absolutely fun research can be, and I thank them both for making my time at UNL incredibly enjoyable. Next, I would like to thank Dr. Ufuk Nalbantoglu for being a great friend and constant source of ideas throughout my graduate career. Beyond my committee, I want thank Dr. David Russell, who has been an amazing role model, teacher, and friend over the past few years. His high expectations have motivated me to always strive to do quality work, and with him, I also thank my other co-workers at Red Cocoa, especially Mark Nispel, David McCreight, Chris Hruby, and Andy Lenhart. I also need to thank Dr. Mark Bauer for his incredible support and guidance, as well as our collaborators at Names for Life and the University of Nebraska Medical Center. I owe a special thanks and many boxes of cookies to Cheryl Wemhoff and Terri Eastin for helping me get everything lined up to graduate this semester. They helped me to overcome a number of obstacles this summer, and I thank them for all of their patience. I also need to thank Jay Carlson for his help with Adobe Illustrator and for letting me vent my frustrations. Finally, my thanks also go to my family and friends for their amazing support throughout my academic journey. Since day one, my parents have provided me with iv the means and inspiration to accomplish great things, and they've always been there to help me in any way they can. Words can't express how grateful I am for their many sacrifices and undying support. Lastly, I would like to thank my beautiful girlfriend, Laura Norris, whose love and encouragement has carried me through the last nearly eight years of my life. She inspires me to become a better person every day, and this work would not have been possible without her support. v Contents Contents v 1 Introduction 1 2 Biology Background 3 2.1 Introduction to DNA Structure . 3 2.2 Function of DNA . 6 3 Comparing DNA Sequences 8 3.1 Alignment-based Distance Measures . 8 3.1.1 Jukes-Cantor . 9 3.1.2 Kimura . 13 3.2 Alignment-free Distance Measures . 15 3.2.1 Relative Complexity Metric . 16 3.2.2 K-mer Based Genomic Signatures . 19 3.2.2.1 The Average Mutual Information Profile . 20 3.2.2.2 The Relative Abundance Index . 22 4 Latent Semantic Analysis 25 4.1 Introduction to LSA . 27 vi 4.2 Dimensionality Reduction Techniques . 31 4.2.1 Singular Value Decomposition . 32 4.2.1.1 Mathematical Derivation . 33 4.2.1.2 Dimensionality Reduction via SVD . 35 4.2.2 Non-negative Matrix Factorization . 36 4.2.2.1 Mathematical Derivation . 38 4.2.2.2 Dimensionality Reduction via NMF . 40 4.3 Adaptation to Sequence Data . 41 5 Visualizing High-Dimensional Datasets 43 5.1 Method . 45 5.2 Implementation . 47 5.3 Results and Discussion . 48 5.3.1 Validation of Taxonomy Data . 49 5.3.2 Map Creation From Proximity Data . 50 5.3.3 Discussion . 51 6 Clustering of 16s Ribosomal Genes 54 6.1 Background Information . 54 6.2 Method . 55 6.3 Results and Discussion . 56 7 Identification and Removal of Host DNA Fragments from Metage- nomics Datasets 67 7.1 Construction of a synthetic dataset . 68 7.2 Method . 70 7.3 Results . 73 vii 7.4 Discussion . 77 8 Using LSA-NMF to Design Microarray Probes 79 8.1 Background Information . 79 8.2 Method . 81 8.3 Results and Discussion . 82 9 Conclusion and Recommendations for Future Work 89 9.1 Recommendations for Future Work . 90 Bibliography 94 1 Chapter 1 Introduction Since the days of the Human Genome Project, the world has witnessed an ongoing revolution in the field of biology. In fact, the changes in this area are so profound that many believe we are currently experiencing the advent of a \New Biology", one that relies on concepts and approaches from engineering, computational sciences, mathematics, and many other fields[1]. The integration of these disciplines signi- fies the emergence of biology as an informational science and is necessitated by the introduction of many new technologies for producing biological data. Some of the most important examples of these new technologies are platforms for massively parallel, next-generation DNA sequencing. Next-generation sequencing (NGS) replaces the traditional Sanger sequencing method with automated, high- throughput technologies that produce unprecedented amounts of sequence data at ever-increasing speeds. What's more, the cost to purchase and operate these platforms has declined rapidly over the years, and as a result, the number of research centers using these technologies has skyrocketed. Given the ability to quickly sequence genetic material, scientists are constantly discovering new applications for next-generation platforms. However, with so many 2 scientists generating such massive amounts of sequence data, it is becoming increas- ingly difficult to make sense of all this new information. As such, it is imperative that we devise efficient techniques for classifying and organizing genomic sequences such that they may be quickly identified and retrieved. Traditional methods for sequence comparison are often computationally inefficient and therefore should not be used in large-scale implementations. Instead, in this re- search, we present a new approach for differentiating genomic sequences based on latent semantic analysis, a theory and collection of techniques born out of the fields of natural language processing and information retrieval. Using this new means com- parison, we explore several experiments that will demonstrate the approach as an attractive alternative to traditional distance measures that is especially well-suited for large-scale applications. 3 Chapter 2 Biology Background Before we go about introducing methods for comparing DNA sequences, we should first provide some biological background to establish what these sequences are and why they are important. It is commonly understood that DNA is a sort of genetic code or blueprint that is used to construct living things. For the purposes of our discussion, we will require only a slightly more detailed explanation, which we present here. In addition, this section will be used to introduce most of the terminology and conventions that will be used throughout subsequent chapters. 2.1 Introduction to DNA Structure DNA stands for deoxyribonucleic acid and is a double-stranded macromolecule com- prised of four basic structural units called nucleotides.