Fast and Sensitive Genome-Hashing Software and Its Application in Using NGS As a Detection Agent for Bacterial Presence in Oral
Total Page:16
File Type:pdf, Size:1020Kb
University of Missouri, St. Louis IRL @ UMSL Dissertations UMSL Graduate Works 6-4-2015 Fast and Sensitive Genome-Hashing Software and its Application in Using NGS as a Detection Agent for Bacterial Presence in Oral Metagenomic Samples Paul Michael Gontarz University of Missouri-St. Louis, [email protected] Follow this and additional works at: https://irl.umsl.edu/dissertation Part of the Chemistry Commons Recommended Citation Gontarz, Paul Michael, "Fast and Sensitive Genome-Hashing Software and its Application in Using NGS as a Detection Agent for Bacterial Presence in Oral Metagenomic Samples" (2015). Dissertations. 5. https://irl.umsl.edu/dissertation/5 This Dissertation is brought to you for free and open access by the UMSL Graduate Works at IRL @ UMSL. It has been accepted for inclusion in Dissertations by an authorized administrator of IRL @ UMSL. For more information, please contact [email protected]. Fast and Sensitive Genome-Hashing Software and its Application in Using NGS as a Detection Agent for Bacterial Presence in Oral Metagenomic Samples A Dissertation By Paul Gontarz M.S. Chemistry, University of Missouri - Saint Louis, 2012 B.S. Chemistry with Emphasis in Biochemistry, Lindenwood University, 2010 B.S. Mathematics, Lindenwood University, 2010 A Thesis Submitted to the Graduate School at the University of Missouri-St. Louis in partial fulfillment of the requirements for the degree Doctor of Philosophy in Chemistry June 2015 Advisory Committee Chung Wong, Ph.D. Chairperson James Bashkin, Ph.D. Cynthia Dupureur, Ph.D. Michael Nichols, Ph.D. i Abstract Fast and Sensitive Genome-Hashing Software and its Application in Using NGS as a Detection Agent for Bacterial Presence in Oral Metagenomic Samples Paul Gontarz, M.S., University of Missouri, St. Louis, MO, USA Chair of Committee: Dr. Chung Wong Next generation sequencing has increased the throughput of sequenced DNA into the range of billions of nucleotides sequenced per day. With the increased speed of DNA sequencing and the short length of reads produced by next generation sequencers, a significant challenge has been created in quickly and accurately assembling the hundreds of millions of short reads created by modern sequencing instruments into their full genomic sequences. With the increase in throughput in next generation sequencing and the decrease in time and cost to perform DNA sequencing, novel applications for DNA sequencing are being considered. Among them is a methodology by which DNA sequencing can be used as a diagnostic or detection tool for bacterial infection or presence. Here, the implementation, characteristics, and deployment of a novel, genome- hashing alignment algorithm for quickly performing reference-based alignment is described. This algorithm, SRmapper, is shown to be between two-fold to eight-fold faster than a current and popular alignment algorithm, BWA, while retaining a similar fraction of reads aligned to human reference genome. SRmapper demonstrates a capability to align approximately 150 billion nucleotides per processor day on an Intel ii Xeon 2.8GHz processor to the human genome while using approximately 2.5GB of RAM. SRmapper is demonstrated to be able to perform both single-end and pair-end alignment and tolerates a higher number of discrepancies between reads and the reference sequence than BWA. Using SRmapper as an alignment tool, a method to detect Mycobacterium tuberculosis (TB) in metagenomic samples containing many different bacteria is described. This method utilizes the construction of a novel uniqueness genome for TB containing only the regions of the TB genome not similar to any other bacterial species in the oral metagenome. Alignment of simulated and real metagenomic samples demonstrate the effectiveness of the uniqueness genome in the detection of TB and discover TB contamination in samples from the 1000 genomes project. Finally, the uniqueness genomes methodology is expanded to all genomes within the oral metagenome, and preliminary evidence is provided demonstrating that next generation sequencing can detect the presence of multiple simultaneously via alignment using SRmapper. iii Dedication To my wife, Elena; my daughter Anya; my brother, Peter, and my parents, Ken and Vivian: All my love and thanks. iv Acknowledgements I would first like to acknowledge the guidance, direction, work, and patience of my advisor, Dr. Chung Wong in guiding me through my time in the graduate program at the University of Missouri-St. Louis. I would like to thank Dr. Wong for both giving me the projects I worked on and for his ideas and direction as well as the numerous suggestions he gave to ensure that these projects moved in the correct direction. Furthermore, I would like to thank Dr. Wong for his willingness to give me the freedom to pursue my ideas, even when they proved to be incorrect, and his aid in developing the ones that had potential. Being allowed to struggle at times while having someone to fall back on when I could not solve problems on my own has been very beneficial in my development. Next, I would like to thank those serving on my dissertation committee: Dr. James Bashkin, Dr. Cynthia Dupureur, and Dr. Michael Nichols. To Drs. Dupureur and Nichols - thank you for suggestions, advice, and constructive critique during the defense of my dissertation proposal. To Dr. Bashkin, thank you for your suggestions and encouraging words during the oral presentation at my dissertation proposal. I would also like to acknowledge and thank my lab mates Elizabeth Hood and Andrew Lutes for their input and questions on my projects. I would also like to thank the numerous undergraduate and high school students I have had the pleasure and privilege of working with, advising, and supervising. I especially would like to thank Jennifer Berger for her valuable input and suggestions in the implementation and optimization of SRmapper. I would also like to thank Aaron Wilkerson for all the parallel work he did on strain specific features in TB. Specifically, I would also like to thank Barry Hykes, Kelsey Delph, and Peter Gontarz. v I would like to thank UMSL and the UM department of Chemistry and Biochemistry, the Graduate Dissertation Fellowship, the UM research board, and the UM research award for providing the funding necessary to support me during my graduate work. Without this support, none of this work could have happened. I would also like to express my deepest gratitude to the University of Missouri Bioinformatics Consortium for providing the computational resources necessary to make the project feasible. Outside of UMSL, I would first and especially like to acknowledge my wife, Elena Vasilieva, and all the time she has spent, the encouragement she has given, and the love she has shown to me. I would also like to thank my daughter, Anya; and I hope that when you are old enough to be able to read this, you see this and know that just seeing you means the world to me. Dad loves you and always will. Peter, watching you may be my greatest motivation. Your work ethic is an inspiration to me, and I am sure that you will achieve great things. It is not possible for me write here all the ways you inspire me. Finally, to my parents Ken and Vivian: thank you so much for always pushing me and encouraging me and reminding me to “never waste the gifts God has given you.” I do my best. vi TABLE OF CONTENTS Page ABSTRACT ii-iii DEDICATION iv ACKNOWLEDGMENTS v-vi TABLE OF CONTENTS vii-ix LIST OF FIGURES x-xii LIST OF TABLES xiii-xiv LIST OF PSEUDOCODES xv CHAPTER 1 AN INTRODUCTION TO DNA, DNA SEQUENCING, AND THE POTENTIAL APPLICATIONS OF DNA SEQUENCING 1 1.1 Dexoyribonucleic Acid, Its Properties, and Early Sequencing Means 2 1.2 Next Generation Sequencing 6 1.2.1 Next Generation Sequencing Platforms and Methodology 6 1.2.2 Comparison of Next Generation Sequencing Instrument Output 10 1.3 Analysis of NGS Data 13 1.4 Future Applications of DNA Sequencing Detection and Diagonstics of TB and Other Bacteria 19 1.5 Thesis Scope and Overview 22 CHAPTER II SRMAPPER: A FAST AND SENSITIVE GENOME-HASHING ALIGNMENT ALGORITHM FOR NGS ANALYSIS 24 2.1 Background 25 2.2 Construction of Reference Sequence Indexes by SRmapper Buildindex Algorithm 26 2.2.1 SRmapper Buildindex Input Format and Files 26 2.2.2 Hashing Terminology 28 2.2.3 Prehashing Routine to Determine Key Length 28 2.2.4 SRmapper Hashing Function 30 2.2.5 SRmapper Indexing Function and Formation of the Hash Table 33 2.2.5.1 Initial Hash Table Formation Algorithm 34 2.2.5.2 Modified Hash Table Formation Algorithm 35 vii 2.2.6 Output and Storage of the SRmapper Indexing Algorithm 39 2.3 Alignment of Short Reads from NGS to Reference Sequences Using the SRmapper Align Algorithm And Probabilistic Model 39 2.3.1 SRmapper Align Input Format and Files 41 2.3.2 SRmapper Align Prealignment Routines 44 2.3.2.1 Determination of Alignment Probabilities 44 2.3.2.2 Creation of the Probability Table for Alignment 49 2.3.2.3 Loading of the Index into Memory 51 2.3.3 Sequence Alignment by the SRmapper Align Algorithm 51 2.3.4 Alignment Output and Storage 55 2.3.5 Miscellaneous Implementations to Increase Alignment Speed 62 2.4 Results 63 2.4.1 Indexing Reference Sequences 63 2.4.2 Comparison Between SRmapper and BWA Using Real and Simulated Sequencing Datasets 64 2.4.2.1 Real Datasets and Software 64 2.4.2.2 Alignment Conditions and Measures of Aligner Speed and Reads Aligned 65 2.4.2.3 Results of Comparing SRmapper to BWA on Real Datasets 67 2.4.2.4 Creation of Simulated Reads and Determination Of Aligner