Abstract Algorithms and Data Structures For
Total Page:16
File Type:pdf, Size:1020Kb
ABSTRACT Title of Dissertation: ALGORITHMS AND DATA STRUCTURES FOR INDEXING, QUERYING, AND ANALYZING LARGE COLLECTIONS OF SEQUENCING DATA IN THE PRESENCE OR ABSENCE OF A REFERENCE Fatemeh Almodaresi Doctor of Philosophy, Dissertation Directed by: Professor Rob Patro Department of Computer Science High-throughput sequencing has helped to transform our study of biological organisms and processes. For example, RNA-seq is one popular sequencing assay that allows measuring dynamic transcriptomes and enables the discovery (via assem- bly) of novel transcripts. Likewise, metagenomic sequencing lets us probe natural environments to profile organismal diversity and to discover new strains and species that may be integral to the environment or process being studied. The vast amount of available sequencing data, and its growth rate over the past decade also brings with it some immense computational challenges. One of these is how do we design memory-efficient structures for indexing and querying this data. This challenge is not limited to only raw sequencing data (i.e. reads) but also to the growing collection of reference sequences (genomes, and genes) that are assembled from this raw data. We have developed new data structures (both reference-based and reference-free) to index raw sequencing data and assembled reference sequences. Specifically, we describe three separate indices, “Pufferfish”, an index over a set of genomes or tran- scriptomes, and “Rainbowfish” and “Mantis” which are both indices for indexing a set of raw sequencing data sets. All of these indices are designed with consideration of support for high query performance and memory efficient construction and query. The Pufferfish data structure is based on constructing a compacted, colored, reference de Bruijn graph (ccdbg), and then indexing this structure in an efficient manner. We have developed both sparse and dense indexing schemes which allow trading index space for query speed (though queries always remain asymptotically optimal). Pufferfish provides a full reference index that can return the set of refer- ences, positions and orientations of any k-mer (substring of fixed length “k” ) in the input genomes. We have built an alignment tool, Puffaligner, around this index for aligning sequencing reads to reference sequences. We demonstrate that Puffaligner is able to produce highly-sensitive alignments, similar to those of Bowtie, but much more quickly and exhibits speed similar to the ultrafast STAR aligner while requiring considerably less memory to construct its index and align reads. The Rainbowfish and Mantis data structures, on the other hand, are based on reference-free colored de Bruijn graphs (cdbg) constructed over raw sequencing data. Rainbowfish introduces a new efficient representation of the color information which is then adopted and refined by Mantis. Mantis supports graph traversal and other topological analyses, but is also particularly well-suited for large-scale sequence-level search over thousands of samples. We develop multiple and successively-refined versions of the Mantis index, culminating in an index that adopts a minimizer- partitioned representation of the underlying k-mer set and a referential encoding of the color information that exploits fast near-neighbor search and efficient encoding via a minimum spanning tree. We describe, further, how this index can be made incrementally updatable by developing an efficient merge algorithm and storing the overall index in a multi-level log-structured merge (LSM) tree. We demonstrate the utility of this index by building a searchable Mantis, via recursive merging, over , raw sequencing samples, which we then scale to over , samples via incremental update. This index can be queried, on a commodity server, to discover the samples likely containing thousands of reference sequences in only a few minutes. ALGORITHMS AND DATA STRUCTURES FOR INDEXING, QUERYING, AND ANALYZING LARGE COLLECTIONS OF SEQUENCING DATA IN THE PRESENCE OR ABSENCE OF A REFERENCE by Fatemeh Almodaresi Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy Advisory Committee: Professor Rob Patro, Chair/Advisor Professor Stephen Mount, University Chair Representative Professor David Mount Professor Hector Corrada Bravo Professor Mihai Pop © Copyright by Fatemeh Almodaresi Dedication I dedicate this work to Khanoom, my grandma, that if it wasn’t for her I couldn’t image myself be the person I am now or better said, even be! ii Acknowledgments “And whatever of blessings and good things you have, it is from Allah.” Quran (:) There are many people to whom I owe my gratitude for helping me during my PhD years and making this thesis a piece of work that I will be thrilled about forever. First and foremost, I would like to express my deepest appreciation to my advisor, Professor Patro. I started my work with him as someone who “believed” is incapable of any achievements and if it wasn’t for his profound belief in my work and my abilities and for his patience and relentless support I would not be where I am now. He has not only guided me to become a critical thinker and a professional scientist but also a better and stronger person. Taking his class was the pivotal moment of my life that changed it entirely. I now see myself as a proficient researcher who is now able and passionate to work on novel problems and he is my role model for all of that. I have watched him and learned a lot from his extensive knowledge in his area of expertise, persistence in accomplishing the task, and his constructive advice and ingenious contributions. The work in this thesis was impossible without him. I had great pleasure of collaborating with many researchers and scientists in the field during my PhD and I thank them all for the precious experience they offered. iii I would like to thank all the members of Mantis project, specially Michael Ferdman and Robert Johnson for fruitful conversations and insightful suggestions during the three years of collaboration and Prashant Pandey for his invaluable contribution. I would like to thank my thesis committee members for all of their guidance through this process; your discussion, ideas, and feedback have been absolutely invaluable. Many thanks to my colleagues in our lab and CBCB that had made my PhD journey full of fun and exciting. Special thanks to Laraib Iqbal Malik who was like a sister to me and I miss her deeply as well as Mohsen Zakeri for being the brother that was always there for me. I would also like to acknowledge help and support from some of the staff members, specifically Kathy Germana, Tom Hurst, and Barbara Lewis for their kindness, patience and guidance for all the administrative tasks with a constant smile always on their faces. I cannot begin to express my thanks to my husband, Hamid, for his unwavering support, encouragement and accommodation throughout the past years. He gave up the life back in our country and his career for me to pursue my dream which has now become a reality. He has my deepest appreciation for never giving up on me. I would also like to extend my deepest gratitude to my parents for the love, support, and constant encouragement I have gotten over the years. I want to specifically thank my mother who has gone through a lot of struggle and pain but has never surrender nurturing and supporting me. In the end, many thanks to my little siblings, specially Gazelle for her relentless support. Finally, I am extremely thankful to all my friends and family for their per- iv manent and unfailing support. Specifically, Leila Khatami, who is always my fuel and motivator. Her help cannot be overestimated as she was always there for me throughout the hard time when I was alone and far from family. Also, my deepest appreciation goes to Zohreh Sadeghi and Shohreh Tabesh whom never wavered in their support. I would like to specifically thank our Iranian and religion community in NY, who were the most welcoming people one could imagine. Our friends at Stony Brook always make us feel at home and our friends in NY enriched us with a lot of knowledge and love that could not have been achieved otherwise. I was the luckiest person to have them in my life and will always owe a great part of my mindset, and perspective to them. Funding: I gratefully acknowledge support from NSF grants BBSRC-NSF/BIO- , IIS-, IIS-, CNS- , CCF- , CCF-, CCF-, and from Sandia National Laboratories. The following grants specif- ically funded the Mantis line of work discussed in chapters and : National Science Foundation grants CSR- , CCF- , CCF-, CCF-, CNS- , CNS- National Institutes of Health grants RHG , and RGM . The experiments were conducted with equipment purchased through NSF CISE Research Infrastructure Grant Number . v Table of Contents Dedication ii Acknowledgements iii Table of Contents vi List of Tables ix List of Figuresx List of Abbreviations xii Introduction Sequence Indexing............................ . Reference-based Indices..................... . Reference-free Indices....................... De Bruijn Graph............................. . Compacted de Bruijn graph................... . Colored de Bruijn graph..................... Sequence Search.............................. Overview of this document and contribution.............. Pufferfish Introduction................................ Preliminaries............................... Method.................................. . The dense