ABSTRACT GENOME ASSEMBLY and VARIANT DETECTION USING EMERGING SEQUENCING TECHNOLOGIES and GRAPH BASED METHODS Jay Ghurye Doctor
Total Page:16
File Type:pdf, Size:1020Kb
ABSTRACT Title of dissertation: GENOME ASSEMBLY AND VARIANT DETECTION USING EMERGING SEQUENCING TECHNOLOGIES AND GRAPH BASED METHODS Jay Ghurye Doctor of Philosophy, 2018 Dissertation directed by: Professor Mihai Pop Department of Computer Science The increased availability of genomic data and the increased ease and lower costs of DNA sequencing have revolutionized biomedical research. One of the critical steps in most bioinformatics analyses is the assembly of the genome sequence of an organ- ism using the data generated from the sequencing machines. Despite the long length of sequences generated by third-generation sequencing technologies (tens of thousands of basepairs), the automated reconstruction of entire genomes continues to be a formidable computational task. Although long read technologies help in resolving highly repetitive regions, the contigs generated from long read assembly do not always span a complete chromosome or even an arm of the chromosome. Recently, new genomic technologies have been developed that can “bridge” across repeats or other genomic regions that are difficult to sequence or assemble and improve genome assemblies by “scaffolding” to- gether large segments of the genome. The problem of scaffolding is vital in the context of both single genome assembly of large eukaryotic genomes and in metagenomics where the goal is to assemble multiple bacterial genomes in a sample simultaneously. First, we describe SALSA2, a method we developed to use interaction frequency between any two loci in the genome obtained using Hi-C technology to scaffold frag- mented eukaryotic genome assemblies into chromosomes. SALSA2 can be used with either short or long read assembly to generate highly contiguous and accurate chromo- some level assemblies. Hi-C data are known to introduce small inversion errors in the assembly, so we included the assembly graph in the scaffolding process and used the sequence overlap information to correct the orientation errors. Next, we present our contributions to metagenomics. We developed a scaffolding and variant detection method “MetaCarvel” for metagenomic datasets. Several factors such as the presence of inter-genomic repeats, coverage ambiguities, and polymorphic regions in the genomes complicate the task of scaffolding metagenomes. Variant detec- tion is also tricky in metagenomes because the different genomes within these complex samples are not known beforehand. We showed that MetaCarvel was able to generate accurate scaffolds and find genome-wide variations de novo in metagenomic datasets. Finally, we present EDIT, a tool for clustering millions of DNA sequence fragments originating from the highly conserved 16s rRNA gene in bacteria. We extended classi- cal Four Russians’ speed up to banded sequence alignment and showed that our method clusters highly similar sequences efficiently. This method can also be used to remove duplicates or near duplicate sequences from a dataset. With the increasing data being generated in different genomic and metagenomic studies using emerging sequencing technologies, our software tools and algorithms are well timed with the need of the community. GENOME ASSEMBLY AND VARIANT DETECTION USING EMERGING SEQUENCING TECHNOLOGIES AND GRAPH BASED METHODS by Jay Ghurye Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2018 Advisory Committee: Professor Mihai Pop, Chair/Advisor Dr. Adam M. Phillippy, Professor Hector´ Corrada Bravo, Professor Aravind Srinivasan, Professor Max Leiserson, Professor Michael Cummings, Dean’s Representative c Copyright by Jay Ghurye 2018 Preface The algorithms, software, and results in this dissertation have either been published in peer-reviewed journals and conferences or are currently under preparation for submis- sion. At the time of this writing, Chapters 2, 4, 5, and 7 have already been published or submitted for publication and are reformatted here. Chapter 3 and 6 are under preparation for submission. I am indebted to my co-authors on these projects - their dedication and knowledge in the areas of computer science, statistics, and biology have resulted in much stronger scientific papers. • Chapter 1 - Jay Ghurye, Victoria Cepeda-Espinoza, and Mihai Pop. “Focus: Microbiome: Metagenomic Assembly: Overview, Challenges and Applications.” The Yale jour- nal of biology and medicine 89.3 (2016): 353. My contributions to these works include (1) surveying the literature for recent and seminal results, and (2) writing the manuscripts. • Chapter 2 - Jay Ghurye and Mihai Pop. “Algorithms and technologies for uncovering genome structure,” Under review. My contributions to these works include (1) surveying the literature for recent and seminal results, and (2) writing the manuscripts. • Chapter 3 - Jay Ghurye, Mihai Pop, Sergey Koren, Derek Bickhart, and Chen-Shan Chin. ii “Scaffolding of long read assemblies using long range contact information.” BMC genomics 18, no. 1 (2017): 527. - Jay Ghurye, Arang Rhie, Brian P. Walenz, Anthony Schmitt, Siddarth Selvaraj, Mihai Pop, Adam M. Phillippy, and Sergey Koren. “Integrating Hi-C links with assembly graphs for chromosome-scale assembly.” bioRxiv (2018): 261149. My contributions to these works include (1) design and implementation of the al- gorithm, (2) doing software evaluation, and (3) writing the manuscript. • Chapter 4 - Jay Ghurye, Sergey Koren, Arang Rhie, Brian Walenz, Siddarth Selvaraj, An- thony Schmitt and Adam Phillippy “Effect of different Hi-C library preparation on eukaryotic genome scaffolding”, Under preparation My contributions to these works include (1) designing and running experiments, (2) writing the manuscript. • Chapter 5 - Jay Ghurye, and Mihai Pop. “Better Identification of Repeats in Metagenomic Scaffolding.” In International Workshop on Algorithms in Bioinformatics, pp. 174- 184. Springer, Cham, 2016. My contributions to these works include (1) design and implementation of the al- gorithm, (2) doing software evaluation, (3) writing the manuscript. • Chapter 6 - Jay Ghurye, Todd Treangen, Sergey Koren, Marcus Fedarko, W. Judson Hervey iii IV and Mihai Pop “MetaCarvel: linking assembly graph motifs to biological vari- ants”, Under preparation. My contributions to these works include (1) design and implementation of the al- gorithm, (2) doing software evaluation, and (3) writing the manuscript. • Chapter 7 - Brian Brubach∗, Jay Ghurye∗, Mihai Pop and Aravind Srinivasan, 2017. “Bet- ter Greedy Sequence Clustering with Fast Banded Alignment”. In LIPIcs-Leibniz International Proceedings in Informatics (Vol. 88). Schloss Dagstuhl-Leibniz- Zentrum fuer Informatik. (∗ denotes equal contribution) My contributions to these works include (1) design and implementation of the al- gorithm, (2) doing software evaluation, and (3) writing the manuscript. Here is the list of other publications I have been involved, either as a first author or a contributing author. • Jay Ghurye, Gautier Krings and Vanessa Frias-Martinez, 2016, June. “A Frame- work to Model Human Behavior at Large Scale during Natural Disasters”. In Mo- bile Data Management (MDM), 2016 17th IEEE International Conference on (Vol. 1, pp. 18-27). IEEE. • Brian Brubach∗, Jay Ghurye∗, 2018. “A Succinct Four Russian’s Speedup for Edit Distance Computation and One-against-many Banded Alignment”. In LIPIcs- Leibniz International Proceedings in Informatics (Vol. 105). Schloss Dagstuhl- Leibniz-Zentrum fuer Informatik (∗ denotes equal contribution). iv • Jay Ghurye, Sergey Koren, Scott Small, Adam Phillippy, and Nora Besansky. “De novo assembly of Anopheles funestus genome”. (Under preparation) • - Marcus Fedarko, Jay Ghurye, Todd Treagen, and Mihai Pop. “MetagenomeScope: Web-Based Hierarchical Visualization of Metagenome Assembly Graphs.” (2017): 630-632. (Under submission) • Nathan D. Olson, Todd J. Treangen, Christopher M. Hill, Victoria Cepeda-Espinoza, Jay Ghurye, Sergey Koren, and Mihai Pop. “Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes.” Briefings in bioinformatics (2017). • Seth Commichaux, Nidhi Shah, Alexander Stoppel, Jay Ghurye, Michael Cum- mings and Mihai Pop. “A Critical Analysis of the Integrated Gene Catalog,” (Under preparation) v Dedication To my parents for their unconditional love and support vi Acknowledgments I owe my gratitude to all the people who have made this thesis possible and because of whom my graduate experience has been the one that I will cherish forever. First, I would like to thank my advisor, Professor Mihai Pop for giving me an in- valuable opportunity to work on challenging and exciting projects. It has been a pleasure to work with and learn from such an extraordinary individual. His advice has always helped me at various stages in the graduate program and helped me grow as a researcher and better person. I would also like to thank all of my other committee members - Michael Cum- mings, Aravind Srinivasan, Hector´ Corrada Bravo, Max Leiserson, and Adam Phillippy for being interested in my research. Your feedback has been instrumental in shaping this dissertation. I would like to thank all my collaborators who contributed to the work described in this dissertation. Primarily, I would want to thank Adam Phillippy for giving me an opportunity to work on exciting projects with his group at NIH. A special thank you to Sergey Koren and Arang Rhie for continuous