Techniques for Storing and Processing Next-Generation DNA Sequencing Data
Total Page:16
File Type:pdf, Size:1020Kb
Techniques for Storing and Processing Next-Generation DNA Sequencing Data THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Terry Camerlengo Graduate Program in Biophysics The Ohio State University 2014 Master's Examination Committee: Professor Kun Huang, PhD Professor Raghu Machiraju, PhD Professor Carlos Alvarez, PhD Copyright by Terry Camerlengo 2014 ABSTRACT Genomics is undergoing unprecedented transformation due to rapid improvements in genetic sequencing technology, which has lowered costs for genetic sequencing experiments while increasing the amount of data generated in a typical experiment (McKinsey Global Institute, May 2013, pp. 86-94). The increase in data has shifted the burden from analysis and research to expertise in IT hardware and network support for distributed and efficient processing. Bioinformaticians, in response to a data-rich environment, are challenged to develop better and faster algorithms to solve problems in genomics and molecular biology research. This thesis examines the storage and data processing issues inherent in next- generation DNA sequencing (NGS). This work details the design and implementation of a software prototype that exemplifies the current approaches as it relates to the efficient storage of NGS data. The software library is utilized within the context of a previous software project which accompanies the publication related to the HT_SOSA assay. The software for the HT_SOSA, called NGSPositionCounter, demonstrates a workflow that is common in a molecular biology research lab. In an effort to scale beyond the research institute, the software library‟s architecture takes into account scalability considerations ii for data storage and processing demands that are more likely to be encountered in a clinical or commercial enterprise. iii DEDICATION This Masters thesis is dedicated to my beautiful wife Ellen Nixon. iv ACKNOWLEDGMENTS I would like to thank Dr. Joel Saltz and Dr. Tahsin Kurc for giving me the rare opportunity of working at The Ohio State University Comprehensive Cancer Center‟s Biomedical Informatics Shared Resource. Without them I never would have been exposed to the fascinating and exciting areas of bioinformatics and scientific computing. I would also like to thank them for fully supporting my decision to pursue graduate work in computational biology while being employed fulltime at the shared resource. I would also like to thank Dr. Kun Huang for his mentorship over the years as both my graduate advisor and as my supervisor. I have found Dr. Huang to not only be a brilliant individual that I was most fortunate to work with, but a kind and caring teacher whose door was always open when it came to navigating the difficulties of achieving balance between career, academic studies, and personal life. Thank-you Dr. Huang for all you have done. A special thanks to Dr. Raghu Machiraju and Dr. Carlos Alvarez for their assistance both as committee members, but also co-PIs on various grants that I had the opportunity to work on. I was deeply enriched by their depth of knowledge and guidance. Hopefully our collaborations will continue in the coming years. v I would also like to acknowledge the “Department of Defense Congressionally Directed Medical Research Programs grant awards W81XWH-11-2-0224, -0225 and - 0226” which was instrumental in the development of many of the ideas in this thesis. vi VITA March 1988 ....................................................Steubenville High School 1994................................................................B.A. Philosophy, The Ohio State University 1997................................................................B.A. Computer Science, The Ohio State University 1997 to 2004 ..................................................Software Programmer (various places) 2004 to 2013 ..................................................Research Specialist, Department of Biomedical Informatics, The Ohio State University and Comprehensive Cancer Center 2013 to present ...............................................Principal Research Scientist, Battelle Memorial Institute vii PUBLICATIONS Co-author, SCJD Exam with J2SE 5, 2nd Edition, Apress Books, ISBN 1-59059- 516-5, December 2005 Co-author, The Sun Certified Java Developer Exam with J2SE 1.4, Apress Books, ISBN 1590590309, August 2002 Terry Camerlengo, C. Johnson "Make the Java-Oracle9i Connection", JavaWorld Magazine, http://www.javaworld.com/javaworld/jw-06-2003/jw-0613- oracle9i.html, June 2003 Kurc T, Janies D, Johnson A, Langella S, Oster S, Hastings S, Habib F, Camerlengo Terry, Ervin D, Catalyurek U, Saltz J. “An XML-based System for Synthesis of Data from Disparate Databases” J Am Med Inform Assoc, 2006, in press. Hatice Gulcin, Doruk Bozdağ, Terry Camerlengo, Jiejun Wu, Yi-Wen Huang, Tim Hartley, Jeffrey D. Parvin, Tim Huang, Umit V. Catalyurek, Kun Huang, “A Comprehensive Analysis Workflow for Genome-Wide Screening from ChIP- Sequencing Experiments”, SpringerLink, http://www.springerlink.com/content/c882314242m17018, April 2009 viii Terry Camerlengo, Gulcin Ozer, Guojuan Zhang, Tarek Joobeur, Tea Meulia, Joanne Trgovcich, Kun Huang, "Computational Challenges and Solutions to the Analysis of Micro RNA Profiles in Virally-Infected Cells Derived by Massively Parallel Sequencing", occbio, pp.32-36, 2009 Ohio Collaborative Conference on Bioinformatics, http://www.computer.org/portal/web/csdl/doi/10.1109/OCCBIO.2009.24, 2009 Hatice Gulcin Ozer, Terry Camerlengo, Tim Huang, Kun Huang, "A New Method for Mapping Short DNA Sequencing Reads by Using Quality Scores", OccBio, pp.21-25, 2009 Ohio Collaborative Conference on Bioinformatics, http://www.computer.org/portal/web/csdl/doi/10.1109/OCCBIO.2009.35, 2009 Terry Camerlengo, Hatice Gulcin Ozer, Mingxiang Teng, Francisco Perez, Pearlly Yan, Lang Li, Jeffrey Parvin, Tim Huang, Tashin Kurc, Yunlong Liu, and Kun Huang, “Enabling Data Analysis on High-throughput Data in Large Data Depository Using Web-based Analysis Platform – A Case Study on Integrating QUEST with GenePattern in Epigenetics Research”, 2009 IEEE International Conference on Bioinformatics and Biomedicine, Nov. 2009 (Terry Camerlengo H. G.-S., 2012)Kumar PS, Brooker MR, Dowd SE, Camerlengo T (2011), "Target Region Selection Is a Critical Determinant of Community Fingerprints Generated by 16S Pyrosequencing.", PLoS ONE 6(6): e20956. doi:10.1371/journal.pone.0020956 (Taggart, et al., 2013) ix FIELDS OF STUDY Major Field: Biophysics x Table of Contents Abstract ........................................................................................................................... ii Dedication ...................................................................................................................... iv Acknowledgments ........................................................................................................... v Vita ................................................................................................................................ vii Publications .................................................................................................................. viii Fields of Study ................................................................................................................ x List of Tables ................................................................................................................ xiii List of Figures .............................................................................................................. xiv Chapter 1: The NGS Challenge ...................................................................................... 1 Chapter 2: NGS Workflows And Institutional Decision Support ................................... 5 Chapter 3: An Automated Pipeline for Processing Next Generation Sequencing ....... 13 Chapter 4: Outlines of a Software Library for NGS Storage ........................................ 30 Chapter 5: Efficient Storage Of NGS Data ................................................................... 36 Chapter 6: Conclusion ................................................................................................... 65 Bibliography .................................................................................................................. 72 Appendix A – Comparison of 4-bit and 3-base/Byte Encoding Schemes .................... 75 xi Appendix B – QUEST System Overview ..................................................................... 82 xii LIST OF TABLES Table 1. 4-bit encoding of DNA bases.............................................................................. 39 Table 2. API for reference genome lookups ..................................................................... 54 Table 3. PersistenceMgr CRUD types .............................................................................. 61 Table 4. Comparison of various storage techniques ........................................................ 62 xiii LIST OF FIGURES Figure 1. The workflow for ChIP-seq data processing and analysis (Ozer, et al., 2009) ... 8 Figure 2. NGS data processing and automation pipeline .................................................. 17 Figure 3. Execution of the Configuration file. .................................................................. 22 Figure 4. Main Page for viewing Studies.......................................................................... 23 Figure 5. GAII