Evaluating and Improving the Efficiency of Software and Algorithms for Sequence Data Analysis

Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2016 Evaluating and Improving the Efficiency of Software and Algorithms for Sequence Data Analysis Hugh L. Eaves Virginia Commonwealth University Follow this and additional works at: https://scholarscompass.vcu.edu/etd Part of the Bioinformatics Commons, Software Engineering Commons, Systems Architecture Commons, and the Theory and Algorithms Commons © The Author Downloaded from https://scholarscompass.vcu.edu/etd/4295 This Dissertation is brought to you for free and open access by the Graduate School at VCU Scholars Compass. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of VCU Scholars Compass. For more information, please contact [email protected]. Copyright © 2016, Hugh L. Eaves All rights reserved Evaluating and Improving the Efficiency of Software and Algorithms for Sequence Data Analysis A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at Virginia Commonwealth University by Hugh L. Eaves Bachelor of Science (Computer Science), Virginia Commonwealth University (1997) Director: Maria C. Rivera Associate Professor Department of Biology Co-Director: Bonnie L. Brown Professor Department of Biology Virginia Commonwealth University Richmond, VA May, 2016 Acknowledgement “If I have seen further it is by standing on the shoulders of giants.” ―Letter from Isaac Newton to Robert Hooke, 5 February 1676 No Ph.D research occurs in a vacuum, and I acknowledge some of the people who have made this possible. First and foremost, I thank my advisors and committee chairs, Dr. Maria Rivera and Dr. Bonnie Brown, for their support and invaluable insight they provided during this entire process. I also thank my former advisor, Dr. Yuan Gao, for providing the inspiration and guidance for my first project, and my committee members Dr. Kier, Dr. Bonchev, and Dr. Arodz, for their time, dedication, and encouragement. I also acknowledge the VCU Graduate School for the funding I received as I transitioned from full-time work to being a full-time student, and the Integrative Life Sciences program administrators, Dr. Robert Tombes, Dr. William Eggleston, and Dr. Brian Verrelli for their help negotiating the complexities of the process. I thank my lab “mates,” Dr. Christopher Friedline, Alex Waldrop, and Derek Gygax for making the lab a fun place to work and also providing many practical suggestions on how to proceed. Lastly, I thank my students, whose enthusiasm for the material made teaching a joy, and my teachers, whose passion for the subject encouraged me journey down this path in the first place. ii Dedication This dissertation is dedicated to my wife Tatiana, my son Matthew, my parents, my brother Thomas, and my sister Helen who have stood by me and supported me during the many days and nights I was missing in action while working on this Ph.D. Without their support, patience, and understanding, this work would not have been possible. iii Table of Contents 1 Introduction - Computational Challenges in High Throughput Sequencing ........................... 1 1.1 Software Needs in Sequence Data Analysis .................................................................... 1 1.2 Software Libraries ........................................................................................................... 3 1.3 Performance and Efficiency .......................................................................................... 11 1.4 Programming Language Comparison ............................................................................ 16 1.5 Short Read Mapping ..................................................................................................... 25 1.6 Structure of this Dissertation ........................................................................................ 26 2 MOM: Maximum Oligonucleotide Mapping ......................................................................... 29 3 Comparative Analysis of Bioinformatic Library Performance ............................................... 32 3.1 Introduction .................................................................................................................. 34 3.2 Methods ........................................................................................................................ 37 3.3 Results and Discussion .................................................................................................. 50 3.4 Conclusion ..................................................................................................................... 61 3.5 Funding ......................................................................................................................... 62 3.6 Supplementary Data ..................................................................................................... 63 4 BioMojo: A Java Framework for High Performance Sequence Analysis ............................... 64 4.1 Introduction .................................................................................................................. 66 4.2 Methods ........................................................................................................................ 67 4.3 Results and Discussion .................................................................................................. 68 4.4 Conclusion ..................................................................................................................... 78 4.5 Funding ......................................................................................................................... 79 4.6 Supplementary Information ......................................................................................... 79 5 Conclusion ............................................................................................................................. 99 6 References ........................................................................................................................... 101 7 Vita ....................................................................................................................................... 106 iv List of Tables Table 1.1 - Feature comparison of bioinformatics software libraries. ......................................... 11 Table 1.2 - Ranking of computing languages by popularity.......................................................... 17 Table 1.3 - CPU and memory usage of programming languages ................................................. 19 Table 3.1 - Memory usage of sequence data structures. ............................................................. 57 Table 3.2 - Memory usage of sequence data structures, with quality scores. ............................. 57 Table 3.3 - FASTA read benchmark results. .................................................................................. 58 Table 3.4 - FASTQ read benchmark results. .................................................................................. 58 Table 3.5 - Short read trimming benchmark results. .................................................................... 59 Table 3.6 - Sequence translation benchmark results. .................................................................. 60 Table 3.7 - Kmer counting benchmark results. ............................................................................. 60 Table 3.8 - Sequence alignment benchmark results. .................................................................... 61 Table 4.1 - CPU and elapsed time benchmark results. ................................................................. 77 Table 4.2 - Memory usage benchmark results. ............................................................................ 78 v List of Figures Figure 3.1 - Comparison of BioJava memory usage with different measurement methods. ...... 51 Figure 3.2 - Comparison of SeqAn with Python memory usage for varying sizes of FASTA data. 53 Figure 3.3 - R linear regression diagnostic plot for JEBL memory usage with FASTA data. ......... 63 Figure 3.4 - R linear regression diagnostic plot for HTSJDK memory usage with FASTA data. .... 63 Figure 4.1 - File parsing strategy. .................................................................................................. 72 Figure 4.2 - BioMojo code example. ............................................................................................. 75 vi Abstract With the ever-growing size of sequence data sets, data processing and analysis are an increasingly large portion of the time and money spent on nucleic acid sequencing projects. Correspondingly, the performance of the software and algorithms used to perform that analysis has a direct effect on the time and expense involved. Although the analytical methods are widely varied, certain types of software and algorithms are applicable to a number of areas. Targeting improvements to these common elements has the potential for wide reaching rewards. This dissertation research consisted of several projects to characterize and improve upon the efficiency of several common elements of sequence data analysis software and algorithms. The first project sought to improve the efficiency of the short read mapping process, as mapping is the most time consuming step in many data analysis pipelines. The result was a new short read mapping algorithm and software, demonstrated to be more computationally efficient than existing software and enabling

Load more