Parallelization Methods for the Distribution of High-Throughput Bioinformatics Algorithms

Parallelization Methods for the Distribution of High-Throughput Bioinformatics Algorithms by Eric James Rees, B.S. A Dissertation In COMPUTER SCIENCE Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY Submitted to: Dr. Eunseog Youn Dr. Scot Dowd Dr. Michael San Francisco Dr. Peggy Gordon Miller Dean of the Graduate School May, 2011 Copyright 2011, Eric Rees Texas Tech University, Eric James Rees, May 2011 Acknowledgements Dr. Eunseog Youn and Dr. Scot Dowd have both been excellent advisors. Dr. Youn has been with me since the beginning of my time as a Bioinformaticist and has given me the guidance needed to ensure I was able to solve the problems I encountered. Dr. Dowd has always been the one to keep me focused and on task while helping me see solutions to problems by pointing out new ways to view the problem. Both have forced me to think in ways I had never been forced to think before and without their guidance I may have never reached this point. My family for everything they have done for me throughout the years. My parents, Bill and Linda, have always supported my decisions and have always pushed me to excel at everything I try. I could not have done it without the great support of my family. My friends from Texas Tech who I have worked with in some regard or another for 5 years, Eric Garcia, Brad Nemanich, and Viktoria Gonthcharova. Each of them has been there for me when I needed a break from work and research or when I needed help solving an equation. My friends outside of Texas Tech who helped me relax when things became stressful, including Morgan Cadena, Joanna Burk, Shawna Miller, and Clint Miller. Lastly, I would like to thank my amazing girlfriend Teresa, whose loving support and ability to know exactly what to do to keep me going is what made this achievement possible. ii Texas Tech University, Eric James Rees, May 2011 Table of Contents Acknowledgements ............................................................................................................. ii Table of Contents ............................................................................................................... iii Abstract .............................................................................................................................. vi List of Tables ................................................................................................................... viii List of Figures .................................................................................................................... ix Chapter 1 Introduction ....................................................................................................... 1 1.1 Motivation ................................................................................................................. 1 1.2 Problem Statement .................................................................................................... 4 1.3 Overview of the Dissertation .................................................................................... 8 Chapter 2 Related Work..................................................................................................... 9 2.1 Bioinformatics........................................................................................................... 9 2.2 Distributed Computing............................................................................................ 11 2.3 BLAST .................................................................................................................... 17 2.3.1 BLASTN and MegaBLAST ............................................................................ 24 2.3.2 BLASTP ........................................................................................................... 25 2.3.3 BLASTX .......................................................................................................... 25 2.3.4 TBLASTN........................................................................................................ 26 2.3.5 TBLASTX........................................................................................................ 27 2.4 Distributed BLAST ................................................................................................. 27 2.4.1 Query Set Segmentation .................................................................................. 28 2.4.2 Sequence Database Segmentation.................................................................... 30 2.4.3 E-Value Calculation ......................................................................................... 32 2.5 Existing Distributed BLAST Applications ............................................................. 36 iii Texas Tech University, Eric James Rees, May 2011 Chapter 3 Approach ......................................................................................................... 43 3.1 Creating a Distributed System from Existing Nodes .............................................. 43 3.1.1 Algorithm for creating a Distributed System from Existing Nodes ................ 44 3.1.2 Methods for creating a Distributed Application Framework ........................... 46 3.1.3 Meeting the definition of a distributed system ................................................ 53 3.1.4 Meeting the challenges of a distributed system ............................................... 55 3.2 Algorithm behind the Distributed BLAST Master Node ........................................ 62 3.2.1 Node Discovery and Connection Establishment.............................................. 63 3.2.2 User Interface ................................................................................................... 64 3.2.3 Query Segmentation......................................................................................... 65 3.2.4 Database Segmentation .................................................................................... 66 3.2.5 Database and Query Transfer........................................................................... 67 3.2.6 Results Compilation ......................................................................................... 69 3.3 Algorithm behind distributed BLAST Worker Nodes ............................................ 70 3.3.1 Master Discovery and Connection Establishment ........................................... 71 3.3.2 Database and Query Transfer........................................................................... 72 3.3.3 BLAST ............................................................................................................. 73 3.3.4 Result Correction ............................................................................................. 74 3.3.5 Result Transfer ................................................................................................. 78 Chapter 4 Results .............................................................................................................. 79 4.1 Experiment Setup and Environment ....................................................................... 79 4.2 Distributed BLAST versus local BLAST on a small database ............................... 82 4.2.1 Comparing Local and Distributed BLAST on a small database using BLASTN ................................................................................................................................... 84 4.2.2 Comparing Local and Distributed BLAST on a small database using TBLASTX................................................................................................................. 87 iv Texas Tech University, Eric James Rees, May 2011 4.3 Distributed BLAST versus local BLAST on a large database ............................... 92 Chapter 5 Conclusions ..................................................................................................... 97 5.1 Results and Conclusions ......................................................................................... 97 5.2 Future Work ......................................................................................................... 100 References ...................................................................................................................... 101 v Texas Tech University, Eric James Rees, May 2011 Abstract The development of high-throughput bioinformatics technologies has caused a massive influx of biological data over the course of the past decade. During this same span of time, computational hardware has also been rapidly increasing in speed while decreasing in price, multi-core processors have become standard in home and office environments, and distributed and cloud based computing has become affordable and readily available to researchers with implementations such as Amazon’s S3, Microsoft’s Azure, Google’s App Engine, and the 3Tera Cloud. Bioinformatics software tools such as BLAST, a tool for finding local alignments between a set of unknown genetic sequences versus a set of known genetic sequences, have simple interfaces and few installation requirements often so biologists can use them easily in the laboratory without needing an in-depth knowledge of how computer systems work. This, however, is rarely the case for distributed implementations of bioinformatics tools which often require the user to first set up and configure the underlying program that will handle the distribution, such as the Message Passing Interface (MPI) or Remote Procedure

Load more