Parallelization Methods for the Distribution of High-Throughput Algorithms

by

Eric James Rees, B.S.

A Dissertation

In

COMPUTER SCIENCE

Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of

DOCTOR OF PHILOSOPHY

Submitted to:

Dr. Eunseog Youn

Dr. Scot Dowd

Dr. Michael San Francisco

Dr. Peggy Gordon Miller Dean of the Graduate School

May, 2011

Copyright 2011, Eric Rees

Texas Tech University, Eric James Rees, May 2011

Acknowledgements

Dr. Eunseog Youn and Dr. Scot Dowd have both been excellent advisors. Dr. Youn has been with me since the beginning of my time as a Bioinformaticist and has given me the guidance needed to ensure I was able to solve the problems I encountered.

Dr. Dowd has always been the one to keep me focused and on task while helping me see solutions to problems by pointing out new ways to view the problem. Both have forced me to think in ways I had never been forced to think before and without their guidance I may have never reached this point.

My family for everything they have done for me throughout the years. My parents,

Bill and Linda, have always supported my decisions and have always pushed me to excel at everything I try. I could not have done it without the great support of my family.

My friends from Texas Tech who I have worked with in some regard or another for

5 years, Eric Garcia, Brad Nemanich, and Viktoria Gonthcharova. Each of them has been there for me when I needed a break from work and research or when I needed help solving an equation.

My friends outside of Texas Tech who helped me relax when things became stressful, including Morgan Cadena, Joanna Burk, Shawna Miller, and Clint Miller.

Lastly, I would like to thank my amazing girlfriend Teresa, whose loving support and ability to know exactly what to do to keep me going is what made this achievement possible.

ii

Texas Tech University, Eric James Rees, May 2011

Table of Contents

Acknowledgements ...... ii

Table of Contents ...... iii

Abstract ...... vi

List of Tables ...... viii

List of Figures ...... ix

Chapter 1 Introduction ...... 1

1.1 Motivation ...... 1

1.2 Problem Statement ...... 4

1.3 Overview of the Dissertation ...... 8

Chapter 2 Related Work...... 9

2.1 Bioinformatics...... 9

2.2 Distributed Computing...... 11

2.3 BLAST ...... 17

2.3.1 BLASTN and MegaBLAST ...... 24

2.3.2 BLASTP ...... 25

2.3.3 BLASTX ...... 25

2.3.4 TBLASTN...... 26

2.3.5 TBLASTX...... 27

2.4 Distributed BLAST ...... 27

2.4.1 Query Set Segmentation ...... 28

2.4.2 Sequence Database Segmentation...... 30

2.4.3 E-Value Calculation ...... 32

2.5 Existing Distributed BLAST Applications ...... 36

iii

Texas Tech University, Eric James Rees, May 2011

Chapter 3 Approach ...... 43

3.1 Creating a Distributed System from Existing Nodes ...... 43

3.1.1 Algorithm for creating a Distributed System from Existing Nodes ...... 44

3.1.2 Methods for creating a Distributed Application Framework ...... 46

3.1.3 Meeting the definition of a distributed system ...... 53

3.1.4 Meeting the challenges of a distributed system ...... 55

3.2 Algorithm behind the Distributed BLAST Master Node ...... 62

3.2.1 Node Discovery and Connection Establishment...... 63

3.2.2 User Interface ...... 64

3.2.3 Query Segmentation...... 65

3.2.4 Database Segmentation ...... 66

3.2.5 Database and Query Transfer...... 67

3.2.6 Results Compilation ...... 69

3.3 Algorithm behind distributed BLAST Worker Nodes ...... 70

3.3.1 Master Discovery and Connection Establishment ...... 71

3.3.2 Database and Query Transfer...... 72

3.3.3 BLAST ...... 73

3.3.4 Result Correction ...... 74

3.3.5 Result Transfer ...... 78

Chapter 4 Results ...... 79

4.1 Experiment Setup and Environment ...... 79

4.2 Distributed BLAST versus local BLAST on a small database ...... 82

4.2.1 Comparing Local and Distributed BLAST on a small database using BLASTN ...... 84

4.2.2 Comparing Local and Distributed BLAST on a small database using TBLASTX...... 87 iv

Texas Tech University, Eric James Rees, May 2011

4.3 Distributed BLAST versus local BLAST on a large database ...... 92

Chapter 5 Conclusions ...... 97

5.1 Results and Conclusions ...... 97

5.2 Future Work ...... 100

References ...... 101

v

Texas Tech University, Eric James Rees, May 2011

Abstract

The development of high-throughput bioinformatics technologies has caused a massive influx of biological data over the course of the past decade. During this same span of time, computational hardware has also been rapidly increasing in speed while decreasing in price, multi-core processors have become standard in home and office environments, and distributed and cloud based computing has become affordable and readily available to researchers with implementations such as Amazon’s S3, Microsoft’s Azure, Google’s App Engine, and the 3Tera Cloud.

Bioinformatics software tools such as BLAST, a tool for finding local alignments between a set of unknown genetic sequences versus a set of known genetic sequences, have simple interfaces and few installation requirements often so biologists can use them easily in the laboratory without needing an in-depth knowledge of how computer systems work. This, however, is rarely the case for distributed implementations of bioinformatics tools which often require the user to first set up and configure the underlying program that will handle the distribution, such as the Message Passing Interface (MPI) or Remote Procedure Calls (RPC). Once the underlying distribution algorithm is chosen, many of the software tools require the user to then configure the program to work with their chosen method and, in some cases, write the necessary source code to link the program with the underlying service. These are difficult steps for most computer scientists and are near impossible for the average biologist.

vi

Texas Tech University, Eric James Rees, May 2011

By constructing a modularized set of methods that can connect to, broadcast to, and read from a multicast created by the methods, future bioinformatics software developers will be able to construct the underlying message passing system without requiring the end-user, often a biologist, to set up and configure one of their own.

Using these multicast methods will allow any program the ability to seek out and track any nodes on the network that will be used in the distributed system. This communication method allows the program to easily scale up and down depending on available nodes without direct user intervention to alter the size of the system.

This system is then tested by creating a program that connects NCBI’s Basic Local

Alignment Search Tool to the multicast system to allow the BLAST algorithm to then be distributed across multiple nodes. This new system will demonstrate how future programs could then connect stand alone tools, such as BLAST, to the multicast system to create programs that will execute on a distributed system and automatically scale depending on the network size without altering the tools source code.

vii

Texas Tech University, Eric James Rees, May 2011

List of Tables

Table 1: Distributed BLAST Implementations - Fault tolerance comparisons ...... 3 Table 2: Distributed BLAST Implementations - Speed up comparisons ...... 4 Table 3: Brief description regarding the data used and returned by each BLAST program...... 24 Table 4: Brief Description of Machine Classes ...... 81 Table 5: Results from Experiment #1 ...... 85 Table 6: Results from Experiment #2 ...... 88 Table 7: Results from Experiment #3 ...... 94

viii

Texas Tech University, Eric James Rees, May 2011

List of Figures

Figure 1: A simplified view of the BLAST algorithm...... 23 Figure 2: Diagram of Query Set Segmentation ...... 29 Figure 3: Diagram of Database Segmentation ...... 31 Figure 4: Distributed System Layout and Interactions ...... 62 Figure 5: Overview of the BLAST Output Layout ...... 76 Figure 6: Experiment #1 Execution Times - Line Graph ...... 86 Figure 7: Experiment #1 Execution Times - Bar Graph ...... 87 Figure 8: Experiment #2 Execution Times - Line Graph ...... 91 Figure 9: Experiment #2 Execution Times - Bar Graph ...... 92 Figure 10: Experiment #3 Execution Times - Line Graph ...... 95 Figure 11: Experiment #3 Execution Times - Bar Graph ...... 96

ix

Texas Tech University, Eric James Rees, May 2011

Chapter 1 Introduction

1.1 Motivation

Bioinformatics is a rapidly expanding interdisciplinary field that applies computer science to answer the questions of biology(Nair 2007). Due to the development of numerous high-throughput technologies, the amount of biological data is increasing rapidly (Troyanskaya, et al. 2003). To meet this massive influx of data, computational hardware has also been rapidly increasing in speed while decreasing in price. Now multiple core processors are becoming the standard in both home and office environments and, concurrently, distributed computers and cloud based computing have also become readily available to researchers with implementations such as Amazon’s S3 (Amazon 2010) (Palankar, et al. 2008), Microsoft’s Azure

(Microsoft 2010), Google’s App Engine (Google 2010), and the 3Tera Cloud (3tera

2010). However, despite these recent trends towards distributed computing and multiprocessor parallelization, few bioinformatics algorithms have been implemented to make use of this additional computational power.

The tools used in bioinformatics are developed by computer scientists for use by biologists. Tools such as BLAST, a tool for finding local alignments between a set of unknown genetic sequences versus a set of known genetic sequences, have a simple command line interface that does not require users to have a deep understanding of computers or computer science to be able to use. These tools are often executed by

1

Texas Tech University, Eric James Rees, May 2011 simply downloading the program and running it with little to no set up. Few, if any, of the major bioinformatics tools have requirements beyond which operating systems they will and will not run on, thus keeping the entire process simple for users. However, this is rarely the case for distributed implementations of bioinformatics tools. Distributed implementations of these tools, including BLAST, often require the user to first set up the underlying program that will handle the distribution, such as the Message Passing Interface (MPI). This additional step can be rather complex even for a computer specialist, much less the average biologist.

Once the underlying distribution program has been set up, many distributed implementations will still require that the users then set up the program to work with the options chosen during the creation of the MPI or RPC environment, adding yet another step that can be quite complex to biologists and other non-technical users.

Distributed implementations of bioinformatics tools often suffer from a major issue that could easily be corrected, but is often completely ignored by programmers. The issue is that distributed bioinformatics algorithms developed to run in distributed environments should be fault tolerant because a core principle of distributed systems is that they must be able to tolerate system failures and faults. Thus, distributed bioinformatics algorithms must account for faults that may occur within the distributed system and handle these faults accordingly. The problem, however, is that most distributed algorithms, including many implementations of the BLAST algorithm, are unable to handle many common faults that occur during their

2

Texas Tech University, Eric James Rees, May 2011 execution, such as: the master node or worker nodes losing connection to the network, the system encountering a race condition on a worker node, the system encountering a race condition on the master node, the system encountering a dead lock on a worker node, or the system encountering a dead lock on the master node.

The NCBI bioinformatics tool BLAST, Basic Local Alignment Search Tool, is one of the most widely used bioinformatics tools in the field. For this reason, this dissertation shall focus on this tool’s algorithm in order to provide examples of how the distributed algorithm would interact with a bioinformatics tool as well as to show an example of our distributed algorithm being used in a real-world tool. A survey of existing distributed BLAST implementations reveals that most of these applications do not have much, if any ability to perform fault tolerance or recovery.

The results of this survey can be found below in Table 1.

Distributed Implementation Fault Tolerance and Recovery Reference BeoBLAST Medium (Grant, et al. 2002) Condor BLAST Medium (Condor Team 2004) Low (Darling, Carey and mpiBLAST Feng 2003) Soap-HT-BLAST Medium (Wang and Mu 2003) Medium (Carvalho, et al. Squid 2005) W.ND-BLAST Medium (Dowd, et al. 2005) Table 1: Distributed BLAST Implementations - Fault tolerance comparisons

Distributed BLAST implementations should be able to achieve linear speed up in most cases and super linear speed up in BLAST runs involving databases that are too large to fit in memory. While achieving linear speed up is not an incredibly

3

Texas Tech University, Eric James Rees, May 2011 difficult task, achieving super linear speed up requires some additional work that most distributed BLAST applications have failed to implement. A survey of existing distributed BLAST implementations reveals that most of these applications have not yet implemented the code required to achieve super linear speed up. The results of this survey can be found below in Table 2.

Distributed Implementation Achieves super linear speed up Reference BeoBLAST No (Grant, et al. 2002) Condor BLAST No (Condor Team 2004) Yes (Darling, Carey and mpiBLAST Feng 2003) Soap-HT-BLAST No (Wang and Mu 2003) No (Carvalho, et al. Squid 2005) W.ND-BLAST No (Dowd, et al. 2005) Table 2: Distributed BLAST Implementations - Speed up comparisons

Currently there exists no distributed BLAST implementation that has a high, or even a medium, amount of fault tolerance and recovery while also achieving super linear speed up. However the distributed implementation of the BLAST algorithm constructed during this research has the ability to detect, tolerate, and recover from faults while still achieving super linear speed up on large databases.

1.2 Problem Statement

Distributed bioinformatics applications must conform to the standards set forth for all distributed algorithms, discussed in depth below in section 2.2, Distributed

Computing. These standards require that distributed algorithms be able to run on

4

Texas Tech University, Eric James Rees, May 2011 heterogeneous machines, run securely, have the ability to scale, use proper fault tolerance/recovery, be concurrent, and allow for transparency. However, a standard that is implied but is not officially stated requires that a distributed algorithm actually be capable of performing the same task as a non-distributed algorithm at faster speeds. If a distributed algorithm cannot accomplish this, then the algorithm should not be distributed at all. Expanding this idea should state that distributed algorithms should attempt to minimize processing time without introducing errors into the final output.

As such our distributed algorithm and our distributed BLAST application must go beyond the work of its predecessors by meeting all of the requirements stated above. To accomplish this, the following steps must be taken:

1. Devise a communication method capable of automatically scaling the size of a

distributed system.

2. Develop an application framework that will allow for various bioinformatics

tools to securely execute in a distributed heterogeneous environment. This

framework will provide fault tolerance within the heterogeneous

environment while also making use of the aforementioned communication

method. This application framework should be capable of:

a. treating bioinformatics tools as extensions of the framework without

requiring any alterations be made to the tool’s source code.

b. tolerating and effectively recovering from faults,

c. allowing for concurrent operations to be performed, and

5

Texas Tech University, Eric James Rees, May 2011

d. manipulating data in a secure manner.

3. Create a distributed BLAST application using the application framework. This

application should be capable of:

a. acting as an add-on to the BLAST executable by running the BLAST

executable supplied by NCBI without any alterations made to the

BLAST source code,

b. maximizing the speed up gained to near linear time in most cases and

super linear time in all cases where such time could be achieved.

In order to accomplish the first goal, I will construct a modularized set of methods that can connect to, broadcast to, and read from a multicast created by the methods.

This multicast will provide any program using the methods the ability to seek out and track any nodes on the network that will be used in the distributed system. This communication method will allow the program to easily scale up and down depending on available nodes without direct user intervention to alter the size of the system. The details of this communication method are explained further in section 3.1 Creating a Distributed System from Existing Nodes.

The second goal can be accomplished by constructing a pair of algorithms capable of performing distributed computations using the communication method described above. The first algorithm will establish a master node capable of communicating with remote worker nodes, collecting information from remote nodes, monitoring the status of remote nodes, establishing file transfers to and from the remote node,

6

Texas Tech University, Eric James Rees, May 2011 establishing data transfer between the master and remote worker nodes, spawning processes on remote nodes, stopping processes on remote nodes, deleting application specific files on remote nodes, taking input from the user, and generating output for the user in the form of log files and status updates. The second algorithm will establish the worker nodes capable of communicating with the master node, gathering system information for the master node, establishing file transfers to and from the master node, establishing data transfer between itself and the master node, running applications at the request of the master node, ending applications at the request of the master node, and deleting tool specific files at the request of the master node. These algorithms are described in section 3.1 Creating a

Distributed System from Existing Nodes.

The third goal, creating a distributed BLAST application, can be accomplished by creating two applications, a master application and a worker application. The master application will use the master node half of the communication algorithm to establish contact with and pass commands to the various worker nodes while the worker application will use the worker node half of the communication algorithm to respond to the commands of the various master nodes. The file and data transfer methods within the communication algorithm will be used to pass database files, input files, and BLAST command line instructions to the worker node as well as provide a method for which the worker node can pass completed BLAST output files back to the master. The master and worker node application algorithms are

7

Texas Tech University, Eric James Rees, May 2011 described in section 3.2 Algorithm behind the Distributed BLAST Master Node and section 3.3 Algorithm behind distributed BLAST Worker Nodes respectively.

1.3 Overview of the Dissertation

This dissertation consists of five chapters. Chapter 2 details related work in the fields of bioinformatics and distributed computation with an emphasis on work relating to other distributed BLAST implementations and methods. Chapter 3 describes the multicast communication method used to create and maintain a distributed heterogeneous environment as well as the distributed application framework developed to allow high throughput bioinformatics programs to run across this distributed system. Chapter 3 also discusses the distributed algorithm implemented to execute distributed BLAST executions across the distributed system created by the communication method. Chapter 4 describes the results attained by executing the algorithms detailed in Chapter 3 while Chapter 5 discusses the results attained in Chapter 4 and gives ideas and plans for future work in the area.

8

Texas Tech University, Eric James Rees, May 2011

Chapter 2 Related Work

2.1 Bioinformatics

In the paper, “Computational Biology and Bioinformatics: A Gentle Overview” by

Nair Achuthsankar, the author defines bioinformatics as follows: “Bioinformatics is the application of computer sciences and allied technologies to answer the questions of Biologists, about the mysteries of life” (Nair 2007). This short sentence accurately describes not only the application of computer science to biology, but lays out the scope and magnitude of this research by applying it to solve the mysteries of life. The author continues through much of the paper laying out exactly what data we as bioinformaticians handle and how we go about working with the vast amount of biological data. Bioinformatics deals primarily with biological data in either text files containing large quantities of DNA, RNA, or protein sequences or in images containing micro array data. This data can then be analyzed using a multitude of bioinformatics algorithms such as using Hidden Markov Models

(HMMs) to find genes in DNA sequences, using the Nussinov folding algorithm to predict the secondary structure of RNA sequences, or using sub cellular localization algorithms to predict a protein’s location within a cell based on protein sequences.

Bioinformatics has given biologists new ways to tackle lingering problems in their field as well as more efficient methods to complete the problems that they have already solved. However, bioinformatics has also allowed biologists to improve the

9

Texas Tech University, Eric James Rees, May 2011 precision of their results while decreasing their time spent in the lab. For example, the paper “Assessing the precision of high-throughput computational and laboratory approaches for the genome-wide identification of protein sub cellular localization in bacteria” by Sebastien Ray et al., the authors compared computational sub cellular localization methods with laboratory proteomics approaches in an attempt to determine the most effective approach for genome-wide localization characterization and annotation (Rey, Gardy and Brinkman 2005). Their final results showed that the computational methods for sub cellular localization had a higher level of precision than the high-throughput laboratory approaches.

Bioinformatics has also created the need for large sequence repositories, typically referred to as sequence databases. These databases store an immense amount of sequence data that has been collected by various projects worldwide. The most notable of these databases include the European Molecular Biology Laboratory DNA database (EMBL), the DNA Data Bank of Japan (DDBJ), the National Center for

Biotechnology Information’s genetic database (GenBank), the Swiss Institute of

Bioinformatics’ Protein Sequence Database (SWISS-PROT) and the Protein 3D structure database (PDB). These databases are growing at an immense rate, with the authors Dennis A. Benson et al. stating in their paper “GenBank” that the

GenBank database is growing exponentially, doubling in size every 15 to 18 months

(Benson, et al. 2007). Thus, the size of GenBank alone is growing in size faster than

Moore’s Law, which states that the number of transistors that can be placed inexpensively on an integrated circuit board has doubled approximately every two

10

Texas Tech University, Eric James Rees, May 2011 years. This poses a number of challenges not only to the process of mining data from these databases but also to using this data in popular programs like BLAST, a program that uses local alignment to identify unknown sequences, which shifts portions of these large databases into and out of memory in an attempt to identify unknown sequences. In order to overcome this obstacle it has been proposed that bioinformatics algorithms be updated, or developed, so that they can begin making use of multiple core processors and distributed computers.

2.2 Distributed Computing

Distributed computing is a field of computer science that researches distributed systems. In the book Distributed Systems: Concepts and Design, 4th Edition by George

Coulouris et al. and the book Distributed Computing: Implementation and

Management Strategies by Raman Khanna, a distributed system is defined as a

system in which components located at networked computers communicate and

coordinate their actions only by passing messages (Coulouris, Dollimore and

Kindberg 2005) (Khanna 1994). This definition of distributed systems, however,

requires that for a system to be considered distributed it must meet the following

three characteristics:

1) The system must be concurrent, allowing each machine in the distributed

system to work in parallel or independently of the other machines in the

system.

11

Texas Tech University, Eric James Rees, May 2011

2) The system must lack a global clock. Each component of the distributed

network may only communicate via some form of message passing. However

there are limitations on how accurately the components can synchronize

their clocks, thus forcing each component to keep track of their own local

clock, instead of referencing some global clock.

3) The system must allow for and be tolerant of independent component and

system failures. The distributed system should allow for independent

components to fail without it disrupting the capabilities of the system as a

whole.

The authors George Coulouris et al. continue on to discuss challenges that face the construction of distributed systems as well as the distributed algorithms that make use of these systems. These challenges have all been identified and met in previous research and all distributed algorithms should be able to handle these challenges.

The challenges discussed by the authors are as follows:

1) The distributed algorithm should be able to tolerate a heterogeneous

distributed system. Heterogeneous distributed systems are made up of

components that may have different hardware, make use of different

operating systems, and use different methods to connecting to the

distributed systems network. Overcoming this challenge is typically done by

implementing the algorithm using some form of middleware. The

middleware acts as an abstraction layer, allowing for the developer to handle

various system calls in a similar manner across heterogeneous systems.

12

Texas Tech University, Eric James Rees, May 2011

2) Keeping the distributed system secure is of considerable importance. In

order for the entire system to stay secure, each individual system and each

program running on these systems must adhere to the protection of the

system’s confidentiality, integrity, and availability. A system’s confidentiality

is based entirely upon its ability to protect against the disclosure of data or

resources to unauthorized users. Upholding system integrity requires the

system to protect against the alteration and corruption of its data. Lastly, the

system’s availability is its ability to protect itself against interference in

respect to its access of various local and remote resources. Overcoming this

challenge requires developers to stay vigilant in their efforts to ensure that

their programs do not violate or allow for the violation of the system’s

confidentiality, integrity, and availability. This means ensuring that the

contents of all messages sent across the distributed system are secure,

ensuring the systems sending and receiving messages are authorized to do

so, and either backing up important data or finding ways to ensure corrupted

data can either be recovered or recreated safely and efficiently.

3) Distributed systems must also allow for scalability, the ability to remain

effective even when a significant number of users and resources are added to

or removed from the system. Distributed algorithms should be able to

handle the increase, as well as the decrease, in resources and users and make

an attempt to balance these resources usage across the system. Using

algorithms, such as divide and conquer to split a task in equal pieces such

that each piece of the larger task is completed by a separate component,

13

Texas Tech University, Eric James Rees, May 2011

allow algorithms to easily scale both up and down in a dynamic distributed

system.

4) Just as would occur in a non-distributed algorithm, a distributed algorithm

must be able to handle failures. Unlike in a non-distributed system, failures in

a distributed system should follow the rules discussed before, in that the

distributed system should tolerate the loss of components without it causing

failures throughout the entire system. As such, distributed algorithms should

be able to handle these failures such that they too can tolerate these failures.

There are a number of ways to handle faults. For instance, the algorithm can

attempt to detect for when faults have occurred, known as fault detection,

and when such an event occurs the algorithm will take measures to minimize

and contain the fault and either recover what it can or restart the lost process

elsewhere. Redundancy is another way of dealing with fault tolerance by

having multiple systems in the distributed system mirror the actions of

another system. Thus if a task is being completed on three systems and one

system goes down, the other two will still manage to complete the task and

report their results. Redundancy can also be as simple as having multiple

network connections between a system and the distributed system, thus

allowing for hardware failures regarding one, but possibly not both,

networks running the distributed system.

5) Distributed systems are required to be concurrent and thus distributed

algorithms running across these systems must also be concurrent.

Distributed algorithms provide resources that can be shared by multiple

14

Texas Tech University, Eric James Rees, May 2011

users within the distributed system. Thus it becomes highly probable that at

various times, multiple users may attempt to share a resource at the same

time. Whether this resource is a file, a system resource, a database, or an

application, the distributed algorithm must not only allow the sharing to

occur, but must remain stable and keep data correctly synced during these

moments of shared usage. Resources in the system are considered safe only

when operations performed on them are synchronized in such a way that

their data remains consistent, despite shared usage. Overcoming this

problem requires following similar algorithm development standards found

in operating systems and other parallel software, such as the use of

semaphores and mutex locks.

6) Lastly, distributed systems should keep the separation of components within

the system concealed from the user and the application programmer. This

process is known as transparency and is used to ensure the distributed

computer is perceived as a single system instead of a collection of

independently working components. George Coulouris et al. continue on to

discuss the eight major forms of transparency first discussed in the ANSA

Reference Manual (ANSA Project 1987) and the International Organization

for Standardization’s Reference Model for Open Distributed Processing (RM-

ODP) (ISO/IEC 1996). A brief summary of the eight forms of transparency as

listed in the RM-ODP are as follows:

a. Access transparency: The ability to hide from a user the details of the

access mechanisms for any given server object. Access transparency

15

Texas Tech University, Eric James Rees, May 2011

hides the difference between local and remote provisions of the

service. b. Concurrency transparency: The ability to hide from the client the

existence of current access being made to various resources. This

hides the effects of concurrent operations performed by any given

user on a service used by multiple users. c. Location transparency: The ability to conceal the location of the

resource currently being accessed by a user. d. Replication transparency: The ability to hide the presence of multiple

copies of services and maintaining the consistency of multiple copies

of data, from the users of the services. e. Resource transparency: The ability to hide from a user the

mechanisms which manage allocation of resources by activating and

deactivating resources as demand varies for these resources.

f. Failure transparency: The ability to mask certain failures and possible

recovery efforts of resources from the user. This provides fault

tolerance for the distributed system. g. Federation transparency: The ability to hide the effects of operations

crossing multiple administrative boundaries from the users, allowing

users and resources to interwork between multiple administrative

and technological domains.

16

Texas Tech University, Eric James Rees, May 2011

2.3 BLAST

Basic Local Alignment Search Tool (BLAST) was developed in the late 1980’s, early

1990’s, and is the most used bioinformatics program in the world. The program is an approximation of the Smith-Waterman algorithm, developed by Temple F. Smith and Michael S. Waterman in 1981, for performing local sequence alignment. The

Smith-Waterman algorithm compares two sequences against one another in order to detect similarities between the sequences, known as alignments (Smith and

Waterman 1981) (Durbin, et al. 2007). The algorithm takes two sequences, A and B

such that A = a1a2…a n and B = b1b2…b m. We shall define a similarity given between two sequence elements a and b as s(a,b) and gaps in the sequence of length k shall be given weight Wk. According to the Smith-Waterman algorithm, in order to detect pairs of segments between the sequences that contain a high similarity, we will first need to establish a matrix M such that

Mk0 = M 0l = 0 for 0 ≤ k ≤ n and 0 ≤ l ≤ m

This creates a matrix of size k x l with the first column and first row filled with zeros and all other spots left empty. These spots will then be filled in such that Mij is the maximum similarity of two segments ending in ai and bj, respectively. Because the

Smith-Waterman algorithm searches for local instead of global alignments, the

matrix will contain 0 in positions where the Mij would have contained a negative

number. These 0’s will be used to determine where new alignments begin, allowing

17

Texas Tech University, Eric James Rees, May 2011 multiple alignments to begin and end within the sequence pair (Rosenberg 2009).

As such, M ij is calculated for all i and j where 1 ≤ i ≤ n and 1 ≤ j ≤ m using the equation

0, , + , = , = max max {, − } max {, − }

Once this matrix has been constructed, the alignment with the maximum similarity between sequence A and B can be found by first locating the maximum element in M.

From this element a traceback algorithm is used to build the alignment in reverse

starting from the maximum element in M until a 0 element is encountered. At each

step of the traceback process we will move back from the current cell Mij to the cell

from which the value in Mij was derived, located either at Mi - 1, j- 1, Mi - 1, j , or Mi, j – 1. At the same time we will also build A’ and B’ , the alignment sequence pair between A and B respectively, by adding either a letter or a gap to front of each sequence in the pair. If the value for element Mij was derived from Mi – 1, j – 1 then A’ adds the sequence element ai to the front and B’ adds the sequence element bj to the front. If the value

for element Mij was derived from Mi – 1, j then A’ adds the sequence element ai to the

front and B’ adds the sequence element ‘-‘ to the front. If the value for element Mij was derived from Mi, j – 1 then A’ adds the sequence element ‘-‘ to the front and B’

adds the sequence element bj to the front. This algorithm is repeated until the

18

Texas Tech University, Eric James Rees, May 2011

element in Mij = 0 (Durbin, et al. 2007) (Jones and Pevzner 2004) (Orengo, Jones and

Thornton 2003).

As discussed in the book “Introduction to computational genomics” (Cristianini and

Hahn 2007) by Nello Cristianini and Matthew Hahn, the increase of DNA sequences

deposited into the public genomic databases in the late 1980’s caused searching the

three main genomic to start taking immense amounts of time. The Smith-Waterman

algorithm takes on the order of O(nm), often referred to as O(n 2), time and space to calculate results. Despite the rather low costs of computation, the costs were simply too high for the large scale database searching applications it was being used for

(Cristianini and Hahn 2007).

The BLAST algorithm was developed in 1990 to attain similar results to the Smith-

Waterman algorithm but at a fraction of the computational costs. BLAST is able to achieve this goal using two shortcuts: 1) BLAST does not bother to find the optimal alignment and 2) it does not search the entire search space by instead attempting to quickly locate regions of high similarity, regardless of whether it checks every possible local alignment (Cristianini and Hahn 2007). The BLAST algorithm can simplified into the steps described in the Basic Local Alignment Search Tool paper

(S. F. Altschul, et al. 1990) by Stephen F. Altschul et al and expanded upon in the book “Bioinformatics: Sequence and Genome Analysis” (Mount 2004) by David

Mount and “BLAST: An Essential Guide to the Basic Local Alignment Search Tool” by

Ian Korf et al (Korf, Yandell and Bedell 2003). A brief overview of the algorithm can

19

Texas Tech University, Eric James Rees, May 2011 be found below in Figure 1. Also, an in-depth description of the BLAST algorithm is provided for curious readers as follows:

1) The first stage of the algorithm involves removing sequence repeats and

regions of low-complexity, regions of biased composition containing simple

sequence repeats (Orlov and Potapov 2004), from the query sequence.

(Mount 2004).

2) BLAST will then create a list of k-mers from the query sequence. This

requires creating a unique list by cutting the query sequence into words such

that each word is of length k. For example, if k = 3 and we have the query

sequence PGQQFPGQEP, then we would have a list containing PGQ, GQQ,

QQF, QFP, GQE, and QEP. Keep in mind that since we are storing a unique list

we will only add the 3-mer PGQ once (Mount 2004) (Zomaya 2006).

3) Create a list of high scoring match words each of length k. For each member

of the list generated in step 2 above, all the possible matching words are

generated and then scored against the original element using a scoring

matrix. If k = 3 and the program is handling amino acids, then a total of 20 3

possible match words and scores would be generated. For instance, the

sequence PGQ would generate the matching words PEG and PQA, with

BLOSUM62 matching scores of 15 and 12 respectively (Mount 2004)

(Zomaya 2006).

4) A match cutoff score known as the neighborhood word score threshold, T, is

selected in order to cull the list of match scores. By traversing the list of

match scores and removing any match score that does not exceed the value

20

Texas Tech University, Eric James Rees, May 2011

of T we are able to generate a match list that contains only the highest

scoring match words (Mount 2004) (Zomaya 2006).

5) Steps 3 and 4 are repeated for each k-letter word in the list of k-mers created

in step 2 (Zomaya 2006).

6) The list of remaining highest scoring match words is then reorganized into

an efficient search tree. This allows the BLAST program to compare the

matching words to elements within database sequences quickly and

efficiently (Mount 2004).

7) Each sequence in the database is scanned by the BLAST program in order to

find

k-mers in the database sequence that match k-mers from the list of highest

scoring match words (Mount 2004) (Zomaya 2006).

8) Each matching region found per database sequence will then be compared

against one enough to determine their distance. Each match that is within A

letters from another match will be joined with that match as a longer match,

with all the letters between them being incorporated into the new match. At

this point the match will be extended in each direction until the accumulated

total score of the HSP, High Scoring Pair, begins to decrease (Mount 2004)

(Zomaya 2006).

9) A cutoff score known as the segment score threshold, S, is used in order to

remove HSPs that do not meet or exceed S. By examining each HSP and

removing any HSP whose score falls below S, we will be left with a list of

21

Texas Tech University, Eric James Rees, May 2011

HSPs whose values are large enough to accurately calculate significance

(Mount 2004) (Korf, Yandell and Bedell 2003) (Zomaya 2006).

10) Next BLAST will begin assessing each HSP in order to determine the HSPs

score’s significance. Statistical significance is determined the expect value, E,

which is the number of times that an unrelated database sequence would

obtain some score s that is greater than x by chance (Mount 2004). The

equation used to calculate the expect value, E-value, states that the number

of alignments expected by chance ( E) during a sequence database search is a

function of the size of the effective search space (m’n’), the normalized score

(λS ), and a minor constant ( k) (Korf, Yandell and Bedell 2003). In order to

solve for E the following equation is used:

= ′′

where λ and k are a Karlin-Altschul statistical parameters (Karlin and

Altschul 1990), S is the raw score for the HSP, and m’ and n’ are the effective

search spaces for the input sequence and database sequences, respectively

(Korf, Yandell and Bedell 2003).

11) Using the expect value threshold parameter passed to the BLAST algorithm

by the user, BLAST will remove any match whose expect value is higher than

the expect value threshold (Mount 2004).

22

Texas Tech University, Eric James Rees, May 2011

12) The program will report data for every match still remaining in the HSP list,

which now only contains the HSPs that have been determined to be

statistically significant (Mount 2004).

For more information regarding the BLAST algorithm or to see an example of the algorithm, readers are encouraged to read the book “Bioinformatics: Sequence and

Genome Analysis” (Mount 2004) by David W. Mount.

Query Sequence A

B Database Sequence

C Database Sequence HSP 1 HSP 2

Figure 1: A simplified view of the BLAST algorithm. Step A: Construct a list of high-scoring k-mers (words of length k) from the query sequence. Step B: Search a database sequence for exact matches between members of the word list generated in Step A and the database sequence. Step C: Extend exact matches to the left and/or right until the score for the High Scoring Pair (HSP) ceases to increase. Then report any HSPs that are significant and have a score greater than the segment score threshold, S. (Haubold and Wiehe 2006)

23

Texas Tech University, Eric James Rees, May 2011

As briefly mentioned in the detailed BLAST algorithm above, BLAST is able to handle both nucleotide and amino acid sequence data. However, depending on the needs of the user, different BLAST programs may need to be utilized to attain the results the user expects. Currently BLAST is broken into six different programs based upon the type of sequence data they accept as input and the type of sequence data they provide as output. These six programs and their usages are briefly defined below and summarized in Table 3:

Program Query Sequence Database Sequence Alignment Sequence Type Type Type blastn Nucleotide Nucleotide Nucleotide blastp Protein Protein Protein blastx Nucleotide Protein Protein tblastn Protein Nucleotide Protein tblastx Nucleotide Nucleotide Protein megablast Nucleotide Nucleotide Nucleotide Table 3: Brief description regarding the data used and returned by each BLAST program. Lists each NCBI BLAST implementation along with the type of query and database sequences it accepts as input and the type of alignment that it returns as output (Markel and León 2003) .

2.3.1 BLASTN and MegaBLAST

BLASTN is used to compare nucleotide query sequences against a nucleotide database in order to generate nucleotide alignments. As such the BLASTN algorithm follows the detailed algorithm detailed above in section 2.3 BLAST. BLASTN is primarily used to map short nucleotide sequences, known as oligonucleotides, to known genomes in order to identify the taxonomic information of these unknown nucleotide sequences (Korf, Yandell and Bedell 2003).

24

Texas Tech University, Eric James Rees, May 2011

MegaBLAST is a modified version of BLASTN that sacrifices sensitivity in order to greatly decrease the amount of time required to find nucleotide alignments. As such

MegaBLAST behaves like the BLASTN program and is used for many of the same purposes as BLASTN (Korf, Yandell and Bedell 2003).

2.3.2 BLASTP

BLASTP is used to compare protein query sequences against a protein database in order to generate protein alignments. As such the BLASTP algorithm follows the detailed algorithm detailed above in section 2.3 BLAST. BLASTP is primarily used to determine functional information regarding the query proteins by comparing them to proteins whose functions are known. This allows researchers to infer gene function by determining which proteins the query protein matches against (Korf,

Yandell and Bedell 2003).

2.3.3 BLASTX

BLASTX is used to compare nucleotide query sequences against a protein database in order to generate protein alignments. As such the BLASTX algorithm requires that a step be added to the algorithm detailed above in section 2.3 BLAST. In order to correctly identify sequence alignments, the query sequence must first be converted into three protein sequences: the first is generated by converting each set

25

Texas Tech University, Eric James Rees, May 2011 of three nucleotides into its corresponding protein, the second is generated by performing the same task but skipping over the first letter of the nucleotide query sequence, and the third is generated by performing the same task as the first sequence but skipping the first two letters of the nucleotide sequence. Once converted each of these sequences is then passed through the entire BLAST algorithm. BLASTX is primarily used to find protein coding genes in genomic DNA or to identify proteins that are encoded by transcripts (Korf, Yandell and Bedell 2003).

2.3.4 TBLASTN

TBLASTN is used to compare protein query sequences against a nucleotide database in order to generate protein alignments. As such the TBLASTN algorithm requires that a step be added to the algorithm detailed above in section 2.3 BLAST. In order to correctly identify sequence alignments, each database sequence must first be converted into three protein sequences: the first is generated by converting each set of three nucleotides into its corresponding protein, the second is generated by performing the same task but skipping over the first letter of the nucleotide sequence, and the third is generated by performing the same task as the first sequence but skipping the first two letters of the nucleotide sequence. Once converted each of these sequences is then passed through the rest of the BLAST algorithm. TBLASTN is primarily used to map proteins to a genome or search EST, expressed sequence tag, databases for related proteins not yet added to the protein databases (Korf, Yandell and Bedell 2003).

26

Texas Tech University, Eric James Rees, May 2011

2.3.5 TBLASTX

TBLASTX is used to compare nucleotide query sequences against a nucleotide database in order to generate protein alignments. As such the TBLASTX algorithm requires that two steps be added to the algorithm detailed above in section 2.3

BLAST. In order to correctly identify sequence alignments, each query sequence must first be converted into three protein sequences as described above in section

2.3.3 BLASTX. Once the query sequences have been converted, the database sequences must also be converted by adding the steps described above in section

2.3.4 TBLASTN. Once these conversion steps have taken place the newly generated protein sequences will be used in the remaining steps of the BLAST algorithm.

TBLASTX is primarily used to identify transcripts of unknown function. By employing TBLASTX one would be able to determine if the transcript corresponds to any known proteins (Korf, Yandell and Bedell 2003).

2.4 Distributed BLAST

In the paper, “Database Allocation Strategies for Parallel BLAST Evaluation on

Clusters” by Rogério Luís De Carvalho Costa et al., the authors discuss a variety of techniques for dealing with the problem of distributed . According to the authors, the BLAST algorithm contains numerous calculations that are easy to parallelize. Parallelization of BLAST tends to center on the idea of divide and

27

Texas Tech University, Eric James Rees, May 2011 conquer (Costa and Lifschitz 2003). Divide and conquer allows parallelization that is able to achieve linear, or in some cases super-linear, speed up while allowing us to use NCBI’s BLAST program and stay current with any updates made to that algorithm. The divide and conquer optimization is achieved using one of the following algorithms: query set segmentation and sequence database segmentation.

2.4.1 Query Set Segmentation

Query set segmentation is the process of breaking the query sequence file, containing anywhere from 1 to k sequences, into smaller files of equal size. This is done by some master node which takes the query sequences as input, cuts the sequences into N files of equal size, and passes these files down to the worker nodes

(Talbi and Zomaya 2008). The worker nodes then run the BLAST algorithm on their sequence segment against the entire sequence database and then return the results back to the master node. The master node then takes the results from each worker node and concatenates the results into a single master output which is returned to the user. This method is shown below in Figure 2.

28

Texas Tech University, Eric James Rees, May 2011

Sequence 1 Sequence 2 Input Master Node Query File … Sequence k

Sequence 1 Sequence i+1 Sequence n+1 Sequence 2 Sequence i+2 Sequence n+2 … … … Sequence i Sequence j Sequence k

Node 1 2Node 2 Node N

Database Database Database

BLAST Output Results

Master Node BLAST Output File Output Assembly

Figure 2: Diagram of Query Set Segmentation

The pros to this optimization include low communication overhead and linear speed up. This optimization requires minimal communication overhead to operate. For a query set containing k sequences, there are a maximum of k messages that must be sent to the slaves. This optimization also grants a near linear speed up as each node is performing 1/n of the workload and communication is minimal (Zomaya 2006).

The cons of this optimization include not solving the database size problem and making load balancing hard to accomplish. This optimization fails to address the fact

29

Texas Tech University, Eric James Rees, May 2011 that databases are growing at a rate faster than Moore’s Law allows, meaning that this method will not allow the entire database to fit into memory, causing a slow down due to the constant swapping of database sequences into and out of memory.

This method also makes load balancing hard to accomplish as doing so would require the entire query set to be analyzed beforehand since load balancing under this method is directly connected to the composition of the query set (Lazakidou

2010).

2.4.2 Sequence Database Segmentation

Sequence database segmentation is the process of breaking the database, containing

1 to k sequences, into fragments with each database fragment being approximately the same size, but not necessarily containing an equal number of sequences. This process is done by having some master node break down the database file into smaller fragments by computing various offset numbers that state where in the database file each worker should pull their fragment from (Zomaya 2006). Using this master node passes the entire query segment file to each worker node as well as that work nodes offset values. Each worker node then pulls out their fragment of the database, runs the entire query sequence file against their database fragment using BLAST, and then returns the results back to the master node. The master node then takes the results from each worker node and merges these results, after making a few corrections to each sequence, into a single master output which is returned to the user. This method is shown below in Figure 3.

30

Texas Tech University, Eric James Rees, May 2011

Sequence 1 Sequence 2 Input Master Node Query File … Sequence k

Sequence 1 Sequence 1 Sequence 1 Sequence 2 Sequence 2 Sequence 2 … … … Sequence k Sequence k Sequence k

Node 1 Node 2 Node N

Database 1 Database 2 Database N

BLAST Output Results

Master Node BLAST Output File Output Assembly

Figure 3: Diagram of Database Segmentation

The pros to this optimization include dealing with the database size problem and the ability to achieve super linear speed-up. Database segmentation allows each worker node to perform the BLAST algorithm on a database fragment that will fit in memory, granting immense speed ups and allowing developers a method to avoid the problem with rapidly expanding databases. Linear speed up is achieved when each worker node can complete the task in 1/(number of nodes) time, thus meaning all nodes completing the task will be done in 1/(number of nodes) time. However

31

Texas Tech University, Eric James Rees, May 2011 when dealing with larger databases, too large to fit into memory alone, this optimization allows for each worker node to complete their task in less than

1/(number of nodes) time, allowing for super linear speed up.

The cons of this optimization include the implementation of a partitioning strategy and the result merging phase becomes more complex. As each worker completes the

BLAST of the query sequences against a database fragment, the worker node will be required to pass back over the resulting output file and correct the E-values for each sequence match as discussed in the next section. Once complete the worker node can transmit the corrected result file back to the master node, where the result merging phase will begin. However, unlike in the query segmentation method, the result merging phase will not be a simple concatenation of files. Instead the process is complex and requires that each query sequence and the results given by each worker node are examined and the true highest scoring pairs are pulled out

(Lazakidou 2010).

2.4.3 E-Value Calculation

As discussed in section 2.3 BLAST, the BLAST algorithm uses local sequence alignment in order to determine the high scoring pairs between a query set of unknown sequences and the known sequences found in a sequence database. These high scoring pairs each have a score identifying exactly how good of a match was discovered. This score, known as the raw score, is the summation of all the scores

32

Texas Tech University, Eric James Rees, May 2011 given per residue match as well as the penalties applied for mismatches and gaps in the alignment. While this raw score is important in identifying the best matches between sequences, it does not tell us how significant the match was. Significance is used to determine if a match more likely occurred due to chance or biology. Millions of years of evolution has caused biological sequences to accumulate large numbers of substitutions, which has made the task of determining if two sequences share a common ancestor quite daunting. The difficulty of this task increases when you factor in that unrelated sequences may often display some degree of similarity, purely by chance (Cristianini and Hahn 2007). As such, we need to determine the significance of each alignment so that any conclusions we wish to draw from our data is not weakened. This significance can be determined by calculating the number of matches with score s or higher that would be found in a database containing N sequences and n letters, either nucleotide or protein. Knowing the

approximate number of matches we would expect to find allows us to determine

whether a particular alignment is statistically significant or not. With this

determination some matches that are no longer considered significant could then be

thrown out, leaving us with only the highest scoring pairs that are statically

significant.

The book “BLAST” (Korf, Yandell and Bedell 2003) by Ian Korf, Mark Yandell, and

Joseph Bedell as well as the paper “Sequence Comparison: Theory and Methods”

(Chao and Zhang 2009) by Kun-Mao Chao and Louxin Zhang give us a

comprehensive look at the way statistical significance between local alignments is

33

Texas Tech University, Eric James Rees, May 2011 determined by the BLAST programs. As stated previously in section 2.3 BLAST, the expect value, E is the number of times that an unrelated database sequence would

obtain some score S that is greater than x by chance. The equation used to determine an alignments E-value states that the number of alignments expected by chance ( E) during a sequence database search is a function of the size of the effective search space (m’n’), the normalized score ( λS ), and a minor constant ( k)

(Korf, Yandell and Bedell 2003). In order to solve for E the following equation is used:

= ′′

where λ and k are a Karlin-Altschul statistical parameters (Karlin and Altschul

1990), S is the raw score for the HSP, and m’ and n’ are the effective search lengths for the database sequences and input sequence, respectively (Korf, Yandell and

Bedell 2003) (Pevsner 2003).

The effective search space is calculated by multiplying the effective search length of the input sequence, m’, by the effective search length of the database, n’. However the effective search lengths are not simply equal to the actual length of the input sequence and database. Instead the effective search lengths are calculated by taking the actual lengths of the query sequence and database sequences minus the estimated average length of an alignment between two random sequences of equal

34

Texas Tech University, Eric James Rees, May 2011 length (Mount 2004) (Chao and Zhang 2009). Calculating the effective length of the input sequence, m’, is completed using the following equation:

= −

where m is the actual length of the input query and l is the length adjustment for the current HSP. Calculating the effective length of the database, m’ , is completed using the following the equation:

= − ×

= ∈

such that T is a database sequence in the sequence database D, LT is the actual length

of sequence T, and ND is the total number of sequences found in database D (Chao

and Zhang 2009). Thus the effective length of the database is equal to the sum of all

the database sequence lengths minus the product of the length adjustment, l, and

the number of sequences in the database. Lastly, calculating the length adjustment

is done using the following equation:

and = + × ln + ln − × − ×

35

Texas Tech University, Eric James Rees, May 2011

− × − × ≥ max ,

where k, α, β, and λ are the Karlin-Altschul statistical parameters kappa, alpha, beta,

and lambda respectively (Karlin and Altschul 1990), m is the actual length of the

input query, n is the total length of the database as calculated above, and N is the

total number of sequences in some database D (Lagnel, Tsigenopoulos and

Iliopoulos 2009) (Chao and Zhang 2009).

2.5 Existing Distributed BLAST Applications

In the paper, “BeoBLAST: distributed BLAST and PSI-BLAST on a Beowulf Cluster” by J.D. Grant et al. the authors discuss a distributed BLAST application named

BeoBLAST. BeoBLAST is a distributed BLAST application built to run across Beowulf clusters (Grant, et al. 2002). Beowulf clusters are scalable clusters built using commodity hardware using a private system network and running on an open source operating system, such as a UNIX or Linux based OS. Beowulf clusters are different from clusters of workstations in that Beowulf clusters behave more like a single machine, instead of behaving like a network of independent workstations.

BeoBLAST will only run on Linux based operating systems but will work on homogenous and heterogeneous Beowulf clusters. BeoBLAST also is only capable of performing distributed BLASTs using the query segmentation method. The

BeoBLAST algorithm uses a web interface to connect a user to the master node.

Through the interface the user will send jobs to the master node which will in turn

36

Texas Tech University, Eric James Rees, May 2011 spawn jobs to worker nodes within the cluster. These nodes require that some form of parallel file system be in place and use that file system to read off the files they will need to complete their assigned task. Once the task is complete, the nodes write back to the parallel file system. The master node then reads the data in and begins compiling the results back into a single output file. The authors note that their implementation allowed for a linear speed up in that 10 sequences against the NR database took approximately 2 minutes and that 40 sequences took about 12 minutes. The authors state that the program takes approximately 1 minute for every

3 to 4 sequences added to a BLAST job.

In the paper, “Soap-HT-BLAST: high throughput BLAST based on Web services”

(Wang and Mu 2003) by Jiren Wang and Qing Mu, the authors discuss a distributed

BLAST application named Soap-HT-BLAST. As the name seems to imply, Soap-HT-

BLAST is a high-throughput distributed BLAST application built on web services.

This application is primarily built using SOAP, the Simple Object Access Protocol, to exchange information between the various nodes. Soap-HT-BLAST executes using

Perl and thus Perl, Apache Web Server, and the Perl module SOAP::Lite must be installed on all machines that will run it. According to the authors, web services consists of three components: a communications component to receive incoming messages and send outgoing messages, a proxy component to take messages and translate them into their appropriate actions, and an application component to perform the actions described by the proxy component (Wang and Mu 2003). Soap-

HT-BLAST uses SOAP::Lite to act as the proxy component, Apache web server to act

37

Texas Tech University, Eric James Rees, May 2011 as the communications component, and their own BLASTing application capable of performing one of three actions as the application component. The three actions this component can perform are to check the current CPU load, check current connectivity to the network, and execute BLAST on a single sequence. Using web services provides Soap-HT-BLAST a number of advantages such as being able to use multiple systems around the world to act as worker nodes. However, Soap-HT-

BLAST has a number of sever disadvantages that harm its ability to achieve even linear speed up. Unlike other distributed BLAST implementations, this application’s usage of large scale networks means that communication times between nodes is far higher due to higher latency times and, due to sending data across the internet, a higher risk of getting corrupt packets which forces the usage of TCP over UDP. The authors also point out that the system makes no attempt to load balance or to ensure a job submitted to the system will be honored if the system becomes taxed with multiple sequences. As such, if the nodes are working with four BLASTs occurring simultaneously on the nodes and CPU usage at levels higher than 85%, instead of queuing up incoming jobs, the system will simply reject new jobs from users. Soap-HT-BLAST brings a few interesting ideas to the table and the idea of web services as a method to extend the reach of a distributed BLAST application is a novel approach to branching out of purely local distributed systems. However Soap-

HT-BLAST also exposes the weaknesses inherent in distributed systems that branch too far from the LAN as these systems begin to function more slowly than their purely local counterparts and are more susceptible to the latency and communication issues that are often encountered when dealing with web services.

38

Texas Tech University, Eric James Rees, May 2011

In the paper, “Windows .NET Network Distributed Basic Local Alignment Search

Toolkit (W.ND-BLAST)” by Scot E. Dowd et al. the authors discuss their distributed

BLAST implementation named W.ND-BLAST. W.ND-BLAST is a Microsoft Windows only distributed BLAST program that uses the Microsoft .NET framework to control the communication and fault tolerance of the distributed system. Unlike the previous implementations mentioned above, W.ND-BLAST uses a master node to build a distributed system by seeking out all other workstations in the local network and clustering together any of them running the server half of the W.ND-BLAST program (Dowd, et al. 2005). This cluster of workstations is completely heterogeneous with the exception that all of the machines must be running

Microsoft Windows and be compatible with the Windows .NET framework. Once this distributed system has been created, the master node will accept the query sequences, database, and other input from the user and begin the process of passing this information to the various worker nodes. Similarly to the previous implementations, W.ND-BLAST has only implemented query sequence segmentation. Thus the master node will break the query sequences into multiple segments and pass each segment on to the worker nodes for BLASTing. W.ND-

BLAST provides users with the benefit of building distributed systems out of standard workstations, without requiring these workstations to become permanent members of a cluster. This provides users of this implementation with the ability to easily scale their cluster up and down while the system is running. The drawback to such a system is that providing for this scalability requires increased

39

Texas Tech University, Eric James Rees, May 2011 communication overhead, which is made worse by using the .NET framework, known for having a high communication overhead, to build this system. Using a clusters of workstation approach is a tradeoff of allowing the users to cheaply build non-dedicated clusters to handle massive amounts of biological data, however the communication overhead and the usage of workstations that are currently performing non-BLAST related tasks by their primary user causes this approach to be slower than others in most cases. However, it should be noted, that this approach could be used on machines dedicated to work in the cluster which would leave only the communication overhead as a drawback to the method. The authors note that

W.ND-BLAST is capable of achieving near linear speed-up. This is shown by running

BLAST normally on a 50 sequence set against a large, 1.59 gigabyte, database on a machine lacking the RAM to store the entire database in memory. This BLAST takes just shy of 64 minutes to complete. The same BLAST is then run using W.ND-BLAST across 7 nodes and takes the implementation just over 12 minutes to complete, 3 minutes over what would have been a true linear speed up. However on another test the implementation was used on a larger test that took 11 hours when run using standard BLAST and took 45 minutes using W.ND-BLAST on 17 nodes, where true linear speed up would have taken 39 minutes. Thus it appears W.ND-BLAST does achieve linear speed up, but communication overhead tacks on approximately 20 to

30 seconds per executing worker node.

In the paper, “The Design, Implementation, and Evaluation of mpiBLAST” the authors Aaron E. Darling et al. discuss their distributed BLAST implementation

40

Texas Tech University, Eric James Rees, May 2011 known as mpiBLAST. mpiBLAST is currently the most popular implementation of distributed BLAST and has been the most popular since the year it was released. mpiBLAST varies from its predecessors by being one of the first to efficiently implement distributed BLAST using both the query segmentation and the database fragmentation methods (Darling, Carey and Feng 2003). Like most others, mpiBLAST uses NCBI’s BLAST program as the core BLASTing application and simply builds a framework around this implementation to handle the preparation, distribution and compilation stages. mpiBLAST, as the name suggests, is built to make use of MPI, the Message Passing Interface, common in many distributed systems. Much like BeoBLAST, mpiBLAST only works on distributed systems that have already been created and are dedicated to act as a distributed system. While mpiBLAST will work on heterogeneous clusters, it does require that the NCBI BLAST toolkit be installed and that each machine be running an implementation of MPI, such as MPICH2 or LAM/MPI. mpiBLAST is primarily built to run on Linux, Unix, and

Mac OS and does tend to work or run on Windows based machines, thus the heterogeneous cluster must make use of one of these operating systems. mpiBLAST uses the same master-worker configuration as BeoBLAST with the exception that it uses MPI as the primary method for communication between the nodes. mpiBLAST is able to achieve super linear speed up on large databases by using database fragmentation. This allows databases to be forced into the constraints of physical memory and allows the BLAST program to avoid constantly writing to disk, granting major boosts in speed. mpiBLAST achieves database fragmentation through using one of two methods, depending on your cluster configuration. For configurations

41

Texas Tech University, Eric James Rees, May 2011 that allow for parallelized input and output (PIO) to the distributed systems shared memory, mpiBLAST relies on a method discussed in the paper “Efficient Data Access for Parallel BLAST” in which the authors Hershan Lin et al. discuss how using parallelized input and output operations could allow database fragmentation to occur on-the-fly instead of requiring a constant reparsing of the database using

NCBI’s formatdb tool (Lin, et al. 2005). This method requires that the database be processed only once instead of being reformatted multiple times depending on the size of the cluster. In non-PIO configurations, mpiBLAST will simply rely on the formatdb tool to cut databases into numerous fragments and then pass these fragments out to the various works. This process is slower and requires a major increase in the preprocessing stage of execution, however is still a viable method for achieving super linear speed up.

42

Texas Tech University, Eric James Rees, May 2011

Chapter 3 Approach

This chapter includes an overview of the algorithms designed as part of the contribution of this dissertation. The first methods developed were the multicast communication methods which automatically create a distributed system out of existing machines across a local area network. This distributed system can then be utilized to solve problems in parallel across the various members of the distributed system with the aid of the distributed application framework. This framework grants developers the ability to quickly and easily create distributed applications that run across distributed systems build by the multicast communication methods.

Once the distributed application framework was completed, an example distributed application was constructed to perform distributed BLAST processing. This distributed BLAST application makes use of the distributed application framework and expands on it to allow the distributed system to correctly and efficiently perform large scale BLAST operations in near linear time or, in some cases, super linear time. These algorithms are then implemented in order to show they are in fact solutions to the problems they were created to solve.

3.1 Creating a Distributed System from Existing Nodes

Achieving significant speed increases in bioinformatics tools and applications requires utilizing distributed resources that meet all the criteria of a distributed system and is capable of meeting the six challenges for distributed systems as

43

Texas Tech University, Eric James Rees, May 2011 discussed in section 2.2 Distributed Computing. This section will first discuss the algorithm used to create the distributed system and then discuss how this system meets both the definition and the challenges of a distributed system. Figure 4 is shown at the end of this section in order to demonstrate the layout of and interactions between the master and worker nodes in the distributed application.

3.1.1 Algorithm for creating a Distributed System from Existing Nodes

The master and worker nodes will each contain the necessary code to establish and join an IP multicast group on the local area network. An IP multicast, as described by

Stephen E. Deering and David R. Cheriton in their paper “Multicast routing in datagram internetworks and extended LANs”, is the transmission of data to a subset of hosts in the network. This method of data passing is more efficient and requires less overhead than broadcasting the data to all hosts in the network or unicasting to each host individually (Deering and Cheriton 1990). As each master and worker node loads the distributed application framework they will connect themselves into the multicast group or, if one is not established, they will establish the multicast group.

The master and worker nodes will only be capable of broadcasting the following three messages into the multicast group:

1) The worker node discovery message informs all connected worker nodes to

report their presence and status to the broadcasting master node.

44

Texas Tech University, Eric James Rees, May 2011

2) The new worker node activated message informs any currently connected

master nodes that a new worker node has joined the multicast.

3) The worker node response message informs the broadcasting master node of

the worker node’s presence and status in the multicast.

The worker node discovery message is the only message the master node is capable of sending into the multicast group. This message allows the master node to discover which worker nodes are currently in the multicast group and what each connected worker node’s current status is. Before issuing this message into the multicast, the master node will mark all of the worker nodes it has made contact with as being offline. Once the message has been broadcast only worker nodes that respond within a reasonable amount of time will be marked as online. Thus the master node will always assume worker nodes have gone offline unless a worker node sends a message stating otherwise.

The worker nodes have two messages they are capable of sending into the multicast group. The first message is the new worker node activated message which is broadcasted directly following its connection to the multicast group. This message is shorter than the worker node response message because no status information is included with the broadcast. Master nodes that receive this message will add the new worker node to their list of known worker nodes if it was previously undiscovered, otherwise the master node will do nothing. The second message sent by the worker node is the worker node response message. This message is

45

Texas Tech University, Eric James Rees, May 2011 broadcast immediately upon receiving the worker node discover message from a master node. This contains the message code assigned to the worker node response message so that the master node knows what message has been sent, however it also contains two short status codes that inform the master node how much RAM the node has and whether or not the node is currently undertaking any work for the master node.

Using a multicast to transmit messages to worker nodes allows the master node to quickly and easily determine the current status of each worker node in the distributed system without requiring the user to manually inform the master node of the existence of these nodes. By storing the host name, IP address, memory amount, and known status of each responding worker, the master node is able to establish connections to worker nodes, transmit data to and from worker nodes, and stay up to date on the status and size of the evolving distributed system.

3.1.2 Methods for creating a Distributed Application Framework

The communication methods described in section 3.1.1 Algorithm for creating a

Distributed System from Existing Nodes provide the needed communication framework to build a distributed system, but they lack the methods required to support running distributed computations across the newly created system. To accomplish this, a distributed application framework must be constructed to supply

46

Texas Tech University, Eric James Rees, May 2011 developers with access to the methods required to perform distributed computations. The distributed application framework designed for this project contained methods to perform the following:

1) store information regarding remote nodes,

2) transfer string and byte data between master and worker nodes,

3) transfer files between master and worker nodes, and

4) begin execution of a process on remote nodes.

3.1.2.1 Storing Information Regarding Remote Nodes

Created as an extension to the multicast communication framework, these methods and their class objects provide the developer a way to store information regarding each remote node that the master node makes contact with. By default the system will store the remote node’s IP Address, hostname, total amount of RAM in megabytes, and the availability of the remote node. The system, however, is built to be easily configurable in order to incorporate storing any other data regarding remote nodes that the developer deems necessary.

While storing information regarding remote nodes should generate few errors, the class objects and methods contain the necessary error checking to ensure they are able to withstand faults regarding network communication failures and corrupt network data that would lead to unidentified IP addresses or hostnames. In the

47

Texas Tech University, Eric James Rees, May 2011 event any of these errors are encountered the methods will attempt to correct the problem by requesting the information be sent again, if possible. If the error cannot be corrected then the remote node is simply removed from the list of nodes in order to prevent the node from creating other problems and an error code is returned so that the master node may continue performing other tasks until another attempt is made to contact the failing remote node.

3.1.2.2 Transfer String and Byte Data between Master and Worker Nodes

As discussed in section 2.2 Distributed Computing, a distributed system must rely on some form of message passing in order to sync operations up across the system.

By providing methods that allow the developer to easily transfer short byte or string messages, a message passing system for syncing up these operations is established.

These methods are primarily used by other parts of the distributed application framework as part of the message passing system; however they are able to be extended and called directly by the developer so that they too can sync up operations across the system.

This message passing system requires knowledge of the remote nodes IP address and the port number that the remote node will be listening on. By default each remote node will have a dedicated port in which to listen for messages from other nodes. When messages are received on these ports, the node will respond accordingly to the command issues, such as opening a file transfer connection and

48

Texas Tech University, Eric James Rees, May 2011 awaiting a file, before sending out an acknowledgement message back to the sender.

The sender will wait for a pre-determined amount of time, default is two minutes, for a remote node to send their acknowledgement before the remote node is classified as offline and all contact with the node is broken until it begins responding to the sender again via the multicast communication. If an acknowledgement is received, then the system will continue forward with the operation it had requested.

All requests, acknowledgements, and other crucial communications between the nodes are run through these methods due to their high fault tolerance. Failed communications are attempted again if the system deems that the message is vital to system stability or if the receiver requests an operation be attempted again, such as a file transfer error. All network communications are marked to time out if the connection fails to show any activity for a specified amount of time. These time outs ensure that the system continues to respond and prevents the system from deadlocking if a remote node fails to sync properly. Between each attempt at network communication the methods will also ensure the connection is still active, an inactive connection signifies that the TCP connection between the nodes has been severed. When an inactive connection has been discovered, the method will close the connect mark the remote node as offline in order to ensure the system does not continue attempting to make contact with a failing or offline remote node.

The remote nodes, which use the same framework, will also be able to detect when the master node has ceased communication and will return to a listening state if

49

Texas Tech University, Eric James Rees, May 2011 such an event occurs. This ensures the remote node will not get deadlocked when it has failed to sync with a master node.

3.1.2.3 Transfer Files between Master and Worker Nodes

Transferring files between the master nodes and the worker nodes requires that a sync operation, as described in section 3.1.2.2 Transfer String and Byte Data between Master and Worker Nodes occur. The sender will make use of the syncing methods in order to inform the receiver that a file needs to be transferred. Upon receipt of this message, the receiver will open a new TCP connection and begin listening on the port. The port number and a byte code acknowledging the request is then returned to the sender, which in turn will open a TCP connection to the receiving node and begin transferring the file. Upon completion, another request for acknowledgement is sent across the network along with information regarding the file so that the receiver may verify the file’s contents. If the file has transferred successfully an acknowledgement is sent and the connection is severed on both ends, however if the file failed to transfer without corruption then a failure message is sent to the sender along with a short byte code stating to either resend the file or give up. If a file has failed multiple times the receiver will request the sender cease sending the file for a short time and try again later, otherwise the file transfer will begin again. In the event a sync operation fails or a file transfer fails due to a network error or a network time out, the master node will mark the worker node as offline and wait for the worker node to resume contact via multicast communication

50

Texas Tech University, Eric James Rees, May 2011 while the worker node will simply close all file transfer connections with the master node and resume listening for instructions from the master node. This allows each node to resume their previous operations without being deadlocked by a failed connection, ensuring that each node can continue to function even when other nodes begin to fail.

3.1.2.4 Begin Execution of a Process on Remote Nodes

Starting a process on a remote node requires that the master node first transfer the required data up to the remote node. This data transfer requires that a sync operation, as described in section 3.1.2.2 Transfer String and Byte Data between

Master and Worker Nodes occur. Data sent to the remote server may be small enough to send via the same string or byte message passing methods used to establish the sync, but more often than not a file will need to be transferred to the remote node using the methods discussed in 3.1.2.3 Transfer Files between Master and Worker Nodes. Once the data has been successfully transferred, the developer may begin execution of a process on that data by calling the run process remotely method. This method takes the command line parameters that will be used on the remote machine with a few small modifications and passes that string to the remote node using the string message passing methods as well as a byte code to inform the remote machine that a process will need to be started using this data. The parameter string that is passed to the remote node may contain special strings that

51

Texas Tech University, Eric James Rees, May 2011 the remote node will replace before execution will begin. By default the methods are able to handle the following special strings:

• #PROGRAMPATH# will be replaced on the remote node with the correct path

to the folder containing the program executable,

• #INPUTFILEPATH# will be replaced on the remote node with the correct

path to the folder where file transfers are stored, and

• #OUTPUTFILEPATH# will be replaced on the remote node with the correct

path to the folder where the output files are stored before being transferred

back to the master node.

Once the command to begin work remotely has been issued, the method will enter a waiting loop where it will wait for the remote node to signal it has completed the work. Within this wait loop three different aspects of the run are being monitored in order to prevent the master node from waiting indefinitely for a remote node that may have failed to complete the task assigned to it. The wait loop will continuously check the established TCP connection between it and the remote node in order to detect if the remote node has severed a connection before the remote node signals it has completed the work assigned to it. The wait loop will also check the remote node’s information object, which is being updated by the multicast methods, to ensure the remote node is continuing to stay in communication with the master node and, more importantly, is continuing to state that it is performing work for the master node. If the remote node informs the master that it is no longer processing work and the master node has not been signaled that processing is complete, then

52

Texas Tech University, Eric James Rees, May 2011 the distributed system will sever connection with the remote node and return back with an error code stating the work for this node was not completed. The distributed system is unable to predict the amount of time a remote node will need to complete any given task, thus monitoring these two aspects of the run is the only way to ensure the work is being completed. Much like the other methods, upon encountering a processing error, this method will simply break the connection with the remote node, mark the remote node as offline, and return an error code so the program may attempt passing the work to another remote node.

3.1.3 Meeting the definition of a distributed system

As discussed in 2.2 Distributed Computing, in order for a system to be classified as a distributed system, it must meet the following criteria:

1) The system must be concurrent, allowing each machine in the distributed

system to work in parallel or independently of the other machines in the

system.

2) The system must lack a global clock. Each component of the distributed

network may only communicate via some form of message passing. However

there are limitations on how accurately the components can synchronize

their clocks, thus forcing each component to keep track of their own local

clock, instead of referencing some global clock.

53

Texas Tech University, Eric James Rees, May 2011

3) The system must allow for and be tolerant of independent component and

system failures. The distributed system should allow for independent

components to fail without it disrupting the capabilities of the system as a

whole.

The multicast method for determining the status of each worker node in the distributed system allows each machine to run concurrently. The file and data transfer methods also allow each remote node to receive data separately of one another and the remote processing methods allow each machine to be given separate execution commands, regardless of the work being performed on other remote nodes. While the master node is keeping track of each remote node it encounters, the remote nodes are only able to store information about the current master node they are performing work for. If the remote node is not currently processing any data for a master node, then the worker node will not have any information stored regarding any of the master nodes. This allows each worker node to work independently of any other node in the system.

Because each machine is still completely independent of one another they are unable to share a global clock. Each machine is running completely separate of the other nodes regardless of their current clocks. Synchronization of work is performed strictly through passing messages between master and worker nodes.

54

Texas Tech University, Eric James Rees, May 2011

As stated previously, the multicast method in effect will work under the assumption that any node that does not respond in a timely manner is no longer online. The master node is unaware as to why a worker node has gone offline and will simply ignore the worker node when it comes time to pass out work to the worker nodes.

In the event of a network failure that disrupts all communication, the master node will simply fall idle until the problem is resolved and worker nodes begin responding to the master node’s attempts to contact them. Once the network failure has been resolved the process will continue as if the failure had never occurred.

Lastly, in the event the master node experiences a system failure the entire distributed machine will halt. Without a master node to direct the worker nodes, the worker nodes will simply complete any work they had been assigned and then fall idle until a master node calls upon them. If the system failure is recoverable the system will resume operation when the failure is corrected. In the event the system failure requires the user to intervene, then the distributed system will resume operation starting at the point of failure when the user has restarted the master node. As such the system currently will tolerate a master node failure, however the system may cease completing work until the master node is restored or a new master node takes its place.

3.1.4 Meeting the challenges of a distributed system

As discussed in 2.2 Distributed Computing, a distributed system must be capable of overcoming the following challenges:

55

Texas Tech University, Eric James Rees, May 2011

1) The distributed algorithm should be able to tolerate a heterogeneous

distributed system.

2) The distributed system must be secure. In order for the entire system to stay

secure, each individual system and each program running on these systems

must adhere to the protection of the system’s confidentiality, integrity, and

availability.

3) Distributed systems must also allow for scalability, the ability to remain

effective even when a significant number of users and resources are added to

or removed from the system.

4) A distributed algorithm must be able to handle failures. Unlike in a non-

distributed system, failures in a distributed system should follow the rules

discussed before, in that the distributed system should tolerate the loss of

components without it causing failures throughout the entire system

5) Distributed systems are required to be concurrent and thus distributed

algorithms running across these systems must also be concurrent

6) Lastly, distributed systems should keep the separation of components within

the system concealed from the user and the application programmer. This

process is known as transparency and is used to ensure the distributed

computer is perceived as a single system instead of a collection of

independently working components. A brief summary of the eight forms of

transparency as listed in the RM-ODP (ISO/IEC 1996) are as follows:

a. Access transparency

b. Concurrency transparency

56

Texas Tech University, Eric James Rees, May 2011

c. Location transparency

d. Replication transparency

e. Resource transparency

f. Failure transparency

g. Federation transparency

The distributed application framework is built to run on any hardware configuration and will tolerate heterogeneous hardware between the nodes. The framework also requires that the master and worker nodes be installed on machines that can support binaries compiled from C# code and have the .NET 3.5 framework installed.

The distributed application framework adheres to the protection of the system’s confidentiality and integrity simply because the message passing system does not allow for master nodes to request any information or alter any data. Each worker node also protects the system’s security by only communicating with one master node at a time, thus preventing other master nodes from corrupting a worker’s data.

This single master communication method also adheres to the protection of a system’s availability by ensuring that while other master nodes can receive information regarding the worker node’s status, the system does not allow two master nodes to secure it as a resource, thus ensuring the system cannot be deadlocked by multiple master nodes.

57

Texas Tech University, Eric James Rees, May 2011

As explained in section 3.1.1 Algorithm for creating a Distributed System from

Existing Nodes, the multicast method allows for scalability of both master nodes and worker nodes. As worker nodes join the distributed system, they are incorporated into the lists contained by all active master nodes. As master nodes join the system they will begin separately keeping track of all the worker nodes. As such the system is fully scalable and can sustain either a rapid increase or sudden decrease in master or worker nodes. Because the distributed application framework is based upon the multicast method it will inherent this quality.

Section 3.1.3 Meeting the definition of a distributed system states that the multicast method allows the system to meet the definition of a distributed system. The challenge of making the algorithm capable of tolerating errors is covered by third definition of a distributed system. The distributed application framework will inherent this property because it is based upon the algorithm that established the multicast system. As such the algorithm and the system will be capable of tolerating both software and hardware failures.

The first definition of a distribute system covered in section 3.1.3 Meeting the definition of a distributed system states that the distributed system must be concurrent. The multicast method meets this requirement by keeping worker nodes isolated from one another in the sense that each worker node is unaware of the existence of other worker nodes. With each worker node only aware of a single master node at a time, these nodes can work entirely separated from one another

58

Texas Tech University, Eric James Rees, May 2011 without any form of conflict arising. This is ensured since the distributed application framework will force worker nodes to only handle one master node at a time.

Of the eight forms of transparency listed in the RM-ODP (ISO/IEC 1996), the distributed application framework meets six of them. The eight forms of transparency and the reasoning behind the distributed application framework’s ability to meet, or failure to meet, the transparency form is listed below:

1) Access transparency: The distributed application framework does not

show the mechanisms used to access any local or remote object.

2) Concurrency transparency: The distributed application framework does

allow for the developer to state which worker nodes are currently being

accessed and used to perform the distributed processing. However the

concurrent connections between these worker nodes and the concurrent

operations being performed locally on the master node are concealed

from the user. As such the algorithm does meet concurrency

transparency.

3) Location transparency: The distributed application framework does allow

the developer to state which worker nodes are currently being accessed

and used to perform the distributed work. Since the multicast forces the

algorithm to stay within the bounds of a local area network and the user

may be shown the names of each host that is currently performing work,

location transparency is not being upheld.

59

Texas Tech University, Eric James Rees, May 2011

4) Replication transparency: The distributed application framework hides

the fact that each remote worker node is being run on a separate thread,

thus concealing from the user the replication of various files and services

used to both connect to the worker node and tolerate any failures that

may arise within the thread or remote worker node.

5) Resource transparency: When worker nodes or other remote or local

resources fail, the distributed application framework will conceal these

failures from the user. However as new resources become available,

worker nodes go offline, or previously unused worker nodes are used to

process data, the user may be made aware of the changes. While the

developer may choose to obfuscate this data it is normally best if they do

not simply so the user can monitor the status of the distributed system

and can manually request that the distributed application ignore poorly

performing resources. As such the algorithm is not forcing resource

transparency.

6) Failure transparency: Failures within the distributed system may be

concealed from the user as long as user intervention is not required.

However developers may wish to inform the user of failures that may

require the user to intervene, such as when a worker node is constantly

failing or the master node’s network connection has been severed. The

system, however, will tolerate the failures and all automatic recovery

methods will be concealed from the user.

60

Texas Tech University, Eric James Rees, May 2011

7) Federation transparency: The distributed application framework will be

required to cross administrative boundaries on both master and worker

nodes. The program cannot rely on the user running the distributed

process on the master node to have an account with administrative

privileges on the machine running the worker node. As such the

algorithm itself will need to execute data across multiple administrative

boundaries. The user has no need to be informed of these elevations of

de-elevations in privileges and thus that data shall be concealed from the

user. As such the algorithm will perform federation transparency.

61

Texas Tech University, Eric James Rees, May 2011

Distributed Application Distributed Application Master Node Worker Nodes (>1 nodes)

Master Establish Discovery Connection(s) Node Discovery

Node Program Receive Query Return Results 1 Establish (i.e. BLAST) Connection(s) Query File

Receive User In put Master Establish Discovery Connection(s)

Query Node Segmentation Program Receive Query Return Results 2 (i.e. BLAST)

Query Transfer

Master Establish Results Discovery Connection(s) Compilation

Node Program Receive Query Return Results N Outpu Return Results (i.e. BLAST) t File to User

Figure 4: Distributed System Layout and Interactions

3.2 Algorithm behind the Distributed BLAST Master Node

In order to fully evaluate the distributed application framework, a distributed

BLAST application was constructed using the framework along with some additional features required to run BLAST operations. To better explain how the distributed

BLAST application executes this section will discuss the distributed BLAST

62

Texas Tech University, Eric James Rees, May 2011 algorithm by discussing the portions of the algorithm. Each portion of the algorithm shall be described by the tasks it carries out, the means by which it will carry out that task, and the interactions they have with one another at various stages of the program’s execution. The portions this section will cover for the master node include: node discovery and connection establishment, user interface, query segmentation, database segmentation, database and query transfer, and results compilation.

3.2.1 Node Discovery and Connection Establishment

When the distributed BLAST program begins execution on the master node, the program will first need to perform node discovery. Node discovery allows the master node to determine which worker nodes are currently connected and able to perform BLASTs. In order to perform node discovery, the master node must first connect to the multicast group as explained in section 3.1.1 Algorithm for creating a

Distributed System from Existing Nodes. The master node will then broadcast the node discover message as defined in the same section. Once the message has been broadcast, the master node will accept messages from the worker nodes for a maximum of five seconds. Any node that does not respond to the master within five seconds will be classified as offline. After the initial node discovery, the master node will perform the step again approximately every 30 seconds. Before broadcasting the node discovery message, the master node will first classify each remote node as offline. As explained in section 3.1.1 Algorithm for creating a Distributed System

63

Texas Tech University, Eric James Rees, May 2011 from Existing Nodes, this allows the program to work under the assumption that any worker that is unable to respond to a node discovery request within five seconds should be treated as being offline.

3.2.2 User Interface

The master node will contain a graphical user interface to make interacting with the program easier. This user interface will primarily be used by the user to feed input into the distributed system regarding the BLAST to be completed. However, the interface will also allow the user to monitor progress and the current status of the

BLAST. The user interface would contain all the necessary options for executing a

BLAST including the input query file, the database to query against, the BLAST program to run (e.g. blastall, megablast), as well as most of the optional options taken by the various BLAST programs. The interface will also allow the user to select some options regarding how the distributed system will handle their BLAST. These options would include whether or not to restart a previously started yet unfinished

BLAST and whether or not the BLAST should be run on the local machine instead of being distributed. The interface will also provide the user with information regarding the current state of their BLAST such as how much of their BLAST has completed and which worker nodes are currently performing which tasks. The user can then act upon this data to block, or unblock, worker nodes on the fly in order to increase the speed or reliability of the system.

64

Texas Tech University, Eric James Rees, May 2011

The user interface will also contain a method for the user to add or remove sequence databases from the distributed BLAST engine. When a new sequence database is added an XML file will be generated to describe the database. This file will be used by distributed BLAST to determine which files belong to a particular database and will allow distributed BLAST to determine if a worker node has the correct database. This XML information will would contain the following information: the database’s name, the database’s type (e.g. nucleotide or protein), the number of sequences contained in the database, the size of the database, the number of fragments the database contains, and the number of sequences and size of each fragment if there are more than one.

3.2.3 Query Segmentation

If the user selects to begin the distributed BLAST engine using the query segmentation method, then the master node will need to determine how many segments to generate by selecting how many sequences should be added to each segment. The user interface will allow the user to select this value if they so choose or they can allow the system to select this value. Value selection will require the system to see how much RAM and processing power each worker node has and then attempt to estimate how long it would take to complete X number of sequences, where X is the current estimated value. Next the program will determine how many segments would be generated if each segment had X sequences. Assuming it takes and estimated T’ time to BLAST a segment on any given worker node and an

65

Texas Tech University, Eric James Rees, May 2011 estimated T” time to write the result file to the final result file, the program will attempt to select an X such that T’ – T” = 0. Doing so ensures that we are attempting to spend the same amount of time BLASTing the file as we do writing the final file.

Upon determining the best estimated value of X, the master node will pull X * (N * 2) sequences into memory, such that N is the number of worker nodes available to perform work. At any given time X * N sequences are being processed by the distributed BLAST implementation given that each node (N) is processing X sequences. The distributed BLAST implementation will try to keep worker nodes busy as often as possible and as such will ensure it has enough sequences read in from the input file at any given time to immediately serve sequences to each worker node. As such the program must keep the X * N sequences currently being processed stored in memory as well as X * N additional sequences to ensure each worker node has sequences ready to be sent to them. In the event that all N nodes complete

BLASTing at the same time, then each of the nodes can be passed out X sequences for processing without waiting for the master node to read them from the input file.

These sequences are then sent out to the worker nodes as discussed in section 3.2.5

Database and Query Transfer.

3.2.4 Database Segmentation

If the user selects to begin the distributed BLAST engine using the database segmentation method, then the master node will need to ensure the database

66

Texas Tech University, Eric James Rees, May 2011 already exists in segments. Unlike MPI BLAST (Darling, Carey and Feng 2003) which will split a database into segments on the fly, our distributed BLAST implementation will instead rely on formatdb to segment databases that are larger than one half any given worker node’s RAM. Because executing formatdb is a slow operation, we will only run it when the database is added to distributed BLAST.

Thus the database may not have been fragmented if it was not larger than half of any of the worker node’s RAM at the time. If the database was large enough to be fragmented, then the database will be split into N fragments such that each fragment is approximately one half the smallest worker node’s RAM in size.

3.2.5 Database and Query Transfer

Once the distributed BLAST engine has been started, the master must confirm that each worker node has all the files necessary to complete the BLAST. As discussed in section 3.2.2 User Interface, when a new database is added to distributed BLAST an information file containing all the information needed by distributed BLAST is generated. When it comes time to determine if a worker node needs a copy of the database, this information file is transferred over to the worker node via tcp port

15020. The worker will then check if it has a copy of the same information file and the same files described by the information file and will report back whether or not it has the exact database files being described. If the information file did not already exist on the worker node, the worker node was missing any of the files listed in the new copy of the information file, or the information files contained different data,

67

Texas Tech University, Eric James Rees, May 2011 then a new copy of the database is transferred to the worker node using the same port.

Once the worker node has received the database files required for BLASTing, the master node will begin handing out query sequences. Query transfer is handled by having the master node scan through the list of worker nodes to determine which workers meet the following requirements: are currently connected and responding to the master node, have successfully received the database files, and are currently not doing work for the master already. Any worker node that is found meeting these requirements will have a separate handler thread spawned to monitor and communicate with the worker node. This handler thread will then begin communicating the worker node to establish a file transfer connection so that is may transfer a query file to BLAST. In the case of query segmentation, the worker node will receive a query segment containing some previously determined number of segments which it will then BLAST against the entire sequence database. In the case of database segmentation, the worker node will receive the entire set of query sequences and instructions to BLAST against one fragment of the database.

The worker node will then BLAST the files it received until it has either completed the BLAST or the NCBI BLAST program fails. In either event the worker node will notify the handler thread on the master node of the BLAST’s status. Successful completion of a BLAST will have another file transfer connection established and the resulting BLAST output file will be transferred back to the master node. In the event

68

Texas Tech University, Eric James Rees, May 2011 of a failure the master node will mark the run as a failure so that another worker node may take up the job. The handler thread then shuts down the connection between the worker and master node so that a new handler may establish a connection when the worker node is needed again.

3.2.6 Results Compilation

If the user selected to run distributed BLAST using the query segmentation method, then the result files that return from the worker nodes will not require any correction before they can be concatenated together to form the complete result file.

These files will not require any correction because the worker node would have already corrected them as detailed in section 3.3.4 Result Correction. As such, once each result file has been received from a worker node, it will be added to the final result file in the order it was segmented. This ensures that the final result file would contain the same information, in the same order, as the result file attained from running BLAST locally.

If the user selected to run distributed BLAST using the database fragmentation method, then the result files that return from worker nodes will require processing using a selection algorithm. This selection algorithm is as follows:

1. Once each result file returns from the worker nodes, the master node will

begin reading the resulting BLAST output data from each worker node. The

master node will then read in and print out all the XML header information

69

Texas Tech University, Eric James Rees, May 2011

all the way down to the first iteration element, see section 3.3.4 Result

Correction for more information regarding what data is in each element.

Because this data should be the same for each file, the header information

from one of the files will be immediately copied into the final BLAST output

file.

2. From each file, all the hit and HSP elements will be read in from the next

iteration element. By comparing the E-values for each HSP from each hit,

the master node will be able to sort the hits from the most significant to the

least significant. Once sorted the top X, where X is the number of hits the

user selected to receive, will be taken and written into the final BLAST

output file.

3. Step 2 will the repeat for each iteration in the BLAST output files.

4. Once the final iteration has been handled, the footer information from the

BLAST output files will be copied into the final BLAST output file and all the

files will be closed.

5. The segment BLAST output files are then deleted as they are no longer

needed.

3.3 Algorithm behind distributed BLAST Worker Nodes

The distributed BLAST’s worker node algorithm is broken into numerous parts spread across a small number of stages in the program’s execution. Each portion of the algorithm shall be described by the tasks it carries out, the means by which it

70

Texas Tech University, Eric James Rees, May 2011 will carry out that task, and the interactions they have with one another at various stages of the program’s execution. The portions this section will cover for the worker nodes include: master discovery and connection establishment, query transfer, BLAST, result correction and result transfer.

3.3.1 Master Discovery and Connection Establishment

When the distributed BLAST program begins execution on a worker node, the program will first need to inform the multicast group of its presence. By informing the multicast group of its presence, the worker node can immediately inform all master nodes that the worker node has joined the distributed system instead of waiting for each master node to perform another node discovery. Performing this step requires that the worker node first connect to the multicast group as explained in section 3.1.1 Algorithm for creating a Distributed System from Existing Nodes.

The worker node will then broadcast the new worker node activated message as defined in the same section. Once the message has been broadcast, the worker node will begin monitoring port 15020 for commands from existing master nodes.

Simultaneously the worker node will also monitor the multicast group for messages regarding node discovery. Once a node discovery message is received from a master node, the worker node will reply back with the worker node response message defined in section 3.1.1 Algorithm for creating a Distributed System from Existing

Nodes. The worker node will not store information on other worker nodes or any master nodes that contact it through the multicast group. Only master nodes that

71

Texas Tech University, Eric James Rees, May 2011 contact the worker node via port 15020 will have their information stored, and this information will be deleted as soon as the work for that master node has been completed.

3.3.2 Database and Query Transfer

Once the worker node has discovered a master node it will begin monitoring tcp port 15020, waiting for the master node to assign it a task. Before the master node can assign a worker node a task to complete, it must first ensure the worker node has the database files needed to complete the work. The master node will signal the worker node on port 15020 in order to alert the worker node that a database check needs to occur. Once the worker node has replied, the information file discussed in section 3.2.2 User Interface is transferred to the worker node for comparison. The worker will first check if it has a file with the same name, if the file contents of both files are an exact match, and if the database files described in the information file exist. The worker node will then report back whether or not it has the database files being described. If the information file did not already exist on the worker node, the worker node was missing any of the files listed in the new copy of the information file, or the information files contained different data, then a new copy of the entire database is transferred to the worker node using the same port. Because the worker node cannot guarantee it will be given a task by the master node, it will simply signal to the master node that the database transfer completed successfully and then resume monitoring port 15020.

72

Texas Tech University, Eric James Rees, May 2011

A worker node that has been selected to perform a BLAST will receive a signal on port 15020 instructing it to prepare to BLAST. If the worker node is not currently performing a BLAST it will respond to the master node that it is prepared to BLAST.

The query sequences that the master wishes the worker to BLAST will then be transferred across the same port and stored into the worker node’s query directory.

The master will then pass the BLAST options to the worker node so that it may perform the BLAST.

3.3.3 BLAST

With the sequence database files, query file, and BLAST options transferred from the master node, the worker node can now begin execution of NCBI BLAST. This is done by calling the program specified in the BLAST options (e.g. blast all, megablast) and passing it the file paths for the sequence database, query file, and the BLAST options.

NCBI BLAST will now execute normally and once complete the worker node will read off the exit code given by the program. If the exit code shows the program completed successfully then the worker node will continue on to correcting the resulting blastout file. If the exit code shows NCBI BLAST encountered an error of some form, then the master node will be notified of the error and the worker node will return to an idle state awaiting commands from a master node.

73

Texas Tech University, Eric James Rees, May 2011

3.3.4 Result Correction

Once a BLAST completes, the worker node must make corrections to the resulting file. When a file was run through BLAST using the query segmentation method, the resulting blastout file must have certain parts removed so that the master node can easily reassemble the pieces. The resulting blastout file contains all the BLAST results for each sequence, however it also contains some introductory text and parameters at the beginning of the file, and some closing text and information at the end of the file. When the worker node received the file it also received a number stating its position in the query segments. During result correction the worker node will check this value to determine what parts, if any, need to be removed. If the file is the first query segment, then the file will be allowed to keep the introductory text and parameter information at the top of the file, however the data at the end will need to be removed unless this segment is also the final segment. If the segment is the final segment then the text at the end of the file is left in place but the introductory text and parameters at the beginning of the file are removed, unless the segment is also the first segment. In all other cases, both the starting text and end text is removed from the file. Thus when the master node receives the pieces it can simply place them all together without risk of the introductory and end text repeating multiple times throughout the middle of the file.

When a file was run through BLAST using the database fragmentation method, the resulting blastout file must have all the E-values corrected. In order to calculate the

74

Texas Tech University, Eric James Rees, May 2011

E-value reported by NCBI BLAST, the BLAST program must know the total number

of sequences in the database, N, as well as the total size of the database, n. However, when the worker node ran the query sequences over a fragment of the database,

BLAST reported the E-value scores for only that single database fragment. As such the N and n values used to calculate the scores were incorrect, as they were only a

subset of the entire database. As discussed previously in section 3.3.2 Database and

Query Transfer, the worker node will receive a database information file whenever

it receives a copy of the database that will be used by BLAST. This information file,

generated on the master node, contains information regarding the values for N and

n, the number of sequences in the database and the size of the database respectively,

for the entire unsegmented database. Thus, upon completing a BLAST, the worker

node can pull the values for N and n from this file for use in calculating the correct E-

value.

75

Texas Tech University, Eric James Rees, May 2011

BLAST Output Header XML Data Query Sequence Length, m Raw HSP Score, S Incorrect E-Value Kappa, k Lamba, λ BLAST Output Footer XML Data

Figure 5: Overview of the BLAST Output Layout

In order to calculate the correct E-value, the worker node must open the blastout file that resulted from running BLAST so that it may read in the BLAST output data and write the corrected data to a new blastout file. The blastout file is written in

XML format and contains numerous lines of header information describing which

BLAST program and version was used to generate the file, what parameters were passed to the program, and the name of the database used. Following the header data are one to many iteration XML elements, each providing the BLAST information regarding a sequence from the input file. Each iteration element contains the length

76

Texas Tech University, Eric James Rees, May 2011 of the sequence query, m, an iteration statistics element and one to many hit XML elements, with each of these representing a single match between a query sequence and a database sequence. Within each hit XML element the blastout contains one to many HSP XML elements, with each of these representing a single HSP match between the query sequence and the database sequence. By storing the information within each of these elements, the program can then determine all the information needed to re-calculate the E-value for each HSP. Each HSP element contains the raw score, S, for the HSP as well as the E-value that will be recalculated. From the iteration statistics element included in the iteration element, the program will retrieve the values for kappa, k, and lambda, λ, that the BLAST program used to find the E-values. A brief overview of the BLAST output file’s XML layout showing only

the information the worker node is concerned with is shown above in Figure 5. The

worker node will now continue collecting data from each iteration element and,

once it has collected all the information it requires, will recalculate the E-value for

each HSP element. Recalculation of the E-value is done using the equations detailed

in section 2.4.3 E-Value Calculation, which are shown below for reference:

= ′′

= −

= − ×

77

Texas Tech University, Eric James Rees, May 2011

= ∈

and = + × ln + ln − × − ×

− × − × ≥ max ,

3.3.5 Result Transfer

Once the resulting BLAST output file has been corrected, the worker node will notify the master node that the BLAST completed successfully. Once notified, the master node will accept a file transfer connection from the worker node and the BLAST output file will be transferred to the handler thread on the master node. At this point the worker node will close the connection with the master node and enter into an idle state while it awaits commands from another master node.

78

Texas Tech University, Eric James Rees, May 2011

Chapter 4 Results

Chapter 4 shall discuss how the experiment was setup and how data was chosen and used to demonstrate the execution capabilities of the distributed BLAST engine versus standard local BLAST. This chapter shall then continue with a discussion regarding how each experiment was conducted as well as the results of these experiments.

4.1 Experiment Setup and Environment

The goal of these experiments was to see how the distributed BLAST system would handle executing BLAST executions in parallel using a distributed BLAST application that was developed as per the discussions in chapter 3. Testing the capabilities of the distributed BLAST system was done using three separate experiments in order the query set segmentation method on a small database using BLASTN and

TBLASTX and a final experiment to test the database segmentation method against a larger database using BLASTN. The first experiment tested a set of 200,000 nucleotide sequences against a database containing the same 200,000 nucleotide sequences using the BLASTN application, see section 2.3.1 BLASTN and MegaBLAST, from NCBI. The second experiment tested the same set of 200,000 nucleotide sequences against the same database using the TBLASTX application, see section

2.3.5 TBLASTX, from NCBI. The third experiment tested the same set of 200,000 nucleotide sequences against a database created from the larger set of sequences

79

Texas Tech University, Eric James Rees, May 2011 that these 200,000 were randomly selected from. This experiment ran the query sequences against the database using the BLASTN application from NCBI. These experiments are explained in greater details and their results are discussed in the following sections.

The testing environment consisted of 11 computers which have been designated as

Class A, Class B, and Class C machines, see Table 4 below. Class A consisted of four

Windows Server 2003 machines with two four core processors running at 2.66 GHz and 16 GB of RAM each. Class B consisted of six Windows Server 2008 machines with two four core processors running at 2.67 GHz and 6 GB of RAM each. Class C consists of one Windows Vista x64 machine with two four core processors running at 2.16 GHz and 6 GB of RAM. The first two experiments, detailed in the following sections, were run three times under each of the following named configurations, which are shown with the name of each configuration followed by the configuration that the name represents:

• Local 1: Running BLAST locally on one Class A machine.

• Local 2: Running BLAST locally on one Class B machine.

• Configuration 1: Running distributed BLAST across two Class A machines,

with the Class C machine running as the Master Node.

• Configuration 2: Running distributed BLAST across four Class A machines,

with the Class C machine running as the Master Node.

80

Texas Tech University, Eric James Rees, May 2011

• Configuration 3: Running distributed BLAST across four Class A machines

and two Class B machines, with the Class C machine running as the Master

Node.

• Configuration 4: Running distributed BLAST across four Class A machines

and four Class B machines, with the Class C machine running as the Master

Node.

• Configuration 5: Running distributed BLAST across four Class A machines

and six Class B machines, with the Class C machine running as the Master

Node.

Class OS Members RAM Processor(s) CPU Speed Cores Windows 16 A Server 2008 4 Intel Xeon x2 2.66 GHz 4 each / 8 total GB x64 Windows B Server 2003 6 6 GB Intel Xeon x2 2.67 GHz 4 each / 8 total x64 Windows Vista C 1 6 GB Intel Xeon x2 2.13 GHz 4 each / 8 total x64 Table 4: Brief Description of Machine Classes

The final experiment, detailed in section 4.3 Distributed BLAST versus local BLAST on a large database, was run three times under each of the following configurations:

• Local 1 B: Running BLAST locally on one Class B machine.

• Configuration 1 B: Running distributed BLAST across three Class B machines,

with the Class C machine running as the Master Node.

81

Texas Tech University, Eric James Rees, May 2011

The sequence files used in the first two experiments were derived by downloading the nucleotide coding regions for all known bacteria genomes from the NCBI website (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/). The all.ffn.tar.gz file contains all of these coding regions and once downloaded the separate coding regions were merged together into one large sequence file containing 4,113,631 sequences, as of

February 2011. A python script was then written to randomly choose 200,000 unique numbers between 1 and 4,113,631 and then pull those sequences out of the file. This created a FASTA formatted sequence file that contained 200,000 randomly selected nucleotide sequences. This file was then used as the input query sequence file for all three experiments as well as the database file, formatted using formatdb, for the first two experiments. The third experiment uses the NT database downloaded from the NCBI website (ftp://ftp.ncbi.nih.gov/blast/db/). The NT database contains all the nucleotide sequences recorded in the GenBank, EMBL,

DDBJ, and PDB databases and contains 36,185,985,577 nucleotide letters spread across 15,639,775 sequences. From this database a python script was used to randomly select 2,000 sequences and create a FASTA formatted query sequence file.

This 2,000 sequence file was then executed using BLASTN against the entire NT database

4.2 Distributed BLAST versus local BLAST on a small database

The first two experiments were designed to test the distributed BLAST application’s ability to perform distributed BLAST operations using the query set segmentation

82

Texas Tech University, Eric James Rees, May 2011 method. This method is designed to execute a query set against a database small enough to fit into memory on each machine. These experiments tested the distributed BLAST application’s ability to balance the workload across a variable number of systems by running each experiment against multiple configurations of machines with each test being run three times. The first experiment tests the application’s ability to scale data when running the 200,000 sequence query set against the 200,000 sequence database when using the BLASTN tool. The BLASTN tool was chosen to ensure the distributed BLAST application can achieve linear speed up when dealing with a sequence query set and a database query that will not require any translations before execution. The second experiment tests the application’s ability to scale data when running the 200,000 sequence query set against the 200,000 sequence database when using the TBLASTX tool. The

TBLASTX tool was chosen to ensure the distributed BLAST application can achieve linear speed up when dealing with a sequence query set and a database query that will require both to first be translated from nucleotide to protein sequences. The translation process will cause the number of sequences in the query set that need to be compared against the database to triple to 600,000 protein sequences. The translation process will have a similar effect on the database set, causing it to also triple in size from 200,000 sequences to 600,000 sequences. The layout and results of each of these experiments is documented below in the following two subsections.

83

Texas Tech University, Eric James Rees, May 2011

4.2.1 Comparing Local and Distributed BLAST on a small database using

BLASTN

Experiment one was conducted in order to verify the distributed BLAST application’s ability to reach near-linear performance increases when executing the

200,000 sequence query set against the 200,000 sequence database when using

BLASTN. This experiment required that the BLAST first be run three times locally on a Class A machine and three times locally on a Class B machine. These values give us the baseline for which we will base our estimations for what constitutes linear increases in speed. The BLAST execution run on the Class A machine returned a result after 1 hour and 36 minutes on all three executions. The same

BLAST execution run on the Class B machine returned a result after 2 hours on all three executions. Next the distributed BLAST application was used to execute the

BLASTN application under Configuration 1, which consists of two Class A machines.

This test returned a completed BLAST result in 52 minutes. Configuration 2, consisting of four Class A machines, returned a completed BLAST result in 27 minutes. Configuration 3, which consists of four Class A machines and two Class B machines, completed the BLAST execution in 18 minutes while Configuration 4, which consists of four Class A machines and four Class B machines, completed the

BLAST execution in 14 minutes. The final test of the distributed BLAST application used Configuration 5, consisting of four Class A machines and six Class B machines, completed the BLAST execution in 12 minutes. Table 5 below shows the average run times for each configuration used in experiment 1.

84

Texas Tech University, Eric James Rees, May 2011

Configuration Class A Machines Class B Machines Average Time in Hours Local 1 1 0 1.61 Local 2 0 1 2.007 Configuration 1 2 0 0.867 Configuration 2 4 0 0.469 Configuration 3 4 2 0.307 Configuration 4 4 4 0.241 Configuration 5 4 6 0.205 Table 5: Results from Experiment #1

In Figure 6 and Figure 7, shown below, the amount of time taken to complete a

BLASTN execution under the varying number of nodes is depicted in a graph plot and a bar graph respectively. The line/bar labeled ‘Linear Time for Class A’ shows the number of hours the Class A machines should have taken if their local execution time was scaled linearly. The line/bar labeled ‘Linear Time for Class B’ shows the number of hours the Class B machines should have taken if their local execution was scaled linearly. These figures show that Configurations 1 and 2, using two and four

Class A machines respectively, achieved near linear time but did spend some additional time performing the communication and correction stages, causing it to run slightly slower than linear. Configurations 3, 4 and 5 show similar results as these configurations use four Class A machines with two, four, and six Class B machines respectively. Because the Class B machines are slower and contain less memory, we would expect them to have a negative impact on the time taken to complete a BLAST execution. The addition of the Class B machines causes the time to skew further away from the ‘Linear Time for Class A’ line and move closer to the

85

Texas Tech University, Eric James Rees, May 2011

‘Linear Time for Class B’ line as more Class B machines were used to perform the

BLAST execution.

2.5

2

1.5

1

0.5 Time Taken Taken Time to Complete BLAST (hours)

0 0 1 2 3 4 5 6 7 Number of Nodes Performing BLASTN

Distributed BLAST Time (Hours) Linear Time for Class A (Hours) Linear Time for Class B (Hours)

Figure 6: Experiment #1 Execution Times - Line Graph

86

Texas Tech University, Eric James Rees, May 2011

2.5

2

1.5

1

0.5 Time Taken Taken Time to Complete BLAST (hours)

0 Local 2 Nodes 4 Nodes 6 Nodes 8 Nodes 10 Nodes Number of Nodes Performing BLASTN

Distributed BLAST Time (Hours) Linear Time for Class A (Hours) Linear Time for Class B (Hours)

Figure 7: Experiment #1 Execution Times - Bar Graph

4.2.2 Comparing Local and Distributed BLAST on a small database using

TBLASTX

The second was conducted in order to verify the distributed BLAST application’s ability to reach near-linear performance increases when executing the 200,000 sequence query set against the 200,000 sequence database when using TBLASTX.

This experiment required that the BLAST first be run three times locally on a Class A machine and three times locally on a Class B machine. These values give us the baseline for which we will base our estimations for what constitutes linear increases

87

Texas Tech University, Eric James Rees, May 2011 in speed. The BLAST execution run on the Class A machine returned a result after

69 hours and 15 minutes on average between the three executions. The same

BLAST execution run on the Class B machine returned a result after 91 hours and 13 minutes on average between the three executions. Next the distributed BLAST application was used to execute the TBLASTX application under Configuration 1, which consists of two Class A machines. This test returned a completed BLAST result in 27 hours and 43 minutes. Configuration 2, consisting of four Class A machines, returned a completed BLAST result in 12 hours and 58 minutes.

Configuration 3, which consists of four Class A machines and two Class B machines, completed the BLAST execution in 10 hours and 12 minutes while Configuration 4, which consists of four Class A machines and four Class B machines, completed the

BLAST execution in 7 hours and 30 minutes. The final test of the distributed BLAST application used Configuration 5, consisting of four Class A machines and six Class B machines, completed the BLAST execution in 5 hours and 21 minutes. Table 6 below shows the average run times for each configuration used in experiment 2.

Configuration Class A Machines Class B Machines Average Time in Hours Local 1 1 0 69.25 Local 2 0 1 91.21 Configuration 1 2 0 27.717 Configuration 2 4 0 12.967 Configuration 3 4 2 10.2 Configuration 4 4 4 7.5 Configuration 5 4 6 5.35 Table 6: Results from Experiment #2

88

Texas Tech University, Eric James Rees, May 2011

In Figure 8 and Figure 9, shown below, the amount of time each TBLASTX execution took under the varying number of nodes is depicted in a graph plot and a bar graph respectively. The line/bar labeled ‘Linear Time for Class A’ shows the number of hours the Class A machines should have taken if their local execution time was scaled linearly. The line/bar labeled ‘Linear Time for Class B’ shows the number of hours the Class B machines should have taken if their local execution was scaled linearly. These figures show that Configurations 1 and 2, using two and four Class A machines respectively, achieved a super-linear increase in speed which allowed it to complete the task faster than the linear speed shown for Class A machines.

Configurations 3, 4 and 5 show similar results as these configurations use four Class

A machines with two, four, and six Class B machines respectively. Because the Class

B machines are slower and contain less memory, we would expect them to have a negative impact on the time taken to complete a BLAST execution. The addition of the Class B machines causes the execution time to slide close to the time represented by the ‘Linear Time for Class A’ line, however the increase in speed using this method for TBLASTX still allows it to maintain a super-linear increase in speed, even when compared against the ‘Linear Time for Class A’ line. However, it would appear that the addition of two or four more Class B machines would have caused the speed increase to final be slower than linear time for Class A machines, but would have still achieved super-linear increases overall.

Achieving super linear increases in speed occurs because of the way the NCBI

BLAST algorithm handles TBLASTX executions. TBLASTX requires that both the

89

Texas Tech University, Eric James Rees, May 2011 query sequence and the database sequence be translated from one nucleotide sequence into three protein sequences, each. This causes the number of query sequences to increase three fold and the number of database sequences to also increase three fold. The TBLASTX algorithm requires that the database sequences be translated for every sequence, requiring three translations for every nucleotide sequence. This causes the algorithm execution time to increase at a greater than linear rate for every query sequence, whereas in the BLASTN algorithm changes to the number of query sequences cause a linear increase/decrease in execution time.

By reducing the number of sequences run for each node, we are able to achieve a super linear increase in speed by reducing the number of times each node must spend translating the database from nucleotide sequences to protein sequences.

90

Texas Tech University, Eric James Rees, May 2011

100

90

80

70

60

50

40

30

20

10 Time Taken Taken Time to Complete BLAST (hours) 0 0 1 2 3 4 5 6 7 Number of Nodes Performing TBLASTX

Distributed BLAST Time (Hours) Linear Time for Class A (Hours) Linear Time for Class B (Hours)

Figure 8: Experiment #2 Execution Times - Line Graph

91

Texas Tech University, Eric James Rees, May 2011

100

90

80

70

60

50

40

30

20

10 Time Taken Taken Time to Complete BLAST (hours) 0 Local 2 Nodes 4 Nodes 6 Nodes 8 Nodes 10 Nodes Number of Nodes Performing TBLASTX

Distributed BLAST Time (Hours) Linear Time for Class A (Hours) Linear Time for Class B (Hours)

Figure 9: Experiment #2 Execution Times - Bar Graph

4.3 Distributed BLAST versus local BLAST on a large database

The final experiment was designed to examine the distributed BLAST application’s ability to perform distributed BLAST operations using the database fragmentation method. This method is designed to execute a query set against a database that is too large to fit into memory. Because BLAST algorithm looks at one sequence at a time and compares it to each member of the database, the system will only run smoothly if all the database sequences fit into memory so that the computer can perform all of the operations in memory. If the database does not fit, the computer must constantly waste time moving sequences back and forth between memory and

92

Texas Tech University, Eric James Rees, May 2011 the secondary storage device, i.e. hard drive. This added time is what causes large database executions to take many times longer to complete than executions on small databases, causing the time taken to increase much faster than on a linear scale.

By splitting the database into fragments that are small enough to fit into the memory for a machine, the time spent searching the database is reduced at more than a linear rate. For example if a database that was too large to fit into memory was cut into two equal pieces, each piece capable of fitting in memory, then the time it would take to run a BLAST across each piece individually would take less than half the time it would have taken to run the BLAST against the single database. As such, if we send one fragment of the database to one machine and the other fragment to a second machine, we would retrieve the results from each in less than half the time it would have taken to run the BLAST locally. In other words, the time, TL, it took to

run the BLAST locally would be less than the time, TF, it took to run the BLAST locally divided by the number of fragments, f, as shown in the equation below:

<

The final experiment was run locally three times on Class B machines and was not

run on Class A machines. Because the Class A machines have 16 gb of RAM, they are

capable of holding the entirety of the NT database in RAM without issue. Because

this test is to show that superlinearity can be attained using the distributed BLAST

93

Texas Tech University, Eric James Rees, May 2011 application, I chose to remove the Class A machines from the experiment as they would never attain super linear speeds ups and would, instead, only be capable of attaining linear or near-linear increases in speed. When run locally, the Class B machines completed the BLASTN execution in 15 hours and 36 minutes on average.

As discussed in section 2.4.2 Sequence Database Segmentation, the database fragmentation method splits the database into smaller fragments and then executes the entire sequence query file against each fragment. The NT database was downloaded already split into ten roughly equal size partitions. The third experiment was run using the distributed BLAST application by having each Class B node perform the BLASTN execution on 2 segments of the NT database at the same time. This caused each machine to only deal with one-fifth of the NT database at a time, causing a significant increase in productivity. As such, the third experiment was run using the Distributed BLAST application only three times, each time using five Class B nodes to perform the work. These Class B machines were able to achieve super-linear speed ups by returning a corrected BLASTN result in only 24 minutes, or approximately 1/39 the time it took to run the BLAST locally. The results of this experiment are shown below in Table 7.

Configuration Class B Machines Average Time in Hours Local 1 B 1 15.6 Configuration 1 B 5 0.4 Table 7: Results from Experiment #3

94

Texas Tech University, Eric James Rees, May 2011

In Figure 10 and Figure 11, the amount of time each BLASTN execution took using the database fragmentation method is depicted in a graph plot and a bar graph respectively. The line/bar labeled ‘Linear Time for Class B’ shows the number of hours the Class B machines should have taken if their local execution was scaled linearly. Because the experiment was only run under two configuration, the local

BLASTN execution and the distributed BLASTN execution across five Class B nodes, the figures only show these two data points. As was expected, performing the

BLASTN execution across five nodes, each containing one-fifth of the NT database allowed the program to achieve super linear speed up, which is depicted in both figures.

18

16

14

12

10

8

6

4

2 Time Taken Taken Time to Complete BLAST (hours) 0 0 0.5 1 1.5 2 2.5

Number of Nodes Performing BLASTN

Distributed BLAST Time (Hours) Linear Time for Class B (Hours)

Figure 10: Experiment #3 Execution Times - Line Graph

95

Texas Tech University, Eric James Rees, May 2011

18

16

14

12

10

8

6

4

Time Taken Taken Time to Complete BLAST (hours) 2

0 Local 5 nodes

Number of Nodes Performing BLASTN

Distributed BLAST Time (Hours) Linear Time for Class B (Hours)

Figure 11: Experiment #3 Execution Times - Bar Graph

96

Texas Tech University, Eric James Rees, May 2011

Chapter 5 Conclusions

This chapter will discuss the conclusions of the study as well as suggest future research and work that could be done in the field of distributed application development and statistical correction.

5.1 Results and Conclusions

The purpose of this study was to develop a framework capable of automatically building a distributed system that conforms to the standards set forth for all distributed algorithms and the distributed systems they are executed upon. As discussed in section 1.2 Problem Statement the developed system needed to be capable of running on heterogeneous machines, running securely, scaling up or down as needed, using proper fault tolerance/recovery to keep the system stable, running processes concurrently, and allowing for transparency.

During the course of this study we devised a communication method containing only three messages that allowed for the automatic connection and scaling of a distributed system. This message passing system then formed the core of our distributed system by allowing each heterogeneous node to communicate with one another using only messages. By only implementing three messages the system stays simple enough to easily secure and protect against faults while staying robust

97

Texas Tech University, Eric James Rees, May 2011 enough to allow multiple nodes to sync their operations up using only message passing.

Once the communication method was developed an application framework was constructed to allow for secure execution of distributed applications in a distributed heterogeneous environment that is created and maintained using the communication methods. By implementing the distributed application framework in small, self-contained methods, it grants future developers access to the methods required to perform concurrent operations on the remote nodes and execute bioinformatics tools across the framework without altering their source code. Each of these methods were developed to allow only secure interactions between nodes as well as protect the system from faults by tolerating minor faults and recovering from larger faults.

Lastly, a distributed BLAST application was constructed using the distributed communication framework to demonstrate the capability of the communication framework. Construction of this distributed BLAST application was accomplished by developing two applications, a master node application and a worker node application. These applications work in tandem to perform BLAST executions in parallel across the distributed environment by either executing fragments of the query file on each node and then combining the final results together or by executing the entire query file on each node against a fragment of the database and then merging all the result files together. In the process of this second method, a

98

Texas Tech University, Eric James Rees, May 2011 method for quickly determining the correct statistical expect value for each final sequence was developed that allows the final result file to contain correct expect values.

The distributed BLAST application was expected to achieve linear speed up when the query files were segmented and run across the distributed system. In the case of

BLASTN this did occur with the system executing the BLASTN application across the distributed system with a near-linear speed up achieved on each test. However,

TBLASTX, was able to achieve super-linear speed up due to the way the BLAST algorithm handles the translation of query and database files. Because the BLAST algorithm translates each query sequence and then runs it against each translated member of the database, it causes the number of execution passes to scan each translated sequence to increase at a rate that is greater than linear. This allows the

TBLASTX algorithm to achieve a super linear speed up in certain cases when run in a distributed environment, as was noted in experiment two of this study.

The distributed BLAST application was expected to achieve a super linear speed up when the database files were fragmented and the query sequences were run across these smaller databases. The database fragmentation method can only achieve super linear increases in speed when the database is two large to fit into memory.

As such the more powerful machines that the experiment would have run on would have been capable of executing the database within memory without any issue and as such would have only achieved a near-linear increase in speed. However the

99

Texas Tech University, Eric James Rees, May 2011 weaker machines that the experiment were run on, marked as Class B machines, did not contain enough memory to store the entire database. As such the local execution of BLASTN against the NT database required over 15 hours to complete while the distributed BLASTN execution using only five Class B machines accomplished the task in under 25 minutes.

5.2 Future Work

Currently the distributed application framework is only capable of providing future developers with the methods required to construct distributed bioinformatics tools such as the distributed BLAST application developed in this study. Future work in the field of distributed bioinformatics applications should expand upon this framework to construct a general distributed bioinformatics application. This application should take a set of configuration files as input that describes how the master and worker nodes should work and what application is being distributed.

The application will then pass out a copy of the executable to each remote node as well as the data to be executed and perform the distributed executions until a result can be returned. A system of this nature would greatly benefit the field of bioinformatics as well as the field of distributed computations.

100

Texas Tech University, Eric James Rees, May 2011

References

3tera. Cloud Computing For The Enterprise. 3tera. 2010. http://www.3tera.com/Cloud-computing/ (accessed April 9, 2010). Altschul, Stephen F, and Warren Gish. "Local Alignment Statistics." Methods Enzymol. 266 (1996): 460-480. Altschul, Stephen F., Warren Gish, Miller Webb, Eugene W. Myers, and David J. Lipman. "Basic local alignment search tool." Journal of Molecular Biology 215, no. 3 (October 1990): 403-410. Amazon. Amazon Simple Storage Service (Amazon S3). Amazon. 2010. http://aws.amazon.com/s3/ (accessed April 9, 2010). ANSA Project. ANSA Reference Manual. Cambridge: ANSA Project, 1987. Benson, Dennis A., Ilene Karsh-Mizrachi, David J. Lipman, James Ostell, and David L. Wheeler. "GenBank." Nucleic Acids Research , 2007: D21-D25. Carvalho, Paulo C., Rafael V. Glória, Antonio B. de Miranda, and Wim M. Degrave. "Squid - a simple bioinformatics grid." (BMC Bioinformatics) 2005. Chao, Kun-Mao, and Louxin Zhang. Sequence Comparison: Theory and Methods. London: Springer-Verlag London Limited, 2009. Condor Team. Condor BLAST. April 15, 2004. http://www.cs.wisc.edu/condor/tools/BLAST/ (accessed May 19, 2010). Costa, Rogério Luís de Carvalho, and Sérgio Lifschitz. "Database Allocation Strategies for Parallel BLAST Evaluation on Clusters." Distributed and Parallel Databases , 2003: 99-127. Coulouris, George, Jean Dollimore, and Tim Kindberg. Distributed Systems: Concepts and Design (4th Edition). Pearson Education Limited, 2005. Cristianini, Nello, and Matthew W. Hahn. Introduction to Computational Genomics: A case studies approach. Cambridge: Cambridge University Press, 2007. Darling, Aaron E., Lucas Carey, and Wu-chun Feng. "The Design, Implementation, and Evaluation of mpiBLAST." 4th International Conference on Linux Clusters: The HPC Revolution 2003 in conjunction with ClusterWorld Conference & Expo. 2003.

101

Texas Tech University, Eric James Rees, May 2011

Deering, Stephen E., and David R. Cheriton. "Multicast routing in datagram internetworks and extended LANs." ACM Transactions on Computer Systems (TOCS) 8, no. 2 (May 1990): 85-110. Dowd, Scot E., Joaquin Zaragoza, Javier R. Rodriguez, Melvin J. Oliver, and Paxton R. Payton. "Windows .NET Network Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST)." BMC Bioinformatics , 2005. Durbin, Richard, Sean Eddy, Anders Krogh, and Graeme Mitchison. Biological sequence analysis. 12. Cambridge: Cambridge University Press, 2007. Google. What is Google App Engine? Google. 2010. http://code.google.com/appengine/docs/whatisgoogleappengine.html (accessed April 9, 2010). Grant, J. D., R. L. Dunbrack, F. J. Manion, and M. F. Ochs. "BeoBLAST: distributed BLAST and PSI-BLAST on a Beowulf cluster." Bioinformatics Applications Notice , 2002: 765-766. Haubold, Bernhard, and Thomas Wiehe. Introduction to Computational Biology: An Evolutionary Approach. Basel: Birkhäuser Verlag, 2006. ISO/IEC. "Information Technology - Open Distributed Processing - Reference Model: Architecture." Geneva: ISO/IEC, September 15, 1996. Jones, Neil C., and Pavel A. Pevzner. An Introduction to Bioinformatics Algorithms. London: MIT Press, 2004. Karlin, Samuel, and Stephen F Altschul. "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proceedings of the National Academy of Sciences 87 (March 1990): 2264-2268. Khanna, Raman, ed. Distributed Computing: Implementation and Management Strategies. Englewood Cliffs, NJ: Prentice Hall PTR, 1994. Korf, Ian, Ian Yandell, and Joseph Bedell. BLAST: An Essential Guide to the Basic Local Alignment Search Tool. Sebastopol, CA: O'Reilly & Associates, 2003. Lagnel, Jacques, Costas S. Tsigenopoulos, and Ioannis Iliopoulos. "NOBLAST and JAMBLAST: New Options for BLAST and a Java Application Manager for BLAST results." Bioinformatics 25, no. 6 (2009): 824-826.

102

Texas Tech University, Eric James Rees, May 2011

Lazakidou, Athina. Biocomputational and Biomedical Informatics. Hershey: Medical Information Science Reference, 2010. Lin, Hershan, Xiaosong Ma, Praveen Chandramohan, Al Geist, and Nagiza Samatova. "Efficient Data Access for Parallel BLAST." Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01. IEEE Computer Society, 2005. 72.2. Markel, Scott, and Darryl León. Sequence Analysis In a Nutshell: A Guide to Common Tools and Databases. Sebastopol, CA: O'Reilly & Associates, 2003. Microsoft. Windows Azure Platform. Microsoft. 2010. http://www.microsoft.com/windowsazure/windowsazure/ (accessed April 9, 2010). Mount, David W. Bioinformatics: Sequence and Genome Analysis. 2nd. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press, 2004. Nair, Achuthsankar S. "Computational Biology & Bioinformatics: A Gentle Overview." Communications of the Computer Society of India , January 2007. Orengo, C. A., D. T. Jones, and J. M. Thornton. Bioinformatics: Genes, Proteins & Computers. Oxford: BIOS Scientific Publishers Limited, 2003. Orlov, Y. L., and V. N. Potapov. "Complexity: an internet resource for analysis of DNA sequence complexity." Nucleic Acids Research , 2004: W628-W633. Palankar, Mayur R., Adriana Iamnitchi, Matei Ripeanu, and Simson Garfinkel. "Amazon S3 for science grids: a viable solution?" (Proceedings of the 2008 international worshop on Data-aware distributed computing) 2008: 55-64. Pevsner, Jonathan. Bioinformatics and Functional Genomics. Hoboken, NJ: John Wiley & Sons, 2003. Rey, Sébastien, Jennifer L Gardy, and Fiona SL Brinkman. "Assessing the precision of high-throughput computational and laboratory approaches for the genome- wide identification of protein subcellular localization in bacteria." BMC Genomics , 2005. Rosenberg, Michael S., ed. Sequence Alignment: Methods, Models, Concepts, and Strategies. Berkeley, CA: University of California Press, 2009. Smith, Temple F, and Michael S Waterman. "Identification of Common Molecular Subsequences." Journal of Molecular Biology , no. 147 (1981): 195-197.

103

Texas Tech University, Eric James Rees, May 2011

Talbi, El-Ghazali, and Albert Y. Zomaya, . Grid Computing for Bioinformatics and Computational Biology. Hoboken, NJ: John Wiley & Sons, 2008. Troyanskaya, Olga G., Kara Dolinksi, Art B. Owen, Russ B. Altman, and David Botstein. "A Bayesian Framework for combining heterogenous data sources for gene function prediction (in Saccharomyces cerevisiae)." (Proceedings of the National Academy of Sciences) 100, no. 14 (July 2003): 8348-8353. Wang, Jiren, and Qing Mu. "Soap-HT-BLAST: high throughput BLAST based on Web services." Bioinformatics Applications Note , 2003: 1863-1864. Zomaya, Albert Y, ed. Parallel Computing for Bioinformatics and Computational Biology. Hoboken, NJ: John Wiley & Sons, 2006.

104