Parallelization Methods for the Distribution of High-Throughput Bioinformatics Algorithms
by
Eric James Rees, B.S.
A Dissertation
In
COMPUTER SCIENCE
Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of
DOCTOR OF PHILOSOPHY
Submitted to:
Dr. Eunseog Youn
Dr. Scot Dowd
Dr. Michael San Francisco
Dr. Peggy Gordon Miller Dean of the Graduate School
May, 2011
Copyright 2011, Eric Rees
Texas Tech University, Eric James Rees, May 2011
Acknowledgements
Dr. Eunseog Youn and Dr. Scot Dowd have both been excellent advisors. Dr. Youn has been with me since the beginning of my time as a Bioinformaticist and has given me the guidance needed to ensure I was able to solve the problems I encountered.
Dr. Dowd has always been the one to keep me focused and on task while helping me see solutions to problems by pointing out new ways to view the problem. Both have forced me to think in ways I had never been forced to think before and without their guidance I may have never reached this point.
My family for everything they have done for me throughout the years. My parents,
Bill and Linda, have always supported my decisions and have always pushed me to excel at everything I try. I could not have done it without the great support of my family.
My friends from Texas Tech who I have worked with in some regard or another for
5 years, Eric Garcia, Brad Nemanich, and Viktoria Gonthcharova. Each of them has been there for me when I needed a break from work and research or when I needed help solving an equation.
My friends outside of Texas Tech who helped me relax when things became stressful, including Morgan Cadena, Joanna Burk, Shawna Miller, and Clint Miller.
Lastly, I would like to thank my amazing girlfriend Teresa, whose loving support and ability to know exactly what to do to keep me going is what made this achievement possible.
ii
Texas Tech University, Eric James Rees, May 2011
Table of Contents
Acknowledgements ...... ii
Table of Contents ...... iii
Abstract ...... vi
List of Tables ...... viii
List of Figures ...... ix
Chapter 1 Introduction ...... 1
1.1 Motivation ...... 1
1.2 Problem Statement ...... 4
1.3 Overview of the Dissertation ...... 8
Chapter 2 Related Work...... 9
2.1 Bioinformatics...... 9
2.2 Distributed Computing...... 11
2.3 BLAST ...... 17
2.3.1 BLASTN and MegaBLAST ...... 24
2.3.2 BLASTP ...... 25
2.3.3 BLASTX ...... 25
2.3.4 TBLASTN...... 26
2.3.5 TBLASTX...... 27
2.4 Distributed BLAST ...... 27
2.4.1 Query Set Segmentation ...... 28
2.4.2 Sequence Database Segmentation...... 30
2.4.3 E-Value Calculation ...... 32
2.5 Existing Distributed BLAST Applications ...... 36
iii
Texas Tech University, Eric James Rees, May 2011
Chapter 3 Approach ...... 43
3.1 Creating a Distributed System from Existing Nodes ...... 43
3.1.1 Algorithm for creating a Distributed System from Existing Nodes ...... 44
3.1.2 Methods for creating a Distributed Application Framework ...... 46
3.1.3 Meeting the definition of a distributed system ...... 53
3.1.4 Meeting the challenges of a distributed system ...... 55
3.2 Algorithm behind the Distributed BLAST Master Node ...... 62
3.2.1 Node Discovery and Connection Establishment...... 63
3.2.2 User Interface ...... 64
3.2.3 Query Segmentation...... 65
3.2.4 Database Segmentation ...... 66
3.2.5 Database and Query Transfer...... 67
3.2.6 Results Compilation ...... 69
3.3 Algorithm behind distributed BLAST Worker Nodes ...... 70
3.3.1 Master Discovery and Connection Establishment ...... 71
3.3.2 Database and Query Transfer...... 72
3.3.3 BLAST ...... 73
3.3.4 Result Correction ...... 74
3.3.5 Result Transfer ...... 78
Chapter 4 Results ...... 79
4.1 Experiment Setup and Environment ...... 79
4.2 Distributed BLAST versus local BLAST on a small database ...... 82
4.2.1 Comparing Local and Distributed BLAST on a small database using BLASTN ...... 84
4.2.2 Comparing Local and Distributed BLAST on a small database using TBLASTX...... 87 iv
Texas Tech University, Eric James Rees, May 2011
4.3 Distributed BLAST versus local BLAST on a large database ...... 92
Chapter 5 Conclusions ...... 97
5.1 Results and Conclusions ...... 97
5.2 Future Work ...... 100
References ...... 101
v
Texas Tech University, Eric James Rees, May 2011
Abstract
The development of high-throughput bioinformatics technologies has caused a massive influx of biological data over the course of the past decade. During this same span of time, computational hardware has also been rapidly increasing in speed while decreasing in price, multi-core processors have become standard in home and office environments, and distributed and cloud based computing has become affordable and readily available to researchers with implementations such as Amazon’s S3, Microsoft’s Azure, Google’s App Engine, and the 3Tera Cloud.
Bioinformatics software tools such as BLAST, a tool for finding local alignments between a set of unknown genetic sequences versus a set of known genetic sequences, have simple interfaces and few installation requirements often so biologists can use them easily in the laboratory without needing an in-depth knowledge of how computer systems work. This, however, is rarely the case for distributed implementations of bioinformatics tools which often require the user to first set up and configure the underlying program that will handle the distribution, such as the Message Passing Interface (MPI) or Remote Procedure Calls (RPC). Once the underlying distribution algorithm is chosen, many of the software tools require the user to then configure the program to work with their chosen method and, in some cases, write the necessary source code to link the program with the underlying service. These are difficult steps for most computer scientists and are near impossible for the average biologist.
vi
Texas Tech University, Eric James Rees, May 2011
By constructing a modularized set of methods that can connect to, broadcast to, and read from a multicast created by the methods, future bioinformatics software developers will be able to construct the underlying message passing system without requiring the end-user, often a biologist, to set up and configure one of their own.
Using these multicast methods will allow any program the ability to seek out and track any nodes on the network that will be used in the distributed system. This communication method allows the program to easily scale up and down depending on available nodes without direct user intervention to alter the size of the system.
This system is then tested by creating a program that connects NCBI’s Basic Local
Alignment Search Tool to the multicast system to allow the BLAST algorithm to then be distributed across multiple nodes. This new system will demonstrate how future programs could then connect stand alone tools, such as BLAST, to the multicast system to create programs that will execute on a distributed system and automatically scale depending on the network size without altering the tools source code.
vii
Texas Tech University, Eric James Rees, May 2011
List of Tables
Table 1: Distributed BLAST Implementations - Fault tolerance comparisons ...... 3 Table 2: Distributed BLAST Implementations - Speed up comparisons ...... 4 Table 3: Brief description regarding the data used and returned by each BLAST program...... 24 Table 4: Brief Description of Machine Classes ...... 81 Table 5: Results from Experiment #1 ...... 85 Table 6: Results from Experiment #2 ...... 88 Table 7: Results from Experiment #3 ...... 94
viii
Texas Tech University, Eric James Rees, May 2011
List of Figures
Figure 1: A simplified view of the BLAST algorithm...... 23 Figure 2: Diagram of Query Set Segmentation ...... 29 Figure 3: Diagram of Database Segmentation ...... 31 Figure 4: Distributed System Layout and Interactions ...... 62 Figure 5: Overview of the BLAST Output Layout ...... 76 Figure 6: Experiment #1 Execution Times - Line Graph ...... 86 Figure 7: Experiment #1 Execution Times - Bar Graph ...... 87 Figure 8: Experiment #2 Execution Times - Line Graph ...... 91 Figure 9: Experiment #2 Execution Times - Bar Graph ...... 92 Figure 10: Experiment #3 Execution Times - Line Graph ...... 95 Figure 11: Experiment #3 Execution Times - Bar Graph ...... 96
ix
Texas Tech University, Eric James Rees, May 2011
Chapter 1 Introduction
1.1 Motivation
Bioinformatics is a rapidly expanding interdisciplinary field that applies computer science to answer the questions of biology(Nair 2007). Due to the development of numerous high-throughput technologies, the amount of biological data is increasing rapidly (Troyanskaya, et al. 2003). To meet this massive influx of data, computational hardware has also been rapidly increasing in speed while decreasing in price. Now multiple core processors are becoming the standard in both home and office environments and, concurrently, distributed computers and cloud based computing have also become readily available to researchers with implementations such as Amazon’s S3 (Amazon 2010) (Palankar, et al. 2008), Microsoft’s Azure
(Microsoft 2010), Google’s App Engine (Google 2010), and the 3Tera Cloud (3tera
2010). However, despite these recent trends towards distributed computing and multiprocessor parallelization, few bioinformatics algorithms have been implemented to make use of this additional computational power.
The tools used in bioinformatics are developed by computer scientists for use by biologists. Tools such as BLAST, a tool for finding local alignments between a set of unknown genetic sequences versus a set of known genetic sequences, have a simple command line interface that does not require users to have a deep understanding of computers or computer science to be able to use. These tools are often executed by
1
Texas Tech University, Eric James Rees, May 2011 simply downloading the program and running it with little to no set up. Few, if any, of the major bioinformatics tools have requirements beyond which operating systems they will and will not run on, thus keeping the entire process simple for users. However, this is rarely the case for distributed implementations of bioinformatics tools. Distributed implementations of these tools, including BLAST, often require the user to first set up the underlying program that will handle the distribution, such as the Message Passing Interface (MPI). This additional step can be rather complex even for a computer specialist, much less the average biologist.
Once the underlying distribution program has been set up, many distributed implementations will still require that the users then set up the program to work with the options chosen during the creation of the MPI or RPC environment, adding yet another step that can be quite complex to biologists and other non-technical users.
Distributed implementations of bioinformatics tools often suffer from a major issue that could easily be corrected, but is often completely ignored by programmers. The issue is that distributed bioinformatics algorithms developed to run in distributed environments should be fault tolerant because a core principle of distributed systems is that they must be able to tolerate system failures and faults. Thus, distributed bioinformatics algorithms must account for faults that may occur within the distributed system and handle these faults accordingly. The problem, however, is that most distributed algorithms, including many implementations of the BLAST algorithm, are unable to handle many common faults that occur during their
2
Texas Tech University, Eric James Rees, May 2011 execution, such as: the master node or worker nodes losing connection to the network, the system encountering a race condition on a worker node, the system encountering a race condition on the master node, the system encountering a dead lock on a worker node, or the system encountering a dead lock on the master node.
The NCBI bioinformatics tool BLAST, Basic Local Alignment Search Tool, is one of the most widely used bioinformatics tools in the field. For this reason, this dissertation shall focus on this tool’s algorithm in order to provide examples of how the distributed algorithm would interact with a bioinformatics tool as well as to show an example of our distributed algorithm being used in a real-world tool. A survey of existing distributed BLAST implementations reveals that most of these applications do not have much, if any ability to perform fault tolerance or recovery.
The results of this survey can be found below in Table 1.
Distributed Implementation Fault Tolerance and Recovery Reference BeoBLAST Medium (Grant, et al. 2002) Condor BLAST Medium (Condor Team 2004) Low (Darling, Carey and mpiBLAST Feng 2003) Soap-HT-BLAST Medium (Wang and Mu 2003) Medium (Carvalho, et al. Squid 2005) W.ND-BLAST Medium (Dowd, et al. 2005) Table 1: Distributed BLAST Implementations - Fault tolerance comparisons
Distributed BLAST implementations should be able to achieve linear speed up in most cases and super linear speed up in BLAST runs involving databases that are too large to fit in memory. While achieving linear speed up is not an incredibly
3
Texas Tech University, Eric James Rees, May 2011 difficult task, achieving super linear speed up requires some additional work that most distributed BLAST applications have failed to implement. A survey of existing distributed BLAST implementations reveals that most of these applications have not yet implemented the code required to achieve super linear speed up. The results of this survey can be found below in Table 2.
Distributed Implementation Achieves super linear speed up Reference BeoBLAST No (Grant, et al. 2002) Condor BLAST No (Condor Team 2004) Yes (Darling, Carey and mpiBLAST Feng 2003) Soap-HT-BLAST No (Wang and Mu 2003) No (Carvalho, et al. Squid 2005) W.ND-BLAST No (Dowd, et al. 2005) Table 2: Distributed BLAST Implementations - Speed up comparisons
Currently there exists no distributed BLAST implementation that has a high, or even a medium, amount of fault tolerance and recovery while also achieving super linear speed up. However the distributed implementation of the BLAST algorithm constructed during this research has the ability to detect, tolerate, and recover from faults while still achieving super linear speed up on large databases.
1.2 Problem Statement
Distributed bioinformatics applications must conform to the standards set forth for all distributed algorithms, discussed in depth below in section 2.2, Distributed
Computing. These standards require that distributed algorithms be able to run on
4
Texas Tech University, Eric James Rees, May 2011 heterogeneous machines, run securely, have the ability to scale, use proper fault tolerance/recovery, be concurrent, and allow for transparency. However, a standard that is implied but is not officially stated requires that a distributed algorithm actually be capable of performing the same task as a non-distributed algorithm at faster speeds. If a distributed algorithm cannot accomplish this, then the algorithm should not be distributed at all. Expanding this idea should state that distributed algorithms should attempt to minimize processing time without introducing errors into the final output.
As such our distributed algorithm and our distributed BLAST application must go beyond the work of its predecessors by meeting all of the requirements stated above. To accomplish this, the following steps must be taken:
1. Devise a communication method capable of automatically scaling the size of a
distributed system.
2. Develop an application framework that will allow for various bioinformatics
tools to securely execute in a distributed heterogeneous environment. This
framework will provide fault tolerance within the heterogeneous
environment while also making use of the aforementioned communication
method. This application framework should be capable of:
a. treating bioinformatics tools as extensions of the framework without
requiring any alterations be made to the tool’s source code.
b. tolerating and effectively recovering from faults,
c. allowing for concurrent operations to be performed, and
5
Texas Tech University, Eric James Rees, May 2011
d. manipulating data in a secure manner.
3. Create a distributed BLAST application using the application framework. This
application should be capable of:
a. acting as an add-on to the BLAST executable by running the BLAST
executable supplied by NCBI without any alterations made to the
BLAST source code,
b. maximizing the speed up gained to near linear time in most cases and
super linear time in all cases where such time could be achieved.
In order to accomplish the first goal, I will construct a modularized set of methods that can connect to, broadcast to, and read from a multicast created by the methods.
This multicast will provide any program using the methods the ability to seek out and track any nodes on the network that will be used in the distributed system. This communication method will allow the program to easily scale up and down depending on available nodes without direct user intervention to alter the size of the system. The details of this communication method are explained further in section 3.1 Creating a Distributed System from Existing Nodes.
The second goal can be accomplished by constructing a pair of algorithms capable of performing distributed computations using the communication method described above. The first algorithm will establish a master node capable of communicating with remote worker nodes, collecting information from remote nodes, monitoring the status of remote nodes, establishing file transfers to and from the remote node,
6
Texas Tech University, Eric James Rees, May 2011 establishing data transfer between the master and remote worker nodes, spawning processes on remote nodes, stopping processes on remote nodes, deleting application specific files on remote nodes, taking input from the user, and generating output for the user in the form of log files and status updates. The second algorithm will establish the worker nodes capable of communicating with the master node, gathering system information for the master node, establishing file transfers to and from the master node, establishing data transfer between itself and the master node, running applications at the request of the master node, ending applications at the request of the master node, and deleting tool specific files at the request of the master node. These algorithms are described in section 3.1 Creating a
Distributed System from Existing Nodes.
The third goal, creating a distributed BLAST application, can be accomplished by creating two applications, a master application and a worker application. The master application will use the master node half of the communication algorithm to establish contact with and pass commands to the various worker nodes while the worker application will use the worker node half of the communication algorithm to respond to the commands of the various master nodes. The file and data transfer methods within the communication algorithm will be used to pass database files, input files, and BLAST command line instructions to the worker node as well as provide a method for which the worker node can pass completed BLAST output files back to the master. The master and worker node application algorithms are
7
Texas Tech University, Eric James Rees, May 2011 described in section 3.2 Algorithm behind the Distributed BLAST Master Node and section 3.3 Algorithm behind distributed BLAST Worker Nodes respectively.
1.3 Overview of the Dissertation
This dissertation consists of five chapters. Chapter 2 details related work in the fields of bioinformatics and distributed computation with an emphasis on work relating to other distributed BLAST implementations and methods. Chapter 3 describes the multicast communication method used to create and maintain a distributed heterogeneous environment as well as the distributed application framework developed to allow high throughput bioinformatics programs to run across this distributed system. Chapter 3 also discusses the distributed algorithm implemented to execute distributed BLAST executions across the distributed system created by the communication method. Chapter 4 describes the results attained by executing the algorithms detailed in Chapter 3 while Chapter 5 discusses the results attained in Chapter 4 and gives ideas and plans for future work in the area.
8
Texas Tech University, Eric James Rees, May 2011
Chapter 2 Related Work
2.1 Bioinformatics
In the paper, “Computational Biology and Bioinformatics: A Gentle Overview” by
Nair Achuthsankar, the author defines bioinformatics as follows: “Bioinformatics is the application of computer sciences and allied technologies to answer the questions of Biologists, about the mysteries of life” (Nair 2007). This short sentence accurately describes not only the application of computer science to biology, but lays out the scope and magnitude of this research by applying it to solve the mysteries of life. The author continues through much of the paper laying out exactly what data we as bioinformaticians handle and how we go about working with the vast amount of biological data. Bioinformatics deals primarily with biological data in either text files containing large quantities of DNA, RNA, or protein sequences or in images containing micro array data. This data can then be analyzed using a multitude of bioinformatics algorithms such as using Hidden Markov Models
(HMMs) to find genes in DNA sequences, using the Nussinov folding algorithm to predict the secondary structure of RNA sequences, or using sub cellular localization algorithms to predict a protein’s location within a cell based on protein sequences.
Bioinformatics has given biologists new ways to tackle lingering problems in their field as well as more efficient methods to complete the problems that they have already solved. However, bioinformatics has also allowed biologists to improve the
9
Texas Tech University, Eric James Rees, May 2011 precision of their results while decreasing their time spent in the lab. For example, the paper “Assessing the precision of high-throughput computational and laboratory approaches for the genome-wide identification of protein sub cellular localization in bacteria” by Sebastien Ray et al., the authors compared computational sub cellular localization methods with laboratory proteomics approaches in an attempt to determine the most effective approach for genome-wide localization characterization and annotation (Rey, Gardy and Brinkman 2005). Their final results showed that the computational methods for sub cellular localization had a higher level of precision than the high-throughput laboratory approaches.
Bioinformatics has also created the need for large sequence repositories, typically referred to as sequence databases. These databases store an immense amount of sequence data that has been collected by various projects worldwide. The most notable of these databases include the European Molecular Biology Laboratory DNA database (EMBL), the DNA Data Bank of Japan (DDBJ), the National Center for
Biotechnology Information’s genetic database (GenBank), the Swiss Institute of
Bioinformatics’ Protein Sequence Database (SWISS-PROT) and the Protein 3D structure database (PDB). These databases are growing at an immense rate, with the authors Dennis A. Benson et al. stating in their paper “GenBank” that the
GenBank database is growing exponentially, doubling in size every 15 to 18 months
(Benson, et al. 2007). Thus, the size of GenBank alone is growing in size faster than
Moore’s Law, which states that the number of transistors that can be placed inexpensively on an integrated circuit board has doubled approximately every two
10
Texas Tech University, Eric James Rees, May 2011 years. This poses a number of challenges not only to the process of mining data from these databases but also to using this data in popular programs like BLAST, a program that uses local alignment to identify unknown sequences, which shifts portions of these large databases into and out of memory in an attempt to identify unknown sequences. In order to overcome this obstacle it has been proposed that bioinformatics algorithms be updated, or developed, so that they can begin making use of multiple core processors and distributed computers.
2.2 Distributed Computing
Distributed computing is a field of computer science that researches distributed systems. In the book Distributed Systems: Concepts and Design, 4th Edition by George
Coulouris et al. and the book Distributed Computing: Implementation and
Management Strategies by Raman Khanna, a distributed system is defined as a
system in which components located at networked computers communicate and
coordinate their actions only by passing messages (Coulouris, Dollimore and
Kindberg 2005) (Khanna 1994). This definition of distributed systems, however,
requires that for a system to be considered distributed it must meet the following
three characteristics:
1) The system must be concurrent, allowing each machine in the distributed
system to work in parallel or independently of the other machines in the
system.
11
Texas Tech University, Eric James Rees, May 2011
2) The system must lack a global clock. Each component of the distributed
network may only communicate via some form of message passing. However
there are limitations on how accurately the components can synchronize
their clocks, thus forcing each component to keep track of their own local
clock, instead of referencing some global clock.
3) The system must allow for and be tolerant of independent component and
system failures. The distributed system should allow for independent
components to fail without it disrupting the capabilities of the system as a
whole.
The authors George Coulouris et al. continue on to discuss challenges that face the construction of distributed systems as well as the distributed algorithms that make use of these systems. These challenges have all been identified and met in previous research and all distributed algorithms should be able to handle these challenges.
The challenges discussed by the authors are as follows:
1) The distributed algorithm should be able to tolerate a heterogeneous
distributed system. Heterogeneous distributed systems are made up of
components that may have different hardware, make use of different
operating systems, and use different methods to connecting to the
distributed systems network. Overcoming this challenge is typically done by
implementing the algorithm using some form of middleware. The
middleware acts as an abstraction layer, allowing for the developer to handle
various system calls in a similar manner across heterogeneous systems.
12
Texas Tech University, Eric James Rees, May 2011
2) Keeping the distributed system secure is of considerable importance. In
order for the entire system to stay secure, each individual system and each
program running on these systems must adhere to the protection of the
system’s confidentiality, integrity, and availability. A system’s confidentiality
is based entirely upon its ability to protect against the disclosure of data or
resources to unauthorized users. Upholding system integrity requires the
system to protect against the alteration and corruption of its data. Lastly, the
system’s availability is its ability to protect itself against interference in
respect to its access of various local and remote resources. Overcoming this
challenge requires developers to stay vigilant in their efforts to ensure that
their programs do not violate or allow for the violation of the system’s
confidentiality, integrity, and availability. This means ensuring that the
contents of all messages sent across the distributed system are secure,
ensuring the systems sending and receiving messages are authorized to do
so, and either backing up important data or finding ways to ensure corrupted
data can either be recovered or recreated safely and efficiently.
3) Distributed systems must also allow for scalability, the ability to remain
effective even when a significant number of users and resources are added to
or removed from the system. Distributed algorithms should be able to
handle the increase, as well as the decrease, in resources and users and make
an attempt to balance these resources usage across the system. Using
algorithms, such as divide and conquer to split a task in equal pieces such
that each piece of the larger task is completed by a separate component,
13
Texas Tech University, Eric James Rees, May 2011
allow algorithms to easily scale both up and down in a dynamic distributed
system.
4) Just as would occur in a non-distributed algorithm, a distributed algorithm
must be able to handle failures. Unlike in a non-distributed system, failures in
a distributed system should follow the rules discussed before, in that the
distributed system should tolerate the loss of components without it causing
failures throughout the entire system. As such, distributed algorithms should
be able to handle these failures such that they too can tolerate these failures.
There are a number of ways to handle faults. For instance, the algorithm can
attempt to detect for when faults have occurred, known as fault detection,
and when such an event occurs the algorithm will take measures to minimize
and contain the fault and either recover what it can or restart the lost process
elsewhere. Redundancy is another way of dealing with fault tolerance by
having multiple systems in the distributed system mirror the actions of
another system. Thus if a task is being completed on three systems and one
system goes down, the other two will still manage to complete the task and
report their results. Redundancy can also be as simple as having multiple
network connections between a system and the distributed system, thus
allowing for hardware failures regarding one, but possibly not both,
networks running the distributed system.
5) Distributed systems are required to be concurrent and thus distributed
algorithms running across these systems must also be concurrent.
Distributed algorithms provide resources that can be shared by multiple
14
Texas Tech University, Eric James Rees, May 2011
users within the distributed system. Thus it becomes highly probable that at
various times, multiple users may attempt to share a resource at the same
time. Whether this resource is a file, a system resource, a database, or an
application, the distributed algorithm must not only allow the sharing to
occur, but must remain stable and keep data correctly synced during these
moments of shared usage. Resources in the system are considered safe only
when operations performed on them are synchronized in such a way that
their data remains consistent, despite shared usage. Overcoming this
problem requires following similar algorithm development standards found
in operating systems and other parallel software, such as the use of
semaphores and mutex locks.
6) Lastly, distributed systems should keep the separation of components within
the system concealed from the user and the application programmer. This
process is known as transparency and is used to ensure the distributed
computer is perceived as a single system instead of a collection of
independently working components. George Coulouris et al. continue on to
discuss the eight major forms of transparency first discussed in the ANSA
Reference Manual (ANSA Project 1987) and the International Organization
for Standardization’s Reference Model for Open Distributed Processing (RM-
ODP) (ISO/IEC 1996). A brief summary of the eight forms of transparency as
listed in the RM-ODP are as follows:
a. Access transparency: The ability to hide from a user the details of the
access mechanisms for any given server object. Access transparency
15
Texas Tech University, Eric James Rees, May 2011
hides the difference between local and remote provisions of the
service. b. Concurrency transparency: The ability to hide from the client the
existence of current access being made to various resources. This
hides the effects of concurrent operations performed by any given
user on a service used by multiple users. c. Location transparency: The ability to conceal the location of the
resource currently being accessed by a user. d. Replication transparency: The ability to hide the presence of multiple
copies of services and maintaining the consistency of multiple copies
of data, from the users of the services. e. Resource transparency: The ability to hide from a user the
mechanisms which manage allocation of resources by activating and
deactivating resources as demand varies for these resources.
f. Failure transparency: The ability to mask certain failures and possible
recovery efforts of resources from the user. This provides fault
tolerance for the distributed system. g. Federation transparency: The ability to hide the effects of operations
crossing multiple administrative boundaries from the users, allowing
users and resources to interwork between multiple administrative
and technological domains.
16
Texas Tech University, Eric James Rees, May 2011
2.3 BLAST
Basic Local Alignment Search Tool (BLAST) was developed in the late 1980’s, early
1990’s, and is the most used bioinformatics program in the world. The program is an approximation of the Smith-Waterman algorithm, developed by Temple F. Smith and Michael S. Waterman in 1981, for performing local sequence alignment. The
Smith-Waterman algorithm compares two sequences against one another in order to detect similarities between the sequences, known as alignments (Smith and
Waterman 1981) (Durbin, et al. 2007). The algorithm takes two sequences, A and B
such that A = a1a2…a n and B = b1b2…b m. We shall define a similarity given between two sequence elements a and b as s(a,b) and gaps in the sequence of length k shall be given weight Wk. According to the Smith-Waterman algorithm, in order to detect pairs of segments between the sequences that contain a high similarity, we will first need to establish a matrix M such that
Mk0 = M 0l = 0 for 0 ≤ k ≤ n and 0 ≤ l ≤ m
This creates a matrix of size k x l with the first column and first row filled with zeros and all other spots left empty. These spots will then be filled in such that Mij is the maximum similarity of two segments ending in ai and bj, respectively. Because the
Smith-Waterman algorithm searches for local instead of global alignments, the
matrix will contain 0 in positions where the Mij would have contained a negative
number. These 0’s will be used to determine where new alignments begin, allowing
17
Texas Tech University, Eric James Rees, May 2011 multiple alignments to begin and end within the sequence pair (Rosenberg 2009).
As such, M ij is calculated for all i and j where 1 ≤ i ≤ n and 1 ≤ j ≤ m using the equation
0, , + , = , = max max { , − } max { , − }
Once this matrix has been constructed, the alignment with the maximum similarity between sequence A and B can be found by first locating the maximum element in M.
From this element a traceback algorithm is used to build the alignment in reverse
starting from the maximum element in M until a 0 element is encountered. At each
step of the traceback process we will move back from the current cell Mij to the cell
from which the value in Mij was derived, located either at Mi - 1, j- 1, Mi - 1, j , or Mi, j – 1. At the same time we will also build A’ and B’ , the alignment sequence pair between A and B respectively, by adding either a letter or a gap to front of each sequence in the pair. If the value for element Mij was derived from Mi – 1, j – 1 then A’ adds the sequence element ai to the front and B’ adds the sequence element bj to the front. If the value
for element Mij was derived from Mi – 1, j then A’ adds the sequence element ai to the
front and B’ adds the sequence element ‘-‘ to the front. If the value for element Mij was derived from Mi, j – 1 then A’ adds the sequence element ‘-‘ to the front and B’
adds the sequence element bj to the front. This algorithm is repeated until the
18
Texas Tech University, Eric James Rees, May 2011
element in Mij = 0 (Durbin, et al. 2007) (Jones and Pevzner 2004) (Orengo, Jones and
Thornton 2003).
As discussed in the book “Introduction to computational genomics” (Cristianini and
Hahn 2007) by Nello Cristianini and Matthew Hahn, the increase of DNA sequences
deposited into the public genomic databases in the late 1980’s caused searching the
three main genomic to start taking immense amounts of time. The Smith-Waterman
algorithm takes on the order of O(nm), often referred to as O(n 2), time and space to calculate results. Despite the rather low costs of computation, the costs were simply too high for the large scale database searching applications it was being used for
(Cristianini and Hahn 2007).
The BLAST algorithm was developed in 1990 to attain similar results to the Smith-
Waterman algorithm but at a fraction of the computational costs. BLAST is able to achieve this goal using two shortcuts: 1) BLAST does not bother to find the optimal alignment and 2) it does not search the entire search space by instead attempting to quickly locate regions of high similarity, regardless of whether it checks every possible local alignment (Cristianini and Hahn 2007). The BLAST algorithm can simplified into the steps described in the Basic Local Alignment Search Tool paper
(S. F. Altschul, et al. 1990) by Stephen F. Altschul et al and expanded upon in the book “Bioinformatics: Sequence and Genome Analysis” (Mount 2004) by David
Mount and “BLAST: An Essential Guide to the Basic Local Alignment Search Tool” by
Ian Korf et al (Korf, Yandell and Bedell 2003). A brief overview of the algorithm can
19
Texas Tech University, Eric James Rees, May 2011 be found below in Figure 1. Also, an in-depth description of the BLAST algorithm is provided for curious readers as follows:
1) The first stage of the algorithm involves removing sequence repeats and
regions of low-complexity, regions of biased composition containing simple
sequence repeats (Orlov and Potapov 2004), from the query sequence.
(Mount 2004).
2) BLAST will then create a list of k-mers from the query sequence. This
requires creating a unique list by cutting the query sequence into words such
that each word is of length k. For example, if k = 3 and we have the query
sequence PGQQFPGQEP, then we would have a list containing PGQ, GQQ,
QQF, QFP, GQE, and QEP. Keep in mind that since we are storing a unique list
we will only add the 3-mer PGQ once (Mount 2004) (Zomaya 2006).
3) Create a list of high scoring match words each of length k. For each member
of the list generated in step 2 above, all the possible matching words are
generated and then scored against the original element using a scoring
matrix. If k = 3 and the program is handling amino acids, then a total of 20 3
possible match words and scores would be generated. For instance, the
sequence PGQ would generate the matching words PEG and PQA, with
BLOSUM62 matching scores of 15 and 12 respectively (Mount 2004)
(Zomaya 2006).
4) A match cutoff score known as the neighborhood word score threshold, T, is
selected in order to cull the list of match scores. By traversing the list of
match scores and removing any match score that does not exceed the value
20
Texas Tech University, Eric James Rees, May 2011
of T we are able to generate a match list that contains only the highest
scoring match words (Mount 2004) (Zomaya 2006).
5) Steps 3 and 4 are repeated for each k-letter word in the list of k-mers created
in step 2 (Zomaya 2006).
6) The list of remaining highest scoring match words is then reorganized into
an efficient search tree. This allows the BLAST program to compare the
matching words to elements within database sequences quickly and
efficiently (Mount 2004).
7) Each sequence in the database is scanned by the BLAST program in order to
find
k-mers in the database sequence that match k-mers from the list of highest
scoring match words (Mount 2004) (Zomaya 2006).
8) Each matching region found per database sequence will then be compared
against one enough to determine their distance. Each match that is within A
letters from another match will be joined with that match as a longer match,
with all the letters between them being incorporated into the new match. At
this point the match will be extended in each direction until the accumulated
total score of the HSP, High Scoring Pair, begins to decrease (Mount 2004)
(Zomaya 2006).
9) A cutoff score known as the segment score threshold, S, is used in order to
remove HSPs that do not meet or exceed S. By examining each HSP and
removing any HSP whose score falls below S, we will be left with a list of
21
Texas Tech University, Eric James Rees, May 2011
HSPs whose values are large enough to accurately calculate significance
(Mount 2004) (Korf, Yandell and Bedell 2003) (Zomaya 2006).
10) Next BLAST will begin assessing each HSP in order to determine the HSPs
score’s significance. Statistical significance is determined the expect value, E,
which is the number of times that an unrelated database sequence would
obtain some score s that is greater than x by chance (Mount 2004). The
equation used to calculate the expect value, E-value, states that the number
of alignments expected by chance ( E) during a sequence database search is a
function of the size of the effective search space (m’n’), the normalized score
(λS ), and a minor constant ( k) (Korf, Yandell and Bedell 2003). In order to
solve for E the following equation is used: