K-Mulus: a Database-Clustering Approach to Protein BLAST in the Cloud
Total Page:16
File Type:pdf, Size:1020Kb
K-mulus: A Database-Clustering Approach to Protein BLAST in the Cloud Carl H. Albachy Sebastian G. Angely Christopher M. Hilly Mihai Pop Department of Computer Science University of Maryland, College Park College Park, MD 20741 fcalbach, sga001, cmhill, [email protected] ABSTRACT with the intention of efficiently searching for these similar- With the increased availability of next-generation sequenc- ities. The most widely used application is the Basic Local ing technologies, researchers are gathering more data than Alignment Search Tool, or BLAST[3]. they are able to process and analyze. One of the most widely performed analyses is identifying regions of similar- With the increased availability of next-generation sequenc- ity between DNA or protein sequences using the Basic Local ing technologies, researchers are gathering more data than Alignment Search Tool, or BLAST. Due to the large scale ever before. This large influx of data has become a ma- of sequencing data produced, parallel implementations of jor issue as researchers have a difficult time processing and BLAST are needed to process the data in a timely manner. analyzing it. For this reason, optimizing the performance In this paper we present K-mulus, an application that per- of BLAST and developing new alignment tools has been a forms distributed BLAST queries via Hadoop and MapRe- well researched topic over the past few years. Take the ex- duce and aims to generate speedups by clustering the database. ample of environmental sequencing projects, in which the Clustering the sequence database reduces the search space bio-diversity of various environments, including the human for a given query, and allows K-mulus to easily parallelize microbiome, is analyzed and characterized to generate on the remaining work. Our results show that while K-mulus the order of several terabytes of data[2]. One common way offers a significant theoretical speedup, in practice the dis- in which biologists use these massive quantities of data is tribution of protein sequences prevents this potential from by running protein BLAST on large sets of unprocessed, being realized. We show K-mulus's potential with a compar- repetitive reads to identify putative genes[9, 14, 16]. Un- ison against standard BLAST using a simulated dataset that fortunately, performing this task in a timely manner while is clusterable into disjoint clusters. Furthermore, analysis of dealing with terabytes of data far exceeds the capabilities of K-mulus's poor performance using NCBI's Non-Redundant most existing BLAST implementations. (nr) database provides insight into the limitations of clus- tering and indexing a protein database. As a result of this trend, large sequencing projects require the utilization of high-performance and distributed systems. However, most researchers do not have access to these com- General Terms puter clusters, due to their high cost and maintenance re- Bioinformatics, High-Performance Computing quirements. Fortunately, cloud computing offers a solution to this problem, allowing researchers to run their jobs on Keywords demand without the need of owning or managing any large BLAST, Cloud Computing, MapReduce, Hadoop, Bioinfor- infrastructure. The speedup offered by running large jobs matics in parallel, as well as the opportunities for novel reduction of the BLAST protein search space [13, 19], are the moti- vating factors that led us to devote this paper to effectively 1. BACKGROUND parallelizing protein BLAST (blastx and blastp). Identifying regions of similarity between DNA or protein sequences is one of the most widely studied problems in One of the original algorithms that BLAST uses is 'seed and bioinformatics. These similarities can be the result of func- extend' alignment. This approach requires that there be at tional, structural, or evolutionary relationships between the least one k-mer (sequence sub-string of length k) match be- sequences. As a result, many tools have been developed tween query and database sequence before running the ex- pensive alignment algorithm between them [3]. Using this rule, BLAST can bypass any database sequence which does not share any common k-mers with the query. Using this heuristic, we can design a distributed version of BLAST us- ing the MapReduce model. One aspect of BLAST which we hoped to take advantage of was database indexing of k-mers. While some versions of BLAST have adopted database k- mer indexing for DNA databases, it seems that this approach has not been feasibly scaled to protein databases[13]. For yAuthors contributed equally. this reason, BLAST iterates through nearly every database wanted to count the number of occurrences of each of the sequence to find k-mer hits. K-mulus attempts to optimize words in a body of text, they could design a MapReduce this process by using lightweight database indexing to allow application that would do this in parallel with little to no query iteration to bypass certain partitions of the database. effort. Initially, they would feed the body of text as the input. Individual words from the file would be distributed In this paper we present K-mulus, an application which dis- evenly among all available nodes. The mappers would out- tributes queries over many database clusters and produces put key-value pairs of the form (word, 1). Finally, when all output which is equivalent to that of BLAST. K-mulus aims mappers have finished, the reducers will each get a key, and to achieve speedups by generating database clusters which a list of values. In this case, the list of values is exclusively enable the parallelization of BLAST and reduction of the composed of 1s. The reducer would then aggregate all the search space. After the database is partitioned by cluster, entries in the list, and output the final key-value pair, which a lightweight k-mer index is generated for each partition. corresponds to (word, count). During a query, the k-mers of the query sequences can be quickly compared against these indices to determine if the Another advantage of the MapReduce framework is that it cluster contains potential matches. In this way, a query se- abstracts the communication between nodes, thus allowing quence need only be compared against a subset of the origi- software developers to run their jobs in parallel over poten- nal database, thereby reducing the search space. Note that tially thousands of processors. Although this makes it simple the speed of K-mulus is completely dependent on cluster to program, without direct control of the communication, it variance; if a query sequence matches every partition index, may be inefficient. there is no reduction in search space. Lastly, MapReduce has become a de-facto standard for dis- mpiBLAST [5] is a popular distributed version of BLAST, tributed processing, thus the code written will be very portable. which can yield super-linear speed-up over running BLAST on a single node. mpiBLAST works by segmenting the The MapReduce framework is tightly integrated with a dis- database into equal-sized chunks and distributing these chunks tributed file system allowing it to coordinate computation among the available nodes. All nodes then proceed to search with the location of the data. Google Distributed File Sys- the entirety of the query set against all chunks of the database. tem (GFS) [7] is one such implementation. Files in GFS The results from individual nodes are later aggregated. Al- are partitioned into smaller chunks and stored distributively though K-mulus and mpiBLAST 's both divide the database among a cluster of computers. A master node is responsi- into smaller chunks, mpiBLAST chunks the data arbitrarily, ble for storing the metadata, such as chunk locations per while K-mulus does so according to similarity-based cluster- file and their replicas, and granting access to said chunks. ing. MapReduce attempts to schedule computation at the nodes storing the input chunks. However, given that GFS is pro- Another parallel implementation of BLAST is CloudBLAST [11]. prietary, Hadoop[4], an open-source distributed file system CloudBLAST uses the MapReduce paradigm with Hadoop has it's own spin on MapReduce. Applications built using to parallelize BLAST. Whereas mpiBLAST segments the the Hadoop framework only need to specify the map and database, CloudBLAST's parallelization approach involves reduce functions. Hadoop has become the de-facto stan- segmenting the queries. dard for cloud computing, a model of parallel computing where infrastructure is treated as a service. The end users, K-mulus differs from the previous parallel implementations or application developers, need not worry about the main- of BLAST by using similarity-based clustering of the database tenance or cluster configuration. An example of a popu- along with the creation of a database index for each cluster. lar cloud computing provider is Amazon's Elastic Compute There is a large potential advantage to our approach. K- Cloud (EC2)[1]. K-mulus is implemented using the MapRe- mulus identifies each cluster by a center which allows us duce framework so it can be easily integrated with both to determine whether or not a query will find a match in public and private clouds. the corresponding segment of the database. This optimiza- tion has the potential to dramatically reduce the search 2. DESIGN space for each individual query, thus eliminating unneces- Our model is divided into two sections. First, we present a sary searches. methodology for dividing the existing database into smaller clusters and generating indices for each cluster. This is the There are a few advantages of using the MapReduce [6] preprocessing step for K-mulus. Second, we consider the framework over other existing parallel processing frameworks. process by which a query is executed in K-mulus. The entirety of the framework rests in two simple methods, a mapper and a reducer. During the mapper, input is dis- In order to cluster the database, we used a collection of clus- tributed among all nodes that were assigned to the task of tering algorithms: k-means[8], k-medoid[18], and our own mapping.