K-mulus: A Database-Clustering Approach to Protein BLAST in the Cloud

Carl H. Albach† Sebastian G. Angel† Christopher M. Hill† Mihai Pop Department of Computer Science University of Maryland, College Park College Park, MD 20741 {calbach, sga001, cmhill, mpop}@umiacs.umd.edu

ABSTRACT with the intention of efficiently searching for these similar- With the increased availability of next-generation sequenc- ities. The most widely used application is the Basic Local ing technologies, researchers are gathering more data than Alignment Search Tool, or BLAST[3]. they are able to process and analyze. One of the most widely performed analyses is identifying regions of similar- With the increased availability of next-generation sequenc- ity between DNA or protein sequences using the Basic Local ing technologies, researchers are gathering more data than Alignment Search Tool, or BLAST. Due to the large scale ever before. This large influx of data has become a ma- of sequencing data produced, parallel implementations of jor issue as researchers have a difficult time processing and BLAST are needed to process the data in a timely manner. analyzing it. For this reason, optimizing the performance In this paper we present K-mulus, an application that per- of BLAST and developing new alignment tools has been a forms distributed BLAST queries via Hadoop and MapRe- well researched topic over the past few years. Take the ex- duce and aims to generate speedups by clustering the database. ample of environmental sequencing projects, in which the Clustering the sequence database reduces the search space bio-diversity of various environments, including the human for a given query, and allows K-mulus to easily parallelize microbiome, is analyzed and characterized to generate on the remaining work. Our results show that while K-mulus the order of several terabytes of data[2]. One common way offers a significant theoretical speedup, in practice the dis- in which biologists use these massive quantities of data is tribution of protein sequences prevents this potential from by running protein BLAST on large sets of unprocessed, being realized. We show K-mulus’s potential with a compar- repetitive reads to identify putative genes[9, 14, 16]. Un- ison against standard BLAST using a simulated dataset that fortunately, performing this task in a timely manner while is clusterable into disjoint clusters. Furthermore, analysis of dealing with terabytes of data far exceeds the capabilities of K-mulus’s poor performance using NCBI’s Non-Redundant most existing BLAST implementations. (nr) database provides insight into the limitations of clus- tering and indexing a protein database. As a result of this trend, large sequencing projects require the utilization of high-performance and distributed systems. However, most researchers do not have access to these com- General Terms puter clusters, due to their high cost and maintenance re- , High-Performance Computing quirements. Fortunately, cloud computing offers a solution to this problem, allowing researchers to run their jobs on Keywords demand without the need of owning or managing any large BLAST, Cloud Computing, MapReduce, Hadoop, Bioinfor- infrastructure. The speedup offered by running large jobs matics in parallel, as well as the opportunities for novel reduction of the BLAST protein search space [13, 19], are the moti- vating factors that led us to devote this paper to effectively 1. BACKGROUND parallelizing protein BLAST (blastx and blastp). Identifying regions of similarity between DNA or protein sequences is one of the most widely studied problems in One of the original algorithms that BLAST uses is ’seed and bioinformatics. These similarities can be the result of func- extend’ alignment. This approach requires that there be at tional, structural, or evolutionary relationships between the least one k-mer (sequence sub-string of length k) match be- sequences. As a result, many tools have been developed tween query and database sequence before running the ex- pensive alignment algorithm between them [3]. Using this rule, BLAST can bypass any database sequence which does not share any common k-mers with the query. Using this heuristic, we can design a distributed version of BLAST us- ing the MapReduce model. One aspect of BLAST which we hoped to take advantage of was database indexing of k-mers. While some versions of BLAST have adopted database k- mer indexing for DNA databases, it seems that this approach has not been feasibly scaled to protein databases[13]. For †Authors contributed equally. this reason, BLAST iterates through nearly every database wanted to count the number of occurrences of each of the sequence to find k-mer hits. K-mulus attempts to optimize words in a body of text, they could design a MapReduce this process by using lightweight database indexing to allow application that would do this in parallel with little to no query iteration to bypass certain partitions of the database. effort. Initially, they would feed the body of text as the input. Individual words from the file would be distributed In this paper we present K-mulus, an application which dis- evenly among all available nodes. The mappers would out- tributes queries over many database clusters and produces put key-value pairs of the form (word, 1). Finally, when all output which is equivalent to that of BLAST. K-mulus aims mappers have finished, the reducers will each get a key, and to achieve speedups by generating database clusters which a list of values. In this case, the list of values is exclusively enable the parallelization of BLAST and reduction of the composed of 1s. The reducer would then aggregate all the search space. After the database is partitioned by cluster, entries in the list, and output the final key-value pair, which a lightweight k-mer index is generated for each partition. corresponds to (word, count). During a query, the k-mers of the query sequences can be quickly compared against these indices to determine if the Another advantage of the MapReduce framework is that it cluster contains potential matches. In this way, a query se- abstracts the communication between nodes, thus allowing quence need only be compared against a subset of the origi- software developers to run their jobs in parallel over poten- nal database, thereby reducing the search space. Note that tially thousands of processors. Although this makes it simple the speed of K-mulus is completely dependent on cluster to program, without direct control of the communication, it variance; if a query sequence matches every partition index, may be inefficient. there is no reduction in search space. Lastly, MapReduce has become a de-facto standard for dis- mpiBLAST [5] is a popular distributed version of BLAST, tributed processing, thus the code written will be very portable. which can yield super-linear speed-up over running BLAST on a single node. mpiBLAST works by segmenting the The MapReduce framework is tightly integrated with a dis- database into equal-sized chunks and distributing these chunks tributed file system allowing it to coordinate computation among the available nodes. All nodes then proceed to search with the location of the data. Google Distributed File Sys- the entirety of the query set against all chunks of the database. tem (GFS) [7] is one such implementation. Files in GFS The results from individual nodes are later aggregated. Al- are partitioned into smaller chunks and stored distributively though K-mulus and mpiBLAST ’s both divide the database among a cluster of computers. A master node is responsi- into smaller chunks, mpiBLAST chunks the data arbitrarily, ble for storing the metadata, such as chunk locations per while K-mulus does so according to similarity-based cluster- file and their replicas, and granting access to said chunks. ing. MapReduce attempts to schedule computation at the nodes storing the input chunks. However, given that GFS is pro- Another parallel implementation of BLAST is CloudBLAST [11]. prietary, Hadoop[4], an open-source distributed file system CloudBLAST uses the MapReduce paradigm with Hadoop has it’s own spin on MapReduce. Applications built using to parallelize BLAST. Whereas mpiBLAST segments the the Hadoop framework only need to specify the map and database, CloudBLAST’s parallelization approach involves reduce functions. Hadoop has become the de-facto stan- segmenting the queries. dard for cloud computing, a model of parallel computing where infrastructure is treated as a service. The end users, K-mulus differs from the previous parallel implementations or application developers, need not worry about the main- of BLAST by using similarity-based clustering of the database tenance or cluster configuration. An example of a popu- along with the creation of a database index for each cluster. lar cloud computing provider is Amazon’s Elastic Compute There is a large potential advantage to our approach. K- Cloud (EC2)[1]. K-mulus is implemented using the MapRe- mulus identifies each cluster by a center which allows us duce framework so it can be easily integrated with both to determine whether or not a query will find a match in public and private clouds. the corresponding segment of the database. This optimiza- tion has the potential to dramatically reduce the search 2. DESIGN space for each individual query, thus eliminating unneces- Our model is divided into two sections. First, we present a sary searches. methodology for dividing the existing database into smaller clusters and generating indices for each cluster. This is the There are a few advantages of using the MapReduce [6] preprocessing step for K-mulus. Second, we consider the framework over other existing parallel processing frameworks. process by which a query is executed in K-mulus. The entirety of the framework rests in two simple methods, a mapper and a reducer. During the mapper, input is dis- In order to cluster the database, we used a collection of clus- tributed among all nodes that were assigned to the task of tering algorithms: k-means[8], k-medoid[18], and our own mapping. Each node processes the input in a particular way MapReduce efficient hierarchical clustering. We also keep according to the developer’s specifications, and outputs a track of the centers for each cluster as they play the crucial key-value pair. Once all nodes have finished outputting all role of identifying membership to a cluster. Fig 1. shows a their key-value pairs, all the values for a given key are ag- high level view of our pre-processing model. We further ex- gregated into a list, and are sent to the reducer. During plain the details involved in clustering the existing database the reduce phase, the (key, list of values) pairs are received. in Section 3, as well as our preliminary results in Section 4. This list of values is used to compute the final result accord- ing to the application’s needs. For instance, if a developer After the database has been clustered, we compare the input C1 C2 C3 Ck zipped because the HDFS namenode stores each file, regard- less of size, as a series of 128MB chunks. Compressing the databases into a single archive not only dramatically saves db1 db2 db3 ··· dbk HDFS space, but makes it simpler to load all databases to each machine during the BLAST MapReduce. In the future, the user will be able to download the clusters of the NCBI Non-Redundant database from our website. 2.3 Clustering K − Means/P artition K-mulus attempts to minimize the amount of k-mer over- lap between different clusters of sequences. If two clusters share a high percentage of k-mers in common, then a query that shares a k-mer with one cluster has a higher chance of overlapping with the other. The query will then be sent to both clusters during the reduce phase of K-mulus described below. Determining the correct number of clusters is very important. If the number of clusters is decreased, the k-mer overlap between clusters increases. The percentage of k-mer overlap of different numbers of clusters is discussed in the Database results.

K-mulus only clusters the database sequences. We consid- ered clustering the query sequences as well, but decided not Figure 1: The database is partitioned into k smaller to for several reasons. The advantages of clustering the disjoint clusters using the k-means clustering algo- query sequences are that it requires slightly less network rithm. The center of each cluster is tracked as they overhead in Hadoop to move a cluster of queries instead of are fundamental for subsequent parts of K-Mulus. single queries, and that it requires less computation to com- pare the k-mers of cluster centers against database centers, query sequences to all centers. This comparison is explained than to compare all query k-mers against database centers. in detail in Section 3. The key idea is that by comparing the However, the k-mer look up is quite fast already, as it can input query sequence to the cluster centers, we can deter- be done on the order of the length of the query. The trade mine whether a potential match is present in a given cluster. off for this approach is that overlapping the k-mers of two If this is the case, we run the NCBI BLAST algorithm on clusters is inherently noisier than overlapping one query to the query sequence and the database clusters that we deter- a cluster. This means that each sequence will be mapped to mined as relevant for the query. Fig 2. shows a diagram of more partitions of the database which results in unnecessary the post-processing step carried out by K-mulus. work by BLAST. Furthermore, this approach incurs the cost of performing the query clustering.

2.1 Pre-processing By default BLAST ignores regions of low complexity in the Since text input by default is split on newlines in MapRe- input by repeat masking. For proteins, BLAST uses the duce, the sequences are transformed from FASTA format to SEG program to mask both the query and database se- simpleFasta format, where each sequence shares the same quences [20]. K-mulus similarly masks low complexity se- line as its header, separated by a single white space. In the quences within the query and database to spurious matches future, we plan on implementing a custom Hadoop Input- cause by such regions. Since K-mulus uses an approach iden- Format that can read native FASTA formatted files. tical to that of BLAST in this regard, this optimization of K-mulus causes no loss of sensitivity when compared to a 2.2 Database-specific normal BLAST search. First the database sequences are transformed into simple- Fasta format and uploaded to HDFS. A k-mer presence vec- 2.4 BLAST tor is created for each sequence. The details of the k-mer The actual BLAST step of K-mulus is a single MapReduce presence vector are described in Section 3. During the map job. The query protein sequences are converted into sim- phase, each sequence is emitted as a k-mer presence vector. pleFasta format, and uploaded to HDFS. At the start of This vector has 20k entries, one for each possible k-mer. the MapReduce job, the compressed archive of all previ- These vectors are then clustered using one of the clustering ously compiled BLAST databases are added to Hadoop’s algorithms detailed below. For each cluster, a center vector DistributedCache. Archived files added to the Distributed- is produced that is the union of all k-mer presence vectors Cache are extracted to the local working directory of each of the cluster. In other words, the center vector represents compute node that executes a map or reduce function. This all k-mers present in the cluster. way we can effectively transfer the databases and BLAST binaries provided by the user to use during the reduce step. In the final stage of database preparation, the formatdb program is used to build BLAST databases for each clus- Prior to the map phase, each computing node loads the k- ter. The resulting collection of BLAST databases are zipped mer presence vectors of the cluster centers. The centers and uploaded back to HDFS. The databases have to be are only loaded once per node, regardless of the number C1 C2 C3 ··· Ck

0 0 0 0 Seq1 Seq1 (db1, Seq2, Seq4, Seq11) 0 0 Seq2 Seq2 (db2, Seq5, Seq7) 0 0 0 Seq3 Repeat Seq3 (db3, Seq8, Seq22) Sequences K − mer F ilter BLAST . Masking . . . . . 0 0 0 0 Seqn Seqn (dbn, Seq3, Seq31, Seqn)

Figure 2: K-mulus takes a collection of protein sequences and BLASTs them against a collection of databases. First, the query’s low complexity regions are masked, then each query is compared to the cluster center k- mer vectors. If the query shares a k-mer with a cluster center, the query is sent to the core responsible for running BLAST with that cluster’s database. of map functions the node is responsible for. During the which all sequences begin in disjoint clusters of size one, map phase, each k-mer is checked against all cluster centers’ and are iteratively merged together with the nearest cluster presence vectors. If a cluster contains a given k-mer, then [10]. Implementations of agglomerative clustering are differ- that sequence must be aligned to some sequence within that entiated by the distance function used to compare clusters. cluster. Therefore, if there exists a shared k-mer between the query sequence q and a cluster center c, then the mapper Our algorithm performs clustering with respect to the pres- emits(c, q). ence vectors of each input sequence. For each cluster, a cen- ter presence vector is computed as the union of all sequence The reducer aggregates all query sequences that share a k- presence vectors in the cluster. The distance between clus- mer with a given cluster. These query sequences are written ters is taken as the Hamming distance, or number of bitwise to a temporary file on the local disk of the reduce node. differences, between these cluster centers. This design choice The reduce node then forks a child process that executes creates a tighter correspondence between the clustering al- the BLAST binary along with any arguments passed by the gorithm and the metrics for success of the results, which user. The BLAST binary determines which database to load depend entirely on the cluster presence vectors as computed based on the key received. The output of the forked process above. is captured and redirected to the reducer’s output. Note that this algorithm requires the intensive computation 2.5 Post-processing of all-pairs distances at every clustering iteration. For this reason, a novel algorithm was used to efficiently make these Since BLAST takes into account the size of the database computations in the cloud. This algorithm is described in when computing alignment statistics, the individual BLAST detail in the next section. results must have their scores adjusted for the database seg- mentation. In order to determine the k and λ values, a test query of one sequence is run against the entire non- Algorithm 1 Hierarchical Clustering Algorithm (Modified) segmented database. 1: Define each input presence vector as a cluster c 2: repeat 3. METHODS DETAILS 3: for all Clusters c do 4: Assign c.k as the union of all presence vectors in c 3.1 Presence Vector 5: end for In the context of this paper a Presence Vector is a vector of 6: Compute all pairwise distances for all c.k bits in which the value at each position indicates the pres- 7: repeat ence of a specific sequence K-mer. The index of each K-mer 8: Merge the two clusters c1, c2 with the shortest dis- in the vector is trivial to compute. tance 9: until All previous clusters c have been merged 3.2 Clustering 10: until One cluster remains We considered three different clustering algorithms for use in K-mulus: K-means, K-medoids, and a modified form of hierarchical clustering. Pseudo-code for these algorithms is 3.3 Efficient Distributed All-Pairs given below in addition to details about our hierarchical clus- We initially considered clustering the query sequences, in tering algorithm addition to clustering the database sequences. We decided not to explore this option, for reasons mentioned elsewhere We used a modified form of agglomerative clustering to per- in the paper. However, we recognize that for some applica- form hierarchical clustering on the database sequences. This tions, such an approach would be useful. Query clustering general algorithm uses a simple bottom-up approach, in may require all pairwise distances to be computed in parallel, Algorithm 2 K-Means Algorithm 7 1: Select k centroids randomly 2: repeat 3: for all i in the database do 4: Assign i to the cluster with the closest centroid 5: end for 6: for all Clusters c do 2 1 7: c ← 1 P s |c| s∈c 4 8: end for 9: until Centroids do not move

5 3 6 Algorithm 3 K-Medoid Algorithm 1: Select k centroids randomly 2: repeat Figure 3: Fano Plane 3: for all i in the database do 4: Assign i to the cluster with the closest centroid 5: end for bullet one, it follows that for all pairs of items, there is 6: for all Clusters c do exactly one group which contains that pair. A configuration 7: for all Elements e in cluster c do of data distribution such as this meets the condition listed 8: Find e such that it minimizes the total distance above, and is therefore an optimal distribution of data for to all other e in c clusters of that size. 9: Make e the new centroid of cluster c Projective planes of order p can be calculate trivially, where 10: end for 2 11: end for p is a prime power and the plane has a capacity for p + p + 12: until Centroids do not move 1 points. There are some complications to this algorithm which arise as a result of this restriction on p. For input sizes which vary from this order, additional positions must and for this reason we include our efficient distributed all- be ’padded’ out and some efficiency is lost. Overall, this pairs algorithm which we used for clustering of the database algorithm is very efficient in the average case for performing sequences. intensive all pairs comparisons.

n(n−1) It is trivial to use Hadoop to perform exactly 2 com- 4. RESULTS parisons. However, naive use of memory and network over- Common BLAST word sizes range from 2-5, thus we experi- head can lead to run times which can be worse than serial mented with k-mers of size 2-5[15]. For all subsequent anal- [12]. Our algorithm maps a sizable number of comparisons yses we used 3-mers. While three different clustering algo- to each node, which reduces overhead. Furthermore, for the rithms were considered in this paper, all subsequent K-mulus given workloads, our algorithm minimizes network overhead results were generated with k-means clustering (explanation and guarantees that each comparison occurs only once. An below). In BLAST, neighboring words are k-mers which are implication of this is that for all data a node receives, it biologically similar to a given k-mer within a certain thresh- may freely compute all pairs within that set without any old. In our results, neighboring words are not considered global duplication of work. Proof of this point is trivial during K-mulus database indexing. This had a negligible when you consider that in any other configuration (where effect on results as nearly all queries already matched every some available comparisons are not executed at the node), cluster index. fewer comparisons per sequence are being done at the node. It follows that such a configuration will require a strictly 4.1 Speedup Potential larger amount of overall data movement. First we will show that K-mulus achieves significant speedups on well clustered data. To demonstrate this we simulated an The above is achieved through projective geometry. In pro- ideal data set of 1,000 sequences, where the sequences were jective geometry a projective plane, the simplest of which is comprised of one of two disjoint sets of 3-mers. The database shown in Fig 3, has the following properties: sequences were clustered into two even-size clusters. The sample query was 10,000 sequences, also comprised of one • For all pairs of distinct points, there is exactly one line of two disjoint sets of 3-mers. Table 1 shows the result of which contains that pair. running BLAST on the query using Hadoop Streaming with query segmentation (the method CloudBLAST uses to run • For any pair of lines, there is exactly one point at which BLAST) and K-mulus. K-mulus running on 2 cores with 2 they intersect. databases yields a 225% improvement over running BLAST using Hadoop Streaming on 2 cores. In practice this de- • There exist four points such that no line intersects gree of separability is nearly impossible to replicate, but more than two of them. this model allows us to set an upper bound for the speedup contributed by clustering and search space reduction. Allow me to equate ’lines’ to our task input groups, and ’points’ to the data items. Using this transformation with 4.2 K-mulus in Practice A more practical BLAST query would be using the nr database, containing 3,429,135 sequences. K-mulus is compared against the naive Hadoop streaming method using a realistic query Hadoop Streaming K-mulus of 30,000 sequences from the HMP project (Table 2). K- 1 core 15 mins, 38 sec 6 mins, 55 sec mulus performs poorly compared to the default Hadoop Stream- 2 cores 9 mins, 48 sec 4 mins, 25 sec ing BLAST implementation because of the very high k- mer overlap between clusters. Due to the high k-mer over- Table 1: Runtimes of BLAST using Hadoop Stream- lap, each query sequence is being replicated and compared ing and K-mulus on a query of 10,000 protein against nearly all clusters. Since the complete nr database sequences. K-mulus is run with two clustered can fit into memory, we achieve no benefit from segmenting databases which contain no overlapping k-mers. K- the database. Running K-mulus with 100,000 clusters per- mulus running on 2 cores with 2 databases yields forms slightly worse compared to only 100 clusters because a 225% improvement over running BLAST using of the additional overhead of rebuilding the query index for Hadoop Streaming on 2 cores. each BLAST instance. 5. DISCUSSION In our nr query results, K-mulus shows poor performance due entirely to noisy, overlapping clusters. In the worst case, Method Time K-mulus will map every query to every cluster and devolve Hadoop Streaming 0 hr, 39 mins to a naive parallelized BLAST on database segments, while K-mulus 100 clusters 1 hr, 50 mins also including some overhead due to database indexing. This K-mulus 10,000 clusters 2 hr, 17 mins is close to the behaviour we observed when running K-mulus on the nr database. In order to describe the best possible Table 2: Runtimes of BLAST using Hadoop Stream- clusters we could have generated from a database, we consid- ing and K-mulus using the nr database on a query ered a lower limit on the exact k-mer overlap between single of 30,000 sequences on 100 cores. K-mulus is run us- sequences in the nr database (Fig 4). We generated this plot ing both 100 and 10,000 clusters to compare perfor- by taking 50 random samples of 3000 nr sequences each, mance. Due to the low variance of k-mers between computing the pairwise k-mer intersection between them, sequences in nr, K-mulus performs poorly compared and plotting a histogram of the magnitude of pairwise k- to Hadoop Streaming. K-mulus with 10,000 clusters mer overlap. This shows that there are very few sequences performs worse than K-mulus with 100 clusters be- in the nr database which have no k-mer overlap which makes cause of the additional overhead of query indexing the generation of disjoint clusters impossible. Furthermore, between BLAST instantiations. this plot is optimistic in that it does not include BLAST’s neighboring words, nor does it illustrate comparisons against cluster centers which will have intersection greater than or Sheet1 equal to that of a single sequence.

In order to show the improvement offered by repeat mask- Samples K-mer Intersection Frequencies ing, we ran SEG[20] on the sequences before computing the 2000000 intersection. On average, SEG resulted in a 6% reduction 1800000 1600000 in the number of exact k-mer overlap between two given e c

n 1400000 e sequences. Repeat masking caused a significant, favorable r r 1200000 u c shift in k-mer intersection and would clearly improve clus- c 1000000 With Repeat Masking O

f Original Sequences o tering results. However, the nr database had so much exist- 800000 y c ing k-mer overlap that using SEG preprocessing would have n 600000 e u

q 400000 almost no effect on the speed of K-mulus. e r

F 200000

0 While we considered three different clustering algorithms in 4 12 20 28 36 44 52 60 68 76 84 92 100 108 116 124 132 140 148 156 164 172 180 188 196 204 212 220 228 236 244 this paper, we determined that the cluster overlap was so 0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 136 144 152 160 168 176 184 192 200 208 216 224 232 240 248 excessive that any difference between the clustering output Number of K-mer Intersections was negligible. For this reason, we exclusively used K-means clustering for our analysis. Figure 4: A histogram of the number of exact pair- wise k-mer intersections within a set of nr database 6. FUTURE WORK sequences. 50 samples of 3,000 random sequences While K-mulus did not meet the goals set forth in this pa- were taken from nr and repeats were masked by per, it has great potential as a platform for improving dis- SEG[20]. K-mer overlaps were plotted pairwise tributed BLAST performance. The logical next step for K- within each original sample and each masked sam- mulus is nucleotide database indexing, which has histori- ple. At a high level, this plot indicates k-mer vari- cally had far more success than protein indexing in BLAST. ance in nr. Lower magnitudes of intersection imply With a four character alphabet and simplified substitution better possible clusters and improved K-mulus per- rules, nucleotides are much easier to work with than amino formance. acids, and allow for much more efficient hashing by avoid- ing of the ambiguity inherent in amino acids. Furthermore

Page 1 we expect a more random distribution of nucleotide k-mers 2008, pp. 1 –11. than amino acids k-mers, which allows for better clustering. [13] Aleksandr Morgulis, George Coulouris, Yan Raytselis, While we chose to pursue improvements to protein BLAST, Thomas L. Madden, Richa Agarwala, and in our analysis it has become clear that a K-mulus nucleotide Alejandro A. Sch¨affer, Database indexing for BLAST not only has a promising outlook for a speedup, but production megablast searches, Bioinformatics 24 could serve as a model for effectively executing the more (2008), 1757–1764. complex task of protein clustering. [14] J Murray, J Larsen, T E Michaels, A Schaafsma, C E Vallejos, and K P Pauls, Identification of putative Our work did not include analysis of any clustering or index- genes in bean (phaseolus vulgaris) genomic (bng) rflp ing methods which would have resulted in a loss of BLAST clones and their conversion to stss., Genome 45 search sensitivity. For example, K-mulus clustering might (2002), no. 6, 1013–24. benefit from the positive results shown for protein k-mer in- [15] NCBI, Ncbi c++ toolkit book, dexing of large k-mers over compressed alphabets[17]. An- http://www.ncbi.nlm.nih.gov/books/NBK7160/pdf/TOC.pdf. other possible improvement would be to map queries accord- [16] J. Schloss, E. Mitchell, M. White, R. Kukatla, ing to percent k-mer identity to a cluster, or to raise the E. Bowers, H. Paterson, and S. Kresovich, threshold for required k-mer overlap with a cluster index. Characterization of rflp probe sequences for gene While these approaches would reduce BLAST sensitivity, discovery and ssr development in sorghum bicolor (l.) the trade off with search speed may be favorable. The moti- moench., Theor Appl Genet 105 (2002), no. 6-7, vation for this work comes from our evidence that if protein 912–920. clustering in K-mulus can be improved, large speedups can [17] Sergey A. Shiryev, Jason S. Papadopoulos, be achieved. Alejandro A. Sch¨affer, and Richa Agarwala, Improved searches using longer words for protein seeding, 7. REFERENCES Bioinformatics 23 (2007), no. 21, 2949–2951. [1] Amazon elastic compute cloud. [18] Mark Van Der Laan, Katherine Pollard, and Jennifer http://aws.amazon.com/ec2/. Bryan, A new partitioning around medoids algorithm, [2] Human microbiome project. Journal of Statistical Computation and Simulation 73 http://commonfund.nih.gov/hmp/, 2006. (2003), no. 8, 575–584. [3] Stephen F. Altschul, Warren Gish, , [19] Hugh E. Williams and Justin Zobel, Indexing , and David Lipman, Basic local nucleotide databases for fast query evaluation, EDBT, alignment search tool, Journal of Molecular Biology 1996, pp. 275–288. 215 (1990), 403 – 410. [20] John C. Wootton and Scott Federhen, Statistics of [4] Dhruba Borthakur, The Hadoop Distributed File local complexity in amino acid sequences and sequence System: Architecture and Design, The Apache databases, Computers and Chemistry 17 (1993), no. 2, Software Foundation, 2007. 149 – 163. [5] Aaron E. Darling, Lucas Carey, and Wu chun Feng, The design, implementation, and evaluation of mpiblast, In Proceedings of ClusterWorld 2003, 2003. [6] Jeffrey Dean and Sanjay Ghemawat, Mapreduce: simplified data processing on large clusters, Commun. ACM 51 (2008), 107–113. [7] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The google file system, SIGOPS Oper. Syst. Rev. 37 (2003), 29–43. [8] J. A. Hartigan and M. A. Wong, A K-means clustering algorithm, Applied Statistics 28 (1979), 100–108. [9] Ying Li, Hong M. Luo, Chao Sun, Jing Y. Song, Yong Z. Sun, Qiong Wu, Ning Wang, Hui Yao, Andre Steinmetz, and Shi L. Chen, Est analysis reveals putative genes involved in glycyrrhizin biosynthesis, BMC Genomics 11 (2010), no. 1, 268+. [10] Christopher D Manning, Prabhakar Raghavan, and Hinrich Schutze,¨ Introduction to information retrieval, Cambridge Univ. Press, New York, NY, 2008. [11] Andr´ea Matsunaga, Maur´ıcio Tsugawa, and Jos´e Fortes, Fourth ieee international conference on escience cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. [12] C. Moretti, J. Bulosan, D. Thain, and P.J. Flynn, All-pairs: An abstraction for data-intensive cloud computing, Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, april