15-712 Project Proposal

Anthony Gitter, Alex Grubb, and Jeffrey Barnes October 8, 2007

1 Background

In virtually all scientific disciplines, technological advances over the past years and decades have resulted in an exponential growth in the quantity of raw scien- tific data available for analysis. Massive datasets, ranging from hundreds of MBs to many TBs, are increasingly common and available to the public. Spanning the breadth of scientific fields, this publicly accessible information includes NASA astrophysics and astronomy collections [1], partial three-dimensional mappings of the universe [2], earthquake and seismic data [3], complete genomes for a multitude of organisms [4], and so on. However, collecting and publishing raw data is only the beginning of sci- entific understanding. Unfortunately, the enormous benefits to be gained by thoroughly analyzing these datasets is typically matched by the complexity of such analysis. For instance, sequence alignment is a powerful technique for pre- dicting the function of newly discovered genes [5], but computational limitations prevent the best algorithms available from being used on full datasets. Super- computers and customized high-performance solutions such as the CLC Bioin- formatics Cube [6] can sometimes employed, but the cost of such solutions is prohibitive. Furthermore, for some classes of scientific problems involving huge datasets (including those mentioned above), processing power can be largely wasted even on a supercomputer if the computation is disk-bound. With the advent of multi-core personal computers, clusters of commodity machines have the potential to become a reasonable alternative platform for intense scientific computing. Using clusters of readily available machines in place of supercomputers or problem-specific high-performance machines could not only reduce costs, power consumption, and possibly execution time, but it would also open the multitude of rich, massive datasets to a wider community of researchers, conceivably accelerating the rate of scientific discovery.

2 Project Overview

We aim to demonstrate that a cluster of standard computers can be used instead of supercomputers for scientific computing over very large datasets. Because our objective is not to derive new results from these datasets, we will select a

1 well-known problem that 1) requires a supercomputer to solve or 2) has only been partially solved because the best algorithms running on typical computers cannot handle the complete dataset. By adapting the target algorithms to take advantage of a cluster and exploiting parallelism in this environment, we expect to reduce cost or power consumption in case 1) or increase the scale of problems that can be solved in case 2). While we have yet to decide on one particular problem and dataset to target, we present two viable alternatives below.

3 Sequence Alignment

Sequence alignment allows biological researchers to identify similar genes and develop hypotheses about common functionality between sequentially similar or homologous genes [7]. The problem is non-trivial because genetic mutations may introduce gaps in sequences that are otherwise very similar, and the size of the peptide and nucleotide sequence databases to be searched continues to grow dramatically [8]. The Smith-Waterman algorithm [9] is recognized as one of, if not the very best for sequence alignment algorithm, but when used with entire sequence databases (this is the typical use case), it is unacceptably slow. Therefore, BLAST [5], an approximation of the Smith-Waterman algorithm, has emerged as the preferred way to perform sequence alignment because it accommodates large sequence databases. However, corporation CLC bio claims that BLAST “can po- tentially miss up to 50% of what you are searching for”[6]. Their solution is a specialized high-performance computer, the CLC Bioinformatics Cube, which is reported to run Smith-Waterman over 100 times faster than a 3.0 GHz Pen- tium desktop computer. While pricing is not provided on their corporate web site, this custom machine is most likely expensive and has limited use (if any) outside of bioinformatics applications. Thus, there is quite a gap in the options biologists wishing to do sequence alignment have. They can run an approximate algorithm, sacrificing the quality of their results, or they can invest in a (pre- sumably expensive) single-purpose machine to run the best available algorithm on all obtainable data. If we focus on this particular problem, we seek to elim- inate this gap by implementing the Smith-Waterman algorithm on a cluster of standard machines with a runtime that is reasonably close to that of BLAST.

4 Machine Learning Over Massive Datasets

With the increased popularity of machine learning techniques, the amount of problems being solved with statistical data mining has grown rapidly. Many groups, including Carnegie Mellon’s Auton Lab[10] are working specifically to make these problems computationally feasible and efficiently solvable, due the the broad applicability of these techniques, and the increasing need for these techniques to be used for massive sets of data. In many of these cases, the large data sets cannot be usefully evaluated by humans due to their enormous scale,

2 and machine learning techniques will be necessary to discover the properties and phenomena emobodied in these datasets, such as those containing huge amounts of data generated by observing natural processes and systems like the datasets used in computational biology mentioned above and the astronomical data mentioned in [2]. Using the Sloan Sky Survey astronomical data as a motivating example, the need for efficient systems for solving these problems is clear. The raw data con- tained in this dataset is over 10TB of image data covering approximately 25% of the night sky. Further processed forms of this data exist and are used for analysis, but even these forms are often multiple gigabytes of data. A number of initial research into mining this data have been made [11], and smaller subsets of the dataset are often used to test the efficiency of data mining techniques[12]. Currently, much of the processing done on this dataset in particular is in the form of nearest-neighbor matching on the processed datasets to discover clusters of similar galaxies, because many other methods of data mining are computa- tionally infeasible using current methods. Obviously, further investigation is needed here, and a meeting with researchers from the Auton Lab is planned to get a better idea of the problems that could be most benefited by efficient parallelization and a proper distributed architecture, but the need for efficient systems of solving these problems is clear.

5 Parallelization

The essence of our project is to adapt a resource-intensive, data-driven problem to take advantage of a highly parallel architecture. This section describes how we will do so and, in particular, what existing software frameworks we might avail ourselves of in order to implement a solution. One of the most famous frameworks relevant to our task is MapReduce [13], a programming model from Google. The basic idea is that system developers (i.e., us) frame their situation in terms of a map function and a reduce function. The map function takes a key/value pair as input, processes it, and generates a set of zero or more intermediate key/value pairs. The reduce function, which the application calls once for each intermediate key, aggregates the intermediate values associated with that key. In principle, expressing the solution in terms of these primitives is all the developer needs to do. Lower-level tasks, such as partitioning the input data, scheduling execution events, and handling inter-process communication are all automated. MapReduce is especially well suited to parallel computations over large (perhaps many terabytes) data sets, which is exactly the domain with which we are dealing. The Google implementation of MapReduce is in C++ and is proprietary. However, open-source implementations are available, of which the most mature by far is Hadoop [14, 15], a Java software framework that implements MapRe- duce. Hadoop is part of the Lucene project [16], a popular Java information retrieval library supported by the Apache Software Foundation. Hadoop boasts

3 the ability to scale up to thousands of nodes and petabytes of data. The Hadoop implementation of MapReduce runs on top of the Hadoop Distributed File Sys- tem (HDFS) [17], which differs from other distributed file systems in that it is intended to be deployed on low-cost commodity hardware that is subject to failure. In addition, HDFS is tuned to support typical file sizes of gigabytes or terabytes. In short, Hadoop is ideally suited to our aims. It presents a high-level frame- work for the development of highly parallel software, abstracting the messiest implementation details, while offering impressive performance and scalability. These are important concerns, since we may indeed be running applications on terabytes of data and dozens of cores. The advantage we gain by utilizing an ex- isting framework that offers a simple interface for highly parallel programming can hardly be overstated, given the limited time available to us for implemen- tation and our team members’ lack of experience programming applications distributed across more than one or two cores.

6 Hardware

For implementation and testing, we intend to take advantage of the Intel cluster that Prof. David Andersen mentioned. We do not know the technical details of these machines, but if we’re not mistaken, the cluster has dozens of cores, which should provide a satisfactory testbed for our implementation.

7 Timeline

This section presents a list of tasks that we expect to constitute our project. This serves two purposes. First, it provides a more concrete description of what our project will entail. Second, it provides us with a set of milestones by which to judge our progress during the course of the semester (and targets for which to aim). In total, there are slightly more than eight weeks between the submittal of this proposal and the scheduled presentation of our project. We will allocate our time approximately as follows.

Selection of the data set and learning algorithm; literature search (1–2 weeks) Our most immediate task will be to decide from among the data sets we mentioned in Section 3 and to pick a machine learning problem suited to the data, as described in Section 4. We must pick a problem that is computationally feasible but difficult to carry out on conventional architectures. Our data set must be neither too small nor too big, and our learning problem must be neither too hard (computationally) nor too easy. In addition, once we have pinned down the problem, a deeper literature search is in order.

Algorithm development (1–2 weeks) Once we have identified a problem, the next step is to devise a solution that takes advantage of the hardware avail-

4 able to us. In particular, we expect to take an existing machine learning al- gorithm and adapt it to take advantage of a parallel architecture. This is the most “creative” phase of the project in the sense that it requires the invention of a novel computational technique. As a consequence, it is difficult to predict how long this will take. However, it may be the least systems-oriented portion of the project, so our emphasis lies elsewhere.

Implementation (3–4 weeks) We have allotted the greatest amount of time for implementing our algorithm to run on a real system. Of course, coding the logic of the algorithm itself is unlikely to be the greatest task. Rather, there is a great deal of work to do in terms of figuring out how to load, use, and store data; how to take input and output; and even such basic problems as figuring out how one writes code that runs on dozens of cores. Relying on an existing framework such as Hadoop should alleviate much of this burden.

Testing (1–2 weeks) Once our implementation is complete, we need to test it on our target machines. Since we do not yet even know what data set we will use, it is hard to say what these tests will entail.

Evaluation (1 week or less) Finally, we must evaluate our results to assess how our system performed. We will want to compare our metrics (e.g., time to completion or cost per performance) to existing results. (If no existing results are available, we would want to either calculate or measure empirically the performance of our system on, say, a powerful single-core machine.)

Presentation authoring (days) Project presentations are scheduled for the week of December 3. We will need to write and prepare a PowerPoint presen- tation.

Report composition (days) Finally, written reports are due on Decem- ber 12. We have at most a week between the presentation and the report deadline. However, we should be writing things up throughout the semester, reducing our load in the final week.

References

[1] Obtaining data from the NSSDC. Government Web site, 2007. URL http: //nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html.

[2] Sloan Digital Sky Survey. Project Web site, 2006. URL http://www.sdss. org/background/.

[3] Earthquake Hazards Program: Scientific data. Government Web site, 2007. URL http://earthquake.usgs.gov/research/topics.php?areaID=13.

5 [4] Genome project statistics. Government Web site, 2007. URL http://www. ncbi.nlm.nih.gov/genomes/static/gpstat.html.

[5] Stephen F. Altschul, Warren Gish, , Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215(3):403–410, 1990.

[6] CLC Bioinformatics Cube. Corporate Web site, 2007. URL http://www. clccube.com/.

[7] Similarity searching. Government Web site, 2000. URL http://www.ncbi. nlm.nih.gov/Education/BLASTinfo/similarity.html.

[8] Databases available for BLAST search. Government Web site, 2007. URL http://www.ncbi.nlm.nih.gov/blast/blast_databases.shtml.

[9] T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195–197, 1981.

[10] Auton lab. University Web site, 2007. URL http://www.autonlab.org/ autonweb/2.html.

[11] S. G. Djorgovski, R. R. de Carvalho, S. C. Odewahn, R. R. Gal, J. Roden, P. Stolorz, and A. Gray. Data-mining a large digital sky survey: From the challenges to the scientific results. Applications of Digital Image Processing, XX:98–109, 1997.

[12] An Investigation of Practical Approximate Nearest Neighbor Algorithms, 12 2004.

[13] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data pro- cessing on large clusters. In OSDI ’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, December 2004. URL http://labs.google.com/papers/mapreduce-osdi04.pdf.

[14] Hadoop. Project Web site, 2007. URL http://lucene.apache.org/ hadoop/.

[15] Hadoop project description. Project wiki page, 2007. URL http://wiki. apache.org/lucene-hadoop/ProjectDescription.

[16] Lucene. Project Web site, 2007. URL http://lucene.apache.org/.

[17] Dhruba Borthakur. The Hadoop Distributed File System: Architecture and design. Technical report, Apache, 2007. URL http://lucene.apache. org/hadoop/hdfs_design.html.

6