Data-Intensive Computing with Hadoop
Total Page:16
File Type:pdf, Size:1020Kb
15-712 Project Proposal Anthony Gitter, Alex Grubb, and Jeffrey Barnes October 8, 2007 1 Background In virtually all scientific disciplines, technological advances over the past years and decades have resulted in an exponential growth in the quantity of raw scien- tific data available for analysis. Massive datasets, ranging from hundreds of MBs to many TBs, are increasingly common and available to the public. Spanning the breadth of scientific fields, this publicly accessible information includes NASA astrophysics and astronomy collections [1], partial three-dimensional mappings of the universe [2], earthquake and seismic data [3], complete genomes for a multitude of organisms [4], and so on. However, collecting and publishing raw data is only the beginning of sci- entific understanding. Unfortunately, the enormous benefits to be gained by thoroughly analyzing these datasets is typically matched by the complexity of such analysis. For instance, sequence alignment is a powerful technique for pre- dicting the function of newly discovered genes [5], but computational limitations prevent the best algorithms available from being used on full datasets. Super- computers and customized high-performance solutions such as the CLC Bioin- formatics Cube [6] can sometimes employed, but the cost of such solutions is prohibitive. Furthermore, for some classes of scientific problems involving huge datasets (including those mentioned above), processing power can be largely wasted even on a supercomputer if the computation is disk-bound. With the advent of multi-core personal computers, clusters of commodity machines have the potential to become a reasonable alternative platform for intense scientific computing. Using clusters of readily available machines in place of supercomputers or problem-specific high-performance machines could not only reduce costs, power consumption, and possibly execution time, but it would also open the multitude of rich, massive datasets to a wider community of researchers, conceivably accelerating the rate of scientific discovery. 2 Project Overview We aim to demonstrate that a cluster of standard computers can be used instead of supercomputers for scientific computing over very large datasets. Because our objective is not to derive new results from these datasets, we will select a 1 well-known problem that 1) requires a supercomputer to solve or 2) has only been partially solved because the best algorithms running on typical computers cannot handle the complete dataset. By adapting the target algorithms to take advantage of a cluster and exploiting parallelism in this environment, we expect to reduce cost or power consumption in case 1) or increase the scale of problems that can be solved in case 2). While we have yet to decide on one particular problem and dataset to target, we present two viable alternatives below. 3 Sequence Alignment Sequence alignment allows biological researchers to identify similar genes and develop hypotheses about common functionality between sequentially similar or homologous genes [7]. The problem is non-trivial because genetic mutations may introduce gaps in sequences that are otherwise very similar, and the size of the peptide and nucleotide sequence databases to be searched continues to grow dramatically [8]. The Smith-Waterman algorithm [9] is recognized as one of, if not the very best for sequence alignment algorithm, but when used with entire sequence databases (this is the typical use case), it is unacceptably slow. Therefore, BLAST [5], an approximation of the Smith-Waterman algorithm, has emerged as the preferred way to perform sequence alignment because it accommodates large sequence databases. However, bioinformatics corporation CLC bio claims that BLAST “can po- tentially miss up to 50% of what you are searching for”[6]. Their solution is a specialized high-performance computer, the CLC Bioinformatics Cube, which is reported to run Smith-Waterman over 100 times faster than a 3.0 GHz Pen- tium desktop computer. While pricing is not provided on their corporate web site, this custom machine is most likely expensive and has limited use (if any) outside of bioinformatics applications. Thus, there is quite a gap in the options biologists wishing to do sequence alignment have. They can run an approximate algorithm, sacrificing the quality of their results, or they can invest in a (pre- sumably expensive) single-purpose machine to run the best available algorithm on all obtainable data. If we focus on this particular problem, we seek to elim- inate this gap by implementing the Smith-Waterman algorithm on a cluster of standard machines with a runtime that is reasonably close to that of BLAST. 4 Machine Learning Over Massive Datasets With the increased popularity of machine learning techniques, the amount of problems being solved with statistical data mining has grown rapidly. Many groups, including Carnegie Mellon’s Auton Lab[10] are working specifically to make these problems computationally feasible and efficiently solvable, due the the broad applicability of these techniques, and the increasing need for these techniques to be used for massive sets of data. In many of these cases, the large data sets cannot be usefully evaluated by humans due to their enormous scale, 2 and machine learning techniques will be necessary to discover the properties and phenomena emobodied in these datasets, such as those containing huge amounts of data generated by observing natural processes and systems like the datasets used in computational biology mentioned above and the astronomical data mentioned in [2]. Using the Sloan Sky Survey astronomical data as a motivating example, the need for efficient systems for solving these problems is clear. The raw data con- tained in this dataset is over 10TB of image data covering approximately 25% of the night sky. Further processed forms of this data exist and are used for analysis, but even these forms are often multiple gigabytes of data. A number of initial research into mining this data have been made [11], and smaller subsets of the dataset are often used to test the efficiency of data mining techniques[12]. Currently, much of the processing done on this dataset in particular is in the form of nearest-neighbor matching on the processed datasets to discover clusters of similar galaxies, because many other methods of data mining are computa- tionally infeasible using current methods. Obviously, further investigation is needed here, and a meeting with researchers from the Auton Lab is planned to get a better idea of the problems that could be most benefited by efficient parallelization and a proper distributed architecture, but the need for efficient systems of solving these problems is clear. 5 Parallelization The essence of our project is to adapt a resource-intensive, data-driven problem to take advantage of a highly parallel architecture. This section describes how we will do so and, in particular, what existing software frameworks we might avail ourselves of in order to implement a solution. One of the most famous frameworks relevant to our task is MapReduce [13], a programming model from Google. The basic idea is that system developers (i.e., us) frame their situation in terms of a map function and a reduce function. The map function takes a key/value pair as input, processes it, and generates a set of zero or more intermediate key/value pairs. The reduce function, which the application calls once for each intermediate key, aggregates the intermediate values associated with that key. In principle, expressing the solution in terms of these primitives is all the developer needs to do. Lower-level tasks, such as partitioning the input data, scheduling execution events, and handling inter-process communication are all automated. MapReduce is especially well suited to parallel computations over large (perhaps many terabytes) data sets, which is exactly the domain with which we are dealing. The Google implementation of MapReduce is in C++ and is proprietary. However, open-source implementations are available, of which the most mature by far is Hadoop [14, 15], a Java software framework that implements MapRe- duce. Hadoop is part of the Lucene project [16], a popular Java information retrieval library supported by the Apache Software Foundation. Hadoop boasts 3 the ability to scale up to thousands of nodes and petabytes of data. The Hadoop implementation of MapReduce runs on top of the Hadoop Distributed File Sys- tem (HDFS) [17], which differs from other distributed file systems in that it is intended to be deployed on low-cost commodity hardware that is subject to failure. In addition, HDFS is tuned to support typical file sizes of gigabytes or terabytes. In short, Hadoop is ideally suited to our aims. It presents a high-level frame- work for the development of highly parallel software, abstracting the messiest implementation details, while offering impressive performance and scalability. These are important concerns, since we may indeed be running applications on terabytes of data and dozens of cores. The advantage we gain by utilizing an ex- isting framework that offers a simple interface for highly parallel programming can hardly be overstated, given the limited time available to us for implemen- tation and our team members’ lack of experience programming applications distributed across more than one or two cores. 6 Hardware For implementation and testing, we intend to take advantage of the Intel cluster that Prof. David Andersen mentioned. We do not know the technical details of these machines, but if we’re not mistaken, the cluster has dozens of cores, which should provide a satisfactory testbed for our implementation. 7 Timeline This section presents a list of tasks that we expect to constitute our project. This serves two purposes. First, it provides a more concrete description of what our project will entail. Second, it provides us with a set of milestones by which to judge our progress during the course of the semester (and targets for which to aim). In total, there are slightly more than eight weeks between the submittal of this proposal and the scheduled presentation of our project.