Applicability of Dryadlinq to Scientific Applications

Applicability of DryadLINQ to Scientific Applications Salsa Group, Pervasive Technology Institute, Indiana University http://salsaweb.ads.iu.edu/salsa/ Jan 30th 2010 Contents 1. Introduction ...................................................................................................................................... 4 2. Overview ........................................................................................................................................... 4 2.1 Microsoft DryadLINQ .................................................................................................................... 4 2.2 Apache Hadoop ............................................................................................................................. 5 2.3 Twister: Iterate MapReduce ......................................................................................................... 5 2.4 MPI ................................................................................................................................................ 5 3. Performance and Usability of Applications using DryadLINQ ........................................................... 7 3.1 EST (Expressed Sequence Tag) sequence assembly program using DNA sequence assembly program software CAP3. ........................................................................................................................... 7 3.1.1 Evaluations and Findings ........................................................................................................... 8 3.1.2 Inhomogeneity of data partitions and scheduling partitions to nodes .................................... 9 3.1.3 Threads vs. Processes (Issue No. 2) ........................................................................................ 11 3.2 Pairwise Alu sequence alignment using Smith Waterman GOTOH ............................................ 12 3.2.1 ALU Clustering ......................................................................................................................... 12 3.2.2 Smith Waterman Dissimilarities .............................................................................................. 12 3.2.3 The O(N2) Factor of 2 and structure of processing algorithm ................................................. 13 3.2.4 DryadLINQ Implementation ................................................................................................... 13 3.2.5 MPI Implementation ............................................................................................................... 14 3.2.6 Performance of Smith Waterman Gotoh SW-G Algorithm ..................................................... 14 3.2.7 Apache Hadoop Implementation ............................................................................................ 15 3.2.8 Performance comparison of DryadLINQ and Hadoop SW-G implementations ...................... 16 3.2.9 Inhomogeneous data study .................................................................................................... 17 3.3 PhyloD Application ...................................................................................................................... 19 3.3.1 PhyloD Algorithm .................................................................................................................... 19 3.3.2 DryadLINQ Implementation .................................................................................................... 19 3.3.3 DryadLINQ PhyloD Performance ............................................................................................. 21 3.4 HEP Processing large column of physics data using software Root and produce histogram results for data analysis. ......................................................................................................................... 23 3.4.1 Evaluations and Findings ......................................................................................................... 24 3.5 K-means Clustering ..................................................................................................................... 26 3.5.1 Evaluations and Findings ......................................................................................................... 26 3.5.2 Another Relevant Application - Matrix Multiplication ............................................................ 27 4. Analysis ........................................................................................................................................... 29 4.1 DryadLINQ vs. Other Runtimes ................................................................................................... 29 4.1.1 Handling Data .......................................................................................................................... 29 4.1.2 Parallel Topologies .................................................................................................................. 29 4.1.3 Twister: Iterate MapReduce ................................................................................................... 31 4.2 Performance and Usability of Dryad ........................................................................................... 31 4.2.1 Installation and Cluster Access ............................................................................................... 32 4.2.2 Developing and Deployment of Applications ........................................................................ 32 4.2.3 Debugging ............................................................................................................................... 32 4.2.4 Fault Tolerance........................................................................................................................ 33 4.2.5 Monitoring .............................................................................................................................. 33 5. Summary of key features of applications that suitable and not suitable for Dryad ....................... 34 References .............................................................................................................................................. 36 Appendix A .............................................................................................................................................. 38 Appendix B .............................................................................................................................................. 38 1. Introduction Applying high level parallel runtimes to data/compute intensive applications is becoming increasingly common. The simplicity of the MapReduce programming model and the availability of open source MapReduce runtimes such as Hadoop, are attracting more users to the MapReduce programming model. Recently, Microsoft has released DryadLINQ for academic use, allowing users to experience a new programming model and a runtime that is capable of performing large scale data/compute intensive analyses. The goal of our study is to explore the applicability of DryadLINQ to real scientific applications and compare its performance with other relevant parallel runtimes such as Hadoop. To achieve this goal we have developed a series of scientific applications using DryadLINQ, namely, CAP3 DNA sequence assembly program [1], Pairwise ALU sequence alignment, High Energy Physics(HEP)data analysis, and K-means Clustering [2]. Each of these applications has unique requirements for parallel runtimes. For example, the HEP data analysis application requires ROOT [3] data analysis framework to be available in all the compute nodes and in Pairwise ALU sequence alignment the framework must handle computing of distance matrix with hundreds of millions of points. We have implemented all these applications using DryadLINQ and Hadoop, and used them to compare the performance of these two runtimes. Twister and MPI are used in applications where the contrast in performance needs to be highlighted. In the sections that follow, we first present an overview of the different parallel runtimes we use in this analysis followed by a detailed discussion of the data analysis applications we developed. Here we discuss the mappings of parallel algorithms to the DryadLINQ programming model and present performance comparisons with Hadoop implementations of the same applications. In section 4 we analyze DryadLINQ’s programming model comparing it with other relevant technologies such as Hadoop and Twister. We also include a set of usability requirements for DryadLINQ. We present our conclusions in section 5. 2. Overview This section presents a brief introduction to a set of parallel runtimes we use our evaluations. 2.1 Microsoft DryadLINQ Dryad [4] is a distributed execution engine for coarse grain data parallel applications. Dryad considers computation tasks as directed acyclic graphs (DAG) where the vertices represent computation tasks and while the edges acting as communication channels over which the data flow from one vertex to another. In the HPC version of Dryad the data is stored in (or partitioned to) Windows shared directories in local compute nodes and a meta-data file is use to produce a description of the data distribution and replication. Dryad schedules the execution of vertices depending on the data locality. Dryad also stores the output of vertices in local disks, and the other vertices which depend on these results, access them via the shared directories. This enables Dryad to re-execute failed vertices, a step

Load more