Dryadlinq for Scientific Analyses
Total Page:16
File Type:pdf, Size:1020Kb
DryadLINQ for Scientific Analyses Jaliya Ekanayake1,a, Atilla Soner Balkirc, Thilina Gunarathnea, Geoffrey Foxa,b, Christophe Poulaind, Nelson Araujod, Roger Bargad aSchool of Informatics and Computing, Indiana University Bloomington bPervasive Technology Institute, Indiana University Bloomington cDepartment of Computer Science, University of Chicago dMicrosoft Research {jekanaya, tgunarat, gcf}@indiana.edu, [email protected],{cpoulain,nelson,barga}@microsoft.com Abstract data-centered approach to parallel processing. In these frameworks, the data is staged in data/compute nodes Applying high level parallel runtimes to of clusters or large-scale data centers and the data/compute intensive applications is becoming computations are shipped to the data in order to increasingly common. The simplicity of the MapReduce perform data processing. HDFS allows Hadoop to programming model and the availability of open access data via a customized distributed storage system source MapReduce runtimes such as Hadoop, attract built on top of heterogeneous compute nodes, while more users around MapReduce programming model. DryadLINQ and CGL-MapReduce access data from Recently, Microsoft has released DryadLINQ for local disks and shared file systems. The simplicity of academic use, allowing users to experience a new these programming models enables better support for programming model and a runtime that is capable of quality of services such as fault tolerance and performing large scale data/compute intensive monitoring. analyses. In this paper, we present our experience in Although DryadLINQ comes with standard samples applying DryadLINQ for a series of scientific data such as Terasort, word count, its applicability for large analysis applications, identify their mapping to the scale data/compute intensive scientific applications is DryadLINQ programming model, and compare their not studied well. A comparison of these programming performances with Hadoop implementations of the models and their performances would benefit many same applications. users who need to select the appropriate technology for the problem at hand. 1. Introduction We have developed a series of scientific applications using DryadLINQ, namely, CAP3 DNA Among many applications benefit from cloud sequence assembly program [4], High Energy Physics technologies such as DryadLINQ [1] and Hadoop [2], data analysis, CloudBurst [5] - a parallel seed-and- the data/compute intensive applications are the most extend read-mapping application, and Kmeans important. The deluge of data and the highly compute Clustering [6]. Each of these applications has unique intensive applications found in many domains such as requirements for parallel runtimes. For example, the particle physics, biology, chemistry, finance, and HEP data analysis application requires ROOT [7] data information retrieval, mandate the use of large analysis framework to be available in all the compute computing infrastructures and parallel runtimes to nodes, and in CloudBurst the framework needs to achieve considerable performance gains. The support process different workloads at map/reduce/vertex for handling large data sets, the concept of moving tasks. We have implemented all these applications computation to data, and the better quality of services using DryadLINQ and Hadoop, and used them to provided by Hadoop and DryadLINQ made them compare the performance of these two runtimes. CGL- favorable choice of technologies to implement such MapReduce and MPI are used in applications where problems. the contrast in performance needs to be highlighted. Cloud technologies such as Hadoop and Hadoop In the sections that follow, we first present the Distributed File System (HDFS), Microsoft DryadLINQ programming model and its architecture DryadLINQ, and CGL-MapReduce [3] adopt a more on HPC environment, and a brief introduction to Hadoop. In section 3, we discuss the data analysis var input = PartitionedTable.Get<LineRecord>(“file://in.tb applications and the challenges we faced in l”); implementing them along with a performance analysis var result = MainProgram(input, …); of these applications. In section 4, we present the var output = result.ToPartitionedTable(“file://out.tbl”); related work to this research, and in section 5 we present our conclusions and the future works. As noted in Figure 1, the Dryad execution engine operates on top of an environment which provides 2. DryadLINQ and Hadoop certain cluster services. In [1][9] Dryad is used in conjunction with the proprietary Cosmos environment. A central goal of DryadLINQ is to provide a wide In the Academic Release, Dryad operates in the context array of developers with an easy way to write of a cluster running Windows High-Performance applications that run on clusters of computers to Computing (HPC) Server 2008. While the core of process large amounts of data. The DryadLINQ Dryad and DryadLINQ does not change, the bindings environment shields the developer from many of the to a specific execution environment are different and complexities associated with writing robust and may lead to differences in performance. efficient distributed applications by layering the set of technologies shown in Figure 1. Table 1. Comparison of features supported by Dryad and Hadoop Feature Hadoop Dryad Programming MapReduce DAG based Model execution flows Data Handling HDFS Shared directories/ Local disks Intermediate HDFS/ Files/TCP pipes/ Data Point-to-point Shared memory Communication via HTTP FIFO Figure 1. DryadLINQ software stack Scheduling Data locality/ Data locality/ Working at his workstation, the programmer writes Rack aware Network code in one of the managed languages of the .NET topology based run time graph Framework using Language Integrated Queries. The optimizations LINQ operators are mixed with imperative code to Failure Persistence via Re-execution of process data held in collection of strongly typed Handling HDFS vertices objects. A single collection can span multiple Re-execution computers thereby allowing for scalable storage and of map and efficient execution. The code produced by a reduce tasks DryadLINQ programmer looks like the code for a Monitoring Monitoring Monitoring sequential LINQ application. Behind the scene, support of support for however, DryadLINQ translates LINQ queries into HDFS, and execution graphs Dryad computations (Directed Acyclic Graph (DAG) MapReduce computations based execution flows). While the Dryad engine Language Implemented Programmable via executes the distributed computation, the DryadLINQ Support using Java C# client application typically waits for the results to Other DryadLINQ continue with further processing. The DryadLINQ languages are Provides LINQ system and its programming model are described in supported via programming API details in [1]. Hadoop for Dryad This paper describes results obtained with the so- Streaming called Academic Release of Dryad and DryadLINQ, Apache Hadoop has a similar architecture to which is publically available [8]. This newer version Google’s MapReduce runtime, where it accesses data includes changes in the DryadLINQ API since the via HDFS, which maps all the local disks of the original paper. Most notably, all DryadLINQ compute nodes to a single file system hierarchy collections are represented by the allowing the data to be dispersed to all the PartitionedTable<T> type. Hence, the example data/computing nodes. Hadoop schedules the computation cited in Section 3.2 of [1] is now MapReduce computation tasks depending on the data expressed as: locality to improve the overall I/O bandwidth. The outputs of the map tasks are first stored in local disks until later, when the reduce tasks access them (pull) via We developed a DryadLINQ application to perform HTTP connections. Although this approach simplifies the above data analysis in parallel. The DryadLINQ the fault handling mechanism in Hadoop, it adds application executes the CAP3 executable as an significant communication overhead to the external program, passing an input data file name, and intermediate data transfers, especially for applications the other necessary program parameters to it. Since that produce small intermediate results frequently. The DryadLINQ executes CAP3 as an external executable, current release of DryadLINQ also communicates it must only know the input file names and their using files, and hence we expect similar overheads in locations. We achieve the above functionality as DryadLINQ as well. Table 1 presents a comparison of follows: (i) the input data files are partitioned among DryadLINQ and Hadoop on various features supported the nodes of the cluster so that each node of the cluster by these technologies. stores roughly the same number of input data files; (ii) a “data-partition” (A text file for this application) is 3. Scientific Applications created in each node containing the names of the original data files available in that node; (iii) Dryad In this section, we present the details of the “partitioned-file” (a meta-data file for DryadLINQ) is DryadLINQ applications that we developed, the created to point to the individual data-partitions located techniques we adopted in optimizing the applications, in the nodes of the cluster. and their performance characteristics compared with Following the above steps, a DryadLINQ program Hadoop implementations. For all our benchmarks, we can be developed to read the data file names from the used two clusters with almost identical hardware provided partitioned-file, and execute the