DryadLINQ for Scientific Analyses Jaliya Ekanayake1,a, Atilla Soner Balkirc, Thilina Gunarathnea, Geoffrey Foxa,b, Christophe Poulaind, Nelson Araujod, Roger Bargad aSchool of Informatics and Computing, Indiana University Bloomington bPervasive Technology Institute, Indiana University Bloomington cDepartment of Computer Science, University of Chicago dMicrosoft Research {jekanaya, tgunarat, gcf}@indiana.edu,
[email protected],{cpoulain,nelson,barga}@microsoft.com Abstract data-centered approach to parallel processing. In these frameworks, the data is staged in data/compute nodes Applying high level parallel runtimes to of clusters or large-scale data centers and the data/compute intensive applications is becoming computations are shipped to the data in order to increasingly common. The simplicity of the MapReduce perform data processing. HDFS allows Hadoop to programming model and the availability of open access data via a customized distributed storage system source MapReduce runtimes such as Hadoop, attract built on top of heterogeneous compute nodes, while more users around MapReduce programming model. DryadLINQ and CGL-MapReduce access data from Recently, Microsoft has released DryadLINQ for local disks and shared file systems. The simplicity of academic use, allowing users to experience a new these programming models enables better support for programming model and a runtime that is capable of quality of services such as fault tolerance and performing large scale data/compute intensive monitoring. analyses. In this paper, we present our experience in Although DryadLINQ comes with standard samples applying DryadLINQ for a series of scientific data such as Terasort, word count, its applicability for large analysis applications, identify their mapping to the scale data/compute intensive scientific applications is DryadLINQ programming model, and compare their not studied well.