Paper Title (Use Style: Paper Title) s44

Parallel Applications And Tools For Cloud Computing Environments

Judy Qiu1,2, Adam Lee Hughes2, Saliya Ekanayake 1,2 , Thilina Gunarathne1,2, Stephen Tak-lon Wu1,2, Hui Li1,2, Jong Youl Choi1,2, Seung-Hee Bae1,2, Yang Ruan1,2 1Pervasive Technology Institute 2School of Informatics and Computing Indiana University Bloomington IN, U.S.A {xqiu, adalhugh, sekanaya, tgunarat, taklwu, lihui, jychoi, sebae, yangruan}@indiana.edu

The main research focus of SALSA project is two-fold. Fir e master node handles the fault tolerance, task assignment and st, we investigate new programming models of parallel multico monitoring of Map & Reduce tasks among other responsibilitie re computing and Cloud/Grid computing. It aims aimed at deve s. Cloud environments are known to be more brittle than the tra loping and applying parallel and distributed Cyberinfrastructure ditional compute clusters and the cloud applications should be to support large scale data analysis. Second, we develop user-fr developed to expect and withstand the failures. Hence AzureM iendly and integrated Cloud computing environments and tools apReduce is designed as a decentralized controlled system avoi to provide our parallel algorithms and Cloud infrastructures as ding the possible single point of failures. AzureMapReduce als services. o provides users with the capability to dynamically scale up or down the number of compute instances, even in middle of a Ma To fulfill our research objectives, we have been researched pReduce computation, as and when based on the needs of the u various parallel applications and programming models, especial ser. The map & reduce tasks of the AzureMapReduce runtime a ly designed for real life computing-intensive data analysis. The re dynamically scheduled using a global queue. AzureMapRed most notable example is Twister , which is an iterative MapRed uce provides fault tolerance capabilities similar to the other Ma uce parallel programming model developed for Cloud computi pReduce run times such as Hadoop by rerunning of the failed o ng environments. In addition to those efforts, we have also dev r slow tasks. eloped integrated Cloud computing environments to extend our research efforts to real-life use cases. For this purpose, we have In this demo, we are going to demonstrate the usage of Azu built a portal web service to integrate our various Cloud applica reMapReduce for scientific computations with the use of sever tions and tools along with convenient workflow engines. al Bioinformatics application use cases. We showcase the viabi lity of AzureMapReduce and the ability of the underlying cloud We will demonstrate the following key research efforts of t infrastructure services to deliver comparable performance and e he SALSA project. fficiency with appropriately designed systems.

A. AzureMapReduce B. Large-scale PageRank with Twister AzureMapReduce is a distributed decentralized MapReduc PageRank is a well-studied web graph ranking algorithm. A e runtime for Windows Azure cloud infrastructure developed b s there is a data deluge in the internet, the study of appliance of y utilizing Azure cloud infrastructure services. Usage of cloud i PageRank in large graph is becoming increasingly popular. The nfrastructure services allows the AzureMapReduce implementa main challenges of large scale PageRank processing we faced a tion to take advantage of the scalability and the distributed natu re 1) most distributed job execution engines do not support iter re of such services guaranteed by the cloud service providers. ative map reduce well. 2) there is a significant amount of netwo AzureMapReduce is able to overcome the latencies of the clou rk traffic in large scale PageRank processing. We resolve this i d services by the use of sufficiently coarser grained map and re ssue by implementing PageRank with Twister , a lightweight it duce tasks. We overcome the eventual consistency nature of cl erative map reduce execution engine. Further, we optimize the oud services in AzureMapReduce by designing the system not t distribution of both Mappers and Reducers in Twister to decrea o rely on the immediate availability of data with the use of retr se the communication cost among them. Result shows that our ying. AzureMapReduce runtime currently uses Azure Queues f method outperforms tradition technology that process large scal or map/reduce task scheduling, Azure tables for metadata & m e PageRank job. onitoring data storage, Azure blob storage for input/output/inter mediate data storage and Window Azure Compute worker roles We evaluate Twister PageRank performance by using Clue to perform the computations. Web data set collected in January 2009. We built the adjacency matrix using this data set. The whole web graph are partitioned Google MapReduce , Hadoop as well as Twister MapRed into few parts and stored in the format of sub adjacency matrix, uce computations are centrally controlled using a master node a which is called static data in Twister. During the computation, nd assume master node failures to be rare. In these run times, th each partition of static data is cached in the memory; and it can hms MDS (Multi-dimensional Scaling) and GTM (Generative be used multiple times by the iterative Map task. This strategy Topographic Mapping) , and their variants, such as interpolatio brings much benefit as it does not need reload the static data fro n algorithms for MDS and GTM. m disk to memory in every iteration. With the use of PlotViz and parallel MDS and GTM, users can easily identify points of interests by colors or select a group C. SALSA Portal and Biosequence Analysis Workflow of points distinguished by structural distribution in 3D space. A The advent and continued refinement of modern high-throu dditional functions are included in browsing data by rotating, z ghput sequencing techniques have led to a proliferation of raw ooming, or panning the 3D space to search for more details. D biosequence data, as labs routinely generate millions of sequen ynamic updating labels of points or adding new data points are ce reads in a matter of days. Analyzing these results is beyond t also supported by integrating an external repository system, Ch he computational capacity of single-lab resources, necessitating em2Bio2RDF, by using the standard web query language, SPA the use of high-performance computing resources, which prese RQL. nts significant overhead in terms of manual data manipulation a nd job monitoring. To overcome this bottleneck, application pi Our tool can help very large and high-dimensional data anal pelines are being developed in conjunction with a generalized s ysis, which is very common in many life science researches. As ervice-based portal system to abstract the details of job creation a result, we have been successfully used PlotViz and parallel M and management. This infrastructure will eventually provide se DS and GTM with our internal collaborators for analyzing larg rvice interfaces to map-reduce technologies, leveraging the co e and high-dimensional life science data, including PubChem d mputing resources offered by traditional grids and clusters, as ata analysis for drug discoveries and genetic data analysis. well as emerging Cloud platforms. We demonstrate the following key features of PlotViz. We demonstrate the design and implementation of a softwa i) PlotViz which is a lightweight 3D data visualization clien re package to automate the setup steps for a specific sequence a t to browse large (up to a few millions) and high-dimensional d nalysis workflow and a generalized service-oriented architectur ata. e (SOA) to expose this pipeline and other analysis tools as web services. Then, a simple web interface for submitting and man ii) On-line data fetching by connecting a remote external sy aging jobs on high-performance computing platforms is present stem, Chem2Bio2RDF, which is an integrated repository of che ed as an example of the flexibility in building applications affor mogenomic and systems chemical biology data. ded by web service composition. iii) Research results for drug discovery with mining cause-e ffect relationship between large number of chemical compound D. PlotViz Visualization with parallel MDS/GTM s and diseases PlotViz is a 3D data point visualization tool for high-dimen sional data analysis. It is developed to visualize results of our i REFERENCES n-house high-performance parallel dimension reduction algorit