Granules: A Lightweight, Streaming Runtime for Cloud Computing With Support for Map-Reduce
Shrideep Pallickara 12 Jaliya Ekanayake 2 and Geoffrey Fox 2 Department of Computer Science1 Community Grids Lab2 Colorado State University Indiana University Fort Collins, USA Bloomington, USA [email protected] {jekanaya, gcf}@indiana.edu
Abstract— Cloud computing has gained significant traction in of data. These data volumes cannot be processed by a single recent years. The Map-Reduce framework is currently the computer or a small cluster of computer. Furthermore, in most dominant programming model in cloud computing most cases, this data can be processed in a pleasingly settings. In this paper, we describe Granules, a lightweight, parallel fashion. The result has been the aggregation of a streaming-based runtime for cloud computing which large number of commodity hardware components in vast incorporates support for the Map-Reduce framework. data centers. Granules provides rich lifecycle support for developing Map-Reduce [1], introduced by Dean and Ghemawat at scientific applications with support for iterative, periodic and Google, is the most dominant programming model for data driven semantics for individual computations and developing applications in cloud settings. Here, large pipelines. We describe our support for variants of the Map- datasets are split into smaller more manageable sizes which Reduce framework. The paper presents a survey of related work in this area. Finally, this paper describes our are then processed by multiple map instances. The results performance evaluation of various aspects of the system, produced by individual map functions are then sent to including (where possible) comparisons with other comparable reducers, which collate these partial results to produce the systems. final output. A clear benefit of such concurrent processing is a speed-up that is proportional to the number of Keywords- map-reduce, cloud computing, streaming, cloud computational resources. Map-Reduce can be thought of as runtimes, and content distribution networks an instance of the SPMD [2] programming model for parallel computing introduced by Federica Derema. I. INTRODUCTION Applications that can benefit from Map-Reduce include data and/or task-parallel algorithms in domains such as Cloud computing has gained significant traction in information retrieval, machine learning, graph theory and recent years. By facilitating access to an elastic (meaning visualization among others. the available resource pool can expand or contract over In this paper we describe Granules [3], a lightweight time) set of resources, cloud computing has demonstrable streaming-based runtime for cloud computing. Granules applicability to a wide-range of problems in several allows processing tasks to be deployed on a single resource domains. or a set of resources. Besides the basic support for Map- Appealing features within cloud computing include Reduce, we have incorporated support for variants of the access to a vast number of computational resources and Map-Reduce framework that are particularly suitable for inherent resilience to failures. The latter feature arises, scientific applications. Unlike most Map-Reduce because in cloud computing the focus of execution is not a implementations, Granules utilizes streaming for specific well-known resource but rather the best available disseminating intermediate results, as opposed to using file- one. Another characteristic of a lot of programs that have based communications. This leads to demonstrably better been written for cloud computing is that they tend to be performance (see benchmarks in section 6). stateless. Thus, when failures do take place, the appropriate This paper is organized as follows. In section 2, we computations are simply re-launched with the corresponding provide a brief overview of the NaradaBrokering substrate datasets. that we use for streaming. We discuss some of the core Among the forces that have driven the need for cloud elements of Granules in section 3. Section 4 outlines our computing are falling hardware costs and burgeoning data support for Map-Reduce and for the creation of complex volumes. The ability to procure cheaper, more powerful computational pipelines. In section 5, we describe related CPUs coupled with improvements in the quality and work in this area. In section 6, we profile several aspects of capacity of networks have made it possible to assemble the Granules runtime, and where possible contrast its clusters at increasingly attractive prices. The proliferation of performance with comparable systems such as Hadoop, networked devices, internet services, and simulations has Dryad and MPI. In section 7, we present our conclusions. resulted in large volumes of data being produced. This, in turn, has fueled the need to process and store vast amounts II. NARADABROKERING Some of the domains that NaradaBrokering has been deployed in include earthquake science, particle physics, Granules uses the NaradaBrokering [4-6] streaming environmental monitoring, geosciences, GIS systems, and substrate (developed by Pallickara) for all its streams defense applications. disseminations. The NaradaBrokering content distribution network (depicted in Fig. 1) comprises a set of cooperating III. GRANULES router nodes known as brokers. Producers and consumers do not directly interact with each other. Entities, which are Granules orchestrates the concurrent execution of connected to one of the brokers within the broker network, processing tasks on a distributed set of machines. Granules use their hosting broker to funnel streams into the broker is itself distributed, and its components permeate not only network and, from thereon, to other registered consumers of the computational resources on which it interleaves those streams. processing, but also the desktop from where the applications NaradaBrokering is application-independent and are being deployed in the first place. The runtime manages incorporates several services to mitigate network-induced the execution of a set of processing tasks through various problems as streams traverse domains during stages of its lifecycle: deployment, initialization, execution disseminations. The system provisions easy to use and termination. Fig. 2, depicts the various components that guarantees, while delivering consistent and predictable comprise Granules. performance that is adequate for use in real-time settings. A A A C C G G Content Distribution Network
G G G R1 R2 RN Discovery Services Granules Lifecycle Diagnostics Management Concurrent Execution Dataset Queue Deployment Initializer Manager
G Granules R Computational Resources C Client X Producers or Consumers Figure 2. Overview of the Granules runtime Broker node
Figure 1. The NaradaBrokering broker network A. Computational task Consumers of a given data stream can specify, very The most fundamental unit in Granules is the notion of a precisely, the portions of the data stream that they are computational task. This computational task encapsulates interested in consuming. By preferentially deploying links processing functionality, specifies its scheduling strategy during disseminations, the routing algorithm [4] in and operates on different types of datasets. These NaradaBrokering ensures that the underlying network is computational tasks can take on additional interchangeable optimally utilized. This preferential routing ensures that roles (such as map and reduce) and, when cascaded, can consumers receive only those portions of streams that are of form complex execution pipelines. interest to them. Since a given consumer is typically Computational tasks require the domain specialists to interested in only a fraction of the streams present in the specify processing functionality. This processing typically system, preferential routing ensures that a consumer is not operate upon a collection of datasets encapsulated within the deluged by streams that it will subsequently discard. computational task. The system incorporates support for reliable streaming The computational task encapsulates functionality for and secure streaming. In reliable streaming, the substrate processing for a given fine grained unit of data. This data copes with disconnects and process/link failures of different granularity could be a packet, a file, a set of files, or a components within the system with the ability to fine-tune database record. For example, a computational task can be redundancies [5] for a specific stream. Secure streaming [6] written to evaluate a regular expression query (grep) on a set enforces the authorization and confidentiality constraints of characters, a file, or a set of files. In some cases there will associated with the generation and consumption of secure not be a specific dataset; rather, each computational task streams while coping with denial of service attacks. instance initializes itself using a random-seed generator. Computational tasks include several metadata such as for execution whenever data is available on any one of its versioning information, timestamps, domain identifiers and constituent datasets. The periodicity axis specifies that computation identifiers. Individual instances of the computational tasks be periodically scheduled for execution computational tasks include instance identifiers and task at pre-defined intervals (specified in milliseconds). identifiers, which in turn allows us to group several related Each of these axes can extend to infinity, in which case, computational tasks together. it constitutes a stay-alive primitive. A domain specialist can also specify a custom scheduling strategy that permutes B. Datasets and collections along these 3-dimensions. Thus, one can specify a scheduling strategy that limits a computational task to be In Granules, datasets are used to simplify access to the executed a maximum of 500 times either when data is underlying data type. Datasets currently supported within available or at regular intervals. Granules include streams and files; support for databases is A computational task can change its scheduling strategy being incorporated. For a given data type, besides managing during execution, and Granules will enforce the newly the allocation and reclamation of assorted resources, established scheduling strategy during the next round of Granules also mediates access to it. For example, Granules execution (section 3.5). This scheduling change can be a performs actions related to simplifying the production and significant one – from data driven to periodic. The consumption of streams, the reading and writing of files, scheduling change could also be a minor one with changes and access to databases. to the number of times the computation needs to be A data collection is associated with every computational executed or an update to the periodicity interval. task. The data collection represents a collection of datasets, In addition to the aforementioned primitives, another and maintains information about the type, number and primitive – stay alive until termination condition reached – identifiers associated with every encapsulated dataset. can be specified. In this case, the computational task All that the domain specialist needs to specify is the continues to be stay alive until the computational task number and type of the datasets involved. The system asserts that its termination condition has been reached. The imposes no limits on the number of datasets within a dataset termination condition overrides any other primitives that collection. During initializations of the dataset collection, may have been specified and results in the garbage depending on the type of the constituent datasets, Granules collection of the computational task. subscribes to the relevant streams, configures access to files on networked file systems, and sets up connections (JDBC) D. Finite state machine for a computational task to the databases. Dataset collections allow observers to be registered to At a given computational resources, Granules maintains track data availability, and dataset initializations and a finite state machine for every computational task. This closure. This simplifies data processing since it obviates the finite state machine, depicted in Fig. 4 has four states: need to perform polling. initialize, activated, dormant, and terminate.
C. Specifying a scheduling strategy Dormant Computational tasks specify a scheduling strategy, which in turn govern their lifetimes. Computational tasks can specify their scheduling strategy along 3-dimensions (see Fig. 3). Initialize
Terminate Activated
Figure 4. FSM for a computational task
The transition triggers for this finite state machine include external requests, elapsed time intervals, data availability, reset counters and assertions of the termination condition being reached. When a computational task is first received in a deployment request, Granules proceeds to initialize the computational task. The finite state machine created for this Figure 3. Dimensions for scheduling strategy computational task starts-off in the initialize state. If, for some reason, the computational task cannot The counts axis specifies the number of times a proceed in its execution, either because the datasets are not computational task needs to be executed. The data driven available or the start-up time has not yet elapsed, the axis specifies that computational task needs to be scheduled computational task transitions into the dormant state. If there were problems in initialization, the computational task the number of times a computational task was scheduled for transitions into the terminate state. execution, its queuing overheads, its CPU-bound time, the If, on the other hand, the computational task was time it was memory-resident, and the total execution time. A initialized successfully, and is ready for execution with computational task can also assert that diagnostic messages accessible datasets, the computational task transitions into be sent back to the client during any (or some) of its state the activated state. transitions. On the client side, an observer can be registered for collections of computational tasks to track their progress E. Interleaving execution of computational tasks without the need to actively poll individual computational tasks. At each computational resource, Granules maintains a pool of worker threads to manage and interleave the IV. SUPPORT FOR MAP-REDUCE IN GRANULES concurrent execution of multiple computational tasks. When a computational task is activated and ready for Map-Reduce is the dominant framework used in cloud execution, it is moved into the activated queue. As, and computing settings. In Map-Reduce, a large dataset is when, worker threads become available the computational broken-up into smaller chunks that are concurrently tasks are pulled from the FIFO queue and executed in a operated upon by map function instances. The results from separate thread. Upon completion of the computational task, these map functions (usually
5 5 words histogramming for Overhead Delay 0 Standard Deviation 20 30 40 50 60 70 80 90 100 Total size of dataset (GB) 4 4 Figure 9. Processing time for histogramming words
3 3 The results, depicted in Fig. 9 demonstrate the benefits of using streaming as opposed to file-based 2 2 communications. As the size of datasets increases, there is a concomitant increase in the number and size of the 1 1 intermediate results (file-based). This contributes to the slower performance of Hadoop and Dryad. We expect the Mean transit delay (Milliseconds) Standard (Milliseconds) Deviation performance of Dryad’s socket-based version to be faster 0 0 100 1000 10000 than their file-based version. Content Payload Size (Bytes) Figure 8. Streaming overheads in cluster settings C. K-means: Iterative Machine learning provides a fertile ground for iterative Two cluster machines were involved in this benchmark. algorithms. In our benchnmarks, we considered a simple The producer and consumer were hosted on the same algorithm in the area of unsupervised machine learning: k- machine to obviate the need to account for clock drifts while means. Given a set of n data points, the objective is to measuring latencies for streams issued by the producer, and organize these points into k clusters. routed by the broker (hosted on the second machine) to the The algorithms starts-off by selecting k centroids and consumer. then associates different data points within the dataset to one The reported delay, in the results depicted in Fig. 8, is of the clusters based on their proximity to the centroids. For the average of 50 samples for a given payload size; the each of the clusters, new centroids are then computed. The standard deviation for these samples also being reported. algorithm is said to converge when the cumulative The Y-axis for the standard deviation is the axis on the Euclidean distance between the centroids in successive right-side (blue) of the graph. Streaming latencies vary from iterations is less than a pre-defined threshold. 750 microseconds per-hop for 100 bytes to 1.5 milliseconds In k-means, the number of iterations depends on the per-hop for a stream fragments of 10 KB in cluster settings. initial choice of the centroids, the number of data points and the specified error rate (signifying that the centroid movements are acceptable). The initial set of data points is loaded at each of the map functions. Each map is 70 responsible for processing a portion of the entire dataset. What changes from iteration to iteration are the centroids. 60 The output of each map function is a set of centroids. 50 The benchmarks, which were run on 5 machines, also contrast the performance of Granules with MPI using a C++ 40 implementation of the k-means algorithm. 30 10000 20
1000 10
Time task scheduled at (Seconds) 0 100 0 2000 4000 6000 8000 10000 Computational Tasks 10 Hadoop Figure 11. Periodic scheduling of 10,000 computational tasks Granules MPI Overhead (Seconds) 1 Fig. 11 depicts the results of periodic executions of 10,000 maps for 17 iterations. The graph depicts the spacing 0.1 in the times at which these maps are scheduled for 0.1 1 10 100 execution. The X-axis represents a specific map instance Number of 2D data points (millions) (assigned IDs from 1 to 10,000) and the Y-axis represents Figure 10. Performance of the k-means algorithm the spacing between the times at which a given instance was scheduled. Each map instance reports 17 values. The graphs, depicted in Fig. 10, have been plotted on a The first time a computational task is scheduled for log-log graph so that the trends can be visualized a little execution, a base time tb is recorded. Subsequent iterations better. We varied the number of data points in the dataset report the difference between the base time tb and the 5 7 from 10 to 4x10 . The results indicate that Hadoop’s current time, tc. In almost all cases, the spacing between the performance is orders of magnitude slower than Granules successive executions for any given instance was between and MPI. In Hadoop, these centroids are transferred using 3.9-4.1 seconds. In some cases, there is a small notch; this files, while Granules uses streaming. Furthermore, since reflects cases where the first execution was delayed by a Hadoop does not support iterative semantics, map functions small amount, the (constant) impact of which is reflected in need to be initialized and the datasets need to be reloaded subsequent iterations for that map instance. using HDFS. Though these file system reads are being performed locally (thanks to HDFS and data collocation) E. Data driven these costs can still be prohibitive as evidenced in our benchmarks. Additionally, as the size of the dataset In this subsection, we describe the performance of increases, the performance of the MPI/C++ implementation matrix multiplication using Granules. In this case, the object of k-means and the Granules/Java implementation of k- is to measure the product of two dense 16000 x 16000 means start to converge. matrices i.e. each matrix has 256 million elements with predominantly non-zero values. D. Periodic scheduling The matrix multiplication example demonstrates how computational tasks can stay alive, and be scheduled for In this subsection we benchmark the ability of Granules to execution when data is available. The maps are scheduled periodically schedule tasks for execution. For this particular for execution as and when data is available for the benchmark, we initialized 10,000 map functions that needed computations to proceed. to be scheduled for execution every 4 seconds. For this benchmark, we vary the number of machines The objective of this benchmark is to show that a single involved in the experiment from 1 to 8. There are a total of Granules instance can indeed enforce periodicity for a 16000 map instances. At a given time, each of these maps reasonable number of map instances. process portions of the rows and columns that comprise the matrix. Each Granules instance copes with fragment of more than 2000 concurrent streams. In total, every Granules instance copes with 32000 distinct streams. 16384 Our objective as part of this benchmark was also to see Processing time how Granules can be used to maximize core utilizations on a machine. CAP3 takes as input a set of files. In our 8192 benchmark we need to process 256 files during the assembly. On a given machine, we fine tuned the concurrency by 4096 setting the number of worker threads within the thread pool to different values. By restricting the number of threads, we also restricted the amount of concurrency and the 2048 underlying core utilizations. We started off by setting the worker pool size to 1, 2, 4 and 8 on 1 machine, and then Processing time (Seconds) time Processing used 8 worker threads on 2, 4 and 8 machines. This allowed 1024 1 2 4 8 us to report results for 1, 2, 4, 8, 16, 32 and 64 cores. Number of Machines The results of our benchmark in terms of processing costs and the speed-ups achieved are depicted in Fig. 14 and Figure 12. Processing time for matrix multiplication on different machines Fig. 15 respectively. In general as the number of available cores increase, there is a corresponding improvement in The results for the processing times (plotted on a log-log execution times. scale) can be seen in Fig. 12. In general as the number of 1024 available machines increase, there is a proportional Processing time improvement in the processing time. Our plots of the speed- 512 up (Fig. 13) in processing times with the availability of additional machines reflect this. 256 8 Speed-up 128
64
4 32
16 Processing time (Seconds) time Processing
Speed up 8 2 1 2 4 8 16 32 64 Number of Cores Figure 14. Processing time for EST assembly on different cores using Granules and CAP3 1 1 2 4 8 64 Number of Machines Speedup Figure 13. Speed-up for matrix multiplication 32
In general, these graphs demonstrate that Granules can 16 bring substantial benefits to data driven applications by amortizing the computational load on a set of machines. 8 Domain scientists do not need to write a single-line of Speed-up networking code, Granules manages this in a transparent 4 fashion for the applications. 2 F. Assembling mRNA sequences 1 This sub section describes the performance of Granules 1 2 4 8 16 32 64 in orchestrating the execution of applications developed in Number of Cores languages other than Java. The application we consider is the CAP3 [20] messenger Ribonucleic acid (mRNA) Figure 15. Speed-up for EST assembly using Granules and CAP3 sequence assembly application (C++) developed at Michigan Tech. The results demonstrate that, when configured correctly, As Expressed Sequence Tag (EST) corresponds to Granules can maximize core utilizations on a given mRNAs transcribed from genes residing on chromosomes. machine. The graphs, plotted on a log-log scale indicate that Individual EST sequences represent a fragment of mRNA. for every doubling of the available cores, the processing CAP3 allows us to perform EST assembly to reconstruct time for assembling the mRNA sequences reduces by half full-length mRNA sequences for each expressed gene. (approximately). Currently, the Granules runtime ready the thread-pool sizing information from a configuration file; we Proceedings of the 7th IEEE/ACM International Conference will be investigating mechanisms that will allow us to on Grid Computing (GRID 2006). Barcelona, Spain. dynamically size these thread-pools. [6]. S. Pallickara, H. Bulut, and G. Fox, “Fault-Tolerant Reliable Delivery of Messages in Distributed Publish/Subscribe Systems,” 4th IEEE International Conference on Autonomic VII. CONCLUSIONS Computing, Jun. 2007, pp. 19. In this paper, we described the Granules runtime. [7]. R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, Support for rich lifecycle support within Granules allows “Interpreting the data: Parallel analysis with Sawzall,” computations to retain state which in turn is particularly .Scientific Programming Journal Special Issue on Grids and applicable for several classes for scientific applications. Worldwide Computing Programming Models and Granules allows complex computational graphs to be Infrastructure, vol. 13, no. 4, pp. 227–298, 2005. created. As discussed, these graphs can encapsulate both [8]. Apache Hadoop, http://hadoop.apache.org/core/ control flow and data flow. Granules enforces the semantics [9]. S. Garfinkel An Evaluation of Amazon’s Grid Computing of complex distributed computational graphs that have one Services: EC2, S3 and SQS . Tech. Rep. TR-08-07, Harvard University, August 2007. or more feedback loops. The domain scientist does not have [10]. Message Passing Interface Forum. MPI: A Message Passing to cope with IO, threading, synchronization or networking Interface. In Proc. of Supercomputing ’93, pages 878–883. libraries while developing applications that span multiple IEEE Computer Society Press, November 1993. stages, with multiple distributed instances comprising each [11]. M. Isard , M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, stage. These computational pipelines can be iterative, “Dryad: Distributed data-parallel programs from sequential periodic, data-driven, or termination condition dependent building blocks,” European Conference on Computer Systems Demonstrable performance benefits have been accrued , March 2007. by Granules as result of using streaming for disseminating [12]. J. B. MacQueen (1967): "Some Methods for classification intermediate results. and Analysis of Multivariate Observations, Proceedings of 5- Granules’ rich lifecycle support, and its performance th Berkeley Symposium on Mathematical Statistics and when contrasted with comparable systems, underscores the Probability", Berkeley, University of California Press, 1:281- feasibility of using Granules in several settings. As part of 297 our future work, we will be investigating support for [13]. C. Ranger, R. Raghuraman, A. Penmetsa, G. R. Bradski, and autonomic error detection and recovery within Granules. C. Kozyrakis. “Evaluating MapReduce for Multi-core and Multiprocessor Systems,” Proc. International Symposium on REFERENCES High-Performance Computer Architecture (HPCA), 2007, pp. 13-24. [1]. J. Dean and S. Ghemawat, “Mapreduce: Simplified data [14]. Qt Concurrent. Simplified MapReduce in C++ with support processing on large clusters,” ACM Commun., vol. 51, Jan. for multicores. http://labs.trolltech.com/page/Projects/ 2008, pp. 107-113. Threads/ QtConcurrent. April 2009. [2]. F. Darema, “SPMD model: past, present and future,” Recent [15]. Disco project, http://discoproject.org/ Advances in Parallel Virtual Machine and Message Passing [16]. S. Schlatt, T. Hübel, S. Schmidt and U. Schmidt. The Interface, 8th European PVM/MPI Users' Group Meeting, Holumbus distributed computing framework & MapReduce in Santorini/Thera, Greece, 2001. Haskell. http://holumbus.fh-wedel.de/trac. 2009. [3]. S. Pallickara, J. Ekanayake, and G. Fox. “An Overview of the [17]. A Pisoni. Skynet: A Ruby MapReduce Framework. Granules Runtime for Cloud Computing”. (Short Paper) http://skynet.rubyforge.org/. April 2009. Proceedings of the IEEE International Conference on e- [18]. J. Ekanayake, S Pallickara and G. Fox. “Map-Reduce for Science. Indianapolis. 2008. Scientific Applications”. Proceedings of the IEEE [4]. S Pallickara and G Fox. “NaradaBrokering: A Middleware International Conference on e-Science. Indianapolis. 2008. Framework and Architecture for Enabling Durable Peer-to- [19]. J. M. Squyres and A. Lumsdaine. A Component Architecture Peer Grids.” Proceedings of the ACM/IFIP/USENIX for LAM/MPI. In Proceedings, Euro PVM/MPI, October International Middleware Conference Middleware-2003. pp 2003. 41-61. [20]. X. Huang and A. Madan (1999). CAP3: A DNA Sequence [5]. S Pallickara et al. “A Framework for Secure End-to-End Assembly Program. Genome Research, 9: 868-877. Delivery of Messages in Publish/Subscribe Systems”.