A Lightweight, Streaming Runtime for Cloud Computing with Support for Map-Reduce

Granules: A Lightweight, Streaming Runtime for Cloud Computing With Support for Map-Reduce Shrideep Pallickara 12 Jaliya Ekanayake 2 and Geoffrey Fox 2 Department of Computer Science1 Community Grids Lab2 Colorado State University Indiana University Fort Collins, USA Bloomington, USA [email protected] {jekanaya, gcf}@indiana.edu Abstract— Cloud computing has gained significant traction in of data. These data volumes cannot be processed by a single recent years. The Map-Reduce framework is currently the computer or a small cluster of computer. Furthermore, in most dominant programming model in cloud computing most cases, this data can be processed in a pleasingly settings. In this paper, we describe Granules, a lightweight, parallel fashion. The result has been the aggregation of a streaming-based runtime for cloud computing which large number of commodity hardware components in vast incorporates support for the Map-Reduce framework. data centers. Granules provides rich lifecycle support for developing Map-Reduce [1], introduced by Dean and Ghemawat at scientific applications with support for iterative, periodic and Google, is the most dominant programming model for data driven semantics for individual computations and developing applications in cloud settings. Here, large pipelines. We describe our support for variants of the Map- datasets are split into smaller more manageable sizes which Reduce framework. The paper presents a survey of related work in this area. Finally, this paper describes our are then processed by multiple map instances. The results performance evaluation of various aspects of the system, produced by individual map functions are then sent to including (where possible) comparisons with other comparable reducers, which collate these partial results to produce the systems. final output. A clear benefit of such concurrent processing is a speed-up that is proportional to the number of Keywords- map-reduce, cloud computing, streaming, cloud computational resources. Map-Reduce can be thought of as runtimes, and content distribution networks an instance of the SPMD [2] programming model for parallel computing introduced by Federica Derema. I. INTRODUCTION Applications that can benefit from Map-Reduce include data and/or task-parallel algorithms in domains such as Cloud computing has gained significant traction in information retrieval, machine learning, graph theory and recent years. By facilitating access to an elastic (meaning visualization among others. the available resource pool can expand or contract over In this paper we describe Granules [3], a lightweight time) set of resources, cloud computing has demonstrable streaming-based runtime for cloud computing. Granules applicability to a wide-range of problems in several allows processing tasks to be deployed on a single resource domains. or a set of resources. Besides the basic support for Map- Appealing features within cloud computing include Reduce, we have incorporated support for variants of the access to a vast number of computational resources and Map-Reduce framework that are particularly suitable for inherent resilience to failures. The latter feature arises, scientific applications. Unlike most Map-Reduce because in cloud computing the focus of execution is not a implementations, Granules utilizes streaming for specific well-known resource but rather the best available disseminating intermediate results, as opposed to using file- one. Another characteristic of a lot of programs that have based communications. This leads to demonstrably better been written for cloud computing is that they tend to be performance (see benchmarks in section 6). stateless. Thus, when failures do take place, the appropriate This paper is organized as follows. In section 2, we computations are simply re-launched with the corresponding provide a brief overview of the NaradaBrokering substrate datasets. that we use for streaming. We discuss some of the core Among the forces that have driven the need for cloud elements of Granules in section 3. Section 4 outlines our computing are falling hardware costs and burgeoning data support for Map-Reduce and for the creation of complex volumes. The ability to procure cheaper, more powerful computational pipelines. In section 5, we describe related CPUs coupled with improvements in the quality and work in this area. In section 6, we profile several aspects of capacity of networks have made it possible to assemble the Granules runtime, and where possible contrast its clusters at increasingly attractive prices. The proliferation of performance with comparable systems such as Hadoop, networked devices, internet services, and simulations has Dryad and MPI. In section 7, we present our conclusions. resulted in large volumes of data being produced. This, in turn, has fueled the need to process and store vast amounts II. NARADABROKERING Some of the domains that NaradaBrokering has been deployed in include earthquake science, particle physics, Granules uses the NaradaBrokering [4-6] streaming environmental monitoring, geosciences, GIS systems, and substrate (developed by Pallickara) for all its streams defense applications. disseminations. The NaradaBrokering content distribution network (depicted in Fig. 1) comprises a set of cooperating III. GRANULES router nodes known as brokers. Producers and consumers do not directly interact with each other. Entities, which are Granules orchestrates the concurrent execution of connected to one of the brokers within the broker network, processing tasks on a distributed set of machines. Granules use their hosting broker to funnel streams into the broker is itself distributed, and its components permeate not only network and, from thereon, to other registered consumers of the computational resources on which it interleaves those streams. processing, but also the desktop from where the applications NaradaBrokering is application-independent and are being deployed in the first place. The runtime manages incorporates several services to mitigate network-induced the execution of a set of processing tasks through various problems as streams traverse domains during stages of its lifecycle: deployment, initialization, execution disseminations. The system provisions easy to use and termination. Fig. 2, depicts the various components that guarantees, while delivering consistent and predictable comprise Granules. performance that is adequate for use in real-time settings. A A A C C G G Content Distribution Network G G G R1 R2 RN Discovery Services Granules Lifecycle Diagnostics Management Concurrent Execution Dataset Queue Deployment Initializer Manager G Granules R Computational Resources C Client X Producers or Consumers Figure 2. Overview of the Granules runtime Broker node Figure 1. The NaradaBrokering broker network A. Computational task Consumers of a given data stream can specify, very The most fundamental unit in Granules is the notion of a precisely, the portions of the data stream that they are computational task. This computational task encapsulates interested in consuming. By preferentially deploying links processing functionality, specifies its scheduling strategy during disseminations, the routing algorithm [4] in and operates on different types of datasets. These NaradaBrokering ensures that the underlying network is computational tasks can take on additional interchangeable optimally utilized. This preferential routing ensures that roles (such as map and reduce) and, when cascaded, can consumers receive only those portions of streams that are of form complex execution pipelines. interest to them. Since a given consumer is typically Computational tasks require the domain specialists to interested in only a fraction of the streams present in the specify processing functionality. This processing typically system, preferential routing ensures that a consumer is not operate upon a collection of datasets encapsulated within the deluged by streams that it will subsequently discard. computational task. The system incorporates support for reliable streaming The computational task encapsulates functionality for and secure streaming. In reliable streaming, the substrate processing for a given fine grained unit of data. This data copes with disconnects and process/link failures of different granularity could be a packet, a file, a set of files, or a components within the system with the ability to fine-tune database record. For example, a computational task can be redundancies [5] for a specific stream. Secure streaming [6] written to evaluate a regular expression query (grep) on a set enforces the authorization and confidentiality constraints of characters, a file, or a set of files. In some cases there will associated with the generation and consumption of secure not be a specific dataset; rather, each computational task streams while coping with denial of service attacks. instance initializes itself using a random-seed generator. Computational tasks include several metadata such as for execution whenever data is available on any one of its versioning information, timestamps, domain identifiers and constituent datasets. The periodicity axis specifies that computation identifiers. Individual instances of the computational tasks be periodically scheduled for execution computational tasks include instance identifiers and task at pre-defined intervals (specified in milliseconds). identifiers, which in turn allows us to group several related Each of these axes can extend to infinity, in which case,

A Lightweight, Streaming Runtime for Cloud Computing with Support for Map-Reduce

Recursive Definitions Across the Footprint That a Straight- Line Program Manipulated, E.G

Type 2 Diabetes Prediction Using Machine Learning Algorithms

Scalable Fault-Tolerant Elastic Data Ingestion in Asterixdb

Anomaly Detection and Identification Scheme for VM Live Migration in Cloud Infrastructure✩

A Decidable Logic for Tree Data-Structures with Measurements

Data Intensive Computing with Linq to Hpc Technical

Cloud Computing Paradigms for Pleasingly Parallel Biomedical

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

Ray: a Distributed Framework for Emerging AI Applications

Natural Proofs for Structure, Data, and Separation

SONIC: Application-Aware Data Passing for Chained Serverless Applications

Grasshopper Complete Heap Veriﬁcation with Mixed Speciﬁcations