Lambda Architecture for Cost-Effective Batch and Speed Big Data Processing

Lambda Architecture for Cost-effective Batch and Speed Big Data processing Mariam Kiran Peter Murphy, Inder Monga, Jon Dugan Sartaj Singh Baveja School of Computer Science Energy Sciences Network (ESnet) Netaji Subhas Institute of Technology University of Bradford {pmurphy, imonga, jdugan}@es.net New Delhi [email protected] [email protected] Abstract—Sensor and smart phone technologies present collection, management and processing. Access to certain opportunities for data explosion, streaming and collecting from architectures and resources to enable users to conduct Big Data heterogeneous devices every second. Analyzing these large and Internet of things research, has raised a number of issues of datasets can unlock multiple behaviors previously unknown, and availability, know-how and security [20]. A constant growth in help optimize approaches to city wide applications or societal use devices such as smartphones, sensors, household appliances, cases. However, collecting and handling of these massive datasets RFID devices, are joining internet capabilities to produce presents challenges in how to perform optimized online data global data traffic of massive volumes and varieties, presenting analysis ‘on-the-fly’, as current approaches are often limited by various challenges for the security and management of these capability, expense and resources. This presents a need for data-as-a-service applications [20]. developing new methods for data management particularly using public clouds to minimize cost, network resources and on- With this in mind, multiple vendors are delivering services demand availability. for data processing such as Amazon Web Services (AWS), Rackspace hosting and Google Cloud, presenting a collection This paper presents an implementation of the lambda of tools for online data collection, cloud hosted databases and architecture design pattern to construct a data-handling backend map reduce processing such as using Hadoop, Hive or Spark. on Amazon EC2, providing high throughput, dense and intense By offering users virtual machines to host, compute and data demand delivered as services, minimizing the cost of the manage their data, users can use advantages such as elasticity, network maintenance. This paper combines ideas from database multi-tenancy and the pay-as-you-go cost model. For instance, management, cost models, query management and cloud computing to present a general architecture that could be applied cloud resources can be rented with current Amazon services in any given scenario where affordable online data processing of priced for small data resources (i2.xlarge) for $0.853/hour and Big Datasets is needed. The results are presented with a case for large data resources (d2.8xlarge) for $5.520/hour for on- study of processing router sensor data on the current ESnet demand resources. Additional reserved instances can be rented network data as a working example of the approach. The results from 1 to 3 year terms, but may prove expensive in the long showcase a reduction in cost and argue benefits for performing run, especially if data needs are not as intensive at all times. online analysis and anomaly detection for sensor data. This paper presents a cost-optimised architecture for online Keywords—big data processing, lambda architecture, Amazon and batch data processing for massive volumes of sensor data EC2, sensor data analysis as an adaptation of the lambda architecture design pattern currently being used by companies such as Twitter and AWS I. INTRODUCTION [21]. The architecture combines both batch and stream processing capabilities for online processing and handling of The Cloud computing paradigm is a promising massive data volumes in a uniform manner, reducing costs in environment delivering IT-as-a-service for industries and the process. The paper presents a flexible data provisioning researchers to deploy their applications [4, 8]. These based on the user needs and achieves the following: capabilities have laid foundations for more innovative research challenges in Big Data and Internet of Things projects, with a • Data capability to be collected in online and processed continuing growth of massive and diverse data volumes, along on-the-fly for real time analysis. with the use of data intensive applications. These areas present a need to investigate effective means for data management in • Capability to perform massive batch processes on efficient and cost-effective ways. Forecasting a growth of $75 historical data sets to observe data patterns over longer billion for small and medium-sized businesses using Clouds for period of time. data management applications, SAP industries argue lower • Investigate cost-effective solutions using cloud services costs, less installation needs, and ease of management of less for deploying this architecture. IT resources as an attractive business model [14]. However, this technological innovation, comes with increased challenges The paper has been organised as follows: section 2 presents such as network availability, security and reliability as biggest related work and research pertaining to data processing concerns for businesses world-wide . architectures and the challenges still faced in them. This section also presents an overview of the lambda architecture Initiatives such as Smart City projects are highly reliant on and how it is currently being used with Apache Storm and the availability of various services to fulfil their aims of data Hadoop. Section 3 presents the proposed architecture and implementation challenges of porting similar data processing HadoopDB, a hybrid of MapReduce and DBMS technologies, toolkits on AWS. These observations are supported by a case to allow scalability and performance of massive data study presented in section 4 showing online data collection and processing. The authors present the application for a biological processing for multiple router sensors sending data at a protein analysis or for business warehousing. Another example constant rate of 30 seconds. The results and conclusions are of merging was for image analysis in medical fields [27]. discussed in section 5 and 6 with further future extensions to Bruns [28] discussed how the current Twitter APIs were the architecture to enable future smart city projects. extended for third party researchers to deploy their own data analysis on twitter feeds in order to enhance business practices. II. RELATED WORK However unique solutions that allow multiple users of varying A. Current Data Processing Solutions backgrounds to write and deploy optimised data processing applications is still needed. However there is a need for tailored Data analytics are essential to plan and create decision solutions for online and batch data processing which keeps in support systems for optimising the underlying infrastructure. line non-functional attributes such as cost and network This involves not only processing of the online data, in search complexities. for certain events, but also the historical data sources which may be needed to find data patterns which influence decisions. Further work has used similar data processing toolkits in Cloud providers are paramount for the availability and smart grid applications where it is important to forecast and durability to their resources but present various challenges. For redistribute resources on the fly [31]. Current industry focus of instance, for availability, data is often replicated across using Spark SQL have aided further faster processing reducing multiple servers in different geographical locations, sometimes some of the weaknesses of the Hadoop processing model [30]. in untrustworthy locations [6]. There are also additional B. Lambda architecture computational challenges in handling elasticity by allocating resources on-the-fly to handle increased demand. Presented as a software design pattern, the lambda architecture unifies online and batch processing within a single Mian et al. [22] presented a cost effective model for virtual framework. The pattern is suited to applications where there machine provisioning to execute dynamic data analytic are time delays in data collection and availability through workloads, at the same time trying to satisfy all service level dashboards, requiring data validity for online processing as it agreement (SLA) constraints. The paper highlighted how an arrives. The pattern also allows for batch processing for older optimised infrastructure would be more reliant on the provider data sets to find behavioural patterns as per user needs [21]. setting up experiments and would not be defined SLAs. Dobre et al [23] presented a context aware framework, specifically designed for handling multiple devices, mapping between components and caching or handling requests from multiple users. As a means to support intelligent data processing through contexts, the authors however did not discuss how the data is moved through multiple abstraction layers to aid with speed and cost of delivery. Further projects such as M3 [24] proposed a disk communication layer between the mappers and reducers to allow dynamic rate-based load balancing and multi streaming of applications. Another version of the project Chameleon [25] used specific context based indexing to augment query for fast data delivery. Other concrete projects such as Yahoo’s Pig Fig. 1. Basic lambda architecture for speed and batch processing. [17], Microsoft’s SCOPE [5] and Google’s initiatives [9], are aiming to integrate declarative query constructs

Lambda Architecture for Cost-Effective Batch and Speed Big Data Processing

Cloud Computing and Enterprise Data Reliability Luan Gashi University for Business and Technology, [email protected]

Geospatial Data AS a Service: Use Cases

Cloud Computing and Big Data Is There a Relation Between the Two: a Study

Evolution of As-A-Service Era in Cloud

Accelerate Delivery of Your Modern Data Analytics Platform with the Adatis Framework

Lenovo Big Data Validated Design for Cazena Saas Using Cloudera on Thinkagile HX

Towards an Efficient Distributed Cloud

IDC TECHNOLOGY SPOTLIGHT Sponsored By: IBM

Charting New Territory with Google Cloud Platform

The Foundations of Hybrid IT

The Xaas Revolution Distribution’S Pivot to the Future Xaas Factors in Focus and on Demand

A Data-Science-As-A-Service Model