Lambda Architecture for Cost-effective Batch and Speed Big Data processing

Mariam Kiran Peter Murphy, Inder Monga, Jon Dugan Sartaj Singh Baveja School of Computer Science Energy Sciences Network (ESnet) Netaji Subhas Institute of Technology University of Bradford {pmurphy, imonga, jdugan}@es.net New Delhi [email protected] [email protected]

Abstract—Sensor and smart phone technologies present collection, management and processing. Access to certain opportunities for data explosion, streaming and collecting from architectures and resources to enable users to conduct Big Data heterogeneous devices every second. Analyzing these large and of things research, has raised a number of issues of datasets can unlock multiple behaviors previously unknown, and availability, know-how and security [20]. A constant growth in help optimize approaches to city wide applications or societal use devices such as smartphones, sensors, household appliances, cases. However, collecting and handling of these massive datasets RFID devices, are joining internet capabilities to produce presents challenges in how to perform optimized online data global data traffic of massive volumes and varieties, presenting analysis ‘on-the-fly’, as current approaches are often limited by various challenges for the security and management of these capability, expense and resources. This presents a need for data-as-a-service applications [20]. developing new methods for data management particularly using public clouds to minimize cost, network resources and on- With this in mind, multiple vendors are delivering services demand availability. for data processing such as (AWS), Rackspace hosting and , presenting a collection This paper presents an implementation of the lambda of tools for online data collection, cloud hosted databases and architecture design pattern to construct a data-handling backend map reduce processing such as using Hadoop, Hive or Spark. on Amazon EC2, providing high throughput, dense and intense By offering users virtual machines to host, compute and data demand delivered as services, minimizing the cost of the manage their data, users can use advantages such as elasticity, network maintenance. This paper combines ideas from database multi-tenancy and the pay-as-you-go cost model. For instance, management, cost models, query management and cloud to present a general architecture that could be applied cloud resources can be rented with current Amazon services in any given scenario where affordable online data processing of priced for small data resources (i2.xlarge) for $0.853/hour and Big Datasets is needed. The results are presented with a case for large data resources (d2.8xlarge) for $5.520/hour for on- study of processing router sensor data on the current ESnet demand resources. Additional reserved instances can be rented network data as a working example of the approach. The results from 1 to 3 year terms, but may prove expensive in the long showcase a reduction in cost and argue benefits for performing run, especially if data needs are not as intensive at all times. online analysis and anomaly detection for sensor data. This paper presents a cost-optimised architecture for online Keywords—big data processing, lambda architecture, Amazon and batch data processing for massive volumes of sensor data EC2, sensor data analysis as an adaptation of the lambda architecture design pattern currently being used by companies such as Twitter and AWS I. INTRODUCTION [21]. The architecture combines both batch and stream processing capabilities for online processing and handling of The paradigm is a promising massive data volumes in a uniform manner, reducing costs in environment delivering IT-as-a-service for industries and the process. The paper presents a flexible data provisioning researchers to deploy their applications [4, 8]. These based on the user needs and achieves the following: capabilities have laid foundations for more innovative research challenges in Big Data and projects, with a • Data capability to be collected in online and processed continuing growth of massive and diverse data volumes, along on-the-fly for real time analysis. with the use of data intensive applications. These areas present a need to investigate effective means for data management in • Capability to perform massive batch processes on efficient and cost-effective ways. Forecasting a growth of $75 historical data sets to observe data patterns over longer billion for small and medium-sized businesses using Clouds for period of time. data management applications, SAP industries argue lower • Investigate cost-effective solutions using cloud services costs, less installation needs, and ease of management of less for deploying this architecture. IT resources as an attractive business model [14]. However, this technological innovation, comes with increased challenges The paper has been organised as follows: section 2 presents such as network availability, security and reliability as biggest related work and research pertaining to data processing concerns for businesses world-wide . architectures and the challenges still faced in them. This section also presents an overview of the lambda architecture Initiatives such as Smart City projects are highly reliant on and how it is currently being used with Apache Storm and the availability of various services to fulfil their aims of data Hadoop. Section 3 presents the proposed architecture and implementation challenges of porting similar data processing HadoopDB, a hybrid of MapReduce and DBMS technologies, toolkits on AWS. These observations are supported by a case to allow scalability and performance of massive data study presented in section 4 showing online data collection and processing. The authors present the application for a biological processing for multiple router sensors sending data at a protein analysis or for business warehousing. Another example constant rate of 30 seconds. The results and conclusions are of merging was for image analysis in medical fields [27]. discussed in section 5 and 6 with further future extensions to Bruns [28] discussed how the current Twitter were the architecture to enable future smart city projects. extended for third party researchers to deploy their own data analysis on twitter feeds in order to enhance business practices. II. RELATED WORK However unique solutions that allow multiple users of varying A. Current Data Processing Solutions backgrounds to write and deploy optimised data processing applications is still needed. However there is a need for tailored Data analytics are essential to plan and create decision solutions for online and batch data processing which keeps in support systems for optimising the underlying infrastructure. line non-functional attributes such as cost and network This involves not only processing of the online data, in search complexities. for certain events, but also the historical data sources which may be needed to find data patterns which influence decisions. Further work has used similar data processing toolkits in Cloud providers are paramount for the availability and smart grid applications where it is important to forecast and durability to their resources but present various challenges. For redistribute resources on the fly [31]. Current industry focus of instance, for availability, data is often replicated across using Spark SQL have aided further faster processing reducing multiple servers in different geographical locations, sometimes some of the weaknesses of the Hadoop processing model [30]. in untrustworthy locations [6]. There are also additional B. Lambda architecture computational challenges in handling elasticity by allocating resources on-the-fly to handle increased demand. Presented as a software design pattern, the lambda architecture unifies online and batch processing within a single Mian et al. [22] presented a cost effective model for virtual framework. The pattern is suited to applications where there machine provisioning to execute dynamic data analytic are time delays in data collection and availability through workloads, at the same time trying to satisfy all service level dashboards, requiring data validity for online processing as it agreement (SLA) constraints. The paper highlighted how an arrives. The pattern also allows for batch processing for older optimised infrastructure would be more reliant on the provider data sets to find behavioural patterns as per user needs [21]. setting up experiments and would not be defined SLAs. Dobre et al [23] presented a context aware framework, specifically designed for handling multiple devices, mapping between components and caching or handling requests from multiple users. As a means to support intelligent data processing through contexts, the authors however did not discuss how the data is moved through multiple abstraction layers to aid with speed and cost of delivery. Further projects such as M3 [24] proposed a disk communication layer between the mappers and reducers to allow dynamic rate-based load balancing and multi streaming of applications. Another version of the project Chameleon [25] used specific context based indexing to augment query for fast data delivery. Other concrete projects such as Yahoo’s Pig Fig. 1. Basic lambda architecture for speed and batch processing. [17], ’s SCOPE [5] and Google’s initiatives [9], are aiming to integrate declarative query constructs from the Figure 1 shows the basic architecture of how the lambda database community into MapReduce-like software to allow architecture works. It caters as three layers (1) Batch greater data independence, code reusability, and automatic processing for precomputing large amounts of data sets (2) query optimization. These projects approached the problem as Speed or real time computing to minimize latency by doing a distributed model, however further work needs to explore real time calculations as the data arrives and (3) a layer to hybrid solutions which consider resources, data models, varied respond to queries, interfacing to query and provide the results queries in accordance with network traffic or cost. of the calculations. Researchers have often merged techniques with other tools to develop field related solutions. Abouzied [26] discussed

Fig. 2. Main lambda architecture implemented on Amazon web services.

Lambda architecture allows users to optimise their costs of multiple services. A comparison of the services presented in data processing by understanding which parts of the data need Table 1 shows that the online processing needs stream and online or batch processing. The architecture also partitions batch processing which was easier to be performed in Amazon datasets to allow various kinds of calculation scripts to be cloud rather than Azure services. The availability of services executed on them [21]. However, a few critiques of the and cost plans for first time users of the Amazon infrastructure architecture have argued that the multiple set of projects that were also suitable for the project objectives. need to be maintained under the data branch to allow multiple data executions, requires more skills from the developers TABLE I. COMPARISON OF CLOUD SERVICES setting up the jobs to execute and produce results. Example Amazon web services services Despite of this, the architecture is well suited for big data Subhead processing problems with multiple kinds of analysis needed to Available AWS Global Azure Region study the online data arriving through sensors. The online Region Infrastructure stream can be used to detect data anomalies verifying whether Compute Virtual Machines Elastic Compute Cloud it is accurate before processing it further. Verified data can Services (VMs) (EC2) Azure Storage (Blobs, Amazon Simple Storage then be stored into databases, which can have batch scripts Storage Options Tables, Queues, Files) (S3) performed once a day or a month to study data patterns over a Amazon Relational Database time period. Users can reduce the costs of performing these Azure SQL Database Database Service (RDS) Options scripts on larger data sets by breaking the problem down in Amazon Redshift NoSQL Azure DocumentDB manageable steps reducing cost and tailoring the data analysis Amazon Dynamo DB Database Azure Managed Cache routines to suit their needs. This architecture can be adapted for Amazon Elastic Cache collecting and analysing online sensor data to find efficient Options (Redis Cache) Data Azure Data Factory AWS Data Pipeline solutions to process large data sets. Orchestration AWS Directory Service III. PROPOSED ARCHITECTURE Administration Azure Active AWS Identity and In scenarios such as smart cities, involve working with & Security Directory Access Management large complex networks of sensors continuously fetching and (IAM) Azure Stream Analytics Amazon Kinesis recording data to a central repository for efficient decisions. Analytics Examples such as when to send garbage collection vans or Azure Machine None Other Services & Learning when to grit the roads for better driving conditions can all be AWS Lambda Integrations None motivated through visual, motion and temperature sensor AWS Config networks that already exist in city infrastructures. None

Public clouds provide a number of services which could be employed for online and batch processing. Table 1 presents a comparison of Microsoft azure and Amazon AWS services offering similar capabilities. For the purpose of this paper, Amazon EC2 is chosen as a starting point for accessing

Fig. 3. ESnet router production network.

Amazon AWS offers a collection of services which could Figure 2 describes the architecture that was built on be used for different purposes, each differing in cost and time. Amazon web services to read router data every 30 seconds and Selection of the appropriate cloud service that maps onto the process it in online and batch jobs. A recent report by Amazon general architecture of lambda architecture was not obvious [29] uses Apache spark and storm for processing the data and required comparisons and study of performance, and cost. stream. It also uses an event processing service which allowed One of the decisions is showcased in Table 2, which presents a processing scripts to be triggered when data arrives in the comparison of using either S3 or DynamoDB as a means to kinesis stream. In the architecture (figure 2) the event handle and process data. Although DynamoDB is much more processing was omitted because in the use case, data was expensive compared to S3, the speed of query processing known to be arriving every 30 seconds making it less likely to would reduce the total effective cost as we plan for long-term have an event processing element. Having an event processing use of DynamoDB rather than using S3. element also charges every time it is triggered, which would eventually charge more than the current architecture TABLE II. COMPARING S3 AND DYNAMODB implemented. SERVICES The initial implementation report [29] also uses Spark SQL DynamoDB S3 to perform batch processing for a fast query analysis. In figure $0.02 per 100,000 $0.005 per 1000 2, the basic elastic map reduce functions were implemented transactions requests with Hadoop to perform map reduce processing jobs on hourly, Storage costs Storage costs vary. daily and monthly bases in batches. The batch job could be vary. Maximum is $0.03 per GB triggered via cron jobs or through a job scheduler to run them $0.09 for storage once a day after the online data has been collected for the day. Faster and DB Blob The map reduce jobs can filter and sort the data based on either hourly, 5 hourly or daily sorts. Similarly, a number of decisions had to be addressed in A. Real-time (online) or Speed processing terms of cost and usefulness of the services. For the purposes The raw data arrives at 30second intervals from multiple of online processing of data, services such as Amazon Kinesis router interfaces in the form of json files. These data sets were was chosen and merged with Amazon lambda for event-based read and processed to calculate averages across minute processing of the data. Figure 2 describes the final processing intervals and the maximum values recorded. This has been architecture that was built on Amazon web services to read explained below: router data every 30 seconds and process it as it arrives and batch jobs. Arriving Json raw data: [router_id, interface_id, variable_id, timestamp, data_recorded] IV. USE CASE: ESNET NETWORK SENSOR TESTBED 5 minute aggregations: [router_id, interface_id, We used the entire ESnet router production network as the variable_id, 5_minute_avg, maximum_data_in_5_minutes] testbed to experiment with this architecture (shown Figure 3). An existing SNMP data collection software, ESxSNMP was The 5 minute aggregations were output to a new stream used to collect router in and out bytes from every interface which could then be used to visualise the data while the data every 30 seconds. arrives. B. Batch processing Figure 4 shows the batch processing jobs on the raw data sets. Multiple map reduce jobs can be triggered to read the raw data sets and produce consolidated 1 day, 7 day and 90 day and 1 year aggregations. These batch files can only perform calculations on stored data sets. Outputs for the calculated data sets can be read into output directory to visualise the averaged data sets. These outputs are also stored in separate S3 buckets.

Fig. 5. Cost per service during two months (Month 1 and Month 2) of the experiment.

Figure 5 shows the cost variation between two months. Cloud services charge for the amount of usage. As shown in Figure 5, Month 1 used more demand on EC2 services as compared to the Month 2. This is reflected in Figure 6 and 7, where Month 3 used even lesser resources bringing the costs down from 80 dollars to 20 dollars. Figure 7 shows that the cheapest services to use were micro machines and xlarge machines being the most expensive to use.

Fig. 4. Batch processing on raw data sets. MONTH 3 The EMR scripts used c1.medium machines as master, core and task with the machine image version 3.8.0 and a Hadoop distribution of Amazon 2.4.0. The map reduce command used MONTH 2 was as follows: hadoop jar /home/hadoop/contrib/streaming/hadoop- streaming.jar -files s3://location-of-mapper/mapper.py, MONTH 1 s3://location-of-reducer/reducer.py -libjars /home/hadoop/CustomOutputFormats3.jar -outputformat oddjob.hadoop.MultipleTextOutputFormatByKey -mapper t2.micro m3.xlarge c1.medium m1.small t1.micro python mapper.py -reducer python reducer.py -input s3n://location-of-inputs/jsons/ -output s3n://location-of-output- job Fig. 6. Instance hours used over 3 months. The command, above, allows users to specify the location of mapper and reducer files, input files and where to produce outputs. Further java files can be passed as arguments to MONTH 3 specify the format of the outputs generated as an optional step.

V. RESULTS MONTH 2 The experiments were set up to run for 3 months executing certain jobs at specific times during the months. Figures 5-11 represent the statistics in terms of used hours, costs and type of MONTH 1 services used during the time period.

t2.micro m3.xlarge c1.medium m1.small t1.micro

Fig. 7. Cost by instance type.

Further analysis on the usage of online kinesis stream and batch processing can be done as seen in Figure 8-11. Figure 8 and 9 show the distribution of requests used by the kinesis stream. Most of the kinesis cost arose from the stream hosting the data shown as shard active per hour. This shows that in the USW2-BoxUsage:m3.xlarge future cost could be optimised by limiting the amount of time the stream needs to be active by starting and deleting it, dynamically, when it is used. BoxUsage:m3.xlarge

5000000 BoxUsage:c1.medium

4000000 BoxUsage 3000000 0 0.5 1 1.5 2 2.5 3 3.5 2000000 Fig. 10. Batch processing: usage of various instance types in EMR over 3 1000000 months number of requests

0 Figure 11 shows the reduction in cost by replacing the Time during 1 month machine specifications for the map reduce clusters.

Fig. 8. Online stream: Kinesis usage during 1 month. 160 140 Storage-ShardHour 120 PutRequestVolume 100 80 PutRequestPayloadUnits 60

PutRequestBytes Usage in cost 40 20 EUW1-Storage-ShardHour 0 DataTransfer-Out-Bytes Time over 3 months

0 200 400 600 800 Fig. 11. Batch processing: price distribution of the EMR during 3 months Fig. 9. Online Stream: Distribution of work in terms of requests for kinesis. VI. DISCUSSION AND CONCLUSIONS Figure 10 and 11 depict the usage of instances and the cost Lambda architecture allows multiple data processing scripts incurred by running batch jobs over the three month period which are tailored to specific data sets. For online processing using elastic map reduce. Using larges machine instances such stream processing can be used to perform calculations as data as m3.xlarge machines produce more costs even if run a few arrives and batch processing scripts can be created to run on times. These costs can be optimised by replacing larger data stored from before. The above architecture was machines with smaller machines such as c1.medium, to give implemented on Amazon AWS utilising multiple resources and similar results but at lower costs. Elastic map reduce involves producing various results in terms of cost and usage. A number creating a cluster of machines to act as master-slave to deploy of lessons were learned while exploring the architecture on and run the map reduce jobs. Therefore replacing these both batch and real time processing such as: machines with smaller machines and more modes can help reduce the machine cost and also perform the same processing. For batch processing: The runtime of c1.medium cluster was approximately 10 • Performing local tests before deploying the scripts on minutes, which was slower (of 5 minutes) than the m3.large EC2 helps to find code errors and manages in reducing machine clusters. the costs of failed clusters on EC2. • Even if the cluster was executed for 2 minutes, Amazon cloud charges the machine as a full hour, which causes the steep increase in EMR costs shown in Figure 10. Which machines are used in the cluster cause an impact on the cost and also affects the time the service will take to execute the jobs. • Future work can involve using spot instances to be requested, to allow optimising the costs even further but may introduce more delays in the processing being REFERENCES fulfilled. [1] D. Abadi, Data Management in the Cloud: Limitations and For real-time or online processing: Opportunities, IEEE Engineering Bulletin - DEBU , vol. 32, no. 1, pp. 3- 12, 2009. • The kinesis stream read data as a last-in-first-out that [2] R.M. Badia, M. Corrales, T. Dimitrakos, K. Djemame, E. Elmroth, A. meant data needed to be saved locally to calculate the 5 Ferrer, N. Forgó, J. Guitart, F. Hernández, B. Hudzia, A. Kipp, K. minute data aggregations. Konstanteli, S. Nair, T. Sharif, C. Sheridan, J. Tordsson, T. Varvarigou, S. Wesner, W. Ziegler, C. Zsigri., Demonstration of the OPTIMIS • Amazon Cloud management allows roles and rights to Toolkit for Cloud Service Provisioning, Towards a Service-Based be assigned to multiple members of a team. Multiple Internet LNCS vol 6994, 2011. members may have rights for kinesis processing, but if [3] J. Brown, Migration, Integration, Challenges in Government Cloud the kinesis stream interacts with other services by deployments, Govtech, 2015. moving data across would require the member to have [4] R. Buyya, C. Yeo, S. Venugopal, J. Broberg, I. Brandic, Cloud rights to the other services as well. Software testing computing and emerging IT platforms: Vision, hype, and reality for strategies such as try and catch exceptions need to be delivering computing as the 5th utility FGCS 25, 2009. implemented in the code to prevent services to fail. [5] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: Easy and efficient parallel processing of massive In terms of the processing time, the kinesis stream was able data sets. In Proc. of VLDB, 2008 to process data in real-time while the EMR cluster used [6] M. Dikaiakos, G. Pallis, D. Katsaros, P. Mehra, A. Vakali, Cloud approximately 10 minutes to complete a job. Some of the Computing: Distributed Internet Computing for IT and Scientific Research, IEEE Internet Computing, 2009. services charge while they are active, and thus should only be [7] e-skills UK, Jan 2013 https://www.e-skills.com/research/research- dynamically started and stopped when being used in order to themes/big-data-analytics/ optimise costs. [8] I. Foster, Y. Zhao, I. Raicu, S. Lu, Cloud Computing and Grid In conclusion, the lambda architecture on Amazon AWS Computing 360-Degree Compared Proceedings of the IEEE Grid Computing Environments Workshop, pp. 1-10, 2008. was able to provide a proof-of-concept for data processing in [9] J. N. Hoover. Start-Ups Bring Google’s Parallel Processing to Data both as data arrives and if it is saved prior to the scripts. The Warehousing. InformationWeek, August 29th, 2008. tailored solutions allow users to perform cost-optimised [10] K. Jackson, K. Muriki, L. Ramakrishnan, K. Runge, R. Thomas, processing. Both processes can produce data aggregations Performance and cost analysis of the Supernova factory on the Amazon which are easier to plot and reduces time for data processing AWS cloud, Scientific Programming 19, 2011. through the data visualisation interface. [11] T. Jin, C. Tracy, M. Veeraraghavan, Characterization of high-rate large- sized flows, University of Virginia, Master’s thesis, 2013, Online: Further work needs to be extended to explore issues of data http://www.cs.virginia.edu/events/colloquia/jin.html security, availability zones and data replications for more [12] Z. Liu, M. Veeraraghavan, C. Tracy, J. Tie, I. Foster, J. Dennis, J. Hick, complex operations of the services. Scripts which can perform W. Yang, On using virtual circuits for GridFTP transfers, Int. Conf. for data roll ups in case of data loss and machine-learning HPC, Networking, Storage and Analysis, pp. 81:1, 2012. algorithms for performing online anomaly detection while the [13] Y. Lu, M. Wang, B. Prabhakar, F. Bonomi, ElephantTrap: A low cost data arrives can be embedded into the stream, to check the data device for identifying large flows, 15th Annual IEEE Symposium on as it arrives. This would allow verification and validation of the High-Performance Interconnects, pp. 99–108, 2007. data sets to be conducted as it arrives preventing loss in time [14] D. Menon, Benefits and challenges in deployment of Cloud solutions for and cost to run these scripts after the data has been saved and SME, InsideSAP, 2014. [15] C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins. Pig latin: a processed. However, the above experiment shows huge not-so-foreign language for data processing, SIGMOD Conference, pgs potential in processing of Big Data sets in cost-effective ways 1099–1110, 2008. by implementing the lambda architecture design pattern and [16] S. Sakr, A. Liu, D. M. Batista, M. Alomari, A Survey of Large Scale the architecture shows that it is particularly suited for cloud Data Management Approaches in Cloud Environments. IEEE services where multiple resources are available for controlling Communication Surveys & Tutorials, vol 13, no 3, 2011. and processing the data. This sort of architecture would prove [17] US DOE Office of Science ASCR, Terabit networks for extreme-scale extremely useful for processing sensor related data in Smart science workshop report. Available Online: http://science.energy.gov/, city research which is to be explored further in the future. 2014. [18] S. Sarvotham, R. Riedi, R. Baraniuk, Connection-level analysis and ACKNOWLEDGMENT modelling of network traffic, ACM SIGCOMM Internet Measurement Workshop, pp. 99–104, 2001. The work was supported by the NEMODE EPSRC grant [19] P. Xiong, Y. Chi, S. Zhu, H.J. Moon, C. Pu, H. Hacigumus, Intelligent (www.nemode.ac.uk) for Big Data analysis on Cloud services Management of Virtualized Resources for Database Systems in Cloud and further via ESnet resources and DOE support for Amazon Environment, In ICDE, 2011. Cloud services. ESnet is operated by Lawrence Berkeley [20] N. Mitton, S. Papavassiliou, A. Puliafito, K. S Trivedi, Combining National Laboratory, which is operated by the University of Cloud and sensors in a smart city environment, EURASIP Journal on California for the U.S. Department of Energy under contract Wireless Communications and Networking, December 2012, 2012:247 DE-AC02-05CH11231. This work was supported by the [21] N. Marz, J. Warren, Big Data: Principles and best practices of scalable realtime data systems. Manning Publications, 2013. Directors of the Office of Science, Office of Advanced Scientific Computing Research, Facilities Division. [22] R. Miana, P. Martina, J.L. Vazquez-Poletti, Provisioning data analytic [27] Song X-W, Dong Z-Y, Long X-Y, Li S-F, Zuo X-N, Zhu C-Z, et al. workloads in a cloud, Future Generation Computer Systems 29 (2013) (2011) REST: A Toolkit for Resting-State Functional Magnetic 1452–1458 Resonance Imaging Data Processing. PLoS ONE 6(9): e25031. [23] C. Dobre, F. Xhafa, Intelligent services for Big Data science, Future doi:10.1371/journal.pone.0025031 Generation Computer Systems 37 (2014) 267–281 [28] A. Brun, Y. Liang, L. Eugene. Tools and methods for capturing Twitter [24] A.M. Aly, A. Sallam, B.M. Gnanasekaran, L. V. Nguyen-Dinh, W.G. data during natural disasters. First Monday, [S.l.], mar. 2012. ISSN Aref, M. Ouzzani, A. Ghafoor, M3: stream processing on main-memory 13960466. MapReduce, in: Proceedings of the 2012 IEEE 28th International [29] Amazon Web Services, Lambda Architecture for Batch and Stream Conference on Data Engineering, ICDE ’12, IEEE Computer Society, Processing on AWS, May 2015 Washington, DC, USA, 2012, pp. 1253–1256. [30] Xin, Reynold; Rosen, Josh; Zaharia, Matei; Franklin, Michael; Shenker, [25] T.M. Ghanem, A.K. Elmagarmid, P.-A. Larson, W.G. Aref, Supporting Scott; Stoica, Ion, Shark: SQL and Rich Analytics at Scale, 2013 views in data stream management systems, ACM Transactions on [31] Y. Simmhan, V. Agarwal, S. Aman, A. Kumbhare, S. Natarajan, N. Database Systems 35 (2008) 1:1–1:47 Rajguru, I. Robinson, S. Stevens, W. Yin, Q. Zhou and V. Prasanna, [26] A. Abouzied, K. Bajda-Pawlikowski, J. Huang, D. J. Abadi, and A. Adaptive Energy Forecasting and Information Diffusion for Smart Silberschatz. 2010. HadoopDB in action: building real world Power Grids , IEEE International Scalable Computing Challenge applications. In Proceedings of the 2010 ACM SIGMOD International (SCALE 2012) , 2012 Conference on Management of data. ACM, New York, USA