Investigating the Lambda Architecture

Master Thesis August 19, 2014 Investigating the Lambda Architecture Nicolas Bär of ZürichZH, Switzerland Student-ID: 08-857-195 [email protected] Advisor: Dr. Thomas Scharrenbach Prof. Abraham Bernstein, PhD Institut fürInformatik UniversitätZürich http://www.ifi.uzh.ch/ddis Acknowledgements I would like to thank Prof. Abraham Bernstein for giving me the opportunity to write this thesis at the Dynamic and Distributed Information Systems group and for providing the necessary resources to conduct the experiments. Special thanks go to Dr. Thomas Scharrenbach for the valuable feedback and great discussions throughout the course of this thesis. I would like to thank the S3IT group for providing me resources to conduct pre-studies. And special thanks go to Dr. Jakob Bärfor proofreading this thesis. Zusammenfassung Die fortschreitende Integration von Informationssystemen stellt Systeme zur Echtzeit- analyse grosser Datenmengen zunehmend vor Herausforderungen: Einerseits sollen die Ergebnisse möglichst präzisesein, andererseits sollen die Daten rasch verarbeitet werden und zur Verfügungstehen. Einen neuen Lösungsansatzzur Bewältigungder entste- henden Probleme stellt die von Marz skizzierte Lambda-Architektur dar, zu der bisher allerdings noch keine Referenzimplementierung publiziert wurde. Die vorliegende Arbeit stellt eine mögliche Umsetzung dieser Architektur auf der Basis von Open-Source-Software Komponenten vor. Die Grundlage des Batch-Layers bildet dabei ein skalierbarer inkrementeller Mechanismus, der eingehende Nachrichten repliziert ablegt und Operationen wiederholen kann, falls Fehler auftreten. Der verteilte Speed- Layer hingegen verwirft unverarbeitete Nachrichten, falls unerwartete Fehler auftreten, damit neue Nachrichten schneller verarbeiteten werden können. Die Architektur ver- spricht "eventual accuracy\: Die allenfalls fehlerhaften Echtzeit-Resultate des Speed- Layers könnendurch die präzisenErgebnisse des Batch-Layers ersetzt werden. Die vorliegende Arbeit präsentiert auch die Ergebnisse der Evaluation des vorgeschla- genen Designs mit den Datensätzen des SRBench Benchmarks und der DEBS Grand Challenge 2014. Aufgezeigt wird das Verhalten der Architektur und deren Leistungsfähigkeit bei Instabilitätder Infrastruktur und unter variierenden Datenfrequenzen. Abstract Information systems become increasingly integrated and cause new challenges to provide real-time analytics based on a high volume of data. The concept of the lambda architecture proposed by Marz provides a new solution to this problem, but the lack of a reference implementation limits its analysis. This thesis presents a possible implementation of the lambda architecture based on open source software components. The design of the batch layer is based on a scalable incremental mechanism that stores incoming data in a distributed and highly available storage engine, which provides replay functionality in case of failures. The speed layer does not provide recovery mechanisms and in case of machine failures the speed layer drops messages and continues with the most recent data available. The architecture guarantees eventual accuracy, which provides the possibly inaccurate results of the speed layer in real-time and replaces these values with the accurate results of the batch layer. The evaluation of the designed architecture measured its capabilities based on the SRBench Benchmark and DEBS Grand Challenge 2014 task and stressed its behavior with varying data frequency rates on an unreliable infrastructure. Table of Contents 1 Introduction 1 1.1 Motivation . .1 1.2 Outline . .2 2 Preliminaries 3 2.1 Stream Processing . .3 2.2 Batch Processing . .5 3 Related Work 7 4 Architecture 9 4.1 Frameworks and Services . .9 4.1.1 Coordination and Provisioning . .9 4.1.2 Stream Processing . 13 4.1.3 Event Processing . 18 4.1.4 Persistent Storage . 19 4.2 Coordination . 20 4.2.1 Implementaton Design . 21 4.2.2 Persistent Messaging . 22 4.2.3 In-Memory Messaging . 23 4.3 Batch Layer . 23 4.3.1 Implementation Design . 24 4.3.2 Micro-Batch Processing . 26 4.3.3 Replay Mechanism . 27 4.3.4 Precise Recovery . 29 4.4 Speed Layer . 30 4.4.1 Implementation Design . 30 4.4.2 Node Failures . 32 4.4.3 Scalability . 33 4.5 Orchestration . 33 4.5.1 Resource Management and Cluster Health . 34 4.5.2 Node Failure Simulation . 35 x Table of Contents 4.6 Service Layer . 36 4.6.1 Logging Infrastructure . 36 4.6.2 Process Monitoring . 36 5 Design of Experiments 39 5.1 Infrastructure . 39 5.1.1 Automatic Deployment . 39 5.2 Experimental Setup . 40 5.3 Data Sets and Queries . 40 5.3.1 SRBench Data Set . 40 5.3.2 DEBS Grand Challenge 2014 Data Set . 41 5.3.3 Data Set Statistics . 42 5.3.4 Baseline . 42 5.3.5 Partitioning . 42 5.4 Node Failure Simulation . 42 6 Results 45 6.1 Key Performance Indicators . 45 6.2 SRBench Data Set . 46 6.2.1 Rainfall Observed once an Hour . 46 6.2.2 Broken Station Detection . 53 6.3 DEBS Grand Challenge 2014 Data Set . 58 6.3.1 Load Prediction . 58 6.3.2 Average Load . 64 6.3.3 Eventual Accuracy . 69 7 Discussion 71 7.1 Batch Layer . 71 7.1.1 Message Delivery Guarantee . 71 7.1.2 Micro Batch Processing . 72 7.1.3 Node Failure Recovery . 72 7.2 Speed Layer . 73 7.2.1 Quality of Service . 73 7.2.2 Node Failure Recovery . 75 7.3 Data Partitioning . 75 8 Limitations 77 8.1 Single Point of Failure . 77 8.2 Concurrent Node Failures . 78 8.3 Partitioning . 78 8.4 Stream Imperfections . 79 9 Future Work 81 10 Conclusions 83 x 1 Introduction There is a wide range of applications in which data is consumed via streams. In most cases this involves an external environment to generate data and push this data asyn- chronously to stream processing systems. These systems compute results based on the continuous data streams in a time-discrete manner. Stream processing requirements are found in the business and scientific domain. Examples include financial markets, surveil- lance, manufacturing, healthcare, infrastructure monitoring, radio astronomy, etc. [5]. The data frequency of streams depends on the problem an application solves and may range from a few to millions of data items per second. Use cases enforce different quality of service constraints regarding the response time of stream-based applications. For example a reactive use case with high-volume data streams may require an answer in a timely fashion. In such a scenario, complex computation processes have to be distributed and in case of data loss, for example through communication or hardware failures, the QoS constraints allow for very limited fault recovery due to response time restrictions. Other use cases involve QoS constraints regarding the precision of the results. In such a scenario data loss is not acceptable and response time is traded for the sake of obtaining precise and complete results. The lambda architecture introduced by Marz [44] is an interesting proposal to the latency challenges in real-time stream processing. The architecture proposal decomposes the problem into three layers: (i) the batch layer focuses on fault tolerance and optimizes for precise results (ii) the speed layer is optimized for short response-times and only takes into account the most recent data and (iii) the serving layer provides low latency views to the results of the batch layer. The reason to divide the architecture into three layers is the flexibility it offers to the potential applications. The fast but possibly inaccurate results of the speed layer are eventually replaced by the precise results of the batch layer. 1.1 Motivation The purpose of this thesis is to investigate the lambda architecture in the context of purely stream-based applications and measure its effect regarding different QoS met- rics. The batch and the speed layer consume from the same stream, but vary in their requirements towards response times and fault-tolerance. 2 CHAPTER 1. INTRODUCTION Very limited related work to the lambda architecture is available and a reference implementation has not been published yet. This thesis includes a design based on open source software of the batch layer following an incremental approach and of the speed layer with reduced fault-recovery guarantees. It is then used to generate key performance indicators in order to measure the behavior of the system with different use cases such as the SRBench benchmark and DEBS Grand Challenge 2014 task. A proof of concept includes the simulation of failures in the underlying system to check its fault-recovery behavior. Different frequencies and bursts in the data are simulated to evaluate the key performance indicators of the architecture. 1.2 Outline Chapter 2 introduces the concepts of stream and batch processing and highlights the recent research in these areas with regard to the lambda architecture. Chapter 3 discusses the related work and further positions the research question of this thesis within its field. The designed architecture is presented in Chapter 4 that includes an introduction of frameworks that forms the basis of the design and presents the proposed solution for the speed and batch layer. The setup of the experiments conducted to qualify and quantify the performance of the architecture design is described in Chapter 5. The results of the experiments are then reported in Chapter 6 and discussed in Chapter 7. Chapter 8 highlights the limitation of the designed architecture and in Chapter 9 possible future work is listed. Finally, the conclusion of this work is presented in Chapter 10. 2 2 Preliminaries The lambda architecture is a new concept and very limited related work is available. It includes techniques and methods from the stream and batch processing area that are described as follows. First, a brief introduction in stream processing systems is provided and the relevant work regarding the batch and speed layer is highlighted.

Load more