Open Source Lambda Architecture for Interactive Analytics

Proceedings of the 50th Hawaii International Conference on System Sciences | 2017 The RADStack: Open Source Lambda Architecture for Interactive Analytics Fangjin Yang, Gian Merlino, Nelson Ray, Xavier Léauté, Himanshu Gupta, Eric Tschetter {fangjinyang, gianmerlino, ncray86, xavier.leaute, g.himanshu, echeddar}@gmail.com ABSTRACT An ideal data serving layer alone is often not sufficient The Real-time Analytics Data Stack, colloquially referred to as a complete analytics solution. In most real-world use as the RADStack, is an open-source data analytics stack de- cases, raw data cannot be directly stored in the serving layer. signed to provide fast, flexible queries over up-to-the-second Raw data suffers from many imperfections and must first data. It is designed to overcome the limitations of either be processed (transformed, or cleaned) before it is usable a purely batch processing system (it takes too long to sur- [17]. The drawback of this requirement is that loading and face new events) or a purely real-time system (it’s difficult processing batch data is slow, and insights on events cannot to ensure that no data is left behind and there is often no be obtained until hours after the events have occurred. way to correct data after initial processing). It will seam- To address the delays in data freshness caused by batch lessly return best-effort results on very recent data combined processing frameworks, numerous open-source stream pro- with guaranteed-correct results on older data. In this paper, cessing frameworks such as Apache Storm[12], Apache Spark we introduce the architecture of the RADStack and discuss Streaming[25], and Apache Samza[1] have gained popular- our methods of providing interactive analytics and a flexible ity for offering a low-latency model to ingest and process data processing environment to handle a variety of real-world event streams at near real-time speeds. The drawback of workloads. almost all stream processors is that they do not necessarily provide the same correctness guarantees as batch processing frameworks. Events can come in days late, and may need 1. INTRODUCTION to be corrected after the fact. Large batches of data may also need to be reprocessed when new columns are added or The rapid growth of the Hadoop[16] ecosystem has en- removed. abled many organizations to flexibly process and gain in- Combining batch processing, streaming processing, and sights from large quantities of data. These insights are typ- a serving layer in a single technology stack is known as a ically generated from business intelligence, or OnLine Ana- lambda architecture[9]. In lambda architectures, data en- lytical Processing (OLAP) queries. Hadoop has proven to be tering the system is concurrently fed to both the batch and an extremely effective framework capable of providing many streaming processing layer. The streaming layer is responsi- analytical insights and is able to solve a wide range of dis- ble for immediately processing incoming data, however, the tributed computing problems. However, as much as Hadoop processed data may suffer from duplicated events and other is lauded for its wide range of use cases, it is derided for its imperfections in data accuracy. The batch layer processes high latency in processing and returning results. A common incoming data much slower than the streaming layer, but is approach to surface data insights is to run MapReduce jobs able to provide accurate views of data. The serving layer that may take several hours to complete. merges the results from the batch and streaming layers and Data analysis and data-driven applications are becoming provides an interface for queries. Although each individual increasingly important in industry, and the long query times component in a lambda architecture has their own limita- encountered with using batch frameworks such as Hadoop tions, the pieces complement each other extremely well and are becoming increasingly intolerable. User facing appli- the overall stack is robust enough to handle a wide array of cations are replacing traditional reporting interfaces as the data processing and querying challenges at scale. preferred means for organizations to derive value from their The RADStack is an open source lambda architecture im- datasets. In order to provide an interactive user experi- plementation meant to offer flexible, low-latency analytic ence with data applications, queries must complete in an queries on near real-time data. The solution combines the order of milliseconds. Because most of these interactions low latency guarantees of stream processors and the correct- revolve around data exploration and computation, organi- ness and flexibility guarantees of batch processors. It also zations quickly realized that in order to support low latency introduces a serving layer specifically designed for interac- queries, dedicated serving layers were necessary. Today, tive analytics. The stack’s main building blocks are Apache most of these serving layers are Relational Database Man- Kafka[11], Apache Samza, Apache Hadoop, and Druid [23], agement Systems (RDBMS) or NoSQL key/value stores. and we have found that the combination of technologies is Neither RDBMS nor NoSQL key/value stores are partic- flexible enough to handle a wide variety of processing re- ularly designed for analytics [19], but these technologies are quirements and query loads. Each piece of the stack is de- still frequently selected as serving layers. Solutions that in- signed to do a specific set of things very well. This paper volve these broad-focus technologies can be inflexible once will cover the details and design principles of the RADStack. tailored to the analytics use case, or suffer from architecture Our contributions are around the architecture of the stack drawbacks that prevent them from returning queries fast itself, the introduction of Druid as a serving layer, and our enough to power interactive, user-facing applications [20]. URI: http://hdl.handle.net/10125/41359 1 ISBN: 978-0-9981331-0-2 CC-BY-NC-ND 1703 joined, cleaned up, and transformed before it was usable in Druid, but that was the trade-off we were willing to make in order to get the performance necessary to power an interactive data application. We introduced stream processing to our stack to provide the processing required before raw data could be loaded into Druid. Our stream processing jobs range from simple data transformations, such as id to name lookups, up to complex operations such as multi- stream joins. Pairing Druid with a stream processor enabled flexible data processing and querying, but we still had prob- Figure 1: The components of the RADStack. Kafka lems with event delivery. Our events were delivered from acts as the event delivery endpoints. Samza and many different locations and sources, and peaked at several Hadoop process data to load data into Druid. Druid million events per second. We required a high throughput acts as the endpoint for queries. message bus that could hold these events for consumpation by our stream processor. To simplify data transmission for our clients, we wanted the message bus to be the single de- model for unifying real-time and historical workflows. livery endpoint for events entering our cluster. The structure of the paper is as follows: Section 2 de- Our stack would be complete here if real-time processing scribes the problems and use cases that led to the creation were perfect, but the open source stream processing space of the RADStack. Section 3 describes Druid, the serving is still young. Processing jobs can go down for extended layer of the stack, and how Druid is built for real-time and periods of time and events may be delivered more than batch data ingestion, as well as exploratory analytics. Sec- once. These are realities of any production data pipeline. tion 4 covers the role of Samza and Hadoop for data pro- To overcome these issues, we included Hadoop in our stack cessing, and Section 5 describes the role of Kafka for event to periodically clean up any data generated by the real-time delivery. In Section 6, we present our production metrics. pipeline. We stored a copy of the raw events we received in Section 7 presents our experiences with running the RAD- a distributed file system, and periodically ran batch process- Stack in production, and in Section 8 we discuss the related ing jobs over this data. The high level architecture of our solutions. setup is shown in Figure 1. Each component is designed to do a specific set of things well, and there is isolation in 2. BACKGROUND terms of functionality. Individual components can entirely fail without impacting the services of the other components. The RADStack was first developed to address problems in the online advertising. In online advertising, automated systems from different organizations will place bids against 3. THE SERVING LAYER one another to display users ads in the milliseconds before a webpage loads. These actions generate a tremendous volume Druid is a column-oriented data store designed for ex- of data. The data shown in Table 1 is an example of such ploratory analytics and is the serving layer in the RAD- data. Each event is comprised of three components: a times- Stack. A Druid cluster consists of different types of nodes tamp indicating when the event occurred; a set of dimen- and, similar to the overall design of the RADStack, each sions indicating various attributes about the event; and a set node type is instrumented to perform a specific set of things of metrics concerning the event. Organizations frequently well. We believe this design separates concerns and simpli- serve this insights to this data to ad publishers through vi- fies the complexity of the overall system. To solve complex sualizations and data applications. These applications must data analysis problems, the different node types come to- rapidly compute drill-down and aggregates with this data, gether to form a fully working system.

Open Source Lambda Architecture for Interactive Analytics

Aligning Machine Learning for the Lambda Architecture

Building a Scalable Distributed Data Platform Using Lambda Architecture

Introduction to Big Data & Architectures

A Scalable Data Store and Analytic Platform for Real-Time Monitoring of Data-Intensive Scientiﬁc Infrastructure

Real-Time Data Processing with Lambda Architecture

The Radstack: Open Source Lambda Architecture for Interactive Analytics

Dmla: a Dynamic Model-Based Lambda Architecture for Learning And

Lambda Architecture for Distributed Stream Processing in the Fog

Fundamentals of Real-Time Data Processing Architectures Lambda and Kappa

Efficient Stream Data Management

How to Build and Run a Big Data Platform in the 21St Century