DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2021

FlinkNDB: Guaranteed Data Streaming Using External State

MUHAMMAD HASEEB ASIF

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE FlinkNDB: Guaranteed Data Streaming Using External State

MUHAMMAD HASEEB ASIF

Master Thesis in Big Data & Distributed Computing School of Information and Communication Technology KTH Royal Institute of Technology Stockholm, Sweden [2021] KTH School of Information and Communication Technology SE-164 40 Kista TRITA-ICT XXXX:XX SWEDEN

Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av licentiatexamen i Cloud Computing & Services Friday den 08 January 2021 i .

© Muhammad Haseeb Asif, January 2021

Tryck: Universitetsservice US AB iii

Abstract

Apache Flink is a stream processing framework that provides a unified state management mechanism which, at its core, treats stream processing as a sequence of distributed transactions. Flink handles failures, re-scaling and reconfiguration seamlessly via a form of a two-phase commit protocol that periodically commits all past side effects consistently into the state backends. This involves invoking and combining checkpoints and, in time of need, re- distributing the state to resume data pipelines. All the existing Flink state backend implementations, such as RocksDB, are embedded and coupled with the compute nodes. Therefore, recovery time is proportional to the state needed to be reconfigured and that can take from a few seconds to hours. If application logic is compute-heavy and Flink’s tasks are overloaded, scaling out compute pipeline means scaling out storage together with compute tasks and vice-versa because of the embedded state backends. It also introduces delays due to expensive state re-shuffle and moving large state on the wire. This thesis work proposes the decoupling of the state storage from com- pute to improve Flink’s scalability. It introduces the design and implemen- tation of a new State backend, FlinkNDB, that decouples state storage from compute. Furthermore, we designed and implemented new techniques to per- form snapshotting, and failure recovery to reduce the recovery time close to zero. Keywords: Apache Flink, NDB, Flink State Backend, RocksDB State Backend, State management, Large State Applications iv

Sammanfattning

Apache Flink är ett strömbehandlingsramverk som tillhandahåller en en- hetlig tillståndshanteringsmekanism som i sin kärna behandlar strömbehand- ling som en sekvens av distribuerade transaktioner. Flink hanterar fel, omskal- ning och omkonfigurering sömlöst via en form av ett tvåfas-engagemangsprotokoll som regelbundet begår alla tidigare biverkningar konsekvent i tillståndets bac- kends. Detta innebär att man åberopar och kombinerar kontrollpunkter och vid behov omdistribuerar dess tillstånd för att återuppta dataledningar. Alla befintliga backendimplementeringar för Flink-tillstånd, som Rocks- DB, är inbäddade och kopplade till beräkningsnoderna. Därför är återhämt- ningstiden proportionell mot det tillstånd som behöver konfigureras om och det kan ta från några sekunder till timmar. Om applikationslogiken är be- räkningstung och Flinks uppgifter är överbelastade, innebär utskalning av beräkningsrörledning att utskalning av lagring, tillsammans med beräknings- uppgifter och vice versa på grund av det inbäddade tillståndet i backend. Det introducerar också förseningar i förhållande till dyra tillståndsförflyttningar och flyttning av stora datamängder som upptar stora delar av bandbredden. Detta avhandlingsarbete föreslår frikoppling av tillståndslagring från be- räkning för att förbättra Flinks skalbarhet. Den introducerar designen och implementeringen av ett nytt tillstånd i backend, FlinkNDB, som frikopplar tillståndslagring från beräkning. Avslutningsvis designade och implemente- rade vi nya tekniker för att utföra snapshotting och felåterställning för att minska återhämtningstiden till nära noll. Keywords: Apache Flink, NDB, Flink State Backend, RocksDB State Backend, State management, Large State Applications v

Acknowledgements

Throughout my master thesis, I had the opportunity to meet and learn from peo- ple at my research lab and online as well during the interesting COVID times. I really appreciate everyone’s time and support to assist with this accomplishment. Especially, my supervisor, Paris Carbone, and Mahmoud Ismail had been a great support throughout this thesis journey. Their support has been instrumental during the whole project. Furthermore, it’s worth mentioning the efforts and motivation from Sruthi for challenging and head-scratching moments. Finally and most im- portantly, I would like to say thanks to my family for their unbounded emotional support for me during my studies.

M Haseeb Asif, Stockholm, Jan 2021 Contents

Contents vi

List of Figures ix

List of Tables x

List of Acronyms xi

1 Introduction 1 1.1 Background ...... 1 1.2 Research Questions ...... 2 1.3 Goals ...... 2 1.4 Research Methodology ...... 3 1.5 Ethics and Sustainability ...... 4 1.6 Delimitations ...... 4 1.7 Thesis Organization ...... 4

2 Background 7 2.1 Big Data Analytics ...... 8 2.1.1 Map Reduce ...... 9 2.1.2 Batch Processing ...... 10 2.1.3 Stream processing ...... 10 2.2 ...... 11 2.3 Apache Flink ...... 12 2.3.1 Flink Architecture ...... 12 2.3.2 Flink Programming Model ...... 13 2.3.3 Windowing ...... 14 2.4 Flink Application State ...... 14 2.4.1 Keyed State ...... 15 2.4.2 Operator State ...... 18 2.5 Flink State Management ...... 18 2.5.1 Spark State Backend ...... 18 2.5.2 Flink State Backends ...... 19

vi CONTENTS vii

2.5.3 External State Approach ...... 20 2.5.4 Flink Re-scalable State ...... 21 2.6 RocksDB State Backend ...... 23 2.7 Flink Fault Tolerance ...... 24 2.7.1 Checkpointing ...... 24 2.7.2 Consistent Snapshots - Chandy-Lamport ...... 25 2.7.3 Flink 2PC protocol ...... 26 2.7.4 Flink State Checkpointing ...... 27 2.8 Transactional processing for Streaming application ...... 28 2.9 NDB ...... 28 2.10 Summary ...... 29

3 Design and Implementation of FlinkNDB 31 3.1 FlinkNDB Architecture ...... 32 3.1.1 Cache Layer ...... 32 3.1.2 Database Layer ...... 32 3.1.3 Primary Key ...... 33 3.2 NDB Schema ...... 33 3.3 Checkpointing ...... 34 3.3.1 NDB Schema Enhancements ...... 36 3.4 State Type Schema ...... 37 3.4.1 List State ...... 37 3.4.2 Map State ...... 38 3.5 Cache Optimizations ...... 38 3.5.1 Active Cache ...... 38 3.5.2 Commit Cache ...... 39 3.5.3 Cache Implementation ...... 39 3.6 Recovery ...... 40 3.6.1 RockDB Approach ...... 40 3.6.2 FlinkNDB Approach ...... 41 3.7 Summary ...... 42

4 Benchmarking & Results 43 4.1 Benchmarking Framework ...... 43 4.1.1 Nexmark Benchmark ...... 43 4.1.2 NDW Benchmark ...... 43 4.2 Hardware Infrastructure ...... 44 4.3 Benchmarking Architecture ...... 45 4.4 Objectives ...... 46 4.5 Experimental Evaluation ...... 47 4.5.1 Experiment 1 ...... 47 4.5.2 Experiment 2 ...... 47 4.5.3 Experiment 3 ...... 49 4.5.4 Experiment 4 ...... 50 viii CONTENTS

4.5.5 Experiment 5 ...... 52 4.6 Evaluation Summary ...... 52

5 Conclusion and Future Work 55 5.1 Conclusion ...... 55 5.2 Future Work ...... 56

Bibliography 59 List of Figures

1.1 Flink reconfiguration from 2 node to 3 node cluster [1] ...... 3

2.1 Map Reduce - Execution Flow [2] ...... 9 2.2 Map Reduce vs Apache Spark [3] ...... 10 2.3 Apache Flink Architecture overview [4] ...... 13 2.4 Translation from Logical to Physical Execution Graphs [5] ...... 14 2.5 Separate compute and storage [6] ...... 21 2.6 Reshuffling of keys while changing parallelism [7] ...... 22 2.7 Flink RocksDB state backend ...... 23 2.8 An example of an inconsistent (C1) and a consistent cut (C2) [5] . . . . 26 2.9 NDB Cluster [1] ...... 29

3.1 FlinkNDB initial architecture ...... 34 3.2 Flink injection of barriers into data stream [8] ...... 35 3.3 FlinkNDB state backend architecture with cache ...... 36 3.4 FlinkNDB - NDB Table Schema ...... 36 3.5 Flink NDB Cache Activity diagram ...... 39

4.1 NEXMark - Online Auction system ...... 44 4.2 FlinkNDB data processing pipeline [1] ...... 45 4.3 Apache Beam NEXMark - Performance comparison of Flink state back- ends [1] ...... 46 4.4 Experiment 1 evaluation graphs ...... 48 4.5 Experiment 2 evaluation graphs ...... 49 4.6 Experiment 3 evaluation graphs ...... 50 4.7 Experiment 4 evaluation graphs ...... 51 4.8 Experiment 5 evaluation graphs ...... 53

ix List of Tables

2.1 Comparison of Flink State Backends [9] ...... 19

4.1 Summary of input parameters for experiments ...... 47 4.2 Summary of Evaluation metrics for Flink State backends ...... 52

x List of Acronyms

API Application Programmable Interface CPU Central Processing Unit DAG Directed Acyclic Graph EC2 Elastic Compute Cloud GCE Google Compute Engine HDFS Hadoop Distributed File system IoT Internet of Things I/O Input/output JVM Java Virtual Machine NDB Network Database OLTP Online Transactional Processing OLAP Online Analytical Processing OSI Open Systems Interconnection POJO Plain Old Java Object S3 Simple Storage Service URL Uniform Resource Locator

xi

Chapter 1

Introduction

Technology has been advancing at a rapid pace and generating a huge amount of data ever produced. Apache Flink is a prominent processing engine enabling users to handle data on a large scale. It is distributed, fault-tolerant, and scalable processing engine. Although Apache Flink meets all the industry needs, it can do better to dynamically scale-in and out without any major delays. In this thesis, we explore a different approach to existing Flink architecture to show how Apache Flink can do better than the current state.

1.1 Background

Internet of Things (IoT) and digitization of different sectors are among the major sources of the massive amount of data that cannot be analyzed and processed by conventional processing systems. On the other hand, raw data is useless, in-fact it cost money to store large amounts of data. Having said that, analyzing a huge amount of data and extracting business intelligence out of it in a timely manner, is still an active area of research. Big data tools and technologies are catching up with the data generation pace and being developed rapidly by the community to meet the data processing needs. There are various tools that exist in the industry at the moment. In the early days, all the data was collected and loaded into a single machine for data processing, known as batch processing. Historically, Batch processing was leveraging vertical scaling 1 of the systems to meet the need for large data set processing. In recent years, with advances in distributed systems, tools have been developed which can process data by leveraging the horizontal scaling 2. A major switch from single system processing techniques to distributed systems happened with the introduction of the MapReduce[2] by Google. Later, quite a few tools have been developed which enhanced the processing capabilities, such as Apache Tez[10],

1Vertical scaling refers to adding more resources (CPU/RAM/DISK) to the same server. 2Horizontal scaling means scaling by adding more machines to the pool of resources.

1 2 CHAPTER 1. INTRODUCTION

Apache Spark[11], Apache Samza[12], and Apache Flink[13]. Most of these tools can process unlimited streams of data or data in motion compared to their predecessors where all the data need to be stored before processing. Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming dataflow engine. Flink executes arbitrary dataflow programs in a data-parallel and pipelined (hence task-parallel) manner. Flink’s pipelined runtime system enables the execution of bulk/batch and stream processing programs[14]. This thesis focuses on Apache Flink.

1.2 Research Questions

Although data processing engines, including Apache Flink, have made advances by leaps and bounds still there are areas where industry needs aren’t met completely. Apache Flink is a stateful processing engine. While doing stateful computation, it needs to store the state in a state backend. Currently, it has support for a few state backends. One of them is the RocksDB state backend and it is widely used in the industry. It uses the embedded RocksDB database to store the state on the processing system. RocksDB state backend provides fast access to data-parallel stateful operations, however its performance degrades during scaling-in and out, failure recovery, or any type of reconfiguration done on a running pipeline. It can take hours to reconfigure the data pipelines if the state size is of few Terabytes. End-User experience can deteriorate badly if there are long processing delays. So having a large state size can lead to significant delays when jobs scale in or out, proportional to the size and the partitioning granularity of the state as shown in fig. 1.1. Therefore our research question for the thesis is: • Can we substitute embedded state with external state while maintaining the same consistency guarantees and not violating the performance requirements?

1.3 Goals

Overall, the project goal is to investigate the decoupling of state storage and com- pute for Apache Flink and benchmark the performance against existing embedded storage state backends. FlinkNDB is a state backend that uses NDB to store the state external to Flink and has an initial state backend implementation by Sruthi[1]. This thesis complements the thesis work by Sruthi[1] for checkpointing and recovery implementation. Following are the research goals that need to be looked at during the thesis study. 1. Design and develop the schema for NDB to add support for the checkpointing in FlinkNDB. 1.4. RESEARCH METHODOLOGY 3

Figure 1.1: Flink reconfiguration from 2 node to 3 node cluster [1]

2. Research and implement the checkpointing strategy for FlinkNDB.

3. Research and implement FlinkNDB recovery from checkpointed snapshots during failure or reconfiguration.

1.4 Research Methodology

The degree project consists of three parts:

1. Research and understand existing Flink state backends, their implementation, and the underlying storage engines

2. Design and implementation of snapshotting and recovery for FlinkNDB state backend

3. Benchmarking and experimental evaluation of the failure recovery perfor- mance of different state backends

So the research methodology is to do the exploratory literature review for the aforementioned research question followed by implementation and qualitative ex- perimental evaluation using different benchmarking metrics. 4 CHAPTER 1. INTRODUCTION

1.5 Ethics and Sustainability

Stream processing systems provide real-time data processing capabilities but they need to be running all the time hence consume more resources. So, these systems should be performant and use the resources more effectively. This project is fo- cused on the optimization of Flink cluster scalability which will reduce the power consumption, heat production, and waste of underutilized resources of the digi- tal infrastructure. Therefore, we will have a more sustainable stream processing system. Additionally, this project adheres to all the ethical standards. While doing the research and development of the project, there was a lot of qualitative and quanti- tative research was done but no personal data was used. All the data collected and used during the project was aggregated and anonymized if there was any personal information involved. Furthermore, there isn’t any data that might cause any se- curity risks to individuals or organizations. Finally, references have been furnished to any of the previous work which is leveraged throughout this project.

1.6 Delimitations

This thesis is part of the joint project work with Sruthi’s[1] to design and de- velop a new state backend, FlinkNDB. Sruthi’s[1] work focuses on state storage for FlinkNDB while this thesis work is about the design and development of check- pointing and recovery.

1.7 Thesis Organization

This thesis is organized into five chapters as follows: • Chapter 1 gives a high-level overview of the overall project and defines the research goal for the thesis. • Chapter 2 details the background of numerous concepts discussed in this thesis. It starts with establishing the base of big data processing and it’s evolution. Afterward, it introduces the Apache Flink. It touches on the different types of state backends and how each state backend is storing the state. Then, keyed stated backend and different possible state data structures are discussed. Finally, it talks about how Flink is doing snapshotting and recovery with existing state backends. Finally, the chapter introduces NDB and some of its key features. • Chapter 3 introduces the architecture, design, and implementation of FlinkNDB. It discusses the different possible solutions and why a certain solution was chosen? It also discusses the optimizations introduced to improve the performance of the initial design. • Chapter 4 explains about different ways how FlinkNDB was benchmarked against existing implementation. It explains existing benchmarking frameworks 1.7. THESIS ORGANIZATION 5 as well as the development of a newer framework to measure the performance for specific performance metrics. • Chapter 5 concludes with the discussion of the research question and contri- butions of the thesis toward the creation of new knowledge for this domain. Finally, It mentions the possible improvements and extensible features as future work.

Chapter 2

Background

All the applications running in production does require to add more resources to meet the increase in demand. Increasing or decreasing resources is known as sys- tem scaling. There are two type of scaling - vertical scaling and horizontal scaling. Vertical scaling is to add more resources (CPU, RAM, disk) to the same com- puter system. On the other hand, horizontal scaling refers to the addition of more computer systems into the cluster to increase the resources. MapReduce[2] started the shift with the idea of distributed data processing by scaling out instead of scaling up. After that systems introduced interactive query execution and DAG (directed acyclic graphs) data flows. Apache spark is consid- ered as a 3rd generation tool which introduced the lineage graphs, iterative data processing, and near real-time data processing. The current generation, Apache Flink, is focused on scalable stream processing with native iterative processing and real-time streaming. There are still quite a few open challenges in the field of scal- able stream processing. The three most critical challenges [5] are - Fault tolerance and scalable state management, computation sharing, and semantics for sliding windows and iterative data streaming. Apache Flink is being developed with the design consideration of scalable stream processing. It provides the streaming API which is used to apply stream transfor- mations while writing distributed applications for the data streams. It exposes basic abstract types such as DataStream and WindowedStream which support dif- ferent types of transformations. Flink employs a long-running task architecture where each program has three phase compilation: logical, optimized, and physical presentation. Each stream transformation can have a state which is taken care of by Flink itself. Additionally, windows are used to group continuous data to apply different grouping operations and also store the state. Apache Flink programs written with data stream API maintain the state for its operations and Flink manages the state transparently without the user being worried about it. It ensures reliability by periodically taking consistent snapshots of the system. Consistent snapshots present the global state of the system at any

7 8 CHAPTER 2. BACKGROUND given point in time and in the event of failure, Flink can resume the processing without losing information. Flink’s mechanism for drawing these snapshots is de- scribed in "Lightweight Asynchronous Snapshots for Distributed Dataflows"[15]. It is inspired by the famous Chandy-Lamport algorithm for distributed snapshots and is specifically tailored to Flink’s execution model. Flink state is stored in different state backends[16]. Currently, Flink has three implementations of state backends in terms of storage, JVM heap, file system, and RocksDB. RocksDB is an embedded key-value store which keeps the state on the disk of the compute node. It is good for large state that cannot fit in the memory but it compromises performance as objects will require serialization and de-serialization. FsStateBackend and Memory state backend store their state on the JVM Heap so they are much faster than disk-based RocksDB but they require a large heap size. This chapter presents an overview of batch and stream processing, followed by Apache Flink, its programming and execution model. Later, we discuss about NDB and 2 phase commit (2PC) protocol. Finally, chapter concludes with the discussion about different challenges at hand.

2.1 Big Data Analytics

Digitization across the industries is generating an enormous amount of data. Now, logs and events are analyzed out of every digital device e.g. smart phones, smart watches, home automation sensors. Big data is often broken down into Vs: volume, velocity, value, variety, and veracity. All of these Vs attribute certain properties in the data. With advances in technology, data is being generated in much more quantity and at a higher volume such that conventional data processing systems cannot cope with it. There are two complimentary approaches to deal with the big data conven- tionally, Scale-up, or Scale-out. Scale up or vertical scaling means adding more resources to the same node in a system with the increasing demand for resources to manage and process the big data. It does work well in most cases but you can scale a system to a certain level only where it becomes more expensive to add more resources than the desired gain. Graphics Processing Unit (GPU)[17] and High- Performance Computing Clusters (HPC) and common examples of vertically scaled systems. Scale-out or horizontal scaling refers to adding more nodes to the overall system. In the recent years, the cost of storage & computing has gone down drastically which made it easier to store and process huge amounts of data. Hence, horizontal scaling is more often used[18] even though it requires a huge amount of data transfer over the wire and complex synchronization protocols to do the coordination among different nodes. The data processing ecosystem has been going through different generations of big data processing tools to match the needs of the industry. This revolution 2.1. BIG DATA ANALYTICS 9 started with the inception of MapReduce, followed by batch processing and stream processing.

2.1.1 Map Reduce

MapReduce [2] is an architecture for processing big data on distributed systems or vertically scaled systems. It does parallelize data and computation. MapReduce has three stages as shown in fig. 2.1 - Map, Shuffle, and Reduce. Map is the initial phase where input data is read and produces key-value pairs as intermediate inputs. Then Shuffle also referred to as the sorting phase, will sort and consolidate intermediate data from all the mappers based on the keys. Finally, the reduce task will run the logic on all intermediate values for the same key and produces the final output.

Figure 2.1: Map Reduce - Execution Flow [2]

MapReduce abstracts the system-level details from application developers and provides the parallelization, fault tolerance, and data distribution under the hood. Developers can define the operation using functional programming and it will take care of the parallel execution on the distributed system. It’s a different type of programming where developers specify what to do instead of how to do it through sequential steps. Apache Hadoop is one of the first frameworks with MapReduce implementation. It is considered the first generation of big data analytics. 10 CHAPTER 2. BACKGROUND

2.1.2 Batch Processing

Batch processing is a method and system architecture of processing a bounded data set or a finite amount of data. For example, financial transactions that happened over the last week being processed on weekends, or trading activity during the day is processed overnight. Hadoop MapReduce was slow because data after each step goes to disk. It was reading data from the disk, doing an operation on it, and storing it back to the disk, hence impacting the performance. Apache Spark[11], one of the widely used batch processing engine in the industry, improved the performance by storing the intermediate state of MapReduce operations in memory instead of disk as in fig. 2.2.

Figure 2.2: Map Reduce vs Apache Spark [3]

Apache Spark is a 3rd generation tool for big data analytics. It is a general- purpose batch and stream (close to real-time) processing engine. It supports it- erative processing as well which might be really helpful for machine learning and other types of processing. RDD (Resilient Distributed Dataset) is at the core of spark. It is much faster than MapReduce because it does in-memory computation and optimizes processing. It provides high-level APIs in Java, Scala, and Python.

2.1.3 Stream processing

Stream processing is an alternative computing architecture to process an unbounded data stream. Simply said, it’s an approach to process the data in motion for an infinite data set. The data stream is unbounded with no predefined beginning or end. Data can be unstructured, semi-structured, or structured. Some of the examples for stream processing applications include transaction fraud detection, monitoring user activity, online recommendations, and so on. 2.2. APACHE KAFKA 11

Stream processing has two design principles, continuous and micro-batch pro- cessing. When a continuous stream processing engine receives an event from an input stream, it may trigger the computation on it, update the related aggregation or store the state for future events. In contrast, micro-batch processing divides the unbounded data stream into small subsets of bounded data sets and then processes each batch in an atomic operation. Apache Flink is a continuous streaming platform while Apache Spark provides streaming support via micro-batch processing. Stream processing has two types of programming models, Declarative or Record at a time. Declarative model is high-level programming where application devel- opers specify what to compute instead of how to compute for each new event. Alternatively, record at a time is a low-level model and each record is passed to the application developers to do the computation. It gives more fine-grained control but exposes a lot of complexities to handle for application developers. However, declarative models abstract the application complexities but give you coarse-grained control. Stream processing engines have tackled a lot of challenges. They can handle large volumes of data compared to other processing systems because they only need to store relevant information after processing the incoming data stream. Contrary to batch processing where all data is stored initially and then it is processed, stream processing does process data points as they are ingested into the system. It provides insights in near real-time compared to the high latency of the batch processing. Having said that, batch processing can be considered as scheduled processing while stream processing is a real-time data processing. Stateful stream processing refers to data processing when the streaming engine holds the state related to the computation. Although data processing engines are stateless, there are scenarios where developers need to maintain state during the data processing. Stateful stream processing combines the computation and state store together in one platform. It abstracts the complex state management from application developers. State is required for most of the data processing cases. For example, a monitor- ing application checking for spikes in the application usages will keep the last usage values in the state to compare the incoming values. Once the result is calculated, the state is updated to the new transactions.

2.2 Apache Kafka

Apache Kafka is a fault-tolerant messaging system that captures and stores the event from different data sources like , APIs, and IoT devices in real- time. It supports integration with many other systems where it can act as an input data source. The granular unit of messages in Kafka is known as Topic. It supports processing and routing the events retrospectively as well. Apache Kafka connector[19] for Apache Flink provides capabilities to read and write data to Kafka with exactly-once guarantees. It supports to move back and 12 CHAPTER 2. BACKGROUND forth in the event stream and replay it. It is useful when an application needs to restart the input data stream from a certain point in case of a failure.

2.3 Apache Flink

Apache Flink is a distributed processing engine for stateful computations over un- bounded and bounded data streams. Apache Flink is one of the top Apache projects [20]. It is designed to process real-time streams contrary to its predecessor frame- works. It supports windowing, out-of-order processing, iterative processing, and stateful stream processing computations. Additionally, it provides fault tolerance, parallelization, and reconfiguration of the pipelines. DataSet and DataStream are core APIs but we have multiple high-level APIs as well. It does provide Java, Scala, Python, and SQL API for developing applications. Apache Flink uses master slave architecture[21]. It has a job manager acting as a master while task managers are workers or slave nodes. Apache Flink runs on the JVM (Java virtual machine). It supports various types of windowing as windows are at the heart of stream processing. Flink stores the state using state backends for stateful stream processing. Having said that, doing a failure recovery to a specific point in time requires loading the state which is supported by snapshotting the state at regular intervals.

2.3.1 Flink Architecture Apache Flink has a layered software stack like an OSI model. Each layer abstract the details of the bottom layers and make it easier for the end-user to develop applications. Flink run-time layers sit on top of the deployment layer, then we have core APIs followed by specialized abstraction libraries (FlinkML, Gelly) as in fig. 2.3. Flink has two core APIs, DataSet and DataStream. DataSet API is used for batch processing while DataStream API is used for streaming applications. These APIs provide interfaces to build applications, apply transformations on streams, manage application state and timers. Flink also provides specialized APIs e.g. Gelly for graph processing, Table and SQL for relational queries with table representation, and FlinkML for machine learning pipelines. These APIs generate the logical job graph which is then optimized to generate the physical graph by different optimizers based on the API. Later, this physical graph is executed on the actual nodes. Streaming dataflow run-time or Flink run-time is a distributed system that schedules and executes different applications. There are mainly three components in a Flink cluster: the client, JobManager, and TaskManagers. Client module translates the application code from the above API layer and submits a graph to the JobManager. JobManager acts as master and TaskManagers are workers or slave nodes. JobManagers manages application deployment, monitoring, and execution. Finally, TaskManagers does the actual job and coordinates for all the resources 2.3. APACHE FLINK 13

Figure 2.3: Apache Flink Architecture overview [4]

required to execute a task. Flink runtime does abstract all the complexities with the help of job manager and task manager nodes for the application developers. Flink offers a variety of deployment options as well. Cluster is most the obvious one as it is a distributed system with yarn or standalone. It can also run on a local machine for development or prototyping purposes. Additionally cloud vendors e.g. GCE or EC2 has support for the deployment of the Apache Flink.

2.3.2 Flink Programming Model

Flink provides a functional programming model for streaming applications. Data Stream API provides support to apply transformations as high order functions as a map, reduce, and filter. Flink program is composed of source, sink and com- putations. It has lazy execution and programs are compiled and optimized before getting executed on the distributed run-time. Flink has three phase compilation as shown in fig. 2.4. Initially, a logical graph is generated from the program source code. Each transformation corresponds to a logical operator that executes event-based logic. After that optimization is applied to different operators to fuse them wherever possible to provide better performance. Finally, when a Flink program is deployed, it generates the physical execution graph on distributed systems with multiple instances of the same operator across multiple nodes. 14 CHAPTER 2. BACKGROUND

Figure 2.4: Translation from Logical to Physical Execution Graphs [5]

2.3.3 Windowing

Windowing is a crucial concept in stream processing frameworks or when dealing with an infinite amount of data. In batch processing, since we have finite data so we can apply the computation on it altogether but it’s not possible with stream pro- cessing because input data is unbounded. Windowing is an approach to break the data stream into mini-batches or finite streams to apply different transformations on it. Apache Flink window opens when the first data element arrives and closes when it meets the criteria to close a window. It can be based on time, count of messages, or a more complex condition. There are different types of windowing strategies - Tumbling, Sliding, Session, and Global windows. Additionally, you can create your own complex implementation other than the predefined ones.

2.4 Flink Application State

Application State is a first-class citizen in Flink and it is required for most of the stream processing use cases. For example, when processing credit card transactions to monitor fraudulent activity, last transactions need to be stored to identify mali- cious transaction or while monitoring temperature spikes from IoT sensors, all new values will be compared against older readings. Apache Flink abstracts the state management complexities for application de- velopers. It provides multiple state primitives, pluggable state backends, fault- 2.4. FLINK APPLICATION STATE 15 tolerance with checkpointing, and failure recovery mechanisms. It can handle large size of states and support the redistribution of the state as well. Flink supports two types of state: Keyed state and Operator state. For purely data-parallel stream operations data can be partitioned based on keys which enables computations and state management independent for each key. The keyed state is bound to the key and used for keyed streams. Flink KeyBy transformation is used to transform the data stream to the keyed stream. An operator state is also known as a non-keyed state. Operator state is declared at the level of physical and each operator state is bound to one parallel operator instance. Both Keyed State and Operator State exist in two forms: Managed and Raw. The managed state is represented in data structures controlled by the Flink runtime. The raw state is seen as raw bytes by Flink and knows nothing about the state’s data structure. Operators keep the state in their own data structures. Using a managed state is recommended because Flink can automatically redistribute the state when the parallelism is changed or the system scale in or out.

2.4.1 Keyed State Managed keyed state provides state management for six different primitive types. Each of these is scoped to the key of the current element as we are operating on the keyed Stream. The key is automatically provided by the system for the keyed state. Supported state primitives are as follows • ValueState: It’s like a key-value pair where a value is stored against the currently scoped input key. Value is read using T value() and set using update(T) method. • ListState: List state maintains a list of elements of type T. It is also scoped for the scoped input element key so we can have a list for each of the input keys. ListState has add(T), addAll(List) methods to append the state, update(List) to overwrite and Iterable get() to retrieve the list elements. • ReducingState: Reducing state only holds the reduced or accumulated value for the current key. Values can be added with the add(T) method but they are reduced by the user-provided ReduceFunction and a single value is stored. Values are fetched with T get() • AggregatingState: Aggregate state is similar to reducing state with the exception that reduced or aggregated type OUT can be different from the input type IN. It will also store a single value by applying aggre- gateFunction on all the inputs provided by Add(IN). Values can be retrieved with the Iterable get() method. • MapState: MapState maintains a key-value mappings like a Map Data structure. Values for individual keys can be fetched using get(UK). It 16 CHAPTER 2. BACKGROUND

provides methods put(UK, UV) and putAll(Map) to set the value in the state. Additionally entries(), keys() and values() return iterable for key-value, keys and values stored in the state respectively.

Flink State is accessed using different state descriptors for each type. It will require application developers to provide state name, value type information, and other parameters specific to that state type such as default values, ReduceFunction or AggregateFunction. The state name is useful if we are creating multiple states of the same type so it will be used to reference them. Value type information is used to serialize and de-serialize values. Flink RuntimeContext provides methods for various type descriptors.

1 public class CountWindowAverage extends RichFlatMapFunction, Tuple2> { 2 3 private transient ValueState> sum; 4 5 @Override 6 public void flatMap(Tuple2 input, Collector> out ) throws Exception { 7 Tuple2 currentSum = sum.value(); 8 9 currentSum.f0 += 1;// update the count 10 currentSum.f1 += input.f1; 11 sum.update(currentSum); 12 13 if (currentSum.f0 >= 2) { 14 out.collect(new Tuple2<>(input.f0, currentSum.f1 / currentSum.f0)); 15 sum.clear(); 16 } 17 } 18 @Override 19 public void open(Configuration config) { 20 ValueStateDescriptor> descriptor= 21 new ValueStateDescriptor<>( 22 "average",// the state name 23 TypeInformation.of(new TypeHint>() {}), 24 Tuple2.of(0L, 0L)); 25 26 sum = getRuntimeContext().getState(descriptor); 27 } 28 } 29 // this can be used ina streaming program like this(assuming we havea StreamExecutionEnvironment env) 30 env.fromElements(Tuple2.of(1L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L), Tuple2 .of(1L, 4L), Tuple2.of(1L, 2L)) 31 . keyBy (0) 32 . flatMap (new CountWindowAverage()) 33 . print (); 34 // the printed output will be (1,4) and (1,5) Listing 2.1: Keyed state example using value state to store the count

List 2.1 is an example of FlatMapFunction using ValueState [22] which imple- ments the counting window. This example uses the keyBy transformation on the first element of the input tuples and store the count and rolling sum. ValueState stores a tuple where first field is the count and the second field is the running 2.4. FLINK APPLICATION STATE 17

sum. Once count reaches the required count of 2, it triggers the windows, emit the average value and clears the state for next window.

1 public class BufferingSink 2 implements SinkFunction>, 3 CheckpointedFunction { 4 5 private final int threshold; 6 private transient ListState> checkpointedState; 7 private List> bufferedElements; 8 9 public BufferingSink(int threshold) { 10 this.threshold = threshold; 11 this.bufferedElements = new ArrayList<>(); 12 } 13 14 @Override 15 public void invoke(Tuple2 value, Context contex) throws Exception { 16 bufferedElements.add(value); 17 if (bufferedElements.size() == threshold) { 18 19 for (Tuple2 element: bufferedElements) { 20 // send it to the sink 21 } 22 bufferedElements.clear(); 23 } 24 } 25 26 @Override 27 public void snapshotState(FunctionSnapshotContext context) throws Exception { 28 29 checkpointedState.clear(); 30 for (Tuple2 element : bufferedElements) { 31 checkpointedState.add(element); 32 } 33 } 34 35 @Override 36 public void initializeState(FunctionInitializationContext context) throws Exception { 37 38 ListStateDescriptor> descriptor = 39 new ListStateDescriptor<>( 40 "buffered-elements", 41 TypeInformation.of(new TypeHint>() {})); 42 43 checkpointedState = context.getOperatorStateStore().getListState( descriptor); 44 45 if (context.isRestored()) { 46 for (Tuple2 element : checkpointedState.get()) { 47 bufferedElements.add(element); 48 } 49 } 50 } 51 }

Listing 2.2: Operator state example 18 CHAPTER 2. BACKGROUND

2.4.2 Operator State Flink operator state is bound to each individual operator instance. Flink opera- tor state is used by implementing CheckpointedFunction interface, or the ListCheck- pointed interface in a stateful function. CheckpointedFunction interface require users to implement snapshotState and initializeState methods. InitializeState is called when a stateful function is called for the first time or it is recovering from an existing checkpoint. It also contains state recovery logic as well. SnapshotState method is called to create a checkpoint for operator state. Currently only supported managed operator state is of list type, similar to ListState in the keyed state. It is the same as the list data structure of serialized objects so conceptually operator state is a huge list of state elements returned by different operators. When doing recovery of the state there are two types of redistribution schemes: Even-split and union. Even-split redistributes the state equally into all the operators while the union shares the whole list with all the operators. ListCheckpointed interface is a limited version of checkpointedFunction. It provides only list operations with the even-split distribution. List 2.2 is an example of operator state which implements a state SinkFuncation using CheckpointedFunction. It continue to buffer the elements until it reaches the threshold. Once it has enough elements, it emits them to the sink. Managed operator state is accessed similarly as keyed state by state descrip- tors of a list type. It also requires the state name and value type information. However, Flink RuntimeContext has a different method, getOperatorStateStore, for operator state. Depending on the type of distribution, one can use getListState or getUnionListState.

2.5 Flink State Management

Stateful stream processing engines does provide support for dynamic configuration and failure recovery while ensuring the strong consistency guaranties. The state is maintained and stored in a state store, usually referred as state backend. It can be anything from a basic in-memory HashMap to persistent file system like HDFS or distributed storage systems like Cassandra to local embedded store like RocksDB. The purpose of a state backend is to provide a reliable place where the stream processing engine can write to and read from, the intermediary result of data processing. Although, such capabilities can be achieved by integration of different storage systems, but research is happening to provide these features transparently and taking away the integration challenges from the users.

2.5.1 Spark State Backend Spark[11], one of the well-known stream processing engine in the industry, does provide stateless and stateful streaming. Spark can recover to the point of failure 2.5. FLINK STATE MANAGEMENT 19 reliably due to the stored state in case any of its component or nodes fails because of stored state. It only support one sate backend (or state store in Spark terminology) out of the box, HDFSBackedStateStore[23], which stores state in-memory HashMap and HDFS for fault tolerance. Storing the state into the in-memory HashMap of the compute node introduces challenges with the growing state size.

2.5.2 Flink State Backends Apache Flink stores the state in a persistent storage to ensure the consistency guar- antees. The state can be stored in memory, local file, or on a distributed file system - referred as state backend. When a Flink application is running, the state is snap- shotted periodically and stored in state backend, if configured. Flink has support for various types of storage systems and it supports three types of state backends: MemoryStateBackend[16], FsStateBackend[16], and RocksDBStateBackend[16]. Each of these is stored on the same compute nodes hence provides local read and write performance for state operations. Application developer can configure different state backend but MemoryStateBackend is the default state backend. The table 2.1 shows a comparison of different Apache Flink state backends.

Name Working State State Backup Snapshotting Memory StateBackend JVM Heap JobManager Full JVM Heap Fs StateBackend JVM Heap Distributed file Full system RocksDB StateBack- Local disk (tmp Distributed file Full / Incremen- end dir) system tal

Table 2.1: Comparison of Flink State Backends [9]

2.5.2.1 Memory State Backend MemoryStateBackend use hash tables to manage the state internally. These hash tables are stored on the Java heap of the task manager. When a snapshot is created, it is stored on the program manager’s (master) heap. It also supports asynchronous snapshots to avoid the blocking of the stream processing pipeline. It is really performant and useful for smaller state sizes as it cannot hold any state larger than job manager memory. Also, it isn’t recommended to use it in production for state critical applications because, in case of system failure, all state will be lost from memory and there wouldn’t be any backup to do the recovery.

2.5.2.2 File State Backend FsStateBackend is like MemoryStateBackend except that upon checkpointing it stores the state snapshot on a configured file system. File system URL is required 20 CHAPTER 2. BACKGROUND on initialization and it has support for HDFS and file system as well. It can be used for applications having large state sizes as well.

2.5.2.3 RocksDB State Backend

RocksDBStateBackend stores the state in RocksDB, an on each task manager node. Upon checkpointing, a snapshot is stored on a configured file system like FsStateBackend. All the snapshots are executed asynchronously. File system address or path is required for initialization as it will be used to store the snapshots. It can be used for highly available systems with large state size requirements. RocksDBStateBackend can store the state as long as disk space is available com- pared to the other two where size is limited to memory size. On the other hand, performance is impacted when we need serialization to store data for RocksDB compared to direct object manipulation on the Java heap. It is the only state back- end that provides the support for incremental checkpointing - an approach where state delta is stored between different checkpoints instead of full state snapshot.

2.5.3 External State Approach

Traditionally, data processing pipelines consist of a cluster of compute nodes which are typically virtual machines (VMs) with one or more embedded or network at- tached storage disks. CPU can access and do processing on all the VMS which are communicating through a fast network. So overall, this architecture works well until when your data flow and your graph of the pipeline is easily parallelizable. There are different types of state that Flink maintains for it’s operations as well. For example, all the windowing operations that need to be aggregated, until trigger is fired to do computations, they need to store the input somewhere. Flink store them on the on the compute node disk, which means there’s a lot of data in steaming pipelines that needs to be preserved. All the existing state backend implementation use the embedded state which comes with a few challenges. For example, while doing re-scaling operation with RockDB statebackend, If application logic is compute-heavy and Flink’s tasks are overloaded, scaling out means scaling out storage with our compute tasks. Similarly, if application logic is state-intensive while scaling out storage, more virtual CPUs will be allocated as well, which isn’t required. Additionally, the problem with this approach, especially in the compute-intensive case, is that every time application re-scale it also need to re-shuffle and load state around which is expensive. So, it’ time to really rethink the traditional data processing architecture. A bet- ter strategy, for any re-scaling or recovery purposes, would be to have a decoupled compute and state. If we have a compute intensive task, we scale out compute only and if we have a state intensive task, we scale out the storage. Therefore, we only acquire the resources that we need for the running pipeline. 2.5. FLINK STATE MANAGEMENT 21

Figure 2.5: Separate compute and storage [6]

It’s more efficient and easier to auto-scale where state storage is separate from compute. So on the left hand side of fig. 2.5 we have an architecture with the steaming engine where state storage is separate from compute, while on the right hand side is the traditional architecture that has state storage on the workers, so while scaling up and down the unit of scaling the architecture is completely vertical, that means state need to scale together with compute. On the other side, steaming engine can scale compute separately from storage which means if pipeline is more compute intensive then launch more workers, and if pipeline is more state intensive then increase these resources for the steaming engine. Furthermore, a good thing is that Flink already provides perfect support for compute reconfiguration but when it comes to decoupled storage, we need such a storage system that acts separately to Flink while leveraging existing Flink features so we can provide the best of both worlds Dataflow [24] is another unified stream and batch data processing engine by Google. Initially, it used persistent disk embedded to the worker node to store the state with intensive memory caching to provide better performance. Later, they developed state backend where they had separated compute from state storage to improve scalability of steaming and batch pipelines.

2.5.4 Flink Re-scalable State Apache Flink does follow a shared nothing architecture. All the distributed task on the compute nodes will process data or key groups assigned to them and they don’t need any other external information from other nodes in the cluster. Currently, when a Flink cluster needs to re-configure, a checkpoint is triggered to snapshot the state to an external persistent storage such as HDFS. Then the cluster 22 CHAPTER 2. BACKGROUND

Figure 2.6: Reshuffling of keys while changing parallelism [7] is stopped and snapshotted state from last checkpoint is used to redistribute state on to the reconfigured cluster nodes. Keyed state is only available for keyed stream. Also, Flink ensures that all the events for a specific key are processed by one specific operator instance. Mapping between key and operator is computed through hash partitioning on the key[7]. During recovery or restarting, all the subtasks can read the last checkpointed state based on the computed hash. 2.6. ROCKSDB STATE BACKEND 23

However, when we are re-scaling a keyed state backend, due to different number of nodes, computed hash values does change which introduced a lot of random reads as shown in fig. 2.6. Flink uses key-groups to solve this challenge. Key-group is used as an atomic unit of state assignment. Each sub-tasks is assigned a range of key-groups which makes the reads on restore sequential within each key-group.

2.6 RocksDB State Backend

RocksDB state backend stores the state as a serialized byte-string. It’s a key value store using log-structured tree (LSM-tree). Flink uses serialized composite key - key, key-group and namespace. Whenever we need to do a read or write operation, both key and values needs to be de/serialized which can compromise performance compared to the in-memory state backends.

Figure 2.7: Flink RocksDB state backend

RocksDB does have transient memory which acts as a cache as in fig. 2.7. It’s multiple memory tables used for read and write operations. A write operation stores the data in the currently active memory table. When an active memory table is full, it becomes a read-only memory table and replaced by another active memory table. All those old, read-only memory tables are flushed to the disk asynchronously. They are called SSTables. Similarly, a read operation always try to read from the active memory table. If it finds the data, it will deserialize the value and return the response, which is nearly equivalent to the memory access. If it doesn’t find from the active memory table, then it will find it in the read-only memory tables from the most recent to the oldest one. If we still don’t find the values, then finally SSTables are searched for the key. There are multiple possible optimizations to avoid hitting the disk. 24 CHAPTER 2. BACKGROUND

RocksDB state backend uses the embedded RocksDB instances with each com- pute nodes which provides the data locality. It does use the checkpointing to back up the persistent database log files to a distributed file system e.g. HDFS or S3. Embedded RocksDB instances does provide the good performance but it does create certain challenges and adds reconfiguration delays. Furthermore, Flink stores the state as serialized bytes into rocksDB. This means that data has to be de/seri- alized with every read or write operation, which can compromise performance.

2.7 Flink Fault Tolerance

Flink provides a robust fault tolerance API to checkpoint and recover the applica- tions from failure. It does capture snapshots periodically to recover from failures. A snapshot is a global state of the system storing enough information to restart the application from that specific state. Flink provides different delivery guarantees that the application will receive a record exactly once or at least once. State management comes out of the box for Flink. While Flink abstracts the traditional state complexities for application developers, it needs to do a lot more to provide stateful fault-tolerant applications. It needs to checkpoint the state frequently and restore it in case of failures. Checkpointing in a distributed system is more complex because of the dynamic and unpredictable nature of the network. Distributed Snapshots, snapshots for a distributed system, at any point in time will contain the state of all the processes (vertices) and their network connections (edges). Flink is a distributed stream pro- cessing engine, hence it uses a distributed snapshot algorithm for checkpointing. It does leverage a variant of the famous Chandy Lamport algorithm for snapshotting. Flink requires a replayable data source in addition to the state backend for the checkpointing. When an application fails, checkpoints are used to restore the application based on the snapshotted position of the data source e.g. Apache Kafka offset to go back and replay the lost message.

2.7.1 Checkpointing Application checkpointing is a common technique in computer science to make applications fault-tolerant. In this approach, we make a copy of the application state, called snapshot, at a regular interval, and store it. When the application fails, we restart our application using the last saved snapshot of the application. It helps streaming applications to resume processing from the last checkpoint instead of starting all the calculations from the beginning. Checkpointing is an expensive operation hence it cannot be done after processing each record. A naive algorithm for checkpointing would be to pause or stop the application, backup the state from memory to some reliable storage and resume the application. It doesn’t make sense to run a checkpoint every second which takes 2-3 seconds to take a snapshot. Application developers need to see the trade-off and decide the checkpointing interval to achieve optimum performance. 2.7. FLINK FAULT TOLERANCE 25

Distributed systems generally use two different approaches for checkpointing, coordinated checkpointing and uncoordinated checkpointing. Coordinated check- points are more complex since all the nodes need to coordinate and align themselves before taking a snapshot. It does need to capture the state of all the compute nodes and in-flight messages on the network. Generally, two-phase commit protocols are used for coordinated checkpointing. In uncoordinated checkpointing, all the nodes save their state without coordination at a certain interval. It does not guarantee global consistency because each node might have a different processing stage and during recovery, some of the messages get lost or replayed multiple times. Flink uses the variant of the Chandy-Lamport algorithm for coordinated check- pointing to draw consistent snapshots of the data stream and operator state [25]. It uses barriers to align the distributed data stream and store the snapshots on the state backends.

2.7.2 Consistent Snapshots - Chandy-Lamport Determining the global state of a distributed system to capture a snapshot is a common challenge. The global state doesn’t only include the state on individual nodes in the whole distributed system but it also includes the messages on the communication channels as well. Consistent snapshot is important for various reasons for a distributed system. It is used to determine the safety points, find out the current load of a system, verify whether a deadlock exists. It is also widely used to detect if a distributed algorithm has been terminated. Capturing the snapshot for the distributed system has quite a lot of challenges. One cannot catch all the processes at the same time. Also, flowing messages on the communication channels cannot be seen. All the nodes cannot have the same processing times and network channels always have variant delays. By the time some nodes capture the state, it will be already out-of-date since other nodes might have already received messages from the future. Consistent snapshots are required to have a globally synchronized clock such that each node can snapshot its state at the exact same time. There are few algo- rithms to capture a distributed system with a global clock and provide a consistent past global state. One of the well-know algorithm is Chandy-Lamport and it uses a consistent cut to captured a consistent snapshot as in fig. 2.8. A cut in a distributed system is considered consistent if, for each event it contains, it also contains all the events that have happened before. Chandy and Lamport suggested a model of a distributed system where all the processes (nodes) have their own state. It assumes that the graph is strongly connected and all the processes are connected by channels which follow the following rules • FIFO guarantees • Has an infinite buffer size 26 CHAPTER 2. BACKGROUND

Figure 2.8: An example of an inconsistent (C1) and a consistent cut (C2) [5]

• Delivers the message in finite amount of time

Marker messages are sent on the channel to trigger the consistent cut and record the message. This model is used as the basis for a lot of other distributed algorithms but stream processing engines need to do enhancements to these algorithms to capture the snapshots.

2.7.3 Flink 2PC protocol Apache Flink does use a variant of the Chandy-Lamport algorithm, Epoch commit protocol [5], and does an asynchronous epoch commit. It records the state between two different checkpoints and stores it into permanent storage. Since it’s an asyn- chronous operation, it doesn’t block the normal stream processing operations and uses marker messages, called watermarks, to capture the consistent cuts and create the snapshot to be stored in external storage. Apache Flink allows to run multiple checkpoints at the same time as well and end-user can configure the delay between consecutive checkpoints. Apache Flink runs a two-phase commit protocol to ensure transactional guar- antees. The initial phase is the job manager sends messages to all task managers to prepare for the snapshot. Once it received acknowledgment from all of them, then it sends the commit message to complete the checkpointing. The snapshot coordinator process, running at the job manager, will initiate an instance of the epoch commit protocol. It will trigger an epoch change on all the task managers which will then run the snapshotting algorithm and send out the marker messages. All the task will snapshot their state and send back the success acknowledgment to the coordinator. This will finish only when it received acknowledgment from all the task managers. This is usually referred to as the preparation phase. If any of the acknowledgment timeouts or fails the checkpointing process aborts. 2.7. FLINK FAULT TOLERANCE 27

The commit phase is initiated by the snapshot coordinator by sending a com- mit epoch message, only after receiving the success message for all the prepared acknowledgments are received. All the task managers will commit the prepared state to backends or external storage depending on the type of the backend.

2.7.4 Flink State Checkpointing

Apache Flink checkpoints or savepoints are also used to upgrade the streaming ap- plication seamlessly without impacting the running application state. Application state is captured with the savepoint before shutting it down, and then application updates are deployed. Once starting again, it will load the savepoint and resume the application where it had left. Savepoints are user-triggered checkpoints where Apache Flink takes the snap- shot and stores it on the state backend similar to checkpointing. Savepoints are not removed automatically as they are used to back up the state of the applica- tion while checkpoints are temporary and removed after program execution unless retained explicitly. Under the hood, different state backend serializers act very differently. Heap- StateBackend does use java objects in the code and uses lazy serialization and eager deserialization. On the other hand, RocksDBStateBackend uses eager serialization and lazy deserialization. Apache Flink does support serialization of the basic java types and some com- posite types while storing the state. It does support Tuples, POJOs, Apache Avro. For any other generic type, it will fall back to Kryo serialization and deserialization unless a custom serializer is provided. Flink uses watermarks to capture the consistent snapshots which are used to restore the state in case of a failure. Flink checkpointing is asynchronous [25] and it doesn’t block the normal stream processing. Watermarks are lightweight and we can have multiple watermarks at the same time in the data stream. Snapshot coordinator initiates the checkpointing and watermarks are injected into the data stream at the source operators level. Once an operator receives a watermark from all of its sources, it does take the snapshot and emit a watermark for the downstream operators. Once the sink operator receives all the watermarks from its incoming channels, it acknowledges the checkpoint completion to the snapshot coordinator. Flink also supports incremental snapshotting as well. Incremental checkpoints only snapshot the state difference between the last checkpoint and the current state. It is helpful for applications with large states as checkpointing delta will reduce the state size which in turn will reduce the time to complete a checkpoint. Flink’s robust snapshotting helps to provide a great recovery mechanism. When a failure happens, Flink chooses the last completed checkpoint and reloads the state for different operators from the snapshot. For incremental snapshots, it will copy the last full snapshot and apply all the state changes snapshotted incrementally. 28 CHAPTER 2. BACKGROUND

2.8 Transactional processing for Streaming application

OLTP (online transaction processing) and OLAP (online analytical process) are commonly known terms used for classical data processing applications. OLTP captures, processes, and stores data from real-time transactions. While OLAP use complex queries to extract the information from the aggregated historical data stored by OLTP. Transactional processing is the core concept of the OLTP and it ensures the atomicity of the multiple transactions or units of works. These systems are not designed for the unbounded and continuous stream of input data. On the other hand, most of the modern stream processing platforms [13, 11, 12] are designed to process unbounded streams. Such systems typically run in main memory to avoid the extreme latency caused by disk access and they are not designed for ACID transactions inherently. Hence they have storage and compute bounded to the same system to leverage the local read/write operations. Separating compute and storage can enhance scalability but requires to design and architect systems that all the data is stored on a network or distributed system. Such systems use local storage for transient data, which can read from the persistent storage in case of failure. Having storage on different nodes introduces the need for transactional processing due to a number of factors such as unreliable network communication or failure of the remote storage systems. S-Store is world’s first transactional streaming database system[26] where work- flow is modeled as a dataflow graph of transactions. Transaction processing guar- antees the coordination safety for distributed storage. It provides three streaming guarantees: ACID (Atomicity, Consistency, Isolation, Durability), data ordering and exactly-once execution. It is built on top of a distributed main-memory OLTP system, H-Store.

2.9 NDB Database

NDB (Network Database) is an in-memory highly available distributed storage engine. It is widely used in the telecom industry but mostly known as NDB Cluster since it is used by MySQL Cluster as an underlying storage engine. NDB uses a shared-nothing architecture. Shared nothing architecture provides the capability to use commodity hardware for the cluster as well. Each machine will have its own resources and network communication will be limited. NDB cluster has usually two types of nodes, management node, and data node. A cluster should have at least one management node and as many data nodes as per the needs. Management nodes act as a leader and responsible for the configuration and network partition management. It also orchestrates starting and stopping the data nodes, and running backups [27]. Data nodes are at the core of the cluster. Data nodes are the nodes where the actual data is stored and the main work of the queries is done. NDB can have up 2.10. SUMMARY 29

Figure 2.9: NDB Cluster [1]

to 48 data nodes and the recommended practices are to have two replicas of the data. Higher replication factors require more data nodes in the cluster. NDB handles node failures automatically, transparently from the end-user. NDB can handle failure safely because of NDB data nodes replication. NDB cluster will survive as long as one replica of each partition is still alive so if a node goes down, a running transaction will fail but the next attempt will get the data from the replicated node. NDB does provide good performance because data is stored in memory but data is also persisted to the disk through checkpoints. Sharding and partitioning of the data across multiple nodes are handled by the NDB but can be controlled and customized by the applications. It has node groups consisting of one or more nodes that store partitions, or sets of fragment replicas. The number of node groups in a cluster is not directly configurable; it is a function of the number of data nodes to the number of replica nodes.

2.10 Summary

In this chapter, we introduced quite a few fundamental concepts about Apache Flink and related technologies for FlinkNDB. We started with different generations of big data analytics and discussed what are the state of the art solutions. Afterward, we focused on the Apache Flink and reviewed its architecture, programming model, and different supported streaming features. We covered state management in detail where we talked about the different types of states, state backends, and how failure recovery is supported by state backends. In the end, we have shown transactional 30 CHAPTER 2. BACKGROUND processing in streaming platforms. Finally, we have a brief introduction to NDB and its capabilities as a distributed database. Given this chapter established a solid background, chapter 3 will focus on design, architecture, and implementation to solve the dynamic scalability challenge. Chapter 3

Design and Implementation of FlinkNDB

Flink state is embedded into the compute nodes. As per our goal, we need to decou- ple it from the compute nodes to speed up the re-configuration. FlinkNDB state backend is designed to reconfigure the running application with minimal delays. The core principle is to decouple the state from compute nodes while still providing the memory equivalent read and write performance. The other consideration is to have better failure recovery and snapshotting performance compared to other state backends. A state backend is responsible to handle the operator state as well as keyed state. FlinkNDB is using the default Flink state for operator while it support custom implementation for the keyed state backend. Keyed stated backend have various type of state descriptors that needs to be implemented to store and read the data from NDB, major ones are value state, list state, map state. NDB provides good performance because it’s an in-memory database but still processing delays need to be taken care of because data is being transferred on the wire. Different states types uses individual tables to store the state in the database. FlinkNDB has three tables for each type of state, active state, commit state and snapshot table. Active state will present the state of the application at any point in time, while committed state will store the snapshotted state for each epoch. Snapshot table is storing committed epochs in the commit table and mainly used for snapshotting and recovery purposes. FlinkNDB is implemented in two phases. Sruthi [1] implemented the bare bone state backend which stored everything into the active state tables. This thesis complements the Sruthi’s work[1] under the same code base to form the basis of the open source backend and add support for checkpointing and recovery. Following sections will explain the basic structure, specifies for each type of state, different possible implementations and their challenges. Furthemore, it presents

31 32 CHAPTER 3. DESIGN AND IMPLEMENTATION OF FLINKNDB how FlinkNDB does snapshotting and handles the failure recovery.

3.1 FlinkNDB Architecture

NDB does give robust performance with the right use of primary key and partition groups. Flink key-groups are used for partitioning across the NDB cluster. So when FlinkNDB is performing read & write operations, it always read the whole state from the specific partition. Hence, when it read the primary key from the right partition, performance should be close enough to memory read & writes. FlinkNDB state backend has application layer, transient or cache layer and finally the persistent or database layers. Flink sits at the application layer where user will be reading and writing from the different states using the abstraction view without the knowledge of underlying complexities of different state backends. Flink manages compute tasks and delegate every read & write operation to FlinkNDB state backend.

3.1.1 Cache Layer FlinkNDB state storage has in-memory cache and a persistent storage - a dis- tributed database. It implements a light form of multi-version concurrency control on top of NDB to support all necessary Flink operations. The in-memory cache is the transient layer or can be called a cache layer which acts similar to the RocksDB memtable. It logs all recent changes in state. The main difference is that if a read leads to a cache hit, it will give us equivalent performance to a direct in-memory operation since it isn’t serializing or de-serializing values in the cache compared to what RocksDB does.

3.1.2 Database Layer At the database layer FlinkNDB has two tables, active and commit state tables. Active table will present the state values at any point in time while the commit table will maintain the committed state values over different epochs. If there is cache miss, then the record is fetched externally from a special table in NDB which keeps the latest active state of a running Flink application. It is called as active table. All fetched values are also kept in the cache for fast future access. In order to speed up state commits, FlinkNDB also maintains a Write ahead log in-memory to track the incremental changes as in writes between checkpoints in Flink. Once there is a need to checkpoint state or whenever an entry need to be evicted from that cache, the Write ahead log is flushed into another special table in NDB, referred as commit table. The commit table is used for logging all durable changes happening across checkpoints as it stores the values for each checkpoint epoch and is effectively taking the role of HDFS used in Rocksdb, Yet is orders of magnitude faster in committing changes. 3.2. NDB SCHEMA 33

Once a checkpoint happens, FlinkNDB will spill over all the Write ahead cache to the database layer into the committed state tables. Additionally, if transient cache get filled in, it will be synchronized to the persistent state in the database as well. And finally, in the case of the failures, the commit-log table will be used to recover the state into an active table.

3.1.3 Primary Key FlinkNDB uses composite key with four columns, Key, Key-group, State Name and Namespace. Key is the partition key of the keyed data stream and will have unique value across the keyed stated ensured by Flink. A single key-group can have multiple keys. Namespace and state names are used to differentiate between different states on the same key. State name wasn’t required with other backends because they have different instances of the database but FlinkNDB has a single repository for all the state from all the different compute instances of the cluster.

3.2 NDB Schema

FLinkNDB currently stores the state in an active table [1] and it is used for both read and write. Its primary key consists of key, key-group, state name and names- pace. Key is the Flink state key which is used to partition the input stream since this is a keyed state backend. Key group is an atomic unit to partition the state data a granular level. State name can be based on the Flink operator or provided the by application developer. Finally, namespace is also used as part of the key since FlinkNDB is storing all the state values in the single table, so it is used to differentiate if it has same state name in two different namespaces. Active state schema doesn’t support multiple version of the same value between different checkpoints. FlinkNDB need such a schema that if a failure happens, it can differentiate between current values and all the changes that has happened until last successful checkpoint. So, to differentiate or maintain different versions of the value for the same key, an additional version column needs to be added to the active state table.

Initial Approach was to add a checkpoint epoch column into the active table, it will maintain the state versions after each checkpoint. So, for each epoch a new copy of state value will be written into the active table. After nth successful checkpoint, active state table will have n versions for a specific key. This schema change, addition of epoch column, provided the support to replay the state value over the different epochs for querying and debugging purposes. It was a reasonable approach, but it adds a lot of delays while doing recovery because it requires group by operations on active state table. Single purpose active state table has quite a few challenges. For example, one of them is writing all the state values on each epoch and there is no way to do the 34 CHAPTER 3. DESIGN AND IMPLEMENTATION OF FLINKNDB

Figure 3.1: FlinkNDB initial architecture incremental checkpointing which add processing delays and requires extra storage. More importantly it isn’t using primary key on the NDB. So, schema need to be designed such that it stores only the changes over time and have a way to restore from the historical log efficiently.

3.3 Checkpointing

State in Flink is critical for it’s operations because any function or operator can be stateful and they need to store data for each input element processed. Hence, Apache Flink checkpoints the state periodically so that it can restore the complete state of the running application in case of failure and resume processing as nothing had happened. Apache Flink allows application developers to configure the checkpoint duration in the code or through the Flink configuration file. As checkpoints are intended for the fault tolerance so they are transparent from the end user and deleted once a job is cancelled or finished. Flink can be configured to retain them, in which case user need to delete them manually. Flink retains only the n-most-recent checkpoints (n being configurable) while a job is running. When Flink’s snapshot coordinator start the snapshot, each task will receive marker. Before coordinator start snapshot, it inserts marker into the data stream as shown in fig. 3.2. Once every task has received marker on its input then snapshot method for each task is invoked. Snapshot table has key-group, namespace and epoch as primary key because Flink task and state have different granularity. State is always fragmented by key- 3.3. CHECKPOINTING 35

data stream newer records older records

checkpoint checkpoint stream record barriern barriern-1 (event)

part of part of part of checkpointn+1 checkpointn checkpointn-1

Figure 3.2: Flink injection of barriers into data stream [8] group and task can be as many as key-groups. All key-groups within a name space needs to be marked with a status for a specific epoch. In the Flink semantics, name space identifies the operators being snapshotted. So, it’s like an operator has key-groups and each key-group have values snapshotted in a specific epoch marked with the completion status. When an application starts, state will be stored into both active and commit state table. Once a checkpoint triggers, snapshot table will be populated with all the respective key-groups marking that commit state table contain the final values for that epoch. Writes will happen in order as Flink does the alignment automatically, so FlinkNDB don’t need to worry about out of order elements at an epoch level. So, if application crashes before completion of the next checkpoint, state values from commit table will be used to restore the state to that point in time. FlinkNDB does the incremental checkpointing by default because state granu- larity is at the key-group level. It means that for any given key, it will write to the database only if there are changes during that epoch, otherwise there wouldn’t be any anything logged in the database. Snapshot table is used to keep track which epoch for a specific key has been committed and can be used to do the recovery. While doing recovery from the commit state table to active state table after failure or reconfiguration, FlinkNDB cannot determine which epoch need to use restore the values. Since FlinkNDB is writing to both active and commit state tables, it can have multiple epoch values in the commit state table. For example, after successful nth epoch, if application crashes before next checkpoint, commit state table will contain dirty values for (n +1)th epoch as well. Hence Snapshot table is referred to get the last successful committed epoch and used to reconstruct the active table from commit table. So current implementation is storing the metadata to the file system as RocksDB state backend does, but all the application state is stored in the database. During the checkpointing, metadata is stored at the user configured location and on-restart 36 CHAPTER 3. DESIGN AND IMPLEMENTATION OF FLINKNDB after failure or reconfiguration, it loads the metadata and populate the data from the database.

Figure 3.3: FlinkNDB state backend architecture with cache

3.3.1 NDB Schema Enhancements FlinkNDB leverages the primary key for all the possible operations on NDB, so active table is splitted into 3 different tables - active, commit and snapshot as in Fig. 3.4. Active state table will represent the state value at any point in time while commit table will maintain state version for recovery purposes. Snapshot table will maintain the metadata that which epoch needs to be used to restore the data from the commit table.

Figure 3.4: FlinkNDB - NDB Table Schema 3.4. STATE TYPE SCHEMA 37

Now, with the changed schema active state will be used to do the read and write but all the operations are also committed into the commit state table to keep track of state value after each epoch. Active state table schema includes an additional epoch column. Epoch isn’t part of the primary key, which means table cannot have the duplicate entries. Having said that, active state table always return the latest value which is required during normal operations with no failure happening. Each write will override the value of the same key during the different epochs. Commit state table has the same schema as active table but they have different primary keys. Commit table will have the epoch as part of the primary key. It’s unique and represent one specific key for a specific operator, like a unique row per epoch. So, commit state will have an entry per epoch, like a log of all the operations at the end of each epoch. Contrary to active state table, which will only have the latest value, commit table will have the state values over the different epochs because epoch is part of the primary key.

3.4 State Type Schema

Value state is the simple among all the state types supported by Flink. It stores a single value against each key of the keyed Stream. So far value state schema has been discussed mainly but Flink does support other types as well as we talked about it in 2.4.1.

3.4.1 List State List state stores a list of elements. It has different schema than the value state because state backend need to store the list values and their indices as well. Having said that, an additional column is introduced in the table to keep the index of each list item. All the database read operations happen on the active table and write operations are done on both active and commit tables. As mentioned earlier, FlinkNDB leverages primary key for most of the database operations to provide the fast and comparable performance. Reading an element row is a primary key operation with specified index as part of the primary key. Also, while reading the whole list can be implemented with two different ap- proaches. First one is use the iterator and do a batch read of the primary key operations or secondly do a non-primary key operation without the index. Current implementation is using the first approach where it leverages primary key opera- tions with all the requests batched together to speed up the performance. On the other hand, adding an element on a specific index in the list is a tricky operation and Flink doesn’t support it. But insertion of an element insert the value at the end of the list. FlinkNDB need to know the list size to determine where to put the element. First approach is to read the whole list and count the size and then do the insert operation but this approach has multiple operations and also it is not using the primary key. So, FlinkNDB have an additional table to store the 38 CHAPTER 3. DESIGN AND IMPLEMENTATION OF FLINKNDB number of items in the list which is used to maintain the list size. It is used to read the index for new elements to be inserted using the primary key.

3.4.2 Map State Map state stores a HashMap against the key on the keyed state. It is used for partitioned key-value state where key-value pair can added, updated and removed. The state is accessed and modified by user defined functions, and checkpointed consistently. Map state has different schema to support the individual key value storage and retrieval in more efficient way, similar to other state types. An additional column is added to store the user key as well and it is part of the primary key. When a specific key is requested, this column is used to fetch the values along the other primary key columns.

3.5 Cache Optimizations

FlinkNDB stores the state into the database over the network compared to in- memory operations of RocksDB state backend. Hence FlinkNDB performance was compromised because of the network delays for read/write operations. To over- come that challenge, caching layer is introduced between the state backend and the database. Caching layer or the transient state consist of two type of cache - Active and Commit, reflecting both of the active and commit state table in the database as shown in fig. 3.3. It is an in-memory cache and store values without serialization. FlinkNDB implements a light form of Multi-Version Concurrency Control on top of NDB to support all necessary Flink operations. The in-memory cache acts a bit similar to the RocksDB memtable. It logs all recent changes in state. The main difference is that if a read leads to a cache hit, its performance is equivalent to a direct in-memory operation since there is no serializing or deserializing of values required for the cache compared to what RocksDB does. So, all the read and write happens on the cache and state values are committed on checkpointing or when cache values are evicted. Since it is doing read and write operation on the in-memory cache hence performance is equivalent in-memory state backend.

3.5.1 Active Cache Active state cache store all the recent reads like any other normal cache., Whenever an application requests the state value for a key, FlinkNDB will check in the cache first. In case of the cache miss, it will be fetched externally from active state table in NDB that always keeps the latest active state of a running Flink application. All fetched values are also kept in the cache for fast future access. 3.5. CACHE OPTIMIZATIONS 39

So, at the start of the application active state cache will be empty but will be populated during the warming up phase. Active cache entries don’t expire except when cache is full. So, when cache is full, least recent used (LRU) entries are removed to store the new value.

3.5.2 Commit Cache In order to speed up state commits, FlinkNDB also maintains a Write ahead log in-memory, referred as commit cache, to track the incremental changes as in writes between checkpoints in Flink. Commit state cache stores the state delta over an epoch. When an application starts, commit cache will start capturing all the writes. When a checkpoints triggers, all the values will be written to the NDB database and cache will be cleared to store the changes for the next epoch as shown in fig. 3.5. In case of a failure, most recent version of the commit log (table) will be used to reconstruct the active table.

Figure 3.5: Flink NDB Cache Activity diagram

3.5.3 Cache Implementation We have used Caffeine cache [28] which is a java based high performance, near optimal caching library. It provides quite a few good features such as frequency 40 CHAPTER 3. DESIGN AND IMPLEMENTATION OF FLINKNDB and recency based eviction, time-based expiration, automatic loading of entries into cache and so on. There are various eviction strategies for the individual cache entry when cache is full or time-based expiry happens. It allows us to configure the number of entries to be stored in the cache along the expiry time.

3.6 Recovery

Flink leverages checkpointing and message replays to ensure robust fault tolerance. Each checkpoint stores the state snapshot at any point in time along the offset of the streaming input. Each checkpoint combines the snapshot produced by each task along with the state handles and bundles them together for all the tasks. So what happens when an application crashes? It recovers from the latest complete snapshot and all the state handles are sent to respective task, which restore their respective state. Also input messages are replayed from the offset recorded with snapshot. When an application is started from a snapshot or being restarted due to a failure, recovery strategy is determined based on the type of checkpointing, either incremental checkpointing or complete checkpointing. Flink assign state handles to different tasks to recover the state. State handles are redistributed while doing re-scaling to different set of task compared to the state while checkpointing. Recovery operation goes through all the state handles and reads the metadata info to recover the different type of state (value, list, map) that needs to be ini- tialized with respective type of serializer. Once all the metadata is read and data structures has been initialized, it reads the state data based on the key group partitioning.

3.6.1 RockDB Approach

RocksDB is an embedded database so it does store all the state on the compute node but snapshots are stored on the HDFS or any other distributed file system. Distributed file system as snapshot storage has it’s own pros and cons. It makes state accessible to all the nodes in the cluster which makes it easy to distribute across the nodes while re-scaling. Additionally, It makes the snapshots fault tolerant as well. On the other hand, It has one major downside, that is all tasks has to read the state over the wire which increases the recovery time. So for recovery, state is copied from distributed file system to compute nodes which takes time due to data movement on the wire. Hence, recovery time is directly proportional to the state size, the larger the state the longer it will take to recover. 3.6. RECOVERY 41

3.6.2 FlinkNDB Approach

FlinkNDB performs minimal operations for recovery because the only step needed is to re-build the active table from the commit table. There will be no data movement involved on the network because data is recovered from checkpointed snapshot on the same data node, from commit to active table. There are two different approaches to do the recovery, eager and lazy recover. Eager recovery will load all the data into the active tables before starting the appli- cation and it might take sometime depending on the state size. On the other hand, lazy recovery will do the recovery in no time and start the application processing right away while loading the data during the application warm-up phase.

3.6.2.1 Eager Recovery

Our initial implementation does use the eager recovery approach. Once a check- point happens, state is stored into the active and commit tables and metadata is stored into the auxiliary snapshot table. During the epoch processing, if an eviction happens on the cache, it will also be stored into the commit table. During the recovery, metadata is used to determine which epoch need to be re- covered. This information can be determined by reading the most recent successful epoch in the snapshot table as well. After making a determination of the recovery epoch, next step is to populate commit table from the active table. FlinkNDB will read all the latest values until recovery epoch for all the keys under the specific key group and populate the active table. Optimizations during checkpointing introduce few challenges in recovery. One optimization was when checkpointing, state value is snapshotted only if it is changed during that epoch, otherwise it isn’t captured in the database for the latest check- point. Having said that, copying state from commit table to active table for nth epoch wasn’t straightforward anymore. For example, if input stream has three keys, x,y and z. If all of them are continuously changing until epoch 3, both active and commit table will have 3 rows for each epoch. While doing recovery, it only need to take the rows with epoch 3 from commit table and update the active table before restarting the process. Now it gets interesting when only the value of x is changed during epoch 4, and checkpoint will record only the updated value and there won’t be any commit tables rows for key y and z. While doing recovery from epoch 4, populating all the records with epoch 4 from commit to active will break the data integrity. State values for key y,z will not available after restarting the application. The solution to this challenge is to get the most recent value until nth epoch e.g. 4 in our example. While populating the active table for epoch 4, value for key x will be used from epoch 4 but for key y and z, values from epoch 3 will be restored with respective epoch id. Once application restarts, all the data will be read from the active table irrespective of the assigned epoch no. 42 CHAPTER 3. DESIGN AND IMPLEMENTATION OF FLINKNDB

3.6.2.2 Lazy Recovery Lazy recovery is another approach which can potentially bring down the recovery time close to zero as it will not move data around at all. The idea is not to load the active table during recovery but instead use the commit table if it doesn’t find the record. So, when reading from the active table in case of cache miss, if there is no value found in the active table, required value will be fetched from the commit table. Active table will be updated as well before returning the value. This approach will have higher read latency during the initial warm-up period as cache and active table are being populated lazily. So, it is a trade-off between reducing the recovery or reconfiguration time to the performance of data processing system during warm-up phase. Lazy recovery was purely in the ideation phase and couldn’t be implemented due to the lack of time.

3.7 Summary

This chapter started the discussion with state management in different stream pro- cessing engines before jumping into the specifics of Apache Flink and FlinkNDB state backend. It talks about RocksDB architecture and setup the base for the FlinkNDB design. Afterwards, it talks about how FlinkNDB and database schema has evolved from Sruthi’s [1] work. Futhermore, it talk through the different caching strategies implemented to improve the performance. Finally, it finishes the chapter with the discussion about recovery implementation. Next, chapter 4 reviews the benchmarks that are developed and experiments performed to compare the FlinkNDB performance with other state backends. Chapter 4

Benchmarking & Results

In this chapter, we evaluate the performance of FlinkNDB using a number of bench- marks. We used different benchmarking tools including our own to benchmark dif- ferent performance metric for the thesis. It includes the data generation, running different test scenarios, setting up the log processing pipeline and drawing insight out of the data. We ran different benchmarks focusing on the individual read and write team for different state operations. Furthermore, it also measures the time to do the snapshotting and doing recovery from the snapshotted state.

4.1 Benchmarking Framework

Stream processing benchmarks are still an active area of research. There are very few known benchmarks to evaluate queries over continuous data streams. Initially, we ran experiments using NEXMark but later due to certain limitations we imple- mented our own benchmark.

4.1.1 Nexmark Benchmark NEXMARK (Niagara Extension to XMark) is a benchmark that consist of multiple queries to benchmark different set of functionalities for stream processing engines. It is a simple but robust online auction system where people put their items on the sale while other people place bids for the items on the sale as show in Fig. 4.1. It has three entities - Persons, Auctions and Bids. There are different queries that are executed on the system to evaluate the performance of various functionalities for the system under test.

4.1.2 NDW Benchmark Sruthi [1] developed a custom benchmark for the FlinkNDB benchmarking. It is developed on top of an open source project [29] and the data source is traffic data from NDW (National Dataportal Wegverkeer)[30]. It contains data from different

43 44 CHAPTER 4. BENCHMARKING & RESULTS

Figure 4.1: NEXMark - Online Auction system road measurement locations in the Netherlands. At these locations, sensors count the amount of cars that pass by on a certain lane and the average speed of these cars in a time interval. The data mainly consists of two information: flow(number of cars) and speed(average speed). Data generator from [29] was used which generates data and publishes a subset of this data set into the Kafka input topics namely flow and speed. Initially, ingested data is read from the Kafka flow and speed topics. The flow and speed data is in JSON format which is parsed into speed and flow POJOs. Afterwards, stateful operations are performed on the flow and speed stream and different performance metrics are measures.

4.2 Hardware Infrastructure

Running experiments on different type of hardware yields different results. During the development of FlinkNDB, we ran experiments on quite a few different setups to evaluate the adhoc performance as well as gathering data for detailed comparison analysis. Google cloud was used to create the different setup architecture. Following are the major setups for the experiments

1. Running Flink and NDB on the same machine. Most of the these experiments were ran on the Mac and HP Probook (Core i5, 16 GB).

2. Running Flink on single node with NDB running in the cluster mode. Flink used n1-standard-16 (16 vCPUs, 60 GB memory) VM. NDB cluster had 3 4.3. BENCHMARKING ARCHITECTURE 45

n1-standard-16 (16 vCPUs, 60 GB memory) nodes where 2 nodes were data nodes and third one was a management node.

3. Running Flink and NDB both in Cluster mode. Flink cluster had 3 n1- standard-16 (16 vCPUs, 60 GB memory) VMs. NDB cluster had 5 n1- standard-16 (16 vCPUs, 60 GB memory) VM where 4 of them were data nodes and last one was a management node.

For the purpose of this report, we only included the results from above setup 2 and 3 only. Setup 1 had a lot of variation between different runs and results were mostly collected for development purposes only.

4.3 Benchmarking Architecture

Benchmarking pipeline was developed to collect, process and analyze the collected data from different benchmark experiments as in fig. 4.2. While running experi- ments, all the individual read and writes were being logged into the log files. Once experiment finish running, log files are processed using the python script on the same compute nodes to get the only relevant information as file sizes are quite huge. Once we have only read or write time against timestamp, we upload those values to google storage. Finally, Google colab access the Google cloud storage to read the data and plot the graph.

Figure 4.2: FlinkNDB data processing pipeline [1] 46 CHAPTER 4. BENCHMARKING & RESULTS

4.4 Objectives

Apache Flink has already support for different state backends but RocksDB state backend is widely used so we compared FlinkNDB performance against RocksDB. State backend performance can be compared on quite so many key metrics but our experiments focused on following:

1. Read time: Comparing the read time for the state value.

2. Write time: Writing data to the state, it include serialization and de-serialization time as well if any.

3. Checkpoint time: Time it takes to checkpoint the state

4. Recovery time: How long does it take to recover from a snapshot depending on the state size

5. Application Run-time: End to end complete execution time of the data pipeline.

We updated the Flink code to add the logging to see how much time it does take once Flink request an operation (read or write) to compare the performance at the state backend level.

Figure 4.3: Apache Beam NEXMark - Performance comparison of Flink state back- ends [1] 4.5. EXPERIMENTAL EVALUATION 47

4.5 Experimental Evaluation

Initially, we ran experiments using the Apache Beam implementation of Nexmark for all the queries as shown in fig. 4.3. FlinkNDB had a comparative performance and for some queries it performed better than other state backends. Later, NDW benchmarking framework is used [1] to run different experiments for the perfor- mance evaluation using the metrics mentioned in 4.4. NDW benchmark experiments use different set of keys, different number of input data points and different state sizes for each data point to better understand the strengths of the FlinkNDB. Variable number of keys is used to assess how different state backend performs when we have all the state fits into the in-memory cache or vice-versa. State size is determined based on the type of the information being stored for each unique key. The input data points represent the workload to be processed for each point and it is used to evaluate total execution time. All the parameters for each experiments has been summarized in table 4.1.

Experiment Input Size State size Unique Keys No. (KB) 1 379600 33.70 345 2 410400 10.75 110 3 403984 1253 12832 4 1787146 34.50 35328 5A 205200 10.75 110 5B 205200 1430 110

Table 4.1: Summary of input parameters for experiments

4.5.1 Experiment 1 Experiment 1 is evaluating the performance with small set of unique but repetitive keys. It has only 345 keys in the input data set and state size was big enough to fit into the cache. So, all the reads had the cache hit except the first read for each key. FlinkNDB took slightly less time compared to RocksDB as it can be seen from both read and write graph in 4.4a and 4.4b. FlinkNDB has close to zero read time because all the reads are cache hit and only time NDB read happened when key was fetched first time. On the other hand, all the write time has been hovering around 50sh seconds while RocksDB is averaging below 20 seconds with few spikes.

4.5.2 Experiment 2 Experiment 2 is similar as first experiment but with higher number of input data points and failure recovery. It had 110 unique keys and merely 128 cache misses during the whole program execution. That means, only 18 keys were evicted from 48 CHAPTER 4. BENCHMARKING & RESULTS

(a) Read Rolling Average

(b) Write Rolling Average

() Read Write Box Plot

Figure 4.4: Experiment 1 evaluation graphs the cache and had to be read again from the database during the experiment. Also, state size was big enough to fit into the systems memory. RocksDB state backend took less time compared to FlinkNDB to finish the experiment run as in fig. 4.5a and 4.5b. FlinkNDB has close to zero read time as in 4.5c because there were very few cache misses. Similarly, all the write-time has been hovering around 50sh seconds while RocksDB is averaging below 20 seconds which is aligned with experiment 1 as well. Finally, recovery is taking same amount of time for both, 13-14 seconds. 4.5. EXPERIMENTAL EVALUATION 49

(a) Read Rolling Average

(b) Write Rolling Average

(c) Read Write Box Plot

Figure 4.5: Experiment 2 evaluation graphs

4.5.3 Experiment 3

Experiment 3 has around 13k keys compared to experiment 1 and 2 but still dataset fits into the cache and it has only few cache misses. So, all the reads are happen- ing from the cache except the first attempt when cache was warmed up from the database. Overall, FlinkNDB took longer time to finish compared to RocksDB as it can be seen from both read and write graph in 4.6a and 4.6b. Read & write box plot in 4.6c has same results as previous experiments. 50 CHAPTER 4. BENCHMARKING & RESULTS

(a) Read Rolling Average

(b) Write Rolling Average

(c) Read Write Box Plot

Figure 4.6: Experiment 3 evaluation graphs

4.5.4 Experiment 4

Experiment 4 has a lot more keys, 35K. This experiment was a long running one with keys spread over time and bigger state size as well. Due to large set of keys and being spread over time, we can see spikes in the read and write graph for FlinkNDB in 4.7a and 4.7b. These spikes are caused by the reads from the database. Since most of the keys in the input are in the sequential order, so every time new set of keys appear, they cause a spike due to read from the database. 4.5. EXPERIMENTAL EVALUATION 51

Furthermore, RocksDB read time has been gone up as well due to the increased state size. Additionally, this experiment measured the time to complete the snap- shot for each checkpoint as in fig. 4.7c. RocksDB is taking longer in most of the cases since state needs to be moved onto the S3.

(a) Read Rolling Average

(b) Write Rolling Average

(c)

Figure 4.7: Experiment 4 evaluation graphs 52 CHAPTER 4. BENCHMARKING & RESULTS

4.5.5 Experiment 5 Experiment 5 compare the performance of state backends for different state size using the same input data. It has 110 keys and experiment ran twice for the same input, first with the state size in bytes and later state size increased to Kilo Bytes (KBs) for each key. Fig. 4.8a shows that FlinkNDB has consistent and better read time for sequen- tial reads and cache size. On the other hand, RocksDB read time is proportional to the state size as read time increased once the cache size is increase while increased state size didn’t effect the FlinkNDB. FlinkNDB took less time to finish the com- plete input because of individual of read and write for RocksDB is higher as in Fig. 4.8b. Finally, fig. 4.8c shows that FlinkNDB takes less time to checkpoint compared to RocksDB. After increasing the state size, RocksDB checkpointing time doubled. Hence, checkpointing time is proportional to the state size for both of the state backends.

4.6 Evaluation Summary

The different experiments using various systems conclude that FlinkNDB has a good read performance compared to RocksDB state backend because of the cache optimization implementations. On the other hand, FlinkNDB is comparatively behind for individual write operation because of the serialization cost and compos- ing multiple objects for the checkpointing. So, current implementation is good for read-heavy workloads and will outperform other Flink state backends.

Exp. State back- Read Write Checkpoint Total No. end (µs) (µs) Time (s) time 1 RocksDB 10 15 - - 1 NDB 1 40 - - 2 RocksDB 8 20 - - 2 NDB 1 60 - - 3 RocksDB 30 20 - - 3 NDB 2 60 - - 4 RocksDB 50 20 8 23 min 4 NDB 30 60 4.5 23 min 5A RocksDB 7.5 15 2 110 s 5A NDB <1 45 <1 120 s 5B RocksDB 13.5 200 4 230 s 5B NDB <1 230 1.5 180 s

Table 4.2: Summary of Evaluation metrics for Flink State backends 4.6. EVALUATION SUMMARY 53

(a)

(b)

(c)

Figure 4.8: Experiment 5 evaluation graphs

Chapter 5

Conclusion and Future Work

This chapter concludes the thesis with a review of the research work and achieved results. Furthermore, it discusses the possibilities that can be explored in the future to improve the performance of the FlinkNDB.

5.1 Conclusion

Apache Flink provides state streaming capabilities and applications need to store the state while doing the data processing. The state is stored in the state backend. All the existing state backends are embedded with compute nodes. Hence, scaling out storage independently isn’t possible. Therefore, we implemented a new state backend, FlinkNDB. FlinkNDB is de- veloped to solve the problem at the core and it takes a different approach - decouple the compute and storage. Yet, at the same time, it still maintains the same process- ing guarantees provided by Flink’s underlying snapshotting mechanism. Scaling-in or out wouldn’t be expensive since state and compute are not co-located anymore. If we have a compute-intensive task, we scale out compute only and if we have a state intensive task, we scale out the storage. FlinkNDB is developed on top of NDB, one of the world’s fastest in-memory open-source distributed databases. FlinkNDB complies with Flink’s core design choices: dedicated task assignment based on key groups and fast guaranteed pro- cessing. On top of that, it adds fast reconfiguration and unbounded historical rollback capabilities via strict multi-version control orchestrated by Flink’s snap- shots. FlinkNDB does have a transient cache layer that captures the incremental changes during the epoch and all the read & write operations happen on the cache layer so its performance is close to a memory store. Once an epoch finishes and checkpoint triggers, all the changes are serialized into the permanent distributed storage, NDB. All the serialization and de-serialization is delayed until we need to store the data to boost the performance.

55 56 CHAPTER 5. CONCLUSION AND FUTURE WORK

Additionally, recovery from failure is much quicker since data need to be copied from one table to another on the same node. It will reduce the time compared to existing implementations because the state doesn’t need to be copied across the nodes over the wire. FlinkNDB is an optimal solution for the application where all the keys fit into the active cache. It will give the best performance for the sequential read-intensive application instead of sparse read because then all the reads will lead to cache misses. So, it will be really crucial that how to set up the cache size according to our application needs. It is also recommended to use for applications where frequent dynamic scaling is a requirement and state size is large. FlinkNDB should work perfectly for large state size applications as checkpointing takes less time and doing state re-distribution will not require copying of the state. On the other hand, write-intensive might not be that good because we are writing on the network all the time and not leveraging the caching capabilities quite well.

5.2 Future Work

In chapter 4, we compared the performance of the FlinkNDB with RocksDB state backend and identified the reason FlinkNDB is lagging behind in write performance because of the early serialization cost. The obvious solution is to move the serial- ization to the checkpointing while storing it into the database by implementing a checkpoint serializer for FlinkNDB. We couldn’t implement it because of the strict schedule. FlinkNDB still uses the same implementation as RocksDB to store the check- pointing metadata and reads it from the file system during the recovery. Since FlinkNDB is moving away from the file system to an external storage system, metadata needs to be stored in the database as well. The current implementation of the FlinkNDB is doing the eager recovery but we have presented the lazy recovery concept in the chapter 3.6.2.2 as well. Once lazy recovery implemented, it will be interesting to see the results of the failure recovery or reconfiguration of the Flink cluster with variable state sizes. In addition to that, the current implementation doesn’t load the cache and relies on the cache warm-up based on the application. There are different strategies that exist to overcome the cold cache. So, it is worth exploring different cache warming strategies and benchmark performance for both cold and warm cache. Another area of research is how we can combine active and commit cache. An idea is to flag the dirty write sticky during the epoch so that cannot be evicted until successfully checkpointed. Probably, another property on POJO saying these objects have been written to the cache in the current running epoch and haven’t been committed to the database. Once checkpoint triggers, they can be flushed to the database and made non-sticky. This might require the custom implementation 5.2. FUTURE WORK 57 of the caching instead of using a COTS (commercial of the shelf) solution or a third-party library. Finally, running more extensive benchmarks to measure the performance for some of the tricky use cases, e.g. scale-in or out with the large state or failure recovery of the cluster with the large state.

Bibliography

[1] S. Sree Kumar, External Streaming State Abstrac- tions and Benchmarking. KTH, 2021. [Online]. Available: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva- [2] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” in OSDI’04: Sixth Symposium on Design and Implementation, San Francisco, CA, 2004, pp. 137–150. [3] A. H. Payberah, “Parallel Processing - Spark,” GitHub, p. 125, 2019. [4] . H. Asif, “Apache Flink Architecture Overview,” Feb. 2020. [Online]. Available: https://medium.com/big-data-processing/apache-flink- architecture-overview-abbe19199fd0 [5] P. Carbone, “Scalable and Reliable Data Stream Processing,” KTH, 2018, publisher: KTH Royal Institute of Technology. [Online]. Available: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-233527 [6] Fwdays, “Sergei Sokolenko "Advances in Stream Ana- lytics: Apache Beam and Googl...” [Online]. Avail- able: https://www2.slideshare.net/fwdays/sergei-sokolenko-advances-in- stream-analytics-apache-beam-and-google-cloud-dataflow-deepdive [7] “Apache Flink: A Deep Dive into Rescalable State in Apache Flink.” [Online]. Available: https://flink.apache.org/features/2017/07/04/flink- rescalable-state.html [8] “Apache Flink: Stateful Stream Processing.” [Online]. Avail- able: https://ci.apache.org/projects/flink/flink-docs-stable/concepts/stateful- stream-processing.html [9] “Apache Flink 1.11 Documentation: State Backends.” [On- line]. Available: https://ci.apache.org/projects/flink/flink-docs-stable/learn- flink/fault_tolerance.html [10] “Apache Tez - Welcome to Apache TEZ.” [Online]. Available: https://tez.apache.org/

59 60 BIBLIOGRAPHY

[11] “Apache Spark - Unified Analytics Engine for Big Data.” [Online]. Available: https://spark.apache.org/ [12] “Samza.” [Online]. Available: http://samza.apache.org/ [13] “Apache Flink: Stateful Computations over Data Streams.” [Online]. Available: https://flink.apache.org/ [14] “Apache Flink,” Dec. 2020, page Version ID: 995687974. [Online]. Available: https://en.wikipedia.org/w/index.php?title=ApacheFlink&oldid=995687974 [15] P. Carbone, S. Ewen, G. Fora, S. Haridi, S. Richter, and K. Tzoumas, “State management in Apache Flink®: consistent stateful dis- tributed stream processing,” Proceedings of the VLDB Endowment, vol. 10, no. 12, pp. 1718–1729, Aug. 2017. [Online]. Available: https://doi.org/10.14778/3137765.3137777 [16] “Apache Flink 1.11 Documentation: State Backends.” [Online]. Available: https://ci.apache.org/projects/flink/flink-docs- stable/ops/state/state_backends.html [17] “CUDA Zone,” Jul. 2017. [Online]. Available: https://developer.nvidia.com/CUDA-zone [18] A. Ali and M. Abdullah, “A Survey on Vertical and Horizontal Scaling Plat- forms for Big Data Analytics,” International Journal of Integrated Engineering, vol. 11, Sep. 2019. [19] “Apache Flink 1.12 Documentation: Apache Kafka Connec- tor.” [Online]. Available: https://ci.apache.org/projects/flink/flink-docs- stable/dev/connectors/kafka.html

[20] “The Apache Software Foundation Announces Apache Flink as a Top-Level Project : The Apache Software Foundation Blog.” [Online]. Available: https://blogs.apache.org/foundation/entry/the_apache_software_foundation _announces69 [21] “Advanced Apache Flink Tutorial 1: Analysis of Runtime Core Mechanism.” [Online]. Available: https://www.alibabacloud.com/blog/advanced-apache- flink-tutorial-1-analysis-of-runtime-core-mechanism_595686 [22] “Apache Flink 1.10 Documentation: Working with State.” [On- line]. Available: https://ci.apache.org/projects/flink/flink-docs-release- 1.10/dev/stream/state/state.html

[23] “HDFSBackedStateStore · The Internals of Spark Structured Stream- ing.” [Online]. Available: https://jaceklaskowski.gitbooks.io/spark-structured- streaming/content/spark--streaming-HDFSBackedStateStore.html BIBLIOGRAPHY 61

[24] “Dataflow| Google Cloud.” [Online]. Available: https://cloud.google.com/dataflow [25] P. Carbone, G. Fora, S. Ewen, S. Haridi, and K. Tzoumas, “Lightweight Asyn- chronous Snapshots for Distributed Dataflows,” arXiv:1506.08603 [cs], Jun. 2015, arXiv: 1506.08603. [Online]. Available: http://arxiv.org/abs/1506.08603 [26] “Home - National Dataportaal wegverkeer.” [Online]. Available: http://sstore.cs.brown.edu/about.html [27] M. Ismail, “Distributed File System Metadata and its Applications,” KTH, 2020, publisher: KTH Royal Institute of Technology. [Online]. Available: http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-285872 [28] “ben-manes/caffeine.” [Online]. Available: https://github.com/ben- manes/caffeine [29] G. van Dongen and . Van den Poel, “Evaluation of stream processing frame- works,” IEEE Transactions on Parallel and Distributed Systems, vol. PP, pp. 1–1, 03 2020. [30] “Home - National Dataportaal wegverkeer.” [Online]. Available: https://ndw.nu/en/ TRITA-EECS-EX-2021:47

www.kth.se