Guaranteed Data Streaming Using External State

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2021 FlinkNDB: Guaranteed Data Streaming Using External State MUHAMMAD HASEEB ASIF KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE FlinkNDB: Guaranteed Data Streaming Using External State MUHAMMAD HASEEB ASIF Master Thesis in Big Data & Distributed Computing School of Information and Communication Technology KTH Royal Institute of Technology Stockholm, Sweden [2021] KTH School of Information and Communication Technology SE-164 40 Kista TRITA-ICT XXXX:XX SWEDEN Akademisk avhandling som med tillstånd av Kungl Tekniska högskolan framlägges till offentlig granskning för avläggande av licentiatexamen i Cloud Computing & Services Friday den 08 January 2021 i . © Muhammad Haseeb Asif, January 2021 Tryck: Universitetsservice US AB iii Abstract Apache Flink is a stream processing framework that provides a unified state management mechanism which, at its core, treats stream processing as a sequence of distributed transactions. Flink handles failures, re-scaling and reconfiguration seamlessly via a form of a two-phase commit protocol that periodically commits all past side effects consistently into the state backends. This involves invoking and combining checkpoints and, in time of need, re- distributing the state to resume data pipelines. All the existing Flink state backend implementations, such as RocksDB, are embedded and coupled with the compute nodes. Therefore, recovery time is proportional to the state needed to be reconfigured and that can take from a few seconds to hours. If application logic is compute-heavy and Flink’s tasks are overloaded, scaling out compute pipeline means scaling out storage together with compute tasks and vice-versa because of the embedded state backends. It also introduces delays due to expensive state re-shuffle and moving large state on the wire. This thesis work proposes the decoupling of the state storage from compute to improve Flink’s scalability. It introduces the design and implementation of a new State backend, FlinkNDB, that decouples state storage from compute. Furthermore, we designed and implemented new techniques to per- form snapshotting, and failure recovery to reduce the recovery time close to zero. Keywords: Apache Flink, NDB, Flink State Backend, RocksDB State Backend, State management, Large State Applications iv Sammanfattning Apache Flink är ett strömbehandlingsramverk som tillhandahåller en en- hetlig tillståndshanteringsmekanism som i sin kärna behandlar strömbehand- ling som en sekvens av distribuerade transaktioner. Flink hanterar fel, omskal- ning och omkonfigurering sömlöst via en form av ett tvåfas-engagemangsprotokoll som regelbundet begår alla tidigare biverkningar konsekvent i tillståndets backends. Detta innebär att man åberopar och kombinerar kontrollpunkter och vid behov omdistribuerar dess tillstånd för att återuppta dataledningar. Alla befintliga backendimplementeringar för Flink-tillstånd, som Rocks- DB, är inbäddade och kopplade till beräkningsnoderna. Därför är återhämt- ningstiden proportionell mot det tillstånd som behöver konfigureras om och det kan ta från några sekunder till timmar. Om applikationslogiken är be- räkningstung och Flinks uppgifter är överbelastade, innebär utskalning av beräkningsrörledning att utskalning av lagring, tillsammans med beräknings- uppgifter och vice versa på grund av det inbäddade tillståndet i backend. Det introducerar också förseningar i förhållande till dyra tillståndsförflyttningar och flyttning av stora datamängder som upptar stora delar av bandbredden. Detta avhandlingsarbete föreslår frikoppling av tillståndslagring från be- räkning för att förbättra Flinks skalbarhet. Den introducerar designen och implementeringen av ett nytt tillstånd i backend, FlinkNDB, som frikopplar tillståndslagring från beräkning. Avslutningsvis designade och implemente- rade vi nya tekniker för att utföra snapshotting och felåterställning för att minska återhämtningstiden till nära noll. Keywords: Apache Flink, NDB, Flink State Backend, RocksDB State Backend, State management, Large State Applications v Acknowledgements Throughout my master thesis, I had the opportunity to meet and learn from peo- ple at my research lab and online as well during the interesting COVID times. I really appreciate everyone’s time and support to assist with this accomplishment. Especially, my supervisor, Paris Carbone, and Mahmoud Ismail had been a great support throughout this thesis journey. Their support has been instrumental during the whole project. Furthermore, it’s worth mentioning the efforts and motivation from Sruthi for challenging and head-scratching moments. Finally and most im- portantly, I would like to say thanks to my family for their unbounded emotional support for me during my studies. M Haseeb Asif, Stockholm, Jan 2021 Contents Contents vi List of Figures ix List of Tables x List of Acronyms xi 1 Introduction 1 1.1 Background . 1 1.2 Research Questions . 2 1.3 Goals . 2 1.4 Research Methodology . 3 1.5 Ethics and Sustainability . 4 1.6 Delimitations . 4 1.7 Thesis Organization . 4 2 Background 7 2.1 Big Data Analytics . 8 2.1.1 Map Reduce . 9 2.1.2 Batch Processing . 10 2.1.3 Stream processing . 10 2.2 Apache Kafka . 11 2.3 Apache Flink . 12 2.3.1 Flink Architecture . 12 2.3.2 Flink Programming Model . 13 2.3.3 Windowing . 14 2.4 Flink Application State . 14 2.4.1 Keyed State . 15 2.4.2 Operator State . 18 2.5 Flink State Management . 18 2.5.1 Spark State Backend . 18 2.5.2 Flink State Backends . 19 vi CONTENTS vii 2.5.3 External State Approach . 20 2.5.4 Flink Re-scalable State . 21 2.6 RocksDB State Backend . 23 2.7 Flink Fault Tolerance . 24 2.7.1 Checkpointing . 24 2.7.2 Consistent Snapshots - Chandy-Lamport . 25 2.7.3 Flink 2PC protocol . 26 2.7.4 Flink State Checkpointing . 27 2.8 Transactional processing for Streaming application . 28 2.9 NDB Database . 28 2.10 Summary . 29 3 Design and Implementation of FlinkNDB 31 3.1 FlinkNDB Architecture . 32 3.1.1 Cache Layer . 32 3.1.2 Database Layer . 32 3.1.3 Primary Key . 33 3.2 NDB Schema . 33 3.3 Checkpointing . 34 3.3.1 NDB Schema Enhancements . 36 3.4 State Type Schema . 37 3.4.1 List State . 37 3.4.2 Map State . 38 3.5 Cache Optimizations . 38 3.5.1 Active Cache . 38 3.5.2 Commit Cache . 39 3.5.3 Cache Implementation . 39 3.6 Recovery . 40 3.6.1 RockDB Approach . 40 3.6.2 FlinkNDB Approach . 41 3.7 Summary . 42 4 Benchmarking & Results 43 4.1 Benchmarking Framework . 43 4.1.1 Nexmark Benchmark . 43 4.1.2 NDW Benchmark . 43 4.2 Hardware Infrastructure . 44 4.3 Benchmarking Architecture . 45 4.4 Objectives . 46 4.5 Experimental Evaluation . 47 4.5.1 Experiment 1 . 47 4.5.2 Experiment 2 . 47 4.5.3 Experiment 3 . 49 4.5.4 Experiment 4 . 50 viii CONTENTS 4.5.5 Experiment 5 . 52 4.6 Evaluation Summary . 52 5 Conclusion and Future Work 55 5.1 Conclusion . 55 5.2 Future Work . 56 Bibliography 59 List of Figures 1.1 Flink reconfiguration from 2 node to 3 node cluster [1] . 3 2.1 Map Reduce - Execution Flow [2] . 9 2.2 Map Reduce vs Apache Spark [3] . 10 2.3 Apache Flink Architecture overview [4] . 13 2.4 Translation from Logical to Physical Execution Graphs [5] . 14 2.5 Separate compute and storage [6] . 21 2.6 Reshuffling of keys while changing parallelism [7] . 22 2.7 Flink RocksDB state backend . 23 2.8 An example of an inconsistent (C1) and a consistent cut (C2) [5] . 26 2.9 NDB Cluster [1] . 29 3.1 FlinkNDB initial architecture . 34 3.2 Flink injection of barriers into data stream [8] . 35 3.3 FlinkNDB state backend architecture with cache . 36 3.4 FlinkNDB - NDB Table Schema . 36 3.5 Flink NDB Cache Activity diagram . 39 4.1 NEXMark - Online Auction system . 44 4.2 FlinkNDB data processing pipeline [1] . 45 4.3 Apache Beam NEXMark - Performance comparison of Flink state backends [1] . 46 4.4 Experiment 1 evaluation graphs . 48 4.5 Experiment 2 evaluation graphs . 49 4.6 Experiment 3 evaluation graphs . 50 4.7 Experiment 4 evaluation graphs . 51 4.8 Experiment 5 evaluation graphs . 53 ix List of Tables 2.1 Comparison of Flink State Backends [9] . 19 4.1 Summary of input parameters for experiments . 47 4.2 Summary of Evaluation metrics for Flink State backends . 52 x List of Acronyms API Application Programmable Interface CPU Central Processing Unit DAG Directed Acyclic Graph EC2 Elastic Compute Cloud GCE Google Compute Engine HDFS Hadoop Distributed File system IoT Internet of Things I/O Input/output JVM Java Virtual Machine NDB Network Database OLTP Online Transactional Processing OLAP Online Analytical Processing OSI Open Systems Interconnection POJO Plain Old Java Object S3 Simple Storage Service URL Uniform Resource Locator xi Chapter 1 Introduction Technology has been advancing at a rapid pace and generating a huge amount of data ever produced. Apache Flink is a prominent processing engine enabling users to handle data on a large scale. It is distributed, fault-tolerant, and scalable processing engine. Although Apache Flink meets all the industry needs, it can do better to dynamically scale-in and out without any major delays. In this thesis, we explore a different approach to existing Flink architecture to show how Apache Flink can do better than the current state. 1.1 Background Internet of Things.

Load more