A Survey of State Management in Big Data Processing Systems Quoc-Cuong To1 Juan Soto1,2 Volker Markl1,2 1 German Research Center for 2 Technische Universität Berlin Artificial Intelligence (DFKI) FG DIMA, Sekr. EN-7 Raum 728, Alt-Moabit 91c Einsteinufer 17 10559 Berlin, Germany 10587 Berlin, Germany
[email protected] [email protected] [email protected] ABSTRACT graphs or trees (in the absence of iterations or shared results). From The concept of state and its applications vary widely across big data this perspective, the analysis results are the roots, operators are the processing systems. This is evident in both the research literature intermediate nodes, and data are the leaves. Each operator node and existing systems, such as Apache Flink, Apache Heron, Apache performs an operation that transforms inputs flowing through it into Samza, Apache Spark, and Apache Storm. Given the pivotal role outputs. Data flows from the leaves through the operator nodes to that state management plays, particularly, for iterative batch and the roots. stream processing, in this survey, we present examples of state as Operators come in two varieties. Stateless operators are an enabler, discuss the alternative approaches used to handle and purely functional and they produce output, solely based on their implement state, capture the many facets of state management, and input. Examples of stateless operators include relational selection, highlight new research directions. Our aim is to provide insight into relational projection without duplicate elimination, or merging two disparate state management techniques, motivate others to pursue inputs. In contrast, stateful operators compute their output on a research in this area, and draw attention to open problems.