Towards a Unified Ingestion-And-Storage Architecture

Towards a Unified Ingestion-and-Storage Architecture for Stream Processing Ovidiu-Cristian Marcu∗, Alexandru Costany, Gabriel Antoniu∗, Mar´ıa S. Perez-Hern´ andez´ z, Radu Tudoranx, Stefano Bortolix and Bogdan Nicolaex ∗Inria Rennes - Bretagne Atlantique, France yIRISA / INSA Rennes, France zUniversidad Politecnica de Madrid, Spain xHuawei Germany Research Center Abstract—Big Data applications are rapidly moving from a batch-oriented execution model to a streaming execution model in order to extract value from the data in real-time. However, processing live data alone is often not enough: in many cases, such applications need to combine the live data with previously archived data to increase the quality of the extracted insights. Current streaming-oriented runtimes and middlewares are not flexible enough to deal with this trend, as they address ingestion (collection and pre-processing of data streams) and persistent storage (archival of intermediate results) using separate services. This separation often leads to I/O redundancy (e.g., write data twice to disk or transfer data twice over the network) and interference (e.g., I/O bottlenecks when collecting data streams and writing archival data simultaneously). In this position paper, we argue for a unified ingestion and storage architecture for streaming data that addresses the aforementioned challenge. We Fig. 1: The usual streaming architecture: data is first ingested and then it identify a set of constraints and benefits for such a unified model, flows through the processing layer which relies on the storage layer for storing aggregated data or for archiving streams for later usage. while highlighting the important architectural aspects required to implement it in real life. Based on these aspects, we briefly sketch our plan for future work that develops the position defended in this paper. only temporarily and enables limited access semantics to Index Terms—Big Data, Streaming, Storage, Ingestion, Unified it (e.g., it assumes a producer-consumer pattern that is Architecture not optimized for random access) • Storage: unlike the ingestion layer, this layer is responsi- I. INTRODUCTION ble for persistent storage of data. This typically involves Under the pressure of increasing data velocity, Big Data either the archival of the buffered data streams from the analytics shifts towards stream computing: it needs to deliver ingestion layer or the storage of the intermediate stream insights and results as soon as possible, with minimal latency analytics results, both of which are crucial to enable and high frequency that keeps up with the rate of new data. In fault tolerance or deeper, batch-oriented analytics that this context, online and interactive Big Data runtimes designed complement the online analytics. for stream computing are rapidly evolving to complement • Processing: this layer consumes the streaming data traditional, batch-oriented runtimes (such as MapReduce [1]) buffered by the ingestion layer and sends the output and that are insufficient to meet the need for low latency and high intermediate results to the storage or ingestion layers. update frequency [2]. In practice, each layer is implemented as a dedicated This online dimension [3], [4] of data processing was solution that is independent from the other layers, under the as- recognized in both industry and academia and led to the sumption that specialization for a particular role enables better design and development of an entire family of online Big optimization opportunities. Such an architecture is commonly Data analytics runtimes: Apache Flink [5], Apache Spark referred to as the lambda [9] architecture. Streaming [6], Streamscope [7], Apache Apex [8], etc. However, as the need for more complex online data ma- As depicted in Figure 1, a typical state-of-art online big data nipulations arises, so does the need to enable better coupling analytics runtime is built on top of a three layer stack: between these three layers. Emerging scenarios (Section IV) • Ingestion: this layer serves to acquire, buffer and op- emphasize complex data pre-processing, tight fault tolerance tionally pre-process data streams (e.g., filter) before they and archival of streaming data for deeper analytics based on are consumed by the analytics application. The ingestion batch processing. Under these circumstances, data is often layer does not guarantee persistence: it buffers the data written twice to disk or send twice over the network (e.g. as part of a fault tolerance strategy of the ingestion layer the data layout to be represented in a columnar oriented ap- and the persistency requirement of the storage layer). Second, proach [13] or a hybrid row-column layout may be necessary. the lack of coordination between the layers can lead to I/O interference (e.g. the ingestion layer and the storage layer B. Functional Requirements compete for the same I/O resources). Third, the processing Adequate Support for Data Stream Partitioning. A layer often implements custom advanced data management stream partition is a (possibly) ordered sequence of records (e.g. operator state persistence, checkpoint-restart) on top of that represent a unit of a data stream. Partitioning for streaming inappropriate basic ingestion/storage API, which results in is a recognized technique used in order to increase processing significant performance overhead. throughput and scalability, e.g., [14], [15]. We differentiate In this paper, we argue that the aforementioned challenges between data partitioning techniques implemented by inges- are significant enough to offset the benefits of specializing tion/storage and application-level (user-defined) partitioning each layer independently of the other layers. To this end, we [16], [17], as it is not always straightforward to reconcile these discuss the potential benefits of a unified storage and ingestion two aspects. architecture. Specifically, we contribute with a study on the Support for Versatile Stream Metadata. Streaming meta- general characteristics of data streams and the requirements data is small data (e.g., record’s key or attributes) that de- for the ingestion and storage layer (Section II), an overview scribes stream data and it is used to easily find particular in- of state-of-art and its limitations and missing features (Sec- stances of data (e.g., to index data by certain stream attributes tion III) and a proposal for a unified solution based on a series that are later used in a query). of design principles and an architecture (Section V). Support for Search. The search capability is a big challenge: this is more appropriate to specialized storage systems. II. REQUIREMENTS OF INGESTION AND STORAGE FOR Ingestion frameworks usually do not support such feature. A DATA STREAMS large number of stream applications [18] need APIs able to In this section we focus on the characteristics of data scan (random or sequential) the stream datasets and support streams and discuss the main requirements and constraints of to execute real-time queries. the ingestion and storage layers. Advanced Support for Message Routing. Routing defines the flow of a message (record) through a system in order to A. Characteristics of Data Streams ensure it is processed and eventually stored. Applications may Data Size. Stream data takes various forms and most need to define necessary dynamic routing schemes or can rely systems do not target a certain size of the stream event, on predefined static routing allocations. Readers can refer to although this is an important characteristic for performance [19] for a characterization of message transport and routing. optimizations [10]. However, a few of the analyzed systems are Backpressure Support. Backpressure refers to the situation sensitive to the data size, with records defined by the message where a system cannot sustain any more the rate at which the length (from tens or hundreds of bytes up to a few megabytes). stream data is received. It is advisable for streaming systems Data Access Pattern. Stream-based applications dictate the to durably buffer data in case of temporary load spikes, and way data is accessed, with most of them (e.g., [3]) requiring provide a mechanism to later replay this data (e.g., by offsets fine-grained access. Stream operators may leverage multi-key- in Kafka). For example, one approach for handling fast data value interfaces, in order to optimize the number of accesses stream arrivals is by artificially restricting the rate at which to the data store or may access data items sequentially (scan tuples are delivered to the system (i.e., rate throttling), this based), randomly or even clustered (queries). When coupled technique could be implemented either by data stream sources with offline analytics, the data access pattern plays a crucial or by stream processing engines. role in the scalability of wide transformations [11]. Data Model. This describes the representation and orga- C. Non-functional Requirements nization of single information items flowing from sources High Availability. Failures can disrupt or even halt stream to sinks [12]. Streams are modeled as simple records (op- processing components, can cause the loss of potentially large tional key, blob value for attributes, optional timestamp) or amounts of stream processed data or prevent downstream row/column sets in a table. Data model together with data nodes to further progress: stream processing systems should size influence the data rate (i.e., throughput), an important incorporate high-availability mechanisms that allow to operate performance metric related to stream processing. continuously for a long time in spite of failures [20], [21]. Data Compression. In order to improve the consumer Temporal Persistence versus Stream Durability. Tem- throughput (e.g., due to scarce network bandwidth), or to poral persistence refers to the ability of a system to tem- reduce storage size, a set of compression techniques (e.g., porarily store a stream of events in memory or on disk (e.g., GZIP, Snappy, LZ4, data encodings, serialization) can be configurable data retention period of a short period of time implemented and supported by stream systems.

Towards a Unified Ingestion-And-Storage Architecture

Netapp Solutions for Hadoop Reference Architecture: Cloudera Faiz Abidi (Netapp) and Udai Potluri (Cloudera) June 2018 | WP-7217

Groups and Activities Report 2017

CDP DATA CENTER 7.1 Laurent Edel : Solution Engineer Jacques Marchand : Solution Engineer Mael Ropars : Principal Solution Engineer

Kyuubi Release 1.3.0 Kent

Chapter 2 Introduction to Big Data Technology

Release Notes Date Published: 2020-08-10 Date Modified

Storage and Ingestion Systems in Support of Stream Processing

Cloudera Enterprise

Release Notes Date Published: 2020-10-13 Date Modified

Getting Started with Kudu PERFORM FAST ANALYTICS on FAST DATA

Building a Scalable Distributed Data Platform Using Lambda Architecture

Red Hat Fuse 7.3 Release Notes