Structured Streaming: a Declarative API for Real-Time Applications in Apache Spark
Total Page:16
File Type:pdf, Size:1020Kb
Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark Michael Armbrusty, Tathagata Dasy, Joseph Torresy, Burak Yavuzy, Shixiong Zhuy, Reynold Xiny, Ali Ghodsiy, Ion Stoicay, Matei Zahariayz yDatabricks Inc., zStanford University Abstract with Spark Streaming [37], one of the earliest stream processing With the ubiquity of real-time data, organizations need streaming systems to provide a high-level, functional API. We found that two systems that are scalable, easy to use, and easy to integrate into challenges frequently came up with users. First, streaming systems business applications. Structured Streaming is a new high-level often ask users to think in terms of complex physical execution streaming API in Apache Spark based on our experience with Spark concepts, such as at-least-once delivery, state storage and triggering Streaming. Structured Streaming differs from other recent stream- modes, that are unique to streaming. Second, many systems focus ing APIs, such as Google Dataflow, in two main ways. First, itisa only on streaming computation, but in real use cases, streaming is purely declarative API based on automatically incrementalizing a often part of a larger business application that also includes batch static relational query (expressed using SQL or DataFrames), in con- analytics, joins with static data, and interactive queries. Integrating trast to APIs that ask the user to build a DAG of physical operators. streaming systems with these other workloads (e.g., maintaining Second, Structured Streaming aims to support end-to-end real-time transactionality) requires significant engineering effort. applications that integrate streaming with batch and interactive Motivated by these challenges, we describe Structured Stream- analysis. We found that this integration was often a key challenge ing, a new high-level API for stream processing that was developed in practice. Structured Streaming achieves high performance via in Apache Spark starting in 2016. Structured Streaming builds on Spark SQL’s code generation engine and can outperform Apache many ideas in recent stream processing systems, such as separating Flink by up to 2× and Apache Kafka Streams by 90×. It also offers processing time from event time and triggers in Google Dataflow [2], rich operational features such as rollbacks, code updates, and mixed using a relational execution engine for performance [12], and of- streaming/batch execution. We describe the system’s design and use fering a language-integrated API [17, 37], but aims to make them cases from several hundred production deployments on Databricks, simpler to use and integrated with the rest of Apache Spark. Specif- the largest of which process over 1 PB of data per month. ically, Structured Streaming differs from other widely used open source streaming APIs in two ways: ACM Reference Format: • Incremental query model: Structured Streaming automati- M. Armbrust et al.. 2018. Structured Streaming: A Declarative API for Real- Time Applications in Apache Spark. In SIGMOD’18: 2018 International Con- cally incrementalizes queries on static datasets expressed through ference on Management of Data, June 10–15, 2018, Houston, TX, USA. ACM, Spark’s SQL and DataFrame APIs [8], meaning that users typ- New York, NY, USA, 13 pages. https://doi.org/10.1145/3183713.3190664 ically only need to understand Spark’s batch APIs to write a streaming query. Event time concepts are especially easy to ex- 1 Introduction press and understand in this model. Although incremental query execution and view maintenance are well studied [11, 24, 29, 38], Many high-volume data sources operate in real time, including we believe Structured Streaming is the first effort to adopt them sensors, logs from mobile applications, and the Internet of Things. in a widely used open source system. We found that this incre- As organizations have gotten better at capturing this data, they also mental API generally worked well for both novice and advanced want to process it in real time, whether to give human analysts the users. For example, advanced users can use a set of stateful pro- freshest possible data or drive automated decisions. Enabling broad cessing operators that give fine-grained control to implement access to streaming computation requires systems that are scalable, custom logic while fitting into the incremental model. easy to use and easy to integrate into business applications. While there has been tremendous progress in distributed stream • Support for end-to-end applications: Structured Streaming’s processing systems in the past few years [2, 15, 17, 27, 32], these sys- API and built-in connectors make it easy to write code that is tems still remain fairly challenging to use in practice. In this paper, “correct by default" when interacting with external systems and we begin by describing these challenges, based on our experience can be integrated into larger applications using Spark and other software. Data sources and sinks follow a simple transactional Permission to make digital or hard copies of all or part of this work for personal or model that enables “exactly-once" computation by default. The classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation incrementalization based API naturally makes it easy to run a on the first page. Copyrights for components of this work owned by others than the streaming query as a batch job or develop hybrid applications author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission that join streams with static data computed through Spark’s and/or a fee. Request permissions from [email protected]. batch APIs. In addition, users can manage multiple streaming SIGMOD’18, June 10–15, 2018, Houston, TX, USA queries dynamically and run interactive queries on consistent © 2018 Copyright held by the owner/author(s). Publication rights licensed to Associa- tion for Computing Machinery. ACM ISBN 978-1-4503-4703-7/18/06...$15.00 https://doi.org/10.1145/3183713.3190664 snapshots of stream output, making it possible to write applica- time discussing these challenges with users and designers of other tions that go beyond computing a fixed result to let users refine streaming systems, including Spark Streaming, Truviso, Storm, and drill into streaming data. Dataflow and Flink. This section details the challenges wesaw. Beyond these design decisions, we made several other design 2.1 Complex and Low-Level APIs choices in Structured Streaming that simplify operation and in- crease performance. First, Structured Streaming reuses the Spark Streaming systems were invariably considered more difficult to use SQL execution engine [8], including its optimizer and runtime code than batch ones due to complex API semantics. Some complexity is generator. This leads to high throughput compared to other stream- to be expected due to new concerns that arise only in streaming: for ing systems (e.g., 2× the throughput of Apache Flink and 90× that example, the user needs to think about what type of intermediate of Apache Kafka Streams in the Yahoo! Streaming Benchmark [14]), results the system should output before it has received all the data as in Trill [12], and also lets Structured Streaming automatically relevant to a particular entity, e.g., to a customer’s browsing session leverage new SQL functionality added to Spark. The engine runs on a website. However, other complexity arises due to the low- in a microbatch execution mode by default [37] but it can also use level nature of many streaming APIs: these APIs often ask users to a low-latency continuous operators for some queries because the specify applications at the level of physical operators with complex API is agnostic to execution strategy [6]. semantics instead of a more declarative level. Second, we found that operating a streaming application can be As a concrete example, the Google Dataflow model [2] has a challenging, so we designed the engine to support failures, code powerful API with a rich set of options for handling event time updates and recomputation of already outputted data. For example, aggregation, windowing and out-of-order data. However, in this one common issue is that new data in a stream causes an applica- model, users need to specify a windowing mode, triggering mode tion to crash, or worse, to output an incorrect result that users do and trigger refinement mode (essentially, whether the operator not notice until much later (e.g., due to mis-parsing an input field). outputs deltas or accumulated results) for each aggregation operator. In Structured Streaming, each application maintains a write-ahead Adding an operator that expects deltas after an aggregation that event log in human-readable JSON format that administrators can outputs accumulated results will lead to unexpected results. In use to restart it from an arbitrary point. If the application crashes essence, the raw API [10] asks the user to write a physical operator due to an error in a user-defined function, administrators can up- graph, not a logical query, so every user of the system needs to date the UDF and restart from where it left off, which happens understand the intricacies of incremental processing. automatically when the restarted application reads the log. If the Other APIs, such as Spark Streaming [37] and Flink’s DataStream application was outputting incorrect data instead, administrators API [18], are also based on writing DAGs of physical operators and can manually roll it back to a point before the problem started and offer a complex array of options for managing state[20]. In addition, recompute its results starting from there. reasoning about applications becomes even more complex in sys- Our team has been running Structured Streaming applications tems that relax exactly-once semantics [32], effectively requiring for customers of Databricks’ cloud service since 2016, as well as the user to design and implement a consistency model.