it – Information Technology 2016; 58(4): 186–194 DE GRUYTER OLDENBOURG

Special Issue

Wolfram Wingerath*, Felix Gessert, Steffen Friedrich, and Norbert Ritter Real-time stream processing for Big Data

DOI 10.1515/itit-2016-0002 1 Introduction Received January 15, 2016; accepted May 2, 2016

Abstract: With the rise of the web 2.0 and the Internet of Through technological advance and increasing connec- things, it has become feasible to track all kinds of infor- tivity between people and devices, the amount of data mation over time, in particular fine-grained user activi- available to (web) companies, governments and other or- ties and sensor data on their environment and even their ganisations is constantly growing. The shift towards more biometrics. However, while efficiency remains mandatory dynamic and user-generated content in the web and the for any application trying to cope with huge amounts of omnipresence of smart phones, wearables and other mo- data, only part of the potential of today’s Big Data repos- bile devices, in particular, have led to an abundance of in- itories can be exploited using traditional batch-oriented formation that are only valuable for a short time and there- approaches as the value of data often decays quickly and fore have to be processed immediately. Companies like high latency becomes unacceptable in some applications. Amazon and Netflix have already adapted and are mon- In the last couple of years, several distributed data pro- itoring user activity to optimise product or video recom- cessing systems have emerged that deviate from the batch- mendations for the current user context. Twitter performs oriented approach and tackle data items as they arrive, continuous sentiment analysis to inform users on trending thus acknowledging the growing importance of timeliness topics as they come up and even Google has parted with and velocity in Big Data analytics. batch processing for indexing the web to minimise the la- In this article, we give an overview over the state of the tencybywhichitreflectsnewandupdatedsites[34]. art of stream processors for low-latency Big Data analytics However, the idea of processing data in motion is not and conduct a qualitative comparison of the most popu- new: Complex Event Processing (CEP) engines [11, 12]and lar contenders, namely Storm and its abstraction layer Tri- DBMSs with continuous query capabilities [36]canpro- dent, Samza and Spark Streaming. We describe their re- vide processing latency on the order of milliseconds and spective underlying rationales, the guarantees they pro- usually expose high-level, SQL-like interfaces and sophis- vide and discuss the trade-offs that come with selecting ticated querying functionalities like joins. But while typi- one of them for a particular task. cal deployments of these systems do not span more than a few nodes, the systems focused in this article have been Keywords: Distributed real-time stream processing, Big designed specifically for deployments with 10sor100sof Data analytics. nodes. Much like MapReduce, the main achievement of ACM CCS: General and reference → Document types → these new systems is abstraction from scaling issues and Surveys and overviews , Computer systems organization thus making development, deployment and maintenance → Architectures → Distributed architectures → Cloud of highly scalable systems feasible. computing , Computer systems organization → Real-time In the following, we provide an overview over some systems → Real-time system architecture , Information of the most popular distributed stream processing systems systems → Data management systems → Database man- currently available and highlight similarities, differences agement system engines → Stream management , Com- andtrade-offstakenintheirrespectivedesigns. puting methodologies → Distributed computing method- Section 2 covers the environment in which the process- ologies ing systems featured in this article are typically deployed, while the systems we focus on are described in Section 3. We give an overview over other systems for stream process- ing in Section 4 and conclude in Section 5. *Corresponding author: Wolfram Wingerath, Univ. of Hamburg, CS Dept., D-22527 Hamburg, Germany, e-mail: [email protected] Felix Gessert, Steffen Friedrich, Norbert Ritter: Univ. of Hamburg, CS Dept., D-22527 Hamburg, Germany

Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg Angemeldet Heruntergeladen am | 13.10.16 19:14 DE GRUYTER OLDENBOURG W. Wingerath et al., Real-time streaming analytics for Big Data | 187

2 Real-time analytics: Big Data in motion

In contrast to traditional data analytics systems that col- lect and periodically process huge – static – volumes of data, streaming analytics systems avoid putting data at rest and process it as it becomes available, thus minimis- ing the time a single data item spends in the processing pipeline. Systems that routinely achieve latencies of sev- eral seconds or even subsecond latency between receiving data and producing output are often described as “real- time”. However, large parts of today’s Big Data infrastruc- ture are built from distributed components that communi- cate via asynchronous network and are engineered on top of the JVM (Java Virtual Machine). Thus, these systems are only soft-real-time systems and never provide strict upper boundsonthetimetheytaketoproduceanoutput. Figure 1 illustrates typical layers of a streaming ana- lytics pipeline. Data like user clicks, billing information or Figure 2: Lambda and Kappa Architecture in comparison. unstructured content such as images or text messages are collected from various places inside an organisation and in a persistence layer like HDFS from which it is ingested then moved to the streaming layer (e.g. a queueing system and processed by the batch layer periodically (e.g. once like Kafka, Kinesis or ZeroMQ) from which it is accessible a day), while the speed layer handles the portion of the to a stream processor that performs a certain task to pro- data that has not-yet been processed by the batch layer, duce an output. This output is then forwarded to the serv- and the serving layer consolidates both by merging the ing layer which might for example be an analytics web GUI output of the batch and the speed layer. The obvious ben- like trending topics at Twitter or a database where a mate- efit of having a real-time system compensate for the high rialised view is maintained. latency of batch processing is paid for by increased com- In an attempt to combine the best of both worlds, an plexity in development, deployment and maintenance. If architectural pattern called the Lambda Architecture [32] the batch layer is implemented with a system that supports has become quite popular that complements the slow both batch and stream processing (e.g. Spark), the speed batch-oriented processing with an additional real-time layer often can be implemented with minimal overhead by component and thus targets both the Volume and the using the corresponding streaming API (e.g. Spark Stream- Velocity challenge of Big Data [29] at the same time. ing) to make use of existing business logic and the exist- As illustrated in Figure 2a, the Lambda Architecture de- ing deployment. For Hadoop-based and other systems that scribes a system comprising three layers: Data is stored do not provide a streaming API, however, the speed layer is only available as a separate system. Using an abstract language like Summingbird [18] to write the business logic enables automatic compilation of code for both the batch and the stream processing system (e.g. Hadoop and Storm) and thus eases development in those cases where batch and speed layer can use (parts of) the same business logic, but the overhead for deployment and maintenance still remains. Another approach that, in contrast, dispenses with the batch layer in favour of simplicity is known as the Kappa Architecture [25] and is illustrated in Figure 2b. The basic idea is to not periodically recompute all data in the batch layer, but to do all computation in the stream Figure 1: An abstract view on a streaming analytics pipeline. processing system alone and only perform recomputation

Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg Angemeldet Heruntergeladen am | 13.10.16 19:14 188 | W. Wingerath et al., Real-time streaming analytics for Big Data DE GRUYTER OLDENBOURG when the business logic changes by replaying historical data. To achieve this, the Kappa Architecture employs a powerful stream processor capable of coping with data at a far greater rate than it is incoming and a scalable streaming system for data retention. An example of such a streaming system is Kafka which has been specifically designed to work with the stream processor Samza in this kind of architecture. Archiving data (e.g. in HDFS) is still possible, but not part of the critical path and often not re- Figure 3: Choosing a processing model means trading off between quired as Kafka, for instance, supports retention times in latency and throughput. the order of weeks. On the downside, however, the effort required to replay the entire history increases linearly with The space between these two extremes is vast and data volume and the naive approach of retaining the en- some systems like Storm Trident and Spark Streaming em- tire change stream may have significantly greater storage ploy micro-batching strategies to trade latency against requirements then periodically processing the new data throughput: Trident groups tuples into batches to relax and updating an existing database, depending on whether the one-at-a-time processing model in favour of increased and how efficiently the data is compacted in the streaming throughput, whereas Spark Streaming restricts batch size layer. As a consequence, the Kappa Architecture should in a native batch processor to reduce latency. In the follow- only be considered an alternative to the Lambda Architec- ing, we go into more detail on the specificities of the above- ture in applications that do not require unbounded reten- mentioned systems and highlight inherent trade-offs and tion times or allow for efficient compaction (e.g. because it design decisions. is reasonable to only keep the most recent value for each given key). Of course, the latency displayed by the stream proces- 3.1 Storm sor (speed layer) alone is only a fraction of the end-to-end application latency due to the impact of the network or Storm (current stable version: 1.0.0) has been in devel- other systems in the pipeline. But it is obviously an im- opment since late 2010, was open-sourced in September portant factor and may dictate which system to choose in 2011 by Twitter and eventually became an Apache top- applications with strict timing SLAs. This article concen- level project in 2014. It is the first distributed stream pro- trates on the available systems for the stream processing cessing system to gain traction throughout research and layer. practice and was initially promoted as the “Hadoop of real-time” [30, 31], because its programming model pro- vided an abstraction for stream-processing similar to the 3 Real-time processors abstraction that the MapReduce paradigm provides for batch-processing. But apart from being the first of its kind, While all stream processors share some common ground Storm also has a wide user-base due to its compatibil- regarding their underlying concepts and working princi- ity with virtually any language: On top of the Java API, ple, an important distinction between the individual sys- Storm is also Thrift-compatible and comes with adapters tems that directly translates to the achievable speed of pro- for numerous languages such as Perl, Python and Ruby. cessing, i.e. latency, is the processing model as illustrated Storm can run on top of Mesos [5], as a dedicated cluster in Figure 3: Handling data items immediately as they ar- or even on a single machine. The vital parts of a Storm rive minimises latency at the cost of high per-item over- deployment are a ZooKeeper cluster for reliable coordi- head (e.g. through messaging), whereas buffering and pro- nation, several supervisors for execution and a Nimbus cessing them in batches yields increased efficiency, but ob- server to distribute code across the cluster and take action viously increases the time the individual item spends in in case of worker failure; in order to shield against a failing the data pipeline. Purely stream-oriented systems such as Nimbus server, Storm allows having several hot-standby Storm and Samza provide very low latency and relatively Nimbus instances. Storm is scalable, fault-tolerant and high per-item cost, while batch-oriented systems achieve even elastic as work may be reassigned during runtime. unparalleled resource-efficiency at the expense of latency As of version 1.0.0, Storm provides reliable state imple- that is prohibitively high for real-time applications. mentations that survive and recover from supervisor fail- ure. Earlier versions focused on stateless processing and

Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg Angemeldet Heruntergeladen am | 13.10.16 19:14 DE GRUYTER OLDENBOURG W. Wingerath et al., Real-time streaming analytics for Big Data | 189

processing within a given timeframe. Using an appropri- ate streaming system, it is even possible to shield against spout failures, but the acknowledgement feature is of- ten not used at all in practice, because the messaging overhead imposed by tracking tuple lineage, i.e. a tuple and all the tuples that are emitted on its behalf, notice- ably impairs achievable system throughput [20]. With ver- sion 1.0.0, Storm introduced a backpressure mechanism to throttle data ingestion as a last resort whenever data is in- gested faster than it can be processed. If processing be- comes a bottleneck in a topology without such a mecha- nism, throughput degrades as tuples eventually time-out Figure 4: Data flow in a Storm topology: Data is ingested from the and are either lost (at-most-once processing) or replayed streaming layer and then passed between Storm components, until repeatedly to possibly time-out again (at-least-once pro- the final output reaches the serving layer. cessing), thus putting even more load on an already over- burdened system. thus required state management at the application level to achieve fault-tolerance and elasticity in stateful appli- cations. Storm excels at speed and thus is able to perform 3.1.1 Storm Trident in the realm of single-digit milliseconds in certain applica- tions. Through the impact of network latency and garbage In autumn 2012 and version 0.8.0, Trident was released collection, however, real-world topologies usually do not as a high-level API with stronger ordering guarantees and display end-to-end latency below 50 ms [23,Chapter7]. a more abstract programming interface with built-in sup- A data pipeline or application in Storm is called port for joins, aggregations, grouping, functions and fil- a topology and as illustrated in Figure 4 is a directed graph ters. In contrast to Storm, Trident topologies are directed that represents data flow as directed edges between nodes acyclic graphs (DAGs) as they do not support cycles; this which again represent the individual processing steps: The makes them less suitable for implementing iterative algo- nodes that ingest data and thus initiate the data flow in the rithms and is also a difference to plain Storm topologies topology are called spouts and emit tuples to the nodes which are often wrongfully described as DAGs, but actu- downstream which are called bolts and do processing, ally can introduce cycles. Also, Trident does not work on write data to external storage and may send tuples further individual tuples, but on micro-batches and correspond- downstream themselves. Storm comes with several group- ingly introduces batch size as a parameter to increase ings that control data flow between nodes, e.g. for shuf- throughput at the cost of latency which, however, may still fling or hash-partitioning a stream of tuples by some at- be as low as several milliseconds for small batches [22]. tribute value, but also allows arbitrary custom groupings. All batches are by default processed in sequential order, By default, Storm distributes spouts and bolts across the one after another, although Trident can also be configured nodes in the cluster in a round-robin fashion, though the to process multiple batches in parallel. On top of Storm’s scheduler is pluggable to account for scenarios in which scalability and elasticity, Trident provides its own API for a certain processing step has to be executed on a particular fault-tolerant state management with exactly-once pro- node, for example because of hardware dependencies. The cessing semantics. In more detail, Trident prevents data application logic is encapsulated in a manual definition of loss by using Storm’s acknowledgement feature and guar- data flow and the spouts and bolts which implement inter- antees that every tuple is reflected only once in persis- faces to define their behaviour during start-up and on data tent state by maintaining additional information along- ingestion or receiving a tuple, respectively. side state and by applying updates transactionally. As of While Storm does not provide any guarantee on the writing, two variants of state management are available: order in which tuples are processed, it does provide the One only stores the sequence number of the last-processed option of at-least-once processing through an acknowl- batch together with current state, but may block the entire edgement feature that tracks the processing status of ev- topology when one or more tuples of a failed batch cannot ery single tuple on its way through the topology: Storm be replayed (e.g. due to unavailability of the data source), will replay a tuple, if any bolt involved in processing it ex- whereas the other can tolerate this kind of failure, but is plicitly signals failure or does not acknowledge successful more heavyweight as it also stores the last-known state.

Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg Angemeldet Heruntergeladen am | 13.10.16 19:14 190 | W. Wingerath et al., Real-time streaming analytics for Big Data DE GRUYTER OLDENBOURG

Irrespective of whether batches are processed in parallel or one by one, state updates have to be persisted in strict order to guarantee correct semantics. As a consequence, their size and frequency can become a bottleneck and Tri- dent can therefore only feasibly manage small state.

3.2 Samza

Samza (current stable version: 0.10.0) is very similar to Storm in that it is a stream processor with a one-at-a- time processing model and at-least-once processing se- mantics. It was initially created at LinkedIn, submitted to Figure 5: Data flow in a typical Samza analytics pipeline: Samza jobs cannot communicate directly, but have to use a queueing system the in July 2013 and was granted top- such as Kafka as message broker. level status in 2015. Samza was co-developed with the queueing system Kafka [4, 26] and therefore relies on the same messaging semantics: Streams are partitioned and eventually display higher end-to-end latency than compa- messages (i.e. data items) inside the same partition are or- rable Storm implementations. dered, whereas there is no order between messages of dif- However, this design also decouples individual pro- ferent partitions. Even though Samza can work with other cessing steps and thus eases development. Another ad- queueing systems, Kafka’s capabilities are effectively re- vantage is that buffering data between processing steps quired to use Samza to its full potential and therefore it is makes (intermediate) results available to unrelated par- assumed to be deployed with Samza for the rest of this sec- ties, e.g. other teams in the same company, and further tion. In comparison to Storm, Samza requires a little more eliminates the need for a backpressure algorithm, since work to deploy as it does not only depend on a ZooKeeper there is no harm in the backlog of a particular job filling up cluster, but also runs on top of Hadoop YARN [6]forfault- temporarily, given a reasonably sized Kafka deployment. tolerance: In essence, application logic is implemented as Since Samza processes messages in order and stores pro- a job that is submitted through the Samza YARN client cessing results durable after each step, it is able to prevent which has YARN then start and supervise one or more con- data loss by periodically checkpointing current progress tainers. Scalability is achieved through running a Samza and reprocessing all data from that point onwards in case job in several parallel tasks each of which consumes a sep- of failure; in fact, Samza does not support a weaker guar- arate Kafka partition; the degree of parallelism, i.e. the antee than at-least-once processing, since there would be number of tasks, cannot be increased dynamically at run- virtually no performance gain in relaxing this guarantee. time. Similar to Kafka, Samza focuses on support for JVM- While Samza does not provide exactly-once semantics, it languages, particularly Java. Contrasting Storm and Tri- allows configuring the checkpointing interval and thus of- dent, Samza is designed to handle large amounts of state fers some control over the amount of data that may be pro- in a fault-tolerant fashion by persisting state in a local cessed multiple times in an error scenario. database and replicating state updates to Kafka. By de- fault, Samza employs a key-value store for this purpose, but other storage engines with richer querying capabilities 3.3 Spark streaming can be plugged in. As illustrated in Figure 5, a Samza job represents one The Spark framework [39] (current stable version: 1.6.1) is processing step in an analytics pipeline and thus roughly a batch-processing framework that is often mentioned as corresponds to a bolt in a Storm topology. In stark contrast the inofficial successor of Hadoop as it offers several bene- to Storm, however, where data is directly sent from one fits in comparison, most notably a more concise API result- bolt to another, output produced by a Samza job is always ing in less verbose application logic and significant per- written back to Kafka from where it can be consumed by formance improvements through in-memory caching: In other Samza jobs. Although a single Samza job or a sin- particular, iterative algorithms (e.g. machine learning al- gle Kafka persistence hop may delay a message by only gorithms such as 𝑘-means clustering or logistic regression) a few milliseconds [8, 24], latency adds up and complex are accelerated by orders of magnitude, because data is analytics pipelines comprising several processing steps not necessarily written to and loaded from disk between

Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg Angemeldet Heruntergeladen am | 13.10.16 19:14 DE GRUYTER OLDENBOURG W. Wingerath et al., Real-time streaming analytics for Big Data | 191 every processing step. In addition to these performance a DStream are processed in order, whereas data items in- benefits, Spark provides a variety of machine learning al- side an RDD are processed in parallel without any order- gorithms out-of-the-box through the MLlib library. Origi- ing guarantees. Since there is a certain job scheduling de- nating from UC Berkeley in 2009, Spark was open-sourced lay when processing an RDD, batch sizes below 50 ms tend in 2010 and was donated to the Apache Software Founda- to be infeasible [10, section “Performance Tuning”]. Ac- tion in 2013 where it became a top-level project in Febru- cordingly, processing an RDD takes around 100 ms in the ary 2014. It is mostly written in Scala and has a Java, best case, although Spark Streaming is designed for la- Scala and Python API. The core abstraction of Spark are tency in the order of a few seconds [40, Sec. 2]. To prevent distributed and immutable collections called RDDs (re- data loss even for unreliable data sources, Spark Stream- silient distributed datasets) that can only be manipulated ing grants the option of using a write-ahead log (WAL) from through deterministic operations. Spark is resilient to ma- which data can be replayed after failure. State manage- chine failures by keeping track of any RDD’s lineage, i.e. ment is realised through a state DStream that can be up- the sequence of operations that created it, and check- dated through a DStream transformation. pointing RDDs that are expensive to recompute, e.g. to HDFS [35]. A Spark deployment consists of a cluster man- ager for resource management (and supervision), a driver 3.4 Discussion program for application scheduling and several worker nodes to execute the application logic. Spark runs on top Table 1 sums up the properties of these systems in direct of Mesos, YARN or in standalone mode in which case it comparison. Storm provides low latency, but does not of- may be used in combination with ZooKeeper to remove the fer ordering guarantees and is often deployed providing master node, i.e. the cluster manager, as a single point of no delivery guarantees at all, since the per-tuple acknowl- failure. edgement required for at-least-once processing effectively Spark Streaming [40] shifts Spark’s batch-processing doubles messaging overhead. Stateful exactly-once pro- approach towards real-time requirements by chunking the cessing is available in Trident through idempotent state stream of incoming data items into small batches, trans- updates, but has notable impact on performance and even forming them into RDDs and processing them as usual. fault-tolerance in some failure scenarios. Samza is another It further takes care of data flow and distribution auto- native stream processor that has not been geared towards matically. Spark Streaming has been in development since low latency as much as Storm and puts more focus on late 2011 and became part of Spark in February 2013. Be- providing rich semantics, in particular through a built- ing a part of the Spark framework, Spark Streaming had in concept of state management. Having been developed a large developer community and also a huge group of po- for use with Kafka in the Kappa Architecture, Samza and tential users from day one, since both systems share the Kafka are tightly integrated and share messaging seman- same API and since Spark Streaming runs on top of a com- tics; thus, Samza can fully exploit the ordering guaran- mon Spark cluster. Thus, it can be made resilient to fail- tees provided by Kafka. Spark Streaming effectively uni- ure of any component [37] like Storm and Samza and fur- fies batch and stream processing and offers a high-level ther supports dynamically scaling the resources allocated API, exactly-once processing guarantees and a rich set of for an application. Data is ingested and transformed into libraries, all of which can greatly reduce the complexity of a sequence of RDDs which is called DStream (discretised application development. However, being a native batch stream) before processing through workers. All RDDs in

Table 1: /Trident, LinkedIn’s Samza and Streaming in direct comparison.

Storm Trident Samza Spark Streaming strictest guarantee at-least-once exactly-once at-least-once exactly-once achievable latency ≪ 100 ms < 100 ms < 100 ms < 1 s state management yes yes (small state) yes yes processing model one-at-a-time micro-batch one-at-a-time micro-batch backpressure mechanism yes yes not required (buffering) yes ordering guarantees no between batches within stream partitions between batches elasticity yes yes no yes

Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg Angemeldet Heruntergeladen am | 13.10.16 19:14 192 | W. Wingerath et al., Real-time streaming analytics for Big Data DE GRUYTER OLDENBOURG

processor, Spark Streaming loses to its contenders with re- ery guarantees. MillWheel [13]isanextremelyscalable spect to latency [20]. stream processor that offers similar qualities as Flink and All these different systems show that low latency is in- Apex, e.g. state management and exactly-once seman- volved in a number of trade-offs with other desirable prop- tics. Millwheel and FlumeJava [19] are the execution en- erties such as throughput, fault-tolerance, reliability (pro- gines behind Google’s Dataflow cloud service for data pro- cessing guarantees) and ease of development. Through- cessing. Like other Google services and unlike most other put can be optimised by buffering data and processing it systems discussed in this section, Dataflow is fully man- in batches to reduce the impact of messaging and other aged and thus relieves its users of the burden of deploy- overhead per data item, whereas this obviously increases ment and all related troubles. The Dataflow programming the in-flight time of individual data items. Abstract inter- model [14] combines batch and stream processing and is faces hide system complexity and ease the process of ap- also agnostic of the underlying processing system, thus plication development, but sometimes also limit the pos- decoupling business logic from the actual implementa- sibilities of performance tuning. Similarly, rich processing tion. The runtime-agnostic API was open-sourced in 2015 guarantees and fault-tolerance for stateful operations in- and has evolved into the project (short for crease reliability and make it easier to reason about se- Batch and stream,currentlyincubating)[7]tobundleit mantics,butrequirethesystemtodoadditionalwork,e.g. with the corresponding execution engines (runners): As of acknowledgements and (synchronous) state replication. writing, Flink, Spark and the proprietary Google Dataflow Exactly-once semantics are particularly desirable and can cloud service are supported. The only other fully managed be implemented through combining at-least-once guaran- stream processing system apart from Google Dataflow that tees with either transactional or idempotent state updates, we are aware of is IBM Infosphere Streams [17]. However, but they cannot be achieved for actions with side-effects in contrast to Google Dataflow which is documented to be such as sending a notification to an administrator. highly scalable (quota limit for customers: 1000 compute nodes [9]), it is hard to find evidence for high scalability of IBM Infosphere Streams; performance evaluations made 4 Further systems by IBM [21] only indicate it performs well in small deploy- mentswithuptoafewnodes.ApacheFlume[3]isasys- In the last couple of years, a great number of stream pro- tem for efficient data aggregation and collection that is of- cessors have emerged that all aim to provide high avail- ten used for data ingestion into Hadoop as it integrates ability, fault-tolerance and horizontal scalability. Flink [2] well with HDFS and can handle large volumes of incom- (formerly known as Stratosphere [15]) is a project that has ing data. But Flume also supports simple operations such many parallels to Spark Streaming as it also originated as filtering or modifying on incoming data through Flume from research and advertises the unification of batch and Interceptors [23, Chapter 7] which may be chained together stream processing in the same system, providing exactly- to form a low-latency processing pipeline. once guarantees for the stream programming model and The list of distributed stream processors goes on [16, a high-level API comparable to that of Trident. In contrast 28, 33, 38], but since a complete discussion of the Big Data to Spark Streaming, though, Flink does not rely on batch- stream processing landscape is out of the scope of this ar- ing for stream processing internally and thus delivers low ticle, we do not go into further detail here. latency in the order of Storm or Samza. Project Apex [1] is the open-sourced DataTorrent RTS core engine. Much like Flink, Apex promises high performance in stream and 5 Wrap-up batch processing with low latency in streaming workloads. As Spark/Spark Streaming, Flink and Apex are comple- Batch-oriented systems have done the heavy lifting in mented by a host of database, file system and other con- data-intensive applications for decades, but they do not nectors as well as pattern matching, machine learning reflect the unbounded and continuous nature of data as and other algorithms through additional libraries. A sys- it is produced in many real-world applications. Stream- tem that is not publicly available is Twitter’s Heron [27]. oriented systems, on the other hand, process data as it ar- Designed to replace Storm at twitter, Heron is completely rives and thus are oftentimes a more natural fit, though in- API-compatible to Storm, but improves on several aspects ferior with respect to efficiency. While a growing number of such as backpressure, efficiency, resource isolation, mul- production deployments implementing the Lambda Archi- titenancy, ease of debugging and performance monitor- tecture and emerging hybrid systems like Dataflow/Beam, ing, but reportedly does not provide exactly-once deliv- Flink or Apex document significant efforts to close the

Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg Angemeldet Heruntergeladen am | 13.10.16 19:14 DE GRUYTER OLDENBOURG W. Wingerath et al., Real-time streaming analytics for Big Data | 193

gap between batch and stream-processing in both re- 18. O. Boykin, S. Ritchie, I. O’Connell, et al. Summingbird: A frame- search and practice, the Kappa Architecture completely es- work for integrating batch and online mapreduce computa- chews the traditional approach and even heralds the ad- tions. PVLDB,2014. 19. C. Chambers, A. Raniwala, F. Perry, et al. Flumejava: Easy, effi- vent of purely stream-oriented Big Data analytics. How- cient data-parallel pipelines. In PLDI 2010, 2010. ever, whether at the core of novel system designs or as 20. S. Chintapalli, D. Dagit, B. Evans, et al. Benchmarking stream- a complement to existing architectures, horizontally scal- ing computation engines at yahoo! http://yahooeng.tumblr. able stream processors are gaining momentum as the re- com/post/135321837876/benchmarking-streaming- quirement for low latency has become a driving force in computation-engines-at, December 2015. Accessed: 2016- modern Big Data analytics pipelines. 01-11. 21. I. Corporation. Of streams and storms. Technical report, IBM Software Group, 2014. 22. Ericsson. Trident – benchmarking performance. References http://www.ericsson.com/research-blog/data-knowledge/ trident-benchmarking-performance/. Accessed: 2016-01-12. 1. . http://apex.incubator.apache.org/. 23. M. Grover, T. Malaska, J. Seidman, and G. Shapira. Hadoop 2. . https://flink.apache.org/. Application Architectures. O’Reilly, Beijing, 2015. 3. . https://flume.apache.org/. 24. J. Kreps. Benchmarking : 2 million writes per 4. Apache Kafka. http://kafka.apache.org/. second (on three cheap machines). https://engineering. 5. . http://mesos.apache.org/. linkedin.com/kafka/benchmarking-apache-kafka-2-million- 6. Apache Yarn. http://hadoop.apache.org/docs/stable/ writes-second-three-cheap-machines. Accessed: 2016-01-12. hadoop-yarn/hadoop-yarn-site/YARN.html. 25. J. Kreps. Questioning the lambda architecture. 7. Beamproposal – incubator wiki. http://radar.oreilly.com/2014/07/questioning-the-lambda- https://wiki.apache.org/incubator/BeamProposal. architecture.html, 7 2014. Accessed: 2015-12-17. 8. Comparisons: Spark streaming. http://samza.apache.org/ 26. J. Kreps, N. Narkhede, and J. Rao. Kafka: a distributed messag- learn/documentation/0.7.0/comparisons/spark-streaming.html. ing system for log processing. In NetDB’11, 2011. Accessed: 2016-01-12. 27. S. Kulkarni, N. Bhagat, M. Fu, V. Kedigehalli, C. Kellogg, S. Mit- 9. Resource quotas. https://cloud.google.com/dataflow/quotas. tal, J. M. Patel, K. Ramasamy, and S. Taneja. Twitter heron: Accessed: 2016-01-14. Stream processing at scale. In Proceedings of the 2015 ACM 10. Spark streaming programming guide. https://spark.apache. SIGMOD International Conference on Management of Data, org/docs/1.6.0/streaming-programming-guide.html.Ac- SIGMOD ’15, pages 239–250, New York, NY, USA, 2015. ACM. cessed: 2016-01-12. 28. W. Lam, L. Liu, S. Prasad, et al. Muppet: Mapreduce-style pro- 11. D. J. Abadi, Y. Ahmad, M. Balazinska, M. Cherniack, J. hyon cessing of fast data. VLDB 2012,2012. Hwang, W. Lindner, A. S. Maskey, E. Rasin, E. Ryvkina, N. Tat- 29. D. Laney. 3D data management: Controlling data volume, veloc- bul, Y. Xing, and S. Zdonik. The design of the borealis stream ity, and variety. Technical report, META Group, February 2001. processing engine. In In CIDR, pages 277–289, 2005. 30. N. Marz. Preview of storm: The hadoop of realtime processing. 12. D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, http://web.archive.org/web/20120509023348/http://tech. S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: A new backtype.com/preview-of-storm-the-hadoop-of-realtime-proce, model and architecture for data stream management. The VLDB 5 2012. Accessed: 2015-12-17. Journal, 12(2):120–139, Aug. 2003. 31. N. Marz. History of apache storm and lessons learned. 13. T. Akidau, A. Balikov, K. Bekiroglu, et al. Millwheel: Fault- http://nathanmarz.com/blog/history-of-apache-storm-and- tolerant stream processing at internet scale. In Very Large lessons-learned.html, 10 2014. Accessed: 2015-12-17. Data Bases, pages 734–746, 2013. 32. N. Marz and J. Warren. Big Data: Principles and Best Practices 14. T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. of Scalable Realtime Data Systems. Manning Publications Co., Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, Greenwich, CT, USA, 1st edition, 2015. E. Schmidt, and S. Whittle. The dataflow model: A practical ap- 33. L. Neumeyer, B. Robbins, and A. Kesari. S4: Distributed stream proach to balancing correctness, latency, and cost in massive- computing platform. In In Intl. Workshop on Knowledge Dis- scale, unbounded, out-of-order data processing. Proceedings covery Using Cloud and Distributed Computing Platforms (KD- of the VLDB Endowment, 8:1792–1803, 2015. Cloud, 2010. 15. A. Alexandrov, R. Bergmann, S. Ewen, et al. The stratosphere 34. D. Peng and F. Dabek. Large-scale incremental processing us- platform for big data analytics. The VLDB Journal,2014. ing distributed transactions and notifications. In Proceedings 16. R. Ananthanarayanan, V. Basker, S. Das, A. Gupta, et al. Pho- of the 9th USENIX Symposium on Operating Systems Design ton: Fault-tolerant and scalable joining of continuous data and Implementation, 2010. streams. In SIGMOD ’13,2013. 35. K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop 17. A. Biem, E. Bouillet, H. Feng, et al. Ibm infosphere streams distributed file system. In Proceedings of the 2010 IEEE 26th for scalable, real-time, intelligent transportation services. In Symposium on Mass Storage Systems and Technologies Proceedings of the 2010 ACM SIGMOD International Conference (MSST), MSST ’10, pages 1–10, Washington, DC, USA, 2010. on Management of Data, 2010. IEEE Computer Society.

Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg Angemeldet Heruntergeladen am | 13.10.16 19:14 194 | W. Wingerath et al., Real-time streaming analytics for Big Data DE GRUYTER OLDENBOURG

36. D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous nologies for cloud data management. His thesis addresses caching queries over append-only databases. SIGMOD Rec., 21(2):321– and transaction processing for low-latency mobile and web applica- 330, June 1992. tions. He is also founder and CEO of the startup Baqend that imple- 37. B. Venkat, P. Padmanabhan, A. Arokiasamy, and R. Up- ments these research results in a cloud-based backend-as-a-service palapati. Can spark streaming survive chaos monkey? platform. Since their product is based on a polyglot, NoSQL-centric http://techblog.netflix.com/2015/03/can-spark-streaming- storage model, he is very interested in both the research and prac- survive-chaos-monkey.html, Marriage 2015. Accessed: 2016- tical challenges of leveraging and improving these systems. He is 01-11. frequently giving talks on different NoSQL topics. 38. F. Yang, Z. Qian, X. Chen, I. Beschastnikh, L. Zhuang, L. Zhou, and G. Shen. Sonora: A platform for continuous mobile-cloud computing. Technical Report MSR-TR-2012-34, March 2012. Steffen Friedrich 39. M. Zaharia, M. Chowdhury, T. Das, et al. Resilient distributed Univ. of Hamburg, CS Dept., datasets: A fault-tolerant abstraction for in-memory clus- D-22527 Hamburg, Germany ter computing. In Proceedings of the 9th USENIX Conference [email protected] on Networked Systems Design and Implementation, NSDI’12, pages 2–2, Berkeley, CA, USA, 2012. USENIX Association. 40. M. Zaharia, T. Das, H. Li, et al. Discretized streams: Fault- tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Prin- ciples, SOSP ’13, pages 423–438, New York, NY, USA, 2013. Steffen Friedrich is a Ph.D. student working under supervision of ACM. Norbert Ritter at the University of Hamburg. He has taken part in several workshops and conferences, both as presenter (e.g. DMC2014) and as co-organiser (BTW 2015). Being a member of the Bionotes databases and information systems group, Steffen is interested in large-scale data management and data-intensive computing. Furthermore, in his Master thesis, he also dealt with data quality Wolfram Wingerath issues, specifically with duplicate detection in probabilistic data. Univ. of Hamburg, CS Dept., His research project is primarily concerned with benchmarking of D-22527 Hamburg, Germany non-functional characteristics (e.g. consistency and availability) in [email protected] distributed NoSQL database systems.

Prof. Dr.-Ing. Norbert Ritter Univ. of Hamburg, CS Dept., D-22527 Hamburg, Germany Wolfram Wingerath is a Ph.D. student under supervision of Norbert [email protected] Ritter teaching and researching at the University of Hamburg. He was co-organiser of the BTW 2015 conference and has held work- shop and conference talks on his published work on several occa- sions. Wolfram is part of the databases and information systems group and his research interests evolve around scalable NoSQL database systems, cloud computing and Big Data analytics, but he Prof. Dr.-Ing. Norbert Ritter is a full professor of computer science also has a background in data quality and duplicate detection. His at the University of Hamburg, where he heads the databases and current work is related to real-time stream processing and explores information systems group. He received his Ph.D. from the Uni- the possibilities of providing always-up-to-date materialised views versity of Kaiserslautern in 1997. His research interests include and continuous queries on top of existing non-streaming DBMSs. distributed and federated database systems, transaction process- ing, caching, cloud data management, information integration and autonomous database systems. He has been teaching NoSQL topics Felix Gessert in various courses for several years. Seeing the many open chal- Univ. of Hamburg, CS Dept., lenges for NoSQL systems, he and Felix Gessert have been orga- D-22527 Hamburg, Germany nizing the annual Scalable Cloud Data Management Workshop [email protected] (www.scdm2015.com) for three years to promote research in this area.

Felix Gessert is a Ph.D. student at the databases and information systems group at the University of Hamburg. His main research fields are scalable database systems, transactions and web tech-

Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg Angemeldet Heruntergeladen am | 13.10.16 19:14