Apache Samza
Total Page:16
File Type:pdf, Size:1020Kb
Apache Samza Martin Kleppmann Definition vehicles, or the writes of records to a database. Apache Samza is an open source frame- Stream processing jobs are long- work for distributed processing of high- running processes that continuously volume event streams. Its primary design consume one or more event streams, goal is to support high throughput for a invoking some application logic on wide range of processing patterns, while every event, producing derived output providing operational robustness at the streams, and potentially writing output massive scale required by Internet com- to databases for subsequent querying. panies. Samza achieves this goal through While a batch process or a database a small number of carefully designed ab- query typically reads the state of a stractions: partitioned logs for messag- dataset at one point in time, and then ing, fault-tolerant local state, and cluster- finishes, a stream processor is never based task scheduling. finished: it continually awaits the arrival of new events, and it only shuts down when terminated by an administrator. Many tasks can be naturally ex- Overview pressed as stream processing jobs, for example: Stream processing is playing an increas- • aggregating occurrences of events, ingly important part of the data man- e.g., counting how many times a agement needs of many organizations. particular item has been viewed; Event streams can represent many kinds • computing the rate of certain events, of data, for example, the activity of users e.g., for system diagnostics, report- on a website, the movement of goods or ing, and abuse prevention; 1 2 Martin Kleppmann • enriching events with information the scalability of Samza is directly at- from a database, e.g., extending user tributable to the choice of these founda- click events with information about tional abstractions. The remainder of this the user who performed the action; article provides further detail on those • joining related events, e.g., joining an design decisions and their practical con- event describing an email that was sequences. sent with any events describing the user clicking links in that email; • updating caches, materialized views, and search indexes, e.g., maintaining Partitioned Log Processing an external full-text search index over text in a database; A Samza job consists of a set of Java • using machine learning systems Virtual Machine (JVM) instances, called to classify events, e.g., for spam tasks, that each processes a subset of filtering. the input data. The code running in each JVM comprises the Samza framework Apache Samza, an open source and user code that implements the re- stream processing framework, can be quired application-specific functionality. used for any of the above applications The primary API for user code is the (Kleppmann and Kreps, 2015; Noghabi Java interface StreamTask, which de- et al, 2017). It was originally developed fines a method process(). Figure 1 at LinkedIn, then donated to the Apache shows two examples of user classes im- Software Foundation in 2013, and plementing the StreamTask interface. became a top-level Apache project in Once a Samza job is deployed and 2015. Samza is now used in production initialized, the framework calls the at many Internet companies, including process() method once for every LinkedIn (Paramasivam, 2016), Netflix message in any of the input streams. (Netflix Technology Blog, 2016), Uber The execution of this method may have (Chen, 2016; Hermann and Del Balso, various effects, including querying 2017), and TripAdvisor (Calisi, 2016). or updating local state and sending Samza is designed for usage scenar- messages to output streams. This model ios that require very high throughput: of computation is closely analogous to in some production settings, it pro- a map task in the well-known MapRe- cesses millions of messages per second duce programming model (Dean and or trillions of events per day (Feng, Ghemawat, 2004), with the difference 2015; Paramasivam, 2016; Noghabi that a Samza job’s input is typically et al, 2017). Consequently, the design never-ending (unbounded). of Samza prioritizes scalability and Similarly to MapReduce, each Samza operational robustness above most other task is a single-threaded process that it- concerns. erates over a sequence of input records. The core of Samza consists of sev- The inputs to a Samza job are partitioned eral fairly low-level abstractions, on top into disjoint subsets, and each input par- of which high-level operators have been tition is assigned to exactly one process- built (Pathirage et al, 2016). However, ing task. More than one partition may be the core abstractions have been carefully assigned to the same processing task, in designed for operational robustness, and Apache Samza 3 class SplitWords implements StreamTask { class CountWords implements StreamTask, InitableTask { static final SystemStream WORD_STREAM = new SystemStream("kafka", "words"); private KeyValueStore<String, Integer> store; public void process( public void init(Config config, IncomingMessageEnvelope in, TaskContext context) { MessageCollector out, store = (KeyValueStore<String, Integer>) TaskCoordinator _) { context.getStore("word-counts"); } String str = (String) in.getMessage(); public void process( for (String word : str.split(" ")) { IncomingMessageEnvelope in, out.send( MessageCollector out, new OutgoingMessageEnvelope( TaskCoordinator _) { WORD_STREAM, word, 1)); } String word = (String) in.getKey(); } Integer inc = (Integer) in.getMessage(); } Integer count = store.get(word); if (count == null) count = 0; store.put(word, count + inc); } } Fig. 1 The two operators of a streaming word-frequency counter using Samza’s StreamTask API (Image source: Kleppmann and Kreps (2015), c 2015 IEEE, reused with permission) Partition 0 “hello world” “hello samza” Split “hello” “hello” “interesting” Count Partition 1 “samza is interesting” Split “world” “samza” “samza” “is” Count Kafka topic strings Samza job Kafka topic words Samza job SplitWords CountWords Fig. 2 A Samza task consumes input from one partition, but can send output to any partition (Image source: Kleppmann and Kreps (2015), c 2015 IEEE, reused with permission) Output stream “hello” “hello” “interesting” Count —hello“ : 1 —hello“ : 2 —interesting“ : 1 Output stream “world” “samza” “samza” “is” Count —world“ : 1 —samza“ : 1 —samza“ : 2 —is“ : 1 Kafka topic words Samza job CountWords Kafka topic word_counts Fig. 3 A task’s local state is made durable by emitting a changelog of key-value pairs to Kafka (Image source: Kleppmann and Kreps (2015), c 2015 IEEE, reused with permission) 4 Martin Kleppmann which case the processing of those par- this log available for applications to titions is interleaved on the task thread. consume (Das et al, 2012; Qiao et al, However, the number of partitions in the 2013). Processing the stream of writes input determines the job’s maximum de- to a database enables jobs to maintain gree of parallelism. external indexes or materialized views The log interface assumes that each onto data in a database and is especially partition of the input is a totally ordered relevant in conjunction with Samza’s sequence of records and that each record support for local state (see Section is associated with a monotonically in- “Fault-tolerant Local State”). creasing sequence number or identifier While every partition of an input (known as offset). Since the records in stream is assigned to one particular task each partition are read sequentially, a of a Samza job, the output partitions job can track its progress by periodically are not bound to tasks. That is, when a writing the offset of the last read record StreamTask emits output messages, to durable storage. If a stream processing it can assign them to any partition of task is restarted, it resumes consuming the output stream. This fact can be the input from the last recorded offset. used to group related data items into Most commonly, Samza is used the same partition: for example, in the in conjunction with Apache Kafka word-counting application illustrated (see separate article on Kafka). Kafka in Figure 2, the SplitWords task provides a partitioned, fault-tolerant log chooses the output partition for each that allows publishers to append mes- word based on a hash of the word. This sages to a log partition, and consumers ensures that when different tasks en- (subscribers) to sequentially read the counter occurrences of the same word, messages in a log partition (Wang et al, they are all written to the same output 2015; Kreps et al, 2011; Goodhope partition, from where a downstream job et al, 2012). Kafka also allows stream can read and aggregate the occurrences. processing jobs to reprocess previously When stream tasks are composed seen records by resetting the consumer into multistage processing pipelines, the offset to an earlier position, a fact that is output of one task becomes the input to useful during recovery from failures. another task. Unlike many other stream However, Samza’s stream interface is processing frameworks, Samza does not pluggable: besides Kafka, it can use any implement its own message transport storage or messaging system as input, layer to deliver messages between provided that the system can adhere to stream operators. Instead, Kafka is used the partitioned log interface. By default, for this purpose; since Kafka writes all Samza can also read files from the messages to disk, it provides a large Hadoop Distributed Filesystem (HDFS) buffer between stages of the processing as input, in a way that parallels MapRe- pipeline, limited only by the available duce jobs, at competitive performance disk space on the Kafka brokers. (Noghabi et al, 2017). At LinkedIn, Typically, Kafka is configured to re- Samza is commonly deployed with tain several days or weeks worth of mes- Databus inputs: Databus is a change sages in each topic. Thus, if one stage data capture technology that records the of a processing pipeline fails or begins log of writes to a database and makes to run slow, Kafka can simply buffer the Apache Samza 5 input to that stage while leaving ample point, an additional consumer can be at- time for the problem to be resolved.