Introduction to Apache Storm Dennis Patrone, Ph.D
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Apache Storm Dennis Patrone, Ph.D. UB CDSE Days March 2016 About Me • Member of Principle Professional Staff and Chief Engineer of the Large-Scale Analytic Systems group at the Johns Hopkins University Applied Physics Laboratory; Laurel, MD o 18+ years working primarily in distributed systems o Last 6 years working on big data challenges • Computer Science Background o Ph.D., UB o M.S., Johns Hopkins U o B.S., St. Bonaventure U • Cloudera Certified Developer for Apache Hadoop Outline • Big Data Architectures • Apache Storm o Terminology o Concepts o Architecture • Hands-On 1: Hello, World! • Reliability API (Guaranteed Message Processing) • Tips & Tricks • Advanced Topics (briefly) • Hands-On 2: Twitter Sentiment Analysis Batch Architectures (simplified) Batch Layer Pre-computed “Raw” (e.g., Hadoop) Views Data Source Dataset • Batch is a “snapshot” of data • CAP (Consistency, Availability, and Partition Tolerance) Query Theorem: “P” is required; choose “C” or “A”! • Consistency: all nodes see same data at same time • Availability: every request receives a response • Partitioning: system continues to operate despite segmenting due to network failures See: Nathan Marz, “How to beat the CAP theorem” (2011) http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html Lambda Architecture (simplified) Append-only, Batch Layer Pre-computed Immutable (e.g., Hadoop) Views “Raw” Dataset Data Source Realtime Layer Realtime (e.g., Storm) Views Query • Batch Layer handles massive scale, historical data • Realtime Layer handles realtime query on “recent” data (significantly smaller problem) • Immutable raw dataset makes processing “human fault-tolerant” See: Jay Kreps, “Questioning the Lambda Architecture” (2014) https://www.oreilly.com/ideas/questioning-the-lambda-architecture Kappa Architecture (simplified) Stream Process, Processed v1.0 Append-only, View, v1.0 Immutable Data Source Query “Raw” Dataset • Process everything through stream architecture See: Jay Kreps, “Questioning the Lambda Architecture” (2014) https://www.oreilly.com/ideas/questioning-the-lambda-architecture Kappa Architecture (simplified) Stream Process, Processed v1.0 Append-only, View, v1.0 Immutable Data Source Query “Raw” Dataset Stream Process, Processed v2.0 View, v2.0 • Process everything through stream architecture • Change in the processing, reprocess from “beginning” of stream • Trades space (2x) for complexity of maintaining 2 codebases See: Jay Kreps, “Questioning the Lambda Architecture” (2014) https://www.oreilly.com/ideas/questioning-the-lambda-architecture Kappa Architecture (simplified) Append-only, Immutable Data Source Query “Raw” Dataset Stream Process, Processed v2.0 View, v2.0 • Process everything through stream architecture • Change in the processing, reprocess from “beginning” of stream • Trades space (2x) for complexity of maintaining 2 codebases • When second stream catches up, switch query data source and clean up old data Apache Storm Terminology • Tuple: a named list of values • Stream: An unbounded sequence of tuples • Spout: A source of a stream; emits Tuples • Bolt: A stream processor (can From http://storm.apache.org/tutorial.html also optionally emit Tuples and create new Streams) • Topology: A network of spouts and bolts Spout void open(Map cfg, TopologyContext ctx, SpoutOutputCollector collector); public void declareOutputFields (OutputFieldsDeclarer declarer); void nextTuple(); class SpoutOutputCollector { List<Integer> emit(List<Object> tuple); List<Integer> emit(List<Object> tuple, Object msgId); List<Integer> emit (String stream, List<Object> tuple, Object msgId); … } Example Spout public class RandomIntSpout extends BaseRichSpout { transient SpoutOutputCollector collector; public void open(Map cfg, TopologyContext ctx, SpoutOutputCollector collector) { this.collector = collector; } public void nextTuple() { Random r = new Random(); collector.emit(new Values(r.nextInt(100)); } public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields(“number”)); } } Bolt void prepare(Map cfg, TopologyContext ctx, OutputCollector collector); void execute(Tuple input); void declareOutputFields (OutputFieldsDeclarer declarer); Tuple MessageId getMessageId(); String getSourceComponent(); String getSourceStreamId(); int getSourceTask(); Fields getFields(); List<Object> getValues(); List<Object> select(Fields fields); boolean contains(String field); int fieldIndex(String field); int size(); Boolean getBoolean(int idx); Boolean getBooleanByField(String field); Double getDouble(int idx); Double getDoubleByField(String field); … Example Bolt public class NumberCounterBolt extends BaseRichBolt { transient OutputCollector collector; Map<Integer, Long> counts; public void prepare(Map cfg, TopologyContext ctx, OutputCollector collector) { this.collector = collector; counts = new HashMap<Integer, Long >(); } public void execute(Tuple t) { Integer val = t.getIntByField(“number”); Long count = 1L; if (counts.containsKey(val)) { count = counts.get(val) + 1L; } counts.put(val, count); collector.emit(new Values(val, count)); } public void declareOutputFields(OutputFieldDeclarer declarer) { declarer.declare(new Fields(“number”, “count”)); } … (other methods) … } Spouts and Bolts Topology • Streams are connected together in a topology to perform processing Twitter Tweet Hashtag Hashtag Trending API Normalizer Extractor Counter Topics (Spout) Spouts and Bolts Topology • Streams are connected together in a topology to perform processing • Spouts and bolts may be (likely will be) replicated across Storm cluster (different JVMs and hosts) o User controls how streams are distributed Hashtag Extractor Hashtag Hashtag Counter Tweet Twitter Extractor Normalizer Hashtag Trending API Hashtag Counter Topics (Spout) Tweet Extractor Normalizer Hashtag Hashtag Counter Extractor Stream Grouping: Shuffle Task Task Task Task Task Task Task Task Randomly assign tuples in a way that guarantees load distribution Stream Grouping: Fields Task Task Task Task Task Task Task Task Assigns tuples based on user-specified subset of fields * Anis Uddin Nasir, et al. "The power of both choices: Practical load balancing for distributed stream processing engines." In Data Engineering (ICDE), 2015 IEEE 31st International Conference on, pp. 137-148. IEEE, 2015. Stream Grouping: Partial Key* Task Task Agg Task Task Task Task Task Agg Task Task Task Assigns tuples based on user-specified subset of fields to one of two tasks based on local routing estimates to better distribute skewed data Stream Grouping: All Task Task Task Task Task Task Task Task Tuple replicated to all tasks (use with care) Stream Grouping: Global Task Task Task Task Task Task Task Task Entire stream goes to a single bolt (note: other streams may use same bolt with different grouping) Stream Grouping: None Task Task Task Task Task Task Task Task It doesn’t matter how the tuples are distributed. Currently behaves the same as shuffle, but future improvements will localize execution with source when possible Stream Grouping: LocalOrShuffle Worker Task Task Task Task Worker Task Task Task Task If the target bolt has one or more tasks in the same worker process, tuples sent to those in-process tasks, otherwise falls back to “normal” shuffle Stream Grouping: Direct Task Task Task Task Task Task Task Task The tuple producer explicitly decides which task to send the tuple to with: emitDirect(taskId, tuple) CustomStreamGrouping interface CustomStreamGrouping { void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks); List<Integer> chooseTasks (int taskId, List<Object> values); } builder.setBolt(“dest”, new MyExampleBolt()) .customGrouping(“src”, new MyCustomGrouping()); Trending Example Grouping • Shuffle grouping Hashtag Extractor Hashtag Hashtag Counter Tweet Twitter Extractor Normalizer Hashtag Trending API Hashtag Counter Topics (Spout) Tweet Extractor Normalizer Hashtag Hashtag Counter Extractor Trending Example Grouping • Shuffle grouping • Local or Shuffle Hashtag Extractor Hashtag Hashtag Counter Tweet Twitter Extractor Normalizer Hashtag Trending API Hashtag Counter Topics (Spout) Tweet Extractor Normalizer Hashtag Hashtag Counter Extractor Trending Example Grouping • Shuffle grouping • Local or Shuffle • Fields Grouping (field = “hashtag”) Hashtag Extractor Hashtag Hashtag Counter Tweet Twitter Extractor Normalizer Hashtag Trending API Hashtag Counter Topics (Spout) Tweet Extractor Normalizer Hashtag Hashtag Counter Extractor Trending Example Grouping • Shuffle grouping • Local or Shuffle • Fields Grouping (field = “hashtag”) • Global Hashtag Extractor Hashtag Hashtag Counter Tweet Extractor Twitter Normalizer Hashtag Top N Trending API Hashtag Counter Merger Topics (Spout) Tweet Extractor Normalizer Hashtag Hashtag Counter Extractor Storm Architecture Supervisor Supervisor Supervisor WorkerSupervisor WorkerSupervisor (host) Worker Executor Worker ExecutorWorker (JVM) Zookeeper Task Executor Task Zookeeper Task Executor Task Nimbus (CloudZookeeper Task Executor Task (thread) (Cloud Task Task (Master Node) Coordination)(Cloud Task Task Coordination) Worker Coordination) Worker Worker Executor Worker Executor TaskWorker (JVM) Executor Task Executor Task Executor (thread) Task Task From: http://storm.apache.org/documentation/Understanding-the- parallelism-of-a-Storm-topology.html Storm Parallelization Config conf = new Config(); conf.setNumWorkers(2); topologyBuilder.setSpout ("blue-spout", new BlueSpout(), 2); topologyBuilder.setBolt ("green-bolt",