Introduction to Stateful Stream Processing with Apache Flink

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Stateful Stream Processing with Apache Flink Introduction to Stateful Stream Processing with Apache Flink Robert Metzger @rmetzger_ [email protected] 2 Ververica Platform Original creators of Open Source Apache Flink Apache Flink® + Application Manager 3 Apache Flink 101 4 Apache Flink § Apache Flink is an open source stream processing framework • Low latency • High throughput • Stateful • Distributed 5 What is Apache Flink? Data Stream Processing Batch Event-driven Processing realtime results from data streams Applications process static and data-driven actions historic data and services Stateful Computations Over Data Streams 6 Use Case: Streaming ETL ● Periodic ETL is the traditional Periodic ETL approach ○ External tool periodically triggers ETL batch job Data Pipeline / Real-time ETL ● Data pipelines continuously move data ○ Ingestion with low latency ○ No artificial data boundaries 7 Use Case: Data Analytics ● Batch analytics is great for ad-hoc Batch Analytics queries ○ Queries change faster than data ○ Interactive analytics / prototyping ● Stream analytics continuously Stream Analytics processes data ○ Data changes faster than queries ○ Live / low latency results ○ No Lambda architecture required! Use Case: Event-driven applications ● Traditional application design Transactional Application ○ Compute & data tier architecture ○ React to and process events ○ State is stored in (remote) database ● Event-driven application ○ State is maintained locally Event-driven Application ○ Guaranteed consistency by periodic state checkpoints ○ Tight coupling of logic and data (microservice architecture) ○ Highly scalable design 9 Hardened at scale Details about their use cases and more users are listed on Flink’s website at https://flink.apache.org/poweredby.html 10 Case Study: Single’s Day ● Chinese Shopping Festival ● Very high peak load ○ 100s millions records per second ○ 100s thousands payments per second ○ 10 TBs of managed state ○ 10s thousands of cores ● Flink used in various areas in the process incl. payment, shipping, realtime recommendations and the giant dashboard References ● https://www.ververica.com/blog/singles-day-2018-data-in-a-flink-of-an-eye ● https://medium.com/@alitech_2017/how-to-cope-with-peak-data-traffic-on-11-11-the-secrets-of- alibaba-stream-computing-17d5e807980c 11 Building blocks 12 The Core Building Blocks Event Streams State (Event) Time Snapshots consistency with forking / real-time and complex out-of-order data versioning / hindsight business logic and late data time-travel 13 Stateful Event & Stream Processing val lines: DataStream[String] = env.addSource(new FlinkKafkaConsumer(…)) Source val events: DataStream[Event] = lines.map((line) => parse(line)) Transformation val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) Transformation .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) Sink Streaming Dataflow Source Transform Window Sink (state read/write) 14 Stateful Event & Stream Processing Filter / State Source Sink Transform read/write 15 Stateful Event & Stream Processing Scalable embedded state Access at memory speed & scales with parallel operators 16 Stateful Event & Stream Processing Rolling back computation Re-processing Re-load state Reset positions in input streams 17 Time: Different Notions of Time Flink Flink Event Producer Message Queue Data Source Window Operator partition 1 partition 2 Broker Event Ingestion Window Time Time Time Processing Time 18 Time: Event Time Example Event Time Episode Episode Episode Episode Episode Episode Episode IV V VI I II III VII 1977 1980 1983 1999 2002 2005 2015 Processing Time 19 Recap: The Core Building Blocks Event Streams State (Event) Time Snapshots consistency with forking / real-time and complex out-of-order data versioning / hindsight business logic and late data time-travel 20 APIs 21 The APIs Analytics Stream SQL Stream- & Batch Processing Table API (dynamic tables) 22 Stateful DataStream API (streams, windows) Event-Driven Applications Process Function (events, state, time) Process Function class MyFunction extends ProcessFunction[MyEvent, Result] { // declare state to use in the program lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext().getState(…) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]): Unit = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[Result]): Unit = { // handle callback when event-/processing- time instant is reached } } 23 Data Stream API val lines: DataStream[String] = env.addSource( new FlinkKafkaConsumer<>(…)) val events: DataStream[Event] = lines.map((line) => parse(line)) val stats: DataStream[Statistic] = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum(new MyAggregationFunction()) stats.addSink(new RollingSink(path)) 24 Table API & Stream SQL 25 Deployment & Integrations 26 Deployment 27 Integrations ● Event logs: ○ Kafka, Kinesis, Pulsar* ● File systems: ○ S3, HDFS, NFS, MapR FS, ... ● Encodings: ○ Avro, JSON, CSV, ORC, Parquet ● Databases: ○ JDBC, HCatalog ● Key-Value Stores ○ Cassandra, Elasticsearch, Redis* 28 Concluding… 29 Early Bird ticket sales ends July 15th Get in touch via eMail: Get in touch via Twitter: Q & A [email protected] @rmetzger_ [email protected] @ApacheFlink 31 What’s happening in Flink these days … • INSERT INTO flink_sql SELECT * FROM blink_sql DataStream Table / SQL • Turning Table API into an API unified across batch DataSet and streaming (FLINK-11439) (deprecated) Stream Operator & DAG API • Integration with Hive ecosystem (FLINK-10556) Runtime • Batch runtime improvements: Fine-grained recovery (FLINK-4256), more schedulers (FLINK-10429), pluggable shuffle service (FLINK-10653) • Machine Learning Pipelines on Table API (FLIP-39) • Table API: Caching of intermediate results (.cache() API) (FLINK-11199) • Table API: Python support (FLINK-12308) Implementation: State Checkpointing 34 State, Snapshots, Recovery Coordination via markers, injected into the streams 35 36 37 38 39 Implementation: Stream SQL 41 Stream SQL: Intuition Stream / Table Duality Table without Primary Key Table with Primary Key 42 Stream SQL: Implementation Differential Computation Query Compilation (add/mod/del) 43 Implementation: Rescaling 44 Rescaling State / Elasticity ▪ Similar to consistent hashing Key space ▪ Split key space into key groups Key group #1 Key group #2 ▪ Assign key groups to tasks Key group #4 Key group #3 45 Rescaling State / Elasticity ▪ Rescaling changes key group assignment ▪ Maximum parallelism defined by #key groups ▪ Rescaling happens through restoring a savepoint using the new parallism 46 Implementation: Time-handling 47 Time: Different Notions of Time Flink Flink Event Producer Message Queue Data Source Window Operator partition 1 partition 2 Broker Event Ingestion Window Time Time Time Processing Time 48 Time: Watermarks Stream (in order) 23 21 20 19 18 17 15 14 11 10 9 9 7 W(20) W(11) Event Watermark Event timestamp Stream (out of order) 21 19 20 17 22 12 17 14 12 9 15 11 7 W(17) W(11) Event Watermark Event timestamp 49 Time: Watermarks in Parallel Event Watermark [id|timestamp] 29 14 29 Q|44 N|39 M|39 Source K|35 map B|31 A|30 window (1) (1) (1) W(33) C|30 14 Watermark D|15 Event Time at input streams Generation W(17) E|30 29 17 14 Source map window R|37 O|23 L|22 H|20 G|18 F|15 (2) (2) (2) 14 W(17) Event Time at the operator 50 Implementation: Queryable State 51 Queryable State Traditional Queryable State 52 Queryable State: Implementation Query: /job/operation/state-name/key Query Client (1) Get location of "key-partition" (4) Query for "operator" of" job" state-name and key (3) Respond location State Location Server (2) Look up State State register location Registry Registry ExecutionGraph deploy window( window( local state status )/ )/ sum() sum() Job Manager Task Manager Task Manager 53 State, Snapshots, Recovery Events flow without replication or synchronous writes State index (Hash Table or RocksDB) Events are persistent and ordered (per partition / key) in the log (e.g., Apache Kafka) source / stateful transform operation 54 State, Snapshots, Recovery Trigger checkpoint Inject checkpoint barrier source / stateful transform operation 55 State, Snapshots, Recovery Take state snapshot RocksDB: Trigger state copy-on-write source / stateful transform operation 56 State, Snapshots, Recovery Persist state snapshots Processing pipeline continues Durably persist snapshots asynchronously source / stateful transform operation 57 Powerful Abstractions Layered abstractions to navigate simple to complex use cases High-level Analytics API Stream SQL / Tables (dynamic tables) val stats = stream Stream- & Batch .keyBy("sensor") DataStream API (streams, windows) .timeWindow(Time.seconds(5)) Data Processing .sum((a, b) -> a.add(b)) Stateful Event- Process Function (events, state, time) Driven Applications def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } 58 Event Sourcing + Memory Image periodically snapshot the memory main memory event / command event log update local variables/structures persists events (temporarily) Process 59 Event Sourcing + Memory Image Recovery: Restore snapshot and replay events since snapshot event log persists events (temporarily) Process 60 Distributed Memory Image Distributed application, many memory images. Snapshots are all consistent together. 61.
Recommended publications
  • Apache Flink™: Stream and Batch Processing in a Single Engine
    Apache Flink™: Stream and Batch Processing in a Single Engine Paris Carboney Stephan Ewenz Seif Haridiy Asterios Katsifodimos* Volker Markl* Kostas Tzoumasz yKTH & SICS Sweden zdata Artisans *TU Berlin & DFKI parisc,[email protected][email protected][email protected] Abstract Apache Flink1 is an open-source system for processing streaming and batch data. Flink is built on the philosophy that many classes of data processing applications, including real-time analytics, continu- ous data pipelines, historic data processing (batch), and iterative algorithms (machine learning, graph analysis) can be expressed and executed as pipelined fault-tolerant dataflows. In this paper, we present Flink’s architecture and expand on how a (seemingly diverse) set of use cases can be unified under a single execution model. 1 Introduction Data-stream processing (e.g., as exemplified by complex event processing systems) and static (batch) data pro- cessing (e.g., as exemplified by MPP databases and Hadoop) were traditionally considered as two very different types of applications. They were programmed using different programming models and APIs, and were exe- cuted by different systems (e.g., dedicated streaming systems such as Apache Storm, IBM Infosphere Streams, Microsoft StreamInsight, or Streambase versus relational databases or execution engines for Hadoop, including Apache Spark and Apache Drill). Traditionally, batch data analysis made up for the lion’s share of the use cases, data sizes, and market, while streaming data analysis mostly served specialized applications. It is becoming more and more apparent, however, that a huge number of today’s large-scale data processing use cases handle data that is, in reality, produced continuously over time.
    [Show full text]
  • Evaluation of SPARQL Queries on Apache Flink
    applied sciences Article SPARQL2Flink: Evaluation of SPARQL Queries on Apache Flink Oscar Ceballos 1 , Carlos Alberto Ramírez Restrepo 2 , María Constanza Pabón 2 , Andres M. Castillo 1,* and Oscar Corcho 3 1 Escuela de Ingeniería de Sistemas y Computación, Universidad del Valle, Ciudad Universitaria Meléndez Calle 13 No. 100-00, Cali 760032, Colombia; [email protected] 2 Departamento de Electrónica y Ciencias de la Computación, Pontificia Universidad Javeriana Cali, Calle 18 No. 118-250, Cali 760031, Colombia; [email protected] (C.A.R.R.); [email protected] (M.C.P.) 3 Ontology Engineering Group, Universidad Politécnica de Madrid, Campus de Montegancedo, Boadilla del Monte, 28660 Madrid, Spain; ocorcho@fi.upm.es * Correspondence: [email protected] Abstract: Existing SPARQL query engines and triple stores are continuously improved to handle more massive datasets. Several approaches have been developed in this context proposing the storage and querying of RDF data in a distributed fashion, mainly using the MapReduce Programming Model and Hadoop-based ecosystems. New trends in Big Data technologies have also emerged (e.g., Apache Spark, Apache Flink); they use distributed in-memory processing and promise to deliver higher data processing performance. In this paper, we present a formal interpretation of some PACT transformations implemented in the Apache Flink DataSet API. We use this formalization to provide a mapping to translate a SPARQL query to a Flink program. The mapping was implemented in a prototype used to determine the correctness and performance of the solution. The source code of the Citation: Ceballos, O.; Ramírez project is available in Github under the MIT license.
    [Show full text]
  • Portable Stateful Big Data Processing in Apache Beam
    Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC Software Engineer @ Google https://s.apache.org/ffsf-2017-beam-state [email protected] / @KennKnowles Flink Forward San Francisco 2017 Agenda 1. What is Apache Beam? 2. State 3. Timers 4. Example & Little Demo What is Apache Beam? TL;DR (Flink draws it more like this) 4 DAGs, DAGs, DAGs Apache Beam Apache Flink Apache Cloud Hadoop Apache Apache Dataflow Spark Samza MapReduce Apache Apache Apache (paper) Storm Gearpump Apex (incubating) FlumeJava (paper) Heron MillWheel (paper) Dataflow Model (paper) 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 Apache Flink local, on-prem, The Beam Vision cloud Cloud Dataflow: Java fully managed input.apply( Apache Spark Sum.integersPerKey()) local, on-prem, cloud Sum Per Key Apache Apex Python local, on-prem, cloud input | Sum.PerKey() Apache Gearpump (incubating) ⋮ ⋮ 6 Apache Flink local, on-prem, The Beam Vision cloud Cloud Dataflow: Python fully managed input | KakaIO.read() Apache Spark local, on-prem, cloud KafkaIO Apache Apex ⋮ local, on-prem, cloud Apache Java Gearpump (incubating) class KafkaIO extends UnboundedSource { … } ⋮ 7 The Beam Model PTransform Pipeline PCollection (bounded or unbounded) 8 The Beam Model What are you computing? (read, map, reduce) Where in event time? (event time windowing) When in processing time are results produced? (triggers) How do refinements relate? (accumulation mode) 9 What are you computing? Read ParDo Grouping Composite Parallel connectors to Per element Group
    [Show full text]
  • Apache Flink Fast and Reliable Large-Scale Data Processing
    Apache Flink Fast and Reliable Large-Scale Data Processing Fabian Hueske @fhueske 1 What is Apache Flink? Distributed Data Flow Processing System • Focused on large-scale data analytics • Real-time stream and batch processing • Easy and powerful APIs (Java / Scala) • Robust execution backend 2 What is Flink good at? It‘s a general-purpose data analytics system • Real-time stream processing with flexible windows • Complex and heavy ETL jobs • Analyzing huge graphs • Machine-learning on large data sets • ... 3 Flink in the Hadoop Ecosystem Libraries Dataflow Table API Table ML Library ML Gelly Library Gelly Apache MRQL Apache SAMOA DataSet API (Java/Scala) DataStream API (Java/Scala) Flink Core Optimizer Stream Builder Runtime Environments Embedded Local Cluster Yarn Apache Tez Data HDFS Hadoop IO Apache HBase Apache Kafka Apache Flume Sources HCatalog JDBC S3 RabbitMQ ... 4 Flink in the ASF • Flink entered the ASF about one year ago – 04/2014: Incubation – 12/2014: Graduation • Strongly growing community 120 100 80 60 40 20 0 Nov.10 Apr.12 Aug.13 Dec.14 #unique git committers (w/o manual de-dup) 5 Where is Flink moving? A "use-case complete" framework to unify batch & stream processing Data Streams • Kafka Analytical Workloads • RabbitMQ • ETL • ... • Relational processing Flink • Graph analysis • Machine learning “Historic” data • Streaming data analysis • HDFS • JDBC • ... Goal: Treat batch as finite stream 6 Programming Model & APIs HOW TO USE FLINK? 7 Unified Java & Scala APIs • Fluent and mirrored APIs in Java and Scala • Table API
    [Show full text]
  • The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing: Extended Survey
    The VLDB Journal https://doi.org/10.1007/s00778-019-00548-x SPECIAL ISSUE PAPER The ubiquity of large graphs and surprising challenges of graph processing: extended survey Siddhartha Sahu1 · Amine Mhedhbi1 · Semih Salihoglu1 · Jimmy Lin1 · M. Tamer Özsu1 Received: 21 January 2019 / Revised: 9 May 2019 / Accepted: 13 June 2019 © Springer-Verlag GmbH Germany, part of Springer Nature 2019 Abstract Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We performed an extensive study that consisted of an online survey of 89 users, a review of the mailing lists, source repositories, and white papers of a large suite of graph software products, and in-person interviews with 6 users and 2 developers of these products. Our online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software users use; and (iv) the major challenges users face when processing their graphs. We describe the participants’ responses to our questions highlighting common patterns and challenges. Based on our interviews and survey of the rest of our sources, we were able to answer some new questions that were raised by participants’ responses to our online survey and understand the specific applications that use graph data and software. Our study revealed surprising facts about graph processing in practice. In particular, real-world graphs represent a very diverse range of entities and are often very large, scalability and visualization are undeniably the most pressing challenges faced by participants, and data integration, recommendations, and fraud detection are very popular applications supported by existing graph software.
    [Show full text]
  • Introduction to the Flank Stack
    Introduction to the FLaNK Stack Timothy Spann Principal DataFlow Field Engineer Cloudera @PaasDev Tim Spann Who am I? Cloudera Principal DataFlow Field Engineer DZone Zone Leader and Big Data MVB Future of Data Meetup Leader ex-Pivotal Field Engineer https://github.com/tspannhw https://www.datainmotion.dev/ @PaasDev © 2020 Cloudera, Inc. All rights reserved. 2 Welcome to Future of Data - Princeton - Virtual https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ... @PaasDev © 2020 Cloudera, Inc. All rights reserved. 3 Where Can I Run Edge AI Easily? Web service hosted and managed by Cloudera Flow Management Streams Messaging Streaming Analytics Hosted in the your cloud environment, but managed by the CDP Management Console Shared Data Experience (SDX) technologies form a secure and governed data lake backed by object storage (S3, ADLS, GCS) CDP services are optimized for the elastic compute & ‘always-on’ storage services provided by any cloud provider 4 Edge AI for Big Data Engineers Multiple users, frameworks, languages, devices, data sources & clusters CLOUD DATA ENGINEER CAT AI / Deep Learning / ML / DS • Experience in ETL/ELT • Expert in ETL (Eating, Ties • Can run in Apache NiFi and Laziness) • Coding skills in Python or • Can run in Kafka Streams Java • Edge Camera Interaction • Typical User • Can run in Apache Flink • Knowledge of database • No Coding Skills query languages such as • Can run in MiNiFi SQL • Can use NiFi • Can run in Cloudera • Experience with Streaming • Questions your cloud Machine Learning • Knowledge of Cloud Tools spend • Use Native ML/DL in Jetson, Movidius, Coral, TPU/GPU on Edge Devices Streaming Data Pipelines with Apache NiFi + Kafka + Flink © 2020 Cloudera, Inc.
    [Show full text]
  • Spring 2020 1/21
    CS 591 K1: Data Stream Processing and Analytics Spring 2020 1/21: Introduction Vasiliki (Vasia) Kalavri [email protected] Vasiliki Kalavri | Boston University 2020 Course Information • Instructor: Vasiliki Kalavri • Office: MCS 206 • Contact: [email protected] • Course Time & Location: Tue,Thu 9:30-10:45, MCS B33 • Office Hours: Tue,Thu 11:00-12:30, MCS 206 2 Vasiliki Kalavri | Boston University 2020 Announcements, updates, discussions • Website: vasia.github.io/dspa20 • Syllabus: /syllabus.html • Class schedule: /lectures.html • including today’s slides • Piazza: piazza.com/bu/spring2020/cs591k1/home • For questions & discussions • Blackboard: learn.bu.edu/... • For quizzes, assignment announcements & submissions 3 Vasiliki Kalavri | Boston University 2020 What is this course about? The design Operator semantics and architecture of modern Window optimizations distributed streaming Systems Filtering, counting, sampling Graph streaming algorithms Architecture and design Scheduling and load management Scalability and elasticity Fundamental Algorithms Fault-tolerance and guarantees for representing, summarizing, State management and analyzing data streams 4 Vasiliki Kalavri | Boston University 2020 Tools Apache Flink: flink.apache.org Apache Kafka: kafka.apache.org Apache Beam: beam.apache.org Google Cloud Platform: cloud.google.com 5 Vasiliki Kalavri | Boston University 2020 Outcomes At the end of the course, you will hopefully: • know when to use stream processing vs other technology • be able to comprehensively compare features and processing
    [Show full text]
  • HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack
    HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack Geoffrey C. Fox, Judy Qiu, Supun Kamburugamuve Shantenu Jha, Andre Luckow School of Informatics and Computing RADICAL Indiana University Rutgers University Bloomington, IN 47408, USA Piscataway, NJ 08854, USA fgcf, xqiu, [email protected] [email protected], [email protected] Abstract—We review the High Performance Computing En- systems as they illustrate key capabilities and often motivate hanced Apache Big Data Stack HPC-ABDS and summarize open source equivalents. the capabilities in 21 identified architecture layers. These The software is broken up into layers so that one can dis- cover Message and Data Protocols, Distributed Coordination, Security & Privacy, Monitoring, Infrastructure Management, cuss software systems in smaller groups. The layers where DevOps, Interoperability, File Systems, Cluster & Resource there is especial opportunity to integrate HPC are colored management, Data Transport, File management, NoSQL, SQL green in figure. We note that data systems that we construct (NewSQL), Extraction Tools, Object-relational mapping, In- from this software can run interoperably on virtualized or memory caching and databases, Inter-process Communication, non-virtualized environments aimed at key scientific data Batch Programming model and Runtime, Stream Processing, High-level Programming, Application Hosting and PaaS, Li- analysis problems. Most of ABDS emphasizes scalability braries and Applications, Workflow and Orchestration. We but not performance and one of our goals is to produce summarize status of these layers focusing on issues of impor- high performance environments. Here there is clear need tance for data analytics. We highlight areas where HPC and for better node performance and support of accelerators like ABDS have good opportunities for integration.
    [Show full text]
  • DSP Frameworks
    Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica DSP Frameworks Corso di Sistemi e Architetture per Big Data A.A. 2016/17 Valeria Cardellini DSP frameworks we consider • Apache Storm • Twitter Heron – From Twitter as Storm and compatible with Storm • Apache Spark Streaming – Reduce the size of each stream and process streams of data (micro-batch processing) – Lab on Spark Streaming • Apache Flink • Cloud-based frameworks – Google Cloud Dataflow – Amazon Kinesis Valeria Cardellini - SABD 2016/17 1 Twitter Heron • Realtime, distributed, fault-tolerant stream processing engine from Twitter • Developed as direct successor of Storm – Released as open source in 2016 https://twitter.github.io/heron/ – De facto stream data processing engine inside Twitter, but still in beta • Goal of overcoming Storm’s performance, reliability, and other shortcomings • Compatibility with Storm – API compatible with Storm: no code change is required for migration Valeria Cardellini - SABD 2016/17 2 Heron: in common with Storm • Same terminology of Storm – Topology, spout, bolt • Same stream groupings – Shuffle, fields, all, global • Example: WordCount topology Valeria Cardellini - SABD 2016/17 3 Heron: design goals • Isolation – Process-based topologies rather than thread-based – Each process should run in isolation (easy debugging, profiling, and troubleshooting) – Goal: overcoming Storm’s performance, reliability, and other shortcomings • Resource constraints – Safe to run in shared infrastructure: topologies
    [Show full text]
  • Introduction to Apache Beam
    Introduction to Apache Beam Dan Halperin JB Onofré Google Talend Beam podling PMC Beam Champion & PMC Apache Member Apache Beam is a unified programming model designed to provide efficient and portable data processing pipelines What is Apache Beam? The Beam Programming Model Other Beam Beam Java Languages Python SDKs for writing Beam pipelines •Java, Python Beam Model: Pipeline Construction Beam Runners for existing distributed processing backends Apache Apache Apache Apache Cloud Apache Apache Apex Flink Gearpump Spark Dataflow Apex Apache Gearpump Apache Beam Model: Fn Runners Google Cloud Dataflow Execution Execution Execution What’s in this talk • Introduction to Apache Beam • The Apache Beam Podling • Beam Demos Quick overview of the Beam model PCollection – a parallel collection of timestamped elements that are in windows. Sources & Readers – produce PCollections of timestamped elements and a watermark. ParDo – flatmap over elements of a PCollection. (Co)GroupByKey – shuffle & group {{K: V}} → {K: [V]}. Side inputs – global view of a PCollection used for broadcast / joins. Window – reassign elements to zero or more windows; may be data-dependent. Triggers – user flow control based on window, watermark, element count, lateness - emitting zero or more panes per window. 1.Classic Batch 2. Batch with 3. Streaming Fixed Windows 4. Streaming with 5. Streaming With 6. Sessions Speculative + Late Data Retractions Simple clickstream analysis pipeline Data: JSON-encoded analytics stream from site •{“user”:“dhalperi”, “page”:“blog.apache.org/feed/7”,
    [Show full text]
  • A Survey of State Management in Big Data Processing Systems
    A Survey of State Management in Big Data Processing Systems Quoc-Cuong To1 Juan Soto1,2 Volker Markl1,2 1 German Research Center for 2 Technische Universität Berlin Artificial Intelligence (DFKI) FG DIMA, Sekr. EN-7 Raum 728, Alt-Moabit 91c Einsteinufer 17 10559 Berlin, Germany 10587 Berlin, Germany [email protected] [email protected] [email protected] ABSTRACT graphs or trees (in the absence of iterations or shared results). From The concept of state and its applications vary widely across big data this perspective, the analysis results are the roots, operators are the processing systems. This is evident in both the research literature intermediate nodes, and data are the leaves. Each operator node and existing systems, such as Apache Flink, Apache Heron, Apache performs an operation that transforms inputs flowing through it into Samza, Apache Spark, and Apache Storm. Given the pivotal role outputs. Data flows from the leaves through the operator nodes to that state management plays, particularly, for iterative batch and the roots. stream processing, in this survey, we present examples of state as Operators come in two varieties. Stateless operators are an enabler, discuss the alternative approaches used to handle and purely functional and they produce output, solely based on their implement state, capture the many facets of state management, and input. Examples of stateless operators include relational selection, highlight new research directions. Our aim is to provide insight into relational projection without duplicate elimination, or merging two disparate state management techniques, motivate others to pursue inputs. In contrast, stateful operators compute their output on a research in this area, and draw attention to open problems.
    [Show full text]
  • Apache Calcite for Enabling SQL Access to Nosql Data Systems Such As Apache Geode Christian Tzolov Whoami Christian Tzolov
    Apache Calcite for Enabling SQL Access to NoSQL Data Systems such as Apache Geode Christian Tzolov Whoami Christian Tzolov Engineer at Pivotal, Big-Data, Hadoop, Spring Cloud Dataflow, Apache Geode, Apache HAWQ, Apache Committer, Apache Crunch PMC member [email protected] blog.tzolov.net twitter: @christzolov https://nl.linkedin.com/in/tzolov Disclaimer This talk expresses my personal opinions. It is not read or approved by Pivotal and does not necessarily reflect the views and opinions of Pivotal nor does it constitute any official communication of Pivotal. Pivotal does not support any of the code shared here. 2 Big Data Landscape 2016 • Volume • Velocity • Varity • Scalability • Latency • Consistency vs. Availability (CAP) 3 Data Access • {Old | New} SQL • Custom APIs – Key / Value – Fluent APIs – REST APIs • {My} Query Language Unified Data Access? At What Cost? 4 SQL? • Apache Apex • SQL-Gremlin • Apache Drill … • Apache Flink • Apache Geode • Apache Hive • Apache Kylin • Apache Phoenix • Apache Samza • Apache Storm • Cascading • Qubole Quark 5 Geode Adapter - Overview SQL/JDBC/ODBC Parse SQL, converts into relational expression and Apache Calcite optimizes Push down the relational expressions supported by Geode Spring Data API for Enumerable OQL and falls back to the Calcite interacting with Geode Adapter Enumerable Adapter for the rest Convert SQL relational Spring Data Geode Adapter expressions into OQL queries Geode (Geode Client) Geode API and OQL Geode Server Geode Server Geode Server Data Data Data SQL Relational Expressions
    [Show full text]