Technologie-Workshop „Big Data“ FZI Karlsruhe, 22. Juni 2015
Introduction to Apache Flink
Christoph Boden | stv. Projektkoordinator Berlin Big Data Center (BBDC) | TU Berlin DIMA Introducing the
http://bbdc.berlin “Data Scientist” – “Jack of All Trades!”
Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics) Mathematical Programming Relational Algebra / SQL Linear Algebra Data Warehouse/OLAP Stochastic Gradient Descent Application 2 Error Estimation NF /XQuery Active Sampling Resource Management Hardware Adaptation Regression Fault Tolerance Monte Carlo Data Memory Management Statistics Science Parallelization Sketches Scalability Hashing Memory Hierarchy Convergence Data Analysis Language Decoupling Compiler Query Optimization Iterative Algorithms Indexing Data Flow Curse of Dimensionality Control Flow Real-Time Machine Learning + Data Management = X
Mathematical Programming Relational Algebra/SQL Linear Algebra Data Warehouse/OLAP 2 Error Estimation NF /XQuery Scalability Hardware adaption Active Sampling Technology X Fault Tolerance Regression Monte Carlo Resource Management Think ML-algorithms in a scalable way Feature Engineering Declarative Languages Representation declarative Automatic Adaption Algorithms (SVM, GPs, etc.) ML DM Scalable processing Process iterative algorithms Statistic in a scalable way Parallelization Compiler Sketches Hashing Memory Management Isolation Convergence Goal: Data Analysis without Memory Hierarchy Curse of Dimensionality System Programming! Data Analysis Language Iterative Algorithms Query Optimization Control flow Dataflow Indexing Beuth Hochschule
ZIB Berlin
Stefan Edlich Software Engineering Fritz-Haber-Institut, Christof Alexander Max-Planck-Gesellschaft Schütte Reinefeld Information- File Systems, based Medicine Supercomputing
Matthias Technische Universität Berlin, Data Analytics Lab DFKI Scheffler Material Science
Volker Klaus R. Anja Odej Thomas Hans Uszkoreit Markl Müller Feldmann Kao Wiegand Language Technology Data Management Machine Computer Distributed Video Mining Learning Networks Systems http://bbdc.berlin X = Big Data Analytics – System Programming! („What“, not „How“) Data Analyst Machine
Larger human base of „data scientists“ Reduction of „human“ latencies Cost reduction
Description of „What“? Description of „How“? (declarative specification) (State of the art in scalable data analysis) Technology X Hadoop, MPI Technology X Think ML-algorithms Analysis of Declarative specification Algorithmic process in a scalable way “data in motion” fault tolerance Iterative algorithms Automatic optimization, in a scalable way Multimodal parallelization and hardware adaption Consistent analysis of dataflow and control flow intermediate results with user-defined functions, iterations and distributed state Software-defined Numerical networking stability Scalable algorithms and debugging Application Examples: Technology Drivers and Validation
Think ML-algorithms in a scalable way Declarative Technology X Process iterative algorithms in a scalable way hierarchical text data flows multimodal data integration: numerical video, images, text simulation data numerical stability windowing Application example: Marketplace Application Example: Application Example: Material for Health science information economics-based society-based science-based Open source data infrastructure
Hive Cascading Applications Mahout Pig …
MapReduce Flink Data processing engines Spark Storm Tez
App and resource management Yarn Mesos
Storage, streams HDFS HBase Kafka … 8 Engine paradigms & systems
MapReduce Apache Hadoop 1 (OSDI’04)
Apache Tez Dryad, Nephele (EuroSys’07)
PACTs Apache Flink (SOCC’10, VLDB’12)
RDDs Apache Spark (HotCloud’10, NSDI’12) 9 Engine comparison
Transformations Iterative MapReduce on API on k/v pair transformations k/v pairs collections on collections
Paradigm MapReduce RDD Cyclic dataflows
Optimization Optimization Optimization none of SQL queries in all APIs
Batch Batch with Stream with Execution sorting memory out-of-core pinning algorithms10 APACHE FLINK
11 Library
API API ML &
Gelly Batch API
Flink Table Table Streaming Machine Learing Graph Libraries
& DataSet API DataStream API
Batch Processing Stream Processing
Runtime Distributed Streaming Dataflow
Local Cluster Cloud Single JVM YARN, Standalone GCW, EC2 Deploy Core APIs
An open source platform for scalable batch and stream data processing. Data sets and operators
Program
DataSet DataSet DataSet X Y A B C
Operator X Operator Y
Parallel Execution
A (1) X B (1) Y C (1)
A (2) X B (2) Y C (2)
13 Rich operator and functionality set
Map, Reduce, Join, CoGroup, Union, Iterate, Delta Iterate, Filter, FlatMap, GroupReduce, Project, Aggregate, Distinct, Vertex-Update, Accumulators
Iterate Source Map Reduce
Join Reduce Sink
Source Map
14 Base‐Operator: Map
D Base‐Operator: Reduce
D Base‐Operator: Cross
R
L Base‐Operator: Join
R
L Base‐Operator: CoGroup
R
L WordCount in Java
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); text
DataSet
DataSet
20 WordCount in Scala
val env = ExecutionEnvironment text .getExecutionEnvironment val input = env.readTextFile(textInput) flatMap val counts = text .flatMap { line => line.split("\\W+") } .filter { term => term.nonEmpty } .map { term => (term, 1) } reduce .groupBy(0) .sum(1) counts env.execute()
21 Long operator pipelines
DataSet
DataSet
DataSet
22 Beyond Key/Value Pairs
DataSet
DataSet
// custom data types class Impression { class Page { public String url; public String url; public long count; public String topic; } } 23 Flink’s optimizer
• inspired by optimizers of parallel database systems – cost models and reasoning about interesting properties
• physical optimization follows cost‐based approach – Select data shipping strategy (forward, partition, broadcast) – Local execution (sort merge join/hash join) – keeps track of interesting properties such as sorting, grouping and partitioning
• optimization of Flink programs more difficult than in the relational case: – no fully specified operator semantics due to UDFs – unknown UDFs complicate estimating intermediate result sizes – no pre‐defined schema present
24 Optimization example
case class Order(id: Int, priority: Int, ...) case class Item(id: Int, price: double, ) val orders = DataSource(...) case class PricedOrder(id, priority, price) val items = DataSource(...) val filtered = orders filter { ... } val prio = filtered join items where { _.id } isEqualTo { _.id } map {(o,li) => PricedOrder(o.id, o.priority, li.price)} val sales = prio groupBy {p => (p.id, p.priority)} aggregate ({_.price},SUM)
(0,1) Grp/Agg Grp/Agg Join (0) = (0) Join sort (0,1) sort (0) partition(0) (∅) Filter Filter partition(0)
Orders Items Orders Items 25 Memory management
Unmanaged Heap public class WC { public String word; public int count; } Flink Managed JVM Heap Heap empty page
Network Buffers Pool of Memory Pages
• Flink manages its own memory • User data stored in serialize byte arrays • In-memory caching and data processing happens in a dedicated memory fraction • Never breaks the JVM heap
• Very efficient disk spilling and network transfers 26 Built‐in vs. driver‐based iterations
Client Loop outside the system, in driver
Step Step Step Step Step program
Client Iterative program looks like many
Step Step Step Step Step independent jobs
Dataflows with
red. feedback edges
map join System is iteration- join aware, can optimize the job
27 “Iterate” operator
Iterate
Replace
initial partial partial iteration solution solution X Y solution result
other datasets Step function
• Built‐in operator to support looping over data • Applies step function to partial solution until convergence • Step function can be arbitrary Flink program • Convergence via fixed number of iterations or custom convergence criterion
28 “Delta Iterate” operator
Replace Delta-Iterate
initial workset workset workset A B
initial partial delta iteration solution solution X Y set result
other datasets Merge deltas
• compute next workset and changes to the partial solution until workset is empty • generalizes vertex‐centric computing of Pregel and GraphLab
29 ReCap: What is Apache Flink? Apache Flink is an open source platform for scalable batch and stream data processing.
• The core of Flink is a distributed streaming dataflow engine. • Executing dataflows in parallel on clusters • Providing a reliable foundation for various workloads
• DataSet and DataStream programming abstractions are the foundation for user programs and higher layers http://flink.apache.org Working on and with Apache Flink
• Flink homepage https://flink.apache.org
• Flink Mailing Lists https://flink.apache.org/community.html#mailing-lists
• Flink Meetup in Berlin http://www.meetup.com/de/Apache-Flink-Meetup/ Flink community
120 #unique contributor ids by git commits
100
80
60
40
20
0 Aug 10 Feb 11 Sep 11 Apr 12 Okt 12 Mai 13 Nov 13 Jun 14 Dez 14 Jul 15 Evolution of Big Data Platforms
Flink
In‐memory + Out of Core Performance, Declarativity, Optimisation, Iterative Algorithms, Streaming/Lambda 4G
Spark In‐memory Performance and Improved Programming Model 3G Hadoop Scale‐out, Map/Reduce, UDFs
Relational Databases 2G 1G
Big Data Value Association Summit, Madrid, June 17‐19 2015 The cards are dealt anew! https://medium.com/chasing‐buzzwords/is‐apache‐flink‐europes‐wild‐card‐into‐the‐big‐data‐race‐a189fcf27c4c Forbes on Apache Flink:
• „[…] Flink, which is also a top‐level project of the Apache Software Foundation, has just recently begun to attract many of the same admiring comments directed Spark’s way 12‐18 months ago. Despite sound technical credentials, ongoing development, big investments, and today’s high‐ profile endorsement from IBM, it would be unwise (and implausible) to crown Spark as the winner just yet. […]” http://www.forbes.com/sites/paulmiller/2015/06/15/ibm‐backs‐apache‐spark‐for‐big‐ data‐analytics/
• http://www.infoworld.com/article/2919602/hadoop/flink‐hadoops‐new‐contender‐for‐ mapreduce‐spark.html
• http://www.datanami.com/2015/06/12/8‐new‐big‐data‐projects‐to‐watch/ • Two day developer conference with in‐depth talks from – developers and contributors – industry and research users – related projects
• Flink training sessions (in parallel) – System introduction, real‐time stream processing, APIs on top
• Flink Forward registration & call for abstracts is open now at http://flink‐forward.org/