Technologie-Workshop „“ FZI Karlsruhe, 22. Juni 2015

Introduction to

Christoph Boden | stv. Projektkoordinator Berlin Big (BBDC) | TU Berlin DIMA Introducing the

http://bbdc.berlin “Data Scientist” – “Jack of All Trades!”

Domain Expertise (e.g., Industry 4.0, Medicine, Physics, Engineering, Energy, Logistics) Mathematical Programming Relational Algebra / SQL Linear Algebra Data Warehouse/OLAP Stochastic Gradient Descent Application 2 Error Estimation NF /XQuery Active Sampling Resource Management Hardware Adaptation Regression Monte Carlo Data Memory Management Statistics Science Parallelization Sketches Hashing Memory Hierarchy Convergence Data Analysis Language Decoupling Compiler Query Optimization Iterative Algorithms Indexing Data Flow Curse of Dimensionality Control Flow Real-Time + Data Management = X

Mathematical Programming Relational Algebra/SQL Linear Algebra Data Warehouse/OLAP 2 Error Estimation NF /XQuery Scalability Hardware adaption Active Sampling Technology X Fault Tolerance Regression Monte Carlo Resource Management Think ML-algorithms in a scalable way Feature Engineering Declarative Languages Representation declarative Automatic Adaption Algorithms (SVM, GPs, etc.) ML DM Scalable processing iterative algorithms Statistic in a scalable way Parallelization Compiler Sketches Hashing Memory Management Isolation Convergence Goal: Data Analysis without Memory Hierarchy Curse of Dimensionality System Programming! Data Analysis Language Iterative Algorithms Query Optimization Control flow Dataflow Indexing Beuth Hochschule

ZIB Berlin

Stefan Edlich Software Engineering Fritz-Haber-Institut, Christof Alexander Max-Planck-Gesellschaft Schütte Reinefeld Information- File Systems, based Medicine Supercomputing

Matthias Technische Universität Berlin, Data Analytics Lab DFKI Scheffler Material Science

Volker Klaus . Anja Odej Thomas Hans Uszkoreit Markl Müller Feldmann Kao Wiegand Language Technology Data Management Machine Computer Distributed Video Mining Learning Networks Systems http://bbdc.berlin X = Big Data Analytics – System Programming! („What“, not „How“) Data Analyst Machine

Larger human base of „data scientists“ Reduction of „human“ latencies Cost reduction

Description of „What“? Description of „How“? (declarative specification) (State of the art in scalable data analysis) Technology X Hadoop, MPI Technology X Think ML-algorithms Analysis of Declarative specification Algorithmic process in a scalable way “data in motion” fault tolerance Iterative algorithms Automatic optimization, in a scalable way Multimodal parallelization and hardware adaption Consistent analysis of dataflow and control flow intermediate results with user-defined functions, iterations and distributed state Software-defined Numerical networking stability Scalable algorithms and debugging Application Examples: Technology Drivers and Validation

Think ML-algorithms in a scalable way Declarative Technology X Process iterative algorithms in a scalable way hierarchical text data flows multimodal data integration: numerical video, images, text simulation data numerical stability windowing Application example: Marketplace Application Example: Application Example: Material for Health science information economics-based society-based science-based Open source data infrastructure

Hive Cascading Applications Mahout Pig …

MapReduce Flink Data processing engines Storm Tez

App and resource management Yarn Mesos

Storage, streams HDFS HBase Kafka … 8 Engine paradigms & systems

MapReduce 1 (OSDI’04)

Apache Tez , Nephele (EuroSys’07)

PACTs Apache Flink (SOCC’10, VLDB’12)

RDDs (HotCloud’10, NSDI’12) 9 Engine comparison

Transformations Iterative MapReduce on API on k/v pair transformations k/v pairs collections on collections

Paradigm MapReduce RDD Cyclic dataflows

Optimization Optimization Optimization none of SQL queries in all

Batch Batch with Stream with Execution sorting memory out-of-core pinning algorithms10 APACHE FLINK

11 Library

API API ML &

Gelly Batch API

Flink Table Table Streaming Machine Learing Graph Libraries

& DataSet API DataStream API

Batch Processing

Runtime Distributed Streaming Dataflow

Local Cluster Cloud Single JVM YARN, Standalone GCW, EC2 Deploy Core APIs

An open source platform for scalable batch and stream data processing. Data sets and operators

Program

DataSet DataSet DataSet X Y A B C

Operator X Operator Y

Parallel Execution

A (1) X B (1) Y C (1)

A (2) X B (2) Y C (2)

13 Rich operator and functionality set

Map, Reduce, Join, CoGroup, Union, Iterate, Delta Iterate, Filter, FlatMap, GroupReduce, Project, Aggregate, Distinct, Vertex-Update, Accumulators

Iterate Source Reduce

Join Reduce Sink

Source Map

14 Base‐Operator: Map

D Base‐Operator: Reduce

D Base‐Operator: Cross

R

L Base‐Operator: Join

R

L Base‐Operator: CoGroup

R

L WordCount in Java

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); text

DataSet text = readTextFile (input);

DataSet> counts= text flatMap .map (l ‐> l.split(“\\W+”)) .flatMap ((String[] tokens, Collector> out) ‐> { Arrays.stream(tokens) .filter(t ‐> t.length() > 0) reduce .forEach(t ‐> out.collect(new Tuple2<>(t, 1))); }) .groupBy(0) .sum(1); counts env.execute("Word Count Example");

20 WordCount in Scala

val env = ExecutionEnvironment text .getExecutionEnvironment val input = env.readTextFile(textInput) flatMap val counts = text .flatMap { line => line.split("\\W+") } .filter { term => term.nonEmpty } .map { term => (term, 1) } reduce .groupBy(0) .sum(1) counts env.execute()

21 Long operator pipelines

DataSet large = env.readCsv(...); DataSet medium = env.readCsv(...); γ DataSet small = env.readCsv(...);

DataSet joined1 = ⋈ large.join(medium) .where(3).equals(1) ⋈ small .with(new JoinFunction() { ... }); medium DataSet joined2 = large small.join(joined1) .where(0).equals(2) .with(new JoinFunction() { ... });

DataSet result = joined2.groupBy(3) .max(2);

22 Beyond Key/Value Pairs

DataSet pages = ...; DataSet impressions = ...;

DataSet aggregated = impressions .groupBy("url") .sum("count"); pages.join(impressions).where("url").equalTo("url")

// custom data types class Impression { class Page { public String url; public String url; public long count; public String topic; } } 23 Flink’s optimizer

• inspired by optimizers of parallel systems – cost models and reasoning about interesting properties

• physical optimization follows cost‐based approach – Select data shipping strategy (forward, partition, broadcast) – Local execution (sort join/hash join) – keeps track of interesting properties such as sorting, grouping and partitioning

• optimization of Flink programs more difficult than in the relational case: – no fully specified operator semantics due to UDFs – unknown UDFs complicate estimating intermediate result sizes – no pre‐defined schema present

24 Optimization example

case class Order(id: Int, priority: Int, ...) case class Item(id: Int, price: double, ) val orders = DataSource(...) case class PricedOrder(id, priority, price) val items = DataSource(...) val filtered = orders filter { ... } val prio = filtered join items where { _.id } isEqualTo { _.id } map {(o,li) => PricedOrder(o.id, o.priority, li.price)} val sales = prio groupBy {p => (p.id, p.priority)} aggregate ({_.price},SUM)

(0,1) Grp/Agg Grp/Agg Join (0) = (0) Join sort (0,1) sort (0) partition(0) (∅) Filter Filter partition(0)

Orders Items Orders Items 25 Memory management

Unmanaged Heap public class WC { public String word; public int count; } Flink Managed JVM Heap Heap empty page

Network Buffers Pool of Memory Pages

• Flink manages its own memory • User data stored in serialize byte arrays • In-memory caching and data processing happens in a dedicated memory fraction • Never breaks the JVM heap

• Very efficient disk spilling and network transfers 26 Built‐in vs. driver‐based iterations

Client Loop outside the system, in driver

Step Step Step Step Step program

Client Iterative program looks like many

Step Step Step Step Step independent jobs

Dataflows with

red. feedback edges

map join System is iteration- join aware, can optimize the job

27 “Iterate” operator

Iterate

Replace

initial partial partial iteration solution solution X Y solution result

other datasets Step function

• Built‐in operator to support looping over data • Applies step function to partial solution until convergence • Step function can be arbitrary Flink program • Convergence via fixed number of iterations or custom convergence criterion

28 “Delta Iterate” operator

Replace Delta-Iterate

initial workset workset workset A B

initial partial delta iteration solution solution X Y set result

other datasets Merge deltas

• compute next workset and changes to the partial solution until workset is empty • generalizes vertex‐centric computing of Pregel and GraphLab

29 ReCap: What is Apache Flink? Apache Flink is an open source platform for scalable batch and stream data processing.

• The core of Flink is a distributed streaming dataflow engine. • Executing dataflows in parallel on clusters • Providing a reliable foundation for various workloads

• DataSet and DataStream programming abstractions are the foundation for user programs and higher layers http://flink.apache.org Working on and with Apache Flink

• Flink homepage https://flink.apache.org

• Flink Mailing Lists https://flink.apache.org/community.html#mailing-lists

• Flink Meetup in Berlin http://www.meetup.com/de/Apache-Flink-Meetup/ Flink community

120 #unique contributor ids by git commits

100

80

60

40

20

0 Aug 10 Feb 11 Sep 11 Apr 12 Okt 12 Mai 13 Nov 13 Jun 14 Dez 14 Jul 15 Evolution of Big Data Platforms

Flink

In‐memory + Out of Core Performance, Declarativity, Optimisation, Iterative Algorithms, Streaming/Lambda 4G

Spark In‐memory Performance and Improved Programming Model 3G Hadoop Scale‐out, Map/Reduce, UDFs

Relational 2G 1G

Big Data Value Association Summit, Madrid, June 17‐19 2015 The cards are dealt anew! https://medium.com/chasing‐buzzwords/is‐apache‐flink‐europes‐wild‐card‐into‐the‐big‐data‐race‐a189fcf27c4c Forbes on Apache Flink:

• „[…] Flink, which is also a top‐level project of the Apache Software Foundation, has just recently begun to attract many of the same admiring comments directed Spark’s way 12‐18 months ago. Despite sound technical credentials, ongoing development, big investments, and today’s high‐ profile endorsement from IBM, it would be unwise (and implausible) to crown Spark as the winner just yet. […]” http://www.forbes.com/sites/paulmiller/2015/06/15/ibm‐backs‐apache‐spark‐for‐big‐ data‐analytics/

• http://www.infoworld.com/article/2919602/hadoop/flink‐hadoops‐new‐contender‐for‐ ‐spark.html

• http://www.datanami.com/2015/06/12/8‐new‐big‐data‐projects‐to‐watch/ • Two day developer conference with in‐depth talks from – developers and contributors – industry and research users – related projects

• Flink training sessions (in parallel) – System introduction, real‐time stream processing, APIs on top

• Flink Forward registration & call for abstracts is open now at http://flink‐forward.org/