Dataflow Programming for Big Data Stream Processing

Dataflow Programming for Big Data Stream Processing Kai-Uwe Sattler Database & Information Systems Group TU Ilmenau www.tu-ilmenau.de/dbis Outline } Big Dynamic Data: Use Cases & Requirements } Foundations of Stream Processing } APIs, Query & Dataflow Languages } Piglet: Language & Compiler } Conclusion & Outlook Quelle: pixabay.com 1 K. Sattler | TU Ilmenau 07.02.17 The 4th Paradigm & Big Data } J. Gray: shift of paradigm in } Big Data = „too big“ for natural sciences traditional methods } 1000 years ago: empirical experiments Velocity } Since 100 years: grand Realtime theories, models, .. Interval } Last decades: Computational Batch Sciences (Simulation) Relations } today: data exploration MB (eSciences) XML PB Text, Images, ... ZB Variety Volume 2 K. Sattler | TU Ilmenau 07.02.17 Dynamic Data in the Big Data Age } Availability of sensors connecting IT and physical world } Enormous volume of data from real-world observations (=„dirty“ data) } Many heterogeneous sources incl. potentially infinite streams } Valid or useful only for a short period of time } Not inspectible by humans anymore 3 K. Sattler | TU Ilmenau 07.02.17 Use Case: Linked Stream Data } Combine and analyze sensor data e.g. from weather stations (e.g. www.weatherground.com) } Link sensor data with metadata (sensors, stations, GeoNames) } SRBench https://www.w3.org/wiki/SRBench Observations Metadata Detect a hurricane, detect a station observing a blizzard, hourly average temperature, .... 4 K. Sattler | TU Ilmenau 07.02.17 Use Case: Traffic Monitoring } Traffic monitoring using } To l l stations, } Induction loops, } smartphones, ... } tasks: } To l l calculation } Speed control } Detection of traffic jam and accidents } Further applications: } Variable toll (based on time, congestion, …) } Dynamic route planning 5 K. Sattler | TU Ilmenau 07.02.17 Use Case: Engineering Data } Find defects in nano structures } online source localization in (e.g. electronic components) human brain } 3D surface measurement (x,y } EEG/MEG sensors, data rate: 600- position, z measure) 5000 Hz } Resolution: nm (100s MB - 10s } Complex analysis pipeline: signal TBs per scan) filtering, decomposition, matrix inversion, time-based folding 6 K. Sattler | TU Ilmenau 07.02.17 Common Concepts Combining Declarative „Stream“ of base operators Specification data items Abstractions for • data flow • execution model • data + operators ⩯ database queries Data flow } Further applications } System monitoring, smart grid, process automation (Industry 4.0), Social media analysis, ... 7 K. Sattler | TU Ilmenau 07.02.17 Outline } Big Dynamic Data: Use Cases & Requirements } Foundations of Stream Processing } Processing Model } Windows } Basic Operators } Scalable Stream Processing } Examples } APIs, Query & Dataflow Languages } Piglet: Language & Compiler } Conclusion & Outlook Quelle: pixabay.com 8 K. Sattler | TU Ilmenau 07.02.17 Processing models } Batch Processing = } Online Processing = Store & Process Continuous Query Processing (CQP) Query DSMS Quer Quer Quer y y y DBMS } Traditional databases } Data Stream Management Systems (DSMS), Stream } MapReduce Processing Engines (SPE) 9 K. Sattler | TU Ilmenau 07.02.17 Stream Processing: Basic Principles } Data stream := < oi, oi+1, oi+2, ... > } Stream element: oi = (data items, timestamp) } Publish subscribe paradigm process(oi) publish() ... oi+1 oi Publisher Subscriber subscribe() state } Challenges } (Theoretically) unbounded stream } Limited resources (memory, CPU time, ...) for state (computation and representation) 10 K. Sattler | TU Ilmenau 07.02.17 Controlling Stream Processing: Punctuations } Punctuations mark the end of a substream } Conceptually, as a predicate that evaluates to false for every element following the punctuation } But represented as data item (=stream element) } Simple form = heart beat } Advanced forms: end of window, end of stream, end of session, end of auction, .... } Processing punctuations: } match(t: Tuple, p: Punctuation) ➝ bool } combine(p: Punctuation, q: Punctuation) ➝ Punctuation 11 K. Sattler | TU Ilmenau 07.02.17 Punctuations: Example ts=04:25 ts=04:20 ts=04:15 ts=04:10 ts > 04:22 p.m. } Allows to } implement e.g. window semantics } Avoid blocking } Optimize processing of aggregates, etc. } Out-of-order processing 12 K. Sattler | TU Ilmenau 07.02.17 Windows on Data Streams } Aggregations, joins etc. would require access to all data elements ➟ blocking operators } Solution: define a window (subset) over a data stream } Different possible semantics: } Time-based: last 5 minutes } Count-based: 100 data items } Data-driven: Web session } Eviction strategies: sliding vs. tumbling vs. landmark 13 K. Sattler | TU Ilmenau 07.02.17 Windowed Aggregation & Joins window aggregate aggregate(_) aggregate } Examples: } Moving average over 5 minutes (sliding window) symmetric } URL access per day (tumbling window) hash join window ... ⨝ join join window } Example: } Correlated observations, e.g. high temperatures within a certain period of time 14 K. Sattler | TU Ilmenau 07.02.17 Window Implementation } report on report: window aggregate } Re-evaluation } Incremental evaluation } Strategies for incremental evaluation } Possible delay in evaluation if no new tuple arrives } Input triggered } Hard to separate window sum: 2 5 1 3 -3 +2 } Allows to implement windows } Negative tuples separately } +2 Avoids delay 2 5 1 3 aggregate } Increases the number of tuples -3 15 K. Sattler | TU Ilmenau 07.02.17 Complex Event Processing (CEP) } Detecting complex patterns of events in a stream of data } Event = stream element; complex event = sequence of events (not necessarily consecutively) } Defined using logical and temporal conditions Data values + combinations Within a given period of time 24°C, Station#1, 13:00 SEQ(A, B, C) WITH 5 min 23°C, Station#2, 13:00 A.Temp > 23°C && 21°C, Station#1, 13:02 B.Station = A.Station && B.Temp < A.Temp && 20°C, Station#1, 13:05 C.Station = A.Station && A.Temp-C.Temp > 3 16 K. Sattler | TU Ilmenau 07.02.17 CEP: Implementation } Composite events constructed e.g. by } SEQ, AND, OR, NEG, ... } SEQ(e1, e2) ➝ (e1, t1) ∧ (e2, t2)∧t1 ≤ t2 ∧ e1,e2 ∊ � } Implemented by constructing a NFA } Example: SEQ(A, B, C) B C 1 A 2 3 4 * * 17 K. Sattler | TU Ilmenau 07.02.17 Scalability: Data-parallel processing } High volume of data, high arrival rates ➟ parallel processing (Multicore CPUs, Cluster, ...) ➟ requires partitioning of the data stream Operator o4o1 o4o3o2o1 o2 Operator Split Merge o 3 Operator Parallel processing 18 K. Sattler | TU Ilmenau 07.02.17 Data parallelism in data streams } Non-trivial problem } Stateful operators (e.g. aggregates, ...) } Operators relying on ordering of tuples (e.g. windows, CEP, ...) } Dynamic load balancing / elasticity in case of changing workloads or skewed data } State migration } Tasks } Partitioning (split + merge) } Safe auto-parallelization } State management + migration 19 K. Sattler | TU Ilmenau 07.02.17 Partitioning } Some possible solutions } Partitioning: hash-based (e.g. consistent hashing), range-based, ... } Operator-specific merge } „semantic “ partitioning SEQ(A, B) WITH B.Id = A.Id && ... [sum(p1), p1=[oi,o2i,...] avg count(p )] 1 CEP pi=f(Id) avg CEP p2=[oi+1,o2i+i,...] @tj: 20 K. Sattler | TU Ilmenau 07.02.17 Fault tolerance } Continuous queries on unbounded data streams require to deal with } Loss of message } Node failures } Guaranteed message delivery: } Redundant processing and/or } Best effort recovery } At least once } Exactly once Publisher Subscriber publish(o) Publisher Subscriber Replica ACK 21 K. Sattler | TU Ilmenau 07.02.17 Fault Tolerance: Standby } Standby Operator Failover Replica } Simple for stateless operators like filters, etc. } But for stateful operators (e.g. aggregates, ...)?? } Hot Standby Multicast Operator Switch Failover Operator 22 K. Sattler | TU Ilmenau 07.02.17 Fault Tolerance: Upstream Backup } Operator buffers data sent to the the next operator until acknowledgement } In case of failure: retransmit } Can be combined with standby/replicas Op1 Op2 ACK buffer 23 K. Sattler | TU Ilmenau 07.02.17 Fault Tolerance: Checkpointing Operator } Checkpointing/ Logging State Log Restore Operator } Basic idea: } Periodically/on changes: saving the state (checkpointing, snapshot) or saving state changes (logging) } In case of failures: recover the state from the log } Challenges: } State management (buffer, routing, processing), overhead } Consistency among multiple operators during recovery 24 K. Sattler | TU Ilmenau 07.02.17 Challenges in Dataflow Compilation } Optimization: finding the best plan for a declaratively specified dataflow } Choosing algorithms (for operators) and data structures (for windows, state representation, ...) } Reordering operators, e.g. pushdown of selection } Auto-parallelization & partitioning of data streams for distributed/parallel processing } Placement of operators at computing nodes } Resource allocation (memory, CPU time, ...) } ... } Extensibility: integrating and optimizing new operators (possible as black box) and new target platforms 25 K. Sattler | TU Ilmenau 07.02.17 Optimization } requires } Knowledge about algebraic properties of operators such as associativity, commutativity ➟ reordering } Transformations and their properties } Cost model: expected processing costs, arrival rates, data distribution, ... } Challenges } Operators beyond standard (relational) operators = user- defined code: functions, operators, ... } Cost model: parameters, statistics for streams 26 K. Sattler | TU Ilmenau 07.02.17

Load more