Dataflow Programming for Big Data Stream Processing

Dataflow Programming for Big Data Stream Processing

Dataflow Programming for Big Data Stream Processing Kai-Uwe Sattler Database & Information Systems Group TU Ilmenau www.tu-ilmenau.de/dbis Outline } Big Dynamic Data: Use Cases & Requirements } Foundations of Stream Processing } APIs, Query & Dataflow Languages } Piglet: Language & Compiler } Conclusion & Outlook Quelle: pixabay.com 1 K. Sattler | TU Ilmenau 07.02.17 The 4th Paradigm & Big Data } J. Gray: shift of paradigm in } Big Data = „too big“ for natural sciences traditional methods } 1000 years ago: empirical experiments Velocity } Since 100 years: grand Realtime theories, models, .. Interval } Last decades: Computational Batch Sciences (Simulation) Relations } today: data exploration MB (eSciences) XML PB Text, Images, ... ZB Variety Volume 2 K. Sattler | TU Ilmenau 07.02.17 Dynamic Data in the Big Data Age } Availability of sensors connecting IT and physical world } Enormous volume of data from real-world observations (=„dirty“ data) } Many heterogeneous sources incl. potentially infinite streams } Valid or useful only for a short period of time } Not inspectible by humans anymore 3 K. Sattler | TU Ilmenau 07.02.17 Use Case: Linked Stream Data } Combine and analyze sensor data e.g. from weather stations (e.g. www.weatherground.com) } Link sensor data with metadata (sensors, stations, GeoNames) } SRBench https://www.w3.org/wiki/SRBench Observations Metadata Detect a hurricane, detect a station observing a blizzard, hourly average temperature, .... 4 K. Sattler | TU Ilmenau 07.02.17 Use Case: Traffic Monitoring } Traffic monitoring using } To l l stations, } Induction loops, } smartphones, ... } tasks: } To l l calculation } Speed control } Detection of traffic jam and accidents } Further applications: } Variable toll (based on time, congestion, …) } Dynamic route planning 5 K. Sattler | TU Ilmenau 07.02.17 Use Case: Engineering Data } Find defects in nano structures } online source localization in (e.g. electronic components) human brain } 3D surface measurement (x,y } EEG/MEG sensors, data rate: 600- position, z measure) 5000 Hz } Resolution: nm (100s MB - 10s } Complex analysis pipeline: signal TBs per scan) filtering, decomposition, matrix inversion, time-based folding 6 K. Sattler | TU Ilmenau 07.02.17 Common Concepts Combining Declarative „Stream“ of base operators Specification data items Abstractions for • data flow • execution model • data + operators ⩯ database queries Data flow } Further applications } System monitoring, smart grid, process automation (Industry 4.0), Social media analysis, ... 7 K. Sattler | TU Ilmenau 07.02.17 Outline } Big Dynamic Data: Use Cases & Requirements } Foundations of Stream Processing } Processing Model } Windows } Basic Operators } Scalable Stream Processing } Examples } APIs, Query & Dataflow Languages } Piglet: Language & Compiler } Conclusion & Outlook Quelle: pixabay.com 8 K. Sattler | TU Ilmenau 07.02.17 Processing models } Batch Processing = } Online Processing = Store & Process Continuous Query Processing (CQP) Query DSMS Quer Quer Quer y y y DBMS } Traditional databases } Data Stream Management Systems (DSMS), Stream } MapReduce Processing Engines (SPE) 9 K. Sattler | TU Ilmenau 07.02.17 Stream Processing: Basic Principles } Data stream := < oi, oi+1, oi+2, ... > } Stream element: oi = (data items, timestamp) } Publish subscribe paradigm process(oi) publish() ... oi+1 oi Publisher Subscriber subscribe() state } Challenges } (Theoretically) unbounded stream } Limited resources (memory, CPU time, ...) for state (computation and representation) 10 K. Sattler | TU Ilmenau 07.02.17 Controlling Stream Processing: Punctuations } Punctuations mark the end of a substream } Conceptually, as a predicate that evaluates to false for every element following the punctuation } But represented as data item (=stream element) } Simple form = heart beat } Advanced forms: end of window, end of stream, end of session, end of auction, .... } Processing punctuations: } match(t: Tuple, p: Punctuation) ➝ bool } combine(p: Punctuation, q: Punctuation) ➝ Punctuation 11 K. Sattler | TU Ilmenau 07.02.17 Punctuations: Example ts=04:25 ts=04:20 ts=04:15 ts=04:10 ts > 04:22 p.m. } Allows to } implement e.g. window semantics } Avoid blocking } Optimize processing of aggregates, etc. } Out-of-order processing 12 K. Sattler | TU Ilmenau 07.02.17 Windows on Data Streams } Aggregations, joins etc. would require access to all data elements ➟ blocking operators } Solution: define a window (subset) over a data stream } Different possible semantics: } Time-based: last 5 minutes } Count-based: 100 data items } Data-driven: Web session } Eviction strategies: sliding vs. tumbling vs. landmark 13 K. Sattler | TU Ilmenau 07.02.17 Windowed Aggregation & Joins window aggregate aggregate(_) aggregate } Examples: } Moving average over 5 minutes (sliding window) symmetric } URL access per day (tumbling window) hash join window ... ⨝ join join window } Example: } Correlated observations, e.g. high temperatures within a certain period of time 14 K. Sattler | TU Ilmenau 07.02.17 Window Implementation } report on report: window aggregate } Re-evaluation } Incremental evaluation } Strategies for incremental evaluation } Possible delay in evaluation if no new tuple arrives } Input triggered } Hard to separate window sum: 2 5 1 3 -3 +2 } Allows to implement windows } Negative tuples separately } +2 Avoids delay 2 5 1 3 aggregate } Increases the number of tuples -3 15 K. Sattler | TU Ilmenau 07.02.17 Complex Event Processing (CEP) } Detecting complex patterns of events in a stream of data } Event = stream element; complex event = sequence of events (not necessarily consecutively) } Defined using logical and temporal conditions Data values + combinations Within a given period of time 24°C, Station#1, 13:00 SEQ(A, B, C) WITH 5 min 23°C, Station#2, 13:00 A.Temp > 23°C && 21°C, Station#1, 13:02 B.Station = A.Station && B.Temp < A.Temp && 20°C, Station#1, 13:05 C.Station = A.Station && A.Temp-C.Temp > 3 16 K. Sattler | TU Ilmenau 07.02.17 CEP: Implementation } Composite events constructed e.g. by } SEQ, AND, OR, NEG, ... } SEQ(e1, e2) ➝ (e1, t1) ∧ (e2, t2)∧t1 ≤ t2 ∧ e1,e2 ∊ � } Implemented by constructing a NFA } Example: SEQ(A, B, C) B C 1 A 2 3 4 * * 17 K. Sattler | TU Ilmenau 07.02.17 Scalability: Data-parallel processing } High volume of data, high arrival rates ➟ parallel processing (Multicore CPUs, Cluster, ...) ➟ requires partitioning of the data stream Operator o4o1 o4o3o2o1 o2 Operator Split Merge o 3 Operator Parallel processing 18 K. Sattler | TU Ilmenau 07.02.17 Data parallelism in data streams } Non-trivial problem } Stateful operators (e.g. aggregates, ...) } Operators relying on ordering of tuples (e.g. windows, CEP, ...) } Dynamic load balancing / elasticity in case of changing workloads or skewed data } State migration } Tasks } Partitioning (split + merge) } Safe auto-parallelization } State management + migration 19 K. Sattler | TU Ilmenau 07.02.17 Partitioning } Some possible solutions } Partitioning: hash-based (e.g. consistent hashing), range-based, ... } Operator-specific merge } „semantic “ partitioning SEQ(A, B) WITH B.Id = A.Id && ... [sum(p1), p1=[oi,o2i,...] avg count(p )] 1 CEP pi=f(Id) avg CEP p2=[oi+1,o2i+i,...] @tj: 20 K. Sattler | TU Ilmenau 07.02.17 Fault tolerance } Continuous queries on unbounded data streams require to deal with } Loss of message } Node failures } Guaranteed message delivery: } Redundant processing and/or } Best effort recovery } At least once } Exactly once Publisher Subscriber publish(o) Publisher Subscriber Replica ACK 21 K. Sattler | TU Ilmenau 07.02.17 Fault Tolerance: Standby } Standby Operator Failover Replica } Simple for stateless operators like filters, etc. } But for stateful operators (e.g. aggregates, ...)?? } Hot Standby Multicast Operator Switch Failover Operator 22 K. Sattler | TU Ilmenau 07.02.17 Fault Tolerance: Upstream Backup } Operator buffers data sent to the the next operator until acknowledgement } In case of failure: retransmit } Can be combined with standby/replicas Op1 Op2 ACK buffer 23 K. Sattler | TU Ilmenau 07.02.17 Fault Tolerance: Checkpointing Operator } Checkpointing/ Logging State Log Restore Operator } Basic idea: } Periodically/on changes: saving the state (checkpointing, snapshot) or saving state changes (logging) } In case of failures: recover the state from the log } Challenges: } State management (buffer, routing, processing), overhead } Consistency among multiple operators during recovery 24 K. Sattler | TU Ilmenau 07.02.17 Challenges in Dataflow Compilation } Optimization: finding the best plan for a declaratively specified dataflow } Choosing algorithms (for operators) and data structures (for windows, state representation, ...) } Reordering operators, e.g. pushdown of selection } Auto-parallelization & partitioning of data streams for distributed/parallel processing } Placement of operators at computing nodes } Resource allocation (memory, CPU time, ...) } ... } Extensibility: integrating and optimizing new operators (possible as black box) and new target platforms 25 K. Sattler | TU Ilmenau 07.02.17 Optimization } requires } Knowledge about algebraic properties of operators such as associativity, commutativity ➟ reordering } Transformations and their properties } Cost model: expected processing costs, arrival rates, data distribution, ... } Challenges } Operators beyond standard (relational) operators = user- defined code: functions, operators, ... } Cost model: parameters, statistics for streams 26 K. Sattler | TU Ilmenau 07.02.17

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    72 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us