<<

: A Streaming Framework for Live Forward Provenance Dimitris Palyvos-Giannas Bastian Havers Chalmers Univ. of Technology Chalmers Univ. of Technology, Volvo Car Corporation Gothenburg, Sweden Gothenburg, Sweden [email protected] [email protected] Marina Papatriantaflou Vincenzo Gulisano Chalmers Univ. of Technology Chalmers Univ. of Technology [email protected] [email protected]

ABSTRACT Data streaming enables online monitoring of large and continuous event streams in Cyber-Physical Systems (CPSs). In such scenarios, fne-grained backward provenance tools can connect streaming query results to the source data producing them, allowing analysts to study the dependency/causality of CPS events. While CPS mon- itoring commonly produces many events, backward provenance does not help prioritize event inspection since it does not specify if an event’s provenance could still contribute to future results. To cover this gap, we introduce Ananke, a framework to extend any fne-grained backward provenance tool and deliver a live bi- partite graph of fne-grained forward provenance. With Ananke, analysts can prioritize the analysis of provenance data based on whether such data is still potentially being processed by the moni- toring queries. We prove our solution is correct, discuss multiple Figure 1: a) Two sample queries to monitor car location and implementations, including one leveraging streaming APIs for par- mean speed (all tuples up to 8:21 are processed), and allel analysis, and show Ananke results in small overheads, close to their b) backward and c) live forward provenance graphs. those of existing tools for fne-grained backward provenance. shows two streaming applications, or queries, monitoring a vehic- PVLDB Reference Format: Dimitris Palyvos-Giannas, Bastian Havers, Marina Papatriantaflou, ular network to spot cars visiting a specifc area (�1) or speeding and Vincenzo Gulisano. Ananke: A Streaming Framework for Live Forward (�2). Vehicle reports (timestamp, id, position), or tuples, arrive every � � Provenance. PVLDB, 14(3): 391 - 403, 2021. 5 minutes; � � is the �-th tuple from car �, � � the �-th alert from query doi:10.14778/3430915.3430928 �. This scenario is our running use-case throughout the paper. CPSs need continuous monitoring for emerg- PVLDB Artifact Availability: Motivating challenge. The source code, data, and/or other artifacts have been made available at ing threats or dangerous events [32, 37], which may result in many https://github.com/dmpalyvos/ananke. alerts that analysts are then left to prioritize [15, 34]. For streaming- based analysis, provenance techniques [19, 30], which connect re- 1 INTRODUCTION sults to their contributing data, are a practical way to inspect data dependencies, since breakpoint-based inspection is not ft Distributed, large, heterogeneous Cyber-Physical Systems (CPSs) for live queries that run in a distributed manner and cannot be like Smart Grids or Vehicular Networks [22] rely on online anal- paused [14]. Existing provenance tools for traditional databases ysis applications to monitor device data. In this context, the data target , to fnd which source tuples contribute streaming paradigm [36] and the DataFlow model [2] enable the backward tracing to a result [7, 11, 12, 14, 19] (Figure 1b), and , to inspection of large volumes of continuous data to identify specifc forward tracing fnd which results originate from a source tuple [7, 12]; however, patterns [16, 28]. Streaming applications ft CPSs’ requirements due streaming-based tools only exist for backward-provenance [19, 30]. to the high-throughput, low-latency, scalable analysis enabled by The need for live, streaming, forward provenance is multifold: Stream Processing Engines (SPEs) [1, 4, 5, 8, 31] and their correct- (1) while backward tracing can give assurance on the trustwor- ness guarantees, which are critical for sensitive analysis. Figure 1a thiness of end-results [11], forward tracing allows to identify all This work is licensed under the Creative Commons BY-NC-ND 4.0 International results linked to specifc inputs [12], e.g. to mark all results linked License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of to a privacy-sensitive datapoint (e.g. a picture of a pedestrian, in this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights the context of Vehicular Networks) before such results are analyzed licensed to the VLDB Endowment. further; (2) live maintenance of the provenance graph avoids data Proceedings of the VLDB Endowment, Vol. 14, No. 3 ISSN 2150-8097. doi:10.14778/3430915.3430928 duplication and allows to start the analysis of provenance data safely (e.g., once all sensitive results that could be connected to the

391 aforementioned picture have been marked); (3) a streaming-based forward provenance tool does not require intermediate disk stor- age (which might be forbidden for pictures taken in public areas) and enables lean dependency and causality analysis of the mon- itored events. Note that, as shown in our evaluation, providing live, streaming, forward provenance with tools external to the SPE Figure 2: SPEs’ native operators, Source, and Sink. running the monitoring queries incurs signifcant costs that can be avoided by relying on intra-SPE provenance processing instead. of . The metadata � carries the timestamp � and possi- Motivated by the open issues, our key question is: sub-attributes Contribution. bly further sub-attributes. To refer to a sub-attribute of � , e.g., �, “Can we enrich data streaming frameworks that deliver backward we use the notation �.�. We reference �’s �-th sub-attribute as �.� [�] provenance to efciently provide live, duplicate-free, fne-grained, (omitting � when it is clear from the context). In combined notation, forward provenance for arbitrarily complex sets of queries?” a stream tuple is written as ⟨�, �⟩ = ⟨�, . . . , [� [1], � [2],... ]⟩. We answer afrmatively with our contributions: Streaming queries (or simply queries) are composed of Sources, and . A Source forwards a stream of • We formulate the concrete goals and evaluation metrics of solu- operators Sinks source tuples tions for live, duplicate-free, fne-grained, forward provenance. (e.g., events measured by a sensor or reported by other applica- 1 tions). Each can be fed to one or more operators, the • We implement a general framework, Ananke , able to ingest source stream backward provenance and deliver an evolving bipartite graph of basic units manipulating tuples. Operators, connected in a Directed live, duplicate-free, fne-grained, forward provenance (or simply Acyclic Graph (DAG), process input tuples and produce output tuples; eventually, are delivered to , which deliver live provenance) for arbitrary sets of queries. Ananke delivers sink tuples Sinks each result and source tuple contributing to one or more results results to end-users or other applications. In our model, we as- exactly once, distinguishing source data that could still contribute sume each tuple is immutable. Tuples are created by Sources and operators. The latter can also forward or discard tuples. to more results from expired source data that cannot. As source tuples correspond to events, � is set by the Source • Ananke’s key idea builds on our insights on forward provenance w.r.t. the backward provenance problem and defnes a simple yet to when that event took place, the event time. Operators set � of efcient approach, enabling specialized-operator-based imple- each output tuple according to their semantics, while � is set by mentations, as well as modular ones that utilize native operators user-defned functions. Event time is not continuous but progresses of the underlying SPE. We design and prove the correctness of in discrete increments defned by the SPE (e.g., milliseconds). We de- two streaming-based algorithmic implementations: one target- note the smallest such increment of an SPE by �. All major SPEs [3– ing to optimize the labeling of the expired source data as fast as 5, 8] support user-defned operators but also provide native ones: possible, and one that shows how the general SPEs’ parallel APIs Map, Filter, Aggregate and Join. Since we make use of such native operators, we provide in the following their formal description for are sufcient to parallelize Ananke’s algorithm, and thus sustain higher loads of provenance data. self-containment. However, Ananke provides live provenance with- out imposing any restriction on the operators of the query. Figure 2 • We conduct a thorough evaluation of our Ananke implementation on top of Apache Flink [8], with real-world use cases and data, illustrates the native operators, the Source, and the Sink. We begin and also match with previous experiments and an implementa- with stateless operators, which process tuples one-by-one. tion that delivers live forward provenance by relying on tools A Filter (F) relies on a user-defned fltering condition� to either forward an input tuple, when � holds, or discard it otherwise. external to the SPE, for a fair comparison of Ananke’s overheads. A uses a user-defned function F (to transform an The implementations used in our evaluation are open-sourced Map (M) � input tuple into � ≥ 1 output tuples) and �, the schema of the at [18] for reproducibility. Figure 1c shows ’s live prove- Ananke output tuple payloads. It copies the � of each input into the outputs. nance assuming both queries have processed all tuples up to time Diferently from stateless operators, ones run their anal- 8:21. Each source and sink tuple appear exactly once in the bipartite stateful ysis on , delimited groups of tuples maintained by the graphs. Some tuples are labeled by a green check-mark, indicating windows operators. are defned by their size � � (the length that they are expired and will not be connected to future results. Time windows of the window), advance � � (the time diference between the left Organization: §2 covers preliminary data streaming and prove- boundaries of consecutive windows), and ofset � � (the alignment nance concepts. §3 provides the defnitions we use and also includes of windows relative to a reference time; in Flink, this is the Unix a formal problem formulation. §4-§5 cover our contribution, later epoch). For example, a window having � �, � �, and � � set to evaluated in §6. We discuss related work in §7 and conclude in §8. 60, 10 and 5 minutes, respectively, will cover periods [00:05,01:05), [00:15,01:15), etc. Consecutive periods covered by a window can 2 PRELIMINARIES overlap when � � < � �. Lastly, the left and right boundaries of a 2.1 Data Streaming Basics window are inclusive and exclusive, respectively. We say a tuple � in a window [�,�) if � �.� < �. As windows can overlap, a Like Apache Flink [8] (or simply Flink), Ananke builds on the falls ≤ tuple can fall into one or more windows. DataFlow model [2]. Streams are unbounded sequences of tuples. Tu- We now present stateful operators in more detail. ples have two attributes: the metadata � and the payload �, an array An Aggregate (A) is defned by: (1) � �, � �, � �: the window 1In , Ananke personifes inevitability, compulsion and necessity. size, advance, and ofset, (2) ��: an optional key-by function to

392 maintain separate (yet aligned) windows for diferent key-by values, (3) F�: a function to aggregate the tuples falling in one window into the � attribute of the output tuple created for such window, (4) �: the output tuple’s payload schema. When an output tuple is created for a window (and a key-by value, if �� is defned), we assume its timestamp is set to such window’s right boundary [3, 5, 8]. A Join (J) matches tuples from two input streams, � and �. It keeps two windows, one for � and one for � tuples, which share the same values for parameters � �, � � and � �. Each pair of � Figure 3: The DAG of �1, �2 from Figure 1. Source tuples con- and � tuples sharing a common key are matched for every pair tain reports’ timestamp in �, and the vehicle ID ��� and po- of windows covering the same event-time period they fall in. The sition �,� in �. Source tuples are forwarded to both queries. Join operator relies on the following parameters: (1) � �, � �, � �: Tuples arriving at the Sinks correspond to alerts in Figure 1. the window size, advance, and ofset, (2) ��: a key-by function to maintain separate (yet aligned) pairs of windows for diferent its input streams and forwards its watermark downstream. For an key-by values, (3) � (�� , �� ): a predicate for pairs of tuples from the � Aggregate or Join �� , every time the watermark advances from� two input streams, (4) : a function to create the � attribute of the � F� to � �′ , an output tuple is created for each window maintained output tuple, for each pair of input tuples for which � � , � holds, � ( � � ) �′ and (5) �: the schema of the output tuple’s payload �. Similarly by �� that has a right boundary less or equal to �� . If multiple to the Aggregate, when an output tuple is created by a Join, its results are created, they are emitted in event-time order. sub-attribute � is set to the right boundary of the window. In the remainder, we (1) diferentiate between stateless and state- 2.3 Backward Provenance ful only if necessary, we (2) assume� � = 0 unless otherwise stated, Ananke aims at extending frameworks that provide backward prove- and (3) assume a stream can be multiplexed to many operators. nance (§1). Such frameworks [19, 20, 30] rely on instrumented op- Figure 3 presents Figure 1’s queries. Source � emits tuples of erators, i.e., wrappers that add extra functionality to operators. schema ⟨�, [���, �,�]⟩ (timestamp, vehicle ID, x- and y-coordinates). Our contribution can extend any streaming framework providing In �1, Filter �1 forwards tuples within region �, Aggregate �1 backward provenance, assuming each sink tuple has additional sub- counts each car’s reports, and Filter �2 forwards to Sink �1 only attributes in � that can be used to retrieve the unique source tuples tuples with a count higher than 2 (as tuples arrive every 5 minutes, contributing to it. Such sub-attributes can be pointers to source tu- the count can only be 1, 2 or 3). In �2, Aggregate �2 emits the mean ples [30] or IDs identifying source tuples maintained in a dedicated speed of each car within the last 15 minutes. Filter �3 forwards the provenance bufer [19, 20]. In the remainder, we rely on a general 2 tuples whose mean speed exceeds 110km/h to Sink �2. function get_provenance to retrieve backward provenance.

2.2 Watermarks and Correctness Guarantees 3 DEFINITIONS AND GOALS Because of asynchronous, parallel, and distributed execution, state- This section includes the defnitions we use to present and prove ful operators processing tuples from multiple streams can receive our contribution’s correctness, and a formal statement of our goals. such tuples out-of-order. Hence, receiving a tuple with a timestamp Definition 3.1. We say a tuple � contributes directly to another greater than some window’s right boundary does not imply that tuple �∗ if an operator produces �∗ based on the processing of � and tuples received later could not still contribute to said window. write: � → �∗. We then say � contributes to �∗ and use the notation To ensure result correctness for out-of-order streaming process- � ⇝ �∗ if � → � ′ → � ′′ → ... → �∗. Thus, if t → �∗, then � ⇝ �∗. ing [25], Aggregate and Join rely on watermarks to make such distinction, as suggested by pioneer as well as state-of-the-art From the above, a source tuple �� from Source � contributes to a SPEs [8, 25]; the defnition is paraphrased here: sink tuple �� received by Sink �, if there is a directed, topologically- sorted path �, � , . . . , � , . . . , � , � and a sequence of tuples � � 1 � � � = Definition 2.1. The watermark � of operator �� at a point in � �0, . . . , ��−1, ��, . . . , �� = �� , s.t. ∀� = 1, . . . , �, where ��−1 and �� are 3 � wall-clock time is the earliest event time a tuple to be processed by input and output tuples of �� , and �� 1 → �� for all � = 1, . . . , �. � − �� can have from time � on (i.e., �� .� ≥ �� , ∀�� processed from � on). Definition 3.2. At time �, tuple � is active if it can still contribute Watermarks are created periodically by Sources and propagate directly to a tuple produced by an operator. Otherwise, � is inactive. 4 as special tuples through the DAG . Upon receiving a watermark, At time �, tuple � is alive if it is active or if there is at least one active an operator stores the watermark’s time, updates its watermark tuple �∗ such that � ⇝ �∗. Otherwise, � is expired. to the minimum of the latest watermarks received from each of For instance, if a Map produces a tuple �2 upon ingesting tuple 2 In Figure 1, given the grid’s cell size, cars’ mean speed is 120km/h if covering four �1, the latter is inactive, but remains alive as long as �2 is being cells in three consecutive tuples. processed downstream (or as long as �2 itself is alive). Note that all 3Notice that from here on, we only diferentiate between wall-clock time (or simply time) and event time if such distinction is not clear from the context. sink tuples are expired by defnition. Using the above, we defne 4 Notice that this assumption about in-band watermarks is not a constraint. Diferent the live provenance to be delivered by Ananke: watermarking schemes that could also be adopted are discussed in e.g., [25, 26].

393 Definition 3.3. At time � in the execution of a set of queries • Processing Latency, the delay in the production of a sink tuple Q, being KQ the set of Q’s Sinks, the live, duplicate-free, bipartite after all its contributing source tuples have arrived at the query. graph forward provenance F(Q, �) consists of (1) a set of vertices � , • CPU utilization, the percentage of the total CPU time a query containing exactly one vertex for each sink tuple forwarded to KQ, utilizes, across all available processors (0-100%). and exactly one vertex for each source tuple contributing to any sink • Memory consumption, the amount of RAM a query utilizes. K � tuple forwarded to Q, (2) a set of edges , each edge connecting a We also introduce the provenance latency metric, to quantify the sink tuple with its contributing source tuples, and (3) a set of “expired” event time it takes for F(Q, �)’s components to become available: labels �, one for each vertex in � if the tuple it refers to is expired. • For tuple �’s vertex � , it is computed as �� − �.�. � � � Notice that if, at time �, a vertex in � is alive, then more edges • For an edge �, it is computed as �� −��� (�� , �� ), where �� and connecting such vertex to other vertices could be found later in the � �� are the timestamps of the vertices connected by the edge. execution of Q. In Figure 1c, which shows F(Q, 8:21), new edges • For the label � of some vertex, it is computed as �� − �� . could connect the vertices of tuples not marked as expired to ver- tices referring to sink tuples that have not yet been produced. Implementation requirements. We want Ananke to be a streaming- Given the preceding defnitions, we now formulate our goals and based extension (of Q) that delivers the vertices, edges, and labels the necessary requirements to reach these (using prefx R- for the of F(Q, �) through its output stream(s). A solution can rely on latter). For brevity, we refer to vertices containing a source tuple user-defned or native operators (§2). Being a streaming-based ex- tension of Q, the latter’s watermarks are propagated to or a sink tuple as source vertices and sink vertices, respectively, and Ananke operators, too. For generality, we assume a user-defned opera- use the expression expired for tuples and vertices interchangeably. tor needs to support two methods (not invoked concurrently by Problem formulation. Being BQ a set of streams delivering KQ’s the SPE): on_tuple(�), invoked upon reception of tuple �, and sink tuples with backward provenance retrievable via attribute on_watermark(� ), invoked when the watermark � is updated. �, the goal is to continuously deliver F(Q, �)’s � , � and �, for increasing values of �, as one or multiple streams, to meet the 4 DISCERNING ALIVE AND EXPIRED TUPLES following requirements: Live provenance needs to discern alive from expired tuples. Even if Each vertex referring to a source or a sink tuple is delivered (R-V) sink tuples cannot contribute directly to other tuples, source tuples exactly once, by a tuple � , �� , � or � , �� , � , respec- ⟨ � [ � � ]⟩ ⟨ � [ � � ]⟩ can be inactive but alive. As we show, alive and expired tuples can tively. �� is a unique ID for the vertex associated to � ; � � be separated using static query attributes and the Sink watermarks. each edge between vertices �� and �� is delivered exactly (R-E) � � Figure 5 illustrates how a source tuple’s contribution “ripples” once by a tuple � , �� , �� , with � greater than or equal to ⟨ � [ � � ]⟩ � through event time for a query composed of Aggregates � and � , the timestamp � of the connected vertices; and 1 2 � Map �, and Filter �. A source tuple, � , falls into two windows of an “expired” label is delivered once for each vertex �� by a � (R-L) � � , which emit two outputs that pass through � and contribute tuple � , �� , with � � , for each edge � adjacent to �� . 1 ⟨ � [ � ]⟩ � ≥ � � to three separate windows in � . Two of the three output tuples For the queries in Figure 1a, Figure 4 shows F Q, 8:16 and the 2 ( ) of � get dropped by �, and a single tuple � arrives at �. When tuples delivered to update it to F Q, 8:22 . 2 � ( ) that happens, it is unknown whether � will contribute to more While referring to a set of queries Q and a set of Sinks K for � Q sink tuples - for example, it is uncertain if � will drop the next generality, our problem formulation is justifed even for a single output of � . The fgure also explores � ’s hypothetical query with exactly one Source and Sink, because subsequent sink 2 � maximal contributions: We ignore exact window placements and examine tuples can have overlapping sets of source tuples (as in the example the extreme case, tuples always falling at the beginning or the end of Figure 1), and each such source tuple still needs to be delivered of windows. We shade all event that can contain (direct or exactly once and later marked as expired.

Performance metrics. For provenance to be practical in real-world applications, its performance overheads need to be small. The ef- fciency of our solution is evaluated through its overhead on the following metrics: • Throughput, number of tuples a query ingests per unit of time.

Figure 4: Live provenance graph for Figure 1’s queries at 8:16 Figure 5: Sample query showing actual and maximal contri- and 8:22, and the stream of graph tuples received in between. butions of a source tuple to downstream tuples.

394 indirect) contributions of �� . As shown, the growth of the shaded Corollary 4.1. Given a set of operators O = {�� } connected to region only depends on the window size of stateful operators. Tuple operator �� through a set of paths P, for any tuple �� fed to �� at �� can contribute to any tuple inside the shaded region. Thus, if time � or later, it holds that if an input tuple �� of �� contributed to � Í �’s watermark falls after that region (�2 in the example), then �� �� , then �� .� ≥ ���� − max� ∈P �� where �� = � ∈� � �� . is surely expired. �� denotes the width of the shaded region. With this intuition, we proceed with proving the following theorem: The above corollary follows directly from Lemma 4.2 and allows us to compute the maximum “delay” between tuples traversing the Theorem 4.1. Given KQ (Defnition 3.3), it is possible to statically longest path in a query, enabling us to prove Theorem 4.1. compute constants �� , one for each Sink � ∈ KQ so that: If a source tuple �� has contributed to a sink tuple �� that arrives at � at time � Proof of Theorem 4.1. To prove Equation 4.1, we apply Corol- or later, then: lary 4.1 to any sink tuple �� arriving at Sink � at time � or later, � .� � � � . (4.1) and any source tuple , which gives � . The con- � ≥ � − � �� �� .� ≥ �� − �� min � stants �� = max� P �� , with P the set of all paths to sink �, can be Inversely, if �� .� < � (�� − �� ), then �� is expired and cannot ∈ computed statically based on the attributes of the query graph. □ contribute to any sink tuple fed to KQ at time � or later. We move bottom-up towards the proof. First, we study pairs of The following remark, stemming directly from Theorem 4.1, chained operators, each with one downstream peer in Lemma 4.1. introduces a per-application safety-margin for expired tuples. In Lemma 4.2, we focus on longer operator chains before examining arbitrary paths between sets of operators in Corollary 4.1 and fnally Remark 4.1. If we defne � = max� �� , then any source tuple �� min � concluding with the proof of Theorem 4.1. Here, we use the term with �� .� < � �� − � is expired. operator in a broad sense, also to refer to Sources and Sinks. 5 ALGORITHMIC IMPLEMENTATION Lemma 4.1. For any operator �� with downstream operator �� 1, + As mentioned in §2, SPEs provide native operators and support and a tuple ��+1 arriving at ��+1 at time � or later, it holds that if an � user-defned ones. Here we show how ’s goals can be met input tuple �� of �� contributed to �� 1, then: �� .� ≥ � − � �� . Ananke + �+1 by a user-defned operator (ANK-1, §5.1) or by composing native Proof of Lemma 4.1. From the way stateful operators set their operators (ANK-N, §5.2). While ANK-1 targets the prompt fnal la- 5 output timestamps (§2), we obtain �� .� ≥ ��+1.�−� �� . Furthermore, beling of F(Q, �)’s vertices, ANK-N shows how the APIs for parallel from Defnition 2.1 we get � .� � � . Thus, Lemma 4.1 follows �+1 ≥ �+1 execution commonly provided by SPEs are sufcient to parallelize immediately and it also holds for a stateless �� by setting � �� = 0. Ananke’s algorithm. We study their trade-ofs in §6. □ Both in ANK-1 and ANK-N, the ID of �� ’s source vertex is based on ’s attributes. Hence, source tuples with equal attributes refer We now expand the proof to chains of operators. �� to the same source vertex. As each sink tuple represents a unique Lemma 4.2. For any chain of operators�1 . . . ��, their downstream event, it results in a sink vertex with a unique ID. Since each sink operator ��+1 and a tuple ��+1 that arrives at ��+1 at time � or later, tuple can carry each source tuple at most once in its provenance, it holds that if an input tuple �1 of operator �1 contributed to ��+1, edges are also unique. We discuss in §5.3 how to extend Ananke � .� � � Í� � � then 1 ≥ �+1 − �=1 � . to other ID policies. In the following, we make use of Remark 4.1 for distinguishing alive from expired tuples. As mentioned in §3, Proof of Lemma 4.2. We begin by showing that: the set of streams BQ, delivering backward provenance to Ananke, � forwards the required watermarks. For both implementations, we Õ show how they meet the requirements for vertices (R-V), edges ∀�1, �� 1, �1 ⇝ �� 1 ⇒ �1.� ≥ �� 1.� − � �� . (4.2) + + + (R-E), and labels (R-L) from §3. �=1 Let us denote as �� tuples arriving at operator �� , with �� → ��+1 for � ∈ [1, � + 1]. From the proof of Lemma 4.1, we know that 5.1 ANK-1: Single User-defned Operator �1.� ≥ �2.� − � �1. Plugging in this relation again into the frst As introduced in §3, user-defned operators must support two meth- term on the right-hand side of the inequality, we obtain �1.� ≥ ods: on_tuple and on_watermark. Algorithm 1 covers such meth- �2.� −� �1 ≥ �3.� −� �2 −� �1. Performing this step � times yields ods for ANK-1; methods unique_id() and get_id(�) respectively Equation 4.2. Given Defnition 2.1, � .� � � ; hence: generate a unique ID and compute the ID of � based on its attributes. �+1 ≥ �+1 Claim 5.1. B � � A user-defned operator fed Q can correctly deliver � , � , � ⇝ � � .� � .� Õ� � � � Õ� � . F(Q, �) with Algorithm 1’s on_tuple and on_watermark methods. ∀ 1 �+1 1 �+1 ⇒ 1 ≥ �+1 − � ≥ �+1 − � � 1 � 1 = = Proof. Upon reception of sink tuple � from B , ANK-1 emits □ � Q exactly once the corresponding (1) sink and (2) source vertex, only 5 Although the assumption about how timestamps are set covers commonly used SPEs, if its ID was not stored in the timestamp-sorted set � (i.e., if the cor- the analysis holds also for any output timestamp within the window boundaries. If the output tuples of stateful operator �� have timestamps that are Δ� from the left responding source vertex was not forwarded before), thus meeting boundary �� of the window, it holds that ��+1.� ≤ �� + Δ� . By defnition of the left requirement (R-V); (3) edges, and (4) sink vertex label (L1-11). Set boundary, for any tuple �� in the window it is true that �� .� ≥ �� and the relation � represents ANK-1’s “memory” about forwarded source vertices. becomes �� .� ≥ ��+1.� − Δ� . Since � �� is the maximum value of Δ� , using � �� will always give correct results (possibly with a higher delay). The emitted tuples carry as timestamp the current watermark value,

395 Algorithm 1: ANK-1 algorithmic implementation Data: Set � of pairs (�, ��), ordered on �, and watermark � 1 Method on_tuple(�� ) 2 ��� = unique_id(); 3 emit(⟨� , [��� , �� ]⟩) ; // Emit sink vertex 4 sourceTuples ← get_provenance(�� ); 5 for t : sourceTuples do 6 ��� = get_id(�� ); 7 if (�.�, ��� ) ∉ � then 8 emit(⟨� , [��� , �� ]⟩) ; // Emit source vertex Figure 6: Overview of ANK-N. Algorithm 2 shows F�� . 9 � ← � ∪ {(�.�, ��� )} ; 10 emit(⟨� , [��� , ��� ]⟩) ; // Emit edge 11 emit(⟨� , [��� ]⟩) ; // Emit sink vertex label character "E" and the IDs of the source and sink tuples that it connects. In the proof of the following claim, we continue tracing 12 Method on_watermark(� ) 13 for (�, ��� ) ∈ � do the paths of "K", "E" and "S" tuples and show that all requirements 14 if � >= � − � then for delivering a live provenance graph are met. 15 return 16 emit(⟨� , [��� ]⟩) ; // Emit source vertex label Claim 5.2. The DAG in Figure 6, using the D and FO operator, 17 � ← � \(�, ��� ) ; as well as native ones with the mapping function �� defned in Algorithm 2, once fed BQ, correctly delivers F(Q, �). thus meeting requirements (R-E), as the edge does not precede the Proof. We frst prove that for each source tuple �� , its vertex, vertices, and (R-L) for the sink vertices. edges, and label are delivered correctly: Each source vertex ��� in � is purged once the corresponding (1) Ensuring �� ’s vertex is created once. �� can appear multiple times, source tuple is expired, i.e. when � − � is greater than its �� (L12- as provenance of multiple sink tuples. Based on Theorem 4.1, after 17). Upon purging of ��� , exactly one “expired” label is generated contributing to sink tuple �� (timestamped �� ), �� cannot contribute for the corresponding source vertex, with the current watermark to later sink tuples timestamped ≥ �� + � . Thus, no pair of source as the timestamp. Since watermarks are strictly increasing, it is tuples with the same ID can be farther away than � . guaranteed that each label has a timestamp higher than or equal to As source vertices (marked with "S") are forwarded to ��, which that of its source vertex, meeting (R-L) for the source vertex. □ is defned to output each source vertex with a given ID exactly once with �� = delay(�� ), (R-V) is met for source vertices. 5.2 ANK-N: Native Operator Composition (2) Ordering �� ’s edges behind the vertex. For every �� in the prove- We now present ANK-N, based on native operators. First, we study nance of �� , �� produces the connecting edge "E". For these edges the case of � > 0. For ease of exposition, we initially rely on two to come after �� ’s vertex, they are forwarded by �1 to �1, which auxiliary stateful operators, (D) and (FO), that outputs copies of each edge, with �� = delay(�� ), meeting (R-E). Delay Forward Once ′ help meet the requirements, and later show how D’s and FO’s (3) Producing �� ’s label correctly. The latest edge � involving �� semantics can be satisfed by native operators. Finally, we cover could be produced by �� at event time �� + � − � (for � > 0), according to Remark 4.1. As �′ will be delayed to � ′ � the case � = 0, where all Q’s operators are stateless. � = delay( � + � − �) the source label tuple must be delayed beyond � ′ to meet A Delay (D) operator produces, for each input tuple � with a � (R-L). From ��, the source vertex (which has been delayed already) unique payload �, an output tuple � ′ as a copy of � with � ′.� = is multiplexed to � , delayed again, and mapped by � to a label delay(�.�) := ⌊ �.� ⌋ + 2 · � , � ′.� = �.� and � < � ′.� − �.� ≤ 2� . 2 � � tuple with timestamp � = delay(delay(� )) = delay(� ) + 2� A Forward Once (FO) guarantees that, whether one or more �,� � � - which is strictly greater than � ′ , meeting (R-L) for source tuples. tuples are fed to it sharing the same ID sub-attribute, only the � earliest such tuple is output, with its payload unchanged but its Thus, the components involving �� are meeting the requirements. We now focus on sink tuple � ’s vertices, edges, and labels: timestamp delayed by delay() as in D. After such tuple is output, � FO produces nothing for the subsequent input tuples with that ID (1) Ensuring �� ’s vertex is created once. �2 forwards the single in- until a period of � has passed in the input stream. As identical stance of �� ’s vertex (timestamp �� ), meeting (R-V) for sink tuples. (2) Ordering �� ’s edges behind the vertex. As explained in (2) above, source tuples that appear at diferent times in BQ are not spaced apart further than � (Remark 4.1) and have the same ID, FO will edges involving �� are delayed to �� = delay(�� ) and (R-E) is met. output unique source tuples exactly once. (3) Producing �� ’s label correctly. The label for �� ’s vertex must not

ANK-N overview: Using D, FO and native operators, we construct Algorithm 2: Map function F� of �� the DAG of Figure 6 to meet the requirements from §3, with BQ as � � input. First, we outline the main idea. Let us consider sink tuple �� 1 def out=F�� ( � ): 2 ��� = unique_id() ; from BQ, and follow its path through the DAG. out.add(["K", �� , ��� ]) ; // Add sink vertex Upon processing � , � produces a sink vertex that carries a copy 3 � � 4 for �� in get_provenance(�� ) do of �� , a unique ��� and the character "K" as sub-attributes of its 5 ��� = get_id(�� ) ; out. [ , � , �� ] ; payload. �� also produces, for all source tuples in �� ’s provenance, 6 add( "S" � � ) // Add source vertex a source vertex with character "S" (carrying an ID based on the 7 out.add(["E", (��� , ��� )]) ; // Add edge 8 return out ; // Produce tuples source tuple attributes) as well as an edge. Each edge carries the

396 Figure 7: Composition of the D operator. Figure 8: Composition of the FO operator. have a lower timestamp than any edge connected to �� ’s vertex, and these edges are delayed. As the sink vertex "K" is multiplexed from �2 to �2 and mapped to a label by ��, the resulting label has timestamp ��,� = delay(�� ) = ��, meeting (R-L) for sink tuples. Thus, the components involving �� also meet the requirements. Lastly, we now show how Aggregate � emits vertices, edges, and labels in order. Each tuple, timestamped �, falls in one window [�, �+ �) of �, and a copy of each tuple is produced by � when � receives a watermark > � + �. As Aggregates emit results in timestamp- order (§2), this efectively sorts �’s outputs. Thus, ANK-N correctly delivers live provenance, meeting the requirements in §3. □ Figure 9: Illustration of how two diferent groups of input �� We now construct � and �� using native operators: tuples (non-primed and primed) are processed by ’s oper- ators. Each group leads to the emission of one tuple. Delay. An Aggregate �, Filter � and Map � (as in Figure 7) can enforce this operator’s semantics. � and � create the delay, while � restores the input tuple’s payload, creating a delayed copy of it. As one window of the other, or (2) in exactly one window in both �’s window size is twice as big as its window advance and �� : �, Aggregates. Figure 9 exemplifes how FO achieves the required any input tuple with a unique payload falls into two windows and 6 exactly-once forwarding for both cases . �1 and �2 produce output contributes directly to two output tuples of �, with timestamps tuples that carry the common payload of the input tuples and � spaced � apart. As each output tuple � of � carries the timestamp the count � of input tuples per window. Their window alignment of its corresponding input tuple (� ), the delay induced on the ���� guarantees that a tuple produced by �1 or �2 lands in exactly two � � input tuples can be computed as � .� [1] − � .�. � ensures that this of �’s windows. Let us now explain the two diferent cases in detail: delay is greater than � , which is always the case for exactly one if � , . . . , � fall in one � and � window, output tuples � and � are of the two tuples produced by � for each input tuple - the other is • 1 � 1 2 then fed to �. Since �.� < �.� and �.� 2 �.� 2 , predicate 5.1 delayed at most � and thus discarded. In the extreme case, an input [ ] = [ ] holds. When � emits � , it holds that � .� � < � .� � .� 2� . tuple � can be delayed by � by 2� (the window size), namely if �� 1 + �� ≤ 1 + if � ′, . . . , � ′ fall in two � windows and one � window, output �.� coincides with a window’s left boundary. The window advance • 1 � 1 2 tuples � ′, � ′, �′ are fed to �. Then, two windows of � have one dictates that output tuples produced from � have timestamps spaced 1 2 tuple each from � and � ; however, predicate 5.1 holds only for � apart. Thus, there will also be a tuple delayed by � produced by 1 2 the earlier of the two windows, in which � has lower timestamp � - however, this tuple will be discarded by �. This earlier output and lower count �. Also in this case, � ′.� � < � ′ .� � ′.� 2� . tuple, in all other cases where � does not coincide with a window’s 1 + �� ≤ 1 + left boundary, will be delayed even less, and thus also discarded. Thus, in both cases, the input is deduplicated and delayed, and the ′ ′ From this discussion, it is also apparent that the delay for two input timestamp of the output tuple ��� /��� is given by delay(�1/�1). tuples �1, �2|�1.� = �2.� is identical (as both �1 and �2 are equally distanced from the left boundaries of the windows they fall in). Corner case. If all operators in Q are stateless, then � = 0. This corner case is not covered by the above implementation of ANK- Forward Once. �� ensures that from a group of tuples �1, . . . , �� N, as � and �� cannot have � � = 0. If � = 0, each sink tuple with increasing timestamps, common key � and ��.� − �1.� < � , and its single provenance source tuple have the same timestamp; exactly one tuple ��� with payload ��.� is produced. This tuple is thus, source tuples are immediately expired once event time passes delayed from �1 by at least � and at most 2� . Figure 8 shows how beyond their timestamp. Each sink tuple will be contributed to by a �� can be constructed using two Aggregates �1 and �2 and a Join single source tuple; however, the latter could contribute to several �, to satisfy the required semantics. �’s predicate is defned as: sink tuples. One approach for � = 0 is to replace � and �� with FOpredicate(r,s) = ((�.� ≤ �.�) ∧ (�.� [2] ≤ �.� [2])) ∨ (5.1) identity Maps. The fnal Aggregate � will then deduplicate source vertices (and source labels, which could now be duplicated as well), �.� �.� �.� 2 �.� 2 , (( ≤ ) ∧ ( [ ] ≤ [ ])) as they share the same payload and fall into the same window. where � [2] is the count emitted by the Aggregate operators.

The group of input tuples to �� are multiplexed to both �1 and 6 The case in which �1, . . . , �� fall in one window for both �1 and �2 but �1’s window �2. Since ��.� − �1.� < � , and the windows of �2 are ofset by � , ′ ′ ends later than �2’s one, and the case in which �1, . . . , �� fall in two �2 windows and �1, . . . , �� will land in either (1) two windows of one Aggregate and one �1 window are given by “swapping” �1 and �2.

397 Listing 1: Transparently instrumenting a query Rate (t/s) Latency (s) 40000 1 // Provenance decorators highlighted in blue 1 2 ANK.source (sourceStream) 20000 +4.5% +2.6% +0.9% -1.7% +0.4% -2.0% -5.2% 3 .filter ( ANK.filter (t -> t.type()==0 && t.speed()==0)) -5.6% 4 .keyBy ( ANK.key (t -> t.getKey())) 0 0 5 .window(SlidingEventTimeWindows.of(WS, WA)) Memory (MB) CPU Utilization (%) 6 .aggregate( ANK.aggregate (new AverageAggregate())) 500 10

250 5 +1.7% +1.6% +2.6% +0.7% +0.6% +1.2% +1.0% +0.2% 5.3 Extensions 0 0 NP GL NP GL Ananke associates each sink tuple with a dedicated sink vertex. ANK-1ANK-1/T ANK-NANK-N/T ANK-1ANK-1/T ANK-NANK-N/T Hence, our implementations do not need to deduplicate sink ver- Figure 10: Performance - Linear Road queries. tices, edges, or sink vertex labels. Modifying Ananke to allow dis- tinct sink tuples to refer to the same sink vertex and perform that deduplication is nonetheless trivial, as it simply requires to store the sink vertex IDs which have already been forwarded (ANK-1) or overhead and lower performance. We study both techniques and to replace the D operators of Figure 6 with FO operators (ANK-N). let the user choose between fexibility and performance. Evaluation Setup. To account for the broad range of modern CPSs’ 6 EVALUATION devices, we use (1) Odroid-XU4 [21] devices (or simply Odroid), We study Ananke’s performance relative to the state-of-the-art mounting Samsung Exynos5422 Cortex-A15 2Ghz and Cortex-A7 framework GeneaLog [30]. In §6.1, we compare the performance Octa core CPUs, 2 GB RAM, running Ubuntu 18.04.2, OpenJDK of queries (1) without provenance, (2) with GeneaLog’s backward 1.8.0_252, and Flink 1.10.0 (pinned to the four big cores); and (2) a provenance, and (3) with Ananke’s live, forward provenance (ANK- single-socket Intel Xeon-Phi server with 72 1.5GHz cores with 4- 1 and ANK-N, cf. §5). In §6.2, we evaluate the provenance latency way hyper-threading, 32KB L1 and 1MB L2 caches, 102 GB RAM, for the same use-cases. In §6.3, we compare ANK-1 and ANK-N running CentOS 7.4, OpenJDK 1.8.0_161, and Flink 1.10.0. The exe- in-depth, studying their performance for various confgurations. cution environment is made explicit in each experiment. Finally, in §6.4 we highlight Ananke’s strengths in comparison to We study the average throughput, latency, CPU and memory ad-hoc implementations relying on tools external to the SPE. utilization (§3). For real-world use-cases (§6.1), we also study the provenance latency. These experiments are repeated at least ten Ananke Implementation. Ananke is implemented in Java in Flink [8]. times and are at least ten minutes long. Results are presented as It instruments the queries without altering the SPE and uses Genea- averages with 95% confdence intervals between repetitions. Unless Log for backward provenance. We extended GeneaLog to handle otherwise stated, the parallelism of all operators is set to one. We tuples arriving at windows of stateful operators out of timestamp evaluate the scalability of ANK-N separately in §6.3. order (e.g., when there is parallelism). Moreover, GeneaLog requires operator and tuple objects to provenance-specifc code. This inherit 6.1 Comparison With the State-of-the-art non-transparent (or optimized) implementation can introduce a non-negligible development and maintenance overhead, as imple- To compare with GeneaLog, we study four queries from the do- mentations need to be altered tying the query implementation to main of CPSs [30], targeting smart highways and smart grids, and four from smart vehicular systems. The latter are real-world ex- the provenance framework. Here we introduce an alternative trans- amples from the automotive industry, with broader provenance parent implementation (denoted by sufx /T), which is based on characteristics, more operators, and larger data volumes. To show encapsulation: The query is decoupled from the provenance capture, which can be enabled through an automated process. As illustrated Ananke’s support for multiple Sinks, we run queries from the same in Listing 1, the developer simply encapsulates each Flink operator function with the appropriate framework decorators. Those decora- Rate (t/s) Latency (s) tors encapsulate the tuples inside special meta-tuples that contain 0.2 the provenance metadata populated according to the semantics of the underlying operator function, constructing the provenance 20000 0.1 +2.5% +2.0% +1.4% -0.2% -2.7% -7.2% -14.1% graph without user intervention. While more fexible, the extra ab- -16.8% straction layers of this technique can increase the data serialization 0 0.0 Memory (MB) CPU Utilization (%)

500 Table 1: Query confgurations explored in the evaluation. 20 250 +13.8% +13.3% +1.1% +1.1% +0.0% -0.1% +5.2% +2.6%

NP GL ANK-1 ANK-1/T ANK-N ANK-N/T 0 0 NP GL NP GL ANK-1 ANK-1 Provenance - Backward Live Live Live Live ANK-1/T ANK-NANK-N/T ANK-1/T ANK-NANK-N/T Native Ops - No No No Yes Yes Transparent - No No Yes No Yes Figure 11: Performance - Smart Grid queries.

398 Figure 12: Real-world example queries: a) Vehicle Tracking queries. �, lat, lon give the timestamp, GPS latitude and GPS longi- tude of the car with ID car�� . b) Object Annotation queries. ��, �� are the camera images from the left- and right-facing camera modules; � is the LiDAR point cloud. obj is a concretely-bounded object within an image or a point cloud. domain together and present the aggregated performance results. for 24 hours. The second detects anomalies, through meters that Each experiment explores all confgurations in Table 1. report abnormal consumption at midnight as compensation for the previous day. Both queries receive hourly power measurements and Linear Road. We run two queries from the Linear Road Bench- use Aggregates and Filters; the second also has a Join (we refer the mark [6] on an Odroid. The frst query detects broken-down vehi- reader to [30] for more details). On average, a sink tuple depends cles through consecutive reports of zero speed and constant position. on 192 source tuples in the frst query and 24 in the second. The The second detects accidents from cars stopped at the same posi- provenance overhead is higher here since the queries have larger tion. Both queries receive reports from vehicles every 30 seconds, aggregation windows and higher volumes of provenance data. GL and contain Aggregates and Filters (we refer the reader to [30] for results in a 9% rate drop and 3% latency increase, while ANK-1 (/T) more details). Each sink tuple depends on 4 source tuples in the causes a further 2.7% (14.1%) drop in the rate and a jump of 2.6% frst query and 8 in the second. Figure 10 shows the query perfor- (5.2%) in the CPU. For ANK-N (/T), the rate drops by 7.2% (16.8%) mance without provenance (NP), with GeneaLog (GL), and Ananke compared to GL and the latency rises by at most 2.5%, ANK-N’s with the user-defned operator (ANK-1, ANK-1/T) or the native higher number of operators causes a jump in the CPU, around 14%. operators (ANK-N, ANK-N/T). The text in Ananke’s bars shows the percentage diference from GL. The performance impact of both GL Vehicle Tracking queries. This use-case is based on Figure 1. It and Ananke is small. GL results in about a 3% performance drop for uses the GeoLife dataset, composed of 18670 GPS traces of various rate and latency and Ananke causes a further drop of 2% for ANK-1 vehicles over 4 years around Beijing [41]. We employ 10046 traces and ANK-N and up to 5.6% for the transparent variants (/T). of cars driving a full day each to simulate a large feet driving simultaneously. Figure 12a shows the queries, fed tuples carrying Figure 11 shows the performance of two queries Smart Grid. the car ID, timestamp, and latitude/longitude. �1 calculates the from the smart grid domain, run on an Odroid. The frst reports immediate and average speed of the last two minutes per car, and long-term blackouts by identifying meters with zero consumption forwards to �1 tuples with average speed �¯ > 70km/h. �2 forwards

Rate (t/s) Latency (s) Rate (t/s) Latency (s)

200 0.2 1000 2 100 0.1 +18.3% +46.3% +45.2% +15.6% -3.4% -4.4% +3.5% +2.2% +25.3% +23.6% -15.1% -16.1% -3.7% -4.3% -5.4% -6.1% 0 0 0 0.0 Memory (MB) CPU Utilization (%) Memory (MB) CPU Utilization (%) 4000

20 0.5 200 2000 10 +57.0% +95.5% +8.7% +56.3% +34.5% +34.4% +91.8% +4.6% -3.3% -6.2% +58.0% +7.0% +5.1% +49.2% +14.9% +13.9% 0 0 0 0.0 NP GL NP GL NP GL NP GL ANK-1ANK-1/T ANK-NANK-N/T ANK-1ANK-1/T ANK-NANK-N/T ANK-1ANK-1/T ANK-NANK-N/T ANK-1ANK-1/T ANK-NANK-N/T

Figure 13: Performance - Vehicular Tracking queries. Figure 14: Performance - Object Annotation queries.

399 ANK-1 ANK-N in Figure 15 in multiples of � (see §5), for vertices (SINK/SOURCE), 1.5 Query edges (EDGE), and labels (SINK-L/SOURCE-L) of vertices. As shown, LR VT ANK-1 generally has a lower provenance latency than ANK-N. The 1.0 SG OA main diference is that, while ANK-1 can output SOURCE almost

0.5 immediately (and then store its ID to not output it again), ANK-N delays the production of SOURCE by at least � to avoid duplicates. 0.0

Provenance Latency (U) This delay propagates to SINK-L (see §5.2). Also, ANK-1 can imme- SINK EDGE SINK-L SINK EDGE SINK-L SOURCE SOURCE-L SOURCE SOURCE-L diately emit EDGE and SINK-L without any delay (see Algorithm 1). Both variants have low SINK latency as those are safe to emit imme- Figure 15: Provenance Latency in multiples of � . diately upon arrival of a sink tuple. The variance is close to zero, as the frequency of watermark updates, the only execution-dependent events with more than 9 coordinates within area �, a region around feature of provenance latency, does not change between repetitions. Yuyuantan Park in Beijing, to �2. This experiment is performed on an Odroid receiving the input data via 100MBit/s Ethernet. Sink 6.3 ANK-1 vs. ANK-N Trade-ofs tuples of �1 depend on around 30-160 tuples and sink tuples of �2 Here, we compare ANK-1 and ANK-N for diferent data character- depend on 25-250 tuples. As shown in Figure 13, the performance istics and query confgurations. The experiments, run on the Xeon- of Ananke follows the same trend as before. The impact on rate and Phi server, use synthetic queries in which Sources feed Ananke latency is small for ANK-1 (/T), at 2-4.5% more than GL and higher (non-transparent) with pre-populated provenance graphs. for ANK-N (/T), up to 18.3%. The larger provenance graphs of �1 Figure 16 shows the performance for diferent provenance sizes and �2 cause Ananke to have higher resource requirements, with (x-axis) and overlaps (bar colors). The former is the number of the memory utilization almost doubling in some cases and the CPU source tuples each sink tuple depends on, the latter is the percentage utilization jumping by up to 57% in the worst case (ANK-N). of shared provenance between subsequent sink tuples. ANK-1 has better performance and lower resource requirements than ANK-N, Object Annotation queries. These queries enrich an in-vehicle due to the simpler algorithm and single-task deployment. Both computer vision system based on LiDAR and two cameras. We ANK-1’s and ANK-N’s performance drops as the provenance size use the Argoverse Tracking dataset [10], with 113 segments of increases, as more data is maintained and transferred between tasks. 15-30s continuous sensor recordings of urban driving, plus 3D Larger provenance overlaps result in slightly better performance annotations of surrounding objects. The two queries, shown in since fewer source vertices and labels are emitted. Figure 12b, receive a stream of tuples carrying sets of annotations Figure 17 studies ANK-N’s ability to take advantage of the scal- O�����, O���,�, O���,�, of objects found by the vehicle’s LiDAR ability features of the SPE. The x-axis is the number of queries and a left- and right-facing camera. These sets contain objects feeding data to Ananke, and the bars refer to diferent parallelism labeled with the type (e.g., "pedestrian"), 2D position, and a unique values of Ananke. The provenance size is 50, and the overlap 25%. object ID. �1 reproduces all objects found by the LiDAR, while �1 As shown (in log scale), ANK-N outperforms ANK-1 for parallelism forwards only bicycles found in front of the vehicle. � and �2 then 4 or higher since ANK-1 does not support parallel execution. ANK- forward a tuple to �1 if a specifc bike was in front of the vehicle for N’s resource consumption increases with parallelism, making it more than 11 frames during a 6s window. In �2, only tuples referring better suited for use-cases with more available resources. For both to pedestrians are forwarded as two streams to a Join. If, during 2s, a certain pedestrian is found by both cameras, the pedestrian 3 10 ANK-1 ANK-N has crossed, and a tuple is forwarded to �2. To simulate powerful, × specialized vehicular hardware, this experiment was performed 5 on the Xeon-Phi server. On average, �1’s sink tuples depend on Rate (t/s) 0 −1 15-50 source tuples whereas �2’s ones depend on 2. The tuples of ×10 these queries are much bigger than all previous use-cases, in the 2.5 order of kilobytes instead of bytes. As evident by the performance 0.0 of NP in Figure 14, these queries are much more demanding. For Latency (s) example, GL drops 21% in rate, mostly due to the large volume of 2 provenance data transferred between the SPE tasks. Ananke has a small efect on the rate, causing a further drop of 3.9-6.6%. Latency CPU (%) 0

4 is afected more, increasing by about 25% for ANK-1(/T) and 48% ×10 for ANK-N(/T), while remaining at small absolute values. Memory 1 consumption does not change signifcantly compared to GL. The 0 CPU grows for ANK-N(/T), similar to previous use-cases. Memory (MB) 10 50 100 10 50 100 Provenance (#tuples) Provenance (#tuples)

Provenance Overlap (%) 6.2 Provenance Latency 0 25 50 We now study the provenance latency (§3) for ANK-1 and ANK- N for the Linear Road (LR), Smart Grid (SG), Vehicular Tracking Figure 16: Performance of the synthetic query for diferent (VT), and Object Annotation (OA) queries. The results are shown overlaps and provenance sizes.

400 ANK-1 ANK-N Rate (t/s) Latency (s)

3 10 20000 0.2 Rate (t/s) -0.9% -2.7% -3.0% -5.1% -6.4% +42.6% +42.1% -14.3% +8.8% +8.7% +8.1% +1.1% 0 10 0 0.0 Delivery Latency CPU Utilization (%)

Latency (s) 40

2 +75.4x +57.6x 20 1 +40.8x 10 +37.7% +36.4% +11.1x +30.4% +123.4x +8.6% +7.7% +6.2x +0.5%

CPU (%) 0 0

4 ANK-1 SQL-I/R SQL-I/A ANK-1 SQL-I/R SQL-I/A 2 × 10 SQL-P/R NoSQL/RSQL-P/A NoSQL/A SQL-P/R NoSQL/RSQL-P/A NoSQL/A

4 10 Figure 18: Smart Grid: Performance comparison between Memory (MB) 1 4 8 1 4 8 #queries #queries Ananke and on-demand implementations.

Parallelism 1 4 8 16 32 Rate (t/s) Latency (s) 200

Figure 17: Performance of the synthetic query for diferent 0.5 +39.4% +39.1% 100 +37.3% parallelisms and #queries (logarithmic scale). -5.9% +2.4% +3.0x -5.1% -5.8% -15.0% +2.4x -24.4% -25.6% 0 0.0 Delivery Latency CPU Utilization (%) ANK-1 and ANK-N, a higher number of queries results in a drop in performance, caused by the increased data fow. 2 +305.1x 2 +183.6x +39.1% +27.7% +22.1% +134.1x -3.5% -4.0% +54.4x +4.4x +487.5x +28.0x 6.4 Comparison with On-demand Techniques 0 0

ANK-1 and ANK-N are not the only ways to achieve the goals of ANK-1 ANK-1 SQL-P/RSQL-I/RNoSQL/RSQL-P/ASQL-I/ANoSQL/A SQL-P/RSQL-I/RNoSQL/RSQL-P/ASQL-I/ANoSQL/A §3. Here, we compare Ananke with ad-hoc alternatives relying on existing systems (external to the SPE) to produce the provenance Figure 19: Object Annotation: Performance comparison be- graph on-demand. These alternatives can seem appealing due to tween Ananke and on-demand implementations. their additional features, e.g. persistent storage of backward prove- nance. Thus, a comparison with them is crucial to understand the consumption, as the usage of external systems with diferent storage properties of the fully-streaming approach provided by Ananke. We study alternatives based on database systems with varying mechanisms creates a non-uniform measurement environment. performance and safety guarantees. The frst, SQL-P, relies on an Figures 18 and 19 present the performance of the Smart Grid established relational database (PostgreSQL [33]), the second, SQL- and Object Annotation queries, respectively. The text inside the bars indicates the relative diference from ANK-1. The performance I, on a fast, self-contained relational database running in-memory (SQLite [35]), and the third, NoSQL, on a non-relational database of the on-demand implementations ranges widely, with relaxed (MongoDB [27])7. The SQL implementations adhere strictly to the polling (/R) having better rate, latency, and CPU but more than one goals of §3, whereas NoSQL follows a best-efort principle, without order of magnitude higher delivery latency. Aggressive polling (/A) strict ordering guarantees (due to the concurrent accesses by sev- lowers the delivery latency but severely degrades the other metrics, in most cases. SQL-I/A achieves the lowest delivery latency (up to eral threads). In contrast with Ananke, the alternatives produce the (streaming) provenance graph on-demand. A thread polls the data- 6.2x higher than ANK-1), whereas SQL-I/R has the least impact on base periodically and performs the data transforms. We evaluate the original queries (slightly better rate and latency than ANK-1, at the price of 28x the delivery latency and 4.4x the CPU). SQL-I’s an aggressive (sufx /A) polling strategy (as frequently as possible), implementation closely resembles ANK-1, temporarily maintaining and a relaxed (sufx /R) one (polling every second). We compare ANK-1 with the above alternatives for two real- in-memory only unsent graph components, instead of persisting all world experiments (Smart Grid and Vehicle Annotation queries), backward provenance data like SQL-P and NoSQL. This similarity as well as for a synthetic one. In addition to previous performance in SQL-I’s and ANK-1’s implementations explains their similarity in performance. NoSQL’s best-efort strategy results in a relatively metrics, we also study the delivery latency of the provenance graph, to assess the benefts and drawbacks of diferent polling strategies. low impact on the original queries but multiple orders of magnitude This metric expresses the delay (in wall-clock time) between a higher delivery latency than ANK-1, in most cases. SQL-P performs graph component (vertex, edge, "expired" label) being ready to be between SQL-I and NoSQL, with lower delivery latency than NoSQL delivered and actually being delivered. We do not study memory but a much higher impact on the other metrics. This is expected, as SQL-P persists the backward provenance (unlike SQL-I), while also strictly adhering to the goals of §3, in contrast with NoSQL. 7A graph database (Neo4J [29]) seems like an obvious choice for provenance data [38], but our preliminary experiments indicated it performed much worse than the Both experiments indicate that ANK-1 signifcantly outperforms alternatives in our streaming applications and is thus not presented here. all studied alternatives in most or all metrics.

401 Table 2: Performance comparison between Ananke and requirements of streaming provenance but produces approximate on-demand implementations in the synthetic query. results and lacks support for some native operators (e.g., Join). To the best of our knowledge, Ananke is the frst to deliver live ANK-1 SQL-P SQL-I * NoSQL * forward provenance. [14] presents a system for debugging stream- /R /A /R /A /R /A ing queries with integrated provenance capture and visualization. However, it focuses on one SPE [17] and slices of the execution Rate (t/s) 131.3 54.43 54.80 211.4 209.72 19.81 19.75 Latency (s) 1.03 2.86 2.82 0.62 0.70 16.19 16.61 (with the option to lazily create the complete query provenance). It Deliv. Latency (s) 0.07 2.97 0.78 25.10 15.24 27.63 27.29 requires users to declare relationships between input and output CPU (%) 1.18 0.97 1.30 1.75 1.97 0.92 0.89 tuples and does not distinguish expired tuples. Likewise, Stream- * The values for SQL-I and NoSQL do not refect the steady-state performance, Trace [7], targeting the Trill SPE [9], assists the development and which is signifcantly lower. Refer to the text for more details. debugging of queries with data visualization, relying on prove- nance through instrumented operators and ad-hoc query rewriting. Table 2 presents the aggregated results of the synthetic experi- Ananke’s provenance graph allows creating similar visualizations ment, having a setup8 similar to Figure 16, with a provenance over- for the end-to-end system provenance. GeneaLog is the state-of- lap of 0% and provenance sizes 10-100. The relative performance the-art technique and framework for fne-grained provenance in is comparable to the real-world experiments, with ANK-1 outper- streaming applications. It records and traces the provenance meta- forming the alternatives overall. SQL-I seems to outperform ANK-1 data while incurring a small, constant per-tuple overhead. In this in rate and latency but a detailed examination of the experimental work, we extend GeneaLog to support out-of-order data and use it data indicates that SQL-I can only sustain 40% of the measured as the provider of backward provenance for Ananke. [19] rate without exhausting the memory of the system. As, in contrast is a similar framework for fne-grained streaming data provenance to ANK-1, the on-demand alternatives lack a back-pressure mech- using instrumentation. However, it requires variable-length an- anism, they can fetch backward provenance from the SPE faster notations and needs to temporarily store all alive source tuples than they can process it. This leads to a continuously-increasing (including those that do not contribute to any sink tuples) until provenance backlog (illustrated by the high delivery latencies). For they become expired. The authors hypothesize the use of static an in-memory database like SQL-I, this is unsustainable9. While query analysis to discern alive and expired tuples, but without fur- backpressure could be added manually to the studied alternatives, ther details. As discussed in §3, Ananke can be adjusted for use with this would be “reinventing the wheel” as such mechanisms are Ariadne or any streaming backward provenance framework. available out-of-the-box in Ananke, running inside the SPE. Among the studied on-demand alternatives to Ananke, SQL-I 8 CONCLUSIONS performs best but is still disadvantaged by not being streaming- We presented Ananke, a framework to extend streaming tools for oriented. In contrast, ANK-1 has a much better sustained perfor- backward-provenance and to deliver a live bipartite graph of fne- mance and does not have to maintain "raw" provenance. This en- grained forward provenance. Ananke provides users with richer ables the immediate forwarding of the forward provenance graph provenance information, not only specifying which source tuples stream to a secondary ingesting system without keeping unneces- contribute to which query results, but also whether each source sary state, saving space and computational resources. tuple can potentially contribute to future results or not. This dis- tinction can help analysts prioritize the inspection of the large has similar overheads to the state-of- Evaluation summary. Ananke volumes of events commonly observed when monitoring CPSs. We the-art in backward streaming provenance while ofering live, for- formally prove ’s correctness and implement two variations ward rather than simply backward provenance. Compared to [30], Ananke (available in [18]) in Flink. Through our thorough evaluation, we the best-performing implementations of incur less than Ananke show that incurs small overheads while being able to out- 5% drop in the rate in all use-cases and less than 3% increase in Ananke perform alternatives relying on tools external to the SPE. Future latency in all but one use-case. The evaluation shows that Ananke work can address (1) fner-grained debugging and exploration by is suitable in deployments of real-world applications requiring both expanding ’s live provenance to include intermediate tuples efcient processing and live provenance capture. Alternative ex- Ananke produced by query operators, also indicating whether such tuples isting systems fall short in providing the graph timely and in a could contribute to future results, and (2) exploring how ’s sustainable fashion suited to the data streaming paradigm. Ananke theoretical foundations about alive/expired tuples can be used in fault-tolerant stream processing. 7 RELATED WORK Data provenance, extensively studied in databases [11, 13, 23], ACKNOWLEDGMENTS only recently started being in focus in data streaming. Early such Work supported by the Swedish Government Agency for Innovation work [39] focuses on coarse-grained data stream dependencies. A Systems VINNOVA, proj. “AutoSPADA” grant nr. DNR 2019-05884 fner-grained approach, in [40], produces time intervals which may in the funding program FFI: Strategic Vehicle Research and Inno- contain provenance tuples. [24] focuses on minimizing the storage vation, the Swedish Foundation for Strategic Research, proj. “FiC” grant nr. GMT14-0032, the Swedish Research Council (Vetenskap- 8For a fair comparison with the on-demand implementations, in this experiment ANK-1 is writing the graph to disk, and thus has a lower performance than in Figure 16. srådet) proj. “HARE” grant nr. 2016-03800, and Chalmers Energy 9NoSQL sufers from the same issue; however, in this case, it is less critical since AoA framework proj. INDEED and STAMINA. NoSQL does not need to maintain its state in-memory.

402 REFERENCES (Arlington, Texas, USA) (DEBS ’13). Association for Computing Machinery, New [1] Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, York, NY, USA, 39–50. https://doi.org/10.1145/2488222.2488256 Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. [21] HardKernel. 2020. Odroid-XU4. Retrieved November 12, 2020 from http: //www.hardkernel.com MillWheel: fault-tolerant stream processing at internet scale. Proceedings of the [22] Bastian Havers, Romaric Duvignau, Hannaneh Najdataei, Vincenzo Gulisano, VLDB Endowment 6, 11 (2013), 1033–1044. [2] Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J Marina Papatriantaflou, and Ashok Chaitanya Koppisetty. 2020. DRIVEN: A Fer Andez-Moctezuma, Reuven Lax, Sam Mcveety, Daniel Mills, Frances Perry, framework for efcient Data Retrieval and clustering in Vehicular Networks. Eric Schmidt, and Sam Whittle Google. 2015. The Datafow Model: A Practi- Future Generation Computer Systems 107 (2020), 1–17. cal Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, [23] Melanie Herschel, Ralf Diestelkämper, and Houssem Ben Lahmar. 2017. A Survey on Provenance: What for? What Form? What From? 26, 6 (2017), Unbounded, Out-of-Order Data Processing. VLDB 8, 12 (2015), 1792–1803. VLDB Journal https://doi.org/10.14778/2824032.2824076 881–906. https://doi.org/10.1007/s00778-017-0486-1 [3] Apache. 2020. Beam. Retrieved November 12, 2020 from https://beam.apache.org/ [24] Mohammad Rezwanul Huq, Andreas Wombacher, and Peter M.G. Apers. 2011. [4] Apache. 2020. Heron. Retrieved November 12, 2020 from https://heron.incubator. Adaptive Inference of Fine-grained Data Provenance to Achieve High Accuracy at apache.org/ Lower Storage Costs. IEEE Computer Society, USA, 202–209. https://doi.org/10. [5] Apache. 2020. Storm. Retrieved November 12, 2020 from http://storm.apache.org/ 1109/eScience.2011.36 eemcs-eprint-21400. [6] Arvind Arasu, Mitch Cherniack, Eduardo Galvez, David Maier, Anurag S. Maskey, [25] Jeong-Hyon Hwang, Ugur Cetintemel, and Stan Zdonik. 2007. Fast and Reliable Esther Ryvkina, Michael Stonebraker, and Richard Tibbetts. 2004. Linear Road: A Stream Processing over Wide Area Networks. In 2007 IEEE 23rd International . 604–613. https://doi.org/10.1109/ Stream Data Management Benchmark. In Proceedings of the Thirtieth International Conference on Data Engineering Workshop ICDEW.2007.4401047 Conference on Very Large Data Bases - Volume 30 (Toronto, Canada) (VLDB ’04). VLDB Endowment, Toronto, Canada, 480–491. http://dl.acm.org/citation.cfm? [26] Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore John- id=1316689.1316732 son, and David Maier. 2008. Out-of-order processing: a new architecture for [7] Leilani Battle, Danyel Fisher, Robert DeLine, Mike Barnett, Badrish Chandramouli, high-performance stream systems. Proceedings of the VLDB Endowment 1, 1 and Jonathan Goldstein. 2016. Making Sense of Temporal Queries with Interactive (2008), 274–288. [27] MongoDB. 2020. MongoDB. Retrieved November 12, 2020 from https://www. Visualization. In Proceedings of the 2016 CHI Conference on Human Factors in mongodb.com Computing Systems - CHI ’16. ACM Press, Santa Clara, California, USA, 5433– 5443. https://doi.org/10.1145/2858036.2858408 [28] Hannaneh Najdataei, Yiannis Nikolakopoulos, Vincenzo Gulisano, and Marina [8] Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, Papatriantaflou. 2018. Continuous and Parallel LiDAR Point-Cloud Clustering. and Kostas Tzoumas. 2015. Apache Flink: Stream and batch processing in a In 2018 IEEE 38th International Conference on Distributed Computing Systems . IEEE Computer Society, Vienna, Austria, 671–684. https://doi.org/10. single engine. Bulletin of the IEEE Computer Society Technical Committee on Data (ICDCS) 1109/ICDCS.2018.00071 Engineering 36, 4 (2015), 28–38. [9] Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert Deline, Danyel [29] Neo4j. 2020. Neo4j. Retrieved November 12, 2020 from https://neo4j.com/ Fisher, John C Platt, James F Terwilliger, and John Wernsing. 2015. Trill : A High- [30] Dimitris Palyvos-Giannas, Vincenzo Gulisano, and Marina Papatriantaflou. 2019. GeneaLog: Fine-grained data streaming provenance in cyber-physical systems. Performance Incremental Query Processor for Diverse Analytics. VLDB - Very 89 (2019), 102552. Large Data Bases 8, 4 (2015), 401–412. https://doi.org/10.14778/2735496.2735503 Parallel Comput. [10] Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir [31] Dimitris Palyvos-Giannas, Vincenzo Gulisano, and Marina Papatriantaflou. 2019. Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Haren: A Framework for Ad-Hoc Thread Scheduling Policies for Data Streaming Applications. In 2019. Argoverse: 3D Tracking and Forecasting With Rich Maps. In 2019 IEEE/CVF Proceedings of the 13th ACM International Conference on Dis- . ACM, Darmstadt, Germany, 19–30. Conference on Computer Vision and Pattern Recognition (CVPR). 8740–8749. tributed and Event-Based Systems (DEBS ’19) [11] James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2007. Provenance in https://doi.org/10.1145/3328905.3329505 [32] Fabio Pasqualetti, Florian Dörfer, and Francesco Bullo. 2013. Attack Detection Databases: Why, How, and Where. Foundations and Trends in Databases 1, 4 (2007), 379–474. https://doi.org/10.1561/1900000006 and Identifcation in Cyber-Physical Systems. IEEE Trans. Automat. Control 58, [12] Daniel Crawl, Jianwu Wang, and Ilkay Altintas. 2011. Provenance for MapReduce- 11 (2013), 2715–2729. [33] PostgreSQL. 2020. PostgreSQL. Retrieved November 12, 2020 from https: Based Data-Intensive Workfows. In Proceedings of the 6th Workshop on Work- //www.postgresql.org fows in Support of Large-Scale Science (Seattle, Washington, USA) (WORKS ’11). Association for Computing Machinery, New York, NY, USA, 21–30. https: [34] Saeed Salah, Gabriel Maciá-Fernández, and Jesús E. Díaz-Verdejo. 2013. A model- //doi.org/10.1145/2110497.2110501 based survey of alert correlation techniques. Computer Networks 57, 5 (2013), [13] Yingwei Cui, Jennifer Widom, and Janet L. Wiener. 2000. Tracing the Lineage 1289 – 1317. https://doi.org/10.1016/j.comnet.2012.10.022 [35] SQLite. 2020. SQLite. Retrieved November 12, 2020 from https://www.sqlite.org/ of View Data in a Warehousing Environment. ACM Transactions on Database [36] Michael Stonebraker, Ugurˇ Çetintemel, and Stan Zdonik. 2005. The 8 requirements Systems 25, 2 (June 2000), 179–227. https://doi.org/10.1145/357775.357777 [14] Wim De Pauw, Mihai Leţia, Buğra Gedik, Henrique Andrade, Andy Frenkiel, of real-time stream processing. ACM Sigmod Record 34, 4 (2005), 42–47. Michael Pfeifer, and Daby Sow. 2010. Visual Debugging for Stream Processing [37] Joris van Rooij, Vincenzo Gulisano, and Marina Papatriantaflou. 2018. LoCo- Volt: Distributed Detection of Broken Meters in Smart Grids through Stream Applications. In Runtime Verifcation, Howard Barringer, Ylies Falcone, Bernd Finkbeiner, Klaus Havelund, Insup Lee, Gordon Pace, Grigore Roşu, Oleg Sokol- Processing. In Proceedings of the 12th ACM International Conference on Dis- sky, and Nikolai Tillmann (Eds.). Vol. 6418. Springer Berlin Heidelberg, Berlin, tributed and Event-Based Systems (Hamilton, New Zealand) (DEBS ’18). As- Heidelberg, 18–35. https://doi.org/10.1007/978-3-642-16612-9_3 sociation for Computing Machinery, New York, NY, USA, 171–182. https: [15] Katarina Nielsen Dominiak and Anders Ringgaard Kristensen. 2017. Prioritizing //doi.org/10.1145/3210284.3210298 alarms from sensor-based detection models in livestock production - A review [38] Chad Vicknair, Michael Macias, Zhendong Zhao, Xiaofei Nan, Yixin Chen, and Dawn Wilkins. 2010. A Comparison of a Graph Database and a Relational on model performance and alarm reducing methods. Computers and Electronics Database: A Data Provenance Perspective. In in Agriculture 133 (2017), 46 – 67. https://doi.org/10.1016/j.compag.2016.12.008 Proceedings of the 48th Annual [16] Romaric Duvignau, Vincenzo Gulisano, Marina Papatriantaflou, and Vladimir Southeast Regional Conference (Oxford, Mississippi) (ACM SE ’10). Association Savic. 2019. Streaming Piecewise Linear Approximation for Efcient Data Man- for Computing Machinery, New York, NY, USA, Article 42, 6 pages. https: //doi.org/10.1145/1900008.1900067 agement in Edge Computing. In Proceedings of the 34th ACM/SIGAPP Sympo- [39] Nithya N. Vijayakumar and Beth Plale. 2006. Towards Low Overhead Provenance sium on Applied Computing - SAC ’19. ACM Press, Limassol, Cyprus, 593–596. https://doi.org/10.1145/3297280.3297552 Tracking in Near Real-Time Stream Filtering. In Provenance and Annotation [17] Bugra Gedik, Henrique Andrade, Kun-Lung Wu, Philip S. Yu, and Myungcheol of Data, Luc Moreau and Ian Foster (Eds.). Springer Berlin Heidelberg, Berlin, Doo. 2008. SPADE: The System s Declarative Stream Processing Engine. In Heidelberg, 46–54. [40] Min Wang, Marion Blount, John Davis, Archan Misra, and Daby Sow. 2007. A Proceedings of the 2008 ACM SIGMOD International Conference on Management of Time-and-value Centric Provenance Model and Architecture for Medical Event Data (SIGMOD ’08). Association for Computing Machinery, New York, NY, USA, 1123–1134. https://doi.org/10.1145/1376616.1376729 Streams. In Proceedings of the 1st ACM SIGMOBILE International Workshop on [18] GitHub. 2020. Ananke Implementation. Retrieved November 12, 2020 from Systems and Networking Support for Healthcare and Assisted Living Environments https://github.com/dmpalyvos/ananke (San Juan, Puerto Rico) (HealthNet ’07). ACM, New York, NY, USA, 95–100. https: [19] Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer, and Nesime Tatbul. 2014. //doi.org/10.1145/1248054.1248082 [41] Yu Zheng, Xing Xie, and Wei-Ying . 2010. Geolife: A collaborative social Efcient Stream Provenance via Operator Instrumentation. ACM Trans. Internet networking service among user, location and trajectory. 33, Technol. 14, 1, Article 7 (Aug. 2014), 26 pages. https://doi.org/10.1145/2633689 IEEE Data Eng. Bull. [20] Boris Glavic, Kyumars Sheykh Esmaili, Peter Michael Fischer, and Nesime Tatbul. 2 (2010), 32–39. 2013. Ariadne: Managing Fine-Grained Provenance on Data Streams. In Proceed- ings of the 7th ACM International Conference on Distributed Event-Based Systems

403