A Tree Clock Data Structure for Causal Orderings in Concurrent Executions

Umang Mathur Andreas Pavlogiannis Mahesh Viswanathan University of Illinois, Urbana Aarhus University University of Illinois, Urbana Champaign Champaign Denmark USA USA [email protected] [email protected] [email protected] Abstract patterns also makes verification a demanding task, as ex- posing a bug requires searching an exponentially large Dynamic techniques are a scalable and effective way to ana- space [29]. Consequently, significant efforts are made to- lyze concurrent programs. Instead of analyzing all behaviors wards understanding and detecting concurrency bugs effi- of a program, these techniques detect errors by focusing on ciently [4, 12, 24, 45, 50, 54]. a single program execution. Often a crucial step in these techniques is to define a causal ordering between events in Dynamic analyses and partial orders. One popular approach the execution, which is then computed using vector clocks, to the scalability problem of concurrent program verification a simple data structure that stores logical times of threads. is dynamic analysis [16, 28, 32, 42]. Such techniques have The two basic operations of vector clocks, namely join and the more modest goal of discovering faults by analyzing copy, require Θ(k) time, where k is the number of threads. program executions instead of whole programs. Although Thus they are a computational bottleneck when k is large. this approach cannot prove the absence of bugs, it is far In this work, we introduce tree clocks, a new data structure more scalable than static analysis and typically makes sounds that replaces vector clocks for computing causal orderings in reports of errors. These advantages have rendered dynamic program executions. Joining and copying tree clocks takes analyses a very effective and widely used approach to error time that is roughly proportional to the number of entries detection in concurrent programs. being modified, and hence the two operations do not suffer The first step in virtually all techniques that analyze concur- the a-priori Θ(k) cost per application. We show that when rent executions is to establish a causal ordering between the used to compute the standard happens-before (HB) partial events of the execution. Although the notion of causality order, tree clocks are optimal, in the sense that no other varies with the application, its transitive nature makes it data structure can lead to smaller asymptotic running time. naturally expressible as a partial order between these events. Moreover, we demonstrate that tree clocks can be used to The most commonly used partial order used in this context compute other partial orders, such as schedulably-happens- is Lamport’s happens-before (HB)[23] initially proposed in before (SHB), and thus are a versatile data structure. Our the context of distributed systems [43]. In the context of experiments on standard benchmarks show that the time for testing multi-threaded programs, partial orders play a cru- computing HB and SHB reduces to 50% and 57%, respectively, cial role in dynamic race detection techniques, and have simply by replacing vector clocks with tree clocks. been thoroughly exploited to explore trade-offs between soundness, completeness, and running time of the underly- Keywords: concurrency, happens-before, vector clocks ing analysis. Prominent examples include the widespread use of HB [11, 16, 21, 32, 44], schedulably-happens-before 1 Introduction (SHB)[25], causally-precedes (CP)[46], weak-causally- precedes (WCP)[22], doesn’t-commute (DC)[36], and The analysis of concurrent programs is one of the major strong/weak-dependently-precedes (SDP/WDP)[19], and challenges in formal methods, due to the non-determinism M2 [31]. Beyond race detection, partial orders are often em- of inter-thread communication. The large space of commu- ployed to detect and reproduce other concurrency bugs such nication interleavings poses a significant challenge to the as atomicity violations [2, 18, 27] and deadlocks [41, 48]. programmer, as intended invariants can be broken by un- expected communication patterns. The subtlety of these Vector clocks in dynamic analyses. Often, the computational Conference’17, July 2017, , DC, USA task of determining the partial ordering between events of 2021. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 an execution is achieved using a simple data structure called https://doi.org/10.1145/nnnnnnn.nnnnnnn vector clock. Informally, a vector clock C is an integer array 1 Conference’17, July 2017, Washington, DC, USA Umang Mathur, Andreas Pavlogiannis, and Mahesh Viswanathan indexed by the processes/threads in the execution, and suc- rooted at thread t3 encodes the fact that t2 has learned about cinctly encodes the knowledge of a process about the whole the current times of t4, t5 and t6 transitively, via t3. To perform system. For vector clock Ct1 associated with t1, if Ct1 (t2) = i the join operation Ct1 ← Ct1 ⊔ Ct2 , we start from the root of then it means that the latest event of t1 is ordered after the Ct2 , and given a current node u, we proceed to the children first i events of thread t2 in the partial order. Vector clocks, of u if and only if u represents the time of a thread that is not thus seamlessly capture a partial order, with the point-wise known to t1. Hence, the join operation will now access only ordering of the vector timestamps of two events capturing the light-gray area of the tree, and thus compute the join the ordering between the events with respect to the partial without accessing the whole tree, resulting in a sublinear order. For this reason, vector clocks are instrumental in com- running time. puting the HB parial order efficiently [14, 15, 28], and are The above principle, which we call direct monotonicity is ubiquitous in the efficient implementation of analyses based one of two key ideas exploited by tree clocks; the other be- on partial orders even beyond HB [16, 22, 25, 27, 36, 41, 48]. ing indirect monotonicity. The key technical challenge in The fundamental operation on vector clocks is the pointwise developing the tree clock data structure lies in (i) using di- join Ct1 ← Ct1 ⊔ Ct2 . This occurs whenever there is a causal rect and indirect monotonicity to perform efficient updates, ordering from thread t2 to t1. Operationally, a join is per- and (ii) perform these updates such that direct and indirect formed by updating Ct1 (t) ← max(Ct1 (t), Ct2 (t)) for every monotonicity are preserved for future operations. We refer thread t, and captures the transitivity of causal orderings: as to Section 3.1 for a more in depth illustration of the intuition t1 learns about t2, it also learns about other threads t that behind these two principles. t knows about. Note that if t is aware of a later event of t, 2 1 Contributions. The contributions of this work are as fol- this operation is vacuous. With k threads, a vector clock join lows. takes Θ(k) time, and can quickly become a bottleneck in sys- tems with large k. This motivates the following question: is 1. We introduce tree clock, a new data structure for maintain- it possible to speed up join operations by proactively avoid- ing logical times in a concurrent executions. In contrast ing vacuous updates? The challenge in such a task comes to the flat structure of the traditional vector clocks, the from the efficiency of the join operation itself: since itonly hierarchical structure of tree clocks allows for join and requires linear time in the size of the vector, any improve- copy operations that run in sublinear time. As a data ment must operate in sub-linear time, i.e., not even touch structure, tree clocks offer high versatility as they can be certain entries of the vector clock. We illustrate this idea on used in computing many different ordering relations. a concrete example, and present the key insight in this work. 2. We prove that, for computing the HB partial order, tree clocks offer an optimality guarantee we call vector-time Motivating example. Consider the example shown in Fig- (or vt-) optimality. Intuitively, vt-optimality guarantees ure1. It shows the partial trace from a concurrent system that tree clocks are an optimal data structure for HB, with 6 threads with vector times at each event. When event in the sense that the total computation time cannot be e is ordered before event e due to a synchronization event, 2 3 improved (asymptotically) by replacing tree clocks with the vector clock C of t is joined with that of C , i.e., the t2 2 t1 any other data structure. On the other hand, vector clocks t -th entry of C is updated to the maximum of C (t ) and j t1 t1 j do not enjoy this property. C (t ). Now assume that thread t has learned of the cur- t2 j 2 3. We illustrate the versatility of tree clocks by presenting rent times of threads t , t , t and t via thread t . Since 3 4 5 6 3 a tree clock-based algorithm to compute the SHB partial the t -th component of the vector timestamp of event e is 3 1 order. larger than the corresponding component of event e , t can- 2 1 4. We perform an experimental evaluation of the tree clock not possibly learn any new information about threads t , t , 4 5 data structure for computing HB and SHB, and compare and t through the join performed at event e . Hence the 6 3 its performance against the standard vector clock data naive pointwise updates will be redundant for the indices structure. Our results show that just by replacing vec- j = {3, 4, 5, 6}. Unfortunately, the flat structure of vector tor clocks with tree clocks, the running time reduces on clocks is not amenable to such reasoning and cannot avoid average to 50% for HB and 57% for SHB. Given our exper- these redundant operations. imental results, we believe that replacing vector clocks To alleviate this problem, in this work, we introduce a new by tree clocks in partial order-based algorithms can lead data structure for maintaining vector times called a tree clock. to significant improvements. The nodes of the tree encode local clocks, just like entries in a vector clock. In addition, the structure of the tree naturally captures which clocks have been learned transitively via intermediate threads. Figure1 (right) depicts a (simplified) tree clock encoding the vector times of Ct2 . The subtree

2 A Tree Clock Data Structure for Causal Orderings in Concurrent Executions Conference’17, July 2017, Washington, DC, USA

t1 t2 Ct2

e1 t2, 6 Ct1 = [27, 5, 9, 45, 17, 26] e 2 C t3, 5 t1, 11 join t2 = [11, 6, 5, 32, 14, 20] t5, 14 t4, 32 Ct = [28, 6, 9, 45, 17, 26] 1 e3 t6, 20

Figure 1. (Left) Illustration of the effect of a join operation Ct1 ← Ct1 ⊔ Ct2 on the clocks of the two threads. The j-th entry in timestamps correspond to thread tj . Red entries remain unchanged, as t1 already knows of a later time. (Right) A tree representation of the clocks Ct2 that encodes transitivity. Dark gray marks the threads whose clock has processed in Ct2 compared to Ct1 (i.e., just t2). Light gray marks the nodes that we need to examine when performing the join operation.

σ 2 Preliminaries ℓ such that acq1 (ℓ)

≤σ when two events e1 and e2 are unordered by P (denoted Algorithm 1: Computing the HB partial order. σ e1 ∥ e2), then they can be deemed concurrent. This principle P 1 procedure acquire(t, ℓ) forms the backbone of all partial-order based race detection 2 C .Join(C ) techniques — look for two conflicting events in the trace that t ℓ are unordered by the partial order of choice. 3 procedure release(t, ℓ) 4 Cℓ .Copy(Ct ) A naïve approach for detecting data races is to explicitly construct a directed acyclic graph with nodes as events in the trace, edges representing the orderings imposed by the perform in-place operations. In particular, there are meth- ≤σ ods such as Join(·), Copy(·) or Increment(·, ·) that store the partial order P and then performing a graph search for two events that are not connected. Vector clocks, on the result of the corresponding vector time operation in the orig- other hand, can provide a more efficient method to represent inal instance of the data-structure. For example, for a vector partial orders and therefore are the key data structure used in clock C and a vector time V , we assume that a function call most partial order-based algorithms. The use of vector clocks C.Join(V ) (resp. C.Copy(V )) stores the value C ⊔ V (resp. enables designing streaming algorithms that detect races as V ) back in C. A typical implementation of this operation for the execution is observed. These algorithms associate vector the vector clock data structure will iterate over all the thread timestamps [14, 15, 28] with events so that the point-wise identifiers (indices of the internal array representation) and ordering between timestamps reflects the underlying partial compare the corresponding components in C and V and then order. Let us formalize these notions now. store the maximum of the two in the corresponding index of C. Assuming arithmetic operations take constant time, the Vector Timestamps. Let us fix the set of threads Thrds in running time of the in-place join operation for the vector vector timestamp the trace. A (or simply vector time) is a map- clock data structure is thus O(k), where k is the number of → N ping V: Thrds . It supports the following operations. threads in the trace. Similarly, copy and comparison opera- V ⊑ V iff ∀t :V (t) ≤ V (t) (Comparison) 1 2 1 2 tions take O(k) time and an increment operation takes O(1) V ⊔ V = λt : max(V (t), V (t)) (Join) 1 2 1 2 time with vector clocks. V(t) + i, if t = t ′ V[t ′ → +i] = λt : (Increment) V(t), otherwise  We write V = V to denote that V ⊑ V and V ⊑ V . Let 2.3 Happens-Before in Dynamic Race Detection 1 2  1 2 2 1 us see how vector timestamps provide an efficient implicit representation of partial orders. Although happens-before is a general-purpose partial order, it has seen wide adoption in dynamic data race detection Timestamping for a partial order. Consider a partial or- techniques, and forms the basis of popular race detectors ≤σ ≤σ ⊆≤σ der P defined on the set of events of σ such that TO P . such as ThreadSanitizer [44]. Here we use this context to In this case, we can define the P-timestamp of an event e as illustrate happens-before. the following vector timestamp: Happens-before. Given a trace σ, the happens-before (HB) ≤σ P { σ | ≤σ } ≤σ Ce = λu : max lTime (f ) f P e, tid(f ) = u partial order HB of σ is the smallest partial order over the ≤σ events of σ that satisfies the following conditions. P σ We remark that Ce (tid(e)) = lTime (e). The following ≤σ ⊆≤σ observation then shows that the timestamps defined above 1. TO HB. ≤σ 2. For every release event rel(ℓ) and acquire event acq(ℓ) precisely capture the order P . ℓ rel ℓ <σ acq ℓ σ on the same lock with ( ) tr ( ), we have Lemma 2.1. ≤ σ Let P be a partial order defined on the set of rel(ℓ) ≤ acq(ℓ). ≤σ ⊆≤σ HB events of trace σ such that TO P . Then for any two events ≤σ ≤σ For two events e , e in trace σ, we use e ∥σ e to denote e , e of σ, we have, C P ⊑ C P ⇐⇒ e ≤σ e 1 2 1 HB 2 1 2 e1 e2 1 P 2 ≤σ ≤σ σ that neither e1 HB e2, nor e2 HB e1. We say e1

1. Direct monotonicity. Recall that a vector clock-based al- 3.2 Tree Clocks gorithm like Algorithm1, maintains a vector clock Ct which intuitively captures thread t’s knowledge about all threads. We now present the tree clock data structure in detail. However, it does not maintain how this information was Tree clocks. A tree clock TC consists of the following. acquired. Knowledge of how such information was acquired can be exploited in join operations, as we show through an 1. T = (V, E) is a rooted tree of nodes of the form example. Consider a computation of the HB partial order for (tid, clk, aclk) ∈ (Thrds, N, N). Every node u stores its the trace σ shown in Figure 2a. At event e7, thread t4 transi- children in an ordered list (e.g., a stack with random ac- tively learns information about events in the trace through cess) Chld(u) of descending aclk order. We also store a σ thread t3 because e6

t1 t2 t3 t4 t1 t2 t3 t4

HB 1 sync(ℓ1) 1 sync(ℓ1) HB HB 2 sync(ℓ1) 2 sync(ℓ1) HB 3 sync(ℓ1) 3 sync(ℓ2) HB 4 sync(ℓ2) HB 4 sync(ℓ2) 5 sync(ℓ2) 5 sync(ℓ2) HB HB 6 sync(ℓ3) 6 sync(ℓ3) 7 sync(ℓ3) 7 sync(ℓ3)

(a) Direct monotonicity. (b) Indirect monotonicity. Figure 2. Illustration of the two insights behind tree clocks. An event sync(ℓ) represents two events acq(ℓ), rel(ℓ).

, , ⊥ , , ⊥ t4 2 t4 2 will always be the root of TCt . This initialization function is only used for tree clocks that represent the clocks of threads. t , 2, 2 t , 2, 1 t , 3, 2 3 2 3 Auxiliary tree clocks for storing vector times of release events t1, 1, 1 t2, 1, 2 t1, 1, 1 do not execute this initialization. Figure 3. The tree clock of t4 after processing the event e7 2. Get(t). This function simply returns the time of thread t in the traces of Figure 2a (left) and Figure 2b (right). stored in TC. If TC is not aware of any event of thread t, the value returned is 0. points to the unique event e with corresponding to (tid, clk) (tid(e) = tid and lTime(e) = clk). Intuitively, if v = Prnt(u), 3. Increment(i). This function increments the time of the then u represents the following information. root node of TC. It is only used on tree clocks that have been 1. TC has the local time u. clk for thread u. tid. initialized using Init, i.e., the tree clock belongs to a thread 2. u. aclk is the attachment time of v. tid, which is the local that is always stored in the root of the tree. time of v when v learned about u. clk of u. tid (this will LessThan ′ be the time that v had when u was attached to v). 4. (TC ). This function compares the vector time of TC to the vector time of TC′, i.e., it returns True iff TC ⊑ TC′. Naturally, if u = T . root then u. aclk = ⊥. Figure3 illustrates ′ the tree-clock representations of the event e7 in the two 5. Join(TC ). This function implements the join operation traces in Figure2. with TC′, i.e., updating TC ← TC ⊔ TC′. At a high level, the function performs the following steps. Tree clock operations. Just like vector clocks, tree clocks provide functions for initialization, update and comparison. 1. Routine getUpdatedNodesForJoin performs a pre- There are two main operations worth noting. The first is Join order traversal of TC′, and gathers in a stack S the nodes ′ ′ — much like vector clocks, C1.Join(C2) joins the tree clock of TC that have progressed in TC compared to TC. C2 to C1. In contrast to vector clocks, this operation takes 2. Routine detachNodes detaches from TC the nodes with advantage of the direct and indirect monotonicity outlined in tid that appear in the nodes of S, as these will be updated Section 3.1 to perform the join in sublinear time in the size of and repositioned in the tree. C1 and C2 (when possible). The second is MonotoneCopy. We 3. Routine attachNodes updates the nodes of TC that were use C1.MonotoneCopy(C2) to copy C2 to C1 when we know detached in the previous step, and repositions them in the that C1 ⊑ C2. The idea is that when this holds, the copy tree. This step effectively creates a subtree of nodes of TC operation has the same semantics as the join, and hence the that is identical to the subtree of TC′ that contains the pro- principles that make Join run in sublinear time also apply gressed nodes computed by getUpdatedNodesForJoin. to MonotoneCopy. 4. Finally, the last 4 lines of Join attach the subtree con- structed in the previous step under the root z of TC, at Algorithm2 gives a pseudocode description of this function- the front of the Chld(z) list. ality. The functions on the left column present operations that can be performed on tree clocks, while the right column We illustrate the tree-clock Join functionality in Figure4. lists helper routines for the more involved functions Join ′ and MonotoneCopy. In the following we give an intuitive 6. MonotoneCopy(TC ). This function implements the copy ′ ′ description of each function. operation TC ← TC assuming that TC ⊑ TC . The function is very similar to Join. The key difference is that this time, ′ 1. Init(t). This function initializes a tree clock TCt that the root of TC is always considered to have progressed in TC , belongs to thread t, by creating a node u = (t, 0, ⊥). Node u even if the respective times are equal. This is required for 6 A Tree Clock Data Structure for Causal Orderings in Concurrent Executions Conference’17, July 2017, Washington, DC, USA

′ TC1 TC2 TC2

t1, 16, ⊥ t12, 25, ⊥ t12, 25, ⊥

t2, 20, 9 t3, 17, 7 t5, 8, 20 t7, 24, 16 t1, 16, 25 t5, 8, 20 t7, 24, 16 t4, 23, 18 t5, 4, 14 t6, 15, 8 t7, 11, 2 t8, 10, 8 t1, 4, 4 t4, 31, 20 t11, 15, 7 t2, 20, 9 t3, 17, 7 t8, 10, 8 t4, 31, 20 t11, 15, 7

t8, 2, 19 t9, 10, 4 t10, 2, 15 t9, 16, 5 t3, 10, 4 t2, 14, 9 t6, 15, 8 t9, 16, 5

t11, 8, 7 t12, 2, 4 t10, 6, 12 t6, 15, 8 t10, 6, 12

Figure 4. Illustration of TC2 .Join(TC1) of Algorithm2. Light gray marks the nodes of TC1 whose time is compared to the time of the respective thread in TC2 (i.e., the total iterations in Line 37). Dark gray marks the nodes that are updating/being ′ updated (i.e., the size of S). TC2 is the result of the join, where dark gray marks the sub-tree updated by Join.

′ TC1 TC2 TC2

t1, 28, ⊥ t3, 14, ⊥ t1, 28, ⊥

t2, 13, 9 t3, 14, 7 t4, 12, 5 t6, 8, 9 t7, 8, 4 t8, 8, 2 t2, 13, 9 t3, 14, 7 t4, 12, 5

t5, 8, 11 t6, 8, 9 t7, 8, 4 t8, 8, 2 t1, 4, 2 t10, 2, 2 t11, 7, 5 t5, 4, 4 t5, 8, 11 t6, 8, 9 t7, 8, 4 t8, 8, 2

t10, 2, 2 t11, 7, 5 t4, 12, 5 t9, 9, 6 t12, 15, 2 t2, 6, 2 t10, 2, 2 t11, 7, 5

t9, 9, 6 t12, 15, 2 t9, 9, 6 t12, 15, 2

Figure 5. Illustration of TC2 .MonotoneCopy(TC1) of Algorithm2. Light gray marks the nodes of TC1 whose time is compared to the time of the respective thread in TC2 (i.e., the total iterations in Line 66). Dark gray marks the nodes that are updating/being ′ updated (i.e., the size of S). TC2 is the result of the copy, where dark gray marks the sub-tree updated by MonotoneCopy. Observe that node (3, 14, ⊥) (i.e., the root) of TC2 is updated although thread 3 has not progressed in TC1, as it is placed under ′ the new root (1, 28, ⊥) in TC2 (Line 71). changing the root of TC from the current node to one with Race detection with tree clocks. Race detection with tree tid equal to the root of TC′. Figure5 provides an illustration. clocks is shown in Algorithm3. When processing a lock- acquire event, the vector-clock join operation has been re- The crucial parts of Join and MonotoneCopy that ex- placed by a tree-clock join. Moreover, in light of Lemma 4.1, ploit the hierarchical structure of tree clocks are in when processing a lock-release event, the vector-clock copy getUpdatedNodesForJoin and getUpdatedNodesForCopy. operation has been replaced by a tree-clock monotone copy. In each case, we proceed from a parent u′ to its children v ′ Figure6 shows an example run of Algorithm3 on a trace only if u′ has progressed wrt its time in TC (recall Figure 2a), σ, showing how tree clocks grow during the execution. The capturing direct monotonicity. Moreover, we proceed from a figure shows the tree clocks C of the threads; the tree clocks child v ′ of u′ to the next child v ′′ (in order of appearance in t of locks C are just copies of the former after processing a Chld(u′)) only if TC is not yet aware of the attachment time ℓ lock-release event (shown in parentheses in the figure). of v ′ on u′ (recall Figure 2b), capturing indirect monotonicity. Correctness. We now turn our attention to the correctness of Algorithm3, i.e., we show that the algorithm indeed com- putes the HB partial order. We start with the following mono- 4 Tree Clocks for Happens-Before tonicity invariants of tree clocks. Let us see how tree clocks are employed for computing the HB partial order. We start with the following observation. Lemma 4.2. Consider any tree clock C and node u of C. T. For any tree clock C′, the following assertions hold. 1. Direct monotonicity: If u. clk ≤ C′.Get(u. tid) then Lemma 4.1 (Monotonicity of copies). Whenever Algorithm1 for every descendant w of u we have that w. clk ≤ ′ processes a lock-release event ⟨t, rel(ℓ)⟩, we have Cℓ ⊑ Ct . C .Get(w. tid). 7 Conference’17, July 2017, Washington, DC, USA Umang Mathur, Andreas Pavlogiannis, and Mahesh Viswanathan

Algorithm 2: The tree clock data structure. // Initialize a clock-tree for thread t // Populate S with a pre-order traversal of the subtree ′ 1 function Init(t) rooted at u with nodes whose clock has progressed. ′ 2 Let u ← (t, 0, ⊥) 36 routine getUpdatedNodesForJoin(S, u ) ′ ′ 3 Make u the root of T 37 foreach v in Chld(u ) do ′ ′ 4 Let ThrMap(t) ← u 38 if Get(v . tid) < v . clk then ′ 39 getUpdatedNodesForJoin (S, v ) // Get the clock for thread t 40 else 5 function Get(t) ′ ′ 41 if v . aclk ≤ Get(u . tid) then 6 if TC . ThrMap(t) , ⊥ then ← 42 break 7 Let u ThrMap(t) ′ 43 Push u in S 8 return u. clk 9 return 0 // Detach from T the nodes with tid that appears in S 44 routine detachNodes(S) // Increment the clock of the root thread ′ 45 foreach v in S do 10 function Increment(i) ′ 46 if ThrMap(v . tid) , ⊥ then 11 Let z ← T . root ′ 47 Let v ← ThrMap(v . tid) 12 z. clk ← z. clk +i 48 if v , T . root then ′ // True iff ⊑ TC 49 Let x ← Prnt(v) ′ 13 function LessThan(TC ) 50 Remove v from Chld(x) 14 Let z ← T . root ′ // Re-attach the nodes of T with tid that appears in S to 15 return z. clk ≤ TC .Get(z. tid) obtain the shape corresponding to TC′ . T ′ // Update with ⊔ TC 51 routine attachNodes(S) ′ 16 function Join(TC ) 52 while S is not empty do ′ ′ ′ 17 Let z ← TC . T . root 53 Let u ← pop S ′ ′ ′ 18 if z . clk ≤ Get(z . tid) then 54 if ThrMap(u . tid) , ⊥ then ′ 19 return 55 Let u ← ThrMap(u . tid) 20 Let S ← an empty stack 56 else ′ ′ 21 getUpdatedNodesForJoin (S, z ) 57 Let u ← (u . tid, 0, ⊥) 22 detachNodes (S) 58 Let ThrMap(u. tid) ← u ′ 23 attachNodes (S) 59 Assign u. clk ← u . clk ′ ′ // Place the updated subtree under the root of T 60 Let y ← Prnt(u ) ′ ′ 24 Let w ← ThrMap(z . tid) 61 if y , ⊥ then ′ 25 Let z ← T . root 62 Assign u. aclk ← u . aclk ′ 26 Assign w. aclk ← z. clk 63 Let y ← ThrMap(y . tid) 27 pushChild (w, z) 64 pushChild (u, y) // Monotone copy, assumes that ⊑ TC′ // Similar to getUpdatedNodesForJoin ′ ′ 28 function MonotoneCopy(TC ) 65 routine getUpdatedNodesForCopy(S, u , z) ′ ′ ′ ′ 29 Let z ← TC . T . root 66 foreach v in Chld(u ) do ′ ′ 30 Let z ← T . root 67 if Get(v . tid) < v . clk then ′ 31 Let S ← an empty stack 68 getUpdatedNodesForCopy (S, v , z) ′ 32 getUpdatedNodesForCopy (S, z , z) 69 else ′ 33 detachNodes (S) 70 if z , ⊥ and v . tid = z. tid then ′ S 34 attachNodes (S) 71 Push v in ′ ′ // New root has the same tid as the root of TC′ . T 72 if v . aclk ≤ Get(u . tid) then ′ 35 Assign T . root ← ThrMap(z . tid) 73 break ′ 74 Push u in S // Push u in the front of head of Chld(v) 75 routine pushChild(u, v) 76 Assign Prnt(u) ← v 77 Push u to the front of Chld(v)

8 A Tree Clock Data Structure for Causal Orderings in Concurrent Executions Conference’17, July 2017, Washington, DC, USA

e1 e2 e3 e4 t1 t2 t3 t4 t5 TC1 TC1 (TCℓ1 ) TC4 TC4 (TCℓ2 ) t , , ⊥ t , , ⊥ t , , ⊥ t , , ⊥ 1 acq(ℓ1) 1 1 1 2 4 1 4 2

2 rel(ℓ1) e5 e6 e7 e8 3 acq(ℓ2) TC5 TC5 (TCℓ3 ) TC3 TC3 4 rel(ℓ2) t5, 1, ⊥ t5, 2, ⊥ t3, 1, ⊥ t3, 2, ⊥ 5 acq(ℓ3) t1, 2, 1 t5, 2, 2 t1, 2, 1 6 rel(ℓ3) e9 e10 e11 e12 7 acq(ℓ1)

8 acq(ℓ3) TC3 (TCℓ3 ) TC3 (TCℓ1 ) TC3 TC3 (TCℓ2 )

9 rel(ℓ3) t3, 3, ⊥ t3, 4, ⊥ t3, 5, ⊥ t3, 6, ⊥ rel 10 (ℓ1) t5, 2, 2 t1, 2, 1 t5, 2, 2 t1, 2, 1 t4, 2, 5 t5, 2, 2 t1, 2, 1 t4, 2, 5 t5, 2, 2 t1, 2, 1 11 acq(ℓ ) 3 e13 e14 e15 e16 12 rel(ℓ3) TC2 TC2 (TCℓ1 ) TC2 TC2 (TCℓ2 ) 13 acq(ℓ1) t2, 1, ⊥ t2, 2, ⊥ t2, 3, ⊥ t2, 4, ⊥ 14 rel(ℓ1) t3, 4, 1 t3, 4, 1 t3, 6, 3 t3, 6, 3 15 acq(ℓ2)

16 rel(ℓ2) t5, 2, 2 t1, 2, 1 t5, 2, 2 t1, 2, 1 t4, 2, 5 t5, 2, 2 t1, 2, 1 t4, 2, 5 t5, 2, 2 t1, 2, 1

(b) The sequence of tree-clock updates after HB processes each event ei of σ. Only the tree clock TCj of thread j that performs ei is shown. When the thread performs a (a) An example trace σ. lock-release the corresponding tree clock of the lock is mentioned in parenthesis. Figure 6. Example run of HB and the updates on the corresponding clock-trees.

Algorithm 3: HB with tree clocks. more efficient data structure than tree clocks? More gen- erally, what is the most efficient data structure for the HB 1 procedure acquire(t, ℓ) algorithm to represent vector times? 2 Ct .Join(Cℓ ) To answer this question, we next define vector-time work, 3 procedure release(t, ℓ) which intuitively gives a lower bound on the number of 4 Cℓ .MonotoneCopy(Ct ) data structure operations the HB algorithm has to perform regardless of the actual data structure used to store vector 2. Indirect monotonicity: If v is a child of u and v. aclk ≤ ′ times. Then, we show that tree clocks achieve this lower C .Get(u. tid) then for every descendant w of v we have bound, hence yielding a notion of optimality for Algorithm3. that w. clk ≤ C′.Get(w. tid). Vector-time work. Consider the general HB algorithm (i.e., The following lemma establishes the correctness of Algo- Algorithm1) and let D = {C1, C2,..., Cm } be the set of the rithm3 and follows immediately from the above invariants. vector-time data structures used. Consider the execution of the algorithm on an input trace σ. Given a number 1 ≤ i ≤ | | i Lemma 4.3. The following assertions hold. σ , we let Cj denote the vector time represented by Cj after the algorithm has processed the i-the event of σ. We define 1. After Algorithm3 processes an event ⟨t, acq(ℓ)⟩ we have the vector-time work (or vt-work, for short) on σ as Ct = Ct ⊔ Lℓ. X X i−1 i 2. After Algorithm3 processes an event ⟨t, rel(ℓ)⟩ we have VTWork(σ ) = |t ∈ Thrds : Cj (t) , Cj (t)|. ≤ ≤ | | j Cℓ = Ct . 1 i σ In words, for every event processed by the algorithm we add Data structure optimality. Just like vector clocks, comput- the number of vector-time entries that change as a result ing HB with tree clocks takes O(n · k) time in the worst case. of processing the event. Thus, VTWork(σ ) counts the total However, as we have seen, tree clocks can take sublinear number of entries of vector times that have been updated time for performing join and copy operations, whereas vec- in the overall course of the algorithm. Note that vt-work is tor clocks always require time proportional to the size of independent of the actual data structure used to represent the vector (i.e., O(k)). A natural question arises: is there a each Cj , and satisfies the following inequality 9 Conference’17, July 2017, Washington, DC, USA Umang Mathur, Andreas Pavlogiannis, and Mahesh Viswanathan

n ≤ VTWork(σ ) ≤ n · k. Algorithm 4: SHB with tree clocks. as with every event of σ the algorithm updates one of Cj . 1 procedure acquire(t, ℓ) Vector-time optimality. Given an input trace σ, we denote 2 Ct .Join(Lℓ ) T by DS (σ ) the time taken by the HB algorithm (Algorithm1) 3 procedure release(t, ℓ) to process σ using the data structure DS to store vector times. 4 Lℓ .MonotoneCopy(Ct ) Intuitively, VTWork(σ ) captures the number of times that instances of DS change state. For data structures that repre- 5 procedure read(t, x) sent vector times explicitly, VTWork(σ ) presents a natural 6 Ct .Join(LWx ) lower bound for TDS (σ ). Hence, we say that the data struc- 7 procedure write(t, x) ture DS is vt-optimal if TDS (σ ) = O(VTWork(σ )). It is not 8 LWx .CopyCheckMonotone(Ct ); // Deep or Monotone hard to see that vector clocks are not vt-optimal, i.e., tak- ing DS = VC to be the vector clock data structure, one can construct simple traces σ where VTWork(σ ) = O(n) but ≤σ ⊆≤σ 1. HB SHB. T (σ ) = O(n · k), and thus the running time is k times σ DS 2. for every read event r, we have lwσ (r) ≤ r. more than the vt-work that must be performed on σ. In con- SHB trast, we will show that tree clocks are vt-optimal, i.e., when Algorithm for SHB. Similarly to HB, the SHB partial order DS = TC is the tree clock data structure, we always have can be computed by a single pass of the input trace σ using TTC (σ ) = O(VTWork(σ )). vector-times [25]. The SHB algorithm processes synchroniza- tion events (i.e., acq(ℓ)/rel(ℓ)) in the same manner as HB First remote acquires. Consider a trace σ and a lock- (Algorithm1). In addition, for each variable x, the algorithm release event e = ⟨t, rel(ℓ)⟩ of σ, such that there exists maintains one additional data structure LWx that stores the ′ ⟨ ′ acq ⟩ σ ′ a later acquire event e = t , (ℓ) (e

The SHB partial order. Schedulably-happens-before (SHB) is a strengthening of HB. It was introduced in [25] in the context of race detection, and has the property that for every 6 Experiments ∥σ two events e1, e2 of a trace σ, if e1 SHB e2, then σ can be ∗ soundly reordered to a trace σ that ends with e1, e2. The In this section we report on an implementation and experi- partial order SHB is defined as follows. Given a trace σ and mental evaluation of the tree clock data structure. The pri- a read event r let lwσ (r) be the last write event of σ before mary goal of these experiments is to evaluate the practical r with Variable(w) = Variable(r). Then, SHB is the smallest advantage of tree clocks over the vector clocks for keeping partial order that satisfies the following conditions. track of logical times in a concurrent program. 10 A Tree Clock Data Structure for Causal Orderings in Concurrent Executions Conference’17, July 2017, Washington, DC, USA

Implementation. Our implementation is in Java and closely Table 1. Running times forHB and SHB using vector clocks follows Algorithm2. For efficiency reasons, recursive rou- (VC) and tree clocks (TC). The geometric means of the tines have been made iterative. The thread map ThrMap is speedups are GMeanHB = 2 and GMeanSHB = 1.75. implemented as a list with random access, while the Chld(u) Benchmark n k v HB SHB data structure for storing the children of node u is imple- VC TC VC TC mented as a doubly linked list. We also implemented Djit+ style optimizations [32] for both tree and vector clocks. jigsaw 3.1M 12 103K 0.06s 0.07s 1.09s 0.73s allocvector 29K 103 10K 0.07s 0.03s 0.29s 0.25s Benchmarks. Our benchmark set consists of standard startup 31M 5 307K 0.07s 0.05s 5.39s 2.34s benchmarks found in the recent literature. mostly in the con- text of race detection [20, 22, 25, 26, 31, 36]. It contains sev- ftpserver 49K 12 5.5K 0.08s 0.05s 0.10s 0.12s eral concurrent programs taken from standard benchmark batik 157M 7 4.9M 0.09s 0.09s 42.73s 32.85s suites — IBM Contest benchmark suite [12], Java Grande xml 42M 43 1.9M 0.09s 0.10s 16.25s 4.34s suite [47], DaCapo [3], and SIR [10]. In order to fairly com- derby 1.4M 5 185K 0.10s 0.11s 0.51s 0.67s pare the performance of vector clocks and tree clocks, we biojava 221M 4 121K 0.12s 0.08s 40.23s 22.15s logged the execution traces of each benchmark program us- elevator 211K 6 725 0.14s 0.09s 0.18s 0.21s ing RV-Predict [38] and analyze both VC-based and TC-based luindex 397M 3 2.5M 0.18s 0.15s 1m8s 49.36s algorithms on the same trace. lusearch 217M 8 5.2M 0.19s 0.16s 51.69s 28.46s Setup. Our experimental setup consists of two implemen- cryptorsa 58M 9 1.7M 0.19s 0.15s 18.21s 6.68s tations for each of the HB and SHB algorithms, one using bbuffer 504K 16 34 0.21s 0.07s 0.37s 0.25s vector clocks and the other using tree clocks to represent vec- zxing 546M 15 37M 0.21s 0.11s 4m33s 2m33s tor times (Algorithm3 and Algorithm4). Each algorithm was chinserts 24M 4 5.8M 0.29s 0.12s 9.84s 11.22s given as input a trace produced from the above benchmarks, sor 606M 5 1.0M 0.31s 0.23s 2m47s 2m44s and we measured the running time to construct the respec- chadddelete 8.0M 23 2.6M 0.31s 0.19s 3.76s 10.51s tive partial order. We do not include small benchmarks where hsqldb 18M 44 945K 0.34s 0.13s 9.76s 3.92s this time is below 10ms, as these cases are very simple to handle using vector clocks and the tree clock data structure eclipse 90M 15 10M 0.46s 0.27s 31.23s 20.27s is not likely to offer any advantage. As our focus is onthe derby 471M 3 20M 0.52s 0.39s 1m27s 1m10s impact of the vector-time data structure on timestamping, tomcat 49M 48 6.1M 0.59s 0.31s 36.57s 20.56s we did not perform any further analysis after constructing tradesoap 39M 221 2.8M 0.61s 0.15s 1m22s 13.79s partial order (e.g., detecting races). Although these partial xalan 122M 7 4.4M 0.64s 0.39s 32.56s 21.62s orders are useful for various analyses, we remark that any bufwriter 11M 7 56 0.74s 0.20s 2.18s 0.90s analysis component will be identical under vector clocks and tradebeans 39M 222 2.8M 0.93s 0.14s 1m20s 19.20s tree clocks, and thus does not contribute to the comparison chiterator 14M 16 3.0M 0.95s 0.34s 5.50s 4.08s of the two data structures. crypto 294M 43 13M 2.30s 0.63s 2m24s 41.75s Running times. The running times for vector clocks and cassandra 259M 173 9.9M 37.43s 2.95s 7m41s 53.30s tree clocks are shown in Table1. We see that SHB requires raxextended 304M 26 48 49.94s 10.10s 1m11s 21.76s significantly more time to construct than HB, for both vector Totals - - - 1m38s 17.84s 29m7s 13m0s clocks and tree clocks. This is expected, as HB only orders synchronization events (i.e., release/acquire events) across threads, while SHB also orders memory access events events. overhead of tree clocks. Nevertheless, overall tree clocks We observe that tree clocks incur a significant speedup over deliver a generous speedup to both HB and SHB. vector clocks for both partial orders in most benchmarks. In the case of HB, the whole benchmark set is processed To get a better aggregate representation of the advantage of approximately 5.5 times faster using tree clocks, while for tree clocks, we have also computed the geometric mean of the SHB the speedup is 2.24 times faster. The speedup for SHB ratios of vector-clock times over tree-clock times, averaged is less than HB. This is expected because SHB uses as many over all benchmarks. For HB, the speedup is GMeanHB = 2, vector/tree clocks as there are variables. However, for some while for SHB the speedup is GMeanSHB = 1.75. These num- of these variables, the corresponding vector/tree clock is bers show that just by replacing vector clocks with tree only used in a few join and copy operations. As the tree clocks, the running time reduces on average to 50% for HB clock is a more heavy data structure, the number of these and to 57% for SHB, regardless of the total running time operations is not large enough to balance the initialization of the benchmark. These are significant reductions, espe- cially coming from using a new data structure without any 11 Conference’17, July 2017, Washington, DC, USA Umang Mathur, Andreas Pavlogiannis, and Mahesh Viswanathan

Figure 7. Comparing number of operations (vt-work) performed using vector clocks and tree-clocks in HB (y-axis is log-scale). attempt to optimize other aspects such as the algorithm that constant factor larger than VTWork(σ ); our proof of The- constructs the respective partial order. orem 4.5 implies that TCWork(σ ) ≤ 3 · VTWork(σ ). The ratios in Figure7 confirm this theoretical bound. Interest- Deep and monotone copies in SHB. Recall that in SHB ingly, for the benchmark bufwriter, we have TCWork(σ ) ≃ with tree clocks, the processing of a write event leads to 2.99 · VTWork(σ ), i.e., this benchmark pushes tree clocks to a CopyCheckMonotone operation, which might resolve to a their worst performance relative to vt-work. deep copy between the tree clocks instead of a monotone one (Algorithm4). If the number of deep copies is large, the advantage of tree clocks is lost as the operation touches 7 Related Work and Conclusion the whole data structure. Here we have evaluated our ex- pectation that the frequency of deep copies is negligible Here we discuss related work and potential applications of compared to monotone copies. Indeed, we have seen that tree clocks in other analyses. in all benchmarks, the number of deep copies is many or- Other partial orders and tree clocks. As we have men- ders of magnitude smaller than the number of monotone tioned in the introduction, besides HB and SHB, many other copies. For a few representative numbers, in the case zxing, partial orders perform dynamic analysis using vector clocks. CopyCheckMonotone resolved 137M times to a monotone In such cases, tree clocks can replace vector clocks either copy and only 43 times to a deep copy. In the case of hsqldb, partially or completely, sometimes requiring small exten- the corresponding numbers are 2.3M and 25. Analogous ob- sions to the data structure as presented here. To foster future servations hold for all other benchmarks. research, we touch on these points here. Comparison with vt-work. We also investigate the total One such example is the DC partial order [36]. Although number of entries updated using each of the data struc- tree clocks can be directly applied here, a naive application tures. Recall that the metric VTWork(σ ) (Section4) measures might not be the most efficient. The reason is that theun- the minimum amount of updates that any implementation derlying algorithm maintains various queues which store of the vector time must perform when computing a par- copies of vector clocks. For a more efficient approach, one tial order. We can likewise define the metrics TCWork(σ ) has to consider ways to alleviate the cost of deep copies of and VCWork(σ ) corresponding to the number of entries tree clocks. Another example is the WCP partial order [22]. updated when processing each event when using respec- The challenge here is that WCP does not contain the thread tively the data structures tree clocks and vector clocks. These order, and hence at a first glance, the monotonicity and tran- metrics are visualized in Figure7 for our benchmark suite sitivity properties of tree clocks fail. However, because WCP and accurately explain the performance improvement of composes with HB, these two properties only fail for the tree clocks over vector clocks (Table1). The figure shows root of the tree clock, and apply as usual from the first level that the VCWork(σ )/ VTWork(σ ) ratio is often consider- on. Thus, the join and monotone copy operations have to be ably large. In contrast, the ratio TCWork(σ )/ VTWork(σ ) is adapted to account for this anomaly on the root. The SDP typically significantly smaller. The differences in running partial order has similar flavor to WCP, but incurs fewer times between vector and tree clocks are a direct reflection orderings [19]. Here some tree clocks have to be further of the discrepancies between TCWork(·) and VCWork(·). generalized to forest-like clocks that have multiple roots. In fact, the benchmarks which have the highest ratios Although such extensions are beyond the scope of this paper, TCWork(σ )/ VCWork(σ ) also show a corresponding high they serve as fertile ground for future work. speed-up (cassandra, tradebeans, hsqldb and raxextended). Next, we remark TCWork(σ ) is always no more than a Speeding up dynamic analyses. Vector-clock based dy- namic race detection is known to be slow [39], which many 12 A Tree Clock Data Structure for Causal Orderings in Concurrent Executions Conference’17, July 2017, Washington, DC, USA prior works have aimed to mitigate. One of the most promi- nent performance bottlenecks is the linear dependence of the size of vector timestamps on the number of threads. Despite theoretical limits [6], prior research exploits special struc- tures in traces [1, 7, 9, 13, 49] that enable succinct vector time representations. The Goldilocks [11] algorithm infers HB- orderings using locksets instead of vector timestamps but incurs severe slowdown [16]. The FastTrack [16] optimiza- tion uses epochs for maintaining succinct access histories and our work complements this optimization — tree clocks offer optimizations for other clocks (thread and lock clocks). Other optimizations in clock representations are catered to- wards dynamic thread creation [33, 34, 51]. Another major source of slowdown is program instrumentation, and ex- pensive metadata synchronization. Several approaches have attempted to minimize this slowdown, including hardware assistance [8, 55], hybrid race detection [30, 53] based on the lockset principle [42], static analysis [17, 35], or sophisticated ownership protocols [5, 37, 52]. Conclusion. In this work we have introduced tree clocks, a new data structure for maintaining logical times in a con- current executions. In contrast to standard vector clocks, tree clocks can perform join and copy operations in sublin- ear time, thereby avoiding the traditional overhead of these operations when possible. Moreover, we have shown that tree clocks are vector-time optimal for computing the HB partial order. Finally, our experiments show that tree clocks effectively reduce the running time for computing the HB and SHB partial orders by significant factors, and thus offer a promising alternative over vector clocks in future research.

13 Conference’17, July 2017, Washington, DC, USA Umang Mathur, Andreas Pavlogiannis, and Mahesh Viswanathan

References //dl.acm.org/citation.cfm?id=838237.838485 [13] Mingdong Feng and Charles E. Leiserson. 1997. Efficient Detection [1] Kunal Agrawal, Joseph Devietti, Jeremy T. Fineman, I-Ting Angelina of Determinacy Races in Cilk Programs. In Proceedings of the Ninth Lee, Robert Utterback, and Changming Xu. 2018. Race Detection Annual ACM Symposium on Parallel Algorithms and Architectures (New- and Reachability in Nearly Series-Parallel DAGs. In Proceedings of port, Rhode Island, USA) (SPAA ’97). Association for Computing Ma- the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algo- chinery, New York, NY, USA, 1–11. https://doi.org/10.1145/258492. rithms (New Orleans, ) (SODA ’18). Society for Industrial and 258493 Applied Mathematics, USA, 156–171. [14] Colin Fidge. 1991. Logical Time in Distributed Computing Systems. [2] Swarnendu Biswas, Jipeng Huang, Aritra Sengupta, and Michael D. Computer 24, 8 (Aug. 1991), 28–33. https://doi.org/10.1109/2.84874 Bond. 2014. DoubleChecker: Efficient Sound and Precise Atomicity [15] J FIDGE. 1988. Timestamps in message-passing systems that preserve Checking. In Proceedings of the 35th ACM SIGPLAN Conference on the partial ordering. In Proc. 11th Australian Comput. Science Conf. Programming Language Design and Implementation (Edinburgh, United 56–66. Kingdom) (PLDI ’14). ACM, New York, NY, USA, 28–39. https://doi. [16] Cormac Flanagan and Stephen N. Freund. 2009. FastTrack: Efficient org/10.1145/2594291.2594323 and Precise Dynamic Race Detection. In Proceedings of the 30th ACM [3] Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, SIGPLAN Conference on Programming Language Design and Implemen- Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, tation (Dublin, Ireland) (PLDI ’09). ACM, New York, NY, USA, 121–133. Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, https://doi.org/10.1145/1542476.1542490 Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko [17] Cormac Flanagan and Stephen N. Freund. 2013. RedCard: Redun- Stefanović, Thomas VanDrunen, Daniel von Dincklage, and Ben Wie- dant Check Elimination for Dynamic Race Detectors. In ECOOP 2013 dermann. 2006. The DaCapo Benchmarks: Java Benchmarking Devel- – Object-Oriented Programming, Giuseppe Castagna (Ed.). Springer opment and Analysis. In OOPSLA. Berlin Heidelberg, Berlin, Heidelberg, 255–280. [4] Hans-J. Boehm. 2011. How to Miscompile Programs with “Benign” [18] Cormac Flanagan, Stephen N. Freund, and Jaeheon Yi. 2008. Velodrome: Data Races. In Proceedings of the 3rd USENIX Conference on Hot Topic A Sound and Complete Dynamic Atomicity Checker for Multithreaded in Parallelism (Berkeley, CA) (HotPar’11). USENIX Association, USA, Programs. In Proceedings of the 29th ACM SIGPLAN Conference on 3. Programming Language Design and Implementation (Tucson, AZ, USA) [5] Michael D. Bond, Milind Kulkarni, Man Cao, Minjia Zhang, Meisam (PLDI ’08). ACM, New York, NY, USA, 293–303. https://doi.org/10. Fathi Salmi, Swarnendu Biswas, Aritra Sengupta, and Jipeng Huang. 1145/1375581.1375618 2013. OCTET: Capturing and Controlling Cross-Thread Depen- [19] Kaan Genç, Jake Roemer, Yufan Xu, and Michael D. Bond. 2019. dences Efficiently. In Proceedings of the 2013 ACM SIGPLAN Inter- Dependence-Aware, Unbounded Sound Predictive Race Detection national Conference on Object Oriented Programming Systems Lan- (OOPSLA 2019). To appear. guages & Applications (, Indiana, USA) (OOPSLA ’13). As- [20] Jeff Huang, Patrick O’Neil Meredith, and Grigore Rosu. 2014. Maximal sociation for Computing Machinery, New York, NY, USA, 693–712. Sound Predictive Race Detection with Control Flow Abstraction. In https://doi.org/10.1145/2509136.2509519 Proceedings of the 35th ACM SIGPLAN Conference on Programming [6] Bernadette Charron-Bost. 1991. Concerning the size of logical clocks Language Design and Implementation (Edinburgh, United Kingdom) in distributed systems. Inform. Process. Lett. 39, 1 (1991), 11 – 16. (PLDI ’14). ACM, New York, NY, USA, 337–348. https://doi.org/10. https://doi.org/10.1016/0020-0190(91)90055-M 1145/2594291.2594315 [7] Guang-Ien Cheng, Mingdong Feng, Charles E. Leiserson, Keith H. [21] Ayal Itzkovitz, Assaf Schuster, and Oren Zeev-Ben-Mordehai. 1999. Randall, and Andrew F. Stark. 1998. Detecting Data Races in Cilk Toward Integration of Data Race Detection in DSM Systems. J. Parallel Programs That Use Locks. In Proceedings of the Tenth Annual ACM Distrib. Comput. 59, 2 (Nov. 1999), 180–203. https://doi.org/10.1006/ Symposium on Parallel Algorithms and Architectures (Puerto Vallarta, jpdc.1999.1574 Mexico) (SPAA ’98). ACM, New York, NY, USA, 298–309. [22] Dileep Kini, Umang Mathur, and Mahesh Viswanathan. 2017. Dy- [8] Joseph Devietti, Benjamin P. Wood, Karin Strauss, Luis Ceze, Dan namic Race Prediction in Linear Time. In Proceedings of the 38th ACM Grossman, and Shaz Qadeer. 2012. RADISH: Always-on Sound and SIGPLAN Conference on Programming Language Design and Imple- Complete Race Detection in Software and Hardware. In Proceedings mentation (Barcelona, Spain) (PLDI 2017). ACM, New York, NY, USA, of the 39th Annual International Symposium on Computer Architecture 157–170. https://doi.org/10.1145/3062341.3062374 (Portland, ) (ISCA ’12). IEEE Computer Society, USA, 201–212. [23] Leslie Lamport. 1978. Time, Clocks, and the Ordering of Events in a [9] Dimitar Dimitrov, Martin Vechev, and Vivek Sarkar. 2015. Race Detec- Distributed System. Commun. ACM 21, 7 (July 1978), 558–565. https: tion in Two Dimensions. In Proceedings of the 27th ACM Symposium //doi.org/10.1145/359545.359563 on Parallelism in Algorithms and Architectures (Portland, Oregon, USA) [24] Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. 2008. Learn- (SPAA ’15). Association for Computing Machinery, New York, NY, USA, ing from Mistakes: A Comprehensive Study on Real World Concur- 101–110. https://doi.org/10.1145/2755573.2755601 rency Bug Characteristics. In Proceedings of the 13th International [10] Hyunsook Do, Sebastian G. Elbaum, and Gregg Rothermel. 2005. Sup- Conference on Architectural Support for Programming Languages and porting Controlled Experimentation with Testing Techniques: An In- Operating Systems (Seattle, WA, USA) (ASPLOS XIII). ACM, New York, frastructure and its Potential Impact. Empirical Software Engineering: NY, USA, 329–339. https://doi.org/10.1145/1346281.1346323 An International Journal 10, 4 (2005), 405–435. [25] Umang Mathur, Dileep Kini, and Mahesh Viswanathan. 2018. What [11] Tayfun Elmas, Shaz Qadeer, and Serdar Tasiran. 2007. Goldilocks: Happens-after the First Race? Enhancing the Predictive Power of A Race and Transaction-aware Java Runtime. In Proceedings of the Happens-before Based Dynamic Race Detection. Proc. ACM Program. 28th ACM SIGPLAN Conference on Programming Language Design and Lang. 2, OOPSLA, Article 145 (Oct. 2018), 29 pages. https://doi.org/10. Implementation (San Diego, , USA) (PLDI ’07). ACM, New 1145/3276515 York, NY, USA, 245–255. https://doi.org/10.1145/1250734.1250762 [26] Umang Mathur, Andreas Pavlogiannis, and Mahesh Viswanathan. 2021. [12] Eitan Farchi, Yarden Nir, and Shmuel Ur. 2003. Concurrent Bug Optimal Prediction of Synchronization-Preserving Races (POPL ’21). Patterns and How to Test Them. In Proceedings of the 17th Inter- To Appear. national Symposium on Parallel and Distributed Processing (IPDPS [27] Umang Mathur and Mahesh Viswanathan. 2020. Atomicity Checking ’03). IEEE Computer Society, Washington, DC, USA, 286.2–. http: in Linear Time Using Vector Clocks. In Proceedings of the Twenty-Fifth

14 A Tree Clock Data Structure for Causal Orderings in Concurrent Executions Conference’17, July 2017, Washington, DC, USA

International Conference on Architectural Support for Programming Lan- Programming (Orlando, , USA) (PPoPP ’14). Association for guages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’20). Computing Machinery, New York, NY, USA, 29–42. https://doi.org/ Association for Computing Machinery, New York, NY, USA, 183–199. 10.1145/2555243.2555262 https://doi.org/10.1145/3373376.3378475 [42] Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, and [28] Friedemann Mattern. 1989. Virtual Time and Global States of Dis- Thomas Anderson. 1997. Eraser: A Dynamic Data Race Detector for tributed Systems. In Parallel and Distributed Algorithms: proceedings Multithreaded Programs. ACM Trans. Comput. Syst. 15, 4 (Nov. 1997), of the International Workshop on Parallel & Distributed Algorithms, 391–411. https://doi.org/10.1145/265924.265927 M. Cosnard et. al. (Ed.). Elsevier Science Publishers B. V., 215–226. [43] R SCHWARZ and F MATTERN. 1994. Detecting causal relationships [29] Madanlal Musuvathi, Shaz Qadeer, Thomas Ball, Gerard Basler, Pira- in distributed computations: in search of the holy grail. Distributed manayagam Arumuga Nainar, and Iulian Neamtiu. 2008. Finding and computing 7, 3 (1994), 149–174. Reproducing Heisenbugs in Concurrent Programs. In Proceedings of [44] Konstantin Serebryany and Timur Iskhodzhanov. 2009. ThreadSani- the 8th USENIX Conference on Operating Systems Design and Implemen- tizer: Data Race Detection in Practice (WBIA ’09). tation (San Diego, California) (OSDI’08). USENIX Association, Berkeley, [45] Yao Shi, Soyeon Park, Zuoning Yin, Shan Lu, Yuanyuan Zhou, Wen- CA, USA, 267–280. http://dl.acm.org/citation.cfm?id=1855741.1855760 guang Chen, and Weimin Zheng. 2010. Do I Use the Wrong Definition?: [30] Robert O’Callahan and Jong-Deok Choi. 2003. Hybrid Dynamic Data DeFuse: Definition-use Invariants for Detecting Concurrency and Se- Race Detection. SIGPLAN Not. 38, 10 (June 2003), 167–178. https: quential Bugs. In Proceedings of the ACM International Conference //doi.org/10.1145/966049.781528 on Object Oriented Programming Systems Languages and Applications [31] Andreas Pavlogiannis. 2019. Fast, Sound, and Effectively Complete (Reno/Tahoe, Nevada, USA) (OOPSLA ’10). ACM, New York, NY, USA, Dynamic Race Prediction. Proc. ACM Program. Lang. 4, POPL, Article 160–174. https://doi.org/10.1145/1869459.1869474 17 (Dec. 2019), 29 pages. https://doi.org/10.1145/3371085 [46] Yannis Smaragdakis, Jacob Evans, Caitlin Sadowski, Jaeheon Yi, and [32] Eli Pozniansky and Assaf Schuster. 2003. Efficient On-the-fly Data Cormac Flanagan. 2012. Sound Predictive Race Detection in Polyno- Race Detection in Multithreaded C++ Programs. SIGPLAN Not. 38, 10 mial Time. In Proceedings of the 39th Annual ACM SIGPLAN-SIGACT (June 2003), 179–190. https://doi.org/10.1145/966049.781529 Symposium on Principles of Programming Languages (Philadelphia, [33] Raghavan Raman, Jisheng Zhao, Vivek Sarkar, Martin Vechev, and Eran PA, USA) (POPL ’12). ACM, New York, NY, USA, 387–400. https: Yahav. 2012. Efficient data race detection for async-finish parallelism. //doi.org/10.1145/2103656.2103702 Formal Methods in System Design 41, 3 (01 Dec 2012), 321–347. https: [47] L. A. Smith, J. M. Bull, and J. Obdrzálek. 2001. A Parallel Java Grande //doi.org/10.1007/s10703-012-0143-7 Benchmark Suite. In Proceedings of the 2001 ACM/IEEE Conference on [34] Veselin Raychev, Martin Vechev, and Manu Sridharan. 2013. Effective Supercomputing (, ) (SC ’01). ACM, New York, NY, Race Detection for Event-Driven Programs. In Proceedings of the 2013 USA, 8–8. https://doi.org/10.1145/582034.582042 ACM SIGPLAN International Conference on Object Oriented Program- [48] Martin Sulzmann and Kai Stadtmüller. 2018. Two-Phase Dynamic ming Systems Languages & Applications (Indianapolis, Indiana, USA) Analysis of Message-Passing Go Programs Based on Vector Clocks. (OOPSLA ’13). Association for Computing Machinery, New York, NY, In Proceedings of the 20th International Symposium on Principles and USA, 151–166. https://doi.org/10.1145/2509136.2509538 Practice of Declarative Programming (Frankfurt am Main, Germany) [35] Dustin Rhodes, Cormac Flanagan, and Stephen N. Freund. 2017. Big- (PPDP ’18). Association for Computing Machinery, New York, NY, USA, Foot: Static Check Placement for Dynamic Race Detection. In Pro- Article 22, 13 pages. https://doi.org/10.1145/3236950.3236959 ceedings of the 38th ACM SIGPLAN Conference on Programming Lan- [49] Rishi Surendran and Vivek Sarkar. 2016. Dynamic determinacy race guage Design and Implementation (Barcelona, Spain) (PLDI 2017). As- detection for task parallelism with futures. In International Conference sociation for Computing Machinery, New York, NY, USA, 141–156. on Runtime Verification. Springer, 368–385. https://doi.org/10.1145/3062341.3062350 [50] Tengfei Tu, Xiaoyu Liu, Linhai Song, and Yiying Zhang. 2019. Under- [36] Jake Roemer, Kaan Genç, and Michael D. Bond. 2018. High-coverage, standing Real-World Concurrency Bugs in Go. In Proceedings of the Unbounded Sound Predictive Race Detection. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for 39th ACM SIGPLAN Conference on Programming Language Design and Programming Languages and Operating Systems (Providence, RI, USA) Implementation (Philadelphia, PA, USA) (PLDI 2018). ACM, New York, (ASPLOS ’19). Association for Computing Machinery, New York, NY, NY, USA, 374–389. https://doi.org/10.1145/3192366.3192385 USA, 865–878. https://doi.org/10.1145/3297858.3304069 [37] Jake Roemer, Kaan Genç, and Michael D. Bond. 2020. SmartTrack: [51] Xinli Wang, J. Mayo, W. Gao, and J. Slusser. 2006. An Efficient Imple- Efficient Predictive Race Detection. In Proceedings of the 41st ACM SIG- mentation of Vector Clocks in Dynamic Systems. In PDPTA. PLAN Conference on Programming Language Design and Implementa- [52] Benjamin P. Wood, Man Cao, Michael D. Bond, and Dan Grossman. tion (London, UK) (PLDI 2020). Association for Computing Machinery, 2017. Instrumentation Bias for Dynamic Data Race Detection. Proc. New York, NY, USA, 747–762. https://doi.org/10.1145/3385412.3385993 ACM Program. Lang. 1, OOPSLA, Article 69 (Oct. 2017), 31 pages. [38] Grigore Rosu. 2018. RV-Predict, Runtime Verification. Accessed: 2018- https://doi.org/10.1145/3133893 04-01. [53] Yuan Yu, Tom Rodeheffer, and Wei Chen. 2005. RaceTrack: Efficient [39] Caitlin Sadowski and Jaeheon Yi. 2014. How Developers Use Data Race Detection of Data Race Conditions via Adaptive Tracking. SIGOPS Detection Tools. In Proceedings of the 5th Workshop on Evaluation and Oper. Syst. Rev. 39, 5 (Oct. 2005), 221–234. https://doi.org/10.1145/ Usability of Programming Languages and Tools (Portland, Oregon, USA) 1095809.1095832 (PLATEAU ’14). Association for Computing Machinery, New York, NY, [54] M. Zhivich and R. K. Cunningham. 2009. The Real Cost of Software USA, 43–51. https://doi.org/10.1145/2688204.2688205 Errors. IEEE Security and Privacy 7, 2 (March 2009), 87–90. https: [40] Mahmoud Said, Chao Wang, Zijiang Yang, and Karem Sakallah. 2011. //doi.org/10.1109/MSP.2009.56 Generating Data Race Witnesses by an SMT-based Analysis. In Pro- [55] P. Zhou, R. Teodorescu, and Y. Zhou. 2007. HARD: Hardware-Assisted ceedings of the Third International Conference on NASA Formal Methods Lockset-based Race Detection. In 2007 IEEE 13th International Sym- (Pasadena, CA) (NFM’11). Springer-Verlag, Berlin, Heidelberg, 313–327. posium on High Performance Computer Architecture. 121–132. https: http://dl.acm.org/citation.cfm?id=1986308.1986334 //doi.org/10.1109/HPCA.2007.346191 [41] Malavika Samak and Murali Krishna Ramanathan. 2014. Trace Driven Dynamic Deadlock Detection and Reproduction. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel

15 Conference’17, July 2017, Washington, DC, USA Umang Mathur, Andreas Pavlogiannis, and Mahesh Viswanathan

A Proofs induction hypothesis on Lℓ. The analysis for Item2 is similar. Lemma 4.1 (Monotonicity of copies). Whenever Algorithm1 The desired result follows. processes a lock-release event ⟨t, rel(ℓ)⟩, we have Cℓ ⊑ C . t e = ⟨t, rel(ℓ)⟩. The algorithm performs the operation L .MonotoneCopy(C ). After this operation L is isomorphic Proof. Consider a trace σ, a release event rel(ℓ) and let ℓ t ℓ to C and both statements follow by the inductive hypothesis acq(ℓ) be the matching acquire event. When acq(ℓ) is pro- t on Ct . □ cessed the algorithm performs Ct ← Ct ⊔ Cℓ, and thus Cℓ ⊑ Ct after this operation. By lock semantics, there exists no re- lease event rel′(ℓ) such that acq(ℓ) <σ rel′(ℓ) <σ rel(ℓ), Lemma 4.3. The following assertions hold. and hence Cℓ is not modified until rel(ℓ) is processed. Since vector clock entries are never decremented, when rel(ℓ) is 1. After Algorithm3 processes an event ⟨t, acq(ℓ)⟩ we have C ⊑ C processed we have ℓ t , as desired. □ Ct = Ct ⊔ Lℓ. Lemma 4.2. Consider any tree clock C and node u of C. T. 2. After Algorithm3 processes an event ⟨t, rel(ℓ)⟩ we have For any tree clock C′, the following assertions hold. Cℓ = Ct . 1. Direct monotonicity: If u. clk ≤ C′.Get(u. tid) then for every descendant w of u we have that w. clk ≤ Proof. The lemma follows directly from Lemma 4.2. In each C′.Get(w. tid). case, if the corresponding operation (i.e., Join for event 2. Indirect monotonicity: If v is a child of u and v. aclk ≤ ⟨t, acq(ℓ)⟩ and MonotoneCopy for ⟨t, rel(ℓ)⟩), if the clock C′.Get(u. tid) then for every descendant w of v we have of a node w of the tree clock that performs the operation that w. clk ≤ C′.Get(w. tid). does not progress, then we are guaranteed that w. clk is not smaller than the time of the thread w. tid in the tree clock Proof. First, note that after initialization u has no children, that is passed as an argument to the operation. □ hence each statement is trivially true. Now assume that both statements hold when the algorithm processes an event e, and we show that they both hold after the algorithm has Lemma 4.4. Consider the execution of Algorithm3 on a trace processed e. We distinguish cases based on the type of e. σ. For every tree clock Ci and node u of Ci . T other than the root, the following assertions hold. e = ⟨t, acq(ℓ)⟩. The algorithm performs the operation 1. u points to a lock-release event rel(ℓ). Ct .Join(Lℓ ), hence the only tree clock modified is Ct , and 2. rel(ℓ) has a first remote acquire acq(ℓ) and thus it suffices to examine the cases that Ct is C and Ct is C′. (v. tid,u. aclk) points to acq(ℓ), where v is the par- ent of u in Ci . T. 1. Ct is C. First consider the case that u = Ct . T . root. Observe that u. clk > C′.Get(u. tid), and thus Item1 holds trivially. For Item2, we distinguish cases based Proof. The lemma follows by a straightforward induction on on whether v. clk has progressed by the Join operation. σ. □ If yes, then we have v. aclk = u. clk, and the statement holds trivially for the same reason as in Item1. Other- wise, we have that for every descendant w of v, the clock Theorem 4.5 (Tree-clock Optimality). For any input trace w. clk has not progressed by the Join operation, hence σ, we have TTC (σ ) = O(VTWork(σ )). the statement holds by the induction hypothesis on Ct . Now consider the case that u , Ct . T . root. If u. clk has Proof. Consider a critical section of a thread t on lock ℓ, not progressed by the Join operation then each state- marked by two events acq(ℓ), rel(ℓ). We define the follow- ment holds by the induction hypothesis on C . Otherwise, t ing vector times. using the induction hypothesis one can show that for 1 2 every descendant w of u, there exists a node wℓ of that 1. Vt and Vt are the vector times of Ct right before and is a descendant of a node uℓ such that wℓ . tid = w. tid right after acq(ℓ) is processed, respectively. 1 and uℓ . tid = u. tid. Then, each statement holds by the 2. Vℓ is the vector time of Cℓ right before acq(ℓ) is pro- induction hypothesis on Lℓ. cessed. ′ ′ 3 2. Ct is C . For Item1, if u. clk ≤ C .Get(u. tid) holds be- 3. Vt is the vector time of Ct right before rel(ℓ) is pro- fore the Join operation, then the statement holds by the cessed. 3 4 induction hypothesis, since Join does not decrease the 4. Vℓ and Vℓ are the vector times of Cℓ right before and clocks of Ct . Otherwise, the statement follows by the right after rel(ℓ) is processed, respectively, 16 A Tree Clock Data Structure for Causal Orderings in Concurrent Executions Conference’17, July 2017, Washington, DC, USA

1 3 First, note that (i) Vt ⊑ Vt , and (ii) due to lock semantics, we most one such v that is not the root of TCt . For every other 3 1 ′ have Vℓ = Vℓ. Let W = WJ + WC , where such v, let u = Prnt(v). Note that v is not the root of Cℓ, ′ ′ ′ 2 ′ 1 ′ and let u = Prnt(v ). Let rel(ℓ) be the lock-release event WJ =|{t :V (t ) , V (t )}| and t t that v and v ′ point to. By Lemma 4.4, rel(ℓ) has a first re- ′ 4 ′ 3 ′ ′ ′ WC =|{t :Vℓ (t ) , Vℓ (t )}| mote acquire acq(ℓ) such that (i) u. tid = u . tid = t , where ′ acq ℓ . i.e., WJ and WC are the vt-work for handling acq(ℓ) and t is the thread of ( ), and (ii) v aclk is the local clock acq ℓ getUpdatedNodesForCopy ′ rel(ℓ), respectively. Let TJ be the time spent in TCt .Join of ( ). Since examines v , we ′. > . . ≥ . due to acq(ℓ). Similarly, let TC be the time spent in must have u clk u clk. In turn, we have u clk v aclk, ′. > . ′ TCℓ .MonotoneCopy due to rel(ℓ). We will argue that TJ = and thusu aclk v aclk. Hence, due to Line 41,u can have ′ ′. = Get ′. O(W ) and TC = O(WC ), and thus TJ + TC = O(W ). Note that at most one child v with v clk (v tid). Thus, we can this proves the lemma, simply by summing over all critical account for the time of this case in WJ . Hence, TJ = O(W ), sections of σ. as desired. T We start with TJ . Observe that the time spent in this op- We now turn our attention to C . Similarly to the previous eration is proportional to the number of times the loop in case, the time spent in this operation is proportional to the Line 37 is executed, i.e., the number of nodes v ′ that the number of times the loop in Line 66 is executed. Consider the ′ ′ loop iterates over. Consider the if statement in Line 38. If if statement in Line 67. If Get(v . tid) < v . clk, then we have ′ ′ 2 ′ 1 ′ V4 (v ′. tid) > V3 (v ′. tid), and thus this iteration is accounted Get(v . tid) < v . clk, then we have Vt (v . tid) > Vt (v . tid), ℓ ℓ and thus this iteration is accounted for in WJ . On the other for inWC . Note that as the copy is monotone (Lemma 4.1), we ′ ′ 1 ′ can’t have Get(v ′. tid) > v ′. clk. Finally, the reasoning for hand, if Get(v . tid) > v . clk, then we have Vt (v . tid) > 1 ′ 4 ′ the case where Get(v ′. tid) = v ′. clk is similar to the analysis Vℓ )(v . tid). Due to (i) and (ii) above, we have Vℓ (v . tid) > 3 ′ of TJ , using Line 72 instead of Line 41. Hence, TC = O(WC ), Vℓ (v . tid), and thus this iteration is accounted for in WC . Finally, consider the case that Get(v ′. tid) = v ′. clk, and let as desired. ′ v be the node of Ct such that v. tid = v . tid. There can be at The desired result follows. □

17