<<

PAM: Parallel Augmented Maps

Yihan Sun Daniel Ferizovic Guy E. Belloch Carnegie Mellon University Karlsruhe Institute of Technology Carnegie Mellon University [email protected] [email protected] [email protected] Abstract 1 Introduction Ordered (key-value) maps are an important and widely-used The map type (also called key-value store, dictionary, for large-scale data processing frameworks. Beyond table, or ) is one of the most important da- simple search, insertion and deletion, more advanced oper- ta types in modern large-scale data analysis, as is indicated ations such as range extraction, filtering, and bulk updates by systems such as F1 [60], Flurry [5], RocksDB [57], Or- form a critical part of these frameworks. acle NoSQL [50], LevelDB [41]. As such, there has been We describe an for ordered maps that is augment- significant interest in developing high-performance paral- ed to support fast range queries and sums, and introduce a lel and concurrent and implementations of maps parallel and concurrent called PAM (Parallel Augment- (e.g., see Section 2). Beyond simple insertion, deletion, and ed Maps) that implements the interface. The interface includes search, this work has considered “bulk” functions over or- a wide variety of functions on augmented maps ranging from dered maps, such as unions [11, 20, 33], bulk-insertion and basic insertion and deletion to more interesting functions such bulk-deletion [6, 24, 26], and range extraction [7, 14, 55]. as union, intersection, filtering, extracting ranges, splitting, One particularly useful function is to take a “sum” over a and range-sums. We describe algorithms for these functions range of keys, where sum means with respect to any associa- that are efficient both in theory and practice. tive combine function (e.g., addition, maximum, or union). As As examples of the use of the interface and the performance an example of such a range sum consider a of sales of PAM we apply the library to four applications: simple range receipts keeping the value of each sale ordered by the time of sums, interval trees, 2D range trees, and ranked word index sale. When analyzing the data for certain trends, it is likely searching. The interface greatly simplifies the implementation useful to quickly query the sum or maximum of sales during of these data over direct implementations. Sequen- a period of time. Although such sums can be implemented tially the code achieves performance that matches or exceeds naively by scanning and summing the values within the key existing libraries designed specially for a single application, range, these queries can be answered much more efficiently and in parallel our implementation gets speedups ranging using augmented trees [18, 25]. For example, the sum of any from 40 to 90 on 72 cores with 2-way hyperthreading. range on a map of size 푛 can be answered in 푂(log 푛) time. This bound is achieved by using a balanced binary and CCS Concepts • and its engineering → General augmenting each with the sum of the subtree. programming languages; • Social and professional topics Such a data can also implement a significantly → History of programming languages; more general form of queries efficiently. In the sales receipt- ACM Format: s example they can be used for reporting the sales above a Yihan Sun, Daniel Ferizovic, and Guy E. Belloch. 2018. PAM: Par- threshold in 푂(푘 log(푛/푘 + 1)) time (푘 is the output size) if allel Augmented Maps. In Proceedings of PPoPP ’18: 23nd ACM the augmentation is the maximum of sales, or in 푂(푘 + log 푛) SIGPLAN Symposium on Principles and Practice of Parallel Pro- time [44] with a more complicated augmentation. More gen- gramming, Vienna, Austria, February 24–28, 2018 (PPoPP ’18), erally they can be used for interval queries, 푘-dimensional 15 pages. range queries, inverted indices (all described later in the pa- https://doi.org/10.1145/3178487.3178509 per), segment intersection, windowing queries, point location, rectangle intersection, range overlaps, and many others. Permission to make digital or hard copies of all or part of this work for Although there are dozens of implementations of efficient personal or classroom use is granted without fee provided that copies are not range sums, there has been very little work on parallel or made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components concurrent implementations—we know of none for the gen- of this work owned by others than ACM must be honored. Abstracting with eral case, and only two for specific applications2 [ , 34]. In credit is permitted. To copy otherwise, or republish, to post on servers or to this paper we present a general library called PAM (Parallel redistribute to lists, requires prior specific permission and/or a fee. Request Augmented Maps) for supporting in-memory parallel and permissions from [email protected]. concurrent maps with range sums. PAM is implemented in PPoPP ’18, February 24–28, 2018, Vienna, Austria ++. We use augmented value to refer to the abstract “sum” © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-4982-6/18/02. . . $15.00 on a map (defined in Section 3). When creating a map type https://doi.org/10.1145/3178487.3178509 the user specifies two augmenting functions chosen based on

290 PPoPP ’18, February 24–28, 2018, Vienna, Austria Yihan Sun, Daniel Ferizovic, and Guy E. Belloch

In Theory In Practice (Asymptotic bound) (Running Time in seconds) Application Construct Construct Query Query Work Span Size Seq. Par. Spd. Size Seq. Par. Spd. Range Sum 푂(푛 log 푛) 푂(log 푛) 푂(log 푛) 1010 1844.38 28.24 65.3 108 271.09 3.04 89.2 푂(푛 log 푛) 푂(log 푛) 푂(log 푛) 108 14.35 0.23 63.2 108 53.35 0.58 92.7 2d 푂(푛 log 푛) 푂(log3 푛) 푂(log2 푛) 108 197.47 3.10 63.7 106 48.13 0.55 87.5 Inverted Index 푂(푛 log 푛) 푂(log2 푛) * 1.96 × 109 1038 12.6 82.3 105 368 4.74 77.6 Table 1. The asymptotic cost and experimental results of the applications using PAM. Seq. = sequential, Par. = Parallel (on 72 cores with 144 hyperthreads), Spd. = Speedup. “Work” and “Span” are used to evaluate the theoretical bound of parallel algorithms (see Section 4). *: Depends on the query. their application: a base function 푔 which gives the augment- a map using integer addition. For this case we report both ed value of a single element, and a combine function 푓 which sequential and parallel times for a wide variety of operations combines multiple augmented values, giving the augment- (union, search, multi-insert, range-sum, insertion, deletion, ed value of the map. The library can then make use of the filter, build). We also present performance comparisons to functions to keep “partial sums” (augmented values of sub- other implementations of maps that do not support augment- maps) in a tree that asymptotically improve the performance ed values. Secondly we use augmented maps to implement of certain queries. interval trees. An interval tree maintains a of intervals (e.g. Augmented maps in PAM support standard functions on the intervals of times in which users are logged into a site, or ordered maps (which maintain the partial sums), as well as the intervals of time for FTP connections) and can quickly additional function specific to augmented maps (see Figure 1 answer queries such as if a particular point is covered by for a partial list). The standard functions include simple func- any interval (e.g. is there any user logged in at a given time). tions such as insertion, and bulk functions such as union. The Thirdly we implement 2d range trees. Such trees maintain a functions specific to augmented maps include efficient range- set of points in 2 dimensions and allow one to count or list all sums, and filtering based on augmented values. PAM uses entries within a given rectangular range (e.g. how many users theoretically efficient parallel algorithms for all bulk func- are between 20 and 25 years old and have salaries between tions, and is implemented based on using the “join” function $50K and $90K). Such counting queries can be answered to support parallelism on balanced trees [11]. We extend the in 푂(log2 푛) time. We present performance comparison to approach of using joins to augmented values, and also the sequential range-tree structure available in CGAL [47]. give algorithms based on “join” for some other operations Finally we implement a weighted inverted index that supports such as filtering, multi-insert, and mapReduce. and/or queries, which can quickly return the top 푘 matches. PAM uses functional data structures and hence the maps The theoretical cost and practical performance of these four are fully persistent—updates will not modify an existing map applications are shown in Table 1. but will create a new version [22]. Persistence is useful in The main contributions of this paper are: various applications, including the range tree and inverted 1. An interface for augmented maps (Section 3). index applications described in this paper. It is also useful 2. Efficient parallel algorithms and an implementation for in supporting a form of concurrency based on snapshot iso- the interface as part of the PAM library (Section 4). lation. In particular each concurrent process can atomically 3. Four example applications of the interface (Section 5). read a snapshot of a map, and can manipulate and modify 4. Experimental analysis of the examples (Section 6). their “local” copy without affecting the view of other users, or being affected by any other concurrent modification to the 2 Related Work shared copy1. However PAM does not directly support tradi- Many researchers have studied concurrent algorithms and tional concurrent updates to a shared map. Instead concurrent implementations of maps based on balanced search trees fo- updates need to be batched and applied in bulk in parallel. cusing on insertion, deletion and search [6, 10, 13, 15, 21, We present examples of four use cases for PAM along 24, 26, 36, 37, 40, 46]. Motivated by applications in data with experimental performance numbers. Firstly we consid- analysis recently researchers have considered mixing atomic er the simple case of maintaining the sum of the values in range scanning with concurrent updates (insertion and dele- tion) [7, 14, 55]. None of them, however, has considered 1Throughout the paper we use parallel to indicate using multiple processors to work on a single bulk function, such as multi-insertion or filtering, and we sub-linear time range sums. use concurrent to indicate independent “users” (or processes) asynchronously There has also been significant work on parallel algorithms accessing the same structure at the same time. and implementations of bulk operations on ordered maps and

291 PAM: Parallel Augmented Maps PPoPP ’18, February 24–28, 2018, Vienna, Austria sets [3, 11, 12, 20, 24, 26, 33, 52, 53]. Union and intersection, (Partial) Interface AugMap AM(퐾, 푉, 퐴, <, 푔, 푓, 퐼) : for example, are available as part of the multicore version of empty or ∅ : 푀 the C++ Standard Library [26]. Again none of this size : 푀 → Z work has considered fast range sums. There has been some single : 퐾 × 푉 → 푀 work on parallel data structures for specific applications of find : 푀 × 퐾 → 푉 ∪ {} range sums such as range trees [34]. insert : 푀 × 퐾 × 푉 × (푉 × 푉 → 푉 ) → 푀 There are many theoretical results on efficient sequential union : 푀 × 푀 × (푉 × 푉 → 푉 ) → 푀 data-structures and algorithms for range-type queries using filter :(퐾 × 푉 → bool) × 푀 → 푀 augmented trees in the context of specific applications such upTo : 푀 × 퐾 → 푀 as interval queries, k-dimensional range sums, or segment range : 푀 × 퐾 × 퐾 → 푀 intersection queries (see e.g. [43]). Several of these approach- mapReduce :(퐾 × 푉 → 퐵) × (퐵 × 퐵 → 퐵) × 퐵 es have been implemented as part of systems [35, 47]. Our × 푀 → 퐵 work is motivated by this work and our goal is to parallelize build :(퐾 × 푉 ) seq. × (푉 × 푉 → 푉 ) → 푀 many of these ideas and put them in a framework in which it augVal : 푀 → 퐴 is much easier to develop efficient code. We know of no other augLeft : 푀 × 퐾 → 퐴 general framework as described in Section 3. augRange : 푀 × 퐾 × 퐾 → 퐴 Various forms of range queries have been considered in the augFilter :(퐴 → bool) × 푀 → 푀 context of relational [16, 27–29, 31]. Ho et. al. [31] augProject :(퐴 → 퐵) × (퐵 × 퐵 → 퐵) specifically consider fast range sums. However the workis × 푀 × 퐾 × 퐾 → 퐵 sequential, only applies to static data, and requires that the sum function has an inverse (e.g. works for addition, but not Figure 1. The (partial) interface for an , with maximum). More generally, we do not believe that traditional key type 퐾, value type 푉 , and augmented value type 퐴. The (flat) relational databases are well suited for our approach augmenting is (퐴, 푓, 퐼). Other functions not listed since we use arbitrary data types for augmentation—e.g. our include delete, intersect, difference, split, join, downTo, 2d range tree has augmented maps nested as their augmented previous, next, rank, and select. In the table seq. means a values. Recently there has been interest in range queries in sequence. large clusters under systems such as Hadoop [2, 4]. Although As an example, the augmented map type: these systems can extract ranges in work proportional to the AM(Z, < , Z, Z, (푘, 푣) → 푣, + , 0) (1) number of elements in the range (or close to it), they do not Z Z support fast range sums. None of the “” systems based defines an augmented map with integer keys and values, or- on key-value stores [5, 41, 50, 57] support fast range sums. dered by

292 PPoPP ’18, February 24–28, 2018, Vienna, Austria Yihan Sun, Daniel Ferizovic, and Guy E. Belloch sums) in a tree structure. Table 2 gives the asymptotic costs can be applied to AVL trees [1] red-black trees [8], weight- of the functions based on the implementation described in balanced trees and [59]. We implemented all of them in Section 4, which uses augmented balanced search trees. The PAM. By default, we use weight-balanced trees [48] in PAM, function augVal(푚) returns 풜(푚), which is equivalent to because it does not require extra balancing criteria in each mapReduce(푔, 푓, 퐼, 푚) but can run in constant instead of node (the node size is already stored), but users can change linear work. This is because the functions 푓 and 푔 are cho- to any specific balancing scheme using C++ templates. sen ahead of time and integrated into the augmented map Augmentation. We implement augmentation by storing with data type, and therefore the sum can be maintained during every tree node the augmented sum of the subtree rooted updates. The function augRange(푚, 푘1, 푘2) is equivalent to at that node. This localizes application of the augmentation augVal(range(푚, 푘1, 푘2)) and augLeft(푚, 푘) is equivalen- functions 푓 and 푔 to when a node is created or updated2. In t to augVal(upTo(푚, 푘)). These can also be implemented particular when creating a node with a left child 퐿, right child efficiently using the partial sums. 푅, key 푘 and value 푣 the augmented value can be calculated The last two functions accelerate two common queries on as 푓(풜(퐿), 푓(푔(푘, 푣), 풜(푅))), where 풜(·) extracts the aug- augmented maps, but are only applicable when their func- mented value from a node. Note that it takes two applications tion arguments meet certain requirements. They also can be of 푓 since we have to combine three values, the left, middle computed using the plain map functions, but can be much and right. We do not store 푔(푘, 푣). In our algorithms, creation more efficient when applicable. The augFilter(ℎ, 푚) func- of new nodes is handled in JOIN, which also deals with rebal- tion is equivalent to filter(ℎ′, 푚), where ℎ′ : 퐾 × 푉 ↦→ ancing when needed. Therefore all the algorithms and code bool satisfies ℎ(푔(푘, 푣)) ⇔ ℎ′(푘, 푣). It is only applicable if that do not explicitly need the augmented value are unaffected ℎ(푎) ∨ ℎ(푏) ⇔ ℎ(푓(푎, 푏)) for any 푎 and 푏 (∨ is the logical by and even oblivious of augmentation. or). In this case the filter function can make use of the par- Parallelism. We use fork-join parallelism to implement in- tial sums. For example, assume the values in the map are ternal parallelism for the bulk operations. In pseudocode the boolean values, 푓 is a logical-or, 푔(푘, 푣) = 푣, and we want notation “푠 || 푠 ” means that the two statements 푠 and 푠 ′ 1 2 1 2 to filter the map using function ℎ (푘, 푣) = 푣. In this case can run in parallel, and when both are finished the overall we can filter out a whole sub-map once we see ithas false statement finishes. In most cases parallelism is over the struc- as a partial sum. Hence we can set ℎ(푎) = 푎 and directly ture of the trees—i.e. applying some function in parallel over use augFilter(푚, ℎ). The function is used in interval trees the two children, and applying this parallelism in a nested ′ ′ (Section 5.1). The augProject(푔 , 푓 , 푚, 푘1, 푘2) function is fashion recursively (Figure 2 shows several examples). The ′ equivalent to 푔 (augRange(푚, 푘1, 푘2)). It requires, however, only exception is build, where we use parallelism in a sort ′ ′ ′ ′ ′ that (퐵, 푓 , 푔 (퐼)) is a monoid and that 푓 (푔 (푎), 푔 (푏)) = and in removing duplicates. In the PAM library fork-join ′ 푔 (푓(푎, 푏)). This function is useful when the augmented val- parallelism is implemented with the cilkplus extensions to ues are themselves maps or other large data structures. It C++ [39]. We have a granularity set so parallelism is not used allows projecting the augmented values down onto another on very small trees. type by 푔′ (e.g. project augmented values with complicated Theoretical bounds. To analyze the asymptotic costs in the- structures to values like their sizes) then summing them by ory we use work (푊 ) and span (푆), where work is the total 푓 ′, and is much more efficient when applicable. For exam- number of operations (the sequential cost) and span is the ple in range trees where each augmented value is itself an length of the critical path [18]. Almost all the algorithms augmented map, it greatly improves performance for queries. we describe, and implemented in PAM, are asymptotically optimal in terms of work in the comparison model, i.e., the 4 and Algorithms total number of comparisons. Furthermore they achieve very In this section we outline a data structure and associated good parallelism (i.e. polylogarithmic span). Table 2 list the algorithms used in PAM that can be used to efficiently im- cost of some of the functions in PAM. When the augmenting plement augmented maps. Our data structure is based on functions 푓 and 푔 both take constant time, the augmentation abstracting the balancing criteria of a class of trees (e.g. AVL only affects all the bounds by a constant factor. Furthermore or Red-Black trees) in terms of a single JOIN function [11], experiments show that this constant factor is reasonably small, which joins two maps with a key between them (defined more typically around 10% for simple functions such as summing formally below). Prior work describes parallel - the values or taking the maximum. s for UNION,DIFFERENCE, and INTERSECT for standard Join, Split, Join2 and Union. As mentioned, we adopt and un-augmented sets and maps using just JOIN [11]. Here we extend the methodology introduced in [11], which builds extend the methodology to handle augmentation, and also 2 describe some other functions based on join. Because the bal- For supporting persistence, updating a tree node, e.g., when it is involved in rotations and is set a new child, usually results in the creation of a new node. ancing criteria is fully abstracted in JOIN, similar algorithm See details in the persistence part.

293 PAM: Parallel Augmented Maps PPoPP ’18, February 24–28, 2018, Vienna, Austria

1 UNION(푇1, 푇2, ℎ) = Function Work Span 2 if 푇1 = ∅ then 푇2 Map operations 3 else if 푇 = ∅ then 푇 2 1 insert, delete, find, first, 4 else let ⟨퐿2, 푘, 푣, 푅2⟩ = 푇2 ′ last, previous, next, rank, log 푛 log 푛 5 and ⟨퐿1, 푣 , 푅2⟩ = SPLIT(푇1, 푘)

6 and 퐿 = UNION(퐿1, 퐿2) || 푅 = UNION(푅1, 푅2) select, upTo, downTo ′ ′ 7 in if 푣 ̸=  then JOIN(퐿, 푘, ℎ(푣, 푣 ), 푅) Bulk operations 8 else JOIN(퐿, 푘, 푣, 푅) join log 푛 − log 푚 log 푛 − log 푚 9 INSERT(푇, 푘, 푣, ℎ) = union*, intersect*, 푚 log(︀ 푛 + 1)︀ log 푛 log 푚 10 if 푇 = ∅ then SINGLETON(푘, 푣) difference* 푚 11 else let ⟨퐿, 푘′, 푣′, 푅⟩ = 푇 in ′ ′ ′ mapReduce 푛 log 푛 if 푘 < 푘 then OIN( NSERT(퐿, 푘, 푣), 푘 , 푣 , 푅) 12 J I 2 ′ ′ ′ 푛 log 푛 13 else if 푘 > 푘 then JOIN(퐿, 푘 , 푣 , INSERT(푅, 푘, 푣)) filter ′ 14 else JOIN(퐿, 푘, ℎ(푣 , 푣), 푅) range, split, join2 log 푛 log 푛 ′ ′ ′ 15 MAPREDUCE(푇, 푔 , 푓 , 퐼 ) = build 푛 log 푛 log 푛 16 if 푇 = ∅ then 퐼′ Augmented operations 17 else let ⟨퐿, 푘, 푣, 푅⟩ = 푇 augVal 1 1 ′ ′ ′ ′ 18 and 퐿 = MAPREDUCE(퐿, 푔 , 푓 , 퐼 ) || log 푛 log 푛 ′ ′ ′ ′ augRange, augProject 19 푅 = MAPREDUCE(푅, 푔 , 푓 , 퐼 ) augFilter (output size 푘) 푘 log(푛/푘 + 1) log2 푛 20 in 푓 ′(퐿′, 푔′(푘, 푣), 푅′) ′ Table 2. The core functions in PAM and their asymptotic 21 AUGLEFT(푇, 푘 ) = costs (all big-O). The cost is given under the assumption that 22 if 푇 = ∅ then 퐼 the base function 푔, the combine function 푓 and the functions 23 else let ⟨퐿, 푘, 푣, 푅⟩ = 푇 ′ ′ as parameters (e.g., for AUGPROJECT) take constant time to 24 in if 푘 < 푘 then AUGLEFT(퐿, 푘 ) * ′ return. For the functions noted with , the efficient algorithms 25 else 푓(풜(퐿), 푔(푘, 푣),AUGLEFT(푅, 푘 )) with bounds shown in the table are introduced and proved 26 FILTER(푇, ℎ) = in [11]. For functions with two input maps (e.g., UNION), 푛 27 if 푇 = ∅ then ∅ is the size of the larger input, and 푚 of the smaller. 28 else let ⟨퐿, 푘, 푣, 푅⟩ = 푇 ′ ′ 29 and 퐿 = FILTER(퐿, ℎ) || 푅 = FILTER(푅, ℎ) ′ ′ ′ ′ 30 in if ℎ(푘, 푣) then JOIN(퐿 , 푘, 푣, 푅 ) else JOIN2(퐿 , 푅 ) all map functions using JOIN(퐿, 푘, 푅). The JOIN function 31 AUGFILTER(푇, ℎ) = takes two ordered maps 퐿 and 푅 and a key-value pair (푘, 푣) 32 if (푇 = ∅) OR (¬ℎ(풜(푇 ))) then ∅ that sits between them (i.e. max(퐿) < 푘 < min(푅)) and 33 else let ⟨퐿, 푘, 푣, 푅⟩ = 푇 returns the composition of 퐿, (푘, 푣) and 푅. Using JOIN it ′ 34 and 퐿 = AUGFILTER(퐿, ℎ) || is easy to implement two other useful functions: SPLIT and ′ 35 푅 = AUGFILTER(푅, ℎ) JOIN2. The function ⟨퐿, 푣, 푅⟩ = SPLIT(퐴, 푘) splits the map ′ ′ ′ ′ 36 in if ℎ(푔(푘, 푣)) then JOIN(퐿 , 푘, 푣, 푅 ) else JOIN2(퐿 , 푅 ) 퐴 into the elements less than 푘 (placed in 퐿) and those greater 37 BUILD’(푆, 푖, 푗) = (placed in 푅). If 푘 appears in 퐴 its associated value is returned 38 if 푖 = 푗 then ∅ as 푣, otherwise 푣 is set to be an empty value noted as . 39 else if 푖 + 1 = 푗 then SINGLETON(푆[푖]) The function 푇 =JOIN2(푇퐿, 푇푅) works similar to JOIN, but 40 else let 푚 = (푖 + 푗)/2 without the middle element. These are useful in the other 41 and 퐿 = BUILD’(푆, 푖, 푚) || 푅 = BUILD’(푆, 푚 + 1, 푗) algorithms. Algorithmic details can be found in [11]. 42 in JOIN(퐿, 푆[푚], 푅) The first function in Figure 2 defines an algorithm forU- NION based on SPLIT and JOIN. We add a feature that it can UILD(푆) = 43 B accept a function ℎ as parameter. If one key appears in both 44 BUILD’(REMOVEDUPLICATES(SORT(푆)), 0, |푆|) sets, the values are combined using ℎ. In the pseudocode Figure 2. Pseudocode for some of the functions on augmented we use ⟨퐿, 푘, 푣, 푅⟩ = 푇 to extract the left child, key, value maps. UNION is from [11], the rest are new. For an associated binary and right child from the tree node 푇 , respectively. The pseu- function 푓, 푓(푎, 푏, 푐) means 푓(푎, 푓(푏, 푐)). docode is written in a functional (non side-effecting) style. This matches our implementation, as discussed in persistence below.

294 PPoPP ’18, February 24–28, 2018, Vienna, Austria Yihan Sun, Daniel Ferizovic, and Guy E. Belloch

Insert and Delete. Instead of the classic implementations pruned (see Figure 1). For 푘 output entries, this function takes of INSERT and DELETE, which are specific to the balanc- 푂(푘 log(푛/푘 + 1)) work, which is asymptotically smaller ing scheme, we define versions based purely onJOIN, and than 푛 and is significantly more efficient when 푘 is small. Its 2 hence independent of the balancing scheme. Like the UNION span is 푂(log 푛). function, INSERT also takes an addition function ℎ as input, Other Functions. We implement many other functions in such that if the key to be inserted is found in the map, the PAM, including all in Figure 1. In MAPREDUCE(푔′, 푓 ′, 퐼′, 푇 ), values will be combined by ℎ. The algorithm for insert is for example, it is recursively applied on 푇 ’s two subtrees in given in Figure 2, and the algorithm for deletion is similar. parallel, and 푔′ is applied to the root. The three values are then ′ ′ ′ The algorithms run in 푂(log 푛) work (and span since sequen- combined using 푓 . The function AUGPROJECT(푔 , 푓 , 푚, 푘1, 푘2) tial). One might expect that abstracting insertion using JOIN on the top level adopts a similar method as AUGRANGE to ′ instead of specializing for a particular balance criteria has get related entries or subtrees in range [푘1, 푘2], projects 푔 to significant . Our experiments show this is not the their augmented values and combine results by 푓 ′. case—and even though we maintain the reference counter for Persistence. The PAM library uses functional data structures, persistence, we are only 17% slower sequentially than the and hence does not modify existing tree nodes but instead highly-optimized C++ STL library (see section 6). creates new ones [49]. This is not necessary for implementing Build. To construct an augmented map from a sequence of augmented maps, but is helpful in the parallel and concurren- key-value pairs we first sort the sequence by the keys, then t implementation. Furthermore the fact that functional data remove the duplicates (which are contiguous in sorted order), structures are persistent (no existing data is modified) has and finally use a balanced divide-and-conquer withJOIN. The many applications in developing efficient data structures22 [ ]. algorithm is given in Figure 2. The work is then 푂(푊sort(푛) + In this paper three of our four applications (maintaining in- 푊remove(푛) + 푛) and the span is 푂(푆sort(푛) + 푆remove(푛) + verted indices, interval trees and range trees) use persistence log 푛). For work-efficient sort and remove-duplicates with in a critical way. The JOIN function copies nodes along the 푂(log 푛) span this gives the bounds in Table 2. join path, leaving the two original trees unchanged. All of Reporting Augmented Values. As an example, we give the our code is built on JOIN in the functional style, returning algorithm of AUGLEFT(푇, 푘′) in Figure 2, which returns new trees rather than modifying old ones. Such functional the augmented value of all entries with keys less than 푘′. It data structures mean that parts of trees are shared, and that compares the root of 푇 with 푘′, and if 푘′ is smaller, it calls old trees for which there are no longer any pointers need to AUGLEFT on its left subtree. Otherwise the whole left sub- be garbage collected. We use a reference counting garbage tree and the root should be counted. Thus we directly extract collector. When the reference count is one we use a stan- the augmented value of the left subtree, convert the entry in dard reuse optimization—reusing the current node instead of the root to an augmented value by 푔, and recursively call collecting it and allocating a new one [32]. AUGLEFT on its right subtree. The three results are com- Concurrency. In PAM any number of users can concurrently bined using 푓 as the final answer. This function visits at most access and update their local copy (snapshot) of any map. 푂(log 푛) nodes, so it costs 푂(log 푛) work and span assuming This is relatively easy to implement based on functional data 푓, 푔 and 퐼 return in constant time. The AUGRANGE function, structures (persistent) since each process only makes copies which reports the augmented value of all entries in a range, of new data instead of modifying shared data. The one tricky can be implemented similarly with the same asymptotical part is maintaining the reference counts and implementing bound. the memory and garbage collector to be safe for Filter and AugFilter. The filter and augFilter function both concurrency. This is all implemented in a lock-free fashion. select all entries in the map satisfying condition ℎ. For a A compare-and-swap is used for updating reference counts, (non-empty) tree T,FILTER recursively filters its left and and a combination of local pools and a global pool are used right branches in parallel, and combines the two results with for memory allocation. Updates to the shared instance of a JOIN or JOIN2 depending on whether ℎ is satisfied for the map can be made atomically by swapping in a new pointer entry at the root. It takes linear work and 푂(log2 푛) span (e.g., with a compare-and-swap). This means that updates for a balanced tree. The augFilter(ℎ, 푚) function has the are sequentialized. However they can be accumulated and same effect as filter(ℎ′, 푚), where ℎ′ : 퐾 × 푉 ↦→ bool applied when needed in bulk using the parallel multi-insert or satisfies ℎ(푔(푘, 푣)) ⇔ ℎ′(푘, 푣) and is only applicable if multi-delete. ℎ(푎) ∨ ℎ(푏) ⇔ ℎ(푓(푎, 푏)). This can asymptotically improve efficiency since if ℎ(풜(푇 )) is false, then we know that ℎ will 5 Applications 3 not hold for any entries in the tree 푇 , so the search can be In this section we describe applications that can be imple- 3Similar methodology can be applied if there exists a function ℎ′′ to decide mented using the PAM interface. Our first application is given if all entries in a subtree will be selected just by reading the augmented value. in Equation 1, which is a map storing integer keys and values,

295 PAM: Parallel Augmented Maps PPoPP ’18, February 24–28, 2018, Vienna, Austria and keeping track of sum over values. In this section we give amap::aug filter (푂(푘 log(푛/푘 +1)) work for 푘 results), three more involved applications of augmented maps: inter- which is more efficient than a plain filter. val trees, range trees and word indices (also called inverted indices). We note that although we use trees as the implemen- 1 struct interval_map { tation, the abstraction of the applications to augmented maps 2 using interval = pair; is independent of representations. 3 struct entry { 4 using key_t = point; 5 using val_t = point; using aug_t = point; 5.1 Interval Trees 6 7 static bool comp(key_t a, key_t b) We give an example of how to use our interface for interval 8 { return a < b;} trees [18, 19, 23, 25, 35, 43]. This data structure maintains 9 static aug_t identity() a set of intervals on the real line, each defined by a left and 10 { return 0;} right endpoint. Various queries can be answered, such as a 11 static aug_t base(key_t k, val_t v) stabbing query which given a point reports whether it is in an 12 { return v;} interval. 13 static aug_t combine(aug_t a, aug_t b) { There are various versions of interval trees. Here we discuss 14 return (a > b) ? a : b;} }; the version as described in [18]. In this version each interval 15 16 using amap = aug_map; is stored in a tree node, sorted by the left endpoint (key). A 17 amap m; point 푥 is covered by in an interval in the tree if the maximum right endpoint for all intervals with keys less than 푥 is greater 18 interval_map(interval* A, size_t n) { than 푥 (i.e. an interval starts to the left and finishes to the 19 m = amap(A,A+n); } right of 푥). By storing at each tree node the maximum right 20 bool stab(point p) { endpoint among all intervals in its subtree, the stabbing query 21 return (amap::aug_left(m,p) > p);} can be answered in 푂(log 푛) time. An example is shown in 22 amap report_all(point p) { Figure 4. 23 amap t = amap::up_to(m,p); In our framework this can easily be implemented by using 24 auto h = [] (P a) -> bool {return a>p;} the left endpoints as keys, the right endpoints as values, and 25 return amap::augFilter(t,h);}; using max as the combining function. The definition is: Figure 3. The definition of interval maps using PAM in C++. 퐼 = AM(, (1,7) (2,6) 푝) and the combine function 푓(푎, 푏) = max(푎, 푏) satisfy (4,5) (5,8) (4,5) 9 (6,7) ℎ(푎) ∨ ℎ(푏) ⇔ ℎ(푓(푎, 푏)). This means that to get all nodes (7,9) (2,6) 7 (6,7) 9 with values > 푝, if the maximum value of a subtree is less (key,val) aug_val than 푝, the whole subtree can be discarded. Thus we can apply (1,7) 7 (3,5) 5 (5,8) 8 (7,9) 9 Figure 4. An example of an interval tree.

296 PPoPP ’18, February 24–28, 2018, Vienna, Austria Yihan Sun, Daniel Ferizovic, and Guy E. Belloch

Outer Tree ( ) Node sum of weights in the rectangle. When 푓 ′ is an addition, lc key val combine base aug_val rc ′ 𝑶𝑶 푔 returns the range sum, and 푓 is a UNION, the condition , Union Singleton𝑹𝑹 Inner tree 푓 ′(푔′(푎), 푔′(푏)) = 푔′(푎) + 푔′(푏) = 푔′(푎 ∪ 푏) = 푔′(푓(푎, 푏)) Inner Tree ( ) Node 〈𝑥𝑥 𝑦𝑦〉 lc𝑤𝑤 key val combine base aug_val rc UG ROJECT 𝑰𝑰 holds, so A P is applicable. Combining the two Outer tree , Add (+) value𝑹𝑹 ( ) Outer tree 2 node node steps together, the query time is 푂(log 푛). We can also an- Inner tree Inner tree Left 〈𝑦𝑦 𝑥𝑥〉 𝑤𝑤 𝑤𝑤 ∑𝑤𝑤 Right subtree node node subtree swer range queries that report all point inside a rectangle in Left Right 2 subtree subtree time 푂(푘 + log 푛), where 푘 is the output size.

Figure 5. The range tree data structure under PAM frame- 5.3 Ranked Queries on Inverted Indices work. In the illustration we omit some attributes such as size, Our last application of augmented maps is building and search- reference counter and the identity function. Note that the ing a weighted inverted index of the used by search functions combine, base and identity of both the out- engines [56, 66] (also called an inverted file or posting file). er tree and inner tree are static functions, so these functions For a given corpus, the index stores a mapping from words shown in this figure actually do not take any real spaces. to second-level mappings. Each second-level mapping, maps each document that the term appears in to a weight, corre- sponding to the importance of the word in the document and Then a range query can be done by two nested queries on the importance of the document itself. Using such a represen- x- and y-coordinates respectively. Sequential construction 2 tation, conjunctions and disjunctions on terms in the index can time and query time is 푂(푛 log 푛) and 푂(log 푛) respectively be found by taking the intersection and union, respectively, of (the query time can be reduced to 푂(log 푛) with reasonably the corresponding maps. Weights are combined when taking complicated approaches). unions and intersections. It is often useful to only report the In our interface the outer tree (푅푂) is represented as an aug- 푘 results with highest weight, as a search engine would list mented map in which keys are points (sorted by x-coordinates) on its first page. and values are weights. The augmented value, which is the This can be represented rather directly in our interface. inner tree, is another augmented map (푅퐼 ) storing all points in The inner map, maps document-ids (퐷) to weights (푊 ) and its subtree using the points as the key (sorted by y-coordinates) uses maximum as the augmenting function 푓. The outer map and the weights as the value. The inner map is augmented maps terms (푇 ) to inner maps, and has no augmentation. This by the sum of the weights for efficiently answering range corresponds to the maps: sums. Union is used as the combine function for the outer 푀퐼 = AM ( 퐷, <퐷, 푊 , 푊 , (푘, 푣) → 푣, max푊 , 0 ) map. The range tree layout is illustrated in in Figure 5, and 푀푂 = ( 푇 , <푇 푀퐼 , ) the definition in our framework is: M We use (퐾, < , 푉 ) to represent a plain map with key 푅 = ( 푃 , < , 푊 , 푊 , (푘, 푣) → 푣, + , 0 ) M 퐾 퐼 AM 푌 푊 푊 type 퐾, total ordering defined by < and value type 푉 . In the 푅 = ( 푃 , < , 푊 , 푅 , 푅 .singleton, ∪, ∅ ) 퐾 푂 AM 푋 퐼 퐼 implementation, we use the feature of PAM that allows pass- Here 푃 = 푋 × 푌 is the point type. 푊 is the weight type. ing a combining function with UNION and INTERSECT (see +푊 and 0푊 are the addition function on 푊 and its identity Section 4), for combining weights. The AUGFILTER function respectively. can be used to select the 푘 best results after taking unions It is worth noting that because of the persistence supported and intersections over terms. Note that an important feature is by the PAM library, the combine function UNION does not that the UNION function can take time that is much less that affect the inner trees in its two children, but builds a new the size of the output (e.g., see Section 4). Therefore using version of 푅 containing all the elements in its subtree. This 퐼 augmentation can significantly reduce the cost of finding the is important in guaranteeing the correctness of the algorithm. top 푘 relative to naively checking all the output to pick out To answer the query, we conduct two nested range searches: the 푘 best. The C++ code for our implementation is under 50 (푥 , 푥 ) on the outer tree, and (푦 , 푦 ) on the related inner 퐿 푅 퐿 푅 lines. trees [43, 58]. It can be implemented using the augmented map functions as: 6 Experiments QUERY(푟푂, 푥퐿, 푥푅, 푦퐿, 푦푅) = ′ For the experiments we use a 72-core Dell R930 with 4 x let 푔 (푟퐼 ) = AUGRANGE(푟퐼 , 푦퐿, 푦푅) ′ Intel(R) Xeon(R) E7-8867 v4 (18 cores, 2.4GHz and 45MB in AUGPROJECT(푔 , +푊 , 푟푂, 푥퐿, 푥푅) L3 ), and 1Tbyte memory. Each core is 2-way hyper- The augProject on 푅푂 is the top-level searching of x- threaded giving 144 hyperthreads. Our code was compiled coordinates in the outer tree, and 푔′ projects the inner trees using g++ 5.4.1, which supports the Cilk Plus extensions. to the weight sum of the corresponding y-range. 푓 ′ (i.e., Of these we only use cilk spawn and cilk sync for fork- ′ +푊 ) combines the weight of all results of 푔 to give the join, and cilk for as a parallel loop. We compile with -O2.

297 PAM: Parallel Augmented Maps PPoPP ’18, February 24–28, 2018, Vienna, Austria

n m T T Spd. We use numactl -i all in all experiments with more than 1 144 one thread. It evenly spreads the memory pages across the PAM (with augmentation) processors in a round-robin fashion. Union 108 108 12.517 0.2369 52.8 We ran experiments that measure performance of our four Union 108 105 0.257 0.0046 55.9 applications: the augmented sum (or max), the interval tree, Find 108 108 113.941 1.1923 95.6 the 2D range tree and the word index searching. Insert 108 − 205.970 − − Build 108 − 16.089 0.3232 49.8 10 6.1 The Augmented Sum Build 10 − 1844.38 28.24 65.3 108 − 4.578 0.0804 56.9 Times for set functions such as union using the same algo- Filter 8 8 rithm in PAM without augmentation have been summarized Multi-Insert 10 10 23.797 0.4528 52.6 in [11]. In this section we summarize times for a simple aug- Multi-Insert 108 105 0.407 0.0071 57.3 mentation, which just adds values as the augmented value Range 108 108 44.995 0.8033 56.0 (see Equation 1). We test the performance of multiple func- AugLeft 108 108 106.096 1.2133 87.4 tions on this structure. We also compare PAM with some AugRange 108 108 193.229 2.1966 88.0 sequential and parallel libraries, as well as some concurrent AugRange 1010 108 271.09 3.04 89.2 data structures. None of the other implementations support AugFilter 108 106 0.807 0.0163 49.7 augmentation. We use 64- integer keys and values. The 8 5 results on running time are summarized in Table 3. Our times AugFilter 10 10 0.185 0.0030 61.2 include the cost of any necessary garbage collection (GC). Non-augmented PAM (general map functions) We also present space usage in Table 4. Union 108 108 11.734 0.1967 59.7 We test versions both with and without augmentation. For Insert 108 − 186.649 − − general map functions like UNION or INSERT, maintaining build 108 − 15.782 0.3008 52.5 the augmented value in each node costs overhead, but it seems Range 108 108 42.756 0.7603 56.2 to be minimal in running time (within 10%). This is likely because the time is dominated by the number of cache misses, Non-augmented PAM (augmented functions) 8 4 which is hardly affected by maintaining the augmented value. AugRange 10 10 21.642 0.4368 49.5 The overhead of space in maintaining the augmented value AugFilter 108 106 2.695 0.0484 55.7 is 20% in each tree node (extra 8 for the augmented AugFilter 108 105 2.598 0.0497 52.3 value). For the functions related to the range sum, the aug- mentation is necessary for theoretical efficiency, and greatly STL 8 8 improves the performance. For example, the AUGRANGE Union Tree 10 10 166.055 − − function using a plain (non-augmented) tree structure would Union Tree 108 105 82.514 − − require scanning all entries in the range, so the running time is Union Array 108 108 1.033 − − proportional to the number of related entries. It costs 0.44s to Union Array 108 105 0.459 − − 4 process 10 parallel AUGRANGE queries. With augmentation, Insert 108 − 158.251 − − AUGRANGE has performance that is close to a simple FIND function, which is only 3.04s for 108 queries. Another exam- MCSTL 8 8 ple to show the advantage of augmentation is the AUGFILTER Multi-Insert 10 10 51.71 7.972 6.48 8 5 function. Here we use MAX instead of taking the sum as the Multi-Insert 10 10 0.20 0.027 7.36 combine function, and set the filter function as selecting all Table 3. Timings in seconds for various functions in PAM, entries with values that are larger than some threshold 휃. We the C++ Standard Template Library (STL) and the library set the parameter 푚 as the output size, which can be adjusted Multi-core STL (MCSTL) [61]. Here “푇144” means on all 72 by choosing appropriate 휃. Such an algorithm has theoretical cores with hyperthreads (i.e., 144 threads), and “푇1” means work of 푂(푚 log(푛/푚 + 1)), and is significantly more effi- the same algorithm running on one thread. “Spd.” means the cient than a plain implementation (linear work) when 푚 ≪ 푛. speedup (i.e., 푇1/푇144). For insertion we test the total time We give two examples of tests on 푚 = 105 and 106. The of 푛 insertions in turn starting from an empty tree. All other change of output size does not affect the running time of libraries except PAM are not augmented. the non-augmented version, which is about 2.6s sequentially and 0.05s in parallel. When making use of the augmentation, For sequential performance we compare to the C++ Stan- we get a 3x improvement when 푚 = 106 and about 14x dard Template Library (STL) [45], which supports set union improvement when 푚 = 105. on sets based on red-black trees and sorted vectors (arrays).

298 PPoPP ’18, February 24–28, 2018, Vienna, Austria Yihan Sun, Daniel Ferizovic, and Guy E. Belloch

(a). 5 × 107 insertions, throughput (M/s), 푝 = 144. (b). 107 concurrent reads, throughput (M/s), 푝 = 144. Compare to some concurrent data structures. Compare to some concurrent data structures.

100 100

10 10

Skiplist Skiplist

Throughput Throughput 1 1 OpenBW OpenBW Masstree Masstree B+ tree B+ tree PAM 0.1 PAM 0.1 1 2 4 8 16 32 72 144 1 2 4 8 16 32 72 144 Number of Threads Number of Threads (c). 푛 = 108, running time. (). 푛 = 108, speedup. (e). 푛 = 108, sequential building time. PAM on different input sizes. The interval tree using PAM. The range tree. Compare to CGAL.

90 600 64 0.10000 32 100

0.01000 16 10 8 Time (s) Speedup 0.00100 4 1 Building Time (s) Union 2 Build CGAL 0.00010 Build Query PAM 1 0.1 102 103 104 105 106 107 108 1 2 4 8 16 32 72 144 105 106 107 108 Input Size Number of Threads Number of points

Figure 6. (a), (b) The performance (throughput, millions of elements per second) of PAM comparing with some concurrent data structures. In (a) we use our MULTIINSERT, which is not as general as the concurrent insertions in other implementations. (c) The running time of UNION and BUILD using PAM on different input sizes. (d) The speedup on interval tree construction and query. (e) The running time on range tree construction.

Overhead for aug. Saving from node-sharing In parallel, the speedup on the aggregate functions such Func. Type node aug. over- #nodes in Actual Saving as UNION and BUILD is above 50. Generally, the speedup is size size head theory #nodes ratio correlated to the ratio of reads to writes. With all (or mostly) m = 108 48B 8B 20% 390M 386M 1.2% reads to the tree structure, the speedup is often more than 72 Union (number of cores) with hyperthreads (e.g., FIND,AUGLEFT m = 105 48B 8B 20% 200M 102M 49.0% and AUGRANGE). With mostly writes (e.g., building a new Range Outer 48B 8B 20% 100M 100M 0.0% tree as output) it is 40-50 (e.g., FILTER,RANGE,UNION Tree Inner 40B 4B 11% 266M 229M 13.8% and AUGFILTER). The BUILD function is relatively special Table 4. Space used by the UNION function and the range because the parallelism is mainly from the parallel sorting. tree application. We use B for , M for million. We also give the performance of the MULTIINSERT function in the Multicore STL (MCSTL) [61] for reference. On our server MCSTL does not scale to 144 threads, and we show We denote the two versions as Union-Tree and Union-Array. the best time it has (on 8-16 threads). On the functions we test, In Union-Tree results are inserted into a new tree, so it is also PAM outperforms MCSTL both sequentially and in parallel. persistent. When the two sets have the same size, the array PAM is scalable to very large data, and still achieve very implementation is faster because of its flat structure and better good speedup. On our machine, PAM can process up to 1010 cache performance. If one map is much smaller, PAM per- elements (highlighted in Table 3). It takes more than half an forms better than Union-Array because of better theoretical hour to build the tree sequentially, but only needs 28 seconds bound (푂(푚 log(푛/푚 + 1)) vs. 푂(푛 + 푚)). It outperforms in parallel, achieving a 65-fold speedup. For AUGRANGE the Union-Tree because it supports persistence more efficiently, speedup is about 90. i.e., sharing nodes instead of making a copy of all output Also, using path-copying to implement persistence im- entries. Also, our JOIN-based INSERT achieves performance proves space-efficiency. For the persistent UNION on two close to (about 17% slower) the well-optimized STL tree in- maps of size 108 and 105, we save about 49% of tree nodes sertion even though PAM needs to maintain the reference because most nodes in the larger tree are re-used in the output counts. tree. When the two trees are of the same size and the keys of

299 PAM: Parallel Augmented Maps PPoPP ’18, February 24–28, 2018, Vienna, Austria

both trees are extracted from the similar distribution, there is Lib. Func. n m T1 T144 Spd. little savings. PAM Build 108 - 14.35 0.227 63.2 We present the parallel running times of UNION and BUILD (interval) Query 108 108 53.35 0.576 92.7 on different input sizes in Figure 6 (c). For UNION we set one Build 108 - 197.47 3.098 63.7 tree of size 108 and vary the other tree size. When the tree size PAM Q-Sum 108 106 48.13 0.550 87.5 is small, the parallel running time does not shrink proportional (range) Q-All 108 103 44.40 0.687 64.6 to size (especially for BUILD), but is still reasonably small. 8 This seems to be caused by insufficient parallelism on small CGAL Build 10 - 525.94 - - 8 3 sizes. When the input size is larger than 106, the algorithms (range) Q-All 10 10 110.94 - - scales very well. Table 5. The running time (seconds) of the range tree and the We also compare with four comparison-based concurrent interval tree implemented with PAM interface on 푛 points data structures: skiplist, OpenBw-tree [63], Masstree [42] and 푚 queries. Here “푇1” reports the sequential running time and B+ tree [65]. The implementations are from [63]4. We and “푇144” means on all 72 cores with hyperthreads (i.e., 144 compare their concurrent insertions with our parallel MUL- threads). “Spd.” means the speedup (i.e., 푇1/푇144). “Q-Sum” TIINSERT and test on YCSB microbenchmark C (read-only). and “Q-All” represent querying the sum of weights and query- 7 We first use 5×10 insertions to an empty tree to build the ini- ing the full list of all points in a certain range respectively. 7 tial database, and then test 10 concurrent reads. The results We give the result on CGAL range tree for comparisons with are given in Figure 6(a) (insertions) and (b) (reads). For inser- our range tree. tions, PAM largely outperforms all of them sequentially and in parallel, although we note that their concurrent insertions are more general than our parallel multi-insert (e.g., they can deal with ongoing deletions at the same time). For concurrent In parallel, on 108 intervals, our code can build an interval reads, PAM performs similarly to B+ tree and Masstree with tree in about 0.23 second, achieving a 63-fold speedup. We less than 72 cores, but outperforms all of them on all 144 also give the speedup of our PAM interval tree in Figure 6(d). threads. We also compare to Intel TBB [54, 62] concurrent Both construction and queries scale up to 144 threads (72 hash map, which is a parallel implementation on unordered cores with hyperthreads). maps. On inserting 푛 = 108 entries into a pre-allocated table of appropriate size, it takes 0.883s compared to our 0.323s 6.3 Range Trees (using all 144 threads). We test our 2D range tree as described in Section 5.2. A 6.2 Interval Trees summary of run times is presented in Table 5. We com- pared our sequential version with the range tree in the CGAL We test our interval tree (same code as in Figure 3) using the library [51] (see Figure 6(e)). The CGAL range tree is se- 109 PAM library. For queries we test stabbing queries. We quential and can only report all the points in the range. For 108 give the results of our interval tree on intervals in Table 5 108 input points we control the output size of the query to be and the speedup figure in Figure 6(d). around 106 on average. Table 5 gives results of construction 108 Sequentially, even on intervals the tree construction and query time using PAM and CGAL respectively. Note that 휇 only takes 14 seconds, and each query takes around 0.58 s. our range tree also stores the weight and reference counting We did not find any comparable open-source interval-tree li- in the tree while CGAL only stores the coordinates, which brary in C++ to compare with. The only available library is a means CGAL version is more space-efficient than ours. Even Python interval tree implementation [30], which is sequential, considering this, PAM is always more efficient than CGAL and is very inefficient (about 1000 times slower sequentially). and less than half the running time both in building and query- Although unfair to compare performance of C++ to python ing time on 108 points. Also our code can answer the weight- (python is optimized for ease of programming and not per- sum in the window in a much shorter time, while CGAL can formance), our interval tree is much simpler than the python only give the full list. code—30 lines in Figure 3 vs. over 2000 lines of python. We then look at the parallel performance. As shown in This does not include our code in PAM (about 4000 lines of Table 5 it took 3 seconds (about a 64-fold speedup) to build code), but the point is that our library can be shared among a tree on 108 points. On 144 threads the PAM range tree many applications while the Python library is specific for the can process 1.82 million queries on weight-sum per second, interval query. Also our code supports parallelism. achieving a 87-fold speedup. We also report the number of allocated tree nodes in Table 4We do not compare to the fastest implementation (the Adaptive 4. Because of path-copying, we save 13.8% space by the [38]) in [63] because it is not comparison-based. sharing of inner tree nodes.

300 PPoPP ’18, February 24–28, 2018, Vienna, Austria Yihan Sun, Daniel Ferizovic, and Guy E. Belloch

1 Core 72* Cores both sequentially and in parallel. The code of the applications n Speed- Time Melts Time Gelts implemented with PAM outperforms some existing libraries (×109) up and implementations, and also achieves good parallelism. (secs) /sec (secs) /sec Without any specific optimizations, the speedup is about more Build 1.96 1038 1.89 12.6 0.156 82.3 than 60 for both building interval trees and building range Queries 177 368 480.98 4.74 37.34 77.6 trees, and 82 for building word index trees on 72 cores. For Table 6. The running time and rates for building and queer- parallel queries the speedup is always over 70. ing an inverted index. Here “one core” reports the sequen- tial running time and “72* cores” means on all 72 cores Acknowledgments with hyperthreads (i.e., 144 threads). Gelts/sec calculated as This material is based upon work supported by the National 푛/(time × 109). Science Foundation under Grant No. CCF-1314590 and Grant 6.4 Word Index Searching No. CCF-1533858. Any opinions, findings, and conclusions To test the performance of the inverted index data structure de- or recommendations expressed in this material are those of scribed in Section 5.3, we use the publicly available Wikipedi- the author and do not necessarily reflect the views of the a database [64] (dumped on Oct. 1, 2016) consisting of 8.13 National Science Foundation. million documents. We removed all XML markup, treated everything other than alphanumeric characters as separators, and converted all upper case to lower case to make search- [1] Georgy Adelson-Velsky and E. M. Landis. 1962. An Algorithm for the es case-insensitive. This leaves 1.96 billion total words with Organization of Information. Proc. of the USSR Academy of Sciences 145 (1962), 263–266. In Russian, English translation by Myron J. Ricci 5.09 million unique words. We assigned a random weight to in Soviet Doklady, 3:1259-1263, 1962. each word in each document (the values of the weights make [2] Pankaj K. Agarwal, Kyle Fox, Kamesh Munagala, and Abhinandan no difference to the runtime). We measure the performance Nath. 2016. Parallel Algorithms for Constructing Range and Nearest- of building the index from an array of (word, doc id, Neighbor Searching Data Structures. In Proc. ACM SIGMOD-SIGACT- weight) triples, and the performance of queries that take SIGAI Symp. on Principles of Database Systems (PODS). 429–440. [3] Yaroslav Akhremtsev and Peter Sanders. 2015. Fast Parallel Operations an intersection (logical-and) followed by selecting the top 10 on Search Trees. arXiv preprint arXiv:1510.05433 (2015). documents by weight. [4] Ahmed M. Aly, Hazem Elmeleegy, Yan Qi, and Walid Aref. 2016. Kan- Unfortunately we could not find a publicly available C++ garoo: Workload-Aware Processing of Range Data and Range Queries version of inverted indices to compare to that support and/or in Hadoop. In Proc. ACM International Conference on Web Search and queries and weights although there exist benchmarks sup- Data Mining (WSDM). ACM, New York, NY, USA, 397–406. [5] Flurry analytics. -. (-). https://developer.yahoo.com/flurry/docs/ porting plain searching on a single word [17]. However the analytics/ experiments do demonstrate speedup numbers, which are in- [6] Antonio Barbuzzi, Pietro Michiardi, Ernst Biersack, and Gennaro Bog- teresting in this application since it is the only one which does gia. 2010. Parallel Bulk Insertion for Large-scale Analytics Appli- concurrent updates. In particular each query does its own in- cations. In Proc. International Workshop on Large Scale Distributed tersection over the shared posting lists to create new lists (e.g., Systems and Middleware (LADIS). [7] Dmitry Basin, Edward Bortnikov, Anastasia Braginsky, Guy Golan- multiple users are searching at the same time). Timings are Gueta, Eshcar Hillel, Idit Keidar, and Moshe Sulamy. 2017. KiWi: shown in Table 6. Our implementation can build the index for A Key-Value Map for Scalable Real-Time Analytics. In Proc. ACM Wikipedia in 13 seconds, and can answer 100K queries with SIGPLAN Symp. on Principles and Practice of Parallel Programming a total of close to 200 billion documents across the queries (PPOPP). 357–369. in under 5 seconds. It demonstrates that good speedup (77x) [8] Rudolf Bayer. 1972. Symmetric Binary B-Trees: Data Structure and Maintenance Algorithms. Acta Informatica 1 (1972), 290–306. can be achieved for the concurrent updates in the query. [9] Jon Louis Bentley. 1979. Decomposable searching problems. Informa- tion processing letters 8, 5 (1979), 244–251. 7 Conclusion [10] Juan Besa and Yadran Eterovic. 2013. A concurrent red-black tree. J. In this paper we introduce the augmented map, and describe Parallel Distrib. Comput. 73, 4 (2013), 434–449. [11] Guy E Blelloch, Daniel Ferizovic, and Yihan Sun. 2016. Just Join for an interface and efficient algorithms to for it. Based on the Parallel Ordered Sets. In Proc. of the ACM Symp. on Parallelism in interface and algorithms we develop a library supporting Algorithms and Architectures (SPAA). 253–264. the augmented map interface called PAM, which is parallel, [12] Guy E. Blelloch and Margaret Reid-Miller. 1998. Fast Set Opera- work-efficient, and supports persistence. We also give four tions Using Treaps. In Proc. ACM Symp. on Parallel Algorithms and example applications that can be adapted to the abstraction of Architectures (SPAA). 16–26. [13] Nathan Grasso Bronson, Jared Casper, Hassan Chafi, and Kunle Oluko- augmented maps, including the augmented sum, interval trees, tun. 2010. A practical concurrent binary . In Proc. ACM 2D range trees and the inverted indices. We implemented all SIGPLAN Symp. on Principles and Practice of Parallel Programming these applications with the PAM library. Experiments show (PPoPP). 257–268. that the functions in our PAM implementation are efficient

301 PAM: Parallel Augmented Maps PPoPP ’18, February 24–28, 2018, Vienna, Austria

[14] Trevor Brown and Hillel Avni. 2012. Range Queries in Non-blocking and Distrib. Comput. 73, 8 (2013), 1195 – 1207. k-ary Search Trees. Springer Berlin Heidelberg, Berlin, Heidelberg, [35] Hans-Peter Kriegel, Marco Potke,¨ and Thomas Seidl. 2000. Managing 31–45. Intervals Efficiently in Object-Relational Databases. In Proc. Interna- [15] Trevor Brown, Faith Ellen, and Eric Ruppert. 2014. A general technique tional Conference on Very Large Data Bases (VLDB). 407–418. for non-blocking trees. In Proc. ACM SIGPLAN Symp. on Principles [36] H. T. Kung and Philip L. Lehman. 1980. Concurrent Manipulation of and Practice of Parallel Programming (PPoPP). 329–342. Binary Search Trees. ACM Trans. Database Syst. 5, 3 (1980), 354–382. [16] Chee Yong Chan and Yannis E. Ioannidis. 1999. Hierarchical Prefix [37] Kim S. Larsen. 2000. AVL Trees with Relaxed Balance. J. Comput. Cubes for Range-Sum Queries. In Proc. International Conference on Syst. Sci. 61, 3 (2000), 508–522. Very Large Data Bases (VLDB). Morgan Kaufmann Publishers Inc., [38] Viktor Leis, Alfons Kemper, and Thomas Neumann. 2013. The adap- San Francisco, CA, USA, 675–686. tive radix tree: ARTful indexing for main-memory databases. In Data [17] Rosetta Code. -. Inverted index. (-). https://rosettacode.org/wiki/ Engineering (ICDE), 2013 IEEE 29th International Conference on. Inverted index IEEE, 38–49. [18] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. 2001. Intro- [39] Charles E. Leiserson. 2010. The Cilk++ concurrency platform. The duction to Algorithms (second edition). MIT Press and McGraw-Hill. Journal of Supercomputing 51, 3 (2010), 244–257. [19] Costin S. 2012. C# based interval tree. https://code.google.com/ [40] Justin J. Levandoski, David B. Lomet, and Sudipta Sengupta. 2013. archive/p/intervaltree. (2012). The Bw-Tree: A B-tree for New Hardware Platforms. In Proc. IEEE [20] Bolin Ding and Arnd Christian Konig.¨ 2011. Fast set intersection in International Conference on Data Engineering (ICDE). 302–313. memory. Proc. of the VLDB Endowment 4, 4 (2011), 255–266. [41] LevelDB. -. (-). leveldb.org [21] Dana Drachsler, Martin T. Vechev, and Eran Yahav. 2014. Practical [42] Yandong Mao, Eddie Kohler, and Robert Tappan Morris. 2012. Cache concurrent binary search trees via logical ordering. In Proc. ACM craftiness for fast multicore key-value storage. In Proc. of the 7th ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming European Conference on Systems. ACM, 183–196. (PPoPP). 343–356. [43] Mark De Berg Mark, Mark Van Kreveld, Mark Overmars, and Ot- [22] James R Driscoll, Neil Sarnak, Daniel Dominic Sleator, and Robert En- fried Cheong Schwarzkopf. 2000. . Springer. dre Tarjan. 1986. Making data structures persistent. In Proc. ACM [44] Edward M. McCreight. 1985. Priority Search Trees. SIAM J. Comput. Symp. on Theory of computing (STOC). 109–121. 14, 2 (1985), 257–276. [23] Herbert Edelsbrunner. 1980. Dynamic Rectangle Intersrction Search- [45] David R Musser, Gillmer J Derge, and Atul Saini. 2009. STL tutorial ing. Technical Report Institute for Technical Processing Report 47. and reference guide: C++ programming with the standard template Technical University of Graz, Austria. library. Addison-Wesley Professional. [24] Stephan Erb, Moritz Kobitzsch, and Peter Sanders. 2014. Parallel [46] Aravind Natarajan and Neeraj Mittal. 2014. Fast concurrent lock-free bi-objective shortest paths using weight-balanced b-trees with bulk binary search trees. In Proc. ACM SIGPLAN Symp. on Principles and updates. In Experimental Algorithms. Springer, 111–122. Practice of Parallel Programming (PPoPP). 317–328. [25] Preparata Franco and Michael Ian Preparata Shamos. 1985. Computa- [47] Gabriele Neyer. 2017. dD Range and Segment Trees. In tional geometry: an introduction. Springer Science & Business Media. CGAL User and Reference Manual (4.10 ed.). CGAL Edito- [26] Leonor Frias and Johannes Singler. 2007. Parallelization of Bulk rial Board. http://doc.cgal.org/4.10/Manual/packages.html# Operations for STL Dictionaries. In Euro-Par 2007 Workshops: Parallel PkgRangeSegmentTreesDSummary Processing, HPPC 2007, UNICORE Summit 2007, and VHPC 2007. [48] Jurg¨ Nievergelt and Edward M. Reingold. 1973. Binary Search Trees 49–58. of Bounded Balance. SIAM J. Comput. 2, 1 (1973), 33–43. [27] Hong Gao and Jian-Zhong Li. 2005. Parallel Data Cube Storage Struc- [49] Chris Okasaki. 1999. Purely functional data structures. Cambridge ture for Range Sum Queries and Dynamic Updates. J. Comput. Sci. University Press. Technol. 20, 3 (May 2005), 345–356. [50] Oracle. -. Oracle NoSQL. (-). https://www.oracle.com/database/ [28] Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don nosql/index.html Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. [51] Mark H Overmars. 1996. Designing the computational geometry al- Data Cube: A Relational Aggregation Operator Generalizing Group-By, gorithms library CGAL. In Applied computational geometry towards Cross-Tab, and Sub-Totals. Data Mining and Knowledge Discovery 1, geometric engineering. Springer, 53–58. 1 (01 Mar 1997), 29–53. [52] Heejin Park and Kunsoo Park. 2001. Parallel algorithms for red–black [29] Antonin Guttman. 1984. R-trees: A Dynamic Index Structure for trees. Theoretical 262, 1 (2001), 415–435. Spatial Searching. In Proc. ACM SIGMOD International Conference [53] Wolfgang J. Paul, Uzi Vishkin, and Hubert Wagener. 1983. Parallel on Management of Data (SIGMOD). ACM, New York, NY, USA, 47– Dictionaries in 2-3 Trees. In Proc. International Colloq. on Automata, 57. Languages and Programming (ICALP). 597–609. [30] Chaim-Leib Halbet and Konstantin Tretyakov. 2015. Python based in- [54] Chuck Pheatt. 2008. Intel® threading building blocks. Journal of terval tree. https://github.com/chaimleib/intervaltree. (2015). Computing Sciences in Colleges 23, 4 (2008), 298–298. [31] Ching-Tien Ho, Rakesh Agrawal, Nimrod Megiddo, and Ramakrishnan [55] Aleksandar Prokopec, Nathan Grasso Bronson, Phil Bagwell, and Mar- Srikant. 1997. Range Queries in OLAP Data Cubes. In Proc. ACM tin Odersky. 2012. Concurrent with Efficient Non-blocking Snap- International Conference on Management of Data (SIGMOD). ACM, shots. In Proc. ACM SIGPLAN Symp. on Principles and Practice of New York, NY, USA, 73–88. Parallel Programming (PPOPP). [32] Paul Hudak. 1986. A Semantic Model of Reference Counting and Its [56] Anand Rajaraman and Jeffrey David Ullman. 2011. Mining of Massive Abstraction (Detailed Summary). In Proc. of the 1986 ACM Conference Datasets:. Cambridge University Press. on LISP and (LFP ’86). 351–363. [57] RocksDB. -. (-). rockdb.org [33] Hiroshi Inoue, Moriyoshi Ohara, and Kenjiro Taura. 2014. Faster set [58] Hanan Samet. 1990. The design and analysis of spatial data structures. intersection with SIMD instructions by reducing branch mispredictions. Vol. 199. Addison-Wesley Reading, MA. Proc. of the VLDB Endowment 8, 3 (2014), 293–304. [59] Raimund Seidel and Celcilia R. Aragon. 1996. Randomized Search [34] Jinwoong Kim, Sul-Gi Kim, and Beomseok Nam. 2013. Parallel multi- Trees. Algorithmica 16 (1996), 464–497. dimensional range query processing with R-trees on GPU. J. Parallel

302 PPoPP ’18, February 24–28, 2018, Vienna, Austria Yihan Sun, Daniel Ferizovic, and Guy E. Belloch

[60] Jeff Shute, Radek Vingralek, Bart Samwel, Ben Handy, Chad Whip- ∙ function base: 퐾 × 푉 ↦→ 퐴: the base function (푔) key, Eric Rollins, Mircea Oancea, Kyle Littlefield, David Menestrina, ∙ function combine: 퐴 × 퐴 ↦→ 퐴: the combine func- Stephan Ellner, John Cieslewicz, Ian Rae, Traian Stancescu, and Hi- tion (푓) mani Apte. 2013. F1: A Distributed SQL Database That Scales. Proc. identity VLDB Endow. 6, 11 (Aug. 2013), 1068–1079. ∙ function : ∅ ↦→ 퐴: the identity of f (퐼) [61] Johannes Singler, Peter Sanders, and Felix Putze. 2007. MCSTL: The Then an augmented map is defined with C++ template as Multi-core Standard Template Library. In Proc. European Conf. on aug map. Parallel Processing (Euro-Par). 682–694. Note that a plain ordered map (pam map) is [62] TBB -. Intel® Threading Building Blocks 2.0 for Open Source. (-). https://www.threadingbuildingblocks.org/. defined as an augmented map with no augmentation (i.e., it [63] Ziqi Wang, Andrew Pavlo, Hyeontaek Lim, Viktor Leis, Huanchen only has 퐾, <퐾 and 푉 in its entry) and a plain ordered set Zhang, Michael Kaminsky, and David G. Andersen. [n. d.]. Build- (pam set) is similarly defined as an augmented ing A Bw-Tree Takes More Than Just Buzz Words [Experiments and map with no augmentation and no value type. Analyses] (Unpublished). ([n. d.]). Here is an example of defining an augmented map 푚 that [64] Wikimedia Foundation. 2016. Wikepedia:Database download. https: //en.wikipedia.org/wiki/Wikipedia:Database download. has integer keys and values and is augmented with value sums (2016). (similar as the augmented sum example in our paper): [65] Huanchen Zhang, David G Andersen, Andrew Pavlo, Michael Kamin- 1 struct entry { sky, Lin Ma, and Rui Shen. 2016. Reducing the storage overhead 2 using key_t = int; of main-memory OLTP databases with hybrid indexes. In Proc. of using val_t = int; the 2016 International Conference on Management of Data. ACM, 3 1567–1581. 4 using aug_t = int; [66] Justin Zobel and Alistair Moffat. 2006. Inverted Files for Text Search 5 static bool comp(key_t a, key_t b) { Engines. ACM Comput. Surv. 38, 2, Article 6 (July 2006). 6 return a < b;} 7 static aug_t identity() { return 0;} A Artifact description 8 static aug_t base(key_t k, val_t v) { 9 return v;} A.1 Abstract 10 static aug_t combine(aug_t a, aug_t b) { PAM (Parallel Augmented Maps) is a parallel C++ library 11 return a+b;}}; implementing the interface for augmented maps (defined in 12 aug_map m; this paper). It is designed for maintaining an ordered map data Another quick example can be found in Section 5.1, which structure while efficiently answering range-based and other shows how to implement an interval tree using the PAM related queries. In the experiments we use the interface in interface. four examples: augmented-sums, interval-queries, 2d range- queries, and an inverted index. The released code includes A.2.1 Check-list (artifact meta information) both the code for the library and the code implementing the ∙ Algorithm: Join-based balanced algorithms, and applications. We provide scripts for running the specific ex- applications of them, as described in the paper. periments reported in the paper. It is also designed so it is ∙ Program: C++ code with the Cilk Plus extensions. easy to try in many other scenarios (different sizes, differ- ∙ Compilation: g++ 5.4.0 (or later versions), which supports ent numbers of cores, and other operations described in the the Cilk Plus extensions. paper, but not reported in the experiments, and even other ∙ Data set: Mostly generated internally, but for one experiment applications that fit the augmented map framework). we use the publicly available Wikipedia database. ∙ Run-time environment: Linux with numactl installed (we A.2 Description used ubuntu 16.04.3). To just run the experiments and tests as shown in the paper, y- ∙ Hardware: Any modern x86-based multicore machine. Most experiments run with 64GB memory. Some require 256GB, or ou can skip this part and directly use the scripts in our released 1TB. We ran on a machine with 72 cores (144 hyperthreads) version. and 1TB memory. To use the library and define an augmented map using PAM, ∙ Output: Results shown on the screen and written to files: users need to include the header file pam.h, and specify the benchmark name, parameters, median runtime, and speedup. parameters including type names and (static) functions in an ∙ Experiment workflow: git clone; run a script (or modify and entry structure entry. run for more options). ∙ typename key t: the key type (퐾), ∙ Publicly available?: Yes. ∙ function comp: 퐾 × 퐾 ↦→ bool: the comparison func- A.2.2 How delivered tion on K (<퐾 ) ∙ typename val t: the value type (푉 ), Released publicly on GitHub at: ∙ typename aug t: the augmented value type (퐴), https://github.com/syhlalala/PAM-AE

303 PAM: Parallel Augmented Maps PPoPP ’18, February 24–28, 2018, Vienna, Austria

A.2.3 Hardware dependencies ∙ The inverted indices (in directory index/). Any modern (2010+) x86-based multicore machines. Relies In each of the directories there is a separated makefile and a on 128-bit CMPXCHG (requires -mcx16 flag) but script to run the timings for the corresponding application. does not need hardware transactional memory (TSX). Most All tests include parallel and sequential running times. The experiments require 64GB memory, but range query requires sequential versions are the algorithms running directly on 256GB memory and aug sum on the large input requires 1TB one thread, and the parallel versions use all threads on the memory. Times reported are for a 72-core Dell R930 with 4 x machine using “numactl -i all” if numactl is installed. Intel(R) Xeon(R) E7-8867 v4 (18 cores, 2.4GHz and 45MB L3 cache), and 1Tbyte memory. A.5 Evaluation and expected result By running the script, the median running time over multiple A.2.4 Software dependencies runs of each application, along with the parameters used, will PAM requires g++ 5.4.0 or later versions supporting the Cilk be output to both stdout and a file. Each experiment runs on all Plus extensions. The scripts that we provide in the repository threads available on the machine in parallel, and sequentially use numactl for better performance. All tests can also run on one thread. In our experience the times do not deviate by directly without numactl. much from run to run (from 1-5%). The experiments correspond to the numbers reported in A.2.5 Datasets Tables 3, 4 and 5 in the paper. If run on a similar machine, We use the publicly available Wikipedia database (dumped on one should observe similar numbers as reported in the paper. Oct. 1, 2016) for the inverted index experiment. We release All executable files can run with different input arguments. a sample (1% of total size) in the GitHub repository (35MB More details about command line arguments can be found in compressed). The full data (3.5TB compressed) is available the repository. The number of working threads is set using on request. All other applications use randomly generated CILK NWORKERS, i.e.: data. export CILK_NWORKERS=

A.3 Installation A.6 Experiment customization After cloning the repository, scripts are provided for compil- It is not hard to run our experiments on different input sizes ing and running PAM. and number of cores. There is also code to run timings for all functions discussed in the paper even if not included in the A.4 Experiment workflow tables in the paper and default scripts. It is also easy to use At the top level there is a makefilemake ( ) and a script for our library to design user’s own augmented maps based on compiling and running all timings (./run all). The source their applications. code of the library is provided in the directory c++/, and the other directories each corresponds to some examples of A.7 Notes applications (as we show in the paper). There are four example It may take a long time to run all experiments (especially for applications provided in our repository: the sequential ones) since we run each experiment multiple ∙ The range sum (in directory aug sum/). rounds and take the median. Each test can be run separately ∙ The interval tree (in directory interval/). with smaller input size. See the document in the repository ∙ The range tree (in directory range query/). for more details. If the interface changes in the future, check the documents on GitHub.

304