Parallel Balanced SPAA2020 Tutorial Binary trees in Shared-Memory

Yihan Sun University of California, Riverside What’s in this tutorial

• Algorithms and implementation details about parallel balanced binary trees (P-trees) • Simple with various functionalities

• An open-source parallel library (PAM) and example code to use it

• Applications that can be solved using the algorithms and the library in this tutorial

2 [Sedgewick and Wayne] Sec. 3.2 Binary Search Trees Trees Sec. 3.3 Balanced Search Trees • Trees are fundamental data structures for organizing data

• Taught in entry-level [TAOCP] 2.3 Trees undergraduate courses 6.2.2. Binary Searching 6.2.3. Balanced Trees 6.2.4. Multiway Trees

• In real-world applications,

things are more complicated… [CLRS] Especially in parallel 12 Binary Search Trees 13 Red-Black Trees 14.3 Interval trees

3 Applications Using Trees Document Search Engine Struct Doc { int l; Find information about pair* w; “balanced” and “binary” }; class doc_tree { Balanced AND Binary void build(Doc* d); Doc_set search(Word w); 1,234,567 results Doc_set and_search(Word* w); Doc 1: Balanced Doc_set or_search(Word* w); A binary tree is balanced if it void add_doc(Doc d); keeps its height small … …… Doc 2: AVL tree }; AVL tree is a balanced binary structure …

Searching and updating may need to be done concurrently Applications using trees

Databases Find all young CS Struct Student { students with good grade id name, grade, age, major, … }; select class database { name void build(Doc* d); from Student* search(int id); students Student* fitler(function f); where void add_student(Student s); age < 25 …… and major = ‘CS’ }; and grade >= ‘A’

5 Applications Using Trees Geometric queries (A 2D range query) struct Point { Find average temperature X x; Y y; Weight w; in Riverside }; class range_tree { void build(Point* p); Double ave_weight_search(X x1, X x2, Y y1, Y y2); int count_search(X x1, X x2, Y y1, Y y2); int list_all_search(X x1, X x2, Y y1, Y y2); Riverside void insert(Point p); range_tree filter(func f); Point* output(); void update(X x, Y y, Weight w); }; Balanced Binary Trees

• Binary: each tree node has at most two children • Balanced: the tree has bounded height • Usually 푂(log 푛) for size 푛

A wild balanced hyphaene compressa binary tree 7 Balanced Binary Trees

• Binary: each tree node has at most two children • Balanced: the tree has bounded height • Usually 푂(log 푛) for size 푛

A wild balanced binary tree 8 Balanced Binary Trees

• Binary: each tree node has at most two children • Balanced: the tree has bounded height • Usually 푂(log 푛) for size 푛

An abstract balanced binary tree 9 Balanced Binary Trees

• Balancing schemes: invariants to keep the tree balanced and to bound the tree height • We discuss four standard balancing schemes

Height balanced Size balanced Randomized balanced

Red- Weight- AVL black balanced Trees Trees Trees

10 Applications Using Trees Document Search Engine Databases Geometric queries Find information about Find all young CS (A 2D range query) “balanced” and “binary” students with good Find average temperature grade in Riverside area: 62 F

BalancedlANDlBinary Eleganceselect – Framework 1,234,567 results Genericlfor balancingname schemes Doc 1: Balancedlbinary treeGenericiforfromapplications A binary tree is balanced if it students keeps its height small … Massivewhere Data - Performance Doc 2: AVL tree age < 25 Parallelism and concurrency AVL tree is a balanced binary and major = ‘CS’ Riverside search tree structure … Efficiency bothand in grade theory >= ‘A’and in practice Comprehensive Queries – Functionality Range queries, bulk updates, …

What we want we What Augmentation, … Dynamicity, multi-versioning, …

11 What does P-tree look like? Elegance • Genericlfor balancing schemes Framework • Genericifor applications Massive Data Performance • Parallelism and concurrency • Efficiency both in theory and in practice Functionality Comprehensive Queries • Range queries, bulk updates, … • Augmentation, … Join • Dynamicity, multi-versioning, … (A primitive for trees) 12 1 Genericlfor balancing schemes

All algorithms except join are identical across balancing schemes

One algorithm for multiple balancing schemes!

2 Genericifor applications

Multiple applications based on the same tree structure

One tree for different problems!

A primitive for trees: Join 13 The Primitive Join

• 푇 =Join(푇퐿, 푒, 푇푅) : 푇퐿 and 푇푅 are two trees of a certain balancing scheme, 푒 is an entry/node (the pivot). • 푻푳 < 풆 < 푻푹 • It returns a valid tree 푇, which is 푇퐿 ∪ 푒 ∪ 푇푅

푇퐿 푇 푒4 푅2 푇 = 2 10 푒 푇푅

푇퐿

(Rebalance if necessary)

14 The Primitive Join

4

2 10 8 12 1 3 6 9 11 13

5 7 14

푇퐿 푒 푇푅

15 The Primitive Join • Connect at a balancing point

10 4 8 12 2 6 9 11 13 1 3 5 7 14

Balanced!

16 The Primitive Join

• Connect at a balancing point

10 8 12

9 11 4 13

2 6 14

1 3 5 7

17 The Primitive Join

• Connect at a balancing point • Rebalance (specific to balancing schemes) • Join algorithms for four balancing schemes and the cost bound [SPAA’16] 10 4 12

2 8 11 13

1 3 6 9 14

5 7

18

How does Join help? Applications 1 Algorithms Using Join Experiments • Generic across balancing schemes • Highly-Parallel • Theoretically efficient

2 Augmentation Using Join • A unified framework for augmentation • Model multiple applications

3 Persistence Using Join • Multi-versioning on trees 19 PART 1 Algorithms Using Join Generic algorithms across balancing schemes Parallel algorithms using divide-and-conquer paradigm Theoretically efficient

20 Join-based insertion

21 Join-based Algorithms: insertion

insert(푻, 풌) CompareInsert if 푇 = ∅ then return Singleton(푘) 4 else let 퐿, 푘′, 푅 = 푇 if 푘 < 푘′ then 5 return Join (Insert(퐿, 푘),푘′,푅) else if 푘 > 푘(푇) then 2 9 return Join (퐿, 푘′, Insert(푅, 푘)) 1 3 else return 푇 Join-based Algorithms: insertion

insert(푻, 풌) Join results if 푇 = ∅ then return Singleton(푘) (rebalanced may else let 퐿, 푘′, 푅 = 푇 be required) if 푘 < 푘′ then 5 return Join (Insert(퐿, 푘),푘′,푅) else if 푘 > 푘(푇) then 2 9 return Join (퐿, 푘′, Insert(푅, 푘)) 1 3 else return 푇 4 Join-based Algorithms: insertion

insert(푻, 풌) Join results if 푇 = ∅ then return Singleton(푘) (rebalanced may else let 퐿, 푘′, 푅 = 푇 be required) if 푘 < 푘′ then 3 Join will return Join (Insert(퐿, 푘),푘′,푅) 2 5 rebalance else if 푘 > 푘(푇) then for Insert! return Join (퐿, 푘′, Insert(푅, 푘)) 1 4 9 else return 푇

푂(log 푛) Join-based split and Join2

25 The Inverse of Join: Split 25

13 51 • ⟨푻푳, 풃, 푻푹⟩ =Split (푻, 풌) • 푇퐿: all keys in 푇 < 푘 8 17 36 80 • 푇푅: all keys in 푇 > 푘 • A 푏 indicating whether 푘 ∈ 푇 5 12 15 22 30 42 70 95

22 Split by 42

13 30 70 8 17 22 36 푏 = 푡푟푢푒 51 80 5 12 15 95

푇퐿 푇푅 The Inverse of Join: Split split(푻, 풌) if 푇 = ∅ then return (∅, 퐟퐚퐥퐬퐞, ∅); 푘 ? 푘′ else { 푘′ = key at the root of 푇 if 푘 = 푘′ then { // same key as the root 푇퐿 = 푇. left; 푇푅 = 푇. right; flag = true} if 푘 < 푘′ then { // split the left subtree 푇. right (퐿, flag, 푅) = split(푇. left, 푘); 퐿 푅 푇퐿 = 퐿; 푇푅 = Join (푅, 푘′, 푇. right); } if 푘 > 푘′ then { /* symmetric*/ } 푇퐿 = 퐿 return (푇퐿, flag, 푇푅) ′ 푇푅 = join (푅, 푘 , 푇. 푟𝑖푔ℎ푡)

27 Another helper function: join2

• join2(푻푳, 푻푹) • Similar to join, but without the middle key • Can be done by first split out the last key in 푻푳, then use it to join the rest of 푻푳 and 푻푹

join2(푇퐿, 푇푅) { ′ u v (푇퐿, 푘) = split_last(푇퐿); ′ return join(푇퐿, 푘, 푇푅);}

푇퐿 푇푅 k Other Join-based algorithms

29 BST Algorithms

• BST algorithms using divide-and-conquer scheme • Recursively deal with two subtrees (possibly in parallel) • Combine results of recursive calls and the root (e.g., using join or join2) • Usually gives polylogarithmic bounds on span func(T, …) { if (T is empty) return base_case; M = do_something(T.root); in parallel: L=func(T.left, …); R=func(T.right, …); return combine_results(L, R, M, …) } Get the maximum value

• In each node we store a key and a value. The nodes are sorted by the keys.

get_max(Tree T) { if (T is empty) return −∞; in parallel: L=get_max(T.left); R=get_max(T.right); return max(max(L, T.root.value), R);

푂(푛) work and 푂(log 푛) span Similar algorithm work on any map-reduce function Map and reduce • Maps each entry on the tree to a certain value using function map, then reduce all the mapped values using reduce (with identity identity). • Assume map and reduce both have constant cost.

map_reduce(Tree T, function map, function reduce, value_type identity) { if (T is empty) return identity; M=map(t.root); in parallel: L=map_reduce(T.left, map, reduce, identity); R=map_reduce(T.right, map, reduce, identity); return reduce(reduce(L, M), R);

푂(푛) work and 푂(log 푛) span Filter

• Select all entries in the tree that satisfy function 푓 • Return a tree of all these entries

filter(Tree T, function f) { if (T is empty) return an empty tree; in parallel: L=filter(T.left, f); R=filter(T.right, f); if (f(T.root)) return join(L, T.root, R); else return join2(L, R); }

푂(푛) work and 푂(log2 푛) span Construction

T=build(Array A, int size) { 푂(푛 log 푛) work and 퐴′=parallel_sort(A, size); 푂(log 푛) span, ′ return build_sorted(퐴 , 푠); bounded by the } sorting algorithm

T=build_sorted(Arrary A, int start, int end) { if (start == end) return an empty tree; if (start == end-1) return singleton(A[start]); mid = (start+end)/2; 푂(푛) work and in parallel: 푂(log 푛) span L = build_sorted(A, start, mid); R = build_sorted(A, mid+1, end); return join(L, A[mid], R); Output to array

• Output the entries in a tree 푇 to an array in its in-order • Assume each tree node stores its subtree size (an empty tree has size 0)

to_array(Tree T, array A, int offset) { 풆 if (T is empty) return; A[offset+T.left.size] = get_entry(T.root); in parallel: 푇. 푙푒푓푡 푇. 푟𝑖푔ℎ푡

to_array(T.left, A, offset); 풆 to_array(T.right, A, offset+T.left.size()+1); The size of the left subtree 푂(푛) work and 푂(log 푛) span Range query (1D)

• Report all entries in key range [풌푳, 풌푹]. • Get a tree of them: 푶(퐥퐨퐠 풏) work and span • Flatten them in an array: 푶(풌 + 퐥퐨퐠 풏) work, 푶(퐥퐨퐠 풏) span for output size 풌

Call join Equivalent to using two split algorithms 푂(log 푛) related nodes / Range(푇, 푘퐿, 푘푅) { subtrees 푡1, 푏, 푡2 = split(푇, 푘퐿); (푡3, 푏, 푡4) = split(푡2, 푘푅); return 푡3; 푘퐿 푘푅 } Join-based Algorithms: Union

• Input: 푻ퟏ and 푻ퟐ (size 풏 and 풎 ≤ 풏) • Output: 푻 containing all elements in 푻ퟏ and 푻ퟐ

• Can be used to combine a batch of elements to a tree

• The lower bound (of comparisons) 풏 푶 풎 퐥퐨퐠 + ퟏ 풎 • When 푚 = 푛, it is 푂(푛) • When n ≫ 푚, it is about 푂 푚 log 푛 (e.g., when 푚 = 1, it is 푂 log 푛 )

37 Join-based Algorithms: Union

union(푻ퟏ, 푻ퟐ) if 푇1 = ∅ then return 푇2 Base case if 푇2 = ∅ then return 푇1 퐿2, 푘2, 푅2 = 푒푥푡푟푎푐푡(푇2) 퐿1, 푏, 푅1 =split(푇1, 푘2) In parallel: 푇퐿 =Union 퐿1, 퐿2 푇푅 =Union(푅1, 푅2) return Join(푇퐿, 푘2, 푇푅)

7 5 4 8 2 9 0 1 3 6 Join-based Algorithms: Union

union(푻ퟏ, 푻ퟐ) if 푇1 = ∅ then return 푇2 if 푇2 = ∅ then return 푇1 퐿2, 푘2, 푅2 = 푒푥푡푟푎푐푡(푇2) 퐿1, 푏, 푅1 =split(푇1, 푘2) In parallel: 푇퐿 =Union 퐿1, 퐿2 푇푅 =Union(푅1, 푅2) return Join(푇퐿, 푘2, 푇푅)

5 4 7 2 9 0 8 1 3 6

푳ퟏ 푹ퟏ 푳ퟐ 푹ퟐ Join-based Algorithms: Union

union(푻ퟏ, 푻ퟐ)

if 푇1 = ∅ then return 푇2 if 푇2 = ∅ then return 푇1 퐿2, 푘2, 푅2 = 푒푥푡푟푎푐푡(푇2) 퐿1, 푏, 푅1 =split(푇1, 푘2) In parallel: 푇퐿 =Union 퐿1, 퐿2 푇푅 =Union(푅1, 푅2) return Join(푇퐿, 푘2, 푇푅) 5

4 2 7 9 0 1 3 8 6 Union Union Join-based Algorithms: Union

union(푻ퟏ, 푻ퟐ) if 푇1 = ∅ then return 푇2 if 푇2 = ∅ then return 푇1 퐿2, 푘2, 푅2 = 푒푥푡푟푎푐푡(푇2) 퐿1, 푏, 푅1 =split(푇1, 푘2) In parallel: 푇퐿 =Union 퐿1, 퐿2 푇푅 =Union(푅1, 푅2) return Join(푇퐿, 푘2, 푇푅) 5 2 7 1 3 6 8

0 4 9

푻푳 푻푹 Join-based Algorithms: Union

union(푻ퟏ, 푻ퟐ) if 푇1 = ∅ then return 푇2 if 푇2 = ∅ then return 푇1 퐿2, 푘2, 푅2 = 푒푥푡푟푎푐푡(푇2) 퐿1, 푏, 푅1 =split(푇1, 푘2) In parallel: 푇퐿 =Union 퐿1, 퐿2 푇푅 =Union(푅1, 푅2) return Join(푇퐿, 푘2, 푇푅) 5 2 7 1 3 6 8

0 4 9

Similarly we can implement intersection and difference. Join-based Algorithms: Union

Theorem 1. For AVL trees, red-black trees, weight-balance trees and treaps, the above algorithm of merging two balanced 푛 BSTs of sizes 푚 and 푛 (푚 ≤ 푛) have 푂 푚 log + 1 work and 푚 푂(log 푚 log 푛) span (in expectation for treaps).

• The bound also holds for intersection and difference

43 Join-based algorithms

• A wide variety of algorithms using Join are proved to be work-optimal and have polylogarithmic span Functions Work Depth insert, delete, range, split, join2 푂(log 푛) 푂(log 푛) 푛 union, intersection, difference 푂 푚 log + 1 푂(log 푛 log 푚) 푚 filter Join captures the intrinsic property푂(푛) of different푂(log2 푛) map_reduce balanced binary trees in algorithms푂(푛) . 푂(log 푛) build 푂(푛 log 푛) 푂(log 푛) …… Optimal Polylog Achieves speedup 50-90x on 72 cores with hyperthreading

44 Outline 1 Algorithms Using Join • Generic across balancing schemes • Parallel using divide-and-conquer • Simple • Theoretically efficient • Fast in practice

2 Augmentation Using Join

3 Persistence Using Join

45 PART 2 Augmentation Using Join Augmented trees for fast range sum The framework for applications

46 Augmentation for Range Queries

Geometric queries (A 2D range query) (A 2D range query) Find average temperature Find all nearby Pokémon / Poke stops in Riverside area: 62 F

Riverside

47 Augmented Trees for 1D Range Query

• Each tree node store some extra information about the whole subtree rooted at it (E.g., partial sum)

4,11 54 sum푂 of(log values푛) in its subtree related 2,7 6,10nodes / subtrees 13 30

1,4 3,2 5,17 7,3 key,value 푘퐿 푘푅 4 2 17 3 Partial sum

Range sum query (sum of values in a key range): 푶(퐥퐨퐠 풏) time 48 Augmented Trees for 1D Range Query

• Different functionalities are achieved by different augmentations

4,11 17 max of values in its subtree

2,7 6,10 7 17

1,4 3,2 5,17 7,3 key,value 4 2 17 3 Partial “sum”

49 Augmentation – Formalization 4,11 54

2,7 6,10 13 30

1,4 3,2 5,17 7,3 key,value 4 2 17 3 Partial “sum”

• No formal definition for augmented trees… • We give a definition with respect to ordered key-value pairs and a map- reduce operation • Each tree node stores a key-value pair (퐾 × 푉) • Each tree node also maintains some information about the whole subtree (∈ 퐴): The augmented value 50 Augmentation – Formalization key,value 4,11 11 Partial “sum” 54 = 4 + 7 + 2 + 11 + 17 + 10 + 3

2,7 7 6,10 10 13 = 4 + 7 + 2 30 = 17 + 10 + 3

1,4 4 3,2 2 5,17 17 7,3 3 4 = 4 2 = 2 17 = 17 3 = 3 The augmented value • A “map” function 품: 퐾 × 푉 ↦ 퐴 to map an entry to an augmented value 푔 is 푘, 푣 ↦ 푣 in this example • A “reduce” function 풇: 퐴 × 퐴 ↦ 퐴 to reduce augmented values (associative, with identity 퐼: 퐴, 푓, 퐼 is a ) 푓 is + in this example

51 Augmentation – Formalization key,value 4,11 11 Partial “sum” 54 = 4 + 7 + 2 + 11 + 17 + 10 + 3

2,7 7 6,10 10 13 = 4 + 7 + 2 30 = 17 + 10 + 3

1,4 4 3,2 2 5,17 17 7,3 3 4 = 4 2 = 2 17 = 17 3 = 3 The augmented value • A “map” function 품: 퐾 × 푉 ↦ 퐴 • A “reduce” function 풇: 퐴 × 퐴 ↦ 퐴, 퐴, 푓, 퐼 is a monoid • 풂 풖 = 풇 풇 풂 풍풄 풖 , 품 풆풏풕풓풚 풖 , 풂 풓풄 풖

푎(푢): the augmented value of node 푢, 푒푛푡푟푦(푢): the entry stored in node 푢 푙푐(푢) and 푟푐(푢): the left/right child of node u. 52 Augmentation – Formalization key,value 4,11 11 Partial “sum” 54 = (4 + 7 + 2) + g(4,11) + (17 + 10 + 3)

2,7 7 6,10 10 13 = 4 + 7 + 2 30 = 17 + 10 + 3

1,4 4 3,2 2 5,17 17 7,3 3 4 = 4 2 = 2 17 = 17 3 = 3 The augmented value • A “map” function 품: 퐾 × 푉 ↦ 퐴 • A “reduce” function 풇: 퐴 × 퐴 ↦ 퐴, 퐴, 푓, 퐼 is a monoid • 풂 풖 = 풇 풇 풂 풍풄 풖 , 품 풆풏풕풓풚 풖 , 풂 풓풄 풖

푎(푢): the augmented value of node 푢, 푒푛푡푟푦(푢): the entry stored in node 푢 푙푐(푢) and 푟푐(푢): the left/right child of node u. 53 Augmentation – Formalization key,value 4,11 11 Partial “sum” 54 = 13 + 11 + 30

2,7 7 6,10 10 13 = 4 + 7 + 2 30 = 17 + 10 + 3

1,4 4 3,2 2 5,17 17 7,3 3 4 = 4 2 = 2 17 = 17 3 = 3 The augmented value • A “map” function 품: 퐾 × 푉 ↦ 퐴 • A “reduce” function 풇: 퐴 × 퐴 ↦ 퐴, 퐴, 푓, 퐼 is a monoid • 풂 풖 = 풇 풇 풂 풍풄 풖 , 품 풆풏풕풓풚 풖 , 풂 풓풄 풖 • Update the augmented value only in Join • Does not affect the asymptotical cost if 품 and 풇 are simple 54 Augmented trees

• Define keys and values • Define what is the augmented value • Define what is the map function • Define what is the reduce function (and identity)

The Wikipedia page of the augmented map defined by us [PPoPP’18], which is an abstract (ADT) that extends the augmented trees

55 Augmentation: Applications

• Can be applied to a variety of applications: 픸필(푲, <푲, 푽, 푨, 품, 풇, 푰) ❑ Range Sum:

푆 = 픸필(ℤ, <ℤ, ℤ, ℤ, 푘, 푣 ↦ 푣, +ℤ, 0) ❑ 1D stabbing query (can yield an ):

푇 = 픸필(ℝ, <ℝ, ℝ, ℝ, 푘, 푣 ↦ 푣, 푚푎푥, −∞) ❑ 2D range query (can yield a ):

푅퐼 = 픸필( 푋 × 푌, <푌, 푊, 푊, 푘, 푣 ↦ 푣, +푊, 0푊)

푅푂 = 픸필( 푋 × 푌, <푋, 푊, 푅퐼, 푅퐼.singleton 푅퐼.union, 푅퐼.empty) ❑ Document searching:

푀퐼 = 픸필( 퐷, <퐷, 푊, 푊, 푘, 푣 ↦ 푣, 푚푎푥, 0)

푀푂 = 픸필( 푇, <푇, 푀퐼, -, -, -, -)

• Also, many tree-based geometric problems like segment queries, rectangle queries, … 56 Code for the Interval Tree Interval tree, 53 lines struct interval_map { using interval = pair; struct entry { using key_t = int; static bool comp(key_t a, key_t b) { return a < b;} using val_t = int; using aug_t = int; static aug_t base(key_t k, val_t v) { return v;} static aug_t combine(aug_t a, aug_t b) { Range tree, return (a > b) ? a : b;} static aug_t I() { return 0;} 176 lines }; 품 and 풇 using aug_map = aug_map; aug_map m;

interval_map(interval* A, int n) { m = aug_map(A,A+n); } Construction bool stabbing(int p) { Query return (m.aug_left(p) > p);}}; Document searching, 98 lines

57 Range Queries – Sequential and Parallel [PPoPP’18, ALENEX’19] • ퟏퟎퟖ input points • Run PAM directly on one core • Compare to range tree in CGAL and R-tree in Boost • Both are sequential libraries All running times: Lower is better PAM Boost CGAL 600 150 1200 400 100 800

200 50 400

0 0 0 Construction Small window Large window (s) query (us) query (s)

58 Range Queries – Sequential and Parallel [PPoPP’18, ALENEX’19] Self-speedup of PAM All running times: Lower is better 72 cores, 144 threads PAM Boost CGAL Higher is better 600 1200 150 80 400 PAM is fast100 even running sequentially800 PAM also gets ~60x 200 50 400 self-speedup in 40 construction and 60- 0 0 0 Construction Small window Large window 80x in queries (s) query (us) query (s) 0 Self-speedup Construction Query Construction Faster than CGAL 2-9x Faster than CGAL 2x Small window query 1.5x Faster than Boost 1.4-26x Faster than Boost Large window query 59 Applications using augmentation

• 1D stabbing query • 2D range query, segment query, rectangle query • 2D sweepline algorithms • Inverted index searching

• Example code for all applications available on Github • https://github.com/cmuparlay/PAM • Algorithms and more experiments available: • PAM: Parallel Augmented Maps, Yihan Sun, Daniel Ferizovic and Guy Blelloch, ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), 2018 • Parallel Range, Segment and Rectangle Queries with Augmented Maps, Yihan Sun and Guy E. Blelloch, Algorithm Engineering and Experiments (ALENEX), 2019

60 Outline 1 Algorithms Using Join

2 Augmentation Using Join • Formalize augmented trees: implemented by just join • Multiple applications: range-sum, stabbing queries, range queries, inverted index searching • Simple code • Parallel solution • Fast in practice

3 Persistence Using Join

61 PART 3 Persistence Using Join Persistent algorithms Multi-version concurrency control (MVCC)

62 What Are Persistence and MVCC?

• Persistence [DSST’86]: for data structures 푇2 = 푇.insert(푘) • Preserves the previous version of itself • Always yields a new version when being updated • Multi-version Concurrency Control (MVCC): for databases • Let write transactions create new versions • Let ongoing queries work on old versions Why Persistence and MVCC? • To guarantee concurrent updates and queries to work correctly and efficiently • Queries work on a consistent version • Writers/readers do not block each other

63 Why Persistence and MVCC? An example of a document search engine. Whales OR dog Riverside

Calories room AND magic Document 4: The University of Document 1: Document 2: Document 3: California, Riverside Blue whales eat half a My dog probably thinks I’m Banging your head is a public research million calories in one magical because rooms light against a wall for one university in mouthful. up when I enter them. hour burns 150 calories. Riverside, California. A Document Database For end-user experience: • Queries shouldn't be delayed by updates • Queries must be done on a consistent version of database Generally useful for any database systems with concurrent updates and queries. Hybrid Transactional and Analytical Processing (HTAP) Database System 64 Persistence Using Join

• Path-copying: copy the affected path on trees 푇2 = 푇1.insert(4) • Copying occur only in Join!

insert 푇1 푇2 4 5 5’

3 8 3’ 1 4 9 65 Persistence Using Join

• Path-copying: copy the affected path on trees • Copying occur only in Join!

푇1

insert 5 4 3 8

1 9

66 Persistence Using Join

• Path-copying: copy the affected path on trees • Copying occur only in Join! • Always copy the middle node • All the other parts in the algorithm remain unchanged • No extra cost in time asymptotically, small overhead in space • Safe for concurrency – multi-version concurrency control (MVCC) 푇1 푇2

5’5 5’

3’3 8 3’ 1 4 9 67 Transactions using Multi-version Concurrency Control (MVCC) Bank: find all Bank: add the • Lock-free atomic updates ☺ with balance>10 total balance Insurance • A series of operations Frank: agent: check check my Bob’s balance • A bulk of operations (e.g., union) balance • Easy roll-back ☺ • Do not affect other concurrent operations ☺ current $2 • Any operation works on as if a single-versioned Carol Wendy Wendy +$2 Carol -$2 tree with no extra (asymptotical) cost ☺ 푣1 푣temp 푣2 Carol Carol’ Carol” 17 17 15

Bob Frank Frank’ 20 7 7

Alice David Wendy Wendy’ 17 13 4 6 Transactions using Multi-version Concurrency Control (MVCC) Bank: find all Bank: add the • Lock-free atomic updates ☺ with balance>10 total balance • A series of operations Insurance agent: check • A bulk of operations (e.g., union) Bob’s balance • Easy roll-back ☺

• Do not affect other concurrent operations ☺ current $2 • Any operation works on as if a single-versioned Carol Wendy tree with no extra (asymptotical) cost ☺ Wendy +$2 Carol -$2 푣1 푣temp 푣2

Carol Carol’ Carol” • Concurrent writes?  17 17 15 • Concurrent transactions work on snapshots • They don’t come into effect on the same tree? Bob Frank Frank’ • Useless old nodes?  20 7 7

• Out-of-date nodes should be collected in time Alice David Wendy Wendy’ 17 13 4 6 Batching • Collect all concurrent writes can commit using a single writer once a while

Insert 15 delete 16 Insert 8 Delete 6 Insert 5 Insert 9 Delete 3

Database

71 Batching

• Collect all concurrent writes can commit using a single writer once a while

Insert 15 delete 16 Insert 8 Delete 6 Insert 5 Insert 9 Delete 3

Batching layer

Union of [5,8,9,15] Difference of [3,6,16] Database

72 Compare to Concurrent Data Structures [SPAA’19] • Compare with concurrent data structures • Skiplist, OpenBW tree [Wang et al.], Masstree [Mao et al.], B+tree [Wang et al.], Chromatic tree [Brown et al.] • Test on Yahoo! Cloud Serving Benchmark (YCSB) • Streams of updates and queries (insertion/update/searching) • Initial database size: ퟓ × ퟏퟎퟕ, transactions: ퟏퟎퟕ • Delay within 50ms

73 Compare to Concurrent Data Structures [SPAA’19] Throughput: Higher is better 72 cores with hyperthreading, 5 × 107 initial 140 key-value pairs with 107 single-operation 120 100 transactions. 80 60

M txns/s M Faster than all the other 40 1.2-1.4x 20 concurrent data structures 0 A B C (50/50) (95/5) (100/0) Read/Write PAM OpenBW Masstree B+tree Chromatic 74 Compare to Concurrent Data Structures [SPAA’19] Throughput: Higher is better 72 cores with hyperthreading, 5 × 107 initial 7 50 key-value pairs with 10 single-operation /s 40 transactions.

txns 30 Faster than all the other M M 20 1.2-1.4x concurrent data structures 10

0 (almost-linear)

8 2 4

1 144 Threads

32 64

16 speedup scaling up to 144 # threads 128 Workload A (50/50) PAM OpenBW MassTree B+Tree Chromatic 75 Garbage Collection

• Reference Counter Garbage Collector • Each tree node records the number of other tree nodes/pointers refers to it • Node 8 and 1 in the example have reference counter 2

푇1 푇2 Reference count: 5 5’ Node 1 3 5 8 9 5’ 3’ 4 Count 2 1 1 2 1 1 1 1 3 8 3’

1 4 9 Garbage Collection [SPAA’19]

• Each tree node records the number of other tree nodes/pointers refers to it • Node 8 and 1 in the example have reference counter 2 • Collect a node if and only if its reference count is 1 푇 푇 1 2 collect(node* t) { if (!t) return; 5 5’ if (t->ref_cnt == 1) { node* lc = t->lc, *rc = t->rc; free(t); 3 8 3’ in parallel: collect(lc); 1 4 9 collect(rc); } else dec(t->ref_cnt); } Node 1 3 5 8 9 5’ 3’ 4 Count 12 1 1 21 1 1 1 1 Document Searching • Inverted index Searching queries: Document 1: Word Document list Find corresponding document lists Blue whales are the largest … … animals ever known to have Add a new document: lived on Earth. blue 1, 4 All updates need to be done atomically whale 1, 4 Document 2: earth 1 Elephants are the largest Add document 4 land mammals. largest 1, 2 calories 3, 4 Document 3: mammals 2 Banging your head against a head 3 Add “4” to the lists of wall for one hour burns 150 {Blue, whales, eat, half, calories. Eat 4 million, calories, half 4 mouthful} Document 4: Blue whales eat half a million million 4 calories in one mouthful. mouthful 4 78 Inverted Indexes • Stores each word in the outer tree with its document list as an inner tree • Queries: concurrent analytical queries are done on the current version • OR/AND query: union/intersection on the two document lists • Nested trees: represent the affiliation relation Orange objects: words Blue objects: documents parallel

tree join D3 union D2 algorithm optimal D5 D2 D1 D4

79 Inverted Indexes • Updates: take a Union with path-copying (update inner tree for duplicates) • Atomic: make multiple word-doc pairs visible at once • Non-destructive: old versions for ongoing queries new current version version Orange objects: words parallel parallel’ Blue objects: documents

tree’ join tree D3 D3’ algorithm optimal D5’ union D7: D2 D5 D2’ parallel tree D2 D7 D4’ D1 D4 D7 new_version = current_version.add((parallel, D7), (tree, D7)) current_version = new_version

80 Experiments on TPC Benchmarks • 100 GB data, 72 cores with hyperthreading • Compare to HyPer [Neumann et al.] and MemSQL [Shamgnov’14] 15 13.51 12000 10280 Achieves good performance8696 for both 12 9000 9 updates and queries in HTAP 6000 6 workloads. 2.84 3000 3 1.36 # Queries / s / Queries # 75 0 0 Query Throughput s / Transactions # Update Throughput P-tree Hyper MemSQL 4-9x Comparable faster than both systems in performance to the best of queries the previous in updates All throughput numbers: Higher is better 81 Applications using join-based MVCC

• Inverted index searching • Indexes for HTAP database systems • Transactional systems with precise GC • Graph processing

• Example code for all applications available on Github • Algorithms and more experiments available: • (library) PAM: Parallel Augmented Maps, Yihan Sun, Daniel Ferizovic and Guy Blelloch, PPoPP 2018 • (version control, garbage collection) Multiversion Concurrency with Bounded Delay and Precise Garbage Collection, Naama Ben-David, Guy E. Blelloch, Yihan Sun and Yuanhao Wei, SPAA 2019. • (graph processing, compression) Low-Latency Graph Streaming Using Compressed Purely-Functional Trees, Laxman Dhulipala, Guy Blelloch, and Julian Shun, PLDI 2019 • (database system) On Supporting Efficient Snapshot Isolation for Hybrid Workloads with Multi-Versioned Indexes, Yihan Sun, Guy E. Blelloch, Wan Shen Lim and Andrew Pavlo, PVLDB, 13(2). 82 Outline 1 Algorithms Using Join

2 Augmentation Using Join

3 Persistence Using Join • Multi-version concurrency control (MVCC) • Database systems with concurrent updates and queries • Fast in practice

83

Red-black

balance Weight

Join

AVL -

Treap

84 balance Red-blackWeight

JoinTreapAVL • Algorithms-

85 Join • Algorithms • Augmentation • Persistence

86 In practice: In theory: Join The PAM library Join-based algorithms • Algorithms • Augmentation • Persistence

• Simple • Balancing scheme independent • Supporting a wide range of algorithmsP-trees• Abstract augmentation • Work-optimal and poly-log span • Persistent & multi-versioning

2D range query Inverted Database Interval Concise implementation 2D segment query Index Management Trees High2D performance rectangle query both sequentiallySearching and… in parallelSystems 87