Data Organization and Processing Hierarchical Indexing

(NDBI007)

David Hoksza http://siret.ms.mff.cuni.cz/hoksza 1 Outline

• Background • prefix • graphs • … • search trees • binary trees • Newer types of tree structures • m-ary trees

• Typical tree type structures • B-tree • B+-tree • B*-tree

2 Motivation

• Similarly as in hashing, the motivation is to find record(s) given a query key using only a few operations • unlike hashing, trees allow to retrieve of records with keys from given range

• Tree structures use “clustering” to efficiently filter out non relevant records from the data set

• Tree-based indexes practically implement the index file data organization • for each search key, an indexing structure can be maintained

3 Drawbacks of Index(-Sequential) Organization

• Static nature

• when inserting a record at the beginning of the file, the whole index needs to be rebuilt • an overflow handling policy can be applied → the performance degrades as the file grows

• reorganization can take a lot of time, especially for large tables

• not possible for online transactional processing (OLTP) scenarios (as opposed to OLAP – online analytical processing) where many insertions/deletions occur

4 Tree Indexes

• Most common dynamic indexing structure for external memory

• When inserting/deleting into/from the primary file, the indexing structure(s) residing in the secondary file is modified to accommodate the new key

• The modification of a tree is implemented by splitting/merging nodes

• Used not only in databases • NTFS directory structure is built over B+ tree

5 Trees

• Undirected graph without cycles • root • we will be interested in rooted trees • is the only node without parent – root → one node designated as the root of the hierarchy → orientation • leaves • nodes without children • inner nodes • Nodes are in parent-child relation • nodes with children • every parent has a finite set of • branch descendants (children nodes) • path from a leaf to the root • from a parent an edge points to child • every child has exactly one parent

6 Tree height = 3 Root Tree Characteristics (1) Tree arity = 3

width = 1 D:0-H:3 level 0 • Tree arity/degree • maximum number of children of any node width = 2 D:1-H:1 D:1-H:2 • Node depth level 1 • the length of the simplest path (number of edges) from the root node • depth(root) = 0 D:2-H:1 width = 5 • Tree depth level 2 • maximum of node depths • Tree level D:2-H:0 • set of nodes with the same depth (distance from root) width = 2 level 3 • Level width • number of nodes at given level D:3-H:0 • Node height • number of edges on the longest downward simple path from Y to a leaf Tree depth = 3 • Tree height • height of the root node

7 Tree Characteristics (2)

• Balanced (vyvážený) tree • a tree whose subtrees differ in height by no more than one and the subtrees are height-balanced as well

• Unsorted tree • general tree – the descendants of a node are not sorted at all

• Sorted tree • unlike general tree the order of children of each node matters • children of a node are sorted based on a given key

8

• Each inner node contains at • Characteristics most two child nodes (tree arity • number of nodes of a perfect = 2) binary tree of height ℎ

풉 풊 풉+ퟏ • Types #풏풐풅풆풔풑풆풓풇 = ퟐ = ퟐ − ퟏ • perfect (perfektní) binary tree – 풊=ퟎ every non-leaf node has two child nodes • number of leaf nodes of a • complete (komletní) binary tree perfect binary tree of height ℎ – full tree except for the first non- 풉 leaf level where nodes are as far left #풍풆풂풇_풏풐풅풆풔풑풆풓풇 = ퟐ as possible

9 Binary (BST)

• Binary sorted tree where • Searching a record with a key 푘 • each node is assigned a value 1. Enter the tree at the root level corresponding to one of the keys 2. Compare 푘 with the value 푣 of the • all nodes in the left subtree of a node current node 푋 correspond to keys with lower values 3. If 푘 = 푣 return the record than the value of 푋 corresponding to the current node key • all nodes in the right subtree of a 4. If 푘 < 푣 descend to the left subtree node 푋 correspond to keys with higher 5. If 푘 > 푣 descend to the right subtree values than the value of 푋 6. Repeat steps 2-5 until a leaf is reached. If the condition in step 3 is • when using BST for storing records, not met at any level, no record with the records can be stored directly in the the key 푘 is present nodes or addressed from them

10 Redundant vs. Non-Redundant BST

Non-Redundant BST Redundant BST • Records stored in (addressed • Records stored in (addressed from) both inner and leaf nodes from) the leaves • Structure of inner and leaf nodes differ 8 8 4 9 4 12

2 7 9 12 2 7 9

2 4 7 8 11 Applications of Binary Trees (not related to storing data records) • Expression trees • leaves = variables • inner nodes = operands

• leaves = data • coding along a branch leading to give leaf = leaf’s binary representation

• Query optimizers in DBMS • query can be represented by an algebraic expression which can be in turn represented by a binary tree

12 M-ary Trees (1)

• Motivation

• binary trees are not suitable for magnetic disks because of their height – indexing 16 items would need an index of height 4 → up to 4 disk accesses (potentially in different locations) to retrieve one of the 16 records

• log2 1000 = 10 , log2 1,000,000 = 20

• But increasing arity leads to decreasing the tree height

13 M-ary Trees (2)

• Generalization of binary search trees • A binary tree is a special case of an m-ary tree for 푚 = 2 • only m-ary trees with 푚 > 2 are usually considered as true m−ary trees • unlike binary tree, m-ary tree contains 풎 − ퟏ discriminators (keys) in every node • every node 푁 points to up to 풎 child nodes (subtrees) • every subtree of 푁 contains records with keys restricted by the pair of discriminators of 푁 between which the subtree is rooted • the leftmost subtree contains only records with keys having lower values than all the discriminators in 푁 • the rightmost subtree contains only records with keys having higher values than all the discriminators in 푁

14 M-ary Trees (3)

풎 = ퟒ Data could be stored directly in disk block the node as well. But it is not usual in real-world database environments. 푘1 푘2 푘3

data1 data2 data3

subtree with subtree with subtree with subtree with all keys all keys all keys all keys 푘1 < 푘𝑖 < 푘2 < 푘𝑖 < 푘𝑖 < 푘1 푘3 < 푘𝑖 푘2 푘3

15 Characteristics of M-Ary Trees (1)

• Maximum number of nodes • Maximum number of • tree height ℎ records/keys 0 • 푙푒푣푒푙0 = 푚 = 1, 푙푒푣푒푙1 = • every node contains up to 푚 − 1 1 2 ℎ keys/records 푚 , 푙푒푣푒푙2 = 푚 , … 푙푒푣푒푙ℎ = 푚 ℎ 푚ℎ+1 − 1 풎풉+ퟏ − ퟏ × 푚 − 1 = 풎풉+ퟏ − ퟏ 푚𝑖 = 푚 − 1 풎 − ퟏ 𝑖=0 • 푚 = 3, ℎ = 3 → #푟푒푐표푟푑푠 ≤ 34 − 1 = 80 • 푚 = 100, ℎ = 3 → #푟푒푐표푟푑푠 ≤ 1004 − 1 = 100,000,000

16 Characteristics of M-Ary Trees (2)

• Minimum height • Maximum height 풏 풉 ≐ 풎 풉 = 퐥퐨퐠풎 풏 • maximum height of the tree • minimum height of the tree corresponds to the maximum corresponds to the minimum number of disk operations needed number of disk operations needed to fetch a record → search complexity to fetch a record → search complexity 푶(풏) 푶(퐥퐨퐠퐦 풏) • The challenge is to keep the complexity logarithmic, that is to keep the tree more or less balanced

17 B-tree

• Bayer & McCreight, 1972

• B-tree is a sorted balanced m-ary (not binary) tree with additional constraints restricting the branching in each node thus causing the tree to be reasonably “wide”

• Inserting or deleting a record in B-tree causes only local changes and not rebuilding of the whole index

18 B-tree definition (1)

• B-trees are balanced m-ary trees fulfilling the following conditions:

1. The root has at least two children unless it is a leaf. 풎 2. Every inner node except of the root has at least and at most 풎 ퟐ children. 풎 3. Every node contains at least − ퟏ and at most 풎 − ퟏ (pointers ퟐ to) data records. 4. Each branch has the same length.

19 B-tree definition (2)

5. The nodes have following organization: 풑ퟎ, 풌ퟏ, 풑ퟏ, 풅ퟏ , 풌ퟐ, 풑ퟐ, 풅ퟐ , … , , 풌풏, 풑풏, 풅풏 , 풖

• where 풑ퟎ, 풑ퟏ, … , 풑풏 are pointers to the child nodes, 풌ퟏ, 풌ퟐ, … , 풌풏 are keys, 풅ퟏ, 풅ퟐ, … , 풅풏 are associated data, 풖 is unused space and the records 풌풊, 풑풊, 풅풊 are sorted in increasing order with respect to the keys and 풎 − ퟏ ≤ 풏 ≤ 풎 − ퟏ ퟐ

6. If a subtree 푼 풑풊 corresponds to the pointer 푝𝑖 then: i. ∀풌 ∈ 푼 풑풊−ퟏ : 풌 < 풌풊 ii. ∀풌 ∈ 푼 풑풊 : 풌 > 풌풊

20 Example of a B-Tree • Keys: 25, 48, 27, 91, 35, 78, 12, 56, 38, 19

48

19 27 78

12 25 35 38 56 91

21 Redundant B-Tree

• Redundant vs. non-redundant B-trees

• the presented definition introduced the non-redundant B-tree where each key value occurred just once in the whole tree

• redundant B-trees store the data values in the leaves and thus have to allow repeating of keys in the inner nodes • restriction 6(i) is modified so that

• ∀풌 ∈ 푼 풑풊−ퟏ : 풌 ≤ 풌풊 • moreover, the inner nodes do not contain pointers to the data records

22 B-Tree implementations

• Nodes vs. pages/blocks • usually one page/block contains one node

• Existing DBMS implementations • one page usually takes 8KB • redundant B-trees • inner nodes contain only pairs 푘1, 푝1 → inner nodes and leaf nodes differ in structure and size • data are not stored in the indexing structure itself but addressed from the leaf nodes • if the key is of type 64-bit integer (8B) and the DMBS runs on a 64-bit OS (8B pointers) with 8KB blocks → (푘1, 푝1) = 16퐵 → one page can accommodate 8192/16 = 512 pairs → arity of the B-tree is then about 500

23 B-Tree - Search

• Searching a (non-redundant) tree 푇 for a record with key 푘

• Enter the tree in the root node. • If the node contains a key 풌풊 such that 풌풊 = 풌 return the data associated with 풅풊. • Else if the node is leaf, return NULL. • Else find lowest 풊 such that 풌 < 풌풊 and set 풋 = 풊 − ퟏ. If there is no such 𝑖 set 풋 as the rightmost index with existing key. • Fetch the node pointed to by 풑풋. • Repeat the process from step 2.

24 Updating a B-Tree

• The logarithmic complexity is ensured by the condition that every node has to be at least half full • Inserting • finding a leaf where the new record should be inserted • when inserting into a not yet full node no splitting occurs • when inserting into a full node, the node is split in such a way that the two resulting nodes are at least half full • splitting a node increments the number of subtrees of the parent node and thus possibly causes splitting of the parent node → split cascade • Deleting • when deleting a record from a node more than half full, no reorganization happens • deleting in a half full node induces merging of the neighboring nodes • merging decreases the number of child nodes in the parent node thus possibly causing merging of the parent node→ merge cascade

25 Inserting into a B-tree (1)

• Insert into a (non-redundant) tree 푇 a record 푟 with key 푘 • If the tree is empty, allocate a new node, insert the key 푘 and (pointer to record) 푟 and return. • Else find the leaf node 푳 where the key 풌 belongs. • If 푳 is not full insert 풓 and 푘 into 퐿 in such a position that the keys are sorted and return. • Else create a new node 푳′. • Leave lower half records (all the items from 퐿 plus 푟) in 푳 and the higher half records into 푳′ except of the item with the middle key 푘′. a) If 퐋 is the root, create a new root node, move the record with key 푘′ to the new root and point it to 푳 and 푳′ and return. b) Else move the record with key 푘′ to the parent node 푷 into appropriate position based on the value 푘′ and point the “left” pointer to 퐋 and the “right” pointer to 푳′. • If 푷 overflows, repeat step 5 for 푃 else return. 26 Inserting into a B-tree (2)

27 27 Insert 91

48 25 25 48 91

35 falls into the node (48,91) 27 48 → (35, 48, 91) → middle key Insert 35 (48) moves to the parent node

25 35 91

퐿 퐿′ 27 Inserting into a B-tree (3)

27 48

Insert 78,12 78 moves into the node (27, 48) 12 25 35 78 91 → (27, 48, 78) → middle key (48) moves to the parent node → new root 48

Insert 56 56 falls into the node (78,91) → 27 78 (56, 78, 91) → middle key (78) moves to the parent node

12 25 35 56 91

28 Deleting from a B-tree (1)

• Delete from tree 푇 record 푟 with key 푘

• Find a node 푵 containing the key 푘. • Remove 풓 from 푁. 풎 • If number of keys in 푵 ≥ − ퟏ, return. ퟐ • Else, if possible, merge 푵 with either right or left sibling (includes update of the parent node accompanied by the decrease of the number of keys in the parent node). • Else reorganize records among 푵 and its sibling and the parent node. • If needed, reorganize the parent node in the same way (steps 3 – 5).

29 Deleting from a B-tree (2)

27 48 Delete 25 27 48

35 35 12 25 78 91 12 78 91

27 48 48 Delete 35

35 12 78 91 12 27 78 91

30 Deleting from a B-tree (3)

30

13 25 27 29 41 43

Delete 41

29

13 25 27 30 43

31 B-Tree - Complexity

• Let us have • Arity of a B-tree with such values • 8 KB = 8 × 210 B can be page size (푹) 푚 × 푃푇 + 푚 − 1 × 푃퐷 + 퐾 ≤ 푅 • 10 B key (푲) 푚 × 8 + 푚 − 1 × 9 + 10 ≤ 8192 • 8 B pointer to a 8192 + 19 tree node (푷 ) 풎 ≤ = ퟑퟎퟒ 푻 27 • 9 B pointer to • every node can accommodate up to data (pointer to a data page + 303 records (m-1) offset) (푷 ) 푫 • if every node shows utilization 2/3 then every node contains 202 records • thus on average • tree of height 0 → 202 • tree of height 1 → 40,804 • tree of height 2 → 8,242,408 • tree of height 3 → 1,664,966,416

32 B+-tree … • Redundant B-tree . • . records are pointed to from the leaf nodes only . … • The leaf level is chained • the leaf nodes do not have to be physically next to each other since they are connected using pointers

• Preferred in existing DBMS

• In some DBMS (e.g., Microsoft SQL Server), also the non-leaf levels are chained

33 B+-Tree – Example (m=3)

GJ

BK MG Name Department Brown Kevin PR Brown Kevin Purchasing BK DT GJ MG WD WR Brown Kevin Marketing Duffy Terri Developlment Duffy Terri Research Galvin Janice Purchasing Matthew Gigi Purchasing Walters David Production Walters Rob Marketing Walters Rob Developlment 34 Walters Rob PR B- vs. B+-Tree

B-tree Advantages B+-tree Advantages • Smaller inner nodes (no pointers to the data) → nodes can • Non-redundant accommodate more records → • can be redundant as well lower tree height → faster search

• Insert and delete operations are • A record can be identified faster simpler (if it resides in an inner level) • Simple implementation • especially of range queries

35 Bulk Loading B-Trees

3 4 6 9 10 11 12 13 20 22 23 31 35 36 38 41 44 • When indexing a large collection, not yet inserted records inserting records one by one can be 9 13 tedious → bulk loading

4 11 22 31 1. Sort the data based on the search key in the pages 3 4 6 9 10 11 12 13 20 22 23 31 35 36 38 41 44

2. Insert pointer to the leftmost page 13

into a new root 9 31 3. Move over the data file and insert the respective keys into the rightmost 4 11 22 36 index page above the leaf level. If the rightmost page overflows, split. 3 4 6 9 10 11 12 13 20 22 23 31 35 36 38 41 44 36 Page Balancing

• Modification of B-tree where an overflow does not have to lead to a page split • When a page overflows • sibling pages are checked • the content of the overflowed page is joined into a set 푋 with the left or right neighbors • The record to be inserted is added into 푿 and the content is equally distributed into the two nodes • the changes are projected into the parent node where the keys have to be modified (but no new key is inserted → no split cascade) • For high 푚 this change leads to about 75% utilization in the worst case

37 • Tree modification B*-Tree • two full pages are split into 3 pages (one new page) • if a node is full but none of its • Generalization of page balancing neighbors is full, page balancing • the root node has at least 2 takes place children • if the insert occurs in a full page • all the branches are of equal which has full left or right neighbor length • their content is joined into a set 푋 • the new record is added into 푋 • every node different from the root • a new page 푃 is allocated has at least (ퟐ풎 − ퟏ)/ퟑ • the records from 푋 are equally children distributed into the 3 pages (the 2 existing and 푃) • a new key is added into the parent node and the keys are adjusted

• the delete operation is handled similarly 38 B*-Tree - Example 7 24 37

5 6 7 11 23 24 33 34 37 43 45 47

7 24 37

Insert 20

5 6 7 11 20 23 24 33 34 37 43 45 47

7 23 37

Insert 22

5 6 7 11 20 22 23 24 33 34 37 43 45 47 39 B*-Tree – Example (cont.) 7 23 37

5 6 7 11 20 22 23 24 33 34 37 43 45 47

Insert 32 To be redistributed: 11, 7 22 32 37 20, 22, 23, 24, 33, 34, 37 and 32

5 6 7 11 20 22 23 24 32 33 34 37 43 45 47

40 Deferred Splitting

• In the original B-trees, certain sequences of inserts can lead to only half utilization → deferred splitting

• let us keep an overflow page for each node

• when a page overflows, the overflown record is inserted into the respective overflow page

• when both the original and the overflow page are full, the original page is split and the overflow page is emptied

41 Deferred Splitting - Example

Original nonredundant B-tree with m=5

9

3 6 12 15 18

19 20 4 5 13 14

1 2 7 8 10 11 16 17

Fill factor = 0.5 (worst possible)

42 Deferred Splitting – Example (cont.)

1 2 3 4 1 2 3 4 1 2 3 4

Insert 5 Insert 6,7,8 5 5 6 7 8

5 Insert 9

1 2 3 4 6 7 8 9

5 10 15 Insert 10-20 Fill factor = 0.83

1 2 3 4 6 7 8 9 11 12 13 14 16 17 18 19

20 43 Variable Length Records

• Often we want to index not only numbers but also strings → variable length-records (VLR) → different 푚 for different nodes

• When splitting a page with VLR, rather length of the records is taken into account than the number of records → distribution is driven by the resulting length → can lead to violation of the condition regarding the minimum number of records in a B-tree

• longer records tend to get closer to the root → lower arity close to the root • when merging, a short record can be replaced by a longer one → height increase

45 VLR - Example

Representing the sentence: “DO NOT FALL ASLEEP” • node size is 15 • pointers represented by @ • for sake of simplicity let us consider size of a pointer to be identical to the size of a character

@ D O @ F A L L @ N O T @

• Inserting “ASLEEP” causes overflow → splitting • Sequence to be split: @ASLEEP@DO@FALL@NOT@ • → O is the middle character

@ D O @

@ A S L E E P @

@ F A L L @ N O T @ 46 Prefix (B-)Tree • Modification of the redundant B-tree • inner nodes keys do not have to be subsets of the keys in the leaf level • inner nodes keys have to separate • smaller keys lead to higher node capacity → lower trees → faster access • suitable choice of separators are prefixes of the keys

F

B Co M

An As E I N No

Note An Computation For Method By Notes And Computations From Methods A Equation In New As Certain Equations Its 47 Clustered vs Nonclustered Indexes

Clustered index Nonclustered index Index pages

… …

… …

Data pages 48 Clustered vs Nonclustered Indexes

• Clustered index • Nonclustered index • logical order of the key values • order of data in the index and the determines the physical order of the primary file is not related corresponding data records, i.e., the • multiple nonclustered indexes can exist order of the data in the data file follows the order of data in the index file • Motivation • a file can be clustered over at most • when range range querying over a one search key → on clustered nonclustered index, every record index might reside in a different data page • basically corresponds to the idea of • clustered index should be defined over index-sequential file organization an attribute over which range scan often happens

49 (Non)Clustered Indexes - SQL

CREATE TABLE Product( id INT PRIMARY KEY NONCLUSTERED, code NVARCHAR(5), name NVARCHAR(50), type INT);

CREATE NONCLUSTERED INDEX ixProductCode ON Product(Code);

CREATE CLUSTERED INDEX ixProductType ON Product(Type);

50 Indexes in Existing Leading Database Systems

Oracle 11g Microsoft SQL PostgreSQL MySQL Server 2012 9.2 5.5 Standard index B+-tree B+-tree B+-tree B+-tree Bitmap index Yes No No No Hash index Yes Yes (clustering) Yes Yes (clustering) Spatial index R-tree B+-tree R-tree R-tree (mapping 2D to 1D)

51 Less Traditional Approaches to Indexing

• Update-optimized structures • Buffered repository tree (BRT) • Streaming B-tree

• Cache-oblivious structures • Cache-oblivious B-tree • Streaming B-tree

52 B-Tree Insert Drawback

• Inserting in (amortized) time 푶(퐥퐨퐠풃 푵) is suboptimal

• With search/insert tradeoff one can get much faster inserts for increase in the search time • suitable for applications with lot of inserts (e.g., streaming databases (sensor data), file systems, …) Search Insert/Delete B-tree 푁 푁 푂(log ) 푂(log ) 푏 푏 푏 푏 BR-tree 푵 ퟏ 푵 푶(퐥퐨퐠 ) 푶( 퐥퐨퐠ퟐ ) * ퟐ 풃 퐛 풃 B휖-tree 1 푁 1 푁 푂( log ) 푂( 1−휖 log푏 ) * 휖 푏 푏 휖푏 푏 B1/2-tree 푁 2 푁 푂(2log ) 푂( log푏 ) * 53 푏 푏 푏 푏 * amortized Buffered Repository Tree • Buchsbaum,Goldwasser,Venkatasubramani • when a buffer overflows, the data are an,Westbrook, 2000 (BGVW2000) redistributed to the descendants and the key items in that node • Any time a buffer is distributed to children nodes during an insert, we apportion the 푂(1) I/Os • The buffered repository tree is an uniformly to the B items that are distributed. 1 augmented (2,4) tree Each item is thus charged 푂( ) I/Os per level 푏 • each leaf node holds up to 풃 items • each inner node contains keys (2-4) and one buffer holding up to 풃 items • Search • all items stored in descendant leaves and • the root buffer is scanned buffers of a node obey the standard search • if the item is not found, the procedure criteria with respect to the rank among its recursively descends into the next level sibling(s) based on the keys and scans the respective buffer at given level → 푶(퐥퐨퐠ퟐ 푵) • Insert • record is inserted into the root’s buffer 54 Cache-oblivious model

• Full memory hierarchy where each memory can have different characteristics • Algorithm does not know the size M of its cache, or the block size B

CPU

Full memory hierarchy (different types of caches and external memory types) 55 COM search trees

• Search tree in RAM model 푂(log퐵 푁) search, insert, delete • Constant factor for search 1

• Search tree in COM model 푂(log퐵 푁) search, insert, delte • Constant factor for search 푒 (we will show for 4)

• Based on binary trees → good enough mapping between the order of nodes and the sequential organization (array) so that traversal does not touch too many blocks on disk

56 COM static search tree - idea

• van Emde Boas ordering • Cut off the tree in half → extract individual trees of size 푁 → repeat recursively → concatenate + label

1

2 3

4 7 10 13

5 6 8 9 11 12 14 15

57 vEB ordering example

source: Bender, Michael, Erik D. Demaine, and Martin Farach-Colton. "Cache-oblivious B-trees." Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on. IEEE, 2000.

58 COM static search trees - complexity

• For given block size 퐵, let us stop the recursion so that size of the substrees (푻푺) ≤ 푩 (and 푇푠 is of maximum size)

• 푇푆 can span at most 2 blocks

1 • Height of each 푇 is in ( log 퐵 ; log 퐵) → to read the tree from root 푆 2 2 2 to leaf takes log 푁 2 2 = ퟒ 퐥퐨퐠 푵 1 푩 log 퐵 2 2

59 COM B-tree

• Requires ordered file maintenance • [Itai, Alon, Alan G. Konheim, and Michael Rodeh. A sparse table implementation of priority queues. Springer Berlin Heidelberg, 1981] 25 • Enables to store N entries in given order in 푂(푛) space with 푂(1) gaps (between two elements there is at most 10 25 constant number of gaps)

• Enables to insert/delete at given position in by moving 2 10 12 25 2 entries in interval of size 푂(log2 푁) amortized • We stack static search tree over ordered file and 2 7 10 12 20 25 crosslink (even the gaps) • Leafs contain the keys from the ordered file 2 7 10 12 20 25 • Inner nodes contain maximum of left and right subtree

60 COM B-tree update

1. Search for the position in the ordered file 푂(log퐵 푁) log2 푁 2. Update ordered file 푂( 2 ) → 푂(log 푁) 퐵 퐵 log2 푁 3. Propagate changes (post-order) 푂(log 푁 + 2 ) → 푂(log 푁) 퐵 퐵 퐵 25 25

10 25 8 25

2 10 12 25 2 8 12 25

2 7 10 12 20 25 2 7 8 10 12 20 25

2 7 8 10 12 20 25 2 7 10 12 20 25 61 Since we update post-order, to update the whole tree of size 퐵2, we first update maxes COM B-tree update analysis (1) of the lower leftmost subtree, then go to the next one and so on. Each subtree is in at most two pages, so if we have cache of size 4, the jumping between lower and upper level is free.

≤ 퐵 ≤ 퐵 ≤ 퐵 Bottom 2 levels …… … … … ≤ 퐵 ≤ 퐵 ≤ 퐵 ≤ 퐵 ≤ 퐵 ≤ 퐵

log2 푁 log2 푁 amortized → 푂(1 + 2 ) trees (blocks) need to be visited in bottom 2 levels 62 2 퐵 COM B-tree update analysis (2)

≤ log2 푁 nodes → log퐵 푁 memory transfers

log2 푁 There are 2 2 nodes in this subtree 퐵

log2 푁 There are 2 nodes in this level amortized 퐵 ≤ 퐵 ≤ 퐵 ≤ 퐵 …… ≤ 퐵 …… …… ≤ 퐵 … ≤ 퐵 ≤ 퐵 … ≤ 퐵 ≤ 퐵 … ≤ 퐵 ≤ 퐵 … ≤ 퐵 63 COM B-tree update analysis (3) Propagation in the upper part of the tree Search 2 log2 푁 • The cost of update is thus 푂 log퐵 푁 + 퐵 Update of the file and • For 퐵 big enough this is as good as B-tree propagation in the lower part of the tree • Using indirection to get 푂 log퐵 푁 • Split items into blocks of size Θ(log2 푁) • For each block, store only min in the structure log 푁 • Search in the tree + block ( 2 ) 퐵 • Update involves rewrite of the whole block + possibly split/merge of the blocks 1 vEB + ordered file • utilization ∈ ( log 푁 ; log 푁] 푵 4 2 2 on records 풍풐품ퟐ푵 ퟐ 풍풐품ퟐ 푵 퐥퐨퐠푩 푵 + 퐥퐨퐠 푵 퐥퐨퐠 푵 푩 ퟐ ퟐ Θ(log 푁) Θ(log 푁) … Θ(log 푁) 푶 + = 푶 < 푶 퐥퐨퐠푩 푵 2 2 2 퐥퐨퐠ퟐ 푵 푩 푩 64 Streaming B-trees

• Implemented in TokuDB engine of MySQL → Fractal tree • We show its older implementation which is based on COLA (current implementation is based on 퐵휖-tree)

• Cache-oblivious lookahead array (COLA) 𝑖 • consists of 퐥퐨퐠ퟐ 푵 arrays where the 𝑖-th array stores 2 elements • each array is either completely full or empty • CO since size of the block does not appear in the data structure • Invariants • The 푘-th array contains items if and only if the 푘-th least significant bit of the binary representation of N (number of records) is a 1 • Each array contains its items in ascending order by key • When a new item is inserted we effectively perform a carry • we create a list of length one with the new item, and as long as there are two lists of the same length, we merge them into the next bigger size

66 Simplified fractal tree - example

A fractal tree with 4 records

12 17 23 30 Every number can be written as the sum of powers of 2 → the non-zero members correspond to the non-empty arrays. A fractal tree with 10 records

5 10

3 6 8 12 17 23 26 30 67 Fractal tree - Insert

15

5 10

Insert 15

3 6 8 12 17 23 26 30

In order for insert to work, Record with key 15 fits each array has a temporary into the 1-array storage.

At the beginning of each stage, all the temporary arrays are empty.

68 Fractal tree – Insert (cont.)

15 7

5 10

Insert 7

3 6 8 12 17 23 26 30

5 10 7 15

3 6 8 12 17 23 26 30

69 Fractal tree – Insert (cont.)

5 10 7 15

3 6 8 12 17 23 26 30

5 7 10 15

3 6 8 12 17 23 26 30

70 Fractal tree – Insert (cont.)

In some cases the cascade can be very long.

9 5 10 2 18 33 40 3 6 8 12 17 23 26 30

9 5 10 2 18 33 40 3 6 8 12 17 23 26 30

31

5 10 2 18 33 40 3 6 8 12 17 23 26 30

9 31

2 18 33 40 3 6 8 12 17 23 26 30

5 9 10 31

3 6 8 12 17 23 26 30

2 5 9 10 18 31 33 40

2 3 5 6 8 9 10 12 17 18 23 26 30 31 33 40

71 Fractal tree - complexity

• Insert ퟏ • 푶( 풍풐품 푵) amortized and 푂(log 푁) worst-case block transfers 풃 ퟐ 2 • each element is merged at most 푂(log2 푁) times • cost to merge 2 arrays of size X is 푂(푋/푏) block I/Os • cost per element to merge is 푂(1/푏) since we merge blocks having 푏 items • Search ퟐ • 푶(퐥퐨퐠ퟐ 푵) block transfers • naïve solution where each level is searched using binary search which takes 푂(log2 푁) time (basic COLA) • 푶(퐥퐨퐠ퟐ 푵) • using forward pointers

72 Forward pointers

• Fractional cascading (Chazelle, Guibas, 1986) 9

2 14 • each element contains a forward pointer to where it fits in the next- 5 7 13 25 level array 3 6 8 12 17 23 26 30 • when searching for an element, only a part of the array needs to be scanned

73 Fractal tree – insert (Time)

퐥퐨퐠ퟐ 푵 • B-tree: 푂 log퐵 푁 = 푶( ) 퐥퐨퐠ퟐ 푩 퐥퐨퐠 푵 • Fractal tree: 푶( ퟐ ) 푩

• 푁 = 109, 퐵 = 4096 ퟗ 퐥퐨퐠ퟐ ퟏퟎ = ퟑퟎ 퐥퐨퐠ퟐ ퟒퟎퟗퟔ = ퟏퟐ

log 푁 30 • Inserting into B-tree requires 2 = = ퟑ disk accesses log2 퐵 12 log 푁 30 • Inserting into fractal tree requires 2 = = ퟎ. ퟎퟎퟖ disk accesses 퐵 4096

74 75