Data Organization and
Processing
Hierarchical Indexing
(NDBI007)
David Hoksza http://siret.ms.mff.cuni.cz/hoksza
1
Outline
• prefix tree
• …
• Background
• graphs
• search trees
• binary trees • m-ary trees
• Newer types of tree structures
• Typical tree type structures
• B-tree • B+-tree
• B*-tree
2
Motivation
• Similarly as in hashing, the motivation is to find record(s) given a query
key using only a few operations
• unlike hashing, trees allow to retrieve set of records with keys from given range
• Tree structures use “clustering” to efficiently filter out non relevant records from the data set
• Tree-based indexes practically implement the index file data organization
• for each search key, an indexing structure can be maintained
3
Drawbacks of Index(-Sequential) Organization
• Static nature
• when inserting a record at the beginning of the file, the whole index needs to be
rebuilt
• an overflow handling policy can be applied → the performance degrades as the file grows
• reorganization can take a lot of time, especially for large tables
• not possible for online transactional processing (OLTP) scenarios (as opposed
to OLAP – online analytical processing) where many insertions/deletions occur
4
Tree Indexes
• Most common dynamic indexing structure for external memory
• When inserting/deleting into/from the primary file, the indexing
structure(s) residing in the secondary file is modified to accommodate the
new key
• The modification of a tree is implemented by splitting/merging nodes
• Used not only in databases
• NTFS directory structure is built over B+ tree
5
Trees
• root
• Undirected graph without cycles
• is the only node without parent – root of the hierarchy
• we will be interested in rooted trees
→ one node designated as the root → orientation
• leaves
• nodes without children
• inner nodes
• nodes with children
• Nodes are in parent-child relation
• branch
• every parent has a finite set of
descendants (children nodes)
• from a parent an edge points to child
• path from a leaf to the root
• every child has exactly one parent
6
- Tree height = 3
- Root
Tree arity = 3
Tree Characteristics (1)
D:0-H:3
width = 1 width = 2
level 0 level 1
• Tree arity/degree
• maximum number of children of any node
D:1-H:1
D:1-H:2
• Node depth
• the length of the simplest path (number of edges) from the root node
D:2-H:1
• depth(root) = 0
width = 5
• Tree depth
level 2 level 3
• maximum of node depths
D:2-H:0
• Tree level
width = 2
• set of nodes with the same depth (distance from root)
• Level width
D:3-H:0
• number of nodes at given level
• Node height
Tree depth = 3
• number of edges on the longest downward simple path from Y to a leaf
• Tree height
• height of the root node
7
Tree Characteristics (2)
• Balanced (vyv ážený ) tree
• a tree whose subtrees differ in height by no more than one and the subtrees are height-balanced as well
• Unsorted tree
• general tree – the descendants of a node are not sorted at all
• Sorted tree
• unlike general tree the order of children of each node matters
• children of a node are sorted based on a given key
8
Binary Tree
• Each inner node contains at
most two child nodes (tree arity
= 2)
• Characteristics
• number of nodes of a perfect
binary tree of height ℎ
풉
#풏풐풅풆풔풑풆풓풇 = ퟐ풊 = ퟐ풉+ퟏ − ퟏ
• Types
• perfect (perfektní) binary tree –
every non-leaf node has two child
nodes
• complete (komletní) binary tree
– full tree except for the first nonleaf level where nodes are as far left
as possible
풊=ퟎ
• number of leaf nodes of a
perfect binary tree of height ℎ
#풍풆풂풇_풏풐풅풆풔풑풆풓풇 = ퟐ풉
9
Binary Search Tree (BST)
- • Binary sorted tree where
- • Searching a record with a key 푘
• each node is assigned a value
1. Enter the tree at the root level
corresponding to one of the keys
2. Compare 푘 with the value 푣 of the
• all nodes in the left subtree of a node
푋 correspond to keys with lower values than the value of 푋
• all nodes in the right subtree of a
node 푋 correspond to keys with higher values than the value of 푋 current node
3. If 푘 = 푣 return the record
corresponding to the current node key
4. If 푘 < 푣 descend to the left subtree 5. If 푘 > 푣 descend to the right subtree
6. Repeat steps 2-5 until a leaf is
reached. If the condition in step 3 is not met at any level, no record with the key 푘 is present
• when using BST for storing records, the records can be stored directly in the nodes or addressed from them
10
Redundant vs. Non-Redundant BST
- Non-Redundant BST
- Redundant BST
- • Records stored in (addressed
- • Records stored in (addressed
- from) both inner and leaf nodes
- from) the leaves
• Structure of inner and leaf nodes differ
8
8
- 4
- 9
12
4
12
9
2
7
- 2
- 7
- 9
- 7
- 8
11
- 2
- 4
Applications of Binary Trees (not related to
storing data records)
• Expression trees
• leaves = variables • inner nodes = operands
• Huffman coding
• leaves = data
• coding along a branch leading to give leaf = leaf’s binary representation
• Query optimizers in DBMS
• query can be represented by an algebraic expression which can be in turn
represented by a binary tree
12
M-ary Trees (1)
• Motivation
• binary trees are not suitable for magnetic disks because of their height – indexing
16 items would need an index of height 4 → up to 4 disk accesses (potentially in
different locations) to retrieve one of the 16 records
• log2 1000 = 10 , log2 1,000,000 = 20
• But increasing arity leads to decreasing the tree height
13
M-ary Trees (2)
• Generalization of binary search trees
• A binary tree is a special case of an m-ary tree for 푚 = 2
• only m-ary trees with 푚 > 2 are usually considered as true m−ary trees
• unlike binary tree, m-ary tree contains 풎 − ퟏ discriminators (keys) in every
node
• every node 푁 points to up to 풎 child nodes (subtrees)
• every subtree of 푁 contains records with keys restricted by the pair of
discriminators of 푁 between which the subtree is rooted
• the leftmost subtree contains only records with keys having lower values than all the discriminators in 푁
• the rightmost subtree contains only records with keys having higher values
than all the discriminators in 푁
14
M-ary Trees (3)
풎 = ퟒ
Data could be stored directly in the node as well. But it is not usual in real-world database environments.
disk block
푘1
- 푘2
- 푘3
data1 data3 data2
subtree with all keys
푘1 < 푘푖 <
푘2
subtree with all keys
푘2 < 푘푖 <
푘3
subtree with all keys subtree with all keys
푘3 < 푘푖
푘푖 < 푘1
15
Characteristics of M-Ary Trees (1)
• Maximum number of nodes • Maximum number of
records/keys
• tree height ℎ
• 푙푒푣푒푙0 = 푚0 = 1, 푙푒푣푒푙1 =
• every node contains up to 푚 − 1
푚1, 푙푒푣푒푙2 = 푚2, … 푙푒푣푒푙ℎ = 푚ℎ
keys/records
푚ℎ+1 − 1
ℎ
풎풉+ퟏ − ퟏ
× 푚 − 1 = 풎풉+ퟏ − ퟏ
푚 − 1
푚푖 =
풎 − ퟏ
푖=0
• 푚 = 3, ℎ = 3 → #푟푒푐표푟푑푠 ≤
34 − 1 = 80
• 푚 = 100, ℎ = 3 → #푟푒푐표푟푑푠 ≤
1004 − 1 = 100,000,000
16
Characteristics of M-Ary Trees (2)
• Minimum height
풉 = 퐥퐨퐠풎 풏
• Maximum height
풏
풉 ≐
풎
• maximum height of the tree
corresponds to the maximum number of disk operations needed
to fetch a record → search complexity
푶(풏)
• minimum height of the tree
corresponds to the minimum number of disk operations needed
to fetch a record → search complexity
푶(퐥퐨퐠퐦 풏)
• The challenge is to keep the complexity logarithmic, that is to
keep the tree more or less balanced
17
B-tree
• Bayer & McCreight, 1972
• B-tree is a sorted balanced m-ary (not binary) tree with additional
constraints restricting the branching in each node thus causing the
tree to be reasonably “wide”
• Inserting or deleting a record in B-tree causes only local changes and not rebuilding of the whole index
18
B-tree definition (1)
• B-trees are balanced m-ary trees fulfilling the following conditions:
1. The root has at least two children unless it is a leaf.
풎
2. Every inner node except of the root has at least and at most 풎
ퟐ
children.
풎
3. Every node contains at least − ퟏ and at most 풎 − ퟏ (pointers
ퟐ
to) data records.
4. Each branch has the same length.
19
B-tree definition (2)
5. The nodes have following organization:
풑ퟎ, 풌ퟏ, 풑ퟏ, 풅ퟏ , 풌ퟐ, 풑ퟐ, 풅ퟐ , … , , 풌풏, 풑풏, 풅풏 , 풖
• where 풑ퟎ, 풑ퟏ, … , 풑풏 are pointers to the child nodes, 풌ퟏ, 풌ퟐ, … , 풌풏 are keys,
풅ퟏ, 풅ퟐ, … , 풅풏 are associated data, 풖 is unused space and the records 풌풊, 풑풊, 풅풊 are
sorted in increasing order with respect to the keys and
풎
− ퟏ ≤ 풏 ≤ 풎 − ퟏ
ퟐ
6. If a subtree 푼 풑풊 corresponds to the pointer 푝푖 then:
i. ∀풌 ∈ 푼 풑풊−ퟏ : 풌 < 풌풊 ii. ∀풌 ∈ 푼 풑풊 : 풌 > 풌풊
20
Example of a B-Tree
• Keys: 25, 48, 27, 91, 35, 78, 12, 56, 38, 19
48
19
25
- 27
- 78
- 91
- 12
- 35
- 38
- 56
21
Redundant B-Tree
• Redundant vs. non-redundant B-trees
• the presented definition introduced the non-redundant B-tree where each key
value occurred just once in the whole tree
• redundant B-trees store the data values in the leaves and thus have to allow repeating of keys in the inner nodes
• restriction 6(i) is modified so that
• ∀풌 ∈ 푼 풑풊−ퟏ : 풌 ≤ 풌풊
• moreover, the inner nodes do not contain pointers to the data records
22
B-Tree implementations
• Nodes vs. pages/blocks
• usually one page/block contains one node
• Existing DBMS implementations
• one page usually takes 8KB
• redundant B-trees
• inner nodes contain only pairs 푘1, 푝1 → inner nodes and leaf nodes differ in structure and
size
• data are not stored in the indexing structure itself but addressed from the leaf nodes
• if the key is of type 64-bit integer (8B) and the DMBS runs on a 64-bit OS (8B pointers)
with 8KB blocks → (푘1, 푝1) = 16퐵 → one page can accommodate 8192/16 =
512 pairs → arity of the B-tree is then about 500
23
B-Tree - Search
• Searching a (non-redundant) tree 푇 for a record with key 푘
• Enter the tree in the root node.
• If the node contains a key 풌풊 such that 풌풊 = 풌 return the data associated with
풅풊.
• Else if the node is leaf, return NULL.
• Else find lowest 풊 such that 풌 < 풌풊 and set 풋 = 풊 − ퟏ. If there is no such ꢀ set 풋
as the rightmost index with existing key.
• Fetch the node pointed to by 풑풋. • Repeat the process from step 2.
24
Updating a B-Tree
• The logarithmic complexity is ensured by the condition that every node has
to be at least half full
• Inserting
• finding a leaf where the new record should be inserted
• when inserting into a not yet full node no splitting occurs
• when inserting into a full node, the node is split in such a way that the two resulting nodes are at least half full
• splitting a node increments the number of subtrees of the parent node and thus possibly causes
splitting of the parent node → split cascade
• Deleting
• when deleting a record from a node more than half full, no reorganization happens • deleting in a half full node induces merging of the neighboring nodes
• merging decreases the number of child nodes in the parent node thus possibly causing merging of
the parent node→ merge cascade
25
Inserting into a B-tree (1)
• Insert into a (non-redundant) tree 푇 a record 푟 with key 푘
• If the tree is empty, allocate a new node, insert the key 푘 and (pointer to record) 푟
and return.
• Else find the leaf node 푳 where the key 풌 belongs.
• If 푳 is not full insert 풓 and 푘 into 퐿 in such a position that the keys are sorted and
return.
• Else create a new node 푳′. • Leave lower half records (all the items from 퐿 plus 푟) in 푳 and the higher half
records into 푳′ except of the item with the middle key 푘′.
a) If 퐋 is the root, create a new root node, move the record with key 푘′ to the new root and point it to 푳 and 푳′ and return.
b) Else move the record with key 푘′ to the parent node 푷 into appropriate position based on
the value 푘′ and point the “left” pointer to 퐋 and the “right” pointer to 푳′.
• If 푷 overflows, repeat step 5 for 푃 else return.
26
Inserting into a B-tree (2)
27
27
Insert 91
48
25
- 48
- 91
- 25
35 falls into the node (48,91)
→ (35, 48, 91) → middle key
(48) moves to the parent node
27 35
48
Insert 35
- 25
- 91
- 퐿
- 퐿′
27
Inserting into a B-tree (3)
- 27
- 48
Insert 78,12