Trees • Binary Trees • Newer Types of Tree Structures • M-Ary Trees

Data Organization and
Processing

Hierarchical Indexing

(NDBI007)

David Hoksza http://siret.ms.mff.cuni.cz/hoksza

1

Outline

• prefix tree

• …

• Background

• graphs

• search trees

• binary trees • m-ary trees

• Newer types of tree structures

• Typical tree type structures

• B-tree • B+-tree

• B*-tree

2

Motivation

• Similarly as in hashing, the motivation is to find record(s) given a query

key using only a few operations

• unlike hashing, trees allow to retrieve set of records with keys from given range

• Tree structures use “clustering” to efficiently filter out non relevant records from the data set

• Tree-based indexes practically implement the index file data organization

• for each search key, an indexing structure can be maintained

3

Drawbacks of Index(-Sequential) Organization

• Static nature

• when inserting a record at the beginning of the file, the whole index needs to be

rebuilt

• an overflow handling policy can be applied → the performance degrades as the file grows

• reorganization can take a lot of time, especially for large tables

• not possible for online transactional processing (OLTP) scenarios (as opposed

to OLAP – online analytical processing) where many insertions/deletions occur

4

Tree Indexes

• Most common dynamic indexing structure for external memory

• When inserting/deleting into/from the primary file, the indexing

structure(s) residing in the secondary file is modified to accommodate the

new key

• The modification of a tree is implemented by splitting/merging nodes

• Used not only in databases

• NTFS directory structure is built over B+ tree

5

Trees

• root

• Undirected graph without cycles

• is the only node without parent – root of the hierarchy

• we will be interested in rooted trees
→ one node designated as the root → orientation

• leaves

• nodes without children

• inner nodes

• nodes with children

• Nodes are in parent-child relation

• branch

• every parent has a finite set of

descendants (children nodes)

• from a parent an edge points to child
• path from a leaf to the root

• every child has exactly one parent

6

Tree height = 3

Root

Tree arity = 3

Tree Characteristics (1)

D:0-H:3

width = 1 width = 2

level 0 level 1

• Tree arity/degree

• maximum number of children of any node

D:1-H:1
D:1-H:2

• Node depth

• the length of the simplest path (number of edges) from the root node

D:2-H:1

• depth(root) = 0

width = 5

• Tree depth

level 2 level 3

• maximum of node depths

D:2-H:0

• Tree level

width = 2

• set of nodes with the same depth (distance from root)

• Level width

D:3-H:0

• number of nodes at given level

• Node height

Tree depth = 3

• number of edges on the longest downward simple path from Y to a leaf

• Tree height

• height of the root node

7

Tree Characteristics (2)

• Balanced (vyv ážený ) tree

• a tree whose subtrees differ in height by no more than one and the subtrees are height-balanced as well

• Unsorted tree

• general tree – the descendants of a node are not sorted at all

• Sorted tree

• unlike general tree the order of children of each node matters

• children of a node are sorted based on a given key

8

Binary Tree

• Each inner node contains at

most two child nodes (tree arity

= 2)

• Characteristics

• number of nodes of a perfect

binary tree of height ℎ

풉

#풏풐풅풆풔_풑풆풓풇= ퟐ^풊= ퟐ^풉+ퟏ− ퟏ

• Types

• perfect (perfektní) binary tree –

every non-leaf node has two child

nodes

• complete (komletní) binary tree

– full tree except for the first nonleaf level where nodes are as far left

as possible

풊=ퟎ

• number of leaf nodes of a

perfect binary tree of height ℎ

#풍풆풂풇_풏풐풅풆풔_풑풆풓풇= ퟐ^풉

9

Binary Search Tree (BST)

• Binary sorted tree where

• Searching a record with a key 푘

• each node is assigned a value
1. Enter the tree at the root level

corresponding to one of the keys
2. Compare 푘 with the value 푣 of the

• all nodes in the left subtree of a node

푋 correspond to keys with lower values than the value of 푋

• all nodes in the right subtree of a

node 푋 correspond to keys with higher values than the value of 푋 current node

3. If 푘 = 푣 return the record

corresponding to the current node key
4. If 푘 < 푣 descend to the left subtree 5. If 푘 > 푣 descend to the right subtree

6. Repeat steps 2-5 until a leaf is

reached. If the condition in step 3 is not met at any level, no record with the key 푘 is present
• when using BST for storing records, the records can be stored directly in the nodes or addressed from them

10

Redundant vs. Non-Redundant BST

Non-Redundant BST

Redundant BST

• Records stored in (addressed

from) both inner and leaf nodes

from) the leaves

• Structure of inner and leaf nodes differ

8
8

4

9

12
4

12
9
2
7

2

7

9

7

8

11

2

4

Applications of Binary Trees (not related to

storing data records)

• Expression trees

• leaves = variables • inner nodes = operands

• Huffman coding

• leaves = data

• coding along a branch leading to give leaf = leaf’s binary representation

• Query optimizers in DBMS

• query can be represented by an algebraic expression which can be in turn

represented by a binary tree

12

M-ary Trees (1)

• Motivation

• binary trees are not suitable for magnetic disks because of their height – indexing

16 items would need an index of height 4 → up to 4 disk accesses (potentially in

different locations) to retrieve one of the 16 records

• log₂1000 = 10 , log₂1,000,000 = 20

• But increasing arity leads to decreasing the tree height

13

M-ary Trees (2)

• Generalization of binary search trees

• A binary tree is a special case of an m-ary tree for 푚 = 2

• only m-ary trees with 푚 > 2 are usually considered as true m−ary trees

• unlike binary tree, m-ary tree contains 풎 − ퟏ discriminators (keys) in every

node
• every node 푁 points to up to 풎 child nodes (subtrees)

• every subtree of 푁 contains records with keys restricted by the pair of

discriminators of 푁 between which the subtree is rooted

• the leftmost subtree contains only records with keys having lower values than all the discriminators in 푁

• the rightmost subtree contains only records with keys having higher values

than all the discriminators in 푁

14

M-ary Trees (3)

풎 = ퟒ

Data could be stored directly in the node as well. But it is not usual in real-world database environments.

disk block

푘₁

푘₂

푘₃

data1 data3 data2

subtree with all keys

푘₁< 푘_푖<
푘₂

subtree with all keys

푘₂< 푘_푖<

푘₃

subtree with all keys subtree with all keys

푘₃< 푘_푖

푘_푖< 푘₁

15

Characteristics of M-Ary Trees (1)

• Maximum number of nodes • Maximum number of

records/keys

• tree height ℎ

• 푙푒푣푒푙₀= 푚⁰= 1, 푙푒푣푒푙₁=

• every node contains up to 푚 − 1

푚¹, 푙푒푣푒푙₂= 푚², … 푙푒푣푒푙_ℎ= 푚^ℎ

keys/records

푚^ℎ+1− 1

ℎ

풎^풉+ퟏ− ퟏ
× 푚 − 1 = 풎^풉+ퟏ− ퟏ

푚 − 1
푚^푖=
풎 − ퟏ

푖=0

• 푚 = 3, ℎ = 3 → #푟푒푐표푟푑푠 ≤
3⁴− 1 = 80
• 푚 = 100, ℎ = 3 → #푟푒푐표푟푑푠 ≤
100⁴− 1 = 100,000,000

16

Characteristics of M-Ary Trees (2)

• Minimum height

풉 = 퐥퐨퐠_풎풏

• Maximum height

풏

풉 ≐

풎

• maximum height of the tree

corresponds to the maximum number of disk operations needed

to fetch a record → search complexity

푶(풏)

• minimum height of the tree

corresponds to the minimum number of disk operations needed

to fetch a record → search complexity

푶(퐥퐨퐠_퐦풏)

• The challenge is to keep the complexity logarithmic, that is to

keep the tree more or less balanced

17

B-tree

• Bayer & McCreight, 1972

• B-tree is a sorted balanced m-ary (not binary) tree with additional

constraints restricting the branching in each node thus causing the

tree to be reasonably “wide”

• Inserting or deleting a record in B-tree causes only local changes and not rebuilding of the whole index

18

B-tree definition (1)

• B-trees are balanced m-ary trees fulfilling the following conditions:

1. The root has at least two children unless it is a leaf.

풎

2. Every inner node except of the root has at least and at most 풎

ퟐ

children.

풎

3. Every node contains at least − ퟏ and at most 풎 − ퟏ (pointers

ퟐ

to) data records.

4. Each branch has the same length.

19

B-tree definition (2)

5. The nodes have following organization:

풑_ퟎ, 풌_ퟏ, 풑_ퟏ, 풅_ퟏ, 풌_ퟐ, 풑_ퟐ, 풅_ퟐ, … , , 풌_풏, 풑_풏, 풅_풏, 풖

• where 풑_ퟎ, 풑_ퟏ, … , 풑_풏are pointers to the child nodes, 풌_ퟏ, 풌_ퟐ, … , 풌_풏are keys,

풅_ퟏ, 풅_ퟐ, … , 풅_풏are associated data, 풖 is unused space and the records 풌_풊, 풑_풊, 풅_풊are

sorted in increasing order with respect to the keys and

풎

− ퟏ ≤ 풏 ≤ 풎 − ퟏ

ퟐ

6. If a subtree 푼 풑_풊corresponds to the pointer 푝_푖then:

i. ∀풌 ∈ 푼 풑_풊−ퟏ: 풌 < 풌_풊ii. ∀풌 ∈ 푼 풑_풊: 풌 > 풌_풊

20

Example of a B-Tree

• Keys: 25, 48, 27, 91, 35, 78, 12, 56, 38, 19

48
19

25

27

78

91

12

35

38

56

21

Redundant B-Tree

• Redundant vs. non-redundant B-trees

• the presented definition introduced the non-redundant B-tree where each key

value occurred just once in the whole tree
• redundant B-trees store the data values in the leaves and thus have to allow repeating of keys in the inner nodes

• restriction 6(i) is modified so that

• ∀풌 ∈ 푼 풑_풊−ퟏ: 풌 ≤ 풌_풊

• moreover, the inner nodes do not contain pointers to the data records

22

B-Tree implementations

• Nodes vs. pages/blocks

• usually one page/block contains one node

• Existing DBMS implementations

• one page usually takes 8KB

• redundant B-trees

• inner nodes contain only pairs 푘₁, 푝₁→ inner nodes and leaf nodes differ in structure and

size

• data are not stored in the indexing structure itself but addressed from the leaf nodes

• if the key is of type 64-bit integer (8B) and the DMBS runs on a 64-bit OS (8B pointers)

with 8KB blocks → (푘₁, 푝₁) = 16퐵 → one page can accommodate 8192/16 =

512 pairs → arity of the B-tree is then about 500

23

B-Tree - Search

• Searching a (non-redundant) tree 푇 for a record with key 푘

• Enter the tree in the root node.

• If the node contains a key 풌_풊such that 풌_풊= 풌 return the data associated with

풅_풊.

• Else if the node is leaf, return NULL.

• Else find lowest 풊 such that 풌 < 풌_풊and set 풋 = 풊 − ퟏ. If there is no such ꢀ set 풋

as the rightmost index with existing key.

• Fetch the node pointed to by 풑_풋. • Repeat the process from step 2.

24

Updating a B-Tree

• The logarithmic complexity is ensured by the condition that every node has

to be at least half full
• Inserting

• finding a leaf where the new record should be inserted

• when inserting into a not yet full node no splitting occurs

• when inserting into a full node, the node is split in such a way that the two resulting nodes are at least half full

• splitting a node increments the number of subtrees of the parent node and thus possibly causes

splitting of the parent node → split cascade

• Deleting

• when deleting a record from a node more than half full, no reorganization happens • deleting in a half full node induces merging of the neighboring nodes

• merging decreases the number of child nodes in the parent node thus possibly causing merging of

the parent node→ merge cascade

25

Inserting into a B-tree (1)

• Insert into a (non-redundant) tree 푇 a record 푟 with key 푘

• If the tree is empty, allocate a new node, insert the key 푘 and (pointer to record) 푟

and return.
• Else find the leaf node 푳 where the key 풌 belongs.

• If 푳 is not full insert 풓 and 푘 into 퐿 in such a position that the keys are sorted and

return.
• Else create a new node 푳′. • Leave lower half records (all the items from 퐿 plus 푟) in 푳 and the higher half

records into 푳′ except of the item with the middle key 푘′.

a) If 퐋 is the root, create a new root node, move the record with key 푘′ to the new root and point it to 푳 and 푳′ and return.

b) Else move the record with key 푘′ to the parent node 푷 into appropriate position based on