Trees • Binary Trees • Newer Types of Tree Structures • M-Ary Trees
Total Page:16
File Type:pdf, Size:1020Kb
Data Organization and Processing Hierarchical Indexing (NDBI007) David Hoksza http://siret.ms.mff.cuni.cz/hoksza 1 Outline • Background • prefix tree • graphs • … • search trees • binary trees • Newer types of tree structures • m-ary trees • Typical tree type structures • B-tree • B+-tree • B*-tree 2 Motivation • Similarly as in hashing, the motivation is to find record(s) given a query key using only a few operations • unlike hashing, trees allow to retrieve set of records with keys from given range • Tree structures use “clustering” to efficiently filter out non relevant records from the data set • Tree-based indexes practically implement the index file data organization • for each search key, an indexing structure can be maintained 3 Drawbacks of Index(-Sequential) Organization • Static nature • when inserting a record at the beginning of the file, the whole index needs to be rebuilt • an overflow handling policy can be applied → the performance degrades as the file grows • reorganization can take a lot of time, especially for large tables • not possible for online transactional processing (OLTP) scenarios (as opposed to OLAP – online analytical processing) where many insertions/deletions occur 4 Tree Indexes • Most common dynamic indexing structure for external memory • When inserting/deleting into/from the primary file, the indexing structure(s) residing in the secondary file is modified to accommodate the new key • The modification of a tree is implemented by splitting/merging nodes • Used not only in databases • NTFS directory structure is built over B+ tree 5 Trees • Undirected graph without cycles • root • we will be interested in rooted trees • is the only node without parent – root → one node designated as the root of the hierarchy → orientation • leaves • nodes without children • inner nodes • Nodes are in parent-child relation • nodes with children • every parent has a finite set of • branch descendants (children nodes) • path from a leaf to the root • from a parent an edge points to child • every child has exactly one parent 6 Tree height = 3 Root Tree Characteristics (1) Tree arity = 3 width = 1 D:0-H:3 level 0 • Tree arity/degree • maximum number of children of any node width = 2 D:1-H:1 D:1-H:2 • Node depth level 1 • the length of the simplest path (number of edges) from the root node • depth(root) = 0 D:2-H:1 width = 5 • Tree depth level 2 • maximum of node depths • Tree level D:2-H:0 • set of nodes with the same depth (distance from root) width = 2 level 3 • Level width • number of nodes at given level D:3-H:0 • Node height • number of edges on the longest downward simple path from Y to a leaf Tree depth = 3 • Tree height • height of the root node 7 Tree Characteristics (2) • Balanced (vyvážený) tree • a tree whose subtrees differ in height by no more than one and the subtrees are height-balanced as well • Unsorted tree • general tree – the descendants of a node are not sorted at all • Sorted tree • unlike general tree the order of children of each node matters • children of a node are sorted based on a given key 8 Binary Tree • Each inner node contains at • Characteristics most two child nodes (tree arity • number of nodes of a perfect = 2) binary tree of height ℎ 풉 풊 풉+ퟏ • Types #풏풐풅풆풔풑풆풓풇 = ퟐ = ퟐ − ퟏ • perfect (perfektní) binary tree – 풊=ퟎ every non-leaf node has two child nodes • number of leaf nodes of a • complete (komletní) binary tree perfect binary tree of height ℎ – full tree except for the first non- 풉 leaf level where nodes are as far left #풍풆풂풇_풏풐풅풆풔풑풆풓풇 = ퟐ as possible 9 Binary Search Tree (BST) • Binary sorted tree where • Searching a record with a key 푘 • each node is assigned a value 1. Enter the tree at the root level corresponding to one of the keys 2. Compare 푘 with the value 푣 of the • all nodes in the left subtree of a node current node 푋 correspond to keys with lower values 3. If 푘 = 푣 return the record than the value of 푋 corresponding to the current node key • all nodes in the right subtree of a 4. If 푘 < 푣 descend to the left subtree node 푋 correspond to keys with higher 5. If 푘 > 푣 descend to the right subtree values than the value of 푋 6. Repeat steps 2-5 until a leaf is reached. If the condition in step 3 is • when using BST for storing records, not met at any level, no record with the records can be stored directly in the the key 푘 is present nodes or addressed from them 10 Redundant vs. Non-Redundant BST Non-Redundant BST Redundant BST • Records stored in (addressed • Records stored in (addressed from) both inner and leaf nodes from) the leaves • Structure of inner and leaf nodes differ 8 8 4 9 4 12 2 7 9 12 2 7 9 2 4 7 8 11 Applications of Binary Trees (not related to storing data records) • Expression trees • leaves = variables • inner nodes = operands • Huffman coding • leaves = data • coding along a branch leading to give leaf = leaf’s binary representation • Query optimizers in DBMS • query can be represented by an algebraic expression which can be in turn represented by a binary tree 12 M-ary Trees (1) • Motivation • binary trees are not suitable for magnetic disks because of their height – indexing 16 items would need an index of height 4 → up to 4 disk accesses (potentially in different locations) to retrieve one of the 16 records • log2 1000 = 10 , log2 1,000,000 = 20 • But increasing arity leads to decreasing the tree height 13 M-ary Trees (2) • Generalization of binary search trees • A binary tree is a special case of an m-ary tree for 푚 = 2 • only m-ary trees with 푚 > 2 are usually considered as true m−ary trees • unlike binary tree, m-ary tree contains 풎 − ퟏ discriminators (keys) in every node • every node 푁 points to up to 풎 child nodes (subtrees) • every subtree of 푁 contains records with keys restricted by the pair of discriminators of 푁 between which the subtree is rooted • the leftmost subtree contains only records with keys having lower values than all the discriminators in 푁 • the rightmost subtree contains only records with keys having higher values than all the discriminators in 푁 14 M-ary Trees (3) 풎 = ퟒ Data could be stored directly in disk block the node as well. But it is not usual in real-world database environments. 푘1 푘2 푘3 data1 data2 data3 subtree with subtree with subtree with subtree with all keys all keys all keys all keys 푘1 < 푘 < 푘2 < 푘 < 푘 < 푘1 푘3 < 푘 푘2 푘3 15 Characteristics of M-Ary Trees (1) • Maximum number of nodes • Maximum number of • tree height ℎ records/keys 0 • 푙푒푣푒푙0 = 푚 = 1, 푙푒푣푒푙1 = • every node contains up to 푚 − 1 1 2 ℎ keys/records 푚 , 푙푒푣푒푙2 = 푚 , … 푙푒푣푒푙ℎ = 푚 ℎ 푚ℎ+1 − 1 풎풉+ퟏ − ퟏ × 푚 − 1 = 풎풉+ퟏ − ퟏ 푚 = 푚 − 1 풎 − ퟏ =0 • 푚 = 3, ℎ = 3 → #푟푒푐표푟푑푠 ≤ 34 − 1 = 80 • 푚 = 100, ℎ = 3 → #푟푒푐표푟푑푠 ≤ 1004 − 1 = 100,000,000 16 Characteristics of M-Ary Trees (2) • Minimum height • Maximum height 풏 풉 ≐ 풎 풉 = 퐥퐨퐠풎 풏 • maximum height of the tree • minimum height of the tree corresponds to the maximum corresponds to the minimum number of disk operations needed number of disk operations needed to fetch a record → search complexity to fetch a record → search complexity 푶(풏) 푶(퐥퐨퐠퐦 풏) • The challenge is to keep the complexity logarithmic, that is to keep the tree more or less balanced 17 B-tree • Bayer & McCreight, 1972 • B-tree is a sorted balanced m-ary (not binary) tree with additional constraints restricting the branching in each node thus causing the tree to be reasonably “wide” • Inserting or deleting a record in B-tree causes only local changes and not rebuilding of the whole index 18 B-tree definition (1) • B-trees are balanced m-ary trees fulfilling the following conditions: 1. The root has at least two children unless it is a leaf. 풎 2. Every inner node except of the root has at least and at most 풎 ퟐ children. 풎 3. Every node contains at least − ퟏ and at most 풎 − ퟏ (pointers ퟐ to) data records. 4. Each branch has the same length. 19 B-tree definition (2) 5. The nodes have following organization: 풑ퟎ, 풌ퟏ, 풑ퟏ, 풅ퟏ , 풌ퟐ, 풑ퟐ, 풅ퟐ , … , , 풌풏, 풑풏, 풅풏 , 풖 • where 풑ퟎ, 풑ퟏ, … , 풑풏 are pointers to the child nodes, 풌ퟏ, 풌ퟐ, … , 풌풏 are keys, 풅ퟏ, 풅ퟐ, … , 풅풏 are associated data, 풖 is unused space and the records 풌풊, 풑풊, 풅풊 are sorted in increasing order with respect to the keys and 풎 − ퟏ ≤ 풏 ≤ 풎 − ퟏ ퟐ 6. If a subtree 푼 풑풊 corresponds to the pointer 푝 then: i. ∀풌 ∈ 푼 풑풊−ퟏ : 풌 < 풌풊 ii. ∀풌 ∈ 푼 풑풊 : 풌 > 풌풊 20 Example of a B-Tree • Keys: 25, 48, 27, 91, 35, 78, 12, 56, 38, 19 48 19 27 78 12 25 35 38 56 91 21 Redundant B-Tree • Redundant vs. non-redundant B-trees • the presented definition introduced the non-redundant B-tree where each key value occurred just once in the whole tree • redundant B-trees store the data values in the leaves and thus have to allow repeating of keys in the inner nodes • restriction 6(i) is modified so that • ∀풌 ∈ 푼 풑풊−ퟏ : 풌 ≤ 풌풊 • moreover, the inner nodes do not contain pointers to the data records 22 B-Tree implementations • Nodes vs.