Trees • Binary Trees • Newer Types of Tree Structures • M-Ary Trees

Data Organization and Processing Hierarchical Indexing (NDBI007) David Hoksza http://siret.ms.mff.cuni.cz/hoksza 1 Outline • Background • prefix tree • graphs • … • search trees • binary trees • Newer types of tree structures • m-ary trees • Typical tree type structures • B-tree • B+-tree • B*-tree 2 Motivation • Similarly as in hashing, the motivation is to find record(s) given a query key using only a few operations • unlike hashing, trees allow to retrieve set of records with keys from given range • Tree structures use “clustering” to efficiently filter out non relevant records from the data set • Tree-based indexes practically implement the index file data organization • for each search key, an indexing structure can be maintained 3 Drawbacks of Index(-Sequential) Organization • Static nature • when inserting a record at the beginning of the file, the whole index needs to be rebuilt • an overflow handling policy can be applied → the performance degrades as the file grows • reorganization can take a lot of time, especially for large tables • not possible for online transactional processing (OLTP) scenarios (as opposed to OLAP – online analytical processing) where many insertions/deletions occur 4 Tree Indexes • Most common dynamic indexing structure for external memory • When inserting/deleting into/from the primary file, the indexing structure(s) residing in the secondary file is modified to accommodate the new key • The modification of a tree is implemented by splitting/merging nodes • Used not only in databases • NTFS directory structure is built over B+ tree 5 Trees • Undirected graph without cycles • root • we will be interested in rooted trees • is the only node without parent – root → one node designated as the root of the hierarchy → orientation • leaves • nodes without children • inner nodes • Nodes are in parent-child relation • nodes with children • every parent has a finite set of • branch descendants (children nodes) • path from a leaf to the root • from a parent an edge points to child • every child has exactly one parent 6 Tree height = 3 Root Tree Characteristics (1) Tree arity = 3 width = 1 D:0-H:3 level 0 • Tree arity/degree • maximum number of children of any node width = 2 D:1-H:1 D:1-H:2 • Node depth level 1 • the length of the simplest path (number of edges) from the root node • depth(root) = 0 D:2-H:1 width = 5 • Tree depth level 2 • maximum of node depths • Tree level D:2-H:0 • set of nodes with the same depth (distance from root) width = 2 level 3 • Level width • number of nodes at given level D:3-H:0 • Node height • number of edges on the longest downward simple path from Y to a leaf Tree depth = 3 • Tree height • height of the root node 7 Tree Characteristics (2) • Balanced (vyvážený) tree • a tree whose subtrees differ in height by no more than one and the subtrees are height-balanced as well • Unsorted tree • general tree – the descendants of a node are not sorted at all • Sorted tree • unlike general tree the order of children of each node matters • children of a node are sorted based on a given key 8 Binary Tree • Each inner node contains at • Characteristics most two child nodes (tree arity • number of nodes of a perfect = 2) binary tree of height ℎ 풉 풊 풉+ퟏ • Types #풏풐풅풆풔풑풆풓풇 = ퟐ = ퟐ − ퟏ • perfect (perfektní) binary tree – 풊=ퟎ every non-leaf node has two child nodes • number of leaf nodes of a • complete (komletní) binary tree perfect binary tree of height ℎ – full tree except for the first non- 풉 leaf level where nodes are as far left #풍풆풂풇_풏풐풅풆풔풑풆풓풇 = ퟐ as possible 9 Binary Search Tree (BST) • Binary sorted tree where • Searching a record with a key 푘 • each node is assigned a value 1. Enter the tree at the root level corresponding to one of the keys 2. Compare 푘 with the value 푣 of the • all nodes in the left subtree of a node current node 푋 correspond to keys with lower values 3. If 푘 = 푣 return the record than the value of 푋 corresponding to the current node key • all nodes in the right subtree of a 4. If 푘 < 푣 descend to the left subtree node 푋 correspond to keys with higher 5. If 푘 > 푣 descend to the right subtree values than the value of 푋 6. Repeat steps 2-5 until a leaf is reached. If the condition in step 3 is • when using BST for storing records, not met at any level, no record with the records can be stored directly in the the key 푘 is present nodes or addressed from them 10 Redundant vs. Non-Redundant BST Non-Redundant BST Redundant BST • Records stored in (addressed • Records stored in (addressed from) both inner and leaf nodes from) the leaves • Structure of inner and leaf nodes differ 8 8 4 9 4 12 2 7 9 12 2 7 9 2 4 7 8 11 Applications of Binary Trees (not related to storing data records) • Expression trees • leaves = variables • inner nodes = operands • Huffman coding • leaves = data • coding along a branch leading to give leaf = leaf’s binary representation • Query optimizers in DBMS • query can be represented by an algebraic expression which can be in turn represented by a binary tree 12 M-ary Trees (1) • Motivation • binary trees are not suitable for magnetic disks because of their height – indexing 16 items would need an index of height 4 → up to 4 disk accesses (potentially in different locations) to retrieve one of the 16 records • log2 1000 = 10 , log2 1,000,000 = 20 • But increasing arity leads to decreasing the tree height 13 M-ary Trees (2) • Generalization of binary search trees • A binary tree is a special case of an m-ary tree for 푚 = 2 • only m-ary trees with 푚 > 2 are usually considered as true m−ary trees • unlike binary tree, m-ary tree contains 풎 − ퟏ discriminators (keys) in every node • every node 푁 points to up to 풎 child nodes (subtrees) • every subtree of 푁 contains records with keys restricted by the pair of discriminators of 푁 between which the subtree is rooted • the leftmost subtree contains only records with keys having lower values than all the discriminators in 푁 • the rightmost subtree contains only records with keys having higher values than all the discriminators in 푁 14 M-ary Trees (3) 풎 = ퟒ Data could be stored directly in disk block the node as well. But it is not usual in real-world database environments. 푘1 푘2 푘3 data1 data2 data3 subtree with subtree with subtree with subtree with all keys all keys all keys all keys 푘1 < 푘 < 푘2 < 푘 < 푘 < 푘1 푘3 < 푘 푘2 푘3 15 Characteristics of M-Ary Trees (1) • Maximum number of nodes • Maximum number of • tree height ℎ records/keys 0 • 푙푒푣푒푙0 = 푚 = 1, 푙푒푣푒푙1 = • every node contains up to 푚 − 1 1 2 ℎ keys/records 푚 , 푙푒푣푒푙2 = 푚 , … 푙푒푣푒푙ℎ = 푚 ℎ 푚ℎ+1 − 1 풎풉+ퟏ − ퟏ × 푚 − 1 = 풎풉+ퟏ − ퟏ 푚 = 푚 − 1 풎 − ퟏ =0 • 푚 = 3, ℎ = 3 → #푟푒푐표푟푑푠 ≤ 34 − 1 = 80 • 푚 = 100, ℎ = 3 → #푟푒푐표푟푑푠 ≤ 1004 − 1 = 100,000,000 16 Characteristics of M-Ary Trees (2) • Minimum height • Maximum height 풏 풉 ≐ 풎 풉 = 퐥퐨퐠풎 풏 • maximum height of the tree • minimum height of the tree corresponds to the maximum corresponds to the minimum number of disk operations needed number of disk operations needed to fetch a record → search complexity to fetch a record → search complexity 푶(풏) 푶(퐥퐨퐠퐦 풏) • The challenge is to keep the complexity logarithmic, that is to keep the tree more or less balanced 17 B-tree • Bayer & McCreight, 1972 • B-tree is a sorted balanced m-ary (not binary) tree with additional constraints restricting the branching in each node thus causing the tree to be reasonably “wide” • Inserting or deleting a record in B-tree causes only local changes and not rebuilding of the whole index 18 B-tree definition (1) • B-trees are balanced m-ary trees fulfilling the following conditions: 1. The root has at least two children unless it is a leaf. 풎 2. Every inner node except of the root has at least and at most 풎 ퟐ children. 풎 3. Every node contains at least − ퟏ and at most 풎 − ퟏ (pointers ퟐ to) data records. 4. Each branch has the same length. 19 B-tree definition (2) 5. The nodes have following organization: 풑ퟎ, 풌ퟏ, 풑ퟏ, 풅ퟏ , 풌ퟐ, 풑ퟐ, 풅ퟐ , … , , 풌풏, 풑풏, 풅풏 , 풖 • where 풑ퟎ, 풑ퟏ, … , 풑풏 are pointers to the child nodes, 풌ퟏ, 풌ퟐ, … , 풌풏 are keys, 풅ퟏ, 풅ퟐ, … , 풅풏 are associated data, 풖 is unused space and the records 풌풊, 풑풊, 풅풊 are sorted in increasing order with respect to the keys and 풎 − ퟏ ≤ 풏 ≤ 풎 − ퟏ ퟐ 6. If a subtree 푼 풑풊 corresponds to the pointer 푝 then: i. ∀풌 ∈ 푼 풑풊−ퟏ : 풌 < 풌풊 ii. ∀풌 ∈ 푼 풑풊 : 풌 > 풌풊 20 Example of a B-Tree • Keys: 25, 48, 27, 91, 35, 78, 12, 56, 38, 19 48 19 27 78 12 25 35 38 56 91 21 Redundant B-Tree • Redundant vs. non-redundant B-trees • the presented definition introduced the non-redundant B-tree where each key value occurred just once in the whole tree • redundant B-trees store the data values in the leaves and thus have to allow repeating of keys in the inner nodes • restriction 6(i) is modified so that • ∀풌 ∈ 푼 풑풊−ퟏ : 풌 ≤ 풌풊 • moreover, the inner nodes do not contain pointers to the data records 22 B-Tree implementations • Nodes vs.

Trees • Binary Trees • Newer Types of Tree Structures • M-Ary Trees

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support