Advanced Data Structures 1 Introduction 2 Setting the Stage 3
Total Page:16
File Type:pdf, Size:1020Kb
3 Van Emde Boas Trees Advanced Data Structures Anubhav Baweja May 19, 2020 1 Introduction Design of data structures is important for storing and retrieving information efficiently. All data structures have a set of operations that they support in some time bounds. For instance, balanced binary search trees such as AVL trees support find, insert, delete, and other operations in O(log n) time where n is the number of elements in the tree. In this report we will talk about 2 data structures that support the same operations but in different time complexities. This problem is called the fixed-universe predecessor problem and the two data structures we will be looking at are Van Emde Boas trees and Fusion trees. 2 Setting the stage For this problem, we will make the fixed-universe assumption. That is, the only elements that we care about are w-bit integers. Additionally, we will be working in the word RAM model: so we can assume that w ≥ O(log n) where n is the size of the problem, and we can do operations such as addition, subtraction etc. on w-bit words in O(1) time. These are fair assumptions to make, since these are restrictions and advantages that real computers offer. So now given the data structure T , we want to support the following operations: 1. insert(T, a): insert integer a into T . If it already exists, do not duplicate. 2. delete(T, a): delete integer a from T , if it exists. 3. predecessor(T, a): return the largest b in T such that b ≤ a, if one exists. 4. successor(T, a): return the smallest b in T such that b ≥ a, if one exists. Note that balanced binary search tree can support all these operations in O(log n) time, so we will try to do better here. 3 Van Emde Boas Trees If we are given that the size of the universe is u, then using vEB trees we can support all the given operations in O(log log u) time [1]. If we are considering all integers of word length w, then we have that u = 2w so our time bound becomes O(log w). Since we are working in the word RAM model, we made the assumption that w ≤ O(log n), so we get that the time is O(log log n), significantly better than the complexity for binary search trees. In order to motivate this complexity, we need to have a recurrence that solves to O(log log u). The p classic example of such a recurrence is T (u) = T ( u) + O(1), so we will strive to get to that point. But first let's incrementally build this data structure in steps. 1 Advanced Data Structures 15-751 3 Van Emde Boas Trees 3.1 The first solution A naive thing we could do is just maintain a bit vector on all possible elements in our universe. This will give us O(1) inserts and deletes, but predecessor and successor can be as bad as O(u). However, we can make the following optimization: we store another bit vector with half the size where each element is an OR of two adjacent entries. Then we will combine adjacent elements of this bit vector to get another one and so on. Figure 1: The data structure for the set S = f1; 2; 3; 7g where u = 8 It is clear that we can do insert and delete in O(log u) time with this modification (just update ancestor blocks, and also maintain a counter for the number of elements present in the range), but now we can also do the predecessor and successor operation in that time. We first find the leaf of the corresponding position, and do the rest in 2 phases: 1. Up phase: We keep going up until we enter a node from the left side such that the right child of the node is also 1. 2. Down phase: From there we go down the left child if there is a 1, otherwise we go down the right child. For example, if we wanted the successor of 1 in Figure 1, then we go up until the 0 − 3 block since its right child also has a 1, and from there we go down to the 2 block, so we get that the successor of 1 is 2. This is particularly nice because we see that the hard case for successorT, a is when a is already in T: otherwise we can make the search for a the Down phase itself: bypassing the Up phase completely. 3.2 Motivation for vEB trees Note that the reason why the above solution has O(log u) complexity is because we divide the entire set into 2 parts at every layer, since the recurrence we are implicitly solving is T (u) = T (u=2)+O(1). p p Since we want to move towards T (u) = T ( u) + O(1) instead, let's divide the set into u parts of p p p u size. Therefore every vEB stores u many vEBs of size u each, and the total height of the tree is O(log log u). When we divided the set into 2 halves, we could just OR the result stored in the 2 halves to compute the result of the set. This is because at the end of the Up phase, there was always only one place p to look: the right child of the node. However now there might be as many as u − 1 many options 2 Advanced Data Structures 15-751 3 Van Emde Boas Trees to pick from, and we cannot go through each one of them since that would destroy our complexity. So we need to figure out some other way to do this. Here is the super clever part: deciding which child to go down on is like solving the successor p p problem again. Since the node has u children, we can enumerate them from 0 to u − 1, and while coming up from child/block i, we can just ask for the successor of i on a bit set that we p maintain over these children. And this can be solved with a vEB tree of size u. So not only p p does a vEB contain u children vEBs with u elements, but we have an extra vEB called the p "summary", which also contains u elements (although these elements are artificial in the sense that they correspond to these child/block numbers that we have assigned to the children). 3.3 Cleanup This is all really cool, but we need to tie up some loose ends. In particular, do we really query this summary vEB on every node in the Up phase? Note that if we do then the recursive formula is p no longer T (u) = T ( u) + O(1). So we can only query the summary at most a constant number of times in the Up phase. In fact, if we just store the maximum stored in each vEB, then we only need to query the summary once: at the end when we flip to the Down phase. In the Up phase, we can check if queried integer is equal to the max. If it is, we continue going up, otherwise we know now that there exists an integer greater in the set that lies within this vEB, so we query the summary and start the Down phase with the appropriate block. Note that in order to support the predecessor operation, we need to do a similar thing and store the minimum. With these additions to our data structure, we have finally achieved O(log log u) time predecessor and successor operations. 3.4 Insert and Delete Now we just need to make sure insertions and deletions can still be supported in O(log log u) time: 1. insert(V, a): Starting at the root, we can figure out which child vEB to go down on in O(1) time. If the number is less or more than the minimum or maximum respectively then we update it in O(1). Now there are two cases: • The child vEB we want to insert a into is not empty. In this case we do not need to update the summary at all, and we can just proceed by inserting a into the child which p takes T ( u) time. • The child vEB we want to insert into is empty. In this case we need to enter the child's a p index into the summary, which will take T ( u) time. Now one might think that we need p to insert a into the child vEB recursively as well and that takes another T ( u) time breaking our recursive formula. However, we know that the vEB is empty to begin with so we have already reached our 'base case' in a way. The remaining cost of insertions will be a total of O(log log u) because that is the height of the tree, and the total cost is p T (u) + O(log log u) where T (u) = T ( u) + O(1), so we are good. 2. delete(V, a): Note that there is nothing to do done if the tree is empty. Just like insert, now we need to consider a few cases: 3 Advanced Data Structures 15-751 3 Van Emde Boas Trees • There is a single element in V. In this case, the Min and Max are equal (which can be checked in O(1) time), so we just check if they are equal to a.