Range & Windowing Queries (2 Lectures)

Range & windowing queries 1/180 Geometric Algorithms Range & windowing queries (2 lectures) Database queries 2/180 G. Ometer born: Aug 16, 1954 salary salary: $3,500 A database query may ask for all employees with age between a1 and a2, and salary between s1 and s2 19,500,000 19,559,999 date of birth Data structures 3/180 Idea of data structures I Representation of structure, for convenience (like DCEL) I Preprocessing of data, to be able to solve future questions really fast (sub-linear time) A (search) data structure has a storage requirement, a query time, and a construction time (and an update time) 1D range query problem 4/180 1D range query problem: Preprocess a set of n points on the real line such that the ones inside a 1D query range (interval) can be answered fast The points p1;:::; pn are known beforehand, the query [x;x0] only later A solution to a query problem is a data structure, a query algorithm, and a construction algorithm Question: What are the most important factors for the efficiency of a solution? Balanced binary search trees 5/180 A balanced binary search tree with the points in the leaves 49 23 80 10 37 62 89 3 19 30 49 59 70 89 93 3 10 19 23 30 37 59 62 70 80 93 97 Balanced binary search trees 6/180 The search path for 25 49 23 80 10 37 62 89 3 19 30 49 59 70 89 93 3 10 19 23 30 37 59 62 70 80 93 97 Balanced binary search trees 7/180 The search paths for 25 and for 90 49 23 80 10 37 62 89 3 19 30 49 59 70 89 93 3 10 19 23 30 37 59 62 70 80 93 97 Example 1D range query 8/180 A 1-dimensional range query with [25; 90] 49 23 80 10 37 62 89 3 19 30 49 59 70 89 93 3 10 19 23 30 37 59 62 70 80 93 97 Node types for a query 9/180 Three types of nodes for a given query: I White nodes: never visited by the query I Grey nodes: visited by the query, unclear if they lead to output I Black nodes: visited by the query, whole subtree is output Question: What query time do we hope for? Node types for a query 10/180 The query algorithm comes down to what we do at each type of node Grey nodes: use query range to decide how to proceed: to not visit a subtree (pruning), to report a complete subtree, or just continue Black nodes: traverse and enumerate all points in the leaves Example 1D range query 11/180 A 1-dimensional range query with [61; 90] 49 23 80 10 37 62 89 3 19 30 49 59 70 89 93 3 10 19 23 30 37 59 62 70 80 93 97 Example 1D range query 12/180 A 1-dimensional range query with [61; 90] 49 split node 23 80 10 37 62 89 3 19 30 49 59 70 89 93 3 10 19 23 30 37 59 62 70 80 93 97 1D range query algorithm 13/180 Algorithm 1DRANGEQUERY(T;[x : x0]) 1. nsplit FINDSPLITNODE(T;x;x0) 2. if nsplit is a leaf 3. then Check if the point in nsplit must be reported. 4. else n lc(nsplit) 5. while n is not a leaf 6. do if x xn ≤ 7. then REPORTSUBTREE(rc(n)) 8. n lc(n) 9. else n rc(n) 10. Check if the point stored in n must be reported. 11. Similarly, follow the path to x0, and ::: Query time analysis 14/180 The efficiency analysis is based on counting the numbers of nodes visited for each type I White nodes: never visited by the query; no time spent I Grey nodes: visited by the query, unclear if they lead to output; time determines dependency on n I Black nodes: visited by the query, whole subtree is output; time determines dependency on k, the output size Query time analysis 15/180 Grey nodes: they occur on only two paths in the tree, and since the tree is balanced, its depth is O(logn) Black nodes: a (sub)tree with m leaves has m 1 internal − nodes; traversal visits O(m) nodes and finds m points for the output The time spent at each node is O(1) O(logn + k) query time ) Storage requirement and preprocessing 16/180 A (balanced) binary search tree storing n points uses O(n) storage A balanced binary search tree storing n points can be built in O(n) time after sorting Result 17/180 Theorem: A set of n points on the real line can be preprocessed in O(nlogn) time into a data structure of O(n) size so that any 1D range query can be answered in O(logn + k) time, where k is the number of answers reported quadtrees: good in practice, but not so good worst-case query time Kd-trees: queries take O(pn + k) (Chapter 5.2) range trees: today Range queries in 2D 18/180 Range queries in 2D 18/180 quadtrees: good in practice, but not so good worst-case query time Kd-trees: queries take O(pn + k) (Chapter 5.2) range trees: today Back to 1D range queries 19/180 A 1-dimensional range query with [61; 90] 49 split node 23 80 10 37 62 89 3 19 30 49 59 70 89 93 3 10 19 23 30 37 59 62 70 80 93 97 Examining 1D range queries 20/180 Observation: Ignoring the search path leaves, all answers are jointly represented by the highest nodes strictly between the two search paths Question: How many highest nodes between the search paths can there be? Examining 1D range queries 21/180 For any 1D range query, we can identify O(logn) nodes that together represent all answers to a 1D range query Toward 2D range queries 22/180 For any 2d range query, we can identify O(logn) nodes that together represent all points that have a correct first coordinate Toward 2D range queries 23/180 (1, 5) (3, 8) (4, 2) (5, 9) (6, 7) (7, 3) (8, 1) (9, 4) Toward 2D range queries 24/180 (1, 5) (3, 8) (4, 2) (5, 9) (6, 7) (7, 3) (8, 1) (9, 4) Toward 2D range queries 25/180 data structure for searching on y-coordinate (1, 5) (3, 8) (4, 2) (5, 9) (6, 7) (7, 3) (8, 1) (9, 4) Toward 2D range queries 26/180 (5, 9) (3, 8) (6, 7) (1, 5) (9, 4) (7, 3) (4, 2) (8, 1) (1, 5) (3, 8) (4, 2) (5, 9) (6, 7) (7, 3) (8, 1) (9, 4) 2D range trees 27/180 Every internal node stores a whole tree in an associated structure, on y-coordinate Question: How much storage does this take? Storage of 2D range trees 28/180 To analyze storage, two arguments can be used: I By level: On each level, any point is stored exactly once. So all associated trees on one level together have O(n) size I By point: For any point, it is stored in the associated structures of its search path. So it is stored in O(logn) of them Construction algorithm 29/180 Algorithm BUILD2DRANGETREE(P) 1. Construct the associated structure: Build a binary search tree Tassoc on the set Py of y-coordinates in P 2. if P contains only one point 3. then Create a leaf n storing this point, and make Tassoc the associated structure of n. 4. else Split P into Pleft and Pright, the subsets and > ≤ the median x-coordinate xmid 5. nleft BUILD2DRANGETREE(Pleft) 6. nright BUILD2DRANGETREE(Pright) 7. Create a node n storing xmid, make nleft the left child of n, make nright the right child of n, and make Tassoc the associated structure of n 8. return n Efficiency of construction 30/180 The construction algorithm takes O(nlog2 n) time T(1) = O(1) T(n) = 2 T(n=2) + O(nlogn) · which solves to O(nlog2 n) time Efficiency of construction 31/180 Suppose we pre-sort P on y-coordinate, and whenever we split P into Pleft and Pright, we keep the y-order in both subsets For a sorted set, the associated structure can be built in linear time Efficiency of construction 32/180 The adapted construction algorithm takes O(nlogn) time T(1) = O(1) T(n) = 2 T(n=2) + O(n) · which solves to O(nlogn) time 2D range queries 33/180 How are queries performed and why are they correct? I Are we sure that each answer is found? I Are we sure that the same point is found only once? 2D range queries 34/180 ν p p µ µ0 p p Query algorithm 35/180 Algorithm 2DRANGEQUERY(T;[x : x ] [y : y ]) 0 × 0 1. nsplit FINDSPLITNODE(T;x;x0) 2. if nsplit is a leaf 3. then report the point stored at nsplit, if an answer 4. else n lc(nsplit) 5. while n is not a leaf 6. do if x xn ≤ 7. then 1DRANGEQ(Tassoc(rc(n));[y : y0]) 8. n lc(n) 9. else n rc(n) 10. Check if the point stored at n must be reported. 11. Similarly, follow the path from rc(nsplit) to x0 ::: 2D range query time 36/180 Question: How much time does a 2D range query take? Subquestions: In how many associated structures do we search? How much time does each such search take? 2D range queries 37/180 ν µ µ0 2D range query efficiency 38/180 We search in O(logn) associated structures to perform a 1D range query; at most two per level of the main tree The query time is O(logn) O(logm + k ), or × 0 ∑O(lognn + kn ) n where ∑kn = k the number of points reported 2D range query efficiency 39/180 Use the concept of grey and black nodes again: 2D range query efficiency 40/180 The number of grey nodes is O(log2 n) The number of black nodes is O(k) if k points are reported The query time is O(log2 n + k), where k is the size of the output Result 41/180 Theorem: A set of n points in the plane can be preprocessed in O(nlogn) time into a data structure of O(nlogn) size so that any 2D range query can be answered in O(log2 n + k) time, where k is the number of answers reported In contrast, a kd-tree has O(n) size and answers queries in O(pn + k) time (Chapter 5.2).

Load more