GIS-E1070 Theories and Techniques in GIS L

Lecture 8: Vector Data Indexing

Jussi Nikander The contents of this lecture

• The different problems in vector data indexing • Point location • Range search • Window search • The data structures for indexing the data • Trapezodial map • Kd- • Segment tree • Partition tree Learning goals of this lecture

Understand the different point location and windowing problems Know how they are different from each other and how these differences affect the data structures required for efficient problem solving Recall the different data structures and their most important details Literature for this lecture

• De Berg et al.: Computational geometry algorithms and applications • Chapters 5-6, 10, 16 The vector data indexing problems Different kinds of vector data

• Recall that vector data consists of points, polylines and polygons • A point is a 0-dimensional coordinate pair • A polyline consists of two or more points • A polygon is bounded by polyline rings (first and last point are identical) Different kinds of vector data searches • In a search problem, the goal • For a of points, find the points that are inside a given is to find data elements that rectangular or arbitrary search fulfill a given search criterion window • Which elements need to be shown on- • Different search criteria screen? • Which elements are inside a given include polygon? • In a polygon network, find the • For a set of polylines or polygon that contains a given polygons, find the elements that overlap a given search point window • E.g. inside which map element the • Again, rectangular and arbitrary mouse cursor is? windows are separate cases Different kinds of vector data searches Search windows for a set of points • In breakout rooms, consider the following questions • Why computer need different approaches to all these problems? • Try to put into words what the fundamental differences are • How could computers attempt to solve these problems? A given point in polygon network Search window for polylines and polygons The point location problem

• Consider a polygon network • Naive solution is to make a depicting municipalities point-in-polygon calculation • Consider a situation, where the for each polygon municipalities are visualized on a • There are 푛 polygons and it computer takes 푂(푙표푔푛) time to do test • How can we efficiently find over => 푂(푛푙표푔푛) which municipality the cursor is? • Thus, an efficient data • Or, more generally, given a polygon structure for storing the network, in which polygon the point 푝 relevant information is is? required The range search problem

• Problem: given a set of points in 2D, we want to do queries on points that are inside a given window [풙ퟏ, 풚ퟏ, 풙ퟐ, 풚ퟐ] • For point sets, this is basically a range query • Report points that are in range 푥1,푥2 , 푦1, 푦2 • All points are either completely inside or outside the range Simple solution to range searching

• Again, we could check all • In one-dimensional case, we can elements to see whether they are just use a regular inside the window • A balanced BST or its variants, for example • This would take 푂(푛) time • Checking whether a point is inside • When looking through the nodes, a rectangle can be done in 푂(1) make recursive calls to all • However, good search structures subtrees that might contain nodes typically have search times belonging to the area comparable to 푂(푙표푔푛) • Each recursive call divides the area into two halves • Therefore, again, we can do better The window search problem

• Problem: Select all map elements that intersect a given window [풙ퟏ, 풚ퟏ, 풙ퟐ, 풚ퟐ] • Now, map elements may contain points, lines and polygons • Therefore, a more sophisticated approach is required than with point range queries • Even more complicated approaches are required if the • Consider in what functionally windows are not axis-parallel or rectangular different ways lines or polygons can overlap a rectangular search area Segment endpoint: inside and outside cases

• From algorithmic point of view, there • For segments that have an endpoint inside does not need to be a distinction between the window, we can use a range query polylines and polygons • Kd-trees or range trees • polygon can be considered a set of • For segments that have endpoints outside polylines • Polygons that contain the whole search the window, we need a different structure window require some extra work • Capable of answering interval queries • Which line segments intersect the line segment [푥, 푥], [푦 , 푦 ] or [푥 , 푥 ], [푦, 푦] • There are two different types of segments 1 2 1 2 that intersect the window • We assume that the line segments do not • Those that have at least one endpoint intersect each other inside the window • Data stored in, for example, DCEL • Those that have both endpoints outside Windowing with polygon windows

• We have now covered • Windowing using rectangular windows • Both for point data (kd-trees and range trees), and line data (segment trees) • Now, we generalize the situation to using polygonal windows with point data • Line data can be stored using a modification of this method The point location problem The point location problem

• Consider a polygon network depicting municipalities • Consider a situation, where the municipalities are visualized on a computer • How can we efficiently find over which municipality the cursor is? • Or, more generally, given a polygon network, in which polygon the point 푝 is? Searchable subdivisions

• A polygon network is a subdivision of an area • Unfortunately, since polygons are arbitrary, it is not a very good subdivision for searching • For searching, dividing the polygons into trapezoids works better • Now it is possible to do comparisons such as ”is the point to the left/right, or above/below certain line”

Image source: de Berg et al: Computational geometry Search structure using trapezoidal map

• A trapezodial map consists of vertical • Each inner node represents a lines (created for the map) and non- vertical line or non-vertical line vertical line segments (part of the segment original polygon network) • Each node has two child nodes • Vertical lines divide the area to parts on • For searching, these edges are left and right (x-nodes) • Non-vertical line segments divide the arranged into a search structure area to parts above and below (y-nodes) • Whenever an y-node is encountered, the • Structure is a ”tree-like” structure corresponding line segment spans the x- values possible in that subset of the that consists of of inner nodes and search space leaf nodes • Leaf nodes represent trapezoids Search structure using trapezoidal map

• There is a root node from which all searches start • There may be several paths from the root node to a leaf node • Thus, the resulting structure is a rooted, directed, acyclic graph • In x-nodes comparison is left/right (white) • In y-nodes comparison is above/below (gray)

Image source: de Berg et al: Computational geometry Building a trapezoidal map

• A trapezodial map works by limiting • Creating a structure that practically the search area using x- and y-axes always exhibits such behavior can • Each step should divide the be done through randomization remaining search area approximately • Lines are added to the trapezodial to half map in random order • After each line segment insertion the • Thus, the map must be structured so structure is updated to be a valid that each node divides the remaining trapezodial map area to approximately half • Resulting structure has 푂(푛) space • Creating a structure that guarantees and 푂(푙표푔푛) search time efficiency this is hard Line segment addition examples

Image source: de Berg et al: Computational geometry Trapezoidal map search example Trapezoidal map search example Trapezoidal map search example Trapezoidal map search example Trapezoidal map search example Trapezoidal map search example

1. 5. 4.

6.

3. 2. Trapezoidal map search example

• Actual search path depends on • The structure of the trapezoidal map • The point location • Thus for all the possible trapezoidal map implementations for any polygon network, the number of possible search paths is very large • The paths in any specific implementation are well-defined Range searching The range search problem

• Problem: given a set of points in 2D, we want to do queries on points that are inside a given window [풙ퟏ, 풚ퟏ, 풙ퟐ, 풚ퟐ] • For point sets, this is basically a range query • Report points that are in range 푥1,푥2 , 푦1, 푦2 • All points are either completely inside or outside the range Recursive splitting of search area

• The 1-dimensional example shows how the search area is continually split into smaller pieces • A similar approach can be applied in higher dimensions: at each node, the search area is split into two halves on one axis • Range search for range [18, 77] • Different axes can used on • Light grey nodes are in the search path, dark grey nodes are known to be inside the range without different levels in the tree having to check each element separately

Image source: de Berg et al: Computational geometry Recursive splitting example Recursive splitting example Recursive splitting example The kd-tree The kd-tree

• Recursive splitting can be • kd-Tree is a multidimensional implemented using a generalization of a binary search structure called kd-tree tree • At each level of the tree, • Technically 2d-tree, since the elements are split into two groups k stands for the number of according to one of the axes dimensions • All data elements are in leaf nodes • Internal nodes contain info about lines used for the split • Or (hyper)planes for dimensions higher than 2 The kd-tree

Image source: de Berg et al: Computational geometry Building a kd-tree

• Kd-tree is one of those structures • The good news is that kd-tree is that (in its basic form) need to be both conceptually simple, and constructed for a set of points 푃 simple to implement (in Python, at • Adding (or removing) points to a least) kd-tree can make it unbalanced, • For construction you need a which will cause a loss of means to sort a set according to efficiency specific value (available in • It is, however, possible to Python) construct variations of kd-tree that • And need to be able to recursive allow for modifications calls (both for construction and • Not covered here searching) properly Kd-tree construction algorithm

• As input, take set of points 푃 • Else and depth 푑 (initially 0) • Split 푃 into two subsets As output, return the root 푃1and 푃푠 according to the node of a kd-tree storing 푃 median y-coordinate of 푃 • If 푃 contains one point, return • 푣 ←kd-tree(푝 , 푑 + 1) a leaf node storing the point 푙푒푓푡 1 • 푣 ←kd-tree(푝 , 푑 + 1) • If 푑 is even 푟𝑖푔ℎ푡 2 • Return node 풗, which • Split 푃 into two subsets contains 푃1and 푃푠 according to the median x-coordinate of 푃 • The median coordinate • 푣푙푒푓푡 as left child • 푣푟𝑖푔ℎ푡 as right child Searching in the kd-tree

• Searching a kd-tree is also simple • Internal nodes outside the search area • The parameter for search is the search are discarded area 푅 • The trick in efficient searching is keeping track of the area of the current subtree • Leaf nodes (each holding one point) are compared to checked against belonging to the search 푅 area • If area covered by a subtree is topologically within 푅 all data items in • Internal nodes completely inside the the subtree are naturally inside 푅 search area can be directly reported (all points in the subtree are reported) • The area of the subtree is limited by the lines induced by the median coordinate • Internal nodes not completely inside values search area are recursively searched Searching in the kd-tree

• Light gray is the search area • Darker gray is part of the tree that is completely within the search area • Can you match the lines in delimiting the dark gray area in the left picture to the corresponding internal nodes in the right picture?

Image source: de Berg et al: Computational geometry Kd-tree search algorithm

• As input take kd-(sub)tree • if 푣푙푒푓푡 is fully in 푅 rooted at 푣 and search area 푅 • Traverse 푣푙푒푓푡 and report all As output give all points in 푅 points in it • If 푣 is a leaf and the point • Else if 푣 intersects 푅 stored in 푣 is in 푅 푙푒푓푡 • SearchkdTree(푣 , 푅) • Report the point and return 푙푒푓푡 • If 푣푟𝑖푔ℎ푡 is fully in 푅 • Traverse 푣푟𝑖푔ℎ푡 and report all points in it

• Else if 푣푟𝑖푔ℎ푡 intersects 푅 • SearchkdTree(푣푟𝑖푔ℎ푡, 푅) Kd-tree efficiency

• Kd-tree can be built in 푂(푛푙표푔푛) • Unfortunately, the efficiency of kd-tree time search is not 푂(푙표푔푛), but 푂( 푛 + 푘) • However, this requires some • 푘 is the number of reported points doing, as it assumes the data set • In order to reach logarithmic efficiency, is not repeatedly sorted at each we need to use more storage space than level of recursion 푂(푛) 2 • Repeated sorting gives 푂(푛푙표푔 푛) • This means that at least some elements time efficiency need to be stored several times • Storage efficiency is 푂(푛) • This is common for multidimensional • That’s the best possible, because data structures: either time or space we need to store all the data efficiency needs to be sacrificed The

• A better search time can be • The data in each auxiliary achieved using a tree is sorted on the y-axis called Range tree • Space and build time • Range tree consists of a ‘main’ efficiency are both 푂(푛푙표푔푛) tree that sorts the data on the x- axis • Each node in the main tree has an auxiliary tree containing all the data elements covered by the subtree rooted at the node

Image source: de Berg et al: Computational geometry Range tree searching

• Search proceeds first in the • Main tree query delimits the main tree (x-axis) area on the x-axis, auxiliary • When the search splits into two tree queries on the y-axis • Left path reports all right • Time efficiency is 푂(푙표푔2푛 subtrees along the path + 푘) • Right path reports all left subtrees along the path • Technique called • Make an y-axis query in the fractional auxiliary tree in each reported cascading would subtree root allow to reduce the query time to 푂(푙표푔푛 + 푘)

Image source: de Berg et al: Computational geometry Windowing The window search problem

• Problem: Select all map elements that intersect a given window [풙ퟏ, 풚ퟏ, 풙ퟐ, 풚ퟐ] • Now, map elements may contain points, lines and polygons • Therefore, a more sophisticated approach is required than with point range queries • Even more complicated approaches are required if the • Consider in what functionally windows are not axis-parallel or rectangular different ways lines or polygons can overlap a rectangular search area Segment tree: 1D case

• There are several different • 1D problem statement: find all types of structures that can intervals that intersect a query answer 1D interval queries point q. • Interval trees, for example • The query point corresponds to • We will discuss segment tree, the x-coordinate of a vertical which can be used to create query segment in 2D, or y- general 2D interval queries coordinate of a horizontal one with minor modifications • The solution is based on the fact that situation changes only at interval endpoints Interval partitioning and the segment tree

• A set of N intervals can be • A balanced partitioned into elementary • Each leaf node corresponds to either intervals by • Interval endpoint • Sorting the endpoints • Elementary interval • Internal nodes correspond to intervals that • Making each interval are unions of their child nodes between endpoints into one • Each node in the tree stores all intervals elementary interval that span whole subtree rooted at the node, but do not span the subtree rooted • Elementary intervals can then at the parent of the node be used to create a segment • Elements stored as high up in the tree as possible tree 1D segment tree

• Space efficiency 푂(푛푙표푔푛) • Elements are stored multiple times • Time efficiency for construction 푂(푛푙표푔푛) • Time efficiency for query 푂(푙표푔푛 + 푘) where k is the number of segments reported Segment endpoints Elementary intervals

Image source: de Berg et al: Computational geometry 2D Segment trees

• A segment tree solves the • Since the line segments do not intersect problem in one dimension each other, we can put them in order • Segments corresponding to one slab • For two-dimensional case it (subtree in the segment tree) can be divides the plane into vertical stored in a balanced binary search tree slabs that correspond to the • Using this structure we can check the intervals segments in order and see which of • We still need to find which of them intersect the actual window the line segments that boundary intersect the slab also intersect the window border 2D segment tree example

Auxiliary tree • Space efficiency 푂(푛푙표푔푛) • Elements are stored multiple times • Time efficiency for construction 푂(푛푙표푔푛) • Time efficiency for query 푂(푙표푔2푛 + 푘) where k is the number of segments reported Slabs

Image source: de Berg et al: Computational geometry Windowing with polygons Windowing with polygon windows

• We have now covered • Windowing using rectangular windows • Both for point data (kd-trees and range trees), and line data (segment trees) • Now, we generalize the situation to using polygonal windows with point data • Line data can be stored using a modification of this method The basis for 2D searches

• Simple solution would be to use • For two-dimensional points, an point in polygon for all points in intelligent division into subsets the point set would be, for example, polygons • However, for searching, the goal is • Each point belongs to only one subset to have 푂(푙표푔푛) search time (or at • Subsets can overlap least close to that) • Now, in order to solve a partition • Sublinear search times typically query, we can first check which come by dividing the set to be polygons intersect the query area • Only those that intersect the edge of the searched intelligently into subsets query area need further processing Partition tree

• Partition tree is a data • Subsets are a fine partition of the structure for storing 2D data points points • In a fine partition no subset • It divides the set of points contains more than twice the into distinct subsets average number of elements • Each subset is a triangle • A fine partition can be • Subsets can recursively be constructed in a manner where divided into smaller sets the number of triangles a given query line intersects is relatively • Answers polygon queries small (rather) efficiently Partition tree example

Image source: de Berg et al: Computational geometry Partition tree and cutting tree

• Search time in partition tree • Search times can be reduced is 푂(푛½+휀) to logarithmic using a data • Storage space is linear structure called cutting tree • The value of 휀 is dependent • Cutting tree divides the on how many child nodes plane into disjoint triangles each node has • Search time becomes • Partition tree stores only logarithmic, 푂(푙표푔푛) points • Storage space becomes 2+휀 • Multilevel partition tree can quadratic 푂(푛 ) store line segments • Details in de Berg et al., if you're interested Questions?