Adaptive Main-Memory Indexing for High-Performance Point-Polygon Joins
Total Page:16
File Type:pdf, Size:1020Kb
Adaptive Main-Memory Indexing for High-Performance Point-Polygon Joins Andreas Kipf Harald Lang Varun Pandey Raul Alexandru Persa Christoph Anneser Eleni Tzirita Zacharatou⇤ Harish Doraiswamy⇧ Peter Boncz? Thomas Neumann Alfons Kemper ? TUM TU Berlin⇤ NYU⇧ CWI {kipf, langh, pandey, raul.persa, anneser, neumann, kemper}@in.tum.de [email protected] [email protected] [email protected] ABSTRACT we argue that it is in many cases admissible to trade oaccuracy Connected mobility applications rely heavily on geospatial joins for performance. Based on these two insights, we transform the that associate point data, such as locations of Uber cars, to static traditionally CPU-intensive problem of point-polygon joins into polygonal regions, such as city neighborhoods. These joins typi- one that is bound by memory access latencies. cally involve expensive geometric computations, which makes it In contrast to the classical lter and rene approach, true hit hard to provide an interactive user experience. ltering [9] identies actual join pairs already in the ltering In this paper, we propose an adaptive polygon index that lever- phase, and thus partially avoids expensive renements. This ages true hit ltering to avoid expensive geometric computations is achieved by using additional approximations (such as inner in most cases. In particular, our approach closely approximates rectangles [20]) to approximate the interior of polygons, so that polygons by combining quadtrees with true hit ltering, and when a point falls into an interior approximation, it can be safely stores these approximations in a query-ecient radix tree. Based deducted that the point is contained in the polygon. on this index, we introduce two geospatial join algorithms: an Building on this seminal idea, we present an improved al- approximate one that guarantees a user-dened precision, and gorithm that combines true hit ltering with quadtrees [23] to an exact one that adapts to the expected point distribution. In holistically index an entire set of polygons. This is in contrast to summary, our technique outperforms existing CPU-based joins existing implementations of true hit ltering that approximate by up to two orders of magnitude and is competitive with state- polygons individually [15, 21] or use non-hierarchical (single- of-the-art GPU implementations. resolution) grids [6, 39, 49]. In our approach, polygons are trans- lated into a single set of multi-resolution grid cells that approx- imates their boundary and interior areas. To support ecient 1 INTRODUCTION queries, we store one-dimensional identiers of the cells in a new Connected mobility companies need to process vast amounts of in-memory radix tree (trie) named Adaptive Cell Trie (ACT). We location data in near real-time to run their businesses. For exam- show that ACT is more query-ecient than previous approaches ple, Uber needs to map locations of cars and passenger requests for indexing cell identiers (e.g., B-trees, like in [21]). (points) to predened zones (polygonal regions) for allocation Another distinguishing feature of our approach is that it can and dynamic pricing purposes [40]. These polygonal regions are entirely avoid the expensive renement phase by rening cells in typically largely disjoint (non-overlapping) and mostly static. the boundary areas until a user-dened precision is guaranteed. Points, on the other hand, are often not known a priori. Thus, Naturally, this comes at the cost of higher memory consumption the problem is how to eciently nd the polygons that contain than traditional lter and rene approaches. However, as stated an incoming point. above, we argue that we can nowadays actually aord this higher Traditionally, such point-polygon joins [19] follow the lter memory consumption in exchange for higher performance. and rene approach. In this two-phase evaluation strategy, the Our approach can also provide accurate results by performing ltering phase typically uses an index (e.g., an R-tree) on the expensive PIP tests for points that are potential hits. To reduce minimum bounding rectangles (MBRs) of polygons and probes their number, we adapt (train) our index based on historical data the index for each point to obtain a list of candidate join pairs. points to provide higher precision where it is actually needed. As Then, in the renement phase, expensive point-in-polygon (PIP) we show in our experiments, our accurate algorithm performs tests are performed to discard false matches. very few PIP tests. Compared to a lter based on the polygons’ We argue that the time has come to rethink this strategy: First, MBRs, our index (trained with 1 M historical points) reduces main memory is not a scarce resource anymore and modern the number of required PIP tests by >97% for a join between machines oer multiple terabytes of memory. Combined with NYC taxi pick-up locations and neighborhood polygons. This the city-centric model of geospatial applications (e.g., Uber), we algorithm can also be used when ACT cannot guarantee the show that it is possible to maintain highly ne-grained indexes desired precision given a certain memory budget. for entire cities (e.g., Uber’s operating zones) in main memory, In summary, we make the following contributions: dramatically reducing the number of CPU-intensive PIP tests. Second, geospatial positions, nowadays typically obtained by An algorithm that computes quadtree-based grid approxima- • smartphones or wearables, are inherently imprecise [41]. Thus, tions for sets of polygons with precision guarantees A radix tree data structure (ACT) that is optimized for indexing • © 2020 Copyright held by the owner/author(s). Published in Proceedings of the cell identiers: for a join of NYC’s yellow taxi data with NYC’s 23rd International Conference on Extending Database Technology (EDBT), March 30-April 2, 2020, ISBN 978-3-89318-083-7 on OpenProceedings.org. neighborhoods, we achieve a throughput of >50 M points/s Distribution of this paper is permitted under the terms of the Creative Commons per CPU core under a <4 m precision bound license CC-by-nc-nd 4.0. Series ISSN: 2367-2005 347 10.5441/002/edbt.2020.31 contains neither conicting nor duplicate cells. Two cells are con- icting when one cell contains the other. Only when the covering 01 10 level level is normalized can cell containment checks be eciently imple- mented using a binary search on the sorted vector (O logn ). i+1 ( ) i While binary search on a sorted 00 11 vector is a good strategy for 1 querying small collections of 0000 0001 1110 1111 cells (e.g., the covering cells of a single polygon), it is not the Figure 1: Quadtree-based cell decomposition and Hilbert most ecient way to search curve-based enumeration. larger collections (e.g., cover- ings of multiple polygons). In this work, we store large cell An evaluation of ACT in contrast to more traditional data • collections in ACT, a query- structures, such as B-trees ecient radix tree, and evalu- Figure 2: A covering (blue An accurate algorithm that trains the index structure based on ate its performance compared • cells) and an interior cov- historical data points to alternative physical represen- ering (green cells) of an in- An experimental comparison against state-of-the-art GPU- tations (including a sorted vec- • dividual polygon. based point-polygon joins tor and a B-tree). In the remainder of this paper, we rst give some background PIP Test. A point-in-polygon (PIP) test determines whether a about the building blocks of our approach in Section 2. Section 3 point lies within a polygon. Typically such a test is performed describes our approach and Section 4 presents the evaluation using complex geometric operations, such as the ray-tracing algo- with real-world and synthetic data. Finally, we summarize related rithm [17], which involves drawing a line from the query point to work in Section 5 before concluding in Section 6. a point that is known to be outside of the polygon and counting the number of edges that the line crosses. If the line crosses an 2 BACKGROUND odd number of edges, the query point lies within the polygon. The runtime complexity of this algorithm is O n , n being the number Location Discretization. Our approach relies on a quadtree- ( ) based (hierarchical) decomposition of space (the surface of the of edges. While there are many conceptual optimizations to the PIP test, this operation remains computationally expensive since Earth in this case). This decomposition is static and thus data it processes real numbers (e.g., latitude/longitude coordinates) independent. We enumerate the quadtree cells using a space- lling curve (e.g., the Hilbert or the Z curve) to index them in a and thus involves oating point arithmetics. one-dimensional data structure. Our approach does not depend on a concrete space-lling curve. For our indexing strategy to 3 GEOSPATIAL JOIN APPROACH work, the cell enumeration must only fulll the property that In this work, we target the problem of mapping points to static, child cells share a common prex with their parent cell. largely disjoint polygons. We show how to accelerate such joins Figure 1 shows the hierarchical decomposition of two cells at by computing ne-grained cell-based approximations of sets of levels i and i + 1 and the corresponding bitwise representations polygons and maintaining them in a query-ecient in-memory that encode the cells’ positions along the Hilbert curve. Each radix tree, which enables ecient cell lookups and signicantly cell consists of four sub cells, which it completely covers. Child reduces (or even eliminates) expensive geometric tests. cells share a common prex with their parent cell, allowing us to In contrast to techniques that rst reduce the number of can- compute contains relationships using ecient bitwise operations.