Effective Spatial Data Partitioning for Scalable Query Processing
Total Page:16
File Type:pdf, Size:1020Kb
Effective Spatial Data Partitioning for Scalable Query Processing Ablimit Aji Hoang Vo Fusheng Wang HP Labs Emory University Stony Brook University Palo Alto, California, USA Atlanta, Georgia, USA Stony Brook,New York, USA [email protected] [email protected] [email protected] ABSTRACT amounts of spatial data in a way that was never before pos- Recently, MapReduce based spatial query systems have emerged sible. The volume and velocity of data only increase signifi- as a cost effective and scalable solution to large scale spa- cantly as we shift towards the Internet of Things paradigm tial data processing and analytics. MapReduce based sys- in which devices have spatial awareness and produce data tems achieve massive scalability by partitioning the data and while interacting with each other. As science and businesses running query tasks on those partitions in parallel. There- are becoming increasingly data-driven, timely analysis and fore, effective data partitioning is critical for task paral- management of such data is of utmost importance to data lelization, load balancing, and directly affects system perfor- owners. A wide spectrum of applications and scientific disci- mance. However, several pitfalls of spatial data partitioning plines such as GIS, Location Based Social Networks (LBSN), make this task particularly challenging. First, data skew neuroscience [4], medical imaging [38] and astronomy [27], is very common in spatial applications. To achieve best can benefit from an efficient spatial query system to cope query performance, data skew need to be reduced to the with the challenges of Spatial Big Data. minimum. Second, spatial partitioning approaches generate To effectively store, manage and process such large amounts boundary objects that cross multiple partitions, and add of spatial data, a scalable distributed data management sys- extra query processing overhead. Therefore, boundary ob- tem is essential. Recently, the MapReduce framework [9] jects need to be minimized. Third, the high computational has become the de facto standard for handling large scale complexity of spatial partitioning algorithms combined with data processing tasks, and it has many salient features such massive amounts of data require an efficient approach for as massive scalability, fault-tolerance, easy programmability partitioning to achieve overall fast query response. In this and low deployment cost. With the success of MapReduce, a paper, we provide a systematic evaluation of multiple spa- number of spatial query systems [5, 23, 30] and frameworks tial partitioning methods with a set of different partition- [6, 12] have emerged to enable large scale spatial query pro- ing strategies, and study their implications on the perfor- cessing on MapReduce and cloud platforms. mance of MapReduce based spatial queries. We also study Data partitioning is a powerful mechanism for improving sampling based partitioning methods and their impact on efficiency of data management systems, and it is a standard queries, and propose several MapReduce based high perfor- feature in modern database systems. In fact, state-of-the- mance spatial partitioning methods. The main objective of art systems employ a shared-nothing architecture [36], and our work is to provide a comprehensive guidance for optimal both MapReduce and parallel DBMS are examples of such spatial data partitioning to support scalable and fast spatial architecture. Aside from the fact that data partitioning data processing in distributed computing environments such improves the overall manageability of large datasets, it im- as MapReduce. The algorithms developed in this work are proves query performance in two ways. First, partitioning the data into smaller units enables processing of a query in arXiv:1509.00910v1 [cs.DB] 3 Sep 2015 open source and can be easily integrated into different high performance spatial data processing systems. parallel, and henceforth the improved throughput. Second, with a proper partitioning schema, I/O can be significantly reduced by only scanning a few partitions that contain rel- evant data to answer the query. Therefore, a partitioning 1. INTRODUCTION approach – that evenly distributes the data across nodes and The proliferation of ubiquitous positioning technology, mo- facilitates parallel processing – is essential for achieving fast bile devices, and the rapid improvement of high resolution query response and optimal system performance. data acquisition technologies enabled us to collect massive Spatial data partitioning, however, is particularly chal- lenging due to several pitfalls that are endemic to spatial data and query processing. Spatial Data Skew. Data skew is very common and se- vere in spatial applications. For example, in microscopic pathology imaging scenario, tumorous tissues contain far more spatial objects (segmented cells), whereas cells are more evenly distributed in healthy tissues. In geospatial ap- plications (e.g., OpenStreetMap) some countries and regions have more detailed mapping information due to the enthu- 1 siastic data contributors. For example, if OpenStreetMap problem for parallel query processing. is partitioned into 1000 x 1000 fixed size tiles, the number 2. We present six spatial partitioning algorithms in de- of objects contained in the most skewed tile is nearly three tails, and provide a general classification of those approaches orders of magnitude more than the one in an average tile. along three dimensions. Needless to say, data skew is detrimental to the query per- 3. We systematically study various properties of the pre- formance [35] and curtails system scalability [29]. Therefore, sented spatial partitioning algorithms and their effects on to achieve the best query performance, a spatial partition ap- query performance, and provide a comprehensive empirical proach should try to avoid a skewed partitioning whenever evaluation on two large scale real-world datasets. it is possible. 4. We propose MapReduce based algorithms for parallel Boundary Objects. Spatial partitioning approaches gen- spatial partitioning, and evaluate their performance in de- erate boundary objects that cross multiple partitions, thus tails. violating the partition independence. As spatial objects have complex boundary and extent, imposing a rectangu- lar region based partitioning on sufficiently large dataset 2. BACKGROUND would most certainly produce objects that cross multiple partition boundary. Spatial query processing algorithms get 2.1 Spatial Query Processing with MapReduce around the boundary problem by using a replicate-and-filter approach [29, 39] in which boundary objects are replicated Recently, several MapReduce based spatial query systems to multiple spatial partitions, and side effects of such repli- [5, 12] have emerged to support scalable spatial query pro- cation is remedied by filtering the duplicates at the end of cessing on large datasets. While these systems may vary the query processing phase. This process adds extra query in implementation details and at the query language layer, processing overhead which increases along with the volume conceptually they are very similar. Algorithm 1 sketches of boundary objects. Therefore, a good spatial partitioning out HadoopGIS – a general MapReduce based spatial query approach should aim to minimize the number of boundary processing framework [6]. As the algorithm shows, data is objects. spatially partitioned and staged to HDFS; spatial queries Performance. Spatial partitioning algorithms are expen- are expressed as a set of operators that can be translated to sive to compute compared to the conventional one dimen- MapReduce tasks during runtime. Tasks run on the parti- sional table partitioning algorithms, such as hash and range tioned input for parallel query processing. Queries are im- partitioning, that can be done quickly on the fly. The mul- plicitly parallelized through MapReduce, and a tile (as spa- tidimensional nature of spatial data entails that most spa- tial partitioning closely resembles tiling of two dimensional tial operators are of linear time complexity. The high com- space, we use tile and spatial partition interchangeably here- putational complexity combined with massive amounts of after) is the parallelization unit that a Mapper/Reducer can data require an efficient approach for spatial partitioning to process independently. achieve overall fast query response. This is in particularly important for spatial-temporal data where new spatial data Algorithm 1: MapReduce based spatial query process- has to be partitioned and processed in a timely fashion. ing framework To the best of our knowledge, no spatial database system provides a graceful approach to spatial partitioning. Previ- 1 A. Data/space partitioning; ously, Paradise [29] – a parallel spatial database system – 2 B. Staging of partitioned data to HDFS; used a regular fixed grid partitioning for parallel join pro- 3 C. Pre-query processing (optional); cessing. Fixed grid partitioning is the basis of many spatial 4 D. for tile in input do algorithms and it is easy to compute. However, as men- 5 Index building for objects in the tile; tioned in the original work, fixed grid approach suffers from 6 Tile based spatial querying processing; both data skew problem and boundary object problem. 7 E. Boundary object handling; In the relevant research literature, some of those chal- 8 F. Post-query processing (optional); lenges are given some attention in various contexts. How- ever, in most cases the problem is not fully explored,