Distributed Execution of Interactive SQL Queries on Big Spatial Data

Sphinx: Distributed Execution of Interactive SQL Queries on Big Spatial Data Ahmed Eldawy1∗ Mostafa Elganainy2$ Ammar Bakeer3$ Ahmed Abdelmotaleb4$ Mohamed Mokbel5∗$ ∗University of Minnesota $GIS Technology Innovation Center {1eldawy,5mokbel}@cs.umn.edu {2melganainy,3 abakeer,4aothman}@gistic.org SELECT COUNT(*) FROM OSM_Points WHERE x ≥ x1 AND x < x2 AND ABSTRACT y ≥ y1 AND y < y2; This paper presents Sphinx, a full-fledged distributed system which (a) Range query in Impala uses a standard SQL interface to process big spatial data. Sphinx adds spatial data types, indexes and query processing, inside the SELECT COUNT(*) code-base of Cloudera Impala for efficient processing of spatial FROM OSM_Points data. In particular, Sphinx is composed of four main components, WHERE Contains(Rectangle(x1, y1, x2, y2), OSM_Points.coords); namely, query parser, indexer, query planner, and query executor. (b) Range query in Sphinx The query parser injects spatial data types and functions in the SQL interface of Sphinx. The indexer creates spatial indexes in Sphinx by adopting a two-layered index design. The query planner utilizes Figure 1: Range query in Impala vs. Sphinx these indexes to construct efficient query plans for range query and spatial join operations. Finally, the query executor carries out these which is much preferred by existing DBMS users, and (2) they in- plans on big spatial datasets in a distributed cluster. A system pro- herit the limitations of the underlying systems, such as significant totype of Sphinx running on real datasets shows up-to three orders startup time, and materializing intermediate data to disk, which im- of magnitude performance improvement over traditional Impala. pede these system from reaching the full potential of the underlying hardware. In this paper, we introduce Sphinx; a full-fledged system for Categories and Subject Descriptors distributed execution of interactive SQL queries on Big Spatial H.2.8 [Database Applications]: Spatial databases and GIS Data. We have chosen to build Sphinx inside Impala [7] rather than any other open-source distributed big data system (e.g., Hadoop Keywords and Spark) as Impala has several advantages which include: (1) it adopts the ANSI-standard SQL interface, (2) employs query opti- Spatial, Impala, SQL, Range Query, Spatial Join, Sphinx mization, (3) C++ runtime code generation, and (4) low-level di- rect disk access. With these features, Impala achieves an order 1. INTRODUCTION of magnitude speedup [5, 7, 11] on standard TPC-H and TPC-DS queries compared to other popular SQL-on-Hadoop systems such The recent explosion in the amounts of spatial data generated by as Hive [10] and Spark-SQL. many applications, such as satellite images, GPS tracks, medical Figure 1 shows a range query example that gives the essence of images, and geotagged tweets, urged researchers and developers Sphinx. Figure 1(a) shows a range query expressed in Impala using to extend big data systems to efficiently support spatial data. This primitive data types and operations, and it takes 21 minutes on a ta- includes Hadoop-GIS [1], SpatialHadoop [4], and ESRI tools for ble of 2.7 Billion points with a cluster of 10 cores. As Impala does Hadoop [12], among others. Unfortunately, all these systems suf- not understand the properties of this spaital query, it has to perform fer from the following two limitations. (1) Despite SQL-like lan- a full-table scan with a very little room of optimization to do. In guages, such as HiveQL, they lack an ANSI-standard SQL interface Sphinx, the same query is expressed as shown in Figure 1(b). In ∗This work is supported in part by the GIS Technology Innovation Center addition to the expressive language, this query runs in one second under project GISTIC-13-05 on Sphinx, giving three orders of magnitude speedup over plain- vanilla Impala. The main reason behind this performance boost Permission to make digital or hard copies of all or part of this work for is the spatial indexes that we add in Sphinx and the spatial query personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear processing that is injected in the core of the query planner and ex- this notice and the full citation on the first page. Copyrights for components ecutor. of this work owned by others than ACM must be honored. Abstracting with Figure 2 gives an overview of Sphinx which consists of four credit is permitted. To copy otherwise, or republish, to post on servers or to components, all implemented inside the core of Impala. (1) The redistribute to lists, requires prior specific permission and/or a fee. Request query parser (Section 2) enriches the SQL interface with spatial permissions from [email protected]. data types (e.g., Point and Polygon), spatial functions (e.g., Overlap SIGSPATIAL'15, November 03-06, 2015, Bellevue, WA, USA ©2015 ACM. ISBN 978-1-4503-3967-4/15/11$15.00 and Touch), and new commands to construct and import spatial in- DOI: http://dx.doi.org/10.1145/2820783.2820869. dexes. (2) The indexer (Section 3) constructs spatial indexes based Shuffle Partition Local Input Index 4 Table Impala Sphinx 3 Spatial Data Types Sample Query Parser Spatial Functions 1 Indexed Catalog Index commands Table Range Query Plans 2 Query Planner Subdivide Broadcast Spatial Join Plans cell boundaries R-tree Scanner Query Executor Spatial Join Figure 3: Indexing plan in Sphinx Global/Local Storage Indexes Indexing 3. SPATIAL INDEXER In this section, we describe how Sphinx constructs spatial in- HDFS dexes on HDFS-resident tables. The main goal is to store the records in a spatial-aware way by grouping nearby records and stor- Figure 2: Overview of Sphinx ing them physically together in the same HDFS block. While traditional Impala already provides a partitioned table feature, where a table is hierarchically partitioned based on a sequence of columns, on grid, R-tree or Quad-tree, and organized in two layers as one Sphinx employs spatial indexes which overcome the following global index and multiple local index. (3) The query planner (Sec- three limitations in partitioned table: (1) While Impala assigns one tion 4) utilizes the spatial indexes to introduce new efficient query value per partition, e.g., dpartment ID, Sphinx assigns a region, plans for the range query and spatial join operations. (4) The query i.e., a rectangle, to each partition which is more suitable to spatial executor (Section 5) introduces the R-tree scanner and spatial join data. (2) While Impala assigns each record to exactly one parti- operators, which use C++ runtime code generation to efficiently tion, Sphinx can replicate a record to multiple partitions which can execute spatial range and spatial join queries, respectively. We be used to index polygons which span multiple partitions. (3) Im- conduct an experimental evaluation on the proposed system pro- pala puts the burden of choosing partition boundaries on the user, totype using a publicly available real dataset of 2.7 Billion points which is not suitable for skewed spatial data. Sphinx provides a extracted from OpenStreetMap. We show that Sphinx outperforms one-statement index command which takes care of defining parti- traditional Impala by up-to three orders of magnitude with both tion boundaries based on the data distribution. In the rest of this range query and spatial join queries. In addition, we show that section, we first describe how the index is stored in Sphinx, and Sphinx scales well with both the input size and the cluster size. then we explain how Sphinx builds this spatial index efficiently. 2. QUERY PARSER 3.1 Index Layout To make it user-friendly and easy to use, Sphinx extends the Sphinx employs a two-layered design for spatial indexes in query parser of Impala to introduce spatial data types, functions, HDFS [1,4,12], where the global index is stored in the master node and new commands to construct and import spatial indexes. and defines how records are partitioned across machines, while lo- Spatial Data Types. Sphinx adds the Geometry datatype as cal indexes are stored inside slave nodes and define how records are an abstraction for all standard spatial datatypes, such as Point, internally organized inside that node. This design well fits with the Linestring and Polygon, as defined by the Open Geospa- architecture of Impala and Sphinx where the global index is stored tial Consortium (OGC). We adopt the standard Well-Known Text in the catalog server on the master and the local indexes are stored (WKT) format to be able to import text files from other systems in HDFS data nodes, and are processed by the impalad processes such as PostGIS and Oracle Spatial. running on the slaves. This also allows Sphinx to easily import an Spatial Functions Sphinx adds OGC-compliant spatial functions index which was built in SpatialHadoop by simply importing the which are implemented as either user-defined functions (UDF) or global index into the catalog server. user-defined aggregate functions (UDAF). It is imperative to men- tion that all those functions only work in Sphinx as the input and/or 3.2 Index Construction in Sphinx the output of each function is of the Geometry datatype, which Sphinx provides an efficient algorithm for constructing a spatial is supported only in Sphinx. These functions include basic func- index on a user-selected attribute in a table. We focus on the con- tions, e.g., MakePoint, spatial predicates, e.g., Touch, spatial struction of R-tree and R+-tree as examples of non-replicated and analysis functions, e.g., Union, and spatial aggregate functions, replicated indexes, respectively.

Load more