GPU Rasterization for Real-Time Spatial Aggregation Over Arbitrary Polygons
Total Page:16
File Type:pdf, Size:1020Kb
GPU Rasterization for Real-Time Spatial Aggregation over Arbitrary Polygons Eleni Tzirita Zacharatou∗‡, Harish Doraiswamy∗†, Anastasia Ailamakiz, Claudio´ T. Silvay, Juliana Freirey z Ecole´ Polytechnique Fed´ erale´ de Lausanne y New York University feleni.tziritazacharatou, anastasia.ailamakig@epfl.ch fharishd, csilva, [email protected] ABSTRACT Not surprisingly, the problem of providing efficient support for Visual exploration of spatial data relies heavily on spatial aggre- visualization tools and interactive queries over large data has at- gation queries that slice and summarize the data over different re- tracted substantial attention recently, predominantly for relational gions. These queries comprise computationally-intensive point-in- data [1, 6, 27, 30, 31, 33, 35, 56, 66]. While methods have also been polygon tests that associate data points to polygonal regions, chal- proposed for speeding up selection queries over spatio-temporal lenging the responsiveness of visualization tools. This challenge is data [17, 70], these do not support interactive rates for aggregate compounded by the sheer amounts of data, requiring a large num- queries, that slice and summarize the data in different ways, as re- ber of such tests to be performed. Traditional pre-aggregation ap- quired by visual analytics systems [4, 20, 44, 51, 58, 67]. proaches are unsuitable in this setting since they fix the query con- Motivating Application: Visual Exploration of Urban Data Sets. straints and support only rectangular regions. On the other hand, In an effort to enable urban planners and architects to make data- query constraints are defined interactively in visual analytics sys- driven decisions, we developed Urbane, a visualization framework tems, and polygons can be of arbitrary shapes. In this paper, we for the exploration of several urban data sets [20]. The framework convert a spatial aggregation query into a set of drawing operations allows the user to visualize a data set of interest at different resolu- on a canvas and leverage the rendering pipeline of the graphics tions and also enables the visual comparison of several data sets. hardware (GPU) to enable interactive response times. Our tech- Figures 1(a) and 1(b) show the distribution of NYC taxi pick- nique trades-off accuracy for response time by adjusting the canvas ups (data set) in the month of June 2012 using a heat map over resolution, and can even provide accurate results when combined two resolutions: neighborhoods and census tracts. To build these with a polygon index. We evaluate our technique on two large heatmaps, aggregate queries are issued that count the number of real-world data sets, exhibiting superior performance compared to pickups in each neighborhood and census tract. Through its visual index-based approaches. interface, Urbane allows the user to change different parameters dynamically, including the time period, the distribution of interest PVLDB Reference Format: (e.g., count of taxi pickups, average trip distance, etc.), and even the E. Tzirita Zacharatou, H. Doraiswamy, A. Ailamaki, C. T. Silva, and J. polygonal regions. Figure 1(c) shows multiple data sets being com- Freire. GPU Rasterization for Real-Time Spatial Aggregation over Arbi- trary Polygons. PVLDB, 11(3): 352 - 365, 2017. pared using a single visualization: a parallel coordinate chart [28]. DOI: https://doi.org/10.14778/3157794.3157803 In this chart, each data set (or dimension) is represented as a ver- tical axis, and each region (neighborhood) is mapped to a polyline 1. INTRODUCTION that traverses across all of the axes, crossing each axis at a position proportional to its value for that dimension. Note that each point in The explosion in the number and size of spatio-temporal data sets an axis corresponds to a different aggregation for the selected time from urban environments (e.g., [10,41,60]) and social sensors (e.g., range for each neighborhood, e.g., Taxi reflects the number of pick- [43,62]) creates new challenges for analyzing these data. The com- ups, while Price shows the average price of a square foot. This vi- plexity and cost of evaluating queries over space and time for large sual representation is effective for analyzing multivariate data, and volumes of data often limits analyses to well-defined questions, can provide insights into the relationships between different indi- what Tukey described as confirmatory data analysis [61], typi- cators. For example, by filtering and varying crime rates, users can cally accomplished through a batch-oriented pipeline. To support observe related patterns in property prices and noise levels over the exploratory analyses, systems must provide interactive response different neighborhoods. times, since high latency reduces the rate at which users make ob- Motivating Application: Interactive Urban Planning. Policy servations, draw generalizations and generate hypotheses [34]. makers frequently rezone different parts of the city, not only ad- ∗ These authors contributed equally to this work. justing the zonal boundaries, but also changing the various laws (e.g., new construction rules, building policies for different building Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are types). During this process, they are interested in viewing how the not made or distributed for profit or commercial advantage and that copies other aspects of the city (represented by urban data sets) vary with bear this notice and the full citation on the first page. To copy otherwise, to the new zoning. This operation typically consists of users changing republish, to post on servers or to redistribute to lists, requires prior specific polygonal boundaries, and inspecting the summary aggregation of permission and/or a fee. Articles from this volume were invited to present the data sets until they are satisfied with a particular configuration. their results at The 44th International Conference on Very Large Data Bases, In this process, urban planners may also place new resources August 2018, Rio de Janeiro, Brazil. Proceedings of the VLDB Endowment, Vol. 11, No. 3 (e.g., bus stops, police stations), and again inspect the coverage Copyright 2017 VLDB Endowment 2150-8097/17/11... $ 10.00. with respect to different urban data sets. The coverage is com- DOI: https://doi.org/10.14778/3157794.3157803 352 Figure 1: Exploring urban data sets using Urbane: (a) visualizing data distribution per neighborhood, (b) visualizing data distribu- tion per census tract, (c) comparing data over different neighborhoods. The blue line denotes the NYC average for these data. monly computed by using a restricted Voronoi diagram [7] to asso- tionally expensive task. This two stage evaluation strategy also in- ciate each resource with a polygonal region, and then aggregating troduces the overhead of materializing the results of the first stage. the urban data over these polygons. To be effective, these summa- Finally, the aggregates are computed over the materialized join re- rizations must be executed in real-time as configurations change. sults and incur additional query processing costs. Data cube-based Problem Statement and Challenges. In this paper, we propose structures (e.g., [33]) can be used to maintain aggregate values. new approaches to speedup the execution of spatial aggregation However, creating such structures requires costly pre-processing queries, which, as illustrated in the examples above, are essential while the memory overhead can be prohibitively high. More im- to explore and visualize spatio-temporal data. These queries can portantly, these techniques do not support queries over arbitrary be translated into the following SQL-like query that computes an polygonal regions, and thus are unsuitable for our purposes. aggregate function over the result of a spatial join between two data Last but not least, while powerful servers might be accessible to sets, typically a set of points and a set of polygons. some, many users have no alternative other than commodity hard- SELECT AGG(ai) FROM P, R ware (e.g., business grade laptops, desktops). Having approaches WHERE P.loc INSIDE R.geometry [AND filterCondition]* to efficiently evaluate the above queries on commodity systems can GROUP BY R.id help democratize large-scale visual analytics and make these tech- Given a set of points of the form P(loc;a1;a2;:::), where loc and niques available to a wider community. ai are the location and attributes of the point, and a set of regions For visual analytics systems, approximate answers to queries are R(id;geometry), this query performs an aggregation (AGG) over the often sufficient as long as they do not alter the resulting visualiza- result of the join between P and R. Functions commonly used for tions. Moreover, the exploration is typically performed using the AGG include the count of points and average of the specified at- “level-of-detail” (LOD) paradigm: first look at the overview, and tribute ai. The geometry of a region can be any arbitrary polygon. then zoom into the regions of interest for more details [53]. Thus, The query can also have zero or more filterConditions on the these systems can greatly benefit from an approach that trades-off attributes. In general, P and R can be either tables (representing accuracy for response times, and enables LOD exploration that im- data sets) or the results from a sub-query (or nested query). proves accuracy when focusing on details. The heat maps in the Figures 1(a) and 1(b) were generated by Our Approach. By leveraging the massive parallelism provided by setting P as pickup locations of the taxi data; R as either neigh- current generation graphics hardware (Graphics Processing Units borhood (a) or census tract (b) polygons; AGG as COUNT(*); and or GPUs), we aim to support interactive response times for spa- filtered on time (June 2012). On the other hand, to obtain the par- tial aggregation over large data.