A Fast Location Service for Partial Spatial Replicas

A Fast Location Service for Partial Spatial Replicas Yun Tian Philip J. Rhodes Department of Computer and Information Science Department of Computer and Information Science University of Mississippi University of Mississippi University, MS, USA 38677 University, MS, USA 38677 Email: [email protected] Email: [email protected] Abstract—This paper describes a design and implementa- some time [6], [7], [8]. GIS data generally refers to the tion of a distributed high-performance partial spatial replica surface of the earth, and is usually two dimensional. Infor- location service. Our replica location service identifies the set mation such as road locations, municipal boundaries, bodies of partial replicas that intersect with a region of interest, an important component of partial spatial replica selection. of water, etc., can be represented using a sequence of points We find that using an R-tree data structure is superior to denoting the feature. Although this type of data is certainly relying on a relational database alone when handling spatial spatial, our own research is focused more on volumetric data queries. We have also added a collection of optimizations data. For example, Computational Fluid Dynamics (CFD) that together improve performance. In particular, database simulations commonly represent three or four dimensional Query Aggregation and using a Morton curve during R-tree construction produce significant performance gains. Experi- volumes using rectilinear grids of data points that span the mental results show that the proposed partial spatial replica volume. Conceptually, such datasets (and subsets extracted location service scales well for multi-client and distributed from them) often have the shape of a rectangular prism, large spatial queries, queries that return more than 10,000 while GIS data may have an entirely irregular shape. replicas. Individual servers with one million pieces of replica The work described here was performed within the metadata in the backend database can support up to 100 clients concurrently when handling large spatial queries. context of the Granite Scientific Database System, which Our previous work solved the same problem using an un- provides efficient access to spatial datasets stored on local modified Globus Toolkit, but the work described here modifies or remote disks [9], [10]. Granite allows its users to specify and extends existing Globus Toolkit code to handle spatial spatial subsets of larger n-dimensional volumes for retrieval. metadata operations. It takes advantage of UDT [11], a UDP based reliable Keywords-Globus RLS; R-tree; replica; spatial; data transfer protocol that is well suited to the transfer of large data volumes over high bandwidth-delay product I. INTRODUCTION connections. The combination of UDT and Granite provides Replication and Replica Selection are widely used in fast access to subsets of datasets stored remotely. If a distributed systems to help distribute both data and the computation needs only a comparatively small subset of computation performed upon it. The Globus Toolkit includes a larger volume, accessing that subset remotely may be a Replica Location Service (RLS) that provides users with a much faster than moving the entire volume. The flexibility flexible mechanism to discover replicas that are convenient provided by this additional option is especially welcome in for a particular computation [1], [2], [3], [4], [5]. By provid- heterogeneous environments, where the hardware that best ing multiple copies of the same dataset, replicas can increase suits a computation (e.g. a GPU) might be located far from both I/O bandwidth and options for scheduling computation. the dataset. Recently, we have been working on the problem of partial The Magnolia component of the Granite system is in- replica selection for spatial data. A spatial dataset associates tended to integrate Granite’s unique spatial capabilities with data values with locations in an n-dimensional domain and existing Grid software. Replica selection for partial spatial is commonly used in various fields of engineering and the replicas is an important part of Magnolia, and requires both sciences. For example, a climatologist might use spatial data a way to discover the partial replicas that intersect with a produced by simulation to predict temperature changes for spatial query and a model of access time that allows Mag- the Gulf of Mexico over a long time period. An agronomist nolia to choose between them. The Granite system already might use spatial data to represent soil humidity for a large incorporates the concept of a storage model, which can tract of land. The size of spatial datasets is growing rapidly be used to infer disk access costs. This paper concentrates due to advances in measuring instrument technology and on the problem of efficiently determining the set of partial more accessible but less expensive computational power. replicas that intersect with the spatial query bounds given The Geographical Information Systems (GIS) community by the user. has been actively investigating distributed data access for In a previous paper [12], we described the Globus Toolkit R-tree (GTR-tree) implementation of an important spatial bility [3]. More recently, Chervenak et al. systematically data structure on top of an existing grid infrastructure. In described Globus RLS framework design, implementation, that work, an R-tree [13] is constructed in the Globus Toolkit performance and scalability evaluation, and its production RLS backend relational database, and metadata describ- deployments [5]. ing spatial replicas is managed via an unmodified Globus In the GIS context, Wu et al. described a framework for Toolkit. a spatial data catalog implemented on top of a grid and The GTR-tree is one possible implementation of a Spa- P2P system [6]. Wei and Di et al. implemented a Grid- tial Replica Location Service (SRLS), which represents the enabled catalogue service for GIS data, by adapting the Open spatial information of replicas in a grid or other distributed Geospatial Consortium (OGC) Catalogue Service for Web system, associating the spatial extent of replicas with their Specification and the ebXML Registry Information Model physical addresses. When a query region is submitted to an (ebRIM) [7], [8]. The OGC publishes a number of standards SRLS, it must return the set of replicas that intersect with for GIS databases and applications, but these aren’t well that region. A data grid may contain many SRLS instances, suited for other types of spatial data, especially those that each with a different set of replica metadata, and each able entail three or more dimensions. For this reason, databases to service multiple clients simultaneously. This allows not such as PostGIS [14] that implement the OGC GIS standard only the metadata, but also the intersection computation to are not easily applied to other types of scientific data or be be distributed across a grid. metadata. Using an R-tree for an SRLS implementation is very Narayanan et. al. described a middleware GridDB-Lite effective since we can selectively traverse the tree, greatly re- which provides basic scientific database support for data- ducing the number of required intersection tests. Implement- driven applications in the grid [15]. To expedite selection of ing the R-Tree data structure on top of an unchanged Globus the data of interest, they implemented an R-tree via summary implementation eases deployment and avoids compatibility index files and detailed index files in their indexing service. issues, but at some cost to performance. In this paper, we In another paper [16], Narayanan presented a generic frame- describe the Mortonized Aggregated Query (MAQ)¯ R-Tree, work to support partial data replication and data reordering. a new SRLS implementation that is now directly integrated Weng et. al. described a partial replica selection algorithm into the Globus Toolkit. Although we made minor changes for serving range queries on multidimensional datasets to the Globus source code, the MAQR-tree implementation [17]. Our work differs from these efforts in two important is independent of Globus and could easily be ported to other respects. First, our MAQR-tree integrates an R-tree into grid systems. We have also added a collection of optimiza- the Globus toolkit to support spatial metadata operations. tions that together improve performance over previous work Second, we investigated the efficient representation of an by a substantial margin. Namely, we have improved upon R-tree constructed in a relational database. the GTR-tree by changing the table design in the backend To manage spatial metadata for fast retrieval, we chose database, and by aggregating several queries into one larger the R-Tree [13] data structure for our system from among query, which reduces overhead. We now also use the Morton several other spatial data structures, including the Quadtree, Space-filling Curve during R-tree construction, which im- Octree, Kd-tree and UB-tree. We require a data structure proves spatial locality. Experiments evaluating multi-client which returns the collection of replica Minimal Bounding and distributed spatial queries demonstrate good scalability, Rectangles(MBRs) that intersect with a query region. OLAP and a very substantial improvement over previous results. methods like the UB-tree [18] return a collection of points The rest of the paper is organized as follows. Section II (records) that are contained within a query region, making presents related work. In section III, we further describe them inconvenient for our purposes. As described by Kamel the problem that we are addressing. Section IV analyzes et al., Quadtrees and Octrees typically divide the spatial our previous GTR-tree implementation. The MAQR-tree replicas into smaller blocks, thus generating more partial implementation is presented in section V. In section VI, spatial replicas during construction of the tree [19]. Kd-trees the advantages and usefulness of proposed techniques are [20] have a similar problem because they require us to divide validated through experiments. We conclude by summariz- the space with a splitting hyperplane.

Load more