International Journal of Geo-Information

Article An Effective NoSQL-Based Vector Map Tile Management Approach

Lin Wan 1, Zhou Huang 2,* and Xia Peng 3,4

1 Faculty of Information Engineering, China University of Geosciences, Wuhan 430074, China; [email protected] 2 Institute of Remote Sensing & GIS, Peking University, Beijing 100871, China 3 Institute of Tourism, Beijing Union University, Beijing 100101, China; [email protected] 4 State Key Laboratory of Resources and Environmental Information System, Institute of Geographical Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China * Correspondence: [email protected]; Tel.: +86-10-6276-0132

Academic Editors: Bert Veenendaal, Maria Antonia Brovelli, Serena Coetzee, Peter Mooney and Wolfgang Kainz Received: 15 August 2016; Accepted: 5 November 2016; Published: 12 November 2016

Abstract: Within a digital map service environment, the rapid growth of Spatial Big-Data is driving new requirements for effective mechanisms for massive online vector map tile processing. The emergence of Not Only SQL (NoSQL) databases has resulted in a new data storage and management model for scalable spatial data deployments and fast tracking. They better suit the scenario of high-volume, low-latency network map services than traditional standalone high-performance computer (HPC) or relational databases. In this paper, we propose a flexible storage framework that provides feasible methods for tiled map data parallel clipping and retrieval operations within a distributed NoSQL database environment. We illustrate the parallel vector tile generation and querying algorithms with the MapReduce programming model. Three different processing approaches, including local caching, distributed file storage, and the NoSQL-based method, are compared by analyzing the concurrent load and calculation time. An online geological vector tile map service prototype was developed to embed our processing framework in the China Geological Survey Information Grid. Experimental results show that our NoSQL-based parallel tile management framework can support applications that process huge volumes of vector tile data and improve performance of the tiled map service.

Keywords: digital map; map tile; NoSQL; MapReduce; cloud computing

1. Introduction Digital maps, or online electronic maps, are the core components of modern Web geographic information systems (GIS). Supported by various thematic information databases, web-based map series can be provided with different scales for multiple geographic themes, such as streets, traffic, politics, and agricultural phenomena. The framework can be considered as a computing framework for the integration of hardware, algorithms, software, networking and spatial data. With the rapid growth of various networked geospatial data sources, such as volunteered geographic information in social networks, real-time data acquired from sensor networks and large quantities of earth observation data from satellites, web-based electronic map service infrastructures have faced the big challenge of organization and processing strategies for the huge volume of spatial data and its inherent complexity. This has resulted in the pursuit of advanced high-performance computing architectures and sophisticated algorithms to achieve scalable map data storage and fast processing.

ISPRS Int. J. Geo-Inf. 2016, 5, 215; doi:10.3390/ijgi5110215 www.mdpi.com/journal/ijgi ISPRS Int. J. Geo-Inf. 2016, 5, 215 2 of 25

Currently, most map service systems, such as Google Maps, Yahoo! Maps, Bing Maps, and OpenStreetMap, are primarily built on vector tile map models, in which a set of map images that are pre-generated or dynamically rendered from vector GIS data are delivered to the requester via a web browser. To facilitate visualization in different resolutions and map-based spatial analysis, the entire vector map should be split into many chunks with specific clip algorithms before tile generation. The map tiles, or so-called tile caches, are basic units of one specific tile level and are identified by the corresponding zoom level, row and column. There are many tile storage structures, and the Tile Pyramid structure based on a level-of-detail (LOD) model is renowned for its efficiency for large map image rendering and fast tile scheduling. The use of the Tile Pyramid model can support the display of high-resolution tile images with a high level of performance. The sizes of tiles in Tile Pyramid structures are growing quickly, and the number of users with concurrent access is also increasing because geo-social apps are used by more and more people. However, the traditional tile storage and management strategies, standalone high-performance computer (HPC) architectures and commercial relational databases cannot easily accomplish the goals of preprocessing, integration, exploration, scheduling, and visualization for big-data tiles. When there is a sharp rise in tile data volume and the number of concurrent sessions, these classic management models exhibit characteristics of slow reading and writing speed, limited storage capacity, poor scalability and high building and maintenance cost. Hence, within such a big geo-data environment, the development of effective mechanisms which facilitate fast tile operations becomes a key requirement. Technical issues should be addressed to implement the combination of distributed vector tile data and distributed geoprocessing by means of a software system. We concentrate more on the exploration of massive vector tile data management methods, particularly on a versatile scalable computing framework that provides both distributed tile storage and parallel vector tile processing (tile generation and fast retrieval). The framework forms the link between the network map data organization model and the spatial operations that are associated with this distributed model. In recent studies, cloud-based mechanisms, such as Not Only SQL (NoSQL) storage and the MapReduce programming paradigm, have been gradually adopted by many researchers in the geocomputing field to explore new methods for large-volume spatial data processing. The number of cloud geocomputing frameworks has proliferated, and they are varied in their main points of studies. For example, some researchers emphasized distributed storage means for tile storage, whereas other studies focused on MapReduce-based spatial data processing against one specific batch-oriented, high-latency platform such as Apache Hadoop. With the need for a systematic approach to satisfy both mass storage and fast parallel processing for big tile data, how to integrate a scalable map data model with fast operating capabilities is becoming a critical goal of cloud-based tile computing. The objectives of this paper focus on two aspects. First, we propose a new tile data storage model based on one type of NoSQL paradigm: wide column-oriented storage. The model distributes massive tile data across scalable data nodes while maintaining global consistency and allows the effective handling of specific tile data. Second, based on our tile data model, a parallel algorithm framework was presented for fast vector tile generation and querying using the MapReduce programming model. We also incorporate the new algorithm framework into a tiled map service as a component of a real geological survey information vector map service prototype and assess its performance with other methods of our NoSQL tile data infrastructure. In the remainder of the paper, we sequentially introduce applications of the cloud-based computing method, such as Hadoop, NoSQL, and MapReduce, in big geo-data processing. This is followed by an illustration of the extended vector map tile generation and locating algorithm, preparing for the NoSQL-based tile data model. Next, we outline vector map tile storage access models and illustrate how to map tile data into NoSQL column-family stores. The tile map parallel clipping and query framework based on MapReduce is discussed in detail. We also devote a section to the elaboration of the parallel tile generation and static/dynamic query algorithm. After the description of the delivery of the parallel tile operation algorithm as a GIS service, ISPRS Int. J. Geo-Inf. 2016, 5, 215 3 of 25 we conduct a series of comparative experiments to evaluate its performance. Finally, we close with some concluding remarks.

2. Related Work

2.1. Cloud-Based Big Geo-Data Processing Our main work objective is an endeavor to integrate proper cloud computing mechanisms based on open standards with massive spatial information data management, particularly vector map tile data management. In the geospatial domain, some progress has been made towards implementing an operational big geo-data computing architecture. Part of the research has focused on web service-based geoprocessing models for distributed spatial data sharing and computing [1–4], and some studies have investigated CyberGIS-based [5,6] methods to address computationally intensive and collaborative geographic problems by exploiting HPC infrastructure, such as computational grids and parallel clusters [7–9], particularly focused upon the integration of particular CyberGIS components, spatial middleware and high-performance computing and communication resources for geospatial research [10,11]. Cloud computing is becoming an increasingly more prominent solution for complex big geo-data computing. It provides scalable storage to manage and organize continuously increasing geospatial data, elastically changing its processing capability to data parallel computing with an effective paradigm to integrate many heterogeneous geocomputing resources [12,13]. Thus far, the advancements of cloud-enabled geocomputing involve dealing with the intensities of data, computation, concurrent access, and spatiotemporal patterns [14]. To date, the integration of advanced big data processing methods and practical geocomputing scenarios is still in its infancy and is a very active research field.

2.2. Hadoop and Big Geo-Data Processing Hadoop is a reputable software ecosystem for distributed massive data processing in a cloud environment. As an open-source variant of Google’s big data processing platform, its primary aim is to provide scalable storage capability with lower hardware costs, effective integration of any heterogeneous computing resources for cooperative computing, and a highly available cloud service on top of loosely coupled clusters. As the core component, Hadoop Distributed File System (HDFS), which runs on scalable clusters consisting of cheap commodity hardware, can afford reliable large-dataset (typical in the GB to TB range) storage capabilities with high bandwidth and streaming file system data access characteristics. Based on this, it could accelerate intensive computing on massive data based on the idea that “moving computation is cheaper than moving data.” By adopting a master/slave architecture, the master node of HDFS, often called NameNode, maintains the global file metadata and is responsible for data distribution as well as accessing specific data blocks, whereas slaves—DataNodes, also commonly called computing nodes—manage their own data region allocated by the master node, performing data block creation, modification, deletion and replication operations issued from the NameNode. On the basis of these advanced distributed file system features, HDFS meets the requirement for large-scale and high-performance concurrent access and has gradually become a foundation of cloud computing for big data storage and management. Another core component of Hadoop is the MapReduce [15] parallel data processing framework. Built on a Hadoop cluster, the MapReduce model can divide a sequential processing task into two sub-task execution phases—Map and Reduce. In the Map stage, one client-specified computation is applied to each record unit (identified by unique key/value pairs) of the input dataset. The execution course is composed of multiple instances of the computation, and these instance programs are executed in parallel. The output of this computation—all intermediate values—is immediately aggregated (merged by the same key identifier) by another client-specified computation in the Reduce stage. One ISPRS Int. J. Geo-Inf. 2016, 5, 215 4 of 25 of the most significant advantages of MapReduce is that it provides an abstraction that hides many system-level details from the client. Unlike other parallel computing frameworks, the MapReduce model can decompose complex algorithms into a sequence of MapReduce parallel jobs without considering their communication process. Hadoop MapReduce is the open-source implementation of the programming model and execution framework built on HDFS. In this framework, the job submission node (typically NameNode) runs the JobTracker, which is the administration point of contact for clients wishing to execute a parallel MapReduce job. The input to a MapReduce job comes from data stored on the distributed file system or existing MPP (massively parallel processing) relational databases, and the output is written back to the distributed file system or relational databases; any other system that satisfies the proper abstractions can serve as a data source or sink. In the Map stage, JobTracker receives MapReduce programs consisting of code segments for Mappers and Reducers with data configuration parameters, divides the input into smaller sub-problems by generating intermediate key/value pairs, and distributes them to each TaskTracker node (usually DataNodes) in the cluster to solve sub-problems, whereas the Reduce procedure merges all intermediate values associated with the same key and returns the final results to the JobTracker node. During the entire process, JobTracker also monitors the progress of the running MapReduce jobs and is responsible for coordinating the execution of the Mappers and Reducers. To date, besides Hadoop, there are also many other applications and implementations of the MapReduce paradigm, e.g., targeted specifically for multi-core processors, for GPUs (Graphics Processing Unit), and for the NoSQL database architecture. Because Hadoop provides a distributed file system and a scalable computation framework by partitioning computation processes across many computing nodes, it is better for addressing big data challenges and has been widely adopted in big geo-data processing and analysis scenarios [16–18].

2.3. Not Only SQL (NoSQL)-Enabled Big Data Management The emergence of NoSQL () databases was accompanied by the urgent demand for new methods to handle larger-volume data, which forced the exploration of building massive data processing platforms through distributed clusters of commodity servers. In addition to the underlying processing platform, the data model must be concerned with making upper applications that are more appropriate for fast data storage and parallel computing, for which the relational model has some limitations, such as impedance mismatch [19] and the fact that relational tuples do not support complex value structures, etc. Generally speaking, NoSQL databases refer to a number of recent non-relational databases, which are schema-less and more scalable on clusters and have more flexible mechanisms for trading off traditional consistency for other useful properties, such as increased performance, better scalability and ease of application development, etc. Currently, there are several primary types of NoSQL databases: • Key/Value data model: This NoSQL storage is simply considered as one hash table or one relational tuple with two attributes—a key and its corresponding value—and all access to the database is via the primary key. Because key-value storage always uses primary-key access, it generally has great performance and can be easily scaled. The typical implementation of the Key/Value data model is the distributed in-memory databases, such as Memcached DB, Redis, Riak and Berkeley DB. • Document data model: This NoSQL model stores document contents in the value part of the Key/Value model by XML, JSON and BSON types. It can be recognized as key-value storage, in which the value is self-describing. With a document model, we can quickly query part of the document rather than the entire contents, and data indexes can be created on the contents of the document entity. Some of the popular implementations of the document data model are MongoDB, CouchDB, and RavenDB. • Column-Family data model: Column-family storage can be considered a generic key-value model. The only special part is that the value consists of multiple column-families, each of which is a ISPRS Int. J. Geo-Inf. 2016, 5, 215 5 of 25

Map data structure. In this Map structure, each column has to be part of a single column family, and the column acts as the basic unit for access with the assumption that data for a particular column family will usually be accessed together. As an open-source clone of Google’s Bigtable, HBase is a sparse, distributed, persistent multidimensional implementation of the column-family model, indexed by row key, column key, and timestamp and frequently used as a source of input and as storage of MapReduce output. Other popular column-family databases are Cassandra, Amazon DynamoDB and Hypertable. • Graph data model: The graph data model focuses on the relationships between the stored entities and is designed to allow us to find meaningful patterns between the entity nodes. The graph data model can represent multilevel relationships between domain entities for things such as category, path, time-trees, quad-trees for spatial indexing, or linked lists for sorted access. Some representative examples are Neo4J, FlockDB and OrientDB.

Based on the good scalability, large-scale concurrent processing capability and MapReduce parallel model of NoSQL databases, some preliminary research has been made toward exploring spatial data distributed storage and processing [20–24].

3. Design Issues With the rapid development of the Internet, the map tile service is more and more widely used, and a large number of users might access the service at the same time. This creates processing pressure of the server, thus using high-performance computing technologies to optimize map tile data management and access becomes the natural choice. Tile data management approaches of current mainstream map tile services, such as Google Maps, Bing Maps, Baidu Map, etc., are still black boxes and not known to the outside world. Therefore, in view of the tremendous advantages of the emerging NoSQL storage and MapReduce parallel computing framework, we try to explore the key technologies of using NoSQL databases to manage the map tile data, including fast tile-clipping methods and real-time tile query approaches. In particular, we use open source NoSQL databases rather than commercial products to achieve efficient management of map tile data. The goal of this paper is to provide a low-cost, effective and transparent NoSQL-based technical framework associated with key approaches for readers who need to build their own map tile services. The map tile service we have envisioned consists of two types of application scenarios:

(1) The user is interested in a single layer and submit query and visualization requests. In this case, tile data of the single layer are generated into a huge number of pictures stored on the server side in advance. Hence, when the query request arrives, the server will return the corresponding tile picture collection according to the query range. Generating tiles in advance is also called static tile clipping. (2) The user is interested in a number of layers and submit query and visualization requests. In this case, because the user-requested layers cannot be predicted in advance, tile data of multiple layers are dynamically generated based on the Tile Pyramid model and user-defined parameters, only after query requests arrive. At this point, multiple layers are superimposed and rendered together and returned as a set of tile images. Generating tiles in the running time is also called dynamic clipping. Furthermore, the dynamically generated clipped tiles are then stored in the database. Hence, when a similar request arrives next time, the result can be returned directly without dynamic clipping.

Corresponding to the above two scenarios, the NoSQL database, with distributed storage and the parallel computing framework (i.e., MapReduce framework), fully meets the needs of map tile clipping and querying. A typical example, as shown in Figure1, illustrates how distributed storage and MapReduce framework work together in data processing. Firstly, there is a cluster comprising of multiple computer nodes. Then, in the distributed storage level, the data are partitioned into ISPRS Int. J. Geo-Inf. 2016, 5, 215 6 of 25 several blocks and stored separately at different nodes. Therefore, when a query or processing request is submitted, several parallel map tasks associated with the data blocks are set up. After the parallel map tasks are finished, the reduce task is performed to obtain the final computing result by merging temporary results output by the map phase. Furthermore, the MapReduce framework sets up TaskTracker on each running node to manage and track the map or reduce task, and JobTracker is built for the whole MapReduce workflow’s control issues. ISPRS Int. J. Geo-Inf. 2016, 5, 215 6 of 25

Figure 1. MapReduce framework based map tile data management. Figure 1. MapReduce framework based map tile data management. Through the concurrent map tasks, the map data can be quickly clipped into tiles and distributed onto NoSQL storage, and real-time tile queries can also be achieved through the ThroughMapReduce the concurrent framework. map In addition, tasks, the there map are four data types can of be NoSQL quickly storage clipped models, into including tiles key- and distributed onto NoSQLvalue, storage, document, and column real-time and graph. tile The queries key-value can database also such be as achieved Redis is often through memory-based, the MapReduce framework. Inbut addition, the huge amount there of are map four tile typesdata is often of NoSQL much larger storage than models,the amount including of memory. key-value,The graph document, database is used for graph-type data management. The document database like MongoDB is often column and graph.used for Thesemi-structured key-value data database (e.g., social such me asdia Redis user data) is often management. memory-based, The column but database the huge amount of map tile dataphysically is often stores much the data larger in accordance than the with amount the column, of memory. and thusThe achieves graph a better database data is used for graph-type datacompression management. rate compared The documentwith the row database storage. In like particular, MongoDB it is critical is often to useduse a forcolumn semi-structured database to physically store the tile data together to improve the I/O efficiency in query processing. data (e.g., socialTherefore, media we use user HBase, data) the management. most popular open The source column column database database, to physically store massive stores map the data in accordance withtile data, the and column, HBase’s and built-in thus MapReduce achieves framework a better data is utilized compression to achieve rateboth effective compared map with the row storage. In particular,clipping and itreal-time is critical tile querying. to use a column database to physically store the tile data together Furthermore, as to map tile services, spatially adjacent tiles are often accessed and returned to improve thetogether I/O to efficiency the user, so the in queryadjacent processing.tiles with the sa Therefore,me level need to we be usestored HBase, in a server the or mostadjacent popular open source columnservers database, in the HBase’s to store cluster massive environment map as much tile data, as possible. and This HBase’s avoids/reduces built-in cross-server MapReduce data framework is utilized toaccess, achieve thereby both improving effective efficiency. map clipping At the same and time, real-time under the tile HBase’s querying. column storage model, tiles are also put together in the physical storage level, significantly improving the I/O efficiency, and Furthermore,thus meeting as to real-time map access tile services, needs better. spatially adjacent tiles are often accessed and returned together to the user, so the adjacent tiles with the same level need to be stored in a server or adjacent servers in the4. HBase’sVector-Based cluster Tile Generation environment and Locating as much Method as possible. This avoids/reduces cross-server data access, thereby4.1. improvingBasic Methods efficiency. At the same time, under the HBase’s column storage model, tiles are also put togetherTile maps in thetypically physical use the storage Tile Pyramid level, model significantly with a quadtree improving tile-index structure the I/O (as efficiency, shown and thus meeting real-timein Figure access 2) for data needs organization. better. As a top-down hierarchical model of variable resolution, many map service providers, such as Google Maps, Bing Maps and ArcGIS Online, use it to organize their 4. Vector-Basedlarge map Tile tiles. Generation and Locating Method In the Tile Pyramid model, each tile size is fixed to 256 × 256 pixels and level determines the map size. At the lowest level of detail (Level 0), the global map is 256 × 256 pixels. When the level increases, 4.1. Basic Methodsthe map size, including width and height, grows by a factor of 2 × 2: Level 1 is 512 × 512 pixels, Level 2 is 1024 × 1024 pixels, Level 3 is 2048 × 2048 pixels, and so on. Generally, the map size (in pixels) can Tile mapsbe calculated typically as useEquation the Tile(1): Pyramid model with a quadtree tile-index structure (as shown in Figure2) for data organization. As a top-down hierarchical model of variable resolution, many map map width = map height = 2level × 256 (1) service providers, such as Google Maps, Bing Maps and ArcGIS Online, use it to organize their large As shown in Figure 1, a vector dataset can be organized into multi-level maps using a quadtree map tiles. recursive subdivision strategy. Then, each tile (represented by one rectangular block) of one specific In the Tilelevel Pyramid can be clipped model, and rendered each tile into size one ispicture fixed as toa part 256 of ×the256 tile map. pixels For anda givenlevel mapdetermines tile data, the map size. At the lowestwe can use level the oftriple detail T = {level (Level, row, 0),col} theto identify global it, mapwhere islevel 256 represents× 256 the pixels. resolution When level the of the level increases, tile, starting with level 0; row and col denote the row coordinate and column coordinate of the the map size, including width and height, grows by a factor of 2 × 2: Level 1 is 512 × 512 pixels,

Level 2 is 1024 × 1024 pixels, Level 3 is 2048 × 2048 pixels, and so on. Generally, the map size (in pixels) can be calculated as Equation (1):

map width = map height = 2level × 256 (1) ISPRS Int. J. Geo-Inf. 2016, 5, 215 7 of 25

As shown in Figure1, a vector dataset can be organized into multi-level maps using a quadtree recursive subdivision strategy. Then, each tile (represented by one rectangular block) of one specific level can be clipped and rendered into one picture as a part of the tile map. For a given map tile data, we can use the triple T = {level, row, col} to identify it, where level represents the resolution level of the tile, starting with level 0; row and col denote the row coordinate and column coordinate of the correspondingISPRS Int. J. Geo-Inf. tile in2016 the, 5, 215 current level, respectively. Obviously, the N-level (N > 0) tile layer7 of 25 is the successive result clipped from the N-1 tile level. For instance, tile level 1 is the further clipped result of levelcorresponding 0; the rows tile and in columnsthe current of level, level respectively 0, bisected. inObviously, half, result the inN-level a total (N of > four 0) tile tiles, layer and is the row coordinatessuccessive and result column clipped coordinates from the N-1 can tile simply level. For be instance, numbered tile level from 1 0is tothe 1. further After clipped that, the result upper of left cornerlevel of the0; the tile rows map and can columns be identified of level by 0, tilebisected triple in {1, half, 0, 0}.result Likewise, in a total level-2 of four tile tiles, generation and the row is based on level-1coordinates tiles, and column each tile coordinates will be equally can simply divided be numbered into four from portions, 0 to 1. namely,After that, rows the upper and columnsleft corner of the tile map can be identified by tile triple {1, 0, 0}. Likewise, level-2 tile generation is based were dissected into four equal portions, resulting in a total of 16 tiles. Then, the q-level tile map can on level-1 tiles, and each tile will be equally divided into four portions, namely, rows and columns were have its rows and columns divided equally into 2q portions, finally resulting in a total of 4q tiles after dissected into four equal portions, resulting in a total of 16 tiles. Then, the q-level tile map can have its renderingrows and operations. columns divided Using this equally method, into 2q we portions, can accurately finally resulting derive in the a total triple of 4 identificationq tiles after rendering for any tile of anyoperations. level. Using this method, we can accurately derive the triple identification for any tile of any level.

Level 0 Level 1 Level 2

(90,180)

(-90,-180)

FigureFigure 2. Quadtree-based 2. Quadtree-based tile-clipping tile-clipping pattern for for global global map map tile tile generation. generation.

In the application of a massive tile map service, a tile map with a very large coverage area is Inoften the generated application from of amany massive sub-maps. tile map Hence, service, different a tile map tile withpartitions a very clipped large coverage by a sub-vector area is often generateddataset from need many to be sub-maps. identified in Hence, the context different of a tile larger partitions tile map. clipped The key by ato sub-vectora map tile-locating dataset need to bealgorithm identified is inhow the to context calculate of the a largertile identifier tile map. set covered The key by to the a mapquery tile-locating region. Different algorithm sub-map is how to calculatedocuments the tilecould identifier have inconsistent set covered ranges, by the resulting query region. in differences Different in sub-mapthe row- documentsand column- could haveencoded inconsistent rule ranges,of the tile resulting cache generated in differences by th ine theclipping row- operation and column-encoded for each map rule document. of the tile To cache generatedconduct by a the global clipping uniform operation tile locating for each operation map document. in practical To applications, conduct a globalwe need uniform to specify tile the locating origin for both the global tile grid of the large map and the local tile grid of the sub-map. In Figure 3, operation in practical applications, we need to specify the origin for both the global tile grid of the tile-locating methods for a map of China, used as an example, illustrate the tile retrieval course in large map and the local tile grid of the sub-map. In Figure3, tile-locating methods for a map of China, a large vector map scope. In this example, before starting to render tile data for the sub-scope, we usedneed as an first example, to designate illustrate the origin the tileO for retrieval the global course map grid. in a largeHere, vectorwe set the map geographic scope. In coordinate this example, beforeof startingthe upper to left render corner tile of China’s data for geographic the sub-scope, scope to we be need the origin first toof the designate global grid, the originand weO alsofor the globalspecify map grid.the display Here, ratio we set for the each geographic level and the coordinate height and of width the upper of the left tile corner unit. Then, of China’s both the geographic size scopeof to the be grid the of origin each ofspecific the global level and grid, the and row we and also column specify number the display covered ratio by the for global each levelgrid can and the heightbe anddetermined. width of The the sub-map tile unit. region Then, in bothFigure the 3 is size divided of the into grid a grid of each that has specific eight levelrows andand ten the row and columncolumns. number When some covered local by area the in global this scope grid needs can be to determined.be clipped into The tiles, sub-map we can regionquickly infind Figure the 3 is dividedcorresponding into a grid grid that identifiers has eight at rows the same and tenlevel columns. of the global When coverage some localgrid. areaPoint in O this’ in Figure scope 3 needs represents the origin of one sub-map’s local area in the global map and is numbered 22 in the to be clipped into tiles, we can quickly find the corresponding grid identifiers at the same level of country tile grid at the same level. Using this method, we can use geographic coordinates to the global coverage grid. Point O’ in Figure3 represents the origin of one sub-map’s local area in the compute the corresponding tile level and the coverage area in the level for tile query requests and globalthen map easily and calculate is numbered the tile 22 identifiers in the country covered tile by gridthis geographic at the same range. level. Vector Using tiles this of method,large maps, we can use geographicsuch as the China coordinates map example to compute in Figure the 3, corresponding can be clipped tilein parallel level and andthe deployed coverage in a areadistributed in the level for tilemanner query using requests this tile and generation then easily strategy. calculate the tile identifiers covered by this geographic range. Vector tiles of large maps, such as the China map example in Figure3, can be clipped in parallel and deployed in a distributed manner using this tile generation strategy.

ISPRS Int. J. Geo-Inf. 2016, 5, 215 8 of 25 ISPRS Int. J. Geo-Inf. 2016, 5, 215 8 of 25

ISPRS Int. J. Geo-Inf. 2016, 5, 215 8 of 25

Figure 3. Local map area locating in a global tile map grid. Figure 3. Local map area locating in a global tile map grid. 4.2. Tile-Clipping Range Expansion Algorithm 4.2. Tile-Clipping Range ExpansionFigure 3. Local Algorithm map area locating in a global tile map grid. Because map documents for clipping are often in an irregular figure pattern, generating tiles by theBecause map range map will documents cause the clipping for clipping results are to often be in inconsistent an irregular with figurethe tile pattern, range defined generating by the tiles Tile by 4.2. Tile-Clipping Range Expansion Algorithm thePyramid map range model. will In cause other the words, clipping the original results map to be rang inconsistente needs to with be extended the tile rangeto just defined cover an by integer the Tile PyramidnumberBecause model. of titles. map In Therefore,documents other words, we for thepresent clipping original a arerange map often ex rangepansion in an needsirregular algorithm to befigure extendedto extendpattern, tothe generating just original cover rangetiles an integer by of numberthethe map map, of range titles.taking will Therefore, the cause expanded the we clipping present range asresults a the range clipping toexpansion be in consistentrange. algorithm with the to extendtile range the defined original by range the Tile of the map,Pyramid takingA range model. the expansion expanded In other words,algorithm range the as the primarilyoriginal clipping map refers range. rang toe appropriately needs to be extended adjusting to thejust actual cover andata integer range numberwithA rangecertain of titles. expansion rules Therefore, according algorithm we to thepresent primarilyclip goal a range to referseasily expansion toaddress appropriately algorithm it. Generally, to adjusting extend let the the irregular the original actual clip range data region rangeof withtheexpand certainmap, totaking rulesa rectangular the according expanded (rectangle to therange clip or as square)goal the clipping to area, easily resizerange. address the it.zone Generally, height and let width the irregular to the same clip regionsize expandas theA to tilerange a rectangulargrid expansion unit, and (rectangle algorithm then set the or primarily square)expanded area,refers scope resize to asappropriately the new zone clipping height adjusting and range. width the actual to the data same range size as thewith tile certainFor grid the unit, rules sake and accordingof thensimplicity, set to the the in expanded clipour goalstudy, to scope weeasily exte as addressnd the the new it.map Generally, clipping document range. let tothe a irregularsquare figure clip region before expandtheFor tilethe todata a sake rectangular generation. of simplicity, (rectangleThis method in our or square)primarily study, wearea, sets extend resize the left thethe bottom mapzone documentheight as the andorigin towidth of a squarethe to originalthe figuresame scope, size before as the tile grid unit, and then set the expanded scope as the new clipping range. theexpanding tile data generation. left to right This and methodbottom primarilyto top. The sets orig theinal left scope bottom is expanded as the origin to a ofnew the clipping original area scope, For the sake of simplicity, in our study, we extend the map document to a square figure before expandingaccording left to tothis right rule; and then, bottom the processing to top. The vector original data scopefor generating is expanded each to level’s a new map clipping tiles areais the tile data generation. This method primarily sets the left bottom as the origin of the original scope, accordingincreased to by this the rule; exponential then, the growth processing model vector in this data context. for generating each level’s map tiles is increased expandingAssume left that to rightthe original and bottom range to is top.expressed The orig as inal{(x0 minscope, y0min is), expanded(x0max, y0max to)}, a the new expanded clipping rangearea by the exponential growth model in this context. accordingis denoted to as this {(xmin rule;, ymin then,), (xmax the, ymax processing)}, and the vectorheight dataand widthfor generating of tile grid each unit level’s are represented map tiles byis Assume that the original range is expressed as {(x0 , y0 ), (x0max, y0max)}, the expanded range increasedHeight and by Width the exponential, respectively. growth model in this context.min min is denoted as {(xmin, ymin), (xmax, ymax)}, and the height and width of tile grid unit are represented by AssumeLet rate0 that = (y the0max original − y0min)/( rangex0max −is x expressed0min); thus, as {(x0min, y0min), (x0max, y0max)}, the expanded range Heightis denotedand Width as {(x,min respectively., ymin), (xmax, ymax)}, and the height and width of tile grid unit are represented by (1) When rate0 > Height/Width, then ymax = y0max, (y0max − y0min)/=(xmax − x0min) = (Height/Width), and HeightLet and rate0 Width = (y0, maxrespectively.− y0min)/( x0max − x0min); thus, the tile map expands to the left horizontally, as shown in Figure 4. Let rate0 = (y0max − y0min)/(x0max − x0min); thus, (1) When rate0 > Height/Width, then ymax = y0max,(y0max − y0min)/=(xmax − x0min) = (Height/Width), (1) andWhen the rate0 tile map> Height expands/Width to, then the leftymax horizontally,= y0max, (y0max − as y0 shownmin)/=(xmax in Figure− x0min)4 =. (Height/Width), and the tile map expands to the left horizontally, as shown in Figure 4. expansion

scope

expansion

scope O(x0min, y0min)

Figure 4. Tile map horizontal expansion.

O(x0min, y0min) (2) In Figure 5, if rate0 = Height/Width, i.e., xmax = x0max, ymax = y0max, then the tile map does not need any expansion. Figure 4. TileTile map map horizontal horizontal expansion. expansion.

(2) In Figure 5, if rate0 = Height/Width, i.e., xmax = x0max, ymax = y0max, then the tile map does not need (2) In Figure5, if rate0 = Height/Width, i.e., xmax = x0max, ymax = y0max, then the tile map does not need anyany expansion. expansion.

ISPRS Int. J. Geo-Inf. 2016, 5, 215 9 of 25 ISPRS Int. J. Geo-Inf. 2016, 5, 215 9 of 25

ISPRS Int. J. Geo-Inf. 2016, 5, 215 9 of 25

No expansion

No expansion

O(x0min, y0min)

Figure 5. No expansion in the tile map. OFigure(x0min, 5.y0Nomin) expansion in the tile map.

(3) If rate0 < Height/WidthFigure, then 5.xmax No = expansionx0max, (ymax in − the y0 mintile)/( map.xmax − xMin) = Height/Width, and the tile map (3) If rate0 < expandsHeight/ Widthto the ,top then verticallyxmax = inx0 Figuremax,(y 6.max − y0min)/(xmax − xMin) = Height/Width, and the

(3) tileIf map rate0 expands < Height/ toWidth the, topthen vertically xmax = x0max in, (y Figuremax − y06min. )/(xmax − xMin) = Height/Width, and the tile map expansion expands to the top vertically in Figure 6. scope expansion

scope

O(x0min, y0min)

O(x0Figuremin, y0min 6. )Tile map vertical expansion.

5. NoSQL-Based Vector MapFigureFigure Tile 6.6. StorageTile map map Access vertical vertical Modelexpansion. expansion. A tile map pyramid, as a hierarchical model with variable resolution, has many excellent 5. NoSQL-Based Vector Map Tile Storage Access Model 5. NoSQL-Basedcharacteristics, Vector such Map as massive Tile Storage storage Access capacity Model and a dynamic update mechanism. Using the world 12 A tileAvector tile map map tile pyramid, mappyramid, as an as example,as a a hierarchicalhierarchical the total tilemodel number with with ofvariable the variable 1 to resolution,20 resolution,zoom level has is has manynearly many excellent 3 × 10 excellent. In this massive data scenario, the relational database and file system-based management approach, which characteristics,characteristics, such such as as massive massive storage storage capacitycapacity and and a adynamic dynamic update update mechanism. mechanism. Using Using the world the world vector istile not map suitable as an for example, the storage the total and tileprocessing number ofof numerousthe 1 to 20 tiles,zoom will level seriously is nearly affect 3 × 10 data12. In storage this and vector tile map as an example, the total tile number of the 1 to 20 zoom level is nearly 3 × 1012. In this massiveretrieval data scenario, actions, thewith relational slow computation database an dand file up system-baseddate rate. NoSQL management databases approach, support which mass data massive data scenario, the relational database and file system-based management approach, which is not suitablestorage forand the highly storage concurrent and processing queries of numerouswith better tiles, scalability will seriously and other affect great data storagefeatures and as well. is notretrieval suitableMoreover, actions, for the somewith storage NoSQLslow computation and variants, processing such and as ofup HBase,date numerous rate. buil dingNoSQL tiles, on willtopdatabases of seriously a distributed support affect massfile data system, data storage also andstorage retrievalprovide and actions, highly a fault-tolerant, with concurrent slow replica computation queries and with scalable andbetter storage update scalability mechanism rate. NoSQLand otherand databasesare great suitable features support for processing as well. mass huge data storageMoreover, andquantities highly some concurrent ofNoSQL tile data variants, queriesrecords. such with Therefore, as betterHBase, scalabilitywe buil hereding consider on and top other ofapplying a greatdistributed featuresNoSQL file methods assystem, well. Moreover,alsoto build a someprovide NoSQLdistributed a fault-tolerant, variants, storage such replica management as HBase,and scalable model building storage for massive on mechanism top oftile amap distributedand data. are suitable file for system, processing also huge provide a fault-tolerant,quantities of replica tile data and records. scalable Therefore, storage mechanism we here consider and are applying suitable NoSQL for processing methods hugeto build quantities a 5.1. Distributed Storage Model Based on Column-Family of tiledistributed data records. storage Therefore, management we model here for consider massive applying tile map data. NoSQL methods to build a distributed storage managementIn our studies, model map for massivetile data, tilewhich map are data.clipped from a vector dataset, are stored using a NoSQL 5.1. Distributed Storage Model Based on Column-Family column-family model similar to Google’s BigTable. In HBase, a table is comprised of a series of rows 5.1. DistributedInconsisting our studies, Storage of amapModel RowKey tile Based data, and whicha on number Column-Family are ofclipped ColumnFamilies from a vector. RowKey dataset, is the are primary stored using key of a NoSQLthe table, and column-familythe records model in the similar table to are G oogle’ssorted accordingBigTable. Into HBase,RowKey a. tableA ColumnFamily is comprised can of a be series composed of rows of any In our studies, map tile data, which are clipped from a vector dataset, are stored using a NoSQL consistingnumber of a RowKeyof Columns and, athat number is, ColumnFamily of ColumnFamilies supports. RowKey dynamic is the expansion primary key without of the the table, need and to pre- column-familythe recordsdefine in modelthe the number table similar are and sorted to type Google’s of according Column BigTable.. to RowKey In HBase,. A ColumnFamily a table is comprised can be composed of a series of any of rows consistingnumber of of a is. anRowKey expansion option foris the RowKeywithout primary thedesign, need key but to of inpre- the this table, case anddefine the recordsthe the rownumberin key the isand table a stringtype are of with sorted Column variable-length according. to andRowKey delimiters.A ColumnFamily should be usedcan to connect be composed the separated of any number of dynamicalgori is an thmoption expansionlike for MD5 RowKey to without obtain design, a the fixed-length but need in this to pre-definecase row key. the numberthe rowUsing key and is“ typeMD5 a string of(Map_Name)Column with variable-length. MD5 (Level) and MD5 de (Row)limiters MD5 should (Col) be + usedreverse to timestamp connect the” as separated RowKey, there elements. coul is like and option helpMD5 HBase to for obtain RowKey better a fixed-length predict design, the but readrow in key.and this write case the rowUsing key performance;“MD5 is a (Map_Name) string (2) with RowKey MD5 variable-length (Level) is usually MD5 and(Row)expected delimiters MD5 to (Col) be evenly should+ reverse distributed; be timestamp used to” the connectas RowKey,MD5-based the there separated method are twoenables advantages: this goal. (1) Because fixed-length data blocksRowKey are coul automatid helpcally HBase distributed better predict onto region the read servers and bywrite row keys elements. One alternative is to use the hash algorithm like MD5 to obtain a fixed-length row key. Using performance;in HBase, (2) evenly RowKey distributed is usually keys expected can avoid to the be problem evenly ofdistributed; writing hot the spots, MD5-based and thus avoidmethod the load “MD5 (Map_Name) MD5 (Level) MD5 (Row) MD5 (Col) + reverse timestamp” as RowKey, there are two enablesconcentration this goal. Because in a small data part blocks of theare regionautomati servers.cally distributed onto region servers by row keys advantages:in HBase, (1) evenly fixed-length distributed RowKey keys can could avoid helpthe problem HBase betterof writing predict hot spots, the read and thus and avoid write the performance; load

(2) RowKeyconcentration is usually in a small expected part of tothe be region evenly servers. distributed; the MD5-based method enables this goal. Because data blocks are automatically distributed onto region servers by row keys in HBase, evenly

ISPRS Int. J. Geo-Inf. 2016, 5, 215 10 of 25 distributed keys can avoid the problem of writing hot spots, and thus avoid the load concentration in a small part of the region servers. ISPRS Int. J. Geo-Inf. 2016, 5, 215 10 of 25 AsISPRS shown Int. J. in Geo-Inf. Figure 20167, ,5, 215RowKey is used to globally identify a specific tile object, initially10 of composed 25 of a map documentAs shown in identifier, Figure 7, RowKey tile level, is used row to number, globally identify and column a specific number. tile object, Furthermore, initially composed to facilitate As shown in Figure 7, RowKey is used to globally identify a specific tile object, initially composed frequentof a updatingmap document and fastidentifier, locating, tile alevel, reverse row timestampnumber, and for column per data number. update Furthermore, actions is alsoto facilitate considered of a map document identifier, tile level, row number, and column number. Furthermore, to facilitate as onefrequent componentfrequent updating updating of RowKey and and fast fastto ensure locating,locating, that aa reversereverse the latest timestamptimestamp version for offor the perper updateddata data update update tile dataactions actions is automaticallyis isalso also arrangedconsideredconsidered at the as top as one one of component thecomponent tile data of of RowKey RowKey table. to Thisto ensureensure allows thatthat the thethe latest latestlatest version version tile version of of the the updated toupdated be given tile tile data thedata is utmost is accessautomatically priorityautomatically in aarranged queryarranged scheduling at at the the top top of of session. the the tiletile data Based tabl one.e. ThisThis that, allows allows we usethe the latest the latest method tile tile version version of calculatingto tobe begiven given 16-bit MD5the (Message-Digestthe utmost utmost access access priority Algorithm priority in in a 5)a queryquery values schedulingscheduling of eachfield session.session. and BasedBased then on combineon that, that, we we them use use the together the method method to of yield of the calculating 16-bit MD5 (Message-Digest Algorithm 5) values of each field and then combine them RowKeycalculating(i.e., GMK 16-bitfunction MD5 (Message-Digest as shown in FigureAlgorithm8). In5) values column-family of each field storage, and then we combine adopt them the single togethertogether to toyield yield the the RowKey RowKey (i.e., (i.e., GMK GMK functionfunction as shownshown in in Figure Figure 8). 8). In In column-family column-family storage, storage, we we Tile Image column-familyadoptadopt the the single structure—onlysingle column-family column-family one structure—only structure—only ColumnFamily, one ColumnFamily,ColumnFamily,and one Tile Column, Tile and and one one Column, Column,to speedImage Image to up to data accessspeed whilespeed up up avoiding data data access access the while while wide-column avoiding avoiding thethe performance wide-columnwide-column cost. performanceperformance The value cost. cost. ofThe TheImage value valueis of the ofImage Image corresponding is theis the map binarycorrespondingcorresponding data for map map this binary tilebinary unit. data data for for this this tiletile unit.unit.

RowKeyRowKey ColumnFamily: ColumnFamily: Tile Tile

ColumnQualify:Column: Image Image ColumnQualify:Column: Image Image MD5(Map_ID)MD5(Level)MD5(R MD5(Map_ID)MD5(Level)MD5(R Value: Tile Binary ow)MD5(Col) + reverse ts Value: Tile Binary ow)MD5(Col) + reverse ts

FigureFigure 7. 7. Column-family-based tile data storage model. Figure 7.Column-family-based Column-family-based tile tile data data storage storage model. model.

Figure 8. Distribution of map tiles in a NoSQL cluster. Figure 8. Distribution of map tiles in a NoSQL cluster. Based on this, weFigure propose 8. aDistribution NoSQL cluster-base of mapd tiles map in tile a NoSQL distributed cluster. storage model: as shown in Figure 8, in the process of clipping, the different levels of tile data of the map document generate theBased corresponding on this, we calculated propose RowKey a NoSQL, namely, cluster-base the tiled identifier, map tile distributedaccording to storage the GMK model: method as (i.e.,shown Basedin Figure“MD5 on (Map_Name) 8, this, in the we process propose MD5 (Lofevel) clipping, a NoSQL MD5 (Row)the cluster-based differen MD5 (Col)t levels + mapreverse of tile tile timestamp data distributed of”). the Then, map storage tile document data model: blocks generate are as shown in Figurethestored corresponding8, in into the records process calculated of ofthe clipping, NoSQL RowKey column the, namely, different table, theand tile levelsdifferent identifier, of records tile according data are of assigned the to mapthe to GMK different document method Region generate(i.e., the corresponding“MD5nodes (Map_Name) according calculated toMD5 the (Lpartevel)RowKeyitioning MD5, strategy;(Row) namely, MD5 here, the (Col) tileRegion + identifier, reverse refers timestamp to accordingthe basic”). Then,element to thetile of dataGMK availability blocksmethod are (i.e., “MD5stored (Map_Name)and intodistribution records MD5 forof (Level)thethe NoSQLtile MD5table column and (Row) is compritable, MD5 andsed (Col) differentof +record reverse recordspartitions timestamp are for assigned ”).the Then,column to different tile family data TileRegion blocks. are storednodes intoTypically, according records column-based of to the the NoSQL part NoSQLitioning column databases strategy; table, such here, and as HBaseRegion different achieve refers records to scalable the arebasic storage assigned element of mass toof differentavailabilitydata and Region andbalanced distribution data distributionfor the tile viatable different and is Regioncompri divisions,sed of record and features partitions such for as theMemStore column and family HLog Tile . nodes according to the partitioning strategy; here, Region refers to the basic element of availability and Typically,mechanisms column-based can ensure NoSQLhigh efficiency databases for readingsuch as andHBase writing achieve tile datascalable and highstorage reliability of mass of data data. and Tile distributionbalancedWhen for data the querying distribution tile table and and viaacquiring isdifferent comprised tile Regionmaps of with divisions, record different partitions and access features forpatterns, thesuch column asthe MemStore RowKey family of andthe tile.HLog Typically, column-basedmechanismsmap to be NoSQL accessedcan ensure databases first high needs efficiency suchto be asdetermined for HBase reading achieve according and writing scalable to the tile document storagedata and identifier of high mass reliability dataof the andvector of data. balanced data distributionmap,When the levelquerying via of different tile and data, acquiring andRegion the rowdivisions, tile and maps column andwith number. features different These such access parameters as patterns, MemStore are the passed and RowKey HLog to the of mechanismsGMK the tile can ensuremapfunction to high be accessedto efficiency generate first a for needshash-based reading to be andRowKeydetermined writing. Then, according tile the data RowKeys andto the of high documentthe reliabilitytile map identifier to of be data. queried of the vectorare Whenmap,submitted the querying level to of the tile and master data, acquiring nodeand thein the tilerow NoSQL maps and column withcluster. different number. The master accessThese node, parameters patterns, which maintains the are RowKeypassed the global toof the the tile GMK tile map metadata, processes the received information and quickly returns the target data node path in which to befunction accessed to first generate needs a to hash-based be determined RowKey according. Then, the to theRowKeys document of the identifier tile map ofto thebe queried vector map,are the levelsubmitted of tile data, to andthe master the row node and in columnthe NoSQL number. cluster. These The master parameters node, which are passed maintains to thethe GMKglobalfunction tile to generatemetadata, a hash-based processes theRowKey received. Then,information the RowKeys and quicklyof thereturns tile the map target to be data queried node path are submittedin which to

ISPRS Int. J. Geo-Inf. 2016, 5, 215 11 of 25 the master node in the NoSQL cluster. The master node, which maintains the global tile metadata, processes the received information and quickly returns the target data node path in which the tiles are stored. After that, the tile requestor can obtain the desired tile data by directly accessing the data node through RowKey matching. This management mechanism can effectively lighten the increased workload via high concurrency and can also improve the efficiency of storage and the access of tile map data.ISPRS Int. J. Geo-Inf. 2016, 5, 215 11 of 25

5.2. Vectorthe Tile tiles Map are Parallelstored. After Clipping that, the and tile Query requestor Framework can obtain Basedthe desired on MapReduce tile data by directly accessing the data node through RowKey matching. This management mechanism can effectively lighten the Basedincreased on the distributedworkload via NoSQLhigh concurrency tile data and storage can also model, improve we the propose efficiency aof framework storage and the that is based on MapReduceaccess of for tile parallel map data. clipping and fast scheduling of map tiles. This framework aims to improve the access efficiency of mass tile data. As shown in Figure9, the bottom of the framework is a 5.2. Vector Tile Map Parallel Clipping and Query Framework Based on MapReduce NoSQL storage cluster (e.g., an HBase database), among which multiple region hosts as data nodes are Based on the distributed NoSQL tile data storage model, we propose a framework that is based used to store distributed tile map data, deploying different tile records in the column-family stores on MapReduce for parallel clipping and fast scheduling of map tiles. This framework aims to improve (hereinafterthe referredaccess efficiency to as of TCFS, mass tile e.g., data. HTable As shown of in HBase) Figure 9, accordingthe bottom of tothe our framework tile data is a NoSQL model. Record sets in TCFSstorage are cluster split (e.g., into an multiple HBase database), independent among which partitions—tile multiple region regions, hosts as data according nodes are toused region unit capacity—andto store are distributed stored intile different map data, data deploying nodes diff toerent which tile records their respectivein the column-family regions werestores allocated. (hereinafter referred to as TCFS, e.g., HTable of HBase) according to our tile data model. Record sets The coordination service of NoSQL (such as ZooKeeper) synchronizes tile data regions among different in TCFS are split into multiple independent partitions—tile regions, according to region unit data nodes,capacity—and maintaining are stored the mappingin different data relationship nodes to which between their respective global regions tile datasets were allocated. and The each part in the data nodes,coordination ensuring service the of NoSQL logical (such integrity as ZooKeeper) of the synchronizes global tile tile dataset. data regions The among master different node performs the unifieddata allocation nodes, maintaining and scheduling the mapping of relationship the storage betw resourceseen global tile of datasets tile data and on each different part in the data nodes data nodes, ensuring the logical integrity of the global tile dataset. The master node performs the according to regions. The advantage is that accessing multiple map tile data is directly performed unified allocation and scheduling of the storage resources of tile data on different data nodes by the correspondingaccording to regions. distributed The advantage data nodes.is that accessing In addition, multiplethe map data tile data redundancy is directly performed replica by mechanism supportedthe by corresponding the distributed distributed file system data nodes. (e.g., HDFS)In addition, ensures the data the redundancy reliability replica of tile mechanism data. Based on this framework,supported we realized by the distributed the parallel file syst generationem (e.g., HDFS) and ensures retrieval the methodsreliability of of tile vector data. Based map on tiles this using the framework, we realized the parallel generation and retrieval methods of vector map tiles using the model of MapReduce.model of MapReduce.

Figure 9. MapReduce-based tile map parallel processing framework. Figure 9. MapReduce-based tile map parallel processing framework.

ISPRS Int. J. Geo-Inf. 2016, 5, 215 12 of 25

5.3. Tile Map Parallel Clipping Algorithm The map tile-clip operations generally fall into two categories, static clip and dynamic clip, according to different application scenarios. Static tile clipping mainly refers to pre-generating a tile ISPRS Int. J. Geo-Inf. 2016, 5, 215 12 of 25 map before a user accesses the data and distributing the tile map among tile databases or some type of file systems.5.3. Tile Map At Parallel present, Clipping the main Algorithm handling method of static clipping is to use either a sequential processing mechanism or a multi-process computation method, and both are based on standalone The map tile-clip operations generally fall into two categories, static clip and dynamic clip, nodes.according However, to thedifferent extension application of computational scenarios. Static tile capabilities clipping mainly is seriously refers to limitedpre-generating by the a tile hardware capacitymap of abefore standalone a user accesses computer, the data particularly and distributing when the a vectortile map map among consists tile databases of many or some sub-documents type or a mapof documentfile systems. containsAt present, a largethe main number handling of method levels; of in static such clipping cases,it isoften to usetakes either aa longsequential time to clip maps, sometimesprocessing mechanism even hours. or Thisa multi-process could lead computat to extremeion method, inefficiencies and both in are static based tile ongeneration. standalone nodes. However, the extension of computational capabilities is seriously limited by the hardware In our framework, by using the MapReduce distributed data processing model, a parallel capacity of a standalone computer, particularly when a vector map consists of many sub-documents algorithmor a is map presented document to contains execute a large tile generating number of levels; operations in such (Seecases, steps it often C1 takes to C5a long in Figuretime to 9clip): a client submitsmaps, a clip sometimes task and even uploads hours. This the could computing lead to extreme resource—map inefficienciesdoc in staticset, tile which generation. includes multiple vector mapIn documents our framework, and task by using configuration the MapReduce information distributed (tile data size: processing Tile height model,/width a, theparallel maximum level ofalgorithm tiles: maxlevel is presented, etc.) to toexecute the tasktile generating execution oper endpoint.ations (See stepsJobTracker C1 to ,C5 the in Figure task scheduling 9): a client node of the MapReducesubmits a clip task framework, and uploads would the computing accept resource—map the task and doc then set, assign which includes it to various multiple job vector execution map documents and task configuration information (tile size: Tile height/width, the maximum level of nodes—TaskTracker. TaskTracker will automatically download appropriate computing resources from tiles: maxlevel, etc.) to the task execution endpoint. JobTracker, the task scheduling node of the the taskMapReduce resources framework, pool for parallel would accept execution the task of sub-tasks.and then assign First, it toMapper variousinstances job execution are nodes— constructed to performTaskTrackerMap calculations.. TaskTracker As will shown automatically in Figure download 10, the appropriate main process computing of the resourcesMap algorithm from the is task described as follows:resourcest represents pool for parallel a tile unit execution in a specifiedof sub-tasks. level First, of Mapper the given instances map are document. constructed Theto performclip function performsMap tile calculations. generating As calculation shown in Figure for each 10, thet to main output process corresponding of the Map algorithm tile data is which described are as then put into a TCFSfollows: record t represents table (i.e., a tile HBase unit in table). a specified Each level map-task’s of the given computing map document. result The is a clip Key/Value function object performs tile generating calculation for each t to output corresponding tile data which are then put rk rk and identifiedinto a TCFS by arecord tile record table (i.e., access HBase number table). Each (i.e., map-task’s); then, allcomputing the keys result are is a sent Key/Value to the object MapReduce frameworkand identified for the nextby a tileReduce recordexecution access number phase. (i.e., rk In); thethen, Reduce all the rkstage, keys areTaskTracker sent to the MapReducewill perform the Reduce operationframework for when the next receiving Reduce theexecution execution phase. request In the Reduce of sub-tasks, stage, TaskTracker and parallel will perform merging the will be performedReduce on operation the tile access when numbersreceiving the that execution represent request tile units of sub-tasks, that were and successfully parallel merging yielded will accordingbe to differentperformed vector on map the documents.tile access numbers The final that resultrepresent set willtile units be returned that were to successfully the client afteryielded merging. according to different vector map documents. The final result set will be returned to the client after In the entire process, through execution of the two stages, Map and Reduce, the MapReduce framework merging. In the entire process, through execution of the two stages, Map and Reduce, the MapReduce accomplishesframework the accomplishes parallel computation the parallel workcomputation for large-volume work for large-volume map tile map data. tile data.

FigureFigure 10. Pseudo-code 10. Pseudo-code for for the thestatic static tile tile-clip-clip algorithm algorithm in MapReduce. in MapReduce.

ISPRS Int. J. Geo-Inf. 2016, 5, 215 13 of 25

As to the clip function implementation details, the rendering function is provided by MapGIS K9, one of the most popular GIS platform softwares in China. The MapGIS rendering function’s prototype statement is listed as follows: void rendering(Layer layers, SLD descriptors, GeoEnvelope domain, Rectangle png_size,String output_path); where layers represent multiple map documents that can be specified, descriptors represent corresponding SLD () files (i.e., each map document has its own SLD file), domain represents geographical rectangle for map rendering, png_size represents the desired output PNG picture’s size and is fixed as the tile unit size 256 × 256 in the clip function, output_path represents the output file path for the rendered picture. Because the clip function passes parameters like map documents associated with SLD files, level, xtile (tile index in x axis) and ytile (tile index in y axis), the following Equations (2)–(4) are used to convert level, xtile and ytile into the geographical rectangle comprising latitude and longitude ranges. After these transformations, the rendering function is able to be invoked by the clip function and the tile picture could be thus rendered correctly. For example, the geographical rectangle of the tile (level = 1, xtile = 0, ytile = 0) is calculated as (longitude(21, 0) latitude(21, 0), longitude(21, 1) latitude(21, 1)) = (−180◦ 85.05◦, 0◦ 0◦). n = 2level (2)

longitude(n,xtile) = xtile/n × 360.0 − 180.0 (3)

latitude(n,ytile) = (arctan(sinh(π × (1 − 2 × ytile/n)))) × 180/π (4)

In a dynamic tile-clip manner, vector datasets need not be previously processed; instead, the corresponding tile map is dynamically generated according to the map zoom level, query range, etc., only when users access vector map tiles, and then the generated map tiles are stored to the NoSQL TCFS store table. The main difference with static tile production is that tile data are dynamically generated and loaded when users continuously access a series of tiles across multiple map levels or multiple non-adjacent tiles or some specific areas, not using tiles generated in advance. Because the map data do not need pre-processing, this method is more flexible and convenient and saves storage space on the server side. However, the task processing mode requires real-time tile generation, so the background processing algorithm must have low time complexity. Currently, most dynamic tile-clipping implementations are also based on a quadtree index structure; when users randomly access a piece of a tile or multiple tiles of a region, they could first check whether the requested data exists in tile datasets or not; if not, then the tile-clipping operation would be performed on the vector map data using the quadtree-based map clip algorithms. The scenario shown in Figure 11 is a quadtree-based tile map partition graph in a dynamic request session, where the corresponding geographic location of this tile map is Wuhan City of China. In the NoSQL-based tile processing algorithm framework we have proposed, a parallel dynamic clipping method on the basis of the MapReduce model is also considered; the algorithm steps (D1 to D6) are shown in Figure 10. The basic principle of each computation unit is roughly the same as static tile clipping—the inputs are the tile’s level and the map’s range generated with the map doc set accessed dynamically. After tasks and related computing resources are submitted to the execution end, JobTracker of the MapReduce algorithm framework, it would notify each sub-task execution end, i.e., TaskTracker, to fetch their corresponding map doc and allocated resources; each TaskTracker constructs a Mapper object instance to execute the Map method, as shown in Figure 12, and the tiles function creates tile object collections using the spatial extent m within level l of the map document object d. For each tile object t in tiles, obtain the tile’s binary from the global tile record table of NoSQL; if its value is empty, then call the clip function to generate the object t immediately; the calculation results are stored in the tile data table, and its access number rk is encapsulated as an object that could be processed in Reduce stages; if the value is not empty, it indicates that the tile data have already been generated; then, package the record number into Key/Value objects and dispatch them to Reduce stages. ISPRS Int. J. Geo-Inf. 2016, 5, 215 13 of 25

As to the clip function implementation details, the rendering function is provided by MapGIS K9, one of the most popular GIS platform softwares in China. The MapGIS rendering function’s prototype statement is listed as follows: void rendering(Layer layers, SLD descriptors, GeoEnvelope domain, Rectangle png_size,String output_path); where layers represent multiple map documents that can be specified, descriptors represent corresponding SLD (Styled Layer Descriptor) files (i.e., each map document has its own SLD file), domain represents geographical rectangle for map rendering, png_size represents the desired output PNG picture’s size and is fixed as the tile unit size 256 × 256 in the clip function, output_path represents the output file path for the rendered picture. Because the clip function passes parameters like map documents associated with SLD files, level, xtile (tile index in x axis) and ytile (tile index in y axis), the following Equations (2)–(4) are used to convert level, xtile and ytile into the geographical rectangle comprising latitude and longitude ranges. After these transformations, the rendering function is able to be invoked by the clip function and the tile picture could be thus rendered correctly. For example, the geographical rectangle of the tile (level = 1, xtile = 0, ytile = 0) is calculated as (longitude(21, 0) latitude(21, 0), longitude(21, 1) latitude(21, 1)) = (−180° 85.05°, 0° 0°).

n = 2level (2)

longitude(n,xtile) = xtile/n × 360.0 − 180.0 (3)

latitude(n,ytile) = (arctan(sinh(π × (1 − 2 × ytile/n)))) × 180/π (4) In a dynamic tile-clip manner, vector datasets need not be previously processed; instead, the corresponding tile map is dynamically generated according to the map zoom level, query range, etc., only when users access vector map tiles, and then the generated map tiles are stored to the NoSQL TCFS store table. The main difference with static tile production is that tile data are dynamically generated and loaded when users continuously access a series of tiles across multiple map levels or multiple non-adjacent tiles or some specific areas, not using tiles generated in advance. Because the map data do not need pre-processing, this method is more flexible and convenient and saves storage ISPRS Int. J. Geo-Inf. 2016, 5, 215 14 of 25 space on the server side. However, the task processing mode requires real-time tile generation, so the background processing algorithm must have low time complexity. Currently, most dynamic tile- The executionclipping mechanism implementations of the are Reducealso basedcourse on a quad on subtaskstree index isstructure; similar when to that users of randomly static tile access processing, a piece of a tile or multiple tiles of a region, they could first check whether the requested data exists and the inTaskTracker tile datasets’s or end not; will if not, start then the theReduce tile-clippingoperation operation execution would whenbe performed qualified on the objects vector are map received. The successfullydataISPRS using Int. clippedJ. the Geo-Inf. quadtree-based 2016 tile, 5,units 215 and map those clip algorithms. already existing The scenario in the shown tile database in Figure will 11 is be a quadtree- parallel14 of 25 merged accordingbased to differenttile map partition vector mapgraph documents, in a dynamic and request the finalsession, result where set the will corresponding be sent back geographic to the requestor after reduction.locationIn of the this NoSQL-based tile map is Wuhantile processing City of algorithm China. framework we have proposed, a parallel dynamic clipping method on the basis of the MapReduce model is also considered; the algorithm steps (D1 to D6) are shown in Figure 10. The basic principle of each computation unit is roughly the same as static tile clipping—the inputs are the tile’s level and the map’s range generated with the map doc set accessed dynamically. After tasks and related computing resources are submitted to the execution end, JobTracker of the MapReduce algorithm framework, it would notify each sub-task execution end, i.e., TaskTracker, to fetch their corresponding map doc and allocated resources; each TaskTracker constructs a Mapper object instance to execute the Map method, as shown in Figure 12, and the tiles function creates tile object collections using the spatial extent m within level l of the map document object d. For each tile object t in tiles, obtain the tile’s binary from the global tile record table of NoSQL; if its value is empty, then call the clip function to generate the object t immediately; the calculation results are stored in the tile data table, and its access number rk is encapsulated as an object that could be processed in Reduce stages; if the value is not empty, it indicates that the tile data have already been generated; then, package the record number into Key/Value objects and dispatch them to Reduce stages. The execution mechanism of the Reduce course on subtasks is similar to that of static tile processing, and the TaskTracker’s end will start the Reduce operation execution when qualified objects are received. The successfully clipped tile units and those already existing in the tile database will be parallel mergedFigure according 11. Quadtree-based to different partition vector graph map ofdocuments, a vector map and in thedynamic final resulttile clipping. set will be sent Figure 11. Quadtree-based partition graph of a vector map in dynamic tile clipping. back to the requestor after reduction.

Figure 12. Pseudo-code for the dynamic tile-clip algorithm in MapReduce. Figure 12. Pseudo-code for the dynamic tile-clip algorithm in MapReduce. 5.4. Parallel Query of the Tile Map 5.4. Parallel Query of the Tile Map The parallel query algorithm of the tile map (See Figure 13), primarily implemented by means The parallelof the MapReduce query algorithm computing paradigm, of the tile yields map the (See para Figurellel retrieval 13), of primarily tile objects in implemented our NoSQL tile by means storage model. The input of this algorithm is the query condition collection Q = {c|c = }, where of the MapReduce computing paradigm, yields the parallel retrieval of tile objects in our NoSQL tile storage model. The input of this algorithm is the query condition collection Q = {c|c = }, where the triple c is comprised of three query conditions: d, l, and m, which denote the map document object, zoom level and coverage, respectively. Subtask execution ends (i.e., TaskTracker) obtain query objects from the collection set Q and then download computing resources and quickly scan the tile table, ISPRS Int. J. Geo-Inf.ISPRS Int. 2016J. Geo-Inf., 5, 2152016, 5, 215 15 of 25 15 of 25

the triple c is comprised of three query conditions: d, l, and m, which denote the map document object, constructingzoom atile level object and coverage, subset respectively. according toSubtask the query execution object. ends (i.e., For TaskTracker each query) obtain object queryt in objects the subset, the associativefromRowKey the collectionin the set global Q and tile then data download table iscomputing obtained resources and then and packagedquickly scan into the tile computing table, units constructing a tile object subset according to the query object. For each query object t in the subset, handled inthe the associativeReduce phase RowKey according in the global to tile the data query table level. is obta Theined EMITand thenfunction packaged dispatches into computing these units to the MapReduceunits handled task scheduler in the Reduce (i.e., phaseJobTracker according). to It the is worth query level. reiterating The EMIT that function if the dispatches query result these is null, the EMIT functionunits to would the MapReduce set the valuetask scheduler of the emitted(i.e., JobTracker object). It tois worth zero. reiterating The algorithm that if the process query result in the Reduce stage is similaris null, tothe that EMIT of function dynamic would tile set clipping: the value of composing the emitted object a tuple to zero. according The algorithm to the process map in document the Reduce stage is similar to that of dynamic tile clipping: composing a tuple according to the map and querydocument level, merging and query query level, results,merging query and returningresults, and thereturning final the result final toresult the to requestor. the requestor.

Figure 13.FigurePseudo-code 13. Pseudo-code for for the the parallel parallel tile tile query query algorithm algorithm in MapReduce. in MapReduce. 6. Experiments 6. ExperimentsTo evaluate the performance of the model we have proposed in the previous section in a real To evaluateenvironment, the performancewe set up a testof environment the model usin weg the have China proposed Geological in Survey the previous Information section Grid in a real (hereinafter referred to as the CGS-Grid) and created a verification prototype. Based on this, we environment,carried we out set a series up a of test experiments. environment These experiments using the mainly China compared Geological our framework Survey Information with the Grid (hereinaftertraditional referred model, to as thewhich CGS-Grid) adopts a local and tile created cache amechanism verification and prototype. distributed Basedfile system-based on this, we carried out a seriesmanagement of experiments. model in These terms of experiments tile storage efficiency mainly and compared performance. our framework with the traditional model, whichOur adopts testing a prototype local tile system cache is mechanismbuilt on the CGS- andGrid distributed tile data center, file the system-based system architecture management that is shown in Figure 14; it is mainly composed of a tile data cluster with 8 NoSQL servers, where model in termsdata nodes of tile are storageconnected efficiency with each other and via performance. Gigabit Ethernet. Each computation end is configured Our testingwith one prototype 2.8-GHz Dual-Core system Intel is built processor, on the 4 GB CGS-Grid of 1066-MHz tile DDR3 data ECC center, memory the systemand 1 TB architecture of disk that is shown instorage, Figure running 14; it the is CentOS mainly Linux composed 5.8 64-bit of operat a tileing data system. cluster One of with these 8NoSQL NoSQL nodes servers, is used aswhere data nodes are connected with each other via Gigabit Ethernet. Each computation end is configured with one 2.8-GHz Dual-Core Intel processor, 4 GB of 1066-MHz DDR3 ECC memory and 1 TB of disk storage, running the CentOS Linux 5.8 64-bit operating system. One of these NoSQL nodes is used as a distributed tile storage controller and task scheduler (i.e., NameNode and JobTracker in Figure 14), whereas seven other nodes are selected for data storage and job execution (i.e., DataNode and TaskTracker in Figure 14). The tile cluster uses the Hadoop 1.1.2 platform and HBase 0.96 NoSQL database, which provide distributed NoSQL stores. To facilitate the comparison between our method and others, we also prepared one high-performance computer, which offers a standalone tile-clipping service using the classic model, and its configuration is as follows: 24 Intel Dual-Core 2.8-GHz ISPRS Int. J. Geo-Inf. 2016, 5, 215 16 of 25

a distributed tile storage controller and task scheduler (i.e., NameNode and JobTracker in Figure 14), whereas seven other nodes are selected for data storage and job execution (i.e., DataNode and TaskTracker in Figure 14). The tile cluster uses the Hadoop 1.1.2 platform and HBase 0.96 NoSQL database, which provide distributed NoSQL stores. To facilitate the comparison between our method and others, we also prepared one high-performance computer, which offers a standalone tile-clipping service using the classic model, and its configuration is as follows: 24 Intel Dual-Core 2.8-GHz ISPRS Int. J. Geo-Inf. 2016, 5, 215 16 of 25 processors, 64 GB memory, and Fiber Channel-based Storage Area Network (FC SAN) with 50 TB of storage capacity. In these experiments, we used the unified public datasets of the China Geological a distributed tile storage controller and task scheduler (i.e., NameNode and JobTracker in Figure 14), ISPRS Int. J.Survey Geo-Inf. [25],2016 which, 5, 215 include 110 vector map documents (covering geophysical, geochemical and many 16 of 25 whereas seven other nodes are selected for data storage and job execution (i.e., DataNode and other geological survey topics). The size of the vector datasets is over 300 GB, and the maximum level TaskTracker in Figure 14). The tile cluster uses the Hadoop 1.1.2 platform and HBase 0.96 NoSQL of LOD model-based tiles reaches 25. Tile map clipping and search functions are wrapped into restful processors,database,GIS 64 services GB which memory, inprovide accordance and distributed Fiber with Channel-based NoSQLdifferent stores. processing ToStorage facilitate frameworks, Areathe comparison Network including between (FCthe SAN)HPC-based our method with 50 TB of storageand capacity.standalone others, weIn GIS also these tile prepared experiments,service, one the high-performance Hadoopwe used Distribu the tedcomputer, unified File System publicwhich (HDFS)-b offers datasets a standaloneased of distributed the tile-clipping China tile Geological service using the classic model, and its configuration is as follows: 24 Intel Dual-Core 2.8-GHz Survey [25processing], which system, include and 110 our vector tile processing map documents algorithm framework (covering based geophysical, on MapReduce geochemical and NoSQL. and many processors,Using 64 the GB above-mentionedmemory, and Fiber tile Channel-based services, a web Storage map client Area (See Network Figure (FC 15) SAN) is developed with 50 TBto of other geologicalstorageprovide capacity. surveygeological In topics). these map experimeservices The size nts,suitable of we the usedfor vector th theree unified datasetsdifferent public tile is over datadatasets 300storage of GB, theand and China management the Geological maximum level of LODSurvey model-basedmodels. [25], Based which tiles on include this, reaches we 110 conducted vector 25. Tile map a map group documents clipping of experiments (covering and search to geophysical, compare functions the geochemical parallel are wrapped processing and many into restful GIS servicesothercapabilities geological in accordance and survey scalability with topics). different between The size processingour of theNoSQ vectorL-based frameworks, datasets method is over and including 300 other GB, andcommon the the HPC-based maximum processing level standalone GIS tileof service, LODpatterns. model-based the The Hadoop former tiles Distributedtwo reaches experiments 25. Tile File mapare System usedclippi tong (HDFS)-based evaluateand search the functions parallel distributed aretile wrappedquery tile processing processing into restful system, capabilities of our algorithm framework, whereas the latter two are used to assess its parallel and ourGIS tile services processing in accordance algorithm with framework different based processing on MapReduce frameworks, and including NoSQL. the HPC-based standalonetile-clipping GIS performance. tile service, the Hadoop Distributed File System (HDFS)-based distributed tile processing system, and our tile processing algorithm framework based on MapReduce and NoSQL.

Using the above-mentioned tile services, a web map client (See Figure 15) is developed to provide geological map services suitable for three different tile data storage and management NameNode/JobTracker models. Based on this, we conducted a group of experiments to compare the parallel processing capabilities and scalability between our NoSQL-based method and other common processing patterns. The former two experiments are used to evaluate the parallel tile query processing capabilities of our algorithm framework, whereas the latter two are used to assess its parallel tile-clipping performance.

RS1 RS2 RS3NameNode/JobTrackerRS4 RS5 RS6 RS7

DataNode/TaskTracker/NoSQL RegionServer

FigureFigure 14. NoSQL 14. NoSQL cluster-based cluster-based tile tile processing processing architecture. architecture.

Using the above-mentioned tile services, a web map client (See Figure 15) is developed to provide geological map services suitable for three different tile data storage and management models. Based on this, we conducted a group of experiments to compare the parallel processing capabilities and RS1 RS2 RS3 RS4 RS5 RS6 RS7 scalability between our NoSQL-based method and other common processing patterns. The former DataNode/TaskTracker/NoSQL RegionServer two experiments are used to evaluate the parallel tile query processing capabilities of our algorithm framework, whereas the latterFigure two 14. areNoSQL used cluster-based to assess tile its processing parallel architecture. tile-clipping performance.

Figure 15. Web portal for the tile map of a China geological survey in the GSI Grid.

Figure 15. Web portal for the tile map of a China geological survey in the GSI Grid. Figure 15. Web portal for the tile map of a China geological survey in the GSI Grid.

6.1. Data Description Our experiment is based on the real datasets of the China Geological Survey Information Service Platform [25]. The datasets are classified into 39 major categories and contain 247 thematic geological datasets (as shown in Table1), including multi-scale geological vector maps of the ten most important geological spatial databases of China (covering 1:200,000 and 1:500,000 geological map, 1:200,000 hydrological geological map, 1:200,000 geochemical map, mineral deposits, natural heavy mine, isotope distribution, 1:500,000 environmental geological map and geological exploration databases) ISPRS Int. J. Geo-Inf. 2016, 5, 215 17 of 25 and dozens of thematic geological survey datasets. The volume of datasets has reached about 1 TB [26] and is still expanding continuously.

Table 1. Data catalogues of China Geological Survey Information Service Platform.

ID Layer Type 1 Stream sediment geochemistry exploration dataset (scale: 1:50,000/1:100,000/1:200,000/1:500,000) 2 Soil geochemical exploration dataset (scale: >1:50,000/1:50,000/1:100,000/1:500,000) 3 Petro-geochemical exploration dataset (scale: 1:200,000/1:500,000) 4 Comprehensive geochemical exploration dataset (scale: ≤1:100,000/>1:100,000) 1:200,000 scale geochemical exploration datasets of China (including 37 elements and compounds: 5 Al2O3/Au/B/Ba/Be/Bi/Cao/Co/Cr/Cu/F/Fe2O3/K2O/Li/MgO/Mn/Mo/Na2O/Nb/Ni/P/Sb/SiO2/ Sn/Sr/Th/Ti/V/W/Zn/Zr/Ag) 6 Geomagnetic survey dataset (scale: <1:25,000/≥1:25,000) 7 Geo-electric survey dataset (scale: ≤1:100,000/1:25,000/1:100,000/≥1:25,000) 8 Ground radiation survey dataset (scale: ≤1:100,000/>1:100,000) 9 Seismic survey dataset (scale: ≤1:100,000/>1:100,000) 10 Airborne magnetic survey dataset (scale: 1:1,000,000/1:500,000/1:200,000/1:100,000/>1:100,000) 11 Airborne electromagnetic survey dataset 12 Airborne radioactive survey dataset 13 Other methods of geophysical survey dataset 14 Deep geophysical investigation dataset 15 Gravimetric dataset (scale: 1:1,000,000/1:500,000/1:200,000/1:100,000/>1:100,000) 16 Comprehensive geophysical exploration dataset (scale: ≤1:100,000/1:25,000/1:100,000/≥1:25,000) 17 Engineering geological investigation dataset (scale: ≤1:100,000/>1:100,000) 18 Urban environmental geological investigation dataset China geological hazard dataset (1:2,500,000 scale) (abrupt severe geological hazards/disaster-affected level datasets 19 for landslide, rockfall and debris flow hazards/synthetic evaluation dataset for abrupt geological hazards) 20 Geological survey dataset of mining environment 21 Ecological environment geological survey dataset 22 Water pollution geological survey dataset 23 Nonmetallic mineral exploration dataset 24 Precious metals mineral exploration dataset 25 Ferrous metals mineral exploration dataset 26 Coal mineral exploration dataset 27 Nonferrous metals mineral exploration dataset 28 Comprehensive mineral exploration dataset Regional geological survey datasets (scale: 1:200,000/1:250,000/1:500,000/1:1,000,000), mining area geological 29 mapping dataset cum comprehensive regional geological survey dataset Hydro-geological survey dataset (scale: 1:200,000/1:250,000/1:500,000/1:1,000,000) and other types hydro-geological 30 dataset (scale: >1:100,000/≤1:100,000) Hydro-geological, engineering and environmental geological survey dataset and comprehensive geochemical survey 31 dataset (scale: 1:50,000/1:100,000) 32 Natural heavy minerals dataset 33 China isotopic dating dataset 34 Drill dataset 35 China important thermal springs dataset (1:6,000,000 scale) 36 China important karst cave dataset (1:4,000,000 scale) 37 China geological park dataset (1:4,000,000 scale) 38 Qinghai-Tibet Plateau geological dataset (1:250,000 scale) 39 China groundwater resources dataset ISPRS Int. J. Geo-Inf. 2016, 5, 215 18 of 25

6.2. Experiment 1—Concurrency The first experiment compared the tile data access performance of the standalone file cache-based method, the distributed file system (Hadoop HDFS)-based method and our proposed NoSQL-based management framework for tile map dataset processing. In these experiments, we primarily utilized the same tile map sets (the China geochemical map set, at a scale of 1:200,000, contains 32 map documents; the vector data size of the set is approximately 20 GB, and the tile files generated by the standalone clipping operation, with a size of 46.3 GB, are used in one standalone HPC, a Hadoop cluster and the NoSQL cluster framework mentioned above) and conducted performance tests on three access patterns; in addition, the HDFS replica switch was closed during the test. The first pattern is accessing one particular map tile based on the identification triple of the tile object, and the corresponding experiment sets the single tile unit at row 0, column 0 of level 0 as the target area; the second mode accesses multiple tile maps of the same level by using the rectangle range condition, and the access scope in the experiment covers the four total tiles of map level 1; the third method is accessing multiple tile maps crossing several levels according to the rectangle range, and in the experiment, we used 21 total tiles covering level 0 to level 2 as the query target. The average access times of multiple vector map tile documents in the three architectures under different users and access models are analyzed. The experimental results are shown in Table2.

Table 2. Average response time of three concurrent access patterns (seconds).

Single-Tile Multiple-Tile Query in the Multiple-Tile Query across Query Same Level Multiple Levels Virtual Users File File File HDFS NoSQL HDFS NoSQL HDFS NoSQL System System System 10 0.013 0.019 0.016 0.033 0.045 0.036 1.125 1.252 1.147 50 0.135 0.180 0.143 0.362 0.506 0.425 1.973 2.094 1.982 100 0.304 0.297 0.263 0.731 0.636 0.520 3.429 3.303 3.075 200 0.765 0.629 0.554 1.049 0.883 0.725 5.641 4.786 4.142 300 0.942 0.986 0.873 1.275 1.143 1.012 6.272 5.681 5.093 400 1.252 1.026 0.943 1.536 1.464 1.258 6.941 6.548 5.586 500 1.460 1.245 1.123 1.840 1.644 1.460 7.267 6.863 6.092

From Figures 16–18, we can see that when a concurrent query is on a small scale, the standalone system is considerably more efficient than distributed architecture. In the case that the number of concurrent requests is less than 50—either concurrent queries of a single-tile unit or those of multiple tiles in the same level—the average response times of the distributed architecture are generally lower than those of the standalone server architecture. Nevertheless, the efficiency of our NoSQL-based framework is higher than that of the distributed file system architecture. Obviously, in this scenario, the underlying network communication cost of the distributed architecture is a higher proportion of the total execution time, and the time cost of data block scheduling across multiple nodes of the distributed file system is higher than that of parallel tile retrieval using the MapReduce model in our tile processing framework. However, when concurrent sessions reach more than 100, the average access time of tile data in the distributed environment is less than that of the local cache mode; this indicates that the concurrent scale has a specified threshold value and that when the threshold is reached, a gradual increase in virtual users, the complexity of tile querying (e.g., from simple queries in the same level to queries covering multiple levels) or the total number of tiles to be queried will cause the efficiency of the mass query in the distributed environment to be much higher than that of the local file cache mechanism. In addition, under different modes of concurrent queries, the NoSQL-based parallel framework has better performance than the distributed file system-based framework. All of the above demonstrate that our NoSQL-based parallel processing model is suitable for fast data processing in high-concurrency and large-volume map tile scenarios. ISPRS Int. J. Geo-Inf. 2016, 5, 215 19 of 25 ISPRS Int. J. Geo-Inf. 2016, 5, 215 19 of 25 All of the above demonstrate that our NoSQL-based parallel processing model is suitable for fast data ISPRSAll of Int. the J. above Geo-Inf. demonstrate2016, 5, 215 that our NoSQL-based parallel processing model is suitable for fast19 data of 25 processingAll of the above in high-concurrency demonstrate that and our large-volume NoSQL-based map parallel tile scenarios. processing model is suitable for fast data processing in high-concurrency and large-volume map tile scenarios.

Figure 16. Response time of a single-tile query by concurrent service requests. Figure 16. Response time of a single-tile query by concurrent service requests.

Figure 17. Response time of a rectangle-range query in the same level. Figure 17. Response time of a rectangle-range query in the same level.

Figure 18. Response time of a cross-level rectangle-range query. Figure 18. Response time of a cross-level rectangle-range query. Figure 18. Response time of a cross-level rectangle-range query.

ISPRS Int. J. Geo-Inf. 2016, 5, 215 20 of 25

6.3. Experiment 2—Scalability In addition to the number of concurrent sessions and the ways in which tile data are accessed, the processing capabilities of tile map querying are also dependent on the number of participating nodes in the parallel processing, especially in large-scale vector tile map application scenarios. Experiment 2 is mainly used to explore the relationship of the response time of query requests and the number of processing nodes of the NoSQL architecture under the same query conditions. In the experiment, we used a China geological vector map on a scale of 1:200,000 (the size of the vector data is 7 GB, 17 tile levels are generated using the NoSQL framework, and the capacity of the tile record table of NoSQL reaches approximately 17.3 GB). We gradually increase the number of cluster nodes, one at a time, from three nodes at the beginning to seven nodes. Every time a node is added, a data balancer will balance the distribution of the global tile data in the NoSQL table over all Region hosts, which can yield better load balance and improve storage scalability. The generation rules of query objects are described as follows: for each query, one tile unit of each level from level 3 to level 17 is randomly selected to compose a cross-level query objects set (which includes 15 tiles). We conducted the experiment under different concurrent access sessions with the query objects and recorded the average response time of the framework. The results are shown in Table3.

Table 3. Response time of a complex query with multiple nodes by concurrent requests.

Nodes 3 4 5 6 7 Users 10 0.948 0.949 0.950 0.949 0.953 50 1.283 1.280 1.287 1.285 1.283 100 1.562 1.560 1.556 1.548 1.522 200 1.982 1.983 1.962 1.931 1.904 300 2.586 2.543 2.492 2.456 2.421 400 3.663 3.526 3.364 3.181 3.042 500 4.247 4.132 3.916 3.785 3.472

The experimental data in Table3 can be expressed as a line chart (Figure 19); as we can see, when the number of concurrent sessions is less than 100, the average response time of a query is not significantly affected by the increase of the node members. However, when the number of virtual users reaches 100, the more nodes we have participating in the processing, the faster our framework responds to the query. In the scenarios of small-scale concurrent query tasks, the effect of improving the processing efficiency by increasing the number of NoSQL nodes is not apparent; the reason for this is that the parallel acceleration gained by increasing the processing nodes is partly offset by the concurrently increased cost of network communication. When the scale of the concurrent query grows to a particular threshold (e.g., 200 users in Figure 19), the communication overhead directly increased by new nodes becomes negligible compared to the parallel processing acceleration capabilities brought by the expansion. The experimental results have shown that by increasing the processing nodes properly, we can distribute query tasks to minimize the load on any single processing node and thus increase the supportable number of parallel tile query tasks. All of this can effectively expand the parallel processing capabilities of our NoSQL computing framework. ISPRS Int. J. Geo-Inf. 2016, 5, 215 21 of 25 ISPRS Int. J. Geo-Inf. 2016, 5, 215 21 of 25

FigureFigure 19. 19. ResponseResponse time time of of concurrent concurrent tile tile requests by different quantities of processing nodes.

Let u denote the number of concurrent sessions initiated by a tile query transaction, and let S be Let u denote the number of concurrent sessions initiated by a tile query transaction, and let S be the number of tiles in the query tasks. S can be defined as follows: the number of tiles in the query tasks. S can be defined as follows: u = u Sn i (5) S = ∑i=1 ni (5) i=1 where ni represents the number of tiles requested by the ith concurrent session. For our NoSQL-based frameworkwhere ni represents with N processing the number nodes of tiles (w requestedhere each by node the ihasth concurrent o processor session. cores), Forthe ouraverage NoSQL-based response framework with N processing nodes (where each node has o processor cores), the average response time of the tile query operation Tr is given generally by: time of the tile query operation Tr is given generally by: 11tc+ p + c =× ×× m + r  TSr 1 2 1 t + cm p + cr (6) Tr = S × No× × m + r (6) N2 o m r where t is the average query time of a single tile object in the NoSQL table using our data storage where t is the average query time of a single tile object in the NoSQL table using our data storage model (in our model, a tile record’s RowKey is designed to be generated according to the row and model (in our model, a tile record’s RowKey is designed to be generated according to the row and column number of each tile unit, thus avoiding a full-table scan when performing a query operation), column number of each tile unit, thus avoiding a full-table scan when performing a query operation), and p denotes the average time of the merge operation on query results, whereas m and r represent and p denotes the average time of the merge operation on query results, whereas m and r represent the the number of Map and Reduce jobs, respectively, on one single node. In addition, cm and cr are random number of Map and Reduce jobs, respectively, on one single node. In addition, cm and cr are random variables that obey exponential distributions with parameters λm and λr, respectively. They refer to variables that obey exponential distributions with parameters λ and λ , respectively. They refer to the time overhead caused by the uncertainty factors, such as communicationm r latency, disk addressing the time overhead caused by the uncertainty factors, such as communication latency, disk addressing and busy clients when performing each Map or Reduce operation. and busy clients when performing each Map or Reduce operation. Obviously, the execution time Tr is mainly proportional to the concurrent users, the total number Obviously, the execution time T is mainly proportional to the concurrent users, the total number of tiles, the execution time of the queryr program the and network communication time, and it has an of tiles, the execution time of the query program the and network communication time, and it has inverse relationship with the number of nodes involved in the calculation. When the scale of S is an inverse relationship with the number of nodes involved in the calculation. When the scale of S is small, with the increase of N, the reduction scale of Tr is not significant, but when the S scale reaches small, with the increase of N, the reduction scale of Tr is not significant, but when the S scale reaches a a certain level, such as the continuous growth of N, the effect on the reduction scale of Tr is more certain level, such as the continuous growth of N, the effect on the reduction scale of T is more obvious. obvious. Consequently, our distributed tile management framework is suitabler for application Consequently, our distributed tile management framework is suitable for application scenarios with scenarios with high concurrency and a large amount of data processing. In these scenarios, increasing high concurrency and a large amount of data processing. In these scenarios, increasing the number the number of parallel processing nodes can effectively improve the efficiency of massive tile queries. of parallel processing nodes can effectively improve the efficiency of massive tile queries. Moreover, Moreover, to make the tile scheduling as efficient as possible, we need to adopt the method of to make the tile scheduling as efficient as possible, we need to adopt the method of improving the improving the hardware configuration, such as increasing the bandwidth of the network adaptor and hardware configuration, such as increasing the bandwidth of the network adaptor and the core number the core number of processors, etc. of processors, etc. 6.4. Experiment 3—Static-Clipping Performance Experiment 3 compared the computing performance of three different static tile-clipping architectures—a traditional high-performance server, distributed file-based processing system and

ISPRS Int. J. Geo-Inf. 2016, 5, 215 22 of 25

6.4. Experiment 3—Static-Clipping Performance

ISPRSExperiment Int. J. Geo-Inf. 2016 3 compared, 5, 215 the computing performance of three different static tile-clipping22 of 25 architectures—a traditional high-performance server, distributed file-based processing system and our NoSQL-based parallel framework. In the experiment,experiment, we selected the same vector datasets as in Experiment 2, generating 0 to 6 tile levels for a totaltotal of 981 tile map units through the static clipping operation requested byby thethe client.client. Based on closing the HDFS replicareplica functionfunction switch,switch, we comparedcompared thethe averageaverage executionexecution time of the static clipping of the single server mode and distributed computing mode, and the results shown in Figure 20 indicate that the local filefile cache method spent the most time on processingprocessing (264.3 s), whereas the MapReduce diagram-based diagram-based NoSQL NoSQL framework framework has has obvious obvious advantages advantages (92.7 (92.7 s) in s) efficiency. in efficiency. We Wecan canclearlyclearly see seethat that the the average average transaction transaction comple completiontion time time of of performing performing the the MapReduce MapReduce clipping task basedbased onon thethe HDFSHDFS frameworkframework isis slightlyslightly fasterfaster thanthan thatthat ofof thethe centralizedcentralized model,model, butbut thethe advantages are not obvious; the reason for this is that the cross-node data access costs brought by the increasedincreased tile filesfiles inin thethe processingprocessing shouldshould notnot bebe neglected.neglected. However, ourour NoSQLNoSQL frameworkframework withwith MapReduce parallel algorithms of of static static tile tile cli clippingpping on on the the vector vector dataset dataset is is more more efficient. efficient.

Figure 20. Average execution time of the static-clip operation by three processingprocessing models.models.

6.5. Experiment 4—Dynamic-Clipping Performance Experiment 4 compared the task completioncompletion time of aa dynamic-clippingdynamic-clipping transaction by the traditional method of aa high-performancehigh-performance server, thethe HadoopHadoop distributeddistributed filefile systemsystem processingprocessing method andand ourour proposedproposed approach, approach, i.e., i.e., dynamic dynamic tile tile generation generation using using a NoSQL a NoSQL cluster. cluster. By usingBy using the samethe same real datasetsreal datasets as Experiments as Experiments 2 and 3,2 weand make 3, we dynamic-clipping make dynamic-clippi requestsng forrequests China geologicalfor China vectorgeological map vector data inmap the data three in architecture the three architecture environments; environments; according toaccording the one to tile the unit one per tile level unit rule, per welevel made rule, requests we made to requests ten tiles to randomly ten tiles randomly selected from selected levels from 3 to levels 12 of the3 tovector 12 of the map vector sets andmap then sets analyzedand then analyzed their dynamic their clippingdynamic task clipping completion task completion time as shown time as in shown Figure 21in .Figure 21. The execution of the aforementioned three methodsmethods is more time-consumingtime-consuming in our experiments for theirtheir dynamicdynamic clippingclipping characteristicscharacteristics thanthan theirtheir correspondingcorresponding staticstatic methods.methods. In the case of small-scale concurrentconcurrent requests, requests, the the difference difference of average of average time consumedtime consumed for dynamic for dynamic clipping betweenclipping thebetween single the processing single processing architecture architecture and the HDFS and th distributede HDFS distributed architecture architecture was smaller was when smaller generating when 10generating random tiles;10 random the execution tiles; the times execution were 13.47times swere and 12.6813.47 s,s and respectively. 12.68 s, respectively. Furthermore, Furthermore, besides the operationbesides the cost operation of tile clipping,cost of tile the clipping, store operation the store for operation real-time for generated real-time tilegenerated data also tile brought data also a certainbrought amount a certain of timeamount overhead, of time particularly overhead, particul in the casearly of in enabling the case theof enabling replica function the replica of the function HDFS fileof the system HDFS (the file time system expenditure (the time expenditure of the HDFS of method the HDFS was method even greater was even than greater that of than the standalone that of the standalone mechanism). However, the advantages of the NoSQL architecture are obvious, as the time expenditure of parallel dynamic clipping was just 8.73 s in the experiment.

ISPRS Int. J. Geo-Inf. 2016, 5, 215 23 of 25 mechanism). However, the advantages of the NoSQL architecture are obvious, as the time expenditure ofISPRS parallel Int. J. Geo-Inf. dynamic 2016 clipping, 5, 215 was just 8.73 s in the experiment. 23 of 25

Figure 21. Average response time of dynamic tile clipping byby threethree methods.methods.

7. Conclusions Conclusions and and Future Future Work Work This article has introduced a parallel processing method for vector tile data that can be applied to clippingclipping and storage operations for massive data tiles in high-concurrency map service scenarios. A column-family-basedcolumn-family-based distributed tile data storagestorage model was developed for the cases of the construction, manipulation,manipulation, retrieval, retrieval, and and sharding sharding of of large-scale large-scale tile tile data data in a in NoSQL a NoSQL cluster. cluster. This dataThis modeldata model addresses addresses the critical the critical need to need support to support high storage high storage efficiency, efficiency, rapid scaling rapid ability scaling and ability balanced and databalanced distribution. data distribution. To improve the processing processing efficiency efficiency of of a a tiled tiled map map in in the the NoSQL NoSQL data data model, model, we we designed designed an analgorithmic algorithmic framework framework for performing for performing distributed distributed clipping clipping and query and query operations operations on massive on massive vector vectortiles. The tiles. framework The framework supports supports both static both and static dynamic and dynamic tile clipping tile clipping in parallel in parallelmode for mode our NoSQL for our NoSQLstorage model storage and model also and implements also implements the fast retrieval the fast retrieval of a large-scale of a large-scale tile dataset tile deployed dataset deployed in NoSQL in NoSQLusing the using MapReduce the MapReduce programming programming paradigm. paradigm. Towa Towardrd this thisend, end, we we also also embed embed the the parallel processing frameworkframework withinwithin a high-performance webweb service environment, which is established on the foundation of the China Geological Survey Information Grid, to develop a tiled map services the foundation of the China Geological Survey Information Grid, to develop a tiled map services prototype that can provide clipping and map services for big vector data. Timing experiments reveal prototype that can provide clipping and map services for big vector data. Timing experiments reveal that our framework performs well in network scenarios in which vector tiles for geological survey that our framework performs well in network scenarios in which vector tiles for geological survey information are distributed over the NoSQL cluster, especially when the growth of both the spatial information are distributed over the NoSQL cluster, especially when the growth of both the spatial data data volume and the number of concurrent sessions presents challenges regarding the system volume and the number of concurrent sessions presents challenges regarding the system performance performance and execution time of tile data processing. and execution time of tile data processing. Our NoSQL-based processing method for vector tile data is an initial attempt at developing an Our NoSQL-based processing method for vector tile data is an initial attempt at developing effective means to tackle massive volume spatial storage and computing problems in the Cloud. an effective means to tackle massive volume spatial storage and computing problems in the Cloud. Distributed storage, access and a fast processing model for large-volume three-dimension tile data, Distributed storage, access and a fast processing model for large-volume three-dimension tile data, especially the exploration of enhancing the efficiency and effectiveness of parallel clipping, geospatial especially the exploration of enhancing the efficiency and effectiveness of parallel clipping, geospatial query and analysis, will be one of our next research emphases. In addition, we hope to collaborate query and analysis, will be one of our next research emphases. In addition, we hope to collaborate with other researchers to explore index structures for high-dimensional spatial data in the NoSQL with other researchers to explore index structures for high-dimensional spatial data in the NoSQL framework, a complex data querying algorithm for massive spatio-temporal data using the framework, a complex data querying algorithm for massive spatio-temporal data using the MapReduce MapReduce programming model, and a coupling mechanism according to the communication and programming model, and a coupling mechanism according to the communication and logical logical relationship between the Map task and Reduce task in favor of the optimization of spatial relationship between the Map task and Reduce task in favor of the optimization of spatial processing processing algorithms that contain intensive data operations, such as spatial joins, in future work. algorithms that contain intensive data operations, such as spatial joins, in future work.

Acknowledgments: This research was supported by the grants from the National Natural Science Foundation of China (41401449, 41501162, 41625003), Scientific Research Key Program of Beijing Municipal Commission of Education (KM201611417004), New Starting Point Program of Beijing Union University (ZK10201501), State Key Laboratory of Resources and Environmental Information System, and Key Laboratory of Satellite Surveying Technology and Application, Chinese National Administration of Surveying, Mapping and Geoinformation.

ISPRS Int. J. Geo-Inf. 2016, 5, 215 24 of 25

Acknowledgments: This research was supported by the grants from the National Natural Science Foundation of China (41401449, 41501162, 41625003), Scientific Research Key Program of Beijing Municipal Commission of Education (KM201611417004), New Starting Point Program of Beijing Union University (ZK10201501), State Key Laboratory of Resources and Environmental Information System, and Key Laboratory of Satellite Surveying Technology and Application, Chinese National Administration of Surveying, Mapping and Geoinformation. Author Contributions: Zhou Huang conceived and designed the experiments; Lin Wan performed the experiments; Xia Peng analyzed the data; Zhou Huang, Lin Wan and Xia Peng wrote the paper. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Cruz, S.A.B.; Monteiro, A.M.V.; Santos, R. Automated geospatial Web Services composition based on geodata quality requirements. Comput. Geosci. 2012, 47, 60–74. [CrossRef] 2. Li, H.; Zhu, Q.; Yang, X.; Xu, L. Geo-information processing service composition for concurrent tasks: A QoS-aware game theory approach. Comput. Geosci. 2012, 47, 46–59. [CrossRef] 3. Zhang, T.; Tsou, M. Developing a grid-enabled spatial Web portal for Internet GIServices and geospatial cyberinfrastructure. Int. J. Geogr. Inf. Sci. 2009, 23, 605–630. [CrossRef] 4. Zhao, P.; Foerster, T.; Yue, P. The geoprocessing web. Comput. Geosci. 2012, 47, 3–12. [CrossRef] 5. Evans, M.R.; Oliver, D.; Yang, K.; Shekhar, S. Enabling spatial big data via CyberGIS: Challenges and opportunities. In CyberGIS: Fostering a New Wave of Geospatial Innovation and Discovery; Springer: Heidelberg, Germany, 2013; pp. 1–22. 6. Wang, S.; Anselin, L.; Bhaduri, B.; Crosby, C.; Goodchild, M.F.; Liu, Y.; Nyerges, T.L. CyberGIS software: A synthetic review and integration roadmap. Int. J. Geogr. Inf. Sci. 2013, 27, 2122–2145. [CrossRef] 7. Chen, A.; Di, L.; Wei, Y.; Bai, Y.; Liu, Y. Use of grid computing for modeling virtual geospatial products. Int. J. Geogr. Inf. Sci. 2009, 23, 581–604. [CrossRef] 8. Tang, W. Parallel construction of large circular cartograms using graphics processing units. Int. J. Geogr. Inf. Sci. 2013, 27, 2182–2206. [CrossRef] 9. Zhang, J.; You, S. High-performance quadtree constructions on large-scale geospatial rasters using GPGPU parallel primitives. Int. J. Geogr. Inf. Sci. 2013, 27, 2207–2226. [CrossRef] 10. Liu, Y.; Padmanabhan, A.; Wang, S. CyberGIS Gateway for enabling data-rich geospatial research and education. Concurr. Comput. Pract. Exp. 2015, 27, 395–407. [CrossRef] 11. Wang, S. CyberGIS: Blueprint for integrated and scalable geospatial software ecosystems. Int. J. Geogr. Inf. Sci. 2013, 27, 2119–2121. [CrossRef] 12. Yang, C.; Goodchild, M.F.; Huang, Q.; Nebert, D.; Raskin, R.; Xue, Y.; Bambacus, M.; Faye, D. Spatial cloud computing: How can the geospatial sciences use and help shape cloud computing? Int. J. Digit. Earth 2011, 4, 305–329. [CrossRef] 13. Yang, C.; Xu, Y.; Nebert, D. Redefining the possibility of digital Earth and geosciences with spatial cloud computing. Int. J. Digit. Earth 2013, 6, 297–312. [CrossRef] 14. Yue, P.; Zhou, H.; Gong, J.; Hu, L. Geoprocessing in cloud computing platforms—A comparative analysis. Int. J. Digit. Earth 2012, 6, 404–425. [CrossRef] 15. Dean, J.; Ghemawat, S. MapReduce: Simplified data processing on large clusters. Commun. ACM 2008, 51, 107–113. [CrossRef] 16. Aji, A.; Wang, F.; Vo, H.; Lee, R.; Liu, Q.; Zhang, X.; Saltz, J. Hadoop GIS: A high performance spatial data warehousing system over mapreduce. Proc. VLDB Endow. 2013, 6, 1009–1020. [CrossRef] 17. Eldawy, A.; Mokbel, M.F. A demonstration of SpatialHadoop: An efficient mapreduce framework for spatial data. Proc. VLDB Endow. 2013, 6, 1230–1233. [CrossRef] 18. Guo, D.; Wu, K.; Li, J.; Wang, Y. Spatial scene similarity assessment on Hadoop. In Proceedings of the ACM SIGSPATIAL International Workshop on High Performance and Distributed Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010. 19. Meijer, E.; Bierman, G. A co-relational model of data for large shared data banks. Commun. ACM 2011, 54, 49–58. [CrossRef] 20. Li, Y.; Kim, G.; Wen, L.; Bae, H. MHB-Tree: A distributed spatial index method for document based NoSQL database system. In Lecture Notes in Electrical Engineering; Springer: Dordrecht, The Netherlands, 2012; pp. 489–497. ISPRS Int. J. Geo-Inf. 2016, 5, 215 25 of 25

21. Resch, B. NoSQL suitability for SWE-enabled sensing architectures. In Proceedings of the First ACM SIGSPATIAL Workshop on , Redondo Beach, CA, USA, 7–9 November 2012. 22. Steiniger, S.; Hunter, A.S. Free and open source GIS software for building a spatial data infrastructure. In Lecture Notes in Geoinformation and Cartography; Bocher, E., Neteler, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 247–261. 23. Li, Z.; Hu, F.; Schnase, J.L.; Duffy, D.Q.; Lee, T.; Bowen, M.K.; Yang, C. A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduce. Int. J. Geogr. Inf. Sci. 2016. [CrossRef] 24. Huang, Z.; Fang, Y.; Chen, B.; Wu, L.; Pan, M. Building the distributed geographic SQL workflow in the grid environment. Int. J. Geogr. Inf. Sci. 2011, 25, 1117–1145. [CrossRef] 25. China Geological Survey (CGS). China Geological Survey Information Service Platform. Available online: http://www.gsigrid.cgs.gov.cn/newgsigrid/CGSGeoData/Default.aspx (accessed on 6 September 2016). 26. Li, C.; Song, M.; Lv, X.; Luo, X.; Li, J. The spatial data sharing mechanisms of geological survey information grid in P2P mixed network systems network architecture model. In Proceedings of the Ninth International Conference on Grid and Cloud Computing, Nanjing, China, 1–5 November 2010.

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).