A New Initiative for Tiling, Stitching and Processing Geospatial Big Data in Distributed Computing Environments
Total Page:16
File Type:pdf, Size:1020Kb
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume III-4, 2016 XXIII ISPRS Congress, 12–19 July 2016, Prague, Czech Republic A NEW INITIATIVE FOR TILING, STITCHING AND PROCESSING GEOSPATIAL BIG DATA IN DISTRIBUTED COMPUTING ENVIRONMENTS A. Olasz a*, B. Nguyen Thai b, D. Kristóf a a Department of Geoinformation, Institute of Geodesy, Cartography and Remote Sensing (FÖMI),5. Bosnyák sqr. Budapest, 1149 (olasz.angela, kristof.daniel)@fomi.hu b Department of Cartography and Geoinformatics, Eötvös Loránd University (ELTE), 1/A Pázmány Péter sétány, Budapest, 1117 Hungary, [email protected] Commission ICWG IV/II KEYWORDS: Distributed computing, GIS processing, raster data tiling, data assimilation, remote sensing data analysis, geospatial big data, spatial big data ABSTRACT Within recent years, several new approaches and solutions for Big Data processing have been developed. The Geospatial world is still facing the lack of well-established distributed processing solutions tailored to the amount and heterogeneity of geodata, especially when fast data processing is a must. The goal of such systems is to improve processing time by distributing data transparently across processing (and/or storage) nodes. These types of methodology are based on the concept of divide and conquer. Nevertheless, in the context of geospatial processing, most of the distributed computing frameworks have important limitations regarding both data distribution and data partitioning methods. Moreover, flexibility and expendability for handling various data types (often in binary formats) are also strongly required. This paper presents a concept for tiling, stitching and processing of big geospatial data. The system is based on the IQLib concept (https://github.com/posseidon/IQLib/) developed in the frame of the IQmulus EU FP7 research and development project (http://www.iqmulus.eu). The data distribution framework has no limitations on programming language environment and can execute scripts (and workflows) written in different development frameworks (e.g. Python, R or C#). It is capable of processing raster, vector and point cloud data. The above-mentioned prototype is presented through a case study dealing with country-wide processing of raster imagery. Further investigations on algorithmic and implementation details are in focus for the near future. 1 INTRODUCTION principal capability in a way to transform information to knowledge. Our goal is to find a solution for Geoprocessing of big The progress and innovation is no longer hindered by the geospatial data in a distributed ecosystem providing an ability to collect data. The most important issue is how we environment to run algorithms, services, processing modules exploit these geospatial big data (Lee and Kang, 2015). We without any limitations on implementation programming consider that, we are facing the paradigm shift from data- language as well as data partitioning strategies and driven research to knowledge-driven scientific method in Big distribution among computational nodes. As a first step we Data which was considered as a challenge by R. Kitchin in would like to focus on (i) data decomposition and (ii) 2014. In our previous work (Nguyen Thai and Olasz, 2015) distributed processing. The challenges associated with each the Big Data concept and the well-known four dimensions: focus area, related methodology and first results are analyzed Volume, Velocity, Variety (originally Laney came up with and discussed in the paper. this three dimensions in 2001) and Veracity have been discussed taking into account of geospatial considerations 2 PROBLEM DESCRIPTION and characteristics. Additional dimensions have been continuously appearing to describe better the concept of the workflow such as: Value, Validity, Visibility, etc. (Li et al, 2.1 Geospatial Big Data 2015). Moreover, we have concluded a fifth “V” to Geospatial Big Reliable analysis of the geospatial data is extremely Data as Visualization, because it has a great importance to important base for being able to support better decision communicate the desired information in Geographical making with location-aware data even in our changing Information Science from the very beginning. From geo- World. The challenges for handling geospatial big data location information transform into knowledge going through include capture, storage, search, sharing, transfer, analysis, the pipeline of Data Life Cycle in every phase of conversion and visualization (Jewell et al., 2014). Furthermore, with (collect, aggregate, analyse, return as knowledge, share) newly adapted data management requirements and initiatives, Visualization is fundamental (Nguyen Thai Binh and Olasz, even more open data will be available on the web which need 2015). Beyond, we consider that in Big Data environment to be handled, the latent information be shared and extracted Visualization plays an important role in (1) geospatial data knowledge applied in the level of decision making as well processing (due to volume and variety) that help analysts to (Wu and Beng, 2014.). Big data, open data and open identify trends, relations, correlations and patterns in an government has joint interest in location and in many efficient way; (2) facilitate broadcasting geo-information to challenges considered in geospatial aspect will soon benefit citizens (and decision makers) employing interactive and eye from it (Jewell et al., 2014). We consider GI analysis as a tracking approaches. According to our conclusion Visualization is a definite geospatial component of Big Data. This contribution has been peer-reviewed. The double-blind peer-review was conducted on the basis of the full paper. doi:10.5194/isprsannals-III-4-111-2016 111 ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume III-4, 2016 XXIII ISPRS Congress, 12–19 July 2016, Prague, Czech Republic theory geospatial big data is defined as volume, variety and 2.2 Defining Geospatial Big Data update frequency rate that exceed the capability of spatial computing technology (Lee and Kang, 2015, Li et al., 2015, Spatial data (also known as geospatial data, geo-information, Kambatla et al., 2014). In Table 1. we have collected the geodata, etc) have many definitions depending from the main characteristics of geospatial big data for each type of background of the author. All of them emphasize the formats such as: representation formats, GIS operations, geographic location of the phenomena to be described as volume, velocity, variety and visualization aspects. To have a basic criteria. The nature of the digital representation of the better understanding on what are the main attributes of continuous space can be grouped in 4 or 5 type. Traditionally geospatial data because it is hard to delineate the margin we consider two type of geospatial data vector and raster starting to “exceed the capability of spatial computing (Elek, 2006) owing to the development of information technology”. To estimate the size of the processable amount technology nowadays we can have higher abstraction type of of data are use-case specific, there are some good examples data such as point clouds, graph networks. An additional (Evans et al., 2014) where the authors tried to identify the particular kind of location-aware data is also examined by Geospatial Data and Geospatial Big Data differences. analysts; social media-like data which requires a particular approach to collect and process as well. Along with Big Data Data type Formats GIS operations Volume Velocity Variety Visualization thanks to OGC standards available amount of vector data real time and web-gis platforms non point, line, polygon Overlapping vector (for instance nation-wide monitoring, and Vector researchers are able to use (multi) geoprocessing cadastre, or land cover,roads, rapid response is GIS data for many different waterways, utility network, etc) emerging purposes 3D modeling, urban 3D view (perspective) modeling,simulation,flight available amount of point cloud time sensitive 3D together with thematic from above camera view, data (or TIN) for the creation of data requires rapid content with reduced 3D point cloud,TIN (or visibility operations, semi- DSM, DTM modelling, feature processing (disaster information are essential representation Triangular mesh) automatic point cloud extraction and simulation management and to spreading information feature detection, requires huge computational simulation) for different level of end- classification, terrestrial capacity users laser scanning, BIM consider in a GIS real time spatio available (free) series terrestrial, processing to deliver results of earth temporal earth aerial and satellite (multispectral several types observation monitoring observation data Local, Focal, Zonal (Global) and hyperspectral) imagery of previously and processing novel processing is need Raster grid Map Algebra processing, (airplane, UAV), earth observation mentioned solution in visualization more than ever image analysis data requires huge computational data need to also needed to transform (independently capacity using raster image be combined information human from the extent of processing methods to extract readable the processing) the relevant trillions of edges, nodes for graph real time information in network analysis and processing available from location monitoring of routing visualization graph (nodes, routing, network analysis, based networks (also from social moving objects, Network techniques are