Function Modeling Improves the Efficiency of Spatial Modeling Using
Total Page:16
File Type:pdf, Size:1020Kb
big data and cognitive computing Article Function Modeling Improves the Efficiency of Spatial Modeling Using Big Data from Remote Sensing John Hogland * and Nathaniel Anderson Rocky Mountain Research Station, U.S. Forest Service, Missoula, MT 59801 USA; [email protected] * Correspondence: [email protected]; Tel.: +1-406-329-2138 Received: 26 June 2017; Accepted: 10 July 2017; Published: 13 July 2017 Abstract: Spatial modeling is an integral component of most geographic information systems (GISs). However, conventional GIS modeling techniques can require substantial processing time and storage space and have limited statistical and machine learning functionality. To address these limitations, many have parallelized spatial models using multiple coding libraries and have applied those models in a multiprocessor environment. Few, however, have recognized the inefficiencies associated with the underlying spatial modeling framework used to implement such analyses. In this paper, we identify a common inefficiency in processing spatial models and demonstrate a novel approach to address it using lazy evaluation techniques. Furthermore, we introduce a new coding library that integrates Accord.NET and ALGLIB numeric libraries and uses lazy evaluation to facilitate a wide range of spatial, statistical, and machine learning procedures within a new GIS modeling framework called function modeling. Results from simulations show a 64.3% reduction in processing time and an 84.4% reduction in storage space attributable to function modeling. In an applied case study, this translated to a reduction in processing time from 2247 h to 488 h and a reduction is storage space from 152 terabytes to 913 gigabytes. Keywords: function modeling; remote sensing; machine learning; geographic information system 1. Introduction Spatial modeling has become an integral component of geographic information systems (GISs) and remote sensing. Combined with classical statistical and machine learning algorithms, spatial modeling in GIS has been used to address wide ranging questions in a broad array of disciplines, from epidemiology [1] and climate science [2] to geosciences [3] and natural resources [4–6]. However, in most GISs, the current workflow used to integrate statistical and machine learning algorithms and to process raster models limits the types of analyses that can be performed. This process can be generally described as a series of sequential steps: (1) build a sample data set using a GIS; (2) import that sample data set into statistical software such as SAS [7], R [8], or MATLAB [9]; (3) define a relationship (e.g., predictive regression model) between response and explanatory variables that can be used within a GIS to create predictive surfaces, and then (4) build a representative spatial model within a GIS that uses the outputs from the predictive model to create spatially explicit surfaces. Often, the multi-software complexity of this practice warrants building tools that streamline and automate many aspects of the process, especially the export and import steps that bridge different software. However, a number of challenges place significant limitations on producing final outputs in this manner, including learning additional software, implementing predictive model outputs, managing large data sets, and handling the long processing time and large storage space requirements associated with this work flow [10,11]. These challenges have intensified over the past decade because large, fine-resolution remote sensing data sets, such as meter and sub-meter imagery and Lidar, have become widely available and less expensive to procure, but the tools to use such data efficiently and effectively have not always kept pace, especially in the desktop environment. Big Data Cogn. Comput. 2017, 1, 3; doi:10.3390/bdcc1010003 www.mdpi.com/journal/bdcc Big Data Cogn. Comput. 2017, 1, 3 2 of 14 over the past decade because large, fine‐resolution remote sensing data sets, such as meter and sub‐ meter imagery and Lidar, have become widely available and less expensive to procure, but the tools Bigto Datause Cogn.such Comput.data efficiently2017, 1, 3 and effectively have not always kept pace, especially in the desktop2 of 14 environment. To address some of these limitations, advances in both GIS and statistical software have focused on integratingTo address functionality some of these through limitations, coding advances libraries in both that GIS extend and statistical the capabilities software of have any focused software on integratingpackage. Common functionality examples through include coding RSGISlib libraries that[12], extend GDAL the [13], capabilities SciPy [14], of any ALGLIB software [15], package. and CommonAccord.NET examples [16]. At include the same RSGISlib time, [12new], GDAL processi [13ng], SciPytechniques [14], ALGLIB have been [15], developed and Accord.NET to address [16]. Atcommon the same challenges time, new with processing big data techniques that aim haveto more been fully developed leverage to addressimprovements common in challengescomputer withhardware big data and thatsoftware aim toconfigurations. more fully leverage For example, improvements parallel processing in computer libraries hardware such and as softwareOpenCL configurations.[17] and CUDA [18] For are example, stable and parallel actively processing being used libraries within such the GIS as OpenCL community [17] [19,20]. and CUDA Similarly, [18] areframeworks stable and such actively as Hadoop being used [21] within are being the GIS used community to facilitate [19,20 ].cloud Similarly, computing, frameworks and suchoffer asimprovements Hadoop [21 ]in are big being data processing used to facilitate by partitioning cloud computing, processes andacross offer multiple improvements CPUs within in big a large data processingserver farm, by thereby partitioning improving processes user access, across affordabi multiplelity, CPUs reliability, within and a largedata sharing server farm,[22,23]. thereby While improvingthe integration, user functionality, access, affordability, and capabilities reliability, of GI andS and data statistical sharing software [22,23]. continue While the to expand, integration, the functionality,underlying framework and capabilities of how procedures of GIS and and statistical methods software are used continue within spatial to expand, models the in underlying GIS tends frameworkto remain the of same, how procedures which can andimpose methods artificial are usedlimitations within on spatial the type models and in scale GIS of tends analyses to remain that can the same,be performed. which can impose artificial limitations on the type and scale of analyses that can be performed. Spatial models models are are typically typically composed composed of ofmultip multiplele sequential sequential operations. operations. Each Eachoperation operation reads readsdata from data a from given a givendata set, data transforms set, transforms the data, the data,and then and thencreates creates a new a newdata data set set(Figure (Figure 1). 1In). Inprogramming, programming, this this is iscalled called eager eager evaluation evaluation (or (or strict strict semantics) semantics) and and is is characterized characterized by by a a flow flow that evaluates all expressions (i.e., arguments) regardless of the need for the values of those expressions in generating finalfinal results [[24].24]. Though eager evaluationevaluation is intuitive and used by many traditional programming languages,languages, creatingcreatingand and reading reading new new data data sets sets at at each each step step of of a a model model in in GIS GIS comes comes at at a higha high processing processing and storageand storage cost, and cost, is notand viable is not for largeviable area for analysis large area outside analysis of the supercomputingoutside of the environment,supercomputing which environment, is not currently which available is not currently to the vast available majority to ofthe GIS vast users. majority of GIS users. Figure 1. AA schematic schematic comparing comparing conventional conventional modeling modeling (CM), (CM), which which employs employs eager eager evaluation, evaluation, to tofunction function modeling modeling (FM), (FM), which which us useses lazy lazy evaluation. evaluation. The The conceptual conceptual difference difference between between the two modeling techniquestechniques is is that that FM FM delays delays evaluation evaluation until until results results are needed are needed and produces and produces results withoutresults storingwithout intermediatestoring intermediate data sets data to disksets to (green disk (gr squares).een squares). When When input input and output and output operations operations or disk or storagedisk storage space space are the are primary the primary bottlenecks bottlenecks in running in running a spatial a spatial model, model, this this difference difference in evaluation in evaluation can substantiallycan substantially reduce reduce processing processing time time and and storage storage requirements. requirements. In contrast, lazy evaluation (also called lazy reading and call‐by‐need) can be employed to In contrast, lazy evaluation (also called lazy reading and call-by-need) can be employed to perform perform the same types of analyses, reducing the number of reads and writes to disk, which reduces the same types of analyses, reducing the number of reads and writes to disk, which reduces processing processing time and storage.