Distributed, File Based, Geospatial Array DBMS
Total Page:16
File Type:pdf, Size:1020Kb
ChronosDB: Distributed, File Based, Geospatial Array DBMS Ramon Antonio Rodriges Zalipynis National Research University Higher School of Economics, Moscow, Russia [email protected] ABSTRACT that are very accustomed to them. NCO (NetCDF Oper- An array DBMS streamlines large N-d array management. ators) have been under development since about 1995 [43], A large portion of such arrays originates from the geospatial GDAL (Geospatial Data Abstraction Library) provides tools for managing over 150 raster file formats, has 106 lines of domain. The arrays often natively come as raster files while ≈ standalone command line tools are one of the most popu- code and hundreds of contributors [14]. Many tools utilize lar ways for processing these files. Decades of development multicore CPUs but mostly work on a single machine. and feedback resulted in numerous feature-rich, elaborate, The in situ technique has gained increased attention due free and quality-assured tools optimized mostly for a single to the explosive growth of raster data volumes in diverse file machine. ChronosDB partially delegates in situ data pro- formats [3, 85, 6, 8]. Unlike the in-db approach, the in situ cessing to such tools and offers a formal N-d array data approach operates on data in their original file formats in a model to abstract from the files and the tools. ChronosDB standard file system and does not require importing into an readily provides a rich collection of array operations at scale internal DBMS format. Files may be preprocessed before and outperforms SciDB by up to 75 on average. querying but this is usually much faster than a full import. × Keeping data of an array DBMS in open, widely adopted, PVLDB Reference Format: and standardized file formats has numerous advantages: eas- R.A. Rodriges Zalipynis. ChronosDB: Distributed, File Based, ier data sharing, powerful storage capabilities, the absence Geospatial Array DBMS. PVLDB, 11(10): 1247-1261, 2018. of a time-consuming and error-prone import phase for many DOI: https://doi.org/10.14778/3231751.3231754 data types, direct accessibility of DBMS data to other sys- tems, and easier migration to other DBMS to name a few [59]. 1. INTRODUCTION Many raster processing algorithms require significant im- A georeferenced raster or more generally an N-d array is plementation efforts, but already existing stable and multi- the primary data type in a broad range of fields including functional tools are largely ignored in this research trend. climatology, geology, and remote sensing which are expe- Algorithms are re-implemented almost from scratch delay- riencing tremendous growth of data volumes. For example, ing the emergence of an array DBMS being competitive with DigitalGlobe is a commercial satellite imagery provider that existing operational pipelines, e.g. SciDB appeared in 2008 collects 70 TB of imagery on an average day [35]. and still lacks even core operations like interpolation [65]. Traditionally rasters are stored in files, not in databases. The idea of partially delegating in situ array operations The European Centre for Medium-Range Weather Forecasts to existing tools was first realized in ChronosServer [59, 58, (ECMWF) has alone accumulated 137:5 106 files sized 11, 60, 62, 61]. Its successor is ChronosDB extended with 52.7 PB in total [26]. This file-centric model× resulted in a a significantly improved data model (section 2), new com- broad set of sophisticated raster file formats. For example, ponents, optimizations, array management, query execution GeoTIFF is an effort by over 160 different remote sensing, capabilities (section 3), and array operations (section 4). GIS (Geographic Information System), cartographic, and The ChronosDB formal data model uniformly represents surveying related companies and organizations [23]. NetCDF diverse raster data types and formats, takes into account the has been under development since 1990 [56], is standardized distributed context, and is independent of the underlying by OGC [44], and supports multidimensional arrays, chunk- raster file formats and tools at the same time (section 2). ing, compression, and hierarchical name space [40]. ChronosDB distributed algorithms are based on the model, Command line tools have long being developed to man- unified, formalized, and targeted at the in situ processing of age raster files. Many tools have large user communities arbitrary geospatial N-d arrays (section 4). The algorithms are designed to always delegate signifi- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are cant portions of their work to the tools. This proves wide not made or distributed for profit or commercial advantage and that copies applicability of the tools and the delegation approach. The bear this notice and the full citation on the first page. To copy otherwise, to delegation happens by direct submission of files to a tool as republish, to post on servers or to redistribute to lists, requires prior specific command line arguments to process them on a single cluster permission and/or a fee. Articles from this volume were invited to present node. ChronosDB re-partitions and streams input/output their results at The 44th International Conference on Very Large Data Bases, files between the nodes and tools to scale out the processing. August 2018, Rio de Janeiro, Brazil. Proceedings of the VLDB Endowment, Vol. 11, No. 10 In summary, the major contributions of our work include: Copyright 2018 VLDB Endowment 2150-8097/18/06... $ 10.00. (1) we show that it is possible to build a well-abstracted, DOI: https://doi.org/10.14778/3231751.3231754 1247 extensible, and efficient distributed geospatial N-d array a totally ordered set, di[j] < di[j + 1], and di[j] = NA for 6 DBMS by leveraging existing elaborate command line tools j Di. In this case, A is defined as A(d1; d2; : : : ; dN ): T. 8 2 designed for a single machine; A hyperslab A0 A is an N-d subarray of A. The hyper- v (2) we give a holistic description of the complete DBMS slab A0 is defined by the notation A[b1 : e1; : : : ; bN : eN ] = leveraging the tools: ChronosDB has a relatively simple yet A0(d10 ; : : : ; dN0 ), where bi; ei Z, 0 bi ei < li, di0 = 2 6 6 flexible coordination layer on top of powerful components. di[bi : ei], d0 = ei bi + 1, and for all yi [0; ei bi] the j ij − 2 − This enables exceptional performance, rich functionality and following holds: A0[y1; : : : ; yN ] = A[y1 + b1; : : : ; yN + bN ], avoids re-implementing array algorithms from scratch. It is di0 [yi] = di[yi + bi](A and A0 have a common coordinate also easier to manage the development life cycle of such a subspace over which cell values of A and A0 coincide). The DBMS as it is comprehensive by a small team of engineers; dimensionality of A and A0 is the same. We will omit \: ei" (3) we thoroughly compare ChronosDB and SciDB on up if bi = ei or \bi : ei" if bi = 0 and ei = di 1. to 32-node clusters in the Cloud on real-world data: Landsat Arrays X and Y overlap iff Q : Q j Xj − Q Y . Array satellite scenes and climate reanalysis. We took SciDB as it Q is called the greatest common9 hyperslabv ^of Xvand Y and is the only freely available distributed array DBMS to date. denoted by gch(X; Y ) iff @W :(W X) (W Y ) (Q W ) (Q = W ). An array X coversv an array^ Yviff Y ^ X.v ^ 6 v 2. CHRONOSDB DATA MODEL 2.3 Datasets Two dataset types exist: raw and regular. Raw datasets 2.1 Motivation capture a broad range of possible raster data types, e.g. a The most widely used industry-standard data models for set of scattered, overlapping satellite scenes, fig. 1. A raw abstracting from raster file formats are Unidata CDM, GDAL dataset must be \cooked" into a regular one in order to Data Model and ISO 19123. Industrial models are map- perform array operations on it, section 3.3. Many real-world pable to each other [40] and have resulted from decades of array data already satisfy regular dataset criteria and need considerable practical experience. However, they work with not to be cooked, e.g. gridded geophysical data, section 5. a single file, not with a set of files as a single array. Both datasets are two-level and contain a user-level array The most well-known research models and algebras for and a set of system-level arrays (array and subarrays for dense N-d, general-purpose arrays are AML [38], AQL [37], short). Subarrays are distributed among cluster nodes and and RAM [81]. They are mappable to Array Algebra [4]. stored as ordinary raster files. An array is never materialized The creation of the ChronosDB data model was motivated and stored explicitly: an operation with an array is mapped by the following features that are not simultaneously present to a sequence of operations with respective subarrays. in the existing data models: (1) the treatment of arrays in Datasets are read-only: an array operation produces a multiple files arbitrarily distributed over cluster nodes as a new dataset. This has a strong practical motivation. The single array, (2) formalized industrial experience to leverage majority of raster data come from instrumental observation it in the algorithms, (3) a rich set of data types (Gaussian, and numerical simulation. Original data are never changed irregular grids, etc.), (4) subarray mapping to a raster file is once they are produced. Derivative data differ vastly: raster almost 1:1 but still independent from a format. As can be algorithms usually alter large portions of an array. raw raw seen from [40], the N-d array model (section 2.2) resembles Formally, a raw dataset D = (A; M; P ) has a user- CDM while the two-level set-oriented dataset model (sec- level array A(d1; d2; : : : ; dN ): T, metadata M, and a set raw tion 2.3) provides additional necessary abstractions.