Array DBMS in Environmental Science
Total Page:16
File Type:pdf, Size:1020Kb
The 9th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications 21-23 September, 2017, Bucharest, Romania Array DBMS in Environmental Science: Satellite Sea Surface Height Data in the Cloud Ramon Antonio Rodriges Zalipynis National Research University Higher School of Economics, Moscow, Russia [email protected] Abstract – Nowadays environmental science experiences data model is designed to uniformly represent diverse tremendous growth of raster data: N-dimensional (N-d) raster data types and formats, take into account a dis- arrays coming mainly from numeric simulation and Earth tributed cluster environment, and be independent of the remote sensing. An array DBMS is a tool to streamline raster data processing. However, raster data are usually stored in underlying raster file formats at the same time. Also, new files, not in databases. Moreover, numerous command line distributed algorithms are proposed based on the model. tools exist for processing raster files. This paper describes a Four modern raster data management trends are rele- distributed array DBMS under development that partially vant to this paper: industrial raster data models, formal delegates raster data processing to such tools. Our DBMS offers a new N-d array data model to abstract from the array models and algebras, in situ data processing algo- files and the tools and processes data in a distributed fashion rithms, and raster (array) DBMS. Good survey on the directly in their native file formats. As a case study, popular algorithms is contained in [7]. A recent survey of existing satellite altimetry data were used for the experiments carried array DBMS and similar systems is in [5]. It is worth out on 8- and 16-nodes clusters in Microsoft Azure Cloud. mentioning SciDB [8], Oracle Spatial [9], ArcGIS IS [10], New array DBMS is up to 70× faster than SciDB which is the only freely available distributed array DBMS to date. RasDaMan [11], Intel TileDB [12], and PostGIS [13]. The most well-known array models and algebras are Keywords – SciDB; in situ; command line tools; NetCDF Array Algebra, AML, AQL, and RAM. All of them are mappable to Array Algebra [14]. SciDB does not have I. INTRODUCTION a formal description of its data model. The most widely Modern volumes of raster data are enormous. European used industry standard data models to abstract from raster Centre for Medium-Range Weather Forecasts has alone file formats are CDM, GDAL, and ISO 19123. They accumulated 137.5 million files sized 52.7 PB in total [1]. models mappable to each other [2], have resulted from The main goal of an array DBMS is to process N-d arrays decades of considerable practical experience, but work via a flexible yet efficient declarative query style. with a single file, not with a set of files as a single array. The long history of file-based data storage resulted The following requirements not satisfied by the existing in many sophisticated raster file formats. For example, data models governed the creation of the new model: NetCDF format supports multidimensional arrays, chunk- (i) treatment of arrays in multiple files distributed over ing, compression, diverse data types, metadata and hier- cluster nodes as a single array, (ii) formalization of the archical namespace [2]. Decades of development resulted industrial experience to leverage it in the algorithms, in numerous elaborate tools for processing raster files. (ii) provide rich set of data types (Gaussian, irregular For example, NetCDF Operators (NCO) have been under grids, etc.), (iii) make the model mapping to a format development since 1995 [3]. GDAL has about 106 lines almost 1:1 but still independent from the format. As can of code made by hundreds of contributors [4]. be seen from [2], array model in section II-A closely The idea of partially delegating raster data processing follows CDM while two-level set-oriented data model in to existing command line tools was first presented and section II-B provides additional necessary abstractions. proved to outperform SciDB 3× to 193× in [5]. The del- The major contributions of this paper are: (i) a new two- egation ability is being integrated into ChronosServer [6]. level formal N-d array data model, (ii) new distributed Work [5] used NCEP/DOE Reanalysis (R2) data. A single algorithms, (iii) performance evaluation of ChronosServer machine was used for the experiments. No formal array and SciDB on popular satellite data in the Cloud. model or formal distributed algorithms were given in [5]. The rest of the paper is organized as follows. Section II The main goal of this paper is to advance the approach formally describes ChronosServer data model. Section III proposed earlier. To achieve this goal, a new two-level presents generic distributed algorithms for processing of arbitrary N-d arrays in NetCDF format by delegating This work was partially supported by Russian Science Foundation (grant 17-11-01052) and Russian Foundation for Basic Research portions of work to NCO/GDAL tools. Performance eval- (grant 16-37-00416). uation is in section IV. Conclusions are given in section V. II. CHRONOSSERVER Arrays A:di, H, and elements of 8p 2 P except p:A A. ChronosServer Multidimensional Array Model are stored on Gate. Upon startup workers connect to Gate and receive a list of all available datasets and file naming In this paper, an N-dimensional array (N-d array) is rules. Workers scan their local filesystems to discover the mapping A : D ×D ×· · ·×D 7! , where N > 0, 1 2 N T datasets and create p:k by parsing file names or reading D = [0; l ) ⊂ , 0 < l is a finite integer, and is a i i Z i T file metadata. Found set of keys is transmitted to Gate. numeric type. The li is said to be the size or length of ith dimension (in this paper, i 2 [1;N] ⊂ Z). Let us denote III. ARRAY OPERATIONS the N-d array by A. Aggregation Ahl1; l2; : : : ; lN i : T (1) The aggregate of an N-d array A(d1; : : : ; dN ):T over axis d is the (N − 1)-d array A (d ; : : : ; d ): l × l × · · · × l shape A jAj 1 aggr 2 N T By 1 2 N denote the of , by such that A [x ; : : : ; x ] = f (cells(A[0 : jd j − denote the size of A such that jAj = Q l .A cell or aggr 2 N aggr 1 i i 1; x ; : : : ; x ])), where x ; : : : ; x are valid integer in- element A (x ; x ; : : : ; x ) 2 N 2 N value of with integer indexes 1 2 N dexes, f : T 7! w is an aggregation function, T is a A[x ; x ; : : : ; x ] x 2 D aggr is referred to as 1 2 N , where i i. Each multiset of values from , w 2 , cells : A0 7! T is the A T T cell value of is of type T. multiset of all cell values of an array A0 v A. x Indexes i are optionally mapped to specific values of Algorithm 1 on highlighted line 5 delegates aggregation i coordinate A:d hl i : th dimension by arrays i i Ti, where of subarrays within a single node to a proven, optimized d [j] < d [j + 1] Ti is a totally ordered set, and i i for all command line tool (ncra in case of NetCDF format). j 2 Di. In this case, A is defined as Mapping µ :(k2; : : : ; kN ) 7! id returns a worker ID to key A(d1; d2; : : : ; dN ): T (2) which a partial aggregate aaggr must be sent; µ is sent by Gate to respective workers along with other necessary A hyperslab A0 v A is an N-d subarray of A. The 0 parameters. Set PL ⊂ P contains locally stored subarrays. hyperslab A is defined by the notation Algorithm 1 is executed on each involved worker. 0 0 0 A[b1 : e1; : : : ; bN : eN ] = A (d1; : : : ; dN ) (3) Algorithm 1 Distributed Array Aggregation 0 0 where bi; ei 2 Z, bi 6 ei < li, di = di[bi : ei], jdij = Input: PL; faggr; µ ei − bi + 1, and for 8yi 2 [0; ei − bi] the following holds 0 1: PL fg . local subarrays for new dataset 0 A [y1; : : : ; yN ] = A[y1 + b1; : : : ; yN + bN ] (4a) 2: for each (k2; : : : ; kN ) 2 fp:k[2:N]:p 2 PLg do 0 3: key (k2; : : : ; kN ) di[yi] = di[yi + bi] (4b) 4: C fp:Ak : p:k[2:N] = keyg B. ChronosServer Datasets key 5: aaggr faggr(all arrays in C) . delegation key A dataset D = (A; H; P ) contains a user-level array 6: send aaggr to worker µ(key) A(d1; : : : ; dN ): T and the set of system-level arrays P = key 7: accept subarrays from other workers: Paggr f(Ak; k; nidk)g, where Ak v A, k = (k1; : : : ; kN ) 2 key key N 8: aggregate all p 2 Paggr into paggr Z (an N-d key), nidk is a cluster node ID storing 0 0 key 9: PL PL [ f(paggr; key; thisW orkerId)g array Ak, Hht1; : : : ; tN i : int such that Ak = A[h1 : 0 0 h1; : : : ; hN : hN ], where hi = H[k1; : : : ; ki; : : : ; kN ], 0 and hi = H[k1; : : : ; ki + 1; : : : ; kN ] (array A is divided Algorithm 1 is illustrated on fig. 1. Subarrays with the by N-d hyperplanes on subarrays: this is quite usual in same color reside on the same node. Lines 2–6 perform practice, see top of fig. 1). A user-level array is never local aggregation of 3-d Ak(time; lat; lon) subarrays at stored explicitly: operations with A are mapped to a the top of fig.