The 9th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications 21-23 September, 2017, Bucharest, Romania Array DBMS in Environmental Science: Satellite Sea Surface Height Data in the Cloud Ramon Antonio Rodriges Zalipynis National Research University Higher School of Economics, Moscow, Russia [email protected]

Abstract – Nowadays environmental science experiences data model is designed to uniformly represent diverse tremendous growth of raster data: N-dimensional (N-d) raster data types and formats, take into account a dis- arrays coming mainly from numeric simulation and Earth tributed cluster environment, and be independent of the remote sensing. An array DBMS is a tool to streamline raster data processing. However, raster data are usually stored in underlying raster file formats at the same time. Also, new files, not in . Moreover, numerous command line distributed algorithms are proposed based on the model. tools exist for processing raster files. This paper describes a Four modern raster data management trends are rele- distributed array DBMS under development that partially vant to this paper: industrial raster data models, formal delegates raster data processing to such tools. Our DBMS offers a new N-d array data model to abstract from the array models and algebras, in situ data processing algo- files and the tools and processes data in a distributed fashion rithms, and raster (array) DBMS. Good survey on the directly in their native file formats. As a case study, popular algorithms is contained in [7]. A recent survey of existing satellite altimetry data were used for the experiments carried array DBMS and similar systems is in [5]. It is worth out on 8- and 16-nodes clusters in Microsoft Azure Cloud. mentioning SciDB [8], Oracle Spatial [9], ArcGIS IS [10], New array DBMS is up to 70× faster than SciDB which is the only freely available distributed array DBMS to date. [11], Intel TileDB [12], and PostGIS [13]. The most well-known array models and algebras are Keywords – SciDB; in situ; command line tools; NetCDF Array Algebra, AML, AQL, and RAM. All of them are mappable to Array Algebra [14]. SciDB does not have I.INTRODUCTION a formal description of its data model. The most widely Modern volumes of raster data are enormous. European used industry standard data models to abstract from raster Centre for Medium-Range Weather Forecasts has alone file formats are CDM, GDAL, and ISO 19123. They accumulated 137.5 million files sized 52.7 PB in total [1]. models mappable to each other [2], have resulted from The main goal of an array DBMS is to process N-d arrays decades of considerable practical experience, but work via a flexible yet efficient declarative query style. with a single file, not with a set of files as a single array. The long history of file-based data storage resulted The following requirements not satisfied by the existing in many sophisticated raster file formats. For example, data models governed the creation of the new model: NetCDF format supports multidimensional arrays, chunk- (i) treatment of arrays in multiple files distributed over ing, compression, diverse data types, metadata and hier- cluster nodes as a single array, (ii) formalization of the archical namespace [2]. Decades of development resulted industrial experience to leverage it in the algorithms, in numerous elaborate tools for processing raster files. (ii) provide rich set of data types (Gaussian, irregular For example, NetCDF Operators (NCO) have been under grids, etc.), (iii) make the model mapping to a format development since 1995 [3]. GDAL has about 106 lines almost 1:1 but still independent from the format. As can of code made by hundreds of contributors [4]. be seen from [2], array model in section II-A closely The idea of partially delegating raster data processing follows CDM while two-level set-oriented data model in to existing command line tools was first presented and section II-B provides additional necessary abstractions. proved to outperform SciDB 3× to 193× in [5]. The del- The major contributions of this paper are: (i) a new two- egation ability is being integrated into ChronosServer [6]. level formal N-d array data model, (ii) new distributed Work [5] used NCEP/DOE Reanalysis (R2) data. A single algorithms, (iii) performance evaluation of ChronosServer machine was used for the experiments. No formal array and SciDB on popular satellite data in the Cloud. model or formal distributed algorithms were given in [5]. The rest of the paper is organized as follows. Section II The main goal of this paper is to advance the approach formally describes ChronosServer data model. Section III proposed earlier. To achieve this goal, a new two-level presents generic distributed algorithms for processing of arbitrary N-d arrays in NetCDF format by delegating This work was partially supported by Russian Science Foundation (grant ›17-11-01052) and Russian Foundation for Basic Research portions of work to NCO/GDAL tools. Performance eval- (grant ›16-37-00416). uation is in section IV. Conclusions are given in section V. II.CHRONOSSERVER Arrays A.di, H, and elements of ∀p ∈ P except p.A A. ChronosServer Multidimensional Array Model are stored on Gate. Upon startup workers connect to Gate and receive a list of all available datasets and file naming In this paper, an N-dimensional array (N-d array) is rules. Workers scan their local filesystems to discover the mapping A : D ×D ×· · ·×D 7→ , where N > 0, 1 2 N T datasets and create p.k by parsing file names or reading D = [0, l ) ⊂ , 0 < l is a finite integer, and is a i i Z i T file metadata. Found set of keys is transmitted to Gate. numeric type. The li is said to be the size or length of ith dimension (in this paper, i ∈ [1,N] ⊂ Z). Let us denote III.ARRAY OPERATIONS the N-d array by A. Aggregation

Ahl1, l2, . . . , lN i : T (1) The aggregate of an N-d array A(d1, . . . , dN ):T over axis d is the (N − 1)-d array A (d , . . . , d ): l × l × · · · × l shape A |A| 1 aggr 2 N T By 1 2 N denote the of , by such that A [x , . . . , x ] = f (cells(A[0 : |d | − denote the size of A such that |A| = Q l .A cell or aggr 2 N aggr 1 i i 1, x , . . . , x ])), where x , . . . , x are valid integer in- element A (x , x , . . . , x ) 2 N 2 N value of with integer indexes 1 2 N dexes, f : T 7→ w is an aggregation function, T is a A[x , x , . . . , x ] x ∈ D aggr is referred to as 1 2 N , where i i. Each multiset of values from , w ∈ , cells : A0 7→ T is the A T T cell value of is of type T. multiset of all cell values of an array A0 v A. x Indexes i are optionally mapped to specific values of Algorithm 1 on highlighted line 5 delegates aggregation i coordinate A.d hl i : th dimension by arrays i i Ti, where of subarrays within a single node to a proven, optimized d [j] < d [j + 1] Ti is a totally ordered set, and i i for all command line tool (ncra in case of NetCDF format). j ∈ Di. In this case, A is defined as Mapping µ :(k2, . . . , kN ) 7→ id returns a worker ID to key A(d1, d2, . . . , dN ): T (2) which a partial aggregate aaggr must be sent; µ is sent by Gate to respective workers along with other necessary A hyperslab A0 v A is an N-d subarray of A. The 0 parameters. Set PL ⊂ P contains locally stored subarrays. hyperslab A is defined by the notation Algorithm 1 is executed on each involved worker. 0 0 0 A[b1 : e1, . . . , bN : eN ] = A (d1, . . . , dN ) (3) Algorithm 1 Distributed Array Aggregation 0 0 where bi, ei ∈ Z, bi 6 ei < li, di = di[bi : ei], |di| = Input: PL, faggr, µ ei − bi + 1, and for ∀yi ∈ [0, ei − bi] the following holds 0 1: PL ← {} . local subarrays for new dataset 0 A [y1, . . . , yN ] = A[y1 + b1, . . . , yN + bN ] (4a) 2: for each (k2, . . . , kN ) ∈ {p.k[2:N]:p ∈ PL} do 0 3: key ← (k2, . . . , kN ) di[yi] = di[yi + bi] (4b) 4: C ← {p.Ak : p.k[2:N] = key} B. ChronosServer Datasets key 5: aaggr ← faggr(all arrays in C) . delegation key A dataset D = (A, H, P ) contains a user-level array 6: send aaggr to worker µ(key) A(d1, . . . , dN ): T and the set of system-level arrays P = key 7: accept subarrays from other workers: Paggr {(Ak, k, nidk)}, where Ak v A, k = (k1, . . . , kN ) ∈ key key N 8: aggregate all p ∈ Paggr into paggr Z (an N-d key), nidk is a cluster node ID storing 0 0 key 9: PL ← PL ∪ {(paggr, key, thisW orkerId)} array Ak, Hht1, . . . , tN i : int such that Ak = A[h1 : 0 0 h1, . . . , hN : hN ], where hi = H[k1, . . . , ki, . . . , kN ], 0 and hi = H[k1, . . . , ki + 1, . . . , kN ] (array A is divided Algorithm 1 is illustrated on fig. 1. Subarrays with the by N-d hyperplanes on subarrays: this is quite usual in same color reside on the same node. Lines 2–6 perform practice, see top of fig. 1). A user-level array is never local aggregation of 3-d Ak(time, lat, lon) subarrays at stored explicitly: operations with A are mapped to a the top of fig. 1 over the first axis to obtain intermediate sequence of operations with respective arrays Ak. Let 2-d aggregates at the bottom of fig. 1. The 2-d arrays us call a user-level array and a system-level array an covering the same area (they have the same 2-d key) are array and a subarray respectively for short. Due to space gathered on one of the nodes which calculates the final constraints, we skip the description of dataset metadata. result for a particular 2-d key (not shown). For example, C. ChronosServer Architecture red and blue 2-d arrays at the lower left of fig. 1 reside on two different nodes. One of the nodes will send its ChronosServer runs on a computer cluster of commod- subarray to the other one to calculate the final result. ity hardware and operates on arrays stored in diverse Algorithm 1 is for faggr ∈ {max, min, sum}. Calcu- raster file formats. A file is always stored entirely on a lation of average is reduced to calculating the sum and node in contrast to parallel or distributed file systems. dividing each cells of the resulting array on |A.d1|. Workers run on each node and are responsible for data processing. One Gate at a dedicated node receives client B. Multiresolution pyramid queries and coordinates workers. A file may be replicated Digital maps like Google or Bing Maps display satellite on several nodes for fault tolerance and load balancing. imagery depending on the current map scale. First, several lat0 5 0 0 1 1 2 2 time lat 4 0 0 1 1 2 2 5 2 0 1 2 4 3 3 3 4 4 5 5 1 3 4 5 2 3 3 4 4 5 5 0 6 7 8 lat 3 1 6 6 7 7 8 8 2 0 1 2 lon 1 0 6 6 7 7 8 8 1 0 0 1 2 3 4 5 lon0 0 0 1 2 3 4 5 lon Figure 3. Nearest neighbor 2× interpolation

j ∈ [0, |di| − 1) ⊂ N. Interpolation is a core raster lat processing operation with many applications [15]. For 1 example, image resolution can be increased twice by 0 interpolation when yi = (di[j] + di[j + 1])/2. As in case with downsampling, numerous interpolation 0 1 2 3 4 5 lon techniques exist [15]. The most basic technique is the nearest neighbor: unknown cell value at a given coordi- Figure 1. Array aggregation. nate is obtained by copying the value from the nearest cell with a known value. SciDB xgrid operator mimics nearest neighbor interpolation. The operator increases the zoom levels are defined, e.g. Z = {0, 1,..., 16}. A digital length of input array dimensions by an integer scale map switches its current zoom level when a user zooms replicating the original values, fig. 3. This is almost equiv- in or out the map. Usually, at zoom level z ∈ Z the alent to the nearest neighbor approach since a generic image resolution is 2z× less than the original one. The interpolation must be able to increase the image resolution multiresolution pyramid is the stack of images for all not only by an integer scale. ChronosServer delegates zoom levels, fig. 2. Display of downsampled images for interpolation of each p ∈ P to gdalwarp. Workers coarser map scales significantly reduces network traffic interpolate each p independently of other subarrays. and system load. Thus, downsampling functionality is very important for an array DBMS. D. Hyperslabbing Hyperslabbing is an extraction of a hyperslab from an Level 2 1/4 resolution array, eq. (3). Consider a 2-d array A(lat, lon) consisting of 16 subarrays separated from each other with thick lines, Level 1 fig. 4. Subarrays possibly reside on different cluster nodes. 1/2 resolution The hatched area marks the hyperslab A0 = A[3:7, 2:7].

lat 7 Level 0 6 original array 5 4 Figure 2. Multiresolution pyramid (3 levels) 3 2 Numerous techniques exist for image downsam- 1 pling [15]. However, a simple averaging of cells is evalu- 0 ated in section IV since SciDB does not support anything 0 1 2 3 4 5 6 7 lon more. In contrast, large number of command line utilities with plethora of options exist specifically targeted at Figure 4. Array hyperslabbing. creating coarser image versions. ChronosServer delegates downsampling of a p ∈ P to gdalwarp. Workers Hyperslabbing of an array is reduced to hyperslabbing downsample each p independently of other subarrays. of the respective subarrays as follows. Some subarrays do not participate in the hyperslabbing: almost 50% of them C. Interpolation do not overlap with A0 and can be filtered out beforehand Given an array A(d1, . . . , dN ): T, a value at a (e.g., A[0 : 1, 0 : 1]). Also, almost 40% of the subarrays 0 coordinate (y1, . . . , yN ) can be estimated by the operation are entirely inside A and must migrate to the resulting called interpolation; yi ∈ (di[j], di[j + 1]) ⊂ Ti, and dataset as is (e.g., A[6:7, 6:7]). It is necessary to hyperslab only 3 subarrays to com- I. Results of Performance Evaluation plete the operation. Since a subarray is fully located on a node in a single file, hyperslabbing is delegated Execution Time, sec. SciDB/ Operation Chronos Chronos SciDB to a command line tool. Most raster tools support file Nodes Cold Hot Cold Hot subsetting but most of them work on a single machine. 8 17.18 1.84 133.81 7.79 72.72 Average ChronosServer scales them out by orchestrating their 16 21.15 2.21 118.55 5.61 53.64 massively parallel execution. Hyperslabbing of NetCDF 8 17.57 1.78 82.64 4.70 46.43 Maximum is delegated to ncks (NCO). Due to space constraints, 16 19.44 2.01 68.82 3.54 34.24 the hyperslabbing algorithm is not shown. 8 17.48 1.81 62.40 3.57 34.48 Minimum 16 19.14 2.11 69.51 3.63 32.94 ERFORMANCE VALUATION IV. P E Hyperslab 8 4.73 0.42 16.99 3.59 40.45 Sea Surface Height Anomalies (SSHA) above a mean :, 0:100, 0:100 16 2.95 0.35 9.15 3.10 26.14 sea surface, 1/6° × 1/6° (2160 × 960 cells) grids every Interpolate 2× 8 72.44 22.17 188.79 2.61 8.52 5 days from 05-Jan-2013 to 31-Dec-2015 (≈3.39 GB, (4320 × 1920) 16 34.45 10.75 114.65 3.33 10.67 NetCDF3) were taken for performance evaluation [16]. Pyramid: 1 level 8 20.84 5.48 57.74 2.77 10.54 The dataset contains the fully corrected heights derived (1080 × 480) 16 10.71 4.68 37.56 3.51 8.03 from the SSHA data of TOPEX/Poseidon, Jason-1, Ja- son-2 and Jason-3 as reference data from the level 2 swath data and SARAL-AltiKa, ERS-1, ERS-2, Envisat, tools are shown to be widely applicable: each algorithm N CRyosat-2, depending on the date, from the RADS is designed for general-purpose -d array processing and . The gridding is done by the kriging method. always has a large portion of work for a tool. Future work includes adding ACID guarantees and Microsoft Azure Cloud was used for the experiments. fault-tolerance. This may be easier than in a traditional Cluster creation, scaling up/down with given parameters DBMS since ChronoServer datasets are read-only. was fully automated with Java Azure SDK. SciDB v16.9 was used (the latest) running on Ubuntu Linux 14.04 LTS. REFERENCES D2 v2 machines were used: 2 CPU cores (Intel Xeon E5- [1] M. Grawinkel et al., “Analysis of the ECMWF storage landscape,” 2673 v3 (Haswell) 2.4 GHz), 7 GB RAM, 100 GB local in 13th USENIX Conf. on File and Storage Technologies, 2015, SSD drive (4 virtual data disks), max 4 × 500 IOPS. p. 83. [2] S. Nativi, J. Caron, B. Domenico, and L. Bigagli, “Unidata’s We have written a Java program that converts NetCDF common data model mapping to the ISO 19123 data model,” Earth files to CSV files to feed the latter to SciDB. To date, this Sci. Inform., vol. 1, pp. 59–78, 2008. is the only way to import an external file into SciDB 16.9. [3] “NCO homepage,” http://nco.sourceforge.net/. [4] “Coverity scan: GDAL,” https://scan.coverity.com/projects/gdal. SSHA data import into SciDB took about 8 hours. [5] R. A. Rodriges Zalipynis, “ChronosServer: Fast in situ processing SciDB parameters used: 0 redundancy, 2 instances per of large multidimensional arrays with command line tools,” machine, 5 execution and prefetch threads, 1 prefetch in Supercomputing: Second Russian Supercomputing Days, RuSCDays 2016, Moscow, Russia, September 26–27, 2016, queue size, 1 operator threads, 1024 MB array cache, etc. Revised Selected Papers, ser. Communications in Computer and ChronosServer has 100% Java code, ran one worker per Information Science, vol. 687. Cham: Springer International node, OpenJDK 1.8.0 111 64 bit, max heap size 978 MB Publishing, 2016, pp. 27–40. [Online]. Available: http://dx.doi. org/10.1007/978-3-319-55669-7 3 (-Xmx). NCO v4.4.2 and GDAL v1.10.1 tools available [6] ——, “Chronosserver: real-time access to “native” multi-terabyte from the standard Ubuntu 14.04 repository were used. retrospective data warehouse by thousands of concurrent clients,” A ChronosServer node contained 1/M of files, where Inform., Cybern. Comput. Eng., vol. 14, no. 188, pp. 151–161, 2011. M is the number of cluster nodes. SciDB also distributes [7] S. Blanas et al., “Parallel data analysis directly on scientific file data uniformly. Cold and hot query runs were evaluated: formats,” in ACM SIGMOD 2014. a query is executed for the first and the second time [8] P. Cudre-Mauroux et al., “A demonstration of SciDB: A science- oriented DBMS,” Proc. of VLDB Endowment, vol. 2, no. 2, pp. respectively. Respective OS commands were issued to free 1534—-1537, 2009. pagecache, dentries and inodes each time before [9] “Oracle spatial and graph,” http://www.oracle.com/technetwork/ executing a cold query to prevent data caching at various database/options/spatialandgraph/overview/index.html. [10] “ArcGIS for server — Image Extension,” http://www.esri.com/ OS levels. ChronosServer benefits from native OS caching software/arcgis/arcgisserver/extensions/image-extension. and is much faster during hot runs. There is no significant [11] P. Baumann et al., “The array database that is not a database: File runtime difference between cold and hot SciDB runs. based array query answering in rasdaman,” in SSTD 2013. [12] “TileDB,” http://istc-bigdata.org/tiledb/index.html. V. CONCLUSIONSAND FUTURE WORK [13] “PostGIS raster data management,” http://postgis.net/docs/manual- 2.2/using raster dataman.html. ChronosServer runs much faster than SciDB due to [14] P. Baumann and S. Holsten, “A comparative analysis of array partial delegation of raster data processing to highly models for databases,” Int. J. Database Theory Appl., vol. 5, no. 1, pp. 89–120, 2012. optimized command line tools. At the same time, [15] J. A. Richards, Remote Sensing Digital Image Analysis: An Intro- ChronosServer data model keeps a high level of indepen- duction, 5th ed. Springer-Verlag Berlin Heidelberg, 2013. dence from the underlying file formats and the tools. The [16] “Sea Level Data,” http://dx.doi.org/10.5067/SLREF-CDRV1.