Array DBMS in Environmental Science

Total Page:16

File Type:pdf, Size:1020Kb

Array DBMS in Environmental Science The 9th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications 21-23 September, 2017, Bucharest, Romania Array DBMS in Environmental Science: Satellite Sea Surface Height Data in the Cloud Ramon Antonio Rodriges Zalipynis National Research University Higher School of Economics, Moscow, Russia [email protected] Abstract – Nowadays environmental science experiences data model is designed to uniformly represent diverse tremendous growth of raster data: N-dimensional (N-d) raster data types and formats, take into account a dis- arrays coming mainly from numeric simulation and Earth tributed cluster environment, and be independent of the remote sensing. An array DBMS is a tool to streamline raster data processing. However, raster data are usually stored in underlying raster file formats at the same time. Also, new files, not in databases. Moreover, numerous command line distributed algorithms are proposed based on the model. tools exist for processing raster files. This paper describes a Four modern raster data management trends are rele- distributed array DBMS under development that partially vant to this paper: industrial raster data models, formal delegates raster data processing to such tools. Our DBMS offers a new N-d array data model to abstract from the array models and algebras, in situ data processing algo- files and the tools and processes data in a distributed fashion rithms, and raster (array) DBMS. Good survey on the directly in their native file formats. As a case study, popular algorithms is contained in [7]. A recent survey of existing satellite altimetry data were used for the experiments carried array DBMS and similar systems is in [5]. It is worth out on 8- and 16-nodes clusters in Microsoft Azure Cloud. mentioning SciDB [8], Oracle Spatial [9], ArcGIS IS [10], New array DBMS is up to 70× faster than SciDB which is the only freely available distributed array DBMS to date. RasDaMan [11], Intel TileDB [12], and PostGIS [13]. The most well-known array models and algebras are Keywords – SciDB; in situ; command line tools; NetCDF Array Algebra, AML, AQL, and RAM. All of them are mappable to Array Algebra [14]. SciDB does not have I. INTRODUCTION a formal description of its data model. The most widely Modern volumes of raster data are enormous. European used industry standard data models to abstract from raster Centre for Medium-Range Weather Forecasts has alone file formats are CDM, GDAL, and ISO 19123. They accumulated 137.5 million files sized 52.7 PB in total [1]. models mappable to each other [2], have resulted from The main goal of an array DBMS is to process N-d arrays decades of considerable practical experience, but work via a flexible yet efficient declarative query style. with a single file, not with a set of files as a single array. The long history of file-based data storage resulted The following requirements not satisfied by the existing in many sophisticated raster file formats. For example, data models governed the creation of the new model: NetCDF format supports multidimensional arrays, chunk- (i) treatment of arrays in multiple files distributed over ing, compression, diverse data types, metadata and hier- cluster nodes as a single array, (ii) formalization of the archical namespace [2]. Decades of development resulted industrial experience to leverage it in the algorithms, in numerous elaborate tools for processing raster files. (ii) provide rich set of data types (Gaussian, irregular For example, NetCDF Operators (NCO) have been under grids, etc.), (iii) make the model mapping to a format development since 1995 [3]. GDAL has about 106 lines almost 1:1 but still independent from the format. As can of code made by hundreds of contributors [4]. be seen from [2], array model in section II-A closely The idea of partially delegating raster data processing follows CDM while two-level set-oriented data model in to existing command line tools was first presented and section II-B provides additional necessary abstractions. proved to outperform SciDB 3× to 193× in [5]. The del- The major contributions of this paper are: (i) a new two- egation ability is being integrated into ChronosServer [6]. level formal N-d array data model, (ii) new distributed Work [5] used NCEP/DOE Reanalysis (R2) data. A single algorithms, (iii) performance evaluation of ChronosServer machine was used for the experiments. No formal array and SciDB on popular satellite data in the Cloud. model or formal distributed algorithms were given in [5]. The rest of the paper is organized as follows. Section II The main goal of this paper is to advance the approach formally describes ChronosServer data model. Section III proposed earlier. To achieve this goal, a new two-level presents generic distributed algorithms for processing of arbitrary N-d arrays in NetCDF format by delegating This work was partially supported by Russian Science Foundation (grant 17-11-01052) and Russian Foundation for Basic Research portions of work to NCO/GDAL tools. Performance eval- (grant 16-37-00416). uation is in section IV. Conclusions are given in section V. II. CHRONOSSERVER Arrays A:di, H, and elements of 8p 2 P except p:A A. ChronosServer Multidimensional Array Model are stored on Gate. Upon startup workers connect to Gate and receive a list of all available datasets and file naming In this paper, an N-dimensional array (N-d array) is rules. Workers scan their local filesystems to discover the mapping A : D ×D ×· · ·×D 7! , where N > 0, 1 2 N T datasets and create p:k by parsing file names or reading D = [0; l ) ⊂ , 0 < l is a finite integer, and is a i i Z i T file metadata. Found set of keys is transmitted to Gate. numeric type. The li is said to be the size or length of ith dimension (in this paper, i 2 [1;N] ⊂ Z). Let us denote III. ARRAY OPERATIONS the N-d array by A. Aggregation Ahl1; l2; : : : ; lN i : T (1) The aggregate of an N-d array A(d1; : : : ; dN ):T over axis d is the (N − 1)-d array A (d ; : : : ; d ): l × l × · · · × l shape A jAj 1 aggr 2 N T By 1 2 N denote the of , by such that A [x ; : : : ; x ] = f (cells(A[0 : jd j − denote the size of A such that jAj = Q l .A cell or aggr 2 N aggr 1 i i 1; x ; : : : ; x ])), where x ; : : : ; x are valid integer in- element A (x ; x ; : : : ; x ) 2 N 2 N value of with integer indexes 1 2 N dexes, f : T 7! w is an aggregation function, T is a A[x ; x ; : : : ; x ] x 2 D aggr is referred to as 1 2 N , where i i. Each multiset of values from , w 2 , cells : A0 7! T is the A T T cell value of is of type T. multiset of all cell values of an array A0 v A. x Indexes i are optionally mapped to specific values of Algorithm 1 on highlighted line 5 delegates aggregation i coordinate A:d hl i : th dimension by arrays i i Ti, where of subarrays within a single node to a proven, optimized d [j] < d [j + 1] Ti is a totally ordered set, and i i for all command line tool (ncra in case of NetCDF format). j 2 Di. In this case, A is defined as Mapping µ :(k2; : : : ; kN ) 7! id returns a worker ID to key A(d1; d2; : : : ; dN ): T (2) which a partial aggregate aaggr must be sent; µ is sent by Gate to respective workers along with other necessary A hyperslab A0 v A is an N-d subarray of A. The 0 parameters. Set PL ⊂ P contains locally stored subarrays. hyperslab A is defined by the notation Algorithm 1 is executed on each involved worker. 0 0 0 A[b1 : e1; : : : ; bN : eN ] = A (d1; : : : ; dN ) (3) Algorithm 1 Distributed Array Aggregation 0 0 where bi; ei 2 Z, bi 6 ei < li, di = di[bi : ei], jdij = Input: PL; faggr; µ ei − bi + 1, and for 8yi 2 [0; ei − bi] the following holds 0 1: PL fg . local subarrays for new dataset 0 A [y1; : : : ; yN ] = A[y1 + b1; : : : ; yN + bN ] (4a) 2: for each (k2; : : : ; kN ) 2 fp:k[2:N]:p 2 PLg do 0 3: key (k2; : : : ; kN ) di[yi] = di[yi + bi] (4b) 4: C fp:Ak : p:k[2:N] = keyg B. ChronosServer Datasets key 5: aaggr faggr(all arrays in C) . delegation key A dataset D = (A; H; P ) contains a user-level array 6: send aaggr to worker µ(key) A(d1; : : : ; dN ): T and the set of system-level arrays P = key 7: accept subarrays from other workers: Paggr f(Ak; k; nidk)g, where Ak v A, k = (k1; : : : ; kN ) 2 key key N 8: aggregate all p 2 Paggr into paggr Z (an N-d key), nidk is a cluster node ID storing 0 0 key 9: PL PL [ f(paggr; key; thisW orkerId)g array Ak, Hht1; : : : ; tN i : int such that Ak = A[h1 : 0 0 h1; : : : ; hN : hN ], where hi = H[k1; : : : ; ki; : : : ; kN ], 0 and hi = H[k1; : : : ; ki + 1; : : : ; kN ] (array A is divided Algorithm 1 is illustrated on fig. 1. Subarrays with the by N-d hyperplanes on subarrays: this is quite usual in same color reside on the same node. Lines 2–6 perform practice, see top of fig. 1). A user-level array is never local aggregation of 3-d Ak(time; lat; lon) subarrays at stored explicitly: operations with A are mapped to a the top of fig.
Recommended publications
  • An Array DBMS for Simulation Analysis and ML Models Predictions
    SAVIME: An Array DBMS for Simulation Analysis and ML Models Predictions Hermano. L. S. Lustosa1, Anderson C. Silva1, Daniel N. R. da Silva1, Patrick Valduriez2, Fabio Porto1 1 National Laboratory for Scientific Computing, Rio de Janeiro, Brazil {hermano, achaves, dramos, fporto}@lncc.br 2 Inria, University of Montpellier, CNRS, LIRMM, France [email protected] Abstract. Limitations in current DBMSs prevent their wide adoption in scientific applications. In order to make them benefit from DBMS support, enabling declarative data analysis and visualization over scientific data, we present an in-memory array DBMS called SAVIME. In this work we describe the system SAVIME, along with its data model. Our preliminary evaluation show how SAVIME, by using a simple storage definition language (SDL) can outperform the state-of-the-art array database system, SciDB, during the process of data ingestion. We also show that it is possible to use SAVIME as a storage alternative for a numerical solver without affecting its scalability, making it useful for modern ML based applications. Categories and Subject Descriptors: H.2.4 [Database Management]: Systems Keywords: Scientific Data Management, Multidimensional Array, Machine Learning 1. INTRODUCTION Due to the increasing computational power of HPC environments, vast amounts of data are now generated in different research fields, and a relevant part of this data is best represented as array data. Geospatial and temporal data in climate modeling, astronomical images, medical imagery, multimedia and simulation data are naturally represented as multidimensional arrays. To analyze such data, DBMSs offer many advantages, like query languages, which eases data analysis and avoids the need for extensive coding/scripting, and a logical data view that isolates data from the applications that consume it.
    [Show full text]
  • Array Databases: Concepts, Standards, Implementations
    Baumann et al. J Big Data (2021) 8:28 https://doi.org/10.1186/s40537-020-00399-2 SURVEY PAPER Open Access Array databases: concepts, standards, implementations Peter Baumann , Dimitar Misev, Vlad Merticariu and Bang Pham Huu* *Correspondence: b. Abstract phamhuu@jacobs-university. Multi-dimensional arrays (also known as raster data or gridded data) play a key role in de Large-Scale Scientifc many, if not all science and engineering domains where they typically represent spatio- Information Systems temporal sensor, image, simulation output, or statistics “datacubes”. As classic database Research Group, Jacobs technology does not support arrays adequately, such data today are maintained University, Bremen, Germany mostly in silo solutions, with architectures that tend to erode and not keep up with the increasing requirements on performance and service quality. Array Database systems attempt to close this gap by providing declarative query support for fexible ad-hoc analytics on large n-D arrays, similar to what SQL ofers on set-oriented data, XQuery on hierarchical data, and SPARQL and CIPHER on graph data. Today, Petascale Array Database installations exist, employing massive parallelism and distributed processing. Hence, questions arise about technology and standards available, usability, and overall maturity. Several papers have compared models and formalisms, and benchmarks have been undertaken as well, typically comparing two systems against each other. While each of these represent valuable research to the best of our knowledge there is no comprehensive survey combining model, query language, architecture, and practical usability, and performance aspects. The size of this comparison diferentiates our study as well with 19 systems compared, four benchmarked to an extent and depth clearly exceeding previous papers in the feld; for example, subsetting tests were designed in a way that systems cannot be tuned to specifcally these queries.
    [Show full text]
  • The Gamma Operator for Big Data Summarization on an Array DBMS
    JMLR: Workshop and Conference Proceedings 36:88{103, 2014 BIGMINE 2014 The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez [email protected] Yiqun Zhang [email protected] Wellington Cabrera [email protected] University of Houston, USA Editors: Wei Fan, Albert Bifet, Qiang Yang and Philip Yu Abstract SciDB is a parallel array DBMS that provides multidimensional arrays, a query language and basic ACID properties. In this paper, we introduce a summarization matrix operator that computes sufficient statistics in one pass and in parallel on an array DBMS. Such sufficient statistics benefit a big family of statistical and machine learning models, including PCA, linear regression and variable selection. Experimental evaluation on a parallel cluster shows our matrix operator exhibits linear time complexity and linear speedup. Moreover, our operator is shown to be an order of magnitude faster than SciDB built-in operators, two orders of magnitude faster than SQL queries on a fast column DBMS and even faster than the R package when the data set fits in RAM. We show SciDB operators and the R package fail due to RAM limitations, whereas our operator does not. We also show PCA and linear regression computation is reduced to a few minutes for large data sets. On the other hand, a Gibbs sampler for variable selection can iterate much faster in the array DBMS than in R, exploiting the summarization matrix. Keywords: Array, Matrix, Linear Models, Summarization, Parallel 1. Introduction Row DBMSs remain the best technology for OLTP [Stonebraker et al.(2007)] and col- umn DBMSs are becoming a competitor in OLAP (Business Intelligence) query processing [Stonebraker et al.(2005)] in large data warehouses [Stonebraker et al.(2010)].
    [Show full text]
  • Multidimensional Arrays for Analysing Geoscientific Data
    International Journal of Geo-Information Review Multidimensional Arrays for Analysing Geoscientific Data Meng Lu 1,2,*,† , Marius Appel 1 and Edzer Pebesma 1 1 Institute for Geoinformatics, University of Muenster, 48149 Muenster, Germany; [email protected] (M.A.); [email protected] (E.P.) 2 Department of Physical Geography, Faculty of Geoscience, Utrecht University, 3584 CB Utrecht, The Netherlands * Correspondence: [email protected]; Tel.: +31 648901629 † Current address: Vening Meinesz Building A, Princetonlaan 8A, 3584 CB Utrecht, The Netherlands. Received: 27 June 2018; Accepted: 30 July 2018; Published: 3 August 2018 Abstract: Geographic data is growing in size and variety, which calls for big data management tools and analysis methods. To efficiently integrate information from high dimensional data, this paper explicitly proposes array-based modeling. A large portion of Earth observations and model simulations are naturally arrays once digitalized. This paper discusses the challenges in using arrays such as the discretization of continuous spatiotemporal phenomena, irregular dimensions, regridding, high-dimensional data analysis, and large-scale data management. We define categories and applications of typical array operations, compare their implementation in open-source software, and demonstrate dimension reduction and array regridding in study cases using Landsat and MODIS imagery. It turns out that arrays are a convenient data structure for representing and analysing many spatiotemporal phenomena. Although the array model simplifies data organization, array properties like the meaning of grid cell values are rarely being made explicit in practice. Keywords: multidimensional arrays; geoscientific data; data analysis; spatiotemporal modeling 1. Introduction An array is a storage form for a sequence of objects of similar type.
    [Show full text]
  • A Survey on Array Storage, Query Languages, and Systems
    A Survey on Array Storage, Query Languages, and Systems Florin Rusu Yu Cheng University of California, Merced 5200 N Lake Road Merced, CA 95343 {frusu,ycheng4}@ucmerced.edu February 2013 Abstract Since scientific investigation is one of the most important providers of massive amounts of ordered data, there is a renewed interest in array data processing in the context of Big Data. To the best of our knowledge, a unified resource that summarizes and analyzes array processing research over its long existence is currently missing. In this survey, we provide a guide for past, present, and future research in array processing. The survey is organizedalong three main topics. Array storage discusses all the aspects related to array partitioning into chunks. The identification of a reduced set of array operators to form the foundation for an array query language is analyzed across multiple such proposals. Lastly, we survey real systems for array processing. The result is a thorough survey on array data storage and processing that should be consulted by anyone interested in this research topic, independent of experience level. The survey is not complete though. We greatly appreciate pointers towards any work we might have forgotten to mention. 1 Introduction Big Data [33] is the new buzz word in computer science as of 2012. There are new conferences orga- nized specifically to tackle Big Data issues. Many classical research areas – beyond data management and databases – allocate significant attention to Big Data problems. And, most importantly, state governments provide unprecedented amounts of funds to support Big Data research [36]—at the end of the day, Big Data analytics played a significant role in the 2012 US presidential elections.
    [Show full text]
  • The Bigdawg Polystore System
    The BigDAWG Polystore System Jennie Duggan Aaron J. Elmore Michael Magda Balazinska Bill Howe Northwestern Univ. of Chicago Stonebraker Univ. of Univ. of MIT Washington Washington Jeremy Kepner Sam Madden David Maier Tim Mattson Stan Zdonik MIT MIT Portland State Intel Brown Univ. ABSTRACT and Technology Center for Big Data, we have constructed This paper presents a new view of federated databases to a medical example of this new class of applications, based address the growing need for managing information that on the MIMIC II dataset [18]. These publicly available pa- spans multiple data models. This trend is fueled by the tient records cover 26,000 intensive care unit admissions proliferation of storage engines and query languages based at Boston’s Beth Israel Deaconess Hospital. It includes on the observation that “no one size fits all”. To address waveform data (up to 125 Hz measurements from bed- this shift, we propose a polystore architecture; it is de- side devices), patient metadata (name, age, etc.), doctor’s signed to unify querying over multiple data models. We and nurse’s notes (text), lab results, and prescriptions filled consider the challenges and opportunities associated with (semi-structured data). A production implementation would polystores. Open questions in this space revolve around store all of the historical data augmented by real-time streams query optimization and the assignment of objects to stor- from current patients. Given the variety of data sources, age engines. We introduce our approach to these topics this system must support an assortment of data types, stan- and discuss our prototype in the context of the Intel Sci- dard SQL analytics (e.g., how many patients were given ence and Technology Center for Big Data.
    [Show full text]
  • The Architecture of Scidb
    The Architecture of SciDB Michael Stonebraker, Paul Brown, Alex Poliakov, Suchi Raman Paradigm4, Inc. 186 Third Avenue Waltham, MA 02451 Abstract. SciDB is an open-source analytical database oriented toward the data management needs of scientists. As such it mixes statistical and linear algebra operations with data management ones, using a natural nested multi-dimensional array data model. We have been working on the code for two years, most recently with the help of venture capital backing. Release 11.06 (June 2011) is downloadable from our website (SciDB.org). This paper presents the main design decisions of SciDB. It focuses on our decisions concerning a high-level, SQL-like query language, the issues facing our query optimizer and executor and efficient storage management for arrays. The paper also discusses implementation of features not usually present in DBMSs, including version control, uncertainty and provenance. Keywords: scientific data management, multi-dimensional array, statistics, linear algebra 1 Introduction and Background The Large Synoptic Survey Telescope (LSST) [1] is the next “big science” astronomy project, a telescope being erected in Chile, which will ultimately collect and manage some 100 Petabytes of raw and derived data. In October 2007, the members of the LSST data management team realized the scope of their data management problem, and that they were uncertain how to move forward. As a result, they organized the first Extremely Large Data Base (XLDB-1) conference at the Stanford National Accelerator Laboratory [2]. Present were many scientists from a variety of natural science disciplines as well as representatives from large web properties. All reported the following requirements: Multi-petabyte amounts of data.
    [Show full text]
  • On the Integration of Array and Relational Models in Databases
    On the Integration of Array and Relational Models in Databases by Dimitar Miˇsev A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in Computer Science Prof. Dr. Peter Baumann Jacobs University Bremen Prof. Dr. Michael Sedlmair Jacobs University Bremen Prof. Dr. Tore Risch Uppsala University Dr. Heinrich Stamerjohanns Jacobs University Bremen Date of defense: May 15th, 2018 Computer Science & Electrical Engineering Statutory Declaration Family Name, Given/First Name Miˇsev,Dimitar Matriculation number 20327580 Type of thesis PhD English: Declaration of Authorship I hereby declare that the thesis submitted was created and written solely by myself without any external support. Any sources, direct or indirect, are marked as such. I am aware of the fact that the contents of the thesis in digital form may be revised with regard to usage of unauthorized aid as well as whether the whole or parts of it may be identified as plagiarism. I do agree my work to be entered into a database for it to be compared with existing sources, where it will remain in order to enable further comparisons with future theses. This does not grant any rights of reproduction and usage, however. This document was neither presented to any other examination board nor has it been published. German: Erkl¨arung der Autorenschaft (Urheberschaft) Ich erkl¨arehiermit, dass die vorliegende Arbeit ohne fremde Hilfe ausschließlich von mir erstellt und geschrieben worden ist. Jedwede verwendeten Quellen, direkter oder indirekter Art, sind als solche kenntlich gemacht worden. Mir ist die Tatsache bewusst, dass der Inhalt der Thesis in digitaler Form gepr¨uftwerden kann im Hinblick darauf, ob es sich ganz oder in Teilen um ein Plagiat handelt.
    [Show full text]
  • Scidb DBMS Research at M.I.T
    SciDB DBMS Research at M.I.T. Michael Stonebraker, Jennie Duggan, Leilani Battle, Olga Papaemmanouil fstonebraker, jennie, [email protected], [email protected] Abstract This paper presents a snapshot of some of our scientific DBMS research at M.I.T. as part of the Intel Science and Technology Center on Big Data. We focus our efforts primarily on SciDB, although some of our work can be used for any backend DBMS. We summarize our work on making SciDB elastic, providing skew-aware join strategies, and producing scalable visualizations of scientific data. 1 Introduction In [19] we presented a description of SciDB, an array-based parallel DBMS oriented toward science applications. In that paper we described the tenets on which the system is constructed, the early use cases where it has found acceptance, and the state of the software at the time of publication. In this paper, we consider a collection of research topics that we are investigating at M.I.T. as part of the Intel Science and Technology Center on Big Data [20]. We begin in Section 2 with the salient characteristics of science data that guide our explorations. We then consider algorithms for making a science DBMS elastic, a topic we cover in Section 3. Then, we turn in Section 4 to query processing algorithms appropriate for science DBMS applications. Lastly, in Section 5 we discuss our work on producing a scalable visualization system for science applications. 2 Characteristics of Science DBMS Applications In this section we detail some of the characteristics of science applications that guide our explorations, specifi- cally an array data model, variable density of data, skew, and the need for visualization.
    [Show full text]
  • Principles of Distributed Database Systems: Spotlight on Newsql
    Principles of Distributed Database Systems: spotlight on NewSQL Patrick Valduriez Outline • Distributed database systems • NoSQL • Polystores • Spotlight on NewSQL • Taxonomy of NewSQL systems • Current trends Principles of Distributed Database Systems Tamer Özsu & Patrick Valduriez The Story of the Book Relational databases. 1991 “In the following 10 years, centralized DBMSs would be an antique curiosity and most organizations would move towards distributed DBMSs.” M. Stonebraker (1988) Advanced transaction models, 1999 query optimization, object data management, parallel DBMSs. 2011 Data replication, database clusters, web data integration, P2P, cloud. 2020 Blockchain, big data, data streaming, graph data analytics, NoSQL, NewSQL, polystores. SBBD 2020 © P. Valduriez 2020 4 Distributed Database - User View 1991 SBBD 2020 © P. Valduriez 2020 5 Distributed Database - User View 2020 SBBD 2020 © P. Valduriez 2020 6 Distributed DBMS – Reality User Query DBMS User Application Software DBMS Software DBMS Communication Software Subsystem User DBMS User Application Software Query DBMS Software User Query SBBD 2020 © P. Valduriez 2020 7 Definitions • A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network • WAN, LAN, cluster interconnect • A distributed database system (distributed DBMS) is the software that manages the DDB and provides an access mechanism that makes this distribution transparent to the users SBBD 2020 © P. Valduriez 2020 8 Promises of Distributed DBMSs • Transparent management of distributed, fragmented, and replicated data • Declarative query language (SQL) • Improved reliability/availability through replication and distributed transactions • Strong ACID consistency • Failover and online recovery • Improved performance • Proximity of data to its points of use • Data-based parallelism • Query optimization • Easier and more economical system expansion • Elasticity in the cloud SBBD 2020 © P.
    [Show full text]
  • Skew-Aware Join Optimization for Array Databases
    Skew-Aware Join Optimization for Array Databases Jennie Duggan‡, Olga Papaemmanouil†, Leilani Battle⋆, Michael Stonebraker⋆ ‡ Northwestern University, † Brandeis University, ⋆ MIT ‡ [email protected], †[email protected], ⋆{leilani, stonebraker}@csail.mit.edu ABSTRACT In addition, queries to scientific databases do not resemble the Science applications are accumulating an ever-increasing amount ones found in traditional business data processing applications. of multidimensional data. Although some of it can be processed in Complex analytics, such as predictive modeling and linear regres- a relational database, much of it is better suited to array-based en- sion, are prevalent in such workloads, replacing the more traditional gines. As such, it is important to optimize the query processing of SQL aggregates found in business intelligence applications. Such these systems. This paper focuses on efficient query processing of analytics are invariably linear algebra-based and are more CPU- join operations within an array database. These engines invariably intensive than RDBMS ones. “chunk” their data into multidimensional tiles that they use to effi- The relational data model has shown itself to be ill-suited for ciently process spatial queries. As such, traditional relational algo- many of these science workloads [37], and performance can be rithms need to be substantially modified to take advantage of array greatly improved through use of a multidimensional, array-based tiles. Moreover, most n-dimensional science data is unevenly dis- data model [34]. As a result, numerous array processing solutions tributed in array space because its underlying observations rarely have emerged to support science applications in a distributed envi- follow a uniform pattern.
    [Show full text]
  • Scientific Dbmss at Scale and Scidb Mike Stonebraker Outline
    Scientific DBMSs at Scale and SciDB Mike Stonebraker Outline • Data architecture – File system –RDBMS –other • Support for analytics © Paradigm4 Inc. 2 Example Application -- MODIS • I know nothing about HEP, so you get satellite imagery • 2 satellites – Trace a “wide piece of scotch tape” continuously around the earth – At multiple frequencies • Cooked into several levels of data products by NASA – For example, “best cloud free cell” from multiple passes • Baked into science metrics by users – For example snow cover © Paradigm4 Inc. 3 SnowSnow CoverCover inin thethe SierrasSierras GeneralGeneral ModelModel Sensors Derived data Cooking Algorithm(s) (pipeline) © Paradigm4 Inc. 5 Traditional Wisdom (1) • Cooking pipeline in hard code or custom hardware • All data in some file system © Paradigm4 Inc. 6 Problems • Can’t find anything – No query language – No schema • Metadata often not recorded – Sensor parameters – Cooking parameters • Can’t easily share anything – Have to export your access programs • Can’t easily recook anything – Big problem with MODIS • Everything is custom code – Supported by an army of Postdocs © Paradigm4 Inc. 7 Traditional Wisdom (2) • Cooking pipeline outside DBMS • Derived data loaded into RDBMS for subsequent querying • Used by Sloan Sky Survey © Paradigm4 Inc. 8 RDBMS Issues • Pretty hopeless on raw data – Simulating arrays on top of tables likely to cost a factor of 10-100 • Not pretty on time series data – Find me a sensor reading whose average value over the last 3 days is within 1% of the average value adjoining 5 sensors • Not pretty on spatial data – Find me snow cover within 10 miles of Mt. Whitney © Paradigm4 Inc. 9 RDBMS Summary • Wrong data model – Arrays not tables • Wrong operations – Regrid not join • Missing features – Versions, no-overwrite, provenance, support for uncertain data, … © Paradigm4 Inc.
    [Show full text]