Data Management and Data Processing Support on Array-Based Scientific Data

Total Page:16

File Type:pdf, Size:1020Kb

Data Management and Data Processing Support on Array-Based Scientific Data Data Management and Data Processing Support on Array-Based Scientific Data Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Yi Wang, B.S., M.S. Graduate Program in Computer Science and Engineering The Ohio State University 2015 Dissertation Committee: Gagan Agrawal, Advisor Feng Qin Srinivasan Parthasarathy © Copyright by Yi Wang 2015 Abstract Scientific simulations are now being performed at finer temporal and spatial scales, leading to an explosion of the output data (mostly in array-based formats), and challenges in effectively storing, managing, querying, disseminating, analyzing, and visualizing these datasets. Many paradigms and tools used today for large-scale scientific data management and data processing are often too heavy-weight and have inherent limitations, making it extremely hard to cope with the ‘big data’ challenges in a variety of scientific domains. Our overall goal is to provide high-performance data management and data processing support on array-based scientific data, targeting data-intensive applications and various scientific array storages. We believe that such high-performance support can significantly reduce the prohibitively expensive costs of data translation, data transfer, data ingestion, data integration, data processing, and data storage involved in many scientific applications, leading to better performance, ease-of-use, and responsiveness. On one hand, we have investigated four data management topics as follows. First, we built a light-weight data management layer over scientific datasets stored in HDF5 format, which is one of the popular array formats. Unlike many popular data transport protocols such as OPeNDAP, which requires costly data translation and data transfer before accessing remote data, our implementation can support server-side flexible subsetting and aggregation, with high parallel efficiency. Second, to avoid the high upfront data ingestion costs of loading large-scale array data into array databases like SciDB, we designed a ii system referred to as SAGA, which can provide database-like support over native array storage. Specifically, we focused on implementing a number of structural (grid, sliding, hierarchical, and circular) aggregations, which are unique in array data model. Third, we proposed a novel approximate aggregation approach over array data using bitmap indexing. This approach can operate on the compact bitmap indices rather than the original raw datasets, and can support fast, accurate and flexible aggregations over any array or its subset without data reorganization. Fourth, we extended bitmap indexing to assist the data mining task subgroup discovery over array data. Like the aggregation approach, our algorithm can operate entirely on bitmap indices, and it can efficiently handle a key challenge associated with array data – a subgroup identified over array data can be described by value-based and/or dimension-based attributes. On the other hand, we focused on both offline and in-situ data processing paradigms in the context of MapReduce. To process disk-resident scientific data in various data formats, we developed a customizable MapReduce-like framework, SciMATE, which can be adapted to support transparent processing on any of the scientific data formats. Thus, unnecessary data integration and data reloading incurred by applying traditional MapReduce paradigm to scientific data processing can be avoided. We then designed another MapReduce-like framework, Smart, to support efficient in-situ scientific analytics in both time sharing and space sharing modes. In contrast to offline processing, our implementation can avoid, either completely or to a very large extent, both data transfer and data storage costs. iii This is dedicated to the ones I love: my parents and my fiancee. iv Acknowledgments I would never have been able to complete my dissertation without the guidance of my committee members, help from friends, and support from my family and fiancee over the years. Foremost, I would like to express my sincerest gratitude to my advisor Prof. Gagan Agrawal for his continuous support of my Ph.D career. He is the researcher with the broadest interests I have ever seen, and he has offered me excellent opportunities to delve into a variety of research topics, including high-performance computing, programming models, databases, data mining, and bioinformatics. This helps me to learn diverse techniques, gain different methodologies, and develop my potential in an all-around way. I am also grateful that he is not a harsh boss who pushes immense workload with close supervision, but a thoughtful mentor who teaches me how to be self-motivated and enjoy the research. My 5-year exploration would not have been so rewarding without his patience, encouragement, and advice. Besides my advisor, I would like to thank my candidacy and dissertation committee members, Prof. Feng Qin, Prof. Srinivasan Parthasarathy, Prof. Han-Wei Shen, and Prof. Spyros Blanas, for their valuable time and insightful advices. I would also like to thank Dr. Arnab Nandi, Dr. Kun Huang, Dr. Hatice Gulcin Ozer, Dr. Xin Gao, Dr. Jim-Jingyan Wang for the collaborations during my PhD career. I also need to acknowledge a CSE v staff, Aaron Jenkins, and I will always remember the nice time I worked with him in my first quarter. My sincere thanks also go to my hosts Dr. Marco Yu and Dr. Donghua Xu, and my mentor Dr. Nan Boden, for their help in both work and life during the summer internship at Google. Under their guidance, I learned diverse techniques fast and was even able to apply them to the research projects after the internship. Working with them was quite a rewarding experience that I might not be able to obtain from academia. The insightful discussions with them can always bring me some inspirations of interesting research directions, some of which I have already fulfilled in the last year of my PhD study. Their enthusiasm and charms let me make the decision to join the same team at Google after graduation. I also would like to thank the colleagues in our group, Data-Intensive and High Performance Computing Research Group. My Ph.D. career would not have been so colorful without the friendship from them. Former and present members of our lab have provided me with numerous help, in both daily life and work, and they are: Wenjing Ma, Vignesh Ravi, Tantan Liu, Tekin Bicer, Jiedan Zhu, Vignesh San, Xin Huo, Ashish Nagavaram, Liyang Tang, Yu Su, Linchuan Chen, Mehmet Can Kurt, Mucahid Kutlu, Jiaqi Liu, Roee Ebenstein, Sameh Shohdy, Gangyi Zhu, Peng Jiang, Jiankai Sun, Qiwei Yang, and David Siegal. Last but not the least, I express special thanks to my parents and my fiancee. My parents have been always considerate and supportive. My fiancee has been always cheering me up and standing by me through the ups and downs. vi Vita July29th,1987 .............................Born-Wuhan, China 2010 .......................................B.S.Software Engineering, Wuhan University, Wuhan, China 2014 .......................................M.S. Computer Science & Engineering, The Ohio State University May2014-August2014 .....................Software Engineer Intern, Google Inc., Mountain View, CA Publications Research Publications Yi Wang, Linchuan Chen, and Gagan Agrawal. “Supporting User-Defined Estimation and Early Termination on a MapReduce-Like Framework for Online Analytics”. Under Submission. Yi Wang, Gagan Agrawal, Tekin Bicer, and Wei Jiang. “Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics”. In Proceeedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), November, 2015. Yi Wang, Gagan Agrawal, Tekin Bicer, and Wei Jiang. “Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics”. Technical Report, OSU-CISRC-4/15-TR05, April, 2015. Yi Wang, Yu Su, Gagan Agrawal and Tantan Liu. “SciSD: Novel Subgroup Discovery over Scientific Datasets Using Bitmap Indices”. Technical Report, OSU-CISRC-3/15- TR03, March, 2015. vii Gangyi Zhu, Yi Wang, Gagan Agrawal. “SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices (Short Paper)”. In Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM), June 2015. Yi Wang, Yu Su, and Gagan Agrawal. “A Novel Approach for Approximate Aggregations Over Arrays”. In Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM), June 2015. Yu Su, Yi Wang, and Gagan Agrawal. “In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps”. In Proceedings of the International Symposium on High- Performance Parallel and Distributed Computing (HPDC), June 2015. Jim Jing-Yan Wang, Yi Wang, Shiguang Zhao, and Xin Gao. “Maximum Mutual Information Regularized Classification”. In Engineering Applications of Artificial Intelligence (ENG APPL ARTIF INTEL), 2015. Yi Wang, Arnab Nandi, and Gagan Agrawal. “SAGA: Array Storage as a DB with Support for Structural Aggregations”. In Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM), June 2014. Yi Wang, Gagan Agrawal, Gulcin Ozer, and Kun Huang “Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data”. In Proceedings of the International Workshop on High Performance Computational Biology (HiCOMB) in Conjunction with International Parallel & Distributed Processing Symposium (IPDPS), May 2014. Yu Su, Yi Wang, Gagan Agrawal,
Recommended publications
  • Array DBMS in Environmental Science
    The 9th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications 21-23 September, 2017, Bucharest, Romania Array DBMS in Environmental Science: Satellite Sea Surface Height Data in the Cloud Ramon Antonio Rodriges Zalipynis National Research University Higher School of Economics, Moscow, Russia [email protected] Abstract – Nowadays environmental science experiences data model is designed to uniformly represent diverse tremendous growth of raster data: N-dimensional (N-d) raster data types and formats, take into account a dis- arrays coming mainly from numeric simulation and Earth tributed cluster environment, and be independent of the remote sensing. An array DBMS is a tool to streamline raster data processing. However, raster data are usually stored in underlying raster file formats at the same time. Also, new files, not in databases. Moreover, numerous command line distributed algorithms are proposed based on the model. tools exist for processing raster files. This paper describes a Four modern raster data management trends are rele- distributed array DBMS under development that partially vant to this paper: industrial raster data models, formal delegates raster data processing to such tools. Our DBMS offers a new N-d array data model to abstract from the array models and algebras, in situ data processing algo- files and the tools and processes data in a distributed fashion rithms, and raster (array) DBMS. Good survey on the directly in their native file formats. As a case study, popular algorithms is contained in [7]. A recent survey of existing satellite altimetry data were used for the experiments carried array DBMS and similar systems is in [5].
    [Show full text]
  • An Array DBMS for Simulation Analysis and ML Models Predictions
    SAVIME: An Array DBMS for Simulation Analysis and ML Models Predictions Hermano. L. S. Lustosa1, Anderson C. Silva1, Daniel N. R. da Silva1, Patrick Valduriez2, Fabio Porto1 1 National Laboratory for Scientific Computing, Rio de Janeiro, Brazil {hermano, achaves, dramos, fporto}@lncc.br 2 Inria, University of Montpellier, CNRS, LIRMM, France [email protected] Abstract. Limitations in current DBMSs prevent their wide adoption in scientific applications. In order to make them benefit from DBMS support, enabling declarative data analysis and visualization over scientific data, we present an in-memory array DBMS called SAVIME. In this work we describe the system SAVIME, along with its data model. Our preliminary evaluation show how SAVIME, by using a simple storage definition language (SDL) can outperform the state-of-the-art array database system, SciDB, during the process of data ingestion. We also show that it is possible to use SAVIME as a storage alternative for a numerical solver without affecting its scalability, making it useful for modern ML based applications. Categories and Subject Descriptors: H.2.4 [Database Management]: Systems Keywords: Scientific Data Management, Multidimensional Array, Machine Learning 1. INTRODUCTION Due to the increasing computational power of HPC environments, vast amounts of data are now generated in different research fields, and a relevant part of this data is best represented as array data. Geospatial and temporal data in climate modeling, astronomical images, medical imagery, multimedia and simulation data are naturally represented as multidimensional arrays. To analyze such data, DBMSs offer many advantages, like query languages, which eases data analysis and avoids the need for extensive coding/scripting, and a logical data view that isolates data from the applications that consume it.
    [Show full text]
  • Array Databases: Concepts, Standards, Implementations
    Baumann et al. J Big Data (2021) 8:28 https://doi.org/10.1186/s40537-020-00399-2 SURVEY PAPER Open Access Array databases: concepts, standards, implementations Peter Baumann , Dimitar Misev, Vlad Merticariu and Bang Pham Huu* *Correspondence: b. Abstract phamhuu@jacobs-university. Multi-dimensional arrays (also known as raster data or gridded data) play a key role in de Large-Scale Scientifc many, if not all science and engineering domains where they typically represent spatio- Information Systems temporal sensor, image, simulation output, or statistics “datacubes”. As classic database Research Group, Jacobs technology does not support arrays adequately, such data today are maintained University, Bremen, Germany mostly in silo solutions, with architectures that tend to erode and not keep up with the increasing requirements on performance and service quality. Array Database systems attempt to close this gap by providing declarative query support for fexible ad-hoc analytics on large n-D arrays, similar to what SQL ofers on set-oriented data, XQuery on hierarchical data, and SPARQL and CIPHER on graph data. Today, Petascale Array Database installations exist, employing massive parallelism and distributed processing. Hence, questions arise about technology and standards available, usability, and overall maturity. Several papers have compared models and formalisms, and benchmarks have been undertaken as well, typically comparing two systems against each other. While each of these represent valuable research to the best of our knowledge there is no comprehensive survey combining model, query language, architecture, and practical usability, and performance aspects. The size of this comparison diferentiates our study as well with 19 systems compared, four benchmarked to an extent and depth clearly exceeding previous papers in the feld; for example, subsetting tests were designed in a way that systems cannot be tuned to specifcally these queries.
    [Show full text]
  • The Gamma Operator for Big Data Summarization on an Array DBMS
    JMLR: Workshop and Conference Proceedings 36:88{103, 2014 BIGMINE 2014 The Gamma Operator for Big Data Summarization on an Array DBMS Carlos Ordonez [email protected] Yiqun Zhang [email protected] Wellington Cabrera [email protected] University of Houston, USA Editors: Wei Fan, Albert Bifet, Qiang Yang and Philip Yu Abstract SciDB is a parallel array DBMS that provides multidimensional arrays, a query language and basic ACID properties. In this paper, we introduce a summarization matrix operator that computes sufficient statistics in one pass and in parallel on an array DBMS. Such sufficient statistics benefit a big family of statistical and machine learning models, including PCA, linear regression and variable selection. Experimental evaluation on a parallel cluster shows our matrix operator exhibits linear time complexity and linear speedup. Moreover, our operator is shown to be an order of magnitude faster than SciDB built-in operators, two orders of magnitude faster than SQL queries on a fast column DBMS and even faster than the R package when the data set fits in RAM. We show SciDB operators and the R package fail due to RAM limitations, whereas our operator does not. We also show PCA and linear regression computation is reduced to a few minutes for large data sets. On the other hand, a Gibbs sampler for variable selection can iterate much faster in the array DBMS than in R, exploiting the summarization matrix. Keywords: Array, Matrix, Linear Models, Summarization, Parallel 1. Introduction Row DBMSs remain the best technology for OLTP [Stonebraker et al.(2007)] and col- umn DBMSs are becoming a competitor in OLAP (Business Intelligence) query processing [Stonebraker et al.(2005)] in large data warehouses [Stonebraker et al.(2010)].
    [Show full text]
  • Multidimensional Arrays for Analysing Geoscientific Data
    International Journal of Geo-Information Review Multidimensional Arrays for Analysing Geoscientific Data Meng Lu 1,2,*,† , Marius Appel 1 and Edzer Pebesma 1 1 Institute for Geoinformatics, University of Muenster, 48149 Muenster, Germany; [email protected] (M.A.); [email protected] (E.P.) 2 Department of Physical Geography, Faculty of Geoscience, Utrecht University, 3584 CB Utrecht, The Netherlands * Correspondence: [email protected]; Tel.: +31 648901629 † Current address: Vening Meinesz Building A, Princetonlaan 8A, 3584 CB Utrecht, The Netherlands. Received: 27 June 2018; Accepted: 30 July 2018; Published: 3 August 2018 Abstract: Geographic data is growing in size and variety, which calls for big data management tools and analysis methods. To efficiently integrate information from high dimensional data, this paper explicitly proposes array-based modeling. A large portion of Earth observations and model simulations are naturally arrays once digitalized. This paper discusses the challenges in using arrays such as the discretization of continuous spatiotemporal phenomena, irregular dimensions, regridding, high-dimensional data analysis, and large-scale data management. We define categories and applications of typical array operations, compare their implementation in open-source software, and demonstrate dimension reduction and array regridding in study cases using Landsat and MODIS imagery. It turns out that arrays are a convenient data structure for representing and analysing many spatiotemporal phenomena. Although the array model simplifies data organization, array properties like the meaning of grid cell values are rarely being made explicit in practice. Keywords: multidimensional arrays; geoscientific data; data analysis; spatiotemporal modeling 1. Introduction An array is a storage form for a sequence of objects of similar type.
    [Show full text]
  • OLAP Cubes Ming-Nu Lee OLAP (Online Analytical Processing)
    OLAP Cubes Ming-Nu Lee OLAP (Online Analytical Processing) Performs multidimensional analysis of business data Provides capability for complex calculations, trend analysis, and sophisticated modelling Foundation of many kinds of business applications OLAP Cube Multidimensional database Method of storing data in a multidimensional form Optimized for data warehouse and OLAP apps Structure: Data(measures) categorized by dimensions Dimension as a hierarchy, set of parent-child relationships Not exactly a “cube” Schema 1) Star Schema • Every dimension is represented with only one dimension table • Dimension table has attributes and is joined to the fact table using a foreign key • Dimension table not joined to each other • Dimension tables are not normalized • Easy to understand and provides optimal disk usage Schema 1) Snowflake Schema • Dimension tables are normalized • Uses smaller disk space • Easier to implement • Query performance reduced • Maintenance increased Dimension Members Identifies a data item’s position and description within a dimension Two types of dimension members 1) Detailed members • Lowest level of members • Has no child members • Stores values at members intersections which are stored as fact data 2) Aggregate members • Parent member that is the total or sum of child members • A level above other members Fact Data Data values described by a company’s business activities Exist at Member intersection points Aggregation of transactions integrated from a RDBMS, or result of Hierarchy or cube formula calculation
    [Show full text]
  • Data Warehousing & OLAP
    CPSC 304 Introduction to Database Systems Data Warehousing & OLAP Textbook Reference Database Management Systems Sect. 25.1-25.5, 25.7-25.10 Other References: Database Systems: The Complete Book, 2nd edition The Data Warehouse Toolkit, 3rd edition, by Kimball & Ross Hassan Khosravi Based on slides from Ed, George, Laks, Jennifer Widom (Stanford), and Jiawei Han (Illinois) 1 Learning Goals v Compare and contrast OLAP and OLTP processing (e.g., focus, clients, amount of data, abstraction levels, concurrency, and accuracy). v Explain the ETL tasks (i.e., extract, transform, load) for data warehouses. v Explain the differences between a star schema design and a snowflake design for a data warehouse, including potential tradeoffs in performance. v Argue for the value of a data cube in terms of: the type of data in the cube (numeric, categorical, counts, sums) and the goals of OLAP (e.g., summarization, abstractions). v Estimate the complexity of a data cube in terms of the number of views that a given fact table and set of dimensions could generate, and provide some ways of managing this complexity. 2 Learning Goals (cont.) v Given a multidimensional cube, write regular SQL queries that perform roll-up, drill-down, slicing, dicing, and pivoting operations on the cube. v Use the SQL:1999 standards for aggregation (e.g., GROUP BY CUBE) to efficiently generate the results for multiple views. v Explain why having materialized views is important for a data warehouse. v Determine which set of views are most beneficial to materialize. v Given an OLAP query and a set of materialized views, determine which views could be used to answer the query more efficiently than the fact table (base view).
    [Show full text]
  • A Survey on Array Storage, Query Languages, and Systems
    A Survey on Array Storage, Query Languages, and Systems Florin Rusu Yu Cheng University of California, Merced 5200 N Lake Road Merced, CA 95343 {frusu,ycheng4}@ucmerced.edu February 2013 Abstract Since scientific investigation is one of the most important providers of massive amounts of ordered data, there is a renewed interest in array data processing in the context of Big Data. To the best of our knowledge, a unified resource that summarizes and analyzes array processing research over its long existence is currently missing. In this survey, we provide a guide for past, present, and future research in array processing. The survey is organizedalong three main topics. Array storage discusses all the aspects related to array partitioning into chunks. The identification of a reduced set of array operators to form the foundation for an array query language is analyzed across multiple such proposals. Lastly, we survey real systems for array processing. The result is a thorough survey on array data storage and processing that should be consulted by anyone interested in this research topic, independent of experience level. The survey is not complete though. We greatly appreciate pointers towards any work we might have forgotten to mention. 1 Introduction Big Data [33] is the new buzz word in computer science as of 2012. There are new conferences orga- nized specifically to tackle Big Data issues. Many classical research areas – beyond data management and databases – allocate significant attention to Big Data problems. And, most importantly, state governments provide unprecedented amounts of funds to support Big Data research [36]—at the end of the day, Big Data analytics played a significant role in the 2012 US presidential elections.
    [Show full text]
  • The Bigdawg Polystore System
    The BigDAWG Polystore System Jennie Duggan Aaron J. Elmore Michael Magda Balazinska Bill Howe Northwestern Univ. of Chicago Stonebraker Univ. of Univ. of MIT Washington Washington Jeremy Kepner Sam Madden David Maier Tim Mattson Stan Zdonik MIT MIT Portland State Intel Brown Univ. ABSTRACT and Technology Center for Big Data, we have constructed This paper presents a new view of federated databases to a medical example of this new class of applications, based address the growing need for managing information that on the MIMIC II dataset [18]. These publicly available pa- spans multiple data models. This trend is fueled by the tient records cover 26,000 intensive care unit admissions proliferation of storage engines and query languages based at Boston’s Beth Israel Deaconess Hospital. It includes on the observation that “no one size fits all”. To address waveform data (up to 125 Hz measurements from bed- this shift, we propose a polystore architecture; it is de- side devices), patient metadata (name, age, etc.), doctor’s signed to unify querying over multiple data models. We and nurse’s notes (text), lab results, and prescriptions filled consider the challenges and opportunities associated with (semi-structured data). A production implementation would polystores. Open questions in this space revolve around store all of the historical data augmented by real-time streams query optimization and the assignment of objects to stor- from current patients. Given the variety of data sources, age engines. We introduce our approach to these topics this system must support an assortment of data types, stan- and discuss our prototype in the context of the Intel Sci- dard SQL analytics (e.g., how many patients were given ence and Technology Center for Big Data.
    [Show full text]
  • The Architecture of Scidb
    The Architecture of SciDB Michael Stonebraker, Paul Brown, Alex Poliakov, Suchi Raman Paradigm4, Inc. 186 Third Avenue Waltham, MA 02451 Abstract. SciDB is an open-source analytical database oriented toward the data management needs of scientists. As such it mixes statistical and linear algebra operations with data management ones, using a natural nested multi-dimensional array data model. We have been working on the code for two years, most recently with the help of venture capital backing. Release 11.06 (June 2011) is downloadable from our website (SciDB.org). This paper presents the main design decisions of SciDB. It focuses on our decisions concerning a high-level, SQL-like query language, the issues facing our query optimizer and executor and efficient storage management for arrays. The paper also discusses implementation of features not usually present in DBMSs, including version control, uncertainty and provenance. Keywords: scientific data management, multi-dimensional array, statistics, linear algebra 1 Introduction and Background The Large Synoptic Survey Telescope (LSST) [1] is the next “big science” astronomy project, a telescope being erected in Chile, which will ultimately collect and manage some 100 Petabytes of raw and derived data. In October 2007, the members of the LSST data management team realized the scope of their data management problem, and that they were uncertain how to move forward. As a result, they organized the first Extremely Large Data Base (XLDB-1) conference at the Stanford National Accelerator Laboratory [2]. Present were many scientists from a variety of natural science disciplines as well as representatives from large web properties. All reported the following requirements: Multi-petabyte amounts of data.
    [Show full text]
  • On the Integration of Array and Relational Models in Databases
    On the Integration of Array and Relational Models in Databases by Dimitar Miˇsev A thesis submitted in partial fulfillment for the degree of Doctor of Philosophy in Computer Science Prof. Dr. Peter Baumann Jacobs University Bremen Prof. Dr. Michael Sedlmair Jacobs University Bremen Prof. Dr. Tore Risch Uppsala University Dr. Heinrich Stamerjohanns Jacobs University Bremen Date of defense: May 15th, 2018 Computer Science & Electrical Engineering Statutory Declaration Family Name, Given/First Name Miˇsev,Dimitar Matriculation number 20327580 Type of thesis PhD English: Declaration of Authorship I hereby declare that the thesis submitted was created and written solely by myself without any external support. Any sources, direct or indirect, are marked as such. I am aware of the fact that the contents of the thesis in digital form may be revised with regard to usage of unauthorized aid as well as whether the whole or parts of it may be identified as plagiarism. I do agree my work to be entered into a database for it to be compared with existing sources, where it will remain in order to enable further comparisons with future theses. This does not grant any rights of reproduction and usage, however. This document was neither presented to any other examination board nor has it been published. German: Erkl¨arung der Autorenschaft (Urheberschaft) Ich erkl¨arehiermit, dass die vorliegende Arbeit ohne fremde Hilfe ausschließlich von mir erstellt und geschrieben worden ist. Jedwede verwendeten Quellen, direkter oder indirekter Art, sind als solche kenntlich gemacht worden. Mir ist die Tatsache bewusst, dass der Inhalt der Thesis in digitaler Form gepr¨uftwerden kann im Hinblick darauf, ob es sich ganz oder in Teilen um ein Plagiat handelt.
    [Show full text]
  • A Data Cube Metamodel for Geographic Analysis Involving Heterogeneous Dimensions
    International Journal of Geo-Information Article A Data Cube Metamodel for Geographic Analysis Involving Heterogeneous Dimensions Jean-Paul Kasprzyk 1,* and Guénaël Devillet 2 1 SPHERES, Geomatics Unit, University of Liege, 4000 Liège, Belgium 2 SPHERES, SEGEFA, University of Liege, 4000 Liège, Belgium; [email protected] * Correspondence: [email protected] Abstract: Due to their multiple sources and structures, big spatial data require adapted tools to be efficiently collected, summarized and analyzed. For this purpose, data are archived in data warehouses and explored by spatial online analytical processing (SOLAP) through dynamic maps, charts and tables. Data are thus converted in data cubes characterized by a multidimensional structure on which exploration is based. However, multiple sources often lead to several data cubes defined by heterogeneous dimensions. In particular, dimensions definition can change depending on analyzed scale, territory and time. In order to consider these three issues specific to geographic analysis, this research proposes an original data cube metamodel defined in unified modeling language (UML). Based on concepts like common dimension levels and metadimensions, the metamodel can instantiate constellations of heterogeneous data cubes allowing SOLAP to perform multiscale, multi-territory and time analysis. Afterwards, the metamodel is implemented in a relational data warehouse and validated by an operational tool designed for a social economy case study. This tool, called “Racines”, gathers and compares multidimensional data about social economy business in Belgium and France through interactive cross-border maps, charts and reports. Thanks to the metamodel, users remain Citation: Kasprzyk, J.-P.; Devillet, G. independent from IT specialists regarding data exploration and integration.
    [Show full text]