Integrating Memory-Mapping and N-Dimensional Hash Function for Fast and Efficient Grid-Based Climate Data Query
Total Page:16
File Type:pdf, Size:1020Kb
Annals of GIS ISSN: 1947-5683 (Print) 1947-5691 (Online) Journal homepage: https://www.tandfonline.com/loi/tagi20 Integrating memory-mapping and N-dimensional hash function for fast and efficient grid-based climate data query Mengchao Xu, Liang Zhao, Ruixin Yang, Jingchao Yang, Dexuan Sha & Chaowei Yang To cite this article: Mengchao Xu, Liang Zhao, Ruixin Yang, Jingchao Yang, Dexuan Sha & Chaowei Yang (2020): Integrating memory-mapping and N-dimensional hash function for fast and efficient grid-based climate data query, Annals of GIS, DOI: 10.1080/19475683.2020.1743354 To link to this article: https://doi.org/10.1080/19475683.2020.1743354 © 2020 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group, on behalf of Nanjing Normal University. Published online: 02 Apr 2020. Submit your article to this journal Article views: 474 View related articles View Crossmark data Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=tagi20 ANNALS OF GIS https://doi.org/10.1080/19475683.2020.1743354 Integrating memory-mapping and N-dimensional hash function for fast and efficient grid-based climate data query Mengchao Xua, Liang Zhaob, Ruixin Yanga, Jingchao Yang a, Dexuan Shaa and Chaowei Yang a aDepartment of Geography and Geoinformation Science, George Mason University, Fairfax, VA, USA; bDepartment of Information Science and Technology, George Mason University, Fairfax, VA, USA ABSTRACT ARTICLE HISTORY Database systems are pervasive components in the current big data era. However, efficiently Received 3 April 2019 managing and querying grid-based or array-based multidimensional climate data are still Accepted 11 March 2020 beyond the capabilities of most databases. The mismatch between the array data model and KEYWORDS relational data model limited the performance to query multidimensional data in a traditional Multidimensional array; database when data volume hits a cap. Even a trivial data retrieval on large multidimensional gridded data; array database; datasets in a relational database is time-consuming and requires enormous storage space. memory-mapping; MERRA; Given the scientific interests and application demands on time-sensitive spatiotemporal data data cube query and analysis, there is an urgent need for efficient data storage and fast data retrieval solutions on large multidimensional datasets. In this paper, we introduce a method for multidimensional data storing and accessing, which includes a new hash function algorithm that works on a unified data storage structure and couples with the memory-mapping technology. A prototype database library, LotDB developed as an implementation, is described in this paper, which shows promising results on data query performance compared with SciDB, MongoDB, and PostgreSQL. 1. Introduction multidimensional arrays involves using various formats such as the popular GeoTiff, NetCDF, and HDF. The data In climate earth community, with the growth of both in those formats are suitable for data transfer because of data volume and complexity of spatiotemporal gridded their size advantages but require programming skills to data, the importance of efficient data storage, fast perform further analysis. In addition to the original for- access, and analysis are well recognized by both data mats, data are often needed to be stored using DBMSs, producers and data users. Using new technologies to and the mainstream technologies for sensing image enhance big earth data discovery, storage, and retrieval management are relational database and array database is attractive and demanded not only in the climate earth technologies (Tan and Yue 2016). For DBMSs, a well- community but also in all science domains and data- developed relational database management system related industries (Yang et al. 2017). As the nature of (RDMS) is based on the relational data model, and the many natural phenomena, they can be modelled as array database storage structures on lower storage level are data sets with some dimensionality and known lengths often trees (B+ trees). Historically, since traditional (Baumann 1999). Grid-based data sets, raster structured databases are not designed in the climate earth data sets, array-based data sets or data cubes are all domain, nor are they built to store array data sets, referring to the same concept of a data model, array RDMS does not directly support the array data model data model. It is a multidimensional (n-D) array with to the same extent as sets and tables. Mapping values in the computer science context. In geoscience, attempts have been made to utilize the mature rela- it is an n-D gridded data set. The Earth Observation data tional model as the backend to support array data- is one of the main gridded data sources in climate earth sets, like in van Ballegooij’swork(2004), and theories community, according to Earth Observing System Data have been developed to support array query (Libkin, and Information System (EOSDIS)’s metrics of 2014, it Machlin, and Wong 1996; Marathe and Salem 1997); manages over 9 PB of data and adds 6.4 TB of data to however, with the growth in data volume, storing, its archives every day (Blumenfeld 2015). Its data distri- accessing, and analysing multidimensional arrays in bution is often in forms of flat files, and almost all the RDMS became problematic. data are array based. Traditional file storage for CONTACT Chaowei Yang [email protected] © 2020 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group, on behalf of Nanjing Normal University. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 2 M. XU ET AL. Meanwhile, NoSQL databases have been developed 1.1. Outline to meet the demand to store and query the increasing The rest of this paper is organized as follows. In amount of data from various sources, in which, array Section 2, the studies of storing arrays in databases datacouldalsobemappedtodifferent data models, are reviewed. Section 3 introduces the n-dimensional likeusingkey-valuepairstostorecellsofthearray. hash function as an efficient method to execute data However, the data model mapping brings longer data retrieve for a unified data storage structure. LotDB pre-processing time and larger data storage volume. At is introduced and compared with PostgreSQL, the same time, as the data volume and complexity MongoDB, and SciDB in Section 4. Section 5 con- increase, there is no promising performance to be cludes this research with a summary and an outlook expected because of the corresponding increase in for future works. unit computer resource consumption. Recently, the array database is drawing increasing attention because ’ of its array data model s support on the database level, 2. Related work which provides a more native solution for array data storage. Tree data structures, which are usually used in There are many attempts to store multidimensional relational databases’ storage level on disks, have lower arrays in a database system, studies could be gen- computational complexities for value search when eralized as 1) array data model mapping in relational comparing with an array data structure. In contrast, databases, and 2) developing native array databases. the array data structure offers the lowest computa- Meanwhile, recent studies that involve using big tional complexity for data access, perfect for static data frameworks like Hadoop systems to store and data retrieve if indexes could be presented as integers. retrieve multidimensional climate data sets have Popular array databases like Rasdaman and SciDB are some excellent performance results on data analysis widely used in many practical projects (Baumann et al. queries (Li et al. 2017; Cuzzocrea, Song, and Davis 1998; Stonebraker et al. 2013b), showing promising 2011;Songetal.2015). However, most of the solu- performance gain comparing to traditional methods. tions were developed as customized solutions for However, existing array databases are still suffering specific data formats or data sets and could not be from the problems including 1) low performance on applied to arbitrary multidimensional arrays. The non-cluster environment: the standalone version of related works in this section focus on the database array databases usually have limited performance com- solutions for multidimensional array storage and pare to their clustered setup (Hu et al. 2018), 2) long retrieval, and efforts done by researchers to enhance data pre-processing time: current array databases does array data query. not provide direct support for multidimensional cli- mate gridded data, data pre-processing is therefore 2.1. Relational database necessary and cannot be avoided (Das et al. 2015), and 3) high data expansion rate compare to raw data: The data model in relational databases are tables, most because of the data pre-processing and data chunking of the relational database systems do not support stor- inside databases, data volume could be expanded mul- ing multidimensional arrays natively. A natural way to tiply times compare to the raw data. storearraysintablesistosplitarrayscellintorows, As the multidimensional array datasets are growing which is straightforward for 1D/2D datasets, however, in both size and complexity, we are becoming limited in problematic for higher dimensional