A Metadata Based Approach for Supporting Subsetting Queries Over Parallel HDF5 Datasets
Total Page:16
File Type:pdf, Size:1020Kb
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Vignesh Santhanagopalan, B.S. Graduate Program in Computer Science and Engineering The Ohio State University 2011 Thesis Committee: Dr. Gagan Agrawal, Advisor Dr. Radu Teodorescu ABSTRACT A key challenge in scientific data management is to manage the data as the data sizes are growing at a very rapid speed. Scientific datasets are typically stored using low-level formats which store the data as binary. This makes specification of the data and processing very hard. Also, as the volume of data is huge, parallel configurations must be used to process the data to enable efficient access. We have developed a data virtualization approach for supporting subsetting queries on scientific datasets stored in native format. The data is stored in Hierarchical Data Format (HDF5) which is one of the popular formats for storing scientific data. Our system supports SQL queries using the Select, From and Where clauses. We support queries based on the dimensions of the dataset and also queries which are based on the dimensions and attributes (which provide extra information about the dataset) of the dataset. In order to support the different types of queries, we have the pre-processing and post-processing modules. We also parallelize the selection queries involving the dimensions and the attributes. Our system offers the following advantages. We provide SQL like abstraction for specifying subsets of interest which is a powerful mechanism. Our system provides efficiency by processing on the actual layout of data and we enable more efficient access to scientific data by using parallel computing. We have evaluated our system for the different types of queries. Also, we compare the performance of the sequential ii version with the parallel version for different sizes of data. Finally, we show the scalability of our system, by evaluating the performance of the system by varying the number of nodes. iii I dedicate this work to my family iv ACKNOWLEDGMENTS Firstly, I would like to express my sincere gratitude to my advisor Prof. Gagan Agrawal for his continuous support of my Masters study and research. He has been great motivation for me throughout and his guidance has helped with my Masters. His constant encouragement and complete knowledge of the subject has made me learn and explore more about my research. My sincere thanks also goes to my thesis committee member Dr. Radu Teodorescu, for his encouragement and support. I would like to thank Yu Su for all his guidance throughout my thesis. Many thanks to Vignesh Trichy Ravi for his motivation and help. Also, my thanks to all my other lab members for their support. Finally, I would like to extend to my gratitude to my family and friends for all their help and motivation, without whom this Masters would not have been possible. v VITA 2005 . .B.Tech., Information Technology, Anna University, Chennai, India. 2009-2011 . Masters Student, Department of Computer Science and Engineering, The Ohio State University. FIELDS OF STUDY Major Field: Computer Science and Engineering Studies in: High Performance Computing, Scientific Data Management Prof. Gagan Agrawal vi TABLE OF CONTENTS Page Abstract . ii Dedication . iv Acknowledgments . v Vita . vi List of Figures . ix Chapters: 1. Introduction . 1 1.1 Scientific Data Management . 1 1.2 Specific Goals . 3 1.3 Contributions . 4 1.4 Organization . 5 2. System Design . 6 2.1 Background . 6 2.1.1 HDF5 . 6 2.1.2 Parallel-HDF5 . 11 2.2 Overview of the System . 13 2.3 Specific Goals of the System . 14 2.4 Technical Challenges . 15 2.5 Main Steps involved in the design of our System . 16 2.6 Metadata Extraction and Handling . 21 2.7 Pre-Processing and Post-Processing Modules . 25 vii 2.8 Parallelization with Parallel HDF5 . 30 2.9 Summary . 32 3. Experimental Evaluation . 34 3.1 Experiments . 34 3.2 Summary . 42 4. Conclusions . 43 Bibliography . 44 viii LIST OF FIGURES Figure Page 2.1 HDF5 group structure . 10 2.2 Metadata Information . 11 2.3 Overview of the System Design . 18 2.4 Continuation: Overview of the System Design . 19 2.5 Continuation: Overview of the System Design . 20 2.6 Example datasets . 26 2.7 Path and Information for dataset SolarZenithAngle . 27 2.8 Path and Information for dataset CloudPressureFraction . 28 2.9 Shared access to a HDF5 file by different processes . 32 3.1 Performance Comparison of Sequential and Parallel execution (4 pro- cessors) for a dataset of size 500MB . 37 3.2 Performance Comparison of Sequential and Parallel execution (4 pro- cessors) for a dataset of size 1GB . 37 3.3 Performance Comparison of Sequential and Parallel execution (4 pro- cessors) for a dataset of size 2GB . 38 3.4 Performance Comparison of Sequential and Parallel execution (4 pro- cessors) for a dataset of size 4GB . 38 ix 3.5 Performance Comparison of Sequential and Parallel execution by vary- ing the number of nodes for a dataset of size 500MB . 40 3.6 Performance Comparison of Sequential and Parallel execution by vary- ing the number of nodes for a dataset of size 1GB . 40 3.7 Performance Comparison of Sequential and Parallel execution by vary- ing the number of nodes for a dataset of size 2GB . 41 3.8 Performance Comparison of Sequential and Parallel execution by vary- ing the number of nodes for a dataset of size 4GB . 41 x CHAPTER 1 INTRODUCTION 1.1 Scientific Data Management Data Driven applications are emerging day by day in Scientific and high-end com- putations. Scientific simulations and increasing numbers of high precision data collec- tion instruments are creating datasets that are very huge. The data that is collected from scientific instruments and simulations are very important in many scientific ap- plications. Example of scientific datasets are medical imaging modalities or sensors that are attached to a satellite. The main challenge exists by the fact that the dataset sizes are growing enor- mously. Also, it is very difficult for the application scientists to manage and process these huge datasets. There has been a lot of interest recently on Scientific Data Management. A new Science database called SciDB [19] has been recently released. They provide a database for scientific applications in which the natural way of storing data is Arrays. Also, Large Synoptic Survey Telescope (LSST) is another scientific data management project that aims to provide time-lapse digital imaging of faint astronomical objects in the sky [9]. 1 Scientific datasets are typically stored in low-level formats like HDF5 [7], NetCDF [13] and ROOT [16]. These formats store the data as binary or character flat-files. The low-level formats mentioned above provide compact storage for the scientific data. The formats mentioned are very useful for storing scientific data and it is used in various domains. The choice of the format depends on the domain and also on the application. Relational databases are not used to store scientific datasets, because their use re- sults in significant overheads and processing. Also, the usage of traditional databases is not justified by the fact that scientific datasets are updated very infrequently. Re- lational databases are very heavy-weight for read-only huge scale scientific data and also they don't support array-based data or the metadata which the user prefers to see. The queries involving scientific applications are typically to retrieve a subset of data or aggregation, so there is no need for Transaction processing like in relational databases. There is no need to provide strict Atomicity, Consistency, Isolation and Durability (ACID) constraints. The standard relational model is not efficient for types of data used by scientific applications. Relations can be used to represent time series and spatial grids, but at very high cost of processing time and space. The many reasons why scientific data is not stored in relational databases is documented in these papers [5] [11] [20] [9]. 2 1.2 Specific Goals The low-level formats used for storing scientific datasets makes the specification of processing much harder. The users of scientific datasets are typically interested in retrieving a subset of data from the dataset. In order to do this, the user needs to understand the layout of the data. For each query the user needs to write a separate program in order to retrieve the necessary subset of data. Also, the user needs to familiarize with the low-level format's libraries in order to extract a subset of data. There has been lot of work in the areas of scientific data management. Sarawagi and Stonebraker [17] showed how array chunks could be described and accessed as objects in an object-relational database. Beomseok Nam and Alan Sussman [12] have described on improving access for multi-dimensional self-describing scientific data. They propose an indexing scheme for efficiently accessing a subset of data, but they don't provide any notion of data virtualization nor use parallel configurations. Lot of focus has been on providing fast bitmap indexing technologies to answer queries efficiently [3] [10] [14] [18]. There has been a lot of interest recently in extending relational database technology to support the need of extreme scientific data [20] [9]. Our approach focuses on providing database like support to users subsetting re- quests, but the main difference in our approach is to keep the data in the native format, thereby eliminating the need for loading the dataset in the database sys- tem. The approach called automatic data virtualization was proposed by Li Weng et al seven years back [21].We provide an implementation of the data virtualization approach to enable support for complex and large low-level formats.