Data Management and Data Processing Support on Array-Based Scientiﬁc Data

Data Management and Data Processing Support on Array-Based Scientific Data Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Yi Wang, B.S., M.S. Graduate Program in Computer Science and Engineering The Ohio State University 2015 Dissertation Committee: Gagan Agrawal, Advisor Feng Qin Srinivasan Parthasarathy © Copyright by Yi Wang 2015 Abstract Scientific simulations are now being performed at finer temporal and spatial scales, leading to an explosion of the output data (mostly in array-based formats), and challenges in effectively storing, managing, querying, disseminating, analyzing, and visualizing these datasets. Many paradigms and tools used today for large-scale scientific data management and data processing are often too heavy-weight and have inherent limitations, making it extremely hard to cope with the ‘big data’ challenges in a variety of scientific domains. Our overall goal is to provide high-performance data management and data processing support on array-based scientific data, targeting data-intensive applications and various scientific array storages. We believe that such high-performance support can significantly reduce the prohibitively expensive costs of data translation, data transfer, data ingestion, data integration, data processing, and data storage involved in many scientific applications, leading to better performance, ease-of-use, and responsiveness. On one hand, we have investigated four data management topics as follows. First, we built a light-weight data management layer over scientific datasets stored in HDF5 format, which is one of the popular array formats. Unlike many popular data transport protocols such as OPeNDAP, which requires costly data translation and data transfer before accessing remote data, our implementation can support server-side flexible subsetting and aggregation, with high parallel efficiency. Second, to avoid the high upfront data ingestion costs of loading large-scale array data into array databases like SciDB, we designed a ii system referred to as SAGA, which can provide database-like support over native array storage. Specifically, we focused on implementing a number of structural (grid, sliding, hierarchical, and circular) aggregations, which are unique in array data model. Third, we proposed a novel approximate aggregation approach over array data using bitmap indexing. This approach can operate on the compact bitmap indices rather than the original raw datasets, and can support fast, accurate and flexible aggregations over any array or its subset without data reorganization. Fourth, we extended bitmap indexing to assist the data mining task subgroup discovery over array data. Like the aggregation approach, our algorithm can operate entirely on bitmap indices, and it can efficiently handle a key challenge associated with array data – a subgroup identified over array data can be described by value-based and/or dimension-based attributes. On the other hand, we focused on both offline and in-situ data processing paradigms in the context of MapReduce. To process disk-resident scientific data in various data formats, we developed a customizable MapReduce-like framework, SciMATE, which can be adapted to support transparent processing on any of the scientific data formats. Thus, unnecessary data integration and data reloading incurred by applying traditional MapReduce paradigm to scientific data processing can be avoided. We then designed another MapReduce-like framework, Smart, to support efficient in-situ scientific analytics in both time sharing and space sharing modes. In contrast to offline processing, our implementation can avoid, either completely or to a very large extent, both data transfer and data storage costs. iii This is dedicated to the ones I love: my parents and my fiancee. iv Acknowledgments I would never have been able to complete my dissertation without the guidance of my committee members, help from friends, and support from my family and fiancee over the years. Foremost, I would like to express my sincerest gratitude to my advisor Prof. Gagan Agrawal for his continuous support of my Ph.D career. He is the researcher with the broadest interests I have ever seen, and he has offered me excellent opportunities to delve into a variety of research topics, including high-performance computing, programming models, databases, data mining, and bioinformatics. This helps me to learn diverse techniques, gain different methodologies, and develop my potential in an all-around way. I am also grateful that he is not a harsh boss who pushes immense workload with close supervision, but a thoughtful mentor who teaches me how to be self-motivated and enjoy the research. My 5-year exploration would not have been so rewarding without his patience, encouragement, and advice. Besides my advisor, I would like to thank my candidacy and dissertation committee members, Prof. Feng Qin, Prof. Srinivasan Parthasarathy, Prof. Han-Wei Shen, and Prof. Spyros Blanas, for their valuable time and insightful advices. I would also like to thank Dr. Arnab Nandi, Dr. Kun Huang, Dr. Hatice Gulcin Ozer, Dr. Xin Gao, Dr. Jim-Jingyan Wang for the collaborations during my PhD career. I also need to acknowledge a CSE v staff, Aaron Jenkins, and I will always remember the nice time I worked with him in my first quarter. My sincere thanks also go to my hosts Dr. Marco Yu and Dr. Donghua Xu, and my mentor Dr. Nan Boden, for their help in both work and life during the summer internship at Google. Under their guidance, I learned diverse techniques fast and was even able to apply them to the research projects after the internship. Working with them was quite a rewarding experience that I might not be able to obtain from academia. The insightful discussions with them can always bring me some inspirations of interesting research directions, some of which I have already fulfilled in the last year of my PhD study. Their enthusiasm and charms let me make the decision to join the same team at Google after graduation. I also would like to thank the colleagues in our group, Data-Intensive and High Performance Computing Research Group. My Ph.D. career would not have been so colorful without the friendship from them. Former and present members of our lab have provided me with numerous help, in both daily life and work, and they are: Wenjing Ma, Vignesh Ravi, Tantan Liu, Tekin Bicer, Jiedan Zhu, Vignesh San, Xin Huo, Ashish Nagavaram, Liyang Tang, Yu Su, Linchuan Chen, Mehmet Can Kurt, Mucahid Kutlu, Jiaqi Liu, Roee Ebenstein, Sameh Shohdy, Gangyi Zhu, Peng Jiang, Jiankai Sun, Qiwei Yang, and David Siegal. Last but not the least, I express special thanks to my parents and my fiancee. My parents have been always considerate and supportive. My fiancee has been always cheering me up and standing by me through the ups and downs. vi Vita July29th,1987 .............................Born-Wuhan, China 2010 .......................................B.S.Software Engineering, Wuhan University, Wuhan, China 2014 .......................................M.S. Computer Science & Engineering, The Ohio State University May2014-August2014 .....................Software Engineer Intern, Google Inc., Mountain View, CA Publications Research Publications Yi Wang, Linchuan Chen, and Gagan Agrawal. “Supporting User-Defined Estimation and Early Termination on a MapReduce-Like Framework for Online Analytics”. Under Submission. Yi Wang, Gagan Agrawal, Tekin Bicer, and Wei Jiang. “Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics”. In Proceeedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), November, 2015. Yi Wang, Gagan Agrawal, Tekin Bicer, and Wei Jiang. “Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics”. Technical Report, OSU-CISRC-4/15-TR05, April, 2015. Yi Wang, Yu Su, Gagan Agrawal and Tantan Liu. “SciSD: Novel Subgroup Discovery over Scientific Datasets Using Bitmap Indices”. Technical Report, OSU-CISRC-3/15- TR03, March, 2015. vii Gangyi Zhu, Yi Wang, Gagan Agrawal. “SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices (Short Paper)”. In Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM), June 2015. Yi Wang, Yu Su, and Gagan Agrawal. “A Novel Approach for Approximate Aggregations Over Arrays”. In Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM), June 2015. Yu Su, Yi Wang, and Gagan Agrawal. “In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps”. In Proceedings of the International Symposium on High- Performance Parallel and Distributed Computing (HPDC), June 2015. Jim Jing-Yan Wang, Yi Wang, Shiguang Zhao, and Xin Gao. “Maximum Mutual Information Regularized Classification”. In Engineering Applications of Artificial Intelligence (ENG APPL ARTIF INTEL), 2015. Yi Wang, Arnab Nandi, and Gagan Agrawal. “SAGA: Array Storage as a DB with Support for Structural Aggregations”. In Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM), June 2014. Yi Wang, Gagan Agrawal, Gulcin Ozer, and Kun Huang “Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data”. In Proceedings of the International Workshop on High Performance Computational Biology (HiCOMB) in Conjunction with International Parallel & Distributed Processing Symposium (IPDPS), May 2014. Yu Su, Yi Wang, Gagan Agrawal,

Data Management and Data Processing Support on Array-Based Scientiﬁc Data

Array DBMS in Environmental Science

An Array DBMS for Simulation Analysis and ML Models Predictions

Array Databases: Concepts, Standards, Implementations

The Gamma Operator for Big Data Summarization on an Array DBMS

Multidimensional Arrays for Analysing Geoscientific Data

OLAP Cubes Ming-Nu Lee OLAP (Online Analytical Processing)

Data Warehousing & OLAP

A Survey on Array Storage, Query Languages, and Systems

The Bigdawg Polystore System

The Architecture of Scidb

On the Integration of Array and Relational Models in Databases

A Data Cube Metamodel for Geographic Analysis Involving Heterogeneous Dimensions