Supporting Advanced Queries on Scientific Array Data

Supporting Advanced Queries on Scientific Array Data DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Roee Ebenstein , B.Sc., M.Sc. Graduate Program in Computer Science and Engineering The Ohio State University 2018 Dissertation Committee: Gagan Agrawal, PhD, Advisor P. Saday, PhD Arnab Nandi, PhD c Copyright by Roee Ebenstein 2018 Abstract Distributed scientific array data is becoming more prevalent, increasing in size, and there is a growing need for (performance in) advanced analytics over these data. In this dissertation, we focus on addressing issues to allow data management, efficient declarative querying, and advanced analytics over array data. We formalize the semantic of array data querying, and introduce distributed querying abilities over these data. We show how to improve the optimization phase of join querying, while developing efficient methods to execute joins in general. In addition, we introduce a class of operations that is closely related to the traditional joins performed on relational tables – including an operation we refer to as Mutual Range Joins (MRJ), which arises on scientific data that is not only numerical, but also have measurement noise. While working closely with our colleagues to pro- vide them usable analytics over array data, we uncovered a new type of analytical querying – analytics over windows with an inner window ordering (in contrast to the external window ordering, available elsewhere). Last, we adjust our join optimization approach for skewed settings, addressing resource skew observed in real environments as well as data skew that arises while data is processed. ii Several major contributions are introduced throughout this dissertation. First we formalize querying over scientific array data (basic operators, such as sub- settings, as well as complex analytical functions and joins). We focus on distributed data, and present a framework to execute queries over variables that are distributed across multiple containers (DSDQuery DSI) – this framework is used in production environments. Next, we present an optimization approach for join queries over geo-distributed data. This approach considers networking proper- ties such as throughput and latency to optimize the execution of join queries. For such complex optimization, we introduce methods and algorithms to efficiently prune optional execution plans (DistriPlan). Then, after the join was optimized, we show how to execute distributed joins and optimize the MRJ operator. We demonstrate how bitmap indexes can be used for accelerating the execution of distributed joins – we do so by introducing a new Bitmap Index structure that fits the MRJ goals (BitJoin). Afterwards we introduce analytical functions (window querying) to the domain of scientific arrays (FDQ). Last, we revisit join optimization for different settings, while addressing data and resource skew (Sckeow). We thoroughly evaluate our systems. We show DSDQuery DSI produces out- put in its optimal size as well as produces it efficiently – performance decrease linearly with increasing dataset sizes. DistriPlan finds the optimal plan while con- sidering reasonable amount of plans (out of the exponential amount of optional plans). BitJoin improves the performance of MRJ’s and equi-join by 140% and 113%, on average. By using a new processing model with an efficient memory allocation approach, on average, FDQ improves the performance of existing func- tionality by 538%. In addition, FDQ efficiently process queries of types that were iii not available before – its performance improve linearly with scaled resources. Last, Sckeow improves the performance of queries by 396% for heterogeneous settings and 368% for homogeneous ones. For heterogeneous settings, in most cases Sck- eow generates an ideal plan directly, while generating about half the amount of plans other engines do in homogeneous settings. iv This is dedicated to the ones I love: my parents, my brothers and sister, my family, and my friends around this globe. You all kept me on the right path for this moment to happen. v Acknowledgments This dissertation was completed thanks to all the, professional as well as per- sonal, help I got during my days at THE Ohio State University. There are so many people that I want to express my deepest gratitude to and not enough room to list them all. Without their help and support, it would have not been possible for me to finish this program. Foremost, I wish to thank my advisor, Prof. Gagan Agrawal. He had sup- ported me since I began working in the lab – it was him who brought me into the high performance computing for data processing domain. His knowledge in the research area and insightful guidance did not only broadened my knowledge, but also helped me to establish my research interest and develop my problem solv- ing skills. I also want to thank him for his kindness and patience that helped me throughout my course of studies – especially in regards to academic writing. In addition, I want to thank Prof. Arnab Nandi and Mr. David Bertram. Prof. Nandi was the first advisor I had worked with, and although I departed to another laboratory, I still feel his influence on my academic progress. Graduating would have not been possible without the values and curiosity he embedded in me, even after leaving his laboratory. While initially at this university, my funds were provided through my work with the Advanced Computing Center for the Arts and Design (ACCAD). Mr. Bertram, my supervisor there, had put me on the right path vi academically and socially. I am grateful for the guidance and friendship both of you had provided me – this moment would not have happened without you. I would also like to thank the rest of my academic committee: Prof. P. Sadayap- pan and Prof. Feng Qin. Their insightful comments, incisive analysis, and teaching helped my thesis become stronger and solid, and more importantly made me un- derstand the domain better. The group I have taken part of, Data Intensive and High Performance Comput- ing Research Group, developed me in many ways (personally and professionally). The friendship from them have made my Ph.D. life much more enjoyable. I thank my fellow labmates and fiends across this university: Yu Su, Yi Wang, Mehmet Can Kurt, Sameh Shohdy, Gangyi Zhu, Peng Jiang, Jiaqi Liu, Jiankai Sun, Piyali Das, Jia Guo, Shuangsheng Luo, Omid Asudeh, Emin Ozturk, Bill (Haoyuan) Xing, Qiwei Yang, Yang Xia, David Siegal, Niranjan Kamat, and Soumaya Dutta for the stimu- lating discussions and for all the fun we have had in the past years. A special gratitude to those who made me socialize, and enjoy my stay on this amazing campus and on the travels: Joshua Laney (and his family), Joshua Con- ner, Brad Shook, Lindsy Schwerer, Jessica Franz, Nate Moffitt, Patrick Spaulding, Jeff Starr, Gregory Hanel, Michael Baker, Joel Howard, Noga Adler, Orly Gilad, Pery Stosser, Tomer Peled, Idan Cohen, Adi Hochmann, Jonathan Fishner, Dima Machlin, Karen Cohen, Zalman and Sarah Deitsch (and the OSU Chabad family), and all the on campus OSJews. Finally, I want to extend my special thanks to my parents, my brothers and sister, my extended family, and my friends who had always been supportive. No words can express the help you all provided me. vii Vita 2006 . Software Technician, Center of Computing and Informa- tion Systems, M.O.D, Israel 2010 . B.Sc. in Computer Science, The Open University of Israel 2017 . M.Sc. in Computer Science, The Ohio State University 2017 . Software Engineering Intern, Google, Mountain View, CA Publications Research Publications Ebenstein, Roee and Agrawal, Gagan, DSDQuery DSI-Querying scientific data repos- itories with structured operators. In 2015 IEEE International Conference on Big Data (BigData). Ebenstein, Roee, Kamat, Niranjan, and Nandi, Arnab, FluxQuery: An Execution Framework for Highly Interactive Query Workloads. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD). Ebenstein, Roee and Agrawal, Gagan, DistriPlan - An Optimized Join Execution Framework for Geo-Distributed Scientific Data. In Proceedings of the 29th Interna- tional Conference on Scientific and Statistical Database Management (SSDBM). Ebenstein, Roee and Agrawal, Gagan, BitJoin: Executing Distributed Range Based Joins. Under Submission. viii Ebenstein, Roee, Agrawal, Gagan, Wang Jiali, Boley Joshua, and Kettimuthu Rajkumar, FDQ: Advance Analytics over Real Scientific Array Datasets. Under Submission. Ebenstein, Roee and Agrawal, Gagan, ScKeow: An Optimizer for Multi-Level Join Execution of Skewed and Distributed Array Data. Under Submission. Fields of Study Major Field: Computer Science and Engineering ix Table of Contents Page Abstract......................................... ii Dedication.......................................v Acknowledgments.................................. vi Vita........................................... viii List of Tables...................................... xiv List of Figures..................................... xvi List of Algorithms................................... xx 1. Introduction...................................1 1.1 Introduction................................1 1.1.1 Overview.............................2 1.1.2 High Level Motivation.....................3 1.2 Background................................4 1.2.1 Array Data............................4 1.2.2 Bitmap Index Structure.....................7 1.3 Motivation and Contributions.....................8

Load more