Applying OLAP Pre-Aggregation Techniques to Speed up Aggregate Query Processing in Array Databases by Angélica Garcıa Gutiérr

Applying OLAP Pre-Aggregation Techniques to Speed Up Aggregate Query Processing in Array Databases by Angelica´ Garc´ıa Gutierrez´ A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science Approved, Thesis Committee: Prof. Dr. Peter Baumann Prof. Dr. Vikram Unnithan Prof. Dr. Ines´ Fernando Vega Lopez´ Date of Defense: November 12, 2010 School of Engineering and Science In memory of my grandmother, Naty. Acknowledgments I would like to express my sincere gratitude to my thesis advisor, Prof. Dr. Peter Baumann for his excellent guidance throughout the course of this dissertation. With his tremendous passion for science and his great efforts to explain things clearly and simply, he made this research to be one of the richest experiences of my life. He always suggested new ideas, and guided my research through many pitfalls. Fur- thermore, I learned from him to be kind and cooperative. Thank you, for every single meeting, for every single discussion that you always managed to be thought- provoking, for your continue encouragement, for believing in that I could bring this project to success. I am also grateful to Prof. Dr. Ines´ Fernando Vega Lopez´ for his valuable sugges- tions. He not only provided me with technical advice but also gave me some important hints on scientific writing that I applied on this dissertation. My sincere gratitude also to Prof. Dr. Vikram Unnithan. Despite being one of Jacobs University’s most pop- ular and busiest professors due to his genuine engagement with student life beyond academics, Prof. Unnithan took interest in this work and provided me unconditional support. I would like to thank two promising graduate students, Irina Calciu and Eugen Sorbalo for their outstanding contributions with some of the experiments presented in Chapter 5 of this thesis. I am especially grateful to my colleagues Michael Owonibi, Salah Al Jubeh, and Yu Jinsongdi for their many valuable discussions, and for providing a stimulating and fun environment in which to learn and grow. I am grateful to the team assistants at School of Engineering and Science, for helping the School to run smoothly and for assisting me in many different ways. Sigrid Manss deserves special mention. Thank you for all your kindness, and caring. Also, I would like to thank Connie Garcia, Jim Toersten, Greg White, Irina Pr- jadeha, and all of my friends that helped me to proofread this thesis. Victoria Inness- Brown deserves special mention for applying her expertise as an editor on reviewing each chapter of this thesis. Thank you to all my great friends who provided support and encouragement in so many ways, for helping me to see the bright side of my problems in difficult times, for all the emotional support, comraderie, entertainment, and caring provided. Specially, to Salah Al Jubeh, Asma Alazeib, Talina Eslava, Rainer Gruenheid, Yu Jinsongdi, Maria Joy, Ghada Kadamany, Ingrid Lara, Blessing Musunda, Michael Owonibi, Jes- sica Price, Irina Prjadeha, Joerg Reinekirchen, Yannic Ramaye, Mila Tarabashkina, Ruiju Tong, Derya Toykan, Iyad Tumar, Vanya Uzunova, Tanja Vaitulevich, and Justo Vargas. You all have a place in my heart. Also, to my friend Samantha Hooton, whom I learned to love as a sister shortly after meeting her. Her authenticity, self-confidence, and drive to success are a real inspiration. Thank you for your caring, for sharing your wisdom, for taking me to the hospital when I was in pain, and for being there anytime I needed a friend. My warmest thanks to Father Matthew I. Nwoko for his spiritual guidance, his caring, his advices, and overall, for his unconditional love. Thank you to my parents, my brother and sisters, who have always been very sup- portive of my aspirations. Their support has been instrumental in getting me on the path that brought me to this project. Especialmente, Gracias a ti mama,´ por ser mi ejemplo de tenacidad y compromiso. A ti tambien´ te dedico esta tesis. To DAAD and CONACYT, the financial support and trust is gratefully acknowl- edged. To everybody that has been a part of my life, thank you very much. Lastly, I thank the Lord God Almighty for giving me health, ideas and wisdom to enable me complete this research project successfully. Abstract Large multidimensional arrays of data are common in a variety of scientific applications. In the past, arrays have typically been stored in files, and then manipulated by customized programs operating on those files. Nowadays, with science moving toward computational databases, the trend is toward a new class of database, the array database. In the broadest sense, the array database supports various types of multidimensional array data, including remote-sensor data, satellite imagery, and data resulting from scientific simulations. As with traditional databases for business applications, analytics in array databases often involves the extraction of general characteristics from large repositories. This re- quires efficient methods for computing queries that involve data summarization, such as aggregate queries. A typical solution is to pre-compute the whole or parts of each query, and then save the results of those queries that are frequently submitted against the database and those that can be used to compute the results of similar future queries. This process is known as pre-aggregation. Unfortunately, pre-aggregation support for array databases is currently limited to one specific operation, scaling (zooming), and to two-dimensional datasets (images). In this aspect, database technology for business applications is much more mature. Technologies such as On-Line Analytical Processing (OLAP) provide the means to analyze business data from one or multiple sources, and thus facilitate the decision making process. In OLAP, the information is viewed as data cubes. These cubes are typically stored in relational tables, or in multidimensional arrays, or in a hybrid model. In order to enable fast interactive multidimensional data analysis, database systems frequently pre-compute and store the results of aggregate queries. While there are some valuable research results in the realm of OLAP pre-aggregation techniques with varying degrees of power and refinement, not enough work has been done and reported for array databases. The purpose of this thesis is to investigate the application of OLAP pre-aggregation techniques with the objective of speeding up aggregate operations in array databases. In particular, we consider enhancing aggregate computation in Geographic Informa- tion Systems (GIS) and remote-sensing imaging applications. To this end, we de- scribe a set of fundamental operations in GIS based on a sound algebraic framework. This allows us to identify those operations that require data summarization and that therefore may benefit from pre-aggregation. We introduce a conceptual framework and cost model for rewriting basic aggregate queries in terms of pre-aggregated data, and conduct experiments to assess the performance of our algorithms. Results show that query response times can be substantially reduced by strategically selecting the pre-aggregate with the least cost in terms of execution time. We also investigate the problem of selecting a set of queries for pre-aggregation, but failed to find an analytical solution for all possible types of aggregate queries. Nevertheless, we present a framework and algorithms for the selection of scaling operations for pre-aggregation considering 2D, 3D, and 4D datasets. The results of our experiments with 2D datasets outperform the results of image pyramids, the current technique used to speed up scaling operations on 2D datasets. Furthermore, our experiments on 3D and 4D datasets show that query response types can also be substantially reduced by intelligently selecting a set of scaling operations for pre-aggregation. The work presented in this thesis is the first of its kind for array databases in scientific applications. Contents 1 Introduction and Problem Statement 9 1.1 Overview of Thesis and Contributions . 12 1.2 Publications Related to this Thesis . 12 2 Background and Related Work 15 2.1 Array Databases . 15 2.1.1 Basic Notion of Arrays . 15 2.1.2 2D Data Models . 16 2.1.3 Multidimensional Data Models . 17 2.1.4 Storage Management . 18 2.1.5 2D Pre-Aggregation . 19 2.1.6 Pre-Aggregation Beyond 2D . 23 2.1.7 Summary . 25 2.2 On-Line Analytical Processing (OLAP) . 25 2.2.1 OLAP Data model . 25 2.2.2 OLAP Operations . 26 2.2.3 OLAP Architectures . 26 2.2.4 OLAP Pre-Aggregation . 30 2.3 Discussion . 33 3 Fundamental Geo-Raster Operations 37 3.1 Array Algebra . 37 3.1.1 Constructor . 38 3.1.2 Condenser . 39 3.1.3 Sorter . 39 3.2 Geo-Raster Operations . 39 3.2.1 Mathematical Operations . 39 3.2.2 Aggregation Operations . 45 3.2.3 Statistical Aggregate Operations . 51 3.2.4 Affine Transformations . 55 3.2.5 Terrain Analysis . 57 3.2.6 Other Operations . 59 3.3 Summary . 61 3 4 Answering Basic Aggregate Queries Using Pre-Aggregated Data 63 4.1 Framework . 63 4.1.1 Aggregation . 64 4.1.2 Pre-Aggregation . 64 4.1.3 Aggregate Query and Pre-Aggregate Equivalence . 64 4.2 Cost Model . 67 4.2.1 Computing Queries from Raw Data . 68 4.2.2 Computing Queries from Independent and Overlapped Pre- Aggregates . 68 4.2.3 Computing Queries from Dominant Pre-Aggregates . 69 4.3 Implementation . 70 4.4 Experimental Results . 73 4.5 Summary . 74 5 Pre-Aggregation Support Beyond Basic Aggregate Operations 77 5.1 Non-Standard Aggregate Operations . 77 5.2 Conceptual Framework . 78 5.2.1 Lattice Representation . 79 5.2.2 Pre-Aggregation Selection Problem . 80 5.3 Pre-Aggregates Selection . 82 5.3.1 Complexity Analysis . 83 5.4 Answering Scaling Operations Using Pre-Aggregated Data .

Applying OLAP Pre-Aggregation Techniques to Speed up Aggregate Query Processing in Array Databases by Angélica Garcıa Gutiérr

Overview of Mapreduce and Spark

Multidimensional Modeling and Analysis of Large and Complex Watercourse Data: an OLAP-Based Solution

Mapreduce: a Major Step Backwards - the Database Column

A Survey on Preparing Data Sets for Data Mining Analysis Using Horizontal Aggregations in SQL Prashant B

Sharded Parallel Mapreduce in Mongodb for Online Aggregation B Rama Mohan Rao, a Govardhan, Dept

Aggregate Calculations Alculations and Subqueries

Cleanm: an Optimizable Query Language for Unified Scale-Out Data Cleaning

Pandas Dataframe Pivot Table Example

Excel? DATA 301 Spreadsheets Are the Most Common, General‐Purpose Software for Introduction to Data Analytics Data Analysis and Reporting

Excel Pivot Table Transpose Rows to Columns

Sql Server Create Aggregate Function Example Nplifytm

In-Situ Mapreduce for Log Processing