Applying OLAP Pre-Aggregation Techniques to Speed Up Aggregate Query Processing in Array

by

Angelica´ Garc´ıa Gutierrez´

A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science

Approved, Thesis Committee:

Prof. Dr. Peter Baumann Prof. Dr. Vikram Unnithan Prof. Dr. Ines´ Fernando Vega Lopez´

Date of Defense: November 12, 2010

School of Engineering and Science

In memory of my grandmother, Naty.

Acknowledgments

I would like to express my sincere gratitude to my thesis advisor, Prof. Dr. Peter Baumann for his excellent guidance throughout the course of this dissertation. With his tremendous passion for science and his great efforts to explain things clearly and simply, he made this research to be one of the richest experiences of my life. He always suggested new ideas, and guided my research through many pitfalls. Fur- thermore, I learned from him to be kind and cooperative. Thank you, for every single meeting, for every single discussion that you always managed to be thought- provoking, for your continue encouragement, for believing in that I could bring this project to success. I am also grateful to Prof. Dr. Ines´ Fernando Vega Lopez´ for his valuable sugges- tions. He not only provided me with technical advice but also gave me some important hints on scientific writing that I applied on this dissertation. My sincere gratitude also to Prof. Dr. Vikram Unnithan. Despite being one of Jacobs University’s most pop- ular and busiest professors due to his genuine engagement with student life beyond academics, Prof. Unnithan took interest in this work and provided me unconditional support. I would like to thank two promising graduate students, Irina Calciu and Eugen Sorbalo for their outstanding contributions with some of the experiments presented in Chapter 5 of this thesis. I am especially grateful to my colleagues Michael Owonibi, Salah Al Jubeh, and Yu Jinsongdi for their many valuable discussions, and for providing a stimulating and fun environment in which to learn and grow. I am grateful to the team assistants at School of Engineering and Science, for help- ing the School to run smoothly and for assisting me in many different ways. Sigrid Manss deserves special mention. Thank you for all your kindness, and caring. Also, I would like to thank Connie Garcia, Jim Toersten, Greg White, Irina Pr- jadeha, and all of my friends that helped me to proofread this thesis. Victoria Inness- Brown deserves special mention for applying her expertise as an editor on reviewing each chapter of this thesis. Thank you to all my great friends who provided support and encouragement in so many ways, for helping me to see the bright side of my problems in difficult times, for all the emotional support, comraderie, entertainment, and caring provided. Specially, to Salah Al Jubeh, Asma Alazeib, Talina Eslava, Rainer Gruenheid, Yu Jinsongdi, Maria Joy, Ghada Kadamany, Ingrid Lara, Blessing Musunda, Michael Owonibi, Jes- sica Price, Irina Prjadeha, Joerg Reinekirchen, Yannic Ramaye, Mila Tarabashkina, Ruiju Tong, Derya Toykan, Iyad Tumar, Vanya Uzunova, Tanja Vaitulevich, and Justo Vargas. You all have a place in my heart. Also, to my friend Samantha Hooton, whom I learned to love as a sister shortly after meeting her. Her authenticity, self-confidence, and drive to success are a real inspiration. Thank you for your caring, for sharing your wisdom, for taking me to the hospital when I was in pain, and for being there anytime I needed a friend. My warmest thanks to Father Matthew I. Nwoko for his spiritual guidance, his caring, his advices, and overall, for his unconditional love. Thank you to my parents, my brother and sisters, who have always been very sup- portive of my aspirations. Their support has been instrumental in getting me on the path that brought me to this project. Especialmente, Gracias a ti mama,´ por ser mi ejemplo de tenacidad y compromiso. A ti tambien´ te dedico esta tesis. To DAAD and CONACYT, the financial support and trust is gratefully acknowl- edged. To everybody that has been a part of my life, thank you very much. Lastly, I thank the Lord God Almighty for giving me health, ideas and wisdom to enable me complete this research project successfully. Abstract

Large multidimensional arrays of data are common in a variety of scientific appli- cations. In the past, arrays have typically been stored in files, and then manipulated by customized programs operating on those files. Nowadays, with science moving toward computational databases, the trend is toward a new class of , the array database. In the broadest sense, the array database supports various types of mul- tidimensional array data, including remote-sensor data, satellite imagery, and data resulting from scientific simulations. As with traditional databases for business applications, analytics in array databases often involves the extraction of general characteristics from large repositories. This re- quires efficient methods for computing queries that involve data summarization, such as aggregate queries. A typical solution is to pre-compute the whole or parts of each query, and then save the results of those queries that are frequently submitted against the database and those that can be used to compute the results of similar future queries. This process is known as pre-aggregation. Unfortunately, pre-aggregation support for array databases is currently limited to one specific operation, scaling (zooming), and to two-dimensional datasets (images). In this aspect, database technology for business applications is much more mature. Technologies such as On-Line Analytical Processing (OLAP) provide the means to analyze business data from one or multiple sources, and thus facilitate the decision making process. In OLAP, the information is viewed as data cubes. These cubes are typically stored in relational tables, or in multidimensional arrays, or in a hybrid model. In order to enable fast interactive multidimensional data analysis, database systems frequently pre-compute and store the results of aggregate queries. While there are some valuable research results in the realm of OLAP pre-aggregation techniques with varying degrees of power and refinement, not enough work has been done and reported for array databases. The purpose of this thesis is to investigate the application of OLAP pre-aggregation techniques with the objective of speeding up aggregate operations in array databases. In particular, we consider enhancing aggregate computation in Geographic Informa- tion Systems (GIS) and remote-sensing imaging applications. To this end, we de- scribe a set of fundamental operations in GIS based on a sound algebraic framework. This allows us to identify those operations that require data summarization and that therefore may benefit from pre-aggregation. We introduce a conceptual framework and cost model for rewriting basic aggregate queries in terms of pre-aggregated data, and conduct experiments to assess the performance of our algorithms. Results show that query response times can be substantially reduced by strategically selecting the pre-aggregate with the least cost in terms of execution time. We also investigate the problem of selecting a set of queries for pre-aggregation, but failed to find an analyt- ical solution for all possible types of aggregate queries. Nevertheless, we present a framework and algorithms for the selection of scaling operations for pre-aggregation considering 2D, 3D, and 4D datasets. The results of our experiments with 2D datasets outperform the results of image pyramids, the current technique used to speed up scal- ing operations on 2D datasets. Furthermore, our experiments on 3D and 4D datasets show that query response types can also be substantially reduced by intelligently se- lecting a set of scaling operations for pre-aggregation. The work presented in this thesis is the first of its kind for array databases in scien- tific applications. Contents

1 Introduction and Problem Statement 9 1.1 Overview of Thesis and Contributions ...... 12 1.2 Publications Related to this Thesis ...... 12

2 Background and Related Work 15 2.1 Array Databases ...... 15 2.1.1 Basic Notion of Arrays ...... 15 2.1.2 2D Data Models ...... 16 2.1.3 Multidimensional Data Models ...... 17 2.1.4 Storage Management ...... 18 2.1.5 2D Pre-Aggregation ...... 19 2.1.6 Pre-Aggregation Beyond 2D ...... 23 2.1.7 Summary ...... 25 2.2 On-Line Analytical Processing (OLAP) ...... 25 2.2.1 OLAP Data model ...... 25 2.2.2 OLAP Operations ...... 26 2.2.3 OLAP Architectures ...... 26 2.2.4 OLAP Pre-Aggregation ...... 30 2.3 Discussion ...... 33

3 Fundamental Geo-Raster Operations 37 3.1 Array Algebra ...... 37 3.1.1 Constructor ...... 38 3.1.2 Condenser ...... 39 3.1.3 Sorter ...... 39 3.2 Geo-Raster Operations ...... 39 3.2.1 Mathematical Operations ...... 39 3.2.2 Aggregation Operations ...... 45 3.2.3 Statistical Aggregate Operations ...... 51 3.2.4 Affine Transformations ...... 55 3.2.5 Terrain Analysis ...... 57 3.2.6 Other Operations ...... 59 3.3 Summary ...... 61

3 4 Answering Basic Aggregate Queries Using Pre-Aggregated Data 63 4.1 Framework ...... 63 4.1.1 Aggregation ...... 64 4.1.2 Pre-Aggregation ...... 64 4.1.3 Aggregate Query and Pre-Aggregate Equivalence ...... 64 4.2 Cost Model ...... 67 4.2.1 Computing Queries from Raw Data ...... 68 4.2.2 Computing Queries from Independent and Overlapped Pre- Aggregates ...... 68 4.2.3 Computing Queries from Dominant Pre-Aggregates . . . . . 69 4.3 Implementation ...... 70 4.4 Experimental Results ...... 73 4.5 Summary ...... 74

5 Pre-Aggregation Support Beyond Basic Aggregate Operations 77 5.1 Non-Standard Aggregate Operations ...... 77 5.2 Conceptual Framework ...... 78 5.2.1 Lattice Representation ...... 79 5.2.2 Pre-Aggregation Selection Problem ...... 80 5.3 Pre-Aggregates Selection ...... 82 5.3.1 Complexity Analysis ...... 83 5.4 Answering Scaling Operations Using Pre-Aggregated Data ...... 83 5.5 Experimental Results ...... 85 5.5.1 2D Datasets ...... 86 5.5.2 3D Datasets ...... 91 5.5.3 4D Datasets ...... 98 5.6 Summary ...... 100

6 Conclusion 103 6.1 Future Work ...... 104 List of Figures

2.1 3D Array ...... 16 2.2 Map Algebra Functions ...... 17 2.3 Image Tiling ...... 19 2.4 Image Pyramids ...... 20 2.5 Nearest Neighbor, Bilinear and Cubic Interpolation Methods . . . . . 22 2.6 3D Scaling Operations on Time-Series Imagery Datasets ...... 24 2.7 OLAP Data Cube ...... 26 2.8 Typical OLAP Cube Operations ...... 27 2.9 OLAP Approaches: MOLAP, ROLAP, and HOLAP ...... 27 2.10 MOLAP Storage Scheme ...... 28 2.11 ROLAP Storage Scheme ...... 29 2.12 Typical Query as Expressed in ROLAP and MOLAP Systems . . . . 29 2.13 Star Model of a Spatial Warehouse ...... 32 2.14 Comparison of Roll-Up and Scaling Operations ...... 34

3.1 Reduction of Contrast in the Green Channel of an RGB Image . . . . 40 3.2 Highlighted Infrared Areas of an NRG Image ...... 41 3.3 Cells of Rasters A and B with Equal Values ...... 42 3.4 Re-Classification of the Cell Values of a Raster Image ...... 43 3.5 Computation of a Proximity Operation ...... 44 3.6 Computation of an Overlay Operation ...... 45 3.7 Computation of an Overlay Operation Considering Values Greater than Zero ...... 46 3.8 Calculation of the Total Sum of Cell Values in a Raster ...... 47 3.9 Result of an Average Aggregate Operation ...... 48 3.10 Result of a Maximum Aggregate Operation ...... 48 3.11 Result of a Minimum Aggregate Operation ...... 49 3.12 Computation of the Histogram for a Raster Image ...... 50 3.13 Computation of the Diversity for a Raster Image ...... 50 3.14 Computation of a Majority Operation for a Raster Image ...... 51 3.15 Computation of the Variance for a Raster Image ...... 52 3.16 Computation of the for a Raster Image ...... 52 3.17 Computation of for a Raster Image ...... 54 3.18 Computation of a Top-k Operation for a Raster Image ...... 54 3.19 Computation of a Translation Operation for a Raster Image ...... 56

5 3.20 Computation of a Scaling Operation for a Raster Image ...... 57 3.21 Slopes Along the X and Y Directions ...... 58 3.22 Flow Directions ...... 59 3.23 Sobel Masks ...... 60 3.24 Computation of an Edge-Detection for a Raster Image ...... 60

4.1 Types of Pre-Aggregates ...... 66 4.2 Selected Queries for Pre-Aggregation (left) and Decomposed Queries (right) ...... 67

5.1 Sample Lattice Diagram for a Workload with Five Scaling Operations 79 5.2 Query Workload with Uniform Distribution ...... 87 5.3 Query Workload with Poisson Distribution ...... 88 5.4 Selected Queries for Pre-Aggregation ...... 89 5.5 Query Workload with Peak Distribution ...... 90 5.6 Selected Queries for Pre-Aggregation ...... 90 5.7 Query Workload with Step Distribution ...... 91 5.8 Selected Queries for Pre-Aggregation ...... 92 5.9 Workload with Uniform Distribution along x, y, and t ...... 93 5.10 Average Query Cost over Storage Space ...... 93 5.11 Selected Pre-Aggregates, c = 36% ...... 94 5.12 Workload with Uniform Distribution Along x, y, and Poisson distri- bution in t ...... 95 5.13 Average Query Cost as Space is Varied ...... 95 5.14 Selected Pre-Aggregates, c = 26% ...... 96 5.15 Workload with Poisson distribution Along x, y, and t ...... 96 5.16 Average Query Cost as Space is Varied ...... 97 5.17 Selected Pre-Aggregates, c = 30% ...... 97 5.18 Workload with Poisson Distribution Along x, y, and Uniform Distri- bution in t ...... 98 5.19 Average Query Cost as Space is Varied ...... 99 5.20 Selected Pre-Aggregates, c = 21% ...... 99 List of Tables

3.1 UNO and FAO Suitability Classifications ...... 43 3.2 Capability Indexes for Different Capability Classes ...... 43 3.3 Array Algebra Classification of Geo-Raster Operations...... 62

4.1 Cost Parameters ...... 68 4.2 Database and Queries of the Experiment...... 74 4.3 Comparison of Query Evaluation Costs Using Pre-Aggregated Data and Original Data...... 74

5.1 Sample Pre-Aggregates...... 84 5.2 ECHAM T-42 Climate Simulation Dimensions ...... 100 5.3 4D Scaling: Scale Vector Distribution ...... 100 5.4 4D Scaling: Selected Pre-Aggregates ...... 100

7 This page was left blank intentionally. Chapter 1

Introduction and Problem Statement

Scientific computing platforms and infrastructures are making new kinds of experi- ments possible, resulting in the generation of vast volumes of arrays of data. This is happening in many specialized application areas such as meteorology, oceanogra- phy, hydrology, astronomy, medical imaging, and exploration systems for oil, nat- ural gas, coal, and diamonds. These datasets range from uniformly spaced points (cells) along a single dimension to multidimensional arrays containing several dif- ferent types of data. For example, astronomy and earth sciences operate on two- or three-dimensional spatial grids, often using a plethora of spherical coordinate sys- tems. Furthermore, nearly all sciences must deal with data series over time. It is fre- quently necessary to understand relationships between consecutive elements in time, or to analyze entire sequences of observations, and such datasets may represent spa- tial, temporal, or spatio-temporal information. For example, if ocean measurements such as temperature, salinity, and oxygen are recorded every hour at spacings of every one meter in depth, and every ten meters in two horizontal dimensions, the result is a four-dimensional array with three spatial dimensions and one temporal dimension, and three values attached to each cell of the array. In the past, arrays were typically stored in files and then manipulated by programs that operated on these files. Nowadays, with science moving toward being computa- tional and data based, the trend is toward a new class of database system which pro- vides support for not only traditional, or coded, data types such as text, integers, etc., but also richer data types like multidimensional arrays. This new trend of databases is referred to as Array Databases. Implementing an efficient array database management system (DBMS) can be very challenging. Typically, there are two approaches that can be taken to store array datasets in a DBMS. In the first, the values of each cell are stored in a separate row, along with fields describing the position of the cell in the array. The most obvious drawback of this approach is the need for a large multidimensional index to efficiently find rows in the table. Moreover, the space taken by a multidimensional index is larger than the size of the table itself if all dimensions forming an array are used as the key. In the second approach, a multidimensional array is written to a Binary Large Object (BLOB), which is stored in a field of a table in the database. Applications then fetch

9 10 1. Introduction and Problem Statement the contents of the BLOB when they wish to operate on the data. The main drawback to this approach is that it either requires the entire array to be passed to the client, or it requires that the client perform a large number of BLOB input/output (I/O) operations to read only the required portions of the array. With databases growing beyond a few tens of terabytes, the analysis of large volumes of array datasets is severely limited by the relatively low I/O performance of most of todays computing platforms. High- performance numerical simulations are also increasingly feeling the I/O bottleneck. To improve data management and analytics on large repositories of data, aggre- gation has been put forward as a key process when describing high-level data. An example of data aggregation is the computation and storage of statistical parameters, such as count, average, median, and standard deviation. Aggregate computation has been studied in a variety of settings [4, 21, 66]. In particular, On-Line Analytical Pro- cessing (OLAP) technology has emerged to address the problem of efficiently com- puting complex multidimensional aggregate queries on large data warehouses. Most OLAP systems rely on the process of selecting aggregate combinations, and then pre- computing and storing their results so the database system can make use of them in subsequent requests. Such a process is known as pre-aggregation, which has proved to speed up aggregate queries by several orders of magnitude in business and statistical applications [31, 41]. While considerable work has been done on the problem of efficiently computing aggregate queries in OLAP-based applications, such computations continue to be a data management challenge in scientific applications. A relevant example in which the use of advanced data management and efficient query processing are highly desirable is hyper-spectral remote-sensing imaging, in which an image spectrometer collects hundreds or even thousands of measurements for the same area of the surface of the Earth. The scenes provided by such sensors are often called data cubes to denote the dimensionality of the data. Notably, efficient query processing and techniques facilitate exploration of spatio-temporal data patterns, both interactively as well as in batch on archived data. A significant fraction of scientific data is image-based and can be naturally repre- sented in multidimensional arrays. These datasets fit poorly into relational databases, which lack efficient support for the concepts of physical proximity and order. They are typically stored in array-friendly formats such as HDF5, netCDF, or FITS. The extremely high computational requirements introduced by image-based scientific ap- plications make them an excellent case study for our research. Since array databases and OLAP/data warehousing both deal with large multidi- mensional datasets and aggregate queries, adapting OLAP pre-aggregation techniques to the management and computation of aggregate queries in array databases may pro- vide a strong potential benefit. This thesis investigates the application of OLAP pre- aggregation techniques in speeding up query processing in array databases. In particu- lar, we focus on enhancing aggregate computation in GIS and remote-sensing imaging applications. However, the results can be generalized to other domains as well. 11

Relevant and complementary questions to this thesis are: 1. What factors influence the decision of selecting an aggregate query for pre- aggregation? 2. What formalisms are necessary to establish an efficient and scalable pre-aggregation framework for array databases? 3. What type of constraints are typically considered by existing OLAP pre-aggregation algorithms, and how do they effect performance? The thesis objectives are outlined as follows: 1. To illustrate the necessity for improving aggregate computation in array databases for GIS and remote-sensing imaging applications. 2. To achieve a solid understanding of OLAP pre-aggregation algorithms and ar- chitectural issues when manipulating large amounts of data. 3. To formally describe fundamental operations in GIS and remote-sensing imag- ing applications and identify those that involve data summarization. 4. To design a theoretical pre-aggregation framework for array databases support- ing GIS and remote-sensing imaging applications. 5. To design query selection and query rewriting algorithms using existing OLAP/data warehousing pre-aggregation techniques. 6. To implement algorithms in an array database management system. 7. To conduct a performance study of the developed algorithms. The methodological approach employed in this thesis is centered on a three-stage design methodology: • Identification of fundamental operations in GIS and remote-sensing imaging applications. A literature review helped us identify fundamental operations in GIS that require data summarization. The literature included different classification schemes, international standards and best practices. • Design and implementation Existing OLAP pre-aggregation techniques are used as a basis for the construc- tion of a pre-aggregation framework for array databases. Storage space con- straints are considered while designing query selection algorithms. The algo- rithms were developed using the C++ programming language and tested in the RasDaMan multidimensional array database management system. • Evaluation Performance of the developed algorithms is measured on 2D, 3D, and 4D datasets. For scaling operations on 2D datasets we compare our results against those of the traditional image pyramids approach. 12 1. Introduction and Problem Statement 1.1 Overview of Thesis and Contributions

This section provides an overview of the following chapters. Chapter 2 presents a comparative study between array databases and OLAP, and devotes special attention to data structures and operations. It starts with a discussion of existing approaches for data modeling, storage management and query process- ing in both array databases and the data warehousing/OLAP environment. Existing pre-aggregation and related techniques are also discussed in both application domains. From this study, one can observe similarities with regards to data structures and opera- tions between both application domains. This suggests that array databases can benefit from pre-aggregation schemes to accelerate the computation of aggregate queries. Chapter 3 describes fundamental operations in GIS and remote-sensing imaging applications. The selection of operations is based on a thorough review of existing surveys regarding GIS operations, international standards, and on feedback from GIS practitioners. To better understand the structural characteristics of common queries in array databases, such operations evolved using a proven array model. This allowed us to identify a set of operations requiring data summarization (aggregation) and the candidate operations to be supported by pre-aggregation techniques. Chapter 4 deals with the computation of aggregate queries in array databases using pre-aggregated data. The proposed pre-aggregation framework distinguishes different types of pre-aggregates and shows that such a distinction is useful in finding an opti- mal solution that reduces the cost of the CPU required for the computation of aggre- gate queries. A cost-model is used to assess the benefit of using pre-aggregated data for computing aggregate queries. The measurements on real-life raster image datasets show that the computation of aggregate queries is always faster with our algorithms in comparison to traditional methods. Chapter 5 considers the problem of offering pre-aggregation support to non-standard aggregate operations in GIS and remote-sensing imaging applications. A discussion is presented on the issues found while attempting to provide pre-aggregation support for all non-standard aggregate operations as well as the motivation for focusing on scaling operations. The framework and cost model presented in Chapter 4 are adapted to support scaling operations. Experiments covering 2D, 3D, and 4D show how our pre-aggregation approach not only generalizes the most common approach for 2D, but it also helps reduce computational times for 2D, 3D, and 4D datasets. Chapter 6 presents a summary of our findings and outlines future lines of research.

1.2 Publications Related to this Thesis

A number of papers have been published that relate to the work described in this thesis. Doctoral workshops provided a platform to discuss the feasibility of the pro- posed research and an opportunity to receive feedback from experts in computer sci- ence [6] and the GIS scientific community [5]. Participation in those workshops led to a refinement of the research objectives outlined in Chapter 1. The study and alge- braic modeling of geo-raster operations reported in Chapter 3 are presented in [7, 8]. 1.2 Publications Related to this Thesis 13

The pre-aggregation framework described in Chapter 4 is presented in [9]. Finally, findings about the query selection problem addressed in Chapter 5 have been accepted for publication in [10]. This page was left blank intentionally. Chapter 2

Background and Related Work

This chapter describes existing database technology for two environments: GIS/remote- sensing imaging and data warehousing/OLAP. Our investigation shows that concep- tual data models and operations are similar in both application domains. This sug- gests that array database technology can be substantially enhanced by adopting a pre- aggregation scheme using a basis of existing OLAP technology.

2.1 Array Databases

Multidimensional data analysis has recently taken the spotlight in the context of scientific applications. A fundamental demand from science users is extremely fast response times for multidimensional queries. While most scientific users can use rela- tional tables and have been forced to do so by many commercial DBMS systems, only a few users find tables to be a natural data model that closely matches their data. Fur- thermore, few users are satisfied with SQL as the interface language [30]. In contrast, it appears that arrays are a natural data model for a significant subset of science users, specifically in astronomy, oceanography, and remote-sensing applications. Moreover, a table with a primary key is merely a 1D array. Hence, an array data model can subsume the needs of users who are satisfied with tables. Next we review the existing database technology supporting multidimensional ar- rays in scientific applications: 1D sensor time-series, 2D satellite imagery, 3D image time-series, and 4D atmospheric data.

2.1.1 Basic Notion of Arrays

Several approaches have been proposed towards the formalization of arrays and array query languages. The underlying methods of formalization differ, and it is still an open discussion. However, the following notion of arrays is quite common [79]: An array is a set of cells of a fixed data type T , with a fixed cell size. Each cell corresponds to one element in the multidimensional domain of the array. The domain D of an array is a d-dimensional subinterval of a discrete coordinate set S = S1 × ... × Sd, where each Si, i = 1, ..., d is a finite totally ordered discrete set and d is the dimensionality of the array.

15 16 2. Background and Related Work

The definition domain of an array is expressed as a multidimensional interval by its lower and upper bounds, li and ui respectively, along each direction li of the domain, denoted as D = [l1 : u1; ...; ld : ud], where li < ui, i = 1, ..., d, and li, ui ∈ Si. Figure 2.1(a) shows the constituents of a sample 3D array.

Figure 2.1. 3D Array

The following subsections provide a brief summary of the main contributions of data modeling and query languages that support array data in GIS and remote-sensing imaging applications.

2.1.2 2D Data Models

A uniform representation and algebraic notation for manipulating image-based data structures known as map algebra was first advanced by Tomlin and Berry [56]. While not the first ones to describe this type of spatial data processing, Tomlin and Berry put forward the methodological basis for the organization of this form of geographical data analysis. Map algebra represents a method of treating individual rasters or array layers as members of algebraic equations. Map algebra functions are grouped into the following categories:

• Local functions create outputs in which output cell values are determined on a cell-by-cell basis without regard for the value of neighboring cells.

• Focal functions create outputs in which the value of the output grid is affected by the value of neighboring cells. Low-pass filters are commonly used to smooth out data.

• Zonal functions create outputs in which the values of output cells are determined in part by the spatial association between cells in the input grids.

• Global functions compute an output raster where the value for each output cell is potentially a function of all of the input cell values. 2.1 Array Databases 17

Figure 2.2 shows a graphical classification of grid functions according to map al- gebra.

Figure 2.2. Map Algebra Functions

Map algebra is primarily oriented toward 2D static data. Each layer is associated with a particular moment or period of time, and analytical operations are intended to deal with spatial relationships. In its original form, map algebra was never intended to handle spatial data with a temporal component.

2.1.3 Multidimensional Data Models

AQL Libkin et al. [63] presented an array data model called AQL that embeds array sup- port into specific nested relational calculus and treats arrays as functions rather than collection types. The AQL data model combines complex objects such as sets, bags, and lists with multidimensional arrays. To express complex object values, the core calculus on which AQL is based has been extended with concepts such as comprehen- sions, pattern matching, and block structures that strengthen the expressive power of the language. Still, AQL does not provide a declarative mechanism to define the order in which queries manipulate data.

Array Manipulation Language (AML) AML is a query language for multidimensional array data [80]. The model is aimed towards applications in image databases, particularly for remote sensing, but it is cus- tomizable to support a wide variety of application domains. An interesting charac- teristic of this language is the use of bit patterns, an array indexing mechanism that allows for a more powerful access structure to arrays. AML’s algebra consists of three operators that enable the manipulation of arrays: subsample, merge, and apply. Each operator takes one or more arrays as arguments, and produces an array as result. Sub- sample is a unary operator that eliminates cells from an array by cutting out slices. Merge is a binary operator that combines two arrays defined over the same domain. The Apply operator applies a user-defined function to an array, thereby producing a new array. All AML operators take bit patterns as parameters. 18 2. Background and Related Work

Data and Query Model for Stream Geo-Raster Imagery Gertz et al. [67] introduced a data and query model for managing and querying streams of remote-sensing imagery. The data model considers the spatio-temporal and geo-referenced nature of satellite imagery. Three classes of operators allow the formulation of queries. A stream restriction operator acts as a filter that selects points from a stream that satisfy a given condition of the spatial, temporal, or spatio-temporal component of the image. The stream transform operator maps the point or value associated with a stream to a new point or value set. This class of operators is useful for processing on a point-by-point basis. The third class of operators is called stream compositions, which allows the combination of image data from different spectral bands. To this end, each stream is considered to represent a single spectral band. However, since the primary objective of the authors was to stream geo-raster image data, they put less emphasis on post-processing satellite images. Core operations such as Fourier transforms and edge detection are therefore not supported by their framework.

Array Algebra Baumann [75] introduced a formal array model called Array Algebra that supports the description and manipulation of multidimensional array data types [76]. The simple algebra consists of three core operators: an array constructor, a general condenser for computing aggregations, and an index sorter. The expressive power of Array Algebra through these operators enables a wide range of signal processing, imaging, and sta- tistical operations. Moreover, the termination of any well-formed query is guaranteed by limiting the expressiveness power to non-recursive operations. Array Algebra is described in more detail in Chapter 3. To date, Array Algebra is the most comprehensive and complete approach support- ing a variety of applications including sensor, image and statistical data. Recently, a Geo-raster service standard based on Array Algebra concepts has been issued by the Open GeoSpatial Consortium (OGC) [78]. A commercial and open-source imple- mentation of Array Algebra is currently available for the scientific community.

2.1.4 Storage Management

At present, handling large image data stored in a database is usually carried out by adopting a tiling strategy [23]. An image is split into sub-images (tiles), as shown in Fig. 2.3. When a region of interest is requested in a given query operation, only the relevant tiles are accessed. This strategy results in significant I/O bandwidth savings. Tiles form the basic processing units for indexing and compression. Spatial indexing allows for the quick retrieval of the identifier and location of a required tile, while compression improves disk I/O bandwidth efficiency. The choice of tile size is crucial for efficiency. While large tiles return much redundant data in response to a range query, small tiles result in a bad compression ratio where tile size varies from 8 KB (very small) to 512 KB (very large) of data [23, 96]. A comprehensive approach 2.1 Array Databases 19 toward the storage of large amounts of data on tertiary storage media considering tiling techniques in multidimensional database management systems is presented in [23, 24, 25].

Figure 2.3. Image Tiling

A key factor influencing the effectiveness of a tiling scheme is compression. Raster data compression algorithms are the same as algorithms for compression of other im- age data. However, remote-sensing images are usually of much higher resolution, are multi-spectral and have significant larger volumes than natural images. To ef- fectively compress raster data in GIS environments, emphasis must be placed on the management of schemas to deal with large volumes of remote-sensing imagery, and on the integration of various types of datasets such as vector and multidimensional datasets [3, 87]. Dehmel [3] proposed a comprehensive framework for the compression of multidi- mensional arrays based on different model layers, including various kinds of predic- tors and a generic wavelet engine for lossy compression with arbitrary quality levels. In particular, the author introduces concepts such as channel separation to compress values for each channel separately, and predictors that calculate approximate values for some cells and express those cell values relative to the approximate values. Fur- ther, the proposed method applies wavelets to transform the channels individually into multi-resolution representations with coarse approximations and various levels of de- tail information. This led to a wavelet engine architecture consisting of three major components: transformation, quantization and compression that helps improve com- pression rates considerably in array databases.

2.1.5 2D Pre-Aggregation

Aggregate operations on GIS and remote-sensing applications have been shown to be computationally expensive due to the size and complexity of the operations [8]. One such operation is zooming (scaling), which is carried out by interpolating the val- ues of the original dataset to downsample it to a lower resolution. This is particularly necessary in web-based raster applications, where limitations such as bandwidth and other resources prevent the efficient processing of the original raster datasets. For smooth interactive panning, browsers load the image in tiles and quantities larger than actually displayed. Zooming far out results in large scale factors, meaning that large amounts of data must be moved to deliver minimal results. 20 2. Background and Related Work

Current database technology for GIS and remote-sensing imaging applications em- ploy multi-scale image pyramids to improve performance of scaling operations on 2D raster images [51, 70, 82]. Image pyramids is a technique which consists of resam- pling the original dataset and creating a number of copies from it, where each copy is resampled at a coarser resolution (Fig. 2.4). The pyramid consists of a finite number of levels that differ in scale by a fixed step factor, and are much smaller in size than the original dataset but adequate for visualization at a lower scale (zoom ratio). Common practice is to construct pyramids in scale levels of a power of 2, yielding scale factors 2, 4, 6, 8, 16, 32, 64, 128, 256, and 512. When more detailed data are needed, or when it becomes necessary to access the original image, a better access speed can be achieved by accessing the smaller piece of the original data, if the original data are cut into smaller pieces. A restricted area of the image, instead of the entire image, is then accessed.

Figure 2.4. Image Pyramids

Pyramid Construction The construction of pyramid layers requires resampling of original image cell val- ues. Resampling interpolates cell values or otherwise assigns values to cells of a new raster object. It results in a raster with larger or smaller cells and different dimen- sions. Resampling changes the scale of an input raster, and is used in conjunction with geometric transformation models that change the internal geometry of a raster. The following are the most popular interpolation methods [34]: • Nearest neighbor is the resampling technique of choice for discrete (categorical) data since it does not alter the value of the input cells [64]. After the cell’s center on the output raster dataset is located on the input raster, the nearest neighbor assignment determines the location of the closest cell center on the input raster and assigns the value of that cell to the cell on the output raster. • Linear interpolation is used to interpolate along value curves. It assumes that cell values vary in proportion to distance along a value segment: v = a + bx. Linear interpolation may be used to interpolate feature attribute values along a line segment connecting any two point value pairs. 2.1 Array Databases 21

• Bilinear interpolation is used to interpolate cell values at direct positions within a quadrilateral grid. It assumes that feature attribute values vary as a bilinear function of position within the grid cell: v = a + bx + cy + dxy. Given a direct position, p, in a grid cell whose vertices are V,V + V1,V + V2, and V + V1 + V2, where V1 and V2 are offset vectors of the grid, and with cell values at vertices v1, v2, v3, and v4, respectively, there are unique numbers i and j, with 0 ≤ i ≤ 1, and 0 ≤ j ≤ 1 such that p = V + iV1 + jV2. The cell value at p is: v = (1 − i)(1 − j)v1 + i(1 − j)v2 + j(1 − i)v3 + ijv4. Since the values for output cells are calculated according to the relative positions and values of input cells, bilinear interpolation is preferred for data where the location from a known point or phenomenon determines the value assigned to a cell (that is, continuous surfaces). Elevation, slope, intensity of noise from an airport, and salinity of groundwater near an estuary are phenomena represented as continuous surfaces and are most appropriately resampled using bilinear in- terpolation.

• Quadratic interpolation is used to interpolate cell values along curves. It as- sumes that cell values vary as a quadratic function of distance along a value segment: v = a + bx + cx2, where a is the value of a cell at the start of a value segment and v is the value of a cell at distance x along the curve from the start. Three point value pairs are needed to provide control values for calculating the coefficients of the function.

• Cubic interpolation is used to interpolate cell values along curves. It assumes that cell values vary as a cubic function of distance along a value segment: v = a + bx + cx2 + dx3 where a is the value of a cell at the start of a value segment and v is the value of a cell at distance x along the curve from the start. Four point value pairs are needed to provide control values for calculating the coefficients of the function. Cubic convolution has a tendency to sharpen the edges of the data more than bi- linear interpolation since more cells are involved in the calculation of the output values.

Pyramid Evaluation During the evaluation of a scaling operation with a target scale factor s, the pyramid level with the largest scale factor s0 with s0 < s is determined. This level is loaded and then an adjustment is made by scaling the resulting image by a factor of s/s0. If, for example, scaling by s = 11 is required, then pyramid level 3 with scale factor s0 = 8 is chosen, requiring a rest scaling of 11/8 = 1.375, thereby touching only 1/64 of what is read without a pyramid. The computation complexity of a scaling operation depends on the chosen resam- pling method. For example, nearest neighbor resampling considers the closest cell center of the input raster and assigns the value of that cell to the corresponding cell 22 2. Background and Related Work on the output raster. Other resampling methods such as bilinear and cubic interpola- tion consider a subset of cells to calculate each of the cell values in the output rasters. Fig. 2.5 shows three common options for interpolating output cell values. Note that the bold outline (center image) indicates the current target cell for which a value is being interpolated.

(a) Portion of (b) Portion of (c) Input cells used by com- original raster output raster mon resampling methods

Figure 2.5. Nearest Neighbor, Bilinear and Cubic Interpolation Methods

A characteristic of the pyramid approach is that it increases the size of a raster dataset by approximately 33 percent. This is because the additionally reduced resolu- tion representations are stored in the system together with the original dataset. This is offset, however, by the increasing response time obtained in return. The choice of re- sampling method for constructing the pyramid is influenced by the data characteristics and type of analysis performed on the data. For example, visual appearance of remote sensing imagery is best using nearest-neighbor resampling, whereas scientific inter- pretation may require cubic interpolation. Rasters representing categorical data e.g., land use data, do not allow interpolation since it is important that original data val- ues remain unchanged; hence only nearest-neighbor resampling can be applied [64]. The reason why categorical data should not be interpolated is because intermediate terms cannot be derived with meaningful results. For example, soil type data cannot be interpolated since a soil type 14 and a soil type 15 cannot sensibly be averaged to derive a soil type 14.5. Creating pyramids for different resampling methods is not efficient due to the additional resources required for storage and maintenance. Thus, the hard-wired resampling approach possess significant flexibility limitations to users when analytic objectives diverge. Fast retrieval of raster image datasets has also been investigated in distributed database systems. Kitamoto [14] proposed a caching mechanism that allows two- dimensional satellite imagery to be cached with minimum resolution to provide a coarse view of the images in distributed satellite image databases. The cache manage- ment problem is treated as the knapsack problem [14], where the relevance and size of the data is considered to determine if the data will be cached or not. Additionally, access patterns influence the relevance of the data. The frequency of requests for a 2.1 Array Databases 23 given image and its resulting popularity rank are included in the strategy for caching selection. Prediction of user access patterns is not considered, however. More recently, methods exploiting the capabilities of modern graphics hardware have been applied to the organization and processing of large amounts of satellite imagery. For example, Boettger et al. presented a method based on the concepts of perspective and complex logarithm [90] for visualization and navigation of satellite and aerial imagery [50]. Datasets are decomposed into tiles of different sizes and levels of resolution according to a pre-defined area of interest. The tiles closer to the center of interest have higher resolution, whereas low-resolution tiles are created for parts further away. The resulting tiles are indexed and cached into the memory of the graphics hardware, enabling quick access to the area of interest with the best available resolution. When the center of interest is changed, tiles not yet available in graphics memory are loaded. Based on the assumption that the graphics memory offers more space than needed, the cache contains not only the tiles that conform to the area of interest, but those that presumably will be needed in the future.

2.1.6 Pre-Aggregation Beyond 2D

Geographic phenomena can be examined at different granularities. This includes different spatial perspectives and temporal views. Earth remote sensing imagery can be treated as time-series data to study/track changes over time. For example, a user looking at changes in vegetation patterns over a certain region during the past 10 years can see their effect on the regional maps over that time period. Fig. 2.6 shows various instances of scaling operations on 3D image time-series. Figure 2.6(a) shows the original dataset, which consists of two spatial dimensions (dim 1, dim 2), and one temporal dimension (dim 3). Figure 2.6(b) shows the original dataset scaled down along the two spatial dimensions. Figure 2.6(c) shows a scaling operation along the time dimension of the original dataset. Figure 2.6(d) shows the original dataset scaled down in the spatial and temporal dimensions. Shifts in temporal detail have been studied in various application domains [18, 22, 43]. At the time of this writing, there is little support for zooming with respect to time in GIS technology: the focus has been set on studying such alterations with respect to the geometric (vector) properties of objects [54, 58, 59]. Datasets in environmental observation and climate modeling are often defined over 4-D spatio-temporal space of the form (x,y,z,t), possibly extended with topology rela- tionships. Scaling operations are also critical for these kinds of applications due to the size and dimensionality of the data. Extremely large volumes of data are generated during climate simulations. While only one part might be needed for a specific data analysis, huge data volumes are moved. This is particularly true for time-series data analysis. At the time of this writing, however, 4D scaling operations are not supported for GIS and remote-sensing imaging applications. 24 2. Background and Related Work

(a) 3D dataset (b) 3D dataset (scaled-down along dim1 and dim2 by a factor of 2)

(c) 3D dataset (scaled-down along (d) 3D dataset (scaled-down along dim3 by a factor of 4) all dimensions by a factor of 2)

Figure 2.6. 3D Scaling Operations on Time-Series Imagery Datasets 2.2 On-Line Analytical Processing (OLAP) 25

2.1.7 Summary

Array database theory is gradually entering its consolidation phase. The notion of arrays as functions mapping points of some hypercube-shaped domain to values of some range set is commonly accepted. Two main modeling paradigms are used: calculus and algebra. Multidimensional data models embed arrays into the relational world, either by providing conceptual stubs like Array Algebra, or by adding rela- tional capabilities explicitly such as AQL and RAM. Notably, aggregate query pro- cessing plays a critical role given the large volumes of the arrays. Our study shows that pre-aggregation techniques focus only on 2D datasets, and that support is lim- ited to one particular operation: scaling. We distinguish the pyramid approach as the most popular method for speeding up scaling operations on 2D datasets; despite its known limitations such as hard-wired interpolation and lack of support for datasets of higher dimensions. Advances on hardware graphics are enabling quicker and more accurate visualization and navigation capabilities for raster imagery. However, little work has been reported on how array database technology is progressively exploiting these hardware advances. A critical gap with respect to pre-aggregation is the lack of support for aggregate operations other than 2D scaling.

2.2 On-Line Analytical Processing (OLAP)

Data warehousing/OLAP is an application domain where complex multidimen- sional aggregates on large databases have been studied intensively. Typically, a data warehouse collects business data from one or multiple sources so that the desired fi- nancial, marketing, and business analyses can be performed. These kinds of anal- yses can detect trends and anomalies, make projections, and make business deci- sions [41]. When such analysis predominantly involves aggregate queries, it is called on-line analytical processing, or OLAP [38, 39]. To understand the mechanism of pre-computation, the following subsections review different approaches to structuring multidimensional data, storage mechanisms and operations in OLAP.

2.2.1 OLAP Data model

The multidimensional OLAP model begins with the observation that the factors that influence decision-making processes are related to enterprise-specific facts, such as sales, shipments, hospital admissions, surgeries, and so on. [68]. Instances of a fact subsequently correspond to events that occur. For example, every sale or ship- ment carried out is an event. Each fact is described by the values of a set of relevant measures providing quantitative descriptions of events, e.g., sales receipts, amounts shipped, hospital admission costs, and surgery times are all measures. In OLAP, information is viewed conceptually as cubes that consist of descriptive categories (dimensions) and quantitative values (measures) [26, 81, 69, 83]. In the sci- entific literature, measures are at times called variables, metrics, properties, attributes, or indicators. Figure 2.7 illustrates a 3D OLAP data cube where business events 26 2. Background and Related Work

(facts) are mapped at the intersection of a specific combination of dimensions. Different attributes along each dimension are often organized in hierarchical struc- tures that determine the different levels in which data can be further analyzed [26]. For example, within the time dimension, one may have levels composed of years, months, and days. Similarly, within the geography dimension, one may have levels such as country, region, state/province, or city. Hierarchical structures are used to in- fer summarization (aggregation), that is, whether an aggregate view (query) defined for some category can be correctly derived from a set of precomputed views defined for other categories.

Figure 2.7. OLAP Data Cube

2.2.2 OLAP Operations

OLAP includes a set of operations for manipulation of dimensional data organized in multiple levels of abstraction. Basic OLAP operations are roll-up, drill-down, slice, dice and pivot [44]. A roll-up (aggregation) operation computes higher aggregations from lower aggregations or base facts according to their hierarchies, whereas drill- down (disaggregation) is an analytic technique whereby the user navigates among levels of data ranging from most summarized/aggregated, to most detailed. Typical OLAP aggregate functions include average, maximum, minimum, count, and sum. Drilling paths may be defined by the hierarchies within dimensions or other relation- ships dynamic within or between dimensions. A slice consists of the selection of a smaller data cube or even the reduction of a multidimensional data cube to fewer di- mensions by a point restriction in some dimension. The dice operation works similarly to the slice except that it performs a selection on two or more dimensions. Figure 2.8 provides a graphical description of these operations.

2.2.3 OLAP Architectures

Figure 2.9 shows different approaches for the implementation of OLAP functional- ities: Multidimensional OLAP (MOLAP), Relational OLAP (ROLAP), Hybrid OLAP 2.2 On-Line Analytical Processing (OLAP) 27

Figure 2.8. Typical OLAP Cube Operations

(HOLAP). These approaches offer a common view in the form of data cubes, which are independent of how the data is stored.

Figure 2.9. OLAP Approaches: MOLAP, ROLAP, and HOLAP

MOLAP MOLAP maintains data in a multi-dimensional matrix based on a non-relational spe- cialized storage structure [37], see Fig. 2.10(a). While building the storage structure, selected aggregations associated with all possible roll-ups are precomputed and stored [92]. Thus, roll-up and drill-down operations are executed in interactive time. Prod- ucts such as Oracle Essbase, IBM Cognos Powerplay, and open-source Palo have adopted this approach. A MOLAP system is based on an ad-hoc logical model that directly represents multidimensional data and its applicable operations. The underlying multidimensional database physically stores data as arrays and access to it is positional [68]. Grid-files [53, 55], R*-trees [71] and UB-trees [84] are among the techniques used for that purpose. The main advantage of this approach is that it contains the pre-computed aggregate values that offer a very compact and efficient way to retrieve answers for specific 28 2. Background and Related Work aggregate queries [68]. One difficulty that MOLAP poses, however, pertains to the sparseness of the data. Sparseness means that many events did not take place and valuable processing time is taken by adding up zeros [91]. For example, a company may not sell every item every day in every store, so no values appear at the intersection where products are not sold in a particular region at a particular time. On the other hand, MOLAP can be much faster for applications where subsets of the data cube are dense [100]. Another limitation of this approach is that the computation of a cube requires a complex aggregate query across all data in a warehouse. Though it is possible to incrementally update cubes as new data arrives, it is impractical to dynamically create new cubes to answer ad-hoc queries [68].

Figure 2.10. MOLAP Storage Scheme

ROLAP In ROLAP, underlying data is stored in a relational database, see Fig. 2.11(a). The relational model, however, does not include concepts of dimension, measure, and hi- erarchy. Thus specific types of schemata must be created so the multidimensional model can be represented in terms of basic relational elements such as attributes, rela- tions, and integrity constraints [68]. Such representations are done using a star schema data model, although the snowflake schema is also often adopted. ROLAP implementations can handle large amounts of data and leverage all func- tionalities of the relational database [72]. Disadvantages are that overall performance is slow and each ROLAP report represents an SQL query with the limitations of the genre. ROLAP vendors tried to mitigate this problem by including out-of-the-box complex functions in their product offering and providing users the capability of defin- ing their own functions. Another problem with ROLAP implementations results from the performance hit caused by costly join operations between large tables [68]. To overcome this issue, fact tables in data-warehouses are usually de-normalized. Sub- 2.2 On-Line Analytical Processing (OLAP) 29 stantial performance gains can be achieved through the materialization of derived ta- bles (views) that store aggregate data used for typical OLAP queries.

Figure 2.11. ROLAP Storage Scheme

Figure 2.12 shows the formulation of a typical query in both ROLAP and MOLAP. The query yields sales information for a specific product sold in a particular city by a given vendor. The formulation of the queries is done according to the syntax of Oracle 10g. Note the lengthy difference between the two query formulations.

(a) Sample ROLAP query (b) Sample MOLAP query

Figure 2.12. Typical Query as Expressed in ROLAP and MOLAP Systems 30 2. Background and Related Work

HOLAP The intermediate architecture type, HOLAP, mixes the advantages offered by ROLAP and MOLAP. It takes advantage of the standardization level and the ability to manage large amounts of data from ROLAP implementations, and the query speed typical of MOLAP systems. For summary type information, HOLAP leverages cube technology and for drilling down into details it uses the ROLAP model. In HOLAP architecture, the largest amount of data should be stored in an RDBMS to avoid the problems caused by sparsity, and a multidimensional system should store only the information users most frequently need to access [68]. If that information is not enough to solve queries, then the system accesses the data managed by the relational system in a more transparent manner.

2.2.4 OLAP Pre-Aggregation

OLAP systems require fast interactive multidimensional data analysis of aggre- gates. To fulfill this requirement, database systems frequently pre-compute aggre- gate views on some subset of dimensions and their corresponding hierarchies. Vir- tually all OLAP products resort to some degree of pre-computation of these aggre- gates, a process known as pre-aggregation. OLAP pre-aggregation techniques have proved to speed up aggregate queries by several orders of magnitude in business ap- plications [31, 41]. A full pre-aggregation of all possible combinations of aggregate queries, however, is not considered feasible because it often exceeds the available stor- age limit and incurs a high maintenance cost. Therefore, modern OLAP systems adopt a partial pre-aggregation approach where only a set of aggregates are materialized so it can be re-used for efficiently computing other aggregates. Pre-aggregation techniques consist of three inter-related processes: view selection, query rewriting, and view maintenance.A view is a derived relation defined in terms of base relations. Views can be materialized by storing the tuples of a view in a database, as was first investigated in the 1980s [36]. Like a cache, a materialized view provides fast access to its data. However, a cache may get dirty whenever its underlying base relations are updated. The process of updating a materialized view in response to changes to its base data is called view maintenance [12].

View Selection Gupta et al. [13] proposed a framework that shows how to use materialized views to help answer aggregate queries. The framework provides a set of query rewriting rules to determine what materialized aggregate views can be employed to answer aggregate queries. An algorithm uses these rules to transform a query tree into an equivalent tree with some or all base relations replaced by materialized views. Thus, a query optimizer can choose the most efficient tree and provide the best query response time. Harinarayan et al. [92] investigated the issue of how to select views for materializa- tion under storage space constraints so the average query cost is minimal. To meet changing user needs several dynamic pre-aggregation approaches have 2.2 On-Line Analytical Processing (OLAP) 31 been proposed. In principle, views may be either selected on demand or pre-selected using some prediction strategy. For applications where storage space is a constraint, replacement algorithms identify those views that can be replaced with new selections [60]. Kotidis et al. [97] introduced a dynamic view selection approach called Mul- tidimensional Range Queries (MRQ), known as slice queries in OLAP, which use an on-demand fetching strategy. Within this approach, the level of detail or granularity is a compromise between the materialization of many small, highly specific queries, and the materialization of a few large queries followed by answering incoming queries at each stage, using the materialized queries. This approach, however, does not take into account user access patterns before making selections. The first work to consider user access information to evaluate potential queries to be materialized is presented in [26], where the author introduced PROMISE, an approach that predicts the structure and value of the next query based on the current query. Yao et al. [99] proposed a different approach for the materialization of dynamic views. A set of batch queries were rewritten using certain canonical queries so the total cost of execution could be reduced using the intermediate results for answering queries appearing later in the batch. This approach requires all queries to be precisely known before hand, and though the approach might work well in a particular database scenario, it might not be useful in dynamic OLAP, where it is extremely difficult to accurately predict the exact nature of future queries.

View Maintenance In most cases it is wasteful to maintain a view by recomputing it from scratch. Mate- rialized views are therefore maintained using an incremental approach [11]. Only the changes to be propagated to the materialized view are computed using the changes of the source relations [1, 33, 89]. At present, view maintenance has been investigated from these four dimensions [11]:

• Information Dimension: Focuses on accessing the information required for view maintenance, such as base relations and the materialized view.

• Modification Dimension: Focuses on the kinds of modifications e.g., insertions and deletions, that a view maintenance algorithm can handle.

• Language Dimension: Addresses the problems related to the language of the views supported by the view maintenance algorithm. That is, what is the lan- guage of the views that can be maintained by the view maintenance algorithm? How are views expressed? Does the algorithm allow duplicates?

• Instance Dimension: Considers the applicability of the algorithm to all or a specific set of instances of the database.

View maintenance cost is the sum of the cost of propagating each base relation change to the affected materialized views. The sum can be weighted, where each weight indicates the frequency of propagations of the changes of the associated source 32 2. Background and Related Work relation. When the base relation affects more than one materialized view, multiple maintenance expressions must be evaluated. Multi-query optimization techniques can be used to detect common sub-expressions between the maintenance expressions so that an efficient global evaluation plan for the maintenance expressions can be achieved [61, 62]. Numerous methods have been developed for materialized view maintenance in con- ventional database systems. Zhuge et al. [101] introduced the Eager Compensating Algorithm (ECA) based on previous incremental view maintenance algorithms and compensating queries used to eliminate anomalies. In [102], authors define multiple views consistent with each other as the multiple view consistency problem. Further research from the same authors [102, 103] considers data warehouse views defined on base tables located in different data sources, i.e., if a view involves n base tables, then n data sources are also involved. A common characteristic of the early approaches to view maintenance is the con- siderable need for accessing base relations, which in most cases results in performance degradation. The improvement of the efficiency of view maintenance techniques has been a topic of active research in the database research community [15, 65, 85, 98].

Spatial OLAP (SOLAP) The multidimensional approach used by data warehouses and OLAP does not support array data types or spatial data types such as point, lines, or polygons. Following the development trends of data warehouse and data mining techniques, Stefanovic et al. [52] proposed the construction of a spatial data warehouse to enable on-line data analysis in spatial-information repositories. The authors used a star/snowflake model to build a spatial data cube consisting of both spatial and non-spatial dimensions and measures: the data cube shown in Fig. 2.13 consists of one spatial dimension (region) and three non-spatial dimensions (precipitation, temperature, and time).

Figure 2.13. Star Model of a Spatial Warehouse

Current research in spatial data management focuses on querying spatial data, particularly regarding the improvement of aggregate query performance [57] for 2.3 Discussion 33 spatial-vector data structures. Alas, little attention has been given to spatial-raster data [42, 73, 86]. Support for spatial-raster data typically consists of creating a spatial-raster cube from information in the metadata file (such as size, level, width, height, date of creation, format, and location) [28, 94]. Vega et al. [40] presented a model to analyze and compare existing techniques for the evaluation of aggregate queries on spatial, temporal, and spatio-temporal data. The study shows that existing aggregate computation techniques rely on some form of pre-aggregation and support is restricted to distributive aggregate functions such as COUNT , SUM, and MAX. Additionally, the authors identify several impor- tant needs concerning aggregate computation. First, they discuss the need to develop further and more substantial techniques to support holistic aggregate functions e.g., MEDIAN, RANK, and to better support selective predicates. The second obser- vation pertains to the lack of support for queries needing to be efficiently evaluated at every granule in time. Existing aggregate computation techniques focus only on spatial objects such as lines, points, and polygons but do not consider aggregate com- putation on data grids (array) structures.

2.3 Discussion

Query performance is a major concern underlying the design of databases in both business and remote-sensing imaging applications. While there are some valuable research results in the realm of pre-aggregation techniques to support query process- ing in business and statistical applications, little has been done in the field of array databases. The question therefore arises, what distinguishes array data from traditional data types that it cannot be fully supported by relational databases and thus take advantage of advance technologies such as OLAP? OLAP from its very conception was designed to assist in the decision-making process of business applications, where business per- spectives such as products and/or stores, represented the dimensions of the data cube. And while the different columns in a data cube are usually called dimensions, they generally cannot be considered as a special extent of the entities modeled by the database. Instead, they are regarded as explicit attributes that characterize a particular entity. Some dimensions in a data cube (e.g., CustomerId) are defined over discrete domains which do not have a natural ordering among their values (customer 1000 can- not be considered close to customer 1001). In such cases, any ordering defined for the values in one of these columns is arbitrary [40]. For this reason, existing OLAP so- lutions and related pre-aggregation techniques cannot be applied to multidimensional arrays, at least not in a straight-forward manner. Recently, however, a new trend in OLAP gained considerable popularity due to its capabilities to support Geo-spatial data. Spatial OLAP considers the case in which a data-cube may have both spatial and non-spatial dimensions. However, spatial OLAP focuses mainly on spatial-vector data and so far little support has been provided for spatial-raster data in terms of selective materialization for the optimization of ag- gregates. Support is limited only to those operations that can be constructed from 34 2. Background and Related Work metadata available for the raster, but not to the improvement of the computation of aggregate operations over the values of raster datasets. At present, pre-aggregation support in array databases is limited. Only one compar- atively simple pre-aggregation technique has been used, namely image pyramids. The limitation of this technique to two-dimensional datasets and hard-wired interpolation calls for the development of more flexible and efficient techniques. From our study of data modeling, storage techniques, operations in OLAP and remote sensing imaging applications, we have observed the following similarities:

• Array databases and OLAP systems typically employ multidimensional data models to organize their data.

• Both application domains handle large volumes of multidimensional data.

• Operations convey a high degree of similarity, for instance, a roll-up (aggregate) operation in OLAP such as computing the weekly sales per product is very similar to scaling a satellite image by a factor of seven along the X axis. Figure 2.14 illustrates this similarity.

(a) Scaling operation (b) Roll-Up operation

Figure 2.14. Comparison of Roll-Up and Scaling Operations 2.3 Discussion 35

• Both application domains use pre-aggregation approaches to speed up query processing: OLAP pre-aggregation techniques support a wide range of aggre- gate operations and speed up query processing by several orders of magnitude (last benchmark reported factors up to 100 times [29, 88]). Scaling of 2D datasets always uses the same scale factor on each dimension to maintain a coherent view, whereas for datasets of higher dimensionality, the scale factor is independent. Scaling resembles a primitive form of pre-aggregation in compar- ison to existing OLAP pre-aggregation techniques.

• While data in OLAP applications are sparsely populated, remote sensing im- agery usually are densely populated (100%). There are no guidelines explaining when an OLAP data cube is considered sparse or dense. However, when a data cube contains 30 percent empty cells it is usually treated with sparsity-handling techniques in most OLAP systems.

Furthermore, when compared to well-known OLAP pre-aggregation techniques, GIS image pyramids are different in several respects:

• Image pyramids are constrained to 2D imagery. To the best of our knowledge there is no generalization of pyramids to n-D.

• The x and y axes are always zoomed by the same scalar factor s in the 2D zoom vector (s, s). This is exploited by image pyramids in that they only offer pre- aggregates along a scalar range. In this respect, image pyramids actually are 1D pre-aggregates.

• Several interpolation methods are used for resampling during scaling. Some techniques are standardized [48], they include nearest-neighbor, bi-linear, bi- quadratic, bi-cubic, and barycentric. The two scaling steps incurred for image pyramids (construction of the pyramid level and rest scaling) must be done using the same interpolation technique to achieve valid results. In OLAP, summation during roll-up corresponds to linear interpolation in imaging.

• Scale factors are continuous, as opposed to the discrete hierarchy levels in OLAP. It is, therefore, impossible to materialize all possible pre-aggregates.

Based on these observations, this thesis aims to systematically carry over results from OLAP to array databases and provide pre-aggregation support not only for queries using basic aggregate functions, but to more complex operations such as scaling. As a preliminary and fundamental step, it is necessary to have a clear understanding of the various operations performed on remote sensing imagery and to identify those that involve aggregation computation. Next chapter addresses this issue in more detail. This page was left blank intentionally. Chapter 3

Fundamental Geo-Raster Operations in GIS and Remote-sensing Applications

This chapter describes a set of fundamental operations in GIS and remote-sensing imaging applications. For rigid comparison and classification, these operations are discussed by means of a sound mathematical framework. The aim is to identify those operations requiring data summarization that may benefit from a pre-aggregation ap- proach. To that end, we use Array Algebra as our modeling framework.

3.1 Array Algebra

The rationale behind the selection of Array Algebra as the modeling framework is grounded in the following observations:

• It is oriented towards multidimensional data in a variety of applications includ- ing imaging.

• It provides the means to formulate a wide variety of operations on multidimen- sional arrays.

• There are commercial and open-source implementations of Array Algebra that show the soundness and maturity of the framework.

The expressive power of Array Algebra, the simplicity of its operators, and its suc- cessful implementation in both commercial and scientific applications make it suitable for our investigation. Essentially, the algebra consists of three operators: an array constructor, a general- ized aggregation, and a multi-dimensional sorter [75, 76]. Array algebra is minimal in the sense that no subset of its operations exhibits the same expressive power. It is safe in evaluation: every formula can be evaluated in a finite number of steps. It is closed in its application: any resulting expression is either a scalar or an array.

37 38 3. Fundamental Geo-Raster Operations

Arrays are represented as functions mapping n-dimensional points from discrete Euclidean space to values. The spatial domain of an array is defined as a finite set of n-dimensional points in Euclidean space forming a hypercube with boundaries parallel to the coordinate system axes. Let X ⊆ Zd be a spatial domain and F a value set i.e., a homogeneous algebra. Then, an F-valued d-dimensional array over the spatial domain X(multi-dimensional array) is defined as:

a : X → F (i.e., a ∈ F X ), a = {(x, a(x)) : x ∈ X, a(x) ∈ F }

The array elements a(x) are referred to as cells. Auxiliary function sdom(a) denotes the spatial domain of some array a.

3.1.1 Constructor

The MARRAY array constructor allows arrays to be defined by indicating a spa- tial domain and an expression evaluated for each cell position of the array. An iteration variable bound to a spatial domain is available in the cell expression so that the cell value depends on its position. Let X be a spatial domain, F a value set, and v a free identifier. Let ev be an expression with result type F containing zero or more free oc- currences of v as placeholder(s) for an expression with result type X. Then, an array over spatial domain X with base type F is constructed through:

MARRAYX,v(ev) = {(x, a(x)) : a(x) = ex, x ∈ X}

A straightforward application of MARRAY is spatio-temporal sub-setting by sim- ply changing its domain. Example: For some 2-D grey-scale image a, its cutout to domain [x0:x1,y0:y1] (as- sumed to lie inside the array) is given by:

MARRAY[x0:x1,y0:y1],p(a[p])

Similarly, trimming produces a cutout of an array of lower volume, but unchanged dimensionality, and section cuts out a hyperplane with reduced dimensionality. We can also change an array’s values by changing the ev expression. In the simplest case this expression takes the cell value and modifies it. The following expression adds the values in the cells of two raster images, regardless of their extent and dimension:

a + b = MARRAYX,p(a[p] + b[p])

If we allow the use of all operations known on the base algebra, i.e., on the pixel type, we immediately obtain a cohort of the following useful operations. 3.2 Geo-Raster Operations 39

3.1.2 Condenser

The COND array condenser (aggregator) takes the values of an array’s cells and combines them through some commutative and associative operation, thereby obtain- d ing a scalar value. For some v free identifier, spatial domain X = x1, ..., xn, xi ∈ Z consisting of n points, and ea,v an expression of result type F containing occurrences of an array a and identifier v, the condense of a by o is defined as:

CONDo,X,v(ea,v) := O ea,x = ea,x1 o...oea,xn x∈X Example: Let a be the image as defined in above. The average over all pixel intensities in a is then given by: P COND+,sdom(a),p(a) = a[x]/(m ∗ n) x∈[1:m,1:n] 3.1.3 Sorter

The SORT array sorter proceeds along a selected dimension to reorder the cor- responding hyperslices. Functional sorts rearranges a given array along a speci- fied dimension s without changing its value set or spatial domain. To that end, an order-generating function is provided that associates a sequence position to each (d- 1)-dimensional hyperslice. Note that function fs,a has all degrees of freedom to assess any of a’s cell values for determining the measure value of a hyperslice on hand - it can be a particular cell value in the current hyperslice, the average of all hyperslice values, or the value of one or more neighboring slices. Note that the sort operator includes the relational group by. The language is recursive in the array expression ev and hence allows arbitrary nesting of expressions. In the sequel we use the abbreviations introduced above for nested expressions.

3.2 Geo-Raster Operations

This section presents a set of fundamental operations for Geo-raster data. These operations have been selected based on an exhaustive literature review of classification schemes, international standards, and best practices [2, 19, 27, 32, 35, 45, 46, 47, 49]. By examining the Array Algebra operators involved in the computation of the oper- ations, we identify those that require data summarization (aggregation) and therefore may benefit from pre-aggregation. Queries were executed in a raster database management system (RasDaMan), and formulated according to the syntax of an SQL-based query language for multidimen- sional raster databases based on Array Algebra, namely, rasql.

3.2.1 Mathematical Operations

The following groups of mathematical operators are distinguished: arithmetic, trigonometric, boolean and relational. They operate at cell level and can be applied 40 3. Fundamental Geo-Raster Operations in a single or multiple rasters of numerical type and identical spatial domain. The basic arithmetic operators include addition (+), subtraction (-), multiplication (*), and division (/). Trigonometric functions perform trigonometric calculations on the values of an input raster: sine (sin), cosine (cos), tangent (tan) or their inverse (arcsin, arccos, arctan). Consider, for example, the following query:

Query 3.2.1. Consider a RGB (red, green, blue) raster image A. Extract the green component from the image, and reduce the contrast by a factor of 2.

With Array Algebra, the query can be computed as follows:

MARRAYsdom(A),i(A.green[i]/2)

Results are shown in Fig. 3.1.

(a) Original RGB image (b) Green component (c) Output raster

Figure 3.1. Reduction of Contrast in the Green Channel of an RGB Image

All or part of a raster image can be manipulated using the rules of Boolean algebra integrated into database query languages such as SQL [2]. Boolean algebra uses logical operators such as and, or, not, and xor to determine if a particular condition is true or false. These operators are often combined with relational operators: equal (=), not equal (6=), less than (<), less than or equal to (≤), greater than (>), and greater than or equal to (≥). Consider, for example, the following queries:

Query 3.2.2. Given a near-infrared green (NRG) raster image A, highlight the cells with sufficient near-infrared values.

This query can be answered by imposing a lower bound on the infrared intensity, and upper bounds on the green and blue intensities. The resulting boolean array is 3.2 Geo-Raster Operations 41 multiplied by the original image A to show the original cell where an infrared value prevails and black otherwise.

MARRAYsdom(A),i (A[i] ∗ ((A[i].nir ≥ 130) and (A[i].green ≤ 110) and (A[i].blue ≤ 140)))

Results are shown in Fig. 3.2.

(a) Original NRG raster (b) Output raster

Figure 3.2. Highlighted Infrared Areas of an NRG Image

Query 3.2.3. Compare the cell values of two 8-bit gray raster images A and B. Create a new raster where each cell value takes the value of 255 (white pixel) when the cell values of A and B are identical.

The algebraic formulation is as follows:

MARRAYsdom(A),i((A[i] = B[i]) ∗ 255)

Results are shown in Fig. 3.3. 42 3. Fundamental Geo-Raster Operations

(a) Grey 8-bit raster A (b) Grey 8-bit raster B (c) Output raster image

Figure 3.3. Cells of Rasters A and B with Equal Values

Reclassification Reclassification is a generalization technique used to re-assign cell values in classified rasters. For example consider the query below where reclassification is based on a land suitability study.

Query 3.2.4. Given an 8-bit gray image A, map each cell value to its corresponding suitability class shown in Table 3.21, and decrease the contrast of the image according to the decreasing factor.

The query can be answered as follows:

MARRAYsdom(A),g(((A[g] > 180) ∗ A[g]/2) + (((A[g] ≥ 130)and(A[g] < 180)) ∗ A[g]/3) + (((A[g] ≥ 80)and(A[g] < 130)) ∗ A[g]/4) + ((A[g] < 80) ∗ A[g]/5))

Results are shown in Fig. 3.4.

1Classification taken from http://www.fao.org/docrep/X5310E/X5310E00.htm 3.2 Geo-Raster Operations 43

Table 3.1. UNO and FAO Suitability Classifications Classification Description S1 Highly suitable S2 Moderately suitable S3 Marginally suitable NS Not suitable

Table 3.2. Capability Indexes for Different Capability Classes Capability index Class Suitability class Decrease factor >180 I S1 2 130-180 II S2 3 80-130 III S3 4 < 80 IV NS 5

(a) Original raster (b) Output raster

Figure 3.4. Re-Classification of the Cell Values of a Raster Image 44 3. Fundamental Geo-Raster Operations

Proximity The proximity operation creates a new raster where each cell value contains the dis- tance to a specified reference point. As an example consider the following query: Query 3.2.5. Estimate the proximity of each cell of the raster image shown in Fig. 3.4(a) to the reference cell located in [30,5].

The computation of this query can be formulated as:

MARRAYsdom(A),(g,h)(|g − 30| + |h − 5|)

Results are shown in Fig. 3.5.

Figure 3.5. Computation of a Proximity Operation

Overlay The overlay operation refers to the process of stacking two or more identical geo- referenced rasters on top of each other so that each position in the covered area can be analyzed in terms of these data. The overlay operation can be solved using arithmetic and relational operators. For example, consider the following query: Query 3.2.6. Given two 8-bit gray raster images A and B with identical spatial do- main, perform an overlay operation. That is, make a cell-wise comparison between the two rasters. Each cell value of the new array must take the maximum cell value between A and B.

The computation of this query can be formulated as:

MARRAYsdom(A),g(((A[g] > B[g]) ∗ A[g]) + ((A[g] ≤ B[g]) ∗ B[g]))

The above formulation works as follows. The left part of the arithmetic expression + tests for the cell value of array A to be greater than the cell value of B. The result of this operation is either 0 (condition not satisfied) or 1 (condition satisfied), which in 3.2 Geo-Raster Operations 45 turn is multiplied by the cell value of array A. Thus, the left part of the expression is either 0 or the cell value of array A. Similarly, the right-hand side of the arithmetic addition expression verifies if the cell value of array A is less than or equal to the cell value of B. The result is either 0 or 1 depending on whether or not the condition is satisfied. This value is multiplied for the cell value of array B. Note that only one of the parts of the addition expression will be greater than zero, and that this value cor- responds to the highest value between arrays A and B. Results are shown in Fig. 3.6.

(a) 8-bit gray raster A (b) 8-bit gray raster B (c) Output raster

Figure 3.6. Computation of an Overlay Operation

An overlay operation can also be done considering a different condition to be tested while determining the cell values of the output array. For example: Query 3.2.7. Compute an overlay operation between rasters A and B. That is, com- pare cell-wise the two rasters: if the cell value of B is non-zero, then set this value as the cell value of the corresponding cell in array A. Otherwise, the cell value of A remains unchanged. The query can be answered as follows:

MARRAYsdom(A),g(((B[g] > 0) ∗ B[g]) + ((B[g] ≤ 0) ∗ A[g])) Results are shown in Fig. 3.7.

3.2.2 Aggregation Operations

We now present the modeling of operations consisting of one or more aggregate functions. An aggregate function takes a collection of cells and returns a single value that summarizes the information contained in the set of cells. The SQL standard pro- vides a variety of aggregate functions. SQL-92 includes count, sum, average, min, 46 3. Fundamental Geo-Raster Operations

(a) Grey 8-bit raster A (b) Grey 8-bit raster B (c) Output raster

Figure 3.7. Computation of an Overlay Operation Considering Values Greater than Zero and max. SQL:1999 adds every, some and any. OLAP functions were first published as an addendum to the ISO SQL:1999 standard. They have since been completely incorporated into both SQL:2003 and recently published SQL:2008 ISO SQL Stan- dards. OLAP functions include rank, ntile, cume dist, percent rank, row number, percentile cont, and percentile disc.

Add The add operation sums up the content of the cells and returns the total as a scalar value. It can be applied in two or more rasters with an identical spatial domain, re- turning a new raster with the same spatial domain. In this case, the cells of the new raster contain the sum of the inputs computed on a cell-by-cell basis. As an example of the add operation in a single raster consider the following query:

Query 3.2.8. Return the sum of all cell values of the raster shown in Fig. 3.8(a).

add cells(A) = COND+,sdom(A),i(A[i])

Results are shown in Fig. 3.8. 3.2 Geo-Raster Operations 47

(a) Original NRG raster (b) Output result

Figure 3.8. Calculation of the Total Sum of Cell Values in a Raster

Count The count operation returns the number of cells that fulfill a boolean condition applied to a raster. For example, consider the following query:

Query 3.2.9. Return the number of cells of raster A of boolean type, containing true value in the green channel.

count cells(A) = COND+,sdom(A),i(A[i].green = 1)

Average The average operation returns a scalar value representing the mean of all values con- tained in a raster. As an example consider the following query:

Query 3.2.10. Return the average of the cell values in each channel of the NRG image shown in Fig. 3.9(a).

Let sum cells(A) be a function calculated as shown in Section 3.2.2, and card(sdom(A)) a function returning the cardinality of A. Then, the average of A is calculated as fol- lows: sum cells(A) avg cells(A) = card(sdom(A)) Results are shown in Fig. 3.9.

Maximum A maximum operation returns the largest cell value contained in a raster of numerical type. As an example, consider the following query: 48 3. Fundamental Geo-Raster Operations

(a) Original NRG raster (b) Output result

Figure 3.9. Result of an Average Aggregate Operation

Query 3.2.11. Return the maximum cell value of all cells contained in the NRG raster image shown in Fig. 3.10(a).

max cells(A) = CONDmax,sdom(A),i(A[i]) Results are shown in Fig. 3.10.

(a) Original NRG raster (b) Output result

Figure 3.10. Result of a Maximum Aggregate Operation

Minimum A minimum operation returns the smallest cell value contained in a raster of numerical type. As an example, consider the following query: 3.2 Geo-Raster Operations 49

Query 3.2.12. Return the smallest element of all cell values in the NRG raster image shown in Fig. 3.11(a).

min cells(A) = CONDmin,sdom(A),i(A[i])

Results are shown in Fig. 3.11.

(a) Original NRG raster (b) Output result

Figure 3.11. Result of a Minimum Aggregate Operation

Histogram A histogram provides information about the number of times a value occurs across a range of possible values. For an 8-bit raster up to 256 different values are possible. As an example consider the following query: Query 3.2.13. Calculate the histogram for a 2D raster A with 8-bit integer pixel resolution.

The query can be computed as follows:

MARRAYsdom(A),g(count cells(A = g[0])) (3.1)

Results are shown in Fig. 3.12.

Diversity The diversity operation returns the different classifications in a raster. For example, consider the following query: Query 3.2.14. Given the classifications in an 8-bit gray raster image, return true (1) for those classes whose total number of cells are greater than 0. 50 3. Fundamental Geo-Raster Operations

Figure 3.12. Computation of the Histogram for a Raster Image

For the computation of this operation we make use of the histogram calculated in Query. 3.2.2. Let B be a 1-D array containing the histogram values:

B = MARRAYsdom(A),g(COND+,sdom(A),i(A[i] = g)) then, C is the array containing true values for the elements of the histogram that are greater than 0: C = MARRAYsdom(B),i(B[i] > 0) Results are shown in Fig. 3.13.

Figure 3.13. Computation of the Diversity for a Raster Image

Majority/Minority In a classified raster, the majority operation finds the class value with the largest num- ber of elements in the raster. Similarly, the minority operation finds the cell value with fewest number of elements. As an example, consider the following query: Query 3.2.15. Return the cell representing the majority of all cell values contained in 2D 8-bit gray raster image A shown in Fig. 3.14(a). To solve this query we use the histogram computed in Query. 3.2.2, and then select the cell value representing the majority of the different classes. Let h be a 1-D array 3.2 Geo-Raster Operations 51 containing the histogram values, h1 a 1-D array of spatial domain[0:255] containing a list of values from 0 to 255. Let h2 be an array containing the sum of h and h1:

h2 = MARRAY[0:255],g(h + h1) then, majority can be computed as follows:

COND+,sdom(A),i((max cells(h) = (h2[i] − h1[i])) ∗ h1[i]) Results are shown in Fig. 3.14.

(a) Classified raster (b) Majority class

Figure 3.14. Computation of a Majority Operation for a Raster Image

3.2.3 Statistical Aggregate Operations

We now consider operations that consist or include one or more statistical aggregate functions. The basic statistical aggregate functions include standard deviation, root square, power, , median, variance, and top-k. These functions can be applied to a raster, or a set of rasters retrieved by a logical search. Consider the following examples:

Variance Let n be the cardinality of the spatial domain of A, n = card(sdom(A)); and avg a variable containing the average of all cell values of A, avg=avg cells(A); then the variance v of A can be solved as follows: 1 v(A) = ∗ COND ((A[i] − avg) ∗ (A[i] − avg)) n +,sdom(A),i Results are shown in Fig. 3.15. 52 3. Fundamental Geo-Raster Operations

Figure 3.15. Computation of the Variance for a Raster Image

Standard Deviation Query 3.2.16. Estimate the standard deviation of the cell values of the NRG raster image shown in Fig. 3.8(a). Let n be the cardinality of the spatial domain of A, n = card(sdom(A)); and avg the average of the cell values of A, avg=avg cells(A); then the standard deviation s of A can be solved as follows: r 1 s(A) = ∗ COND ((A[i] − avg) ∗ (A[i] − avg)) n +,sdom(A),i Results are shown in Fig. 3.16.

Figure 3.16. Computation of the Standard Deviation for a Raster Image

Median The median can be calculated by sorting the cell values of raster A in ascending or- der and choosing the middle value. In case the number of cells is even, the median 3.2 Geo-Raster Operations 53 is the average of the two middle values. In solving this operation, we use the sort operator to perform the ascending sorting of array A. However, for an array of di- mensionality higher than 1 it is necessary to flatten the array into a one-dimensional array. For example, the conversion from a two-dimensional raster A[0:m,0:n] into a one-dimensional raster B[0:m*n] can be calculated as follows:

Let d be the cardinality of A, d=card(sdom(A); let r be the number of rows; and let c be the number of columns. Then, the flattening of A can be calculated as:

B =MARRAY[0:255],g (

COND+,[0:m,0:n],i ( ((g > (m ∗ (i − 1))) and (g ≤ i)) ∗ A[1 : (g − (m ∗ (i − 1))), 1 : i]))

asc Let S be the raster containing the sorted values of B (the flattening of A), S = SORT0,f (B), and let n be the cardinality of S, n = card(sdom(S)). Assuming an integer division and an array indexing starting at zero, the median of array A can be solved as follows: n−1 n+1 n S[ 2 ]+S[ 2 ] if n is odd then the median is equal to S[ 2 ]; else median = 2 . Consider the following query:

Query 3.2.17. Obtain the median of the 1-D array A whose cell values are shown in Fig. 3.17(a).

Since the array has an odd number of elements the computation of the query is as follows: A[card(A)/2] Results are shown in Fig. 3.17(b).

Top-k The Top-k function returns the k cells with the highest values within a raster. For example, consider the following query:

Query 3.2.18. Find the five highest values contained in raster A.

To solve this query we first sort A in ascending order and then select the top five values. Let d=0 indicate a sorting in the 0 dimension, and let f be sorting function fd,A(p)=A[P]. Then S is a sorted array of raster A (see Fig. 3.18):

asc S = SORT0,f (A) thus, the top five cell values are obtained by:

S[0 : 4] 54 3. Fundamental Geo-Raster Operations

(a) 1-D array

(b) Median

Figure 3.17. Computation of Median for a Raster Image

(a) Top five values

Figure 3.18. Computation of a Top-k Operation for a Raster Image 3.2 Geo-Raster Operations 55

3.2.4 Affine Transformations

Geometric transformations permit the elimination of geometric distortions that oc- cur when images are captured. An example is the attempt to match remotely sensed images of the same area taken after one year, when the more recent image was prob- ably not taken from precisely the same position. Another example is the Landsat Level 1B data that are already transformed onto a plane, but that may not be recti- fied to the user’s desired map projection [46]. Applying an affine transformation to a uniformly distorted raster image can correct for a range of perspective distortions by transforming the measurements from the ideal coordinates to those actually used. An affine transformation is an important class of linear 2-D geometric transforma- tions that maps variables, e.g. cell intensity values located at position (x1, y1), in an input raster image into new variables (x2, y2) in an output raster image by applying a linear combination of translation, rotation, scaling and shearing operations. The computation of these operations often requires interpolation techniques. In the remainder of this section we discuss special cases of affine transformations.

Translation Translation performs a geometric transformation that maps the position of each cell in an input raster image into a new position in an output raster image. Under translation, a cell located at (x1, y1) in the original is shifted to a new position (x2, y2) in the corresponding output raster image by displacing it through a user-specified translation vector (h, k). The cell values remain unchanged and the spatial domain of the output raster image is the same as that of the original input raster. Consider for example, the following query:

Query 3.2.19. Shift the spatial domain of a raster defined as A[x1 : x2, y1 : y2] by the point [h:k].

The query can be solved by invoking the shift function of Array Algebra:

shift(A[x1 : x2, y1 : y2], [h : k]])

Results are shown in Fig. 3.19.

Rotation

Rotation performs a geometric transformation that maps position (x1, y1) of a cell in an input raster image onto a position (x2, y2) in an output raster image by rotating it clockwise or counterclockwise, through a user-specified angle (θ) about origin O. The rotation operation performs a transformation of the form:

x2 = cos(θ) ∗ (x1 − x0) − sin(θ) ∗ (y1 − y0) + x0

y2 = sin(θ) ∗ (x1 − x0) + cos(θ) ∗ (y1 − y0) + y0 56 3. Fundamental Geo-Raster Operations

(a) Original domain (b) Translated domain

Figure 3.19. Computation of a Translation Operation for a Raster Image

where (x0, y0) are the coordinates of the center of rotation in the input raster image, and θ is the angle of rotation. Existing algorithms for the computation of rotation, unlike those employed by translation, can produce coordinates (x2, y2) that are not integers. A common solution to this problem is the application of interpolation tech- niques like nearest neighbor, bilinear, or cubic interpolation. For large raster datasets this is an intensive computing problem because every output cell must be computed separately using data from its neighbors. Consequently, the rotation operation is not yet properly supported by Array Algebra.

Scaling Scaling stretches or compresses the coordinates of a raster (or part of it) according to a scaling factor. This operation can be used to change the visual appearance of an image, to alter the quantity of information stored in a scene representation, or as a low- level preprocessor in a multi-stage image processing chain that operates on features of a particular scale. For the estimation of the cell values in a scaled output raster image, two common approaches exist:

• one pixel value within a local neighborhood is chosen (perhaps randomly) to be representative of its surroundings. This method is computationally simple but may lead to poor results when the sampling neighborhood is too large and diverse.

• the second method interpolates cell values within a neighborhood by taking the average of the local intensity values. 3.2 Geo-Raster Operations 57

As in the rotation operation, the application of scaling using interpolation techniques in large raster datasets is an intensive computing problem because every output cell must be computed separately using data from its neighbors. Consider the following query performing a scaling operation using bilinear interpolation. That is, the cell value for (x0,y0) in the output raster is calculated by averaging the values of its nearest cells: two in the horizontal plane (x0,x1) and two in the vertical plane (y0,y1). Note that the query is applied in a raster of spatial domain [0:255, 0:255] but as earlier mentioned, raster datasets tend to be extremely large (TB, PB). Query 3.2.20. Scale the 2D raster shown in Fig. 3.20(a), along the x and y dimensions by a factor of 2. The query can be solved as follows:

B = MARRAY m n (COND (A[i + x ∗ 2, j + y ∗ 2]/4)) [0: 2 ,0: 2 ],(x,y) +,[0:1,0:1],(i,j) Results are shown in Fig. 3.20.

(a) Original raster (b) Scaled raster

Figure 3.20. Computation of a Scaling Operation for a Raster Image

3.2.5 Terrain Analysis

Raster image data is particularly useful for tasks related to terrain analysis. Some of the most popular operations include slope/aspect, drainage networks, and catch- ments (or watersheds). The processing of these operations may involve interpolation 58 3. Fundamental Geo-Raster Operations techniques that lead to expensive computational costs. For simplicity, we model these operations with approaches not using interpolation methods.

Slope/Aspect Slope is defined by a plane tangent to a topographic surface, as modeled by the Digital Elevation Model (DEM) at a point [2]. Slope is classified as a vector, thus having two components: a quantity (gradient) and a direction (aspect). The slope (gradient) is defined as the maximum rate of change in altitude, and aspect as the compass direction of the maximum rate of change. Several approaches exist for the computation of slope/aspect, and we follow the method proposed by [32]: • Slope in the X direction (difference in height values on either side of P) is given by: z(r, c + 1) − z(r, c − 1) T anΘ = x 2g • slope in the Y direction z(r + 1, c) − z(r − 1, c) T anΘ = y 2g • gradient at P q 2 2 (tan Θx + tan Θy) • direction or aspect of the gradient tanΘ tanα = x tanΘy Results are shown in Fig. 3.21.

Figure 3.21. Slopes Along the X and Y Directions

Note that after the calculation of the slopes for each cell in a raster image, the results may need to be classified to display them clearly on a map [2]. Query 3.2.21. Calculate the slope along the X direction of an 8-bit grey raster A: (arctan(A(r, c + 1) − A(r, c − 1))) MARRAY sdom(A),(r,c) 2g 3.2 Geo-Raster Operations 59

Local Drain Directions (ldd) The ldd network is useful for computing several properties of a DEM because it ex- plicitly contains information about the connectivity of different cells. Two steps are required to derive a drainage network: the estimation of flow of material over the surface and the removal of pits. For instance (see Fig. 3.22), cell A1 has three neigh- boring cells (A2, B1 and B2) and the lowest of them is B1, thus the flow direction is south (downward). For cell C3, the lowest of its eight neighboring cells is D2, so the flow direction is southwest (to the lower left). This method is one of the most popular algorithms to estimate flow directions and it is commonly known as D8 algorithm [2].

Figure 3.22. Flow Directions

Query 3.2.22. Estimate the flow of material over raster A where each cell contains the slope along the X direction. Let A be a raster with the slopes along the X direction of A. The ldd is then calcu- lated as:

MARRAYsdom(A),(i,j)(CONDmin,[−1:1,−1:1],(v,w) (A[i + v, j + w]))

Irrespective of the algorithm used to compute flow directions, the resulting ldd net- work is extremely useful for computing other properties of a DEM such as stream channels, ridges, and catchments.

3.2.6 Other Operations

Edge Detection Edge detection produces a new raster containing only the boundary cells of a given raster. The detection of intensity discontinuities in a raster is very useful, e.g. the boundary representation is easy to integrate into a large variety of detection algo- rithms. The following parameterized function can be used to express filtering opera- tions in Array Algebra:

f(A, M) = MARRAYsdom(A),x(COND+,sdom(M),i(A[x + i] ∗ M(y))) where sdom(M) is the size of the corresponding filter window, e.g., 3x3. As an exam- ple consider the following query: 60 3. Fundamental Geo-Raster Operations

(a) M1 (b) M2

Figure 3.23. Sobel Masks

Query 3.2.23. Apply edge detection to raster A shown in Fig. 3.24(a) using a 3x3 Sobel filter.

To compute this query, a Sobel filter and its inverse are applied to the original raster A (see Fig. 3.23): |f(A, M1)| + |f(A, M2)| 9 which in Array Algebra can be computed as follows:

MARRAYsdom(A),x (COND+,sdom(M1),i ( (abs(A[x + i] ∗ M1(i))) + (abs((A[x + i] ∗ M2(i))))/9))

Results are shown in Fig. 3.24.

(a) Original raster image (b) Output raster image

Figure 3.24. Computation of an Edge-Detection for a Raster Image 3.3 Summary 61

Slicing The slicing operation extracts lower-dimensional sections from a raster. Array Alge- bra accomplishes the slicing operation by indicating the slicing position in the desired dimension. Thus, the operation reduces the dimensionality of the raster by one. For example, consider the following query:

Query 3.2.24. Slice raster A along the second dimension at position 50.

The query is solved by specifying the slicing position as follows:

MARRAYsdom(A),(x,y,z)(A[x, 50, z])

3.3 Summary

By examining the fundamental structure of Geo-raster operations and breaking down their computational steps into a few basic Array Algebra operators, we deter- mine that Geo-raster operations can be broken down into the following classes:

• COND and MARRAY combined operations. Operations whose computation requires both MARRAY and COND operators: add, count, average, maximum, minimum, majority, minority, histogram, di- versity, variance, standard deviation, scaling, edge detection, and local drain directions.

• MARRAY exclusive operations. Operations whose computation requires only the MARRAY operator: arithmetic, trigonometric, boolean, logical, overlay, reclassification, proximity, translation, slicing, and slope/aspect.

• SORT operations. Operations whose computation requires the SORT operator: top-k, median.

• AFFINE transformations. Special cases of affine transformations partially or not yet supported by Array Algebra: rotation and scaling.

This classification allows us to identify a set of operations that require data sum- marization and thus are potential candidates to be treated with pre-aggregation tech- niques: add, count, average, maximum, minimum, majority, minority, histogram, di- versity, variance, standard deviation, scaling, edge detection, and local drain direc- tions. Table 3.3 summarizes the usage of Array Algebra operators for each operation discussed in Section 3.2. 62 3. Fundamental Geo-Raster Operations

Table 3.3. Array Algebra Classification of Geo-Raster Operations. Operation MARRAY COND SORT AFFINE 1. Count x 2. Add x 3. Average x 4. Maximum x 5. Minimum x 6. Majority x x 7. Minority x x 8. Std. Deviation x 9. Median x x 10. Variance x 11. Top-k x 12. Histogram x x 13. Diversity x x 14. Proximity x 15. Arithmetic x 16. Trigonometric x 17. Boolean x 18. Logical x 19. Overlay x 20. Re-classification x 21. Translation x 22. Rotation x 23. Scaling x x x 24. Slicing x 25. Edge Detection x x 26. Slope/Aspect x 27. Local drain directions (ldd) x x Chapter 4

Answering Basic Aggregate Queries Using Pre-Aggregated Data

As discussed in previous chapters, aggregation is an important mechanism that allows users to extract general characterizations from very large repositories of data. In this chapter, we study the effect of selecting a set of aggregate queries, compute their results and use them for subsequent query requests. In particular, we study the effect of pre-aggregation in computing aggregate queries in the field of GIS and remote- sensing imaging applications. We introduce a pre-aggregation framework that distinguishes among different types of pre-aggregates for computing a query. We show that in most cases, several pre- aggregates may qualify for answering an aggregate query and address the problem of selecting the best pre-aggregate in terms of execution time. To this end, we introduce a model that measures the cost of using qualified pre-aggregates for the computation of a query. We then present an algorithm that selects the best pre-aggregate for com- puting a query. We measure the performance of our algorithms in an array database management system (RasDaMan), and show that our algorithms give much better per- formance over straightforward methods.

4.1 Framework

Most major database management systems allow the user to store query results through a process known as view materialization. The query optimizer may then auto- matically use the materialized data to speed up the evaluation of a new query. Queries that benefit from using materialized data are those that involve the summarization of large amounts of data. They are known as aggregate queries because their query state- ments include one or more aggregate functions. The ANSI SQL:2008 standard defines a wide variety of aggregate functions including: COUNT, SUM, AVG, MAX, MIN, EVERY, ANY, SOME, VAR POP, VAR SAMP, STDDEV POP, STDDEV SAMP, AR- RAY AGG, REGR COUNT, COVAR POP,COVAR SAMP,CORR, REGR R2, REGR SLOPE, and REGR INTER-CEPT [20].

63 64 4. Answering Basic Aggregate Queries Using Pre-Aggregated Data

4.1.1 Aggregation

An aggregate operation contains one or more aggregate functions that map a mul- tiset of cell values in a dataset to a single scalar value. In our framework, queries may contain an arbitrary number of aggregate functions, e.g., COUNT, SUM, AVG, MAX, MIN, and a spatial domain. We formulate our queries using rasql1, the declara- tive interface to the RasDaMan server. We use the Array Algebra notation for spatial domains:

sdom = [l1 : h1, . . . , ld : hd] (4.1) where the vector variables l (low) and h (high) deliver lower and upper bound vectors respectively.

4.1.2 Pre-Aggregation

The term pre-aggregation refers to the process of pre-computing and storing the results of aggregate queries for subsequent use in the same or similar query requests. The decision to use pre-aggregated data during the computation of an aggregate query is influenced by the structural characteristics of the query and the pre-aggregate. By comparing the data structures between the two, one can determine if the pre- aggregated result contributes fully or partially to the final answer of the query, and if it is worth using pre-aggregated data.

4.1.3 Aggregate Query and Pre-Aggregate Equivalence

An aggregate query Q and a pre-aggregate pi are equivalent if and only if all the following conditions are met:

1. The aggregate operation of the query Q is the same as the aggregate operation defined for the pre-aggregate pi.

2. The aggregate operation of the query Q and the pre-aggregate pi must be applied over the same objects.

3. The same logical and boolean conditions, if any, apply to both the query Q and the pre-aggreate pi.

4. For aggregate operations to be applied over a specific spatial domain, the extent of the spatial domain in query Q must be the same as the one in pre-aggregate pi.

When all of the above conditions are satisfied, we say there is a full-matching between the query and pre-aggregate. In this case, the time it takes to retrieve the

1rasql is a SQL-based query language for multidimensional raster databases based on Array Alge- bra. 4.1 Framework 65 pre-aggregated result will be much faster than the time required to compute the query from raw (original) data. Moreover, the storage overhead required to save the pre- aggregated result is compensated by the faster computation of the query obtained in return. However, cases do occur when only conditions 1, 2, 3 are satisfied. We refer to this case as a partial-matching between the query and pre-aggregate. We can use the partial results provided by these pre-aggregates and thus speed up the computation of the query. However, further analysis must be carried out to find those pre-aggregates that provide the maximum speed for computing a query. To that end, we define the following types of pre-aggregates: independent, overlapped, and dominant.

Independent Pre-Aggregates

Definition 4.1 (Independent Pre-Aggregates) – A set of pre-aggregates is called Independent Pre-Aggregates (IPAS) with respect to Q, if the spatial domain of each pre-aggregate is contained within the spatial domain of query Q and there is no inter- section among the spatial domains of the pre-aggregates. Fig. 4.1(a) shows an example of an independent pre-aggregate.

IPAS := {p1, p2, . . . , pn | pi.sdom ⊆ Q.sdom, pi.sdom ∩ pj.sdom = ∅} , (4.2)

2

Overlapped Pre-Aggregates

Definition 4.2 (Overlapped Pre-Aggregates) – A set of pre-aggregates is called Overlapped Pre-Aggregates (OPAS) if the spatial domain of each pre-aggregate in- tersects with the spatial domain of the query Q. Fig. 4.1(b) shows an example of an overlapped pre-aggregate.

OPAS := {p1, p2, . . . , pn | pi.sdom ∩ Q.sdom 6= ∅} (4.3) 2

Dominant Pre-Aggregates

Definition 4.3 (Dominant Pre-Aggregates) – A set of pre-aggregates is called Dom- inant Pre-Aggregates (DPAS) if the spatial domain of the query Q is contained within the spatial domain of each pre-aggregate. Fig. 4.1(c) shows an example of a domi- nant pre-aggregate. Note that dominant pre-aggregates can only be used to answer the following types of aggregate queries: ADD, COUNT, and AVG. 66 4. Answering Basic Aggregate Queries Using Pre-Aggregated Data

2

DPAS := {p1, p2, . . . , pn | Q.sdom ⊆ pi.sdom} . (4.4) Moreover, given an ordered DPAS

DPAS = {p1, p2, . . . , pn | Q.sdom ⊆ p1.sdom ⊆ ... ⊆ pn.sdom} , (4.5) the closest dominant pre-aggregate (pcd) to Q is given by p1, i.e., pcd = p1.

(a) Independent pre- (b) Overlapped pre- (c) Dominant pre- aggregate aggregate aggregate

Figure 4.1. Types of Pre-Aggregates

Cases may occur where a pre-aggregate intersects with one or more pre-aggregates of the same or different type. Intersections are problematic because the greater the number of intersections, the greater the number of cells that may need to be computed from raw data to determine the real contribution towards the result of the query by a given pre-aggregate. The computation process involves several intermediary op- erations such as decomposing the pre-aggregate into sub-partitions that in turn must be aggregated. Moreover, the same procedure must be performed on the other inter- sected pre-aggregates should we want to use their results. For example, assume that pre-aggregates p1, p2 and p3 can be used to answer query Q, and that they all intersect with each other. Since the result of each pre-aggregate includes a partial result of the other two pre-aggregates, we must use raw data to compute the intersected area and adjust the result of the pre-aggregate according to the aggregate function specified in the query predicate. To overcome this problem, a query selected for pre-aggregation for which other pre-aggregates exist with different spatial domains but identical structural properties can be decomposed into a set of sub-partitions prior to the pre-aggregation process. 4.2 Cost Model 67

By partitioning the query to be pre-aggregated we can avoid intersection among pre- aggregates, see example shown in Fig. 4.2.

Figure 4.2. Selected Queries for Pre-Aggregation (left) and Decomposed Queries (right)

4.2 Cost Model

This section introduces a cost model that allows us to estimate the cost (in terms of execution time) of computing a query using pre-aggregates compared to raw data. In our model, the access cost is driven by the number of required disk I/Os and memory accesses. These parameters are influenced by the number of tiles needed to answer a given query and the number and size of the cells in the datasets. The following assumptions underlie our estimates.

1. We assume that the tiles needed to answer a given query are stored using implicit storage of coordinates, which is the prevalent storage format for raster image data [79]. Implicit storage of coordinate values is a storage technique that leads to a higher degree of clustering of cell values that are close in data space, that is, it preserves spatial proximity of cell values. Given that state-of-the-art disk drives improve access to multidimensional datasets by allowing the spatial locality of the data to be preserved in the disk itself [93], we assume that it takes the same time to retrieve a tile from disk as to retrieve any other tile needed to answer a given query. Clearly, there are other factors, not considered here, that influence access cost. Among them are the cost for storing intermediate results, and the communication cost for sending the results from the client to the server. More complicated cost models are certainly possible, but we believe the cost 68 4. Answering Basic Aggregate Queries Using Pre-Aggregated Data

model we pick, being both simple and realistic, enable us to design and analyze powerful algorithms.

2. We consider the time taken to access a given cell (pixel) on main memory to be the same as that required to access any other cell. That is, we assume that a tile sits in main memory and is not swapped out.

3. We ignore the time it takes to combine partial aggregate results. Investigations have shown this time to be negligible compared to tile iteration [74].

Table 4.1 lists the parameters involved in the different cost functions presented in the remainder of this section.

Table 4.1. Cost Parameters Parameter Description Ntiles Number of tiles Ncells Number of cells sdom Spatial domain IPAS Independent pre-aggregates set OPAS Overlapped pre-aggregates set DPAS Dominant pre-aggregated set pcd Closest dominant pre-aggregate SP Sub-partitions

4.2.1 Computing Queries from Raw Data

The cost of computing an aggregate query Q (or sub-partitions of pre-aggregates) from raw data (Cr), is given by

Cr(Q) = Cacc(Ntiles(Q)) + Cagg(Ncells(Q)) (4.6) where Cacc is the cost of retrieving the tiles required to answer Q, and Cagg is the time taken to access and aggregate the total cells given by the spatial component of the query.

4.2.2 Computing Queries from Independent and Overlapped Pre-Aggregates

The cost of answering an aggregate query using independent and overlapped pre- aggregates is given by:

CIOP AS(Q) = CIP AS(Q) + COP AS(Q) + CSP (Q), (4.7) where CIP AS and COP AS are the costs of using the results of independent and over- lapped pre-aggregates, respectively, and CSP is the cost of decomposing the query Q into a set of sub-partitions and aggregating each from raw data. 4.2 Cost Model 69

Cost of independent pre-aggregates

The cost of retrieving the results of independent pre-aggregates (CIP AS) is given by:

|IP AS| X CIP AS(Q,T ) = Cfin(Q,T ) + Cacc(pi) (4.8) i=0 where Cfin is the cost of finding the pre-aggregates ∈ IP AS in the pre-aggregated pool T , and Cacc is the accumulated cost of retrieving the results of the pre-aggregates.

Cost of overlapped pre-aggregates

The cost of retrieving the results of overlapped pre-aggregates (COP AS) is given by:

|OP AS| |S| X X COP AS(Q) = Cfin(Q,T ) + Cdec(pi) + Cr(si) (4.9) i=0 i=0 where Cfin is the cost of finding the pre-aggregates ∈ OP AS in the pre-aggregated pool T , Cdec is the cost of decomposing the spatial domain of each pre-aggregate into a set of sub-partitions S such that the spatial domain of the partitioned pre-aggregate corresponds to pi.sdom − (pi.sdom ∩ Q), and Cr is the cost of aggregating each resulting sub-partition si ∈ S from raw data.

Cost of aggregating sub-partitions of a query The cost of aggregating all sub-partitions forming a query is given by:

|SP | X CSP (Q) = Cdec(Q) + Cr(si), (4.10) i=0 where Cdec is the cost of decomposing Q into a set SP of sub-partitions, and Cr is the cost of aggregating each resulting sub-partition s ∈ SP from raw data. Note that Cdec is influenced by the costs of accessing the tiles required to aggregate each sub- partition, and the cost of accessing the spatial properties of the pre-aggregates in IPAS and OPAS.

4.2.3 Computing Queries from Dominant Pre-Aggregates

The cost of computing an aggregate query Q using a dominant pre-aggregate is given by: CDP AS(Q) = CDP (Q,T ) + Cagg(pcd), (4.11) where CDP is the sum of the cost of finding the pre-aggregates ∈ DPAS in the pre- aggregated pool T and the cost of finding the closest dominant pre-aggregate pcd, and Cagg is the cost of computing the aggregate difference of pcd corresponding to pcd.sdom − Q.sdom. 70 4. Answering Basic Aggregate Queries Using Pre-Aggregated Data

Cost of aggregating sub-partitions of the closest dominant pre-aggregate

The cost Cagg can be calculated as follows:

|SP | X Cagg(pcd) = Cdec(pcd) + Cr(si), (4.12) i=0 where Cdec is the cost of decomposing pcd into a set SP of sub-partitions, and Cr is the cost of aggregating each resulting sub-partition s ∈ SP from raw data.

4.3 Implementation

This section describes the application of a query optimization technique that trans- forms an input query written in terms of arrays so that it can be executed faster using pre-aggregated data. The query processing module of an array database management system (RasDaMan) has been extended with our pre-aggregation framework for query rewriting, and has been implemented as part of the optimization and evaluation phases. As discussed earlier in this chapter, there are two problems related to the computation of an aggregate query using pre-aggregated data. First, we must find all pre-aggregates that can be used to compute an aggregate query, including those that provide partial answers. Next, from all candidate pre-aggregates, we must find the one that minimizes the execution time (or cost) for computing the query. Our solution is based on an ex- isting approach for answering queries using views in OLAP applications. Halevy et al. [95] showed that all possible rewritings of a query can be obtained by considering containment mappings from the bodies of the views to the body of the query. They also showed that such characterization is a NP-complete problem. The QUERYCOMPUTATION procedure returns the result of a query or an execution plan for a given query Q. An execution plan is an indicator of the kind of data that must be used to compute the query. It returns a raw indicator if the query must be computed from the original data. Other valid indicators include IP AS, OP AS, and DP AS, which indicate that the query will be answered using one or more partial pre-aggregates. The input of the algorithm is a query tree Qt of an aggregate query. The algorithm first verifies if the conditions for a PERFECT-MATCHING between the query and the pre-aggregated queries are satisfied. If a perfect-matching is found, it returns the result of the pre-aggregated query. Otherwise, the algorithm verifies if the conditions for a PARTIALMATCHING between the query and set of pre-aggregate queries are satisfied. Then, the algorithm makes use of our cost model to determine the cost of using pre- aggregates that satisfy partial-matching conditions for the computation of the query, and the cost of computing the query using the original data. Finally, the algorithm picks the plan with least cost in terms of execution time. The algorithm makes use of the following auxiliary procedures:

• DECOMPOSEQUERY(Qt) examines the nodes of the query tree Qt and generates a standardized representation Sqt that can be manipulated via SQL statements. 4.3 Implementation 71

Algorithm 1 QUERYCOMPUTATION Require: A query tree Qt, a set of k number of pre-aggregate queries P 1: initialize R = 0, key = false 2: Sqt = decomposeQuery(Qt) 3: key = perfectMatching(Sqt,P ) 4: if key then 5: R = fetchResult(key) 6: return R; 7: end if 8: if !key then 9: plan = partialMatching(Sqt,P ) 10: return plan; 11: end if

• PERFECTMATCHING(Sqt) compares a standardized representation of the query tree Sqt against existing k number of pre-aggregates. The output is the corre- sponding key of the matched pre-aggregated query. A null value is returned if no perfect matching is found.

• FETCHRESULT(key) retrieves the result R of the pre-aggregated query identi- fied by key.

The algorithm PARTIALMATCHING identifies an aggregate sub-expression in a query tree Qt, and finds pre-aggregated queries satisfying conditions 1, 2 and 3, but not condition 4 as defined in section 4.1.2. It considers the use of pre-aggregates that partially contribute to the answer of a query sub-expression that are either inde- pendent, overlapped, or dominant. The algorithm calculates the cost of using each pre-aggregate for computing the query, and returns an indicator of the type of query providing the least cost. The aggregateOp() procedure compares a node n of a given query tree Qt against a list of pre-defined aggregate operations, e.g, add cells, count cells, avg cells, max cells, and min cells. If the node matches any such operation, it returns a true value. The getSubtree() procedure receives as parameter a query tree Qt and a pointer to an aggregate node. If the aggregate node has children, it creates a subtree Q0 where the root node corresponds to the aggregate node. The findP reaggregate() procedure receives as parameters an aggregate operation op, an object identifier ro, and a spatial domain sd. It then determines if the values of these parameters match those of any existing pre-aggregate. If a match is found, the result of the matched pre-aggregate is returned. The findIpasP reaggregates() procedure receives as a parameter a subtree Q0 and verifies if any pre-aggregates satisfy conditions 1, 2 and 3 as defined in section 4.1.2 for equivalence between a query and a pre-aggregate. For those pre-aggregates 72 4. Answering Basic Aggregate Queries Using Pre-Aggregated Data

Algorithm 2 PARTIALMATCHING Require: A standardized query tree Qt with m number of nodes. 1: initialize IP AS, OP AS, DP AS = {} 2: initialize plan = ”raw”, key = false 3: for each node n of Qt do 4: if aggregateOp(node[n]) then 0 5: Q = getSubtree(Qt, node[n]) 6: op = getOperation(Q0) 7: ro = getRasterObject(Q0) 8: sd = getSpatialDomain(Q0) 9: key = findP reaggregate(op, ro, sd) 10: if key then 11: R = fetchResult(key) 12: return R; 13: end if 14: if !key then 15: IP AS = findIpasP reaggregates(op, ro, sd) 16: OP AS = findOpasP reaggregates(op, ro, sd) 17: DP AS = findDpasP reaggregates(op, ro, sd) 18: end if 19: plan = selectP lan(Q0, IP AS, OP AS, DP AS) 20: end if 21: end for 22: return plan;

that qualify, it identifies those whose spatial domains are contained in the spatial do- main of the query. The output is a set of independent pre-aggregates. The findOpasP reaggregates() procedure receives as a parameter a subtree Q0 and verifies if any pre-aggregates satisfy conditions 1, 2 and 3 as defined in section 4.1.2. For those pre-aggregates that qualify, it identifies those whose spatial domains intersect with the spatial domain of the query. The output is a set of overlapped pre- aggregates. The findDpasP reaggregates() procedure receives as a parameter a subtree Q0 and verifies if any pre-aggregates satisfy conditions 1, 2 and 3 as defined in sec- tion 4.1.2. For those pre-aggregates that qualify, it identifies those whose spatial domains dominate the spatial domain of the query. The output is a set of dominant pre-aggregates. The selectP lan() procedure receives as parameters a sub-query tree Q0, a set of independent pre-aggregates IP AS, a set of overlapped pre-aggregates OP AS and a set of dominant pre-aggregates DP AS. It then calculates the cost of answering the query using different types of pre-aggregates and raw data. The output of this procedure is an indicator of the best plan for executing the query. 4.4 Experimental Results 73

Query Evaluation The query optimizer module provides an optimized query tree along with the plan suggested for the computation of the query to the final phase, evaluation. Typically, the evaluation phase identifies the tiles affected by an aggregate query and executes the aggregate operation on each tile. Finally it combines the results to generate the answer to the query. With the extension of pre-aggregation in the optimizer, the tra- ditional process differs such that the selected plan is considered before proceeding to execution. If the plan corresponds to raw, then the computation of the query is entirely done from raw data. Otherwise, it executes the aggregate operation only on those sub-expressions for which there are not pre-aggregated results.

4.4 Experimental Results

This section presents the performance results of our algorithms on real-life raster image datasets. We ran our experiments on a Intel Pentium 4 -CPU 3.00 GHz PC running SuSe Linux 9.1. The workstation had a total physical memory of 512 MB. The datasets were stored in RasDaMan, an array database management system (our research vehicle). Table 4.2 lists the test queries used in our experiments. We ran each query 200 times against the database to obtain average query response times. The queries are formulated using rasql syntax, the declarative query interface to the RasDaMan server. We performed a cold test where the queries were run sequentially; the cache buffer was cleaned after the completion of each query. The dataset consists of a collection of 2D raster images, each associated with an object identifier (oid). Each image shows a portion of the Black Sea, is 260 Mb in size, and consists of 100 indexed tiles. We artificially created a set of pre-aggregates for the experiment. They are stored in a pre- aggregation pool containing a total of 5000 pre-aggregates requiring a total storage space of 50 Mb. Computing the test queries involves the execution of two fundamental operations in GIS and remote-sensing imaging: sub-setting and aggregation. The values of the spatial domain of the queries were chosen such that we could measure the impact of using pre-aggregation for the following cases:

• The computation of queries Q1, Q2 and Q3 can be done by combining the results of partial pre-aggregates and the remaining parts from original data.

• The computation of queries Q4, Q5 and Q6 can be done by using the results of full pre-aggregates. That is, the full answer to these queries has been pre- computed and stored in the database.

• The computation of queries Q7, Q8 and Q9 can be done by combining the results of two or more pre-aggregates. There is no need to use original data to compute these queries. 74 4. Answering Basic Aggregate Queries Using Pre-Aggregated Data

Table 4.2. Database and Queries of the Experiment. Qid Description Q1 select add cells(y[6000:10000, 29000:32000]) from blacksea as y were oid(y) = 49153 Q2 select add cells(y[7000:10000, 29000:31000]) from blacksea as y where oid(y) = 49154 Q3 select add cells(y[6700:10000, 28000:30000]) from blacksea as y where oid(y) = 49155 Q4 select add cells(y[7680:8191, 29000:31000]) from blacksea as y where oid(y) = 49153 Q5 select add cells(y[8704:9215, 29000:31000]) from blacksea as y where oid(y) = 49154 Q6 select add cells(y[9728:10000, 29000:31000]) from blacksea as y where oid(y) = 49155 Q7 select add cells(y[7680:8191, 29696:30207]) from blacksea as y where oid(y) = 49153 Q8 select add cells(y[8704:9215, 30720:31000]) from blacksea as y where oid(y) = 49154 Q9 select add cells(y[9216:9727, 30208:30719]) from blacksea as y where oid(y) = 49155

Table 4.3 compares the CPU cost required for the computation of the queries using pre-aggregated data and raw data. The CPU cost was obtained by using the time library of C++. The column #aff. tiles shows the number of tiles that need to be read for computing the given query. Column # preagg. tiles represents the number of pre- aggregates that can be used to compute the query. Column t pre shows the total CPU cost of computing the query considering pre-aggregated data. Column t ex shows the time taken to execute the query entirely from raw data. Column ratio shows that CPU time is always better when the computations consider pre-aggregated data.

Table 4.3. Comparison of Query Evaluation Costs Using Pre-Aggregated Data and Original Data.

Q id #aff. tiles #preagg. tiles t pre t ex ratio Q1 63 24 15.6 17.8 87% Q2 35 24 6.9 9.3 74% Q3 35 8 9.4 10 94% Q4 5 5 1.02 1.55 65% Q5 5 5 1.1 1.63 67% Q6 5 5 0.74 1.01 73% Q7 2 1 0.04 0.41 9% Q8 2 1 0.04 0.45 8% Q9 2 1 0.04 0.41 9%

4.5 Summary

In this chapter we presented a framework for computing aggregate queries in ar- ray databases using pre-aggregated data. We distinguished among different types of 4.5 Summary 75 pre-aggregates: independent, overlapped, and dominant. We showed that such a dis- tinction is useful to find a set of pre-aggregated queries that can reduce CPU cost for query computation. We proposed a cost-model to calculate the cost of using different pre-aggregates and select the best option for evaluating a query using pre-aggregated data. The measurements on real-life raster images showed that the computation of the queries is always faster with our algorithms compared to straightforward methods. We focused on queries using basic aggregate functions covering a large number of operations in GIS and remote-sensing imaging applications. The challenge remains, however, in supporting more complex aggregate operations, e.g., scaling, which is discussed in the following chapter. This page was left blank intentionally. Chapter 5

Pre-Aggregation Support Beyond Basic Aggregate Operations

In this chapter we investigate the problem of offering pre-aggregation support to non- standard aggregate operations such as scaling and edge detection. We discuss issues found while attempting to provide a pre-aggregation framework for all non-standard aggregate operations. We then justify our reasons for focusing on scaling operations. We adapt the framework and cost model presented in Chapter 4 to support scaling op- erations. Finally, we discuss the efficiency of our algorithms based on a performance analysis covering 2D, 3D and 4D datasets. We indicate how our approach generalizes and outperforms well-known 2D image pyramids widely used in Web mapping.

5.1 Non-Standard Aggregate Operations

As shown in Chapter 2, aggregate operations are not limited to queries using basic aggregate functions. In the GIS domain, operations such as scaling, edge detection, and those related to terrain analysis also require data summarization and may therefore benefit from pre-aggregation. See Table 3.3 for a complete list of operations requir- ing summarization. Finding a general pre-aggregation approach for computing those kinds of operations, however, it introduces additional complications when compared to finding pre-aggregates using basic aggregate functions. Basic aggregate functions each consolidate the values of a group of cells and return a scalar value. The value may represent the total sum, the number of cells, the maxi- mum or minimum cell value, or the average value of the affected cells. Affected cells are determined by the spatial domain defined in the predicate of the query. In contrast, the computation of a scaling operation may require consolidating the cell values of a group of cells to calculate each cell value in the output raster. The affected cells are determined by both the resampling method and scale vector as described in Chapter 3. A similar situation occurs with edge detection. The affected cells are determined by the size and values of the applied Sobel filter. For simplicity, we refer to those kinds of operations as non-standard aggregate operations. There is an important concern that must now be taken into account. From Chap-

77 78 5. Pre-Aggregation Support Beyond Basic Aggregate Operations ter 3, we see that the result returned by a group of affected cells for a given non- standard aggregate operation such as scaling is not likely to be useful in computing another non-standard aggregate operation such as edge detection. This is because non-standard operations differ significantly with respect to the way their affected cells are determined. Nevertheless, this result may be useful in computing the same type of non-standard operation under certain conditions. For example, the result of scaling by a factor of 8 could be used to compute scaling by a factor of 10 (assuming that both operations utilize the same resampling method). This result, however, is not likely to be useful in edge detection for the same object. We therefore simplify the problem of offering pre-aggregation support to non- standard aggregations by treating each type of non-standard operation separately. This simplification is similar to those found in data warehousing techniques where pre- aggregation algorithms cover a specific type of queries. For instance, pre-aggregation algorithms exist for queries that include a group-by clause in their predicates, while other algorithms are used for queries without join conditions. We now focus on pre-aggregation support for one non-standard aggregate opera- tion, scaling, for the following reasons:

• One of the most frequent operations in GIS and remote-sensing imaging appli- cations is downscaling of some dataset or part thereof, such as obtaining a 1 GB overview of a 10 TB dataset.

• Scaling is a very expensive operation as it normally requires a full scan of the dataset, plus costly main memory operations. Therefore, query optimization is critical to this class of retrieval operations.

• Scaling is the only operation that has already been supported by pre-aggregation, at least for 2D datasets. This provides a point of reference to compare the effec- tiveness of our algorithms against existing techniques.

Although the framework discussed in the following sections is centered around scaling operations, it can be adapted to support other non-standard aggregate opera- tions by modifying the matching conditions as discussed later in this chapter.

5.2 Conceptual Framework

A common optimization technique that speeds up scaling operations is to materi- alize selected downscaled versions of an object, e.g., using image pyramids. When evaluating a scaling operation with target scale factor s, the pyramid level with the largest scale factor s0 is determined, where s0 < s. This relationship between scaling operations places them within a lattice framework similar to that used for data cubes in data warehouse/OLAP applications [92]. Our conceptual framework and greedy algorithm for the selection of pre-aggregates is based on the work of Harinarayan et al. presented in [92]. The use of this approach was motivated by the similarities between our datasets (multidimensional arrays) and OLAP data cubes. Furthermore, 5.2 Conceptual Framework 79

Figure 5.1. Sample Lattice Diagram for a Workload with Five Scaling Operations the lattice framework and the greedy algorithm have proven successful in a variety of business applications.

5.2.1 Lattice Representation

A scaling lattice consists of a set of queries L and dependence relations  denoted by hL, i. The  operator imposes a partial ordering on the queries of the lattice. Consider two queries q1 and q2. We say q1  q2 if q1 can be answered using only the results of q2. The base node of the lattice is the scaling operation with the smallest scale vector upon which every query is dependent. Lattices are commonly represented in a diagram in which the elements are nodes, and there is a path downward from q1 to q2 if and only if q1  q2. The selection of pre-aggregates, that is, queries for materialization, is equivalent to selecting vertices from the underlying nodes of the lattice. Fig. 5.1 shows a lattice diagram for a workload containing five queries. Each node has an associated label that represents a scaling operation for a given dataset, scale-vector and resampling method. In our framework, we use the following function to define scaling operations:

scale(objName[lo1 : hi1, ..., lon : hin], ~s,resMeth) (5.1)

where

• objName[lo1 : hi1, ..., lon : hin]: is the name of the multidimensional raster image to be scaled. The operation can be restricted to a specific area of the raster object. In that case, the area is specified by defining lower (lon) and upper (hin) bounds for each dimension. If the spatial domain is omitted, the operation is performed on the full spatial extent defining the raster image.

• ~s: is a vector where each element is a numeric value that represents the scale factor used in a specific dimension of the raster image.

• resMeth: specifies the resampling method to be applied to the original raster object. 80 5. Pre-Aggregation Support Beyond Basic Aggregate Operations

For example, scale(CalF ires, [2, 2, 2], nn) defines a scaling operation by a factor of two on each dimension, using nearest neighbor as resampling method on a 3D dataset identified as CalF ires.

5.2.2 Pre-Aggregation Selection Problem

Definition 5.4 (Pre-Aggregates Selection Problem) – Given a query workload Q and a storage space constraint C, the pre-aggregates selection problem is to select a set P ⊆ Q of queries such that P minimizes the overall costs of computing Q while the storage space required by P does not exceed the limit given by C. 2

Considering existing view selection strategies in data warehousing/OLAP, the fol- lowing selection criteria are suggested for pre-aggregates:

• Frequency. Pre-aggregates yield particularly significant increases in process- ing speed when scaling operations are executed with high frequency within a workload.

• Storage space. The storage space constraint of a candidate scaling operation must be at least the size of the storage required by the query in the workload with the smallest scale vector. This guarantees that for any query in the workload at least one pre-aggregate can be used for its computation.

• Benefit. A scaling operation may be used to compute the same and other de- pendent queries in the workload. A metric is therefore used to calculate the cost savings gained by using a candidate scaling operation. To evaluate the cost, we use the model presented in Section 4.2. We call this the benefit of a pre-aggregate set and normalize the benefit against the base object’s storage volume.

Frequency The frequency of query q, denoted by F (q), is the number of occurrences of a given query in a workload:

F (q) = N(q)/ |Q| (5.2) where N(q) is a function that returns the number of occurrences of a given query in workload Q.

Storage Space The storage space of a given query denoted by S(q), represents the storage space required to save the result of query q and it is determined by the number of cells composing the output object defined in query q. 5.2 Conceptual Framework 81

Benefit The benefit of a candidate scale operation for pre-aggregation q, is computed by adding the savings in query cost for each scaling operation in the workload depen- dent on q, including all queries identical to q. That is, query q may contribute to saving processing costs for the same or similar queries in the workload. In both cases, specific matching conditions must be satisfied.

Full-Match Conditions. Let q be a candidate query for pre-aggregation and p a query in workload Q. Let p and q both be scaling operations as defined in Eq. 5.1. There is a full-match between q and p if and only if:

• the value of parameter objName[] in the scale function defined for q is the same as in p

• the value of parameter ~s in the scale function defined for q is the same as in p

• the value of parameter resMeth in the scale function defined for q is the same as in p

Partial-Match Conditions. Let q be a candidate query for pre-aggregation and p be a query in the workload Q. There is a partial-match between p and q if and only if:

• the value of parameter objName[] in the scale function defined for q is the same as in p

• the value of parameter resMeth in the scale function defined for q is the same as in p

• the parameter ~s for both q and p is of the same dimensionality

• vector values defined in ~s for q are higher than those defined in p

Definition 5.5 (Benefit) – Let T ∈ Q be a subset of scaling operations that can be fully or partially computed using query q. The benefit of query q per unit space, denoted by B(q), is the sum of the computational cost savings gained by selecting query q for pre-aggregation. 2

X B(q) = ((F (q) ∗ C(q)) + (F (t) ∗ Cr(t, q)))/size(q) (5.3) t∈T where F (q) represents the frequency of query q in the workload, C(q) is the cost of computing query q on the original dataset, Cr(t, q) is the relative cost of computing query t from q, and size(q) is a function that returns the number of cells composing the spatial domain component of a query q. 82 5. Pre-Aggregation Support Beyond Basic Aggregate Operations 5.3 Pre-Aggregates Selection

Pre-aggregating all distinct scaling operations in the workload is not always pos- sible because of space limitations. This is similar to the problem of selecting views for materialization in OLAP. One approach for finding the optimal set of scaling oper- ations to pre-compute consists of enumerating all possible combinations and finding the one that yields the minimum average query cost, or the maximum benefit. Finding the optimal set of pre-aggreates in this way has a complexity of O(2n) where n is the number of queries in the workload. If the number of scaling operations on a given raster object is 50, there are 250 possible pre-aggregates for that object. Therefore, computing the optimal set of aggregates exhaustively is not feasible. In fact, it is an NP-hard problem [92, 17]. We therefore consider the selection of pre-aggregates as an optimization problem where the input includes multidimensional datasets, a query workload, and an upper bound on available disk space. The output is a set of queries that minimizes the total cost of evaluating the query workload depending on the storage limit. We present an algorithm that uses the benefit per unit space of a scaling operation. We model the expected queries by a query workload, which is a set of scaling operations:

Q = {qi|0 < i ≤ n} (5.4)

where each qi has an associated non-negative frequency, fi. We normalize frequen- cies so that they sum up to 1:

n X ( qi) (5.5) i=1 Based on this setup we study different workload patterns. The PRE-AGGREGATESSELECTION procedure returns a set P = {pi|0 < i ≤ n} of queries to be pre-aggregated. Input is a workload Q and a storage space constraint S. The workload contains a number of queries, each corresponding to a scaling operation as defined in Eq. 5.1. Frequency, storage space, and benefit per unit space are calculated for each distinct query in the workload. When calculating the benefit, we assume that each query is evaluated using the root (top) node, which is the first selected pre-aggregate, p1. The second chosen pre-aggregate p2 is the one with highest benefit per unit space. The algorithm recalculates the benefit of each scaling operation given that they are computed either from the root, if the scaling operation is above p1, or from p2 other- wise. Subsequent selections are performed in a similar manner. The benefit is recal- culated each time a scaling operation is selected for pre-aggregation. The algorithm stops selecting pre-aggregates when the storage space constraint is reached, or when there are no more queries in the workload to be considered for pre-aggregation, i.e., all scaling operations in the workload have already been selected for pre-aggregation. The function highestBenefit(Q) returns the scaling operation with highest ben- efit per unit space in Q. Complexity of the algorithm is O(k · n2) (k is the number 5.4 Answering Scaling Operations Using Pre-Aggregated Data 83

Algorithm 3 PRE-AGGREGATESSELECTION Require: A workload Q, and a storage space constraint c 1: P = {top scaling operation} 2: while (c > 0 and |P | != |Q| ) do 3: p = highestBenefit(Q, P ) 4: if (c - |p| > 0) then 5: c = c - |p| 6: P = P ∪ p 7: end if 8: else c = 0 9: return P of selected pre-aggregates and n is the number of vertices in the lattice), which arises from the cost of sorting the pre-aggregates by benefit per unit size.

5.3.1 Complexity Analysis

Let m be the number of queries in the lattice. Suppose we have no queries selected except for the top query, which is mandatory. The time to answer a given query in the workload is the time taken to compute the query using the top query and calculating it according to our cost model. We denote this time by To. Suppose that in addition to the top query, we choose a set of queries P . Denote the average time to answer a query by Tp. The benefit of the set of queries P is the reduction in average time to answer a query, that is, To − Tp. Thus, minimizing the average time to answer a query is equivalent to maximizing the benefit of a set of queries. Let p1, p2, ..., pk be the k queries selected by the PRE-AGGREGATESSELECTION algorithm. Let bi be the benefit achieved by the selection of pi, for i = 1, 2, ..., k. That is, bi is the benefit of pi, with respect to the set consisting of the top query and p1, p2, ..., pi−1. Let P = p1, p2, ..., pk. Let O = o1, o2, ..., ok be an optimal set of k queries, i.e., those queries giving the maximum benefit. Let mi be the benefit achieved by the selection of oi, for i = 1, 2, ..., k. That is, mi is the benefit of oi, with respect to the set consisting of the top query and o1, o2, ..., oi−1. Harinarayan et al [92] proved that the benefit of the greedy algorithm can never be less than (e-1)/e = 0.63 times the benefit of the optimum choice of pre-aggregated queries.

5.4 Answering Scaling Operations Using Pre-Aggregated Data

We say that a pre-aggregate p answers query q if there exists some other query q0 which when executed on the result of p, provides the result of q. The result can be either exact with respect to q (q0 ◦ p ≡ q), or only an approximation (q0 ◦ p ≈ q). In practice, the result is often an approximation because of the effect of resampling the original dataset. The same effect is observed in the traditional image pyramids 84 5. Pre-Aggregation Support Beyond Basic Aggregate Operations approach, but it is considered negligible since approximations are good enough for many applications. In our approach, when two or more pre-aggregates qualify for computing a given scaling operation, we pick the pre-aggregate with the closest scale vector value to the one defined in the scaling operation.

Example 5.1 – Assume the queries listed in Table 5.1 have been pre-aggregated, and suppose we want to compute the following query: q = scale(ras01, (4.0, 4~.0, 4.0), bi). From the list of available pre-aggregates, the query can be answered either by using p2 or p3. From these two pre-aggregates, p3 has the closest scale vector to q. Thus, q0 = scale(p3, (0.87, 0.~87, 0.87), bi). Note that q0 represents a rewritten scaling oper- ation in terms of the pre-aggregate. 2

Table 5.1. Sample Pre-Aggregates. Raster Object ID Raster Name Scale Vector Resampling Method p1 ras01 (2.0, 2~.0, 2.0) nn p2 ras01 (3.0, 3~.0, 3.0) bi p3 ras01 (3.5, 3~.5, 3.5) bi p4 ras01 (6.0, 6~.0, 6.0) bi

The REWRITEOPERATION procedure returns for query q a query q0 that has been rewritten in terms of a pre-aggregate identified with pid. The input of the algorithm is the scaling operation q and a set of pre-aggregates P . The algorithm looks for a PERFECT-MATCH between q and one of the elements in P . To this end, the algo- rithm verifies that the matching conditions listed in Section 5.2.2 are all satisfied. If a perfect match is found, it returns the identifier of the matched pre-aggregate. Oth- erwise, the algorithm verifies PARTIAL-MATCH conditions for all pre-aggregates in P . All qualified pre-aggregates are added to set S. In case of a partial matching, the algorithm finds the pre-aggregate with the scale vector closest to the one defined in Q. REWRITEQUERY rewrites the original query as a function of the selected pre- aggregate, and adjusts the values of the scale vector to perform the complementary scaling operation. The algorithm makes use of the following auxiliary functions.

• FULLMATCH(q, P ). Verifies that all full-match conditions are satisfied. If no matching is found, it returns 0, else it returns the id of the matching pre- aggregate.

• PARTIALMATCH(q, P ). Verifies that all partial-match conditions are satisfied. Each qualified pre-aggregate of P is added to set S.

• CLOSESTSCALEVECTOR(q, S). Compares the scale vectors between q and the elements of S, and returns the identifier (pid) of the pre-aggregate whose scale vector is the closest to that defined for q.

• REWRITEQUERY(Q, pid). Rewrites query q in terms of the selected pre-aggregate and adjusts the scale vector values accordingly. 5.5 Experimental Results 85

Algorithm 4 REWRITEOPERATION Require: A query q, and a set of pre-aggregates P 1: initialize S = {} , pid = 0 2: pid = fullMatch(q, P ) 3: if (pid == 0) then 4: S = partialMatch(q, P ) 5: pid = closestScaleV ector(q, S) 6: end if 0 7: q = rewriteQuery(q, pid) 8: return q0

5.5 Experimental Results

Experiments were conducted to evaluate the effectiveness of the pre-aggregation selection and rewriting algorithms in supporting scaling operations. They were run on a machine with a 3.00 GHz Intel Pentium 4 processor, running SuSe Linux 9.1. The workstation had a total physical memory of 512 MB. The query workload consisted of scaling operations with different scaling vectors. Different data distributions of the query workload were also considered. Despite the growing popularity of Web mapping services for GIS raster information processing, very few studies have been undertaken that report on user behaviors while using those services. One of the primary reasons for lack of research in this area may be the limited availability of the datasets outside of specialized research groups. Moreover, while query patterns related to scaling operations on 2D datasets are difficult to find, no empirical workload distributions were found for datasets of higher dimensional- ities. We therefore resorted to using a set of artificial distributions that cover many practical situations in GIS and remote-sensing imaging. Most pre-aggregation algorithms in OLAP and image pyramids assume a uniform distribution of the values given for the scale vector in the query workload, so we considered the same type of distribution for our experiments. Furthermore, we also considered a Poisson distribution of the scale vector values. The rationale is that such a distribution covers situations where the dataset is scaled down by factors that typically fall within a narrow range of scale vectors. For example, very large objects may need to be scaled down by large scale vectors so they can be efficiently transferred back and forth via Web services [77]. We also considered applications where the dataset is scaled down by the same scale vector, we refer to such access patter as a peak distribution. Finally, we investigated a step distribution that satisfies cases where scaling operations can be grouped within specific ranges of scale vectors. Our experiments were performed on datasets generated from three real-life raster- objects: • Dataset R1. Consists of a 2D raster object with spatial domain [0 : 15359, 0 : 10239]. The dataset contains 600 tiles, each with a spatial domain of [0 : 512, 0 : 512]. The total number of cells composing the raster object is 157 millions. 86 5. Pre-Aggregation Support Beyond Basic Aggregate Operations

• Dataset R2. Consists of a 3D raster object with spatial domain [0 : 11299, 0 : 10459, 0 : 3650]. The dataset contains 3214 tiles, each with a spatial domain of [0 : 512, 0 : 512, 0 : 512]. The total number of cells composing the raster object is 43 trillions.

• Dataset R3. Consists of a 4D raster object with spatial domain [0 : 10150, 0 : 7259, 0 : 2430, 0 : 75640]. The dataset contains 197,070 tiles, each with a spatial domain of [0 : 512, 0 : 512, 0 : 512, 0 : 512]. The total number of cells composing the raster object is 1.35e+16.

In the rest of this section, we present the results of our experiments according to the dimensionality of the data.

5.5.1 2D Datasets

In this experiment the workload consisted of 12800 scaling operations defined for dataset R1.

Uniform Distribution The scaling vectors of the queries in the workload were uniformly distributed. Scale vectors were integers ranging from 2 to 256. Per observations in practice, we assumed that both dimensions were coupled. We considered a storage space constraint of 35%, which is slightly higher than the additional storage space taken by image pyramids. The PRE-AGGREGATESSELECTION algorithm yields 12 pre-aggregates for this test where we executed scaling operations with scale vectors 2, 4, 6, 11, 15, 22, 32, 46, 67, 95, 137 and 182. The cost of computing the workload using these pre-aggregates is 18, 565. In contrast, image pyramids selects scaling operations with scale vectors: 2, 4, 8, 16, 32, 64, 128, and 256, and requires 33% additional storage space. Image pyramids computes the workload at a cost of 29, 166. The results of this experiment show that the pre-aggregates selected by our algorithm provide an improved perfor- mance for scaling operations over image pyramids. The cost of computing the work- load using our algorithm is 36% less than that incurred by image pyramids, at a price of 2% additional storage space. Fig. 5.2(a) shows the distribution of the scale vectors of all queries in the workload. The pre-aggregates selected by image pyramids and our pre-aggregation selection al- gorithm are shown in Fig. 5.2(b) and 5.2(c), respectively.

Poisson Distribution The workload for this experiment consisted of scaling operations where the scale vec- tors had a Poisson distribution, and the mean value of the scale vector equaled 50. The PRE-AGGREGATES-SELECTION algorithm yields 33 pre-aggregates for this test that executed scaling operations using scale vectors from 34 to 66. The cost of comput- ing the workload using these pre-aggregates is 42, 455. In contrast, image pyramids 5.5 Experimental Results 87

(a) Query workload (Uniform distribution)

(b) Selected queries for materialization by image pyramids

(c) Selected queries for materialization by our pre-aggregation selection algorithm

Figure 5.2. Query Workload with Uniform Distribution selects scaling operations with scale vectors: 2, 4, 8, 16, 32, 64, 128, and the cost of computing the workload is 95, 468. Thus, the cost of computing the workload using 88 5. Pre-Aggregation Support Beyond Basic Aggregate Operations pre-aggregates selected by our algorithm is 55% less than that incurred using im- age pyramids. There is also a major difference with respect to the additional storage space required by both approaches: image pyramids requires 33% additional storage space, while our algorithm requires only 5% additional space to store the selected pre-aggregates.

(a) Query workload (Poisson distribution)

Figure 5.3. Query Workload with Poisson Distribution

Fig. 5.3(a) shows the distribution of the scale vectors of all queries in the workload. The pre-aggregates selected by image pyramids are shown in Fig. 5.4(a). Even when there are no queries in the workload with scale factors smaller than 33, image pyramids still allocates space for pre-aggregates 2, 4, 8, 16, 32 which are the ones that account for much of the overall space requirement (33%). In contrast, our algorithm uses the query frequencies in the workload to select the queries for pre-aggregation. See Fig. 5.4(b). For this workload configuration, it is possible to pre-aggregate all distinct queries, and provide much faster query response times than image pyramids. This shows the benefit of considering query frequencies in the workload. If we pick a mean higher than 50, the additional storage space needed by the pre-aggregates is minimal. Conversely, if the mean is shifted to a lower scale vector value, e.g. 16, the storage space needed by our pre-aggregation algorithm can increase up to 35%.

Peak Distribution In this experiment, the query workload consisted of scaling operations with a scale vector having a value of 100 in each dimension. The PRE-AGGREGATESSELECTION algorithm yields 1 pre-aggregate for this test that executes a scaling operation with scale vector: 100, 100. The cost of computing the workload using this pre-aggregate is 1.27E + 08. In contrast, image pyramids selects scaling operations with scale factor values in each dimension: 2, 4, 8, 16, 32, 64, 128,, and the cost of computing the workload is 3.01E + 08. Thus, the cost of computing the workload using the pre-aggregates selected by our algorithm is 58% less than the cost incurred by image pyramids. Furthermore, there is major difference with respect to the storage space 5.5 Experimental Results 89

(a) Selected queries for pre-aggregation by image pyramids

(b) Selected queries for pre-aggregation by our pre-aggregation selection algorithm

Figure 5.4. Selected Queries for Pre-Aggregation required by both approaches: image pyramids requires 33% additional storage space, while our algorithm only requires 5% additional space. Fig. 5.5(a) shows the distribution of the scale vectors for all queries in the work- load. The pre-aggregates selected by image pyramids are shown in Fig. 5.6(a). Image pyramids allocates space for pre-aggregates with scale factors 2, 4, 8, 16, 32, 128, and 256 in each dimension. In contrast, our pre-aggregation selection algorithm selected one query, shown in Fig. 5.6(b). Although our algorithm makes more efficient use of storage space and computes the workload faster than image pyramids, this kind of scenario is not likely to occur in practice. The storage overhead is simply not justified. However, users may benefit from having a system that automatically pre-aggregates such operations with minimum overhead, a capability that can be provided by using our algorithm.

Step Distribution We now consider a scenario where scale vectors are distributed in various ranges of frequencies, i.e. in a step distribution. The PRE-AGGREGATESSELECTION algorithm yields 6 pre-aggregates for this test, where scaling operations are executed with scale 90 5. Pre-Aggregation Support Beyond Basic Aggregate Operations

(a) Query workload (Peak distribution)

Figure 5.5. Query Workload with Peak Distribution

(a) Selected queries for pre-aggregation by image pyramids

(b) Selected queries for pre-aggregation by our pre-aggregation selection algorithm

Figure 5.6. Selected Queries for Pre-Aggregation 5.5 Experimental Results 91 vectors 6, 8, 13, 19, 75, and 200. The cost of computing the workload using these pre- aggregates is 1.5e + 09. In contrast, image pyramids selects scaling operations with scale vectors: 2, 4, 8, 16, 32, 64, and 128, and the cost of computing the workload is 2.21e + 09. The cost of computing the workload using the pre-aggregates selected by our algorithm is therefore 32% less than that incurred by image pyramids. Moreover, there is a major difference with respect to the additional storage space required by both approaches. Image pyramids requires 33% additional storage space, while our algorithm only requires 15% additional space.

(a) Query workload (Step distribution)

Figure 5.7. Query Workload with Step Distribution

Fig. 5.7(a) shows the distribution of the scale vectors for all queries in the work- load. The pre-aggregates selected by image pyramids are shown in Fig. 5.8(a).

5.5.2 3D Datasets

To test our pre-aggregation algorithms on 3D time-series datasets, we picked four data distribution patterns of scaling vectors. For simplicity, we have labeled each dimension x, y, and t respectively. The following assumption (taken from observation in practice) is common for each data distribution type: the scale vector along the first two dimensions is the same, i.e. x = y. The aim of this test is to measure average query costs while varying available storage space for pre-aggregation.

Uniform distribution in x, y, t In this experiment, the workload consisted of 10, 000 scaling operations referring to the 3D dataset R2 described at the beginning of this Section. Scale vectors were uni- formly distributed along the x, y, and t dimensions. Values of scale vectors ranged from 2 to 256. Fig. 5.9 shows the distribution of the scaling vectors in the work- load. We executed the PRE-AGGREGATESSELECTION algorithm for different values of storage space constraint (c). The minimum storage space required to support the 92 5. Pre-Aggregation Support Beyond Basic Aggregate Operations

(a) Selected queries for pre-aggregation by image pyramids

(b) Selected queries for pre-aggregation by our pre-aggregation selection algorithm

Figure 5.8. Selected Queries for Pre-Aggregation 5.5 Experimental Results 93 root node of the lattice was 12.5% of the size of the original dataset. Fig. 5.10 shows the average query cost as storage space is increased. A small amount of storage space dramatically reduces the average query cost. The improvement in average query cost decreases, however, as allocated space goes beyond 36%. Fig. 5.11 shows the scal- ing operations selected for pre-aggregation when c = 36%. For this instance of the storage space constraint, the algorithm selected 49 pre-aggreagates. The total cost of computing that workload is 6.44e+05. In contrast, computing the workload using the original dataset incurs a cost of 1.28e + 12.

Figure 5.9. Workload with Uniform Distribution along x, y, and t

Figure 5.10. Average Query Cost over Storage Space 94 5. Pre-Aggregation Support Beyond Basic Aggregate Operations

Figure 5.11. Selected Pre-Aggregates, c = 36%

Uniform distribution in x, y and Poisson distribution in t

In this experiment, the workload consisted of 23, 460 scaling operations referring to 3D dataset R2. The scale vectors were uniformly distributed along x and y, with a Poisson distribution along t. Values of scale vectors ranged from 2 to 256 in the x and y dimensions, whereas in t they ranged from 8 to 16, with a mean value of 12. Fig. 5.12 shows the distribution of scaling vectors in the workload. Note that the scale vector values in the dimensions x and y are coupled. The frequency of the var- ious scale factor values is denoted by f. We ran the PRE-AGGREGATESSELECTION algorithm for different values of the storage space constraint. The minimum storage space required to support the root node of the lattice was 3.13% of the size of the original dataset. Fig. 5.13 shows the average query cost as storage space increases. A small amount of storage space dramatically improves the average query cost. How- ever, we can also observe that the improvement in average query cost decreases as allocated space goes beyond 26%. Fig. 5.14 shows the scaling operations selected for pre-aggregation when c = 26%. For this instance of the storage space constraint, the algorithm selected 67 pre-aggreagates. The total cost for computing the workload is 1.21e + 07. In contrast, computing the workload using the original dataset incurs a cost of 2.31e + 11.

Poisson distribution in x, y, t

In this experiment, the workload consisted of 600 scaling operations referring to 3D dataset R2. The scale vectors followed a Poisson distribution along the three di- mensions x, y, and t. Values of scale vectors ranged from 2 to 10 in the x and y dimensions, whereas in t they ranged between 8 and 16, with a mean value of 12. Fig. 5.15 shows the distribution of the scaling vectors in the workload. We ran the PRE-AGGREGATESSELECTION algorithm for different values of the storage space 5.5 Experimental Results 95

Figure 5.12. Workload with Uniform Distribution Along x, y, and Poisson distribu- tion in t

Figure 5.13. Average Query Cost as Space is Varied

constraint. The minimum storage space required to support the root node of the lattice was 4.18% of the size of the original dataset. Fig. 5.16 shows the average query cost as storage space is increased. A small amount of storage space dramatically improves the average query cost. However, the improvement in average query cost decreases as allocated space goes beyond 26%. Fig. 5.17 shows the scaling operations selected for pre-aggregation when c = 30%. For this instance of the storage space constraint, the algorithm selected 23 pre-aggreagates. The total cost of computing the workload is 1680. In contrast, computing the workload using the original dataset incurs a cost of 1.34e + 12. 96 5. Pre-Aggregation Support Beyond Basic Aggregate Operations

Figure 5.14. Selected Pre-Aggregates, c = 26%

Figure 5.15. Workload with Poisson distribution Along x, y, and t

Poisson distribution in x, y, and Uniform distribution along t In this experiment, the workload consisted of 924 scaling operations referring to 3D dataset R2. The scale vectors followed a Poisson distribution along the dimensions x and y, and a uniform distribution along dimension t. Values of scale vectors ranged from 2 to 10 in the x and y dimensions, and were uniformly distributed along t. Fig. 5.18 shows the distribution of the scaling vectors in the workload. We ran the PRE- AGGREGATESSELECTION algorithm for different values of the storage space con- straint. The minimum storage space required to support the root node of the lattice was 4% of the size of the original dataset. Fig. 5.19 shows the average query cost as storage space is increased. A small amount of storage space dramatically improves the average query cost. However, the improvement in average query cost decreases as allocated space goes beyond 21%. Fig. 5.20 shows the scaling operations selected for 5.5 Experimental Results 97

Figure 5.16. Average Query Cost as Space is Varied

Figure 5.17. Selected Pre-Aggregates, c = 30% 98 5. Pre-Aggregation Support Beyond Basic Aggregate Operations

Figure 5.18. Workload with Poisson Distribution Along x, y, and Uniform Distribu- tion in t pre-aggregation when c = 21%. For this instance of the storage space constraint, the algorithm selected 17 pre-aggreagates. The total cost of computing the workload is 1472. In contrast, computing the workload using the original dataset incurs a cost of 1.63e + 12.

5.5.3 4D Datasets

For 4D datasets, we considered ECHAMT−42 as a typical use case found in climate modeling. ECHAMT−42 is an energy and mass budget model developed by the Max-Planck-Institute for Meteorology [16]. We assumed that dimensions x and y are scaled down by the same scale value. However, the scale values along z and t may vary according to specific analysis requirements for a given application. If we look at the sample dimensions of ECHAMT − 42 model shown in Table 5.2, it is clear that the dimension values along the first three dimensions are much smaller than those of the fourth dimension (time). In this experiment, the workload consisted of 1, 137 scaling operations referring to 4D dataset R3. We assumed that the scale vectors followed a Poisson distribution in each of the four dimensions. The rationale behind this assumption is that scientists are often interested in a highly selective data set and Poisson distribution fits nicely for this data access pattern. Values of scale vectors ranged from 2 to 11 in the x, y dimensions with a mean of 6; from 10 to 19 along the z dimension with a mean of 14, and from 230 to 239 along t with the mean of 234. Table 5.3 shows the distribution of the scale factors of all scaling operations in the workload. We ran the PRE-AGGREGATESSELECTION algorithm for different values of the storage space constraint. The minimum storage space required to support the root node of the lattice was 1.25% of the size of the original dataset. Table 5.4 shows the 5.5 Experimental Results 99

Figure 5.19. Average Query Cost as Space is Varied

Figure 5.20. Selected Pre-Aggregates, c = 21% 100 5. Pre-Aggregation Support Beyond Basic Aggregate Operations scaling operations selected for pre-aggregation when c = 1.3%. For this instance of the storage space constraint, the algorithm selected 4 pre-aggregates shown in Table 5.4. The total cost of computing the workload is 3361. In contrast, computing the workload using the original dataset incurs a cost of 1.35e + 16.

Table 5.2. ECHAM T-42 Climate Simulation Dimensions Dimension Extent Longitude 128 Latitude 64 Elevation 17 Time (24 min/slice) 200 years (2,190,000)

Table 5.3. 4D Scaling: Scale Vector Distribution Scale Vector count 2,2,10,230 200 3,3,11,231 300 4,4,12,232 500 5,5,13,233 800 6,6,14,234 1000 7,7,15,235 1000 8,8,16,236 800 9,9,17,237 500 10,10,18,238 300 11,11,19,239 200

Table 5.4. 4D Scaling: Selected Pre-Aggregates Scale Vector count 2,2,10,230 200 4,4,12,232 500 6,6,14,234 1000 8,8,16,236 800

5.6 Summary

This chapter describes our investigations on the problem of intelligently picking a subset of scaling operations given a storage space constraint. There is a tradeoff between the amount of space allocated for pre-aggregation, and the average query cost of scaling operations. We introduced a pre-aggregation selection algorithm based on a given query workload that determines a set of pre-aggregates in face of storage space constraints. We performed experiments on 2D, 3D, and 4D datasets using different data distri- bution patterns for the scale vectors. We relied on artificial data distributions since no empirical distributions were found. In addition to uniformly distributed scale vectors, 5.6 Summary 101 we considered non-uniform distributions including Poisson, peak, and step. For 2D datasets, we showed that our algorithm performs better than that of image pyramids. In particular, for non-uniform data distributions, our pre-aggregation selection algo- rithm not only provides a lower average query cost, but makes a much more efficient use of storage space. This is because our algorithm considers the frequency of the query, and the cost savings (benefit) this provides for computing the workload. Nev- ertheless, the major advantage of our algorithm over that of image pyramids is not the improved average query cost, but the reduced amount of storage space required for the pre-aggregates, especially for non-uniform distributions. In our experiments with 3D and 4D datasets, we showed the effect of the available storage space for pre-aggregation on average query costs. We observed that a small amount of storage overhead is sufficient to dramatically reduce average query costs. Since there are no similar techniques against which we can compare our results, we compared our results against the average query costs obtained by using the original data. This page was left blank intentionally. Chapter 6

Conclusion

One of the biggest challenges of database technology is to effectively and efficiently provide solutions for extremely large volumes of multidimensional array data archiv- ing and management. This thesis focuses on investigating the problem of applying OLAP pre-aggregation technology to speed up aggregate query processing in array databases for GIS and remote-sensing imaging applications. We presented a study of fundamental imaging operations in GIS. By using a for- mal algebraic framework, Array Algebra, we were able to classify GIS operations according to three basic algebraic operators, and thus identify a set of operations that can benefit from pre-aggregation techniques. We argued that OLAP pre-aggregation techniques cannot be applied in a straight-forward manner to array databases for our target applications. The reason is that although similar, data structures in both appli- cation domains differ in fundamental aspects. In OLAP, multidimensional data spaces are spanned by axes where cell values sit on the grid at intersection points. This is paralleled by raster image data that are discretized during acquisition. Thus, the structure of an OLAP data cube is rather similar to a raster array. Dimension hierar- chies in OLAP serve to group value ranges along an axis. Querying data by referring to coordinates on the measure axes yields ground data, whereas queries using axes higher up in a dimension hierarchy will return aggregated values. A main differen- tiating criterion between data in OLAP and raster image data is density: OLAP data are sparse, typically 5%, whereas, raster image datasets are 100% dense. Note also that dimensions in OLAP are treated as business perspectives such as products and/or stores, and these are non-spatial dimensions, which contrast with the spatial nature of raster image datasets. There are, however, core similarities that motivated us to fur- ther research OLAP pre-aggregation techniques. For example, we observed that array databases and OLAP systems both employ multidimensional data models to organize their data. Also, the operations convey a high degree of similarity: a roll-up (ag- gregate) operation in OLAP is very similar to a scaling operation in a raster domain. Moreover, both application domains make use of pre-aggregation approaches to speed up query processing, however, each has different levels of maturity and scalability. We presented a framework that focuses on computing basic aggregate operations using pre-aggregated data. We argued that the decision of computing an aggregate

103 104 6. Conclusion query using pre-aggregated data is influenced by the structural characteristics of the query and the pre-aggregate. Thus, by comparing query tree structures between the two, one can determine if the pre-aggregated result contributes fully or partially to the final answer of the query. The best case occurs when there is full-matching be- tween the query and the pre-aggregate, since the time taken to compute the query is reduced to the time it takes to retrieve the result. However, in the case of partial- matching, several pre-aggregates can be considered for computing the answer of a query. The decision has to be made, therefore, as to which pre-aggregates provide the best performance in terms of execution time. To this end, we distinguished between different pre-aggregates and presented a cost-model to calculate the cost of using each qualifying pre-aggregate. Then we presented an algorithm that selects the best execu- tion plan for evaluating a query considering pre-aggregated data. Tests performed on real-life raster image datasets showed that our distinction between different types of pre-aggregates is useful to determine the pre-aggregate providing the highest benefit (in terms of execution time) for computing a given query. We then described the issues of attempting to generalize our pre-aggregation frame- work to support more complex aggregate operations, and justified our decision to fo- cus on one particular operation: scaling. Traditionally, 2D scaling operations have been performed using image pyramids. Practice shows that pyramids are typically constructed in scale levels of powers of 2, thus yielding scale vectors 2, 4, 6, 8, 16, 32, 64, 128, 256, and 512. The materialization of the pyramid requires an estimated 33% addi- tional storage space. Our pre-aggregation selection algorithm is similar to the pyramid approach in that it selects a set of queries for materialization, where each level cor- responds to a scaling operation with a defined scale factor. However, the selection of such queries is not restricted to a fixed number of levels interleveled by a power of two. Instead, our selection algorithm considers the frequency of each query in the work- load, and how the results of each individual query can help to reduce the overall cost of computing the workload. We compared the performance of our pre-aggregation al- gorithm against that of image pyramids: results showed that for workloads with scale vectors uniformly distributed our algorithm computes the workload 36% cheaper than image pyramids, and requires 7% additional space than image pyramids. For scale vectors following a Poisson distribution, our algorithm computes the workload at a cost 55% cheaper than when using the pyramids approach. Further, our algorithm can be applied to datasets of higher dimensions, a feature not supported by traditional image pyramids.

6.1 Future Work

There are natural extensions to this work that would help expand and strengthen the results. One area of further work is in adding self-management capabilities so that the DBMS maintains statistics about each scaling operation appearing within the incom- ing queries and, at some suitable time, adjust the pre-aggregate set accordingly. OLAP dynamic pre-aggregation addresses a similar problem. Another area is in applying the results studied here to the many real-world situations where data cubes contain one or 6.1 Future Work 105 more non-spatio-temporal dimensions, such as pressure, which is common in meteo- rological and oceanographic data sets. Workload distribution deserves further investigation. While the distributions cho- sen are practical and relevant, there might be further situations worth considering. Gaining empirical figures from user-exposed services like EarthLook1 can be useful to tune our pre-aggregation selection algorithms. Further investigation is also neces- sary in the realm of rewriting scaling operations. In OLAP applications, there is a trade-off between speed and accuracy. But accuracy may be critical for certain Geo- raster applications, so solutions to the query rewriting problem must weight these two aspects according to user data analysis requirements. Moreover, it must consider the fact that the same dataset may be accessed by various users with totally different anal- ysis needs.

1www.earthlook.org This page was left blank intentionally. Bibliography

[1] Blakeley J. A., Larson P-K., and Tompa F. Efficiently updating materialized views. In SIGMOD Rec., volume 15, pages 61–71, New York, NY, USA, 1986. ACM.

[2] Burrough P. A. and McDonell R. A. Principles of Geographical Information Systems. Oxford, 2004.

[3] Dehmel A. A Compression Engine for Multidimensional Array Database Sys- tems. PhD thesis, Technical University Munich, Germany, 2002.

[4] Dobra A., Garofalakis M., Gehrke J., and Rastogi R. Processing complex ag- gregate queries over data streams. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 61–72, New York, NY, USA, 2002. ACM.

[5] Garcia-Gutierrez A. Applying OLAP pre-aggregation techniques to speed up query processing in raster-image databases. In GI-Days 2007 - Young Re- searchers Forum, pages 189–191, Muenster, Germany, 2007. IfGIprints 30.

[6] Garcia-Gutierrez A. Applying OLAP pre-aggregation techniques to speed up query response times in raster image databases. In ICSOFT (ISDM/EHST/DC), pages 259–266, 2007.

[7] Garcia-Gutierrez A. Modeling geo-raster operations with array algebra. In Technical Report (7), 2007.

[8] Garcia-Gutierrez A. and Baumann P. Modeling fundamental geo-raster opera- tions with array algebra. In ICDM Workshops, pages 607–612, 2007.

[9] Garcia-Gutierrez A. and Baumann P. Computing aggregate queries in raster im- age databases using pre-aggregated data. In Proceedings of the International Conference on Computer Science and Applications, pages 84–89, San Fran- cisco, CA, USA, 2008.

[10] Garcia-Gutierrez A. and Baumann P. Using pre-aggregation to speed up scaling operations on massive spatio-temporal data. In 29th International Conference on Conceptual Modeling, November 2010.

107 108 Bibliography

[11] Gupta A. and Mumick I. S. Maintenance of materialized views: Problems, techniques, and applications. In IEEE Data Engineering Bulletin, volume 18, pages 3–18, 1995. [12] Gupta A. and Mumick I. S. Materialized Views. The MIT Press, 2007. [13] Gupta A., Harinarayan V., and Quass D. Aggregate-query processing in data warehousing environments. In Proceedings of the 21th International Confer- ence on Very Large Data Bases, pages 358–369, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. [14] Kitamoto A. Multiresolution cache management for distributed satellite image database using nacsis-thai international link. In Proceedings of the 6th Inter- national Workshop on Academic Information Networks and Systems (WAINS), pages 243–250, 2000. [15] Koeller A. and Rundensteiner E. A. Incremental maintenance of schema- restructuring views in schemasql. In IEEE Transactions on Knowledge and Data Engineering, volume 16, pages 1096–1111, Piscataway, NJ, USA, 2004. IEEE Educational Activities Department. [16] Lauer A., J. Hendricks, I. Ackermann, B. Schell, H. Hass, and S. Metzger. Simulating aerosol microphysics with the echam/made gcm; part i: Model de- scription and comparison with observations. In Atmospheric Chemistry and Physics, volume 5, pages 3251–3276, 2005. [17] Shukla A., Deshpande P., and Naughton J. F. Materialized view selection for multidimensional datasets. In Proceedings of the 24th International Conference on Very Large Data Bases, pages 488–499, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. [18] Spokoiny A. and Shahar Y. An active database architecture for knowledge- based incremental abstraction of complex concepts from continuously arriving time-oriented raw data. In Journal on Intelligent Information Systems, vol- ume 28, pages 199–231, Hingham, MA, USA, 2007. Kluwer Academic Pub- lishers. [19] Stan A. Geographic information systems: A management perspective. In WDL Publications, 1991. [20] American National Standards Institute Inc. (ANSI). ANSI/ISO/IEC 9075-2:2008, International Organization for Standardization (ISO), Infor- mation Technology –Database Languages – SQL–Part 2: Foundation (SQL/Foundation). Technical report, American National Standards Institute, 2008. [21] Barbara´ B. and Imielinski T. Sleepers and workaholics: Caching strategies in mobile environments. In SIGMOD Conference, pages 1–12, 1994. BIBLIOGRAPHY 109

[22] Moon B., Vega-Lopez I. F., and Vijaykumar I. Scalable algorithms for large temporal aggregation. In Proceedings of the 16th International Conference on Data Engineering, page 145, Washington, DC, USA, 2000. IEEE Computer Society. [23] Reiner B. HEAVEN A Hierarchical Storage and Archive Environment for Mul- tidimensional Array Database Management Systems. PhD thesis, Technical University Munich, Germany, 2004. [24] Reiner B. and Hahn K. Tertiary storage support for large-scale multidimen- sional array database management systems, 2002. [25] Reiner B., Hahn K., Hoefling G., and Baumann P. Hierarchical storage support and management for large-scale multidimensional array database management systems. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications (DEXA), Aix en Provence, 2002. [26] Sapia C. Promise: Predicting query behavior to enable predictive caching strategies for OLAP systems. In Proceedings of the 2nd International Con- ference on Data Warehousing and Knowledge Discovery, pages 224–233, Lon- don, UK, 2000. Springer-Verlag. [27] Open GIS Consortium. Web Coverage Processing Service (WCPS). In best practices document No. 06-035r1, pages 21–47, 2006. [28] The OLAP Council. Efficient storage and management of environmental infor- mation. www.olapreport.com, Accessed July 11 2002. [29] The OLAP Council. Apb-1 olap benchmark release ii. http:// www.olapcouncil.org/research/resrchly.htm, Accessed July 11 2010. [30] P. Cudre-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush, P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, D. DeWitt, B. Heath, D. Maier, S. Madden, J. Patel, M. Stonebraker, and S. Zdonik. A demon- stration of scidb: a science-oriented dbms. In Proceedings of the Very Large Data Bases Conference Endowment, volume 2, pages 1534–1537. VLDB En- dowment, 2009. [31] Chatziantoniou D. Ad hoc OLAP: Expression and evaluation. In Proceedings of the 15th International Conference on Data Engineering, page 250, Washington, DC, USA, 1999. IEEE Computer Society. [32] O’Sullivan D. and Unwin D. Geographic Information Analysis. John Wiley, 2003. [33] Quass D. Maintenance expressions for views with aggregation. In VIEWS, pages 110–118, 1996. 110 Bibliography

[34] Tveito I. D., Dobesch H., Grueter E., Perdigao A., Tveito O.E., Thornes J. E., Van derWel F., and Bottai L. The use of geographic information systems in climatology and meteorology. In Final Report COST Action, page 719, 2006.

[35] Nguyen D.H. Using javascript for some interactive operations in virtual geo- graphic model with geovrml. In Proceedings of the International Symposium on Geoinformatics for Spatial Infrastructure Development in Earth and Allied Sciences, 2006.

[36] Adiba M. E. and Lindsay B. G. Database snapshots. In Proceedings of the Sixth International Conference on Very Large Data Bases, October 1-3, 1980, Mon- treal, Quebec, Canada, Proceedings, pages 86–91. IEEE Computer Society, 1980.

[37] Thomsen E. Olap Solutions : Building Multidimensional Information Systems. John Wiley and Sons, 1997.

[38] Codd E. F., Codd S. B., and Salley C.T. Beyond decision support. In Computer World, volume 27, 1993.

[39] Codd E. F., Codd S. B., and Salley C. T. Providing OLAP (on-line analytical processing) to user-analysts: An it mandate. In Technical Report, 1993.

[40] Vega-Lopez I. F., Snodgrass R. T., and Moon B. Spatiotemporal aggregate com- putation: A survey. In IEEE Transactions on Knowledge and Data Engineer- ing, volume 17, pages 271–286, Piscataway, NJ, USA, 2005. IEEE Educational Activities Department.

[41] Colliat G. OLAP, relational, and multidimensional database systems. In SIG- MOD Rec., volume 25, pages 64–69, New York, NY, USA, 1996. ACM.

[42] Pestana G., da Silva M. M., and Bedard Y. Spatial OLAP modeling: An overview base on spatial objects changing over time. In IEEE 3rd International Conference on Computational Cybernetics, pages 149–154, April 2005.

[43] Wiederhold G., Jajodia S., and Litwin W. Dealing with granularity of time in temporal databases. In Proceedings of the 3rd international conference on Advanced information systems engineering, pages 124–140, New York, NY, USA, 1991. Springer-Verlag New York, Inc.

[44] Garc´ıa-Molina H., Ullman J. D., and Widom J. Database Systems: The Com- plete Book. Williams, 2002.

[45] Samet H. Foundations of Multidimensional and Metric Data Structures. Mor- gan Kaufmann Publishers, 2006.

[46] ERDAS IMAGINE. ERDAS Field Guide. 1997. BIBLIOGRAPHY 111

[47] ESRI Inc. ArcGIS 9 Geo Processing Commands, quick reference guide. Ar- cGIS, 2004.

[48] ISO. 19123:2005 geographic information - coverage geometry and functions, 2005.

[49] Albrecht J. Universal analytical gis operations - a task-oriented systematiza- tion of data structure-independent gis functionality. In Geographic Information Research- transatlantic perspectives, pages 577–591, 1998.

[50] Boettger J., Preiser M., Balzer M., and Deussen O. Detail-in-context visualiza- tion for satellite imagery. volume 27, pages 587–596, 2008.

[51] Burt P. J. and Adelson E. H. The laplacian pyramid as a compact code. In IEEE Transactions on Communications, number 31, pages 532–540, 1983.

[52] Han J., Stefanovic N., and Koperski K. Selective materialization: An efficient method for spatial data cube construction. In Proceedings of the Second Pacific- Asia Conference on Research and Development in Knowledge Discovery and Data Mining, pages 144–158, London, UK, 1998. Springer-Verlag.

[53] Nievergelt J., Hinterberger H., and Sevcik K. C. The grid file: An adaptable, symmetric multikey file structure. In ACM Transactions on Database Systems, volume 9, pages 38–71, 1984.

[54] Peuquet D. J. Making space for time: Issues in space-time data representa- tion. In Geoinformatica, volume 5, pages 11–32, Hingham, MA, USA, 2001. Kluwer Academic Publishers.

[55] Whang K. J. and Krishnamurthy R. The multilevel grid file - a dynamic hierar- chical multidimensional file structure. In DASFAA, pages 449–459, 1991.

[56] Berry J. K. and Tomlin C. D. A Mathematical Structure for Cartographic Mod- eling in Environmental Analysis. In Proceedings of the American Congress on Surveying and Mapping, pages 269–283, 1979.

[57] Choi K. and Luk W. Processing aggregate queries on spatial OLAP data. In Pro- ceedings of the 10th international conference on Data Warehousing and Knowl- edge Discovery, pages 125–134, Berlin, Heidelberg, 2008. Springer-Verlag.

[58] Hornsby K. and Egenhofer M. J. Shifts in detail through temporal zooming. In International Workshop on Database and Expert Systems Applications, vol- ume 0, page 487, Los Alamitos, CA, USA, 1999. IEEE Computer Society.

[59] Hornsby K. and Egenhofer M. J. Identity-based change: A foundation for spatio-temporal knowledge representation. In International Journal of Geo- graphical Information Science, volume 14, pages 207–224, 2000. 112 Bibliography

[60] Ramachandran K., Shah B., and Raghavan V. V. Dynamic pre-fetching of views based on user-access patterns in an OLAP system. In ICEIS (1), pages 60–67, 2005.

[61] Sellis T. K. Multiple-query optimization. In ACM Trans. Database Syst., vol- ume 13, pages 23–52, New York, NY, USA, 1988. ACM.

[62] Shim K., Sellis T., and Nau D. Improvements on a heuristic algorithm for multiple-query optimization. In Data and Knowledge Engineering, volume 12, pages 197–222, 1994.

[63] Libkin L., Machlin R., and Wong L. A query language for multidimensional arrays: Design, implementation, and optimization techniques. In SIGMOD Rec., volume 25, pages 228–239, New York, NY, USA, 1996. ACM.

[64] Usery E. L., Finn M. P., Scheidt D. J., Ruhl S., Beard T., and Bearden M. Geospatial data resampling and resolution effects on watershed modeling: A case study using the agricultural non-point source pollution model. In Journal of Geographical Systems, volume 6, pages 289–306, 2004.

[65] Yong K. L. and Kim M. H. Optimizing the incremental maintenance of multiple join views. In Proceedings of the 8th ACM International Workshop on Data Warehousing and OLAP, pages 107–113, New York, NY, USA, 2005. ACM.

[66] Benedikt M. and Libkin L. Exact and approximate aggregation in constraint query languages. In Proceedings of the 18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 102–113, New York, NY, USA, 1999. ACM.

[67] Gertz M., Hart Q., Rueda C., Singhal S., and Zhang J. A data and query model for streaming geospatial image data. In EDBT Workshops, pages 687–699, 2006.

[68] Golfarelli M. and Rizzi S. Data Warehouse Design: Modern Principles and Methodologies. McGraw Hill, 2009.

[69] Gyssens M. and Lakshmanan L. V. A foundation for multi-dimensional databases. pages 106–115, 1997.

[70] Ogden J. M., Adelson E. H., Bergen J. R., and Burt P. J. Pyramid methods in computer graphics, 1985.

[71] Beckmann N., Kriegel H. P., Schneider R., and Seeger B. The r*-tree: an efficient and robust access method for points and rectangles. In SIGMOD Rec., volume 19, pages 322–331, New York, NY, USA, 1990. ACM.

[72] Roussopoulos N. Materialized views and data warehouses. In SIGMOD Record, volume 27, pages 21–26, 1997. BIBLIOGRAPHY 113

[73] Stefanovic N., Han J., and Koperski K. Object-based selective materialization for efficient implementation of spatial data cubes. In IEEE Transactions on Knowledge and Data Engineering, volume 12, pages 938–958, Piscataway, NJ, USA, 2000. IEEE Educational Activities Department. [74] Widmann N. and Baumann P. Performance evaluation of multidimensional array storage techniques in databases. In Proceedings of the IDEAS Conference, 1999. [75] Baumann P. Management of multidimensional discrete data. In The VLDB Journal, volume 3, pages 401–444, Secaucus, NJ, USA, 1994. Springer-Verlag New York, Inc. [76] Baumann P. A database array algebra for spatio-temporal data and beyond. In Next Generation Information Technologies and Systems, pages 76–93, 1999. [77] Baumann P. Web-enabled raster gis services for large image and map databases. In Proceedings of the 12th International Workshop on Database and Expert Systems Applications, page 870, Washington, DC, USA, 2001. IEEE Computer Society. [78] Baumann P. Web coverage processing service (wcps) implementation specifi- cation. number 08-068. ogc, 1.0.0 edition. 2008. [79] Furtado P. and Baumann P. Storage of multidimensional arrays based on ar- bitrary tiling. In Proceedings of the 15th International Conference on Data Engineering, page 480, Washington, DC, USA, 1999. IEEE Computer Society. [80] Marathe A. P. and Salem K. A language for manipulating arrays. In Proceed- ings of the 23rd International Conference on Very Large Data Bases VLDB ’97, pages 46–55, San Francisco, CA, USA, 1997. Morgan Kaufmann Publish- ers Inc. [81] Vassiliadis P. Modeling multidimensional databases, cubes and cube opera- tions. In Proceedings of the 10th International Conference on Scientific and Statistical Database Management, pages 53–62, Washington, DC, USA, 1998. IEEE Computer Society. [82] Burt P.J. Fast filter transforms for image processing. In Computer Graphics and Image Processing, number 16, pages 16–51, 1981. [83] Agrawal R., Gupta A., and Sarawagi S. Modeling multidimensional databases. In Proceedings of the 13th International Conference on Data Engineering, pages 232–243, Washington, DC, USA, 1997. IEEE Computer Society. [84] Pieringer R., Markl V., Ramsak F., and Bayer R. Hinta: A linearization algo- rithm for physical clustering of complex OLAP hierarchies. In DMDW, page 11, 2001. 114 Bibliography

[85] Chen S., Liu B., and Rundensteiner E. A. Multiversion-based view mainte- nance over distributed data sources. In ACM Transaction Database Systems, volume 29, pages 675–709, New York, NY, USA, 2004. ACM.

[86] Prasher S. and Zhou X. Multiresolution amalgamation: Dynamic spatial data cube generation. In Proceedings of the 15th Australasian database conference, pages 103–111, Darlinghurst, Australia, Australia, 2004. Australian Computer Society, Inc.

[87] Shekhar S. and Xiong H. Encyclopedia of GIS. Springer, 2008.

[88] SYBASE. Sybase solutions guide. http://www.sybase.cz/uploads/ CEEMEA_SybaseIQ_FINAL.pdf, Accessed July 11, 2010.

[89] Griffin T. and Libkin L. Incremental maintenance of views with duplicates. In Proceedings of the SIGMOD Rec., volume 24, pages 328–339, New York, NY, USA, 1995. ACM.

[90] Needham T. Visual Complex Analysis. Oxford University Press, 1998.

[91] Niemi T., Nummenmaa J., and Thanisch P. Normalizing OLAP cubes for con- trolling sparsity. In Data Knowledge Engineering, volume 46, pages 317–343, Amsterdam, The Netherlands, The Netherlands, 2003. Elsevier Science Pub- lishers B. V.

[92] Harinarayan V., Rajaraman A., and Ullman J. D. Implementing data cubes efficiently. In SIGMOD Rec., volume 25, pages 205–216, New York, NY, USA, 1996. ACM.

[93] Schlosser S. W., Schindler J., Papadomanolakis S., Shao M., Ailamaki A., Faloutsos C., and Ganger G. R. On multidimensional data and modern disks. In In Proceedings of the 4th USENIX Conference on File and Storage Technolo- gies. USENIX Association, pages 225–238, 2005.

[94] Mingjie X. Experiments on remote sensing image cube and its OLAP. In Proceedings of the IEEE International Geoscience and Remote Sensing Sym- posium, volume 7, pages 4398–4401 vol.7, September 2004.

[95] Halevy A. Y. Answering queries using views: A survey. In The VLDB Journal, volume 10, pages 270–294, Secaucus, NJ, USA, December 2001. Springer- Verlag New York, Inc.

[96] Jiebing Y. and Dewitt D. J. Processing satellite images on tertiary storage: A study of the impact of tile size on performance. In Proceedings of the 5th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 460–476, 1996. BIBLIOGRAPHY 115

[97] Kotidis Y. and Roussopoulos N. A case for dynamic view management. In ACM Transactions on Database Systems, volume 26, pages 388–423, New York, NY, USA, 2001. ACM.

[98] Lee K. Y., Son J. H., and Kim M. H. Efficient incremental view maintenance in data warehouses. In Proceedings of the 10th International Conference on In- formation and Knowledge Management, pages 349–356, New York, NY, USA, 2001. ACM.

[99] Qingsong Y. and Aijun A. Using user access patterns for semantic query caching. In DEXA, pages 737–746, 2003.

[100] Zhao Y., Deshpande P. M., and Naughton J. F. An array-based algorithm for si- multaneous multidimensional aggregates. In SIGMOD Rec., volume 26, pages 159–170, New York, NY, USA, 1997. ACM.

[101] Zhuge Y., Garc´ıa-Molina H., Hammer J., and Widom J. View maintenance in a warehousing environment. In Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 316–327, New York, NY, USA, 1995. ACM.

[102] Zhuge Y., Garc´ıa-Molina H., and Wiener J. L. Multiple view consistency for data warehousing. In Proceedings of the 13th International Conference on Data Engineering, pages 289–300, Washington, DC, USA, 1997. IEEE Computer Society.

[103] Zhuge Y., Garc´ıa-Molina H., and Wiener J. L. Consistency algorithms for multi-source warehouse view maintenance. In Distributed Parallel Databases, volume 6, pages 7–40, Hingham, MA, USA, 1998. Kluwer Academic Publish- ers.