Supporting Advanced Queries on Scientific Array Data

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Roee Ebenstein , B.Sc., M.Sc.

Graduate Program in Computer Science and Engineering

The Ohio State University

2018

Dissertation Committee:

Gagan Agrawal, PhD, Advisor P. Saday, PhD Arnab Nandi, PhD c Copyright by Roee Ebenstein 2018 Abstract

Distributed scientific array data is becoming more prevalent, increasing in size, and there is a growing need for (performance in) advanced analytics over these data. In this dissertation, we focus on addressing issues to allow data manage- ment, efficient declarative querying, and advanced analytics over array data. We formalize the semantic of array data querying, and introduce distributed querying abilities over these data. We show how to improve the optimization phase of join querying, while developing efficient methods to execute joins in general. In ad- dition, we introduce a class of operations that is closely related to the traditional joins performed on relational tables – including an operation we refer to as Mutual

Range Joins (MRJ), which arises on scientific data that is not only numerical, but

also have measurement noise. While working closely with our colleagues to pro-

vide them usable analytics over array data, we uncovered a new type of analytical

querying – analytics over windows with an inner window ordering (in contrast to

the external window ordering, available elsewhere). Last, we adjust our join opti-

mization approach for skewed settings, addressing resource skew observed in real

environments as well as data skew that arises while data is processed.

ii Several major contributions are introduced throughout this dissertation. First we formalize querying over scientific array data (basic operators, such as sub- settings, as well as complex analytical functions and joins). We focus on dis- tributed data, and present a framework to execute queries over variables that are distributed across multiple containers (DSDQuery DSI) – this framework is used in production environments. Next, we present an optimization approach for join queries over geo-distributed data. This approach considers networking proper- ties such as throughput and latency to optimize the execution of join queries. For such complex optimization, we introduce methods and algorithms to efficiently prune optional execution plans (DistriPlan). Then, after the join was optimized, we show how to execute distributed joins and optimize the MRJ operator. We demonstrate how bitmap indexes can be used for accelerating the execution of dis- tributed joins – we do so by introducing a new Bitmap Index structure that fits the

MRJ goals (BitJoin). Afterwards we introduce analytical functions (window query- ing) to the domain of scientific arrays (FDQ). Last, we revisit join optimization for different settings, while addressing data and resource skew (Sckeow).

We thoroughly evaluate our systems. We show DSDQuery DSI produces out- put in its optimal size as well as produces it efficiently – performance decrease linearly with increasing dataset sizes. DistriPlan finds the optimal plan while con- sidering reasonable amount of plans (out of the exponential amount of optional plans). BitJoin improves the performance of MRJ’s and equi-join by 140% and

113%, on average. By using a new processing model with an efficient memory allocation approach, on average, FDQ improves the performance of existing func- tionality by 538%. In addition, FDQ efficiently process queries of types that were

iii not available before – its performance improve linearly with scaled resources. Last,

Sckeow improves the performance of queries by 396% for heterogeneous settings and 368% for homogeneous ones. For heterogeneous settings, in most cases Sck- eow generates an ideal plan directly, while generating about half the amount of plans other engines do in homogeneous settings.

iv This is dedicated to the ones I love: my parents, my brothers and sister, my family, and my friends around this globe. You all kept me on the right path for this moment to happen.

v Acknowledgments

This dissertation was completed thanks to all the, professional as well as per- sonal, help I got during my days at THE Ohio State University. There are so many people that I want to express my deepest gratitude to and not enough room to list them all. Without their help and support, it would have not been possible for me to finish this program.

Foremost, I wish to thank my advisor, Prof. Gagan Agrawal. He had sup- ported me since I began working in the lab – it was him who brought me into the high performance computing for data processing domain. His knowledge in the research area and insightful guidance did not only broadened my knowledge, but also helped me to establish my research interest and develop my problem solv- ing skills. I also want to thank him for his kindness and patience that helped me throughout my course of studies – especially in regards to academic writing.

In addition, I want to thank Prof. Arnab Nandi and Mr. David Bertram. Prof.

Nandi was the first advisor I had worked with, and although I departed to another laboratory, I still feel his influence on my academic progress. Graduating would have not been possible without the values and curiosity he embedded in me, even after leaving his laboratory. While initially at this university, my funds were pro- vided through my work with the Advanced Computing Center for the Arts and

Design (ACCAD). Mr. Bertram, my supervisor there, had put me on the right path

vi academically and socially. I am grateful for the guidance and friendship both of you had provided me – this moment would not have happened without you.

I would also like to thank the rest of my academic committee: Prof. P. Sadayap- pan and Prof. Feng Qin. Their insightful comments, incisive analysis, and teaching helped my thesis become stronger and solid, and more importantly made me un- derstand the domain better.

The group I have taken part of, Data Intensive and High Performance Comput- ing Research Group, developed me in many ways (personally and professionally).

The friendship from them have made my Ph.D. life much more enjoyable. I thank my fellow labmates and fiends across this university: Yu Su, Yi Wang, Mehmet Can

Kurt, Sameh Shohdy, Gangyi Zhu, Peng Jiang, Jiaqi Liu, Jiankai Sun, Piyali Das, Jia

Guo, Shuangsheng Luo, Omid Asudeh, Emin Ozturk, Bill (Haoyuan) Xing, Qiwei

Yang, Yang Xia, David Siegal, Niranjan Kamat, and Soumaya Dutta for the stimu- lating discussions and for all the fun we have had in the past years.

A special gratitude to those who made me socialize, and enjoy my stay on this amazing campus and on the travels: Joshua Laney (and his family), Joshua Con- ner, Brad Shook, Lindsy Schwerer, Jessica Franz, Nate Moffitt, Patrick Spaulding,

Jeff Starr, Gregory Hanel, Michael Baker, Joel Howard, Noga Adler, Orly Gilad,

Pery Stosser, Tomer Peled, Idan Cohen, Adi Hochmann, Jonathan Fishner, Dima

Machlin, Karen Cohen, Zalman and Sarah Deitsch (and the OSU Chabad family), and all the on campus OSJews.

Finally, I want to extend my special thanks to my parents, my brothers and sister, my extended family, and my friends who had always been supportive. No words can express the help you all provided me.

vii Vita

2006 ...... Software Technician, Center of Computing and Informa- tion Systems, M.O.D, Israel 2010 ...... B.Sc. in Computer Science, The Open University of Israel 2017 ...... M.Sc. in Computer Science, The Ohio State University 2017 ...... Software Engineering Intern, Google, Mountain View, CA

Publications

Research Publications

Ebenstein, Roee and Agrawal, Gagan, DSDQuery DSI-Querying scientific data repos- itories with structured operators. In 2015 IEEE International Conference on (BigData).

Ebenstein, Roee, Kamat, Niranjan, and Nandi, Arnab, FluxQuery: An Execution Framework for Highly Interactive Query Workloads. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD).

Ebenstein, Roee and Agrawal, Gagan, DistriPlan - An Optimized Join Execution Framework for Geo-Distributed Scientific Data. In Proceedings of the 29th Interna- tional Conference on Scientific and Statistical Management (SSDBM).

Ebenstein, Roee and Agrawal, Gagan, BitJoin: Executing Distributed Range Based Joins. Under Submission.

viii Ebenstein, Roee, Agrawal, Gagan, Wang Jiali, Boley Joshua, and Kettimuthu Rajkumar, FDQ: Advance Analytics over Real Scientific Array Datasets. Under Submission.

Ebenstein, Roee and Agrawal, Gagan, ScKeow: An Optimizer for Multi-Level Join Execution of Skewed and Distributed Array Data. Under Submission.

Fields of Study

Major Field: Computer Science and Engineering

ix of Contents

Page

Abstract...... ii

Dedication...... v

Acknowledgments...... vi

Vita...... viii

List of Tables...... xiv

List of Figures...... xvi

List of Algorithms...... xx 1. Introduction...... 1

1.1 Introduction...... 1 1.1.1 Overview...... 2 1.1.2 High Level Motivation...... 3 1.2 Background...... 4 1.2.1 Array Data...... 4 1.2.2 Bitmap Index Structure...... 7 1.3 Motivation and Contributions...... 8 1.3.1 DSDQuery DSI - Declarative Querying of Distributed Sci- entific Data...... 8 1.3.2 DistriPlan - Distributed Plan Optimization...... 10 1.3.3 BitJoin - Scientific Data Join...... 13 1.3.4 FDQ - Analytical Querying of Array Data...... 14 1.3.5 Sckeow - multi-level join queries optimization over skewed resources and skewed data...... 17

x 2. Structured Querying of Scientific Data - DSDQuery...... 20

2.1 Background...... 21 2.1.1 SDQuery DSI...... 21 2.1.2 DSDQuery - Functionality and Challenges...... 22 2.2 Formal Definition of Query Operators...... 23 2.2.1 Formalization - ...... 24 2.2.2 Example Queries...... 31 2.3 System Design and Implementation...... 33 2.3.1 Metadata Extraction...... 33 2.3.2 Query Analysis...... 34 2.3.3 System Implementation...... 39 2.4 Evaluation...... 43 2.4.1 Metadata Processing Overhead...... 45 2.4.2 Increasing Number of File and Distributed Variable..... 46 2.4.3 Impact of Selectivity...... 49 2.5 Summary...... 52

3. Optimization of Join Queries - DistriPlan...... 53

3.1 Background...... 54 3.1.1 Distributed Array Data Joins...... 54 3.1.2 Existing Approaches...... 58 3.1.3 Execution Plans...... 59 3.2 Plan Selection Algorithms...... 62 3.2.1 Query Execution Plans...... 62 3.2.2 Plan Distribution Algorithm...... 63 3.2.3 Pruning Search Space...... 64 3.3 Cost Model...... 70 3.3.1 Costs:...... 73 3.3.2 Summarizing – Choosing a Plan...... 76 3.4 Evaluation...... 76 3.4.1 Pruning of Query Plans...... 81 3.4.2 Query Execution Performance Improvement...... 83 3.4.3 Impact of Network Latency...... 86 3.5 Summary...... 88

4. Execution of Scientific Data Join - BitJoin...... 89

4.1 Background...... 91 4.1.1 Mutual Range Join...... 91

xi 4.1.2 Join execution...... 91 4.2 Bitmap Based Join...... 96 4.2.1 Putting It All Together...... 100 4.3 Mutual Range Join - MRJ...... 102 4.3.1 Binning and Mutual Range Joins...... 102 4.3.2 MRJ with Distributed Data...... 104 4.4 Evaluation...... 107 4.4.1 Implementation...... 107 4.4.2 Equi-Join Over Discrete Data...... 108 4.4.3 MRJ Processing...... 111 4.4.4 Throughput Affect...... 113 4.4.5 Impact of Changing Error and Range Values...... 115 4.5 Summary...... 117

5. Advance Analytics over Real Scientific Array Datasets - FDQ...... 119

5.1 Domain...... 120 5.1.1 Climate Datasets and Variables...... 120 5.1.2 Querying Needs and Tools...... 121 5.2 Analytical/Window Querying...... 123 5.2.1 Definition...... 123 5.2.2 Minus Function...... 126 5.3 Query Evaluation and Optimization Algorithms...... 128 5.3.1 Analytical Functions Implementations...... 128 5.3.2 Optimization: Dimensional Restructuring...... 133 5.3.3 Distributed Calculations...... 134 5.4 System Implementation...... 138 5.4.1 FDQ Engine...... 138 5.4.2 ANL Web Portal...... 138 5.5 Evaluation...... 140 5.5.1 Average Function Performance...... 140 5.5.2 Analytical Function Performance and ...... 144 5.6 Summary...... 150

6. An Optimizer for Multi-Level Join Execution of Skewed and Distributed Array Data - ScKeow...... 151

6.1 Background...... 153 6.1.1 Addressing Skew in Processing of Array Data: Open Prob- lems and Our Contributions...... 154 6.1.2 Sparse Matrices Representation...... 157 6.1.3 Array Data Joins and Aggregations...... 158

xii 6.1.4 Query Optimizer...... 158 6.2 Overview - Sckeow Optimization...... 161 6.2.1 Formal Problem Statement...... 161 6.2.2 The Plan Generation Process...... 161 6.3 Cost Model...... 164 6.3.1 MPU - Minimal Profitable Unit...... 164 6.3.2 Cost Evaluation...... 166 6.4 Plan Generation Considerations...... 168 6.4.1 Plan Pruning...... 168 6.5 Evaluation...... 171 6.5.1 Optimization Performance...... 171 6.5.2 Query Execution Times...... 176 6.6 Summary...... 183

7. Related Work...... 184

7.1 Scientific Data Management...... 184 7.2 Optimization of Advanced Join Queries over Distributed Data.. 187 7.3 Operators Execution: Joins and Analytical Functions...... 189

8. Conclusions and Future Work...... 192

Bibliography...... 196

xiii List of Tables

Table Page

2.1 Example Tables and Simple Relational Algebra Queries...... 27

2.2 DSDQuery Resultset Size based on selectivity (S) and number of files (F) – value based query...... 50

3.1 DistriPlan: Costs of Operations by Operator...... 73

3.2 DistriPlan: Expected Resultset Size by Operator...... 74

3.3 DistriPlan Experimental Settings: Join selectivities and processed dataset average sizes by query...... 78

3.4 DistriPlan Spanning Time: Time required to span plans and number of spanned plans for a three-way join distributed among multiple nodes...... 82

3.5 DistriPlan Experiment Setup: Distributed Join Slowdown...... 83

3.6 DistriPlan Join Queries Execution Time...... 85

3.7 DistriPlan Performance Slowdown Percentage of a Plan Optimized for a specific, Optimizer Setting(Set), penalty but Executed with Dif- ferent Actual(Act) network setting - Q1 with a setting of 3,5,4..... 87

3.8 DistriPlan Performance Slowdown Percentage of a Plan Optimized for a specific, Optimizer Setting(Set), penalty but Executed with Dif- ferent Actual(Act) network setting - Q3-24 in Table 3.5...... 87

4.1 BitJoin Percent of Additional Records Output With Varying ‘e’ val- ues for 10M values...... 116

xiv 6.1 Node’s Cost Examples. Join Costs Calculated for Non-Sparse Nested- Loop Equi-Join...... 167

6.2 Sckeow Non-Distributed Trees Spanning Performance; Increasing Number of Joins with Different Selectivities...... 172

6.3 Sckeow Distributed Tree Spanning Performance; Queries with Dif- ferent Selectivities...... 173

6.4 Sckeow Distributed Plan Spanning Engines Performance Compari- son...... 174

6.5 Scale of Number of trees spanned for heterogeneous environments.. 175

xv List of Figures

Figure Page

1.1 Array and Relational Data Storage Compared...... 5

1.2 Bitmap Index Example...... 7

1.3 Alternative Ways of Evaluating a Query Across Multiple Repositories 11

1.4 An Example Network with Skewed Resources...... 17

2.1 Mapping Between the Structured File, UML, and the 25

2.2 Example of Simple Subsetting and Aggregation Queries...... 31

2.3 DSDQuery Repository Structure Example...... 37

2.4 DSDQuery System Architecture: Bridger and DSDQuery DSI.... 39

2.5 DSDQuery’s Metadata Processing Overhead...... 45

2.6 DSDQuery Performance for Querying Increasing Number of Files. 47

2.7 DSDQuery’s Output Resultset Size, as Data is Queried from Increas- ing Number of Files...... 48

2.8 DSDQuery Response to Querying of Changing Selectivities – Di- mension Based Queries...... 49

2.9 DSDQuery Selectivity Impact on Performance – Value-based Queries 51

3.1 Query for Join Walkthrough Example...... 56

xvi 3.2 A walkthrough the join process...... 57

3.3 Example Query: Simple Join...... 57

3.4 Non-Distributed Execution plan Samples...... 60

3.5 Distributed Execution Plan Examples...... 61

3.6 Distribution Options for An array (Using DistriPlan Rule 2) dis- tributed over 3 nodes...... 68

3.7 DistriPlan Pruning: Number of pruned Candidate Join Plans as the number of hosts Increase for a Join between two variables...... 80

3.8 DistriPlan Execution Time Slowdown (in %) For the Most Distributed and Median Plans Compared to The Cheapest Plan...... 84

4.1 Example Query: Temperature Join...... 89

4.2 Steps in the Regular Join Algorithm...... 94

4.3 BitJoin: a Walkthrough the Join Process...... 101

4.4 Bitmap Index Types Example...... 102

4.5 BitJoin Processing Time for Equi-Joins Over Categorical Data with Increasing number of Bins...... 109

4.6 BitJoin Resultset size of discrete data join with increasing number of bins...... 109

4.7 BitJoin Performance Improvement over Small Datasets With Increas- ing Number of Nodes – Speedup Compared to Regular Join on the Same No. of Nodes...... 112

4.8 BitJoin Execution Time Comparison over Large Datasets...... 112

4.9 BitJoin Execution Time With Decreasing Throughput...... 114

4.10 BitJoin Speedup over Regular Join for Changing ‘r’ values...... 117

xvii 5.1 Analytical Query SQL Example...... 125

5.2 Analytical Query Example: Demonstration of expected behavior of the MINUS analytical function with data reset (highlighted)...... 127

5.3 FDQ: A 3 Dimensional Variable...... 128

5.4 FDQ Query 1 - Analytical Function Query Example...... 128

5.5 FDQ Query 1 resultset...... 129

5.6 ANL Querying Portal, backed by FDQ...... 139

5.7 FDQ Execution Time of The AVG Function Over Partitions With Dif- ferent Selectivities (16% and 100%) [128 days - 90GB]...... 141

5.8 Query Runtime Speedups: FDQ Over SciDB...... 143

5.9 Runtime of Minus Function: Benefits of Memory Optimizations in FDQ [1024 days - 24GB]...... 145

5.10 Comparison of Analytical Functions Runtimes using FDQ [1024 days - 24GB]...... 146

5.11 FDQ Scale Up: Execution time of Average Function with Increasing Number of Threads...... 147

5.12 FDQ Scale Out: Execution Time of the Average Function With In- creasing Number of Nodes...... 148

6.1 Example Query: Tornado Hypothesis Verification...... 152

6.2 Sckeow: Cost Based Model Optimization Engines Comparison.... 154

6.3 Sckeow Execution Time (in Seconds) Comparison for Different Ar- ray Sizes on Increasing Number of Nodes...... 178

6.4 Breakdown of Time Spent per Task By Engine – Sckeow, DSDQuery, Skew-Aware Join, and Similarity Join...... 179

xviii 6.5 Sckeow Comparison of Execution Time (in Seconds) for Heteroge- neous Settings Using Fixed Array Size...... 182

xix List of Algorithms

1 Query Analysis...... 35

2 DistriPlan – Build Plans Without Repetitions...... 70

3 DistriPlan – Build arrays to spread data holding rule 2...... 71

4 Na¨ıve Array Data Equi-Join...... 92

5 Regular Array Data Join...... 92

6 The Na¨ıve Bitmap Index Equi-Join...... 96

7 BitJoin Dimensional Join...... 98

8 BitJoin Index Restructure...... 99

9 Distributed BitJoin Master...... 104

10 Distributed BitJoin Process...... 105

11 The Minus Analytical Function...... 132

12 FDQ Results Accumulation...... 136

13 : High Level Optimizer...... 162

xx Chapter 1: Introduction

1.1 Introduction

Scientific Data collected from instruments and simulations [11, 19, 126, 127, 130,

147], have been extremely valuable for a variety of scientific endeavors. These data are typically disseminated through portals such as the Earth System Grid

(ESG) or Mikulski Archive for Space Telescopes (MAST). These portals typically support processing a collection of files that are stored in low-level formats such as NetCDF[102] or HDF5[48] (These file formats usually hold two sections: meta- data and data, where the metadata holds the data structure, while the data itself is stored sequentially). The key challenge being faced in these scientific efforts is that while the dataset sizes continue to grow rapidly and increase in distribution, disk speeds and wide-area transfer bandwidths have not increased. Thus, soft- ware tools for dealing with scientific data must be enhanced to incorporate new approaches, for data-driven scientific advances to be maintained in the future.

While data is collected at large scale and distributed through portals, the sci- entists using these data often require it to be processed (e.g., aggregated or sub- setted) and restructured before they can find it effective. For example, climate re- searchers often use aggregation of certain attributes (from a large set of attributes)

1 for a specific spatial ranges over the geographically distributed arrays they col- lect. Currently, those scientific data portals provide limited querying functionality to obtain the desired data, and do not allow cross clusters abilities. Examples of current available functionality include simple interfaces for browsing through the set of files and selecting the ones of interest to the user, or, in some cases, simple

(predefined and hard-coded) subsetting functionality within one or multiple files.

These functionalities do not suffice.

1.1.1 Overview

In this dissertation, we will address Distributed Scientific Array Data Querying with the goal of enabling declarative complex queries over distributed array data, which is expected to substantially improve the time required for these data analy- sis and research advances where scientific array data are used. We have identified two main issues which will lead to our goal: querying functionality and perfor- mance. Querying functionality refers to the flexibility of the enabled querying, the ease of such query phrasing, and the features enabled by the expressed queries.

Query performance refers to how the phrased query is optimized and prepared for execution (e.g. How the data should be read? How it should be returned? In what order should the different query part be executed, and should it use other data structure, such as indexes?).

We begin by developing our framework for query execution, and show how we address advanced distributed subsetting queries (DSDQuery DSI, Chapter2), which we later use for addressing join queries and advanced analytical queries (FDQ,

Chapter5). We address join queries by introducing query optimizers (DistriPlan,

2 Chapter3, and Sckeow, Chapter6) and an execution engine (BitJoin, Chapter4) for advanced join queries over distributed arrays.

Each chapter begins by introducing the issue we address there. Then, if neces- sary, specific background is provided. Immediately afterwards, we dive into the matter at hand. We conclude each chapter by evaluating and summarizing the presented work.

1.1.2 High Level Motivation

Querying and analyzing scientific data is critical for data-driven discovery.

However, several characteristics of scientific data make such analysis challeng- ing, the reason which current systems provide limited querying ability over these data. First, the data are typically in the form of arrays and often stored in native system files. Second, both the nature of the desired operations, and the challenges in executing them efficiently, are quite different from those of relational that have been developed for the past four decades by the database community.

Last, scientific data is often distributed, either due to its size, and/or the way it is generated, collected, and/or managed; a point we will discuss thoroughly later.

Querying semantics over scientific array data is different compared to tradi- tional relational data. Currently, querying scientific data entails programming the query execution (the scientist provides code in languages such as C++, Java, R, or Python that is expected to provide the intended results). Two issues are raised with this approach: Each query takes long to develop, and it must go through a rigorous testing phase (which assures the query generates the intended results).

These issues raise the need for supporting scientific array data processing using

3 declarative languages or structured operators. Many systems aiming at address- ing this need have been built [23, 114, 121]. Systems that can handle some (but not all) of the characteristics of array data include relational distributed systems [94],

Map-Reduce implementations [40], and scientific Array Databases [25, 141]. These systems simplify the query specification process as compared to the ad-hoc ap- proach and/or using low-level languages. Some of these systems require that the data be loaded into a database [23, 121], whereas others can provide query process- ing capabilities working directly with low-level data layouts, such as a flat-file, or data in formats like NetCDF and HDF5 [114]. However, there is still limited adop- tion of these systems, particularly because none support all of the desired features, and almost all have a high cost of initially loading the data. Thus, it is important to continue to provide effective data analytics functionality for this data. We will discuss these issues more thoroughly at each chapter, while targeting some of the needs that are not currently addressed by existing systems.

1.2 Background

In this section we describe the background necessary for all chapters. Informa- tion that is relevant only to specific chapters will be described in the first relevant chapter.

1.2.1 Array Data

Scientific array data in our context is defined to be a group of variables, each of which is comprised from a set of multidimensional arrays. The dimensional coordinate values are either just the indices (as in default storage of an array in

4 Array Model Relational Model Product TV Apple Product Service Value App TV TV App Netflix N Netflix NY Apple TV Hulu Y Hulu YY TV App Hulu Y Service Apple TV Netflix Y

Figure 1.1: Array and Relational Data Storage Compared

a programming language implementation,YN or a value stored in another array and

mapped by its index location. 0111 In Figure 1.1 we demonstrate a simple array within this model. The data in this 1000 array is the availability of services on different platforms. There are 3 arrays in the diagram. The main array is the boolean variable, whereas two other arrays are mapped dimensional coordinates. The values of mapped dimensional coordinates are globally unique – e.g., in the example, the combination of “Netflix” and “Apple

TV” is unique and can appear only at one location since otherwise a conflict may occur.

Data stored in the array model can also be mapped to the relational model. In equivalent relational storage, each has an identifier (usually its row location, commonly known as ROWID) that allows addressing it directly.

However, storing arrays as relational tables significantly increases space re-

quirements and harms some operations performance. The size differences together

with the querying types used over this data make the relational storage inefficient,

5 and many management and storage engines specific for array data have been de-

veloped [24, 44, 95, 114, 146]. In addition, the relational model is designed to pre-

vent users knowledge of the underlying storage properties, whereas array data

is intended to be queried by offsets. This difference results in many mathemati-

cal operations (like multiplication, scaling, masking, and others) that are easy to

implement and optimize over array data representation being hard to implement

efficiently over relational data.

Although array database systems have been a topic of much research [24, 25,

43, 95, 104], most scientific data are still held as raw files in native formats [48, 102].

For example, climate simulation studies often create a different file for every mea-

surement epoch (time range). Each file holds all available weather data in a differ-

ent area or section of the globe within that epoch [11]. All these files together can

be considered to logically construct one large array, referred to as variable, with the

file properties themselves translating into a virtual dimension, such as time. Many useful operations need access to multiple physical files simultaneously. For ex- ample, where climate data accumulates from multiple sensors at different times, to calculate the average temperature multiple files are needed since the data for all required time epochs and areas are hosted in different files and at potentially different locations.

Array Data properties are different compared to relational data. Within this work we assume that array data is interpreted by its dimensionality – the dimen- sions give context which assigns meaning to individual array cells. In addition, we assume data is not sparse when collected or generated, yet, data may become

6 Product TV Apple App TV

Netflix NY Hulu YY Service

YN

0111 1000

Figure 1.2: Bitmap Index Example

sparse while being processed. On top of that, we refer to array data that is dis-

tributed across multiple files and possibly many machines or clusters. These char-

acterizations result in the need to address array data in a different manner com-

pared to the traditional, relational data, methods.

1.2.2 Bitmap Index Structure

A bitmap index involves a set of bins, each of which represents a value (or a

range of values). This (range of) value(s) is referred to as the bin key, or simply the

key. Corresponding to each key is a set of ordered bits or a BitArray. In a BitArray, the first bit refers to the first value in the data array, the second bit refers to the second one, and so on. If a specific bit in the BitArray is lighted (has the value

1), the corresponding value in the array contains the bin key value (or a value within the range the bin represents). All other BitArrays contain 0 for that specific location, since keys and ranges do not normally overlap.

Array data can have discrete or continuous values. If the values are discrete, the keys match the values and the number of bins is equal to the number of distinct values. If the values are continuous, a binning methods is used [135].

7 In Figure 1.2 we show an example of a bitmap Index for the data in Figure 1.1.

As can be seen, the data provided in the index is the same as in the previous ex-

ample variable. At the top, there is a map holding the keys. Each key points to a

BitArray that contains the index data – for example, the 2nd, 3rd, and 4th values (at indexes 1, 2, and 3) are all Y .

BitArrays are sparse by nature – the larger the number of bins, the higher the frequency of 0 values. This property can be exploited for compression, and compressed bitmap indexes have been extensively researched [10, 31, 133]. These structures also provide efficient implementations for bit-wise AND and OR opera- tions directly over two compressed structures. Although we only use compressed bitmaps, specifically WAH [133], our discussions will not explicitly consider com- pression.

1.3 Motivation and Contributions

Next, we briefly motivate each chapter and discuss the challenges and con- tributions introduced there. While currently users query array data by writing proprietary software to do so (investing time and resources for each query), this dissertation enable users to execute advanced (traditional and analytical) queries by using declarative language over distributed array data stored in native system

files that hold the data in different formats.

1.3.1 DSDQuery DSI - Declarative Querying of Distributed Sci- entific Data

Climate simulations like the Community Earth System Model (CESM) are pro- ducing massive datasets. The current output organization involves keeping all the

8 variables for the entire globe, for one time-slice, in a single NetCDF file. In the fu- ture, this data organization is likely to change. However, most researchers focus on a specific geographical region (and often certain time-ranges), and a small number of variables. Selecting the data of interest for the user involves spatial or spatio- temporal subsetting of data over a Cartesian (non-rectilinear) grid. In either case, user’s need cannot be met by simply selecting appropriate files. Unfortunately, it means that the user has to download orders of magnitude greater volume of data over the Wide Area Network (WAN) than they need, and then manage and filter the data at their end.

If the data was to be stored in a standard database, extracting data of inter- est will be a relatively straight-forward subsetting query. However, for various reasons, scientific data is not stored using the popular (and mature) commercial database systems. This gap has been addressed to an extent by new emerging ar- ray databases such as SciDB [23]. These systems require loading the data while changing its storage format to the system proprietary one, imposing significant overheads. Such overheads cannot be justified for massive datasets that are either read-only, append-only, or accessed infrequently.

Challenges and Contributions

DSDQuery DSI, addresses three problems that have not been targeted before.

The first is locating the data that is relevant for the query without requiring the user to intimately know the datasets, enabling true declarative querying. The second is querying across multiple files and extracting relevant data. Finally, combining the output elements into one unified and combined output dataset (NetCDF, for

9 example), which allows the user to apply the rest of the workflow operations in a more convenient and traditional way (e.g, a visualization program that is specific to the output format).

The main challenges are related to querying semantics and execution of both: metadata extraction, and the query itself. We discuss the challenges in depth in

Subsection 2.1.2 and present our end-to-end solution, used by our collaborators at

Argonne National Laboratory in Chapter2.

1.3.2 DistriPlan - Distributed Plan Optimization

One of the issues that remain unaddressed by DSDQuery is providing ad- vanced query capabilities over data distributed across multiple geographically dis- tributed repositories. In supporting structured query operators over distributed data, most common query operators (selections, projections, and aggregations) are relatively straightforward to support. However, the classical join operator and its variants [86] are challenging to support when the data is at geographically dis- parate locations. DistriPlan focuses on the challenge of executing and optimizing the join operator over geographically distributed array data.

Consider the current status of dissemination of climate data: in the United

States, much of the climate data is disseminated by Earth System Grid Forum

(ESGF)1. However, world-wide, climate data is also made available by agencies of other countries, such as those from Japan, Australia, and others. A climate scientist interested in comparing data across datasets collected from different satellites or agencies will need to run multiple queries across these repositories and to bind the

1See https://www.earthsystemgrid.org/about/overview.htm

10 Plan 1

B

A B B

푀푎푐ℎ𝑖푛푒1 푀푎푐ℎ𝑖푛푒2 푀푎푐ℎ𝑖푛푒3 Plan 2

A B B

푀푎푐ℎ𝑖푛푒1 푀푎푐ℎ𝑖푛푒2 푀푎푐ℎ𝑖푛푒3

Figure 1.3: Alternative Ways of Evaluating a Query Across Multiple Repositories

data manually. Similarly, array data related to other disciplines are spread across multiple repositories as well – e.g. genetic variation data is found in 1000 (Human)

Genomes Project and 1000 Plant Genomes Repository. Scientists are likely to have the need to produce join queries over different variables collected in these arrays.

For example, linking behavioral data to genes or correlating particles behavior.

Challenges and Contributions

To illustrate the challenges in executing join(-like) queries across multiple repos- itories, we take a specific example. Given a declarative query the system needs to decide what to do for providing the intended results – a process referred to as building an execution plan. An execution plan is a tree representation of ordered steps to perform for retrieving the expected results. Figure 1.3 shows two simple execu- tion plans for executing a join between two one-dimensional arrays A and B. A is

11 stored entirely on Machine1 and is of length 100, whereas B is distributed between

Machine2 and Machine3, each having a sub-array of length 10. The join selectivity, which means the percent of the results holding the join criteria out of all possi- ble results generated by a Cartesian multiplication of both joined relations [60], is

1%. In Plan 1, the variable A is sent to both nodes Machine2 and Machine3.A

partial resultset of length 10 is produced on each machine, and then the resultset

from Machine3 is sent to Machine2 for combining with the local resultset. The

final resultset is sent to the user. Plan 2 combines the distributed array B before performing the join on Machine1.

Multiple challenges have been implicitly introduced here. Translating a query to a plan has been thoroughly researched before [28, 34], yet building execution plans that consider different processing ordering on different nodes while the data is distributed among multiple nodes and sites have not. In our example, anticipat- ing which of the two presented plans would execute faster is not trivial. A possible reason for choosing Plan 1 will be that more calculations are performed in paral- lel. However, with communication latencies taken into account, Plan 2 may be preferable. In addition, when two or more joins need to be performed, producing an execution plan becomes hard since the amount of distribution and evaluation options increases exponentially.

This work presents a methodology for executing and optimizing join queries over geographically distributed array data. We show that Cost-Based Optimiz- ers (CBO’s) provide better optimization opportunities for our target setting, where simple heuristics are not sufficient. We develop algorithms for building distributed

12 query execution plans while pruning not efficient and isomorphic plans. We intro- duce a cost model for queries over distributed data. The cost model considers the physical distribution of the data and the networking properties.

1.3.3 BitJoin - Scientific Data Join

While DistriPlan discusses the optimization of join queries, it does not address its execution. BitJoin focuses on the execution of two distributed join types: Equi-

Join and the Mutual Range Join (MRJ). While our focus is initially on the equi-join operation, join where the join criteria is equality of values, the techniques devel- oped can be easily adjusted for other types of joins criteria. The MRJ is an operator that is similar to “Range Joins” [81], but with an additional notion of sampling noise.

Join algorithms [105, 109] have been extensively researched and implemented in relational DataBase Management Systems (DBMS). However, the state-of-the- art in executing joins over distributed data is limited. Distributed DBMS often use the same join algorithm implementations, without adjusting them for issues re- lated to limited networking properties. In MapReduce systems [40] the program- mer manually implements such joins. DBMS implementations over MapReduce, such as Hive [116], focus on executing queries within a cluster and not when data is geographically dispersed [20]. SciDB [24], a Scientific DBMS (SDBMS) for array data, and its alike, provide only centralized, not distributed, join implementations.

Furthermore, we posit that the traditional join definition does not fit scientific

(numerical) data well. The storage properties for array data are different compared

13 to relational settings, changing the definition of a join. In addition, sensors collect-

ing numerical data often have sampling errors; comparing array cell values based

on a precise match does not align with user’s intuition. Instead, it is more meaning-

ful to look for close values. While pre-processing (particularly, data discretization, i.e., assigning discrete values to continuous ones) may help, it is desirable to define a high-level operator for this need.

Contributions

This work addresses challenges in dealing with distributed joins, including the

MRJ: We accelerate distributed equi-joins by using bitmaps. Though there has been prior work on accelerating joins by using bitmaps [81], we focus on the case of distributed data, and specifically, transferring only the compressed bitmaps, which lead to reduced network time. Moreover, we develop algorithms for the case where variable dimensional values do not match. We later motivate and for- mulate the Mutual Range Join (MRJ) queries. For supporting these operations, we define a variant of bitmaps, i.e., introduce the type B bitmaps (index) to comple-

ment the original (type A ) bitmap.

1.3.4 FDQ - Analytical Querying of Array Data

While DSDQuery and BitJoin address subsettings and joins over raw data, ad-

vanced analytical querying is not available through these engines (nor any other

engine). An example of a query that cannot be addressed by current systems is

“calculate the daily median of the differences between corresponding measurements (same

time) of current and previous days”. For example, if three samples are collected a

day, at 10 am, 5 pm, and 10 pm, and we have 2 days of data – day 1 with [85,

14 75, 60] and day 2 with [80, 75, 50], the subtraction of day 2 from day 1 produces

[-5, 0, -10]. The reported median in this case should be -5. This calculation cannot be phrased using one query (without the usage of “sub-queries” and joins) over array data, mainly since the notion of matching samples is lacking in current query- ing languages such as the Structured Querying Language (SQL) [84] (including

SQL/MDA, an in development SQL extension for Multi-Dimensional Array data).

In this work we provide the ability to generically phrase and provide such func- tionality efficiently. This work builds on top of our earlier work, where we had de- veloped structured query support on top of scientific data [44, 45]. We had focus on querying over native array storage formats (e.g. NetCDF), avoiding the costs other systems require (data duplication or reformatting). We used SQL as the in- terface to our querying systems – some semantical adjustments to the operators were made for fitting the array data context.

We introduce the FDQ, Analytical Functions Distributed Querying, Engine.

FDQ focus is on a class of operations called Analytical Functions, which had not been defined or supported before for array data. In analytical queries, the user provides dimensions that are used to the data (similar to the SQL parti- tions generated by the GROUP BY clause). Analytical functions process the data in each of these partitions (which are referred to as windows). However, unlike the partitions created by the GROUP BY clause, windows in our context are ordered, and therefore, the processing engine can access values of different windows.

15 Challenges and Contributions

Multiple challenges arise in efficiently calculating analytical functions over sci- entific array data. First is establishing the syntax and semantics of analytical query- ing over array data. Second, aggregations requires heavy mathematical calcula- tions, even more so for analytics that cross window boundaries. Minimizing the calculation overheads and processing queries efficiently is a challenge. Third, gen- erating arrays requires pre-defining dimensions, yet, the dimensions can be deter- mined only after we know the results. Last, processing of such data requires a lot of memory – being memory efficient requires careful design of algorithms.

Overall, we contribute by introducing Analytical Functions (also known as

“Window Querying”) over scientific array data. We provide syntax and seman- tics, and extend Analytical Functions to not only allow windows ordering, but to also allow oredering each window internally. We optimize our querying engines by using “chunked” in-memory processing of array data. We refer to these mem- ory chunks, or tiles, as sections in this work. Sectioning (or tiling) allows batching calculations and utilize memory caches to improve performance, decreasing disk accesses. We introduce methods to distribute windows calculations. And last, we present approaches to distribute the query execution, based on both sectioning and dimensional distribution approaches, while discussing the benefits of each approach over the other.

16 C D

250 Servers 170 Servers 400 MBPS / 20 ms

E5-2670 E5-2670 E5640 E5640 1 GBPS 10 GBPS

E5-2670 E5640

200 MBPS / 20 ms 100 MBPS / 40 ms A B

32 Servers

E5-2680 E5-2680 Infiniband E5-2680

Figure 1.4: An Example Network with Skewed Resources

1.3.5 Sckeow - multi-level join queries optimization over skewed resources and skewed data

While we addressed the optimization of join queries in DistriPlan, the focus there is limited to homogeneous resources with its environments. Complex join queries over heterogeneous resources implicitly introduce multiple challenges. First, the data is stored at different locations and formats – e.g., tornado data is provided by the Storm Prediction Center, while temperature data is collected from simula- tions made locally and in part provided by the National Weather Service. Second, data for specific variables or different portions of the same variable, may be dis- tributed among multiple clusters due to storage availability or retention policies.

Third, some of the data are sparse in nature, for example, storm data. Last, some of the data becomes sparse during processing (for example, temperature self-join produces a sparse resultset).

17 In Figure 1.4 we show an example network that holds multiple variables (A, B,

C, and D) at different locations – this diagram is based on real environments. Ex- ecuting join queries over this environment requires planning data accesses and its processing carefully, a step conducted by a Query Optimizer [68]. The state of the art in query optimization over scientific data is limited – it either assumes uniformity in network properties [43, 145] or requires the full network architecture [45] (Chap- ter3). For the first, not using specific network properties for each cluster in gen- erating plans when data is distributed will likely not produce efficient plans. For the latter, too many options can exist, which makes it an impractical approach for complex queries.

Another factor complicating query processing is data skew. Data might be gen- erated skewed, a challenge that had been addressed before [43, 145]. Yet, while processing a query (e.g. subsettings or joins), the data might become skewed as well, introducing new considerations – for example, changing the data represen- tation format while a query is being processed might result in more efficient pro- cessing. Another factor is resource skew – since each environment has different resources (CPU’s, memory, internal network hardware, and others), even if two environments have the same number of machines, one is often preferable for a specific operator.

Challenges and Contributions

These introduce multiple challenges, all are rooted in the exponential number of execution plans generated when all options are considered. First, each cluster’s

18 computational power are different and hard to compare based on technical hard- ware specifications. Second, data transportation needs to be encapsulated within the plans. Evaluating the full network properties is impractical since too many options may be generated. Last, it might be preferable to change the data repre- sentation format (e.g. from traditional array layout to a sparse representation).

In addressing these issues we first introduce the CF - Conversion Factor, a nu- merical value that enables data format change at run time, decreasing the pro- cessed array sizes. We also introduce the MPU, Minimal Profitable Unit. MPU’s contain multiple parameters, measured automatically for each environment, which allow accurate computational power estimation. These properties are used within our introduced cost model, intended for complex queries over highly dimensional scientific arrays.

19 Chapter 2: Structured Querying of Scientific Data - DSDQuery

Here we present our first engine, which targets querying of Scientific Array

Data. We named this system DSDQuery DSI (and we often refer to it in short as

DSDQuery) since it extends SDQuery DSI [114] for distributed datasets.

Specifically, DSDQuery allows the use of a subset of SQL to create a structured

database based on extraction of content from multiple datasets without knowing

the names of the files that host the data and their formats [51]. DSDQuery is built

from two components: Bridger, the component that automatically detects new files and datasets, and extracts its metadata; and DSDQuery DSI, an extension of SD-

Query that allows querying multiple databases and datasets while structuring a unified result dataset based on a user’s query. When Bridger detects a file of a known type (currently, NetCDF or HDF5), Bridger uses its appropriate provider to extract the relevant metadata and updates the repository. Bridger detects new

files and datasets not by scanning the file system periodically, but by using the file system notification interfaces, particularly inotify in linux [80]. After locating all the

files that are relevant for a given query, DSDQuery generates single file query calls for each of the located files. Unlike for SDQuery, the results are padded with their dimensions. Afterwards, DSDQuery reformats, reorganizes, and restructures the results to a valid dataset format based on the user request (currently, the supported

20 formats are the coordinate form and NetCDF). This work can also be viewed as an application of the previously proposed No-DB approach [3, 72, 129].

2.1 Background

2.1.1 SDQuery DSI

DSDQuery is built on top of SDQuery DSI [114]. SDQuery DSI was developed to minimize the amount of data that needs to be transferred over the Wide Area

Network, by subsetting data from within a single file. One way of viewing SD-

Query is that it embeds data management functionality with a data transfer pro- tocol, where the data transfer protocol engine is Globus GridFTP [5]. GridFTP is a

File Transfer Protocol (FTP) server that allows a developer to load external mod- ules (called DSIs) that react to FTP events. SDQuery DSI registers within GridFTP, and once an FTP GET request arrives, SDQuery DSI extracts the query and the file name of the dataset, and executes the query over a single NetCDF or HDF5 file.

FTP supports several different commands, but the most common ones are PUT and GET. SDQuery DSI reacts to a PUT by receiving the sent file, and generating an index over it. The framework responds to a GET by examining the given com- mand, and if it contains a query, processing it. If, for example, the user issues the following ftp command:

GET /home/user/POP.nc(SQL:SELECT TEMP FROM POP WHERE TEMP GE

8.1 AND TEMP LT 10.2)

SDQuery DSI would access the file POP.nc, and based on the query enclosed in the command, will extract the values of the variable TEMP that are greater or equal to

8.1 and smaller than 10.2.

21 2.1.2 DSDQuery - Functionality and Challenges

DSDQuery DSI is based on SDQuery DSI, but provides the following additional functionality.

1. DSDQuery DSI allows querying multiple files.

2. DSDQuery DSI can locate the relevant files and datasets for a given query by

building and maintaining a metadata repository that is populated automati-

cally when file system events occur.

3. In SDQuery DSI, a simple set of records were returned to the user. DSDQuery

DSI generates a full database file in a format chosen by the user (although

currently the only formats available are native coordinate, (x,y,val), and

NetCDF). The generated file contains a remap of the variable dimensions

to a denser representation that contains only the dimension values that are

needed. In some cases, for representing the different data sources that results

were sourced from, we add another dimension for the data source, that by

default is mapped to the source file name.

To further illustrate the points above, we use an example. In SDQuery, a cli- mate researcher accessing a repository with simulation output will state (rephrased in plain English) “I want to see the evaporation data of a specific area from the file

/server/2015/01/21/POP.evap.nc”. Having to specify the file requires the scientist to learn the file system structure of the source repository. In DSDQuery, the query would be phrased as “I want to see the evaporation data of a specific area from the

01/21/2015 simulation”. We believe this declarative statement forms a better com- munication with the data source provider for multiple reasons: the user does not

22 need to know where the files are stored, the data provider can change its directory structure without communicating this change with its clients, and, most impor- tantly, if there are multiple data sources that have the relevant data, they can all be queried.

However, multiple challenges are raised from this declarative query:

Query Meaning: many engines exist that provide querying ability using declarative languages, but nuances and implications of such queries are not always clear. This is particularly important for our system, as we use SQL operators on array data.

In our work, we use relational algebra to assign meaning to operators.

File Detection and Metadata Extraction: we show how using the dataset and ad- ditional metadata allows us to further assign semantics and context to queries.

Specifically, non-unique array and dimension names can form unique meaning by using additional metadata.

Unionizing (and Aggregating) Data: we show how our target operators should be run over multiple files.

Storage Location and Query Planning: locating data sources for queries to execute on is critical for efficient execution of queries using declarative language over large repositories. We developed algorithms for creating master queries and data sources adjusted queries, each of the latter being a subset of the master query optimized for a specific data source.

2.2 Formal Definition of Query Operators

DSDQuery supports a subset of SQL for querying array data. We have chosen

SQL because of its popularity, and particularly, because none of the array query

23 languages [15, 23, 29, 38, 77, 82, 83, 85, 99, 121, 142] have reached level of popu-

larity or maturity comparable to SQL. The subset of SQL we are using is similar

to the AQL used by SciDB, although we chose not to refer to AQL since it is not

standardized.

To make SQL operations applicable to array data, we first formalize a map-

ping between an array dataset to relational algebra[36], something that have not

been done before for two reasons: current systems do not support the various fea-

tures described before and current systems do not support distributed data, which

makes the problem simpler and more intuitive. The subset of SQL we support in-

volves simple operations on a single table or array – complex operations such as

analytical functions and joins are handled by our subsequent systems, FDQ and

BitJoin.

The following relational algebra is described upon one data set. When using

multiple datasets the data from these needs to be unionized (and sometimes ag-

gregated) to form the full data set.

2.2.1 Formalization - Relational Algebra

We first describe the nuances of Scientific Array Data that are relevant here.

Array databases are comprised of dimensions and variables. Each variable is con-

structed of a data type and dimensions – the same dimension(s) can be shared

among variables. The dimensions can have a mapping variable, with the same name, that is used for mapping different values by its coordinate. For example, a dimen- sion longitude would, probably, have a mapped variable that maps the coordinate

0 to a value such as 39.997, 1 to 39.998, and so on.

24 Dim1 Dim2 Dim3 Dim2 Dim3 Dim1 val Dim2 val Var1

Dim2 Dim3 Var2 Var1 Var2 Var3 Dim1 val Scientific Database Var3 UML

Table 1 Table 2 Dim1 ‐ NN Dim2 ‐ NN Dim2 ‐ NN var2 Dim3 ‐ NN var1 DB var3

Figure 2.1: Mapping Between the Structured File, UML, and the Relational Model

25 In Figure 2.1, we demonstrate the dissonance between an array database and a relational one. In an array database, each dimensional coordinate in a variable, which can be referred to as a cell, has exactly one value. All of the cells together form a multi-dimensional array. Mapping the array model to relational one, we can define a to be all the values of the different variables that have the same set of dimensional coordinates. For example, if we have three dimensions: longitude, latitude, and depth, and two variables which are based on these dimen- sions: temperature and evaporation, we can either look at these as two different relations, temperature and evaporation, or as one relation, climate, which has two properties: temperature and evaporation. The second form provides more con- venience while posing queries, and we develop the algebra over the second sug- gested form. More specifically, in the model we describe here, each array database source is a set of multiple relations, each having different dimensions. All the vari- ables with the same set of dimensions within a specific dataset file are looked at as one relation.

Projection: in relational algebra, a projection is defined as a restriction of proper- ties, or atoms, of a relation. For example, if we have a relation R that is constructed of 4 atoms, dim1, dim2, var1, and var2, where dim1 and dim2 are the dimensions of the variables, the result of Πdim1,var1(R) is a relation that includes only dim1 and var1. This definition cannot be directly applied to array data because the data cannot be brought out of its context, i.e., dimensions, without using aggregation

first. Therefore, we modify the definition of projection to apply only on non di- mensional atoms. Using the last example dataset, Πvar2(R) is a valid request for

26 DS , 123 1NN 1N 13 341 NN1 N1 (a) (b) (c) (d) A B 123 134 1.0 1.0 2.0 1.0 1.0 0.0 (e) (f) ∪ SRC1234Intersect(A,B) 1 A 1.0 1.0 2.0 N 1.0 B 1.0 N 1.0 0.0 (g) (h)

Table 2.1: Example Tables and Simple Relational Algebra Queries Above each table there is its definition. N stands for NULL.

27 the data of the variable var2 alone. We allow querying multiple variables in one query: Πvar1,var2(R) is a valid query.

Selection: a selection operation is of the form σϕ(DS), where ϕ is a list of condi-

tioned atoms that are connected by and operators, (V), or or operators (W), and

DS is the dataset at hand. The atoms in ϕ may be dimensions or variables names

as in traditional relational algebra; each variable name and dimension name can

be looked upon as a name. For example, the query:

σlongitude≤39.885 V depth>5(Πtemp(DS))

implies: extract all the temperature values from the dataset DS that are located in

an area with a longitude that is smaller or equal to 39.885 and depth that is bigger

than 5. In applying selection operation to array datasets, remapping of the dimen-

sions values might occur, and each data type has to have a NaN or NULL value (we

will not differ here between different types of NULL values, such as “not sampled”

etc., yet these are needed to be implemented). Each value that is not selected, but

for which a corresponding dimensional coordinate exists, is replaced with NULL.

Dimensional coordinates that do not hold any values are omitted. There is one special case: if a dimension does not have a mapped variable, which maps the coordinates it contains to other values, we have two options. The first is to copy the dimension as is, and replace all the values of the unselected coordinates with

NULL, whereas the second option is to create a mapped dimension that maps the coordinates.

For example, if we have a variable with 2 dimensions, as DS (shown in Ta- ble 2.1 (a)), and we execute a query such as σval=1(DS), since the second column

28 presented have no value equal to 1 this dimensional value can be dismissed. As- suming a mapped variable for the first dimension does not exist, the output may possibly be the one shown in Table 2.1 (b), for the case we do not create a mapped variable. Alternatively, the result would be as shown in Table 2.1 (c) and (d), where the former holds the data for the dimension of the mapped dimension values and the latter holds the resulted variable – we implemented the latter since although it requires a small additional processing time, it produces a significantly smaller re- sultset size, which leads to shorter file transfer time. If the dimensional coordinates have not been modified, as is the case with the vertical dimension in the example, no additional processing is required.

Set Operators: we describe, using examples, the way one should think of set op- erations over array data.

Union: a union combines two different variables to appear as one in the output. In relational algebra, two relations need to be union compatible for the union to have

meaning. For our case, union compatible variables, are two variables that share

the same dimensions. In the case the set of dimensions values are not identical,

dimensional merge is performed and new cells are allocated with either NaN or NULL.

Moreover, unlike the relational case, a union of two non-union-compatible variables is possible and allows constructing a dataset with multiple variables. For example,

A ∪ B for variables A and B in Table 2.1 (e) and (f) respectively is presented in

Table 2.1 (g), where the first row contains the dimension values.

If the dimensional coordinates do not collide, as in the case where an array is spread over multiple nodes, two options exist – the first is to unionize the data as described above, by adding a dimension for the source. The other option is to not

29 create this dimension – since the values do not collide, there will be no case where

duplicate values are targeted towards the same dimensional coordinate. This be-

havior is more appropriate when unionizing results from a distributed array. In

our system, if the union is performed over multiple datasets, we use the latter

union technique, unless it cannot be used because of data collisions. If the union

is performed because of a user request in the query, we perform the former union

presented, yet we consider extending the in the future for allowing

the user to choose which union algorithm to perform.

Intersection: intersection reports the values that are the same among the given vari-

ables, ignoring dimensions that do not match. The output has the same dimensions

as of its sources. NULL is reported for dimensions that are created but when no co- ordinate match between the variables. In Table 2.1 (h) we demonstrate the result of intersection between variables A, presented in (e), and B, presented in (f).

Aggregations: the operator GAGGR(dimension)(V ar, DS) executes the aggregation func-

tion AGGR using the given dimension on the variable V ar that is stored in the

dataset DS. For example, when Gsum()(T able2.1(V ariableA(e),DS)) is ran on the

data of variable A in Table 2.1 (e), a variable that has the value 4.0 would be re-

turned, since no dimensions were given for the action.

Gavg(dim)(T able2.1(V ariableA(e),DS)) would return the same variable since there

is only one dimension, which the operation is on.

Mathematical Operators: mathematical operators can be given over variables or

dimensions. For either of these cases, there are two possibilities – the operator

may be applied on a singular value, or on an array of values. In the case of a

singular value, for both, dimensions and variables, the mathematical operation is

30 Figure 2.2: Example of Simple Subsetting and Aggregation Queries

performed for each value of the relevant dimension or variable. If, for example, the

user specifies var1×2, or dim1×2, each of the values within var1, or each of dim1’s coordinate values, is multiplied by 2. In a case where a variable or a dimension is multiplied by another variable, dimension, or a given array, there are two possible implications: one can either perform a matrix operation, or alternatively, perform a by place operation. The choice needs to be specified by the user – the “.op” suffix is used for a by place operation and “op” for a matrix operation, where op is the

mathematical operator.

2.2.2 Example Queries

As concrete examples, we use two queries – they involve filtering results by

value and dimensions. We consider a climate simulation output stored in a reposi-

tory across a number of files. Only a small subset of these files contain the variable

we are querying, which is TEMP. The variable we query has three dimensions in

most files, which are longitude, latitude, and depth. In a fraction of the files that contain TEMP , only a subset of these three dimensions are involved, whereas there are additional dimensions in some of the files.

31 The relational algebra for the first query shown in Figure 2.2 (a) is

[ V GAV G(longitude,latitude,depth) (σTEMP ≥5 T EMP <15(ΠTEMP (d))) d∈DS

Here, we union the results of the same query from multiple datasets. The query extracts the variable TEMP from each queried dataset d, afterwards it subsets the variable by value, and then aggregates the results by three of its dimensions. In some of the files an aggregations over the additional dimensions, which are not of the three dimension mentioned above, is required. The aggregation will perform the “AVG”, average, aggregation function. To recap by order of execution – First, the system will find the datasets that are relevant using the way described in the next section, then, for each dataset, a selection based on the variable values is per- formed. Afterwards an optional aggregation is executed if needed, e.g., if there are more dimensions than those returned. Last, the data is collected from each dataset and unionized (and possibly aggregated) to form the final result.

The second query, that is presented in Figure 2.2 (b), is distinct in the sense that the condition is not on the value of the variable, but on its dimensions. The relational algebra that represents this query is:

[ V GAV G(longitude,latitude,depth) (σdepth≥15 latitude<0(ΠTEMP (d))) d∈DS

Here, we unionize the results of the same query from multiple datasets, as ex- plained before. The query extracts the variable TEMP from each queried dataset d, afterwards it subsets the variable by two out of its three dimensions, and then aggregates the results by three of its dimensions. Afterwards, the data collected from all of the datasets are unionized to form the final results set.

32 2.3 System Design and Implementation

This section describes key components of the system we have developed for

DSDQuery DSI.

2.3.1 Metadata Extraction

The metadata extraction is divided to five parts: new file detection, atoms extrac- tion, statistical analysis, atoms unification, and DB enrichment. The first step, new file detection, is done using two methods. First, the system registers a set of inotify

file system listeners on the relevant directories. When a file is added or modi-

fied, a metadata scan is initiated and the new file is registered or the modified file meta-data is updated. In addition, on initialization, the system verifies its reposi- tory contains up to date data by scanning the file system. Differences between the repository and the file system are processed.

The next step, atoms detection, finds the relevant metadata atoms. Currently we support NetCDF and HDF5 files, though other file formats can be added eas- ily in the future. The metadata atoms are mostly the variables and their dimen- sions. As will be described in Subsection 2.3.3, additional metadata can be added to each dataset and/or atom. Assuming we detected a new file, with the vari- able in the example that is in Subsection 2.2.2, the output of this process would be variable: TEMP, dimensions: depth, longitude, latitude. Ad- ditional atoms can appear in the dataset’s metadata or in the additional metadata

file.

Next, for each atom, statistical information extraction is activated, which help execute and/or optimize the queries. A query can be optimized by, for example,

33 removing datasets that are irrelevant, or allocating resources by the portion of re- sults that are expected or the dataset size. This step finds the minimum and the maximum values for each atom we detect – additionally, a histogram of values can be built, and additional statistical data can be collected if needed.

Atoms unification makes sure all the atoms, manually added metadata enrich- ment, and statistical information look the same in the unified repository. The uni-

fication of the metadata from all the different dataset types (for example: NetCDF or HDF5) makes sure the execution engine is agnostic to the dataset type. In our example, TEMP, depth, longitude, and latitude would be sent to the appropriate repository tables that contain variables and dimensions. The final step, DB enrich- ment, sends the unified atoms to the shared system’s repository.

2.3.2 Query Analysis

In Algorithm1 we describe how we analyze a query. This algorithms generates a map between a variable to the multiple single-file queries that needs to be run.

The query performance are dependent on the query execution engine. As men- tioned before, the query execution engine is from a precursor system SDQuery, and is further described in an earlier publication [114]. The query shown in Fig- ure 2.2 (b) is used to demonstrate the algorithm.

The input to the algorithm is a SQL query. Each query, in addition to stating the user intended results, is a dataset definition corresponding to the output gener- ated at runtime. The output dataset definition is based on the given variables and dimension, and the current files content – if there are three files that contain the queried variable with the dimensions mentioned in the query, the output variable

34 Algorithm 1 Query Analysis 1: function QUERYANALYZE(query) 2: Maps = φ 3: pt ← ParseTree(q) 4: for each v ∈ variable in pt.FROM do 5: allq = φ 6: s ← v.dimensions (in all query section) 7: mdq ← build query for v with all s 8: DS ← executeQuery(Repository, mdq) 9: cd ← FindCommonDimensions(v, DS) 10: for each ds ∈ DS do 11: stat ← queryDataSourceStats(v,ds) 12: q ← regenerateQuery(query, cd, ds, stat) 13: allq = allq ∪ q 14: end for 15: Maps = Maps ∪ (v,allq) 16: end for 17: return Maps 18: end function

will contain all the dimensions that intersect across these files, even if some of the dimensions were not explicitly queried.

When a query is submitted, we parse it, build a parsed query tree (line3), and determine which variables are required for the query execution. These are iterated on (line4). In the example query the only variable is TEMP. Afterwards, in line6, for each variable we extract all the dimensions that are mentioned in the original input query. In the case a user did not explicitly relate a dimension to a variable, for example, the user wrote:

... FROM T WHERE A =... instead of:

... FROM T WHERE T.A =...

35 the system deducts what variable should the dimension be related to, by using the

metadata in the repository, by the number of variables within the query, and/or

by query analysis. In our running example, the created map would be (TEMP,

[depth, latitude]).

In line7, we build a repository query for getting the datasets that contains the

atoms that are mentioned in the original query. The query is exeucted in line8

and all the relevant datasets names and its locations are returned. For the exam-

ple query mentioned above, the repository query asks for the datasets that has a

variable named TEMP with both of the dimensions depth, and latitude. Next, in line9, we find the variable dimensions that intersects across all the detected datasets using the repository. In our example, the output would be [depth, longitude, latitude]: although longitude is not mentioned in the query, it appears in all the data sets that have the variable TEMP and the dimensions latitude and depth.

For each dataset and variable combination, in line 12 we generate an SDQuery query that is based on a subset of the given original query. The new query selects only the portion of the original query that is relevant for the specific variable and dataset, based on the dimensions, aggregations, and statistics that are relevant. The common dimensions are used here for building SDQuery queries that aggregate the necessary data before the merge process begins, which improve performance significantly.

After the execution of the engine, for each variable we insert the variable that was processed and all the queries that were produced for it to the Map (imple- mented as such for efficiency) that is returned to the function caller. This allows

36 Variables Dimensions

TEMP depth

EVAP longitude Files TEMP latitude 1.nc COOKIE frequency

2.nc

Sample Date: Baker: 3.nc 03/15/2015 Roee Agency: provider

Additional Metadata

Figure 2.3: DSDQuery Repository Structure Example

the function invoker to find and separate the queries by the variable given in the query itself easily.

In Figure 2.3, we present an example repository that has three files, two of these have the variable TEMP, while the last one does not. The first demonstrated

file, 1.nc, has three dimensions for the variable TEMP: depth, longitude, latitude. The second file has four dimensions for a variable with the same name: depth, longitude, latitude, frequency. For the example query, given

37 the repository described above, the common dimensions are: depth, longitude,

latitude. The dimension longitude was added since both the files in the reposi-

tory contain it, while the dimension frequency was not because it is not contained

in all the “TEMP” variables that were retrieved from the metadata repository. As-

suming the statistical analysis concluded all the depths in the first file are lower

than 15, the optimized SDQuery request for that file would be:

GET 1.nc(SQL:SELECT TEMP FROM TEMP WHERE latitude < 0)

while the SDQuery request for the second file would be:

GET 2.nc(SELECT AVG(TEMP) FROM TEMP WHERE depth ≥ 15 and

latitude < 0 GROUP BY depth, latitude, longitude).

Notice an aggregation is required in this query since the dimension frequency is

not in this file’s common dimensions, and therefore cannot be unionized with the

variable that is in the previous file.

The third file is not requested at all since it does not contain the variable TEMP.

A query such as SELECT TEMP FROM TEMP WHERE FREQUENCY = 1 would query only the file 2.nc, while a query such as SELECT TEMP FROM TEMP WHERE NO

= 1 would produce an empty dataset. The content of the repository determines the content of the output, and therefore the same query on different repository may produce not only different content, but also different data structures.

In the process of query generation, as demonstrated in the example above, we use the collected statistics for estimating if, and how many, results are expected.

If no results are expected a query would not be generated at all. But if results are expected, a query can be optimized based on the collected statistics. In some cases the data in the repository would suffice and a query does not need to execute on

38 GridFTP request Parsed NetCDF

NewFile.nc Bridger DSDQuery File New File Detection Query Analysis File Sender MetaData Extraction Source Detector File MetaData Extraction Formatter Queries Generator MetaData Unification Data SDQuery Collection DB fill MetaData Queries DB Exectuion

Figure 2.4: DSDQuery System Architecture: Bridger and DSDQuery DSI

the actual data set since the statistics evaluation phase would return the requested data.

2.3.3 System Implementation

In Figure 2.4 we show our system architecture. The two parts the system is constructed from are Bridger and DSDQuery. Bridger’s responsibility is to detect and extract metadata, and add it to the system’s repository. DSDQuery’s respon- sibilities are receiving user requests, executing each user’s query, and sending the results to the user using the embedded transportation service.

39 Bridger

Bridger maintains the system’s repository. Since array databases are stored in a file system, changes can occur only when a file is created, modified, or deleted.

For detecting when a file is added, we add inotify listeners[80] to each directory the system is configured on. The system can be configured to automatically, and recur- sively, add listeners to subdirectories of the configured directories on initialization or as these subdirectories are created. When a file system event occurs, the system determines if the detected file is relevant or not by checking if it is a metadata file

(Subsection 2.3.3) or if there is a plugin that can be used to load the file.

Once a relevant file is detected, the system uses the file-type plugin to extract the file metadata, and uses the file-type plugin to extract statistical information using the file-type specific statistical engine. All the extracted data is unified and normalized for storage in a relational SQLite [93] database.

Additional Meta Data

The in file dataset’s metadata is often not sufficient for semantical querying using declarative query language, since the data is brought out of context when variable and dimension names are used. The ability to enhance the dataset’s meta- data is necessary. Since array datasets are physically stored in file systems, the file system hierarchy often has a meaning and therefore there are multiple types of metadata enhancement that are required:

• Enhancing a specific directory, and all of its subdirectories.

• Enhancing a specific dataset, and all of its variables and dimensions.

40 • Enhancing a specific variable / dimension within a specific dataset.

Each file that is detected with the extension .metadata is adding metadata to the system’s repository. The metadata file name, without the extension, determines which directory or dataset the additional metadata should apply for. The metadata

file is structured from text lines, each line provides a metadata atom to add to the repository or instructions for the scanning process (such as which subdirectories should be added and if it should be done recursively and online).

Since SQL supports hierarchical data structures, we added the capability to sub-select based on the metadata by using the prefix metadata.* where * is the metadata atom name. For limiting the results of a specific array, the keyword meta- data will be followed by the array name. For example - for limiting the variables to those provided by a specific sensor, while giving a meaningful semantical con- text to the query, one could use the where clause: metadata.TEMP.sensor =

’sensor name’.

DSI Implementation

DSDQuery DSI is implemented as a plugin to Globus GridFTP [5], in the same

way the precursor system SDQuery [114] was. In the case an incoming request is

not a DSDQuery one, DSDQuery DSI will tunnel it to the relevant FTP engine. If

the request is a DSDQuery request, which is detected by having the string DSQL:

right in front of a valid SQL statement, the engine extracts the SQL statements and

applies the algorithm described in Section 2.3.2 for analyzing the query, locating

the relevant array datasets, and for generating the queries that should run on each

individual dataset. The output of Algorithm1 is a map between the variables that

41 should appear in the resulting array dataset to a set of SDQuery queries that a union of, and possibly an aggregation over, would fill.

For example, the following FTP GET call:

GET /(DSQL:SELECT TEMP FROM POP

WHERE TEMP GE 8.1 AND TEMP LT 10.2) would obtain from the process above a set of SDQuery queries similar to:

GET /home/user/POP.nc(SQL:SELECT TEMP FROM POP WHERE TEMP GE

8.1 AND TEMP LT 10.2)

The generated queries differ in the location and name of the dataset that should be used, and in some cases with its aggregations and subsetting as described in

Section 2.3. Next, for each expected variable and dataset we call the SDQuery en- gine with the given query. We modified the SDQuery engine to return the variable metadata, and for each returned value its coordinates – data that was missing in the original SDQuery engine. As mentioned before, the performance of the query execution are dependent on the SDQuery engine – we did not modify this engine.

The results of each SDQuery’s query are unionized and aggregated into an inter- mediate result set file. This file holds for each output variable its metadata, its source files, and the resultset in a coordinate format: ((x,y,z,value)).

A result file for each of the example queries in Section 2.2.2 would look like:

TEMP, float,

3, depth, float, longitude, float, latitude, float,

DS1, 1, (x,y,z,val),

DS2, 2, (x1,y1,z1,val1), (x2,y2,z2,val2).

42 The first is the variable name and type, followed by the number of dimensions, followed by the dimension names and their matching data types. Afterwards for each dataset that returned results we have the dataset name, followed by the num- ber of results, and the results. As shown, the results are in the coordinate format within the intermediate file. There is one file for each variable, this file includes the data from all the datasets and each file can be built in parallel. After the inter- mediate files were generated, we use the requested output format to decide if and how to aggregate the data.

While converting and reformatting the intermediate file to the output format, if required, we perform additional aggregations. The aggregations over dimen- sions are performed at the SDQuery level, which creates a unified format for each variable, yet additional aggregations might be needed in the case of joins or SET operators. As discussed before, the default aggregation function, average, can be changed in the SQL statement. The resulting dataset is sent to the user using the

GridFTP standard parallel file sending mechanism.

2.4 Evaluation

We evaluate DDSQuery DSI’s overall performance. Our emphasis is on testing the ability of the implementation to “scale”. There are several dimensions along which the scalability of the system can be measured. We first consider query ex- ecution with increasing number of files, with queries designed in such a way that only one file contains the queried variable. The goal here is to see how efficient processing of metadata is.

43 In the second set of experiments, we consider query execution with increas- ing number of files, but unlike the previous experiments, all of the files contain the queried variable. Within this set of experiments, the output variable size vary. Fur- ther, we consider both dimension-based and value-based subsetting in the queries.

The engine is implemented in such a way that although we experiment on queries with one variable that is distributed over multiple files, the experiments cover queries over multiple variables. The engine that merges the results currently works linearly for the data collection, and therefore the relevant FOR loops behave the same for a variable that is spread over two files and for two different queried variables (even if these reside in the same file).

We use two simple queries – a query that selects data by the variable value; and a query that selects data using two out of the three dimensions. The first query is SELECT TEMP FROM TEMP WHERE TEMP ≤ X – we change the value of X to vary the selectivity. The second query is SELECT TEMP FROM TEMP WHERE depth ≤ 10.0 and longitude > 250.0. Here, the values of the depth and the longitude change based on the datasets, to maintain a small selectivity of 0.5%.

A set of NetCDF files, each contains 500 MB of data from which the queried vari- able is 100 MB, was used for our experiments.

All experiments were performed on an 8 cores, 2.53 GHz Intel(R) Xeon(R) pro- cessors, with 12 GB of memory servers. We used The Ohio State University Com- puter Science and Engineering RI cluster for these experiments. Each experiment was executed three times, after a system warm-up run. For each experiment the reported results are the averages among these runs. Because in none of the experi- ments we witnessed high values of standard deviation, we do not report these. The

44 0.45 0.43 Query By Value 0.41 Query By Dimension 0.39 0.37 0.35

Tme (S) Tme 0.33 0.31 0.29 0.27 0.25 1 10 100 1000 Number Of Files

Figure 2.5: DSDQuery’s Metadata Processing Overhead

reported results are measured at the client end, which ran on the same machine as the server, and include: sending the request, processing of the server process, and transportation of the results back to the client.

2.4.1 Metadata Processing Overhead

In this experiment we increase the number of files in the system, the added files do not contain the queried variable. The resultset size does not change, and is 8.4

KB for both queries – a selectivity of 0.005%. The results are shown in Figure 2.5, where the X-axis is logarithmic, while the Y-axis is linear. The increase in execution time is modest, and in fact, almost non-existent till the number of files becomes

45 quite large. This shows that our system is able to process additional metadata efficiently.

2.4.2 Increasing Number of File and Distributed Variable

In this experiment we increase the number of files that contain the queried variable (this is equivalent to increasing the number of variables queried, as ex- plained before). The resulted file size increases linearly with the number of queried datasets. The selectivity of the first query, which is querying by value, is 20%. Al- though not all the content of the variable is being returned, all the dimensions are created enforcing the output variable size to be the same as of the queried variable before subsetting. For the second query, which is querying by dimensions, the se- lectivity value is very low, i.e., 0.1%. The output size of the first query including its dimensions is 100 MB per file. The second query outputs 557 KB for each file.

In Figure 2.6, we show the affect of increasing the number of files on the perfor- mance. As a reference, the size of the result file is also shown in Figure 2.7. Both axes presented are logarithmic in both figures. We can see that the system behaves as expected – the processing time is linear in the number of files touched (when similar amount of data is extracted from each file). We can also see that the du- ration of query processing is unrelated to the query selectivity – both queries take nearly the same time, yet the dataset output size is different by scales: the duration of the operations performed by our engine, which is dependent on the data size, is negligible compared to the duration of the NetCDF interface calls that is nearly the same in the two different scenarios.

46 1000 Query By Value

Query By Dimension

100 Time (S) Time 10

1 1 2 4 8 16 32 64 Number Of Files

Figure 2.6: DSDQuery Performance for Querying Increasing Number of Files

47 8192 4096 Query By Value 2048 1024 Query By Dimension 512 256 128 64

32 Size (Mb) Size 16 8 4 2 1 0.5 1 2 4 8 16 32 64 Number Of Files

Figure 2.7: DSDQuery’s Output Resultset Size, as Data is Queried from Increasing Number of Files

48 1000 0.10%

1.00%

1.00%

100 Time(S)

10

1 1 2 4 8 16 Number Of Files

Figure 2.8: DSDQuery Response to Querying of Changing Selectivities – Dimen- sion Based Queries

2.4.3 Impact of Selectivity

We continued the previous experiment (varying number of files that contain the variable) by further modifying the selectivity of the queries. We executed experi- ments with varying selectivities for dimension based queries and for value based queries.

Using a series of queries where dimensional ranges are varied to control the se- lectivity, we measure the output file size and the query execution performance. We show in Figure 2.8 the performance for each selectivity levels by number of files, in a logarithmic scaled graphs. The output sizes are linear in both selectivity and

49 the number of files. The query duration increases linearly within each selectivity,

as well as the difference between each selectivity is linear: The query duration for

0.1% is about 10 times faster than for 1% etc.

When a query subsetting is by value, the output files can have different dimen-

sions. Next, we study the affect of the selectivity on the file size and duration of the

query execution. As mentioned before, the variable size in each of our files is 100

MB. When selecting 1%, about 1 MB of data can be expected – however, it turns out to be more complicated, because of the coordinates that are added as a result of the dimensions remap. For example, when selecting two values that are at the coordinates (0, 0), (1, 1), the coordinates (0, 1), (1, 0) have to be created as well (the

value for these will be NULL). H H F HH 1 2 4 8 16 S HH 0.1% (0.02%) 20K 701K 12M 83M 600M 1% (0.2%) 17M 123M 373M 797M 1.6G 10% (2%) 100M 200M 400M 800M 1.6G

Table 2.2: DSDQuery Resultset Size based on selectivity (S) and number of files (F) – value based query

In Table 2.2 we show the output dataset sizes based on the number of datasets

being queried and the selectivity. We write the selectivity of the values from the

variable while in parenthesis we write the selectivity from the whole dataset file.

For a selectivity of 10%, although not all the values are selected, a full dimensional

map is created, which leads to a file size of 100MB – this behavior is consistent

with the statical analysis presented in MR-Cube [89], when adjusted to array data

and considering the dimensions selection process as the Bernoulli trials. The same

happens for selectivity of 1% when 16 files are touched. The content of each file

50 0.10% 1% 100 10%

Time(S) 10

10%

1%

0.10% 1 1 2 4 8 16 Number Of Files

Figure 2.9: DSDQuery Selectivity Impact on Performance – Value-based Queries

affects on how fast the file size increase: if the queried values are at different coor- dinates (sparse) in each file, the file size will increase faster than if the values are condensed at the same coordinate area.

In Figure 2.9 we show that the growth in execution time among different se- lectivities is linear. It is also noticeable that the execution time is nearly the same for all. Although the higher the selectivity the more values need to be written to the output variable (which is a very expensive operation in the NetCDF library we

51 use), the time difference is nearly negligible. Overall, this shows that the system can handle growing output sizes efficiently.

2.5 Summary

In this chapter we introduced DSDQuery, a system which allows declarative querying over distributed scientific arrays. First we formalize what a scientific array query is. Afterwards, we define the expected behavior of querying over distributed data. We present approaches and algorithms to address distributed array data querying, while describing our system architecture. Last, we evaluate our approaches and show they are beneficial and efficient.

52 Chapter 3: Optimization of Join Queries - DistriPlan

In the past chapter we addressed simple querying (subsetting and simple ag- gregations) of distributed array data using declarative languages. The introduced engine, DSDQuery, accesses all the files from the server on which it runs and pro- cesses data locally. However, scientific data can be stored across geographically dispersed machines, requiring data to be copied to a central machine for its pro- cessing by systems such as DSDQuery. In addition, scientific data contain many variables, and correlating (joining) the data is essential for scientific research ad- vances. These raise the need to execute join queries over data that is distributed among multiple machines and repositories. On top of that, unlike for simple queries, before executing complex queries (such as join queries), it is required to plan the execution carefully – an operation referred to as query optimization that is conducted by a tool called an optimizer.

In this chapter we describe our framework to optimize join-like operations over multi-dimensional array dataset that are spread across multiple sites. The optimization methodology includes enumeration algorithms for candidate plans, methods for pruning plans before they are enumerated, and a detailed cost model

53 for selecting the best (cheapest) plan. While in this chapter we focus on the perfor-

mance optimization of join queries, in the following chapter we will address their

execution.

3.1 Background

3.1.1 Distributed Array Data Joins

Join operations help compare data across multiple relations, and have been ex-

tremely common in the relational database world. In scientific array data analysis,

joins are also essential for analyzing data and confirm hypotheses. As an example,

consider a simple hypothesis such as “when wind speed increases, the temperature drops”. Verifying this hypothesis using climate simulation outputs involves multi- ple joins across different datasets (because scientists commonly program these op- erations, many times they do not directly identify nor recognize a join was used).

For detecting the change (increases/drops), each relation has to be compared with a subset of itself, i.e., we perform what is referred to as a self join. The pattern detec- tion, i.e. relating the wind speed to the temperature value, is a regular join across both relations.

Formal Definition

The definition provided here is very similar to the one found in relational database literature and only differs to match the domain it is imposed on. Here, we address dimensions and values instead of columnar values. In addition, aggregations are an inherent part of the scientific data array querying (including joins). In real sci- entific settings, queries rarely do not require aggregations. Dimension reduction is conducted by using aggregations – data collected includes a higher dimensional

54 setting than the one used in queries, resulting in the common use of aggregations.

For example, when querying the average temperature we reduce multiple dimen- sions such as time and depth of sampling. Therefore, our engine uses aggregations by default, and our implementation minimizes its cost by integrating aggregations into the join operator.

G Formally, the operator ./ signifies a join – A ./C B joins the relations A and

B based on the set of conditions C and using the aggregation function G. C is a concatenation of conditions of the form A0 = B0, using ∧ (and) or ∨ (or), where

A0 and B0 can be either a set of dimensions or the relation names themselves (the latter being referred to as joining by value). G (in the superscript) allows controlling the aggregation function used for the join (if no aggregation function is mentioned, the default function to be used, if necessary, is AVG (average)).

If C is not mentioned, i.e., the operation is A ./ B, common dimensions of both relations are joined based on their names, while the rest of the dimensions are aggregated by using an aggregation function. On the other hand, when the join ex- plicitly states certain dimensions, an aggregation over the non-mentioned dimen- sions is expected. One could use the rename operator, ρfrom,to, which renames a dataset, variable, or a dimension for forcing name match when necessary. It may be needed to allow joining only by values (without limiting the dimensions), an operator we mark as ./¯ and is similar to Cartesian Multiplication.

In Figure 3.1 we translate the relational algebra query A ./A(d1)=B(d1) B to

SQL [84]. In Figure 3.2 we show a walkthrough of this query execution. The com- mon values along the joined dimension, d1, are 2 and 3. Since the other dimension in this array, d2, does not appear in the join criteria, aggregation has been used

55 Figure 3.1: Query for Join Walkthrough Example

for it (as emphasized in the declarative query). An additional dimension has been

added to record the data source, referred to in the figure as d3, this dimension in

the relational model is translated to two columns. Another option, instead of a

dimension, is to use a structure holding both values as the variable data type.

In Figure 3.3 we show another SQL example for a subset of the query stated at

the beginning of this section, demonstrating self joins. The SQL shown is equiv-

alent to the request: “Return the temperature difference for each day from its previous day”. In the query, we first rename both relations from TEMP to A and B – this is

done since this query is a self-join and therefore we need to be able to address both

relations separately. We can look at relation A as if it represents a specific day tem-

peratures, while B represents the previous day temperatures. Therefore, the output

is, for each distinct latitude and longitude, the value of today’s temperature minus

yesterday’s temperature, as requested.

56 A B A d1 1 2 3 d1 2 3 4 A B d2 d2 ⋈𝐴𝐴 𝑑𝑑� =𝐵𝐵 𝑑𝑑� 𝐵𝐵 d1 d1 0.5 1 2 3 1.0 3 4 1 d2 1 2 3 d2 2 3 4 1.0 2 3 4 1.5 4 4 4 0.5 1 2 3 1.0 3 4 1 1.5 3 4 2 2.0 5 4 1 AVG 1.0 2 3 4 1.5 4 4 4 AVG (a) (b) 1.5 3 4 2 2.0 5 4 1 A A A d1 d1 d1 d3⋈𝐴𝐴 𝑑𝑑� =2𝐵𝐵 𝑑𝑑�3𝐵𝐵 d2 ⋈𝐴𝐴=𝐵𝐵2 𝐵𝐵 3 d2 𝐴𝐴 𝑑𝑑1� =𝐵𝐵 2𝑑𝑑� 𝐵𝐵3 3 3 4 4 A 3 3 1.0 3 4 A 2 3 3 A B B 4 4 1.5 4 N B N 4 4 d1 (c) (d) (e) d3 2 3 A 3 3 B 4 4

Figure 3.2: A walkthrough the join process

Figure 3.3: Example Query: Simple Join

57 3.1.2 Existing Approaches

Today, queries over distributed data are executed in one of several ways. First, often scientists simply copy all relevant data from the repositories that they are stored in to their local environment and process it using existing tools – a solution very unlikely to be feasible as data sizes increase. Writing a workflow engine can be another option, but details of data movement, partitioning of the work, and its placement are handled by the developer. Moreover, individual operations in a workflow are still written in low-level languages.

Other approaches use ideas from the database (DB) domain to optimize overall query performance. Query optimization has been thoroughly researched before in the domain of relational data [34], however, these systems work on data at a cen- tral location or a cluster (with all nodes connected by using a Local Area Network

– LAN). For tightly coupled settings, distributed query plans use Rule Based Op- timizers (RBOs) [9, 69, 92, 138]. RBOs build execution plans from parsed queries based on pre-defined set of rules or heuristics. Heuristics are used to allow tak- ing optimization decisions for each node of the parsed query tree separately, e.g. a local decision. An example for such a heuristic is: “execute each query in the most distributed manner possible until data unionization or aggregation is necessary”. Some of the DB query optimization approaches have been extended to geographically distributed data [9, 27, 44, 47, 114, 124, 125]. These extensions often use a heuristic to transfer all data to a central site and process the query there. None of these ex- tensions consider data shuffle as part of the query optimization process, assuming the most distributed heuristic cannot be improved.

58 Data processing has increasingly moved towards using implementations of the

Map-Reduce concept [40]. Most systems in this space are limited to processing within a single cluster, where network latency is predictable and low. Note that joins can be supported over such systems using a high-level language, like in

Hive [115]. In such cases, join optimization [2, 123] is based on a set of rules or heuristics that will not be sufficient to fully optimize geo-distributed data join queries, as will be shown later. MapReduce systems have also been extended to geographically distributed data [62], but we are not aware of any work optimiz- ing the join operation in such settings, which requires careful attention. Strato- sphere [4] offers a Cost Based Optimizer (CBO) for join queries. The CBO there optimizes the operators order, and not the distribution of work, which is done with the simple heuristic mentioned in the previous paragraph – the more the par- allelism, the faster a query should execute.

3.1.3 Execution Plans

An execution plan, or simply a plan, is a tree representation that contains pro- cessing instructions to an execution engine for providing the correct query results.

While the SQL handles the “What”, the plan targets the “How”. Plans contain an hierarchical ordering of operators, each of which has at most two children nodes.

When the plan tree is followed from the bottom to its top the intended query results are produced. In Figure 3.4, we show two possible trees for the query:

A ./ B ./ C (we omitted the (A ./ C) ./ B option for brevity). Each plan shows a different ordering of operations that produce the intended results.

59 Plan 1 Plan 2

𝐴𝐴 ⋈ 𝐵𝐵 ⋈ 𝐶𝐶 𝐴𝐴 ⋈ 𝐵𝐵 ⋈ 𝐶𝐶

C A

𝐴𝐴 ⋈ 𝐵𝐵 𝐵𝐵 ⋈ 𝐶𝐶

A B B C

Figure 3.4: Non-Distributed Execution plan Samples

Optimizing (or choosing) query plans, and especially data join optimization,

has been extensively studied in the database community. The distinct part of our

work is optimizing data joins when the data and execution are geo-distributed –

for example, a situation where cluster one contains the temperatures in Canada

and the U.S., while cluster two contains the temperatures elsewhere.

Our distributed plan has additional information within each node, as well as

additional nodes, which provide the necessary information for parallel and dis-

tributed execution of queries. In Figure 3.5 we demonstrate two different dis-

tributed execution plans for the simple query A ./ B where A is an array dis- tributed over three nodes, 1, 2, and 3, while B is an array distributed over 2 nodes, i.e, nodes 3 and 4. Plan 1 utilizes the most parallelism possible in this case – 3 nodes process data in parallel, and afterwards all data are sent to node 1 where it is accumulated and unionized (unionize means combining multiple datasets to one).

60 Plan 1 Plan 2

Union Nodes:3 𝐴𝐴 ⋈ 𝐵𝐵

SendData From: 2,3 SendData SendData To : 1,1 From: 1,2 From: 4 To : 3,3 To : 3

Nodes:1,2,3 A B 𝐴𝐴 ⋈ 𝐵𝐵 Nodes:1,2,3 Nodes:3,4

SendData SendData From: From: 3,3,4,4,4 To : To : 1,2,1,2,3

A B Nodes:1,2,3 Nodes:3,4

Figure 3.5: Distributed Execution Plan Examples

Plan 2 demonstrates the other extreme, in which all the data is copied to one node, which then processes it. Since only one node processed the data, there is no need to unionize data. These are just the two extreme plans, among many possible options.

More broadly, distributed execution plans need to represent parallel execu- tion correctly, including representation of data communication among the nodes, unionization, and synchronization. There can be many options for such distribu- tion. For example, as one extreme, all the datasets content is collected to a central node, and joined there (as presented in Figure 3.5, Plan 2). After the join was pro- cessed, the results are tunneled either to the client, if this is the final plan step, or pipelined to the next step defined by the query execution plan. Another extreme

61 can be as follows: since a join is performed between two relations, we choose one

relation that we refer to as the internal relation (either relation can be the internal

one, we choose the internal one based on the cost model presented later). The in-

ternal relation is kept stationary, while the other (external relation) is sent across the

network to all the nodes that contain the internal relation. Subsequently, each re- ceiving node executes the join/s and the results are sent forward according to the execution plan. We note some machines may contain many datasets (a common practice in real environments), and therefore it is optional for one node to com- municate with another multiple times in this approach. Avoiding this behavior introduces a third option: we force each machine to transport at most one dataset of each relation it holds by unionization of the multiple datasets each machine has before processing. The rest of the processing can be done similarly to the way it is executed for the previous method – the sample in Figure 3.5, Plan 1, demonstrates such a plan. The advantage of this approach is that it decreases the number of times data movement occurs and still allows parallel execution of joins.

3.2 Plan Selection Algorithms

3.2.1 Query Execution Plans

Formally, an execution plan for a given query is a tree representation of a query, where each node has at most two children, left and right. Each tree node represents either a source (relation, array, or dimension) or an operator. Each node outputs a

data stream. Each operator node receives up to 2 incoming data streams. In the

case a node represents an operator, the operator applies to the node’s inputs.

62 In extending execution plans to distributed ones, additional operators are intro-

duced (An example plan has already been introduced in Figure 3.5). We introduce

three new node types: Sync, SendData, and Union. Sync is synchronizing the execu- tion of its dependent nodes, by waiting for all related sync nodes to start execution.

All machines executing a sync node will start its execution at a different times, but will complete at the same time resulting in its dependent nodes beginning execu- tion at the same time. In many cases synchronization is implicit and partial. A

SendData node implies that machines that execute the children nodes need to send the produced results to a set of nodes. Thus, this operator achieves data distribu- tion from a set of machines that produced a dataset to a (possibly distinct) set of machines that will later execute operations on that data. Union nodes are used to accumulate distributed data that was received from its children.

Each node has a tag that holds information needed for the operator execution.

The tag includes what type of node it is, what subsetting conditions it executes

(if applicable), statistical information, and operator specific data. For example, a

SendData node’s tag contains which data is sent, where from, and to which node.

3.2.2 Plan Distribution Algorithm

Our goal is to create an efficient Cost Based Optimizer (CBO) to find the optimal distribution of a query. A CBO is an alternative to a Rule (or Heuristic) Based

Optimizer (RBO). RBO’s are unlikely to build optimal plans in this case due to the complexity of our queries and diverse set of environments where they may be executed. Note that in a CBO, the optimizer produces a cost value that aligns with

63 execution time, and thus helps to choose the lowest cost plan. CBO’s do not aim to predict the actual execution time.

For implementing a CBO we need to first span or enumerate different execution plans and subsequently to evaluate these plans costs. We enumerate the plans using a two-step process: 1. choosing between different ordering of operators, leading to a set of non-distributed plans – these are built using a simple RBO since we span all options here. 2. Enumerating all possible distributed plans corresponding to each non-distributed plan, each of which involves different choices for where the data is processed and required data movements. An advantage of this two- step method is that producing all non-distributed plans has been well researched previously [34]. Therefore, we focus here on the second step.

3.2.3 Pruning Search Space

The key challenge we face is that the number of distribution options for a given

(non-distributed) plan can be extremely large. When all options of data sending are considered, a blowup of options occurs. For example, if n nodes are involved, nn options of data movement exist. However, not all options have different costs or are even sensible candidates. As a simple example, one server can send its data to a neighboring node, accept data from the other node, or even do both (both nodes swap their data in this last setting). Given this, we must be able to prune the search space efficiently.

Our first observation is that many of these options are essentially a repeat of each other. For example, assuming homogeneity of nodes for simplicity, and given

64 data distributed on 3 nodes, processing on 2 nodes should cost the same irrespec-

tive of which of the two nodes are used. Although it is not often guaranteed that

nodes computing power and data sizes are homogeneous, within scientific data

environments those assumptions often hold. We form the following two rules:

Rule 1: A node can receive data only if it does not send any data. This rule pre-

vents two nodes from swapping data with each other. The following is an analysis

of the reduction in spanning options. Out of n nodes that have the data, we pick i nodes that will receive data. There are n − i nodes left, which send their data to all

options of the chosen i nodes. The summation of all of these reduce the number of

options needed to be spanned.

Intuitively, a plan in which one node processes another node’s data, while

the other node processes the first node’s data, is more expensive than a plan in

which the nodes do not swap the data. Consider the cost of data processing of two

datasets when node 1 and node 2 both process their own data, assuming data sizes

are the same, or when node 1 processes node’s 2 data and vice versa. The process-

ing cost is the same, yet the data transportation, which does not exist in the first

setting, exists on the second. The full scenario is more complex, yet this considera-

tion helps clarifying why swapping data is more expensive than maintaining data

stationary. Based on the above discussion, the optimal plan cannot be pruned by

Rule 1 as shown in Theorem1.

Theorem 1. Rule 1 does not prune the optimal plan.

Proof. We will prove this claim by negation. Assume all machines have the same

computing power (as justified before) and that the optimal plan contains a node re-

ceiving data while sending its own. Two options arise: 1. the number of machines

65 processing the query is the same after the data transportation, or 2. the number of machines was decreased. If the number of machines is the same, it means data transportation was added, but the same computational power is used; a similar plan without the data swap must be cheaper. If the number of machines processing the query was decreased, we both – reduced the available computational power and added a data transportation overhead, again, a similar plan without the data transportation must be cheaper. In both cases, we found a plan with cheaper cost that did not include a node sending its data while receiving another node’s data, therefore the original plan was not optimal, and the plan with a node that sends its own data while receiving data of another node should have been pruned.

Rule 2: Isomorphism Removal. Rule 1 prunes options which are obviously more expensive than other plans, yet, many of the plans are still isomorphic. Clearly, there is no particular advantage of choosing one plan over other if they are isomorphic.

We avoid the generation of isomorphic plans using the following approach. First, we assume that nodes are ordered and ranked by a unique identification number.

With that, we require that a higher ranked node receives data from at least the same amount of nodes its lower ranked neighbor does. The nodes are ranked for the al- gorithm to provide consistent results and to enable addressing non-homogeneous environments in the future; the ranking can be dynamic, and can be used to ad- dress additional challenges (e.g., non-homogeneous data sizes by assigning larger chunks of data to stronger nodes). For example, if the first node, assumed to have the highest rank, receives data from 4 nodes (including itself), the second node can receive data from at most 4 nodes, and so on. The optimal plan is again not pruned since all isomorphic plans evaluate to the same cost – the cost of isomorphic plans

66 is the same since machines are assumed to be homogeneous (have the same pro-

cessing power), are connected to the same network, the size of processed data is

the same, and they perform the same calculation.

An analysis of Rule 2 shows it prevents isomorphism. Assume we have n nodes and d datasets, while n ≤ d. Generically, each of the n nodes process at least 1 dataset and all the nodes altogether process the d datasets. Represent this information in an array where the index of the array represents the node id and the content is the number of datasets processed on this node, for example

[1, 3, 2, 5, 4] (n = 5, d = 15). Consider the setting after sorting the array ([5, 4, 3, 2, 1] for the example above) – the sorted setting holds Rule 2. Therefore, for any dis- tribution setting chosen, there is a setting isomorphic to it, that holds Rule 2. The approach that issues such plans is referred to as the reverse waterfall, and we discuss it in more details later.

For example, assume array A is distributed over nodes ranked 1, 2, and 3. In

Figure 3.6 we demonstrate all possible distribution options following the given rule. Notice that any other option, given the nodes ranking, would either not make sense in a non-homogeneous setting, or would repeat an already existing option when nodes are homogeneous. E.g., consider the semi-parallel option of sending the data from node 2 to node 1 instead of from node 3, while node 3 keeps its data.

This would either imply using a weaker node to process the same amount of data is preferable or will be equivalent to the semi-distributed option presented.

Algorithm2 enumerates all distributed plans for a given non-distributed plan node. The input to this algorithm is a list of ranked nodes (ordered in an array),

67 A A A Nodes: 1 Nodes: 2 Nodes: 3

Fully-Parallel Non-distributed Semi-Parallel Execution Execution Execution

Figure 3.6: Distribution Options for An array (Using DistriPlan Rule 2) distributed over 3 nodes

on which the data is distributed. The goal of the algorithm is to return all distribu- tion options for populating each distribution node’s tag. The algorithm iteratively builds all the options of sending data, by enumerating all options for processing nodes. In a high level, the algorithm begins by enumerating all the options for the number of nodes processing data (from 1 to the number of nodes). For each of these numbers, the algorithm iteratively builds all the options for distributing the data going from the most distributed approach to the least. More specifically, i represents the number of nodes processing data, i.e., by Rule 1 maintaining their data locally. The algorithm spans all options for i between 1 to the total number of nodes. In lines5-7, we build a base option for the current i, which is the option where each “free node” (a node that is not processing its own data) sends its data to the first node. In line8 index j represents the number of nodes that receive data

68 from other nodes (j of the base options is 1). In line 10 we define k to be the num- ber of nodes not sending data to the first node. In line 15 all options spanned by

Algorithm3 are added (based on the sending/receiving pattern demonstrated by the resulted array) – we send data to higher ranked nodes from the lower ranked ones, assuring an optimal plan will not be pruned.

This algorithm provides all distribution options for a specific SendData node.

Each option is used to fill a SendData node’s tag.

In Algorithm3 we use a reverse waterfall method to create all sending options.

This algorithm outputs an array of numbers, each represents how many free nodes need to send data to the matching node. In general, we first determine the most distributed setting by initializing an array to contain the largest number of nodes each receiving node may receive data from (line7). Then, in lines 11-30 we bubble up nodes by following Rule 2 – the number of nodes sending data to the lower ranked nodes is systematically decreased while their higher ranked neighbors are increased.

For example, if there are 4 nodes, and 2 nodes are set to receive data, the re- verse waterfall will create the following arrays: [1, 1, 0, 0], [2, 0, 0, 0] marking that for the first array both nodes receive data from one node each, while for the second array, one of the nodes sending data to the second node have been bubbled up, re- sulting in the first node receiving data from both free nodes. The reason we call this method the reverse waterfall is since one can look on the numbers as if when a number to the right decreases, the number to its left increases. Notice each array represents a unique option, since we always send data from the “weakest” node to the “strongest” one, and an array such as [1,1,0,0] would mean the lowest ranked

69 node, 4, sends its data to the strongest one, node 1, while node 3 sends its data to node 2.

Algorithm 2 DistriPlan – Build Plans Without Repetitions 1: function DATASENDWITHOUTREPETITIONS(node list) 2: options ← ∅ 3: n ← number of nodes (receiving data) 4: for i ← 1..n do . i - number of nodes not sending data 5: ∀jbaseOptions.from[i] = i . Initialize senders 6: ∀j≤ibaseOptions.to[j] = j . Send to itself 7: ∀j>ibaseOptions.to[j] = 1 . All nodes send to first 8: for j ← 1..min(i,n-i) do 9: currOption ← duplicate(baseOption) 10: for k ← 1..(i − 1) × ((n − i)/i) do 11: hasToGet ← min(k, i − 1) 12: ∀t=0..hasT oGet−1currOption.to[i + t] = t + 1 13: for l ← 1..(k-hasToGet) do 14: ar ← buildArraysOfOptoins(n,j,l) 15: options ∪ = span options based on ar 16: end for 17: end for 18: end for 19: end for 20: return options 21: end function

3.3 Cost Model

Cost models for facilitating query optimization in a non-distributed environ- ment are well researched [17, 34, 50, 61]. Enabling a Cost Based Optimizer (CBO) for distributed settings involves a number of additional challenges, particularly, capturing network latency and bandwidth. As mentioned before, the cost is cal- culated for the query, including all its terms and operators, with the goal that

70 Algorithm 3 DistriPlan – Build arrays to spread data holding rule 2 1: function BUILDARRAYSOFOPTOINS(TotalNodes ReceivingNodes NodesToAssign) 2: . How many nodes will be spread 3: for i ← 1..NodesToAssign do 4: actualRcvNodes ← min(i, ReceivingNodes) 5: . How many nodes receive data 6: for j ← 1..actualRcvNodes do 7: ar ← mostDistributedOption 8: arRet ∪ = ar 9: lastPopulated ← LargestNonZeroIndex(ar) 10: ar ← duplicate(ar) 11: while lastPopulated ! = 0 do 12: while ar[lastPopulated] ! = 0 do 13: t ← lastPopulated − 1 14: ar[t+1] ← ar[t+1] − 1 15: ar[t] ← ar[t] + 1 16: while ar[t] > ar[t-1] do 17: t ← t-1 18: if t == 0 then 19: break 20: end if 21: ar[t+1] ← ar[t+1] − 1 22: ar[t] ← ar[t] + 1 23: end while 24: if t == 0 then 25: break 26: end if 27: arRet ∪ = ar 28: end while 29: lastPopulated ← lastPopulated − 1 30: end while 31: end for 32: end for 33: return arRet 34: end function

71 when comparing multiple plans, the plan with the lowest cost will also execute the fastest.

To motivate the need for a nuanced model, consider the following example.

Suppose there are two Value Joins (./¯ ) with a selectivity of 10%, and using three re- lations (A, B, and C), each containing 100 tuples. We expect the first join, between

A and B, to produce 1,000 tuples, and the second one to produce 10,000 tuples. As- sume the data is distributed in the following way: array A is distributed on nodes

1 and 2, array B is distributed on nodes 3, 4, 5, and 6, and array C is distributed over nodes 1 and 2. The plan that involves most parallelism would require copy- ing array A to the nodes that contain array B, and doing the same for array C. This plan’s execution involves: sending data to 4 nodes from 2 (50 tuples), [an implicit] synchronization across 4 nodes, sending data to 4 nodes from 2 (50 tuples), [an implicit] synchronization across 4 nodes, and processing of 27,500 tuples on each processing node. If the same join is executed on nodes 1 and 2, its execution would involve: sending data to 2 nodes from 4 (25 tuples), [an implicit] synchronization across 2 nodes, sending data from 2 nodes to 1 (25 tuples), [an implicit] synchro- nization across 2 nodes, and processing 55,000 tuples on each node. If a relatively slow wide-area-network is connecting these nodes, the synchronizations and data movement operations can be expensive. Thus, it is quite possible that despite more computations on any given node, the second plan would execute faster.

The goal of our model is to assess these options and choose the best plan. Note that the cost model itself can be very dependent upon the communication and processing modalities used. We discuss the model presented at a high-level.

72 Operator Cost (C(n)) E(n) Projection normalizer E(n → left) Filter + P i → length normalizer i∈dims P enaltynumOfNodes−1 × E(n) E(n) Distribute + P acketSize × normalizer normalizer E(n → left) + E(n → right) Join over normalizer dimensions E(n → left) × E(n → right) Join over normalizer values E(k) Union P × k → numOfNodes k∈{n→children} normalizer Sync P enaltynumOfNodes−1 E(n) Source normalizer

Table 3.1: DistriPlan: Costs of Operations by Operator

3.3.1 Costs:

We define C(n) as the cost of the node n and E(n) as the size of the expected re- sultset after the node operator is executed. Costs are evaluated recursively, starting with the root, summing all costs together. Each node has up to two children nodes, denoted by n → left and n → right, where the right child node is populated only for operators that accept two inputs, like joins. We also use n → children to mark both children together, left and right. For each operation, we list its cost in Table 3.1 and its expected resultset size in Table 3.2. The cost of an empty node, C(NULL), is obviously defined to be 0. The selectivity, n → selectivity, is evaluated before- hand, within the RBO, by using techniques established in the literature [37, 56].

73 Operator Expected Results (E(n)) Projection E(n → left) Filter E(n → left) × n → selectivity Distribute E(n → left) Join over (E(n → children) × n → selectivity dimensions Join over E(n → children) × n → selectivity values P E(k) × k → numOfNodes k∈{n→children} Union n → numOfNodes Sync E(n → left) Source n → sourceCells

Table 3.2: DistriPlan: Expected Resultset Size by Operator

P enalty represents the synchronization and communication overheads. We use the generic term normalizer as a tool to normalize the values returned to a reason-

able and usable scale, to assure no overflow occur and to decrease side effects of

large numbers math. For simplicity, we use a fixed set of penalties in our experi-

ments, whose value depends upon the network configuration we use. This penalty

value normally averages the networking delays (mainly latency) across the clus-

ters. In a more heterogeneous environment, where the penalties average does not

suffice, multiple penalty values can be used.

Filtering: multiple filtering operators are supported in our framework: =, ! =,

<, >, <=, and >=. Each filter has different volume or fraction of expected re-

sults, which can typically be estimated based on data statistics. In addition, there

are multiple ways to scan the data and optimize the query, especially when in-

dex structures are available. Dimensional array values are unique, and in most

74 cases are also sorted – both properties can be used for estimating selectivities and number of data scans.

Distributing: in a distributed plan, we assess the cost of data movement among nodes. The cost of distribution has two components – the volume of data sent and the number of packets needed to be sent. Both are calculated for evaluation.

Joining: joins can only be run between 2 relations or dimensions at a time. There- fore, when multiple join criteria are mentioned in the join clause, they are nested one after another, a common scenario for array data since these often involve mul- tiple dimensions. Consider joining by value or a non-contextual join, ./¯ . The cost would simply be the multiplication of the joined array size, while the expected resultset size is the same value multiplied by the join selectivity – assuming a

Cartesian-Like join is used. When the join is of a different type, the value should be adjusted to an equation controlling how many values are touched, evaluating the cost more accurately for that scenario. If the join is over dimensions, a variable reconstruction might be needed. In this case, the dimensions are joined first, and afterwards, the resulted dimensions are used to subset the variable. This process involves communicating the indices of the values that matched between the nodes executing the join to the nodes holding the variable data for efficiency. We do not focus on this step within our cost calculation here.

For brevity, we assume merge sort is used for the dimensional joins since very rarely dimensions are not sorted. The total join time is the summation of both - the set of 1-D dimensional joins and the actual data join. An exhaustive discussion of this issue can be found in the existing literature [53, 106].

75 3.3.2 Summarizing – Choosing a Plan

For a given query, a rule based optimizer builds all options for a non-distributed plan. Since the number of plans built is small, we distribute each of these plans separately by using the two pruning rules given before (Subsection 3.2.2 and Sub- section 3.2.3). The cost of each plan is evaluated by using the cost model presented above and the cheapest plan is selected for execution.

Since all possible plans are evaluated, and only isomorphic or repeating plans are pruned, the cheapest, optimal, plan is found. In the case multiple plans have the same cost, we use the simple heuristic: choose the least distributed plan among these plans. Contrary to expectation, we found the least distributed plan, among equal cost plans, runs somewhat faster due to slight overheads the cost model does not account for; mainly meta-data communication and data broadcast, which assumed to be parallelized but some portions of it are executed sequentially due to hardware limitations and the frameworks we use.

3.4 Evaluation

This section evaluates both our plan building methods and the performance of plan execution.

System: we built two systems for the experimentation - a query optimizer and an execution engine. The query optimizer was written in C++, for efficiency, while the query execution engine, which executes plans produced by the optimizer, was

76 written in Java, which allows its integration in many available frameworks. Be-

cause of the challenges of performing repeatable experiments in wide-area set-

tings, communication latencies were introduced programmatically. The query op-

timizer accepts as input a query, and based on the provided algorithms creates

an XML file with the cheapest plan. The query executor reads the execution plan

from the XML file, follows the plan for executing the query (communication is im-

plemented by using MPI [35]), and sends the query results to the user in a form

of a NetCDF [102] file. The join algorithm itself and the order of execution are

dynamic and provided by the optimizer. Because of scientific array data proper-

ties, dimensional joins are implemented by using sort-merge join, while skipping

the sort since the data is already sorted. The array data join is implemented as

described Section 1.2.

Queries: all queries executed by the engine in the evaluation are value joins, ./¯ , which differ by the amount of joined arrays, selectivities, and array sizes. We fo- cus on joins since analytical querying requires full data scans, and often does not decrease the array sizes – all other operators are cheaper. For the evaluation, we use at most 5 relations (A, B, C, D, E). Query 1 (Q1) joins 3 tables by using 2 joins:

(A./B¯ )./C¯ , Query 2 (Q2) joins 4 tables by using 3 joins: ((A./B¯ )./C¯ )./D¯ , and last

Query 3 (Q3) joins 5 tables by using 4 joins: (((A./B¯ )./C¯ )./D¯ )./E¯ . These three

simple queries represent a wide range of real world queries – by using a small

selectivity we can simulate subsetting conditions as well as joinability. Since the

optimizer is mostly challenged by the number of nodes on which a relation is dis-

tributed on (and not the number of relations) we report results of joins of up to

5 relations. Using more relations for a query does not affect the feasibility of the

77 Q N J1 J2 J3 J4 AVG DS 1 3 5.0% 0.5% 381MB 2 4 1.0% 1.0% 0.1% 762MB 3 5 0.1% 0.5% 0.1% 1.0% 40MB

Table 3.3: DistriPlan Experimental Settings: Join selectivities and processed dataset average sizes by query. N - number of tables involved in the join (N − 1 joins), Jn the selectivity of the nth join, AV G DS - Average Dataset Size on each node.

approaches presented in this work, specifically – the amount of trees needed to be spanned is dependent on the distribution, and not the number of relations.

Data Sizes: the experiments were designed in a manner that each node stores an array with data sizes of between 8 MB to 800 MB. These sizes were chosen based on real datasets available on the ESGF portal. Also, the reported size is the size of an individual array read from a file (files are larger as they typically host multiple arrays). The array size mentioned is the local portion of each array, when we experiment over n nodes the total array size is the local one multiplied by n (for example, when experimenting over 800M arrays which are distributed over 10 nodes, the complete join is conducted over approximately 8GB, which is the actual array size). The data we used was generated in a pattern designed to enforce the given selectivities and holds the same patterns of real climate data.

In Table 3.3 we present, for each query, the join selectivities and the average size of processed array. The selectivities are between 0.1% to 5%, values which are commonly observed in Data Warehousing queries [101]. Since data is distributed

78 over multiple nodes, the total data sizes vary for each query and for each array, therefore, we present the average dataset size.

Experimental Setting: all experiments in which we execute the queries ran on a cluster where each node has an 8 core, 2.53, GHz Intel(R) Xeon(R) processor with

12 GB of memory. All the experiments where we focus on building plans have been executed on a 4 cores Intel(R) Core(TM) i5, 3.3Ghz, processors, with 2GB of memory. All machines run Linux kernel version 2.6. The reported results are over

3 different consecutive runs, with no warm-up runs. Standard deviation is not reported since results were largely consistent. Unless mentioned otherwise, we set the optimizer penalty to be 400, equivalent to network latency of a WAN – this latency is based on data shown in related work [27, 66, 122].

Since no other environment provides the ability to join distributed scientific array data, we could not have compared our engine to any other. For example,

SciDB [24] does not allow running generic joins as described here and does not support geographically distributed data. No implementation of (and/or higher level layer on top of) MapReduce can directly support joins over geographically distributed data.

Throughout the experiments, we assume no multi-tenant effect exists. The do- main of multi-tenancy in context of data analytics is being researched separately [12,

90], and considering it is beyond the scope of this work.

79 1E+36 1E+36 DistriPlan 1 Node DistriPlan 2 Nodes 1E+32 1E+32 Not Pruned Plans Not Pruned Plans 1E+28 1E+28

1E+24 1E+24

1E+20 1E+20

1E+16 1E+16 1E+12 1E+12

100000000Number OfTrees 100000000Number OfTrees

10000 10000

1 1 1 2 4 8 16 32 1 2 4 8 16 32 Number Of Nodes - Relation 2 Number Of Nodes - Relation 2

1E+40 1E+36 DistriPlan 4 Nodes DistriPlan 8 Nodes 1E+36 1E+32 Not Pruned Plans Not Pruned Plans 1E+32 1E+28 1E+28 1E+24 1E+24 1E+20 1E+20 1E+16 1E+16 1E+12 1E+12 Number OfTrees Number OfTrees 100000000 100000000 10000 10000 1 1 1 2 4 8 16 32 1 2 4 8 16 32 Number Of Nodes - Relation 2 Number Of Nodes - Relation 2 1E+50 DistriPlan 16 Nodes 1E+70 DistriPlan 32 Nodes 1E+45 1E+63 Not Pruned Plans Not Pruned Plans 1E+40 1E+56 1E+35 1E+49 1E+30 1E+42 1E+25 1E+35 1E+20 1E+28 1E+15 1E+21 Number OfTrees Number OfTrees 1E+10 1E+14 100000 10000000 1 1 1 2 4 8 16 32 1 2 4 8 16 32 Number Of Nodes - Relation 2 Number Of Nodes - Relation 2

Figure 3.7: DistriPlan Pruning: Number of pruned Candidate Join Plans as the number of hosts Increase for a Join between two variables. Each graph has a fixed number of nodes for the first relation, and increasing number of nodes for the second one on the X-axis.

80 3.4.1 Pruning of Query Plans

We initially consider a simple join operation between two arrays, where each array is split across a given number of hosts. In Figure 3.7 we show how the num- ber of trees spanned for a join between two variables increases, showing the two pruning rules we have introduced drastically decrease the number of plans we need to evaluate. In the figure, there are multiple graphs. In each graph, the ti- tle includes the number of nodes the first joined relation is distributed over, the

X-axis represents the number of nodes the second joined relations is distributed over (between 1 to 32), and the Y-axis holds a logarithmic scale for the number of spanned trees built for each of the settings. The continuous line represents the number of trees generated by DistriPlan, while the other represents the number of trees need to be built when no pruning technique is used. Notice the difference in the initial values between the two graphs, and the scale changes, as the number of nodes increase between the graphs. For example, when both the arrays being joined are spread among 32 machines, our algorithms generate only 4,100 plans, out of the ∼ 1.4×1071 optional ones. In practice, a dataset is likely to be split across a much smaller number of distributed repositories than 32. The maximum runtime of our optimizer was 0.46 Seconds, while the average was 0.08 Seconds, showing that query plans can be enumerated quickly with our method.

Next, we consider a join over three arrays, i.e., query Q2. In Table 3.4 we show how long it takes to span plans for the distribution of the non-distributed plan shown in Figure 3.4, “Plan 1”. We consider a set of representative distribution options of the three datasets across different amount of nodes (each between 1 and 32 nodes). Each row considers a specific partitioning of the three datasets

81 No. of Nodes Relation Partitioned spanning options ABC time (S) spanned 1 1 32 0.19 4,104 1 4 16 0.02 1,816 1 16 16 1.76 72,646 2 4 32 0.91 16,427 4 4 32 1.27 24,644 4 8 8 0.01 1,521 4 16 8 0.30 15,034 8 8 8 0.03 2,614 8 8 16 0.37 17,398 16 16 4 0.41 15,762 16 16 8 0.84 29,640

Table 3.4: DistriPlan Spanning Time: Time required to span plans and number of spanned plans for a three-way join distributed among multiple nodes. The first columns present for each relation on how many nodes it is distributed over.

and shows the number of plans traversed and the time taken. As one sees, all execution times are under 2 seconds. Rules 1 and 2 given in Subsection 3.2.3 limit the increase in number of distribution options (for comparison, without rule 1 and

2, the first row would have had to span 2.63 × 1035 trees – clearly not a feasible option). In all cases we experimented with, the query performance improvement was substantial compared to the other, not optimized, plans. For example, we saved about 20 minutes of execution time compared to the median plan for the setting where the 3 involved datasets are partitioned across 8, 8, and 16 nodes respectively, with the final execution time only being 0.37 Seconds. The same plan ran 20 seconds faster compared to the plan with most parallelism. Similar gains were seen in most experiments. Overall, we conclude the conditions provided in

82 Q No A B C D E 1 16 4 4 8 1 20 4 8 8 1 24 8 8 8 1 32 8 16 8 2 8 2 2 2 2 2 12 2 2 4 4 2 34 12 8 6 8 3 20 4 4 4 4 4 3 24 7 5 3 4 5 3 26 4 6 6 4 6 3 30 7 4 4 4 4

Table 3.5: DistriPlan Experiment Setup: Distributed Join Slowdown. The distribution of relations for each query – the specific distribution of each array for the queries presented in Figure 3.8.

Section 3.2 are sufficient for handling cases where data to be queried is spread across a modest number of repositories.

3.4.2 Query Execution Performance Improvement

In this experiment, we measure how effective our selection method

(and the underlying cost model) is. We execute three different plans for each query, the cheapest and median cost plans generated by our CBO model, and the plan with most parallelism. The latter is a commonly used heuristic in current systems such as Hive [20, 115] and Stratosphere [4]. For each of the queries presented above, we first build a plan using our CBO for different distributed settings. The latency used in all experiments is of WAN, equivalent of 400 ms.

In Figure 3.8 we report the increase in execution time of the median and most parallel plan versions, compared to the cheapest plan our optimizer selected. Along

83 1000 distributed median 100

10

1 Percentage Slowdown

0.1

Figure 3.8: DistriPlan Execution Time Slowdown (in %) For the Most Distributed and Median Plans Compared to The Cheapest Plan. Each query is listed by the query ID, and a detailed distribution by array appears in Table 3.5.

84 Q1 Q1 Q1 Q2 Q3 Q3 Q3 16 20 32 34 20 24 26 Cheapest 6S 20S 57S 1093S 31S 36S 146S Distributed 11S 29S 75S 1191S 34S 39S 161S Median 29S 30S 183S 3304S 36S 41S 152S

Table 3.6: DistriPlan Join Queries Execution Time. The settings provided in the column titles are described in Table 3.5.

the X-axis, we list the query and the number of nodes that held the data for the query (each array is distributed differently for each query – this information can be seen in Table 3.5). Along the Y-axis, we list the slowdown of the median plan and the most parallel plan compared to the cheapest plan. For example, Q1-16 cheapest plan ran 83% faster than the most parallel plan. We note that in certain specific cases (depending on the selectivities of the joins, among other factors) the cheapest plan is also the one with most parallelism. These cases are not shown.

When arrays are distributed unevenly, the optimal plans are rarely the most dis- tributed ones. In fact, the number of processing nodes is often smaller than the number of nodes that originally hold the inner joined array in the cheapest plan, i.e, at least one of the processing nodes processes data of the same array that is copied from another node.

In Table 3.6 we show the actual execution time for some chosen settings from

Table 3.5. As can be seen, in all cases the cheapest plan executes the fastest. Fur- thermore, nearly in all cases the most parallel plan is faster than the median plan.

In addition, plans that are mostly similar have small performance difference (such is the case for Q3-30 in Figure 3.8).

85 A pattern uncovered here is a decrease in the improvement for some of the more complex queries (queries executing more joins and/or using a larger number of nodes). For example, in the case of Q1-16(4,4,8), the slowdown for the most distributed query is 83%, while for Q3-20 (4,4,4,4,4) it is ∼10%. This behavior is rooted in parallelism that the more complex plans originally enable – for example, a 3-way join forces sequentiality, while for a 4-way join, processing of some of the joins can be performed in parallel for certain plans.

We conclude the CBO approach for performance improvement is profitable. In all cases observed using our cost model, the fastest query to execute is the one the CBO evaluated to be the cheapest. In addition, the fastest query execution time was always faster than the median cost query and the most distributed plan (when both were different).

3.4.3 Impact of Network Latency

We executed the query optimizer and the resulting query plans to emulate four different cases. Here, we set the optimizer to build plans for a specific value of the network penalty, and execute each plan using different latency values than the one that matches the actual setting. We chose the following values for the penalty: 0 (no latency), 40 (Cluster), 400 (WAN), and 4000 (extreme), thus covering a wide variety of possible networks. We built plans optimized for each penalty and executed each plan multiple times using different actual settings.

In Tables 3.7 and 3.8 we show the percentage of the slowdown in execution time of the query optimized for a specific value compared to the query optimized for that value. For example, the value of the first row, second column, in the first table

86 P PP PP Act. 0 40 400 4000 PP Set. PP None Cluster WAN Extreme 0 0.00% 75.00% 11.20% 12.42% 40 0.00% 0.00% 10.37% 0.00% 400 75.00% 75.00% 0.00% 0.18% 4000 150.00% 150.00% 1.66% 0.00%

Table 3.7: DistriPlan Performance Slowdown Percentage of a Plan Optimized for a specific, Optimizer Setting(Set), penalty but Executed with Different Actual(Act) network setting - Q1 with a setting of 3,5,4

P PP PP Act. 0 40 400 4000 PP Set. PP None Cluster WAN Extreme 0 0.00% 18.18% 31.36% 17.64% 40 11.11% 0.00% 49.54% 36.96% 400 6122.22% 1436.36% 0.00% 19.90% 4000 159.26% 0.00% 26.81% 0.00%

Table 3.8: DistriPlan Performance Slowdown Percentage of a Plan Optimized for a specific, Optimizer Setting(Set), penalty but Executed with Different Actual(Act) network setting - Q3-24 in Table 3.5

signifies that the optimized plan for a penalty value of 0 executed 75% slower than the plan optimized for 40 when the actual setting was the one intended for 40 – the execution time of the cheapest plan optimized for a penalty of 0 is 7 Seconds, the optimized plan for a penalty of 40 executes in 4 seconds. Similarly, in the second table for the first row last column, the execution time of a plan optimized for a penalty of 4000, which was also used for the actual one, is 2,386 seconds while the

87 plan optimized for a 0 penalty ran in this setting for 2,807 seconds - a slowdown of 17.64%.

Overall, we can conclude that the penalty has to be selected carefully to reflect the actual setting – wrong values might harm performance as significantly as the right values improve it. It also shows that the best plan can vary significantly depending upon the latency, which implies that detailed cost modeling is critical.

3.5 Summary

In this chapter we have presented and evaluated a framework for optimized execution of array-based joins in geo-distributed settings. We developed a query optimizer, which prunes plans as it generates them. For our target queries, the number of plans is kept at a manageable level, and subsequently, a cost model we have developed can be used for selecting the cheapest plan. We have shown our pruning approach makes the plans spanning problem practical to solve. We evaluated our system and shown the cost model cheapest plan executes faster than more expensive plans. We have shown through experimentation that the penalty parameter introduced in the cost model is a critical one, and should be adjusted to

fit the physical system setting carefully.

88 Chapter 4: Execution of Scientific Data Join - BitJoin

In the previous chapter we discussed the join operation optimization, yet we did not target the execution of the join. In this chapter, we focus on the execution of two variants of data joins – the classical equi-join and an operation we refer to as Mutual Range Joins (MRJ). The latter is a new operation we introduce, which is needed because scientific data is not only numerical, but often also contain mea- surement noise.

Consider the following query over a climate dataset: “find the days that had sig- nificant temperature change compared to the daily average”. In Figure 4.1 we formally state this query, and quantify the significant change as 5 degrees. The query has 3

join criteria: 2 of them are over dimensions and the last criterion is looking for a

lack of a match to a range.

Figure 4.1: Example Query: Temperature Join

89 This notion of range query arises in nearly all contexts within scientific domains that involve numerical data collected by sensors or estimated by simulations [126,

147]. We refer to these range queries as Mutual Range Join (MRJ).

In this chapter, we first demonstrate how bitmap indexes can be used for accel- erating the execution of distributed array joins. The most expensive component in executing a distributed join in this domain is the communication, and we signif- icantly reduce this overhead by transporting compressed bitmap indexes instead of the full data arrays. Afterwards, we focus on mutual range joins, and show how a modification of the bitmap index structure can help execute these operations ef-

ficiently.

90 4.1 Background

In this section we will discuss the background for join execution. We will first describe MRJ’s in larger depth, and then discuss the intuitive join algorithms for array data.

4.1.1 Mutual Range Join

Earlier, we had motivated the need for a variant of the traditional join oper- ation, which is the mutual range join or MRJ. Assuming Val1 and Val2 are the values from two different arrays to be joined, two MRJ variations can be stated as:

• Val1 BETWEEN Val2 − r + e AND Val2 + r − e

• Val1 NOT BETWEEN Val2 − r + e AND Val2 + r − e

Here, r is the range that results should be matched with, and e is an acceptable error range – often derived from the sampling error. The error should be smaller than the range, i.e., e

4.1.2 Join execution

In Algorithm4 we present the basic join algorithm; this is a conceptual algo- rithm, and is not expected to be implemented. The inputs are two arrays and the output is the result of an equi-join between these arrays. In Line2 we initiate an empty array. Lines3-11 iterate over each existing value, and find the matching coordinates across both arrays. If there is a match, in Line8, we add the value to the resultset. Since we started with an empty array, adding a value may result in reshaping and re-building the entire array – an expensive operation.

91 Algorithm 4 Na¨ıve Array Data Equi-Join 1: function NA¨IVE ARRAY DATA JOIN(array1, array2) 2: results ← ∅ 3: for value1 ← each cell in array1 do 4: dimSetting ← Extract value1 dimensional setting 5: if array2[dimSetting] != ∅ then 6: value2 ← array2[dimSetting] 7: if value1 = value2 then 8: results[dimSetting] ← value1 9: end if 10: end if 11: end for 12: return result 13: end function

Algorithm 5 Regular Array Data Join 1: function ARRAY DATA JOIN(array1, array2) 2: dims ← common dimensions 3: dimRes ← ∅ 4: array1m ← aggregate(array1, dims) 5: array2m ← aggregate(array2, dims) 6: for each dimension dim1 in array1m do 7: dim2 ← matching dimension for dim1 in array2m 8: dimRes ∪ = JoinDims(dim1,dim2) 9: end for 10: array1m ← restructure(array1m, dimRes) 11: array2m ← restructure(array2m, dimRes) 12: result ← JoinData(array1m, array2m) 13: return result 14: end function

92 In Algorithm5 we present the Regular Join algorithm, a practical approach to

implement array joins. A walk-through example for this algorithm can be found

in Figure 4.2. First, we find the common dimensions across the joined arrays, and

aggregate both arrays, based on the specified (or default) aggregation function, so

both arrays will only have the common dimensions. Thus, removing dimensions

that are not used by the counterpart array, forcing both arrays to have the same di-

mensional configuration. Another option for reshaping can be to add dimensions

and join over a higher set of dimensions – however, our focus is on decreasing the

dimensionality since it results in smaller arrays; adding dimensions can be done

explicitly, by using sub-queries.

Next, in lines6-11 we execute the dimensional equi-join. Based on the results,

we restructure both arrays, removing values of dimensions that did not match –

as described in Algorithm8. At this point both arrays have the same dimensional

setting. Concretely, array1m and array2m are of the same shape and have the same values in the same order. Last, we execute the data join based on the provided join criteria.

The data join is conducted for matching array cells, i.e, the values at index 0 of array1m and array2m are compared to each other, and so on. We may use this approach only because the restructure process guarantees matching values are at matching indexes. This operation is practically a Nested Loop (NL) that iterates through all available cells. Unlike in relational systems where the NL complexity is O (N 2) and N is the number of rows, for scientific array data, the NL complexity is O (N) where N is the number of cells – therefore, NL for value based join over array data is an efficient algorithm.

93 Array1 Array2 Dim1 Dim1 124 234 Dim2 Dim2 2 134 126 1 3 135 327 2 Array1 Array2 Dim1 Dim1 1 24 234 Lines 4‐9 Dim2 Dim2 2 134 126 1 3 135 327 2

Array1m Array2m Dim1 Dim1 24 24 Lines 10‐11 Dim2 Dim2 2 34 37 2 Result Dim1

Line 12 24 Dim2 2 3

Figure 4.2: Steps in the Regular Join Algorithm

94 In Figure 4.2 we illustrate the join algorithm. For simplicity, we use arrays with matching dimensions. Dim1 dimensional join matches values 2 and 4, while

Dim2 has 1 match - 2. Array1 and Array2 are restructured based on the dimen- sional join results. Array1m and Array2m are built and a data join is executed over them, producing the resultset. If the join criteria is not held, a NULL value is used (e.g. [4,2] in the example). Although dimensions could be reduced when an empty plane is generated, we chose not to reduce dimensions for this scenario – maintaining the dimensional join results.

95 4.2 Bitmap Based Join

Bitmap Index Join

Algorithm 6 The Na¨ıve Bitmap Index Equi-Join 1: function JOIN DATA(index1, arr1, index2, arr2) 2: FinalResults ← ∅ 3: for each key in index1 do 4: bin1 ← index1[key] 5: bin2 ← index2[key] 6: if bin2 = then 7: continue 8: end if 9: result ← bin1 & bin2 10: for each i ← place of 1 in result do 11: if arr1[i] != arr2[i] then 12: result[i] = 0 13: end if 14: end for 15: if result > 0 then 16: FinalResults ∪ = (key, result) 17: end if 18: end for 19: return FinalResults 20: end function

In Algorithm6 we present the na ¨ıve Bitmap Join algorithm – this corresponds to the call in Line 12 of Algorithm5. The algorithm’s inputs are both arrays (to be joined) and their corresponding bitmap indexes. In phase 1, Lines4-9, we take all matching bins (by their bin keys) and execute a bitwise AND operation between

key matched BitArrays. Here we assume that the bitwise AND compares keys with

the same dimensional values, which is guaranteed because of the restructuring

process. Phase 2, Lines 10-14, verifies the results – this is necessary when data is

not discrete and bins represent ranges of values.

96 The performance advantage of this algorithm comes from the efficiency of the bit-wise operations. While the regular join algorithm requires values to be com- pared one-by-one, with bit-wise AND, the processor accelerates the calculation by the number of bits it can process simultaneously, when data is not compressed.

When data is compressed, the acceleration can be even higher, depending on the sparsity of each BitArray.

Dimensional Joins and Restructuring

For using bitwise AND/OR for the data join, matching values must be located in the same offsets within both indexes; if they do not, index restructure is necessary.

We use two algorithms to do that restructure efficiently. The first generates BitMap- pings that mark which dimensional values should be compared with which across array boundaries – for example, if the two BitMappings for a specific dimension of the joined arrays are 010 and 001, the 2nd dimensional value of the first array corresponds to the 3rd one in the second array. Clearly, the number of lighted bits on both BitMappings match and both mapped dimensions must be sorted. The second algorithm uses these BitMappings to perform the array restructure.

In Algorithm7 we show the dimensions join algorithm, which generates each dimension BitMappings – this corresponds to the call in Line8 of Algorithm5. The algorithm begins with two reset (to 0) BitMappings, each in the length of the input dimension. We go through each value, one by one, in a technique very similar to merge sort (since dimensional values are assumed to be sorted). If we find a match, we append a 1 to the bit array. If we do not, we skip the smaller item while keeping track of which items were skipped (by appending 0’s to the relevant BitMappings).

97 Algorithm 7 BitJoin Dimensional Join 1: procedure JOINDIMS(dim1, dim2) 2: i ← 0 3: j ← 0 4: BitMappingsDim1 ← 0..0 5: BitMappingsDim2 ← 0..0 6: while i < size(dim1) && j < size(dim2) do 7: if dim1[i] = dim2[j] then 8: dimToRet ∪= dim1[i] 9: BitMappingsDim1 << 1 10: BitMappingsDim2 << 1 11: i ← i+1 12: j ← j+1 13: else 14: if dim1[i] < dim2[j] then 15: BitMappingsDim1 << 0 16: i ← i+1 17: else 18: BitMappingsDim2 << 0 19: j ← j+1 20: end if 21: end if 22: end while 23: return ((BitMappingsDim1,BitMappingsDim2),dimToRet) 24: end procedure

98 Algorithm 8 BitJoin Index Restructure 1: function RESTRUCTURE(index, bitMappings) 2: 3: . If restructure is not needed, skip restructuring 4: if bitMappings has only 1’s then 5: return index 6: end if 7: 8: . Builds the whole index bitMapping 9: histState = bitMappings[0] 10: currState ← ∅ 11: zeros = bitstring of 0’s of length len(histState) 12: for i ← 1..size(bitMappings)-1 do 13: for j ← 0..size(bitMappings[i])-1 do 14: if bitMappings[i][j] then 15: currState ← currState.append(histState) 16: else 17: currState ← currState.append(zeros) 18: end if 19: end for 20: histState = currState 21: currState = ∅ 22: zeros = histState && 0 23: end for 24: currState ← compress(histState) 25: 26: . rebuilds each bin 27: for each bin b in index do 28: . next call removes bits that are marked by 0 in currState 29: . from the current bin, preserving only correlated values 30: newBin ← restructure b based on currState 31: if newBin has 1 in it then 32: newIndex ∪= newBin 33: end if 34: end for 35: return newIndex 36: end function

99 In Algorithm8 we take an index structure and re-structure it based on the gen-

erated BitMappings. In line 24 we call compress for simplicity; the algorithm can execute directly over compressed structured. This algorithm correlates to the calls in Lines 10-11 of Algorithm5. The input for this algorithm is the bitmap index, equivalent to the array data, and the BitMappings, which encapsulate the dimen- sional join results.

In lines3-6 we determine if restructure is needed. If needed, in Lines8-23, for each dimension index (represented by i) we go through the bitmap that represents the locations we need to keep on restructure (represented by j). If we need to keep the current dimensional value, we shift left the entire set of values which were calculated before (histState), otherwise, we append zeros of the same length instead. The reason we append the entire data that was collected before is that a 1

on the x dimension’s BitMappings requires all the dimensions with indexes smaller

than x to appear. Lastly, the algorithm iterates through the bins and restructures

the bitmap – described graphically in Figure 4.3.

4.2.1 Putting It All Together

In Figure 4.3 we demonstrate the full join process. First, we join the dimen-

sions, and produce the dimensions BitMappings for each dimension of each ar-

ray. Since the value 2 of Dim1 matches, for the left array the BitMappingsDim1

has 01 (since 2 is the second value). Similarly BitMappingsDim2 of the left array

is 10. Same process is executed for the right array as well (producing 10 for Dim1

and 01 for Dim2).

100 Dim1 Dim1 12 23 Dim2 Dim2 2 13 12 1 3 13 32 2 1 0 index index 0 1 1:1010 1:1000 0 1 1 0 2:0101 3:0101 3:0010

BitMappingsDim2 BitMappingsDim2 BitMappingsDim1 Restructure Restructure BitMappingsDim1 1 0 01 01 00 00 10 10 0 1:0 1:0 1 & 2:0 3:1 3:1 3:1

Figure 4.3: BitJoin: a Walkthrough the Join Process

The second part of the restructure is using the BitMappings which were created.

First, we multiply (vector multiplication) all the BitMappings to create a global ar- ray BitMapping. Then, we modify the indexes by using the mapping, maintaining only the values for lighted bits in the BitMapping – in this case, the BitMapping for the left array is 0100, instructing us to take the 2nd bit of each index bin, while ignoring all others. Similarly, we take the 3rd bit of each index bin of the right array.

Last, we join the matching bins by using the bitwise AND operator to generate the join results.

101 = 2, = 0.5

𝑟𝑟 𝑒𝑒 Index Type A for Value 2.2 Bit Value 0 0 0 0 0 1 0 0 0 0 Bins -0.5-0 0-0.5 0.5-1 1-1.5 1.5-2 2-2.5 2.5-3 3-3.5 3.5-4 4-4.5 Index Type B for Value 1.7 Bit Value 1 1 1 1 1 1 1 1 1 0 Bins -0.5-0 0-0.5 0.5-1 1-1.5 1.5-2 2-2.5 2.5-3 3-3.5 3.5-4 4-4.5

Figure 4.4: Bitmap Index Types Example

4.3 Mutual Range Join - MRJ

4.3.1 Binning and Mutual Range Joins

In using bitmaps for finding values that are within a range, we propose build- ing two different indexes. The first index uses bin width of e – referred to as Bitmap

Index Type A. The second index, Bitmap Index Type B, uses the same bin width,

e, but has a different structure. Index Type B has multiple lighted bits in different

bins for the same index, specifically, all bins that represent values that are within

the range val − r, val + r are lighted. Both indexes are generated in advance.

In Figure 4.4 we demonstrate the two index types for the same value. We show

each bin bit-value for the given indexed value. Index type A has exactly one bit

lighted for all bins – since the value 2.2 lays within the range 2, 2.5, this bin has 1.

Index type B has 9 lighted bits, representing all the values in the range − 0.3, 3.7

– the result of 1.7 − 2, 1.7 + 2.

MRJ Algorithm: we modify Algorithm6 to use index type A for one of the join operands, while using index type B for the other. We also skip “phase 2”, the results verification phase, which is not necessary due to the index structure.

102 The output of the algorithm is slightly different than the definition of the orig-

inal MRJ operation. Specifically, the range and error boundaries are extended to

include the adjacent bins. While a simple check can eliminate these records with

low overheads, we believe, based on the usage of our current systems and discus-

sions with research scientists, that it is reasonable from a practical -point to

report all results.

Theorem 2. Suppose we define relations X and Y with index ix of type A on X and index

iy of type B on Y , and the parameters r and e as range and error accordingly. Consider the $ !% x records with matching variable values x and y. In addition, we define bx to be ×e, e the “leftmost” value of the bin which x is in – by is defined similarly. These records are part of the output from the algorithm iff by − r ≤ bx < by + r + e.

Proof. Define BitIx and BitIy to be the values of the ith bit of the relevant bins of ix

and iy. More formally - BitIx = ixbxi and BitIy = iybyi. In this case, by definition, BitIx is 1 – ixbx is the bin that is defined to be the one that contains the value x, therefore the ith bit of it must contain 1. Assume by−r ≤ bx < by+r+e.

In this case, the bin iyby is at a distance that is smaller than r + e from bx (when the bins are ordered by increasing order). By definition, BitIy is 1. Bitwise AND between BitIx and BitIy will produce 1, and make this record part of the output.

In contrast, if by − r > bx or by + r + e ≤ bx then BitIy is 0, based on the same reasoning. The bitwise AND between BitIx and BitIy will produce 0, showing the other direction.

103 An important notion of our indexes is that if the range r is smaller than needed, an in-place type-B index modification can be done to adjust the index to the re- quested r value. This allows using the same Type-B index for different r values.

However, such modification is rarely needed since domain experts often anticipate the needed values with high accuracy.

4.3.2 MRJ with Distributed Data

Joining distributed data requires additional communication. Transporting in- dexes, instead of the original arrays, can decrease the amount of necessary com- munication. We suggest executing the entire join by communicating local bitmap indexes (indexes over the local, partitioned, array data).

We assume each machine has access to at least one of the variable’s partitions.

We choose a variable to stay stationary, while sending the other one to the ma- chines having the chosen variable; the join is processed on these machines. The decision which variable remains stationary is performed by a Query Optimizer and

communicated through an execution plan (Section 7.2).

Algorithm 9 Distributed BitJoin Master 1: procedure DISTRIBUTED JOIN MAIN NODE() 2: plan ← DISTRIPLAN OPTIMIZER 3: for each node n do 4: Send plan to node n 5: end for 6: if master node is a processing node then 7: DISTRIBUTED JOIN PROCESSING(plan) 8: end if 9: for each processing node n do 10: res ∪ results from node n 11: end for 12: end procedure

104 Algorithm 10 Distributed BitJoin Process 1: procedure DISTRIBUTED JOIN PROCESSING(plan) 2: communicate dimensions within the cluster 3: if There are dimensions to join then 4: {BitMappings,newDims} ← JOINDIMS 5: end if 6: Communicate BitMappings 7: subIndexesA ← REFORMBITS(IndexA, BitMappings) 8: subIndexesB ← REFORMBITS(IndexB, BitMappings) 9: Communicate subIndexA to relevant nodes 10: res ← subIndexA & subIndexB 11: send results(Bitmaps and dimensions) to main node 12: end procedure

We divide the processing algorithm to consider actions on the master node and each processing node, and similarly, the dimensions join phase and the bit oper- ations phase. The master node orchestrates the full join execution as shown in

Algorithm9. The main node is also a processing node(Lines6-8), and is likely to process a join in addition to its responsibility to distribute the execution plan and collect all results.

The processing thread execution is presented in Algorithm 10. First, each node realizes if it is a data receiving node, a data sending node, or both. If it is a data sending node, it sends the relevant dimensions to the receiving nodes. The re- ceiving nodes execute the dimensional joins, generating the bitMappings (Sec- tion 4.2). These are used to restructure both pre-built indexes (Lines7-8) – the restructure occurs on the node that holds the array itself since all the dimensions that were communicated are needed. The restructured index is sent back to the receiving node (while avoiding the need to communicate and hold in memory the dimensional data). The receiving nodes join the data by using the bitwise AND over

105 compressed data – this is why earlier we restructured the data. The accumulation of these joined indexes is the query result.

106 4.4 Evaluation

We now evaluate the performance of our algorithms for equi-joins and MRJ’s.

The datasets we use contain climate data generated by simulations [126, 126, 127,

147, 148] conducted by the Department Of Energy (DOE) at Argonne National

Laboratory (ANL). The used variables have 4 dimensions – longitude, latitude, height, and time. The data distribution is based on the time dimension, i.e., data for different timestamps are stored at different files (and possibly different locations).

Our experiments were conducted on a cluster where each node has two 2.4

GHz Intel(R) Xeon(R) E5-2680 v4 processors, with 128 GB of memory running

Linux kernel version 3.10. Nodes are connected through Infiniband. As we will describe later, some experiments used 1 node of this cluster, and in some, lowered bandwidth is used. We also vary the number of nodes (sites) the data is distributed across.

We compare our implementation to the basic “Regular Join” algorithm, which is a combination of Nested Loop Join (NL) [120] and Merge Join, as described in

Section 1.2. The complexity of the used NL is O (N), and not quadratic as in the relational systems domain, as discussed in Subsection 4.1.2.

4.4.1 Implementation

Our system is constructed from two different parts: optimizer and execution en- gine. Our optimizer is based on the ideas presented in DistriPlan [45] (Chapter3).

The execution engine is written in C++ and includes the following components: network manager, join engine, and results collector.

107 When a query is introduced to the system, the optimizer generates an execu- tion plan. All the information regarding where to read the data from (e.g. disk or network), what to communicate, which node accumulates the results etc. is encapsulated within the execution plan. The (join) execution engine, networking components, and result collector follow the plan to produce the query results.

4.4.2 Equi-Join Over Discrete Data

In this experiment we join discrete (categorical) data, and compare the bitmap based algorithm against the regular join algorithm. For simplicity, data in this ex- periment is on a single node. Both joined dataset sizes were 781.2 MB (102,400,000 items in each array). The query executed is a two relations equi-join where its di- mensions were joined upon as well (the join criteria is “WHERE A.A = B.B AND

A.dim1 = B.dim1...”). For climate data, this corresponds to a query that helps iden- tifying erroneous sensors that produce the same exact readings.

We measured both the join duration and the resultset size. BitJoin resultset is produced as a set of bitmaps, and sent to the user in its compressed representa- tion. Therefore, its size is based on both, the number of bins and compressions ratio (WAH [133]). It should be noted that this representation does not lose ac- curacy when the data is discrete (since the bins are structured to match discrete values).

In Figure 4.5 we present the processing duration for 10 to 50 bins. The greater the number of processed bins, the longer it takes to process it using BitJoin, while the regular join algorithm is agnostic to this factor, as expected. There is nearly a 5 times speed advantage when the number of bins is 10, but this reduces when the

108 Figure 4.5: BitJoin Processing Time for Equi-Joins Over Categorical Data with In- creasing number of Bins

Figure 4.6: BitJoin Resultset size of discrete data join with increasing number of bins

109 number of bins is 50. Overall, this shows that even when data is not distributed, there can be advantages of using bitmaps, because of their compact size, and also because of the efficiency of bit-wise operations.

Another advantage of using bitmaps is that the size of the results (which may be communicated over the network) is small and somewhat predictable. In Figure 4.6 we present the resultset sizes with varying number of attributes. As one can see, the size of the resultset with bitmaps is quite small – in fact, there is nearly a factor 5 improvement when 50 bins are involved. There is a larger advantage with respect to size because of bitmaps compression, this is why we use the compressed index not only as an internal format to communicate for the join execution, but also as the output of the system.

When the cardinality increases (the number of distinct values in the index goes up), although the duration of processing takes longer using BitJoin, the average processing duration of each bin is decreased. This behavior is caused by the WAH compression algorithm over sparse data. Since usually the bitmap index bins data are more dense when there are fewer distinct values, the processing load over the compression is higher, leading to the behavior we observed. When the number of bins is increased from 10 to 100, the average bin processing time is decreased by half, from 0.19S to 0.09S. This suggests that using a more current and effi- cient bitmap compression scheme, such as Upbit [10], may improve BitJoin per- formance.

110 4.4.3 MRJ Processing

We now focus on processing of mutual range joins. Next we consider datasets of different sizes, and increase the number of nodes over which the data is dis- tributed over.

The query executed is a two relations MRJ where its dimensions were joined upon as well (the join criteria is “WHERE A.A BETWEEN B.B-r AND B.B+r ...”).

Using this query, we can identify areas that were more heavily affected by global warming. We experiment over 1, 10, 100, 1,000, and 10,000 million array values, corresponding to local datasets of sizes 7.81 MB, 78.13 MB, 781.25 MB, 7812.5 MB, and 78,125.0 MB. Each total array size is a multiplication of the local dataset size with the number of nodes it is distributed over – the largest array size used is 78GB over 32 nodes, a total of about 2.5TB. We assume both indexes, type A and type

B, are available, and were built when the data was generated. All the data distri- butions mentioned are representing the number of nodes that each array is hosted on – when we say “32 nodes”, for example, 64 nodes are involved (32 containing array A, and 32 containing array B).

In Figure 4.7 we show the improvement of MRJ execution using our algorithms compared to an implementation that aggressively optimizes distributed joins. We show the improvement only for the 3 smaller datasets (out of 5) since the compared algorithms cannot process the larger datasets and therefore improvement cannot be calculated. We used a relatively large number of bins, i.e. 100. We chose such a large number since for some datasets the performance of BitJoin is slower on

1 node compared to the regular join when the index contains 100 bins, as shown earlier. Thus, we emphasize the benefits of using BitJoin with increasing number of

111 Figure 4.7: BitJoin Performance Improvement over Small Datasets With Increasing Number of Nodes – Speedup Compared to Regular Join on the Same No. of Nodes

Figure 4.8: BitJoin Execution Time Comparison over Large Datasets

112 nodes over which the data is distributed. We can expect higher speedups if fewer bins are used, as shown in the previous experiment.

In the figure, the scale of the number of nodes, the x axis, is logarithmic while the improvement is not; the improvement is nearly linear with the number of nodes. As shown, the greater the number of nodes over which data is distributed, the performance improvement of BitJoin over regular join increases. For example, for the 781 MB dataset, BitJoin runs 31 times faster than regular join when each array is distributed over 32 nodes. For the other two datasets, the speedups on

32 nodes are nearly 15 and 24, respectively. Thus, the relative advantage of using bitmaps increases with increasing dataset sizes too. This is primarily because of the high costs of moving larger (source) data in the traditional algorithms.

In Figure 4.8 we present the execution time for join execution between the two larger datasets per increasing number of nodes. Unlike the previous figure, we re- port absolute execution times here. From the figure we see that for larger datasets using bitmaps is preferable over using previous approaches. This is mainly due to the advantage of working over compressed data. Once data needs to be trans- ported, the benefits increase substantially. In addition, the low memory footprint of compressed bitmaps allows the processing of large datasets, which traditional algorithms fail executing.

4.4.4 Throughput Affect

In this experiment we modified the networking throughput for matching differ- ent settings(throttle the network bandwidth and latency), and measured the affect

113 Figure 4.9: BitJoin Execution Time With Decreasing Throughput

it had over the join performance. Our goal was to emulate a typical ‘big data’ clus- ter (which uses Ethernet, and not Infiniband), and then a wide-area like setting.

All experiments ran with 4 nodes, 2 hosting each of the two joined datasets.

In Figure 4.9 we present the execution time for queries running over indexes with varying number of bins, when each joined dataset is 1.6GB. When the through- put is 16 times slower, regular join executes 3 times slower than BitJoin. The reason for the tremendous difference is the compression of the bitmaps, which drastically decreases the amount of data needed to communicate. We also executed this ex- periment for other dataset sizes and other number of bins, and the same pattern

114 holds – the bigger the dataset, the better BitJoin performs compared to Regular

Join.

For comparison, when the dataset size is 16 GB and 10 bins are used, the slow-

down of Regular join is 11.94 seconds, while for BitJoin it is only 0.66 seconds (com-

pared to 0.52 seconds and 0.1 seconds for a dataset of 1.6 GB). Moreover, as we had

shown earlier, BitJoin’s advantage increases when the number of nodes on which

the data is distributed is larger. We have distributed each array over only 2 nodes,

and we can expect larger gains as this number increases.

We conclude that BitJoin is clearly preferable as datasets are larger, and/or, as

data is distributed over a larger number of nodes, and/or the network bandwidth

is slower.

4.4.5 Impact of Changing Error and Range Values

We first experiment on changing error, i.e., e values. As we had discussed ear- lier, our algorithm further stretches the query range limits, reporting more records in the resultset (yet all within the error boundaries). In the following experiment, we measure the percent of additional results reported using our methods when e values change. We execute this experiment for a constant array sizes, 78.3 MB (10 million values). We experiment with varying e values, which consequently in- crease the number of bins generated for each index.

In Table 4.1 we report the results. As expected, the smaller the error value, the smaller the percentage of additional, out of the range, records. However, even for a fairly large e value, we have less than 10% additional results while for commonly used e values we have less than 4%.

115 e value 0.25 0.025 0.016 0.0125 0.01 14.85% 7.53% 5.07% 3.79% 3.04%

Table 4.1: BitJoin Percent of Additional Records Output With Varying ‘e’ values for 10M values

Next we experiment over changing range, i.e. r, values. The r value that is

required to be used for a specific query is only known at runtime, and therefore

we cannot normally pre-build an index for this specific value. In this experiment,

we change the values of r used in queries over variables that already have exist-

ing indexes that were created for a smaller r value, and measure the performance

overhead. We used 2 arrays of the same size and executed the joins on 2 nodes. We

ran the experiment for varying array sizes – 10MB to 1GB.

In Figure 4.10 we present the speedup of BitJoin compared to the regular join

when queries are introduced to the system with a multiplier of the r value used for

generating the index. Both engines slowdown with the increasing r – regular join slows due to the number of results it needs to emit while BitJoin is slowing due to the bitmap index manipulations. As can be seen, if the r value multiplier is only

2, regular join performs a slightly better. Yet, as the multiplier increases, BitJoin performs increasingly better. Overall, we conclude that the performance penalty for increasing r values is reasonable and that BitJoin performs better than the Reg- ular Join. In addition, caching or storing the modified index that was generated in run time for the higher r multiplier will result in better performance at a modest storage cost.

116 Figure 4.10: BitJoin Speedup over Regular Join for Changing ‘r’ values

4.5 Summary

In this chapter we have considered joins and join-like operations over scien- tific (array) datasets. We developed techniques for improving the execution time of equi-joins by using bitmap indexes. Afterwards, we developed a variant of standard bitmaps, and showed how it can be used for efficiently executing MRJ queries. Last, we shown algorithms for distributed execution of joins. The basis of our techniques are methods to efficiently join array dimensions and restructure bitmap indexes without accessing the raw data – only addressing and transporting compressed, limited in size, bitmap indexes.

Our experiments show that BitJoin is preferable over traditional approaches.

The bigger, more sparse, and more distributed the data, the better our methods

117 perform. In addition, the resultsets BitJoin produce are significantly smaller than in other methods.

118 Chapter 5: Advance Analytics over Real Scientific Array Datasets - FDQ

So far, we have developed limited querying abilities over array data – specifi- cally, we introduced subsettings and joins over distributed array data. In this chap- ter we address analytical querying and uncover a new class of analytical querying

– querying over windows where the planes that construct these windows are in- ternally ordered. Currently, in analytical [or window] queries, an order is imposed across different windows; we address queries where an ordering within the win- dow is required.

First, we introduce FDQ - an Analytical Functions Distributed Querying En- gine intended for Array Data – our querying engine that addresses this new query type. Then we describe in detail memory management optimizations for effi- cient processing of analytical (and other structured operators) querying over large datasets. Last, we provide efficient methods to execute these queries in parallel, using a sectioned (tiled) approach.

119 5.1 Domain

In this section we provide a technical description of the climate data we use and the querying needs from research scientists. This motivates further the need for the new querying class introduced later.

5.1.1 Climate Datasets and Variables

The datasets generated by ANL’s Climate Research Group are saved in NetCDF [102]

files, each represents a specific epoch the file was sampled at or calculated for. Each

file contains about 100 variables, each of which is structured from up to 4 dimen- sions (longitude, latitude, vertical levels, and time). In the datasets we have used, the time dimension has only 1 value, since the simulations generating the datasets are configured to output one time step [108, 113, 137] for each file. Last, not all variables contain vertical levels.

Each variable used can be large – in our case, the spatial resolution was 12 kilometers (about 7.45 miles), and the model covers most of the North America.

The data used contain in total 2TB per year with a retention of 10 years – a total of

20TB. Each raw file is self-describing and contains the used dimensions, including their values. Usually, all the variables in the same raw file use the same dimensions definition – e.g., the precipitation and temperature variables within each file will often describe the same area.

The two variables we use for most of the queries in this work are tempera- ture and precipitation, simulated by a regional climate model [126, 127, 147, 148].

Temperature outputs are stored as instantaneous absolute values, as measured in observatory stations or reported by weather forecast. Precipitation amounts are

120 output in an accumulated mode – data is accumulated from the first epoch until last one, and reported as the value accumulated up to the sampling point. However, to avoid reporting a large number as the precipitation amount, the program (or oper- ators) reset the precipitation amount at some time point and let the model output accumulated precipitation again from that time point. In calculating the amount of precipitation during a certain time-interval, if there is no reset, we simply subtract the largest reading from the smaller one. On the other hand, for those data se- quences where reset has been applied, special processing is required. For example, for 5 samples with the values: 3, 5, 2, 2, and 4, the amount of rain between the first and last samples is 6 – 2 from sample 1 to 2, 2 from sample 2 to 3, which included a reset of the “counter”, none between samples 3 to 4, and 2 more between the last two samples. Usually the counter reset is not often, leading to the ability to detect when a counter reset occurred.

5.1.2 Querying Needs and Tools

Currently, the workflow at ANL and their research communities includes man- ually extracting the variables and storing them locally by using tools such as NCO [139] and CDO [107]. The extracted data is used for simple mathematical and algebraic manipulations or loaded to a programming environment. The most common lo- cal data calculated are daily mean, or seasonal/annual averages. More advanced analysis of the data is conducted by scripts written and executed using NCL [67],

Matlab [7, 117], Python [63, 65], or R [30, 103].

When it comes to collaborations with, or assistance provided to, other groups and other institutions to address different problems, transporting the larger than

121 20TB of generated raw data is tedious and most importantly, unnecessary. ANL offers pre-processing analytics to decrease the provided dataset size and address the needs of collaborators and colleagues better. Lowering the overhead on ANL’s personnel in addressing these needs is important, and FDQ, in part, allows ANL to address these generic inquiries over long periods of times efficiently, while pro- viding results faster.

122 5.2 Analytical/Window Querying

5.2.1 Definition

Analytical Queries are queries over data partitions, each of which is referred to

as a window. While analytical queries have some similarities to queries involving

aggregations over partitions (the GROUP BY clause), Analytical Queries are unique in allowing access to other window’s raw data while processing a specific window.

For example, subtracting all days average values from their previous day cannot be generated directly by using the GROUP BY clause (this would require an inefficient

use of sub-queries and joins).

Analytical Functions are usually listed in the SELECT clause of the query. As a consequence, Analytical Querying enables a single query to use multiple Analyt- ical Functions – providing the ability to use multiple aggregations over different partitioning scheme within one query. The syntax for Analytical Functions is:

FUNC = FUNCTION(param list) OVER ([PARTITION BY part dim list] [INTERNAL ORDER BY part int ord[INCOMPLETE|COMPLETE]] ORDER BY part ord)

FUNC is the syntax clause for analytical functions over array data, and is intended to appear instead of a column (or dimension) where column list are used in the SQL

standard [84]. Brackets mean the clause within it is optional, a standard in the nota-

tions used to describe databases querying syntax. FUNCTION is the analytical func-

tion used – in our setting the available functions are average AVG, minimum MIN, maximum MAX, lead LEAD, lag LAG, median MEDIAN, and minus MINUS.

123 The function parameters change based on the function, for example, the MINUS function takes 2 parameters, the variable upon it should operate, and the distance to the anchor window (the window which its values are used for the subtraction).

The OVER clause marks that we are using an analytical function. The PARTITION

BY... clause accepts a list of functions or dimensions (parallel to columns in relational settings) that are used to build the partitions upon the calculations oc- cur. By using a list of functions or dimensions, the ORDER BY clause determines the external windows order – only dimensions or functions that are listed in the

PARTITION BY clause can be used. Up to here, although discussed in the context of array data, similar functionality exists for relational systems and standardized under the SQL and SQL/MDA standards.

Next, we discuss the extension we suggests. The INTERNAL ORDER BY is pro- vided a list of functions or dimensions as well, but here only dimensions that are not listed in the PARTITION BY section can be used. This clause is used to de- termine the order of planes internal to the partition, as if there was a secondary partitioning, internal to the generated window, based on the provided list of di- mensions. The INCOMPLETE clause instructs the engine to provide (and not dis- miss, as the default COMPLETE behavior is) window’s values that are calculated based on missing values (providing an advanced and nuanced handling for the case NULL values are addressed).

In Figure 5.1 we show an example of an analytical query for finding the dif- ferences between subsequent day’s average temperatures. The syntax shown was simplified compared to the ANSI standard for simplicity and brevity. In the query there are three different parts: the first shows a call to the analytical function

124 C:\Users\Roee\AppData\Local\Temp\~vs390A. 1

SELECT AVG (TEMP.val) OVER (PARTITION BY DAY_OF_YEAR(TEMP.date) ORDER BY DAY_OF_YEAR(TEMP.date)) AS day_avg, LAG (day_avg, 1) OVER (ORDER BY DAY_OF_YEAR(TEMP.date)) AS day_before_avg, day_avg ­ day_before_avg AS average_difference FROM TEMP

Figure 5.1: Analytical Query SQL Example

Average: the average is calculated over a window that corresponds to each day’s data (the window is built through a function that extracts the day of year); the

ORDER BY clause does not impact the results and can be omitted here. Next, we use the function LAG, which allows us access to data that was produced for the previous window, as determined by the provided ORDER BY clause which is criti- cal here. The function LEAD, demonstrated in Figure 5.4, gives access to the sub- sequent window values. Last, we calculate the difference between the currently calculated day and the day before, showing the difference in temperatures across subsequent days.

Analytical Queries challenge the query optimization process since they require repeating calculations, large memory caches, and in some cases disk usage. Effi- cient algorithms have been developed to address these issues for relational databases [18,

26, 59, 76, 88, 91, 131, 149]. Yet, as to date, no engine for array data provides the full strength of analytical functions, therefore developing new methods, which utilize

125 the sequential memory layout, for efficient processing of these calculations in that context is crucial.

5.2.2 Minus Function

We now introduce a new analytical function, MINUS, which is similar to what we described in the previous subsection. This function calculates the difference between the maximum value of each data point within each window, to the last value of that matching data point within the previous window. Mathematically, it can be described as

MAX2 (val) − LAG (val) (5.1) where MAX2 is calculating the maximum in a manner that compensates for data resets. This analytical function cannot be phrased in SQL without the INTERNAL

ORDER BY clause, which was introduced here.

In Figure 5.2 we demonstrate the minus function over precipitation data. Data was collected from 2 locations during 3 days while the array contains 3 dimen- sions: day, time, and location. The query PARTITION BY clause contains only 2 out of the 3 dimensions – the 3rd dimension is aggregated upon. The INTERNAL

ORDER BY clause lets the MINUS Analytical Function know the internal partition- ing ordering (for determining the “matching value” to subtract). The traditional

ORDER BY clause is used here as well, allowing the aggregation function access the previous (or next) partition values. We highlighted in the figure where data reset occurred (and marked the specific sample with an asterix).

In the figure, since the measuring begins at midnight, it should end at the next day midnight – although we have 3 days of data, we are missing a data point to

126 Rain Data By Location and Date Analytical Minus Query 1/1/2018 SELECT MINUS(Rain, 1d) 12a 8a 4p OVER (PARTITION BY Date, Location Office 0 Office 1 Office 5 INTERNAL ORDER BY TimeOfDay ORDER BY Date) Partition Parking 85 Parking 88 Parking 88 FROM RAIN

1/2/2018 12a 8a 4p* Analytical Query Results Office 7 Office 7 Office 4 Date 1/1 1/2 Partition Parking 88 Parking 89 Parking 3 Location Rain

1/3/2018 office 7 4 12a 8a 4p parking 3 21 Office 4 Office 4 Office 4

Partition Parking 20 Parking 20 Parking 25

Figure 5.2: Analytical Query Example: Demonstration of expected behavior of the MINUS analytical function with data reset (highlighted)

calculate the full last day. For clarity, we demonstrate the calculation of the parking location for January 2nd:

(89 − 88) + 3 + (20 − 3) = 21 (5.2)

127 12/30/2017 12/31/2017 1/1/2018 Dim1 Dim1 Dim1 0 1 0 1 0 1 Dim2 Arr Dim2 Arr Dim2 Arr 0 3 5 0 1 5 0 1 7

1 2 0 1 1 4 1 3 2

Week 1 12/30/2017 Week 2 12/31/2017 1/1/2018 Dim1 Dim1 Dim1 0 1 0 1 0 1 Dim2 Arr Dim2 Arr Dim2 Arr 0 3 5 0 1 5 0 1 7

1 2 0 1 1 4 1 3 2 Week 1 Week 2

Dim1 Dim1 Figure 5.3: FDQ: A 3 Dimensional Variable 0 1 0 1 Dim2 Arr Dim2 Arr SELECTWeekAVG 1 (TEMP) – LEADWeek( 2AVG(TEMP)) 0 2 -1 0 1 6 OVERDim1 (PARTITION Dim1BY extract_week(Time) 0 1 ORDER BY0 extract_week1 (Time)) 1 0 -3 1 2 3 Dim2 Arr Dim2 Arr SELECT AVG(TEMP) – LEAD(AVG(TEMP)) FROM0 TEMP2 -1 0 1 6 OVER (PARTITION BY extract_week(Time) ORDER BY extract_week(Time)) 1 0 -3 1 2 3 FROM TEMP Figure 5.4: FDQ Query 1 - Analytical Function Query Example

5.3 Query Evaluation and Optimization Algorithms

5.3.1 Analytical Functions Implementations

Our discussion here considers three analytical functions that together cover all

the different needs that arise in implementing any analytical function: MIN – Min-

imum, AVG – Average, and MINUS – Minus.

Before discussing the algorithms, we explain some of the issues that arise. Fig-

ure 5.3 shows one variable with three dimensions – Dim1, Dim2, and date (pre-

sented by its value above each partial array). Running the query in Figure 5.4

128 12/30/2017 12/31/2017 1/1/2018 Dim1 Dim1 Dim1 0 1 0 1 0 1 Dim2 Arr Dim2 Arr Dim2 Arr 0 3 5 0 1 5 0 1 7

1 2 0 1 1 4 1 3 2

Week 1 Week 2 Dim1 Dim1 0 1 0 1 Dim2 Arr Dim2 Arr 0 2 -1 0 1 6 SELECT AVG(TEMP) – LEAD(AVG(TEMP)) 1 0 -3 1 2 3 OVER (PARTITION BY extract_week(Time)) FROM TEMP

Figure 5.5: FDQ Query 1 resultset

should produce a results similar to those in Figure 5.5. The first question to ad- dress is should “Week 2” be produced at all? The INCOMPLETE clause allows fine grained control over this matter.

Had we decided not to issue the second window, the whole Time dimension should have been removed. Dimensional removals and additions introduce a chal- lenge since, as mentioned before, current frameworks do not allow to modify di- mensional sizes and variable definitions after the initial creation – enforcing two scans of the results in some cases.

Next, we describe the implementation of the three functions we are focusing on. Analytical functions require multiple data passes, while simultaneously per- forming additional calculations. Since many calculations are repeated it makes sense to pre-allocate memory in a way that allows caching the expected re-used values. For example, for LAG (a function to access previously produced values at a specific window offset), we can hold for every currently processed window the

129 previous values in memory, allowing fast access to that data, without disk accesses or values recalculations.

However, memory is limited and this approach might require more memory than available. We limit the memory allocation to the available size, and work section-by-section, i.e., iteratively produce full output for n values at a time. Two options arise – to fill (calculate all values) each window before calculating the val- ues of the next one, or to calculate a limited set of values for all the windows and iteratively do so, until all sections, or tiles, are calculated. The determination of which method should be used is based on the function used, for example LAG should use the latter approach.

The Minimum function implementation is straight-forward. Each value is ini- tialized to NULL, the empty value, while a data scan updates values as necessary.

The average function implementation uses two arrays, one for a values counter and one for the data – both reset to 0. Every non-NULL value we process increase the corresponding counter by 1, while adding the value to the data array. After all the data for the currently processed section has been scanned, we divide the second array by the first one (a ‘by place’ division), and process the subsequent windows (after resetting the arrays values back to 0).

Implementation of Minus

The MINUS function was described in Section 5.2.2. A significant complexity of the implementation comes due to the data counter resets. This can be addressed

130 by using three pre-allocated section caches – the previous values, the current max-

imums, and a careful maintenance of the current output (which is a temporary

summation of all the reset values).

In Algorithm 11 we show the implementation of the MINUS analytical function in great detail, including memory optimization and proper handling of counter resets. First, we iterate over the sections of the relevant variable. We populate each coordinate with its appropriate value (based on the presented calculations). After each section is populated, the calculated values are cached for the next calculation to take place. For brevity, we assume that once a non-NULL value was assigned to a coordinate, at each section, at least one non-NULL value will appear for this coordinate again later.

As can be seen, there is an additional optimization – we maintain calculated sub-results in place. When a function requires access to the previous window, like the function MINUS does, we alternate pointers to where memory is allocated, switching the reading and writing address, instead of moving values to different memory areas. This results in a more complicated memory addressing mecha- nism, but it minimizes disk accesses while decreasing in-memory data movement as well.

We use multiple functions for handling the calculated results: CopyToOutput,

SwapPointers, and SendOutputOfSection. Although for simplicity these are shown as independent function calls, in the implementation we delicately calculate where we should write to, and modify pointer addresses. These allow us to avoid unnecessary data access and movement.

131 Algorithm 11 The Minus Analytical Function

1: procedure MINUS() 2: curr ← ALLOCATESECTIONMEMORY 3: out ← ALLOCATESECTIONMEMORY 4: lastMax ← ALLOCATESECTIONMEMORY 5: previousWindow ← ALLOCATESECTIONMEMORY 6: RESETTONULL(lastMax) 7: RESETTONULL(previousWindow) 8: for each section s do 9: for each ordered window w in s do 10: RESETTONULL(curr) 11: RESETTONULL(lastMax) 12: for each raw data source r in w do 13: for each coordinate c in r do 14: sc ← MAPTOSECTIONCOORDINATE(s, c) 15: DoAddition ← False 16: HadReset ← False 17: if r[c] == NULL then 18: continue 19: end if 20: if curr[sc] == NULL then 21: curr[sc] ← r[c] 22: lastMax[sc] ← r[c] 23: if previousWindow[sc] != NULL && 24: curr[sc] < previousWindow[sc] then 25: HadReset ← True 26: end if 27: else 28: if lastMax[sc] == NULL then 29: curr[sc] ← r[c] 30: else 31: if lastMax[sc] ≤ r[c] then 32: if DoAddition then 33: curr[sc] ← curr[sc] + r[c] 34: − lastMax[sc] 35: else 36: curr[sc] ← curr[sc] 37: end if 38: else 39: DoAddition ← True 40: curr[sc] ← curr[sc] + r[c] 41: end if 42: lastMax[sc] ← r[c] 43: end if 44: end if 45: if !HadReset && 46: previousWindow[sc] != NULL then 47: out[sc] ← curr[sc] - 48: previousWindow[sc] 49: else 50: out[sc] ← curr[sc] 51: end if 52: end for 53: end for 54: COPYTOOUTPUT(curr, s, out) 55: SWAPPOINTERS(curr, lastMax) 56: end for 57: SENDOUTPUTOFSECTION(out, s) 58: end for 59: end procedure

132 5.3.2 Optimization: Dimensional Restructuring

Dimensional restructuring is an expensive phase in the scientific query execu- tion process. This process entails removing dimensional values that all its match- ing variable values are empty (NULL), decreasing the resultset size. Since this pro- cess requires not only a scan of the data, but also copying and remapping all exist- ing values to new locations, it is an extremely expensive one.

The three approaches to restructure dimensions are:

• A la carte – dimensions are built on the go. Once a new dimensional value is

detected, it is added.

• Post-building – execute sub-queries, accumulate results, and after all results

are retrieved build the dimensions and the array.

• Pre-building – build the dimensional values, and afterwards fill the data;

requires executing the queries twice, or caching some queries results.

For the first approach listed above, multiple implementations are feasible. Since this process requires to completely rebuild the variable every time a dimensional value is added, we consider this option as too expensive. In addition, since the array changes its size while it is formed, memory pre-allocation cannot be easily used. The usage of current array data interfaces (NetCDF, for example) require the dimensional setting to be known before an array can be addressed, nullifying this approach as a viable option.

Post-building is the approach used in DSDQuery [44] (Chapter2). In this ap- proach, each sub-query runs once, while caching (to disk, due to the potential size) the results in an intermediate format. Then, the cached results are scanned twice

133 – the first scan is used for building and configuring the dimensions. The second data scan is used to populate the created variable. This approach does not fit well to queries with large selectivities – for example, DSDQuery failed in executing a simple aggregation query due to the need to store more than 200GB of intermedi- ate results to a disk that is smaller than that, for generating a 2MB output.

The last approach, pre-building, includes a two step process as well. We first run queries to extract and build the dimensions (in a streamed manner). Then, we run data filling queries, over the source data, to assign the results to their place.

This method enjoys the benefits of the previous approach, yet does not store inter- mediate results. we experienced a performance boost of up to 4,000% using this approach. This method works well mainly for condensed data, where executing the query twice is more beneficial than caching results to disk – the trade-off be- tween the two last presented methods.

Working with climate data resulted in another optimization that is often valid for scientific data. If the data contains thousands of files, but all source from a specific simulation method or set of sensors, only one file of each source is needed to be used for building the dimensions (since simulations/data sources produce the same dimensional output).

5.3.3 Distributed Calculations

The calculations of the resultset can be distributed over two orthogonal options: sections and dimensions. In the first, section-based distribution, each section can be calculated by a different process (either locally or remotely). This results in each process accessing all input data sources, yet reading only smaller portions of it.

134 The performance with this approach depends on the properties of the distributed

storage system.

The other option is distribution by dimensions. In this option, each process

accesses a set of dimensions from each relevant file. The advantage of using di-

mensions over sections comes due to virtual dimensions. Since virtual dimensions

are stored in separate files, if the distribution is based on a virtual dimension, two

processes rarely access the same file. Thus, this method allows geographical files

distribution, and optimizes file access. However, certain analytical functions need

access to previous windows, and it is possible that two processes will need to ac-

cess the same file.

Another aspect is the output file generation, which can be done using two dif-

ferent options: writing the output file in a distributed manner, or adding a process

to accumulate results and generate the output file locally. We experimented over

both options, and concluded that the differences in performance are marginal. In

settings where the storage system is well connected, like in ours, both approaches

produce similar performance. In settings where the network is fast but storage

is slow, adding a process near the end user that generates the final resultsets can

be preferable. We chose to use the second approach since it works well in both

settings. The added process for writing the results will be referred to as the man- agement process and is discussed next.

Results Accumulation

We designate a process to accumulate the resultsets. In Algorithm 11 we had shown how we transport results from the calculation nodes to the management

135 process in units of sections (line 57). Next, for completeness, we describe the man- agement process.

Algorithm 12 FDQ Results Accumulation 1: procedure ACCUMULATERESULTS() 2: curr ← ALLOCATESECTIONMEMORY 3: output ← INITIALIZEOUTPUT 4: for each node n do 5: if WindowCanBeCopied then 6: address ← CALCULATEOFFSET(n, 0, 0) 7: address += output 8: GETRESULTSFROMNODE(n, address) 9: else 10: GETRESULTSFROMNODE(n, input) 11: for each section s do 12: for each window w do 13: address ← CALCULATEOFFSET(n, s, w) 14: address ← output + address 15: COPYWINDOW(address, input, w) 16: end for 17: end for 18: end if 19: end for 20: end procedure

In Algorithm 12 we show how results are accumulated, taking sectioning into account. While when building the results we can remain agnostic to the final out- put layout, when accumulating results we may not. In the algorithm, we differ between two scenarios. The first is where each window is sequential and the mem- ory contains the fully calculated window. In this scenario, we may just copy the input to the output. The more complicated scenario is when the output sent from the calculating node is not sequential (sections are used). In this case, we need to

136 copy section-by-section and window-by-window, reducing the efficiency. As men- tioned before, this use case is rare due to how scientific data is usually held, yet, for completeness, it must be addressed.

137 5.4 System Implementation

5.4.1 FDQ Engine

We have developed two different versions of FDQ - a non-distributed version and a distributed one. The non-distributed version of FDQ is based on DSDQuery

DSI [44] and is currently in use by scientists at ANL. The code base is implemented as a DSI plugin to Globus GridFTP [5, 33].

The distributed version uses MPI [57] for communication. For improving per- formances, we decrease the number of transported messages by calculating mul- tiple sections and sending all of these to the management process together, within larger, but fewer, messages.

The implementation of both engines is in C++. The current system database is implemented using SQLite [93]. Although it is not the most efficient relational system, it was chosen because it is the easiest to maintain and access, especially when user permissions are limited. This database is used for internal engine meta- data queries, such as which variables are stored and where, as explained in DSD-

Query [44] (Chapter2).

5.4.2 ANL Web Portal

A web portal has been implemented at ANL to facilitate scientists’ interaction with their data visually. In Figure 5.6, we show a screen-shot showing some of the provided functionality. This web portal is hosted on a Laboratory Computing

Resource Center (LCRC)2 web application server. The FDQ engine is run from a location accessible to this portal.

2http://www.lcrc.anl.gov

138 Figure 5.6: ANL Querying Portal, backed by FDQ

The portal page presents scientists with the relevant search parameter values and models that can be queried, and then maps user input back to the application server into an FDQ query. An extended SQL syntax that includes support for Ana- lytical Window Querying is used. The web application uses globus-url-copy3

(an open source GridFTP client) to submit the query to the FDQ-enabled server and retrieves the resulted NetCDF file. The NetCDF file is made available on an anonymous read-only GridFTP server. It can then be downloaded by the user us- ing the Globus transfer service [6], which is a hosted GridFTP client and is more user-friendly than globus-url-copy. We chose this architecture since it is likely a user would like to download the results to their local environment or that differ- ent users would run the same query – this architecture eases the implementation for both.

3http://toolkit.globus.org/toolkit/docs/latest-stable/gridftp/user/

139 5.5 Evaluation

In this section, we evaluate our system. Our goal is to show the usage of FDQ is feasible with reasonable run times, is scalable, and that it outperforms earlier implementations (for supported queries). We consider both group-by queries and analytical queries, using both non-distributed and distributed versions of our en- gine.

With exception to experiments involving comparison with SciDB, experiments were executed on the Blues cluster at ANL. The cluster is constructed from about

350 nodes, each has 64GB memory, and uses Intel Xeon E5-2670 2.6GHz processors.

All the data used for these experiments is real climate data, generated by ANL.

The datasets comprise temperature and precipitation variables, Section 5.1, have 2 common dimensions (latitude and longitude) and 1 virtual dimension (sampling time), and contain data for the North America. Each day usually consists of 8 sam- ples, distributed among different files. The daily precipitation data (not including any metadata) is 24MB, while the daily temperature data is 704MB (the latter has an additional dimension – vertical levels). In our experiments we accessed up to

20 TB of array data, and processed up to 1TB of it.

5.5.1 Average Function Performance

In this experiment we compared the execution of the Average analytical func- tion performance to our predecessor engine and to the community leading Scien- tific Array DBMS, SciDB [112]. This function does not require cross-windows ac- cess, and therefore can be executed by using the GROUP BY clause for partitioning.

We used an array that varied in size between 100MB and 90GB.

140 Figure 5.7: FDQ Execution Time of The AVG Function Over Partitions With Differ- ent Selectivities (16% and 100%) [128 days - 90GB]

141 In Figure 5.7 we show the queries execution time and compare it to the precur- sor engine, DSDQuery (DSQL). Each queried day involves about 700MB of data, and thus 128 days query processes 90GB of data. We consider two queries, one which averages 16% of all data, and another that averages 100% of the data. The baseline system crossed the timeout threshold, 15 minutes, when querying 90GB of data, which is the reason we report results for only up to 128 days (and 32 days for 100% selectivity). While FDQ is comparable for smaller datasets, when larger datasets are used FDQ performs several times faster. For example, for a query over

32 days (22GB), FDQ performs 3.5 times faster for queries with low selectivities and 13.3 times faster when querying all data. FDQ improvement increases with dataset sizes – for 128 days (90GB) the improvement of FDQ over DSDQuery in- creases to 5.05 times. This is sourced in the low impact different selectivities have on FDQ query execution – querying 6 times more data slowed DSQL by about

600% while FDQ was slowed by at most 22%. This is because of the more nuanced memory management that is increasing efficiency and decreasing the amount of data movement.

Next, we compare our engine to SciDB. We used a different environment for this experiment, where we could install SciDB. We used linux kernel (Debian) 4.4 on a machine with i5-4590 and 16GB of RAM. Both engines were set to use only 1 process. We used 3 3-dimensional arrays (100MB, 1GB, 10GB). It should be noted that there is a significant cost of loading data to SciDB, which we do not account for here. Our queries calculated averages over different partitioning configurations.

Specifically, the queries involve 1-partition (longitude, generate 5KB of data), 2- partitions (longitude and latitude, generate 1.8MB of data), or 3-partitions (which

142 Figure 5.8: Query Runtime Speedups: FDQ Over SciDB

returns the original array). In some of the queries we used value based subset- tings (WHERE val < x) to control selectivity – in FDQ, subsetting queries initiate the use of bitmap indexes to locate the queried data, and we wanted to measure its cost. For SciDB, we report the fastest execution out of 4 consecutive runs – this al- lows SciDB caches to fill correctly and to be utilized. FDQ does not use cross-query caches, therefore warm-up runs are not necessary.

In Figure 5.8 we present the results of the SciDB comparison. Each bar in the

figure shows the speedup of FDQ compared to SciDB, while the bars are grouped based on the number of partitions used in the queries. The vertical axis, speedup, is logarithmic. The baseline of the figure is at the value 1; bars that are headed downwards signal that SciDB performed better. The larger the dataset, the better

FDQ is – for array size of 100MB the execution time of both engines are similar, yet

143 for larger arrays our speedup is between 2 to 21. When we use value based sub- settings, the results are more moderate – an improvement of between 1.2 to 2.586.

This is due to the dimensional restructure, the array rebuild in a more condense structure, and the generation of a NetCDF file as the query result. In comparison,

SciDB emits coordinatal results that allow it to skip the restructure process and handle sparsity better, but prevents users from using their existing tools to visu- alize and analyze query resultsets. The last presented setting, 3 partitions, has a speedup of 9.4x for the 10GB dataset, lower than the speedup measured for the two other cases, 19.23x and 20.53x. This is because of the usage of virtual memory due to the resultset size.

The results for the “No Partition’ group shows that for most datasets, when no partitioning is used SciDB, performs better, as expected. FDQ is not optimized for this case. Yet, when data does not fit in SciDB cache, it performs quite badly – a factor of 6 slowdown was observed for the 10GB array.

Overall, we can see the following: FDQ memory optimizations are desirable for array data querying, and outperform previous designs. In addition, FDQ be- haves well for queries with changing selectivities and different columns order in the PARTITION BY clause; SciDB slows down between 2 to 4 times if the parti- tions used for the window generation are not in the optimized order.

5.5.2 Analytical Function Performance and Scalability

We focus on the execution of several analytical functions, including the most expensive analytical function, Minus. While the support for Minus queries was only introduced here, for comparison, we upgraded DSDQuery to support this

144 Figure 5.9: Runtime of Minus Function: Benefits of Memory Optimizations in FDQ [1024 days - 24GB]

function while using the same methods for memory access and management. No other engine provides the ability to run these queries.

In Figure 5.9 we show a performance comparison of DSDQuery with FDQ. The

X-axis scale is logarithmic, while the Y-axis is not. As can be seen, DSDQuery is slower, and its run times are not feasible for a scientific workflow – querying 16 days of data (about 384MB) takes more than 10 minutes using DSDQuery, while it takes 28 seconds using FDQ. Querying larger amounts of data, 64 days (1.5GB) takes more than 30 minutes using DSDQuery while taking only 59 seconds using

FDQ.

145 Figure 5.10: Comparison of Analytical Functions Runtimes using FDQ [1024 days - 24GB]

146 Figure 5.11: FDQ Scale Up: Execution time of Average Function with Increasing Number of Threads

In Figure 5.10 we show run times comparison amongst the most commonly used functions. The graph shows the most simple (maximum) and the most com- plicated (minus) analytical functions perform similarly with the same pattern, sug- gesting any needed analytical function will execute efficiently, within these two boundaries.

Next, we evaluated how the system scales on a machine with 8 cores. We ran a query using the Average analytical function over changing number of days – varying from 64 (1.5GB) to 512 (12GB). As can be seen in Figure 5.11, the system scales well, up to the number of cores available on the machine. The improvement is linear, as expected.

147 Figure 5.12: FDQ Scale Out: Execution Time of the Average Function With Increas- ing Number of Nodes

Next, we ran an average query in a fully distributed manner. We executed the same query for different number of days, calculating the daily average. All results were consistent – for example, processing of 128 days on one node took 40.99S, while processing of 256 days on one node took 81.71S; an additional example, pro- cessing of 1024 days took 6.57S using 64 nodes, while processing 256 days took

1.66S and 0.93S for 128 days on the same amount of nodes.

In Figure 5.12 we report the run times of a query using the Average analyti- cal function for changing amount of days. It is noticeable that as the parallelism increases, the performance improve linearly. An interesting pattern that was un- covered is that when an extremely small number of days are processed on one

148 node (1 day in the 128 days experiment over 128 nodes), the performance degrade.

This is a result of the overhead in the communication and reduction algorithms.

In conclusion, FDQ performs and scales well. The new memory management architecture together with the optimizations described before allow scientists to use structured query operators efficiently. Calculations are distributed in a man- ner that allows linear improvement with the number of used nodes. A noticeable advantage is the ability to run an analytical function for large amount of data and windows in only a few seconds while processing a large amount of raw files. For example, we got to 1024 days over 64 nodes in 6.57 seconds, times that are unheard of within these communities.

149 5.6 Summary

In this work we introduced analytical functions to the domain of scientific ar- ray data. We shown the current tools used by scientists are not sufficient and in- troduced new syntax, algorithms, and techniques that enable the usage of a struc- tured querying engine to efficiently process complex analytical queries. We have done so by planning memory alignment, dividing calculations and executing them in parallel, and working in chunks that can fit in memory. We evaluated our work over real dataset, and shown that we allow near real-time analytical querying, out- performing other options.

150 Chapter 6: An Optimizer for Multi-Level Join Execution of Skewed and Distributed Array Data - ScKeow

In this chapter we re-iterate the join optimization problem, which was partially addressed in Chapter3 by DistriPlan. Our previous coverage of this issue ad- dressed generic environments, yet, real environments (such as the ones described in Figure 1.4) have properties we can utilize for generating lesser evaluation plans.

Complex queries, such as the Tornado Query demonstrated in Figure 6.1, over data distributed on such real environments could have not been optimized using Distri-

Plan due to the large number of nodes in the environments used for its processing.

Here, we target efficient execution of join queries over scientific array data, which is distributed across a heterogeneous clusters, while each cluster is internally ho- mogeneous, and while considering the data skew change during processing. We do so by utilizing environment properties (the similarity of machines within each used cluster) to generate an ideal plan (unlike the optimal plan DistriPlan gener- ates).

We first consider the ability to convert data representation formats (from tradi- tional array representation to a skewed one) during query execution, and suggest integrating this ability within cost models. Then, we introduce the CF, Conversion

151 Figure 6.1: Example Query: Tornado Hypothesis Verification

Factor, and the MPU, Minimal Profitable Unit – parameters we use in our sug- gested cost model to prune and evaluate optional plans. We evaluate our methods and show that they are beneficial. In most heterogeneous settings our approach produces an ideal query directly, while for homogeneous settings, in most cases, our approach produces half the amount of plans compared with traditional ap- proaches.

152 6.1 Background

The commonality of the Digital Sky Survery [71, 118], the Large Hadron Col-

lider (LHC) project at CERN [1, 75], and Gas and Climate simulations [103, 108,

137] is that all generate increasing amounts of daily array data, starting at a few

Megabytes, and getting up to 20TB. Due to the way it is collected, the data is often

stored in raw data files in formats such as NetCDF and HDF5 [48, 102], distributed

across multiple repositories. Traditionally, scientists wrote low-level programs or

scripts to analyze this data [103, 108, 113, 137], but lately Scientific Database Man-

agement Systems (SDBMS), which allow querying array data with declarative lan-

guages [24, 25, 41, 43, 44, 81, 141, 145], have been emerging.

While advanced querying engines over scientific data exist [25, 41, 43], they

are limited with the number of subsequent computation-heavy operators they may process in a single query [43, 145]. To optimize queries, these engines mostly use a set of rules and heuristics – these mainly fit simple queries [64, 78]. A few systems use a cost based model [43, 45, 145], which can address complex queries, but they do not consider the overall network architecture, which we will show is essential.

In addition, array data can be represented in multiple ways. While the common representation is the traditional flattened and continuous layout, where many cells are empty (contain Null values), sparse representations that decrease memory but require data reformatting and lookups [16, 96] can also be appropriate. Current systems stick with a specific, pre-chosen, data representation, and after an initial conversion to that representation, process queries [43, 44, 111, 112]. The format change may occur either when data is loaded into their systems, e.g. SciDB [23], or after the query begins execution, e.g. DSDQuery [44] (Chapter2). Scientists often

153 .Mrvr kwddt nRltoa DataBase Relational in data over skewed Morever, join 145]. one , 45 containing [43, queries (arrays) simple variables two on focused Prior mainly target. joins they skew array of for type work what considering data, array over queries join for .. drsigSe nPoesn fAryDt:Oe Prob- Open Data: Array of Processing in Skew Addressing 6.1.1 runtime. at applied representation be data to forcing change format formats, native and the loading on data available of tools cost high of the versatility to the due format native its in data the store to prefer egahclycmaetecretsaeo h r optimizers art the of state current the compare graphically we 6.2, Figure In iue62 cew otBsdMdlOtmzto nie Comparison Engines Optimization Model Based Cost Sckeow: 6.2: Figure esadOrContributions Our and lems Resource Skew SkewAware RDBMSResearch DistriPlan Data Skew Data 154 Similarity Joins Similarity ScKeow Management Systems (RDBMS) do not impact the query processing the same way it does in scientific systems – tree index structures reduce its affect on RDBMS, and solving the issues skew raises is often addressed by columnar storage [74, 87].

Data skew optimization for array data is targeted in Skew-Aware Joins [43] – the authors assume data are stored unevenly on multiple nodes, while each node can- not access data of other nodes. A later work, Similarity Join over Array Data [145] widens the definition of joins to include aerial aggregations (similar to masking), but has a similar approach for query optimization (the main difference is that while both assume the most distributed approach is cheaper, the latter adds network con- gestion to the cost evaluation process, aiming to improve the network utilization).

DistriPlan [45] (Chapter3) addresses resource skew – more specifically, DistriPlan minimizes execution time while considering the full network architecture.

This work, Sckeow, considers both, data skew and resource skew, while target- ing clusters where storage is not local to each node and centrally accessible to all cluster machines, networks have different capabilities, and each cluster has dif- ferent hardware. Our work includes an optimizer for advanced join queries over distributed array data, where data becomes skewed while being processed over skewed resources.

Generating efficient query execution plans for complex queries over distributed arrays is challenging. First, each cluster has a different computational power, and it is hard to evaluate based on the environment specification which cluster is prefer- able for a specific calculation (while accounting for data movement as well). Sec- ond, adjudicating data transportation, shuffling, in a cost based model requires spanning many plans for data transportation. Encapsulating and using the full

155 network properties results in too many plans, which is impractical to generate and evaluate for a highly distributed setting. Third, large amounts of joins in a sin- gle query result in an exponential amount of options to process the query. Last, it might be preferable to change the data representation format (e.g. from traditional array layout to a sparse representation).

Our solution involves the following components. We introduce the CF - Con- version Factor. Using this numerical value, which can be calculated for each array efficiently at run time, we prune plans that will not profit from data representation change before they are generated. We also introduce the concept of MPU, Mini- mal Profitable Unit, an automatic evaluation of cluster performance. The MPU’s entail multiple properties: a size that is used as a threshold for determining if data calculation should be distributed within each cluster, the performance of process- ing this size on that cluster’s machines, and the network properties within each cluster and across clusters. The main innovation in relation to the MPU is that it is collected on each cluster by a software that simulates distributed calculations, allowing an accurate computational power estimation. Last, we introduce a cost model for complex queries over highly dimensional scientific arrays.

We evaluate our system by using an execution engine that executes plans gen- erated by different approaches. We show that methods which consider the net- working properties are more efficient compared to those that do not – in most het- erogeneous settings our approach spans an ideal plan, which executes 508% faster than traditional optimization approaches, directly. In addition, we show that al- though our method does not consider all the networking properties, it performs

156 similarly, and in some cases better (on average, 172% faster), compared to an opti-

mizer that does – resulting from the flexible and nuanced data format change our

approach entails.

6.1.2 Sparse Matrices Representation

Arrays can be represented as sequential values that are mapped to different

dimensional settings – this is the default array representation, and the one used by

programming languages such as C++. This method of storage is wasteful when

only a few values are stored in the array, and the rest are empty (NULL values). For example, when only the diagonal values are stored for a 2-dimensional array of size n × n, such is the case for the identity matrix, we store n2 values while using only n of them. Sparse matrices address the storage and lookup for such arrays.

The most intuitive method of storing sparse arrays is by holding the data in the full setting format (dimensional values, followed by the array cell value). For the example above, the data would be stored as [1, 1, val1], [2, 2, val2]....

More advanced techniques were developed to hold sparse matrices, most of which are targeted either to a specific problem or to a specific data distribution/struc- ture [16, 96]. In this work, we represent our sparse matrices in a format called hashmaps-of-hashmaps. The first dimension has a hashmap that within, each value points to the next level hashmap, and so forth until the last level dimension. At the last level we point to the values that were stored in the array. For communi- cation, we build a continuous layout in the dimensional format and transport it – we do so since hashmaps are not sequential, and do not transport efficiently, while sequential structures do.

157 6.1.3 Array Data Joins and Aggregations

The definition of array data join is debated, and defined differently at each work. Skew-Aware join [43] defines a join as comparison of matching cells or com- parisons of all cells of one array to all cells of the other (reminding a Cartesian

Join). If a comparison condition holds, the value (or a structure containing both values for inequality conditions) is returned. Similarity Join over Array Data [145] slightly expand this definition to include a wider set of criteria while using a mask on one, or both, of the arrays. The whole processing in these work is conducted in a sparse dimensional format.

In this work (Subsection 3.1.1) we provide an alternative, more generic, defini- tion for the join operation, which is based on how arrays are stored in real envi- ronments. The join, as defined here, is not based on matching array cells, but the match is made over the underlying dimensional mapped values. So if, for exam- ple, two arrays have dimensions that describe a location, and the location values are stored in a mapped variable [44], these values are used for the comparison, allowing arrays that are structured differently but describe the same content to be joined.

6.1.4 Query Optimizer

This Subsection repeats some of the content which was described earlier in Sub- section 3.1.3. It is provided for completeness as well as to summarize and empha- size the content that is relevant for this chapter as well.

An execution plan, or simply a plan, is a tree representation that contains pro- cessing instructions to an execution engine for providing the correct query results.

158 While the declarative query handles the “What”, the plan targets the “How”. Plans contain an hierarchical ordering of operators, each of which has at most two chil- dren nodes. When the plan tree is followed from bottom to top, the intended query results are produced.

Critical to producing an efficient query plan are query optimizers. Two types of optimizers are traditionally used - Rule Based Optimizer (RBO) and Cost-Based

Optimizer (CBO). In RBO’s, we use a set of predetermined rules or heuristics. For example, a rule most RBO’s have is “push down all subsettings to the lowest pos- sible tree level” – this is a good rule since it decreases the effort the “upper” part of the plan has, which often requires heavier calculations, and therefore the less work represented there, the faster results will be produced. RBO’s are unlikely to build optimal plans in complex scenarios due to the generality of the rules they use.

CBO’s approach the optimization differently. Instead of determining in ad- vance what to do, they generate, span, all viable options (as determined by a set of simple rules). CBO’s use a Cost Model, which assigns a cost to each plan node – the cost of a plan is a function over the cost of all the nodes in it. The goal of a CBO is that the cheapest plan will execute more efficiently than more expensive plans.

A common mistake is to assume that the cost function is a time estimation.

While in some cases it is true that we measure efficiency by time (but not always, for example, on mobile devices efficiency would often be linked to energy con- sumption), the cost model is not intended to predict time directly. Even when efficiency is measure by execution time, a good cost model would guarantee that if the cost of plan 1 is smaller than the cost of plan 2, it would also execute more

159 efficiently (faster). The subtraction of two plans costs do not indicate how better one plan is over the other.

The first phase in a CBO is spanning plans, which are later evaluated by the cost model. An ideal approach is to span and evaluate all possible plans. Yet, for complex queries, this results in an exponential amount of plans, which cannot even be held in memory. Therefore, we must prune inviable options, and span only relevant plans. The decision what plan is viable and what not is performed by a plan spanner. The plan spanner gets as input a parsed query, generated from the user’s input, and builds all possible execution plans that satisfy the query. It spans plans with the operators in different orders, as well as different options to satisfy the same user criteria – for example, for a join query requiring the results sorted, the query spanner may consider a raw data sort followed by a merge-join , or a join followed by a sort. In many cases, the plan spanner can prune plans before they are generated, for example, if there is a join and a subsetting, the subsetting should be executed before the join.

160 6.2 Overview - Sckeow Optimization

6.2.1 Formal Problem Statement

th n Given n clusters with mi machines in the i cluster, 2 different networks that connect the n clusters, n internal networks for each cluster, data descriptors (which describe the data locations, distributions and possibly other information), and a

(join) query, our goal is to build an efficient execution plan to correctly execute the query. In other words, the problem we target is how to optimize a query over distributed data, given a resource skewed (or heterogenous) environment.

6.2.2 The Plan Generation Process

We suggest using a two level optimization approach. The first level addresses global processing – e.g., which cluster should process which part of the query.

Since the communication is evaluated based on message sizes, the decision regard- ing data representations conversions are made at this level. The second is local for each cluster, i.e. subject to the global plan, we build a local execution plan based on the required tasks each cluster has to perform. In most cases the second level optimization can be done by a RBO.

Unlike for the internal cluster setting, we assume each cluster has a different configuration – i.e. each cluster is homegeneous but different clusters have differ- ent configurations. This allows us to simplify some of the optimizer calculations.

High-Level Plan

This plan spanning stage considers multiple aspects: each cluster’s computing power, data representation, data distribution across clusters, and the cost of data

161 format conversion. Though individual steps will be described later, the summary of the process is as follows. We begin by finding all the data sources relevant to each query (while analyzing where they are stored), and build all possible join orders for these sources up to the root. Then, we evaluated each plan cost by using the relevant cost equations.

In Algorithm 13 we present the high level optimizer. First, the query is parsed, and afterwards, all data sources are extracted and feasible join options are spanned.

Then, data distribution is considered (by spanning for each plan all its distribution options, and as a consequence, all relevant data transportations). Last, the cost of each plan is evaluated and the cheapest plan is chosen. We discuss the plan distribution process in Subsection 6.3.1 and the cost evaluation in Subsection 6.3.2.

Algorithm 13 Query Optimization: High Level Optimizer 1: function HIGHLEVELPLAN(query) 2: pq ← PARSEQUERY(query) 3: ds ← EXTRACTDATASOURCES(pq) 4: plans ← BUILDALLPOSSIBLEJOINS(DS, pq) 5: for each plan p in plans do 6: dplans ← DISTRIBUTEPLAN(p) 7: for each plan dp in dplans do 8: dp.cost ← EVALUATECOST(dp) 9: DistributedPlans.append(dp) 10: end for 11: end for 12: chosenPlan ← FINDCHEAPEST(DistributedPlans) 13: return chosenPlan 14: end function

162 Low-Level Plan

The low level plan is built locally within each cluster. This plan is built mainly by using traditional methods [49, 61, 78] – inter-cluster communication often has high throughput and low latency, therefore heuristics can be used. Although the low level plan can be built using traditional methods, we adjust the local rules to use the computing power we measure (Subection 6.3.1) when taking decisions.

163 6.3 Cost Model

6.3.1 MPU - Minimal Profitable Unit

MPU’s are intended to assess the execution environment properties and aid

in evaluating each cluster’s computational power. We measure two properties: the

Minimal Distribution Size (MDS) and the Communication (comm), and use them to extract 3 MPU’s: MPUMDS, MPUComm, and MPUCP . The first, MPUMDS, is used to decide how to distribute calculations (how many nodes should be involved in an operation). The second, MPUComm, is used to evaluate which cluster should process operations. The last, MPUCP , which is itself based on the MPUMDS eval-

uation, is used within the cost model to anticipate each operation’s cost.

MPUMDS is a dataset size so that if we process an array smaller than that in a distributed manner, it will perform worst compared to processing on a single machine. We evaluate this by first building a large array, with random data, and locally (using 1 node) calculate aggregations (and joins) over this data. Then, we divide the array to smaller chunks and distribute the same calculation over the cluster. We repeat this process with smaller arrays, until there is no performance gap between the local calculation and the distributed one. At this point, we have obtained the MPUMDS.

The MPUMDS is calculated for each cluster separately. We assume that the MDS

determination will terminate; it might be necessary to set search bounds in some

environments. The MDS is mainly used to determine how many nodes should

be used within a cluster for executing calculations, preventing the performance

degradation observed in Figure 5.12. This is performed by assuring each node

164 gets input data that are at least the size of the MDS, an action that might result in

utilization of less resources than available.

MPUCP is a numeric value, where the best performing cluster has the score

1. All other clusters get a score so that the division of their performance by their

MPUCP results in the performance of the same operation performed on the best performing cluster. Although internally we separate between the performance measured for each type of calculation (some operations perform better on specific machines), in the following discussion, for brevity, we do not differ between them.

The communication MPU (MPUcomm) is used to assist evaluating the profitabil-

ity of transporting data across clusters. For example, a cluster with fewer nodes

might be stronger in total compared to smaller clusters, resulting in the need to

evaluate which cluster should process what part of the query. MPUcomm could be

looked at as a function that takes two parameters, which cluster sends data and to

which, and returns a numerical value, or a cost, for transporting the MDS between

these two clusters at the given direction. We correlate the MPUcomm values with

the MPUCP , so that both represent the same performance units.

To measure MPUcomm, we distribute a join operation within a cluster, where each node holds an array at the size of the MDS – since we use equi-join, arrays are not transported within the cluster (in contrast to across clusters), but read from the cluster storage. Then, we distribute the join calculations across 2 clusters – an ad- ditional cluster is used to hold one of the arrays, and transport it to the cluster that previously performed the calculation. We calculate the performance gap between these two executions, which uncovers the cost of the communication from the first cluster to the one performing the calculation.

165 Taking our Tornado query example (presented in this chapter’s motivation, Fig- ure 6.1), the data is split across 3 clusters (we will refer to as 1, 2, and 3), where 2 clusters have similar CPU’s (1 and 3), while the other cluster’s CPU is weaker. This was measured by our MPU software as: cluster 1 - 1.0, cluster 2 - 3.0, and cluster

3 - 1.3. The networking between cluster 1 and 2 is of LAN speeds (10GBPS), while cluster 3 is at a remote site, with access speeds of less than 1GBPS. After correla- tion, examples of communication scores are: 40 for the network between clusters

1 and 3 (38.2 when sending data to cluster 3, and 42 when receiving from it), and about 5 between clusters 1 and 2. Notice the higher networking numbers suggest the CPU’s are strong – this is the result of correlating and scaling the values to the same time units, in this case.

6.3.2 Cost Evaluation

EvaluateCost, the call in Line8 of Algorithm 13, initiates the plan cost eval- uation. The equations for calculating the costs are shown in Tables 6.1. Here, E(n) is a normalized data size, representing the expected reads load. In some cases we use left and right, referring to the current node’s input (children). When one child exist, we assume it is the left one.

The cost of a plan is calculated recursively over its sub-plans. Given the root in the initial call, a plan cost is:

P lanCost (n) = Cost (n) + max P lanCost (i) i∈n→kids

In Table 6.1 we demonstrate a subset of all cost equations to illustrate our over- all approach. We use multiple markers: dims is the set of dimensions. SGC stands

166 Operator Cost(n) Q Filter mpucp [cluster] × i → length i∈dims Filter of mpu [cluster] × E (left) × SGL + (E (left) − E (n)) × SGC Sparse Data cp Q Q  Distribute (mpucomm [from][to] + mpucp [cluster]) × i + i i∈dims i∈dims

Distribute mpucomm [from][to] × E (left) × SGL+ Sparse Data (mpucp [from] + mpucp [to]) × E (left) × SGC Join mpu [cluster] × E (left) × E (right) + E (n)  Not Sparse cp Join One mpu [cluster] × E (left) × SGL + E (right) + E (n) × SGC Sparse cp Join Both mpu [cluster] × E (left) × SGL × 2 + E (n) × SGC Sparse cp

Table 6.1: Node’s Cost Examples. Join Costs Calculated for Non-Sparse Nested- Loop Equi-Join.

for Sparse data Generation Cost, and refers to the cost involved in adding or re- moving items from the sparse structure. Similarly, SGL refers to the lookup cost over the sparse data. In all cases, cluster is the id of the cluster that performs the calculations.

As an illustration of some of the factors – the cost of filtering data in traditional representation is as the cost of scanning it. However, the cost of filtering data in sparse representation is different – it is the size of its input (in the condensed repre- sentation) added to the cost of the number of items removed from it multiplied by the cost of removing these items. Another approach to implement subsetting can be to create a new sparse matrix where each value that hold the criteria is added to (instead of being removed in the other approach) – the decision for such a case can be performed by an optimizer, based on the expected selectivity.

167 6.4 Plan Generation Considerations

In our optimization process, the generation of a large number of plans is not desirable. This is because of several reasons, including the time it takes to evaluate the plans and the memory required for storing them. Here, we introduce several methods to prune plan options before they are even generated.

6.4.1 Plan Pruning Communication

The approach we use to prune communication options is as follows. We first use the MPU’s to evaluate the performance of the relevant calculation on each of the optional clusters, as well as evaluate the cost of transporting the produced resultset to the cluster that originally distributed its data. For example, if a join had clusters 1 and 2 involved, we evaluate the performance of calculating the join on both clusters, and the cost of transporting the results to the other involved cluster.

If the performance of calculating the results on a specific cluster and sending the data to the other one is faster compared to only calculating it on the other cluster, we use that configuration, and the other options are pruned. If it is not clearly preferable to use a specific cluster, all options are spanned and the cost model is used to evaluate them all, while considering their global affect.

This method assures that if data transportation is later necessary, although the results had not been produced on the cluster that consumes them, the chosen plan remains globally cheaper, and therefore the cost model is correct.

168 CF Conversion Factor

We define CF to be a threshold for the profitability of changing data representa-

tion formats. For example, if the CF is 5%, and our query contains a two arrays join

with anticipated selectivity of 2%, it might be preferable to build the join results in

a sparse representation instead of in the traditional array memory layout. Should

the selectivity be 7%, the traditional memory layout should be used. Dimensional

subsettings, unlike value based subsettings, do not generate sparse matrices since

they decrease the total matrix size and not sporadic values within it.

CF is actually structured as a two tuple su, sl. The threshold which above it would harm performance to change the data representation format is defined as su. We also calculate a lower threshold that below it would harm performance to not change the data representation format, this selectivity is defined as sl.

More formally, su and sl are calculated as follows. We will use dimCount the number of dimensions and dim the dimension data type – for simplicity we assume all dimensions are of the same type. The value of the CF upper bound is:

1 1 su = = (6.1)  sizeof(data)  dimCount + 1 dimCount + sizeof(dim)

This matches our intuition since sparse matrix representation may weight up to the

number of dimensions multiplied by the number of values within the matrix, and

any selectivity above the one in su would produce a size that is potentially bigger

than the original array, and definitely more expensive to transport and process.

This equation should be adjusted to each relevant data representation and to the

actual data types.

169 su sl = (6.2) cfv

where cfv is a factor of the computation required for converting the data format

– for any practical purpose that would be the number of data passes required to

convert the data to its new representation multiplied by the number of operations

required for it. For standard sparse matrix representation, the value of cfv is simply

the number of dimensions increased by 1 (each value is read once, while each

dimension is looked up in the relevant dimensional storage map once as well).

1 2 These two terms can be simply stated as su ≈ dimCount+1 and sl ≈ su , and allows using simple logic for pruning plans.

170 6.5 Evaluation

In this section, we evaluate our approach. Specifically, we show that our op- timized plans perform better than plans that were optimized by the approaches described in Skew-Aware Join [43], Similarity-Join [145], and DistriPlan [45].

The datasets we use contain real temperature data generated by climate simu- lations [127, 147] conducted by the Department Of Energy (DOE) at Argonne Na- tional Laboratory (ANL). We used multiple variables including temperature data, precipitation, and others. The dataset sizes used are in between a few MB and up to 2TB. In these datasets, each variable is distributed among multiple files while each file contains data for a specific time, and thus, the time dimension is a virtual dimension. Our experiments ran on a cluster where each node has 16 cores, i.e., two

2.6 GHz Intel(R) Xeon(R) E5640 processors, with 24 GB of memory running Linux kernel version 2.6. Nodes are connected using Infiniband.

6.5.1 Optimization Performance

We first focus on the efficiency of the optimization engine, by executing queries and measuring the numbers of spanned trees each engine generates. We use queries with between 1 to 255 joins; we note that most real queries contain between 3 and

15 joins. In each query there are log2 (N) + 1 involved variables (N is the num- ber of joins). The queries are structured so that all variables join with each-other, similarly to the example query. The selectivity of each join is configured to be the same for all joins in a query, and set to be 0.06, 0.3, 0.6, or 0.9. We chose these values because a selectivity in between 0.06 to 0.3 falls in the range that prevents pruning some of the plans (between the CF lower and upper bounds). We limited

171 Se St 1 Join 3 Joins 7 Joins 15 Joins 31 Joins 63 Joins 127 Joins T 1 1 1 1 1 1 1 0.9 Sp 1 2 4 9 19 50 107 T 1 1 1 4 25 676 456,976 0.6 Sp 1 2 4 8 18 60 27,085 T 1 1 4 16 256 65,536 0.3 Sp 1 2 4 8 22 2156 T 1 1 1 1 1 1 1 0.06 Sp 1 2 4 8 17 37 86

Table 6.2: Sckeow Non-Distributed Trees Spanning Performance; Increasing Num- ber of Joins with Different Selectivities Title Abbreviations: Se – Selectivity, St – Settings Settings Abbreviations: T – # of Trees, Sp – Span Time Spanning time is reported in 100µS units (1 × 10−4 Seconds is reported as 1).

the duration allowed for spanning plans, as often done in real optimizers, to 10 seconds.

Homogeneous Configuration

We first evaluate the plans where the MPU’s are fixed to be equal for all param- eters, and we configure the environment to be structured from 2 clusters. With the exception of DistriPlan, all optimizers are agnostic to the size of each cluster. In all experiments where data is distributed, the data is initially distributed in a manner that forces data transportation – if there are 4 arrays, 1, 2, 3, and 4, and the executed joins are between arrays 1-2, 3-4, and the two previous joins results, arrays 1 and

2 and arrays 3 and 4 are placed on different clusters. This design forces the worst case for data transportation optimization.

172 # of Joins 1 3 7 15 31 0.9 1 4 36 2,916 19,131,876 0.6 1 4 36 11,664 28,697,814 0.3 1 4 144 46,656 28,697,814 0.06 1 4 36 2,916 19,131,876

Table 6.3: Sckeow Distributed Tree Spanning Performance; Queries with Different Selectivities

In Table 6.2 we present the tree spanning performance for queries with differ- ent selectivities. An empty cell means it took more than 10 seconds to produce plans for the referenced setting. As can be seen, when the selectivity is 0.6 and

0.3, at some point of processing the selectivity gets in between the CF thresholds, resulting in the inability to prune these plans, and the need to evaluate increasing number of plans using the cost model is introduced. But, when selectivities are low or high, and it is clear if, and where, data conversion should be made, our optimizer prunes the other options and produces a manageable number of trees within a relatively short time. As can also be seen, the more joins in a query, the longer it took to process it, yet while the number of joins increases exponentially, the processing time increases linearly in most cases, and is dependent on the num- ber of generated trees – not on the number of joins.

In Table 6.3 we show the number of trees spanned for a distributed setting. An interesting result is that the spanning of the plans is independent of the number of the clusters – the numbers of spanned plans for data distributed over 2, 3, 4, and

5 clusters is the same. In Table 6.4 we compare the number of plans spanned from

173 # of Joins 1 3 7 15 31 Sckeow-Best 1 4 36 2,916 19,131,876 Sckeow-Worst 1 4 36 11,664 28,697,814 Skew-Aware 2 6 54 4,374 Similarity Join 2 6 54 4,374 DistriPlan - 2 4 16 256 65,536 DistriPlan - 3 6 36 1,296 1,679,616 DistriPlan - 4 8 64 4,096 16,777,216

Table 6.4: Sckeow Distributed Plan Spanning Engines Performance Comparison. The number mentioned next to DistriPlan is the number of nodes in each cluster.

each engine. While our engine adds additional nodes for evaluating data repre- sentation changes, other engines do not. Sckeow performs the best when the join selectivity is either high or low (“Sckeow-Best”). The other engines generate about twice more plans, in at least twice the time it takes Sckeow to generate and select its plan. Unlike the other systems, DistriPlan is dependent on each cluster size – it considers intra-cluster communication. This results in extremely high number of options it spans.

Overall, we can conclude that Sckeow performs well. The spanning time and the number of plans are at most the same as other engines when the configuration is similar. Adding the ability to convert data representations increases the number of plans issued, but the number is still manageable.

Variable Cluster and Network Setting

Next we evaluate how our optimizer behaves when settings are not uniform, as is the case with many distributed environments. In Table 6.5 we show the scale of the number of plans spanned when environment parameters were modified to

174 Configuration Engine 1 3 7 15 31 # of Nodes: 5, 5 Sc 1 1 1 1 1 MCO: 0.1, 0 Si 2 6 54 4 × 103 28 × 106 MCP: 1, 1 D 10 100 104 108 Se: 0.9 # of Nodes: 7, 5 Sc 1 1 1 1 1 MCO: 0, 0 Si 2 6 54 4 × 103 29 × 106 MCP: 1, 1 D 12 144 21 × 103 43 × 107 Se: 0.9 # of Nodes: 5, 5 Sc 1 1 1 1 1 MCO: 0, 0 Si 2 6 54 4 × 103 29 × 106 MCP: 1, 1.01 D 10 100 104 108 Se: 0.9 # of Nodes: 5, 5 Sc 1 1 1 4 25 MCO: 0, 0 Si 2 6 54 4 × 103 29 × 106 MCP: 1, 1.01 D 10 100 104 108 Se: 0.6 # of Nodes: 3, 6 Sc 1 4 25 961 106 MCO: 0, 0 Si 2 6 54 4 × 103 29 × 106 MCP: 2, 1 D 9 81 7 × 103 43 × 106 Se: 0.9

Table 6.5: Scale of Number of trees spanned for heterogeneous environments. Engines: Sc – Sckeow, Si – Similarity Join (and Skew-Aware Join), D – DistriPlan Configurations: MCO – MPUcomm, MCP – MPUCP , Se – Selectivity

175 those shown. We highlighted the modified values at each row. The first column

describes the number of nodes in each cluster – the setting 3, 6 informs the first cluster had 3 nodes while the second had 6. Except for DistriPlan, the cluster size does not affect the number of spanned plans (we kept this number small so Distri-

Plan could finish execution).

As can be seen, in nearly all experiments Sckeow pruned most options based on the MPU’s. Only at the last row Sckeow spanned a large number of plans, and this is since we provided contradicting values – the first cluster has half of the nodes the other one have, but each machine is exactly twice as strong.

We conclude that Sckeow outperforms other systems when the environment is not uniform, as is the case with most real environments. In most cases Sckeow can issue an ideal query directly without evaluating costs.

6.5.2 Query Execution Times

Next we execute a generic query (shown earlier, in Figure 6.1). By executing it over increasing array sizes and different clusters configurations, used to emulate different realistic settings. Data Skew is introduced through the joins selectivity setting, and is addressed by the data representation conversions.

In all experiments, we logically configure three clusters (over the one physical cluster we use). Each query was executed 3 times, and since the results are con- sistent, we report the average performance among all executions. In the hetero- geneous setting, the machines and network properties are throttled to the desired setting by using the MPU configuration.

176 Homogeneous Setting

We run the query in a setting where we do not modify the MPU’s, all the ma- chines have the same CP, and the network is not throttled. The used MPU param- eters are collected automatically, and used as is.

In Figure 6.3 we show the execution time (in seconds). The horizontal axis is logarithmic, while missing bars reflect the used engine ran out of memory. As can be seen, Sckeow behaves well – Sckeow performs on average 3% better than

DistriPlan, 554% better than Skew-Aware Join, and 470% better than Similarity

Join in the homogeneous setting. The added complexities used for addressing heterogeneous settings do not harm the performance of the homogeneous setting.

DistriPlan and Sckeow have similar performance as do Skew-Aware Join and

Similarity Join. In both cases the plans for retrieving the data are quite differ- ent (e.g. Sckeow reformats the data to be held in sparse representation). DistriPlan is faster than Sckeow in some cases because of the different networking evalua- tion – DistriPlan separates the latency from the throughput, while we measure all networking performance in one unified, simple, parameter. However, as we have already shown, DistriPlan needs to enumerate a very large number of plans, mak- ing it infeasible in practice.

In Figure 6.4 we present a breakdown of the time spent per task for each engine.

As seen, although the execution time is similar for different engines, the chosen plans are quite different. DistriPlan does not convert data representation, while all other engines do – comparing to Sckeow, both spend similar portions of their execution on data transportation, but Sckeow performs better since the data in transports is smaller thanks to the nuanced data conversion. Similarity Join is

177 1.6TB, 105 nodes 800GB, 105 nodes 400GB, 105 nodes 200GB, 105 nodes 100GB, 105 nodes 48GB, 105 nodes 24GB, 105 nodes 60GB, 30 nodes 6GB, 30 nodes 60MB, 30 nodes 6GB, 3 nodes 600MB, 3 nodes 60MB, 3 nodes

0.25 1 4 16 64 256 Similarity Skew Aware DistriPlan Sckeow

Figure 6.3: Sckeow Execution Time (in Seconds) Comparison for Different Array Sizes on Increasing Number of Nodes

178 Sckeow Data Send Skew-Aware Data Send Scan Time 1% 12% Communication 0% 52% Communication Scan Time 32% Calculations 37% 1%

Calculations 5%

Data Conversion Data Conversion 54% 6%

DistriPlan Similarity Joins Data Send Communication 0% 52% Communication Scan Time Scan Time 52% 37% 42%

Calculations 5%

Calculations Data Conversion 6% 6%

Figure 6.4: Breakdown of Time Spent per Task By Engine – Sckeow, DSDQuery, Skew-Aware Join, and Similarity Join

179 intended to solve the network contention observed in Skew-Aware Join, and as

evident by the “data send” task portion, it does. However, this contention removal

results in longer communication waits. DistriPlan and Sckeow avoid the network

contention through their cost model.

We conclude Sckeow behaves well – queries optimized by Sckeow perform on

average 468% faster. Compared to the other engines, Sckeow performance is either

similar or better.

Heterogeneous Setting

Now, we focus on experimenting with a static number of nodes, 60, while

changing resource allocations. In this set of experiments, we allocated 10 nodes

for the first cluster, 20 for the second, and 30 for the last. We maintain each vari-

able (global array) size constant at 7.2GB (there are 5 variables involved in the

query). As before, we use the Torando query (Figure 6.1).

The network setting is configured to slow the Infiniband from its approximate

40GBPS down to 10GBPS or 1GBPS. When nothing is mentioned, the network was

used in its full capacity. When LAN is mentioned, we throttled the speed to be

around 10GBPS across all clusters. The WAN configuration refers to cluster 1 be-

ing remote (throttled to 1GBPS) from clusters 2 and 3, which are on a LAN connec-

tion (10GBPS) themselves. These values were chosen based on the actual network

architecture where the original data is held.

The values presented as “MPU” refer to the enforced hogging for the Compu- tational Power (MPUCP )– a value of 1 means no hogging for that cluster, 2 means

180 the CPU is twice slower for arithmetic calculations, etc. We show the hogging con-

figuration for each cluster by the order of the clusters – a setting of 1,2,3 means the

first cluster has a MPUCP value of 1, the second of 2, while the third has a value of

3. These values were chosen based on the MPUCP values observed on the several clusters we use.

In Figure 6.5 we show the results of the query executing on heterogeneous en- vironment. Overall, Sckeow is faster by 9% to 668% (average improvement of 72% compared to DistriPlan, 508% compared to Skew-Aware Join, and 509% compared to Similarity Join). Here, as well as in the previous experiments, although some en- gines perform similarly, their execution plans are different. Skew-Aware Join and

Similarity Join perform poorly because of the initial data conversion – our analysis of results show that removing the conversion from these engines results in Similar- ity Join performing similarly to DistriPlan, while Skew-Aware Join performs about

3-4 times slower compared to Sckeow.

We conclude the combination of the MPU’s and the CF are essential for span- ning pruned, yet ideal, plans. The combination of data conversion on demand, together with relevant estimation of networking overheads and execution times allow Sckeow to outperform other approaches.

181 MPU=(1,5,10), WAN MPU=(1,2,3), WAN MPU=(3,2,1),WAN MPU=(1,5,10), LAN MPU=(1,2,3), LAN MPU=(3,2,1), LAN WAN LAN MPU=(1,5,10) MPU=(1,2,3) MPU=(3,2,1) No Hogging

1 4 16 64 256 Similarity Skew Aware DistriPlan Sckeow

Figure 6.5: Sckeow Comparison of Execution Time (in Seconds) for Heterogeneous Settings Using Fixed Array Size

182 6.6 Summary

In this chapter we addressed optimization of array data joins, considering skewed

data in heterogeneous environments. These are motivated by real-world scenarios,

including processing of climate data which is collected by numerous agencies and

stored across multiple repositories. We have developed a Cost Based Optimization

approach. We addressed data skew by adding an option to change data represen-

tation to one that fits the current data state. We have introduced new parameters

that allow measuring costs more effectively – CF the Conversion Factor, and MPU the Minimal Processing Unit. The CF allows pruning execution plans before they are generated if the need for data conversion can be definitively determined in advance, whereas the MPU’s allow doing the same based on data processing and transportation performance estimations. These factors have been used to develop a cost model.

We evaluated our engine and compared it to the state of the art approaches. We show not only that our plans perform better (396% faster in heterogeneous settings and 368% faster in homogeneous ones), but also that the aggressive pruning results in less plans (often 1 plan is generated for heterogeneous settings, and half of the amount of plans spanned by other engines for homogeneous settings).

183 Chapter 7: Related Work

In this chapter we describe prior and related work. We discuss the different works by the domains they are related to: scientific data management, perfor- mance optimization of distributed queries, and analytical and joins queries exe- cution.

7.1 Scientific Data Management

Scientific arrays are often stored in low-level formats, such as NetCDF [102] and HDF5 [48], and distributed using portals, such as ESGF [19, 130]. Such storage is preferable for several reasons – its low storage overheads, the rate these data are produced while having no data ingestion costs, data is often very large and queried infrequently, and because of tools and scripts that need such formats [127, 147].

The portals disseminating these data use multiple methods in their backend to store and retrieve data (by using certain predefined, fixed, capabilities for search- ing within datasets). For example, OPenDAP [39] provides ability to share and access remote data. Yet, it does not provide any querying capability directly – and in fact, requires data to be converted to its internal format.

The most common data transportation technology used is FTP [98], and query- ing operators have been integrated with one implementation of FTP,the GridFTP [114].

184 However, querying using this method is limited with the queries complexity and may only use local resources (preventing the usage of distributed data or dis- tributed processing).

Using database-like approaches for viewing array data have been slowly devel- oped with time. UFI [14] is a tool that allows viewing a semi-structured local file as if it was a database – including files in NetCDF and HDF5 formats. However, it does not directly support a high-level query language nor distributed data.

Database-like querying functionality over native formats is not a new idea as well. Prior to the precursor work on SDQuery, similar functionality was supported on flat-files containing array data [129]. More recently, NoDB [3] systems allow querying over raw data files, though focus is on records-based, relational, data.

The authors further extended their original work to have an adaptive query en- gine [72]. The key novel aspect of our work is that we focus on a collection of files, and specifically work with distributed scientific data stored in native formats like

NetCDF.

A more structured approach to querying scientific data involves array databases, and there is a large body of work in this area [15, 23, 29, 38, 77, 82, 83, 85, 99, 121,

142]. These systems require that the data be ingested by a central system, before it could be queried. Thus, they cannot support queries across multiple repositories nor cannot directly operate on low-level scientific data. Our engines are intended to provide querying support without having to load the data inside a traditional system – an approach that is especially suitable for read-only or append-only data that is queried relatively infrequently and generated at a high rate, as scientific data are.

185 SciDB [24] is considered as the state-of-the-art for centralized system that ad- dress array data (and as most other systems, requires data to be pre-loaded into it).

It uses database-like caches to process array queries; but it provides limited query- ing abilities and requires manual array storage configuration. Storage engines such as TileDB [95] and ArrayDB [83] provide different approaches for configuring ar- ray storage, yet all are similar in requiring the data to be loaded and reformatted to a proprietary format.

MapReduce [40, 58, 143] systems use HDFS [22], or a variant, to store data and allow users to query it by either programing an application to do so or using a high-level querying framework like Hive [115]. MapReduce frameworks for pro- cessing array data, e.g., SciHadoop [25] and MERRA [104], provide native query- ing abilities over array data, yet, they do not provide generic declarative querying support, do not support geographically distributed data, nor provide the join oper- ator or analytical functions support. Finally, Wanalytics [124], although somewhat addressing joins in distributed settings, does not handle window querying nor considers array data.

A few querying languages have been suggested for querying array data. AQL [79] is the most referred and used one. In an attempt to standardize multi-dimensional array querying syntax, there is an effort to extend SQL [84] to address these data

– referred to as SQL/MDA (Multi-Dimensional Array); both do not consider ad- vanced analtyical querying. In addition both implicitly assume array data is not distributed through virtual dimensions.

186 7.2 Optimization of Advanced Join Queries over Distributed Data

Optimization of distributed data (outside a cluster) was considered in the Vol- cano project [52, 54, 55]. However, this work did not include a cost-based opti- mizer that considered different options for distributing the processing and data movement, and uses implicit heuristics which cannot address complex queries in highly distributed settings. WANalytics [124] is a recent proposal from Microsoft for developing analytics on geographically distributed datasets, but they do not target scientific data nor largely distributed environments. Stratosphere [4] offers a CBO for optimizing joins (among other operations). However, the CBO is used only for join-order, and choices among options are made using a simple heuristic.

Query optimizers such as the one in Hive [64, 115] optimize distributed query- ing over data within a cluster. Since the focus is on data within a single cluster, these systems assume low latencies and high throughput – assumptions that do not hold for geo-distributed arrays. This requires different approaches for query optimization for our target environments.

Another approach for joining distributed data may be to not evenly trans- port, or shuffle, the data. Examples could be instead of holding the machines that store Array 2 idle while the machines that hold Array 1 are processing all the data, to send both arrays to all optional processing nodes, and to utilize the machines that held the outer joined array as well (increasing the number of used machines) [97, 128]. While these approaches minimize the communication traf-

fic (assuming less communication leads to faster execution time), the same as- sumption of low latency and local cluster are used. Our approaches complement

187 these methods as well and can be adjusted to fit these approaches. In addition, these approaches do not currently handle the situation of arrays stored in a dis- tributed patterns. It is unclear what performance hit adjusting these algorithms for that setting entails.

Relational Equi-Join optimizations were researched thoroughly in the litera- ture. Current work focuses on optimization of equi-join in different settings based on the usage of indexes and a cost model [21, 78]; our work has focused on the use of bitmaps for the acceleration of distributed joins and MRJ’s over scientific data.

Another research venue, referred as Shared Servers [46, 73, 119], is improving per- formance by merging the execution of similar operators locally. Yet, scientific data is rarely accessed simultaneously by multiple users – it is queried infrequently, therefore although this approach can be used, its benefits are expected to be lim- ited.

Optimization of join queries by using Compressed Bitmaps has been looked at before [81]. The main differences are that we target distributed settings (in- cluding arrays with non-conforming dimensional attributes), and we consider op- timizing MRJ’s over scientific data. The WAH compression [134] is used for ef-

ficiently processing over compressed data in both works; more current Bitmap compression schemes, such as Upbit and Roaring Bitmaps [10, 31], may improve our performance. More broadly, a large body of work exists on bitmap indexes utilization [8, 13, 110, 132, 136]; these do not focus on join queries nor distributed settings. Since Type-B index is not generic like Advanced encodings for Bitmap

Indexes [32], it is smaller in size and more efficient for MRJ’s.

188 More advanced querying systems over scientific array data had been lately de-

veloped. As stated earlier, Skew-Aware Join Optimization for Array Data [43],

Similarity Join over Array Data [145], and DistriPlan [45] (Chapter3) are the cur-

rent state of the art in this domain. We have extensively compared our work

against these and demonstrated the advantage of our latest approach.

7.3 Operators Execution: Joins and Analytical Functions

Join Execution: the usage of Bloom Filters has been suggested for executing dis- tributed joins [100, 140]. Bloom Filters are not beneficial for numerical, floating, values. Once the data is going through bucketing, as required for unrestricted con- tinuous numerical values, the advantages of simple Bitmaps are higher. In addi- tion, Bloom Filters structure cannot be used for MRJ’s. Bloom Filters are intended for item lookups, they are not ordered by items and not intended to allow fast information crossing, like MRJ’s require. Second, Distributed Bloom filters inher- ently assume small selectivities (since they require results verification, which will be more expensive if very large portions of the data needs to be verified). There- fore, due to both reasons, the assumption of low selectivity, and more importantly, the lack of ability to use the structure for MRJ’s, we could not use Bloom Filters in this work. Bloom Filters can be used for Equi-Joins, while our approaches can be extended to other join types.

Materializing query results [144] was suggested to improve query performance as well. Since we target exploratory queries, materialized views are not expected to assist, and therefore using indexes, which target generic queries, is preferable.

189 Another popular idea is to control data shuffle so that each node processes similar amount of data [43, 145]. The join execution algorithm there is what we refer to as the Na¨ıve Join Algorithm, and therefore this work is unique in target- ing the join operation itself. Yet, the assumption that data is skewed at the storage level rarely holds for data collected by sensors, and does not need be addressed for this data. We suggest allowing a CBO to plan the data communication based on its global effect, and not to use simple heuristics. In addition, in this disserta- tion we define the join differently. While Similarity Join use aerial aggregations to match values (by using masks), we join values based on mapped dimensions [44].

Therefore, our systems do not impose constraints on the queries variables/arrays dimensions. Although we do not target similarity join queries, our model allows their execution by using aggregations followed by joins.

Analytical/Window functions: analytical functions have been available in rela- tional system for a while [26, 59, 76]. The introduction of these functions to the scientific array DBMS is still limited. Specifically, SciDB [70] supports a limited syntax for analytical functions. It does not provide cross windows data access, but does allow a window to be defined. With the SciDB window, joining two differ- ent window queries is a possible mechanism to manually implement functionality such as lead and lag. However, because functionality is broken across differ- ent queries, efficiency is limited – memory and execution optimizations cannot be used since both queries need to execute before the join can be executed.

ArrayUDF [42] addresses aggregations over adjacent array cells. Some of the functionality provided by ArrayUDF intersects with the functionality provided by DSDQuery [44] (which preceded this system). While DSDQuery allows using

190 functions in the query GROUP BY clause, it does not provide a framework for writ- ing these functions. FDQ provides more advanced analytical abilities – ArrayUDF targets aerial aggregations (similar to those executed on one of the join sides in

Similarity Joins [145]), and not “cross windows” querying abilities, as we target in this work.

191 Chapter 8: Conclusions and Future Work

In this dissertation we have developed multiple systems using innovative state- of-the-art approaches, which enable end-to-end data management and declarative querying of/over scientific array data. We address all facets of scientific array data querying, starting with indexing and metadata extraction of incepted data, through advanced analytics over this data, to parallel and distributed execution of complex queries that generate new datasets. Using the techniques we develop, we enable users the usage of declarative languages to query scientific array data that are stored in native file system, saving time and resources to the scientists (which till now had to develop their own software for executing advanced analytics over these data).

We started by presenting a framework that can process structured queries on scientific data repositories. In the process, we have addressed several challenges, including: 1) extending the relational algebra to apply over array datasets, giving concise meaning to queries, 2) Locating files for processing a user query, allow- ing to develop a real ‘declarative querying’ approach, 3) Querying over multiple distributed array datasets, and 4) Producing an output dataset that contains an aggregated union of the user requested data.

192 Then, we have presented a framework for optimized execution of array-based joins in geo-distributed setting. We first developed a query optimizer, which prunes plans as it generates them. For our target queries, the number of plans is kept at a manageable level, and subsequently, a cost model we have developed can be used for selecting the cheapest plan. We have shown our pruning approach makes the plans spanning problem practical to solve.

Afterwards, we have considered the execution of joins and join-like operations over scientific datasets. For equi-joins, we developed techniques for improving their performance by using bitmaps and for executing them in distributed settings.

For Mutual Range join, MRJ, we have developed a variant of standard bitmaps, and showed how they can be used for efficient execution. We have also discussed distributed algorithms for join execution. Basis for our techniques are methods that efficiently join array dimensions and restructure bitmap indexes to new di- mensions without rebuilding the index, a property that allows transporting only the indexes, which saves communication time.

We continue by unveiling a new type of analytical queries, which is relevant mainly for scientific domains, and address it by developing the FDQ engine. The new class of queries is analytical querying where internal window ordering is needed to be explicit since queries may access a specific, ordered, plane of a differ- ent window. We motivate these queries (required by our collaborators) and relate them to the context of array data. Then, we discuss memory planning, caching, and algorithms for analytical querying. We continue by suggesting the usage of sectioning (tiling) for assuring data remains in memory and to enable parallelism in windows calculations.

193 We conclude this dissertation by revisiting the join optimization problem. In the revisit, we address skew of both, data and resources. While in traditional meth- ods data skew is addressed for the stored data, we focus on data that becomes skew while being processed – we address data skew change by introducing the ability to change the data representation format as part of the query execution, and the ability to process the data on all supported representations. Resource skew is ad- dressed in a manner that targets real settings (where a few heterogeneous clusters are used, yet each cluster internal resources are homogeneous) by incorporating each cluster “Computational Power” into our CBO’s cost model. Unlike previous approaches, we measure each cluster computational power automatically. Last, we developed methods to efficiently represent and prune plans, so that their number will be low and manageable by our evaluation engine.

We evaluated our engines and shown, through experimentation, that our meth- ods are profitable compared to existing approaches. We shown our execution en- gines performance improve linearly with decreasing data sizes and that our exe- cution engines scale well (linearly with the available resources). In addition, we shown that our DistriPlan (Chapter3) optimizer finds the optimal (DistriPlan) plan. Later, we shown that Sckeow (Chapter6) finds an ideal plan with lower overheads compared to DistriPlan. Both engines generate plans that when exe- cuted outperform the execution of plans that were generated by using the current state-of-the-art approaches.

This work can be extended in multiple directions. Here, we focus on two dif- ferent aspects of scientific data querying: query planning and distributed query

194 execution. While our work was generic in nature, we focused on joins and analyt- ical querying over distributed data. We used indexes built on raw data to allow efficient query execution.

Possible directions for developing this work further may include targeting in- dividual query executions – for example, using a different indexing methods for query execution. This will result in any singular, sub-query, executing faster and accelerating the total process. On top of that, more approaches for merging and restructuring array data should be developed, since these two operations are the most expensive ones. These might require developing new techniques to store ar- ray data, in a manner which will allow changing array structure efficiently. These will result in the need of new methods to optimize queries (by pruning more spanned plan options, while considering the new developed indexing approaches and the new execution methods) for achieving acceptable performance.

Last, extending the query types beyond analytical and join queries would be beneficial for the scientific community. For example, preparing the data for learn- ing solely by using SQL would benefit the scientific community and require many additional features for querying over scientific array data. Allowing using declar- ative languages for most, if not all, of the scientific workflow will allow fast tech- nological advances in all domains where these data are used.

195 Bibliography

[1] Georges Aad, E Abat, et al. The atlas experiment at the cern large hadron

collider. Journal of instrumentation, 3(8), 2008.

[2] Foto N Afrati and Jeffrey D Ullman. Optimizing joins in a map-reduce en-

vironment. In Proceedings of the 13th International Conference on Extending

Database Technology, pages 99–110. ACM, 2010.

[3] Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos, and

Anastasia Ailamaki. Nodb: efficient query execution on raw data files. In

Proceedings of the 2012 ACM SIGMOD International Conference on Management

of Data, pages 241–252. ACM, 2012.

[4] Alexander Alexandrov, Rico Bergmann, et al. The stratosphere platform for

big data analytics. The VLDB Journal, 2014.

[5] William Allcock, John Bresnahan, Rajkumar Kettimuthu, Michael Link,

Catalin Dumitrescu, Ioan Raicu, and Ian Foster. The globus striped gridftp

framework and server. In Proceedings of the 2005 ACM/IEEE conference on

Supercomputing, page 54. IEEE Computer Society, 2005.

[6] Bryce Allen, John Bresnahan, Lisa Childers, et al. Software as a service for

data scientists. CACM, 55(2):81–88, 2012.

196 [7] Claus A Andersson and Rasmus Bro. The n-way toolbox for matlab. Chemo-

metrics and intelligent laboratory systems, 52(1):1–4, 2000.

[8] Gennady Antoshenkov. Byte-aligned bitmap compression. In DCC, page

476. IEEE, 1995.

[9] Peter MG Apers, AR Hevner, et al. Optimization algorithms for distributed

queries. Software Engineering, IEEE Transactions, (1):57–68, 1983.

[10] Manos Athanassoulis, Zheng Yan, and Stratos Idreos. Upbit: Scalable in-

memory updatable bitmap indexing. In SIGMOD, pages 1319–1332. ACM,

2016.

[11] Maximilian Auffhammer, Solomon M Hsiang, Wolfram Schlenker, and

Adam Sobel. Using weather data and climate model output in economic

analyses of climate change. Review of Environmental Economics and Policy,

page ret016, 2013.

[12] Stefan Aulbach, Torsten Grust, Dean Jacobs, Alfons Kemper, and Jan Rit-

tinger. Multi-tenant databases for software as a service: schema-mapping

techniques. In Proceedings of the 2008 ACM SIGMOD international conference

on Management of data, pages 1195–1206. ACM, 2008.

[13] Jay Ayres, Jason Flannick, Johannes Gehrke, and Tomi Yiu. Sequential pat-

tern mining using a bitmap representation. In SIGKDD, pages 429–435.

ACM, 2002.

[14] Barrodale Computing Services (BCS). Universal File Interface (UFI). http:

//www.barrodale.com/universal-file-interface-ufi, 2010.

197 [15] P. Baumann, A. Dehmel, et al. The Multidimensional Database System Ras-

DaMan. In SIGMOD, pages 575–577, 1998.

[16] Mehmet Belgin, Godmar Back, et al. Pattern-based sparse matrix represen-

tation for memory-efficient smvm kernels. In SC. ACM, 2009.

[17] Alberto Belussi and Christos Faloutsos. Self-spacial join selectivity estima-

tion using fractal concepts. ACM Transactions on Information Systems (TOIS),

16(2):161–201, 1998.

[18] Itzik Ben-Gan. Microsoft SQL Server 2012 High-performance T-SQL Using Win-

dow Functions. Pearson Education, 2012.

[19] David Bernholdt, Shishir Bharathi, et al. The earth system grid: Supporting

the next generation of climate modeling research. Proceedings of the IEEE,

93(3):485–495, 2005.

[20] Spyros Blanas, Jignesh M Patel, et al. A comparison of join algorithms for

log processing in mapreduce. In SIGMOD, pages 975–986. ACM, 2010.

[21] Roger Bolsius and Kevin Malaney. Methods and systems for optimizing

queries through dynamic and autonomous database schema analysis, 2003.

US Patent App. 10/448,962.

[22] Dhruba Borthakur et al. Hdfs architecture guide. Hadoop Apache Project, 53,

2008.

[23] Paul G. Brown. Overview of SciDB: large scale array storage, processing and

analysis. In SIGMOD, pages 963–968, 2010.

198 [24] Paul G Brown. Overview of scidb: large scale array storage, processing and

analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on

Management of data. ACM, 2010.

[25] Joe B Buck, Noah Watkins, et al. Scihadoop: array-based query processing

in hadoop. In SC, pages 1–11. IEEE, 2011.

[26] Yu Cao, Chee-Yong Chan, et al. Optimization of analytic window functions.

VLDB, 5(11):1244–1255, 2012.

[27] Robert L Carter and Mark E Crovella. Server selection using dynamic path

characterization in wide-area networks. In INFOCOM’97. Sixteenth Annual

Joint Conference of the IEEE Computer and Communications Societies. Driving the

Information Revolution., Proceedings IEEE, volume 3, pages 1014–1021. IEEE,

1997.

[28] Stefano Ceri and Georg Gottlob. Translating sql into relational algebra: Op-

timization, semantics, and equivalence of sql queries. IEEE Transactions on

software engineering, 11(4):324, 1985.

[29] Joao˜ Pedro Cerveira Cordeiro, Gilberto Camara,ˆ et al. Yet Another Map Al-

gebra. Geoinformatica, 13(2):183–202, June 2009.

[30] John Chambers. Software for data analysis: programming with R. Springer

Science & Business Media, 2008.

[31] Samy Chambi, Daniel Lemire, et al. Better bitmap performance with roaring

bitmaps. Software: practice and experience, 46(5):709–719, 2016.

199 [32] C. Chan and Y. Ioannidis. An efficient bitmap encoding scheme for selection

queries. In SIGMOD Record, volume 28. ACM, 1999.

[33] Kyle Chard, Ian Foster, et al. Globus: Research data management as service

and platform. In PEARC, pages 1–5. ACM, 2017.

[34] Surajit Chaudhuri. An overview of query optimization in relational systems.

In Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium

on Principles of database systems, pages 34–43. ACM, 1998.

[35] Lyndon Clarke, Ian Glendinning, and Rolf Hempel. The mpi message pass-

ing interface standard. In Programming environments for massively parallel dis-

tributed systems, pages 213–218. Springer, 1994.

[36] E. F. Codd. A relational model of data for large shared data banks. Commun.

ACM, 13(6):377–387, June 1970.

[37] Richard L. Cole and Goetz Graefe. Optimization of dynamic query evalua-

tion plans. SIGMOD Rec., 23(2):150–160, May 1994.

[38] Roberto Cornacchia, Sandor´ Heman,´ et al. Flexible and efficient IR using

array databases. VLDB J., 17(1):151–168, 2008.

[39] Peter Cornillon, James Gallagher, and Tom Sgouros. Opendap: Access-

ing data in a distributed, heterogeneous environment. Data Science Journal,

2(5):164–174, 2003.

[40] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing

on large clusters. Communications of the ACM, 51(1):107–113, 2008.

200 [41] Bin Dong, Surendra Byna, and Kesheng Wu. Spatially clustered join on het-

erogeneous scientific data sets. In Big Data (Big Data), 2015 IEEE International

Conference on, pages 371–380. IEEE, 2015.

[42] Bin Dong, Kesheng Wu, Surendra Byna, Jialin Liu, Weijie Zhao, and Florin

Rusu. Arrayudf: User-defined scientific data analysis on arrays. In Pro-

ceedings of the 26th International Symposium on High-Performance Parallel and

Distributed Computing, pages 53–64. ACM, 2017.

[43] Jennie Duggan, Olga Papaemmanouil, Leilani Battle, and Michael Stone-

braker. Skew-aware join optimization for array databases. In Proceedings of

the 2015 ACM SIGMOD International Conference on Management of Data, pages

123–135. ACM, 2015.

[44] Roee Ebenstein and Gagan Agrawal. Dsdquery dsi-querying scientific data

repositories with structured operators. In Big Data (Big Data), 2015 IEEE

International Conference on, pages 485–492. IEEE, 2015.

[45] Roee Ebenstein and Gagan Agrawal. Distriplan - an optimized join execution

framework for geo-distributed scientific data. In SSDBM. ACM, 2017.

[46] Roee Ebenstein, Niranjan Kamat, and Arnab Nandi. Fluxquery: An exe-

cution framework for highly interactive query workloads. In Proceedings of

the 2016 International Conference on Management of Data, SIGMOD ’16, pages

1333–1345, New York, NY, USA, 2016. ACM.

[47] Robert Epstein, , and Eugene Wong. Distributed query

processing in a relational data base system. In Proceedings of the 1978 ACM

201 SIGMOD International Conference on Management of Data, SIGMOD ’78, pages

169–180, New York, NY, USA, 1978. ACM.

[48] Mike Folk, Gerd Heber, Quincey Koziol, Elena Pourmal, and Dana Robin-

son. An overview of the hdf5 technology suite and its applications. In Pro-

ceedings of the EDBT/ICDT 2011 Workshop on Array Databases, AD ’11, pages

36–47, New York, NY, USA, 2011. ACM.

[49] Johann Christoph Freytag. A rule-based view of query optimization, volume 16.

ACM, 1987.

[50] Lise Getoor, Benjamin Taskar, and Daphne Koller. Selectivity estimation us-

ing probabilistic models. In ACM SIGMOD Record, volume 30, pages 461–

472. ACM, 2001.

[51] Roy Goldman and Jennifer Widom. Dataguides: Enabling query formulation

and optimization in semistructured databases. Technical report, Stanford,

1997.

[52] Goetz Graefe. Parallelizing the volcano database query processor. In Com-

pcon IEEE Computer Society International Conference., pages 490–493. IEEE,

1990.

[53] Goetz Graefe. Query evaluation techniques for large databases. ACM Com-

put. Surv., 25(2):73–169, June 1993.

[54] Goetz GRaefe. Volcano-an extensible and parallel query evaluation system.

Knowledge and Data Engineering, 1994.

202 [55] Goetz Graefe and William J McKenna. The volcano optimizer generator:

Extensibility and efficient search. In Data Engineering, pages 209–218. IEEE,

1993.

[56] Goetz Graefe and Karen Ward. Dynamic query evaluation plans. In ACM

SIGMOD Record, volume 18, pages 358–366. ACM, 1989.

[57] William Gropp, Ewing Lusk, et al. A high-performance, portable imple-

mentation of the mpi message passing interface standard. Parallel computing,

22(6):789–828, 1996.

[58] Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, and Geoffrey Fox. Mapreduce

in the clouds for science. In CloudCom, pages 565–572. IEEE, 2010.

[59] Abhinav Gupta. Method for minimizing the number of sorts required for a

query block containing window functions, May 14 2002. US Patent 6,389,410.

[60] Peter J Haas, Jeffrey F Naughton, et al. Fixed-precision estimation of join

selectivity. In ACM SIGACT-SIGMOD-SIGART, pages 190–201. ACM, 1993.

[61] Peter J Haas, Jeffrey F Naughton, and Arun N Swami. On the relative cost of

sampling for join selectivity estimation. In ACM SIGACT-SIGMOD-SIGART,

pages 14–24. ACM, 1994.

[62] Benjamin Heintz, Abhishek Chandra, and Jon Weissman. Cloud Computing

for Data-Intensive Applications, chapter Cross-Phase Optimization in MapRe-

duce, pages 277–302. Springer New York, New York, NY, 2014.

203 [63] Jonathan Helmus and Scott Collis. The python arm radar toolkit (py-art),

a library for working with weather radar data in the python programming

language. Journal of Open Research Software, 4(1), 2016.

[64] Herodotos Herodotou and Shivnath Babu. Profiling, what-if analysis, and

cost-based optimization of mapreduce programs. Proceedings of the VLDB

Endowment, 4(11):1111–1122, 2011.

[65] Stephan Hoyer and Joe Hamman. xarray: Nd labeled arrays and datasets in

python. Journal of Open Research Software, 5(1), 2017.

[66] Chengdu Huang and Tarek Abdelzaher. Towards content distribution net-

works with latency guarantees. In Quality of Service, 2004. IWQOS 2004.

Twelfth IEEE International Workshop on, pages 181–192. IEEE, 2004.

[67] W Huang. Using ncl to visualize and analyse of nasa/noaa satellite data in

format of netcdf, hdf, hdf-eos. In AGU Fall Meeting Abstracts, 2014.

[68] Yannis E Ioannidis. Query optimization. ACM Computing Surveys (CSUR),

28(1):121–123, 1996.

[69] Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly.

Dryad: distributed data-parallel programs from sequential building blocks.

ACM SIGOPS Operating Systems Review, 41(3):59–72, 2007.

[70] Li Jiang, Hideyuki Kawashima, and Osamu Tatebe. Incremental window

aggregates over array database. In Big Data, pages 183–188. IEEE, 2014.

204 [71] Nicholas Kaiser, Herve Aussel, et al. Pan-starrs: a large synoptic survey tele-

scope array. In Survey and Other Telescope Technologies and Discoveries, volume

4836. International Society for Optics and Photonics, 2002.

[72] Manos Karpathiotakis, Miguel Branco, Ioannis Alagiannis, and Anastasia

Ailamaki. Adaptive query processing on raw data. Proceedings of the VLDB

Endowment, 7(12):1119–1130, 2014.

[73] Meichen Lai, Tony Kuen Lee, et al. System and procedure for concurrent

database access by multiple user applications through shared connection

processes, 1997. US Patent 5,596,745.

[74] Per-Ake˚ Larson, Eric N Hanson, and Susan L Price. Columnar storage in sql

server 2012. IEEE Data Eng. Bull., 35(1):15–20, 2012.

[75] Christiane Lefevre. The cern accelerator complex. Technical report, 2008.

[76] Viktor Leis, Kan Kundhikanjana, Alfons Kemper, et al. Efficient processing

of window functions in analytical sql queries. VLDB, 8(10):1058–1069, 2015.

[77] Alberto Lerner and Dennis Shasha. AQuery: query language for ordered

data, optimization techniques, and experiments. In VLDB, pages 345–356,

2003.

[78] Jonathan Lewis. Cost-Based Oracle Fundamentals. Apress, 2006.

[79] Leonid Libkin, Rona Machlin, and Limsoon Wong. A query language for

multidimensional arrays: design, implementation, and optimization tech-

niques. In ACM SIGMOD Record, volume 25, pages 228–239. ACM, 1996.

205 [80] Robert Love. Kernel korner: Intro to inotify. Linux J., 2005(139):8–, November

2005.

[81] Kamesh Madduri and Kesheng Wu. Efficient joins with compressed bitmap

indexes. In Information and knowledge management. ACM, 2009.

[82] Arunprasad P. Marathe and Kenneth Salem. A Language for Manipulating

Arrays. In VLDB, pages 46–55, 1997.

[83] Arunprasad P Marathe and Kenneth Salem. Query processing techniques

for arrays. VLDB J., 11(1):68–91, 2002.

[84] Jim Melton. Iso/ansi: Database language sql. ISO/IEC SQL Revision. New

York: American National Standards Institute, 1992.

[85] Jeremy Mennis and C. Dana Tomlin. Cubic map algebra functions for spatio-

temporal analysis. CaGIS, 32:17–32, 2005.

[86] Priti Mishra and Margaret H Eich. Join processing in relational databases.

ACM Computing Surveys (CSUR), 1992.

[87] Brian Robert Muras. Building database statistics across a join network using

skew values, April 19 2011. US Patent 7,930,296.

[88] Arnab Nandi, Cong Yu, Philip Bohannon, and Raghu Ramakrishnan. Dis-

tributed cube materialization on holistic measures. In Data Engineering

(ICDE), 2011 IEEE 27th International Conference on, pages 183–194. IEEE, 2011.

206 [89] Arnab Nandi, Cong Yu, Philip Bohannon, and Raghu Ramakrishnan. Data

cube materialization and mining over mapreduce. IEEE transactions on

knowledge and data engineering, 24(10):1747–1759, 2012.

[90] Vivek Narasayya, Sudipto Das, Manoj Syamala, Badrish Chandramouli, and

Surajit Chaudhuri. Sqlvm: Performance isolation in multi-tenant relational

database-as-a-service. 2013.

[91] Raymond T. Ng, Alan Wagner, and Yu Yin. Iceberg-cube computation with

pc clusters. SIGMOD Rec., 30(2):25–36, May 2001.

[92] Chris Olston, Jing Jiang, and Jennifer Widom. Adaptive filters for continuous

queries over distributed data streams. In SIGMOD, pages 563–574. ACM,

2003.

[93] Michael Owens. Embedding an sql database with sqlite. Linux J.,

2003(110):2–, June 2003.

[94] M Tamer Ozsu¨ and Patrick Valduriez. Principles of systems.

Springer Science & Business Media, 2011.

[95] Stavros Papadopoulos, Kushal Datta, et al. The tiledb array data storage

manager. VLDB, 10(4):349–360, 2016.

[96] Sergio Pissanetzky. Sparse Matrix Technology-electronic edition. Academic

Press, 1984.

[97] Orestis Polychroniou, Rajkumar Sen, et al. Track join: distributed joins with

minimal network traffic. In SIGMOD. ACM, 2014.

207 [98] Jon Postel and Joyce Reynolds. File transfer protocol. RFC, 1985.

[99] David Pullar. MapScript: A Map Algebra Programming Language Incorpo-

rating Neighborhood Analysis. Geoinformatica, 5(2):145–163, June 2001.

[100] Sukriti Ramesh, Odysseas Papapetrou, and Wolf Siberski. Optimizing dis-

tributed joins with bloom filters. In ICDCIT, pages 145–156. Springer, 2008.

[101] Naveen Reddy and Jayant R Haritsa. Analyzing plan diagrams of database

query optimizers. In conference on Very large data bases, pages 1228–1239.

VLDB Endowment, 2005.

[102] Russ Rew and Glenn Davis. Netcdf: an interface for scientific data access.

Computer Graphics and Applications, IEEE, 10(4):76–82, 1990.

[103] David R Roberts and Andreas Hamann. Predicting potential climate change

impacts with bioclimate envelope models: a palaeoecological perspective.

Global Ecology and Biogeography, 21(2):121–133, 2012.

[104] John L Schnase, Daniel Q Duffy, et al. Merra analytic services: Meeting

the big data challenges of climate science through cloud-enabled climate

analytics-as-a-service. Computers, Environment and Urban Systems, 61:198–

211, 2017.

[105] Donovan A Schneider and David J DeWitt. A performance evaluation of

four parallel join algorithms in a shared-nothing multiprocessor environment, vol-

ume 18. ACM, 1989.

208 [106] Donovan A. Schneider and David J. DeWitt. A performance evaluation of

four parallel join algorithms in a shared-nothing multiprocessor environ-

ment. SIGMOD Rec., 18(2):110–121, June 1989.

[107] Uwe Schulzweida, Luis Kornblueh, and Ralf Quast. Cdo users guide. Climate

Data Operators, Version, 1(6), 2006.

[108] MJ Shaffer, BK Wylie, et al. Using climate/weather data with the nleap

model to manage soil nitrogen. Agricultural and Forest Meteorology, 69(1-

2):111–123, 1994.

[109] Leonard D Shapiro. Join processing in database systems with large main

memories. ACM Transactions on Database Systems (TODS), 11(3):239–264,

1986.

[110] Rishi Rakesh Sinha, Soumyadeb Mitra, and Marianne Winslett. Bitmap in-

dexes for large scientific data sets: A case study. In IPDPS, pages 10–pp.

IEEE, 2006.

[111] Michael Stonebraker. Sql databases v. databases. Communications of the

ACM, 53(4):10–11, 2010.

[112] Michael Stonebraker, Paul Brown, Alex Poliakov, and Suchi Raman. The

architecture of scidb. In International Conference on Scientific and Statistical

Database Management, pages 1–16. Springer, 2011.

[113] RA Stratton. A high resolution amip integration using the hadley centre

model hadam2b. Climate Dynamics, 15(1):9–28, 1999.

209 [114] Yu Su, Yi Wang, Gagan Agrawal, and Rajkumar Kettimuthu. SDQuery DSI:

integrating data management support with a wide area data transfer proto-

col. In SC, page 47. ACM, 2013.

[115] Ashish Thusoo, Joydeep Sen Sarma, et al. Hive: a warehousing solution over

a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626–

1629, 2009.

[116] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad

Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy.

Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB,

2(2):1626–1629, 2009.

[117] Martin H Trauth, Robin Gebbers, Norbert Marwan, and Elisabeth Sillmann.

MATLAB recipes for earth sciences. Springer, 2006.

[118] J Anthony Tyson. Large synoptic survey telescope: overview. In Survey

and Other Telescope Technologies and Discoveries, volume 4836, pages 10–21.

International Society for Optics and Photonics, 2002.

[119] Philipp Unterbrunner, Georgios Giannikis, et al. Predictable performance

for unpredictable workloads. VLDB, 2(1), 2009.

[120] Patrick Valduriez and Georges Gardarin. Join and semijoin algorithms for a

multiprocessor database machine. TODS, 9(1):133–161, 1984.

[121] Alex R. van Ballegooij. RAM: a multidimensional array DBMS. In EDBT

2004 Workshops, pages 154–165, 2005.

210 [122] Shanmugam Veeramani, Muhammad Nasir Masood, and Amandeep S

Sidhu. A pacs alternative for transmitting dicom images in a high latency

environment. In Biomedical Engineering and Sciences (IECBES), 2014 IEEE Con-

ference on, pages 975–978. IEEE, 2014.

[123] Stratis D Viglas, Jeffrey F Naughton, and Josef Burger. Maximizing the out-

put rate of multi-way join queries over streaming information sources. In

Proceedings of the 29th international conference on Very large data bases-Volume

29, pages 285–296. VLDB Endowment, 2003.

[124] Ashish Vulimiri, Carlo Curino, et al. Wanalytics: Analytics for a geo-

distributed data-intensive world. In CIDR, 2015.

[125] Fangju Wang. Query optimization for a distributed geographic information

system. Photogrammetric engineering and remote sensing, 65:1427–1438, 1999.

[126] Jiali Wang and Veerabhadra R Kotamarthi. Downscaling with a nested re-

gional climate model in near-surface fields over the contiguous united states.

JGR Atmospheres, 119(14):8778–8797, 2014.

[127] Jiali Wang and Veerabhadra R Kotamarthi. High-resolution dynamically

downscaled projections of precipitation in the mid and late 21st century over

north america. Earth’s Future, 3(7):268–288, 2015.

[128] Daniel Warneke and Odej Kao. Exploiting dynamic resource allocation for

efficient parallel data processing in the cloud. IEEE transactions on parallel

and distributed systems, 22(6):985–997, 2011.

211 [129] Li Weng, Gagan Agrawal, Umit Catalyurek, T Kur, Sivaramakrishnan

Narayanan, and Joel Saltz. An Approach for Automatic Data Virtualization.

In HPDC, pages 24–33, 2004.

[130] Dean N Williams, Gavin Bell, Luca Cinquini, Peter Fox, John Harney, and

Robin Goldstone. Earth system grid federation: Federated and integrated

climate data from multiple sources. In Earth System Modelling-Volume 6,

pages 61–77. Springer, 2013.

[131] Andrew Witkowski, Srikanth Bellamkonda, Tolga Bozkaya, Nathan Folk-

ert, Abhinav Gupta, John Haydu, Lei Sheng, and Sankar Subramanian.

Advanced sql modeling in rdbms. ACM Transactions on Database Systems

(TODS), 30(1):83–121, 2005.

[132] Kesheng Wu, Ekow Otoo, and Arie Shoshani. On the performance of bitmap

indices for high cardinality attributes. In Proceedings of the Thirtieth interna-

tional conference on Very large data bases-Volume 30, pages 24–35. VLDB En-

dowment, 2004.

[133] Kesheng Wu, Ekow J Otoo, and Arie Shoshani. Compressing bitmap indexes

for faster search operations. In SSDBM, pages 99–108. IEEE, 2002.

[134] Kesheng Wu, Ekow J Otoo, and Arie Shoshani. Optimizing bitmap indices

with efficient compression. TODS, 31(1), 2006.

[135] Kesheng Wu, Kurt Stockinger, and Arie Shoshani. Breaking the curse of

cardinality on bitmap indexes. In SSDBM, pages 348–365. Springer, 2008.

212 [136] Ming-Chuan Wu. Query optimization for selections using bitmaps. In SIG-

MOD Rec., volume 28, pages 227–238. ACM, 1999.

[137] Cenk Yavuzturk and Jeffrey D Spitler. Comparative study of operating and

control strategies for hybrid ground-source heat pump systems using a short

time step simulation model. Ashrae Transactions, 106:192, 2000.

[138] Yuan Yu, Pradeep Kumar Gunda, and Michael Isard. Distributed aggre-

gation for data-parallel computing: interfaces and implementations. In

SIGOPS, pages 247–260. ACM, 2009.

[139] Charlie Zender. Nco users guide, 2004.

[140] Changchun Zhang, Lei Wu, and Jing Li. Optimizing distributed joins with

bloom filters using mapreduce. In Computer Applications for Graphics, Grid

Computing, and Industrial Environment, pages 88–95. Springer, 2012.

[141] Ying Zhang, Martin Kersten, Milena Ivanova, and Niels Nes. Sciql: bridging

the gap between science and relational dbms. In Proceedings of the 15th Sym-

posium on International Database Engineering & Applications, pages 124–133.

ACM, 2011.

[142] Ying Zhang, Martin Kersten, Milena Ivanova, and Niels Nes. SciQL: Bridg-

ing the Gap Between Science and Relational DBMS. In IDEAS, pages 124–

133, September 2011.

[143] Hui Zhao, SiYun Ai, ZhenHua Lv, and Bo Li. Parallel Accessing Massive

NetCDF Data Based on MapReduce. In WISM, pages 425–431, Berlin, Hei-

delberg, 2010. Springer-Verlag.

213 [144] W. Zhao, F. Rusu, et al. Incremental view maintenance over array data. In

SIGMOD. ACM, 2017.

[145] Weijie Zhao, Florin Rusu, et al. Similarity join over array data. In SIGMOD.

ACM, 2016.

[146] Y. Zhu et al. Geometadb: powerful alternative search engine for the gene

expression omnibus. Bioinformatics, 24(23), 2008.

[147] Zachary Zobel, Jiali Wang, et al. Evaluations of high-resolution dynamically

downscaled ensembles over the contiguous united states. Climate Dynamics,

pages 1–22, 2017.

[148] Zachary Zobel, Jiali Wang, et al. High-resolution dynamical downscaling en-

semble projections of future extreme temperature distributions for the united

states. Earth’s Future, 2017.

[149] Calisto Zuzarte, Hamid Pirahesh, et al. Winmagic: Subquery elimination

using window aggregation. In SIGMOD, pages 652–656. ACM, 2003.

214