Accelerating Main Memory Query Processing for Data Analytics

by

Puya Memarzia

Master of Science, Shiraz University, 2014

A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

Doctor of Philosophy

In the Graduate Academic Unit of Faculty of Computer Science

Supervisor(s): Virendrakumar C. Bhavsar, PhD, Computer Science, Suprio Ray, PhD, Computer Science Examining Board: Kenneth Kent, PhD, Computer Science, Patricia Evans, PhD, Computer Science, Eduardo Castillo Guerra, PhD, Electrical and Computer Engineering External Examiner: Khuzaima Daudjee, PhD, Computer Science, University of Waterloo

This dissertation is accepted by the Dean of Graduate Studies

THE UNIVERSITY OF NEW BRUNSWICK April, 2020 c Puya Memarzia, 2020 Abstract

Data analytics provides a way to understand and extract value from an ever-growing volume of data. The runtime of analytical queries is of critical importance, as fast re- sults enhance decision making and improve user experience. Data analytics systems commonly utilize in-memory query processing techniques to achieve better through- put and lower latency. Although processing data that is already in the main memory is decidedly speedier than disk-based query processing, this approach is hindered by limited memory bandwidth and cache capacity, resulting in the under-utilization of processing resources. Furthermore, the characteristics of the hardware, data, and workload, can all play a major role in hindering execution time, and the best ap- proach for a given application is not always clear. In this thesis, we address these issues by investigating ways to design more efficient algorithms and data structures. Our approach involves the systematic application of application-level and system- level refinements that improve algorithm efficiency and hardware utilization. In particular, we conduct a comprehensive study on the effects of dataset skew and shuffling on hash join algorithms. We significantly improve join runtimes on skewed datasets by modifying the algorithm’s underlying hash table. We then further im- prove performance by designing a novel hash table based on the concept of cuckoo hashing. Next, we present a six-dimensional analysis of in-memory aggregation that breaks down the variables that affect query runtime. As part of our evaluation, we investigate 13 different algorithms and data structures, including one that we

ii specifically developed to excel at a popular query category. Based on our results, we produce a decision tree to help practitioners select the best approach based on aggregation workload characteristics. After that, we dissect the runtime impact of NUMA architectures on a wide variety of query workloads and present a method- ology that can greatly improve query performance with minimal modifications to the source code. This approach involves systematically modifying the application’s thread placement, memory placement, and memory allocation, and reconfiguring the operating system. Lastly, we design a scalable query processing system that uses distributed in-memory data structures to store, index, and query spatio-temporal data, and demonstrate the efficiency of our system by comparing it against other data systems.

iii Dedication

I dedicate this thesis to all the people that supported and encouraged me along the way.

iv Acknowledgements

First of all, I would like to thank my supervisors, Dr. Virendra C. Bhavsar and Dr. Suprio Ray, for their guidance, patience, and encouragement. I would also like to thank the members of my examining committee, Dr. Eric Aubanel, Dr. Kenneth Kent, Dr. Patricia Evans, Dr. Eduardo Castillo Guerra, and Dr. Khuzaima Daudjee, for their time, effort, support, and constructive comments. Additionally, I would like to thank Dr. Kenneth Kent and Aaron Graham from IBM CASA and Serguei Vassiliev and Kaizaad Bilimorya from Compute Canada, for providing access to some of the machines used for my experiments.

v Table of Contents

Abstract ii

Dedication iv

Acknowledgments v

Table of Contents vi

List of Tables xiii

List of Figures xv

Abbreviations xix

1 Introduction 1 1.1 ThesisObjectivesandMethodology ...... 2 1.2 ThesisOutline...... 4

2 Background 6 2.1 TrendsinModernHardwareArchitecture...... 6 2.1.1 ThePowerWall...... 7 2.1.2 TheILPWall ...... 7 2.1.3 TheMemoryWall ...... 8 2.1.4 CPUCacheandMemoryHierarchy ...... 9 2.1.5 Non-Uniform Memory Access (NUMA) Architectures . . . . . 10

vi 2.2 MainMemoryQueryProcessing...... 11 2.2.1 DataPersistence ...... 12 2.3 QueryDataOperations...... 13 2.3.1 JoinQueries...... 13 2.3.2 AggregationQueries ...... 14

3 Main Memory Join Processing 18 3.1 Motivation...... 19 3.2 RelatedWork ...... 21 3.2.1 HashJoins...... 22 3.2.2 HashJoinConfigurations...... 23 3.2.3 HashTables...... 24 3.3 TheImpactofDataSkew ...... 24 3.4 Separate Chaining with Value-Vectors (SCVV) ...... 25 3.4.1 Lookupandmaterializationcosts ...... 26 3.4.2 EvaluationandOverhead ...... 27 3.5 MapleHashTable(MH) ...... 28 3.5.1 DataStructure ...... 29 3.5.2 HashFunctions ...... 30 3.5.3 ConcurrentImplementation ...... 31 3.5.4 PerformanceEvaluation ...... 33 3.6 Evaluation...... 34 3.6.1 PlatformSpecifications ...... 35 3.6.2 Datasets...... 36 3.6.3 ResultsandDiscussion ...... 37 3.6.4 BuildTableSkew ...... 38 3.6.4.1 Sequential1:Ndataset ...... 38 3.6.4.2 Randomnear1:Ndataset ...... 38

vii 3.6.4.3 GaussianM:Ndataset ...... 39 3.6.4.4 ZipfM:Ndataset ...... 41 3.6.5 ProbeTableSkew...... 41 3.6.6 EffectofDatasetShufflingonHashTables ...... 43 3.6.7 CPUArchitecture...... 45 3.7 ChapterSummary ...... 46

4 MainMemoryAggregationProcessing 48 4.1 Motivation...... 49 4.2 RelatedWork ...... 52 4.3 Queries...... 55 4.4 DataStructuresandAlgorithms ...... 56 4.4.1 Sort-basedAggregationAlgorithms ...... 57 4.4.1.1 ...... 57 4.4.1.2 ...... 58 4.4.1.3 RadixSort(MSBandLSB) ...... 58 4.4.1.4 Spreadsort ...... 58 4.4.1.5 SortingMicrobenchmarks ...... 59 4.4.2 Hash-basedAggregationAlgorithms...... 59 4.4.2.1 Linearprobing ...... 60 4.4.2.2 Quadraticprobing ...... 61 4.4.2.3 Separatechaining...... 62 4.4.2.4 CuckooHashing ...... 62 4.4.3 Tree-basedAggregationAlgorithms ...... 63 4.4.3.1 Btree ...... 63 4.4.3.2 Ttree ...... 64 4.4.3.3 ART...... 64 4.4.3.4 JudyArrays...... 64

viii 4.4.4 DataStructureMicrobenchmark...... 65 4.4.5 TimeComplexity ...... 66 4.5 Datasets...... 66 4.6 ResultsandAnalysis ...... 68 4.6.1 PlatformSpecifications ...... 70 4.6.2 VectorAggregation ...... 71 4.6.3 CacheandTLBmisses ...... 72 4.6.4 MemoryEfficiency ...... 74 4.6.5 DatasetDistributions...... 75 4.6.6 RangeSearchQuery ...... 77 4.6.7 ScalarAggregation ...... 78 4.6.8 Multi-threadedScalability ...... 79 4.7 SummaryandDiscussion...... 82 4.8 ChapterSummary ...... 83

5 Query Processing on NUMA Systems 85 5.1 Motivation...... 86 5.2 NUMATopologies ...... 89 5.2.1 ExperimentWorkloads ...... 90 5.3 RelatedWork ...... 91 5.4 Methodology ...... 93 5.4.1 DynamicMemoryAllocators...... 94 5.4.1.1 ptmalloc...... 95 5.4.1.2 jemalloc...... 95 5.4.1.3 tcmalloc...... 96 5.4.1.4 Hoard ...... 96 5.4.1.5 tbbmalloc ...... 97 5.4.1.6 supermalloc...... 97

ix 5.4.1.7 mcmalloc ...... 97 5.4.1.8 OverridingTheMemoryAllocator ...... 98 5.4.1.9 Memory Allocator Microbenchmark...... 99 5.4.2 ThreadPlacementandScheduling...... 100 5.4.3 MemoryPlacementPolicies ...... 103 5.4.4 OperatingSystemConfiguration...... 103 5.4.4.1 Virtual Memory Page Management ...... 104 5.4.4.2 AutomaticNUMALoadBalancing ...... 104 5.5 Evaluation...... 105 5.5.1 ExperimentalSetup...... 105 5.5.2 DatasetsandImplementationDetails ...... 107 5.5.3 Operating System Configuration Experiments ...... 110 5.5.3.1 AutoNUMA Load Balancing Experiments ...... 110 5.5.3.2 Transparent Hugepages Experiments ...... 112 5.5.3.3 Hardware Architecture Experiments ...... 112 5.5.4 MemoryAllocatorExperiments ...... 113 5.5.4.1 Hashtable-based Experimental Workloads ...... 113 5.5.4.2 ImpactofDatasetDistribution ...... 115 5.5.4.3 EffectonIn-memoryIndexing ...... 116 5.5.5 DatabaseEngineExperiments ...... 117 5.5.6 ResultsSummary ...... 120 5.6 ChapterSummary ...... 121

6 Distributed In-memory kNN Query Processing 123 6.1 Motivation...... 124 6.1.1 APGASandtheX10Language ...... 126 6.2 RelatedWork ...... 127 6.2.1 BigSpatialDataProcessingSystems ...... 127

x 6.2.2 ScalableSpatio-TemporalIndexes ...... 128 6.2.3 DistributedkNNQueryProcessing ...... 131 6.3 Problem Statement and Design Considerations ...... 132 6.4 DISTIL+ SystemOverview...... 134 6.4.1 DataPartitioning ...... 135 6.4.2 DataPersistence ...... 137 6.4.2.1 LocalStore ...... 138 6.4.2.2 GlobalStore ...... 138 6.4.3 IndexingTechniques ...... 138 6.4.3.1 DISTIL+ Distributed In-Memory Spatio-Temporal In- dex ...... 140 6.5 Spatio-temporalkNNQueryProcessing ...... 143 6.5.1 kNNPruning ...... 145 6.6 Evaluation...... 146 6.6.1 ExperimentalSetup...... 146 6.6.2 Workloads and Experimental Parameters ...... 147 6.6.2.1 kNNQuery(STkNNQ)DatasetSize ...... 148 6.6.2.2 kNN Query (STkNNQ) k parameter ...... 149 6.6.3 ComparisonwithOtherSystems...... 149 6.6.3.1 UpdatePerformance ...... 150 6.6.3.2 QueryPerformance...... 151 6.7 ChapterSummary ...... 152

7 Conclusion 154 7.1 FutureWork...... 156

References 180

A Main Memory Joins - Additional Results 181

xi B MainMemoryAggregation-AdditionalResults 183

C Query Processing on NUMA Systems - Additional Results 185

D DISTIL+ - Additional System Details and Results 186 D.1 UpdateProcessing ...... 186 D.1.1 UpdatePerformance ...... 189 D.1.1.1 Update Batch Size per Worker Thread ...... 189 D.1.1.2 Tile placement policy and dataset size ...... 189 D.1.1.3 NodeScaling ...... 190 D.2 Rangequery(STRQ)processing...... 191 D.2.1 RangeQueryPerformance ...... 193 D.2.1.1 RangeQuery(STRQ)Timeframe ...... 193 D.2.1.2 Range Query (STRQ) Dataset Size and Spatial Extent194 D.2.1.3 Tile Placement Policy and Number of Query Threads 194

E Source Code Statistics 197

Vita

xii List of Tables

2.1 AggregationQueryCategories ...... 16

3.1 Experimentalsetupspecifications ...... 35 3.2 HashJoinConfigurationKey ...... 36 3.3 Datasetdetails ...... 37

4.1 AggregationQueries ...... 55 4.2 DataStructureTimeComplexity ...... 66 4.3 AlgorithmsandDataStructures ...... 67 4.4 DatasetDistributions...... 68 4.5 ExperimentParameters...... 69 4.6 Peak Memory Usage (MB) - Q1 on Rseq 103 Groups ...... 74 4.7 Peak Memory Usage (MB) - Q3 on Rseq 103 Groups ...... 75 4.8 Concurrent Algorithms and Data Structures ...... 80

5.1 ExperimentWorkloads ...... 90 5.2 Profiling thread placement - W1 on Machine A - Default (managed by OS) vs Modified (Sparse policy) ...... 102 5.3 NUMAMachineSpecifications...... 106 5.4 Experiment Parameters (bolded values are system defaults)...... 108

6.1 IndexDataStructureComparison ...... 140 6.2 Experiment Parameters (default settings bolded) ...... 146 6.3 DataSystemAttributesandFeatures ...... 149

xiii 6.4 Update Throughput (records/s) Dataset Scaling Comparison . . . . . 151

E.1 SourceCodeSummaryandStatistics ...... 197

xiv List of Figures

1.1 ThesisOverview...... 3

2.1 Processor-Memory Performance Gap (relative performance)[68] ... 8 2.2 TLBandCacheHierarchy ...... 10 2.3 MemoryAccessonaNUMASystem ...... 10 2.4 Overview of Query Processing Approaches: disk-based (left) vs in- memory(right) ...... 11 2.5 EquijoinExample...... 14 2.6 HashJoinOverview...... 14 2.7 Comparison of aggregation categories - distributive and algebraic (top) vsholistic(bottom)...... 15 2.8 AggregationWorkloadOverview...... 16

3.1 Hash join experiments using configurations from [29] - Skylake - 8 threads...... 22 3.2 Example: Comparison of a bucket chain with and without value vec- tors(modulohashfunction)...... 26 3.3 Maple Hash Table Data Structure - a key is found without the need toprobealltheelementsinthebucket ...... 29 3.4 Example of finding a location for a key using multi-stage hashing to determinetable,bucket,andarrayindex ...... 29 3.5 Comparing Maple hash table and Intel Libcuckoo - uniform random dataset-Skylake-8threads...... 34

xv 3.6 Hash Join run time with Variable Build Skew on Ordered Datasets - Skylake-8threads ...... 39 3.7 Hash Join run time with Variable Build Skew on Shuffled Datasets - Skylake-8threads ...... 40 3.8 Hash Join Runtime with Variable Probe Skew on 1:N Datasets - Sky- lake-8threads ...... 42 3.9 Performance impact of shuffling on MH and SCVV - Zipf 1:N dataset -Skylake-8threads ...... 44 3.10 Total join time on variable CPU architectures - 1:N Datasets - 8 threads 45

4.1 Anoverviewoftheanalysisdimensions ...... 50 4.2 SortAlgorithmMicrobenchmark...... 59 4.3 DataStructureMicrobenchmark...... 65 4.4 VectorAggregationQ1-100MRecords ...... 70 4.5 VectorAggregationQ3-100Mrecords ...... 72 4.6 TLBmisses-Rseq100MDataset ...... 73 4.7 TLBmisses-Rseq100MDataset ...... 73 4.8 Vector Q1 - Variable Key Distributions - 100M records ...... 76 4.9 RangeSearchAggregationQ7-100Mrecords ...... 77 4.10 ScalarAggregationQ6-100Mrecords ...... 79 4.11 Parallel Sort Algorithm Microbenchmark ...... 80 4.12 Multi-threadedScaling-Rseq100M...... 81 4.13 DecisionFlowChart ...... 83

5.1 Machine NUMA Topologies (machine specifications in Table 5.3). . . 89 5.2 Memory Allocator Microbenchmark - Machine A...... 99 5.3 OS thread scheduler behavior vs thread affinity strategy - Consecutive runsofW1-MachineA ...... 101

xvi 5.4 Comparison of Sparse and Dense thread affinitization strategies - W1 -MachineA...... 101 5.5 Evaluating and profiling AutoNUMA loadbalancing-W1...... 111 5.6 Evaluating effect of AutoNUMA and THP on memory placement poli- ciesandallocators-W1 ...... 112 5.7 Comparison of memory allocators - variable workload, memory place- mentpolicy,andmachine ...... 114 5.8 W1-MachineA-Effectofdatasetdistribution ...... 115 5.9 Index nested loop join workload (W4) - variable memory allocators andmemoryplacements-MachineA ...... 116 5.10 22 TPC-H queries (W5) scale factor 20 - Query latency reduction - Variabledatabasesystems-MachineA ...... 118 5.11 Effect of memory allocator on TPC-H query latency - MonetDB - MachineA...... 119 5.12 Application-agnostic Decision Flowchart ...... 121

6.1 APGASProgrammingModel ...... 126 6.2 DISTIL+ SystemArchitecture ...... 134 6.3 DISTIL+ UpdateProcessingOverview ...... 135 6.4 DISTIL+ QueryProcessingOverview ...... 136 6.5 DISTIL+ IndexArchitecture...... 141 6.6 Illustration of kNN query (STkNNQ) processing ...... 142 6.7 kNNQueryLatency: variabledatasetsize ...... 148 6.8 kNNQueryLatency: variablekparameter ...... 148 6.9 Insert/Update Throughput (records/s) Comparison (GeoSpark and ST-Hadoopdonotsupportlocationupdates)...... 150 6.10 Spatio-temporal query latency comparison ...... 152

xvii A.1 Hash Join run time with Variable Build Skew on Gaussian near M:K Dataset-Skylake-8threads ...... 182 A.2 Hash Join run time with Variable Build Skew on Zipf near M:K Dataset-Skylake-8threads ...... 182

B.1 Vector Aggregation Q1 - 100M Records - Intel Harpertown (Penryn) Machine-Variabledatasetdistribution ...... 184

C.1 Head-to-head DBMS Comparison - All 22 TPC-H queries - Machine A-BestConfiguration ...... 185

D.1 Update throughput: batch size/worker threads ...... 189 D.2 Tile placement policy comparison: variable dataset size ...... 190 D.3 Update throughput scalability: variable cluster size ...... 190 D.4 Illustration of range query (STRQ) processing ...... 191 D.5 Range query throughput: variable temporal range ...... 193 D.6 Range query throughput scalability: variable dataset size ...... 194 D.7 Range query throughput comparison: RRR vs. MDR ...... 196

xviii Symbols and Abbreviations

← Assignment ∀ For All P Summation ∈ Set Membership O CPU Central Processing Unit DBMS Database Management System DRAM Dynamic Random Access Memory GPGPU General-Purpose Computing on Graphics Processing Units GPU Graphics Processing Unit HDD Hard Disk Drive HEDT High-end Desktop HT Hyper-threading HTM Hardware Transactional Memory I/O Input/Output ILP Instruction-level Parallelism kNN k-Nearest Neighbors LAR Local Access Ratio LBS Location-Based Service LOC LinesofCode NUMA Non-Uniform Memory Access OS OperatingSystem RSS ResidentSetSize SIMD Single Instruction Multiple Data SMT Simultaneous Multithreading SMP Symmetric Multiprocessing SSD SolidStateDrive SQL Structured Query Language THP Transparent Hugepages TLB Translation Lookaside Buffer

xix Chapter 1

Introduction

The digital world is producing large volumes of data at increasingly higher rates. There is a widespread consensus that the rate of data generation will continue to grow exponentially [158, 172, 75]. This data contains value that can be extracted in various forms, such as insights, trends, recommendations, and predictions. We require machine assistance in order to process and interpret this data in a timely manner. Data analytics provides this assistance, and it is a key technology that powers the information age. However, coping with the challenges of big data is an ongoing battle that requires continuous development and innovation on both the software and hardware fronts. The need to store, retrieve, and process data efficiently, has become vital in all spheres of human activities. Furthermore, the breadth of applications that depend on efficient data operations has grown dramatically [75]. Data analytics provide support to applications, such as business intelligence, decision support systems, machine learning, data warehousing, and scientific applications. Users desire results in a timely fashion, and the industry places a great emphasis on performance metrics such as throughput, latency, and scalability. Main memory query processing systems have been increasingly adopted, due to continuous improvements in DRAM capacity and

1 speed, and the growing requirements of the data analytics industry. However, the speed and performance of memory is increasing at a rate that is greatly outmatched by advances in CPU performance [145]. As the hardware landscape shifts toward new architectures, achieving good performance and scalability in these applications is a key challenge. As part of our study of data analytics, we put two fundamental and ubiquitous data operations under the spotlight: joins and aggregations. They are arguably the two most expensive data operations, and are used in a wide range of technologies, ranging from SQL-based relational databases [73, 29], to NoSQL systems, such as graph stores [122] and MapReduce applications [40]. These data operations can be implemented using a wide variety of different algorithms. However, efficient implementations must leverage the ever-evolving features of modern hardware and adapt to the varying characteristics of the data and workload. In recent years, there has been a major push toward main memory (or in-memory) query processing. Using main memory query processing relieves the burden of I/O bottlenecking, but shifts the bottleneck to the memory hierarchy and the application’s memory access patterns. As a result, our study of data analytics is performed in the context of in-memory query processing. Our work is fundamentally focused on the core data structures and algorithms that form the basis of query processing. Next, we outline our objectives.

1.1 Thesis Objectives and Methodology

The main objective of this research is to find ways to boost performance (latency, throughput, and scalability) in main memory query processing. To do so, we will examine this topic from the perspectives of the software and workloads, the data characteristics, and the evolution of hardware. The scope of this research is pri- marily focused around algorithms and data structures used for main memory query

2 Main Memory Query Processing

Multi-threaded (single node) Distributed In-memory In-memory Query Processing Query Processing

Join Aggregation Spatio-temporal Queries Queries Queries

Query Processing on Scalable In-memory kNN and NUMA Systems Indexing on Cluster

Figure 1.1: Thesis Overview processing. Data operations, such as joins and aggregations, are ubiquitous and time consuming, and carry significant importance in the data analytics industry. Appli- cations include traditional relational databases, NoSQL data systems, and machine learning and data science applications. These applications are typically run on mod- ern hardware, such as Non-Uniform Memory Access (NUMA) architectures. The performance of in-memory query processing is notably affected by NUMA, as the memory is partitioned into separate nodes and represented as a single address space. This has implications for in-memory query performance, as the speed of accessing data in the memory varies depending on its physical address. We also study the important role of distributed systems in big data analytics. Specifically, we investi- gate spatio-temporal kNN queries in the context of a distributed data system that employs in-memory query processing. Figure 1.1 provides a general overview of the thesis topics and their relationship with main memory query processing. The methodology used in this research is as follows. First, we identify and analyze the problems and performance bottlenecks in main memory query processing, and the limitations in existing approaches. This includes a comprehensive review of existing work in this area, as well as performance evaluations of state-of-the-art implemen-

3 tations. To do so we build on prior work by creating new synthetic datasets that address issues encountered in real-world scenarios, such as data skew and locality. We develop new data structures, algorithms, and techniques, that leverage algo- rithmic improvements and new trends in modern hardware architecture to achieve significant speedups over existing approaches. Lastly, we support our findings with in-depth empirical and theoretical analysis, including algorithm complexity, CPU cache and memory profiling, microbenchmarks, and concurrency analysis.

1.2 Thesis Outline

The key contributions of this thesis are:

• A systems-orientated approach for accelerating query processing workloads by utilizing new/modified data structures and algorithms on modern hardware

• A novel hash table based on cuckoo hashing that excels at accelerating join operations on skewed/shuffled data

• New and extended synthetic datasets designed for studying in-memory query processing

• A methodology to help practitioners speed up query processing on NUMA systems with minimal code modifications

• A scalable spatio-temporal data system that accelerates kNN queries using a multi-level distributed in-memory index

• Extensive experimental evaluation, involving different query workloads, data structures, and database systems, on different machine architectures, with pro- filing and performance counters, time complexity analysis, and microbench- marks

4 The thesis is organized as follows: in Chapter 2, we outline the core background con- cepts that are relevant to in-memory query processing. Chapter 3 covers our work on in-memory hash joins, including our novel hash table design. In Chapter 4, we delve into a six-dimensional analysis of in-memory aggregation. Chapter 5 presents our ap- proach to systematically accelerate in-memory query processing on NUMA systems. In Chapter 6, we outline our research on adapting in-memory query processing to a distributed data system for spatio-temporal queries. Chapter 7 concludes the thesis and outlines the future work. Lastly, in Appendix A and the following appendices, we include additional system details, experimental results, and statistics that were omitted for brevity.

5 Chapter 2

Background

As mentioned earlier, performance metrics that affect speed, such as throughput, latency, and scalability, are important factors in data analytics. Developing efficient algorithms requires an intimate understanding of the traits and characteristics of the underlying hardware, software, operating system, and data. In this chapter, we explore the fundamental background concepts surrounding this subject.

2.1 Trends in Modern Hardware Architecture

Moore’s law predicts that transistor density will double every two years. However, the development of commodity CPU architectures continues to be influenced by three major limiting factors: the power wall, the ILP wall, and the memory wall [8]. CPU manufacturers have been struggling to keep Moore’s law alive by working around these limitations, but there is evidence that Moore’s law is slowing down [43]. A fourth “wall” may be encountered in the relatively near future, when shrinking tran- sistor sizes further will become unfeasible due to physical and electrical limitations, as well as economical factors, such as research and development costs, and manufactur- ing costs which are heavily impacted by yields. As such, transistor miniaturization is projected to end within the next decade [36], potentially leading to an era of sluggish

6 hardware growth amidst an accelerating torrent of data. These developments will force the hardware industry to search for new pathways toward scalability, and the software industry will be increasingly motivated to take application performance and algorithmic efficiency more seriously [107].

2.1.1 The Power Wall

The power wall was encountered at a time when increasing CPU frequency to achieve greater performance was the industry norm. The push to increase CPU frequency led to CPUs that produced too much heat to be cooled by conventional means (heatsink and fan). High heat can permanently damage the processor. Mod- ern CPUs contain various mechanisms to limit power output and will shutdown if temperatures reach unsafe levels. This issue puts a damper on processor frequency scaling, and has motivated the production of multi-core CPUs, which have become the de facto industry standard. Multi-core CPUs exploit the fact that CPU power output follows an exponential curve. As a result, multiple lower clocked CPUs can outperform a higher clocked CPU while producing the same amount of heat. Nowa- days, multi-core CPUs are used everywhere from smartphones and IoT devices to desktops and supercomputers.

2.1.2 The ILP Wall

The ILP (Instruction-level Parallelism) wall limits the rate at which a single processor can simultaneously execute instructions. CPUs use pipelining and redun- dant circuitry to extract additional performance by running multiple instructions at the same time. This provides a way to mitigate the power wall. However, pipelin- ing eventually leads to diminishing returns due to various hazards, such as data dependencies and branch mispredictions. The ILP wall led to the development of technologies, such as simultaneous multithreading (SMT), which improve CPU re-

7 source utilization by scheduling two threads on the same core. As the demands of the industry continue to grow, hardware manufacturers have placed a greater emphasis on data and task parallelism.

Figure 2.1: Processor-Memory Performance Gap (relative performance) [68]

2.1.3 The Memory Wall

The memory wall is attributed to the growing gap between CPU and memory performance [109, 145]. Memory performance is increasing at a much slower pace compared to CPU performance, and memory bandwidth per CPU core is declining. Applications can spend a considerable portion of their runtime waiting for data to process [76]. Figure 2.1 illustrates the disproportionate growth of CPU and memory performance. The demand for greater memory capacity and bandwidth has led the industry to migrate from uniform memory access (UMA) to non-uniform memory access (NUMA) architectures. Although these changes provide a path toward greater scalability, the burden of leveraging the hardware falls to the software developers. Software has generally been slow to adapt to advances in hardware architecture. There is a widening divide between software projects that try to leverage modern hardware and those that piggyback on incremental hardware improvements, such as

8 the IPC (instructions per cycle) increases that we have seen for the past several CPU generations [49]. For example, the popular MySQL database system [127] continues to lack intra-query parallelism. This issue has been easy to overlook in some fields due to a greater focus on multi-tasking, task parallelism, and multi-process software environments. High hardware utilization is achieved by giving each task a slice of the hardware resources proportional to its needs. This approach is not suitable for all applications. For example, processing large datasets in a large main memory data processing system requires advances in intra-query parallelism.

2.1.4 CPU Cache and Memory Hierarchy

The CPU cache is a type of memory that sits between the main memory (DRAM), and the CPU registers. Relative to main memory, the cache stores small amounts of data, but is significantly faster. Modern CPU caches are designed to take advan- tage of two types of data locality: temporal locality, and spatial locality. Temporal locality refers to data that is likely to be reused in the near future. Spatial locality benefits from data that is near other data in terms of their physical memory address. Data that is not in the CPU cache (cache miss) must be loaded from the memory. The TLB (Translation Lookaside Buffer) accelerates virtual address lookups by stor- ing recently accessed memory addresses. The TLB can only store a limited number of entries, leading to the eviction of older entries. When a requested memory ad- dress is not found within the TLB (TLB miss), a much slower operation finds the address in the page table. Figure 2.2 shows how the data from a virtual address is accessed by the CPU. Cache and TLB misses incur moderate performance penalties, and generally have a greater impact on memory-intensive applications, such as data analytics. Although successive CPU architectures have increased cache and TLB sizes, mitigating these issues on the software side is a major challenge.

9 hit hit virtual address TLB Cache CPU Lookup miss miss

Page Main Table Memory

Figure 2.2: TLB and Cache Hierarchy

Remote Memory Remote Memory Access (1 hop) Access (2 hops) CPU1 CPU2 CPU5 CPU6 CPU9 CPU10 Local CPU3 CPU4 CPU7 CPU8 Memory CPU11 CPU12 Access

Mem 1 Mem 2 Mem 3 Interconnect Interconnect NUMA Node 1 NUMA Node 2 NUMA Node 3

Figure 2.3: Memory Access on a NUMA System

2.1.5 Non-Uniform Memory Access (NUMA) Architectures

The demand for greater processing power has pushed the adoption of various de- centralized memory controller layouts, which are collectively known as non-uniform memory access (NUMA) architectures. In a NUMA system, CPU and memory re- sources are partitioned into groups and linked with each other using an interconnect. Each group is referred to as a NUMA node, and the way these nodes are connected to each other is called a NUMA topology. Memory access times within a node are faster (local access), whereas access to other nodes is typically slower due to latency and bandwidth constraints (remote access). A remote access may need to hop through one or more nodes to reach its destination. Figure 2.3 depicts different memory access routes in an example NUMA system. The first NUMA system was developed by Honeywell Information Systems in 1985 and was limited to two processors. The first commodity NUMA system was released to the market much later with the AMD Opteron K8 in 2003, which could support up to eight NUMA nodes. Today’s NUMA systems typically employ a balance of multi- ple sockets and multicore CPUs, with a greater emphasis on per-socket performance and power efficiency. NUMA architectures are also pervasive in in-memory rack-scale

10 or’s cores into velopments have sed (left) vs in- ntroller [116, 156]. even larger many- RAM Data access has implica- top (HEDT) CPUs to doption of NUMA ar- memory or in-memory sing system, analyzing to factor into the pop- App ided into two categories wer compared to RAM. te data in a database or lly resides on one or more nd, by loading of parts of rconnects (such as GEN-Z) L1 L2 L1 L3 11 CPU L1 Cache L2 L1

Data

Data RAM

Data App 2.2 Main Memory QueryA Processing query represents a request to store, retrieve, or manipula memory (right) systems, as well as a growingled range to of On-Chip manycore NUMA Architectures CPUs. (OCNA) Recent that partition de multiple a NUMA process regions, each with their own dedicated memoryBased co on current trends, future NUMAcore systems processor will designs incorporate and faster, potentially unifiedwith inte tighter coupling between system components [23].chitectures may The even a expand beyond server and high-endproducts desk such as GPUs and high-end smartphones. data system. Query processing techniques can bedepicted broadly div in Figure 2.4: disk-based query processingquery and processing. main In disk-based query processing approaches, the datahard typica disk drives. Queries are typically calculatedthe on-dema disk-resident data into main memory.tions for Dependence query on latency, disk as disks areHowever, their orders of affordability magnitude and slo storage capacityularity continues of disk-based systems. In a disk-based query proces Figure 2.4: Overview of Query Processing Approaches: disk-ba the cost of a query is primarily focused on the number of disk accesses required to process the query. In main memory query processing, queries are evaluated on data that resides in memory. This approach can produce significant speedups over disk-based query pro- cessing, as it avoids the I/O bottleneck completely. However, memory is a relatively scarce resource compared to disk storage, and extracting efficient performance is impeded by factors such as the memory and cache hierarchy, access patterns, and allocation. The push for main memory query processing stems from the demands of the industry, such as decision support systems and real-time data analytics, as well as the falling costs and rising capacities of RAM modules. Disks continue to play an important role as they are often used to provide a means for high capacity data persistence.

2.2.1 Data Persistence

In-memory data processing systems achieve high performance by operating on memory- resident data. However, solely storing the data in volatile main memory entails the risk of data loss. Various events, such as power outages, hardware failures, and appli- cation crashes, or even intentional activities, such as system reboots, maintenance, and upgrades, could destroy valuable data at worst, and require time-consuming recovery operations at best. In this context, data persistence reduces the risk of data loss by storing data in non-volatile storage, such as hard drives and solid-state drives. A key challenge in data persistence is determining the format, frequency, and location of storage operations, and finding a trade-off that does not significantly hinder application performance. In distributed data systems, data persistence can be achieved using disk-based local stores at each node, and/or by using a distributed file system. In Chapter 6, we outline a practical example of this in Section 6.4.2.

12 2.3 Query Data Operations

As mentioned in Chapter 1, we study main memory query processing on two cat- egories of data operations: joins and aggregations. In business intelligence and data analytics workloads, joins are frequently used in conjunction with aggregations. While joins can generate results that are larger than all of the inputs, aggregation typically produces an output that is smaller than the input. Together they help to create human readable information out of large amounts of data. In this section, we describe the fundamental concepts of joins and aggregation, and their corresponding algorithms and data structures.

2.3.1 Join Queries

In relational databases, a join is described as an operation that combines columns from different tables. This is done by evaluating a predicate (such as equal to, or greater than) on the shared columns (known as the join keys) from the participat- ing tables. Joins are a fundamental building block in relational algebra and SQL databases. However, in a broader sense, the same concepts apply to any data that can be represented as key-value pairs. Joins are broadly divided into three categories based on their implementation: sort merge join, nested loop join (or index nested loop join), and hash join. Merge join (also known as sort-merge join), is efficient if the input is sorted on the join columns. The performance of nested loop join depends on the availability of suitable indexes and statistics. Hash joins are typically chosen for their good performance on large and/or non-indexed data. Due to these char- acteristics, hash joins are frequently used to accelerate joins in main memory data systems [29, 58]. Their main downside is that they can only be used for equality predicates, as they do not support inequality predicates, such as <, >, ≤, or ≥. Hash joins are named such because they use hash table data structures to accelerate

13 department employee

deptno desc ename deptno

d.deptno = e.deptno Figure 2.5: Equijoin Example

Hash Table Build Relation Hash Function (Separate Chaining) Hash Function Probe Relation department 10 employee

desc deptno Marketing ename deptno Output Gary, Marketing Marketing 10 30 Gary 10

IT 20 HR Dulley 10 Dulley, Marketing . HR 30 Reiter 30 . NULL Financial 40 Taylor 20

. Prevatt 30 . 20 40 . IT Financial .

(optional) Partition Phase Build Phase Probe Phase

Figure 2.6: Hash Join Overview the process of checking for matching join keys. Figure 2.6 depicts a hash join linking employee names to department names via the department number (join key). The hash join is divided into three phases: partition, build, and probe. The partition phase is optional, and consists of splitting up the input relations, typically on the join keys. During the build phase, one of the join inputs (typically the smaller relation) is designated as the build relation, and inserted into a hash table. The probe phase involves iterating through the other relation (known as the probe relation), checking the hash table for matches, and materializing matching pairs in the output.

2.3.2 Aggregation Queries

Aggregation is an essential tool used to consolidate and summarize data. An ag- gregation query involves grouping the input tuples by their key and then applying an aggregate function to each group. There are three main categories of aggregate

14 Group1 Group2

Group1 Group2

Figure 2.7: Comparison of aggregation categories - distributive and algebraic (top) vs holistic (bottom) functions: distributive, algebraic, and holistic. Distributive functions, such as Count and Sum, can be processed in a distributed manner. This means that the input can be split and processed in multiple partitions, and then the intermediate results can be combined to produce the final result. Algebraic functions consist of two or more Distributive functions. For instance, Average can be broken down into two distribu- tive functions: Count and Sum. Holistic aggregate functions cannot be decomposed into multiple distributive functions, and require all of the input data to be processed together. For example, if an input is split into two partitions and the Mode is cal- culated for each partition, it is impossible to accurately determine the Mode for the total dataset. Figure 2.7 provides an overview illustrating how these aggregate functions are calcu- lated differently. Distributive and algebraic aggregates can be split up and merged, but holistic aggregates must be applied to each group in a single step. Other exam- ples of Holistic functions include Rank, Median, Percentile, and Quantile. Table 2.1 shows SQL examples for each function category. The output of an aggregation can be either in Vector or Scalar format. In Vector aggregates, a row is returned in the output for each unique key in the designated column(s). These columns are commonly specified using the group-by or having

15 Table 2.1: Aggregation Query Categories

Query Category SQL Representation

Q1 Distributive SELECT groupkey, COUNT(val) FROM records GROUP By groupkey;

Q2 Algebraic SELECT groupkey, AVG(val) FROM records GROUP By groupkey;

Q3 Holistic SELECT groupkey, MEDIAN(val) FROM records GROUP By groupkey;

Tree Hashing Sorting Index

Sales

CustomerID ItemID

104 10

104 10 CustomerID SalesCount Vector 146 30 COUNT 104 3

100 20 146 1

104 30 100 1

Input Aggregate Function Output Figure 2.8: Aggregation Workload Overview keywords. The output value is returned as a new column next to the group-by column. Scalar aggregates process all the input rows and produce a single scalar value as the result. Figure 2.8 depicts an overview of an aggregation query (Q1 from Table 2.1) being applied to an example input table. The input contains sales records of customers purchasing different items, and the output contains each unique customer ID along- side the aggregate value, which counts the number of items each customer purchased. In recent years, there have been many studies on in-memory sort algorithms, and data structures, such as tree-based indexes and hash tables. Many of these data structures can be used for in-memory aggregation. Aggregation algorithms can be categorized by the data structure used to store the data. Based on this approach, the algorithms can be divided into three main categories: sort-based, hash-based,

16 and tree-based algorithms. Thus different aggregation implementations will inherit the performance characteristics of their underlying algorithms. The implementation of an aggregate operator can be broken down into two main phases: a build phase and an iterate phase. Consider the following example using a hash table and a vector aggregate function (Q1 from Table 2.1). During the build phase, each key (created from the group-by attribute or attributes) is looked up in the hash table. If it does not exist, it is inserted with a starting value of one. Otherwise, the value for the existing key is incremented. Once the build phase is complete, the iterate phase reads the key-value pairs from the hash table and writes the resulting items to the output. A similar procedure is used for tree data structures. The calculation of the aggregate value during the build phase (early aggregation) is only possible when the aggregate function is distributive or algebraic. As a result, holistic aggregate values cannot be calculated until all records have been inserted. Sort-based approaches “build” a sorted array using the group-by attributes. As a result, all the values for each group are placed in consecutive locations. The aggregate values are then calculated by iterating through the groups.

17 Chapter 3

Main Memory Join Processing

Joins are a vital and ubiquitous operation in most database systems, and are used to combine sets of data based on a predicate. Hash joins are a type of join that utilizes hash tables to speed up the process. Main memory database systems have long used hash joins to improve performance [58]. Recent advances in data systems and computer hardware have spurred further research in this area. As memory becomes cheaper and denser, main-memory hash joins are being increasingly prescribed to speed up existing systems. In this chapter, we conduct a comprehensive study on the effects of dataset skew and shuffling on four hash join algorithms, and present and evaluate two approaches based on modifying or replacing the underlying hash table to further improve join runtime performance. The chapter is organized as follows: we discuss our motivation in Section 3.1. In Section 3.2 we discuss the related work. In Section 3.3 we describe the issues that arise from skewed datasets. We present two solutions to solving this problem in Sections 3.4 and 3.5. The experimental setup and results are presented in Section 3.6. We conclude and summarize the chapter in Section 3.7.

Note: Parts of this chapter were previously published in [112].

18 3.1 Motivation

One of the key challenges facing database developers is how to consistently achieve good performance. In their in-depth analysis of in-memory hash join algorithms, Blanas et al. [29] acknowledged the importance of data skew. They showed that a simple no partitioning hash join algorithm can outperform other join algorithms. However, they only considered a basic 1:N join case that involves matching a primary key (unique and ordered) with a foreign key from another table. Joins can involve non-key columns that do not enforce uniqueness or any particular ordering. Popular relational database benchmarks such as TPC-H [35] generally focus on querying data that is uniformly distributed and non-skewed. However, it has been shown that these cases are not necessarily representative of real-world applications [37, 77]. For example, the size of cities and the length and frequency of words can be modeled with Zipfian distributions, and measurement errors often follow Gaus- sian distributions [61]. Furthermore, skewed build keys can be encountered as a result of parallel multi-way joins [21], joins on non-primary key columns, and com- plex queries. It is quite common to observe dataset skew in the output of a join operation, and this result-set may need to be joined with several other intermediate results or tables. The performance impact of dataset skew is frequently overlooked in favor of heavily optimizing other aspects, such as the effects of cache and TLBs [81], NUMA characteristics [89], architecture awareness [29, 15], memory-efficiency [16], and transactional memory [152]. Figure 3.1 shows two experiments which we use to highlight the problem and the potential for improvement. To demonstrate the acuteness of the data skew problem, we conduct an experiment comparing the hash join performance of a non-skewed dataset that is similar to that used in [29] and a skewed dataset that we generated (build table skew is the only variable). The experiment uses the hash join configu- rations and code provided by [29], which we describe in Section 3.2.2. We run the

19 experiment on a modern processor based on the Skylake architecture (for specifica- tions, see Section 3.6.1). In Figure 3.1a the results show significant performance hits on all hash join configurations by nearly four orders of magnitude, when the dataset is skewed. As we show in section 3.4, this issue is caused by cache misses incurred from traversing long pointer-linked chains of key-value pairs. Thus we illustrate how dataset skew can severely hinder hash join performance if it is not mitigated. To further study the effect of data skew on hash joins, we propose a set of sixteen datasets that vary in terms of distribution, skew, and shuffling. The details of these datasets are presented in Table 3.3. To our knowledge, no prior work has used such a comprehensive series of datasets to evaluate in-memory hash joins. We explore different variants of dataset skew, vary the correlation between the build and probe tables, and examine how the order of the keys can affect the join algorithm. Query optimizers generally rely on various statistics about the tables to reliably predict the cost of different parts of a query. Choosing the right tool for the job can be challenging. In typical real-world applications, we cannot control how and where our data will be skewed. To address the impact of dataset skew on hash joins, we focus our attention on the design of the hash table. Separate chaining hash tables remain one of the most popular choices due to their ease of use and flexibility. They have been used in many existing hash join imple- mentations such as [29, 14, 15, 152, 89]. We demonstrate that the hash tables used in their implementations are not ideal for joins on skewed datasets. In order to im- prove performance, we modify the existing hash table (based on separate chaining) from [29]. We show how this modification results in shorter chain length, simpler result materialization, and significantly improves total join time. We demonstrate an example of this in Figure 3.1b. The modified hash table improves performance by storing the values associated with each key in a contiguous in-memory vector. This significantly reduces the number of lookups required to find a key because there are

20 fewer elements in each bucket chain (this is covered in detail in Section 3.4). The hash join configurations are covered in more detail in Section 3.2.2. To further improve hash join performance, we propose Maple hash table, a novel concurrent hash table based on cuckoo hashing. Maple hash table uses a unique hashing technique to determine a key’s destination table, bucket, and index within the bucket. This multi-stage hashing approach reduces contention and collisions, without increasing the number of required lookups. We show that Maple Hash table outperforms a state-of-the-art concurrent hash table implementation from Intel (see Section 3.5). We demonstrate that non-partitioning hash joins based on Maple hash table can significantly improve performance, particularly when the data is not ordered. Join time is faster by up to 17.3× in the best case, and slower by less than 0.2× in the worst case, compared to partitioning hash joins using our improved separate chaining hash table. This presents an opportunity for a query planner to choose a better hash join method and hash table, based on the data characteristics. To show the effectiveness of our approach on different processor architectures, we conduct experimental evaluations on three distinct hardware platforms. The hard- ware details are presented in Table 3.1. Our approaches achieve consistent perfor- mance gains, without the need to manually tune the parameters for each hardware architecture.

3.2 Related Work

Research on main-memory hash joins has flourished in recent years. Numerous pub- lications have explored different algorithms, workloads, and architectures. In this section, we summarize recent works on hash joins and hash tables.

21 Non-skewed Ordered Non-skewed Shuffled SC (Original Implementation) Skewed Shuffled SCVV (Modified Implementation) 1,000,0001014 1,000,0001014 100,0001013 100,0001013 10,0001012 10,0001012 1,0001011 1,0001011 1010010 1010010 10109 10109 1018 1018 1007 1007 1006 1006 1005 1005 1004 1004

Runtime (CPU Cylces) Runtime Cylces) (CPU 1003 Runtime Cylces) (CPU 1003 1002 1002 1001 1001 01 10 Nopart Part-Share Part-Indep Part-Radix Nopart Part-Share Part-Indep Part-Radix Hash Join Configuration Hash Join Configuration (a) Highlighting the performance impact of (b) Comparing separate chaining hash table dataset skew and shuffling - Comparing a non- (SC) from [29] against our modified version skewed ordered dataset, a non-skewed shuffled (SCVV) - our approach is over three orders of dataset, and a shuffled Gaussian skewed dataset magnitude faster on the skewed dataset (details in Section 3.4)

Figure 3.1: Hash join experiments using configurations from [29] - Skylake - 8 threads

3.2.1 Hash Joins

A key inspiration for our work is the in-depth analysis on main-memory hash joins presented by Blanas et al. [29]. The authors implement a family of hash join algo- rithms (summarized in Section 3.2.2), which we adopt for our work. Their experi- ments evaluate datasets with skew on the probe relation. The results indicate that a simple non-partitioned join algorithm using a separate chaining hash table frequently outperforms other approaches, particularly when probe skew is introduced. Using the framework implemented by [29] as a base, Balkesen et al. [15] make the case for fine-tuning radix hash join to the hardware. The authors extensively analyze the performance of Radix and non-partitioning in-memory hash joins, using workloads adapted from [81] and [29]. Their results provide valuable insight on the role CPU architecture plays in hash join performance. Our work also builds on the framework from [29], but instead focuses on addressing dataset skew on hardware-oblivious hash joins. Interestingly, the authors predict that hardware advancements will eventually eliminate the need for fine-tuning in the future.

22 In [16] Barber et al. present two new hash tables for in-memory hash joins. The au- thors focus on improving hash join memory efficiency. They note that their approach cannot handle M:N joins. Shanbhag et al. [152] describe a hash join implementation that uses hardware trans- actional memory and takes advantage of spatial locality in the data. They also note the importance of evaluating both ordered and shuffled datasets.

3.2.2 Hash Join Configurations

We employ the same hash join algorithms proposed by Blanas et al. in [29]. These consist of one non-partitioning hash join and three partitioning hash join variants. The following is a short description for each hash join variant along with the short- ened names used in our charts.

1. No partitioning join (Nopart): all threads create a shared hash table from the build relation. This hash table is then probed concurrently.

2. Shared partitioning join (Part-Share): both relations are divided into partitions shared by all threads. Locks are used to facilitate concurrent access.

3. Independent partitioning join (Part-Indep): all threads participate in partitioning both relations, but the partitions are private and cannot be ac- cessed by other threads. Consequently, locks are not needed.

4. Radix partitioning join (Part-Radix): parallel Radix Join with dynamic load balancing, as described by Kim et al. [81]. Each input relation is split into a hierarchy of partitions using their least significant bits (LSB). The resulting partitions are then joined using a hash table.

23 3.2.3 Hash Tables

There are many variants of hash tables, but not all hash tables are suitable for hash joins. Cuckoo hashing was first proposed by Pagh et al. [129]. It guarantees con- stant lookup, update, and deletion times, but insertion times are amortized. Kirsch et al. [83] use a stash to store elements that could not be inserted normally. In [87] Kumar et al. employ a hierarchy of hash tables to provide deterministic insertions for cuckoo hashing. In [102], researchers from Intel labs present a concurrent cuckoo hashing technique which utilizes hardware transactional memory (HTM). The au- thors leverage this hardware feature to implement atomic modification of shared data structures. In their experiments they show that this approach outperforms Google’s dense hash, Intel TBB, and C++ unordered map. These hash tables are not designed with hash joins in mind, because the “insert” operation typically overwrites existing values. In a hash join, this may result in incomplete results.

3.3 The Impact of Data Skew

To support hash joins, the hash table implementation must be modified to support duplicate keys. In related work by Blanas et al. [29], the authors used a hash table implementation based on separate chaining, and presented an in-depth analysis on the impact of data skew on hash join performance. However, this analysis was limited to data skew on the probe table. The authors concluded that partitioning hash joins are less resilient to dataset skew. In order to examine this problem further, we developed an extensive set of datasets, and evaluated these datasets with the code provided by [29]. We discovered that joins with highly skewed build relations were significantly more time consuming than joins with similar probe skew. Separate chaining hash tables are used for a wide variety of applications due to their

24 simplicity and flexibility. The main concept behind separate chaining is to store items in an array of buckets, using a hash function (often a fast and simple solution such as modulo) to determine the bucket number, and resolve collisions by chaining items together. Separate chaining is often used for hash join applications because insert operations are very fast (we do not need to traverse the whole chain to add a new item), and the insert operations do not fail as long as there is available memory. If the input data is not skewed (as is the case in [29]), non-partitioned hash joins with separate chaining generally perform well. However, duplicate keys in the build table results in long chains of items in the hash table, regardless of the hash function used. This can significantly hinder performance. We first observed this behavior when testing the code from [29] with skewed datasets. Initially, we solve this problem by modifying the hash table data structure and data operations.

3.4 Separate Chaining with Value-Vectors (SCVV)

As we discussed in the previous section, dataset skew can pose a significant obstacle for hash join performance. We initially address this issue by modifying the separate chaining hash table from [29] to store/read multiple values per key in contiguous value vectors. This simple solution is very effective against dataset skew and can also be extended to other data structures. Figure 3.2 depicts an example comparing our modified separate chaining hash table to the original version. Consider a chained hash table with 10 buckets and a modulo hash function. During the hash join probe phase, we query key 7. We have to traverse all five elements in the chain to look for matching keys. In the modified version with value vectors, we find a match on the first lookup, and we find all other values by traversing the value vector. In this example, finding all the matches for key 7 will take 2 reads, versus 5 reads in the original version. This issue does not

25 Key Value Separate Chaining 7 12 17 70 17 70 17 2 7 15

Key Separate Chaining with Value Vectors 7 12 17 70 Value 15 2 Vector 70 Figure 3.2: Example: Comparison of a bucket chain with and without value vectors (modulo hash function). depend on the hash function. As we show in the experiments, the overall impact on join time can be very severe.

3.4.1 Lookup and materialization costs

We do an analysis on separate chaining lookup costs to provide some intuition on how SCVV can improve performance. Let’s assume K keys have been inserted into the hash table. The worst case scenario occurs when all the keys end up in the same bucket. In this situation, we have to perform K lookups for K keys in the build table for every key that exists in the probe table. In order to simplify this analysis, we focus on what happens while probing the keys in a bucket chain. In the case of regular separate chaining, we have to check every single key in the bucket, because we cannot guarantee that another matching key does not exist in the bucket. With a separate chaining hash table (SC), if a bucket has K keys, the number of key lookups needed until we find a match is as follows: CSC = K. In a modified separate chaining hash table with value vectors (SCVV), we only need to traverse the chain until we find a match. For a bucket with K chained unique keys, the average number of lookups is half the number of lookups required by basic

K separate chaining: CSCV V unique = 2 . However, when the keys contain duplicates due to dataset skew, the maximum chain length is determined by the number of unique keys that hash to a bucket. If we

26 assume that D is the average degree of key duplication, the lookup cost is calculated

K as D∗2 and by factoring in the average cost of reading the value vector once we have K found a match, we get the cost of materialization: CSCV V skewed = D∗2 + D. In conclusion, value vectors reduce the number of key lookups needed, and once we find a matching key, we can stop traversing the chain, and output all the resulting tuples. Henceforth, we use SCVV as our new baseline.

3.4.2 Evaluation and Overhead

We compare the original implementation by Blanas et al. [29] with our modified version using value vectors. In order to provide an “apples-to-apples” comparison, we only modify the hash table used, and leave the rest of the code unchanged for this experiment. A performance comparison of hash joins using SCVV and the original hash table implementation (SC) with skewed datasets is shown in Figure 3.1b. Our approach is several orders of magnitude faster than the original code when the dataset is skewed (up to 5726× faster). As discussed in Section 3.4.1, SCVV produces shorter chains compared to SC. This significantly lowers the number of lookups for both the build and probe phases. We have demonstrated that SCVV is faster than SC on skewed datasets, but what about datasets that are not skewed at all? The main advantage of using a value vector is to store and retrieve multiple values per key. When there are no repeating keys in the dataset, that advantage is lost, and the size of the hash function modulus becomes an important factor. Furthermore, despite the specialized nature of the value vector, it bears additional overhead compared to a much simpler value variable. SCVV incurs a performance penalty of 16% when the build table keys are unique and fully ordered (a sequence of the numbers 1 to N). As a result, we would still choose SC in such rare cases. We have designed a new baseline implementation that can handle skewed datasets. Our next goal is to tackle shuffled (non-ordered) datasets.

27 3.5 Maple Hash Table (MH)

As part of our research on hash tables for hash joins, we explored the possibility of using cuckoo hashing for this purpose. As we mentioned in Section 3.2, cuckoo hashing was originally proposed by Pagh et al. [129]. Its core concept is to store items in one of two tables, each with a corresponding hash function (this can be extended to N tables). If a bucket is occupied by another item, the existing item is displaced and reinserted into the other table. This process continues until all items stabilize, or an arbitrary threshold is crossed. Cuckoo hashing provides a guarantee that reads take no more than two lookups. Its main drawback is relatively slower and less predictable insert operations, a lack of concurrency, and the possibility of failed insertions. Consequently, several variants of cuckoo hashing have been proposed to resolve these drawbacks. A concurrent hash table based on cuckoo hashing was proposed by Intel in [102] (henceforth called “Intel Libcuckoo”). It supports high- performance concurrent operations with multiple readers and writers. By leveraging hardware transactional memory (HTM) introduced in Intel’s new Haswell chipset, as well as several carefully engineered optimizations, it achieves the best insert per- formance amongst all hash tables. However, the availability of special hardware features such as HTM cannot be taken for granted. Furthermore, the performance of Intel Libcuckoo suffers significantly when the distribution of the dataset is skewed. In their experimental evaluation [102] Intel Libcuckoo used a uniform distribution of keys. As we know, real world datasets may be non-uniform or skewed [71]. We introduce Maple hash table, a novel extension to the bucketized cuckoo hash table approach that is designed for concurrency, and insert efficiency. In our approach, we use a fast multi-stage hashing technique to determine an incoming item’s location, effective load-balancing to decrease collisions and lock contention, a semi-optimistic locking strategy that acquires fewer locks without incurring race conditions, and optimizations to improve concurrency and cache usage.

28 out the need to without reading stage hashing to . The hash table 0 3 lly, we describe how 678 e also ensuring cache Number und 4-way to provide h table data structure. Initial TableInitial ented in Section 3.5.3. s us the precise location as contiguous arrays of e locations is limited by Index NumberIndex Bucket NumberBucket VV3 Value Vector Value K3 3 3 3 Bitwise AND Bitwise (table size -size (table 1) Index FunctionIndex 2 2 2 1 1 1 0 0 0 Index No Index 29 49611364 Hash Value Hash Table Selector (Bitwise AND 1) AND (Bitwise SelectorTable Table 0 Table Table No Table KVP Array KVP Array KVP Array KVP

... Hash Function Bucket 0 Bucket 1 Bucket n Bucket 5509 246422 Bucket No Key Key-Value Pair Value Figure 3.4: Example of finding a location for a key using multi- probe all the elements in the bucket determine table, bucket, and array index In the next section (Section 3.5.1) weWe describe then the elaborate Maple on has the hashthe functions hash in table Section functions 3.5.2. and concurrency Fina control are implem 3.5.1 Data Structure Figure 3.3 depicts a generaluses overview 4-way of set-associative buckets, our which cuckoo arekey-value hash implemented pairs. table Other configurations area possible suitable balance but of we performance fo andalignment. memory overhead, As whil we show in Sectionthrough all 3.5.2, the an elements element in can the be entireThe combination bucket. found of the table, bucket, and index numbersof give a key-value pair. For each key, there are number of possibl Figure 3.3: Maple Hash Table Data Structure - a key is found with the number of tables within the hash table. That means that unlike other methods that combine cuckoo hashing with linear probing, such as [102], we can retrieve a key-value pair by looking up a maximum of two slots as opposed to 2N (where N is the number of slots per bucket). This feature further reduces the cost of lookups in buckets that are larger than the cache line, because parts of the bucket that cannot contain the key are skipped. When the table is first created, all keys and values are initialized to zero. We reserve zero to indicate that a key-value pair is empty or deleted.

3.5.2 Hash Functions

A good hash function strikes a balance between computational complexity and its effectiveness at reducing collisions. As noted in [144] and [4], Murmurhash [7] is widely used due to its speed and acceptable hash value quality in most situations. Our own experiments with a wide variety of hash functions confirm that Murmurhash is a suitable choice. Murmurhash provides the avalanche effect [183], meaning that at least half of the input key’s bits are flipped in the output. This results in hash values that are less predictable and better distributed. In addition to the collision property, Murmurhash is also fast, as it mainly relies on multiply and rotate (hence the name Murmur) operations. However, in some test cases (skewed datasets) its performance was not satisfactory. Due to this, we implemented a multi-stage hashing approach that is effective and computationally cheap. Figure 3.4 depicts an example of how each index is computed. An incoming key is passed to the hash function (Murmurhash2 [7] with a seed) to produce the hash value. To determine the bucket number, we calculate the modulo of the hash value to the table size. Since the table size is a power of two, we calculate this much faster by using the well-known technique of replacing the modulo with a bitwise AND operation. A function called the indexgen is used to determine the

30 Algorithm 2: Maple Hash Table Get Algorithm Input: the key provided to search the hash table Output: return associated value if key is found, otherwise return NULL 1 tableno ← newKVP.key bitwise & 1; 2 hashval ← hashfunc0(key); 3 bucketno ← hashval & (tablesize - 1); 4 index ← indexgen(hashval); 5 if newkey exists in table0[bucketno][index].key then 6 return table0[bucketno][index].value; 7 else // Repeat same steps with hashfunc1 for table 1

8 return NULL; // key was not found index within that bucket. In practice, this approach works well when the indexgen function can be computed much faster than the hash function. We use multiplicative hashing to implement indexgen.

3.5.3 Concurrent Implementation

Cuckoo hashing is relatively easy to implement in serial applications. However, the original design by Pagh et al. [129] did not propose an efficient parallel design. Instead, the authors propose calculating two hash values in parallel. This approach does not scale well on modern processors, which generally have over 4 cores. Our implementation leverages data parallelism to scale up on multicore processors. As noted in [63], there is no stable hierarchy between locks, and variables such as workload, available resources, and hardware specifications, need to be taken into account. For concurrency control, we opt to use light-weight spinlocks on a per- bucket granularity. To ensure valid and consistent results, all threads must finish building the hash table before it can be probed for matching keys. This allows us to employ a locking strategy that is less stringent, as concurrent read and write operations are not required. Our semi-optimistic locking strategy reduces lock acquisition and contention. A listing of our insert algorithm is depicted in Algorithm 1. If a key already ex-

31 Algorithm 1: Maple Hash Table Insert Algorithm Input: newKVP is the key-value pair being inserted Output: Return true if insertion is successful, otherwise return false (rehash) 1 if contains(KVP) then 2 append(KVP)// lock and append value to value vector 3 else 4 stepcounter ← 0 ; 5 maxsteps ← log(tablesize); 6 tableno ← newKVP.key bitwise & 1; 7 while stepcounter < maxsteps do 8 switch tableno do 9 case 0 do 10 hashval ← hashfunc0(newKVP.key); 11 bucketno ← hashval & (tablesize - 1); 12 index ← indexgen(hashval); 13 lock lockarray[bucketno]; 14 if table0[bucketno][index] is empty then 15 insert newKVP in table0; 16 unlock lockarray[bucketno]; 17 return true; 18 else 19 if key exists then 20 append(newKVP); 21 unlock lockarray[bucketno]; 22 return true; 23 evict existing KVP as oldKVP and insert newKVP; 24 unlock lockarray[bucketno]; 25 Set newKVP←oldKVP; 26 tableno ← 1;

27 case 1 do // Repeat case 0 steps with hashfunc1 for table 1

28 increment stepcounter by 1;

29 return false;

32 ists in the hash table, the value of the incoming key-value pair is appended to the value vector instead. Our preliminary experiments revealed that having all threads begin inserts from one table could result in a less even key distribution and higher lock contention, due to always starting insertions from the same pool of cells. The table selector selects the initial table to insert a new item, based on the key’s least significant bit. We use the least significant bit because it is fast and effective. In a worst case scenario (all keys are odd or even), the algorithm behaves like most cuckoo hashing approaches that start insertions from the same table. It is possible to further randomize this approach, but we found it to be adequate. Rather than locking every bucket along the cuckoo path, we lock the bucket that will be modified. The insert algorithm checks if the key already exists after locking the corresponding bucket. This is due to the possibility of one or more threads processing items with the same key. This protects against a race condition in which multiple threads inserting the same key could evict the item instead of appending to the value vector. Successfully inserting a new item into the hash table will store the item in only one of two possible locations. This ensures consistent read performance when probing the table.

3.5.4 Performance Evaluation

We use a microbenchmark to compare the performance of Maple hash table against the state-of-the-art concurrent hash table, Intel Libcuckoo. The microbenchmark, which is based on open-source code from Intel, begins by generating a series of unique key-value pairs which are then shuffled using a seeded uniform random number gen- erator. The insert throughput is calculated by measuring the time it takes for eight threads to finish inserting all key-value pairs into the hash table. Once the table has been populated, the read throughput is evaluated using eight threads to perform lookup operations on the data. The datasets consist of 16M, 33M and 67M ran-

33 Maple Hash Intel Libcuckoo Maple Hash Intel Libcuckoo 50 120 45 40 100 35 80 30 25 60 20 15 40 Million read s ops / Million Million writeops / s 10 20 5 0 0 16M 33M 67M 16M 33M 67M Dataset Size Dataset Size (a) Insert throughput (b) Read throughput

Figure 3.5: Comparing Maple hash table and Intel Libcuckoo - uniform random dataset - Skylake - 8 threads dom records generated from a uniform distribution between 1 and N where N is the dataset size. Figures 3.5a and 3.5b depict multi-threaded insert and read throughput respectively. The results show that Maple hash table is faster than Intel Libcuckoo by up to 17% in both insert and read throughput.

3.6 Evaluation

We now evaluate the effectiveness of the hash join approaches with our hash tables. Our experiments build on the same code base as [29], which was also adopted by [14]. Our contributions include implementing and integrating the SCVV and MH hash tables into the benchmark and investigating their performance on a broad range of datasets. It was shown in previous work that hardware architecture can play an intricate role in hash join performance [15]. Inspired by this work, we evaluate our experiments on three different hardware architectures. For convenience, we summarize the hash join configurations in Table 3.2. We present the platform specifications of each machine in the next section (Section 3.6.1), and describe the characteristics of our synthetic datasets in Section 3.6.2.

34 Table 3.1: Experimental setup specifications

CPU Model Cores/HT Cache RAM TLB (4KB pages) 1× Intel Core i7 6700HQ 1MB L2 16GB L1 DTLB: 64 entires 8/16 (Skylake) 6MB L3 DDR4 L2 STLB: 1536 entires 2× Intel Xeon E5472 12MB L2 16GB L1 DTLB: 16 entries 8/8 (Harpertown) - DDR2 L2 DTLB: 256 entries 8× AMD Opteron 8220 2MB L2 128GB L1 TLB: 32 entries 16/16 (K8 or Santa Rosa) - DDR2 L2 TLB: 512 entries

3.6.1 Platform Specifications

We evaluate the experiments on three different processor architectures: Intel Skylake, Intel Harpertown (based on the Penryn architecture), and AMD K8 (Santa Rosa architecture). Our intention is to ensure that the results are reproducible and not tuned for a particular set of hardware. The Harpertown and AMD machines run Ubuntu 14.04 LTS, and the Skylake machine uses Ubuntu 16.04 LTS. To ensure consistency in code compilation, we configure all machines to use version 6.3.0 of the G++ compiler and enable the -O3 optimization flag. The -O3 flag instructs the compiler to use the highest level of optimizations, and is widely used throughout the industry as well as the related literature [15, 144, 151, 175]. The first machine contains an Intel Skylake quad core processor with hyper-threading and 16GB of RAM. The CPU contains 256KB of L2 cache per core, and 6MB of L3 cache that is shared by all cores. The second machine is based on a pair of Intel Harpertown (Penryn architecture) quad core CPU and contains 16GB of RAM. The quad core consists of two pairs of cores that share 6MB of L2 cache for a total of 12MB. Lastly, we test an AMD machine with eight K8 CPUs with 2MB of shared cache each, and a total of 128GB of RAM. We did not observe swap space usage on any of the machines. The hardware specifications are summarized in Table 3.1.

35 Table 3.2: Hash Join Configuration Key

Short Form Hash Join Variant Hash Table Nopart SCVV No partitioning SCVV Nopart MH No partitioning Maple Part-Share SCVV Shared partitions SCVV Part-Indep SCVV Independent partitions SCVV Part-Radix SCVV Radix partitioning join SCVV

3.6.2 Datasets

In [29], the authors use dataset cardinalities of 16M and 256M tuples to mimic common decision support settings (1:16). Subsequent works by other authors have adopted this workload [14, 15, 152]. We adopt this cardinality as the baseline for our datasets. We extensively expand the range of datasets to include distributions not covered by prior work, noting that even popular benchmarks such as TPC-H are not representative of all real-world workloads [37, 77]. In most databases, the primary key is automatically generated in ascending order. However, there is no guarantee that the join keys will be ordered and/or uniformly distributed. Although some recent works examine dataset skew on the probe relation, to our knowledge none of the literature explores such an extensive set of datasets on main-memory hash joins. Another aspect to consider is the way the keys are ordered. Prior work by Shanbhag et al. [152] highlights the importance of studying dataset shuffling. Table 3.3 provides details on how the datasets are designed. All build tables contain 16M records, and the probe table size depends on the correlation. For each dataset, we also generate a shuffled version that is reordered using a uniform random function. The datasets are otherwise denoted as ordered. It is worth noting that if the build keys are drawn from a skewed distribution (Zipf or Gaussian), they will not be in sorted order. However, on ordered datasets the correlating keys in the probe table

36 Table 3.3: Dataset details

Build Key Probe Key Dataset Name Parameters Generation Correlation sequential unique keys have exactly N N =16, key range 1 to Sequential 1:N keys matches in probe table 16M randomly keys have exactly N N =16, uniform random Random near 1:N repeating sequence matches in probe table between 1 and 5 sequential unique relationship follows a mean=0.015, stdev=0.3, Gaussian 1:N keys Gaussian distribution multi=10 Gaussian skewed relationship follows a mean=0.015, stdev=0.6, Gaussian M:N keys Gaussian distribution multi=10,000 Gaussian skewed Random K matches N =16, mean=0.015, Gaussian near M:K keys where 1

3.6.3 Results and Discussion

We conduct comprehensive experiments to analyze how dataset skew and data shuf- fling affect hash join performance. These datasets are designed to focus on the effects that we are interested in studying. To this end, we process each of the datasets with the hash join configurations mentioned in Section 3.2.2. The join operation con- cludes when the resulting tuples have been written to memory. As our main focus is in-memory performance, we omit the final step of writing the result to the disk. This is consistent in all the experiments to ensure “apples-to-apples” comparisons. We present our performance evaluations in Sections 3.6.4 to 3.6.7. In each section, we focus on one parameter (such as build skew), and keep all other parameters constant. Throughout the results we refer to non-partitioning join using the Maple hash table as MH, and Separate Chaining with Value Vectors as SCVV. Finally, in Section

37 3.6.7, we examine the effects of different CPU architectures on the experiments. We measure the CPU cycles for each hash join phase, using the timers from [29]. These timers provide high precision measurement reported in CPU cycles, allowing results from different datasets to be accurately compared. We report the average of five runs, and note that the variation in runs is less than 2%.

3.6.4 Build Table Skew

We start off by investigating how build skew affects performance. We evaluate four different variations of build table skew, while maintaining the 1:16 probe cardinality used in prior work [29]. In Figure 3.6 we present the average runtimes for each ordered dataset, and in Figure 3.7 we present the results for the shuffled versions of these datasets. The charts are arranged in order of increasing build skew, from left to right. We now examine the behavior of each dataset in this category.

3.6.4.1 Sequential 1:N dataset

A basic scenario is to consider a build table with unique sequential keys. We note that no partitioning with SCVV is the fastest configuration when the dataset is ordered, and that MH outperforms all other configurations when the dataset is shuffled. This is a trend that we frequently observe throughout our experiments. We take a closer look at the effect of shuffling in Section 3.6.6. Overall, the non-partitioning hash join configurations provide the fastest runtimes.

3.6.4.2 Random near 1:N dataset

This dataset contains mildly skewed build keys. This is achieved by randomly re- peating sequential keys between 1 and N times until 16M records are created. This is the only example in all our experiments where Radix hash join outperforms all

38 18 Part Build Probe 14 Part Build Probe 16 12 Billions Billions 14 10 12 10 8 8 6 6 4 4 2 2 0 0 Runtime (CPU Cycles) Runtime (CPU Cycles) Runtime Runtime (CPU Cycles)

Configuration Configuration (a) Sequential - 1:N (b) Random - near 1:N

12 Part Build Probe 160 Part Build Probe 10 140

Billions Billions 120 8 100 6 80 4 60 40 2 20 0 0 Runtime (CPU Cycles) Runtime (CPU Cycles) Runtime (CPU Cycles)

Configuration Configuration (c) Gaussian - M:N (d) Zipf - M:N

Figure 3.6: Hash Join run time with Variable Build Skew on Ordered Datasets - Skylake - 8 threads other configurations. When the keys are ordered, MH and SCVV offer very similar performance and are the fastest configurations, and Part-Radix is the worst con- figuration. When the keys are shuffled, all configurations take a performance hit, but Part-Radix ends up as the fastest configuration followed by MH. Previous stud- ies have noted that Radix join is prone to load imbalance [29], and shuffling may partially alleviate this.

3.6.4.3 Gaussian M:N dataset

For this dataset, the build keys approximate a half-normal Gaussian distribution with a standard deviation of 6000. The keys cover a broader range of numbers

39 80 Part Build Probe 70 Part Build Probe 70 60

Billions 60 Billions 50 50 40 40 30 30 20 20 10 10 0 0 Runtime Runtime (CPU Cycles) Runtime (CPU Cycles)

Configuration Configuration (a) Sequential - 1:N (b) Random - near 1:N

50 Part Build Probe 160 Part Build Probe 45 140 40 Billions Billions 120 35 30 100 25 80 20 60 15 40 10 5 20 0 0 Runtime Runtime (CPU Cycles) Runtime Runtime (CPU Cycles)

Configuration Configuration (c) Gaussian - M:N (d) Zipf - M:N Figure 3.7: Hash Join run time with Variable Build Skew on Shuffled Datasets - Skylake - 8 threads and frequencies, compared to the Zipf skewed keys. The data locality in the probe table is lost when the keys are shuffled. The results show that SCVV provides the fastest runtimes, followed by MH. The ordered results follow the same trend as the sequential 1:N dataset. When the dataset is shuffled, all configurations take longer to complete the join. The Part-Indep configuration suffers the greatest slowdown during the partitioning phase. To understand why this happens, consider that Part-Indep creates per-thread par- titions. This eliminates the overhead of synchronization, but also results in a much larger total number of partitions that may no longer fit in the cache and TLB. These private partitions are then merged after the partition phase. These characteristics

40 delay the availability of data. The shuffled dataset further increases the number of partitions created per thread, which magnifies the problem.

3.6.4.4 Zipf M:N dataset

In order to study a case of extreme data skew on the build table, we use a Zipf distribution with a high skew parameter of s= 2.00. The runtime results show the Part-Share configuration performing worst by a considerable margin. If we look at the join breakdown by phase, we see that the partition phase is the main bottleneck. The Part-Share configuration creates partitions that are shared among all threads and protects concurrent inserts using a latch. This particular Zipf distribution con- tains keys that cover a relatively narrow range of values. Consequently, most keys belong to the same partition. This results in high lock contention and serialization of the partitioning phase, as all threads will try to insert into the same partition. The same trend is observed after shuffling the data because the keys will be inserted into the same shared partition regardless of how they are distributed. Other configura- tions such as MH and SCVV offer finer lock granularity or avoid locks altogether in the case of Part-Indep. We can conclude from this that the Part-Share configuration is unsuitable when the majority of build keys occupy a narrow range of numbers.

3.6.5 Probe Table Skew

We now examine probe table skew. In these experiments, we keep the build skew constant, and vary the probe skew. We examine a total of four different probe skew distributions, that are divided into two categories in order to keep the build skew constant when comparing the results. We generate the build table using non-skewed keys, and vary the correlation with the probe table to either be exactly 1:N, Zipf skewed, or Gaussian skewed. The results are shown in Figure 3.8, with each dataset paired next to its shuffled variant.

41 18 Part Build Probe 80 Part Build Probe 10 Part Build Probe 16 70 9 8 Billions Billions Billions 14 60 12 7 50 6 10 40 5 8 30 4 6 3 20 4 2 2 10 1 0 0 0 Runtime Runtime (CPU Cycles) Runtime (CPU Cycles) Runtime Runtime (CPU Cycles)

Configuration Configuration Configuration (a) Sequential - Ordered (b) Sequential - Shuffled (c) Zipf - Ordered

18 Part Build Probe 10 Part Build Probe 20 Part Build Probe 16 9 18 8 16 Billions Billions 14 Billions 12 7 14 6 12 10 5 10 8 4 8 6 3 6 4 2 4 2 1 2 0 0 0 Runtime (CPU Cycles) Runtime (CPU Cycles) Runtime (CPU Cycles) Runtime (CPU Cycles)

Configuration Configuration Configuration (d) Zipf - Shuffled (e) Gaussian - Ordered (f) Gaussian - Shuffled Figure 3.8: Hash Join Runtime with Variable Probe Skew on 1:N Datasets - Skylake - 8 threads

The results continue earlier trends of showing SCVV leading in ordered datasets, and MH provided the fastest times for shuffled datasets. The sequential 1:N dataset produces exactly N =16 keys in the probe table for every key in the build table. The Zipf and Gaussian 1:N datasets pick the number of matches out of their respective skewed distributions, and are capped at 16 matches. As a result, they generate fewer total probe keys. Prior works have shown that non-partitioning hash join variants have an advantage compared to partitioning hash join variants, when there is skew on the probe table [29, 15]. We also observe this in our results, but only when the datasets are ordered. We also explore datasets that are generated with a near M:K relationship. In these

42 datasets, the number of correlating matches for each build key is K, which is a number randomly chosen between 1 and N. The random K number is chosen using a uniformly random distribution. Using a near M:K distribution results in a smaller probe table, and thus relatively faster join speed over M:N. We omit the charts for this category, as there are no noteworthy changes to the relative performance of the hash join configurations.

3.6.6 Effect of Dataset Shuffling on Hash Tables

Every dataset has an ordered version and a shuffled version which is shuffled using a uniformly random distribution. It is observed throughout the results that all configurations take a performance hit going from an ordered dataset to a shuffled dataset. This trend applies even when the build keys are randomly drawn from a skewed distribution, and repeated in the probe table N times. This occurs because repeated keys in the probe table would no longer be probed in the order that they were generated. As a result, we are significantly less likely to find the data in the cache or TLB. There is also a clear trend of MH performing better when the data is shuffled, and SCVV taking the lead when it is ordered. This trend is of particular interest to us, as it presents an opportunity for a query planner to choose a hash table based on whether or not the data is shuffled. In order to gain better insight, we measure the TLB and cache misses. In Figure 3.9a we compare runtimes and in Figure 3.9b, we show the corresponding d-TLB misses. When the dataset is ordered, the SCVV data structure is also accessed in a sequential manner. As a result, entries that are evicted from the dTLB will not be accessed again during the course of the current phase. Due to the fact that the size of the dataset far exceeds the number of entries that the TLB can hold (64 4KB pages), the chance of incurring a TLB miss for every lookup is greatly increased on a shuffled dataset.

43 10 Build Probe 80 DTLB misses 9 70 69.40 8 60 Billions

7 Millions 6 50 5 40 4 30 25.30 3 21.62 20 2 10 1 0.40 0 0 Data-TLB Data-TLB Misses Runtime Runtime (CPU Cycles)

Shuffled Ordered Shuffled Ordered Dataset Dataset (a) Runtime (b) Data-TLB Misses

Figure 3.9: Performance impact of shuffling on MH and SCVV - Zipf 1:N dataset - Skylake - 8 threads

SCVV greatly benefits from ordered data, as it uses a modulo hash function, and the buckets are accessed in the same order as the keys. This benefits lookup times due to increased memory locality and fewer cache misses. When the keys are shuffled, the threads access the hash table in a random and unpredictable fashion. As a result, SCVV suffers from increased TLB and cache misses, and potentially more lock contention. Shanbhag et al. reported that shuffled data increased transaction abort rate as more cache lines were shared among the threads. Their experimental results show a big performance gap between sorted and shuffled datasets [152], which we also observe in our results. MH provides more consistent results as is indicated by the fact that the build times are relatively similar for shuffled and ordered datasets. As MH is based on Cuckoo hashing (which uses two different hash functions), its access pattern is not sequential. MH does not benefit from build table locality, but it guarantees that it will find a key in either one or two lookups.

44 80 Nopart_MH 160 Nopart_MH 70 Nopart_SCVV 140 Nopart_SCVV Part-Share_SCVV Part-Share_SCVV Billions Billions 60 Part-Indep_SCVV 120 Part-Indep_SCVV Part-Radix_SCVV Part-Radix_SCVV 50 100 40 80 30 60 20 40 10 20 Runtime (CPU Runtime Cycles) (CPU 0 Runtime Cycles) (CPU 0 Skylake Harpertown AMD K8 Skylake Harpertown AMD K8 CPU Architecture CPU Architecture (a) Sequential - Ordered (b) Sequential - Shuffled

Figure 3.10: Total join time on variable CPU architectures - 1:N Datasets - 8 threads

3.6.7 CPU Architecture

Figure 3.10 depicts the total hash join runtimes we measured on three different hardware platforms. On all three hardware platforms, we run the experiments using identical source code and datasets. That is, we do not fine-tune the algorithm to any particular architecture. Our results indicate similar trends among the different architectures. Although we observe some minor variations, the fastest and slowest configurations are consistent. Despite large differences in CPU cache size (shown in Table 3.1), non-partitioning hash joins provide the fastest performance. A query planner with some knowledge of the data would be able to choose the fastest join variant. The Part-Radix configuration performs relatively well on the shuffled dataset, but gradually loses ground going from Skylake to Harpertown and then AMD. As noted in [15] Radix hash join benefits from fine-tuning the parameters to the hardware. This trend demonstrates that without parameter tuning, Radix hash join does not perform consistently. Choosing the wrong join configuration can even result in modern hardware perform- ing worse compared to old hardware. In Figure 3.10b, the Part-Indep configuration performs similarly on Skylake and Harpertown. As noted in Section 3.6.4.3, Part- Indep may create an excessive number of private partitions. The combination of poor

45 load balancing and unordered data access results in a performance penalty that is so severe, it undermines eight years of architectural improvements. Shuffled datasets have the potential to cause cache and TLB thrashing. They have a greater impact on the partitioned hash join variants, as the non-partitioning variants provide faster and more consistent performance on all tested CPU architectures.

3.7 Chapter Summary

Hash joins are among the key techniques that enable efficient data access. How- ever, their performance can be hindered when data is skewed and/or shuffled. We explored this issue on a set of 16 datasets that we have developed. We highlighted the importance of using a broad variety of synthetic datasets to mimic real-world applications. To our knowledge, no previous work has generated such a variety of datasets and analyzed their performance impact on hash joins. The key contributions of this chapter are:

• We show how dataset skew can severely hinder hash join performance if it is not mitigated.

• We design a series of datasets that extends the variety of data distributions and relationships that are evaluated.

• We implement a modified version of the hash table used by Blanas et al. [29] that significantly improves join performance with skewed datasets.

• We present joins using Maple hash table, a novel hashing technique based on cuckoo hashing that further improves performance on shuffled workloads.

We proposed modifications to the separate chaining based hash table (used in [29]) to deal with skewed data, and incorporated this into the hash join implementations used in their benchmark. We showed how the choice of hash table can improve

46 performance with skewed datasets by more than three orders of magnitude when using our modified hash table compared to the prior implementation. With extensive experiments, we presented a performance evaluation of five hash join configurations. In order to further improve performance, we introduced a novel hash table called Maple hash table. We have elaborated on how our hash table guarantees constant lookup cost, and described the many mechanisms we use to improve its insertion performance. We have shown that Maple hash table can further improve perfor- mance on shuffled datasets, and demonstrated speed-ups of up to 17.3×. Our study reinforces the case for more research in the area of hash joins on skewed or shuffled datasets, and the need for improved query optimizers that can choose a performant join configuration based on the data characteristics.

47 Chapter 4

Main Memory Aggregation Processing

Aggregation is an essential and ubiquitous data operation used in many database and query processing systems to summarize data. It is considered to be the most expensive operation after joins, and is an essential component in analytical queries. Following our research on in-memory joins in Chapter 3, in this chapter, we conduct a comprehensive analysis of in-memory aggregation. Our analysis covers baseline and state-of-the-art algorithms and data structures from the programming and re- search communities, as well as a custom implementation that we developed from scratch. Our study revolves around six analysis dimensions which represent the in- dependent variables in our experiments: (1) algorithm and data structure, (2) query and aggregate function, (3) key distribution and skew, (4) group by cardinality, (5) dataset size and memory usage, and (6) concurrency and multi-threaded scaling. We conduct extensive evaluations with the goal of identifying the trade-offs of each algorithm and offering insights to practitioners. The remainder of this chapter is organized as follows: we discuss our motivation in Section 4.1. In Section 4.2 we categorize and explore the related work. We describe

Note: Parts of this chapter were previously published in [113].

48 the queries in Section 4.3. We elaborate on the algorithms and data structures in Section 4.4. In Section 4.5 we specify the datasets characteristics. We present and discuss our experimental setup and evaluation results in Section 4.6, and summarize our findings in Section 4.7. Finally, we conclude the chapter in Section 4.8.

4.1 Motivation

Many prior studies on in-memory aggregation limited the scope of their research to a narrow set of algorithms, datasets, and queries. For example, many studies do not evaluate holistic aggregate functions [120, 185, 32]. The datasets used in most studies are based on a methodology proposed by Grey et al. [61]. These datasets do not evaluate the impact of data shuffling, or enforce deterministic group- by cardinality where possible. Some studies only evaluate a proposed algorithm against a naive implementation, rather than comparing it with other state-of-the- art implementations [174, 73]. Other studies have focused on secondary aspects, such as optimizing query planning for aggregations [184], distributed and parallel algorithms [32, 185, 153, 189, 73], and iceberg queries [174]. Additionally, some data structures have been proposed for in-memory query processing or as drop-in replacements for other popular data structures, but have not been extensively studied in the context of aggregation workloads [95, 17, 102]. Real-world applications cover a much more diverse set of scenarios, and understanding them requires a broader and more fundamental view. Due to these limitations, it is difficult to gauge the usefulness of these studies in other scenarios. Different combinations of methodologies and evaluation parameters can produce very different conclusions. Applying the results from an isolated study to a general case may result in poor performance. For example, methods and optimizations for dis- tributive aggregation are not necessarily ideal for holistic workloads. Our goal is to

49 6 1

Dataset Size and 5 Aggregation Query and Memory Usage Aggregate Function 2

4 3 Figure 4.1: An overview of the analysis dimensions conduct a comprehensive study on the fundamental algorithms and data structures used for aggregation. This chapter examines six fundamental dimensions that af- fect main memory aggregation. These dimensions represent well-known parameters which can be used as independent variables in our experiments. Figure 4.1 depicts an overview of the analysis dimensions. Figure 5.12 depicts a decision flow chart that summarizes our observations. Dimension 1: Algorithm and Data Structure. In recent years, there have been many studies on main-memory data structures, such as tree-based indexes and hash tables. Many of these data structures can be used for in-memory aggregation. Aggregation algorithms can be categorized by the data structure used to store the data. Based on this we divide the algorithms into three main categories: sort-based, hash-based, and tree-based algorithms. We propose a framework that aims to cover many of the scenarios that could be encountered in real workloads. Over the course of this chapter, we evaluate and discuss implementations from all three categories. Dimension 2: Query and Aggregate Function. An aggregation query is primarily defined by its aggregate function. These functions are typically organized into three categories: distributive, algebraic, and holistic [60]. Distributive aggregate functions,

50 such as Count, can be independently computed in a distributed manner. Algebraic aggregates are constructed by combining several distributive aggregates. For exam- ple, the Average function is a combination of Sum and Count. Holistic aggregate functions, such as Median, cannot be distributed in the same way as the two previ- ous categories because they are sensitive to the sort order of the data. Aggregation queries are also categorized based on whether their output is a single value (scalar), or a series of rows (vector). We evaluate a set of queries that cover both distributive and holistic, and vector and scalar categories. Dimension 3: Key Distribution and Skew. The skew and distribution of the data can have a major impact on algorithm performance. Popular relational database benchmarks, such as TPC-H [35], generally focus on querying data that is non-skewed and uniformly distributed. However, it has been shown that these cases are not necessarily representative of real-world applications [37, 77]. Recently, researchers have proposed a skewed variant of the TPC-H benchmark [30]. The sizes of cities and the length and frequency of words can be modeled with Zipfian distributions, and measurement errors often follow Gaussian distributions [61]. Furthermore, skewed keys can be encountered as a result of joins [21] and composite queries. Our datasets are based on the specifications defined by [61] with a few additions. We cover the impact of both skew and ordering. Dimension 4: Group-by Cardinality. Group-by cardinality is related to skew in the sense that both dimensions affect the number of duplicate keys. However, the group-by cardinality of a dataset directly determines the size (number of groups) of the aggregation result set. Prior studies have indicated that group-by cardinality has a major impact on the relative performance of different aggregation methods [120, 81, 13, 73]. These studies claim that hashing performs faster than sorting when the group-by cardinality is low relative to the dataset size, and that this performance advantage is reversed when the cardinality is high. We find that the accuracy of

51 this claim depends on the implementation. We evaluate the performance impact of group-by cardinality, as well as its relationship with CPU cache and TLB misses. Dimension 5: Dataset Size and Memory Usage. Recent advances in computer hardware have encouraged the use of main-memory database systems. These systems often focus on analytical queries, where aggregation is a key operation. Although memory has become cheaper and denser, this is offset by the increasing demands of the industry. Our goal is to shed some light on the trade-off between memory efficiency and performance. Dimension 6: Concurrency and Multi-threaded Scaling. Nowadays, query pro- cessing systems are expected to support intraquery parallelism in addition to in- terquery parallelism. Concurrency imposes additional challenges, such as reducing synchronization overhead, eliminating race conditions, and multi-threaded scaling. We explore the viability and scalability of several multi-threaded implementations.

4.2 Related Work

There have been a broad range of studies on the topic of aggregation. With the growing popularity of in-memory analytics in recent years, memory-based algorithms have gained a lot of attention. We explore some of the work that is thematically close to our research. Some studies have proposed novel index structures for database operations. Notably, recent studies have looked into replacing comparison trees with radix trees. In [95] Leis et al. proposed an adaptive radix tree (ART) designed for in-memory query processing. The authors evaluated their data structure with the TPC-C benchmark, which does not focus on analytical queries or aggregation. Based on a similar concept Binna et al. propose HOT [27] (Height Optimized Trie). The core concepts behind this approach are reducing the height of the tree on sparsely distributed keys, and the

52 use of AVX2 SIMD instructions for intra-cycle parallelism. The authors demonstrate that HOT significantly outperforms other indexes, such as ART [95], STX B+tree [25], and Masstree [108], on insert/read workloads with string keys. However, for integer keys, ART maintains a notable performance advantage in insert performance, and is competitive in read performance. The duality of hashing and sorting for database operations is a topic that continues to generate interest. The preferred approach has changed many times as hardware, algorithms, and data have evolved over the years. Early database systems relied heav- ily on sort-based algorithms. As memory capacity increased, hash-based algorithms started to gain traction [59]. In 2009 Kim et al. [81] compared cache-optimized sorting and hashing algorithms and concluded that hashing is still superior. In [148] Satish et al. compared several sorting algorithms on CPUs and GPUs. Their exper- iments found that is faster on small keys, and with SIMD optimizations is faster on large keys. They predicted that Merge Sort would become the preferred sorting method in future database systems due to the large SIMD widths of future architectures. As of current, the growth of SIMD register width in commodity hardware has stagnated at 256 bits for AMD CPUs [150] and 512 bits Intel CPUs [178]. M¨uller et al. proposed an approach to hashing and sorting with an algorithm that can switch between them in real time [120]. The authors modeled their algorithm based on their observation that hashing performs better on datasets with low cardinality, but sorting is faster when there is high data cardinality. This holds true with some basic hashing or sorting algorithms, but there are algorithms for which this model does not apply. Their approach adjusts to the cardinality and skew of the dataset by setting a threshold on the number of groups found in each cache-sized chunk of the data. This approach cannot be used for holistic aggregation queries, as the data is divided into chunks.

53 Balkesen et al. [13] compared the performance of highly optimized sort-merge and radix-hashing algorithms for joins. Their implementations leveraged the extra SIMD width and processing cores found in modern processors. They found that hashing outperformed sorting, although the gap got much smaller for very large datasets. The authors predicted that sorting may eventually outperform hashing in the future, if the SIMD registers and data key sizes continue to expand. Iceberg queries combine joining, aggregation, and filtering the results based on a threshold on the aggregate value. The name iceberg refers to the fact that the final number of groups is generally much smaller than the initial groups, this analogy is similar to how the tip of an iceberg is considerably smaller than the whole. In [174] Walenz et al. propose several optimization techniques for iceberg queries. The authors extended nested loop join to support iceberg queries, but did not explore other (arguably better) join algorithms such as hash join or sort merge join. Parallel aggregation algorithms focus on determining efficient concurrent designs for shared data structures. A key question in parallel aggregation is whether threads should be allowed to work independently, or to work on a shared data structure. Cieslewicz et al. [32] present a framework to select a parallel strategy based on a sample from the dataset. Surprisingly, the authors claim that sort-based aggregation can only be faster than hash-based aggregation if the input is presorted. We found that in the context of single-threaded algorithms, sort-based aggregation is quite competitive with hash-based. In [185], the authors examined several previously proposed parallel algorithms, and propose a new algorithm called PLAT based on the concept of partitioning, and a combination of local and global hash tables. Most of these algorithms do not support holistic aggregation, because they split the data into multiple hash tables in order to reduce contention. Furthermore, none of the algorithms are ideal for scalar aggregation as they do not guarantee lexicographical ordering of the keys.

54 Table 4.1: Aggregation Queries

Query SQL Representation (example) Aggregate Aggregate Output Function Category Format

Q1 SELECT product_id, COUNT(*) Count Distributive Vector FROM sales GROUP BY product_id

Q2 SELECT student_id, AVG(grade) Average Algebraic Vector FROM grades GROUP BY student_id

Q3 SELECT product_id, MEDIAN(amount) Median Holistic Vector FROM products GROUP BY product_id

Q4 SELECT COUNT(sale_id) Count Distributive Scalar FROM sales

Q5 SELECT AVG(grade) Average Algebraic Scalar FROM grades

Q6 SELECT MEDIAN(part_id) Median Holistic Scalar FROM parts

SELECT product_id, COUNT(*) Count with Q7 FROM sales WHERE product_id Range Distributive Vector BETWEEN 500 AND 1000 Condition GROUP BY product_id

4.3 Queries

In this section, we describe the queries used for our experiments. In Table 4.1 we describe each query along with a simple example. Our goal is to evaluate and compare different aggregation variants. There are three main categories of aggregate functions: distributive, algebraic, and holistic. Distributive functions, such as Count and Sum, can be processed in a distributed manner. This means that the input can be split and processed in multiple partitions, and then the intermediate results can be combined to produce the final result. Algebraic functions consist of two or more Distributive functions. For instance, Average can be broken down into two distributive functions: Count and Sum. Holistic aggregate functions cannot be decomposed into multiple distributive functions, and require all of the input data to be processed together. For example, if an input is split into two partitions and the Median is calculated for each partition, it is not possible to accurately determine the Median for the entire dataset solely based on the Median values for each partition.

55 Other examples of Holistic functions include Rank, Median, and Quantile. The output of an aggregation can be either in Vector or Scalar format. In Vector aggregates, a row is returned in the output for each unique key in the designated column(s). These columns are commonly specified using the group-by or having keywords. The output value is returned as a new column next to the group-by column. Scalar aggregates process all the input rows and produce a single scalar value as the result. Sometimes it is desirable to filter the aggregation output based on user defined thresholds or ranges. We study an example of a range search combined with a vector aggregate function in Q7. In a real-world environment, it may be possible to push the range conditions to an earlier point in the query plan, but if several different range scans are desired, early filtering may not be possible. The main purpose of this query is to evaluate each data structure’s efficiency at performing a range search in addition to the aggregation.

4.4 Data Structures and Algorithms

In this section, we introduce the data structures and algorithms that we use to implement aggregate queries. We divide these algorithms into three categories: sort- based, hash-based, and tree-based. In order to facilitate reproducibility, we have selected open-source data structures and sort algorithms where possible. We also consider several state-of-the-art data structures, such as ART[95], HOT[27], and Libcuckoo[102]. Since the performance of algorithms can shift with hardware ar- chitectures, we also consider some of the more fundamental algorithms and data structures, such as a B+Tree [25]. Throughout this section we will state theoretical time complexities using n as the number of elements and k as the number of bits per key. The implementation of an aggregate operator can be broken down into two main

56 phases: the build phase and the iterate phase. Consider this example using a hash table and a vector aggregate function (refer to Q1 in Table 4.1). During the build phase, each key (created from the group-by attribute or attributes) is looked up in the hash table. If it does not exist, it is inserted with a starting value of one. Otherwise, the value for the existing key is incremented. Once the build phase is complete the iterate phase reads the key-value pairs from the hash table and writes the resulting items to the output. A similar procedure is used for tree data structures. The calculation of the aggregate value during the build phase (early aggregation) is only possible when the aggregate function is distributive or algebraic. As a result, holistic aggregate values cannot be calculated until all records have been inserted. Sort-based approaches “build” a sorted array using the group-by attributes. As a result, all the values for each group are placed in consecutive locations. The aggregate values are calculated by iterating through the groups.

4.4.1 Sort-based Aggregation Algorithms

Sorting algorithms are a crucial building block in any query processing system. Many popular database systems, such as Microsoft SQL and Oracle, employ both sort- based and hash-based algorithms. We examine several algorithms designed for sort- ing arrays of fixed length integers, although some of the approaches could be adapted to variable length strings.

4.4.1.1 Quicksort

Quicksort is a sorting method based on the concept of divide and conquer that was invented by Tony Hoare [69] and remains very popular to this day. The average time complexity of Quicksort is O(n log(n)). The worst case time complexity is consider- ably worse at O(n2), but this is rare, and is mitigated on modern implementations [74, 134] .

57 4.4.1.2 Introsort

Introspective sort (Introsort) is a hybrid that was proposed by David Musser [121]. Introsort can be regarded as an algorithm that builds on Quick- sort and improves its worst case performance. This sorting algorithm starts by sort- ing the dataset with Quicksort. When the recursion depth passes a certain threshold, the algorithm switches to . This threshold is defined as the logarithm of the number of elements being sorted. This algorithm guarantees a worst case time complexity of O(n log(n)). The GCC variant of Introsort [74] differs from the original algorithm in two ways. Firstly, the recursion depth is set to 2 ∗ log(n). Secondly, the algorithm switches to , which is faster on small data chunks even though it has a time complexity of O(n2).

4.4.1.3 Radix Sort (MSB and LSB)

Radix sorting works by sorting the data one bit (binary digit) at a time. There are two variants of Radix Sort, based on the order in which the bits are processed: Most Significant Bit (MSB) Radix Sort, and Least Significant Bit (LSB) Radix Sort. As the names suggest, MSB sorts the data starting from the top (leftmost) bits, and works its way down. In comparison, LSB starts from the bottom bits. The time complexity of Radix sort is O(k ∗ n) where k is the key width (number of bits in the key), and n is the number of elements.

4.4.1.4 Spreadsort

Spreadsort is a hybrid sorting algorithm that combines the best traits of Radix and comparison-based sorting. This algorithm was invented by Steven J. Ross in 2002 [146]. Spreadsort uses MSB Radix partitioning until the size of the partitions reaches a predefined threshold, at which point it switches to comparison-based sorting using

58 MSB Radix Sort LSB Radix Sort Introsort Spreadsort Quicksort 1,200

1,000

800

600

400

200 Time to Sort 10M Keys Keys Sort (ms) to10M Time 0 Random (1-5) Random (1-1M) Random (1k-1M) Pre-sorted Sequential Reversed Sequential Dataset Distribution Figure 4.2: Sort Algorithm Microbenchmark

Introsort. Comparison-based sorting is more efficient on small sequences of data compared to Radix partitioning. The time complexity of the MSB Radix phase is O(n log(k/s + s)) where k is the key width, and s is the maximum number of splits (default is 11 for 32 bit integers). As mentioned, time complexity of Introsort is O(n log(n)).

4.4.1.5 Sorting Microbenchmarks

In order to obtain a basic understanding of the performance of these algorithms and how they compare, we evaluate five algorithms on a variety of datasets. The tested algorithms are: Quicksort, Introsort, MSB Radix Sort, LSB Radix Sort, and Spreadsort. We test each algorithm on five data distributions: random integers between one and five, random integers between one and one million, random integers between one thousand and one million, presorted sequential integers, and reverse sorted sequential integers. We measure the time to sort ten million integers from each distribution. The results, depicted in Figure 4.2, show that Introsort and Spreadsort generally outperform the other sorting algorithms.

4.4.2 Hash-based Aggregation Algorithms

Hash tables are particularly efficient in workloads that require fast random lookups, which they perform in constant time. A hash function transforms a key into an

59 address within the table. However, hash tables do not generally guarantee any ordering of the keys (lexicographical or chronological). It is possible to pre-sort the data and construct a hash function that guarantees ordered keys (minimal perfect hashing [51, 105]). However, this would defeat the purpose of constructing a hash table for aggregation, as the impact on query execution time would be quite severe. Hash tables are not well-suited to gradual dynamic growth, as growing the table may entail rehashing all existing elements as well. In principle, a hash table’s size could be tuned to anticipate the dataset group-by cardinality. However, in practice it is difficult to estimate the cardinality, particularly when there are several group- by columns. Cardinality estimation errors result in excessive memory usage if too high, and costly rehash operations if too low. In our experiments we assume that only the size of the dataset is known, hence we set the initial size of the hash tables accordingly. Hash tables can be categorized based on their collision resolution scheme. Collision resolution defines how a hash table resolves conflicts caused by multiple keys hashing to the same location. We now describe four collision resolution schemes and the implementations that use them: linear probing, quadratic probing, separate chaining, and cuckoo hashing.

4.4.2.1 Linear probing

Linear probing is part of the family of collision resolution techniques called open ad- dressing. Open addressing hash tables typically store all the items in one contiguous array. They do not use pointers to link data items. Linear probing specifies the method used to search the hash table. An insertion begins from the hash index and probes forward in increments of one until the first empty bucket is found. Linear probing hash tables do not need to allocate extra memory to store new items as long as the table has empty slots. However, they may encounter an issue called primary

60 clustering, where colliding records form long sequences of occupied slots. These se- quences displace incoming keys, and they grow each time they do so, resulting in the high number of displacement of records. We implement a custom linear probing hash table using several industry best prac- tices, such as maintaining a power of two table size. If the desired size is not a power of two then the nearest greater power of two is chosen. This is a popular optimization that allows the table modulo operation to be replaced with a much faster bitwise AND. The downside to this policy is that it is easier to overshoot the available mem- ory. In order to resolve this, our implementation falls back to the modulo operation and the table size is set to the nearest prime number if possible, and the exact size parameter is used as the final fallback.

4.4.2.2 Quadratic probing

Quadratic probing is an open addressing scheme that is very similar to linear probing. Like linear probing it calculates a hash index and searches the table until a match is found. Rather than probing in increments of one, a quadratic function is used to determine each successive probe index. For example, with an arbitrary hash function h(x) and quadratic function f(x)= x2, the algorithm probes h(x), h(x)+1, h(x)+4, h(x) + 9 instead of a linear probe sequence of h(x), h(x)+1, h(x)+2, h(x)+3. This approach greatly reduces the likelihood of clustering, but it does so at the cost of reducing data locality. Google Sparse Hash and Dense Hash [155] are based on open addressing with quadratic probing. Sparse Hash favors memory efficiency over speed, whereas Dense Hash targets faster speed at the expense of higher memory usage.

61 4.4.2.3 Separate chaining

Separate chaining is a way of resolving collisions by chaining key-value pairs to each other with pointers. Buckets with colliding items resemble a single linked list. The main advantages of separate chaining include fast insert performance, and relatively versatile growth. The use of pointer-linked buckets reduces data locality, which is important for lookups and updates. However, unlike linear probing, separate chaining hash tables do not suffer from primary clustering. Separate chaining hash tables remain popular in recent works [13, 152, 15, 29]. Tem- plated separate chaining hash tables are included as part of the Boost and standard C++ libraries. Additionally, the Intel TBB library provides versatile hash tables that support concurrent insertion and iteration.

4.4.2.4 Cuckoo Hashing

Cuckoo hashing was originally proposed by Pagh et al. [129]. Its core concept is to store items in one of two tables, each with a corresponding hash function (this can be extended to additional tables). If a bucket is occupied by another item, the existing item is displaced and reinserted into the other table. This process continues until all items stabilize, or the number of displacements exceeds an arbitrary threshold. Cuckoo hashing provides a guarantee that reads take no more than two lookups. Its main drawback is relatively slower and less predictable insert operations, and the possibility of failed insertions. In [102], researchers from Intel labs presented a concurrent cuckoo hashing technique called Libcuckoo. Libcuckoo introduces improvements to the insertion algorithm by leveraging hardware transactional memory (HTM). This hardware feature allows concurrent modifications of shared data structures to be atomic. Their experimental results indicate that Libcuckoo outperforms MemC3 [46], and Intel TBB [134].

62 4.4.3 Tree-based Aggregation Algorithms

Hash-based and sort-based aggregation approaches are very popular, mainly due to a heavy focus of past studies on “write once read once” (WORO) aggregation workloads, as opposed to “write once read many” (WORM ). We consider several tree data structures, and assess their viability for aggregation. Trees are commonly used to evaluate range conditions. However, aggregation bench- marks, such as TPC-H, do not include range queries. Tree data structures are well suited to incremental dynamic growth. The trade-off is higher time complexities for both insert and lookup operations, compared to hash tables. We divide the tree data structures into comparison trees and radix trees. Comparison trees have traditionally served as indexing structures, but Radix trees are being increasinlg adopted in recent main memory databases, such as HyPer [78] and Silo [170]. We explore two examples from each category: The Btree family and Ttree are comparison trees, and ART and Judy are Radix trees.

4.4.3.1 Btree

The B-tree is a popular tree data structure that was initially invented in 1971 by Bayer et al. [20], and forms the basis for many modern variants [99, 25]. A B-tree is a balanced m-way tree where m is the maximum number of children per node. The B-tree is perhaps best recognized as a popular disk-based data structure used for database indexing, although they are also used for in-memory indexes. The defining characteristics of B-trees are that they are shallow and wide due to using a high fanout. This reduces the number of node lookups as each node contains multiple data items. B-trees may also include pointers between leaf nodes to facilitate more efficient range scans. The time complexity for inserting n items into a B-tree is O(n log(n)). We use a cache-optimized implementation based on the STX B+tree [25] which we will henceforth refer to as Btree.

63 4.4.3.2 Ttree

Ttree (spelled “T-tree” in the literature) was originally proposed in 1986 by Lehman et al. [93]. Its intended purpose was to provide an index that could outperform and replace the disk-oriented B-tree for in-memory operations. Although the Ttree showed a lot of promise when it was first introduced, we show in Section 4.4.4 that advancements in hardware design have rendered it obsolete on modern processors.

4.4.3.3 ART

ART (Adaptive Radix Tree) [95] is a Radix tree variant with a variable fan-out. Its inventors present it as a data structure that is as fast as a hash table, with the added bonus of sorted output, range queries, and prefix queries. ART uses SIMD instructions to concurrently compare multiple keys in parallel. ART saves on memory consumption by using dynamic node sizes and merging inner nodes when possible. Radix trees have several key advantages compared to comparison trees. The height of a radix tree depends on the length of the keys, rather than the number of keys. Additionally, in contrast with comparison trees, they do not need to perform re-balancing operations. We also considered HOT [27], which builds on the same principles as ART. However, we found its performance with integer keys to be noticeably worse, as its main focus is string keys.

4.4.3.4 Judy Arrays

Judy Arrays (henceforth referred to as Judy) were developed by Doug Baskins [17], and defined as a type of sparse dynamic array designed for sorting, counting, and searching. They are intended to replace common data structures such as hash tables and trees. Judy is implemented as a 256-way Radix tree that uses variable fan out and a total of 20 compression techniques to reduce memory consumption and improve

64 12000 Build Iterate

10000 Millions 8000

6000

4000

2000

Time (CPU Cycles) (CPU Cycles) Time 0

Data Structure Figure 4.3: Data Structure Microbenchmark cache efficiency [5]. Judy is fine-tuned to minimize cache misses on 64 byte cache- lines. Like many other tree data structures, the size of a Judy array dynamically grows with the data and does not need to be pre-allocated.

4.4.4 Data Structure Microbenchmark

We use a microbenchmark to evaluate each data structure’s efficiency in a store and lookup workload. The dataset consists of 10 million key-value pairs from a uniform distribution. We separately measure the time it takes to build the data structure (build phase), and the time to read all the items in the data structure (iterate phase). All hash tables are sized to the number of elements. The results are depicted in Figure 4.3 using the abbreviations outlined in Table 4.3. With the exception of Hash LC, the build phase is faster on all the hash tables due to O(1) insert complexity. Hash LC performs poorly in the build phase because it is designed as a concurrent data structure. We evaluate its concurrent scalability in Section 4.6.8. Hash LP and Hash Dense provide the fastest overall times. Btree is noticeably faster in the iterate phase, but it takes a relatively long time to build due to the cost of balancing the tree. Due to the relatively poor performance exhibited by Ttree in both phases, we opt to omit it from subsequent experiments.

65 Table 4.2: Data Structure Time Complexity

Data Structure Insert Insert Search Search (Average Case) (Worst Case) (Average Case) (Worst Case) ART O(k) O(k) O(k) O(k) Judy O(k) O(k) O(k) O(k) Btree O(log(n)) O(log(n)) O(log(n)) O(log(n)) Ttree O(log(n)) O(log(n)) O(log(n)) O(log(n)) Separate Chaining O(1) O(n) O(1) O(n) Linear Probing O(1) O(n) O(1) O(n) Quadratic Probing O(1) O(log(n)) O(1) O(log(n)) Cuckoo Hashing O(1) (amortized) O(n) O(1) O(1)

4.4.5 Time Complexity

It is a well known fact that time complexities for algorithms are not always the best predictors of real-world performance. This is due to a number of factors, including hidden constants and overheads arising from the implementation, hardware char- acteristics such as CPU architecture cache and TLB, compiler optimizations, and operating systems. On modern systems, cache misses are particularly expensive. Nevertheless, time complexity is widely used as a mean to understand and compare the relative performance of different algorithms. Table 4.2 provides an overview of the known time complexities for each of the data structures that we evaluate. Here n denotes the number of elements, and k the number of bits in the key.

4.5 Datasets

In order to effectively evaluate the algorithms, we generate a set of synthetic datasets that vary in terms of input size, group-by cardinality, key distribution, and key range. Our datasets are based on the highly popular input distributions described in prior works [61, 32, 73]. We employ several modifications to these datasets, with the goal

66 Table 4.3: Algorithms and Data Structures Label Type Description ART Tree AdaptiveRadixTree[95] Judy Tree Judy Array [17] Btree Tree STXB+Tree[25] Hash SC Hash std::unordered map [74] (Separate Chaining) Hash LP Hash LinearProbing(Custom) Hash Sparse Hash GoogleSparseHash[155] Hash Dense Hash GoogleDenseHash[155] Hash LC Hash Intellibcuckoo[102] Introsort Sort std::sort(Introsort)[74] Spreadsort Sort BoostSpreadsort[163] of expanding the data characteristics that we evaluate. Some datasets, such as the sequential dataset, produce very predictable patterns. For such datasets, we generate an additional variant with uniform random shuffling. In [32] it is mentioned that the group-by cardinality is often probabilistic. We enforce deterministic group-by cardinality whenever possible in cases where the intended dataset distribution can be maintained. The dataset distributions are summarized in Table 4.4. Throughout this paper we use random to refer to a uniform random function with a fixed seed, and shuffling refers to the use of the aforementioned function to shuffle all the records in a dataset. The number of records in the dataset is denoted as n records and the group-by cardinality is c. In the repeating sequential dataset (RSeq), we generate a series of segments that contain multiple number sequences. The number of segments is equal to the group- by cardinality, and the number of records in each segment is equal to the dataset size divided by the cardinality. A shuffled variant of the repeating sequential dataset (RSeq-Shf ) is also generated. This dataset mimics transactional data where the key incrementally increases. In the heavy hitter dataset (HHit), a random key from the key range accounts for 50% of the total keys. The remaining keys are produced at least once to satisfy the group-by cardinality, and then chosen on a random basis. In a variant of this dataset,

67 Table 4.4: Dataset Distributions Abbreviation Description Cardinality Rseq RepeatingSequential Deterministic Rseq-Shf Rseq Uniformly Shuffled Deterministic Hhit Heavy Hitter Deterministic Hhit-Shf Hhit Uniformly Shuffled Deterministic Zipf Zipfian Probabilistic MovC MovingCluster Probabilistic the resulting records are shuffled so that the heavy hitters are not concentrated in the first half of the dataset. Real-world examples of heavy hitters include top selling products, and network nodes with the highest traffic. In the Zipfian dataset (Zipf ), the distribution of the keys is skewed using Zipf’s law [138]. According to Zipf’s law, the frequency of each key is inversely proportional to its rank. We first generate a Zipfian sequence with the desired cardinality c and Zipf exponent e = 0.5. Then we take n random samples from this sequence to build n records. The final group-by cardinality is non-deterministic and may drift away from the target cardinality as c approaches n. The Zipf distribution is used to model many big data phenomena, such as word frequency, website traffic, and city population. In the moving cluster dataset (MovC), the keys are chosen from a window that gradually slides. The ith key is randomly selected from the range ⌊(c − W )i/n⌋ to ⌊(c − W )i/n + W ⌋, where the window size W = 64 and the cardinality c is greater than the window size (c >= W ). The moving cluster dataset provides a gradual shift in data locality and is similar to workloads encountered in streaming or spatial applications.

4.6 Results and Analysis

In this section we evaluate the efficiency of the aggregation algorithms. We examine and compare the performance impact of dataset size, group-by cardinality, key skew

68 Table 4.5: Experiment Parameters

Dataset Repeating Sequential, Heavy Hitter, Moving Cluster, Zipfian Dataset Size 100,000,000, 10,000,000, 1,000,000, 100,000 (Records) Group-by 100, 1,000, 10,000, 100,000, 1,000,000, 10,000,000 Cardinality Algorithm Hash LP, Hash SC, Hash LC, Hash Sparse, Hash Dense, ART, Judy, Btree, Introsort, Spreadsort, Hash TBBSC, Sort BI, Sort QSLB Thread Count 1, 2, 3, 4, 5, 6, 7, 8 (logical core count) Query Q1 (Vector Distributive), Q3 (Vector Holistic), Workload Q6 (Scalar Distributive), Q7 (Vector Distributive with Range) and distribution, data structure algorithm, and the query and aggregation functions. We also evaluate memory usage, which we measure as the peak Resident Set Size (RSS), as an indication of each algorithm’s memory efficiency. The experimental parameters are outlined in Table 4.5. For each experiment the input dataset is preloaded into main memory. Similar to chapter 3, we use the timers from [29] to measure execution time with high precision. We report the average of five runs, noting that the observed variation is less than 2%. We also note that these measurements do not include the time to read the input data from disk. Throughout this chapter we aim to understand how each of these dimensions can affect main memory aggregation workloads. Due to space constraints we only show the results for Q1, Q3, Q6 and Q7. We start the experiments with two common vector aggregation queries (Q1 and Q3) in Section 4.6.2. Due to the popularity of these queries, we further analyze these queries by evaluating cache and TLB misses in Section 4.6.3, memory usage for different dataset sizes in Section 4.6.4, and data distributions in Section 4.6.5. Additionally, we evaluate range searches (Q7) in Section 4.6.6 and scalar aggregation queries (Q6) in Section 4.6.7. We examine multi-threaded scaling in Section 4.6.8. Finally, we summarize our findings in Section 4.7. We now outline the experimental setup.

69 ART Judy Btree ART ARTJudy Btree Judy ART BtreeJudy Btree Hash_SC Hash_LP Hash_SC Hash_LP Hash_Sparse Hash_SC Hash_LP Hash_Sparse Hash_SC Hash_LP Hash_Sparse Hash_Dense Hash_LC Introsort Hash_DenseHash_SparseHash_LC Introsort Hash_DenseHash_Dense Hash_LCHash_LC Introsort Introsort Spreadsort 40 120 50 2 3 4 5 45 6 7

10 10 10 10 Billions 10 10

Billions 35 Billions 100 40 30 35 80 25 30

20 60 25 20 15 40 15 10 10 20 5 5 Query Query Execution (CPU Cycles) Time Query Query Execution (CPU Cycles) Time 0 Query Execution (CPU Cycles) Time 0 0 102 103 104 105 106 107 102 103 104 105 106 107 101002 10 10003 10000104 100000105 100000010000000106 107 ART Group-byJudy CardinalityBtree Group-by Cardinality Group-by Cardinality ART Judy Btree ART Judy Btree Hash_SC Hash_LP Hash_Sparse Hash_SC Hash_LP Hash_Sparse Hash_SC Hash_LP Hash_Sparse Hash_Dense(a) RseqHash_LC - Q1 Introsort Hash_Dense(b) Rseq-shfHash_LC - Q1Introsort Hash_Dense(c) HhitHash_LC - Q1 Introsort 60 25 100

90 Billions Billions 50 Billions 20 80

40 70 15 60

30 50

10 40 20 30

5 20 10 10 Query Query Execution (CPU Cycles) Time Query Query Execution (CPU Cycles) Time 0 Query Execution (CPU Cycles) Time 0 0 101002 10 10003 10000104 100000105 100000010000000106 107 101002 10 10003 10000104 100000105 100000010000000106 107 101002 10 10003 10000104 100000105 100000010000000106 107 Group-by Cardinality Group-by Cardinality Group-by Cardinality (d) Hhit-shf - Q1 (e) MovC - Q1 (f) Zipf - Q1 Figure 4.4: Vector Aggregation Q1 - 100M Records 4.6.1 Platform Specifications

The experiments are evaluated on a machine with an Intel Core i7 6700HQ processor at 3.5GHz, 16GB of DDR4 RAM at 2133MHz, and a 512GB SSD. The CPU is a quad core based on the Skylake microarchitecture, with hyper-threading (8 logical cores), 256KB of L1 cache, 1MB of L2 cache, and 6MB of L3 cache. The TLB can hold 64 entries in the L1 Data TLB, and 1536 entries in the L2 Shared TLB (4KB pages). The code is compiled and run on Ubuntu Linux 16.04 LTS, using the GCC 7.2.0 compiler with the -O3 and -march=native optimization flags. The flags enable the compiler highest optimization level and target the native architecture of our machine. We now present and discuss the experimental results.

70 4.6.2 Vector Aggregation

We begin our experiments by evaluating Q1 and Q3 (see Table 4.1), which are based on commonly used aggregate functions. Due to space constraints and the similarity between Algebraic and Distributive functions, we do not show results for Q2. In these experiments, we keep the dataset size at a constant 100M and vary the group- by cardinality from 102 to 107. In each chart we measure the query execution time for a given query and dataset distribution, and the group-by cardinality increases from left to right. The results for Q1 (Vector COUNT) and Q3 (Vector MEDIAN) are shown in Figures 4.4 and 4.5 respectively. A larger group-by cardinality means more unique keys, and fewer duplicates. In tree-based algorithms the data structure dynamically grows to accommodate the group-by cardinality. This is reflected in the gradual increase in query execution time. The insert performance of ART and Judy depends on the length of the keys, which increases with cardinality. Additionally, the compression employed by ART and Judy are more heavily used at high cardinality. The results for Q3 show that Spreadsort is the fastest algorithm across the board. The overall trend shows that hash-based algorithms, such as Hash SC and Hash LP, are competitive with Spreadsort until the group-by cardinality exceeds 104. The execution times for both Spreadsort and Introsort show considerably less variance, whereas the worsening of data locality results in sharp declines in performance for the hash-based and tree-based implementations. The performance of Hash Sparse dramatically worsens at 107 groups, suggesting that the combination of Hash Sparse’s gradual growth policy and the extra space needed by this query result in a much steeper decline in performance compared to Q1. In order to understand why Hash LP outperforms all the other algorithms in Q1, we need to consider several factors. First, the average insert time complexity (as shown in Table 4.2) is unaffected by group-by cardinality. Secondly, the cache-friendly layout of Hash LP takes great advantage of data locality compared to the other

71 ART Judy Btree ART ARTJudy Btree Judy ART BtreeJudy Btree Hash_SC Hash_LP Hash_SC Hash_LP Hash_Sparse Hash_SC Hash_LP Hash_Sparse Hash_SC Hash_LP Hash_Sparse Hash_Dense Hash_LC Introsort Hash_DenseHash_SparseHash_LC Introsort Hash_DenseHash_Dense Hash_LCHash_LC Introsort Introsort Spreadsort 60 200 100 2 3 180 4 5 90 6 7 10 10 10 10 Billions 10 10 Billions Billions 50 160 80

140 70 40 120 60

30 100 50

80 40 20 60 30

40 20 10 20 10 Query Query Execution (CPU Cycles) Time Query Query Execution (CPU Cycles) Time Query Query Execution (CPU Cycles) Time 0 0 0 101002 10 10003 10000104 100000105 1000000106 10000000107 101002 10 10003 10000104 100000105 1000000106 10000000107 101002 10 10003 10000104 100000105 100000010000000106 107 ART Group-byJudy CardinalityBtree Group-by Cardinality Group-by Cardinality Hash_SC Hash_LP Hash_Sparse ART Judy Btree ART Judy Btree Hash_Dense Hash_LC Introsort Hash_SC Hash_LP Hash_Sparse Hash_SC Hash_LP Hash_Sparse (a) Rseq - Q3 Hash_Dense(b) Rseq-shfHash_LC - Q3Introsort Hash_Dense(c) HhitHash_LC - Q3 Introsort 140 45 180

Billions 120 40 160 Billions Billions 35 140 100 30 120 80 25 100

60 20 80

15 60 40 10 40 20 5 20 Query Query Execution (CPU Cycles) Time Query Query Execution (CPU Cycles) Time 0 Query Execution (CPU Cycles) Time 2 3 4 5 6 7 0 0 1010010 1000 1000010 10000010 10000001000000010 10 101002 10 10003 10000104 100000105 100000010000000106 107 101002 10 10003 10000104 100000105 100000010000000106 107 Group-by Cardinality Group-by Cardinality Group-by Cardinality (d) Hhit-shf - Q3 (e) MovC - Q3 (f) Zipf - Q3 Figure 4.5: Vector Aggregation Q3 - 100M records hash tables. Lastly, compared to Q3, Q1 does not require additional memory to store the values associated with each key. This reduces the pressure on the cache and TLB, and allows Hash LP to compete with memory efficient approaches such as Spreadsort. We further explore cache and TLB behavior in Section 4.6.3 and memory consumption in Section 4.6.4.

4.6.3 Cache and TLB misses

Cache and TLB behavior are metrics of algorithm efficiency. Together with runtime and memory efficiency, they paint a picture of how different algorithms compare with each other. Processing large volumes of data in main memory often leads to many cache and TLB misses, which can hinder performance. A cache miss can sometimes be satisfied by a TLB hit, but a TLB miss incurs a page table lookup, which is considerably more expensive. Using the perf tool, we measure the CPU cache misses

72 1k groups 1M groups 1k groups 1M groups 700 1600 600 1400 500 1200 Millions Millions 1000 400 800 300 600 200 400 100 200 Cache Cache Misses Cache Misses 0 0

Aggregation Algorithm Aggregation Algorithm (a) Cache Misses - Q1 (b) Cache Misses - Q3 Figure 4.6: TLB misses - Rseq 100M Dataset

1k groups 1M groups 1k groups 1M groups 7 12

6 10 5

Millions Millions 8 4 6 3 4 2 1 2 D-TLB D-TLB Misses D-TLB Misses 0 0

Aggregation Algorithm Aggregation Algorithm (a) TLB misses - Q1 (b) TLB misses - Q3 Figure 4.7: TLB misses - Rseq 100M Dataset and data-TLB (D-TLB) misses of Q1 and Q3 with low cardinality (103 groups) and high cardinality (106 groups) datasets. The results are depicted in Figures 4.7 and 4.6. It is interesting to compare the results in Figure 4.7a with the performance discrep- ancy between Hash LP and Spreadsort in Q1. At low cardinality, the number of TLB misses is relatively close between the two algorithms. However, at high cardinality, Spreadsort exhibits considerably higher TLB misses. Similarly, in Figure 4.4a we see the runtime gap between the two algorithms widen in Hash LP’s favor as the cardinality increases to 107. Although this metric is not a guaranteed way to predict the relative performance of the algorithms, it is a fairly reliable measure of scalability and overall efficiency. The

73 Table 4.6: Peak Memory Usage (MB) - Q1 on Rseq 103 Groups

Algorithm Dataset Size 105 106 107 108 ART 4.45 11.61 131.61 1,027.44 Judy 4.31 11.53 131.49 1,027.45 Tree Btree 4.54 11.79 131.66 1,027.60 Hash SC 5.45 19.41 159.07 1,540.95 Hash LP 5.23 18.67 156.29 1,529.44 Hash Sparse 4.61 11.94 131.68 1,027.58

Hash Hash Dense 6.42 27.33 336.02 2,814.70 Hash LC 26.95 44.44 263.14 2,069.90 Introsort 4.50 11.74 131.66 1,027.59

Sort Spreadsort 4.53 11.55 131.65 1,027.44 cache behavior of Spreadsort is consistently good. Other algorithms, such as ART, exhibit large jumps in both cache and TLB misses. This correlates with similar gaps in the query runtimes, and it is noted in [27] that ART’s efficiency degrades if the dataset distribution necessities the creation of many tree nodes.

4.6.4 Memory Efficiency

Memory efficiency is a performance metric that is arguably of equal importance to runtime speed. In this section, we study how each method’s memory usage grows as the dataset size is increased. We measure the peak memory consumption at various dataset sizes. To do so we lock the group-by cardinality at 103 and vary the dataset size from 105 up to 108. These measurements are taken by using the Linux /usr/bin/time -v tool to acquire the peak RSS (resident set size) for each configuration. We verified these numbers by taking additional measurements using cgmemtime [149] and Valgrind [123] and found the results to be consistent. The results for Q1 are depicted in Table 4.6 and the memory usage of Q3 is shown in Table 4.7. The results show that the hash tables consume the most memory, followed by the tree data structures. The sort algorithms are the most memory efficient because they sort the data in-place. In order to maintain good insert performance,

74 Table 4.7: Peak Memory Usage (MB) - Q3 on Rseq 103 Groups

Algorithm Dataset Size 105 106 107 108 ART 5.07 15.26 132.57 1,212.46 Judy 4.87 15.37 132.68 1,212.82 Tree Btree 5.14 15.33 132.78 1,212.64 Hash SC 5.88 23.28 211.39 1,986.78 Hash LP 7.80 45.55 437.76 4,264.07 Hash Sparse 5.11 15.92 137.78 1,255.51

Hash Hash Dense 12.74 79.16 1,156.55 9,404.48 Hash LC 30.01 68.44 686.95 5,575.09 Introsort 4.45 11.61 131.57 1,027.50

Sort Spreadsort 4.31 11.66 131.59 1,027.48 most hash tables consume more memory than they need to store the items, and some will only resize to powers of two. Hash Dense’s memory usage is particularly high because it uses 6× the size of the entries in the hash table when performing a resize. After the resize is completed, the memory usage shrinks down to 4× the previous size. Comparing Tables 4.6 and 4.7, we see a large increase in memory usage from Q1 to Q3. This is due to the fact that Q3 requires the data structures to store the keys and all associated values in main memory, whereas Q1 only requires storage of the keys and a running count value. Consequently, holistic queries like Q3 will generally consume more memory.

4.6.5 Dataset Distributions

These experiments show the performance impact of the data key distribution. The results are presented in Figures 4.8a and 4.8b. In each figure, we vary the key distribution while keeping the dataset size at a constant 100 million records. To get a better understanding of how this factor ties in with cardinality, we show results for both low and high cardinality (103 and 106 groups). The results point out that Zipf and Rseq-Shf are generally the most performance- sensitive datasets. The shuffled variants of Rseq and HHit take longer to run due to

75 ARTART Judy Judy Btree Btree ART Hash_SCJudy BtreeHash_LP Hash_SparseHash_SC Hash_LPHash_DenseHash_Sparse Hash_LC Hash_SCIntrosortHash_LP Hash_SparseSpreadsort Hash_Dense Hash_LC Introsort Hash_Dense Hash_LC Introsort 14 80 102 103 104 105 106 107 12 70 Billions Billions 60 10 50 8 40 6 30 4 20

2 10

0 0 Query Query Execution (CPU Cycles) Time Query Query Execution (CPU Cycles) Time

Dataset Dataset (a) 103 groups (low cardinality) (b) 106 groups (high cardinality)

Figure 4.8: Vector Q1 - Variable Key Distributions - 100M records a loss of memory locality. By comparing the two figures we can see that this effect is amplified by group-by cardinality, as it increases the range of keys that must be looked up in the cache. In the low cardinality dataset, the number of unique keys is small compared to the dataset size. Introsort is the overall slowest algorithm at low cardinality and its performance is around the middle of the pack at high cardinality. This is in line with prior works suggesting that sort-based aggregation is faster when the group-by cardinality is high [120]. However, as we can see in the results produced by Spreadsort, the algorithm also performs well at low cardinality which contradicts earlier claims. Due to this, it may be worth revisiting hybrid sort-hash aggregation algorithms in the future. The results also highlight an interesting trend when it comes to shuffled/unordered data. Observe that ART’s performance in Figure 4.8b significantly worsens when going from Rseq to Rseq-Shf or indeed any unordered distribution. The combination of high cardinality and unordered data increase pressure on the cache and TLB. If we consider how well Spreadsort performs in these situations, then the results indicate that presorting the data before invoking the ART-based aggregate operator could significantly improve performance. However, careful consideration of the algorithm and dataset is required to avoid increasing the runtime.

76 Range=25%Range=25%Range=50% Range=50% Range=25%Range=75%Range=50% 1k groups10³ groups 1M10⁶ groups groups Range=75%500000 Range=75% 1820 180 0 250000 0 ART Judy Btree 16 ART Judy Btree Tree Data Structure Tree Data Structure

160 Cycles) Billions Time (CPU (CPU Time Thousands Build Time Time Build Range Search Range (CPU Cycles) Cycles) (CPU 14 200000 140 Thousands Thousands Billions 12 120 150000 10 100 8 80 100000 60 6

40 4 50000 Cycles) (CPU Time Build

20 2 Range Search Time (CPU Cycles) Cycles) (CPU Time Search Range Range Search Time (CPU Cycles) Cycles) (CPU Time Search Range 0 0 0 ART Judy Btree ART Judy Btree ART Judy Btree Tree Data Structure Tree Data Structure Tree Data Structure (a) Range Search Time - (b) Range Search Time - (c) Build time by dataset 103 groups (low cardinal- 106 groups (high cardinal- cardinality ity) ity) Figure 4.9: Range Search Aggregation Q7 - 100M records

4.6.6 Range Search Query

The goal of this experiment is to evaluate algorithms that provide a native range search feature, and combine this with a typical aggregation query. Although it is possible to implement an integer range search on a hash table, this would not work for strings and other keys with non-discrete domains. Consequently, we focus on the tree-based aggregation algorithms. Q7 calculates the vector count aggregates for a range of keys. The tuples that do not satisfy the range condition could be filtered out before building the index (if the range is known in advance). Tree-based data structures are effective in this case if we assume that (a) the data has already been loaded into the data structure, and (b) this is a Write Once Read Many (WORM) workload, and multiple range searches will be satisfied by the same index. We evaluate the time it takes to perform a range search on each of the tree-based data structures for ranges that cover 25%, 50% and 75% of the group-by cardinality (the smaller ranges are run first). The results are shown in Figure 4.9. In Figure 4.9c, we see that the time to build the tree dominates the runtime. The search

77 times shown in Figures 4.9a and 4.9b indicate that Btree significantly outperforms the other algorithms if the tree is prebuilt. This is mostly due to the pointers that link each leaf node in Btree, resulting in one O(log(n)) lookup operation to find the lower bound of the search, and a series of pointer lookups to complete the search. At low cardinality (103 groups), the range search time for ART is 12% lower than Judy, but it is 94% higher at high cardinality (106 groups). If we factor in the build time and consider the workload WORO, then ART is the fastest algorithm.

4.6.7 Scalar Aggregation

Unlike Vector aggregate functions, which output a row for each unique key, Scalar aggregate functions return a single row. We evaluate Q6 on the tree-based and sort- based aggregation algorithms. Hash tables are unsuitable for this query because the keys need to be in lexicographical order to calculate the median. Figure 4.10 shows the query execution time for Q6 with different datasets. The overall winner of this workload is the Spreadsort algorithm. In the case of a dynamic or WORM workload, a tree-based algorithm would have two advantages of faster lookups and requiring considerably less computation for new inserts. A good candidate for tree-based scalar aggregation is Judy, as it outperforms Introsort on all the datasets, and comes close to the performance of Spreadsort in three out of the six datasets. Although ART wins over Judy in some cases, it is inconsistent and its worse case performance is significantly worse, rendering it a poor candidate for this workload. The conclusion is in line with our expectation. To calculate the scalar median of a set of keys, Spreadsort is the fastest algorithm. If an index has already been built then Judy is usually the quickest in producing the answer.

78 ART ARTJudy Btree Judy ART Btree Judy BtreeIntrosort ART SpreadsortJudy Btree Introsort Spreadsort Introsort Spreadsort Introsort Spreadsort 35 50 35 120 45 30 Billions Billions Billions 100 40 25 30

Billions 35 80 30 20 25 60 25 15 20 40 10 20 15 10 5 20 15 5 Query Query Execution (CPU Cycles) Time Query Query Execution (CPU Cycles) Time Query Query Execution (CPU Cycles) Time 0 0 2 3 4 5 6 7 0 2 3 4 5 6 7 2 3 4 5 6 7 1010210 10310 10410 10510 10610 107 1010210 10310 10410 10510 10610 107 1010210 10310 10410 10510 10610 107 Group-by Cardinality Group-by Cardinality Group-by Cardinality 10 ART (a)Judy Rseq Btree ART (b) Rseq-shfJudy Btree ART (c) HhitJudy Btree Introsort Spreadsort Introsort Spreadsort Introsort Spreadsort 60 25 100

5 90 Billions 50 Billions 20 Billions 80 Query Query Execution (CPU Cycles) Time 70 40 0 2 3 4 5 6 7 10 10210 10315 10 10410 10510 10610 107 Group-by Cardinality 60 30 50

10 40 20 30

5 20 10 10 Query Query Execution (CPU Cycles) Time Query Query Execution (CPU Cycles) Time Query Query Execution (CPU Cycles) Time

0 2 3 4 5 6 7 0 2 3 4 5 6 7 0 2 3 4 5 6 7 1010210 10310 10410 10510 10610 107 1010210 10310 10410 10510 10610 107 1010210 10310 10410 10510 10610 107 Group-by Cardinality Group-by Cardinality Group-by Cardinality (d) Hhit-shf (e) MovC (f) Zipf Figure 4.10: Scalar Aggregation Q6 - 100M records 4.6.8 Multi-threaded Scalability

A concurrent algorithm’s ability to provide a performance advantage over a serial implementation depends on two main factors: the problem size, and the algorithmic efficiency. Considerations pertaining to algorithmic efficiency include various over- heads associated with concurrency, such as contention and synchronization. In order to implement concurrent aggregate operators, a suitable data structure must fulfill three requirements. First, they must be designed for data-level parallelism that can scale with an increasing number of threads. Secondly, they must support thread-safe insert and update operations. It is not uncommon to encounter data structures that support concurrent put and get operations, but provide no way to safely modify existing values. Lastly, they must provide a means to iterate through their content, preferably without requiring prior knowledge of the range of values. In this section, we evaluate the performance and scalability of concurrent data structures and algo-

79 Table 4.8: Concurrent Algorithms and Data Structures

Label Type Description Hash TBBSC Hash TBB Separate Chaining (Concurrent Unordered Map [134]) Hash LC Hash Intel Libcuckoo [102] Sort BI Sort Block Indirect Sort [163] Sort QSLB Sort Quicksort with Load Balancing (GCC Parallel Sort [166])

Introsort Spreadsort Sort_SS Sort_TBB Sort_QSLB Sort_BI 1000 900 800 700 600 500 400 Time (ms) Time 300 200 100 0 1 2 3 4 5 6 7 8 No of threads Figure 4.11: Parallel Sort Algorithm Microbenchmark rithms which fulfill all three criteria. We considered and ultimately rejected several candidate tree data structures. HOT [27] does not support concurrent incrementing of values (needed by Q1) or multiple values per key (needed by Q3). BwTree [99, 176] is a concurrent B+Tree originally proposed by Microsoft. However, our preliminary experiments found its performance to be very poor, as limitations in its API prevent efficient update operations. These characteristics have been discovered by other re- searchers as well [182]. The concurrent variant of ART [96] currently lacks any form of iterator, which is essential to our workloads. We use a microbenchmark to select two parallel sorting algorithms from among four candidates. We vary the number of threads from one to eight, and include the two fastest single-threaded sorting algorithms for comparison. The workload consists of sorting random integers between 1-1M, similar to the microbenchmark presented in Section 4.4. The results are shown in Figure 4.11. Sort BI is a novel sorting algorithm, based on the concept of dividing the data into many parts, sorting them in parallel, and then merging them [163]. Sort TBB is a Quicksort variant that

80 Hash_TBBSC Hash_LCMC Hash_TBBSC Hash_LCMC Sort_QSLBHash_TBBSC Sort_BI Hash_LC Sort_QSLBSort_QSLB Sort_BISort_BI 18 20400 40 16 1 2 3 4 5 6 7 8 Billions Billions 35 14 No of threads 30 12 25 10 20 8 15 6 10 4

2 5

Query Query Execution (CPU Cycles) Time 0 0 Query Query Execution (CPU Cycles) Time 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 No of threads No of threads Hash_TBBSC 3 Hash_LCMC Hash_TBBSC 3 Hash_LCMC Sort_QSLB(a) Q1 - 10 groupsSort_BI Sort_QSLB(b) Q3 - 10 groupsSort_BI 25 50 45 Billions Billions 20 40 35 15 30 25 10 20 15 5 10 5 0 0 Query Query Execution (CPU Cycles) Time Query Execution (CPU Cycles) Time 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 No of threads No of threads (c) Q1 - 106 groups (d) Q3 - 106 groups Figure 4.12: Multi-threaded Scaling - Rseq 100M uses TBB task groups to create worker threads as needed (up to the number of threads specified). Sort SS (Samplesort) [163] is a generalization of Quicksort that splits the data into multiple buckets, instead of dividing the data into two partitions using a pivot. Lastly, Sort QSLB [166] is a parallel Quicksort with load balancing. Considering the performance and scalability at 8 threads, we select the Sort BI and Sort QSLB algorithms to implement sort-based aggregate operators. We selected four concurrent algorithms and data structures, listed in Table 4.8, all of which are actively maintained open-source projects. We introduced Hash LC in Section 4.4, and Hash TBBSC is a concurrent separate chaining hash table that is similar to Hash SC. We evaluate the multi-threaded scaling for Q1 and Q3, on both low and high dataset cardinality. The results are depicted in Figure 4.12. We ob-

81 serve that both hash tables are faster in Q1, and Hash TBBSC outperforms Hash LC regardless of the cardinality. Sort-based approaches take the lead in Q3. The gap between sorting and hashing increases at higher cardinalities. This echoes our previ- ous single-threaded results. The performance of Hash TBBSC degrades significantly in Q3, because storing the values requires the use of a concurrent data structure (in this case a concurrent vector) as the value type. This is a limitation of the hash ta- ble implementation, which results in additional overhead due to synchronization and fragmentation [134]. We also considered implementing Q3 using TBB’s concurrent multimap, but the performance was significantly worse. Hash LC does not suffer from these issues, as it provides a thread-safe upsert (insert or update) function. upsert searches the hash table for the key and inserts the item if the key doesn’t exist. If the key exists, upsert can replace or mutate its corresponding value. We observe similar trends with other data distributions, which we omit here for brevity.

4.7 Summary and Discussion

Based on the insights we gained from our experiments, we present a decision flow chart in Figure 4.13 that summarizes our main observations with regards to the algorithms and data structures. We acknowledge that our experiments do not cover all possible situations and configurations, and our conclusions are based on these computational results and observations. We start by picking a branch depending on the output format of the aggregation query. If the query is scalar, the workload determines the best algorithm. If query workload is “Write Once Read Once” (WORO) then the Spreadsort algorithm pro- vides the fastest overall runtimes. If we require a reusable data structure that can satisfy multiple queries of this category, then Judy is a more suitable option. Going back to the start node, if the aggregation query is vector, our decision is deter-

82 Start Aggregation Output Format?

Scalar Vector

Yes WORO Holistic Yes Workload? Aggregate?

No No (Distributive) Spreadsort Judy

Yes Range Search?

Prebuilt No No Index?

Yes Hash_LP Spreadsort

Btree ART Hash_TBBSC Sort_BI

Figure 4.13: Decision Flow Chart mined by the aggregate function category. Holistic aggregates (such as Q3) are considerably faster, and more memory efficient with the sorting algorithms, particu- larly Spreadsort (single-threaded) and Sort BI (multi-threaded). This advantage is more noticeable at high group-by cardinality. If the query is distributive (such as Q1) then our experiments show that Hash LP (single-threaded) and Hash TBBSC (multi-threaded) are the fastest algorithms. For aggregate queries that include a range condition, we found that Btree greatly outperformed the other algorithms in terms of search times. This advantage is only relevant if we assume that the tree has been prebuilt. Otherwise, ART is the best performer in this category, due to its advantage in build times.

4.8 Chapter Summary

Aggregation is an integral aspect of big data analytics. With rising RAM capac- ities, in-memory aggregation is growing in importance. There are many different

83 factors that can affect the performance of an aggregation workload. Knowing and understanding these factors is essential for making better design and implementation decisions. The key contributions of this chapter are:

• Evaluation of aggregation queries using sort-based, hash-based, and tree-based implementations

• Methodology to generate synthetic datasets that expands on prior work

• Extensive experiments that include comparison of distributive and holistic ag- gregate functions, vector and scalar aggregates, range searches, evaluation of memory efficiency and TLB and cache misses, and multi-threaded scaling

• Insights on performance trends and suggestions for practitioners

We presented a six dimensional analysis of in-memory aggregation. We used mi- crobenchmarks to assess the viability of 20 different algorithms, and implemented aggregation operators using 14 of those algorithms. Our extensive experimental framework covered a wide range of data structures and algorithms, including serial and concurrent implementations. We also varied the query workloads, datasets, and the number of threads. We gained many useful insights from these experiments. Our results show that some persisting notions about aggregation do not necessarily apply to modern hardware and algorithms, and that there are certain combinations that work surprisingly better than conventional wisdom would suggest (see Figure 4.13). To our knowledge, this is the first performance evaluation that conducted such a comprehensive study of aggregation. We demonstrated with extensive experimental evaluation that the ideal approach in a given situation depends on the input and the workload. For instance, sorting-based approaches are faster in holistic aggregation queries, whereas hash-based approaches – our custom Hash LP implementation in particular – perform better in distributive aggregation queries.

84 Chapter 5

Query Processing on NUMA Systems

In Chapters 3 and 4, we delved into the algorithms, data structures, and dataset char- acteristics that are relevant in-memory join and aggregation queries. In this chapter, we explore this topic in the context of Non-Uniform Memory Access (NUMA) archi- tectures. As mentioned earlier, data analytics systems commonly utilize in-memory query processing techniques to achieve better throughput and lower latency. Mod- ern computers increasingly rely on NUMA architectures to achieve scalability. This has implications for in-memory query performance, as NUMA architectures have a significant influence on how data is accessed. This can result in sub-optimal query performance and under-utilization of the hardware, if not correctly managed. In this chapter, we outline and evaluate an organized set of strategies that aim to acceler- ate memory-intensive data analytics workloads on NUMA systems. As part of this study, we evaluate the join and aggregation platforms outlined in Chapters 3 and 4, as well as several relational database systems. This chapter is organized as follows: we outline our motivation in Section 5.1, and provide some background on the problem and the workloads in Section 5.2. In

Note: Parts of this chapter were previously published in [114, 115]

85 Section 5.4 we discuss the strategies for improving query performance on NUMA systems. We present our setup and experimental results in Section 5.5. Finally, we discuss related work in Section 5.3 and conclude the chapter in Section 5.6.

5.1 Motivation

NUMA systems include a wide range of CPU architectures, topologies, and intercon- nect technologies. As such, there is no standard for what a NUMA system’s topology looks like. Due to the variety of NUMA topologies and applications, fine-tuning an algorithm to a single machine will not necessarily deliver better performance for other machines. Furthermore, achieving optimal performance on different system configurations can be costly and time-consuming. As a result, we pursue strategies to improve performance with minimal code modification. NUMA architectures are pervasive in multi-socket and in-memory rack-scale sys- tems, as well as a growing range of CPUs with on-chip NUMA. It is clear that NUMA is ubiquitous and is here to stay, and that software needs to evolve and keep pace with these changes. Although these advances have opened a path toward greater performance, the burden of efficiently leveraging the hardware mostly falls on developers. In an effort to provide a general solution that speeds up applications on NUMA systems, some researchers have proposed using NUMA schedulers that co-exist with the operating system (OS). These schedulers monitor running applications in real- time and attempt to improve performance by migrating threads and memory pages to address load balancing issues [28, 39, 98]. However, some of these approaches are not architecture or OS independent. For instance, Carrefour [34] requires an AMD CPU that is based on the K10 architecture, in addition to a modified OS kernel. Moreover, researchers have argued that these schedulers may not be beneficial for

86 multi-threaded in-memory query processing [140]. Lately, researchers have started to pay attention to the issues affecting query per- formance on NUMA systems. These researchers have favored a more application- oriented approach that involves algorithmic tweaks to the application’s source code, particularly in the context of query processing engines. Among these works, some are static solutions that attempted to make query operators NUMA-aware [151, 175]. Others are dynamic solutions that focused on work allocation to threads using work- stealing [94], data placement [84, 136], and task scheduling with adaptive data repar- titioning [141]. These approaches can be costly and time-consuming to implement, and incorporating these solutions to commercial database engines will take time. Re- gardless, our work is orthogonal to these efforts, as we explore application-agnostic approaches to improve query performance. Software has been generally slow in adapting to shifts in hardware architecture, such as NUMA. Inefficiencies in the software stack are not always obvious, and the lack of efficient hardware utilization has been easy to overlook in some fields due to a greater focus on multitasking. One common approach is to run multiple tasks (or virtual machines), and give each task a slice of the hardware resources proportional to its needs. This approach is not suitable for data analytics, due to the size of the data, as well as the importance of query throughput and latency. Processing large datasets in main memory data analytics typically calls for a greater emphasis on intra-query parallelism and hardware utilization. Main memory query processing systems leverage data parallelism on large sets of memory-resident data, thus diminishing the influence of disk I/O. However, appli- cations that are not NUMA-aware do not fully utilize the hardware’s potential [84]. Furthermore, rewriting the application is not always an option. Solving this problem without extensively modifying the code requires tools and tuning strategies that are application-agnostic. In this work, we evaluate the viability and impact of several

87 key parameters (shown in Table 5.4) that aim to achieve this. We demonstrate that significant performance gains can be achieved by managing dynamic memory allo- cators, thread placement and scheduling, memory placement policies, indexing, and the OS configuration. In this context, the impact and role of memory allocators have been under-appreciated and overlooked by researchers. We center our investigation around five different memory-intensive query workloads (shown in Table 5.1) that prominently feature joins and aggregations, arguably two of the most popular and computationally expensive workloads used in data analytics. We selected the open- source MonetDB, PostgreSQL, MySQL, and Quickstep database systems, as well as a commercial database system DBMSx for evaluation. These systems were selected due to their significantly divergent architectures, as well as their popularity. An important finding from our research is that the default (out-of-the-box) OS envi- ronment can be surprisingly sub-optimal for high-performance query processing. For instance, the default Linux memory allocator ptmalloc can significantly lag behind other alternatives. Furthermore, with extensive experimental evaluation, we demon- strate that it is possible to systematically utilize application-agnostic (or black-box) approaches to obtain speedups on a variety of data analytics workloads. We show that a hash join workload can achieve a 3× speedup on Machine C (see machine topologies in Figure 5.1 and specifications in Table 5.3), by replacing the memory allocator. This speedup can be further improved to 20× by optimizing the memory placement policy and modifying the OS configuration. We also show that our find- ings apply to other hardware configurations, by evaluating the experiments on three machines with different hardware architectures and NUMA topologies. Lastly, we show how database system performance can be improved by systematically modifying the default OS configuration and overriding the memory allocator. For example, we demonstrate that MonetDB’s query latency in the TPC-H workload can be reduced by up to 43%.

88 16GB 16GB 16GB 16GB I/O 16GB 768GB 768GB 16GB 16GB

CPU CPU CPU CPU CPU I/O CPU CPU I/O 2 4 6 0 1 Hub 0 1 CPU CPU 0 7 CPU CPU CPU CPU CPU CPU CPU I/O I/O 1 3 5 2 3 2 3 I/O I/O 16GB 16GB 16GB 16GB I/O 16GB 768GB 768GB (a) Machine A (b) Machine B (c) Machine C

Figure 5.1: Machine NUMA Topologies (machine specifications in Table 5.3)

5.2 NUMA Topologies

The topology of our machines is shown in Figure 5.1. Each NUMA system is divided into several NUMA nodes. Each node consists of one or more processors (denoted by CPU#) and their local memory resources (denoted by their memory capacity). The NUMA nodes are linked together using a network of interconnect links to form a NUMA topology. A local memory access involves data that resides on the same node, whereas accessing data on any other node is considered a remote access. Remote data travels over the interconnect, and may need to hop through one or more nodes to reach its destination. Consequently, remote memory access is slower and prone to interconnect congestion. In addition to remote memory access, contention is another possible cause of sub- optimal performance on NUMA systems. Due to the memory wall [8], modern CPUs are capable of generating memory requests at a very high rate, which can easily saturate the interconnect or memory controller bandwidth [39]. Lastly, the abundance of hardware threads in NUMA systems presents a challenge in terms of scalability, particularly in scenarios involving many concurrent memory allocation or memory access requests. In Section 5.4, we outline strategies which can be used to mitigate these issues.

89 Table 5.1: Experiment Workloads Workload SQLEquivalent

W1) Holistic Aggregation SELECT groupkey, MEDIAN(val) (Hash-table-based) [113] FROM records GROUP BY groupkey;

W2) Distributive Aggregation SELECT groupkey, COUNT(val) (Hash-table-based) [113] FROM records GROUP BY groupkey;

W3) Hash Join [29] SELECT * W4) Index Nested Loop Join FROM table1 (ART [96], Masstree [108], INNER JOIN table2 B+tree [26],Skip List [173]) on table1.pk = table2.fk;

W5) TPC-H [35] 22 analytical queries (Q1, Q2,... , Q22)

5.2.1 Experiment Workloads

Our goal is to analyze the effects of NUMA on query processing workloads, and show effective strategies to gain speedups in these workloads. We have selected five workloads, shown in Table 5.1, to represent a variety of data operations that are common in data analytics and decision support systems. The implementation of these workloads is described in more detail in Section 5.5.2. We now provide some background on the experiment workloads. Joins and aggregations are ubiquitous, essential data processing primitives used in many different applications. When used for in-memory query processing, they are notably demanding on cache and memory. Joins and aggregations are integral com- ponents in analytical queries and are frequently used in popular database bench- marks, such as TPC-H [35]. W1 and W2 represent holistic and distributive ag- gregation queries respectively. We covered the characteristics of different aggregate function categories in Chapter 4. The type of aggregate function plays a large role in determining the workload’s sensitivity to memory performance. We selected these workloads in order to highlight these differences and their consequent performance impact due to NUMA effects.

90 W3 represents a hash join query. As described in [29], the query joins two tables with a size ratio of 1:16, which is designed to mimic common decision support systems. The join is performed by building a hash table on the smaller table and probing the larger table for matching keys. W4 is an index nested loop join using the same dataset as W3. The main difference between W3 and W4 is that W3 builds an ad hoc hash table to perform the join, whereas W4 uses a pre-built in-memory index that accelerates lookups to one of the relations. W5 is a database system workload, using the well-known queries and datasets from the TPC-H benchmark [35]. We evaluate W5 on four database systems: MonetDB [117], PostgreSQL [137], MySQL [127], and DBMSx. In order to analyze query performance under memory-bound (rather than I/O-bound) situations, we configure the databases to use large buffer caches where applicable. Furthermore, we measure multiple warm runs for each query.

5.3 Related Work

The rising demand for high performance parallel computing has motivated many works on leveraging NUMA architectures. We now outline existing research that is relevant to our work. In [80], Kiefer et al. evaluated the performance impact of NUMA effects on multiple independent instances of the MySQL database system. Popov et al. [135] explored the combined effect of thread and page placement using supercomputing benchmarks running on NUMA systems. They observed that co-optimizing thread and memory page placement can provide significant speedups. Their work is thematically simi- lar to this chapter, but follows an application optimization approach using codelets extracted from two supercomputing benchmarks, and does not cover OS configu- ration or memory allocation. Durner et al. [42] explored the performance impact of dynamic memory allocators on a database system running TPC-DS. The authors

91 obtained significant speedups utilizing jemalloc and tbbmalloc, which agrees with our findings. In this paper, we evaluate a broader and newer range of allocators, and additional NUMA parameters, indexes, datasets, databases, and workloads. Several works have pursued automatic load balancing approaches that can improve NUMA system performance in an application-agnostic manner. These approaches generally focus on improving performance by altering the process and/or memory placement. Some examples include Dino [28], Carrefour [39], AsymSched [98], Nu- mad [143], and AutoNUMA [143]. These schedulers have been shown to improve performance in some cases, particularly on systems running multiple independent processes. However, some researchers have claimed that these schedulers do not provide much benefit for multi-threaded query processing applications [140, 151]. The effects of operating system behavior on data processing workloads have led some researchers to pursue the creation of custom-tailored operating systems for database applications [54, 57, 55, 56]. For example, Giceva et al. [57] developed a light-weight kernel for the Barrelfish [19] operating system. This modified operating system is designed to provide the minimal requirements to run a database system. The authors propose the option for task-based scheduling. Unlike threads, tasks are given dedicated access to a processor, and they will not be interrupted or preempted by the operating system. The authors demonstrated runtime improvements for three graph processing queries, when run inside a noisy multi-programming environment. Another approach is to integrate NUMA-oriented features into data structures, but leave the application of these features up to the developer. This solution is not auto- matic, but enables developers to adapt their application to different target systems. Psaroudakis et al. [139] propose a smart array that has several NUMA-oriented fea- tures baked into the data structure. The smart array can be configured to replicate, interleave, or relocate across the NUMA nodes. A different approach involves either extensively modifying or completely replacing

92 the OS. This is done with the goal of providing a custom tailored environment for the application. Some researchers have pursued this direction with the goal of providing an OS that is more suitable for large database applications [54, 56, 57]. Custom operating systems aim to reduce the burden on developers, but their adoption has been limited. In the past, researchers in the systems community proposed a few new OSes for multicore architectures, including Corey [31], Barrelfish [18] and fos [177]. Although none were widely adopted by the industry, we believe these efforts under- score the need to investigate the impact of system and architectural aspects on query performance. Some researchers have favored an application-oriented approach that fine-tunes query processing algorithms to the hardware. Wang et al. [175] proposed an aggregation algorithm for NUMA systems, based on radix partitioning. The authors also de- veloped a load balancing algorithm that focuses on inter-socket task stealing, and prohibits task stealing until a socket’s local tasks have been completed. Leis et al. [94] presented a NUMA-aware parallel scheduling algorithm for hash joins, which uses dynamic task stealing in order to deal with dataset skew. Schuh et al. [151] conducted an in-depth comparison of thirteen main memory join algorithms on a NUMA system. Researchers have also investigated data partitioning in the context of NUMA-aware in-memory storage [84, 136]. Psaroudakis et al. [141] developed techniques for adaptive data placement and work-stealing to fix imbalance in resource utilization. Our work is orthogonal to these approaches and they can benefit from applying the application-agnostic strategies that we have suggested.

5.4 Methodology

Achieving good performance on NUMA systems involves careful consideration of thread placement, memory management, and load balancing. We explore application-

93 agnostic strategies that can be applied to the data analytics application in either a black box manner, or with minimal tweaks to the code. Some strategies are exclu- sive to NUMA systems, whereas others may also yield benefits on uniform memory access (UMA) systems. These strategies consist of: overriding the memory allocator, defining a thread placement and affinity scheme, using a memory placement policy, and changing the operating system configuration. In this section, we describe these strategies and outline the options used for each one.

5.4.1 Dynamic Memory Allocators

Dynamic memory allocators track and manage dynamic memory during the lifetime of an application. The performance impact of memory allocators is often overlooked in favor of exploring ways to tweak the application’s algorithms. It can be argued that this makes them one of the most under-appreciated system components. Both UMA and NUMA systems can benefit from faster or more efficient memory allocators. However, the potential is greater on NUMA systems, as the performance penalties caused by inefficient memory or cache behavior can be significantly higher. Key allocator attributes include allocation speed, fragmentation, and concurrency. Most developers use the default memory allocation functions to allocate or deallocate memory (malloc/new and free/delete) and trust that their library will perform these operations efficiently. In recent years, with the growing popularity of multi-threaded applications, there has been a renewed interest in memory allocators, and several alternative allocators have been proposed. Earlier iterations of malloc used a single lock resulting in serialized access to the global memory pool. Although recent malloc implementations provide support for multi-threaded scalability, there are now several competing memory allocators that aim for faster performance and reduced contention and memory consumption overhead. We evaluate the following allocators: ptmalloc, jemalloc, tcmalloc, Hoard, tbbmalloc, mcmalloc, and supermalloc.

94 5.4.1.1 ptmalloc ptmalloc (pthreads malloc) is the standard memory allocator that ships with most Linux distributions as part of the GNU C Library [168] (glibc). It is based on dlmal- loc [91] (Doug Lea’s Malloc). This allocator aims to attain a balance between speed, portability, and space-efficiency. ptmalloc supports multi-threaded applications by employing multiple mutexes to synchronize and protect access to its data structures. The downside of this approach is the possibility of lock contention on the mutexes. In order to mitigate this issue, ptmalloc creates additional regions of memory (arenas) whenever contention is detected. A key limitation of ptmalloc’s arena allocation is that memory can never move between arenas. ptmalloc employs a per-thread cache for small allocations. This helps to further reduce lock contention by skipping access to the memory arenas when possible.

5.4.1.2 jemalloc jemalloc (Jason Evans malloc) [45] first appeared as an SMP-aware memory allocator for the FreeBSD operating system, and was later expanded and adapted for use as a general purpose memory allocator. When a thread requests memory from jemalloc for the first time, it is assigned a memory allocation arena. Arena assignments for multi-threaded applications follow a round-robin order. In order to further improve performance, this allocator also uses thread-specific caches, which allows some allo- cation operations to completely avoid arena synchronization. Lock-free radix trees track allocations across all arenas. jemalloc attempts to reduce memory fragmenta- tion by packing allocations into contiguous blocks of memory and re-using the first available low address. This allocator maintains allocation arenas on a per-CPU basis and associates threads with their parent CPU’s arena. We use jemalloc version 5.1.0 for our experiments.

95 5.4.1.3 tcmalloc tcmalloc (thread-caching malloc) [53] was developed by Google and is included as part of the gperftools library. Its goal is to provide faster memory allocations in memory-intensive multi-threaded applications. tcmalloc divides allocations into two categories: large allocations and small allocations. Small allocations are served by private thread-local caches and do not require any locking. Large allocations use a central heap that is organized into contiguous groups of pages called “spans”. Each span is designed to fit multiple allocations (regions) of a particular size class. Since all the regions in a span are of the same size, only one metadata header is maintained for each span. However, applications that use many different size classes may waste memory due to inefficient utilization of the memory spans. The central heap uses fine-grained locking on a per-span basis. As a result, two threads requesting memory from the central heap can do so concurrently, as long as their requests fall in different class categories. We use tcmalloc from gperftools release 2.7.

5.4.1.4 Hoard

Hoard [24] is a standalone cross-platform allocator replacement designed specifically for multi-threaded applications. Hoard’s main design goals are to provide memory efficiency, reduce allocation contention, and prevent false sharing. At its core, Hoard consists of a global heap (the “hoard”) that is protected by a lock and accessible by all threads, as well as per-thread heaps that are mapped to each thread using a hash function. The allocator counts the number of times a thread has acquired the global heap lock in order to decide if contention is occurring. Hoard uses heuristics to detect temporal locality and fill cache lines with objects that were allocated by the same thread, thus reducing false sharing. We evaluate Hoard version 3.13 in our experiments.

96 5.4.1.5 tbbmalloc

The tbbmalloc [86] allocator is included as part of the Intel Thread Building Blocks (TBB) library [82]. It is based on some of the concepts and ideas outlined in their prior work on McRT-Malloc [72]. This allocator pursues better performance and scalability for multi-threaded applications, and generally considers increased memory consumption as an acceptable trade-off. Allocations in tbbmalloc are supported by per-thread memory pools. If the allocating thread is the owner of the target memory pool, no locking is required. If the target pool belongs to a different thread then the request is placed in a synchronized linked list, and the owner of the pool will allocate the object. We used version 2019 Update 4 of the TBB library for our experiments.

5.4.1.6 supermalloc supermalloc [88] is a malloc replacement that synchronizes concurrent memory allo- cation requests using hardware transactional memory (HTM) if available, and falls back to pthread mutexes if HTM is not available. It prefetches all necessary data while waiting to acquire a lock in order to minimize the amount of time spent in the critical section. supermalloc uses homogeneous chunks of objects for allocations smaller than 1MB, and supports larger objects using operating system primitives. Given a pointer to an object, its corresponding chunk is tracked using a lookup table. This lookup table is implemented as a large 512MB array, which takes advantage of the fact that most of its virtual memory will not be committed to physical memory by the OS. For our experiments, we use the latest publicly released source code, which was last updated in October 2017.

5.4.1.7 mcmalloc mcmalloc (many-core malloc) [171] focuses on mitigating multi-threaded lock con- tention by reducing calls to kernel space, dynamically adjusting the memory pool

97 structures, and using fine-grained locking. Similar to other allocators, it uses a global and local (per-thread) memory pool layout. mcmalloc monitors allocation requests, and dynamically splits its global memory pool into two categories: frequently used memory chunk sizes, and infrequently used memory chunk sizes. Dedicated homoge- neous memory pools are created to support frequently used chunk sizes. Infrequent memory chunk sizes are handled using size-segregated memory pools. mcmalloc re- duces system calls by batching multiple chunk allocations together. We use the latest mcmalloc source code, which was updated in March 2018.

5.4.1.8 Overriding The Memory Allocator

In this section, we describe three methods to change an application’s memory allo- cator. As mentioned earlier, developers typically use the default memory allocation functions to allocate or deallocate memory (malloc/new and free/delete). Alter- natively, they may use other functions that wrap around or subsequently call the default functions. In such cases, the LD PRELOAD environment variable in Linux can be used to point to a pre-compiled memory allocator library. Doing so will en- sure that the memory allocator is loaded before any other library and that all calls to functions such as malloc, new will be handled by the preloaded library. This technique can be used even if the application source code is unavailable, as long as the application developers did not use non-default function calls. A second method is to create a custom library that passes calls from non-default memory allocation functions to the desired memory allocator. Similar to the first method, this method uses LD PRELOAD to preload the compiled library. Finally, the third method is to modify the application source and carefully replace all instances of memory al- location/deallocation with the desired functions, and then link the desired memory allocator library at compile time. The first two methods can be used without access to the source code, whereas the third method is more versatile, but requires source

98 ptmalloc jemalloc tcmalloc Hoard tbbmalloc mcmalloc supermalloc 100

10 Time (s) Time

1 1 2 3 4 5 6 7 8 9 10111213141516 Number of threads (a) Multi-threaded Scalability

ptmalloc jemalloc tcmalloc Hoard tbbmalloc mcmalloc supermalloc

7 6.577 6 5 4.975

4 3.465 3 2.131 1.825 1.741 2 1.578 1.131 1.025 1.010 1.011 1.011 1.010 1.008 1.010 1.007 1.003 1.005 1.006 1.005 1.006 1.007 1.005 1.006 1.006 1.006 1.007 1.007 1.006 1.006 1.007 1.006 1.007 1.007 1.007 (used/requested) Memory Memory Overhead 1 0 1 2 4 8 16 Number of Threads (b) Memory Consumption Overhead

Figure 5.2: Memory Allocator Microbenchmark - Machine A code access, and considerably more time and effort to implement.

5.4.1.9 Memory Allocator Microbenchmark

We now describe a multi-threaded microbenchmark that we use to gain insight on the relative performance of these memory allocators. Our goal is to answer the question: how well do these allocators scale up on a NUMA machine? The microbenchmark simulates a memory-intensive workload with multiple threads utilizing the allocator at the same time. Each thread completes 100 million memory operations, consist- ing of allocating memory and writing to it, or reading an existing item and then deallocating it. The distribution of allocation sizes is inversely proportional to the size class (smaller allocations are more frequent). We use two metrics to compare the allocators: execution time and memory allocation overhead. The execution time

99 gives an idea of how fast an allocator is, as well as its efficiency when being used in a NUMA system by concurrent threads. In Figure 5.2a, we vary the number of threads in order to see how each alloca- tor behaves under contention. The results show that tcmalloc provides the fastest single-threaded performance, but immediately falls behind the competition once the number of threads is increased. Hoard and tbbmalloc show good scalability and outperform the other allocators by a considerable margin. In Figure 5.2b, we show each allocator’s overhead. This is calculated by measuring the amount of memory allocated by the OS (as maximum resident set size), and dividing it by the amount of memory that was requested by the microbenchmark. This experiment shows con- siderably higher memory overhead for mcmalloc as the number of threads increases. Hoard and tbbmalloc are slightly more memory-hungry than the other allocators, which leaves jemalloc as the fastest memory allocator with low overhead. Based on these results, we omit supermalloc and mcmalloc from subsequent experiments, due to their poor performance in terms of scalability and memory overhead respectively.

5.4.2 Thread Placement and Scheduling

Defining an efficient thread placement strategy is a well-known and essential step toward obtaining better performance on NUMA systems. By default, the kernel thread scheduler is free to migrate threads created by the program between all avail- able processors. The reasons for doing so include power efficiency and balancing the heat output of different processors. This behavior is not ideal for large data ana- lytics applications and may result in significantly reduced query throughput. The thread migrations slow down the program due to cache invalidation, as well as a likelihood of moving threads away from their local data. The combination of cache invalidation, loss of locality, and non-deterministic behavior of the OS scheduler, can result in fluctuating runtimes (as shown in Figure 5.3 with 16 threads). Binding

100 No Affinity Affinitized (Sparse) Dense Affinity Sparse Affinity -9 10 160 10-8 10-7 120 10-6 80 10-5 10-4 40 10-3 Relative RuntimeRelative 10-2 0 10-1 2 4 8162 4 8162 4 816 Runtime (Billion CPU (BillionRuntimeCPU Cycles) 1- Moving Sequential Zipf 12345678910 Cluster Run Dataset and Number of Threads Figure 5.3: OS thread scheduler Figure 5.4: Comparison of behavior vs thread affinity strat- Sparse and Dense thread affini- egy - Consecutive runs of W1 - tization strategies - W1 - Ma- Machine A chine A threads to processor cores can solve this issue by preventing the OS from migrating threads. However, deciding how to place the threads requires careful consideration of the topology and workload. A thread placement strategy details the manner in which threads are assigned to processors. We explore two strategies for assigning thread affinity: Dense and Sparse. A Dense thread placement involves packing threads in as few processors as possible. The idea behind this approach is to minimize remote access distance and maximize resource sharing. In contrast, the Sparse strategy attempts to maximize memory bandwidth utilization by spreading the threads out among the processors. There are a variety of ways to implement and manage thread placement, depending on the level of access to the source code and the library used to provide multithreading. Applications built on OpenMP can use the OMP PROC BIND and OMP PLACES environment variables in order to set a thread placement strategy. To demonstrate the impact of affinitization, we evaluate workload W1 from Table 5.1 using the moving cluster dataset described in Section 5.5.2. Figure 5.3 depicts 10 consecutive runs of this workload. The runtime number of the default configuration (no affinity) is expressed in relation to the affinitized configuration. The results

101 Table 5.2: Profiling thread placement - W1 on Machine A - Default (managed by OS) vs Modified (Sparse policy)

Performance Metric Default Modified Percent Change ThreadMigrations 33,196 16 −99.95% CacheMisses 1,450M 972M −32.95% Local Memory Accesses 367M 374M +2.06% Remote Memory Accesses 159M 108M −31.95% LocalAccessRatio 0.70 0.78 +10.77% highlight the inconsistency of the default OS behavior. In the best case, the affinitized configuration is several orders of magnitude faster, and the worst case runtime is still around 27% faster. In order to gain a better understanding of how each configuration affects the workload, we use the perf tool to measure several key metrics. The results, depicted in Table 5.2, show that the operating system migrates the worker threads many times during the course of the workload. The Sparse affinity configuration prevents migration-induced cache invalidation, which in turn reduces cache misses. Furthermore, a stable thread placement increases the ratio of local memory accesses, resulting in more bandwidth. In Figure 5.4 we evaluate the Sparse and Dense thread affinity strategies on workload W1, and vary the number of threads. We also vary the dataset (see Section 5.5.2) in order to ensure that the distribution of the data records is not the defining fac- tor. The goal of this experiment is to determine if threads benefit from being on the same NUMA node against utilizing a greater number of the system’s memory controllers. The Sparse policy achieves better performance when the workload is not using all available hardware threads. This is due to the threads having access to additional memory bandwidth, which plays a major role in memory-intensive work- loads. When all hardware threads are occupied, the two policies perform almost identically. Henceforth, we use the Sparse configuration (when applicable) for all our experiments.

102 5.4.3 Memory Placement Policies

Memory pages are not always accessed from the same threads that allocated them. Memory placement policies are used to control the location of memory pages in relation to the NUMA topology. As a general rule of thumb, data should be on the same node as the thread that processes it and sharing should be kept to a minimum. However, too much consolidation can lead to congestion of the interconnects and contention on the memory controllers. The numactl tool applies a memory placement policy to a process, which is then inherited by all its children (threads). We evaluate the following policies: First Touch, Interleave, Localalloc, and Preferred. We also use hardware counters to measure the ratio of local to total (local + remote) memory accesses. Modern Linux systems employ a memory placement policy called First Touch. In First Touch, each memory page is allocated to the first node that performs a read or write operation on it. If the selected node does not have sufficient free memory, an adjacent node is used. This is the most popular memory placement policy and represents the default configuration for most Linux distributions. Interleave places memory pages on all NUMA nodes in a round-robin fashion. In some prior works, memory interleaving was used to spread a shared hash table across all available NUMA nodes [14, 89, 94]. In Localalloc, the memory pages are placed on the same NUMA node as the thread performing the allocation. The Preferred policy places all newly allocated memory pages on a single node that is selected by the user. This policy will fall back to using other nodes for allocation when the selected node has run out of free space and cannot fulfill the allocation.

5.4.4 Operating System Configuration

In this section, we outline two key operating system mechanisms that affect NUMA applications: Virtual Memory Page Management (Transparent Hugepages), and

103 Load Balancing Schedulers (AutoNUMA). These mechanisms are enabled out-of- the-box on most Linux distributions.

5.4.4.1 Virtual Memory Page Management

OS memory management works at the virtual page level. Pages represent chunks of memory, and their size determines the granularity of which memory is tracked and managed. Most Linux systems use a default memory page size of 4KB in order to minimize wasted space. The CPU’s TLB caches can only hold a limited number of page entries. When the page size is larger, each TLB entry spans a greater memory area. Although the TLB capacity is even smaller for large entries, the total volume of cached memory space is increased. As a result, larger page sizes may reduce the occurrence of TLB misses. Transparent Hugepages (THP) is an abstraction layer that automates the process of creating large memory pages from smaller pages. THP is not to be confused with Hugepages, which depends on the application explicitly interfacing with it and is usually disabled by default. We use the global THP toggles on our Linux machines to configure its behavior.

5.4.4.2 Automatic NUMA Load Balancing

There have been several projects to develop NUMA-aware schedulers that facilitate automatic load balancing. Among these projects, Dino [28] and AsymSched[98] do not provide any source code, and Numad [143] is designed for multi-process load balancing. Carrefour [39] provides public source code, but requires an AMD CPU based on the K10 architecture (with instruction-based sampling), as well as a mod- ified operating system kernel. Consequently, we opted to evaluate the AutoNUMA scheduler, which is open-source and supports all hardware architectures. AutoNUMA was initially developed by Red Hat and later on merged with the Linux kernel. It attempts to maximize data and thread co-location by migrating memory pages and

104 threads. AutoNUMA has two key limitations: 1) workloads that utilize data shar- ing can be mishandled due to the unnecessary migration of memory pages between nodes, 2) it does not factor in the cost of migration or contention, and thus aims to improve locality at any cost. AutoNUMA has received continuous updates, and is considered to be one of the most well-rounded kernel-based NUMA schedulers. We use the numa balancing kernel parameter to toggle the scheduler.

5.5 Evaluation

In this section we describe our setup and evaluate the effectiveness of our strategies. In Section 5.5.1 we outline the hardware/software specifications of our machines. Section 5.5.2 describes the datasets, implementations, and systems used. We analyze the impact of the OS configuration in Section 5.5.3. In Section 5.5.5 we evaluate these techniques on database engines running TPC-H queries. We explore the effects of overriding the default system memory allocator in Section 5.5.4. Finally, we summarize our findings in Section 5.5.6.

5.5.1 Experimental Setup

We run our experiments on three different machines based on different architec- tures. This is done to ensure that the applicability of our findings is not biased to a particular system’s characteristics. The NUMA topologies of these machines are depicted in Figure 5.1 and their specifications are outlined in Table 5.3. We used LIKWID [66] to measure each system’s relative memory access latencies, and the remainder of the specifications were obtained using product pages, spec sheets, and Linux system queries. Now we outline some of the key hardware specifications for each machine. Machine A is an eight socket AMD-based server with a total of 128GB of memory. As the only machine with eight NUMA nodes, machine A provides us

105 Table 5.3: NUMA Machine Specifications

System Machine A Machine B Machine C CPUs 8×Opteron 8220 4×Xeon E7520 4×Xeon E7-4850 v4 CPU Frequency 2.8GHz 2.1GHz 2.1GHz Architecture AMD Santa Rosa Intel Nehalem Intel Broadwell Physical/Logical 16/16 16/32 32/64 Cores Last Level Cache 2MB 18MB 40MB 4KB TLB L1:32×4KB L1:64×4KB L1:64×4KB Capacity L2:512×4KB L2:512×4KB L2:1536×4KB 2MB TLB L1:8×2MB L1:32×2MB L1:32×2MB Capacity - - L2:1536×2MB NUMA Nodes 8 4 4 NUMA Enhanced Fully Fully Topology Twisted Ladder Connected Connected Relative Local: 1.0 Local: 1.0 Local: 1.0 NUMA Node 1 hop: 1.2 1 hop: 1.1 1 hop: 2.1 Memory 2 hop: 1.4 Latency 3 hop: 1.6 Interconnect 2GT/s 4.8GT/s 8GT/s Bandwidth Memory 16GB/node 16GB/node 768GB/node Capacity 128GB Total 64GB Total 3TB Total Memory Clock 800MHz 1600MHz 2400MHz Operating System Ubuntu 16.04 Ubuntu 18.04 CentOS 7.5 Linux Kernel 4.4 4.15 3.10

106 with an opportunity to study NUMA effects on a larger scale. The twisted ladder topology shown in Figure 5.1a is designed to minimize inter-node latency with three HyperTransport interconnect links per node. As a result, Machine A has three re- mote memory access latencies, depending on number of hops between the source and the destination. Each node contains an AMD Opteron 8220 CPU running at 2.8GHz and 16GB of memory. Machine B is a quad-socket Intel server with four NUMA nodes and a total memory capacity of 64GB. The NUMA nodes are fully connected, and each node consists of an Intel Xeon E7520 CPU running at 1.87GHz with 16GB of memory. Lastly, Machine C contains four sockets populated with Intel Xeon E7-4850 v4 processors. Each processor constitutes a NUMA node with 768MB of memory, providing a total system memory capacity of 3TB. The NUMA nodes of this machine are fully connected. Our experiments are coded in C++ and compiled using GCC 7.3.0 with the -O3 and - march=native flags. As mentioned in Chapter 4, the flags enable the compiler highest optimization level and target the native architecture of our machines. Likewise, all dynamic memory allocators and database systems are compiled from source. Machines B and C are owned and maintained by external parties and are based on different Linux distributions. The experiments are configured to utilize all available hardware threads on each machine.

5.5.2 Datasets and Implementation Details

In this section, we outline the datasets and code used for the experiments. We use well-known synthetic datasets outlined in prior work as the basis for all of our exper- iments [29, 35, 32]. Unless otherwise noted, all workloads operate on datasets that are stored in memory resident data structures, hence avoiding any I/O bottlenecks. The aggregation workloads (W1 and W2) evaluate a typical hash-based aggregation query, based on a state-of-the-art concurrent hash table [102], which is implemented

107 Table 5.4: Experiment Parameters (bolded values are system defaults)

Parameter Values Experiment Workload W1) Holistic Aggregation [113] W2) Distributive Aggregation[113] W3) Hash Join [29] W4) Index Nested Loop Join using: 1)ART [96], 2)Masstree [108], 3)B+tree [26], 4)Skip List [173] W5) TPC-H Queries (Q1 to Q22) [35] Thread Placement Strategy None (OS scheduler is free to migrate threads), Sparse, Dense Memory Placement Policy First Touch, Interleave, Localalloc, Preferred Memory Allocator ptmalloc, jemalloc, tcmalloc, Hoard, tbbmalloc Dataset Distribution Moving Cluster (default for W1), Sequential (default for W3 and W4), Zipfian (default for W2), TPC-H (W5) Database System (W5) MonetDB [117], PostgreSQL [137], MySQL [127], DBMSx, Quickstep [131] OS Configuration AutoNUMA on/off, Transparent Hugepages (THP) on/off Hardware System Machine A, Machine B, Machine C as a shared global hash table [113]. The datasets used for the aggregation workloads are based on three different data distributions: Moving Cluster (default), Sequential, and Zipfian. Each dataset consists of 100 million records with a group-by cardinality of one million. In the Moving Cluster dataset, the keys are chosen from a window that gradually slides. The Moving Cluster dataset provides a gradual shift in data locality that is similar to workloads encountered in streaming or spatial applications. In the Sequential dataset, we generate a series of segments that contain multiple number sequences. The number of segments is equal to the group-by cardinality, and the number of records in each segment is equal to the dataset size divided by the cardinality. This dataset mimics transactional data where the key incrementally increases. In the Zipfian dataset, the distribution of the keys approximates Zipf’s law [138]. We first generate a Zipfian sequence with the desired cardinality c and Zipf exponent e =0.5. Then we take n random samples from this sequence to build

108 n records. The Zipfian distribution is used to model many big data phenomena, such as word frequency, website traffic, and city population. The join workloads (W3 and W4) evaluate a typical join query involving two tables. W3 is a non-partitioning hash join based on the code and dataset from [29]. The dataset contains two tables sized at 16M and 256M tuples, and is designed to simulate a decision support system. W4 is an index nested loop join that uses the same dataset as W3. We evaluated several in-memory indexes for this workload: ART [96], Masstree [108], B+tree [26], and Skip List [173]. ART [96] is based on the concept of a Radix tree. Masstree [108] is a hybrid index that uses a trie of B+trees to store keys. B+tree [26] is a cache-optimized in-memory B+tree. Skip List is a canonical implementation of a Skip List [173]. We use the TPC-H workload (W5) to investigate how our strategies can benefit database systems. This entails some limitations, as databases are complex systems with less flexibility compared to microbenchmarks and codelets. Although there are many available database systems that are TPC-H compliant, we note that compar- ing an extensive variety of systems is beyond the scope of this work. We evalu- ate W5 on the MonetDB [117] (version 11.33.3), PostgreSQL [137] (version 11.4), MySQL [127] (version 8.0.17), DBMSx, Quickstep [131] (latest Github version as of October 2019) database systems. MonetDB is an open-source columnar store that uses memory mapped files with demand paging and multiple worker threads for its query processing. PostgreSQL is a widely-used open-source row store that supports intra-query parallelism using multiple worker processes and a shared memory pool for communication. We configured PostgreSQL with a 42GB buffer pool based on recommendations in the documentation [62]. MySQL is an open-source row store that remains highly popular. DBMSx is a commercial hybrid row/column-store with a parallel in-memory query execution engine. Quickstep is an open-source hybrid store with a focus on single-node parallelism and in-memory analytical workloads.

109 W5 uses version 2.18 of the TPC-H dataset specifications. The dataset is designed to mimic a decision support system with eight tables, and is paired with a set of 22 queries which answer typical business questions. Our experiments involve evaluating the impact of the OS configuration on each system, using all the queries on a dataset scale factor of 20. We then explore the effect of different memory placement policies. Finally, we use Queries 5 and 18 to show the impact of utilizing different memory allocators, as both queries involve a combination of joins and aggregations. The experimental parameters are shown in Table 5.4. We use the maximum number of hardware threads supported by each machine. In W1-W4, we measure the average workload execution time of five runs, using the timer from [29], which we selected for its high precision. We also note that the variation between runs is less than 3%. In W5, we use the built-in query timing features of each database system. We prefer using the built-in timers because it allows us to precisely measure the runtime for individual queries rather than just the set of queries as a whole.

5.5.3 Operating System Configuration Experiments

In this section, we evaluate three key OS mechanisms that affect NUMA behav- ior: NUMA Load Balancing (AutoNUMA), Transparent Hugepages (THP), and the system’s memory placement policy. The experiments demonstrate each parameter’s affect on query performance. We also examine how these variables are affected by other experiment parameters, such as hardware architecture, and the interaction between THP and memory allocators.

5.5.3.1 AutoNUMA Load Balancing Experiments

In Figures 5.5a and 5.5b, we evaluate W1 and toggle the state of AutoNUMA Load Balancing between On (the system default) and Off. The results in Figure 5.5a show that AutoNUMA slows down the runtime for the First Touch, Interleave, and Localal-

110 AutoNUMA On AutoNUMA Off AutoNUMA On AutoNUMA Off 9 90% 79 7978 8 80% 75 7675 7 70% 6 60% 5 50% 4 40% 3 30% 2 20% 1717 1 RatioAccess Local 10% 0 Runtime CPU CPU RuntimeCycles (Billions)

Memory Placement Policy Memory Placement Policy (a) AutoNUMA effect on execu- (b) AutoNUMA effect on Local tion time - Machine A Access Ratio - Machine A Figure 5.5: Evaluating and profiling AutoNUMA load balancing - W1 loc memory placement policies. In most cases, AutoNUMA’s overhead dominates any performance gained by migrating threads and memory pages. The best runtime is obtained by applying the Interleave policy and disabling AutoNUMA. If AutoNUMA is enabled, the best approach is to apply the Interleave policy, which may be useful for scenarios where superuser access is unavailable. We observed similar behavior for the other workloads and machines. AutoNUMA had a significantly detrimental effect on runtime. The best overall approach is to use memory interleaving and disable AutoNUMA. We also measure the Local Access Ratio (LAR) [39], which is shown in Figure 5.5b and specifies the ratio of memory accesses that were satisfied with local memory compared to all memory accesses. AutoNUMA attempts to increase LAR without considering other costs, such as moving threads and memory, or memory controller contention. Due to this, the First Touch policy with AutoNUMA enabled (system default) is 86% slower than Interleave without AutoNUMA, despite a higher LAR measurement. In summary, we obtain significant speedups using a modified OS configuration, and note that LAR is not an accurate predictor of performance on NUMA systems.

111 Machine A Machine B Machine C 8.3

THP Off THP On 8.0 8 9 8 5.4

7 5.4 5.3 5.1 5.1

7 5.1 5.0 4.8 4.7 6 6 4.4 5 5 4 1.6 4 1.6 1.4 1.3 1.3 3 1.3 3 2 2 1 0

1 CPU RuntimeCycles (Billions) 0 Runtime CPU CPU RuntimeCycles (Billions) Interleave Interleave Localalloc Localalloc First First Touch First First Touch AutoNUMA and THP AutoNUMA and THP Dynamic Memory Allocator enabled disabled (a) Impact of THP on memory (b) Combined effect of AutoNUMA and allocators - Machine A THP on different memory placement poli- cies - variable machine Figure 5.6: Evaluating effect of AutoNUMA and THP on memory placement policies and allocators - W1 5.5.3.2 Transparent Hugepages Experiments

Next we evaluate the effect of the Transparent Hugepages (THP) configuration, which automatically merges groups of 4KB memory pages into 2MB memory pages. As shown in Figure 5.6a, THP’s impact on the workload execution time ranges from detrimental in most cases to negligible in other cases. As THP alters the composition of the operating system’s memory pages, support for THP within the memory allo- cators is the defining factor on whether it is detrimental to performance. tcmalloc, jemalloc, and tbbmalloc are currently not handling THP well. We hope that future versions of these memory allocators will rectify this issue out-of-the-box. Although most Linux distributions enable THP by default, our results indicate that it is better to disable THP for high performance data analytics.

5.5.3.3 Hardware Architecture Experiments

Here we show how the performance of data analytics applications running on different machines with different hardware architectures is affected by the memory placement strategies. For all machines, the default configuration uses the First Touch memory placement, and both AutoNUMA and THP are enabled. The results depicted in

112 Figure 5.6b show that Machine A is slower than Machine B when both machines are using the default configuration. However, using the Interleave memory placement policy and disabling the operating system switches allows Machine A to outperform Machine B by up to 15%. Machine A shows the most significant improvement from operating system and memory placement policy changes, and the workload runtime is reduced by up to 46%. The runtime for Machine C is reduced by up to 21%. The performance improvement on Machine B is around 7%, which is fairly modest compared to the other machines. Although Machines B and C have a similar inter- socket topology, the relative local and remote memory access latencies are much closer in Machine B (see Table 5.3). Henceforth, we keep AutoNUMA and THP disabled for our experiments, unless otherwise noted.

5.5.4 Memory Allocator Experiments

In Section 5.4.1.9, we used a memory allocator microbenchmark to show that there are significant differences in both multi-threaded scalability and memory consump- tion overhead. In this section, we explore the performance impact of overriding the system default memory allocator on four in-memory data analytics workloads.

5.5.4.1 Hashtable-based Experimental Workloads

In Figure 5.7, we show our results for the holistic aggregation (W1), distributive ag- gregation (W2), and hash join (W3) workloads running on each of the machines. In addition to the memory allocators, we vary the memory placement policies for each workload. The results show significant runtime reductions on all three machines, particularly when using tbbmalloc in conjunction with the Interleave memory place- ment policy. The holistic aggregation workload (W1) shown in Figure 5.7a to 5.7c extensively uses memory allocation during its runtime to store the tuples for each group and calculate their aggregate value. Utilizing tbbmalloc reduced the runtime

113 First Touch Interleave Localalloc First Touch Interleave Localalloc First Touch Interleave Localalloc 6 6 3 5 5 2.5 4 4 2 3 3 1.5 2 2 1 1 1 0.5 0 0 0 Runtime CPU CPU Cycles Runtime(Billions) Runtime CPU CPU Cycles Runtime(Billions) Runtime CPU CPU Runtime Cycles (Billions)

Memory Allocator Memory Allocator Memory Allocator (a) W1 - Machine A (b) W1 - Machine B (c) W1 - Machine C

First Touch Interleave Localalloc first touch interleave localalloc First Touch Interleave Localalloc 14 3.5 1.6 12 3 10 2.5 1.2 8 2 0.8 6 1.5

4 1 0.4 2 0.5 0 0 0 Runtime CPU CPU Cycles Runtime(Billions) CPU Cycles Runtime(Billions) Runtime CPU CPU RuntimeCycles (Billions)

Memory Allocator Memory Allocator Memory Allocator (d) W2 - Machine A (e) W2 - Machine B (f) W2 - Machine C

First Touch Interleave Localalloc First Touch Interleave Localalloc First Touch Interleave Localalloc 16 30 7 14 25 6 12 5 10 20 4 8 15 3 6 10 4 2 2 5 1 0 0 0 Runtime CPU CPU Cycles Runtime(Billions) CPU Cycles Runtime(Billions) Runtime CPU CPU CyclesRuntime (Billions)

Memory Allocator Memory Allocator Memory Allocator (g) W3 - Machine A (h) W3 - Machine B (i) W3 - Machine C

Figure 5.7: Comparison of memory allocators - variable workload, memory placement policy, and machine

114 5 25 25

4 20 20

3 15 15

2 10 10

1 5 5

0 0 0 Runtime CPU CPU Cycles Runtime(Billions) Runtime CPU CPU Cycles Runtime(Billions) CPU Cycles Runtime(Billions)

Memory Allocator Memory Allocator Memory Allocator (a) Moving Cluster (b) Sequential (c) Zipf

Figure 5.8: W1 - Machine A - Effect of dataset distribution of W1 by up to 62% on Machine A, 83% on Machine B, and 72% on Machine C, compared to the default allocator (ptmalloc). The results for the join query (W3) depicted in Figures 5.7g to 5.7i also show significant improvements, with tbbmalloc reducing workload execution time by 70% on Machine A, 94% on Machine B, and 92% on Machine C. The distributive aggregation query (W2) shown in Figure 5.7d to 5.7f speeds up by 44%, 27%, and 28% on Machines A, B, and C respectively. This speedup is almost entirely due to the Interleave memory placement policy. Although W2 is not allocation-heavy and does not gain much benefit from a faster memory allocator, it can still be accelerated using a more efficient memory placement policy.

5.5.4.2 Impact of Dataset Distribution

The performance of query workloads and memory allocators can be sensitive to the access patterns induced by the dataset distribution. The three tested datasets have the same number of records, but differ in the way the record keys are distributed (see Section 5.5.2 for more information). In our previous experiments, we used the Heavy Hitter dataset as the default dataset for W1. In Figure 5.8, we vary the dataset distribution to investigate its impact on different memory allocators. The results show that tbbmalloc continues to produce the largest speedups on both the

115 First Touch Interleave Localalloc First Touch Interleave Local Alloc First Touch Interleave Local Alloc 70 80 25 60 70 20 50 60 50 40 15 40 30 30 10 20 20 5 10 10 0 0 0 Runtime CPU CPU Cycles Runtime(Billions) Runtime CPU CPU Cycles Runtime(Billions) Runtime CPU CPU RuntimeCycles (Billions)

Memory Allocator Memory Allocator Memory Allocator (a) ART Index - Join Times (b) Masstree Index - Join Times (c) B+tree index - Join Times

First Touch Interleave Local Alloc Build Time Join Time 90 70 60 80 60 70 50 50 60 40 50 40 30 40 30 30 20 20 20 10 10 10 0 0 0 Runtime CPU CPU Runtime (Billions)Cycles Runtime CPU CPU Cycles Runtime(Billions) Runtime CPU CPU Cycles Runtime(Billions)

Memory Allocator Index Data Structure Index Data Structure (d) Skip List Index - Join Times (e) Index build times (best con- (f) Index join times (best con- figuration) figuration)

Figure 5.9: Index nested loop join workload (W4) - variable memory allocators and memory placements - Machine A Zipf and Sequential datasets. We also observe this trend on Machines B and C, but omit them due to space constraints.

5.5.4.3 Effect on In-memory Indexing

In W4, we investigate index nested loop join query processing with different in- memory indexes. The type of index used to accelerate the nested loop join workload (W4) plays a key role in determining its speed. We evaluate four in-memory indexes: ART [96], Masstree [108], B+tree [26], and Skip List [173]. As the index is pre-built, the workload is relatively light in terms of number of memory allocations during the join phase, hence factors such as scan/lookup times, materialization, and locality

116 play a greater role. For each index, we vary the memory allocator and memory placement policy and measure the join time. The results, depicted in Figures 5.9a to 5.9c, show that runtime can be significantly improved for most of the tested indexes. In Figure 5.9a, we show that ART’s join time can be substantially improved using the jemalloc or tbbmalloc allocators. A key characteristic of ART is that it uses variable node sizes and a variety of compression techniques for its trie, thus requesting a greater variety of size classes from the memory allocator, compared to the other allocators. In Figures 5.9b and 5.9c Masstree and B+tree show a notable improvement with the Hoard allocator. Both indexes rely on grouping many keys per node, which is favorable for Hoard’s straightforward global heap approach. Skip List breaks the trend as the only index that runs fastest with ptmalloc. Finally, we summarize the results in Figures 5.9e and 5.9f, which respectively depict each index’s build and join times using their fastest configuration. The results show that our techniques we were effective in speeding up the two fastest indexes (ART and B+tree) despite their inherent lack of NUMA-awareness. We were able to reduce W4 runtimes by 18.5% and 77% for B+tree and ART respectively.

5.5.5 Database Engine Experiments

In this section, we evaluate the TPC-H workload (W5) on five database systems: MonetDB, PostgreSQL, MySQL, DBMSx, and Quickstep. We use these five systems to explore how database system performance can be improved using our strategies. Investigating NUMA strategies on database systems is more challenging compared to in-memory microbenchmark workloads, as there is considerably more complexity involved in storing and loading the data and great care must be taken to ensure that disk I/O and caching do not skew the results. To ensure fair and consistent results, we clear the page cache using the proc/sys/vm/drop caches command before running each query, disregard the first (cold) run, and measure the mean runtime for

117 4.3 14.1 30.1 7.2 11 10.3 4.5 22 Q -1.7 -1.0 Q 6.1 2.2

2.6 9.8 20.5 3.2 disabled, 10 2.3 5.6 21 Q 27.6 2.4 Q Quickstep Quickstep 17.5 24.0 cy reduction - 4.5 8.1

41.7 9 9.2 20

49.2 Q -0.1 Q used the following 1.8 -0.2 TPC-H queries for er already handling

10.5 18.6 olicies. We observed hich use joins instead to the standalone in- AutoNUMA nd memory allocators.

7.6 8.6 the original versions of ments, we evaluate the ptimal plans for queries

19.3 3.8 ned that the parameters 8 19

18.4 Q 19.9 Q memory allocator. DBMSx DBMSx 5.6 4.2 15.5 8.1

1.1 4.0 43.1 7.7 7 18

37.2 Q 0.6 Q ) yielded the best performance for 22

3.6 11 -3.4 26.9 42.6 tbbmalloc

40.6 0.2 to Q 33.3 to Q 7.0 17 6 1 MySQL MySQL 12 Q 2.8 Q -2.3 6.5 -4.5

3.1 31.1 memory placement, Query Number Query Query Query Number first touch

-0.7 0.9 118 42.9 24.8 16 5

19.4 Q Q 11.8 2.3 -1.0 10.5 12.1 (a) TPC-H Q

0.7 25.1 (b) TPC-H Q 43.4 25.0 4 15

5.8 Q 2.3 Q First Touch PostgreSQL PostgreSQL 4.3 3.1 9.0 4.6

5.0 1.8 33.0 18.7 3 14

19.4 Q 1.9 Q 2.2 3.4 5.2 8.8

13.8 2.6 14.3 4.9 2 13 Q

34.7 17.2 Q MonetDB MonetDB 0.3 -1.1 35.7 10.0

1.0 3.0 23.5 5.1 1 12 1 2 3 4 5 6 7 8 91011 12 13 14 15 16 17 18 19 20 21 22 0.1 Q -1.7 Q 1.5 4.3

14.0 2.7

Query Latency Reduction Latency Query 50% 40% 30% Reduction 20% Latency 10% Query 50% 40% 30% 20% 10% -10.% -10.% In Figure 5.10, we present the speedups obtained across all 22 THP disabled (for all except DBMSx), and the the database systems. This is due tomemory the placement database between Buffer the Manag workerparameters threads/processes. to speed We up W5: Figure 5.10: 22Variable database TPC-H systems queries - (W5) Machine A scale factor 20five - additional runs. Query laten In a similarimpact vein of to the our OS configuration, previous memory experi placementDue policies, to a an issue causing PostgreSQL to produce severely17 sub-o and 22, we evaluate modified versions ofof these the two regular queries nested w queries. All otherqueries database 17 systems and run 22. Through empirical evaluation, wethat determi work best for the databasememory systems queries, with are the mostly exception identical of thethat memory the placement system’s p default memory policy ( 2.5 12 10 2 8 1.5 6 1 4 Query Latency(s)Query Query Latency (s) LatencyQuery 0.5 2

0 0

Memory Allocator Memory Allocator (a) TPC-H Q5 (b) TPC-H Q18

Figure 5.11: Effect of memory allocator on TPC-H query latency - MonetDB - Machine A each of the database systems. The results are expressed as the percentage reduction in query latency compared to the default configuration, with positive values indi- cating a speedup and negative values indicating a slowdown. The results show that MonetDB’s query latencies improved by up to 43%, with an average improvement of 14.5%. In comparison, the gains for PostgreSQL are less consistent. Query latency improved by up to 27.6%, but the average improvement is 3% and seven queries take slightly longer to complete. We believe these variances are due to PostgreSQL’s rigid multi-process query processing approach, which sometimes opts to use only one worker process and thus fails to fully utilize the hardware. MySQL’s query la- tency is reduced by up to 49% with an average reduction of 12%. We observe that DBMSx query latency is improved by up to 43% with an average of 21%. Lastly, Quickstep query latency speeds up by up to 40% and an average of 7%. Overall, all five database systems obtained noteworthy speedups from modifying the default OS configuration. Next we investigate the effect of memory allocator overriding on MonetDB. To do so, we select queries 5 and 18 due to their usage of both joins and aggregation. The

119 results, shown in Figure 5.11a, indicate that tbbmalloc can provide an average query latency reduction of up to 11% for Query 5, and 20% for Query 18, compared to ptmalloc. These results once again reflect that the default and most commonly used memory allocator can be replaced to obtain speedups, even in software as complex as a database system.

5.5.6 Results Summary

The strategies explored in this chapter, when carefully applied, can significantly speed up query processing workloads without the need for source code modifica- tion. The effectiveness and applicability of these strategies to a workload depend on several factors. Figure 5.12 shows a strategic plan for practitioners. The flowchart outlines a systematic guide to improving performance on NUMA systems, along with some general recommendations. We base these recommendations on our extensive experimental evaluation using multiple machine architectures and workloads. Start- ing with thread management, we showed that thread affinitization can be critical for NUMA systems, but more importantly how a Sparse placement approach can maxi- mize performance in situations that are memory-bandwidth-bound. We then showed that the default OS configuration can have a significant detrimental effect on query performance. The overhead of AutoNUMA and THP was demonstrated to be too costly for high performance data analytics workloads. Although superuser privileges are required to modify AutoNUMA and THP, we observed that optimizing the mem- ory placement policy (such as using Interleave) can mostly mitigate their negative impact. We also investigated dynamic memory allocators using a microbenchmark. The microbenchmark results showed that there are considerable differences between the allocators, both in terms of scalability and efficiency. In our evaluation, we demonstrated that these differences translate into real gains in analytical query pro- cessing workloads, although the performance gains depend on the way the workloads

120 Start

Is Memory thread placement Placement managed? Yes Defined? Yes No No Affinitize thread Optimize the Memory placement Placement (Interleave)

Bound by memory Allocation-heavy No bandwidth? No workload? Yes Yes Adopt Dense Adopt Sparse Evaluate and Override Strategy Strategy Memory Allocator

Superuser Free memory is access? No constrained? No

Yes Yes Preload jemalloc Configure OS to disable AutoNUMA and THP Preload tbbmalloc

End

Figure 5.12: Application-agnostic Decision Flowchart allocate memory. For example, allocation-heavy workloads, such as the hash join (W3) benefited the most, whereas the index nested loop join (W4) exhibited smaller gains due to the prebuilt index. Although we have shown that tbbmalloc frequently outperformed its competitors on different machines and workloads, we recommend experimentation with new/updated memory allocators before selecting a solution.

5.6 Chapter Summary

In this chapter, we have outlined and investigated several application-agnostic strate- gies to speedup query processing on NUMA machines. Our experiments on five analytics workloads have shown that it is possible to obtain significant speedups by utilizing these strategies. We also demonstrate that current operating system default

121 configurations are generally sub-optimal for in-memory data analytics. Our results, surprisingly, indicate that many elements of the default OS environment, such as AutoNUMA, Transparent Hugepages, default memory allocator (eg. ptmalloc), and the OS thread scheduler, should be disabled or customized for high performance analytical query processing, regardless of the hardware generation. We have also demonstrated that memory allocator performance on NUMA systems can be a ma- jor bottleneck and that this under-appreciated topic is ripe for investigation. We obtained large speedups for our query processing workloads by overriding the de- fault dynamic memory allocator with alternatives such as tbbmalloc. The main contributions of this chapter are as follows:

• Categorization and analysis of strategies to improve application performance on NUMA systems

• The first study on NUMA systems (to our knowledge) that explores the com- bined impact of different memory allocators, thread and memory placement policies, and OS-level configurations, on different analytics workloads

• Extensive experimental evaluation, involving different workloads, indexes and database systems on different machine architectures and topologies, with pro- filing and performance counters, and microbenchmarks

• A decision flowchart (Figure 5.12) to help practitioners speed up query pro- cessing on NUMA systems with minimal code modifications

As our approach does not target a specific NUMA topology, we have shown that our findings can be applied to systems with different architectures. As hardware architectures continue to advance towards greater parallelism and greater levels of memory access partitioning, we hope our results and decision flowchart can help practitioners to accelerate data analytics.

122 Chapter 6

Distributed In-memory kNN Query Processing

In previous chapters, we explored in-memory query processing in the context of multi-threaded (single machine) applications. In this chapter, we present the design and implementation of a distributed in-memory data system that we developed to perform spatio-temporal queries using a cluster of machines. Location-based services (LBS) and other geospatial applications produce large volumes of spatio-temporal data at high velocity. A common example is to consider an application that tracks the location of vehicles over time. In addition to the growing number of vehicles, every measurement of the spatial locations (at regular time intervals) generates a large number of new records. Efficiently storing and querying this data is challenging due to the data rate and volume, as well as the multi-dimensional nature of both the data and the queries. As part of a collaborative research effort on high performance query processing, we developed a distributed spatio-temporal data system called DISTIL [132], which supported concurrent location updates and spatio-temporal range queries. In this chapter, we present DISTIL+, an extension of the DISTIL system which we developed in order to support fast spatio-temporal kNN queries. In

Note: Parts of this chapter were previously published in [132, 111].

123 this chapter, we show how DISTIL+ uses a multi-level in-memory index to efficiently execute kNN queries in a distributed manner. The remainder of this chapter is organized as follows. In Section 6.1, We outline our key motivations for building DISTIL+. Section 6.2 presents the related work. In Section 6.3 we present the problem statement and design considerations. We describe the design and implementation of our system in Section 6.4 and provide details on the system’s kNN query processing in Section 6.5. In Section 6.6, we present our experimental evaluation of DISTIL+, and compare it with other systems. Finally, we summarize and conclude the chapter in Section 6.7.

6.1 Motivation

With recent advances in microprocessors, remote sensing and sensor technologies, and the rise of geo-spatial web applications, the growth of spatial data is rapidly accelerating. A significant portion of spatial data is spatio-temporal in nature, the growth of which is driven by the proliferation of GPS-enabled mobile devices and sensors. As a result, Location-Based Services (LBS) have become more pervasive. A wide range of LBS applications, including location-aware search, location-based games, advertising, and weather services, have become a part of our daily lives. These applications are contributing to the generation of vast quantities of spatio-temporal data from a variety of sources. The key characteristics of LBS applications include high throughput location updates and low latency processing of different types of location-oriented queries. With the continuous rise of big spatial data volume, the impetus for scalable pro- cessing of such data is growing. Distributed data systems have been prescribed to provide horizontal scalability on large multi-dimensional datasets. In an effort to support spatial data processing, a number of systems have been developed. Among

124 these, SpatialHadoop [44], Hadoop-GIS [79], and ST-Hadoop [3] are based on the open-source MapReduce framework Hadoop [65]. Hadoop focuses on fault-tolerance and I/O operations. Due to technological advances in computing hardware, large main-memory machines with multiple cores, and low-latency networking are becom- ing quite common in modern data centers. These trends have made distributed in- memory processing of big data attractive. As a result, the in-memory cluster comput- ing framework Spark [190] has become very popular. Recently, several Spark-based spatial data processing systems have been proposed. These include GeoSpark [188], SIMBA [181], SpatialSpark [186], and LocationSpark [162]. However, none of these systems support a high rate of updates, as most of the systems assume immutable spatial data. As a result, these distributed in-memory spatial big data frameworks are not suitable for LBS applications. Some researchers have proposed distributed spatio-temporal systems built on top of the HBase [52] data store. MD-HBase [125] and LC-Index (Local and Clustering Index) [47] are two notable examples. The main drawback of these systems is the lack of a distributed memory-based architecture, as HBase relies on a disk-based architecture. This has implications for performance. Due to the rapidly rising volume and high velocity of spatio-temporal data, hori- zontal (rather than vertical) scaling is needed to achieve scalability. However, in a distributed data processing system, achieving good throughput and latency are challenging. Moreover, efficient distributed processing of spatio-temporal queries, such as range and k-nearest neighbor (kNN) queries, can be hard, as they involve large volumes of data, multiple dimensions, and typically entail data skew. In this chapter, we aim to address these challenges with a distributed in-memory system. Currently, Spark is the most popular distributed in-memory framework. Despite this, researchers [6] have reported that applications developed with Spark can be an order of magnitude or more slower compared to native implementations written with high

125 Local Heap Local Heap Local Heap Local part 0 Local part Distributed Array Local part N-1

. . . … … … Activities Activities Activities

Place 0 Place 1 Place N-1 Figure 6.1: APGAS Programming Model performance computing tools and programming models. Similar observations have been made by other researchers, such as [110]. Spark’s overheads, associated with dis- tributed coordination and data movement, are partly responsible for this. To avoid these issues, we investigated several programming frameworks, and among them the APGAS (Asynchronous Partitioned Global Address Space) paradigm [147] was the most promising. In APGAS, a computation and its related data are logically orga- nized into places, where the computation is structured as lightweight asynchronous tasks called activities [165].

6.1.1 APGAS and the X10 Language

APGAS is a distributed shared memory concurrency model that allows multiple threads of execution to share a common address space, spread across a cluster of nodes. It also includes built-in support for fault tolerant and elastic computing [164]. A place encapsulates the idea of a portion of the address space that contains data and the threads operating on that data. Activities are tasks that can be performed synchronously or asynchronously in the associated places. The concept of an asyn- chronous activity being performed at a place is denoted by async. APGAS offers several constructs to coordinate activities. For instance, finish enables a parent to wait for the completion of all its children activities and conditional atomics can be used to define conditional critical regions. APGAS includes parallel for loop con-

126 structs to support data-parallelism. For example, foreach and ateach provide local and distributed parallelism respectively. X10 [180] is an object-oriented programming language built on the APGAS model. An X10 program is translated into either C++ (native X10) or Java (managed X10) source code. Managed X10 transforms the application into Java classes. For the man- aged runtime, X10 offers a distributed heap appropriate for parallel and . The heap is partitioned across multiple places (nodes). Lightweight threads (activities) are spawned asynchronously or synchronously and access objects on their local or remote heap (shown in Figure 6.1). X10 supports distributed arrays that split data across several places, and can facilitate references between places through a Global Reference [180]. Our system uses these features to distribute data across multiple nodes, and to process and synchronize inter-node activities.

6.2 Related Work

With the continuous rise of big spatial data volume, the impetus for scalable pro- cessing of such data is growing. In response to this, a number of spatial systems have been developed. Similarly, the growing popularity of Location-based services (LBS) has spurred research on spatio-temporal indexing and queries. We now explore some of the work that is relevant to our research. We divide the related work into three main categories: big spatial data processing systems, distributed spatio-temporal indexes, and distributed kNN query processing.

6.2.1 Big Spatial Data Processing Systems

The need for high performance spatial data processing has motivated many projects to add support for spatial queries in existing distributed frameworks. This cate- gory consists of spatial data processing systems built on top of existing parallel and

127 distributed platforms. Some researchers have presented comparative studies on the features and performance characteristics of some of these systems [130, 2]. SpatialHadoop [44], HadoopGIS [79], Hadoop-GIS [1], and HadoopTrajectory [12] are based on the Hadoop MapReduce [40] framework. ST-HBase [106] is based on HBase [52], a distributed database that runs on top of the Hadoop Distributed File System (HDFS). These systems provide a viable path toward horizontal scaling for spatial data processing, but can be hindered by the performance limitations of Hadoop’s disk-oriented data processing paradigm. Another family of systems are based on Apache Spark, which is an in-memory cluster computing framework [190]. These systems include GeoSpark [188], SIMBA [181], SpatialSpark [186], Magellan [160], and LocationSpark [162]. These systems are typ- ically faster than the Hadoop-based systems due to their in-memory data processing, but will fallback to disk-based processing or fail if the job size exceeds available mem- ory. In response to this issue, Baig et al. presented SparkGIS [11], a Spark-based spatial query processing system. Their system uses space-efficient global and local in- memory indexes based on R-trees to accelerate its queries. A key feature of SparkGIS is that it can dynamically rewrite queries in order to handle large workloads that exceed the available memory resources. Both families of systems are designed with spatial queries in mind, and do not natively support spatio-temporal queries. Consequently, researchers have devised spatio-temporal indexes to overcome this hurdle and improve query performance.

6.2.2 Scalable Spatio-Temporal Indexes

Spatio-temporal data are part of spatial big data and their usage is found in many LBS applications. Researchers have explored ways to improve query and storage efficiency. MD-HBase [125] is a multi-level index for multi-dimensional data that uses the HBase

128 database for storage. The indexing data, such as geographical and timestamp, are combined and transformed into a single dimension using the Z-ordering linearization technique. The generated value is then used in the index layer to point to equal-sized buckets, which hold the data (storage layer). Pyro [101] uses a geometrical translation module based on the Moore encoding lin- earization technique [10] to translate geometry queries into a series of one-dimensional range scans. The implementation is based on HBase and HDFS for its index and storage layer. A temporal key is appended to the end of the spatial key and used as each record’s row key in HBase. The HDFS storage layer uses a data block algorithm that preserves data locality while splitting regions. PASTIS [142] is an in-memory grid-based index. The spatial domain is split into grid cells that maintain information about the moving objects using compressed bitmaps. The cells are kept in memory and inserts and updates are performed concurrently using multiple threads. DISTIL+ is inspired by PASTIS. However, PASTIS is not horizontally scalable, as it is a single node approach, whereas DISTIL+ is a distributed in-memory system. In [50], Fox et al. presented a spatial-temporal index based on Niemeyer’s geohash- ing [124] technique, which converts a spatial rectangle into a binary string. The authors implemented their index on Apache Accumulo [167], which was inspired by Google’s BigTable and is similar to HBase. The spatio-temporal attributes of a point are represented by a 35-bit hash that contains the location and date-time informa- tion of that record. The records are spread across multiple servers in a distributed file system. In [47], Feng et al. present Local and Clustering Index (LCIndex), an indexing technique for the Distributed Ordered Tables (DOT) used in HBase. The authors claim their clustering index approach is better suited to handling data skew and updates compared to tree-based approaches. The index is designed to provide high

129 performance multi-dimensional range queries and updates. LCIndex stores indexing data (IFiles) using HBase’s HFile storage format, which is in the form of key-value pairs. Each IFile represents an indexed column, and is stored on the local file system to accelerate queries. The raw data records are stored at different region servers using HDFS. Each server stores horizontal partitions for one or more tables, and replication is used to enhance reliability and performance. Xiangyu Zhang et al. developed a multi-dimensional index for the Cloud, called EMINIC (Efficient Multi-dimensional Index with Node Cube) [194]. They used the the master-slave paradigm to build local K-d tree indexes on the slave nodes and global R-tree indexes on the master nodes. Information about each slave’s local index is stored within a node cube, which is then indexed in the master nodes. Each node cube consists of a set of intervals that indicate the range of each indexed attribute in the slaves. The node cubes are used to prune nodes that do not intersect with the query, thus improving query processing. Liao et al. [104] built a hierarchical R-tree-based index for multi-dimensional data on distributed file systems. Several buffer pools are used to cache the R-tree’s inner nodes, because they are frequently accessed during tree traversal. Another series of buffer pools cache the leaf nodes. The leaf node buffer pools use a Least Recently Used (LRU) replacement policy due to the considerably larger number of leaf nodes. The authors used Hadoop HDFS and range partitioning to distribute the index across multiple nodes. Zhang et al. [195] developed TrajSpark, a distributed Spark-based in-memory system for trajectory data. The authors use a global index on the master node as well as per-node local indexing to speed up trajectory queries. The raw trajectory data is indexed first by time ranges, then by spatial index, and lastly using a B+tree that directly maps to trajectory points. TrajSpark monitors the data distribution using a time-decay model to avoid unnecessary repartitioning when new data arrives.

130 Recently, Alarabi et al. [3] presented ST-Hadoop, a full-fledged spatio-temporal sys- tem built on Hadoop. ST-Hadoop extends SpatialHadoop [44] by improving support for spatio-temporal querying and indexing. ST-Hadoop utilizes HDFS to store a distributed spatio-temporal index that is used to accelerate spatio-temporal queries. It leverages an efficient implementation that directly extends several Hadoop func- tions to support spatio-temporal data. This results in considerably less overhead compared to adding a new layer on top of Hadoop. The index is built by first taking samples from the data, and partitioning the samples into time intervals with various granularities (days, weeks, months). Each temporal interval is then spatially indexed using a common index, such as an R-tree, Quadtree, or k-d tree.

6.2.3 Distributed kNN Query Processing kNN queries are a fundamental tool in many domains, including LBS. Researchers have explored ways to accelerate these queries in distributed computing environ- ments. Some kNN implementations have employed a brute-force exhaustive search over the entire dataset in single node shared memory multicore systems [157], single node GPGPU [103], and distributed systems [192, 187]. Such approaches are not suitable for LBS and spatio-temporal kNN query processing, as scalability and low latency are essential to handling very large datasets. Some researchers have explored the use of pre-existing indexes or Voronoi diagrams to accelerate kNN queries. Kolahdouzan et al. presented a method to process kNN queries using pre-computed Voronoi diagrams [85]. Voronoi diagrams can be expen- sive to compute, and do not support efficient updates as the arrival of new data can significantly alter large portions of the diagram. Zhang et al. used a pre-built R-tree to process kNN queries, and proposed a validity region in which the query result could be re-used without recalculation [193]. Although updating an R-tree is faster than updating a Voronoi diagram, this approach is also unsuitable for LBS due to

131 the lack of scaling. Patwary et al. [133] used a distributed global k-d tree index to determine the node that contains the incoming query point. The node then calculates a local kNN query by traversing its local k-d tree index. Each node then communicates with nearby nodes in order to prune the search space. The pruning is based on the fact that an object in a remote node that is further than the furthest local kNN is not a kNN candidate. The authors leverage multiple nodes, multiple cores, and SIMD instructions to build and query the index in parallel. Ding et al. [41] propose a system for metric kNN queries called Asynchronous Metric Distributed System (AMDS). The authors opted to use pivot-mapping to distribute the data between the nodes. The system achieves asynchronous query processing by using a publish-subscribe messaging pattern to reduce communication overhead.

6.3 Problem Statement and Design Considerations

The goals of DISTIL+ are to support high throughput location updates and many concurrent spatio-temporal range queries (STRQ) and kNN queries (STkNNQ). For- mally, let us assume, a continuously updated set of location updates Ot at time t are being generated by M moving objects. Here, each record o =(o.id,o.l,o.t,o.v,o.d) in the set Ot represents a location update with object ID o.id, a location o.l, timestamp o.t, velocity o.v and direction o.d. Further, o.l is a tuple (o.l.x, o.l.y), x and y being the latitude and longitude, and 0 ≤ o.id ≤ M − 1.

Definition 6.3.1. A spatio-temporal range query (STRQ) is given by q =(q.sw, q.tw), where q.sw is the spatial window [(x1,x2), (y1,y2)], q.tw is the temporal window

[t1,t2]. Here, q finds the set Z of objects, such that

Z := {∀z ∈ O|z.l ∈ q.sw,z.t ∈ q.tw}

132 Definition 6.3.2. A spatio-temporal kNN query (STkNNQ) is given by q =(q.l,q.tw,q.k), where q.l is the query location (x,y), q.tw is the temporal window [t1,t2]. Here, q finds the set Z of k best matching objects, such that

Z := {∀z ∈ O|d(z1.l,q.l) ≤ d(z2.l,q.l)... ≤ d(zk.l,q.l),z.t ∈ q.tw, |Z|≤ q.k}

Here, d(z.l,q.l) is the distance between query and object locations.

Due to the pervasiveness of LBS, the volume of spatio-temporal data is rising at a rate that outpaces the rate at which memory capacities are improving. Therefore, even though it may be possible to obtain a machine with large enough memory to process LBS data for a given application, eventually a single machine’s RAM will not be able to cope with the data growth. Due to this, vertical scaling is challenging and costly. On the other hand, a cluster of commodity servers, connected by a high speed network, can be easily scaled horizontally by adding new nodes. The aggregate memory capacity of such a cluster can be utilized effectively, if the data management system is designed to operate as a distributed in-memory data processing engine. However, this requires efficient partitioning, distribution and sharing of data across the cluster, while minimizing network traffic that would otherwise cause performance degradation. Under the circumstances, APGAS offers a high performance framework to design distributed in-memory systems. Another important consideration with regards to LBS applications is that recent data is usually accessed more frequently than older data. Researchers have noted this request skew in their LBS system [179], and pointed out that more than 50% of all queries only accessed data from the last day in their system. This is not an isolated phenomenon, and in fact many companies providing LBS typically maintain the past 4 to 6 weeks of data online in relational databases, and use database archiving techniques to store older data.

133 Coordinator Query Processor Query Processor Query Processor Scheduler Global Distributed In-memory Spatio-temporal Index Local Index Local Index Local Index Data Updater Memstore Data Updater Memstore Data Updater Memstore

Local Store Local Store . . . Local Store

Global Store (HDFS)

Node 0 (master) Node 1 Node N-1 Figure 6.2: DISTIL+ System Architecture

Although the data volume for an LBS application can be significant even for 6 weeks, it may not be impractical to keep the entire online data in a distributed in-memory big data system. Due to these considerations, in our system we maintain past D (configurable) days of data on the distributed in-memory store. Note that all data is persisted in stable storage. Therefore, we can keep ‘hot’ data in main memory, and the data can be deleted from the main memory store when it gets ‘cold’.

6.4 DISTIL+ System Overview

In this section, we describe the design and architecture of our system. The DISTIL+ system architecture (shown in Figure 6.2) is designed to exploit the aggregate main memory capacities of multiple nodes in a cluster using the APGAS model. The mas- ter node coordinates the various activities of the worker nodes, and is responsible for the data partitioning process. It provides the network interfaces that clients can connect to, and queues the records received from GPS-enabled mobile devices and sensors. The Coordinator redirects the records to the Data updater component in the nodes, based on the data placement. It is assumed that a mobile device can send periodic or event-based location updates to our system. Managing a location update involves first inserting it into the in-memory store (Memstore) and persistent stores (Global Store and Local Store), and then updating the in-memory indexes

134 Incoming location Incoming update updates from batches from moving objects Producers

Coordinator populates record Consumers fetch update queue with location records batches from per-place queue

Producers process records from Consumers insert records into queue, determine destination, local location table, and update create per-place update batches spatio-temporal indexes

No Consumers serialize and store Batch size = limit records in persistent local store Or no records left

Yes Producers send update No Producers done Yes Update task complete record batch to destination and no updates places (for Consumers) left?

Figure 6.3: DISTIL+ Update Processing Overview

(Global and Local Indexes). Location-based queries, submitted by clients, are han- dled by the Scheduler component. It is responsible for generating the query plan, distributing the plan fragments to individual Query Processors, and producing the final query result by aggregating the partial results from each Query Processor. A concise overview of DISTIL+’s update processing, and spatio-temporal range and kNN query processing is provided in Figure 6.4. The remainder of this Section is organized as follows. In Section 6.4.1 we discuss how our system partitions data. Section 6.4.2 covers local and distributed data per- sistence. In Section 6.4.3 we outline and compare several indexing techniques which we considered for our system, and present our indexing technique in Section 6.4.3.1. Finally, in Section 6.5 we elaborate on DISTIL+’s kNN query processing algorithm. More detailed discussions of DISTIL+’s update processing and range query process- ing are beyond the scope of this chapter and are provided in Appendix D.

6.4.1 Data Partitioning

Data partitioning enables managing data that is larger than a single node’s memory. In our system, the spatial domain is partitioned into grid cells (tiles) that are dis- tributed using a tile placement policy (discussed later) among the nodes in a cluster.

135 Historical Range Query: spatial window q.sw and time window q.tw

Use global index to find all tiles Filter out tile objects that are that overlap with q.sw not covered by q.sw Find all objects that meet range query criteria using local indexes For each tile: find associated place and dispatch query plan fragments to each place Merge local results from each place into global query result Yes Process local list of tiles to No determine if they are covered or Partial overlap? Output list of objects as only partially overlapped by q.sw query result (a) Spatio-temporal Range Query Processing

Historical kNN Query: spatial coordinates q.c time window q.tw number of neighbors k

Initialize counter i=0 and use global Exclude tiles processed in For each tile: dispatch query plan

index to find location of q.c previous iterations (C0 to Ci-1) fragments to associated place Yes Find tiles that are adjacent to q.c No using global index and store in i>0 Process query plan fragments at set C each place using local indexes, i store in local results

Increment iteration counter i, expand No Result size ≥ k search area using global index Or cannot expand If local results size > k sort further results by distance and keep k nearest candidates Yes

Output sorted list of objects Sort query results by distance to Merge local results from each as query result q.c, prune to k nearest tuples place into global query results (b) Spatio-temporal kNN Query Processing Figure 6.4: DISTIL+ Query Processing Overview

The master node maintains the tile information using the SpatialGridPlace struc- ture, which includes a hashtable for mapping each tile (tileId) to a node (nodeId). Furthermore, each APGAS place corresponds to a physical node within the cluster. Each place (including the place 0) hosts a subset of tiles (SpatialGridTile) mapped into a SpatialGrid object. Here, SpatialGrid is a hashtable with tileId as the key and SpatialGridTile as the value. The role of the SpatialGrid object is to maintain information regarding the tiles it is responsible to. The SpatialGrid objects are stored in a distributed array. In cases where per node data structures are required, we create distributed arrays with the same length as the number of places. Each array element corresponds to a node. In this case we created the distributed array SpatialGrids to hold each node’s SpatialGrid object. Each tile contains information about the location updates of the moving objects that

136 are received at different timestamps. The placement of the spatial tiles can impact system performance. We outline two tile placement policies: Row-wise Round-Robin partitioning (RRR) and Multi-Dimensional Range partitioning (MDR). Row-wise Round-Robin partitioning (RRR): In this case, each row of tiles is traversed and assigned to a node. The first row is assigned to node 0, the second one to node 1 etc. following a round-robin ordering. The row-wise round-robin approach aims to achieve a balance between parallelism, load balancing overhead, and data skew. It preserves some of the locality. Note that we do not evaluate a basic round-robin assignment of tiles to nodes, as it does not preserve locality. Multi-Dimensional Range partitioning (MDR): Here, the aim is to assign nearby tiles to nodes so that locality is improved. The MDR policy is implemented by linearizing the tiles using a space-filling curve (SFC), such as Z-order or the Hilbert curve, and then following this SFC order and mapping each tile in a given range to a particular node. In our system, the tile assignment approximates a Z-order distribution.

6.4.2 Data Persistence

As mentioned earlier, in our system, data (i.e. location records) is stored in an in- memory store (Memstore), to support low-latency updates and fast query processing. To ensure that no data is lost, two persistent stores are used (shown in Figure 6.2): a Local Store hosted at each node, and a Global Store, as described below. Memstore serves as the defacto in-memory cache by storing most recent and active location data, whereas the persistent stores can permanently keep historical data. As ‘hot’ data in memory becomes ‘cold’, it can be removed from Memstore to make room for new data.

137 6.4.2.1 Local Store

Each node stores the location records corresponding to the spatial tiles it hosts. A new location record is serialized and stored in the Local Store. A vital property of this store is support for fast insert operations. Our implementation is based on LevelDB [100], which is a fast key-value store developed by Google. Records are sorted by key and stored as byte-array form. LevelDB is based on the idea of the Log-Structured Merge-Tree (LSM-tree) [126]. The LSM-tree index is appropriate for applications that require support for high update rates. It splits a logical tree into several physical components so that the most-recently-updated portion of data is in a component that fits entirely in memory. Once the in-memory tree component reaches a threshold, a new in-memory component is created, and the old component is merged with the disk components. To minimize the cost of reads, the disk component is merged with other on-disk components in the background.

6.4.2.2 Global Store

Since the Local Store at each node persists a subset of the data, there is a possibility of data loss if that node fails. Therefore, the data is transferred from the Local Store to the Global Store by using an offline process, which runs periodically. As it is an offline process, it does not affect the system performance. The Global Store is based on Hadoop Distributed File System (HDFS), in which the content of a file is split into blocks and replicated across the nodes [154].

6.4.3 Indexing Techniques

As part of our work in developing DISTIL+, we explored the features and usability of several indexing techniques. Location-aware applications commonly store data that describes the geographical location of moving objects at different timestamps. This type of application generates a large amount of data, creating the need for efficient

138 indexing techniques that can assist with query processing. Spatio-temporal indexes combine information from space and time to accelerate the process of locating and identifying objects, thus avoiding expensive linear scans of the whole database. A common approach known as Grid partitions space into a uniform grid of cells that consist of tessellated shapes (such as squares or cubes). Objects are allocated to grid cells based on their location and are typically replicated if they span across multiple cell boundaries [12, 159]. Grid indexes are ideal for supporting fast updates, as adding or removing objects is trivial. An alternative approach is to map spatio-temporal data into a one-dimensional locality-preserving representation, which can then be used to build an index and query the data. Space-filling curves, such as the Z-Order Curve [118] and Hilbert Curve [90] provide a means to linearize spatial data. Implementations, such as Geo- hash [124], use space-filling curves to map locations to a binary string. The mapping preserves locality such that two strings with identical prefixes refer to locations that are situated close to each other in the multidimensional space, making it suitable for range queries. The string can then be combined with a temporal hash to produce a record key, which is ready to be stored in a common index data structure, such as a B-tree or hash table. Tree-based approaches are suitable for situations where query performance is a main priority. Tree data structures, such as the R-tree [64], multidimensional binary tree (k-d tree) [22], Quadtree [48, 161], B-tree [33], and their many variants, are used for building multidimensional indexes in a wide variety of applications. Table 6.1 presents a brief summary of these approaches. The key idea behind these tree-based data structures is to organize data into nodes and depths using different techniques that allow searching algorithms to quickly find requested objects, ranges, or nearby neighbors. In the following section, we describe our Index architecture, which utilizes a combination of Grid and Quadtree.

139 Table 6.1: Index Data Structure Comparison

Index Basic Concept Ideal Usage R-tree Balanced search tree that groups objects Search queries and (leaf nodes) using minimum bounding bulk loading rectangles k-d tree Binary tree that cycles through each High-dimensional dimension and splits into two subtrees using data and bulk loading the median (actual or sample) Quadtree Each node recursively splits space into four High update rates quadrants until each quadrant contains the and 2D data desired number of objects B-tree Balanced search tree built using Pre-linearized data, one-dimensional keys where each node is a spatial indexing in block of multiple entries relational databases Grid Partition space into uniform grid of cells, High update rates, objects on border are handled with large spatial domains techniques such as replication

6.4.3.1 DISTIL+ Distributed In-Memory Spatio-Temporal Index

The main idea of our indexing system is to discretize the spatial and temporal di- mensions. The index is organized as a multi-level hierarchical index. As can be seen in Figure 6.5, there are three levels in this hierarchy. Level 1 constitutes the global index, whereas Level 2 and Level 3 make up the per-node local index. The local index is comprised of a spatial index (SI), and a collection of partial temporal index (PTI).

There are N nodes such that each node {Ni|i =0...N − 1} is mapped to an APGAS place {Pi|i = 0...N − 1}, where place P0 is mapped to the master node N0. The spatial domain is organized as a L × L grid, {Gr,s|r = 1...L, s = 1...L}. Each grid cell (tile) maintains a PTI (partial temporal index) for a set of discrete time intervals {Tt|t = 1...W }, where W is a configurable parameter. Essentially, a PTI identifies the activities of each moving object with ID o.id that was inside a grid cell Gr,s. To that end, each PTI maintains an interval table, Ibr,s,t, where each

140 1 Quad Tree 11 10 13 121 3 12012 123 122 0 1 2 3 Level 1 Global 0 Index 10 11 12 13 (GIndex) 2 120 121 122 123

RRR MDR

Place 1 Level 2 Spatial Tile Place 0 Place 3 Placement Place 2

(STIndex)

TS1 Bit-vector RIDListMap

TS2 Level 3 … Temporal TSJ Index

ObjId RIDList …

Figure 6.5: DISTIL+ Index Architecture

entry is a tuple i.e. Ibr,s,t := (Bvr,s,t,Htr,s,t). Here, Bvr,s,t is a bit-vector such that

Bvr,s,t =< Bu|u = 0...M − 1 > and bit Bu is set to 1 if object u was inside grid

(r, s) during time interval t. Otherwise, bit Bu is set to 0. The key of the hash table, Htr,s,t, is the ID of the moving object m and the value, Rlr,s,t,m is a record

ID list Rlr,s,t,m =< Cv|v = 0...D − 1 >. Here, D is the maximum record ID, which could be a long integer. Rlr,s,t,m contains the record ID of each location update for moving objects that were inside a grid cell Gr,s during time interval Tt. These record IDs can be used to retrieve the actual location record that is in the Location table (mentioned in Section 6.4.2) and to inspect individual fields of a record.

A function f is used to map each grid cell (tile) Gr,s to a place Pi, where f : Gr,s → Pi.

141 Global Processing Local Processing Termination condition: Node 1 A) k-nearest objects found OR Obj IDs B) Reached boundary thresholds

Node 0 Node 2 Yes Collects results Obj IDs Merge results, from places sort by distance output top k Node 3 No Node 0 (master) Send query Obj IDs Iteratively expand information to spatial search area each place Find and filter objects Determine tile IDs using query info and local STIndex Store kNN candidate objects

Figure 6.6: Illustration of kNN query (STkNNQ) processing

Therefore, a place (i.e. its corresponding node) is responsible for a particular set of tiles and there is no need for synchronization among different places while processing a query. On system start-up, the tiles are assigned to the available nodes following a placement policy, as described previously. The above concepts are implemented by the following data structures and are rep- resented in Figure 6.5. The global index (Level 1) is denoted by GIndex. Itisa Quadtree index that is based on a partitioning of the spatial domain and is used to quickly determine cells covered by a node. The STIndex object represents the per-node local index (Levels 2 and 3). It maintains information about each tile in the spatial domain (i.e. Gr,s above). For each tile, a partial temporal index (level

3) is represented by the PTIndex object (i.e. Ibr,s,t above). Each entry in PTIndex is a tuple represented by TileIndexDatetimeObjectInfo. It consists of a bit-vector

Bvr,s,t, represented by Bit-vector object and a hash table, Htr,s,t, denoted by the RIDListMap object. The keys in the hash table object RIDListMap are object IDs, ObjId, corresponding to the objects that were inside a tile during a time interval.

Each value in RIDListMap is a list, Rlr,s,t,m, represented by RIDList, containing the record identifiers rid. These are the location records for a particular moving object.

142 6.5 Spatio-temporal kNN Query Processing

In this section, we discuss our spatio-temporal kNN query (STkNNQ) processing algorithm. We support two variants of this query: Present and Historical. The Present STkNNQ algorithm utilizes the Historical STkNNQ algorithm, but specifies the end timestamp as “now” and start timestamp a few minutes earlier (a config- urable parameter). Next we describe the STkNNQ processing. A Spatio-Temporal kNN Query (STkNNQ), q, is specified by (k,q.c,q.tw), where k is the number of nearest neighbors we wish to find, q.c specifies the coordinates from which to begin the search, and q.tw the timeframe. It returns the top k objects that were nearest to q.c during q.tw, sorted by distance. The search area begins around q.c, and expands in each iteration. Figure 6.6 illustrates the distributed execution of an STkNNQ query. Local pruning is used to reduce the amount of inter-node communication. Algorithm 3 describes how kNN queries are processed. The global index is used to find the tile containing the query point (line 2). The algorithm repeat from line 3 until the top k nearest neighbors are found, or the number of search area expansions exceeds its threshold. By default, the upper limit on expansions is set so that the entire dataset is eventually covered. In each iteration, the search area is expanded using the global index. As the search expands, additional neighboring tiles are selected to be processed (line 5). Tiles that were fully processed in previous iterations are excluded (line 6). For each tile, its corresponding place is identified. Each participating place builds a list of local tiles to be processed (lines 7 to 10). Each place executes the query on its list of tiles in parallel. If the place finds more than k tuples matching the query criteria, the set is locally pruned according to an adaptive pruning technique (described in Section 6.5.1). The resulting list of kNN candidate tuples is then merged (lines 12 to 16). Once all participating places have returned their results, the termination criteria are evaluated (line 17). If either

143 criterion is met, the result list is sorted and the top-k tuples are returned as the qryResult (lines 18 and 19). Otherwise, the search is continued (line 21).

Algorithm 3: Spatio-temporal kNN Query Input: A kNN query, (k, q.c, q.tw), where k is the number of nearest neighbors to find, q.c is the spatial coordinates of the kNN point, and q.tw is the query time window. GIndex is the global index, and STIndex is the local spatio-temporal index. The final result is stored in qryResult. expansionLimit is the maximum number of kNN iterations (search area expansions). 1 // Global Processing; 2 t.id ← GIndex.tid(q.c) //find ID of tile containing q.c; 3 i ← 0 //counter used to keep track of kNN iterations; 4 // Repeat and expand spatial search area until k neighbors found or expansionLimit threshold has been exceeded; 5 Ci ← GIndex(t.id, k) //iteratively expand search area using global index, store tiles in set Ci; i−1 6 Ci ← Pn=0 Ci − Cn //exclude previously processed tiles when i> 0; 7 for c ∈ Ci do 8 plc ← find place from tile c; 9 at Place plc qryTiles(plc).add(c); 10 queryPlaces(plc).add(plc); 11 // Local Processing; 12 for plc in queryPlaces do 13 query ← new STkNNQuery(qryTiles(plc), STIndex(plc), q.c, q.tw, k); 14 queries(plc) ← execute query //process and prune results early if greater than k; 15 result ← fetch result at plc from queries(plc); 16 qryResult ← merge result with qryResult; 17 if qryResult.size() ≥ k or i > expansionLimit then 18 qryResult.kNNsort(q.c) //sort tuples by distance to q.c, and prune to k nearest tuples; 19 return qryResult; 20 else 21 Increment i and repeat from line 5

144 6.5.1 kNN Pruning

Communication overhead and load imbalance are two major obstacles to achiev- ing efficient scaling in distributed kNN queries. When a distributed kNN query is processed, it involves multiple nodes participating in finding objects within an ex- panding circle until the query constraints are satisfied. These kNN query fragments may return records that are not in the final result set either due to their temporal attributes or simply because they are not part of the top-k. If all the query fragment results are naively collected, it may incur an unnecessarily larger communication overhead, as well as a sorting overhead at the master node. Sorting the results can be particularly time-consuming as it also involves calculating the distance of each object from the kNN query coordinates. However, pruning this data can also hinder query latency if the local processing is too slow. This issue motivated us to pursue a method to effectively prune results at the local node level. The following describes how each place prunes its local result set before submitting it for collection. Let us assume that a Node is instructed to process a set of tiles. The tiles are first filtered on the temporal attribute, creating a partial query result of size n. If the n is less than k then no pruning takes place and the results are submitted as is. If n is greater than or equal to k, the results are locally pruned and only the top k are submitted. When the kNN query requires multiple iterations to complete, each place keeps track of tiles that were fully processed, as well as tiles that were partially processed in previous steps. During each iteration of the kNN algorithm, the results of fully processed tiles are stored, and these tiles are omitted from being re-processed in future kNN iterations.

145 Table 6.2: Experiment Parameters (default settings bolded)

Parameter Settings Cluster Size (num. of Nodes) 2, 4, 8 Data System DISTIL+, MD-HBase, GeoSpark, LCIndex, ST-Hadoop Space domain Texas: 1251km × 1183km Num. of road segments 56,832,846 Dataset Size (num. of records) 10M, 20M, 40M, 60M, 80M, 100M Dataset Time duration (total timestamps) 1000 Tile placement policies RRR, MDR Update worker threads 4, 8 Update batch size (num. of 1000, 2000, 3000, 4000, 5000, Updates records) 6000, 7000, 8000 Query worker threads 1, 2, 4, 8 Range query area (spatial window), 20 (smallR), 137 (medR), 254 km2 (largeR) Present query temporal interval (% 1.5 of total timestamps) Queries Historical query temporal interval 5, 10, 15, 20, 100 (% of total timestamps) kNN query k parameter 250, 500, 1000, 2000, 4000, 8000

6.6 Evaluation

In this section we evaluate our system, DISTIL+, in various settings. We first de- scribe the experimental setup, datasets and configuration parameters used in our experiments.

6.6.1 Experimental Setup

To conduct our experimental evaluation, we used a dataset that is based on real- world spatio-temporal data. It consists of 100 million records that track the location and movement of objects across time. In addition to the 100 million dataset, we also

146 generated five smaller datasets that contain 80M, 60M, 40M, 20M, and 10M records. Each of these smaller datasets contains a random subset of objects from the initial one (100M). The dataset generation process is described next. The polyline (edges) shapefiles of Texas from the TIGER dataset [169] were fed into the mobility trace generator MOTO [119] to generate the traces. We modified MOTO to generate multiple trace files, each file for a different table fragment. Mobility traces were generated for each object for 1000 timestamps (equivalent to 1000 minutes). The experiments were conducted on a cluster of 8 machines that are connected to- gether using a Gigabit Ethernet switch. Each machine contains two Intel(R) Xeon(R) E5472 CPUs @ 3.00GHz with a total of 4 × 2 processor cores, 12 × 2MB L2 cache, 16GB RAM, and a 500GB HDD. This gives the cluster a total of 64 processing cores, 128GB of RAM, and 4TB of storage. The machines are configured with the Ubuntu 16.04 64-bit operating system, Oracle JDK 1.8.0 77, and the X10 2.6.0 compiler.

6.6.2 Workloads and Experimental Parameters

DISTIL+ supports several parameters that can be used to adjust update and query workloads. Range queries and kNN queries each have their town parameters. Addi- tionally, the query’s temporal range determines whether it is a Present or Historical query. We use these parameters to generate batches of queries using different settings, for instance, we vary the spatial window (smallR, mediumR and largeR), temporal interval, and number of worker threads. Note that we report query latency in order to indicate the speed at which a system can perform a single query. Table 6.2 shows the experiment parameter settings. The default parameters are shown in bold.

147

Table 6.3: Data System Attributes and Features

DISTIL+ MD-HBase GeoSpark LCIndex ST-Hadoop Framework APGAS HBase Spark HBase Hadoop Location yes yes no yes no Updates Temporal yes yes no yes yes Indexing Range yes yes yes yes yes Query kNN yes yes yes no yes Query Index Quadtree Quadtree on Clustering R-tree k-d tree Used and Grid KV Store Index

6.6.2.2 kNN Query (STkNNQ) k parameter

Figure 6.8 depicts the effect of varying the k parameter on kNN query latency. The results show that queries that can be satisfied by expanding the same number of tiles achieve similar query latency. As the k parameter is increased, the query eventually requires more expansions to find the desired number of nearest neighbors. The results are in line with our expectations, as we observe an increase in query latency corresponding to processing more records and additional tile lookups. Thus we demonstrate that DISTIL+ continues to scale well for larger values of k, with a 32× increase in k (from 250 to 8000) only taking about 3.2× longer to process.

6.6.3 Comparison with Other Systems

For our comparison experiments, we compare DISTIL+’s update and query per- formance against several other systems. We selected representatives from different system categories: GeoSpark from the spatial big data system category, MD-HBase and LCIndex represent HBase-based NoSQL systems, and ST-Hadoop which is based on SpatialHadoop representing a Hadoop-based system. The data systems and their

149 1.E+066 10 270,584 288,358

101.E+055 17,528

1.E+044 10 2,847 4,107

101.E+033

2 101.E+02

1 101.E+01

Average Insert/Update Throughput Insert/Update Average 1.E+001 DISTIL⁺ MD-Hbase GeoSpark LCIndex ST-Hadoop (update) (update)(bulk insert only) (update) (bulk insert only) Figure 6.9: Insert/Update Throughput (records/s) Comparison (GeoSpark and ST- Hadoop do not support location updates) features are summarized in Table 6.3.

6.6.3.1 Update Performance

In Table 6.4, we show the update throughput of our system, DISTIL+, compared to MD-HBase and LCIndex. DISTIL+ achieves significantly higher throughput than both MD-HBase and LCIndex. Note that we extended the publicly available MD- HBase code [125] to support multi-threaded updates. We could not measure the update throughput of GeoSpark or ST-Hadoop, as they do not support location up- dates. Instead, we measure a bulk insertion workload consisting of the time to copy the data into HDFS, and partition and index the data. In all cases, the startup over- head of Spark or Hadoop (which is non-existent for DISTIL+) is not factored into the throughput. We measured an average insertion throughput of around 288,358 for GeoSpark, and 17,528 for ST-Hadoop. We summarize the average update/insertion throughput of all the tested systems in Figure 6.9. Although GeoSpark can achieve a somewhat higher bulk insertion throughput compared to DISTIL+’s update through- put, it cannot update the records after insertion, and must instead rebuild its index from scratch. DISTIL+ is the only tested system that achieves high throughput

150 Table 6.4: Update Throughput (records/s) Dataset Scaling Comparison

System 10M 20M 40M DISTIL+ 265,455 272,591 273,706 MD-HBase 3,210 2,673 2,659 LCIndex 4,548 3,912 3,860 location updates.

6.6.3.2 Query Performance

GeoSpark does not support temporal data. To do a fair comparison between all five systems, we evaluated a spatio-temporal range query over the entire time win- dow of the 10M dataset. In Figure 6.10a, we present the range query latencies for this dataset, comparing DISTIL+, MD-HBase, GeoSpark, LCIndex, and ST-Hadoop. GeoSpark is faster than ST-Hadoop by around 3× due to its memory-based process- ing. DISTIL+, MD-HBase, and LCIndex, are an order of magnitude faster than GeoSpark and ST-Hadoop because they are not hindered by the various overheads associated with Hadoop MapReduce and Spark. Our system achieves the lowest latency while executing a batch of 64 queries. In Figure 6.10b, we evaluate a kNN query on the same dataset with k = 500. Here we note that the average latencies of MD-HBase and DISTIL+ are almost identical, while GeoSpark is approximately 35× slower, and ST-Hadoop is about 47× slower. This workload is not applicable to LCIndex because it does not support kNN queries. It is also worth noting that MD-HBase does not perform temporal filtering for its kNN query processing, even though it possesses a temporal index. Our system’s results are notable in light of its efficient temporal filtering as well as its considerable advantage in update throughput.

151 18.0 16.841 7.0 6.349 16.0 6.0 14.0 5.0 4.761 12.0 10.0 4.0 8.0 3.0 6.0 5.370 2.0 Query Latency (secs) QueryLatency Query Latency (secs) Query Latency 4.0 2.0 1.0 0.369 0.482 0.403 0.134 0.128 N/A 0.0 0.0

(a) Range query comparison (b) kNN query comparison Figure 6.10: Spatio-temporal query latency comparison

6.7 Chapter Summary

Location-Based Services (LBS) are contributing to a massive growth of spatio- temporal data. It is becoming ever more important to develop scalable data systems that can handle this torrent of data. The key requirements of LBS are support for high location update throughput, and efficient concurrent location-oriented queries. We presented DISTIL+, a distributed in-memory data processing system for spatio- temporal data. Our system is based on the APGAS model, and implemented with the X10 language. DISTIL+ utilizes distributed in-memory data structures to leverage the aggregate memory and processing power of a cluster of machines, and support fast updates and spatio-temporal queries on large datasets. With extensive evaluation we demonstrated that DISTIL+ can support scalable location-based services with high insertion throughput and low query latency. Our system performs significantly better than four other systems that we evaluated: GeoSpark, MD-HBase, LCIndex, and ST-Hadoop. For example, we showed that DISTIL+’s update throughput was up to two orders of magnitude faster than MD-

152 HBase and LCIndex, and over an order of magnitude faster than ST-Hadoop. We also demonstrated that DISTIL+ is faster in range and kNN query performance by up to 46× compared to Geospark and 47× compared to ST-Hadoop. DISTIL+ is the only tested system that was able to produce both high update throughput and low query latency, with full support for spatial-temporal kNN and range queries. In summary, DISTIL+ provides a rich set of features that are important for LBS, such as concurrent location updates, multilevel spatial and temporal indexing, data persistence, and scalable spatio-temporal range and kNN query processing. To summarize, the key contributions of this chapter were as follows:

• We designed a scalable spatio-temporal big data system, called DISTIL+, which can be deployed on a cluster of machines, is based on the APGAS model, and provides high velocity concurrent updates and low latency query processing.

• We proposed a novel distributed in-memory spatio-temporal index that sup- ports efficient concurrent location-oriented kNN queries, and demonstrate that the query latency scalability is favorable even for large values of k and large datasets.

• We conducted extensive experiments with analysis to demonstrate the scala- bility and efficiency of our system, and to compare our system with four other distributed data systems.

153 Chapter 7

Conclusion

In this thesis, we have presented techniques to speed up main memory query process- ing by leveraging algorithmic improvements, novel data structures, modern hardware, and emerging programming frameworks, and by modifying or overriding some of the operating system’s behaviors and components. The main focus on our contributions has been to provide insights on ideas toward faster and more efficient data analytics applications. In Chapter 3, we showed that skewed or shuffled data can significantly hinder in- memory hash join performance. We demonstrated that it is possible to obtain sub- stantial speedups in these joins, by modifying or replacing the underlying hash ta- ble. To that end, we proposed Maple Hash Table, a novel hash table designed for in-memory hash joins that further improved runtime by up to 13.7×. In Chapter 4, we conducted a multi-dimensional analysis of in-memory aggregation, including ex- tensive experimental evaluation of our implementation and 12 other algorithms and data structures. We discovered that the best approach can be selected based on specific criteria that are derived from the workload characteristics. We then devised a decision flowchart that provides guidance on which approaches work best, based on the workload. In Chapter 5, we characterized the impact of NUMA effects on

154 a variety of query workloads, including in-memory joins and aggregations, as well as the industry standard TPC-H benchmark. We then outlined and evaluated a methodology to reduce query latency by systematically modifying the thread place- ment strategy, memory placement policy, dynamic memory allocator, and operating system configuration. We distilled our findings into a flowchart that can be used to apply these strategies to a new or existing application in a step-by-step manner. Lastly, in Chapter 6, we presented DISTIL+, a scalable spatio-temporal data system that uses a distributed in-memory programming framework and a novel multi-level index to accelerate kNN queries on a cluster of machines. We demonstrated that DISTIL+ can support scalable location-based services with high insertion through- put and low query latency. We also showed that our system performs significantly better than four other data systems that we evaluated, providing orders of magnitude faster update processing, as well as up to 47× faster kNN query processing. Throughout this thesis, we have provided novel solutions and fresh insights, and explored the factors that influence main memory query performance in the software, hardware, and data domains. We have supplemented our findings with experimental comparisons against popular and state-of-the-art implementations, accurate runtime and memory efficiency measurements, performance counters such as cache and TLB misses, and in-depth analysis of the experimental results. Additionally, we have developed improvements to dataset generation methods that boosted our research on main memory query processing, and constructed decision diagrams to guide the application of our approaches. We have made substantial contributions to multiple fronts in main memory query processing and data analytics. We expect that our con- tributions will help researchers and practitioners advance the field of main memory query processing. In the following section, we outline some future research directions that would be interesting to explore.

155 7.1 Future Work

As mentioned earlier, low latency is critical to the role of data analytics applications in decision making. Traditionally, database system architectures have been designed with a primary focus on either Online Analytics Processing (OLAP) or Online Trans- actional Processing (OLTP). This duality can create an increase in latency that is undesirable for applications that are required to handle a high rate of incoming data and simultaneously provide real-time insights. Hybrid Transactional and Analytical Processing (HTAP) is a system architecture that combines real-time analytics and high throughput transactional processing [128]. Main memory query processing is a key ingredient in achieving the low latency required for HTAP systems. Ongo- ing challenges within this domain include handling concurrent mixed queries that combine transactional and analytics processing, efficient indexing techniques that support high update rates as well as low latency lookups, and handling the fallback to persistent storage when free memory is exhausted. Future research for our hash table (Maple Hash Table) would include the utiliza- tion of new SIMD instructions to reduce conditional branches and contention and leverage greater instruction-level parallelism. It is expected that new and upcoming CPU architectures will go beyond merely increasing the vector widths of older SIMD instructions, and support additional features that are in high demand. For example, AVX2 adds support for gathering data from non-contiguous memory locations, an operation that has utility in main memory query processing. Finding ways to un- lock more performance by utilizing these instructions is an avenue that is ripe for exploration. Efficiently utilizing SIMD is arguably one of the most challenging tasks in application development. As a result there is an ongoing effort to streamline or automate this task using techniques such as machine learning to facilitate smarter compiler optimizations [92]. Another interesting research avenue would be to pro- vide better support for NUMA architectures by integrating “smart” data structure

156 features that facilitate replication, horizontal partitioning, and interleaving the hash table. In this context, it would be interesting to explore alternative concurrency control techniques, such as flat combining [67], lock-free data structures, cooperative multitasking, and hardware transactional memory. In recent years, emerging memory technologies that provide Non-volatile Random Access Memory (NVRAM) have gained traction due to improved affordability and density, and a greater demand for fast persistent storage. NVRAM provides byte- addressable persistent storage. This creates many exciting opportunities for efficient main memory query processing by bridging the performance gap between DRAM and block-based persistent storage (HDDs and SSDs). Wider adoption of this tech- nology is motivating the development of new algorithms and data structures that are specifically designed to work in a hybrid NVRAM-DRAM environment, thus utiliz- ing the capabilities of both technologies [191]. Although it may be early to predict the future of this technology, NVRAM has the potential to revolutionize the memory hierarchy by replacing both persistent storage and main memory. In such a scenario, the implications for query processing and database systems would be significant. On the topic of query processing on NUMA systems, potential future work would include delving deeper into memory allocator fine-tuning and adapting the allocator to the target architecture by adjusting parameters such as thread-local cache size, chunk alignment, and size/number of arenas. It is conceivable that memory allo- cators may need to become NVRAM-aware in addition to becoming NUMA-aware. New NUMA topologies, such as multi-socket multi-chip-module [97] systems, entail an even larger hierarchy of memory access latencies and bandwidths. This represents interesting new challenges and opportunities for researchers, as existing approaches may need to be revised for optimal performance. Lastly, it would be useful to de- velop a tool that can automatically set good system defaults (memory allocator, placement policy, and schedulers) without requiring much tuning by the developer.

157 This tool could scan several system parameters and use a model-based approach to determine the configuration. Alternatively, the tool could run a compact suite of microbenchmarks to empirically determine a good configuration. As the next step for the distributed spatio-temporal data system, it would be interest- ing to explore supporting NUMA-aware memory placement and allocation strategies and implementing additional data operations, such as spatial joins, trajectory ana- lytics, semantics reasoning, and machine learning algorithms. Semantics reasoners enrich the spatio-temporal data system with information from additional sources, which allows for queries and query results to be expressed in terms that are more natural for humans [70]. For example, “fast-food restaurants near the UNB campus” or “bus moving south on Regent street” is easier to understand than a sequence of raw object IDs, coordinates, and timestamps. Further refinements could be made to the query processing engine, such as faster concurrent sorting algorithms to assist with calculating the k-nearest neighbors, as well as using inter-node communication to assist with result pruning. Lastly, given the promising results we obtained from IBM’s x10 language, it would be interesting to explore the efficiency and utility of other languages that provide distributed in-memory computing, such as UPC++ [9].

158 References

[1] Ablimit Aji, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang, and Joel Saltz, Hadoop-gis: A high performance spatial data warehous- ing system over mapreduce, VLDB Endowment 6 (2013), 1009–1020.

[2] Md Mahbub Alam, Suprio Ray, and Virendra C Bhavsar, A performance study of big spatial data systems, ACM SIGSPATIAL, ACM, 2018, pp. 1–9.

[3] Louai Alarabi, Mohamed F Mokbel, and Mashaal Musleh, St-hadoop: A mapreduce framework for spatio-temporal data, GeoInformatica 22 (2018), no. 4, 785–813.

[4] Dan Anthony Feliciano Alcantara, Efficient hash tables on the gpu, Ph.D. the- sis, University of California, Davis, 2011.

[5] Victor Alvarez, Stefan Richter, Xiao Chen, and Jens Dittrich, A comparison of adaptive radix trees and hash tables, ICDE, IEEE, 2015, pp. 1227–1238.

[6] Michael Anderson, Shaden Smith, Narayanan Sundaram, Mihai Capot˘a, Zheguang Zhao, Subramanya Dulloor, Nadathur Satish, and Theodore L. Willke, Bridging the gap between hpc and big data frameworks, VLDB En- dowment 10 (2017), 901–912.

[7] Austin Appleby, Smhasher suite, github.com/aappleby/smhasher [On- line. Last accessed: 2020-Jan-19], 2016.

159 [8] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, et al., The landscape of paral- lel computing research: A view from berkeley, Tech. report, Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.

[9] John Bachan, Scott B Baden, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Dan Bonachea, Paul H Hargrove, and Hadia Ahmed, Upc++: A high- performance communication framework for asynchronous computation, IPDPS, IEEE, 2019, pp. 963–973.

[10] Michael Bader, Space-filling curves: an introduction with applications in sci- entific computing, vol. 9, Springer Science & Business Media, 2012.

[11] Furqan Baig, Hoang Vo, Tahsin Kurc, Joel Saltz, and Fusheng Wang, Sparkgis: Resource aware efficient in-memory spatial query processing, ACM SIGSPA- TIAL, ACM, 2017, p. 28.

[12] Mohamed Bakli, Mahmoud Sakr, and Taysir Hassan A Soliman, Hadooptrajec- tory: a hadoop spatiotemporal data processing extension, Journal of geograph- ical systems 21 (2019), no. 2, 211–235.

[13] Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M Tamer Ozsu,¨ Multi- core, main-memory joins: Sort vs. hash revisited, VLDBJ 7 (2013), 85–96.

[14] Cagri Balkesen, Jens Teubner, Gustavo Alonso, and M Tamer Ozsu,¨ Main- memory hash joins on multi-core cpus: Tuning to the underlying hardware, ICDE, IEEE, 2013, pp. 362–373.

160 [15] C¸a˘grı Balkesen, Jens Teubner, Gustavo Alonso, and M Tamer Ozsu,¨ Main- memory hash joins on modern processor architectures, TKDE 27 (2014), 1754– 1766.

[16] Ronald Barber, Guy Lohman, Ippokratis Pandis, Vijayshankar Raman, Richard Sidle, G Attaluri, Naresh Chainani, Sam Lightstone, and David Sharpe, Memory-efficient hash joins, VLDBJ (2014), 353–364.

[17] Doug Baskins, General purpose dynamic array - judy, judy.sourceforge. net/index.html [Online. Last accessed: 2020-Jan-18], 2002.

[18] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Re- becca Isaacs, Simon Peter, Timothy Roscoe, Adrian Sch¨upbach, and Akhilesh Singhania, The multikernel: A new os architecture for scalable multicore sys- tems, SIGOPS SOSP, 2009, pp. 29–44.

[19] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Re- becca Isaacs, Simon Peter, Timothy Roscoe, Adrian Sch¨upbach, and Akhilesh Singhania, The multikernel: a new OS architecture for scalable multicore sys- tems, SOSP, ACM, 2009, pp. 29–44.

[20] R Bayer and E Mccreight, Organization and maintenance of large ordered in- dexes, Acta Informatica 1 (1972), 173–189.

[21] Paul Beame, Paraschos Koutris, and Dan Suciu, Skew in parallel query pro- cessing, PODS, ACM, 2014, pp. 212–223.

[22] Jon Louis Bentley, Multidimensional binary search trees used for associative searching, Communications of the ACM 18 (1975), no. 9, 509–517.

[23] Brad Benton, Ccix, gen-z, opencapi: Overview and comparison, OpenFabrics Alliance 13th Workshop, 2017.

161 [24] Emery D Berger et al., Hoard: A scalable memory allocator for multithreaded applications, ACM SIGARCH 28 (2000), no. 5, 117–128.

[25] Timo Bingmann, Stx b+ tree c++ template classes, github.com/ bingmann/stx-btree [Online. Last accessed: 2020-Jan-18], 2013.

[26] Timo Bingmann, STX B+ Tree, panthema.net/2007/stx-btree [On- line. Last accessed: 2020-Jan-18], 2019.

[27] Robert Binna, Eva Zangerle, Martin Pichl, G¨unther Specht, and Viktor Leis, Hot: A height optimized trie index for main-memory database systems, SIG- MOD, ACM, 2018, pp. 521–534.

[28] Sergey Blagodurov, Sergey Zhuravlev, Alexandra Fedorova, and Ali Kamali, A case for numa-aware contention management on multicore systems, PACT, ACM, 2010, pp. 557–558.

[29] Spyros Blanas, Yinan Li, and Jignesh M Patel, Design and evaluation of main memory hash join algorithms for multi-core cpus, SIGMOD, ACM, 2011.

[30] Peter Boncz, Angelos-Christos Anatiotis, and Steffen Kl¨abe, Jcc-h: Adding join crossing correlations with skew to tpc-h, TPCTC, Springer, 2017, pp. 103–119.

[31] Silas Boyd-Wickizer, Haibo Chen, Rong Chen, Yandong Mao, Frans Kaashoek, Robert Morris, Aleksey Pesterev, Lex Stein, Ming Wu, Yuehua Dai, Yang Zhang, and Zheng Zhang, Corey: An operating system for many cores, USENIX OSDI, OSDI’08, 2008, pp. 43–57.

[32] John Cieslewicz and Kenneth A Ross, Adaptive aggregation on chip multipro- cessors, VLDBJ, 2007, pp. 339–350.

[33] Douglas Comer, Ubiquitous b-tree, ACM Computing Surveys (CSUR) 11 (1979), no. 2, 121–137.

162 [34] Jonathan Corbet, AutoNUMA: the other approach to NUMA scheduling, lwn. net/Articles/488709 [Online. Last accessed: 2020-Jan-20], 2012.

[35] Transaction Processing Performance Council, Tpc-h benchmark specification 2.18.0 rc2, 2017.

[36] Rachel Courtland, Transistors could stop shrinking in 2021, IEEE Spectrum 53 (2016), 9–11.

[37] Alain Crolotte and Ahmad Ghazal, Introducing Skew into the TPC-H Bench- mark, TPCTC, 2012, pp. 137–145.

[38] Al Danial, CLOC, github.com/AlDanial/cloc [Online. Last accessed: 2019-Dec-04].

[39] Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Re- naud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth, Traffic man- agement: A holistic approach to memory placement on numa systems, ASP- LOS, ASPLOS ’13, ACM, 2013, pp. 381–394.

[40] Jeffrey Dean and Sanjay Ghemawat, MapReduce: simplified data processing on large clusters, Commun. of the ACM 51 (2008), 107–113.

[41] Xin Ding, Yuanliang Zhang, Lu Chen, Yunjun Gao, and Baihua Zheng, Distributed k-nearest neighbor queries in metric spaces, APWeb and WAIM, Springer, 2018, pp. 236–252.

[42] Dominik Durner, Viktor Leis, and Thomas Neumann, On the impact of mem- ory allocation on high-performance query processing, DaMoN, DaMoN’19, ACM, 2019, pp. 21:1–21:3.

[43] Lieven Eeckhout, Is moores law slowing down? whats next?, IEEE Micro 37 (2017), no. 4, 4–5.

163 [44] Ahmed Eldawy and Mohamed F Mokbel, SpatialHadoop: A mapreduce frame- work for spatial data, ICDE, 2015.

[45] Jason Evans, A scalable concurrent malloc (3) implementation for FreeBSD, BSDCan, 2006.

[46] Bin Fan, David G Andersen, and Michael Kaminsky, Memc3: Compact and concurrent memcache with dumber caching and smarter hashing, NSDI, vol. 13, 2013, pp. 371–384.

[47] Chen Feng, Xi Yang, Fan Liang, Xian-He Sun, and Zhiwei Xu, LCIndex: a local and clustering index on distributed ordered tables for flexible multi-dimensional range queries, ICPP, 2015, pp. 719–728.

[48] Raphael A. Finkel and Jon Louis Bentley, Quad trees a data structure for retrieval on composite keys, Acta informatica 4 (1974), 1–9.

[49] Kenneth Flamm, Measuring moores law: Evidence from price, cost, and quality indexes, Tech. report, National Bureau of Economic Research, 2018.

[50] Anthony Fox, Chris Eichelberger, James Hughes, and Skylar Lyon, Spatio- temporal indexing in non-relational distributed databases, IEEE Big Data, 2013, pp. 291–299.

[51] Edward A Fox, Qi Fan Chen, Amjad M Daoud, and Lenwood S Heath, Order- preserving minimal perfect hash functions and information retrieval, TOIS 9 (1991), 281–308.

[52] Lars George, Hbase: the definitive guide: random access to your planet-size data, “ O’Reilly Media, Inc.”, 2011.

164 [53] Sanjay Ghemawat and Paul Menage, Tcmalloc: Thread-caching malloc, github.com/gperftools/gperftools [Online. Last accessed: 2020- Jan-19], 2015.

[54] Jana Giceva, Operating Systems Support for Data Management on Mod- ern Hardware, sites.computer.org/debull/A19mar/p36.pdf [On- line. Last accessed: 2020-Jan-18], 2019.

[55] Jana Giceva, Gustavo Alonso, Timothy Roscoe, and Tim Harris, Deployment of query plans on multicores, VLDB Endowment 8 (2014), no. 3, 233–244.

[56] Jana Giceva, Adrian Sch¨upbach, Gustavo Alonso, and Timothy Roscoe, To- wards database/operating system co-design, SFMA, vol. 12, Citeseer, 2012.

[57] Jana Giceva, Gerd Zellweger, Gustavo Alonso, and Timothy Rosco, Customized OS support for data-processing, DaMoN, ACM, 2016, p. 2.

[58] Goetz Graefe, Ross Bunker, and Shaun Cooper, Hash joins and hash teams in microsoft sql server, VLDB, 1998, pp. 86–97.

[59] Goetz Graefe, Ann Linville, and Leonard D. Shapiro, Sort vs. hash revisited, TKDE 6 (1994), 934–944.

[60] Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Re- ichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh, Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals, Data mining and knowledge discovery 1 (1997), 29–53.

[61] Jim Gray, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and Peter J Weinberger, Quickly generating billion-record synthetic databases, SIGMOD, ACM, 1994, pp. 243–252.

165 [62] Robert Treat Greg Smith and Christopher Browne, Tuning your postgresql server, wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_ Server [Online. Last accessed: 2020-April-28].

[63] Hugo Guiroux, Renaud Lachaize, and Vivien Qu´ema, Multicore locks: The case is not closed yet, USENIX ATC, 2016, pp. 649–662.

[64] Antonin Guttman, R trees: A dynamic index structure for spatial searching, SIGMOD, vol. 14, 01 1984, pp. 47–57.

[65] pp. 47-57, Apache Hadoop, hadoop.apache.org/ [Online. Last accessed: 2020-Jan-19].

[66] G. Hager, G. Wellein, and J. Treibig, LIKWID: A Lightweight Performance- Oriented Tool Suite for x86 Multicore Environments, ICPP, IEEE Computer Society, 2010, pp. 207–216.

[67] Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir, Flat combining and the synchronization-parallelism tradeoff, SPAA, 2010, pp. 355–364.

[68] John L Hennessy and David A Patterson, Computer architecture: a quantita- tive approach, Elsevier, 2011.

[69] C. A. R. Hoare, Algorithm 64: Quicksort, Commun. ACM (1961), 321.

[70] Yingjie Hu, Krzysztof Janowicz, David Carral, Simon Scheider, Werner Kuhn, Gary Berg-Cross, Pascal Hitzler, Mike Dean, and Dave Kolas, A geo-ontology design pattern for semantic trajectories, COSIT, Springer, 2013, pp. 438–456.

[71] Qi Huang, Helga Gudmundsdottir, Ymir Vigfusson, Daniel A Freedman, Ken Birman, and Robbert van Renesse, Characterizing load imbalance in real-world networked caches, HotNets, ACM, 2014, p. 8.

166 [72] Richard L Hudson, Bratin Saha, Ali-Reza Adl-Tabatabai, and Benjamin C Hertzberg, Mcrt-malloc: a scalable transactional memory allocator, ISMM, ACM, 2006, pp. 74–83.

[73] Peng Jiang and Gagan Agrawal, Efficient simd and mimd parallelization of hash-based aggregation by conflict mitigation, ICS, ACM, 2017, p. 24.

[74] Nicolai M Josuttis, The c++ standard library: a tutorial and reference, Addison-Wesley, 2012.

[75] Karthik Kambatla, Giorgos Kollias, Vipin Kumar, and Ananth Grama, Trends in big data analytics, Journal of Parallel and Distributed Computing 74 (2014), no. 7, 2561–2573.

[76] Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ran- ganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks, Profiling a warehouse-scale computer, ACM SIGARCH Computer Architecture News 43 (2016), no. 3, 158–169.

[77] Thomas Kejser, TPC-H Schema and Indexes, kejser.org/ tpc-h-schema-and-indexes/, Jun 2014 (accessed June 16, 2017).

[78] Alfons Kemper and Thomas Neumann, Hyper: A hybrid oltp&olap main mem- ory database system based on virtual memory snapshots, ICDE, IEEE, 2011, pp. 195–206.

[79] Nathan Thomas Kerr, Alternative Approaches to Parallel GIS Processing, Mas- ter’s thesis, Arizona State University, 2009.

[80] Tim Kiefer, Benjamin Schlegel, and Wolfgang Lehner, Experimental evaluation of numa effects on database management systems, 2013.

167 [81] Changkyu Kim, Tim Kaldewey, Victor W Lee, Eric Sedlar, Anthony D Nguyen, Nadathur Satish, Jatin Chhugani, Andrea Di Blas, and Pradeep Dubey, Sort vs. hash revisited: fast join implementation on modern multi-core cpus, VLDBJ (2009), 1378–1389.

[82] Wooyoung Kim and Michael Voss, Multicore desktop programming with intel threading building blocks, IEEE software 28 (2011), no. 1, 23–31.

[83] Adam Kirsch, Michael Mitzenmacher, and Udi Wieder, More robust hashing: Cuckoo hashing with a stash, ESA, 2008, pp. 611–622.

[84] Thomas Kissinger, Tim Kiefer, Benjamin Schlegel, Dirk Habich, Daniel Molka, and Wolfgang Lehner, ERIS: A NUMA-aware in-memory storage engine for analytical workloads, VLDB Endowment 7 (2014), no. 14, 1–12.

[85] Mohammad Kolahdouzan and Cyrus Shahabi, Voronoi-based k nearest neighbor search for spatial network databases, VLDB, VLDB Endowment, 2004, pp. 840– 851.

[86] Alexey Kukanov and Michael J Voss, The Foundations for Scalable Multi-core Software in Intel Threading Building Blocks, 2007.

[87] Sailesh Kumar, Jonathan Turner, and Patrick Crowley, Peacock hashing: De- terministic and updatable hashing for high performance networking, INFO- COM, IEEE, 2008, pp. 101–105.

[88] Bradley C Kuszmaul, SuperMalloc: a super fast multithreaded malloc for 64-bit machines, SIGPLAN Notices, vol. 50, ACM, 2015, pp. 41–55.

[89] Harald Lang, Viktor Leis, Martina-Cezara Albutiu, Thomas Neumann, and Alfons Kemper, Massively parallel numa-aware hash joins, IMDM, 2013, pp. 3– 14.

168 [90] Jonathan K. Lawder and Peter J. H. King, Querying multi-dimensional data indexed using the hilbert space-filling curve, ACM Sigmod Record 30 (2001), 19–24.

[91] Dong Lea, Dong lea’s malloc (dlmalloc), gee.cs.oswego.edu/dl/html/ malloc.html [Online. Last accessed: 2019-May-05], 2000.

[92] Hugh Leather, Edwin Bonilla, and Michael O’boyle, Automatic feature gener- ation for machine learning–based optimising compilation, ACM Transactions on Architecture and Code Optimization (TACO) 11 (2014), no. 1, 1–32.

[93] Tobin J Lehman and Michael J Carey, A study of index structures for main memory database management systems, Proc. VLDB, vol. 1, 1986.

[94] Viktor Leis, Peter Boncz, Alfons Kemper, and Thomas Neumann, Morsel- driven parallelism: a numa-aware query evaluation framework for the many- core age, SIGMOD, ACM, 2014, pp. 743–754.

[95] Viktor Leis, Alfons Kemper, and Thomas Neumann, The adaptive radix tree: Artful indexing for main-memory databases, ICDE, IEEE, 2013, pp. 38–49.

[96] Viktor Leis, Florian Scheibner, Alfons Kemper, and Thomas Neumann, The art of practical synchronization, DaMoN, ACM, 2016, p. 3.

[97] Kevin Lepak, Gerry Talbot, Sean White, Noah Beck, Sam Naffziger, et al., The next generation amd enterprise server product architecture, 2017.

[98] Baptiste Lepers, Vivien Qu´ema, and Alexandra Fedorova, Thread and mem- ory placement on numa systems: Asymmetry matters., USENIX ATC, 2015, pp. 277–289.

[99] Justin J Levandoski, David B Lomet, and Sudipta Sengupta, The bw-tree: A b-tree for new hardware platforms, ICDE, IEEE, 2013, pp. 302–313.

169 [100] pp. 302-313, LevelDB Java Version, github.com/dain/leveldb [Online. Last accessed: 2019-Nov-10].

[101] Shen Li, Shaohan Hu, Raghu K Ganti, Mudhakar Srivatsa, and Tarek F Ab- delzaher, Pyro: A spatial-temporal big-data storage system., USENIX Annual Technical Conference, 2015, pp. 97–109.

[102] Xiaozhou Li, David G Andersen, Michael Kaminsky, and Michael J Freedman, Algorithmic improvements for fast concurrent cuckoo hashing, EuroSys, ACM, 2014, pp. 27:1–27:14.

[103] Shenshen Liang, Ying Liu, Cheng Wang, and Liheng Jian, A cuda-based parallel implementation of k-nearest neighbor algorithm, CyberC, IEEE, 2009, pp. 291– 296.

[104] Haojun Liao, Jizhong Han, and Jinyun Fang, Multi-dimensional index on hadoop distributed file system, Networking, Architecture and Storage (NAS), 2010, pp. 240–249.

[105] Antoine Limasset, Guillaume Rizk, Rayan Chikhi, and Pierre Peterlongo, Fast and scalable minimal perfect hashing for massive key sets, 2017.

[106] Youzhong Ma, Yu Zhang, and Xiaofeng Meng, St-hbase: a scalable data management system for massive geo-tagged objects, WAIM, Springer, 2013, pp. 155–166.

[107] Chris Mack, The multiple lives of moore’s law, IEEE Spectrum 52 (2015), 31–31.

[108] Yandong Mao, Eddie Kohler, and Robert Tappan Morris, Cache craftiness for fast multicore key-value storage, Eurosys, ACM, 2012, pp. 183–196.

170 [109] Sally A McKee et al., Reflections on the memory wall, Conf. Computing Fron- tiers, 2004, p. 162.

[110] Frank McSherry, Michael Isard, and Derek G. Murray, Scalability! but at what COST?, HOTOS, USENIX Association, 2015, p. 14.

[111] Puya Memarzia, Maria Patrou, Md Mahbub Alam, Suprio Ray, Virendra C Bhavsar, and Kenneth B Kent, Toward efficient processing of spatio-temporal workloads in a distributed in-memory system, MDM, IEEE, 2019, pp. 118–127.

[112] Puya Memarzia, Suprio Ray, and Virendra C Bhavsar, On improving data skew resilience in main-memory hash joins, IDEAS, ACM, 2018, pp. 226–235.

[113] Puya Memarzia, Suprio Ray, and Virendra C Bhavsar, A Six-dimensional Analysis of In-memory Aggregation, EDBT, 2019, pp. 289–300.

[114] Puya Memarzia, Suprio Ray, and Virendra C Bhavsar, Toward efficient in- memory data analytics on numa systems, arXiv preprint arXiv:1908.01860 (2019).

[115] Puya Memarzia, Suprio Ray, and Virendra C Bhavsar, The art of efficient in-memory query processing on numa systems: a systematic approach, ICDE, 2020.

[116] Daniel Molka, Daniel Hackenberg, Robert Sch¨one, and Wolfgang E Nagel, Cache coherence protocol and memory performance of the intel haswell-ep ar- chitecture, ICPP, IEEE, 2015, pp. 739–748.

[117] MonetDB B.V., MonetDB, monetdb.org [Online. Last accessed: 2019-June- 01], 2018.

171 [118] Morton, G.M., A computer oriented geodetic data base; and a new technique in file sequencing, Tech. report, Technical Report, Ottawa. IBM Ltd, Canada, 1966.

[119] 1966, MOTO (Moving Objects Trace generatOr), moto.sourceforge.net [Online. Last accessed: 2020-Jan-19].

[120] Ingo M¨uller, Peter Sanders, Arnaud Lacurie, Wolfgang Lehner, and Franz F¨arber, Cache-efficient aggregation: Hashing is sorting, SIGMOD, 2015.

[121] David R Musser, Introspective sorting and selection algorithms, Software: prac- tice and experience 27 (1997), 983–993.

[122] Neo4j, Inc., Neo4j Graph Database, neo4j.com [Online. Last accessed: 2020- Jan-20].

[123] Nicholas Nethercote and Julian Seward, Valgrind: a framework for heavyweight dynamic binary instrumentation, ACM Sigplan notices, vol. 42, ACM, 2007, pp. 89–100.

[124] Gustavo Niemeyer, Geohash, en.wikipedia.org/wiki/Geohash [On- line. Last accessed: 2019-Nov-12].

[125] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, and Amr El Abbadi, MD- HBase: A scalable multi-dimensional data infrastructure for location aware services, MDM, 2011.

[126] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil, The log- structured merge-tree (LSM-tree), Acta Informatica 33 (1996), no. 4, 351–385.

[127] Oracle Corporation, MySQL, mysql.com [Online. Last accessed: 2019-July- 05], 2019.

172 [128] Fatma Ozcan,¨ Yuanyuan Tian, and Pinar T¨oz¨un, Hybrid transactional/analyt- ical processing: A survey, SIGMOD, 2017, pp. 1771–1775.

[129] Rasmus Pagh and Flemming Friche Rodler, Cuckoo hashing, ESA, Springer, 2001, pp. 121–133.

[130] Varun Pandey, Andreas Kipf, Thomas Neumann, and Alfons Kemper, How good are modern spatial analytics systems?, VLDB Endowment 11 (2018), no. 11, 1661–1673.

[131] Jignesh M Patel, Harshad Deshmukh, Jianqiao Zhu, Navneet Potti, Zuyu Zhang, Marc Spehlmann, Hakan Memisoglu, and Saket Saurabh, Quickstep: A data platform based on the scaling-up approach, VLDB Endowment 11 (2018), no. 6, 663–676.

[132] Maria Patrou, Md Mahbub Alam, Puya Memarzia, Suprio Ray, Virendra C Bhavsar, Kenneth B Kent, and Gerhard W Dueck, Distil: a distributed in- memory data processing system for location-based services, ACM SIGSPA- TIAL, ACM, 2018, pp. 496–499.

[133] Md Mostofa Ali Patwary, Nadathur Rajagopalan Satish, Narayanan Sundaram, Jialin Liu, Peter Sadowski, Evan Racah, Suren Byna, Craig Tull, Wahid Bhimji, Pradeep Dubey, et al., Panda: Extreme scale parallel k-nearest neigh- bor on distributed architectures, IPDPS, 2016, pp. 494–503.

[134] Chuck Pheatt, Intel R threading building blocks, Journal of Computing Sciences in Colleges 23 (2008), 298–298.

[135] Mihail Popov, Alexandra Jimborean, and David Black-Schaffer, Efficient thread/page/parallelism autotuning for numa systems, ICS, ACM, 2019, pp. 342–353.

173 [136] Danica Porobic, Erietta Liarou, Pinar Tozun, and Anastasia Ailamaki, Atra- pos: Adaptive transaction processing on hardware islands, ICDE, 03 2014, pp. 688–699.

[137] PostgreSQL Global Development Group, PostgreSQL, postgresql.org [Online. Last accessed: 2019-July-01], 2019.

[138] David MW Powers, Applications and explanations of zipf’s law, NeM- LaP3/CoNLL98, Association for Computational Linguistics, 1998, pp. 151– 160.

[139] Iraklis Psaroudakis, Stefan Kaestle, Matthias Grimmer, Daniel Goodman, Jean-Pierre Lozi, and Tim Harris, Analytics with smart arrays: adaptive and efficient language-independent data, EuroSys, ACM, 2018, p. 17.

[140] Iraklis Psaroudakis, Tobias Scheuer, Norman May, Abdelkader Sellami, and Anastasia Ailamaki, Scaling up concurrent main-memory column-store scans: towards adaptive numa-aware data and task placement, VLDBJ 8 (2015), no. 12, 1442–1453.

[141] Iraklis Psaroudakis, Tobias Scheuer, Norman May, Abdelkader Sellami, and Anastasia Ailamaki, Adaptive NUMA-aware Data Placement and Task Scheduling for Analytical Workloads in Main-memory Column-stores, VLDB (2016), 37–48.

[142] Suprio Ray, Rolando Blanco, and Anil K. Goel, High performance location- based services in a main-memory database, Geoinformatica 21 (2017), no. 2, 293–322.

[143] Red Hat Inc, Red Hat Enterprise Linux Product Documentation, 2018.

174 [144] Stefan Richter, Victor Alvarez, and Jens Dittrich, A seven-dimensional analy- sis of hashing methods and its implications on query processing, VLDBJ (2015), 96–107.

[145] Brian M Rogers, Anil Krishna, Gordon B Bell, Ken Vu, Xiaowei Jiang, and Yan Solihin, Scaling the bandwidth wall: challenges in and avenues for cmp scaling, ACM SIGARCH 37 (2009), no. 3, 371–382.

[146] Steven J Ross, The spreadsort high-performance general-case sorting algo- rithm., PDPTA, 2002, pp. 1100–1106.

[147] Vijay Saraswat, George Almasi, Ganesh Bikshandi, Calin Cascaval, David Cunningham, David Grove, Sreedhar Kodali, Igor Peshansky, and Olivier Tardieu, The asynchronous partitioned global address space model, The First Workshop on Advances in Message Passing, 2010, pp. 1–8.

[148] Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D Nguyen, Vic- tor W Lee, Daehyun Kim, and Pradeep Dubey, Fast sort on cpus and gpus: a case for bandwidth oblivious simd sort, SIGMOD, ACM, 2010, pp. 351–362.

[149] Georg Sauthoff, cgmemtime, github.com/gsauthof/cgmemtime [Online. Last accessed: 2019-Nov-25].

[150] Matt Scarpino, Crunching numbers with avx and avx2, codeproject.com/Articles/874396/Crunching-Numbers-with-AVX-and-AVX [Online. Last accessed: 2020-Jan-20], 2016.

[151] Stefan Schuh, Xiao Chen, and Jens Dittrich, An experimental comparison of thirteen relational equi-joins in main memory, SIGMOD, ACM, 2016, pp. 1961–1976.

[152] Anil Shanbhag, Holger Pirk, and Sam Madden, Locality-adaptive parallel hash joins using hardware transactional memory, IMDM, 2016.

175 [153] Ambuj Shatdal and Jeffrey F Naughton, Adaptive parallel aggregation algo- rithms, SIGMOD, vol. 24, ACM, 1995, pp. 104–114.

[154] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler, The hadoop distributed file system, Mass storage systems and technologies (MSST), 2010, pp. 1–10.

[155] Craig Silverstein, Google sparsehash, 2015.

[156] Teja Singh, Sundar Rangarajan, Deepesh John, Carson Henrion, Shane Southard, Hugh McIntyre, Amy Novak, Stephen Kosonocky, Ravi Jotwani, Alex Schaefer, et al., 3.2 zen: A next-generation high-performance× 86 core, ISSCC, IEEE, 2017, pp. 52–53.

[157] Nikos Sismanis, Nikos Pitsianis, and Xiaobai Sun, Parallel search of k-nearest neighbors with synchronous operations, 2012 IEEE Conference on High Perfor- mance Extreme Computing, IEEE, 2012, pp. 1–6.

[158] Uthayasankar Sivarajah, Muhammad Mustafa Kamal, Zahir Irani, and Vis- hanth Weerakkody, Critical analysis of big data challenges and analytical meth- ods, Journal of Business Research 70 (2017), 263–286.

[159] Anders Skovsgaard, Darius Sidlauskas, and Christian S Jensen, Scalable top-k spatio-temporal term querying, ICDE, IEEE, 2014, pp. 148–159.

[160] Ram Sriharsha, Magellan: geospatial analytics on spark, github.com/ harsha2010/magellan [Online. Last accessed: 2019-Oct-17], 2018.

[161] Gary J Sullivan and Richard L Baker, Efficient quadtree coding of images and video, IEEE Transactions on image processing 3 (1994), no. 3, 327–331.

176 [162] Mingjie Tang, Yongyang Yu, Qutaibah M Malluhi, Mourad Ouzzani, and Walid G Aref, LocationSpark: A distributed in-memory data management sys- tem for big spatial data, VLDBJ 9 (2016), no. 13, 1565–1568.

[163] Steven Ross, Francisco Tapia, and Orson Peters, Boost c++ library 1.67, boost.org [Online. Last accessed: 2020-Jan-20], April 2018.

[164] Olivier Tardieu, The APGAS library: resilient parallel and distributed pro- gramming in Java 8, ACM SIGPLAN Workshop, ACM, 2015, pp. 25–26.

[165] Olivier Tardieu, Benjamin Herta, David Cunningham, David Grove, Prabhan- jan Kambadur, Vijay Saraswat, Avraham Shinnar, Mikio Takeuchi, and Man- dana Vaziri, X10 and APGAS at petascale, ACM SIGPLAN Notices, vol. 49, ACM, 2014, pp. 53–66.

[166] GCC Team et al., Gcc, the gnu compiler collection, 2018.

[167] The Apache Software Foundation, Apache accumulo, accumulo.apache. org [Online. Last accessed: 2019-Oct-16].

[168] The glibc project developers, The GNU C Library (glibc), gnu.org/ software/libc/ [Online. Last accessed: 2020-Jan-17], 2019.

[169] 2019, Tiger dataset, developers.google.com/earth-engine/ datasets/catalog/TIGER_2016_States [Online. Last accessed: 2020-Jan-20].

[170] Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Mad- den, Speedy transactions in multicore in-memory databases, SOSP, ACM, 2013, pp. 18–32.

177 [171] Akira Umayabara and Hayato Yamana, MCMalloc: A scalable memory allo- cator for multithreaded applications on a many-core shared-memory machine, IEEE Big Data, IEEE, 2017, pp. 4846–4848.

[172] Richard L Villars, Carl W Olofson, and Matthew Eastwood, Big data: What it is and why you should care, White Paper, IDC 14 (2011), 1–14.

[173] Scott Vokes, skiplist, github.com/silentbicycle/skiplist [Online. Last accessed: 2019-June-01], 2016.

[174] Brett Walenz, Sudeepa Roy, and Jun Yang, Optimizing iceberg queries with complex joins, SIGMOD, ACM, 2017, pp. 1243–1258.

[175] Li Wang, Minqi Zhou, Zhenjie Zhang, Ming-Chien Shan, and Aoying Zhou, Numa-aware scalable and efficient in-memory aggregation on large domains, TKDE 27 (2015), no. 4, 1071–1084.

[176] Ziqi Wang, Andrew Pavlo, Hyeontaek Lim, Viktor Leis, Huanchen Zhang, Michael Kaminsky, and David G Andersen, Building a bw-tree takes more than just buzz words, SIGMOD, ACM, 2018, pp. 473–488.

[177] David Wentzlaff and Anant Agarwal, Factored operating systems (fos): The case for a scalable operating system for multicores, SIGOPS Oper. Syst. Rev. 43 (2009), no. 2, 76–85.

[178] Wikichip, Advanced vector extensions 512 (avx-512), en.wikichip.org/wiki/x86/avx-512 [Online. Last accessed: 2020-Jan-20].

[179] Eugene Wu and Samuel Madden, Partitioning techniques for fine-grained in- dexing, ICDE, 2011, pp. 1127–1138.

178 [180] pp. 1127-1138, X10 Language Specification - Version 2.6.2, x10. sourceforge.net/documentation/languagespec/x10-latest.

pdf [Online. Last accessed: 2019-Oct-21].

[181] Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo, Simba: Efficient in-memory spatial analytics, SIGMOD, 2016.

[182] Zhongle Xie, Qingchao Cai, Gang Chen, Rui Mao, and Meihui Zhang, A com- prehensive performance evaluation of modern in-memory indices, ICDE, 2018.

[183] Fumito Yamaguchi and Hiroaki Nishi, Hardware-based hash functions for net- work applications, ICON, IEEE, 2013, pp. 1–6.

[184] Weipeng P Yan and Per-Ake Larson, Eager aggregation and lazy aggregation, VLDB, vol. 95, 1995, pp. 345–357.

[185] Yang Ye, Kenneth A Ross, and Norases Vesdapunt, Scalable aggregation on multicore processors, DMSN, 2011, pp. 1–9.

[186] Simin You, Jianting Zhang, and Le Gruenwald, Large-scale spatial join query processing in cloud, ICDEW, 2015.

[187] Chenhan D Yu, Jianyu Huang, Woody Austin, Bo Xiao, and George Biros, Performance optimization for the k-nearest neighbors kernel on x86 architec- tures, SC, ACM, 2015, p. 7.

[188] Jia Yu, Jinxuan Wu, and Mohamed Sarwat, GeoSpark: A cluster computing framework for processing large-scale spatial data, ACM SIGSPATIAL, 2015, p. 70.

[189] Yuan Yu, Pradeep Kumar Gunda, and Michael Isard, Distributed aggrega- tion for data-parallel computing: interfaces and implementations, SOSP, ACM, 2009, pp. 247–260.

179 [190] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica, Spark: Cluster computing with working sets., HotCloud 10 (2010), no. 10-10, 95.

[191] Mikhail Zarubin, Patrick Damme, Thomas Kissinger, Dirk Habich, Wolfgang Lehner, and Thomas Willhalm, Integer compression in nvram-centric data stores: Comparative experimental analysis to dram, DaMoN, DaMoN19, Asso- ciation for Computing Machinery, 2019.

[192] Chi Zhang, Feifei Li, and Jeffrey Jestes, Efficient parallel knn joins for large data in mapreduce, EDBT, ACM, 2012, pp. 38–49.

[193] Jun Zhang, Manli Zhu, Dimitris Papadias, Yufei Tao, and Dik Lun Lee, Location-based spatial queries, SIGMOD, ACM, 2003, pp. 443–454.

[194] Xiangyu Zhang, Jing Ai, Zhongyuan Wang, Jiaheng Lu, and Xiaofeng Meng, An efficient multi-dimensional index for cloud data management, CloudDB, ACM, 2009, pp. 17–24.

[195] Zhigang Zhang, Cheqing Jin, Jiali Mao, Xiaolin Yang, and Aoying Zhou, Trajs- park: A scalable and efficient in-memory management system for big trajectory data, APWeb/WAIM, 2017.

180 Appendix A

Main Memory Joins - Additional Results

The join runtimes for five hash join configurations on the Gaussian and Zipf near M:K datasets (defined in Chapter 3) are given in Figures A.1 and A.2 respectively. The results are in line with our previous conclusions. Specifically, the non-partitioning configurations provide the best performance, and that MH can further improve run- time when the dataset is shuffled.

181 9 Part Build Probe 30 Part Build Probe 8 25 Billions 7 Billions 6 20 5 15 4 3 10 2 5 1 0 0 Runtime (CPU Cycles) Runtime (CPU Cycles) Runtime (CPU Cycles)

Configuration Configuration (a) Gaussian - near M:K - Ordered (b) Gaussian - near M:K - Shuffled Figure A.1: Hash Join run time with Variable Build Skew on Gaussian near M:K Dataset - Skylake - 8 threads

Part Build Probe 90 90 Part Build Probe 80 80 Billions 70 Billions 70 60 60 50 50 40 40 30 30 20 20 10 10 0 0 Runtime Runtime (CPU Cycles) Runtime Runtime Cycles) (CPU

Configuration Configuration (a) Zipf - near M:K - Ordered (b) Zipf - near M:K - Shuffled Figure A.2: Hash Join run time with Variable Build Skew on Zipf near M:K Dataset - Skylake - 8 threads

182 Appendix B

Main Memory Aggregation - Additional Results

In Chapter 4 we performed our experimental evaluation on a machine based on the Intel Skylake architecture. In order to verify that our conclusions are valid for other machines, we also ran our experiments on a dual-socket machine based on the Intel Penryn architecture. The machine contains a pair of Intel Xeon E5472 quad core CPUs and 16GB of main memory. The results depicted in Figure B.1 indicate that Hash LP consistently produces the fastest runtimes and that Spreadsort can narrow the gap with Hash LP at high group-by cardinality. The results agree with our summarized decision flowchart in Chapter 4 and confirm that the efficiency of our Hash LP implementation can apply to other hardware architectures.

183 ART Judy Btree Hash_SC Hash_LP Hash_Sparse Hash_Dense Hash_LC Introsort Spreadsort 80 180 90 2 3 4 5 6 7 70 10 10 160 10 10 80 10 10 Billions Billions Billions 140 70 60 120 60 50 100 50 40 80 40 30 60 30 20 40 20

10 20 10

0 0 0 Query Query Execution (CPU Time Cycles) Query Query Execution (CPU Cycles) Time 102 103 104 105 106 107 Query Execution (CPU Cycles) Time 102 103 104 105 106 107 102 103 104 105 106 107 Group-by Cardinality Group-by Cardinality Group-by Cardinality (a) Rseq-ord (b) Rseq-shf (c) Hhit-ord

100 60 180

90 160 Billions Billions 50 Billions 80 140 70 40 120 60 100 50 30 80 40 20 60 30 20 40 10 10 20 0 0 0 Query Query Execution (CPU Cycles) Time Query Query Execution (CPU Cycles) Time 102 103 104 105 106 107 102 103 104 105 106 107 Query Execution (CPU Cycles) Time 102 103 104 105 106 107 Group-by Cardinality Group-by Cardinality Group-by Cardinality (d) Hhit-shf (e) MovC (f) Zipf

Figure B.1: Vector Aggregation Q1 - 100M Records - Intel Harpertown (Penryn) Machine - Variable dataset distribution

184 Appendix C

Query Processing on NUMA Systems - Additional Results

In Figure C.1 we depict a direct head-to-head comparison of five database systems running on a NUMA machine: MonetDB, PostgreSQL, MySQL, DBMSx, and Quick- step. Each database system is using the best configuration with our NUMA strategies applied (as defined in Chapter 5), and we measure the total time to complete all 22 queries of the TPC-H benchmark. The results show that the commercial DBMSx and open-source Quickstep database systems have a clear performance advantage in analytical queries due to their in- memory query execution engines, and MySQL lags far behind due to its lack of support for intra-query parallelism in its query engine.

10000 7584.96

1000 733.46

68.23 100 41.96 42.11

10 Total Runtime (s)Runtime Total 1 MonetDB PostgreSQL MySQL DBMSx Quickstep Figure C.1: Head-to-head DBMS Comparison - All 22 TPC-H queries - Machine A - Best Configuration

185 Appendix D

DISTIL+ - Additional System Details and Results

In this appendix, we present additional system details and experimental results for our distributed spatio-temporal data system, DISTIL+, which we introduced in Chapter 6.

D.1 Update Processing

The location update processing algorithm is shown in Algorithm 4. The update processing includes inserting a new record into the in-memory and persistent stores, as well as updating the spatio-temporal index. When a new location update is received from a moving object, the corresponding location record object LRec is inserted into a concurrent queue by the Coordinator component. Further processing is carried out using the producer-consumer paradigm to enable parallel processing. One of the producer activities retrieves a particular location record from the queue (line 4, Algo. 4) and based on the coordinates of that record, it determines the tile ID tileId and the place pl where the record needs to be inserted into (lines 5 and 6). Each producer keeps a fixed size array, arrayPerPlace for each place. Every time a

186 Algorithm 4: Update Processing Input: A new location update, represented by object LRec, is inserted into a concurrent queue RecordQueue by the Coordinator component in Figure 6.2. insertBatchSize is the record insertion batch size. STIndex is the spatio-temporal index. 1 // Producers 2 for ProducerId in Producers do 3 while RecordQueue has more records do 4 LRec ← RecordQueue.pop() ; 5 tileId ← determine tileId from LRec coordinates; 6 pl ← determine Place from tileId; 7 add LRec to arrayPerPlace(pl); 8 if arrayPerPlace(pl).size() == insertBatchSize then 9 at Place placeId{ 10 add arrayPerPlace(pl) in insertionMaps(pl)}; 11 clear arrayPerPlace(pl); 12 l++; 13 for pl in Places do 14 at Place pl 15 add arrayPerPlace(pl) in insertionMaps(pl)}; 16 clear arrayPerPlace(pl) ; 17 producersDone++; 18 // Consumers 19 for pl in Places do 20 at Place pl for ConsumerId in Consumers do 21 while Producers not done && insertionMaps(pl).size() do 22 keys ← insertionMaps(pl).poll(); 23 for key in keys do 24 LRec ←new LocationRecord(key); 25 LTable(pl).insert(LRec); 26 SpatialGridIndex(pl).updateIndex(LRec, STIndex(pl); 27 serializedRec ←serialize(LRec); 28 Insert serializedRec into Localstore(pl) ;

187 new record is processed, it is inserted into the array of the corresponding place. For example, if a record should move to place 0, then the producer appends the record at the arrayPerPlace[0]. When the size of that arrayPerPlace reaches a specific threshold (insertBatchSize), it is sent to the node as one item (a record-batch), and received by the consumer-activities. Finally, the item is inserted into a distributed array of concurrent queues called insertionMaps (lines 8 to 11). This allows batches of records to be processed at each place in a thread-safe manner. When a consumer receives a record-batch item from the queue, it processes the records by inserting them into the in-memory table LTable (line 25) and persistent local store (lines 27 to 28), and updates the spatio-temporal index (line 26).

Algorithm 5: SpatialGridIndex UpdateIndex Input: LRec is the location record to update the index with. STIndex is the per-node local index component at a place. 1 timestampHashCode ← determine hashcode with LRec.getTimestamp(); 2 PTIndex ← STIndex.getPartialTemporalIndex(); 3 if PTIndex.get(timestampHashCode) is null then 4 create a new interval in PTIndex; 5 intervalSlotEntry ← new TileIndexDatetimeObjectInfo(); 6 else 7 intervalSlotEntry ← PTIndex.getTileIndexDatetimeObjectInfo(); 8 intervalSlotEntry.addToBitmap(LRec.getObjectId()); 9 intervalSlotEntry.addToRidListMap(LRec.getObjectId(), LRec.getRID());

Algorithm 5 outlines the steps to update the spatio-temporal index with a location record. First, the location record’s timestamp is hashed to generate timestampHash- Code which will be used as a key (line 1). Then the partial temporal index is checked to see if an entry for the key exists (lines 2 to 3). If there are no existing entries for a timestamp hash, a new interval slot is created in the temporal index (lines 4 to 5). Otherwise, the existing interval is selected (line 7). Lastly, the interval slot is updated with data from the location record (lines 8 to 9).

188 300,000

250,000

200,000

150,000

100,000 Insertionthroughput 50,000 4 Workers 8 Workers 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 Batch size Figure D.1: Update throughput: batch size/worker threads

D.1.1 Update Performance

In this section, we evaluate the parameters that affect DISTIL+’s update perfor- mance. To do so, we vary several key parameters to determine the best configuration for our system. These experiments also help us examine important factors that affect system performance, such as locality and scalability.

D.1.1.1 Update Batch Size per Worker Thread

In this experiment, we maintain the dataset size at 100M records, and vary the number of update workers, as well as the number of records that are sent to each place per batch. Figure D.1 shows the update throughput for each configuration. Peak throughput is achieved by spawning four workers, and creating batches of 6000 records. The batch size only needs to be large enough to cover the overhead of scheduling. This experiment shows that the number of workers has a greater impact than the batch size.

D.1.1.2 Tile placement policy and dataset size

In this section, we compare the update throughput of our system with two tile place- ment policies (previously mentioned in Section 6.4.1): the Row-wise Round-Robin

189 300,000 RRR MDR 300,000

250,000 250,000

200,000 200,000

150,000 150,000 Update Throughput Update Update Throughput Update 100,000 100,000 50,000 50,000 0 10M 20M 40M 60M 80M 100M 0 Dataset Size 2 Nodes 4 Nodes 8 Nodes Figure D.2: Tile placement policy com- Figure D.3: Update throughput scalabil- parison: variable dataset size ity: variable cluster size

(RRR) and the Multi-Dimensional Range partitioning (MDR) policy, for different dataset sizes. As shown in Figure D.2, the RRR policy achieves higher throughput than MDR. Also, the results of RRR are the same regardless of the dataset size. On the other hand, MDR produces lower throughput than RRR, and the throughput decreases as the dataset size increases. The tile placement policies affect the location of the tiles, and therefore, the location of the records. The RRR policy distributes the tiles in a way that places nearby tiles in different nodes. The MDR aims to place tiles of nearby regions together. In short, RRR attempts to maximize parallelism, whereas MDR aims for more two- dimensional spatial locality.

D.1.1.3 Node Scaling

Figure D.3 shows DISTIL+’s update throughput scalability with the RRR placement policy, and a variable number of nodes. The throughput increases as the number of nodes increases. As more nodes are added, more update workers are used, and therefore the system manages to scale out and overcome the communication and scheduling overheads. We used a batch size of 6,000 records, and four update workers per nodes.

190 Global Processing Local Processing Node 1 Obj IDs

Node 0 Node 2 Output merged Obj IDs Collects results from places results

Node 3 Node 0 (master) Send query Obj IDs Determine spatial information to tile IDs for range each place query using GIndex Find objects matching query info using local STIndex Figure D.4: Illustration of range query (STRQ) processing

D.2 Range query (STRQ) processing

An STRQ query, q, is specified by (q.sw, q.tw), where q.sw is the spatial window and q.tw the timeframe. It returns all objects that were inside a specified q.sw during q.tw. All the tiles or grid cells in Gr,s (as mentioned in Section 6.4.3.1) that are fully or partially overlapped by q.sw, i.e. C := {c|c ∈ Gr,s and c ∩ q.sw > 0}, need to be processed. We can support intra-query parallelism by processing the tiles c ∈ C in a distributed manner such that each tile is processed by the node where it is mapped to. Figure D.4 illustrates the distributed execution of an STRQ query. Inter-node communication consists of tile IDs and query results. The algorithm for historical STRQ processing is presented in Algorithm 6. The idea is to limit communication and computation to the nodes that contain the relevant tiles. In line 2 of the algorithm, we use the global index to find the tiles that overlap the query’s spatial window. These tiles are then added to the corresponding places, using the distributed array structure for each place (lines 3 to 6). For each relevant place a query object STRangeQuery is instantiated with the required parameters (lines 8 to 9). The query objects are then executed at each place (line 10), and the results are collected and merged to produce the output (lines 11 to 13).

191 Algorithm 6: Historical Range Query Input: An STRQ query, (q.sw, q.tw), where q.sw is the spatial window and q.tw the timeframe. queryPlaces is a query plan fragment to be executed at each place. Query result is qryResult. qryTiles are the tiles per place related to a query that need to be processed. STIndex is the spatio-temporal index. 1 // Global Processing 2 // Find the tiles that overlap q.sw from global index C ← {c|c ∈ Gr,s and c ∩ q.sw > 0} ; 3 for c ∈ C do 4 plc ← find place from tile c; 5 at Place plc qryTiles(plc).add(c); 6 queryPlaces(plc).add(plc); 7 // Local Processing 8 for plc in queryPlaces do 9 query ← new STRangeQuery(qryTiles(plc), STIndex(plc), q.sw, q.tw); 10 queries(plc) ← execute query; 11 result ← fetch result at plc from queries(plc); 12 qryResult ← merge result with qryResult; 13 return qryResult;

The details of the per place query execution is described next. At a given place we iterate over each tile and check if it fully or partially overlaps the query spatial window. In the first case, we perform bitwise OR operations on the bit-vectors corresponding to interval table entries that are inside the query timeframe. The bits that are set in the resultant bit-vector represent the object IDs that match the query criteria. In the second case of partially overlapped tiles, we need to check for the exact location of the related objects, as well as the actual timeframe of the objects. This is done because the objects might be outside the query window boundaries, and involves creating a union of the record ID list RIDList for the interval table entries that are inside the query timeframe. Then, for each rid entry in the resultant list, the corresponding location record fields of coordinates (latitude, longitude) and timestamp are retrieved from the location table. If the object location is inside the

192 700 smallR medR LargeR 600

500

400

300

QueryThroughput 200

100

0 5% 10% 15% 20% Timeframe percentage Figure D.5: Range query throughput: variable temporal range query spatial window and the timestamp is within the query timeframe, it is included in the partial result. Finally, the objects IDs from each tile are gathered and concatenated with other results obtained from the same place. In the end, they are all gathered and sent to the master node with no need for communication during the execution of the place-specific queries.

D.2.1 Range Query Performance

In this section, we evaluate and analyze the range query performance of DISTIL+ using different experiment variables.

D.2.1.1 Range Query (STRQ) Timeframe

In Figure D.5 the result of query throughput while varying the timeframe (query time window) of the queries is presented. As queries use longer time intervals, more data is processed, and the query throughput gradually drops.

193 900 smallR medR LargeR 800 700 600 500 400 300

Querythroughput 200 100 0 10M 20M 40M 60M 80M 100M Dataset size Figure D.6: Range query throughput scalability: variable dataset size

D.2.1.2 Range Query (STRQ) Dataset Size and Spatial Extent

In this section, we present additional experimental results that explore the perfor- mance characteristics of DISTIL+’s query performance. Here, we evaluate the range query throughput while varying the dataset size (from 10M to 100M), and spatial window size (smallR, medR, largeR). Figure D.6 presents the throughput results. Note that there is no significant degradation of query throughput with the larger datasets. The spatial extent has a greater impact on throughput than the dataset size. This is due to the fact that covering a larger query area results in less data locality, as more spatial tiles are processed.

D.2.1.3 Tile Placement Policy and Number of Query Threads

In this experiment, we show how query throughput is affected by the location of the spatial tiles within the cluster. We evaluate range query throughput with the RRR and MDR tile placement policies. The RRR policy outperforms MDR for smaller range queries (Figure D.7a). In Figures D.7b and D.7c we observe that there is a smaller throughput difference between these policies, but the RRR policy continues to achieve higher throughput. Also, we note that query parallelism can be a key factor in improving throughput.

194 It can be better to have multiple query workers executing the queries in parallel. For example, in Figure D.7a the throughput is typically higher when using multiple worker threads.

195 1,000 RRR MDR 800 600 400

Query throughput 200 0 1 2 4 8 Worker Threads (a) Small Query Radius (smallR)

600 RRR MDR 500 400 300 200

Query throughput 100 0 1 2 4 8 Worker Threads (b) Medium Query Radius (medR)

400 RRR MDR 350 300 250 200 150 100 Querythroughput 50 0 1 2 4 8 Worker Threads (c) Large Query Radius (largeR)

Figure D.7: Range query throughput comparison: RRR vs. MDR

196 Appendix E

Source Code Statistics

In this appendix, we provide statistics on some of the code that we developed for this thesis. We use CLOC [38] to count the lines of code (LOC ) in each codebase. We exclude blank lines from the LOC count, and use the –exclude-dir and –include-lang options to exclude imported dependencies (such as libraries, and open-source data structures), and ignore any files that do not match the codebase’s primary program- ming language (such as bash scripts, documentation, and datasets), respectively. A summary of the source codes is provided in Table E.1.

Table E.1: Source Code Summary and Statistics

Name Language Description LOC HashJoin C/C++ Framework for hash joins, based on [29] 24,689 Aggregation C/C++ Framework for aggregation workloads 48,844 IndexJoin C/C++ Frameworkforindexnestedloop Joins 1,797 DISTIL+ x10 Distributed spatio-temporal data system 11,267 DatagenJoin Python Datasetgeneratorforjoins 900 DatagenAgg Python Datasetgeneratorforaggregations 340

197 Vita

Candidate’s full name: Puya Memarzia

University attended (with dates and degrees obtained): • MSc in Computer Engineering - Software Engineering, Shiraz University, 2014 • BSc in Computer Engineering - Software Engineering, Shiraz Azad University, 2011

Publications: • The Art of Efficient In-memory Query Processing on NUMA Systems: a Sys- tematic Approach, 36th IEEE International Conference on Data Engineering (ICDE2020), April 2020 • DISTIL+: A Scalable Spatio-temporal Distributed In-memory Data System, Distributed and Parallel Databases (DAPD), Submitted Dec 2019 • Toward Efficient In-memory Data Analytics on NUMA Systems, arXiv, August 2019 (preprint) • Toward Efficient Processing of Spatio-temporal Workloads in a Distributed In-memory System, 20th IEEE International Conference on Mobile Data Man- agement (MDM), June 2019 • A Six-dimensional Analysis of In-memory Aggregation, 22nd International Conference on Extending Database Technology (EDBT), March 2019 • DISTIL: A Distributed In-Memory Data Processing System for Location-Based Services, 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, November 2018 • On Improving Data Skew Resilience in Main-memory Hash Joins, 22nd In- ternational Database Engineering & Applications Symposium (IDEAS), May 2018 • In-depth Study on the Performance Impact of CUDA, OpenCL, and PTX Code, Journal of Information and Computing Science (JIC), Volume 10 No 2, pp.124-136, February 2015 • Exploring GPU Memory Performance Using Digital Image Processing Algo- rithms, Indian Journal of Computer Science and Engineering (IJCSE), Volume 5 Issue 6, pp.221-232, December 2014

Conference Presentations: • In-memory Aggregation for Big Data Analytics, UNB 2019 Research Expo, April 2019

• The Art of Efficient In-memory Query Processing on NUMA Systems: a Sys- tematic Approach, 36th IEEE International Conference on Data Engineering (ICDE2020), Dallas, Texas, April 2020