Accelerating Main Memory Query Processing for Data Analytics

Accelerating Main Memory Query Processing for Data Analytics by Puya Memarzia Master of Science, Shiraz University, 2014 A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy In the Graduate Academic Unit of Faculty of Computer Science Supervisor(s): Virendrakumar C. Bhavsar, PhD, Computer Science, Suprio Ray, PhD, Computer Science Examining Board: Kenneth Kent, PhD, Computer Science, Patricia Evans, PhD, Computer Science, Eduardo Castillo Guerra, PhD, Electrical and Computer Engineering External Examiner: Khuzaima Daudjee, PhD, Computer Science, University of Waterloo This dissertation is accepted by the Dean of Graduate Studies THE UNIVERSITY OF NEW BRUNSWICK April, 2020 c Puya Memarzia, 2020 Abstract Data analytics provides a way to understand and extract value from an ever-growing volume of data. The runtime of analytical queries is of critical importance, as fast results enhance decision making and improve user experience. Data analytics systems commonly utilize in-memory query processing techniques to achieve better through- put and lower latency. Although processing data that is already in the main memory is decidedly speedier than disk-based query processing, this approach is hindered by limited memory bandwidth and cache capacity, resulting in the under-utilization of processing resources. Furthermore, the characteristics of the hardware, data, and workload, can all play a major role in hindering execution time, and the best approach for a given application is not always clear. In this thesis, we address these issues by investigating ways to design more efficient algorithms and data structures. Our approach involves the systematic application of application-level and system- level refinements that improve algorithm efficiency and hardware utilization. In particular, we conduct a comprehensive study on the effects of dataset skew and shuffling on hash join algorithms. We significantly improve join runtimes on skewed datasets by modifying the algorithm’s underlying hash table. We then further improve performance by designing a novel hash table based on the concept of cuckoo hashing. Next, we present a six-dimensional analysis of in-memory aggregation that breaks down the variables that affect query runtime. As part of our evaluation, we investigate 13 different algorithms and data structures, including one that we ii specifically developed to excel at a popular query category. Based on our results, we produce a decision tree to help practitioners select the best approach based on aggregation workload characteristics. After that, we dissect the runtime impact of NUMA architectures on a wide variety of query workloads and present a methodology that can greatly improve query performance with minimal modifications to the source code. This approach involves systematically modifying the application’s thread placement, memory placement, and memory allocation, and reconfiguring the operating system. Lastly, we design a scalable query processing system that uses distributed in-memory data structures to store, index, and query spatio-temporal data, and demonstrate the efficiency of our system by comparing it against other data systems. iii Dedication I dedicate this thesis to all the people that supported and encouraged me along the way. iv Acknowledgements First of all, I would like to thank my supervisors, Dr. Virendra C. Bhavsar and Dr. Suprio Ray, for their guidance, patience, and encouragement. I would also like to thank the members of my examining committee, Dr. Eric Aubanel, Dr. Kenneth Kent, Dr. Patricia Evans, Dr. Eduardo Castillo Guerra, and Dr. Khuzaima Daudjee, for their time, effort, support, and constructive comments. Additionally, I would like to thank Dr. Kenneth Kent and Aaron Graham from IBM CASA and Serguei Vassiliev and Kaizaad Bilimorya from Compute Canada, for providing access to some of the machines used for my experiments. v Table of Contents Abstract ii Dedication iv Acknowledgments v Table of Contents vi List of Tables xiii List of Figures xv Abbreviations xix 1 Introduction 1 1.1 ThesisObjectivesandMethodology . 2 1.2 ThesisOutline............................... 4 2 Background 6 2.1 TrendsinModernHardwareArchitecture. 6 2.1.1 ThePowerWall.......................... 7 2.1.2 TheILPWall ........................... 7 2.1.3 TheMemoryWall ........................ 8 2.1.4 CPUCacheandMemoryHierarchy . 9 2.1.5 Non-Uniform Memory Access (NUMA) Architectures . 10 vi 2.2 MainMemoryQueryProcessing. 11 2.2.1 DataPersistence ......................... 12 2.3 QueryDataOperations. 13 2.3.1 JoinQueries............................ 13 2.3.2 AggregationQueries . 14 3 Main Memory Join Processing 18 3.1 Motivation................................. 19 3.2 RelatedWork ............................... 21 3.2.1 HashJoins............................. 22 3.2.2 HashJoinConfigurations. 23 3.2.3 HashTables............................ 24 3.3 TheImpactofDataSkew ........................ 24 3.4 Separate Chaining with Value-Vectors (SCVV) . .. 25 3.4.1 Lookupandmaterializationcosts . 26 3.4.2 EvaluationandOverhead . 27 3.5 MapleHashTable(MH) ......................... 28 3.5.1 DataStructure .......................... 29 3.5.2 HashFunctions .......................... 30 3.5.3 ConcurrentImplementation . 31 3.5.4 PerformanceEvaluation . 33 3.6 Evaluation................................. 34 3.6.1 PlatformSpecifications . 35 3.6.2 Datasets.............................. 36 3.6.3 ResultsandDiscussion . 37 3.6.4 BuildTableSkew ......................... 38 3.6.4.1 Sequential1:Ndataset . 38 3.6.4.2 Randomnear1:Ndataset . 38 vii 3.6.4.3 GaussianM:Ndataset . 39 3.6.4.4 ZipfM:Ndataset . 41 3.6.5 ProbeTableSkew......................... 41 3.6.6 EffectofDatasetShufflingonHashTables . 43 3.6.7 CPUArchitecture. .. 45 3.7 ChapterSummary ............................ 46 4 MainMemoryAggregationProcessing 48 4.1 Motivation................................. 49 4.2 RelatedWork ............................... 52 4.3 Queries................................... 55 4.4 DataStructuresandAlgorithms . 56 4.4.1 Sort-basedAggregationAlgorithms . 57 4.4.1.1 Quicksort ........................ 57 4.4.1.2 Introsort......................... 58 4.4.1.3 RadixSort(MSBandLSB) . 58 4.4.1.4 Spreadsort ....................... 58 4.4.1.5 SortingMicrobenchmarks . 59 4.4.2 Hash-basedAggregationAlgorithms. 59 4.4.2.1 Linearprobing . 60 4.4.2.2 Quadraticprobing . 61 4.4.2.3 Separatechaining. 62 4.4.2.4 CuckooHashing . 62 4.4.3 Tree-basedAggregationAlgorithms . 63 4.4.3.1 Btree .......................... 63 4.4.3.2 Ttree .......................... 64 4.4.3.3 ART........................... 64 4.4.3.4 JudyArrays....................... 64 viii 4.4.4 DataStructureMicrobenchmark. 65 4.4.5 TimeComplexity ......................... 66 4.5 Datasets.................................. 66 4.6 ResultsandAnalysis ........................... 68 4.6.1 PlatformSpecifications . 70 4.6.2 VectorAggregation . 71 4.6.3 CacheandTLBmisses . 72 4.6.4 MemoryEfficiency ........................ 74 4.6.5 DatasetDistributions. 75 4.6.6 RangeSearchQuery . 77 4.6.7 ScalarAggregation . 78 4.6.8 Multi-threadedScalability . 79 4.7 SummaryandDiscussion. 82 4.8 ChapterSummary ............................ 83 5 Query Processing on NUMA Systems 85 5.1 Motivation................................. 86 5.2 NUMATopologies ............................ 89 5.2.1 ExperimentWorkloads . 90 5.3 RelatedWork ............................... 91 5.4 Methodology ............................... 93 5.4.1 DynamicMemoryAllocators. 94 5.4.1.1 ptmalloc......................... 95 5.4.1.2 jemalloc......................... 95 5.4.1.3 tcmalloc......................... 96 5.4.1.4 Hoard .......................... 96 5.4.1.5 tbbmalloc ........................ 97 5.4.1.6 supermalloc....................... 97 ix 5.4.1.7 mcmalloc ........................ 97 5.4.1.8 OverridingTheMemoryAllocator . 98 5.4.1.9 Memory Allocator Microbenchmark. 99 5.4.2 ThreadPlacementandScheduling. 100 5.4.3 MemoryPlacementPolicies . 103 5.4.4 OperatingSystemConfiguration. 103 5.4.4.1 Virtual Memory Page Management . 104 5.4.4.2 AutomaticNUMALoadBalancing . 104 5.5 Evaluation................................. 105 5.5.1 ExperimentalSetup. 105 5.5.2 DatasetsandImplementationDetails . 107 5.5.3 Operating System Configuration Experiments . 110 5.5.3.1 AutoNUMA Load Balancing Experiments . 110 5.5.3.2 Transparent Hugepages Experiments . 112 5.5.3.3 Hardware Architecture Experiments . 112 5.5.4 MemoryAllocatorExperiments . 113 5.5.4.1 Hashtable-based Experimental Workloads . 113 5.5.4.2 ImpactofDatasetDistribution . 115 5.5.4.3 EffectonIn-memoryIndexing . 116 5.5.5 DatabaseEngineExperiments . 117 5.5.6 ResultsSummary . 120 5.6 ChapterSummary ............................ 121 6 Distributed In-memory kNN Query Processing 123 6.1 Motivation................................. 124 6.1.1 APGASandtheX10Language . 126 6.2 RelatedWork ............................... 127 6.2.1 BigSpatialDataProcessingSystems . 127 x 6.2.2 ScalableSpatio-TemporalIndexes . 128 6.2.3 DistributedkNNQueryProcessing . 131 6.3 Problem Statement and Design Considerations . 132 6.4 DISTIL+ SystemOverview. 134 6.4.1 DataPartitioning . 135 6.4.2 DataPersistence . 137 6.4.2.1 LocalStore ....................... 138 6.4.2.2 GlobalStore . 138 6.4.3 IndexingTechniques . 138 6.4.3.1 DISTIL+ Distributed In-Memory Spatio-Temporal In- dex ...........................140 6.5 Spatio-temporalkNNQueryProcessing . 143 6.5.1 kNNPruning ........................... 145 6.6 Evaluation................................

Load more