Redesigning Database Systems in Light of CPU Cache Prefetching

Redesigning Database Systems in Light of CPU Cache Prefetching Shimin Chen CMU-CS-05-192 December 2005 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Anastassia Ailamaki, Co-Chair Todd C. Mowry, Co-Chair Christos Faloutsos Phillip B. Gibbons David J. DeWitt, University of Wisconsin at Madison Submitted in partial ful¯llment of the requirements for the degree of Doctor of Philosophy. Copyright °c 2005 Shimin Chen This research was sponsored by the National Science Foundation (NSF) under grant nos. CCR-0205544 and CCR-0085938, the National Aeronautics and Space Administration (NASA) under grant no. NAG2-1230, and the Microsoft Corporation through a generous fellowship. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the official policies, either expressed or implied, of any sponsoring institution, the U.S. government or any other entity. Keywords: Cache Prefetching, Database Systems, CPU Cache Performance, Data Locality Opti- mizations, B+-Trees, Hash Joins. To my parents Guanhu and Aili, my wife Qin, and my daughter Ada. iv Abstract Computer systems have enjoyed an exponential growth in processor speed for the past 20 years, while main memory speed has improved only moderately. Today a cache miss to main memory takes hundreds of processor cycles. Recent studies have demonstrated that on commercial databases, about 50% or more of execution time in memory is often wasted due to cache misses. In light of this problem, a number of recent studies focused on reducing the number of cache misses of database algorithms. In this thesis, we investigate a different approach: reducing the impact of cache misses through a technique called cache prefetching. Since prefetching for sequential array accesses has been well studied, we are interested in studying non-contiguous access patterns found in two classes of database algorithms: the B+-Tree index algorithm and the hash join algorithm. We re-examine their designs with cache prefetching in mind, and combine prefetching and data locality optimizations to achieve good cache performance. For B+-Trees, we first propose and evaluate a novel main memory index structure, Prefetching B+- Trees, which uses prefetching to accelerate two major access patterns of B+-Tree indices: searches and range scans. We then apply our findings in the development of a novel index structure, Fractal Prefetch- ing B+-Trees, that optimizes index operations both for CPU cache performance and for disk performance in commercial database systems by intelligently embedding cache-optimized trees into disk pages. For hash joins, we first exploit cache prefetching separately for the I/O partition phase and the join phase of the algorithm. We propose and evaluate two techniques, Group Prefetching and Software- Pipelined Prefetching, that exploit inter-tuple parallelism to overlap cache misses across the processing of multiple tuples. Then we present a novel algorithm, Inspector Joins, that exploits the free information v obtained from one pass of the hash join algorithm to improve the performance of a later pass. This new algorithm addresses the memory bandwidth sharing problem in shared-bus multiprocessor systems. We compare our techniques against state-of-the-art cache-friendly algorithms for B+-Trees and hash joins through both simulation studies and real machine experiments. Our experimental results demon- strate dramatic performance benefits of our cache prefetching enabled techniques. vi Acknowledgments I must thank my advisors, Todd Mowry and Anastassia Ailamaki, for the tremendous time, energy, and wisdom they invested in my Ph.D. education. Todd and Natassa taught me everything from choosing research topics, to performing high-quality studies, to writing papers and giving talks. Their guidance and support throughout the years are invaluable. I am indebted to Phillip Gibbons for collaborating on my entire thesis research, for participating in our weekly discussions, and for contributing his knowledge, sharpness, and efforts to all the four publications that form the basis of this thesis. I would like to thank the other members of my Ph.D. thesis committee, David DeWitt and Christos Faloutsos, for their thoughtful comments and invaluable suggestions that have improved the quality of the experimental results and the completeness of this thesis. I thank Gary Valentin for collaborating on the fpB+-Tree paper. Gary implemented jump pointer array prefetching in IBM DB2 and collected DB2 results on B+-Tree range scan I/O performance for the paper. I am grateful to Bruce Lindsay for giving a CMU DB Seminar talk on the DB2 hash join implementation, which inspired my hash join work. My thesis work also benefitted from insightful discussions with David Lomet and Per-Ak˚ e Larson. I would like to acknowledge Angela Brown, Kun Gao, John Griffin, Stratos Papadomanolakis, Jiri Schindler, Steve Schlosser, and Minglong Shao, for helping me set up machines for running real-machine experiments in my publications and in this thesis; Chris Colohan and Gregory Steffan, for helping me un- derstand the internals of the “cello” simulator for my simulation studies; members of the CMU database group and STAMPede group, for giving comments on my practice talks; staff of CMU SCS facilities, vii for allowing me to reboot machines in the machine room after working hours; and Sharon Burks, for her administrative helps in arranging my thesis proposal and defense talks. My summer internships expanded my research experience. I would like to thank David Lomet and Phillip Gibbons for mentoring me during my internships. I also want to thank all the people in IBM Toronto Lab, Microsoft Research Redmond, and Intel Research Pittsburgh, for making my summers productive and enjoyable. I thank my friends in Pittsburgh, Toronto, and Seattle for their helps in my CMU graduate life and in my summer internships, and for adding a lot of funs to my life. Finally, I must express my deepest gratitude to my family. I owe a great deal to my parents, Guanhu and Aili, who gave a life, endless love, and persistent encouragement to me. My sister, Yan, and brother- in-law, Gang, are always helping and supportive. I am deeply indebted to my dear wife, Qin, for sharing every moment of the CMU graduate life with me. Without her love, patience, encouragement, and support, it is impossible for me to complete this six and half years of long journey. Last but not least, my one-year-old daughter, Ada, motivated me to finish my thesis with her sweet voice of “baba”. viii Contents Abstract v Acknowledgments vii Contents ix List of Figures xv List of Tables xxi 1 Introduction 1 1.1 Can We Simply Adapt Memory-to-Disk Techniques? . 2 1.2 Cache Optimization Strategies . 5 1.2.1 Reducing the Number of Cache Misses . 5 1.2.2 Reducing the Impact of Cache Misses . 6 1.3 Our Approach: Redesigning Database Systems in Light of CPU Cache Prefetching . 9 1.4 Related Work . 10 1.4.1 Related Work on B+-Trees . 11 1.4.2 Related Work on Hash Joins . 12 1.5 Contributions . 13 1.6 Thesis Organization . 15 2 Exploiting Cache Prefetching for Main Memory B+-Trees 17 ix 2.1 Introduction . 17 2.1.1 Previous Work on Improving the Cache Performance of Indices . 18 2.1.2 Our Approach: Prefetching B+-Trees . 19 2.2 Index Searches: Using Prefetching to Create Wider Nodes . 20 2.2.1 Modifications to the B+-Tree Algorithm . 22 2.2.2 Qualitative Analysis . 24 2.3 Index Scans: Prefetching Ahead Using Jump-Pointer Arrays . 27 2.3.1 Solving the Pointer-Chasing Problem . 28 2.3.2 Implementing Jump-Pointer Arrays to Support Efficient Updates . 30 2.3.3 Prefetching Algorithm . 32 2.3.4 Qualitative Analysis . 33 2.3.5 Internal Jump-Pointer Arrays . 35 2.4 Experimental Results . 36 2.4.1 Itanium 2 Machine Configuration . 37 2.4.2 Simulation Machine Model . 39 2.4.3 B+-Trees Studied and Implementation Details . 41 2.4.4 A Simple Cache Prefetching Experiment: Measuring Memory Bandwidth on the Itanium 2 Machine . 43 2.4.5 Search Performance . 45 2.4.6 Range Scan Performance . 53 2.4.7 Update Performance . 55 2.4.8 Operations on Mature Trees . 61 2.4.9 Sensitivity Analysis . 63 2.4.10 Cache Performance Breakdowns . 64 2.4.11 Impact of Larger Memory Latency . 66 2.5 Discussion and Related Work . 67 2.6 Chapter Summary . 69 x 3 Optimizing Both Cache and Disk Performance for B+-Trees 71 3.1 Introduction . 71 3.1.1 Our Approach: Fractal Prefetching B+-Trees . 72 3.2 Optimizing I/O Performance . 74 3.2.1 Searches: Prefetching and Node Sizes . 75 3.2.2 Range Scans: Prefetching via Jump-Pointer Arrays . 76 3.3 Optimizing CPU Cache Performance . 77 3.3.1 Why Traditional B+-Trees Suffer from Poor Cache Performance? . 77 3.3.2 Previous Approach to Improving B+-Tree Cache Performance . 80 3.3.3 Disk-First fpB+-Trees . 81 3.3.4 Cache-First fpB+-Trees . 85 3.3.5 Improving Range Scan Cache Performance . 88 3.4 Cache Performance through Simulations . 89 3.4.1 Experimental Setup . 90 3.4.2 Search Cache Performance through Simulations . 92 3.4.3 Insertion Cache Performance through Simulations . 95 3.4.4 Deletion Cache Performance through Simulations . 97 3.4.5 Range Scan Cache Performance through Simulations . 98 3.4.6 Mature Tree Cache Performance . 99 3.4.7 Results with Larger Key Size . 100 3.5 Cache Performance on an Itanium 2 Machine . 101 3.5.1 Experimental Setup . 102 3.5.2 Search Performance on Itanium 2 . 104 3.5.3 Insertion Performance on Itanium 2 . 104 3.5.4 Deletion Performance on Itanium 2 .

Redesigning Database Systems in Light of CPU Cache Prefetching

Memory Hierarchy Memory Hierarchy

Make the Most out of Last Level Cache in Intel Processors In: Proceedings of the Fourteenth Eurosys Conference (Eurosys'19), Dresden, Germany, 25-28 March 2019

Migration from IBM 750FX to MPC7447A by Douglas Hamilton European Applications Engineering Networking and Computing Systems Group Freescale Semiconductor, Inc

Caches & Memory

Stealing the Shared Cache for Fun and Profit

IBM Power Systems Performance Report Apr 13, 2021

Cache & Memory System

Cache-Fair Thread Scheduling for Multicore Processors

A Cache Line Fill Circuit for a Micropipelined, Asynchronous Microprocessor

Quickspecs HP Integrity Rx7640 Server Overview

Exploiting Cache Side Channels on CPU-FPGA Cloud Platforms

The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Computing Architectures