Saturation in Lock-Based Concurrent Data Structures Ph.D
Total Page:16
File Type:pdf, Size:1020Kb
SATURATION IN LOCK-BASED CONCURRENT DATA STRUCTURES by Kenneth Joseph Platz APPROVED BY SUPERVISORY COMMITTEE: S. Venkatesan, Co-Chair Neeraj Mittal, Co-Chair Ivor Page Cong Liu Copyright c 2017 Kenneth Joseph Platz All rights reserved This dissertation is dedicated to my wife and my parents. They have always believed in me even when I did not. SATURATION IN LOCK-BASED CONCURRENT DATA STRUCTURES by KENNETH JOSEPH PLATZ, BS, MS DISSERTATION Presented to the Faculty of The University of Texas at Dallas in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE THE UNIVERSITY OF TEXAS AT DALLAS December 2017 ACKNOWLEDGMENTS I would like to first of all thank my wife Tracy for having the patience and fortitude to stand with me through this journey. Without her support, I could have never started this path much less completed it. Second, I would like to thank my supervising professors, Drs. \Venky" Venkatesan and Neeraj Mittal. They both provided me with frequent guidance and support throughout this entire journey. Finally, I would like to thank the rest of my dissertation committee, Drs. Ivor Page and Cong Liu. Dr. Page especially asked many pointed (and difficult) questions during my proposal defense, and that helped improve the quality of this work. November 2017 v SATURATION IN LOCK-BASED CONCURRENT DATA STRUCTURES Kenneth Joseph Platz, PhD The University of Texas at Dallas, 2017 Supervising Professors: S. Venkatesan, Co-Chair Neeraj Mittal, Co-Chair For over three decades, computer scientists enjoyed a \free lunch" inasmuch as they could depend on processor speeds doubling every three years. This all came to an end in the mid-2000's, when manufacturers ceased increasing processor speeds and instead focused on designing processors with multiple independent execution units on each chip. The demands for ever increasing performance continue to grow in this era of multicore and manycore processors. One way to satisfy this demand is to continue developing efficient data structures which permit multiple concurrent readers and writers while guaranteeing correct behavior. Concurrent data structures synchronize via either locks or atomic read-modify-write instruc- tions (such as Compare-and-Swap). Lock-based data structures are typically less challenging to design, but lock-free data structures can provide stronger progress guarantees. We first develop two variants of existing lock-based concurrent data structures, a linked list and a skiplist. We demonstrate how we can unroll these data structures to support multiple keys per node. This substantially improves the performance in these data structures when compared to other similar data structures. We next demonstrate how lock-based data structures can saturate, or plateau in performance, at sufficiently high thread counts, dependent upon the percentage of write operations applied to that data structure. We then vi discuss how we can apply a new technique involving group mutual exclusion to provide a lock-based data structure which is resilient to saturation. We then demonstrate how this technique can be applied to our implementations of linked lists and skiplists to provide scalable performance to 250 threads and beyond. Our implementations provide excellent throughput for a wide variety of workloads, out- performing many similar lock-based and lock-free data structures. We further discuss how these techniques might apply to other data structures and provide several avenues for future research. vii TABLE OF CONTENTS ACKNOWLEDGMENTS . v ABSTRACT . vi LIST OF FIGURES . xi LIST OF TABLES . xii CHAPTER 1 INTRODUCTION . 1 1.1 Our Contributions . .3 1.2 Dissertation Roadmap . .3 CHAPTER 2 SYSTEM MODEL . 5 2.1 Shared Memory System Architecture . .5 2.2 Synchronization Primitives . .6 2.3 Correctness Conditions . .7 2.3.1 Linearizability . .7 2.3.2 Deadlock-Freedom . .9 CHAPTER 3 BACKGROUND AND RELATED WORK . 10 3.1 Synchronization in Concurrent Data Structures . 10 3.1.1 Blocking Techniques . 10 3.1.2 Non-Blocking Techniques . 11 3.2 Concurrent Linked Lists . 12 3.3 Concurrent Skiplists . 14 3.4 Group Mutual Exclusion . 16 CHAPTER 4 A CONCURRENT UNROLLED LINKED LIST WITH LAZY SYN- CHRONIZATION . 20 4.1 Algorithm Overview . 20 4.2 Algorithm Detail . 22 4.3 Correctness Proof . 26 4.4 Experimental Evaluation . 27 4.4.1 Experiment Setup . 27 4.4.2 Experimental Results . 29 viii 4.4.3 Expansion of Key and Data Sizes . 36 4.5 Introducing a Per-Thread Shortcut Cache . 37 4.5.1 Overview of Shortcut Cache . 38 4.5.2 Detail of Shortcut Cache . 40 4.5.3 Evaluation of Shortcut Cache . 42 4.6 Conclusions and Future Work . 44 CHAPTER 5 UNROLLING THE OPTIMISTIC SKIPLIST . 46 5.1 Introduction . 46 5.2 Algorithm Overview . 47 5.3 Detail of an Unrolled Skiplist . 49 5.3.1 Scan . 49 5.3.2 Lookup . 50 5.3.3 Insert . 51 5.3.4 Remove . 53 5.4 Correctness Proof . 57 5.5 Experiment Setup . 61 5.5.1 Concurrent Skip List Implementations . 62 5.5.2 Simulation Parameters . 62 5.5.3 Test System . 64 5.5.4 Experimental Methodology and Measurements . 64 5.6 Experimental Results . 64 5.6.1 Results for 1 million keys . 65 5.6.2 Results for 10 million keys . 68 5.6.3 Discussion . 70 5.7 Conclusions and Future Work . 72 CHAPTER 6 THE SATURATION PROBLEM . 73 6.1 Introduction . 73 6.2 Demonstrating Saturation on Manycore Systems . 73 6.3 Saturation in Unrolled Linked Lists . 74 ix 6.4 Saturation in Unrolled Skiplists . 76 6.5 Further Exploration of Saturation . 78 CHAPTER 7 INCREASING CONCURRENCY WITH GROUP MUTUAL EXCLU- SION . 79 7.1 About Intra-Node Concurrency . 79 7.2 Evaluation of Group Mutual Exclusion Algorithms . 81 7.2.1 Survey of Potential GME Algorithms . 82 7.2.2 Selection of GME Algorithm . 85 7.3 Introducing Intra-Node Concurrency to the Unrolled Linked List . 85 7.3.1 Algorithm Overview . 85 7.3.2 Algorithm Detail . 86 7.3.3 Experimental Evaluation . 90 7.3.4 Conclusions . 92 7.4 Introducing Intra-Node Concurrency to the Unrolled Skiplist . 93 7.4.1 Algorithm Detail . 93 7.4.2 Experiment Setup . 95 7.4.3 Intel Xeon System . 96 7.4.4 Intel Xeon Phi System . 103 7.5 An In-Depth Evaluation of the GME-enabled Skiplist . 110 7.6 Analysis and Conclusions . 113 CHAPTER 8 CONCLUSION . 114 REFERENCES . 116 BIOGRAPHICAL SKETCH . 124 CURRICULUM VITAE x LIST OF FIGURES 2.1 Example History of Concurrent Memory Location . .8 3.1 Layout of a skiplist . 14 4.1 Layout of the unrolled linked list . 21 4.2 Experimental Results on System A in Operations per Microsecond. 30 4.3 Experimental Results on System B in Operations per Microsecond. 31 4.4 Effect of Compiler Optimizations on System A . 34 4.5 Effect of Compiler Optimizations on System B . 35 4.6 Impact of Node Size on Throughput in Operations per Microsecond . 36 4.7 Impact of Shortcut Cache . 43 5.1 Layout of the unrolled skiplist . 48 5.2 Experimental Results on Intel Xeon System for one million keys. 65 5.3 Results on Intel Xeon System for 10 million keys . 68 6.1 Performance of Unrolled Linked Lists on Intel Xeon Phi System. 74 6.2 Performance of Unrolled Skiplist on Intel Xeon Phi System. 76 6.3 Performance of Unrolled Skiplist on Intel Xeon Phi System with 10 million keys. 77 7.1 Performance of Unrolled Linked Lists on Intel Xeon Phi System. 91 7.2 Results on Intel Xeon System for uniform distribution and one million keys. Throughput is reported in operations per microsecond. 97 7.3 Results on Intel Xeon System for Zipfian distribution and one million keys. Throughput is reported in completed operations per microsecond . 99 7.4 Results on Intel Xeon System for 10 million keys and uniform distribution. Through- put is reported in operations per microsecond. 100 7.5 Results on Intel Xeon System for 10 million keys and Zipfian distribution. Through- put is reported in operations per microsecond. ..