Composable Operations on High-Performance Concurrent Collections

Composable Operations on High-Performance Concurrent Collections A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Nathan G. Bronson October 2011 Abstract Consider the task of designing the API for a data structure or reusable software component. There is a tension between simplicity and completeness: a simple interface is easier to learn and implement, but a complex interface is likely to include more of the functionality desired by the end user. The fundamental tool for managing this tension is composition. Composition allows complex operations to be built up from a few primitive ones. Much of the art of API design comes in choosing a set of primitives that is both simple and complete. Unfortunately, existing efficient concurrent data structures don’t have the ability to safely compose operations. The lack of a generic mechanism for composition leads their APIs to be both more complex and less complete. This thesis presents new algorithms and techniques that allow composability for thread- safe sets and maps, while retaining excellent performance and scalability. First, we use optimistic techniques inspired by software transactional memory (STM) to add lazy copy- on-write to an efficient concurrent binary tree. This enables the collection to provide a linearizable clone operation whose running time is independent of the size of the tree. Second, we introduce transactional predication, a technique that allows STM integration of a concurrent map while bypassing the STM for most memory accesses. This improves the performance and scalability of the resulting transactional map, making it more suitable as a replacement for existing non-composable thread-safe maps. Finally, we explore the coexistence of lock-free algorithms and STM, producing an efficient transactional hash trie that includes lazy copy-on-write. We find that by tailoring the data structure for the STM environment and separately considering the case of non-transactional accesses, we provide both full composability and excellent performance and scalability. iv Acknowledgements Firstly, I am indebted to my wife Paola Grasso Bronson for her patience and support. She will probably be the only epidemiologist to read this thesis completely. I would like to thank my advisor Kunle Olukotun. He has always challenged me to focus on the bigger picture, and he has always taken time to guide me even when he was busy. Thanks also to Darlene Hadding for her rock-solid administrative support. I would like to thank Christos Kozyrakis and Dawson Engler for serving on my orals and reading committees, and thank Alex Aiken and Chuck Eesley for serving on my orals committee. Christos provided me with much hands on feedback when I was learning how to do research. Dawson’s CS240 seems destined to be my favorite class ever. Special thanks to Alex for organizing a long string of interesting talks at the software lunch. I would like to thank Woongki Baek, Brian Carlstrom, Jared Casper, Hassan Chafi, Jae- Woong Chung, Sunpack Hong, Austen McDonald, Chi Cao Minh, Tayo Oguntebi, Martin Trautmann, and the rest of the TCC group that initially attracted me to Stanford. Special thanks to Brian and Leslie Wu for helping me transition from professional programming back to school, and to JaeWoong for including me in my first conference paper push. I would like to thank Anand Atreya, Kevin Brown, HyoukJoong Lee, Sang Kyun Kim, Lawrence McAfee, and Arvind Sujeeth, who joined me with some other TCC stragglers to become Kunle’s students in the PPL. The cultural, technical and moral discussions (some- times all three at once) at our weekly lunch with Kunle were always stimulating. Special thanks to Hassan, who was instrumental to the research in this thesis and to my ability to complete it. I hope that I was as useful and as understanding an office mate for him as he was for me. Thanks to Jacob Leverich for his generous help with cluster and LDAP wrangling. v Thanks to Mike Bauer and Megan Wachs for impressing me with their work ethic and fairness as we shared coarse assistant loads. Martin Odersky’s team at EPFL have produced something amazing in Scala. It is a testament to their great work that carefully engineered Scala data structures can be much more compact and beautiful than Java ones, yet compete head-to-head on performance. Thanks especially to Aleksandar Prokopec for his detailed and insightful feedback on my work. I would like to thank Mooly Sagiv for pressing me to try to formalize my intuitions, and Juan Tamayo, Guy Gueta and Ohad Shacham for letting me be part of an exciting collaboration with Mooly and Alex. Thanks to Peter Hawkins, who seems to understand every idea, no matter how complicated, after at most one discussion. I would like to thank Harold Reiter, who opened many doors for me when I was in high school. It would be difficult to overstate the effect that his intervention has had on my education. Thanks to John Board, my undergraduate advisor, who helped lay a foundation for my doctoral work long before I even considered graduate school. Finally, thanks to my parents Karen and Dick, who instilled in me a confidence in my ability to reason and to learn. When I asked questions as a child their response was often, "Let’s think about that." The wide-ranging discussions that followed are a template for all of the problem solving that I’ve done since. I would also like to specifically thank my dad for insisting that I learn Pascal instead of Basic. vi Contents Abstract iv Acknowledgements v 1 Introduction 1 1.1 The Parallelism Explosion – Multi-Core . .2 1.2 Surveying Solutions for Parallel Programming . .3 1.2.1 Structured communication . .4 1.2.2 Partitioned address spaces . .7 1.2.3 Structured scheduling . .8 1.2.4 Conflict avoidance or repair . 10 1.3 Picking a Programming Model . 11 1.3.1 Is the application a deterministic computation? . 11 1.3.2 How big is the problem? . 12 1.3.3 Do data accesses have high locality? . 13 1.3.4 Is the problem size limited by available memory? . 13 1.3.5 Is the application regular? . 13 1.3.6 What is the ratio of reads to writes? . 14 1.3.7 Are communication patterns statically known? . 14 1.3.8 Is a computation defined in terms of phases? . 14 1.3.9 What kind of failure handling is required? . 15 1.3.10 Successful uses of restricted models . 15 1.4 Shared Memory Multi-Threading . 20 vii 1.4.1 Flexible communication . 20 1.4.2 Evolution of existing languages . 21 1.4.3 Intra-chip coherence . 21 1.4.4 A generic substrate . 22 1.4.5 Platform and language choice . 23 1.5 Our Vision for Multi-Threaded Programming . 25 1.5.1 Immutable data structures . 25 1.5.2 Snapshots for mutable data structures . 25 1.5.3 Composable operations . 26 1.5.4 Help from the type system . 26 1.5.5 Strong-by-default semantics . 27 1.6 Compound Atomic Operations . 28 1.7 Our Contributions . 28 2 SnapTree 30 2.1 Introduction . 30 2.2 Background . 32 2.3 Our Algorithm . 34 2.3.1 The data structure: Node ....................... 35 2.3.2 Version numbers . 36 2.3.3 Hand-over-hand optimistic validation: get(k)............ 38 2.3.4 Insertion and update: put(k;v).................... 42 2.3.5 Partially external trees: remove(k).................. 43 2.3.6 Local tree improvements: fixHeightAndRebalance ......... 46 2.3.7 Link update order during rotation . 49 2.3.8 Iteration: firstNode() and succNode(n)............... 49 2.3.9 Blocking readers: waitUntilNotChanging .............. 50 2.3.10 Supporting fast clone ........................ 51 2.4 Correctness . 52 2.5 Performance . 54 2.6 Conclusion . 61 viii 3 Transactional Predication 63 3.1 Introduction . 63 3.2 Background . 65 3.3 Transactional Predication . 68 3.3.1 Atomicity and isolation . 70 3.3.2 Direct STM vs. transactional predication . 71 3.3.3 Extending predication to maps . 72 3.3.4 Sharing of uncommitted data . 73 3.4 Garbage Collection . 73 3.4.1 Predicate life-cycle in a timestamp-based STM . 74 3.4.2 Reference counting . 76 3.4.3 Soft references . 77 3.4.4 Optimizing non-transactional access . 78 3.5 Iteration and Range Searches . 79 3.5.1 Transactional iteration . 79 3.5.2 Iteration and search in an ordered map . 80 3.6 Experimental Evaluation . 81 3.6.1 Garbage collection strategy . 82 3.6.2 Comparison to other transactional maps . 84 3.6.3 Ordered maps . 86 3.7 Related Work . 86 3.7.1 Avoiding structural conflicts . 86 3.7.2 Semantic conflict detection . 87 3.7.3 Serializing conflicting transactions . 88 3.8 Conclusion . 88 4 Transactional Maps with Snapshots 92 4.1 Introduction . 92 4.2 A Lock-Free Copy-On-Write Hash Trie . 95 4.2.1 Hash tries . 95 4.2.2 Generation counts . 97 ix 4.2.3 Hash trie get(k) ........................... 100 4.2.4 Hash trie put(k, v) ......................... 101 4.2.5 Hash trie remove(k) ......................... 103 4.2.6 Hash trie size ............................ 104 4.3 Lock-free Algorithms Inside Transactions . 104 4.3.1 Semantics of non-transactional memory accesses . 104 4.3.2 Forward progress . 105 4.3.3 Correctness after composition . 106 4.3.4 Atomicity violation – what is to blame? . 107 4.3.5 Internal linearization points – a sufficient condition . 109 4.4 Transactional Access to the Hash Trie .

Composable Operations on High-Performance Concurrent Collections

Here I Led Subcommittee Reports Related to Data-Intensive Science and Post-Moore Computing) and in CRA’S Board of Directors Since 2015

Ultra-Large-Scale Sites

Optimizing Applications for Multicore by Intel Software Engineer Levent Akyil Welcome to the Parallel Universe

August 2009 Volume 34 Number 4

High Performance Neural Network Inference, Streaming, and Visualization of Medical Images Using FAST

Comparison and Analysis of Parallel Computing Performance Using Openmp and MPI

Threading Building Blocks Getting Started Guide

Programming Models & Environments Summit

Mlperf Inference Benchmark

Evaluating the Separation of Algorithm and Implementation Within Existing Programming Models

Flow Graphs, Speculative Locks, and Task Arenas in Intel® Threading Building Blocks

Intel Concurrent Collections for Haskell Ryan Newton, Chih-Ping Chen, and Simon Marlow