Locks, Lock Based Data Structures  Review: Proportional Share Scheduler – Ch

Total Page:16

File Type:pdf, Size:1020Kb

Locks, Lock Based Data Structures  Review: Proportional Share Scheduler – Ch TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology TCSS 422: OPERATING SYSTEMS OBJECTIVES Quiz 2 – Scheduling Review Three Easy Pieces: Assignment 1 – MASH Shell Locks, Lock Based Data Structures Review: Proportional Share Scheduler – Ch. 9 Review: Concurrency: Introduction – Ch. 26 Review: Linux Thread API – Ch. 27 Wes J. Lloyd Locks – Ch. 28 Institute of Technology Lock Based Data Structures – Ch. 29 University of Washington - Tacoma TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 April 23, 2018 L5.2 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma LOCKS Ensure critical section(s) are executed atomically-as a unit . Only one thread is allowed to execute a critical section at any given time . Ensures the code snippets are “mutually exclusive” CHAPTER 28 – Protect a global counter: LOCKS A “critical section”: TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 April 23, 2018 L7.4 Institute of Technology, University of Washington - Tacoma L5.3 Institute of Technology, University of Washington - Tacoma LOCKS - 2 LOCKS - 3 Lock variables are called “MUTEX” pthread_mutex_lock(&lock) . Short for mutual exclusion (that’s what they guarantee) . Try to acquire lock . If lock is free, calling thread will acquire the lock Lock variables store the state of the lock . Thread with lock enters critical section . Thread “owns” the lock States . Locked (acquired or held) No other thread can acquire the lock before the owner . Unlocked (available or free) releases it. Only 1 thread can hold a lock TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.5 April 23, 2018 L7.6 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma Slides by Wes J. Lloyd L5.1 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology LOCKS - 4 FINE GRAINED? Program can have many mutex (lock) variables to Is this code a good example of “fine grained parallelism”? “serialize” many critical sections pthread_mutex_lock(&lock); a = b++; Locks are also used to protect data structures b = a * c; *d = a + b +c; . Prevent multiple threads from changing the same data FILE * fp = fopen ("file.txt", “r"); fscanf(fp, "%s %s %s %d", str1, str2, str3, &e); simultaneously ListNode *node = mylist->head; Int i=0 . Programmer can make sections of code “granular” while (node) { . Fine grained – means just one grain of sand at a time through an node->title = str1; node->subheading = str2; hour glass node->desc = str3; node->end = *e; . Similar to relational database transactions node = node->next; . i++ DB transactions prevent multiple users from modifying a table, } row, field e = e – i; pthread_mutex_unlock(&lock); TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.7 April 23, 2018 L7.8 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma FINE GRAINED PARALLELISM EVALUATING LOCK IMPLEMENTATIONS pthread_mutex_lock(&lock_a); pthread_mutex_lock(&lock_b); Correctness a = b++; pthread_mutex_unlock(&lock_b); . Does the lock work? pthread_mutex_unlock(&lock_a); . Are critical sections mutually exclusive? pthread_mutex_lock(&lock_b); (atomic-as a unit?) b = a * c; pthread_mutex_unlock(&lock_b); pthread_mutex_lock(&lock_d); Fairness *d = a + b +c; pthread_mutex_unlock(&lock_d); . Are threads competing for a lock have a fair chance of acquiring it? FILE * fp = fopen ("file.txt", “r"); pthread_mutex_lock(&lock_e); fscanf(fp, "%s %s %s %d", str1, str2, str3, &e); pthread_mutex_unlock(&lock_e); Overhead ListNode *node = mylist->head; int i=0 . TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.9 April 23, 2018 L7.10 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma BUILDING LOCKS HISTORICAL IMPLEMENTATION Locks require hardware support To implement mutual exclusion . Disable interrupts upon entering critical sections . To minimize overhead, ensure fairness and correctness . Special “atomic-as a unit” instructions to support lock implementation . Atomic-as a unit exchange instruction . XCHG Any thread could disable system-wide interrupt . What if lock is never released? . Compare and exchange instruction On a multiprocessor processor each CPU has its own interrupts . CMPXCHG . Do we disable interrupts for all cores simultaneously? . CMPXCHG8B While interrupts are disabled, they could be lost . CMPXCHG16B . If not queued… TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.11 April 23, 2018 L7.12 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma Slides by Wes J. Lloyd L5.2 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology SPIN LOCK IMPLEMENTATION DIY: CORRECT? Operate without atomic-as a unit assembly instructions Correctness requires luck… (e.g. DIY lock is incorrect) “Do-it-yourself” Locks Is this lock implementation: Correct? Fair? Performant? Here both threads have “acquired” the lock simultaneously TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.13 April 23, 2018 L7.14 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma DIY: PERFORMANT? TEST-AND-SET INSTRUCTION C implementation: not atomic void lock(lock_t *mutex) . Adds a simple check to basic spin lock { . One a single core CPU system with preemptive scheduler: while (mutex->flag == 1); // while lock is unavailable, wait… mutex->flag = 1; . Try this… } What is wrong with while(<cond>); ? Spin-waiting wastes time actively waiting for another thread lock() method checks that TestAndSet doesn’t return 1 while (1); will “peg” a CPU core at 100% Comparison is in the caller . Continuously loops, and evaluates mutex->flag value… Single core systems are becoming scarce . Generates heat… Try on a one-core VM TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.15 April 23, 2018 L7.16 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma DIY: TEST-AND-SET - 2 SPIN LOCK EVALUATION Requires a preemptive scheduler on single CPU core system Correctness: Lock is never released without a context switch . Spin locks guarantee: critical sections won’t be executed 1-core VM: occasionally will deadlock, doesn’t miscount simultaneously by (2) threads Fairness: . No fairness guarantee. Once a thread has a lock, nothing forces it to relinquish it… Performance: . Spin locks perform “busy waiting” . Spin locks are best for short periods of waiting . Performance is slow when multiple threads share a CPU . Especially for long periods TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.17 April 23, 2018 L7.18 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma Slides by Wes J. Lloyd L5.3 TCSS 422 A – Spring 2018 4/24/2018 Institute of Technology COMPARE AND SWAP COMPARE AND SWAP Checks that the lock variable has the expected value FIRST, Compare and Swap before changing its value . If so, make assignment . Return value at location Adds a comparison to TestAndSet Spin lock usage 1-core VM: Count is correct, no deadlock Useful for wait-free synchronization . Supports implementation of shared data structures which can be updated atomically (as a unit) using the HW support CompareAndSwap instruction X86 provides “cmpxchgl” compare-and-exchange instruction . Shared data structure updates become “wait-free” . cmpxchg8b . Upcoming in Chapter 32 . cmpxchg16b TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.19 April 23, 2018 L7.20 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma TWO MORE “LOCK BUILDING” LL/SC LOCK CPU INSTRUCTIONS Cooperative instructions used together to support synchronization on RISC systems No support on x86 processors . Supported by RISC: Alpha, PowerPC, ARM Load-linked (LL) . Loads value into register . Same as typical load . Used as a mechanism to track competition LL instruction loads pointer value (ptr) SC only stores if the load link pointer has not changed Store-conditional (SC) Requires HW support . Performs “mutually exclusive” store . C code is psuedo code . Allows only one thread to store value TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Spring 2018] April 23, 2018 L7.21 April 23, 2018 L7.22 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington - Tacoma LL/SC LOCK - 2 HARDWARE SPIN LOCKS - SUMMARY Simple, correct Slow With long locks, waiting threads spin for entire timeslice . Repeat comparison continuously . Busy waiting How To Avoid Spinning? Two instruction lock Need both HW & OS Support ! TCSS422: Operating Systems [Spring 2018] TCSS422: Operating Systems [Winter 2018] April 23, 2018 L7.23 April 23, 2018 L7.24 Institute of Technology, University of Washington - Tacoma Institute of Technology, University of Washington
Recommended publications
  • A Fast Concurrent and Resizable Robin Hood Hash Table Endrias
    A Fast Concurrent and Resizable Robin Hood Hash Table by Endrias Kahssay Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2021 © Massachusetts Institute of Technology 2021. All rights reserved. Author................................................................ Department of Electrical Engineering and Computer Science Jan 29, 2021 Certified by. Julian Shun Assistant Professor Thesis Supervisor Accepted by . Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 A Fast Concurrent and Resizable Robin Hood Hash Table by Endrias Kahssay Submitted to the Department of Electrical Engineering and Computer Science on Jan 29, 2021, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Concurrent hash tables are among the most important primitives in concurrent pro- gramming and have been extensively studied in the literature. Robin Hood hashing is a variant of linear probing that moves around keys to reduce probe distances. It has been used to develop state of the art serial hash tables. However, there is only one existing previous work on a concurrent Robin Hood table. The difficulty in making Robin Hood concurrent lies in the potential for large memory reorganization by the different table operations. This thesis presents Bolt, a concurrent resizable Robin Hood hash table engineered for high performance. Bolt treads an intricate balance between an atomic fast path and a locking slow path to facilitate concurrency. It uses a novel scheme to interleave the two without compromising correctness in the concur- rent setting.
    [Show full text]
  • Intel Threading Building Blocks
    Praise for Intel Threading Building Blocks “The Age of Serial Computing is over. With the advent of multi-core processors, parallel- computing technology that was once relegated to universities and research labs is now emerging as mainstream. Intel Threading Building Blocks updates and greatly expands the ‘work-stealing’ technology pioneered by the MIT Cilk system of 15 years ago, providing a modern industrial-strength C++ library for concurrent programming. “Not only does this book offer an excellent introduction to the library, it furnishes novices and experts alike with a clear and accessible discussion of the complexities of concurrency.” — Charles E. Leiserson, MIT Computer Science and Artificial Intelligence Laboratory “We used to say make it right, then make it fast. We can’t do that anymore. TBB lets us design for correctness and speed up front for Maya. This book shows you how to extract the most benefit from using TBB in your code.” — Martin Watt, Senior Software Engineer, Autodesk “TBB promises to change how parallel programming is done in C++. This book will be extremely useful to any C++ programmer. With this book, James achieves two important goals: • Presents an excellent introduction to parallel programming, illustrating the most com- mon parallel programming patterns and the forces governing their use. • Documents the Threading Building Blocks C++ library—a library that provides generic algorithms for these patterns. “TBB incorporates many of the best ideas that researchers in object-oriented parallel computing developed in the last two decades.” — Marc Snir, Head of the Computer Science Department, University of Illinois at Urbana-Champaign “This book was my first introduction to Intel Threading Building Blocks.
    [Show full text]
  • 16 Concurrent Hash Tables: Fast and General(?)!
    Concurrent Hash Tables: Fast and General(?)! TOBIAS MAIER and PETER SANDERS, Karlsruhe Institute of Technology, Germany ROMAN DEMENTIEV, Intel Deutschland GmbH, Germany Concurrent hash tables are one of the most important concurrent data structures, which are used in numer- ous applications. For some applications, it is common that hash table accesses dominate the execution time. To efficiently solve these problems in parallel, we need implementations that achieve speedups inhighly concurrent scenarios. Unfortunately, currently available concurrent hashing libraries are far away from this requirement, in particular, when adaptively sized tables are necessary or contention on some elements occurs. Our starting point for better performing data structures is a fast and simple lock-free concurrent hash table based on linear probing that is, however, limited to word-sized key-value types and does not support 16 dynamic size adaptation. We explain how to lift these limitations in a provably scalable way and demonstrate that dynamic growing has a performance overhead comparable to the same generalization in sequential hash tables. We perform extensive experiments comparing the performance of our implementations with six of the most widely used concurrent hash tables. Ours are considerably faster than the best algorithms with similar restrictions and an order of magnitude faster than the best more general tables. In some extreme cases, the difference even approaches four orders of magnitude. All our implementations discussed in this publication
    [Show full text]
  • Read-Copy-Update for Helenos (Slides)
    Read-Copy-UpdateRead-Copy-Update forfor HelenOSHelenOS http://d3s.mff.cuni.cz Martin Děcký [email protected] CHARLES UNIVERSITY IN PRAGUE facultyfaculty ofof mathematicsmathematics andand physicsphysics IntroductionIntroduction HelenOS Microkernel multiserver operating system Relying on asynchronous IPC mechanism Major motivation for scalable concurrent algorithms and data structures Martin Děcký Researcher in computer science (operating systems) Not an expert on concurrent algorithms But very lucky to be able to cooperate with hugely talented people in this area Martin Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 2 InIn aa NutshellNutshell Asynchronous IPC = Communicating parties may access the communication facilities concurrently Martin Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 3 InIn aa NutshellNutshell Asynchronous IPC = Communicating parties may access the communication facilities concurrently → The state of the shared communication facilities needs to be protected by explicit synchronization means Martin Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 4 InIn aa NutshellNutshell Asynchronous IPC = Communicating parties have to access the communication facilities concurrently Martin Děcký, FOSDEM 2014, February 2nd 2014 Read-Copy-Update for HelenOS 5 InIn aa NutshellNutshell Asynchronous IPC = Communicating parties have to access the communication facilities concurrently ← In order to counterweight the overhead of the communication by doing other useful work while
    [Show full text]
  • Threading for Performance with Intel® Threading Building Blocks
    Session: Threading for Performance with Intel® Threading Building Blocks INTEL CONFIDENTIAL Objjec tives At theee en dod of thi ssessos session ,you, you will b eae abl e t o: • List the different components of the Intel Threading Building Blocks library • Code parallel applications using TBB SOFTWARE AND SERVICES 2 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. AdAggenda Intel®®adgudgobagoud Threading Building Blocks background Generic Parallel Algorithms pp_arallel_for parallel_reduce paralle l_sort example Task Scheduler GiGeneric HihlHighly Concurren t CtiContainers Scalable Memory Allocation Low-level Synchronization Primitives SOFTWARE AND SERVICES 3 Copyright © 2014, Intel Corporation. All rights reserved. In te l and the Int el logo are trad emark s or regist ered trad emark s of Int el Corporati on or its subsidi ari es in the Unit ed Stat es or other countries. * Other brands and names are the property of their respective owners. Intel®®gg() Threading Building Blocks (TBB) -Key Features You sp ecify tasks ((aauouwhat can run concurrently) y)ado instead of thread s • Library maps your logical tasks onto physical threads, efficiently using cache and balancing load • Full support for nested paralle lism Targets threading for scalable performance • Portable across Linux*, Mac OS*, Windows*, and Solaris* Emphasizes scalable data parallel programming • Loop parallelism tasks are more scalable than a fixed number of separate tasks Compatible with other thre ading pack age s • Can be used in concert with native threads and OpenMP Open source and lilidcensed versions availilblable SOFTWARE AND SERVICES 4 Copyright © 2014, Intel Corporation.
    [Show full text]
  • Concurrent Data Structures
    1 Concurrent Data Structures 1.1 Designing Concurrent Data Structures ::::::::::::: 1-1 Performance ² Blocking Techniques ² Nonblocking Techniques ² Complexity Measures ² Correctness ² Verification Techniques ² Tools of the Trade 1.2 Shared Counters and Fetch-and-Á Structures ::::: 1-12 1.3 Stacks and Queues :::::::::::::::::::::::::::::::::::: 1-14 1.4 Pools :::::::::::::::::::::::::::::::::::::::::::::::::::: 1-17 1.5 Linked Lists :::::::::::::::::::::::::::::::::::::::::::: 1-18 1.6 Hash Tables :::::::::::::::::::::::::::::::::::::::::::: 1-19 1.7 Search Trees:::::::::::::::::::::::::::::::::::::::::::: 1-20 Mark Moir and Nir Shavit 1.8 Priority Queues :::::::::::::::::::::::::::::::::::::::: 1-22 Sun Microsystems Laboratories 1.9 Summary ::::::::::::::::::::::::::::::::::::::::::::::: 1-23 The proliferation of commercial shared-memory multiprocessor machines has brought about significant changes in the art of concurrent programming. Given current trends towards low- cost chip multithreading (CMT), such machines are bound to become ever more widespread. Shared-memory multiprocessors are systems that concurrently execute multiple threads of computation which communicate and synchronize through data structures in shared memory. The efficiency of these data structures is crucial to performance, yet designing effective data structures for multiprocessor machines is an art currently mastered by few. By most accounts, concurrent data structures are far more difficult to design than sequential ones because threads executing concurrently may interleave
    [Show full text]
  • 29. Lock-Based Concurrent Data Structures Operating System: Three Easy Pieces
    29. Lock-based Concurrent Data Structures Operating System: Three Easy Pieces AOS@UC 1 Lock-based Concurrent Data structure p Adding locks to a data structure makes the structure thread safe. w How locks are added determine both the correctness and performance of the data structure. p Just a succinct introduction to the multithreaded way-of-“thinking” w Thousands of research papers about it AOS@UC 2 Example: Concurrent Counters without Locks p Not thread safe 1 typedef struct __counter_t { 2 int value; 3 } counter_t; 4 5 void init(counter_t *c) { 6 c->value = 0; 7 } 8 9 void increment(counter_t *c) { 10 c->value++; 11 } 12 13 void decrement(counter_t *c) { 14 c->value--; 15 } 16 17 int get(counter_t *c) { 18 return c->value; 19 } AOS@UC 3 Example: Concurrent Counters with Locks p Add a single lock: thread safe but scalable? w The lock is acquired when calling a routine that manipulates the data structure. 1 typedef struct __counter_t { 2 int value; 3 pthread_lock_t lock; 4 } counter_t; 5 6 void init(counter_t *c) { 7 c->value = 0; 8 Pthread_mutex_init(&c->lock, NULL); 9 } 10 11 void increment(counter_t *c) { 12 Pthread_mutex_lock(&c->lock); 13 c->value++; 14 Pthread_mutex_unlock(&c->lock); 15 } 16 AOS@UC 4 Example: Concurrent Counters with Locks (Cont.) (Cont.) 17 void decrement(counter_t *c) { 18 Pthread_mutex_lock(&c->lock); 19 c->value--; 20 Pthread_mutex_unlock(&c->lock); 21 } 22 23 int get(counter_t *c) { 24 Pthread_mutex_lock(&c->lock); 25 int rc = c->value; 26 Pthread_mutex_unlock(&c->lock); 27 return rc; 28 } AOS@UC 5 The performance costs of the simple approach p Each thread updates a single shared counter.
    [Show full text]
  • Blocking and Non-Blocking Concurrent Hash Tables in Multi-Core Systems
    WSEAS TRANSACTIONS on COMPUTERS Akos Dudas, Sandor Juhasz Blocking and non-blocking concurrent hash tables in multi-core systems AKOS´ DUDAS´ SANDOR´ JUHASZ´ Budapest University of Technology and Economics Budapest University of Technology and Economics Department of Automation and Applied Informatics Department of Automation and Applied Informatics 1117 Budapest, Magyar Tudosok´ krt. 2 QB207 1117 Budapest, Magyar Tudosok´ krt. 2 QB207 HUNGARY HUNGARY [email protected] [email protected] Abstract: Widespread use of multi-core systems demand highly parallel applications and algorithms in everyday computing. Parallel data structures, which are basic building blocks of concurrent algorithms, are hard to design in a way that they remain both fast and simple. By using mutual exclusion they can be implemented with little effort, but blocking synchronization has many unfavorable properties, such as delays, performance bottleneck and being prone to programming errors. Non-blocking synchronization, on the other hand, promises solutions to the aforementioned drawbacks, but often requires complete redesign of the algorithm and the underlying data structure to accommodate the needs of atomic instructions. Implementations of parallel data structures satisfy different progress conditions: lock based algorithms can be deadlock-free or starvation free, while non-blocking solutions can be lock-free or wait-free. These properties guarantee different levels of progress, either system-wise or for all threads. We present several parallel hash table implementations, satisfying different types of progress conditions and having various levels of implementation complexity, and discuss their behavior. The performance of different blocking and non-blocking implementations will be evaluated on a multi-core machine.
    [Show full text]
  • Phase-Concurrent Hash Tables for Determinism
    Phase-Concurrent Hash Tables for Determinism Julian Shun Guy E. Blelloch Carnegie Mellon University Carnegie Mellon University [email protected] [email protected] ABSTRACT same input, while internal determinism also requires certain in- termediate steps of the program to be deterministic (up to some We present a deterministic phase-concurrent hash table in which level of abstraction) [3]. In the context of concurrent access, a data operations of the same type are allowed to proceed concurrently, structure is internally deterministic if even when operations are ap- but operations of different types are not. Phase-concurrency guar- plied concurrently the final observable state depends uniquely on antees that all concurrent operations commute, giving a determin- the set of operations applied, but not on their order. This prop- istic hash table state, guaranteeing that the state of the table at any erty is equivalent to saying the operations commute with respect to quiescent point is independent of the ordering of operations. Fur- the final observable state of the structure [36, 32]. Deterministic thermore, by restricting our hash table to be phase-concurrent, we concurrent data structures are important for developing internally show that we can support operations more efficiently than previous deterministic parallel algorithms—they allow for the structure to concurrent hash tables. Our hash table is based on linear probing, be updated concurrently while generating a deterministic result in- and relies on history-independence for determinism. dependent of the timing or scheduling of threads. Blelloch et al. We experimentally compare our hash table on a modern 40-core show that internally deterministic algorithms using nested paral- machine to the best existing concurrent hash tables that we are lelism and commutative operations on shared state can be efficient, aware of (hopscotch hashing and chained hashing) and show that and their algorithms make heavy use concurrent data structures [3].
    [Show full text]
  • Fast, Dynamically-Sized Concurrent Hash Table⋆
    Fast, Dynamically-Sized Concurrent Hash Table? J. Barnat, P. Roˇckai, V. Still,ˇ and J. Weiser Faculty of Informatics, Masaryk University Brno, Czech Republic fbarnat,xrockai,xstill,xweiser1g@fi.muni.cz Abstract. We present a new design and a C++ implementation of a high-performance, cache-efficient hash table suitable for use in implemen- tation of parallel programs in shared memory. Among the main design criteria were the ability to efficiently use variable-length keys, dynamic table resizing to accommodate data sets of unpredictable size and fully concurrent read-write access. We show that the design is correct with respect to data races, both through a high-level argument, as well as by using a model checker to prove crucial safety properties of the actual implementation. Finally, we provide a number of benchmarks showing the performance characteristics of the C++ implementation, in comparison with both sequential-access and concurrent-access designs. 1 Introduction Many practical algorithms make use of hash tables as a fast, compact data struc- ture with expected O(1) lookup and insertion. Moreover, in many applications, it is desirable that multiple threads can access the data structure at once, ide- ally without causing execution delays due to synchronisation or locking. One such application of hash tables is parallel model checking, where the hash table is a central structure, and its performance is crucial for a successful, scalable implementation of the model checking algorithm. Moreover, in this context, it is also imperative that the hash table is compact (has low memory overhead), because the model checker is often primarily constrained by available memory: therefore, a more compact hash table can directly translate into the ability to model-check larger problem instances.
    [Show full text]
  • Concurrent Hash Tables: Fast and General?(!)
    Concurrent Hash Tables: Fast and General(?)! Tobias Maier1, Peter Sanders1, and Roman Dementiev2 1 Karlsruhe Institute of Technology, Karlsruhe, Germany ft.maier,[email protected] 2 Intel Deutschland GmbH [email protected] Abstract 1 Introduction Concurrent hash tables are one of the most impor- A hash table is a dynamic data structure which tant concurrent data structures which is used in stores a set of elements that are accessible by their numerous applications. Since hash table accesses key. It supports insertion, deletion, find and update can dominate the execution time of whole applica- in constant expected time. In a concurrent hash ta- tions, we need implementations that achieve good ble, multiple threads have access to the same table. speedup even in these cases. Unfortunately, cur- This allows threads to share information in a flex- rently available concurrent hashing libraries turn ible and efficient way. Therefore, concurrent hash out to be far away from this requirement in partic- tables are one of the most important concurrent ular when adaptively sized tables are necessary or data structures. See Section 4 for a more detailed contention on some elements occurs. discussion of concurrent hash table functionality. To show the ubiquity of hash tables we give a Our starting point for better performing data short list of example applications: A very sim- structures is a fast and simple lock-free concurrent ple use case is storing sparse sets of precom- hash table based on linear probing that is however puted solutions (e.g. [27], [3]). A more compli- limited to word sized key-value types and does not cated one is aggregation as it is frequently used support dynamic size adaptation.
    [Show full text]
  • Scalable Concurrent Hash Tables Via Relativistic Programming
    Portland State University PDXScholar Computer Science Faculty Publications and Presentations Computer Science 2008 Scalable Concurrent Hash Tables via Relativistic Programming Josh Triplett Portland State University Follow this and additional works at: https://pdxscholar.library.pdx.edu/compsci_fac Part of the Programming Languages and Compilers Commons, and the Software Engineering Commons Let us know how access to this document benefits ou.y Citation Details Triplett, Josh, "Scalable Concurrent Hash Tables via Relativistic Programming" (2008). Computer Science Faculty Publications and Presentations. 223. https://pdxscholar.library.pdx.edu/compsci_fac/223 This Technical Report is brought to you for free and open access. It has been accepted for inclusion in Computer Science Faculty Publications and Presentations by an authorized administrator of PDXScholar. Please contact us if we can make this document more accessible: [email protected]. Scalable Concurrent Hash Tables via Relativistic Programming Josh Triplett Portland State University [email protected] Abstract. Existing approaches to concurrent programming often fail to account for synchronization costs on modern shared-memory multipro- cessor architectures. A new approach to concurrent programming, known as relativistic programming, can reduce or in some cases eliminate syn- chronization overhead on such architectures. This approach avoids the costs of inter-processor communication and memory access by permit- ting processors to operate from a relativistic view of memory provided by their own caches, rather than from an absolute reference frame of mem- ory as seen by all processors. This research shows how relativistic pro- gramming techniques can provide the perceived advantages of optimistic synchronization without the useless parallelism caused by rollbacks and retries. Descriptions of several implementations of a concurrent hash table illus- trate the differences between traditional and relativistic approaches to concurrent programming.
    [Show full text]