Everything You Always Wanted to Know About

Total Page:16

File Type:pdf, Size:1020Kb

Everything You Always Wanted to Know About Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask Tudor David, Rachid Guerraoui, Vasileios Trigonakis∗ School of Computer and Communication Sciences, Ecole´ Polytechnique Fed´ erale´ de Lausanne (EPFL), Switzerland ftudor.david, rachid.guerraoui, vasileios.trigonakisg@epfl.ch Abstract This paper presents the most exhaustive study of syn- chronization to date. We span multiple layers, from hardware cache-coherence protocols up to high-level concurrent software. We do so on different types of architectures, from single-socket – uniform and non- uniform – to multi-socket – directory and broadcast- Figure 1: Analysis method. based – many-cores. We draw a set of observations that, roughly speaking, imply that scalability of synchroniza- little indication, a priori, of whether a given synchro- tion is mainly a property of the hardware. nization scheme will scale on a given modern many-core architecture and, a posteriori, about exactly why a given 1 Introduction scheme did, or did not, scale. One of the reasons for this state of affairs is that there Scaling software systems to many-core architectures is are no results on modern architectures connecting the one of the most important challenges in computing to- low-level details of the underlying hardware, e.g., the day. A major impediment to scalability is synchroniza- cache-coherence protocol, with synchronization at the tion. From the Greek “syn”, i.e., with, and “khronos”, software level. Recent work that evaluated the scal- i.e., time, synchronization denotes the act of coordinat- ability of synchronization on modern hardware, e.g., ing the timeline of a set of processes. Synchronization [11, 26], was typically put in a specific application and basically translates into cores slowing each other, some- architecture context, making the evaluation hard to gen- times affecting performance to the point of annihilating eralize. In most cases, when scalability issues are faced, the overall purpose of increasing their number [3, 21]. it is not clear if they are due to the underlying hardware, A synchronization scheme is said to scale if its per- to the synchronization algorithm itself, to its usage of formance does not degrade as the number of cores in- specific atomic operations, to the application context, or creases. Ideally, acquiring a lock should for example to the workload. take the same time regardless of the number of cores sharing that lock. In the last few decades, a large body Of course, getting the complete picture of how syn- of work has been devoted to the design, implemen- chronization schemes behave, in every single context, is tation, evaluation, and application of synchronization very difficult. Nevertheless, in an attempt to shed some schemes [1, 4, 6, 7, 10, 11, 14, 15, 26–29, 38–40, 43– light on such a picture, we present the most exhaustive 45]. Yet, the designer of a concurrent system still has study of synchronization on many-cores to date. Our analysis seeks completeness in two directions (Figure 1). ∗ Authors appear in alphabetical order. 1. We consider multiple synchronization layers, from Permission to make digital or hard copies of part or all of this work for basic many-core hardware up to complex con- personal or classroom use is granted without fee provided that copies current software. First, we dissect the laten- are not made or distributed for profit or commercial advantage and that cies of cache-coherence protocols. Then, we copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other study the performance of various atomic opera- uses, contact the Owner/Author. tions, e.g., compare-and-swap, test-and-set, fetch- and-increment. Next, we proceed with locking and Copyright is held by the Owner/Author(s). message passing techniques. Finally, we examine SOSP’13, Nov. 3–6, 2013, Farmington, Pennsylvania, USA. ACM 978-1-4503-2388-8/13/11. a concurrent hash table, an in-memory key-value http://dx.doi.org/10.1145/2517349.2522714 store, and a software transactional memory. 33 2. We vary a set of important architectural at- more expensive than a load. tributes to better understand their effect on syn- Message passing shines when contention is very high. chronization. We explore both single-socket Structuring an application with message passing reduces (chip multi-processor) and multi-socket (multi- sharing and proves beneficial when a large number of processor) many-cores. In the former category, we threads contend for a few data. However, under low con- consider uniform (e.g., Sun Niagara 2) and non- tention and/or a small number of cores, locks perform uniform (e.g, Tilera TILE-Gx36) designs. In the better on higher-layer concurrent testbeds, e.g., a hash latter category, we consider platforms that imple- table and a software transactional memory, even when ment coherence based on a directory (e.g., AMD message passing is provided in hardware (e.g., Tilera). Opteron) or broadcast (e.g., Intel Xeon). This suggests the exclusive use of message passing for Our set of experiments, of what we believe constitute optimizing certain highly contended parts of a system. the most commonly used synchronization schemes and Every locking scheme has its fifteen minutes of fame. hardware architectures today, induces the following set None of the nine locking schemes we consider consis- of observations. tently outperforms any other one, on all target architec- Crossing sockets is a killer. The latency of perform- tures or workloads. Strictly speaking, to seek optimality, ing any operation on a cache line, e.g., a store or a a lock algorithm should thus be selected based on the compare-and-swap, simply does not scale across sock- hardware platform and the expected workload. ets. Our results indicate an increase from 2 to 7.5 times Simple locks are powerful. Overall, an efficient im- compared to intra-socket latencies, even under no con- plementation of a ticket lock is the best performing syn- tention. These differences amplify with contention at chronization scheme in most low contention workloads. all synchronization layers and suggest that cross-socket Even under rather high contention, the ticket lock per- sharing should be avoided. forms comparably to more complex locks, in particular Sharing within a socket is necessary but not suffi- within a socket. Consequently, given their small mem- cient. If threads are not explicitly placed on the same ory footprint, ticket locks should be preferred, unless it socket, the operating system might try to load balance is sure that a specific lock will be very highly contended. them across sockets, inducing expensive communica- A high-level ramification of many of these observa- tion. But, surprisingly, even with explicit placement tions is that the scalability of synchronization appears, within the same socket, an incomplete cache directory, first and above all, to be a property of the hardware, combined with a non-inclusive last-level cache (LLC), in the following sense. Basically, in order to be able might still induce cross-socket communication. On the to scale, synchronization should better be confined to a Opteron for instance, this phenomenon entails a 3-fold single socket, ideally a uniform one. On certain plat- increase compared to the actual intra-socket latencies. forms (e.g., Opteron), this is simply impossible. Within We discuss one way to alleviate this problem by circum- a socket, sophisticated synchronization schemes are gen- venting certain access patterns. erally not worthwhile. Even if, strictly speaking, no size Intra-socket (non-)uniformity does matter. Within a fits all, a proper implementation of a simple ticket lock socket, the fact that the distance from the cores to the seems enough. LLC is the same, or differs among cores, even only In summary, our main contribution is the most ex- slightly, impacts the scalability of synchronization. For haustive study of synchronization to date. Results of instance, under high contention, the Niagara (uniform) this study can be used to help predict the cost of a syn- enables approximately 1.7 times higher scalability than chronization scheme, explain its behavior, design bet- the Tilera (non-uniform) for all locking schemes. The ter schemes, as well as possibly improve future hard- developer of a concurrent system should thus be aware ware design. SSYNC, the cross-platform synchroniza- that highly contended data pose a higher threat in the tion suite we built to perform the study is, we believe, presence of even the slightest non-uniformity, e.g., non- a contribution of independent interest. SSYNC abstracts uniformity inside a socket. various lock algorithms behind a common interface: it Loads and stores can be as expensive as atomic op- not only includes most state-of-the-art algorithms, but erations. In the context of synchronization, where also provides platform specific optimizations with sub- memory operations are often accompanied with mem- stantial performance improvements. SSYNC also con- ory fences, loads and stores are generally not signifi- tains a library that abstracts message passing on vari- cantly cheaper than atomic operations with higher con- ous platforms, and a set of microbenchmarks for mea- sensus numbers [19]. Even without fences, on data that suring the latencies of the cache-coherence protocols, are not locally cached, a compare-and-swap is roughly the locks, and the message passing. In addition, we only 1.35 (on the Opteron) and 1.15 (on the
Recommended publications
  • Synchronization Techniques for DOCSIS® Technology Specification
    Data-Over-Cable Service Interface Specifications Mobile Applications Synchronization Techniques for DOCSIS® Technology Specification CM-SP-SYNC-I02-210407 ISSUED Notice This DOCSIS specification is the result of a cooperative effort undertaken at the direction of Cable Television Laboratories, Inc. for the benefit of the cable industry and its customers. You may download, copy, distribute, and reference the documents herein only for the purpose of developing products or services in accordance with such documents, and educational use. Except as granted by CableLabs® in a separate written license agreement, no license is granted to modify the documents herein (except via the Engineering Change process), or to use, copy, modify or distribute the documents for any other purpose. This document may contain references to other documents not owned or controlled by CableLabs. Use and understanding of this document may require access to such other documents. Designing, manufacturing, distributing, using, selling, or servicing products, or providing services, based on this document may require intellectual property licenses from third parties for technology referenced in this document. To the extent this document contains or refers to documents of third parties, you agree to abide by the terms of any licenses associated with such third-party documents, including open source licenses, if any. Cable Television Laboratories, Inc. 2018–2021 CM-SP-SYNC-I02-210407 Data-Over-Cable Service Interface Specifications DISCLAIMER This document is furnished on an "AS IS" basis and neither CableLabs nor its members provides any representation or warranty, express or implied, regarding the accuracy, completeness, noninfringement, or fitness for a particular purpose of this document, or any document referenced herein.
    [Show full text]
  • Lecture 8: More Pipelining
    Overview n Getting Started with Lab 2 Lecture 8: n Just get a single pixel calculating at one time n Then look into filling your pipeline n Multipliers More Pipelining n Different options for pipelining: what do you need? n 3 Multipliers or put x*x, y*y, and x*y through sequentially? David Black-Schaffer n Pipelining [email protected] n If it won’t fit in one clock cycle you have to divide it EE183 Spring 2003 up so each stage will fit n The control logic must be designed with this in mind n Make sure you need it EE183 Lecture 8 - Slide 2 Public Service Announcement Logistics n n Xilinx Programmable World Lab 2 Prelab due Friday by 5pm n Tuesday, May 6th n http://www.xilinx.com/events/pw2003/index.htm n Guest lecture next Monday n Guest Lectures Synchronization and Metastability n Monday, April 28th These are critical for high-speed systems and Ryan Donohue on Metastability and anything where you’ll be connecting across Synchronization clock domains. n Wednesday, May 7th Gary Spivey on ASIC & FPGA Design for Speed n The content of these lectures will be on the Quiz SHOW UP! (please) EE183 Lecture 8 - Slide 3 EE183 Lecture 8 - Slide 4 1 Easier FSMs Data Path always @(button or current_state) Do this if nothing else is begin Xn Yn write_en <= 0; specified. Mandel X Julia X Mandel X Julia Y output <= 1; next_state <= `START_STATE; Mux * * * Mux case(current_state) Note that the else is not `START_STATE: specified. begin write_en <= 1; - <<1 + if (button) next_state <= `WAIT_STATE; What is the next state? end What do we+ do if the+ wholeIf>4
    [Show full text]
  • Performance Analysis of Audio and Video Synchronization Using Spreaded Code Delay Measurement Technique
    ISSN: 0976-9102 (ONLINE) ICTACT JOURNAL ON IMAGE AND VIDEO PROCESSING, AUGUST 2018, VOLUME: 09, ISSUE: 01 DOI: 10.21917/ijivp.2018.0254 PERFORMANCE ANALYSIS OF AUDIO AND VIDEO SYNCHRONIZATION USING SPREADED CODE DELAY MEASUREMENT TECHNIQUE A. Thenmozhi and P. Kannan Department of Electronics and Communication Engineering, Anna University-Chennai, India Abstract video streams for necessarily maintaining synchronization The audio and video synchronization plays an important role in speech between the audio and video signals. During transmission, the recognition and multimedia communication. The audio-video sync is a audio and video streams are recorded by combining audio and quite significant problem in live video conferencing. It is due to use of video signatures into a synchronization signature. At reception, various hardware components which introduces variable delay and the equivalent signatures are extracted and compared with the software environments. The synchronization loss between audio and reference signature using a hamming distance correlation to video causes viewers not to enjoy the program and decreases the estimate the relative misalignment between the audio and video effectiveness of the programs. The objective of the synchronization is used to preserve the temporal alignment between the audio and video streams. Finally, the calculated delays are recognized to correct signals. This paper proposes the audio-video synchronization using the relative temporal misalignment between the audio and video spreading codes delay measurement technique. The performance of the streams. The synchronization efficiency is high for both audio and proposed method made on home database and achieves 99% video streams. It is applicable for multimedia and networking synchronization efficiency. The audio-visual signature technique technology.
    [Show full text]
  • Timecoder Pro Is Indata's Popular Video and Transcript
    TimeCoder Pro is inData’s popular video and transcript synchronization software. Widely used by legal videographers, court reporters, and law firms, the software automatically synchronizes transcripts and video depositions using advanced speech and pattern recognition technologies. Imagine... in just an hour, you can synchronize up to 20 hours of deposition testimony! Not only is TimeCoder Pro faster than ever before, but it also gives you full control over your synchronization projects. TIMECODER PRO PUTS YOU IN COMPLETE CONTROL OF THE SYNCING PROCESS Control – process transcripts in-house. No data goes out of your office or offshore and you’ll always know the status of a sync job Complete Job History – available online for all jobs processed, including time and accuracy statistics Marketing Tools – create client loyalty and reinforce your branding by distributing videos on a DepoView DVD. You can even personalize the main menu of DepoView DVD with your company logo and contact information. Also, take advantage of inData’s customizable brochures and mailers available exclusively to TimeCoder Pro clients. “Having spent the past 2 months “The RapidSync feature in TimeCoder running hundreds of hours of vid- Pro is amazing!!! I synchronized a 9 eo through TimeCoder Pro 5, I can hour deposition with minimal QC effort say without question that you have in 45 minutes.” a true WINNER on your hands.” Richard E. Katz, Esq. Mark Fowler President/CEO Litigation Support & Multimedia Specialist Katz Consulting Group, LLC Whetstone Myers Perkins & Young 1 www.indatacorp.com | 800.828.8292 Multiple Sync Options With TimeCoder Pro, you have multiple synchronization options.
    [Show full text]
  • The Use of Music in the Cinematic Experience
    Western Kentucky University TopSCHOLAR® Honors College Capstone Experience/Thesis Honors College at WKU Projects Spring 2019 The seU of Music in the Cinematic Experience Sarah Schulte Western Kentucky University, [email protected] Follow this and additional works at: https://digitalcommons.wku.edu/stu_hon_theses Part of the Film and Media Studies Commons, Music Commons, and the Psychology Commons Recommended Citation Schulte, Sarah, "The sU e of Music in the Cinematic Experience" (2019). Honors College Capstone Experience/Thesis Projects. Paper 780. https://digitalcommons.wku.edu/stu_hon_theses/780 This Thesis is brought to you for free and open access by TopSCHOLAR®. It has been accepted for inclusion in Honors College Capstone Experience/ Thesis Projects by an authorized administrator of TopSCHOLAR®. For more information, please contact [email protected]. SOUND AND EMOTION: THE USE OF MUSIC IN THE CINEMATIC EXPERIENCE A Capstone Project Presented in Partial Fulfillment of the Requirements for the Degree Bachelor of Arts with Honors College Graduate Distinction at Western Kentucky Univeristy By Sarah M. Schulte May 2019 ***** CE/T Committee: Professor Matthew Herman, Advisor Professor Ted Hovet Ms. Siera Bramschreiber Copyright by Sarah M. Schulte 2019 Dedicated to my family and friends ACKNOWLEDGEMENTS This project would not have been possible without the help and support of so many people. I am incredibly grateful to my faculty advisor, Dr. Matthew Herman. Without your wisdom on the intricacies of composition and your constant encouragement, this project would not have been possible. To Dr. Ted Hovet, thank you for believing in this project from the start. I could not have done it without your reassurance and guidance.
    [Show full text]
  • Perception of Cuts in Different Editing Styles Celia Andreu-Sánchez; Miguel-Ángel Martín-Pascual
    Perception of cuts in different editing styles Celia Andreu-Sánchez; Miguel-Ángel Martín-Pascual Nota: Este artículo se puede leer en español en: http://www.profesionaldelainformacion.com/contenidos/2021/mar/andreu-martin_es.pdf How to cite this article: Andreu-Sánchez, Celia; Martín-Pascual, Miguel-Ángel (2021). “Perception of cuts in different editing styles”. Profesional de la información, v. 30, n. 2, e300206. https://doi.org/10.3145/epi.2021.mar.06 Manuscript received on 18th May 2020 Accepted on 12th July 2020 Celia Andreu-Sánchez * Miguel-Ángel Martín-Pascual https://orcid.org/0000-0001-9845-8957 https://orcid.org/0000-0002-5610-5691 Serra Húnter Fellow Technological Innovation IRTVE Universitat Autònoma de Barcelona Instituto RTVE (Barcelona) Dep. de Com. Audiovisual i Publicitat Corporación Radio Televisión Española Neuro-Com Research Group Universitat Autònoma de Barcelona Edifici I. Campus Bellaterra. Dep. de Com. Audiovisual i Publicitat 08193 Cerdanyola del Vallès (Barcelona), Spain Neuro-Com Research Group [email protected] [email protected] Abstract The goal of this work is to explain how the cuts and their insertion in different editing styles influence the attention of viewers. The starting hypothesis is that viewers’ response to cuts varies depending on whether they watch a movie with a classical versus a messy or chaotic editing style. To undertake this investigation, we created three videos with the same narrative content and duration but different editing styles. One video was a fixed one-shot movie. Another video followed a classical editing style, based on the rules of classic Hollywood movies, according to David Bordwell’s studies.
    [Show full text]
  • Motion-Based Video Synchronization
    ActionSnapping: Motion-based Video Synchronization Jean-Charles Bazin and Alexander Sorkine-Hornung Disney Research Abstract. Video synchronization is a fundamental step for many appli- cations in computer vision, ranging from video morphing to motion anal- ysis. We present a novel method for synchronizing action videos where a similar action is performed by different people at different times and different locations with different local speed changes, e.g., as in sports like weightlifting, baseball pitch, or dance. Our approach extends the popular \snapping" tool of video editing software and allows users to automatically snap action videos together in a timeline based on their content. Since the action can take place at different locations, exist- ing appearance-based methods are not appropriate. Our approach lever- ages motion information, and computes a nonlinear synchronization of the input videos to establish frame-to-frame temporal correspondences. We demonstrate our approach can be applied for video synchronization, video annotation, and action snapshots. Our approach has been success- fully evaluated with ground truth data and a user study. 1 Introduction Video synchronization aims to temporally align a set of input videos. It is at the core of a wide range of applications such as 3D reconstruction from multi- ple cameras [20], video morphing [27], facial performance manipulation [6, 10], and spatial compositing [44]. When several cameras are simultaneously used to acquire multiple viewpoint shots of a scene, synchronization can be trivially achieved using timecode information or camera triggers. However this approach is usually only available in professional settings. Alternatively, videos can be synchronized by computing a (fixed) time offset from the recorded audio sig- nals [20].
    [Show full text]
  • A Fast Concurrent and Resizable Robin Hood Hash Table Endrias
    A Fast Concurrent and Resizable Robin Hood Hash Table by Endrias Kahssay Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2021 © Massachusetts Institute of Technology 2021. All rights reserved. Author................................................................ Department of Electrical Engineering and Computer Science Jan 29, 2021 Certified by. Julian Shun Assistant Professor Thesis Supervisor Accepted by . Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 A Fast Concurrent and Resizable Robin Hood Hash Table by Endrias Kahssay Submitted to the Department of Electrical Engineering and Computer Science on Jan 29, 2021, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Concurrent hash tables are among the most important primitives in concurrent pro- gramming and have been extensively studied in the literature. Robin Hood hashing is a variant of linear probing that moves around keys to reduce probe distances. It has been used to develop state of the art serial hash tables. However, there is only one existing previous work on a concurrent Robin Hood table. The difficulty in making Robin Hood concurrent lies in the potential for large memory reorganization by the different table operations. This thesis presents Bolt, a concurrent resizable Robin Hood hash table engineered for high performance. Bolt treads an intricate balance between an atomic fast path and a locking slow path to facilitate concurrency. It uses a novel scheme to interleave the two without compromising correctness in the concur- rent setting.
    [Show full text]
  • Intel Threading Building Blocks
    Praise for Intel Threading Building Blocks “The Age of Serial Computing is over. With the advent of multi-core processors, parallel- computing technology that was once relegated to universities and research labs is now emerging as mainstream. Intel Threading Building Blocks updates and greatly expands the ‘work-stealing’ technology pioneered by the MIT Cilk system of 15 years ago, providing a modern industrial-strength C++ library for concurrent programming. “Not only does this book offer an excellent introduction to the library, it furnishes novices and experts alike with a clear and accessible discussion of the complexities of concurrency.” — Charles E. Leiserson, MIT Computer Science and Artificial Intelligence Laboratory “We used to say make it right, then make it fast. We can’t do that anymore. TBB lets us design for correctness and speed up front for Maya. This book shows you how to extract the most benefit from using TBB in your code.” — Martin Watt, Senior Software Engineer, Autodesk “TBB promises to change how parallel programming is done in C++. This book will be extremely useful to any C++ programmer. With this book, James achieves two important goals: • Presents an excellent introduction to parallel programming, illustrating the most com- mon parallel programming patterns and the forces governing their use. • Documents the Threading Building Blocks C++ library—a library that provides generic algorithms for these patterns. “TBB incorporates many of the best ideas that researchers in object-oriented parallel computing developed in the last two decades.” — Marc Snir, Head of the Computer Science Department, University of Illinois at Urbana-Champaign “This book was my first introduction to Intel Threading Building Blocks.
    [Show full text]
  • Scalable NUMA-Aware Blocking Synchronization Primitives
    Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap, Changwoo Min, Taesoo Kim The rise of big NUMA machines 'i The rise of big NUMA machines 'i The rise of big NUMA machines 'i Importance of NUMA awareness NUMA node 1 NUMA node 2 NUMA oblivious W1 W2 W3 W4 W6 W5 W6 W5 W3 W1 W4 L W2 File Importance of NUMA awareness NUMA node 1 NUMA node 2 NUMA oblivious W1 W2 W3 W4 W6 W5 W6 W5 W3 W1 W4 NUMA aware/hierarchical L W2 W1 W6 W2 W3 W4 W5 File Importance of NUMA awareness NUMA node 1 NUMA node 2 NUMA oblivious W1 W2 W3 W4 W6 W5 W6 W5 W3 W1 W4 NUMA aware/hierarchical L W2 W1 W6 W2 W3 W4 W5 File Idea: Make synchronization primitives NUMA aware! Lock's research efforts and their use Lock's research efforts Linux kernel lock Dekker's algorithm (1962) adoption / modification Semaphore (1965) Lamport's bakery algorithm (1974) Backoff lock (1989) Ticket lock (1991) MCS lock (1991) HBO lock (2003) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) Cohort lock (2012) RW cohort lock (2013) Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016) Lock's research efforts and their use Lock's research efforts Linux kernel lock Dekker's algorithm (1962) adoption / modification Semaphore (1965) Lamport's bakery algorithm (1974) Backoff lock (1989) Ticket lock (1991) MCS lock (1991) HBO lock (2003) Hierarchical lock – HCLH (2006) Flat combining NUMA lock (2011) Remote Core locking (2012) NUMA- Cohort lock (2012) aware RW cohort lock (2013) locks Malthusian lock (2014) HMCS lock (2015) AHMCS lock(2016)
    [Show full text]
  • 16 Concurrent Hash Tables: Fast and General(?)!
    Concurrent Hash Tables: Fast and General(?)! TOBIAS MAIER and PETER SANDERS, Karlsruhe Institute of Technology, Germany ROMAN DEMENTIEV, Intel Deutschland GmbH, Germany Concurrent hash tables are one of the most important concurrent data structures, which are used in numer- ous applications. For some applications, it is common that hash table accesses dominate the execution time. To efficiently solve these problems in parallel, we need implementations that achieve speedups inhighly concurrent scenarios. Unfortunately, currently available concurrent hashing libraries are far away from this requirement, in particular, when adaptively sized tables are necessary or contention on some elements occurs. Our starting point for better performing data structures is a fast and simple lock-free concurrent hash table based on linear probing that is, however, limited to word-sized key-value types and does not support 16 dynamic size adaptation. We explain how to lift these limitations in a provably scalable way and demonstrate that dynamic growing has a performance overhead comparable to the same generalization in sequential hash tables. We perform extensive experiments comparing the performance of our implementations with six of the most widely used concurrent hash tables. Ours are considerably faster than the best algorithms with similar restrictions and an order of magnitude faster than the best more general tables. In some extreme cases, the difference even approaches four orders of magnitude. All our implementations discussed in this publication
    [Show full text]
  • Non-Scalable Locks Are Dangerous
    Non-scalable locks are dangerous Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich MIT CSAIL Abstract more cores. The paper presents a predictive model of non-scalable lock performance that explains this phe- Several operating systems rely on non-scalable spin locks nomenon. for serialization. For example, the Linux kernel uses A third element of the argument is that critical sections ticket spin locks, even though scalable locks have better which appear to the eye to be very short, perhaps only theoretical properties. Using Linux on a 48-core ma- a few instructions, can nevertheless trigger performance chine, this paper shows that non-scalable locks can cause collapses. The paper’s model explains this phenomenon dramatic collapse in the performance of real workloads, as well. even for very short critical sections. The nature and sud- den onset of collapse are explained with a new Markov- Naturally we argue that one should use scalable locks [1, based performance model. Replacing the offending non- 7, 9], particularly in operating system kernels where scalable spin locks with scalable spin locks avoids the the workloads and levels of contention are hard to con- collapse and requires modest changes to source code. trol. As a demonstration, we replaced Linux’s spin locks with scalable MCS locks [9] and re-ran the software that caused performance collapse. For 3 of the 4 benchmarks 1 Introduction the changes in the kernel were simple. For the 4th case the changes were more involved because the directory It is well known that non-scalable locks, such as simple cache uses a complicated locking plan and the directory spin locks, have poor performance when highly con- cache in general is complicated.
    [Show full text]