Everything You Always Wanted to Know About

Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask Tudor David, Rachid Guerraoui, Vasileios Trigonakis∗ School of Computer and Communication Sciences, Ecole´ Polytechnique Fed´ erale´ de Lausanne (EPFL), Switzerland ftudor.david, rachid.guerraoui, vasileios.trigonakisg@epfl.ch Abstract This paper presents the most exhaustive study of synchronization to date. We span multiple layers, from hardware cache-coherence protocols up to high-level concurrent software. We do so on different types of architectures, from single-socket – uniform and non- uniform – to multi-socket – directory and broadcast- Figure 1: Analysis method. based – many-cores. We draw a set of observations that, roughly speaking, imply that scalability of synchroniza- little indication, a priori, of whether a given synchro- tion is mainly a property of the hardware. nization scheme will scale on a given modern many-core architecture and, a posteriori, about exactly why a given 1 Introduction scheme did, or did not, scale. One of the reasons for this state of affairs is that there Scaling software systems to many-core architectures is are no results on modern architectures connecting the one of the most important challenges in computing to- low-level details of the underlying hardware, e.g., the day. A major impediment to scalability is synchroniza- cache-coherence protocol, with synchronization at the tion. From the Greek “syn”, i.e., with, and “khronos”, software level. Recent work that evaluated the scal- i.e., time, synchronization denotes the act of coordinat- ability of synchronization on modern hardware, e.g., ing the timeline of a set of processes. Synchronization [11, 26], was typically put in a specific application and basically translates into cores slowing each other, some- architecture context, making the evaluation hard to gen- times affecting performance to the point of annihilating eralize. In most cases, when scalability issues are faced, the overall purpose of increasing their number [3, 21]. it is not clear if they are due to the underlying hardware, A synchronization scheme is said to scale if its per- to the synchronization algorithm itself, to its usage of formance does not degrade as the number of cores in- specific atomic operations, to the application context, or creases. Ideally, acquiring a lock should for example to the workload. take the same time regardless of the number of cores sharing that lock. In the last few decades, a large body Of course, getting the complete picture of how syn- of work has been devoted to the design, implemen- chronization schemes behave, in every single context, is tation, evaluation, and application of synchronization very difficult. Nevertheless, in an attempt to shed some schemes [1, 4, 6, 7, 10, 11, 14, 15, 26–29, 38–40, 43– light on such a picture, we present the most exhaustive 45]. Yet, the designer of a concurrent system still has study of synchronization on many-cores to date. Our analysis seeks completeness in two directions (Figure 1). ∗ Authors appear in alphabetical order. 1. We consider multiple synchronization layers, from Permission to make digital or hard copies of part or all of this work for basic many-core hardware up to complex con- personal or classroom use is granted without fee provided that copies current software. First, we dissect the laten- are not made or distributed for profit or commercial advantage and that cies of cache-coherence protocols. Then, we copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other study the performance of various atomic opera- uses, contact the Owner/Author. tions, e.g., compare-and-swap, test-and-set, fetch- and-increment. Next, we proceed with locking and Copyright is held by the Owner/Author(s). message passing techniques. Finally, we examine SOSP’13, Nov. 3–6, 2013, Farmington, Pennsylvania, USA. ACM 978-1-4503-2388-8/13/11. a concurrent hash table, an in-memory key-value http://dx.doi.org/10.1145/2517349.2522714 store, and a software transactional memory. 33 2. We vary a set of important architectural at- more expensive than a load. tributes to better understand their effect on syn- Message passing shines when contention is very high. chronization. We explore both single-socket Structuring an application with message passing reduces (chip multi-processor) and multi-socket (multi- sharing and proves beneficial when a large number of processor) many-cores. In the former category, we threads contend for a few data. However, under low con- consider uniform (e.g., Sun Niagara 2) and non- tention and/or a small number of cores, locks perform uniform (e.g, Tilera TILE-Gx36) designs. In the better on higher-layer concurrent testbeds, e.g., a hash latter category, we consider platforms that imple- table and a software transactional memory, even when ment coherence based on a directory (e.g., AMD message passing is provided in hardware (e.g., Tilera). Opteron) or broadcast (e.g., Intel Xeon). This suggests the exclusive use of message passing for Our set of experiments, of what we believe constitute optimizing certain highly contended parts of a system. the most commonly used synchronization schemes and Every locking scheme has its fifteen minutes of fame. hardware architectures today, induces the following set None of the nine locking schemes we consider consis- of observations. tently outperforms any other one, on all target architec- Crossing sockets is a killer. The latency of perform- tures or workloads. Strictly speaking, to seek optimality, ing any operation on a cache line, e.g., a store or a a lock algorithm should thus be selected based on the compare-and-swap, simply does not scale across sock- hardware platform and the expected workload. ets. Our results indicate an increase from 2 to 7.5 times Simple locks are powerful. Overall, an efficient im- compared to intra-socket latencies, even under no con- plementation of a ticket lock is the best performing syn- tention. These differences amplify with contention at chronization scheme in most low contention workloads. all synchronization layers and suggest that cross-socket Even under rather high contention, the ticket lock per- sharing should be avoided. forms comparably to more complex locks, in particular Sharing within a socket is necessary but not suffi- within a socket. Consequently, given their small mem- cient. If threads are not explicitly placed on the same ory footprint, ticket locks should be preferred, unless it socket, the operating system might try to load balance is sure that a specific lock will be very highly contended. them across sockets, inducing expensive communica- A high-level ramification of many of these observa- tion. But, surprisingly, even with explicit placement tions is that the scalability of synchronization appears, within the same socket, an incomplete cache directory, first and above all, to be a property of the hardware, combined with a non-inclusive last-level cache (LLC), in the following sense. Basically, in order to be able might still induce cross-socket communication. On the to scale, synchronization should better be confined to a Opteron for instance, this phenomenon entails a 3-fold single socket, ideally a uniform one. On certain plat- increase compared to the actual intra-socket latencies. forms (e.g., Opteron), this is simply impossible. Within We discuss one way to alleviate this problem by circum- a socket, sophisticated synchronization schemes are gen- venting certain access patterns. erally not worthwhile. Even if, strictly speaking, no size Intra-socket (non-)uniformity does matter. Within a fits all, a proper implementation of a simple ticket lock socket, the fact that the distance from the cores to the seems enough. LLC is the same, or differs among cores, even only In summary, our main contribution is the most ex- slightly, impacts the scalability of synchronization. For haustive study of synchronization to date. Results of instance, under high contention, the Niagara (uniform) this study can be used to help predict the cost of a syn- enables approximately 1.7 times higher scalability than chronization scheme, explain its behavior, design bet- the Tilera (non-uniform) for all locking schemes. The ter schemes, as well as possibly improve future hard- developer of a concurrent system should thus be aware ware design. SSYNC, the cross-platform synchroniza- that highly contended data pose a higher threat in the tion suite we built to perform the study is, we believe, presence of even the slightest non-uniformity, e.g., non- a contribution of independent interest. SSYNC abstracts uniformity inside a socket. various lock algorithms behind a common interface: it Loads and stores can be as expensive as atomic op- not only includes most state-of-the-art algorithms, but erations. In the context of synchronization, where also provides platform specific optimizations with sub- memory operations are often accompanied with mem- stantial performance improvements. SSYNC also con- ory fences, loads and stores are generally not signifi- tains a library that abstracts message passing on vari- cantly cheaper than atomic operations with higher con- ous platforms, and a set of microbenchmarks for mea- sensus numbers [19]. Even without fences, on data that suring the latencies of the cache-coherence protocols, are not locally cached, a compare-and-swap is roughly the locks, and the message passing. In addition, we only 1.35 (on the Opteron) and 1.15 (on the

Load more