Advances Towards Data-Race-Free Cache Coherence Through Data Classification

Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1521 Advances Towards Data-Race-Free Cache Coherence Through Data Classification MAHDAD DAVARI ACTA UNIVERSITATIS UPSALIENSIS ISSN 1651-6214 ISBN 978-91-554-9925-9 UPPSALA urn:nbn:se:uu:diva-320595 2017 Dissertation presented at Uppsala University to be publicly examined in 2446, Lägerhyddsvägen 2, Hus 2, Uppsala, Thursday, 8 June 2017 at 13:15 for the degree of Doctor of Philosophy. The examination will be conducted in English. Faculty examiner: Professor Manuel Eugenio Acacio Sánchez (Universidad de Murcia). Abstract Davari, M. 2017. Advances Towards Data-Race-Free Cache Coherence Through Data Classification. Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology 1521. 64 pp. Uppsala: Acta Universitatis Upsaliensis. ISBN 978-91-554-9925-9. Providing a consistent view of the shared memory based on precise and well-defined semantics —memory consistency model—has been an enabling factor in the widespread acceptance and commercial success of shared-memory architectures. Moreover, cache coherence protocols have been employed by the hardware to remove from the programmers the burden of dealing with the memory inconsistency that emerges in the presence of the private caches. The principle behind all such cache coherence protocols is to guarantee that consistent values are read from the private caches at all times. In its most stringent form, a cache coherence protocol eagerly enforces two invariants before each data modification: i) no other core has a copy of the data in its private caches, and ii) all other cores know where to receive the consistent data should they need the data later. Nevertheless, by partly transferring the responsibility for maintaining those invariants to the programmers, commercial multicores have adopted weaker memory consistency models, namely the Total Store Order (TSO), in order to optimize the performance for more common cases. Moreover, memory models with more relaxed invariants have been proposed based on the observation that more and more software is written in compliance with the Data-Race-Free (DRF) semantics. The semantics of DRF software can be leveraged by the hardware to infer when data in the private caches might be inconsistent. As a result, hardware ignores the inconsistent data and retrieves the consistent data from the shared memory. DRF semantics therefore removes from the hardware the burden of eagerly enforcing the strong consistency invariants before each data modification. Instead, consistency is guaranteed only when needed. This results in manifold optimizations, such as reducing the energy consumption and improving the performance and scalability. The efficiency of detecting and discarding the inconsistent data is an important factor affecting the efficiency of such coherence protocols. For instance, discarding the consistent data does not affect the correctness, but results in performance loss and increased energy consumption. In this thesis we show how data classification can be leveraged as an effective tool to simplify the cache coherence based on the DRF semantics. In particular, we introduce simple but efficient hardware-based private/shared data classification techniques that can be used to efficiently detect the inconsistent data, thus enabling low-overhead and scalable cache coherence solutions based on the DRF semantics. Keywords: Shared Memory Architectures, Multicore, Memory Hierarchy, Cache Coherence, Data Classification Mahdad Davari, Department of Information Technology, Division of Computer Systems, Box 337, Uppsala University, SE-75105 Uppsala, Sweden. Department of Information Technology, Computer Architecture and Computer Communication, Box 337, Uppsala University, SE-75105 Uppsala, Sweden. © Mahdad Davari 2017 ISSN 1651-6214 ISBN 978-91-554-9925-9 urn:nbn:se:uu:diva-320595 (http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-320595) To my parents, who always valued knowledge, prioritized my education, and supported me throughout. List of Papers This thesis is based on the following papers, which are referred to in the text by their Roman numerals. I Ros, A., Davari, M., Kaxiras, S. (2015) Hierarchical Private/Shared Classification. The Key to Simple and Efficient Coherence for Clus- tered Cache Hierarchies. International Symposium on High Perfor- mance Computer Architecture (HPCA) II Davari, M., Ros, A., Hagersten, E., Kaxiras, S. (2015) The Effects of Granularity and Adaptivity on Private/Shared Classification for Co- herence. ACM Transactions on Architecture and Code Optimization (TACO) III Davari, M., Ros, A., Hagersten, E., Kaxiras, S. (2015) An Efficient, Self-Contained, On-Chip Directory: DIR1-SISD. International Sym- posium on Parallel Architectures and Compilation Techniques (PACT) IV Davari, M., Hagersten, E., Kaxiras, S. (2017) Scope-Aware Classifi- cation: Taking the Hierarchical Private/Shared Data Classification to the Next Level. (Under submission) Technical Report 2017-008, Department of Information Technology, Uppsala University V Davari, M., Hagersten, E., Kaxiras, S. (2017) The Best of Both Works: A Hybrid Data-Race-Free Cache Coherence Scheme. (Ac- cepted with revision) ACM Transactions on Architecture and Code Optimization (TACO) Reprints were made with permission from the respective publishers. All the papers have been reformatted to the one-column format of this book. Other publications not included in this thesis: • Davari, M., Ros, A., Hagersten, E., Kaxiras, S. (2014) The Effects of Granularity and Adaptivity on Private/Shared Classification for Coherence. Seventh Swedish Workshop on Multicore Computing (MCC’14) • Davari, M., Ros, A., Hagersten, E., Kaxiras, S. (2015) An Effi- cient, Self-Contained, On-Chip Directory. DIR1-SISD. Eighth Swedish Workshop on Multicore Computing (MCC’15) Contents 1. Introduction ................................................................................................. 1 2. Generational Classification: Data Classification for DRF Semantics ........ 7 2.1 Simplifying the Coherence for DRF Semantics ................................... 7 2.2 Generational Classification ................................................................ 10 2.2.1 Classification storage and mechanisms ...................................... 10 2.2.2 Classification Adaptivity ............................................................ 11 2.2.3 Adaptivity using Dead-Block Prediction .................................... 12 2.3 Generational Coherence ..................................................................... 13 2.4 Results ................................................................................................ 14 2.5 Summary ............................................................................................ 16 3. Dir1-SISD: a Minimal Directory for DRF Semantics ............................... 17 3.1 The Directory for DRF Semantics ..................................................... 17 3.1.1 Classification-Aware Inclusion Policy ....................................... 18 3.1.2 Self-Correcting Classification .................................................... 19 3.1.3 Implicit Classification Adaptivity .............................................. 20 3.2 Directory Compression ...................................................................... 20 3.3 Results ................................................................................................ 21 3.4 Summary ............................................................................................ 22 4. Hierarchical Classification and Coherence for DRF Semantics ............... 23 4.1 Private/Shared Classification in Hierarchies ...................................... 23 4.1.1 Topology and Terminology ........................................................ 24 4.1.2 Plain Hierarchical Classification ................................................ 24 4.1.3 Scope-Aware Classification ....................................................... 25 4.2 Scoped Synchronization ..................................................................... 25 4.3 Dir1-H: The Hierarchical Coherence .................................................. 27 4.3.1 Hierarchical Inclusion Policies ................................................... 28 4.4 Results ................................................................................................ 28 4.5 Summary ............................................................................................ 29 5. The Best of Both Schemes: A Hybrid Classification Approach ............... 31 5.1 The Penalty of Self-Invalidation and Self-Downgrade ...................... 31 5.1.1 Impact of Shared Classification on Self-Invalidations ............... 32 5.1.2 Impact of Shared Classification on Self-Downgrades ................ 32 5.2 Self-Downgrade vs. Demand-Downgrade ......................................... 33 5.3 Hybrid Data Classification ................................................................. 37 5.3.1 Hybrid Approach using Speculation ........................................... 37 5.3.2 Hybrid Approach Optimized for Migratory Sharing .................. 38 5.3.3 Hybrid Approach Optimized for Producer-Consumer Sharing .. 38 5.4 Results ................................................................................................ 38 5.5 Summary ............................................................................................ 40 6. Future Directions

Advances Towards Data-Race-Free Cache Coherence Through Data Classification

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Abstract 1 Introduction

Detailed Cache Coherence Characterization for Openmp Benchmarks

Analysis and Optimization of I/O Cache Coherency Strategies for Soc-FPGA Device

Chapter 5 Thread-Level Parallelism

2.2 Adaptive Routing Algorithms and Router Design 20

Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K

Detecting False Sharing Efficiently and Effectively

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Leaper: a Learned Prefetcher for Cache Invalidation in LSM-Tree Based Storage Engines

Parallel Processors and Cache Coherency

Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Rchitectures

CSC 256/456: Operating Systems