Abstract of “Improving Performance, Energy-Efficiency and Error

Abstract of \Improving performance, energy-efficiency and error-resilience of multicore embedded systems through speculative synchronization mechanisms" by Dimitra Papagiannopoulou, Ph.D., Brown University, May 2016. Embedded systems are becoming ubiquitous and like their general-purpose counterparts they have embraced the multicore design paradigm. However, embedded systems need to satisfy specific requirements in performance, energy-efficiency and error-resilience. This thesis proposes design techniques based on speculative synchronization mechanisms such as Hardware Transactional Memory (HTM), Speculative Lock Elision (SLE) and Transactional Lock Removal (TLR) to address these issues. The first part of the thesis introduces Embedded-Spec, an energy-efficient and lightweight implementation for transparent speculation on a shared-bus multicore embedded architecture. A major advantage of Embedded-Spec is that it can be transparently used with lock-based, non-speculative legacy code. An extensive set of experiments over a wide range of parameters shows that compared to traditional locking, Embedded-Spec can improve the energy-delay product to different degrees based on the chosen configuration. In order to overcome scalability limitations and achieve better performance per Watt, high-end embedded systems are turning to many-core cluster-based NUMA architectures that employ simple scratchpad memories instead of area- and power-hungry data caches. For these types of architectures without caches and cache coherence support, no speculative synchronization design exists. The sec- ond part of this thesis introduces the first implementation of HTM for a coherence-free many-core embedded system. The design employs distributed conflict management and resolution for increased scalability. Experiments show that the proposed HTM design can achieve significant performance improvement over traditional locking. The final part of this thesis explores how HTM can be used beyond data synchronization and 2 specifically as an error-recovery mechanism from variability-induced errors. Two integrated HW/SW schemes are introduced that adaptively scale the supply voltage in order to save energy. These schemes use lightweight checkpointing and roll-back mechanisms adopted from HTM to recover both from intermittent timing errors and catastrophic failures that may occur due to scaling beyond a safe supply voltage. Experiments over a range of operating parameters show that both techniques can achieve significant energy savings at low overhead compared to using conservative voltage guardbands, while guaranteeing forward progress and reliability. Improving performance, energy-efficiency and error-resilience of multicore embedded systems through speculative synchronization mechanisms by Dimitra Papagiannopoulou M.Sc, Brown University, 2013 M.Sc, University of Patras, Greece 2014 BSE, University of Patras, Greece 2008 A dissertation submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy in the School of Engineering at Brown University Providence, Rhode Island May 2016 c Copyright 2016 by Dimitra Papagiannopoulou This dissertation by Dimitra Papagiannopoulou is accepted in its present form by the School of Engineering as satisfying the dissertation requirement for the degree of Doctor of Philosophy. Date R. Iris Bahar, Director Recommended to the Graduate Council Date Maurice Herlihy, Reader Date Sherief Reda, Reader Approved by the Graduate Council Date Peter M. Weber Dean of the Graduate School iii Vita Dimitra Papagiannopoulou was born in 1985 in Athens, Greece and grew up in Patras, Greece. She holds a Bachelor of Science in Engineering (Dipl.-Ing.) from the department of Electrical and Computer Engineering of the University of Patras, a Master of Science degree on \Integrated Soft- ware and Hardware Systems" from the department of Computer Science and Engineering of the University of Patras and a Master of Science degree from the department of Electrical Sciences and Computer Engineering of Brown University. Her research interests span the areas of computer architecture, embedded systems, low-power design, multiprocessor synchronization, reliability and variability-aware design. iv Acknowledgements I would like to express my sincere gratitude to my advisor, Prof. Iris Bahar for her continuous support, encouragement and guidance throughout my PhD studies. Prof. Bahar was the reason I chose to attend Brown University. She has been a great mentor to me all these years and I would like to thank her for her active participation in developing me as a researcher. I would also like to thank Prof. Maurice Herlihy for working with me throughout my Ph.D., for his invaluable feedback, support and mentorship. I am grateful to Prof. Sherief Reda, for being in my thesis committee and for his constructive feedback concerning this thesis manuscript. I would also like to express my gratitude to my research collaborators, Prof. Tali Moreshet, Prof. Luca Benini and Dr. Andrea Marongiu for their great insight, help and feedback. It has been a pleasure working with them. Many thanks to my present and past colleagues for making the experience at Brown so special. I would like to thank Cesare Ferri, Thomas Carle, Onur Ulusel, Marco Donato, Kumud Nepal, Christopher Picardo, Christopher Harris, Octavian Biris, Kapil Dev, Monami Nowroz and many more. I would also like to thank my friends for always being there for me and for the great times we had together. Last but not least, I would like to thank my mother Ioanna, my father Angelos, my sister Katerina and my fiance Sotiris for their love and their continuous support, encouragement and motivation. Without them, I would not be where I am today. v Contents List of Tables ix List of Figures x 1 Introduction 1 2 Background and Previous Work 9 2.1 Traditional Locking . .9 2.2 Speculative Synchronization Mechanisms . 11 2.2.1 Transactional Memory . 12 2.2.2 Speculative Lock Elision . 20 2.2.3 Transactional Lock Removal . 22 2.2.4 Speculation for Embedded Systems . 23 2.2.5 Error-resilient and energy-efficient execution on embedded systems . 27 3 Energy-efficient and transparent speculation on embedded MPSoC 33 3.1 Embedded-Spec: Speculative Memory Design . 34 3.2 Architecture . 37 3.2.1 The Bloom Module Hardware . 38 3.3 The Embedded-Spec Algorithms . 40 vi 3.3.1 Embedded-LE ................................... 42 3.3.2 Embedded-LR ................................... 43 3.4 Experimental Results . 44 3.4.1 Benchmarks . 44 3.4.2 Embedded-LE Parameter Exploration . 46 3.4.3 Embedded-LR Parameter Exploration . 57 3.4.4 Speculative Execution vs. Locks . 59 3.5 Summary and Discussion . 63 4 Speculative Synchronization on Coherence-free Many-core Embedded architectures 65 4.1 Target Architecture . 66 4.2 Transactional Memory Design . 69 4.2.1 Transactional Bookkeeping . 70 4.2.2 Data Versioning . 72 4.2.3 Transaction Control Flow . 76 4.3 Experimental Results . 79 4.3.1 Overhead Characterization . 80 4.3.2 Performance Characterization . 81 4.3.3 EigenBench . 87 4.4 Summary and Discussion . 90 5 Transactional Memory Revisited for Error-Resilient and Energy-Efficient MPSoC Execution 92 5.1 Motivation . 93 5.2 Target Architecture . 95 5.3 Implementation . 97 vii 5.3.1 Checkpointing and Rollback . 97 5.3.2 Data Versioning . 98 5.3.3 Error-Resilient Transactions . 100 5.3.4 Programming model . 101 5.4 Experimental Results . 102 5.4.1 Overhead characterization . 103 5.4.2 Energy characterization . 103 5.5 Summary and Discussion . 106 6 Adaptive voltage scaling policies for improving energy savings at near-edge op- eration 107 6.1 Addressing critical and non-critical errors . 108 6.2 Error policy design . 109 6.3 The Thrifty uncle/Reckless nephew policy . 113 6.4 Experimental Results . 116 6.4.1 Energy consumption . 116 6.4.2 Overhead characterization . 119 6.4.3 Energy savings vs. transaction size . 119 6.5 Summary and Discussion . 121 7 Conclusions and future directions 123 viii List of Tables 3.1 EMBEDDED-SPEC | All Configurations. 42 3.2 Hardware configurations. 45 3.3 EMBEDDED-SPEC { Top Best two configurations when considering performance only, energy only, or energy-delay product. 63 4.1 Per-core transactional write footprint for each application. 81 4.2 Experimental setup for VSoC platform. 82 ix List of Figures 2.1 The lock interface. 10 2.2 Example of transactional events handling (based on the implementation proposed in [1]). 15 2.3 Classification of TM designs. 20 2.4 Percentage error rate versus supply voltage for intermittent timing errors and the Critical Operating Point. 28 2.5 Pipeline augmented with Razor latches and control lines (taken from [2]). 30 3.1 Logic for Transactional Management used in Embedded-Spec. The architectural configuration is taken from [3]. The dark blocks show the additional hardware re- quired. That is, the Tx bit for each line of the data cache to indicate if the data is transactional, the Tx logic in the cache controller to handle transactional accesses, and the Bloom module to detect and resolve conflicts. 35 3.2 Modifications to the cache coherence protocol for transactional accesses. The gray block indicates the added operations. Note: The TX decision diamond denotes whether the Tx bit is already set or not. 36 3.3 Architecture overview, as proposed in [3]. 38 3.4 (a) Overview of the Bloom Module. (b) Internal details of a core Bloom Filter Unit (BFU). Taken from [3] . 39 x 3.5 The flowchart of the Embedded-LE algorithm. 43 3.6 Execution time for Embedded-LE and Embedded-LE-Sleep modes. 46 3.7 Energy Consumption for Embedded-LE and Embedded-LE-Sleep modes. 47 3.8 Energy Delay Product for Embedded-LE and Embedded-LE-Sleep modes. 48 3.9 Performance of Embedded-LE and varying maximum number of retries. 49 3.10 Energy Consumption of Embedded-LE and varying maximum number of retries. 51 3.11 Energy Delay Product of Embedded-LE and varying maximum number of retries. 51 3.12 Performance of Embedded-LE-Sleep and varying maximum number of retries.

Abstract of “Improving Performance, Energy-Efficiency and Error

A 1024-Core 70GFLOPS/W Floating Point Manycore Microprocessor

Comparison of 116 Open Spec, Hacker Friendly Single Board Computers -- June 2018

Survey and Benchmarking of Machine Learning Accelerators

SPORK: a Summarization Pipeline for Online Repositories of Knowledge

Master's Thesis: Adaptive Core Assignment for Adapteva Epiphany

Program & Exhibits Guide

Proyecto Fin De Grado

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

AI-Optimized Chipsets

November 2–5, 2014 Asilomar Hotel and Conference Grounds

Electronics Tinkering Books

Analysis of Task Scheduling for Multi-Core Embedded Systems