Abstract of “Improving performance, energy-efficiency and error-resilience of multicore embedded systems through speculative synchronization mecha- nisms” by Dimitra Papagiannopoulou, Ph.D., Brown University, May 2016.

Embedded systems are becoming ubiquitous and like their general-purpose counterparts they have embraced the multicore design paradigm. However, embedded systems need to satisfy specific requirements in performance, energy-efficiency and error-resilience. This thesis proposes design tech- niques based on speculative synchronization mechanisms such as Hardware Transactional Memory

(HTM), Speculative Lock Elision (SLE) and Transactional Lock Removal (TLR) to address these issues.

The first part of the thesis introduces Embedded-Spec, an energy-efficient and lightweight im- plementation for transparent speculation on a shared-bus multicore embedded architecture. A major advantage of Embedded-Spec is that it can be transparently used with lock-based, non-speculative legacy code. An extensive set of experiments over a wide range of parameters shows that compared to traditional locking, Embedded-Spec can improve the energy-delay product to different degrees based on the chosen configuration.

In order to overcome scalability limitations and achieve better , high-end embedded systems are turning to many-core cluster-based NUMA architectures that employ simple scratchpad memories instead of area- and power-hungry data caches. For these types of architectures without caches and support, no speculative synchronization design exists. The sec- ond part of this thesis introduces the first implementation of HTM for a coherence-free many-core . The design employs distributed conflict management and resolution for increased scalability. Experiments show that the proposed HTM design can achieve significant performance improvement over traditional locking.

The final part of this thesis explores how HTM can be used beyond data synchronization and 2 specifically as an error-recovery mechanism from variability-induced errors. Two integrated HW/SW schemes are introduced that adaptively scale the supply voltage in order to save energy. These schemes use lightweight checkpointing and roll-back mechanisms adopted from HTM to recover both from intermittent timing errors and catastrophic failures that may occur due to scaling be- yond a safe supply voltage. Experiments over a range of operating parameters show that both techniques can achieve significant energy savings at low overhead compared to using conservative voltage guardbands, while guaranteeing forward progress and reliability. Improving performance, energy-efficiency and error-resilience of multicore embedded systems through speculative synchronization mechanisms

by

Dimitra Papagiannopoulou

M.Sc, Brown University, 2013

M.Sc, University of Patras, Greece 2014

BSE, University of Patras, Greece 2008

A dissertation submitted in partial fulfillment of the

requirements for the Degree of Doctor of Philosophy

in the School of Engineering at Brown University

Providence, Rhode Island

May 2016 c Copyright 2016 by Dimitra Papagiannopoulou This dissertation by Dimitra Papagiannopoulou is accepted in its present form by

the School of Engineering as satisfying the dissertation requirement

for the degree of Doctor of Philosophy.

Date R. Iris Bahar, Director

Recommended to the Graduate Council

Date Maurice Herlihy, Reader

Date Sherief Reda, Reader

Approved by the Graduate Council

Date Peter M. Weber Dean of the Graduate School

iii Vita

Dimitra Papagiannopoulou was born in 1985 in Athens, Greece and grew up in Patras, Greece.

She holds a Bachelor of Science in Engineering (Dipl.-Ing.) from the department of Electrical and

Computer Engineering of the University of Patras, a Master of Science degree on “Integrated Soft- ware and Hardware Systems” from the department of Computer Science and Engineering of the

University of Patras and a Master of Science degree from the department of Electrical Sciences and Computer Engineering of Brown University. Her research interests span the areas of , embedded systems, low-power design, multiprocessor synchronization, reliability and variability-aware design.

iv Acknowledgements

I would like to express my sincere gratitude to my advisor, Prof. Iris Bahar for her continuous support, encouragement and guidance throughout my PhD studies. Prof. Bahar was the reason I chose to attend Brown University. She has been a great mentor to me all these years and I would like to thank her for her active participation in developing me as a researcher.

I would also like to thank Prof. Maurice Herlihy for working with me throughout my Ph.D., for his invaluable feedback, support and mentorship. I am grateful to Prof. Sherief Reda, for being in my thesis committee and for his constructive feedback concerning this thesis manuscript. I would also like to express my gratitude to my research collaborators, Prof. Tali Moreshet, Prof. Luca

Benini and Dr. Andrea Marongiu for their great insight, help and feedback. It has been a pleasure working with them.

Many thanks to my present and past colleagues for making the experience at Brown so special.

I would like to thank Cesare Ferri, Thomas Carle, Onur Ulusel, Marco Donato, Kumud Nepal,

Christopher Picardo, Christopher Harris, Octavian Biris, Kapil Dev, Monami Nowroz and many more.

I would also like to thank my friends for always being there for me and for the great times we had together. Last but not least, I would like to thank my mother Ioanna, my father Angelos, my sister Katerina and my fiance Sotiris for their love and their continuous support, encouragement and motivation. Without them, I would not be where I am today.

v Contents

List of Tables ix

List of Figures x

1 Introduction 1

2 Background and Previous Work 9

2.1 Traditional Locking ...... 9

2.2 Speculative Synchronization Mechanisms ...... 11

2.2.1 Transactional Memory ...... 12

2.2.2 Speculative Lock Elision ...... 20

2.2.3 Transactional Lock Removal ...... 22

2.2.4 Speculation for Embedded Systems ...... 23

2.2.5 Error-resilient and energy-efficient execution on embedded systems ...... 27

3 Energy-efficient and transparent speculation on embedded MPSoC 33

3.1 Embedded-Spec: Speculative Memory Design ...... 34

3.2 Architecture ...... 37

3.2.1 The Bloom Module Hardware ...... 38

3.3 The Embedded-Spec Algorithms ...... 40

vi 3.3.1 Embedded-LE ...... 42

3.3.2 Embedded-LR ...... 43

3.4 Experimental Results ...... 44

3.4.1 Benchmarks ...... 44

3.4.2 Embedded-LE Parameter Exploration ...... 46

3.4.3 Embedded-LR Parameter Exploration ...... 57

3.4.4 vs. Locks ...... 59

3.5 Summary and Discussion ...... 63

4 Speculative Synchronization on Coherence-free Many-core Embedded architec-

tures 65

4.1 Target Architecture ...... 66

4.2 Transactional Memory Design ...... 69

4.2.1 Transactional Bookkeeping ...... 70

4.2.2 Data Versioning ...... 72

4.2.3 Transaction Control Flow ...... 76

4.3 Experimental Results ...... 79

4.3.1 Overhead Characterization ...... 80

4.3.2 Performance Characterization ...... 81

4.3.3 EigenBench ...... 87

4.4 Summary and Discussion ...... 90

5 Transactional Memory Revisited for Error-Resilient and Energy-Efficient MPSoC

Execution 92

5.1 Motivation ...... 93

5.2 Target Architecture ...... 95

5.3 Implementation ...... 97

vii 5.3.1 Checkpointing and Rollback ...... 97

5.3.2 Data Versioning ...... 98

5.3.3 Error-Resilient Transactions ...... 100

5.3.4 Programming model ...... 101

5.4 Experimental Results ...... 102

5.4.1 Overhead characterization ...... 103

5.4.2 Energy characterization ...... 103

5.5 Summary and Discussion ...... 106

6 Adaptive voltage scaling policies for improving energy savings at near-edge op-

eration 107

6.1 Addressing critical and non-critical errors ...... 108

6.2 Error policy design ...... 109

6.3 The Thrifty uncle/Reckless nephew policy ...... 113

6.4 Experimental Results ...... 116

6.4.1 Energy consumption ...... 116

6.4.2 Overhead characterization ...... 119

6.4.3 Energy savings vs. transaction size ...... 119

6.5 Summary and Discussion ...... 121

7 Conclusions and future directions 123

viii List of Tables

3.1 EMBEDDED-SPEC — All Configurations...... 42

3.2 Hardware configurations...... 45

3.3 EMBEDDED-SPEC – Top Best two configurations when considering performance

only, energy only, or energy-delay product...... 63

4.1 Per-core transactional write footprint for each application...... 81

4.2 Experimental setup for VSoC platform...... 82

ix List of Figures

2.1 The lock interface...... 10

2.2 Example of transactional events handling (based on the implementation proposed

in [1])...... 15

2.3 Classification of TM designs...... 20

2.4 Percentage error rate versus supply voltage for intermittent timing errors and the

Critical Operating Point...... 28

2.5 Pipeline augmented with Razor latches and control lines (taken from [2])...... 30

3.1 Logic for Transactional Management used in Embedded-Spec. The architectural

configuration is taken from [3]. The dark blocks show the additional hardware re-

quired. That is, the Tx bit for each line of the data cache to indicate if the data is

transactional, the Tx logic in the cache controller to handle transactional accesses,

and the Bloom module to detect and resolve conflicts...... 35

3.2 Modifications to the cache coherence protocol for transactional accesses. The gray

block indicates the added operations. Note: The TX decision diamond denotes

whether the Tx bit is already set or not...... 36

3.3 Architecture overview, as proposed in [3]...... 38

3.4 (a) Overview of the Bloom Module. (b) Internal details of a core Bloom Filter Unit

(BFU). Taken from [3] ...... 39

x 3.5 The flowchart of the Embedded-LE algorithm...... 43

3.6 Execution time for Embedded-LE and Embedded-LE-Sleep modes...... 46

3.7 Energy Consumption for Embedded-LE and Embedded-LE-Sleep modes. . . . . 47

3.8 Energy Delay Product for Embedded-LE and Embedded-LE-Sleep modes. . . . . 48

3.9 Performance of Embedded-LE and varying maximum number of retries...... 49

3.10 Energy Consumption of Embedded-LE and varying maximum number of retries. . 51

3.11 Energy Delay Product of Embedded-LE and varying maximum number of retries. . 51

3.12 Performance of Embedded-LE-Sleep and varying maximum number of retries. . . 52

3.13 Energy Consumption of Embedded-LE-Sleep and varying maximum number of

retries...... 52

3.14 Energy Delay Product of Embedded-LE-Sleep and varying maximum number of

retries...... 53

3.15 Energy delay product for Embedded-LE using different abort policies and maximum

number of allowed retries set to 1...... 55

3.16 Energy delay product for Embedded-LE using different abort policies and maximum

number of allowed retries set to 2...... 55

3.17 Energy delay product for Embedded-LE using different abort policies and maximum

number of allowed retries set to infinity...... 56

3.18 Energy delay product for Embedded-LE-Sleep using different abort policies and

maximum number of allowed retries set to 1...... 57

3.19 Energy delay product for Embedded-LE-Sleep using different abort policies and

maximum number of allowed retries set to 2...... 57

3.20 Energy delay product for Embedded-LE-Sleep using different abort policies and

maximum number of allowed retries set to infinity...... 58

3.21 Energy delay product for Embedded-LR using different abort policies...... 59

xi 3.22 Execution time of Embedded-Spec vs. standard locks. Showing results for best

configurations for each benchmark...... 60

3.23 Energy Consumption of Embedded-Spec vs. standard locks. Showing results for

best configurations for each benchmark...... 61

3.24 Energy Delay Product of Embedded-Spec vs. standard locks. Showing results for

best configurations for each benchmark...... 62

4.1 Hierarchical design of our cluster-based embedded system...... 67

4.2 Single cluster architecture of target platform...... 68

4.3 A 4X8 Mesh of Trees. Circles represent routing and arbitration switches. Taken

from [4]...... 69

4.4 Bookkeeping example. At time t1 address location A had not been read or written.

By time t2, cores 1, 7, 8, and 13 have read the address. At time t3 core 13 writes the

address and generates a conflict. So, core 13 will be aborted and its read flag will be

cleared. Since core 13 was also the writer of address location A, the Writer ID bits

and the Owner bit will be cleared as well...... 71

4.5 Modified single cluster architecture. Notice that the PIC refers to the off-cluster

peripheral interconnect...... 74

4.6 Distributed per-address log scheme for M banks and N cores...... 75

4.7 Transactional control flow...... 77

4.8 Redblack: Performance comparison between locks and transactions for different num-

ber of cores...... 83

4.9 Skiplist: Performance comparison between locks and transactions for different number

of cores...... 83

4.10 Genome: Performance comparison between locks and transactions for different num-

ber of cores...... 84

xii 4.11 Vacation: Performance comparison between locks and transactions for different num-

ber of cores...... 85

4.12 Kmeans: Performance comparison between locks and transactions for different num-

ber of cores...... 85

4.13 Results for the eigenbench evaluation methodology. Eigen-characteristics considered

are working-set size (top), contention (middle) and Predominance (bottom)...... 89

5.1 Target platform high level view...... 95

5.2 Control Flow of an error-resilient transaction...... 100

5.3 Transformed OpenMP dynamic loop ...... 101

5.4 Energy consumption at -40 C. Steady voltage (SV) versus transactional memory (TM).104

5.5 Energy consumption at 25 C. Steady voltage (SV) versus transactional memory (TM).104

5.6 Energy consumption at 125 C. Steady voltage (SV) versus transactional memory (TM).105

6.1 Example of an error policy decision flow based on: expected error rate, number of

consecutive aborts and number of consecutive commits...... 113

6.2 The ‘Thrifty uncle/Reckless nephew’ policy ...... 115

6.3 Single-core energy consumption normalized to the baseline SV configuration using

a 20mV voltage scaling step. SV: Steady voltage configuration, TM: Transactional

Memory-based technique of Chapter 5, UN: Thrifty uncle/Reckless nephew policy. . 117

6.4 Single-core energy consumption normalized to the baseline SV configuration using

a 25mV voltage scaling step. SV: Steady voltage configuration, TM: Transactional

Memory-based technique of Chapter 5, UN: Thrifty uncle/Reckless nephew policy. . 118

6.5 Single-core energy consumption for different transaction sizes...... 120

xiii Chapter 1

Introduction

Embedded systems are becoming increasingly common in everyday life. Found in a wide range of applications, from consumer electronics (smart phones, tablets, video game consoles, digital cameras) to home appliances (microwave ovens, dishwashers, home security systems) to automotive systems

(cruise control, antilock braking system) to medical equipment (medical imaging), embedded systems are becoming pervasive and will eventually displace many general-purpose systems.

As with general purpose systems, embedded systems have embraced the multi core design paradigm in order to meet the increasing demand for performance within tight energy constraints.

Thus, instead of increasing the clock speed of uniprocessor systems, technology has shifted its focus towards multiprocessor systems on a (MPSoC) architectures where multiple simple, low-power cores are integrated on the same chip, communicating through multi-level on-chip shared memory hierarchies.

Programming shared memory MPSoCs has introduced new challenges, since there is the need of synchronizing cores to prevent memory inconsistencies when data races arise. Many designs have adopted the single address-space shared-memory model, since it provides programmers a familiar and easy-to-understand abstraction of the memory resources. However, the power and performance improvement promised by embedded MPSoCs can be realized only if a fast and energy-efficient

1 2 synchronization mechanism is in place. Designing such a mechanism in the resource-constraint environment of embedded MPSoCs is a major challenge.

Locks are a synchronization mechanism typically used to guarantee memory consistency in shared memory programs by enforcing serialization of access to shared data. Locks, however, can slow per- formance and can consume excessive energy, as they typically require time- and energy-expensive read-modify-write operations that traverse the memory hierarchy. In addition, locks must be de- ployed conservatively whenever conflicts are possible, even if they are very unlikely, thus causing serialization and limiting concurrency. While fine-grained locking techniques can be used to mini- mize serialization, they require deep understanding of the target application and complicated debug processes. Coarse-grained locking is easier to use, but it cannot extract high degrees of parallelism.

To address the limitations of locks several speculative synchronization approaches have been pro- posed, that are based on dynamic conflict detection and promise to improve performance and save energy. Examples include Transactional Memory [5], Speculative Lock Elision [5] and Transactional

Lock Removal [6]. With speculative synchronization, a core does not need to wait for a lock to be released in order to execute a critical section. Instead, it proceeds by speculatively executing the critical section in the presence of potential data conflicts with concurrent computations of other cores on the same critical section. If a data conflict does actually take place, it is detected and one or more of the conflicting threads is rolled back and restarted to ensure correctness and con- sistency. Otherwise, the computation is committed. Thus, if conflicts are not frequent, speculative synchronization can yield higher throughput than conventional lock-based synchronization.

Since speculative techniques can improve throughput, they can often improve energy efficiency as well, simply because the computation finishes earlier. Besides that, while synchronization through locks requires energy-expensive read-modify-write operations that traverse the memory hierarchy, speculative computations are typically restricted to the more energy-efficient L1 (or L2) caches, often relying on native cache-coherence mechanisms to detect conflicts. However, without carefully 3 designing an energy-efficient speculation mechanism, the extra components required to manage spec- ulation may not provide enough of a performance per Watt improvement to justify its adaptation, especially in embedded systems.

Among the most prominent speculative synchronization techniques, Transactional Memory (TM)

[5] simplifies concurrent programming by allowing a group of instructions to execute atomically as one single transaction. Transactions are as easy to use as coarse-grained locks because programmers have only to identify the boundaries of critical code regions, but they promise to provide the same level of concurrency as fine-grained locks. Transactional threads are executed in parallel. If data conflicts occur one or more of the conflicting transactions is rolled back and re-executed. Several TM systems have been proposed, based on hardware, software, or hybrid techniques [5], [6], [7]. Recently, Intel [8] and IBM [9] announced new processors with direct hardware support for speculative transactions, and it seems likely that others will soon follow.

While hardware and software speculation has been extensively studied for the general-purpose computing domain, it has received less attention in the embedded domain [3], [10], [11], [12], [13].

However, embedded systems have very different demands compared to general-purpose systems.

Even though improving throughout and ease of programming remain important for embedded sys- tems, any practical design for such systems must emphasize on energy-efficiency as well. Energy efficiency is important for systems of all levels but for embedded systems that are energy-constrained and often run on batteries, energy efficiency is a critical aspect. In this thesis we propose various designs for speculative execution with energy-efficiency being a primary concern.

There exist some works on speculation that have proposed Transactional Memory designs for embedded systems targeting in reduction in energy consumption [3], [10], [11], but these works have not focused on transparency. Speculative mechanisms must be easy to use, either by requiring little or no changes to legacy code or by integrating it into practical and familiar programming environments. Even though the existing speculative synchronization designs for embedded systems are energy-efficient and light-weight in terms of hardware implementation, they are not necessarily 4 transparent to the programmer, thus imposing an additional burden.

In the first part of this thesis, we explore methods of applying speculation on legacy code without the requirement of special supporting instructions in software. We present Embedded-Spec, an energy-efficient embedded architecture that supports Speculative Lock Elision (SLE) [14]. SLE is a hardware speculation technique that combines attractive properties of both locking and speculative synchronization. With SLE, a core detects blocks of code protected by locks and executes them speculatively by eliding the lock. In case conflicts with other threads occur, it rolls back and retries the block, this time by explicitly acquiring the lock. Lock elision is transparent: applications need not be written explicitly for lock elision use. Instead, annotations are automatically added during compile time to facilitate prediction and tracking. It is an appealing choice since it promises to increase concurrency without the need to retrofit code.

Unlike most prior works on speculative synchronization, Embedded-Spec focuses on energy efficiency, ease of programming, as well as throughput since all are key constraints for embedded systems. Specifically, the energy-delay product (EDP) is evaluated, as a figure of merit that captures the trade-off between these two properties. In order to achieve energy efficiency we emphasize design simplicity. Complex hardware designs that require extensive changes to established protocols (such as cache coherence protocols) will likely be too costly to adapt. We propose the addition of simple hardware structures that avoid changes to the underlying cache coherency protocol but leave us the flexibility to vary how synchronization conflicts are detected, how they are resolved (contention management) and which policy to use for switching between speculative and non-speculative execu- tions. We test the proposed scheme on a multi-core embedded architecture that features a shared bus. Results show energy and performance benefits compared to conventional locking especially when several cores are employed.

Embedded-Spec is targeted to a bus-based architecture. For small-scale systems, shared-bus architectures are attractive for their simplicity, but they do not scale. If we attempt to increase the number of cores to extract more parallelism, a shared bus-based system will likely suffer from 5 high contention in the main bus which can limit performance significantly. For large-scale systems architects have embraced hierarchical structures, where processing elements are grouped into clusters interconnected by a scalable medium such as a network-on-chip (NoC). In such systems [15], [16], [17], cores within a cluster communicate efficiently through common on-chip memory structures, while cores in different clusters communicate less efficiently through bandwidth-limited higher-latency links. Architectures that provide the programmer with a shared-memory abstraction, but where memory references across clusters are significantly slower than references within the clusters, are commonly referred to as non-uniform memory access (NUMA) architectures. High-end embedded systems, like their general-purpose counterparts, are turning to many-core cluster-based shared- memory NUMA architectures. Driven by the need for scalability, in the next parts of these thesis, we drive our focus towards speculative execution on these types of architectures.

For many-core embedded architectures, memory organization is the single, most far-reaching design decision, both in terms of raw performance, and (more importantly) in terms of program- mer productivity. In order to meet stringent area and power constraints, the cores and memory hierarchy must be kept simple. In particular, simple scratchpad memories (SPM) are typically preferred to hardware-managed data caches that require some form of cache-coherence manage- ment and are far more area- (40%) and power-hungry (34%) [9]. Several many-core embedded systems have been designed without the use of caches and cache-coherence (e.g., ST Microelectron- ics p2012/STHORM [18], the Epiphany IV from Adapteva [19]). These kind of platforms are becoming increasingly common. However, implementing speculative synchronization in embed- ded systems that lack cache-coherence support is particularly challenging, since hardware speculative techniques traditionally rely on the underlying cache-coherence protocol to synchronize memory ac- cesses among the cores. For these cacheless systems, a completely new approach is necessary for handling speculative synchronization.

While there have been some works that have looked into Transactional Memory in embedded 6

NUMA systems [13], [7], [12], no solution exists for speculative execution in a cache-less embed- ded system. Therefore, next we present the first ever implementation of Hardware Transactional

Memory (HTM) support within a cluster-based embedded system that lacks an underlying cache- coherence protocol. Prior works relied on the cache-coherence protocol for detecting data conflicts during transaction execution. The lack of cache-coherence introduces major challenges in the design of Transactional Memory support, which now needs to be designed from scratch. Any implementa- tion without cache-coherence support requires explicit data management and implies a fully-custom design of the transactional memory support. At the same time, any design for embedded systems must emphasize simplicity and energy-efficiency, so the underlying software and hardware interface must be kept simple enough to be of practical use.

We design from scratch a unique HTM scheme for a many-core embedded architecture with- out caches and cache-coherence support. We provide full speculative synchronization support for multiple transactions accessing data within a single cluster. While the current implementation is limited to single-cluster accesses, the proposed scheme is designed so that it is scalable and can be extended to multiple clusters. We introduce the idea of distributing synchronization management which makes it inherently scalable. We show that the proposed HTM scheme can achieve significant performance improvements over traditional lock-based schemes.

As the need for energy-efficiency in high-end embedded systems persists, designers are turning to voltage scaling techniques in order to save energy. At the same time, the continuous scaling of semiconductor device dimensions, is increasingly raising the concern of static and dynamic vari- ability. Spatial die-to-die and within-die static variations ultimately induce performance and power mismatches between cores in multi-core systems, introducing heterogeneity in a nominally homoge- neous components. Dynamic variations depend on the operating conditions of the chip, and include aging, supply voltage drops and temperature fluctuations. These static and dynamic variations can ultimately lead to errors. To ensure safe system operation, circuit designers often use conservative guardbands on the operating frequency or voltage. If guardbands are too conservative, they lead 7 to loss of operational efficiency since they limit performance and waste energy. This is particu- larly concerning for embedded systems that are highly constrained. On the other hand, if these guardbands are reduced, using techniques such as voltage scaling the system might face intermittent errors or even worse, a Critical Operating Point (COP) [20]. For a CMOS device, a COP, is a voltage and frequency pair that if surpassed, either by decreasing the voltage or increasing the frequency, the system will face massive instruction failures. This leads us to ask, how should the operating parameters of an embedded device be set for it to function correctly?

So far, researchers have approached this problem by proposing circuit level error detection and correction (EDAC) techniques [21], [2] as a safety net that allows designers to scale the voltage in order to save energy. However, these techniques incur high execution time and energy overheads and are not suitable for many-core environments. In addition, while EDAC techniques are suitable for dealing with sporadic timing errors, they cannot handle the effects of a Critical Operating Point [20].

In principle the COP can be determined for a particular chip after its production, and the most efficient yet safe voltage/frequency pair for the chip could be configured at that time. However, due to static and dynamic variations, the COP may actually change over space and time. As a result, the safe operating point may differ from one core to another and suddenly become unsafe due to aging, temperature fluctuations or voltage drops.

Having extensively studied transactional memory for data synchronization in embedded multicore environments and seeing how it can benefit performance and energy-efficiency, we realized that transactions could be used in an alternative way: as an efficient checkpointing and error recovery mechanism from both timing errors and the COP, that will allow operation at highly reduced supply voltage margins in order to save energy. In the next and final part of this thesis, we propose an integrated HW/SW scheme that is based on HTM and addresses both types of variation phenomena.

In particular, our scheme dynamically monitors the platform and adaptively adjusts to the evolving

COP among multiple cores, using lightweight checkpointing and roll-back mechanisms adopted from

HTM for error recovery. Our scheme enables the system to operate at highly reduced margins 8 without sacrificing performance, while at the same time guaranteeing forward progress at reduced energy levels. Experiments demonstrate that our technique is particularly effective in saving energy while also offering safe execution guarantees. To the best of our knowledge, this work is the first to describe a full-fledged HTM implementation for error-resilient and energy-efficient MPSoC execution.

This thesis is organized as follows. Chapter 2 provides a background and related work discussion on multiprocessor synchronization and speculative execution. It also presents the basic components of speculative mechanism design and discusses the main challenges that embedded systems face in terms of energy, scalability and reliability. Chapter 3 presents Embedded-Spec. The proposed speculative execution mechanism is evaluated both in terms of energy and performance over a range of speculative execution policies ([22]). Chapter 4 describes the design of a novel HTM scheme for many-core embedded architectures without caches and cache-coherence support. Results show that the proposed design can outperform traditional lock-based schemes ([23–25]). In Chapter 5 the focus shifts towards reliability. The chapter presents an HTM-based design for error-resilient and energy-efficient MPSoC execution that allows operation at highly reduced supply voltage margins and can address intermittent timing errors and the COP. The design is evaluated in terms of energy improvements compared to the use of conservative guardbands ([26]). In Chapter 6 a new error policy is proposed to increase the flexibility in addressing intermittent timing errors and allow for further energy savings. This new technique covers a broader range of error types and offers energy improvements compared to the initial design of Chapter 5. Chapter 7 concludes this thesis and discusses possible future directions. Chapter 2

Background and Previous Work

2.1 Traditional Locking

In a multiprocessor environment, the memory model describes how threads interact through memory and how they use shared data. It also defines the programmer’s challenges with respect to data management. Among the parallel programming models, the shared memory paradigm is widely adopted in embedded MPSoC designs, since it provides an easy-to-understand abstraction of memory resources that programmers are accustomed to: a single address space that parallel processes share and can access asynchronously.

Concurrent accesses to shared memory can lead to race conditions and memory inconsistencies.

To avoid such scenarios, programmers have adopted the concept of critical section: A block of code that can be executed by only one thread at a time. We call this property mutual exclusion. The standard way to approach mutual exclusion is through a Lock object satisfying the interface shown in Fig. 2.1 [27]. Each thread is associated with a unique Lock object. When a thread tries to enter a critical section it must execute the lock() method call to acquire the lock associated with that section. When it leaves the critical section it must call the unlock() method to release the lock.

Any mutual exclusion protocol poses the question: what will a thread do if it cannot acquire the

9 10

public interface Lock { public void lock() ; // Called before entering critical section. public void unlock() ; // Called before leaving critical section. }

Figure 2.1: The lock interface. lock? There are two alternatives. It could keep trying and repeatedly test the lock until it becomes available (‘spin’ or ‘busy-wait’) in which case the lock is called a ‘spinlock’ or it could suspend itself and ask the operating system’s scheduler to schedule another thread on the same processor until the lock is released (‘blocking’). Blocking is a good choice only if the lock delay is expected to be long, while spinning is sensible when the lock delay is expected to be short, since it avoids the overhead of operating system process re-scheduling.

Locks typically require hardware support for efficient implementation. This support usually takes the form of one or more atomic instructions such as a ‘Test-And-Set’ or ‘Compare-And-

Swap’, that allow a thread to test if the lock is free and if it is, then acquire it in a single atomic operation. Spinlocks[28] are based on hardware Test-And-Set (TAS) operations that each require a bus access. When multiple threads are concurrently spinning on the same lock, the system’s bus may be overloaded and its bandwidth may be saturated by the synchronization traffic, even with a small number of cores. Test-And-Test-And-Set (TTAS) locks have been proposed to mitigate this problem. With TTAS locks, threads can store local copies of the lock into their private caches, which must be kept coherent. This way, the threads can repeatedly read the lock from their local caches until it appears to be free and only then call Test-And-Set.

Multiple lock-based implementations for embedded multi-processor architectures exist in litera- ture ([29–36]). Many lock-free alternatives have been proposed as well ([10], [37–40]). But while most of these works target lightweight solutions for improving performance, only a few have focused on energy-efficiency ([10],[29],[32],[36],[38],[39]). Next we investigate lock-free speculative mechanisms 11 focusing not only on improving performance but also on reducing energy consumption.

2.2 Speculative Synchronization Mechanisms

While locks can guarantee memory consistency in shared memory programs and are easy to use, they can slow performance and waste energy since they rely on time- and energy-expensive read- modify-write operations that traverse the memory hierarchy. Speculative execution is an attractive alternative choice that can improve performance and save energy. A speculative execution in a multiprocessor system is one in which a processor executes a block of code in the presence of potential conflicts with concurrent computations. After the execution is complete, the process checks for conflicts and if none are found, the block commits. Otherwise, the speculative execution is rolled back. Speculation is attractive in situations where conflict avoidance is expensive, but actual conflicts are rare. Unlike locks that use time- and energy-expensive read-modify-write operations, speculative computations are typically restricted to the more energy-efficient L1 (or sometimes L2) caches, relying on native cache-coherence mechanisms to detect conflicts.

The concept of speculation is employed in other areas, such as branch prediction in modern pipelined processors to predict the outcome of branch instructions based on the history of branch executions. The branch predicted as the most likely is fetched and speculatively executed. If later the decision is proven wrong, the speculative execution is discarded and the pipeline is restarted with the correct branch.

While speculative synchronization mechanisms have been the subject of prior work, those mech- anisms have focused more on performance, i.e. completing a task in a minimal amount of time.

In this thesis, we realize that performance is still important, but energy consumption should be minimized in the process, since energy is of central importance to embedded systems. In the follow- ing sections (2.2.1 - 2.2.3), some of the most well-known speculative synchronization mechanisms that this thesis builds upon are presented: Transactional Memory, Speculative Lock Elision and 12

Transactional Lock Removal.

2.2.1 Transactional Memory

Transactional memory (TM)[5] is a mechanism for synchronizing concurrent threads, analogous to database transactions. It simplifies concurrent programming by allowing a group of instructions to execute atomically as one single transaction. A transaction, is a finite sequence of machine instructions executed by a single thread, satisfying the properties of serializability and atomicity.

Serializability, means that transactions must appear to execute sequentially, in one-at-a-time order

(the order is not guaranteed though). The steps of a transaction should never interleave with the steps of another transaction. Atomicity, means that if the transaction commits, its speculative changes must become visible to other threads instantaneously as a single atomic operation. If a conflict is detected, none of the speculative changes takes effect (all or nothing). TM guarantees that all transactions run in isolation, i.e., no other thread can see the speculative changes before they are committed.

Transactions are implemented as follows. Critical sections are enclosed within transactions and are speculatively executed, making tentative changes to objects. If a transaction completes without encountering a synchronization conflict, then it commits (i.e., the tentative changes become perma- nent), otherwise it aborts (i.e., the tentative changes are discarded) and the transaction is restarted.

To satisfy the properties of serializability and atomicity, until a transaction commits its effects should not be visible outside the transaction itself. For that purpose, tentative memory updates should be kept in specific locations (depending on the implementation that could be a thread-local cache, or a software data structure). If the transaction commits, tentative changes can be written back to memory and become permanent. Data conflicts with other transactions are detected by the TM system tracking memory accesses in hardware or software structures. For Hardware Transactional

Memory, conflict detection is typically done by extensions to the native cache coherence protocol.

When a conflict is detected, at least one of the conflicting transactions must abort. An aborted 13 transaction must then discard its tentative changes. To facilitate this abort process, transactional memory requires checkpointing and rollback mechanisms to allow transactions to re-start and re- execute. The checkpointing mechanism is in charge of saving the internal state of the core when a transaction starts on that core (i.e., the core’s , stack pointer, internal registers, stack contents) in order to enable roll-back if the transaction aborts. The roll-back mechanism restores the internal core state so that the transaction can be re-started.

Transactional memory has some major advantages.

1. It is relatively easy to program and use, by enclosing critical sections within transactions.

2. It increases performance, since it provides a higher level of concurrency than locks (at least

equivalent to fine-grained locks) and eliminates the overhead of lock acquisition.

3. As shown from research[41], TM reduces energy consumption compared to locks, since it

reduces contention and eliminates busy-waiting.

4. TM eliminates all possible sources of race conditions, since it guarantees that all transactions

run in isolation. TM also avoids deadlocks.

Transactional Memory has been adapted in current designs. Sun’s Rock [42] is an example of a multicore processor implementation that supports some form of best-effort Hardware Transac- tional Memory. Intel [8] and IBM [9] announced new processors with direct hardware support for speculative transactions and it seems likely that others will follow suit.

Sections 2.2.1.1-2.2.1.3 present the basic components of TM and describe the choices that must be made in TM design.

2.2.1.1 Transactional Memory environments

Transactional memory can be implemented in software, in hardware and through a hybrid approach.

Herlihy and Moss [5] first introduced Hardware Transactional Memory (HTM) as a new mul- tiprocessor architecture intended to make lock-free synchronization as efficient and easy to use as 14 conventional techniques based on mutual exclusion. The basic idea behind this design is that any pro- tocol capable of detecting accessibility conflicts can also detect transaction conflicts at no extra cost, so TM can be implemented by straightforward extensions to any multi-processor cache-coherence protocol. Each processor maintains two primary caches: a regular cache for non-transactional op- erations and a transactional cache for transactional operations. The contents of these caches are exclusive, meaning that an entry can reside in either one or the other, but not both. The transac- tional cache is a small, fully-associative cache with additional logic to facilitate commit and abort.

This cache holds all the speculative data changes without propagating them to the processors or the main memory unless the transaction commits. If the transaction aborts, the cache lines that hold speculative data are invalidated. If it commits, the speculative changes are committed and the lines may be snooped by other processors, written back to memory upon replacement etc. according to the native cache coherence protocol.

To implement the necessary transactional extensions to the cache coherence protocol, each trans- actional cache line is augmented with separate transactional tags indicating a transactional state, in addition to the existing coherence states (INVALID, VALID, DIRTY and RESERVED). These transactional states are: EMPTY (i.e., no data are contained), NORMAL (i.e., the data contained are committed) and XCOMMIT (i.e., the data contained must be discarded upon commit) and

XABORT (i.e., the data contained must be discarded upon abort). Then, transactions put two entries in the cache, one with tag XCOMMIT and one with XABORT. All speculative changes are made to the XABORT entry and when the transaction commits, the entries that are marked

XCOMMIT are set to EMPTY and the entries that are marked XABORT are set to NORMAL.

Upon abort, entries marked as XABORT are set to EMPTY and entries marked as XCOMMIT to

NORMAL.

Figure 2.2 shows an example of how transactional events would be handled for an HTM imple- mentation on top of the MESI cache coherence protocol (as described in [1]). This implementation includes a dedicated Transactional Cache (TC) that holds transactional data and a L1 cache that 15

Figure 2.2: Example of transactional events handling (based on the implementation proposed in [1]). holds non-transactional data. The contents of TC and L1 are mutually exclusive. When a cache line is accessed by a transaction, the cache line in the L1 cache is invalidated and two new copies of the line are created in the TC (as shown in Figure 2.2). One copy stores a backup of the data

(marked as T COMMIT, equivalent to XCOMMIT) and the other contains the speculative changes of the transaction (marked as T ABORT, equivalent to XABORT). The cache line marked with

T ABORT can be modified multiple times within a single transaction (subsequently marking the line as Modified). If during transaction execution no data conflict occurs (i.e., no other core tries to read or write that line), the transaction commits successfully. The transactional cache invalidates the backup copy and keeps the modified copy. The T ABORT bits are reset so now the modified cache line is visible to the rest of the system and can be shared, modified or invalidated as the rest of the data. If during transaction execution, a data conflict occurs (i.e., another core tries to read or write that line) the execution is halted and the copy marked as T ABORT is invalidated. The backup copy is restored and the T COMMIT bits are reset.

Building on the hardware based transactional synchronization methodology proposed by Herlihy and Moss, Nir Shavit and Dan Touitou introduced Software Transactional Memory (STM) [43], a novel design that supports flexible transactional programming of synchronization operations in 16 software. In this design, transactions acquire exclusive ownership of the accessed locations and track the original memory values using software data structures and Load-Linked/Store-Conditional operations. If a transaction succeeds on acquiring the ownership of a memory location, it updates the memory values and then releases the ownership. If it fails, it changes its status to failure and retries later. The decision on which transactions fail and which succeed is made in software, allowing

flexible and fair policies. While HTM systems use data caches to buffer the speculative data changes of transactions, a STM implementation must provide its own mechanism for concurrent transactions to maintain their own views of the heap. Such a mechanism allows a transaction to see its own write set as it runs and allows memory updates to be discarded if the transaction ultimately aborts [44].

2.2.1.2 Choosing between HTM and STM

Multiple TM implementations have been proposed in literature, both in hardware (e.g., [6], [7],

[45–49]), software (e.g., [50–53]) and a combination of both (hybrid solution) (e.g., [54–57]). HTM designs mainly use caches to keep track of the transactional state. However, the limited size of on-chip caches bounds the size of the transactions that can fit inside. This adds a burden to the programmer, to create small enough transactions that can fit the on-chip caches. Usually, transactions are small.

Hammond et al. [6] report that 90% of transactions fit in less than 8 Kbytes for most applications they ran and the rest fit in 64 Kbytes [44]. However, some applications have a very large transaction footprint[44]. Moreover, the design of transactions to fit certain hardware resources reduces the portability of some applications. Some works have proposed unbounded TM systems ([45–47]) that allow transactions of any size and duration to survive context switches, page faults and overflows of resources. While some of these proposals may be attractive for general-purpose systems, they are too complex for today’s embedded systems. Ferri et al. [10] have proposed a design suitable for embedded systems that overcomes this problem without virtualization. This design consists of a single L1 cache structure with limited associativity for storing both transactional and non- transactional data and a small, fully-associative victim cache that handles overflowed and evicted 17 transactional blocks and is powered down when not in use.

In STM on the other hand, the transactional state is tracked in software. As a result STM does not suffer from the resource overflowing problem of HTM. STM can handle transactions of any size and duration and it can survive context switching and page faults. Moreover, because the trans- actional state is not tracked in hardware, STM increases the portability of transactional programs.

Moreover, STM offers more flexibility since it allows the implementation of complex algorithms that would be impractical to implement in hardware. However, STM suffers from two significant issues.

First, even though it has been improved, it is still much slower than hardware approaches [58], because all accesses to the Read and Write sets of the transactions are done atomically in software.

Having a software barrier for bookkeeping operations increases the access latency. The second major problem of STM is that it is not energy efficient. In [59], the authors provide an analysis of the energy costs of a typical STM system. Klein et al. have shown that a state-of-the-art implementa- tion consumes in average more than twice the energy of a locking counterpart [60]. For embedded systems, the need for energy-efficiency, simplicity and good performance makes STM unattractive.

For these reasons, for embedded systems, a hardware-based TM design is more appropriate.

2.2.1.3 Transactional Memory design

Regardless of the specific implementation (hardware, software or hybrid), every Transactional Mem- ory designer has to make some key decisions on the following aspects:

- Conflict Detection, i.e., When and how should a conflict be detected?

- Conflict Resolution, i.e., When and how should a conflict be resolved?

- Data Versioning, i.e., Where should original and speculative data changes be stored?

Regarding Conflict Detection and Conflict Resolution, there are two main policies: eager and lazy conflict detection or resolution. With eager conflict detection, conflicts are detected when they occur (i.e., at the time of the data access). There are several examples of works that use eager conflict 18 detection (e.g., [7], [47] and [49]. The potential problem of this approach is that after a conflict the restarted transactions may abort committing transactions. Consequent conflicts can hurt progress.

On the other hand, in a lazy conflict detection scheme, conflict detection is performed at commit time.

Potential existing conflicts with other transactions are detected only when a transaction attempts to commit. This scheme does not have the progress guarantee problem of the eager detection scheme but it faces a different problem. Since transactions are fully executed until commit time and only then conflicts are detected, this scheme has the drawback of wasted cycles compared to eager conflict detection, since some transactions will continue executing after the conflict actually occurs, only to be aborted when they attempt to commit. The execution of this extra useless work wastes time and power resources.

Similarly to conflict detection, conflict resolution can happen eagerly or lazily. For eager conflict resolution, the decision of which transaction to abort is made immediately when the conflict is detected. In lazy conflict resolution the decision is postponed until commit time. It is obvious that lazy conflict detection cannot co-exist with eager conflict resolution. Lazy conflict resolution has a throughput advantage [61] but the drawback of additional complexity. Hence it can improve the performance of high-conflict workloads. Previous works used eager conflict detection through the cache coherence protocol and lazy resolution via special hardware and software structures (e.g., [48],

[62]). Overall, eager detection/recovery is easier to implement in hardware through standard cache coherence protocols, but tends to favor short transactions over longer ones.

Another important aspect of conflict resolution is the abort policy in deciding which transactions should be aborted upon a conflict. The requestor-abort policy aborts the transaction that requested the data access that caused the conflict. The rationale behind this is that, since all transactions have made some progress before the requestor caused the conflict, the requestor should be the one to abort so that the other transactions can continue to make progress. Another option is to let the requestor proceed and abort all other conflicting cores (i.e., the requestor-wins policy). This choice is more natural to the way cache coherence works. An alternative choice would be to abort all the 19 transactions that conflicted and let them retry speculation again.

Regardless of the chosen abort policy, it is critical that, after a conflict not all transactions are allowed to retry at the same time since this would inevitably result in consecutive aborts. Instead, transactions should delay retrying (or “backoff”) for different randomly chosen times. Many works have used exponential-backoff strategies that increase the backoff time exponentially based on the number of consecutive aborts experienced by each transaction.

As mentioned in Section 2.2.1, Transactional memory requires a means for storing modifications to speculative data while simultaneously keeping copies of the original data, to be able to restore original values in case of conflict. Data Versioning mechanisms determine how this is done. Eager data versioning stores and modifies the speculative data in-place and keeps original data values elsewhere. In this case, the speculative memory system must guarantee a rollback mechanism, usually implemented by means of log structures, to restore the original contents of the memory.

This technique has been used in [45], [7] and variants such as [47] and [49]. Keeping speculative data in-place makes commits faster. Since data are updated in-place, no data broadcast is required upon commit. However, it has two drawbacks. First, upon data writes, an extra overhead has to be paid for the original data to be saved into the log. Second, recovery time during aborts is increased, since a complex roll-back mechanism has to be followed in order to read the logged values and restore them, while other transactions are stalled. Hence, an eager versioning scheme should be avoided when high contention is experienced.

In contrast, lazy data versioning leaves old copies of transactional data in-place and creates a copy for speculative modifications in other memory locations or transactional buffers. Examples of designs that have used this policy are [5], [6], [45] and [46]. These designs mostly use their caches to store the speculative data and in some cases extra buffers or software structures to handle overflows.

Keeping the original data in their initial location makes the abort scenario very fast, but has the disadvantage of increasing the transaction execution time since extra time is necessary at commit to write the speculative data back to memory. Some of these lazy versioning schemes ([5], [10], 20

Figure 2.3: Classification of TM designs.

[45], [46]) can also efficiently handle commits, by using the cache coherence protocol to keep data consistent at the end of the transaction.

Figure 2.3 shows the classification of various TM implementations based on the key design components discussed above. The speculative execution designs presented in this thesis, are primarily focused on eager conflict detection schemes, offering either lazy or eager data versioning depending on the goals and needs of each specific design.

2.2.2 Speculative Lock Elision

Besides Transactional Memory, another researched speculative execution mechanism, is Speculative

Lock Elision (SLE). SLE was introduced in [14] by Rajwar and Goodman, as a micro-architectural hardware speculation technique that dynamically removes unnecessary lock-induced serialization and enables highly concurrent multithreaded execution. In SLE, a processor speculates (perhaps based on annotations) that a block delimited by an atomic read-modify-write operation (such as a test-and-set) and a subsequent store to the same location is a critical section protected by a lock.

The processor then executes that block speculatively, buffering updates. If it succeeds, the updates 21 are committed, and otherwise they are discarded, and the critical section is retried non-speculatively, by actually acquiring and releasing the lock.

SLE can be implemented entirely in the and requires only trivial hardware support. It does not require instruction set changes, coherence protocol extensions or programmer support and very importantly, it is transparent to programmers. Existing synchronization instruc- tions are identified dynamically. Programmers do not have to learn a new programming methodology and can continue to use well-understood synchronization routines. As a result, legacy (even binary) code can run speculatively without modification.

Even though Lock Elision was proposed years ago, its main idea entered the mainstream only recently via Intel’s Haswell [8], a new processor microarchitecture with direct hardware support for speculative transactions. Using special constructs in software, programmers can specify regions of the code for either transactional memory or speculative lock elision. Haswell is the first processor that features hardware transactional memory by including Transactional Synchronization

Extensions (TSX). Intel’s TSX specification describes how Transactional Memory is exposed to the programmers, but the details on the actual TM implementation are not made public. The TSX specification provides two interfaces to the programmers: The first one is the Restricted Transac- tional Memory (RTM), which is similar to standard TM proposals. The second one, is Hardware

Lock Elision (HLE), whose functionality is very close to the initial SLE proposal by [14]. Both of them utilize new instructions to take advantage of the existing TM hardware.

Although the exact details of the lock elision implementation in Haswell are not released, a best effort speculation on its implementation have been discussed ([63], [64]). HLE uses two new instruction hint prefixes (XACQUIRE and XRELEASE) to denote the region in the code where lock elision can be applied. When a lock acquisition is encountered in the code, the XACQUIRE prefix is inserted to indicate the start of the lock elision region and the lock instruction is added to the read-set of a transaction, but the lock is not acquired (i.e., the thread does not write new data in the lock address). This means that other threads can also enter the lock elision region 22 simultaneously and transactionally access shared data. Writing to the lock address during execution of the HLE region will cause an abort. Reads and Writes to shared memory that happen within the lock elision region are added to the Read and Write sets of the corresponding transaction. When the XRELEASE prefix is encountered, it means that the end of the lock elision region has been reached, and the transaction attempts to commit. In a conflict event, the core restores the internal registers state that was saved prior to XACQUIRE and ignores any writes to shared memory that happened within the HLE region. The thread will retry the HLE region again, but this time by normally acquiring the lock. This means that once aborted, no speculation retries are allowed right after. Moreover, there is a limit on the number of simultaneous elisions. If this limit is surpassed, other regions will be executed through standard locking.

Pohlack and Diestelhorst [65] evaluated the results of applying lock elision to the Memcached caching system and found that it has great potential in improving throughput. Their lock elision implementation was based on AMD’s advanced synchronization facility [66] (ASF), a speculative synchronization architecture similar to transactional memory.

Noting the adaptation of Lock Elision in future processor designs and its potential, this the- sis presents a new architecture, Embedded-Spec that extends the capabilities of these existing techniques by introducing an extra degree of flexibility and with energy efficiency as an additional primary criteria (Chapter 3).

2.2.3 Transactional Lock Removal

Rajwar and Goodman [67] later proposed another transparent speculative synchronization mecha- nism, called transactional lock removal (TLR). Here, conflicts are resolved using timestamps. When a conflict occurs, the conflicting core with the oldest running transaction wins and proceeds with its transaction, while the others are rolled back and suspended until the winning core commits. At that point, the suspended cores resume and re-execute the critical section speculatively. This way there is no need to transition from speculative to non-speculative executions, improving performance 23 while maintaining transparency.

Embedded-Spec goes beyond TLR, exploring alternative conflict resolution policies and showing that a timestamp priority is not always the best choice for some applications.

2.2.4 Speculation for Embedded Systems

While hardware and software speculation has been extensively studied for the general-purpose com- puting domain, it has received less attention in the embedded domain [3, 10–13]. Even though improving throughout remains important for embedded systems, any practical design for such sys- tems must emphasize on energy-efficiency and low complexity. This means that complex hardware designs that require extensive changes to established protocols, such as cache coherence protocols, will likely be too costly to adapt. Similarly, transactional memory must be easy to use, either by requiring little or no changes to ‘legacy code’ or by integrating it into practical and familiar programming environments.

Ferri et al. proposed Embedded-TM, an energy-efficient HTM design suitable for embedded systems [10]. As mentioned in Section 2.2.1.2, that design consists of a single L1 cache structure with limited associativity for storing both transactional and non-transactional data and a small, fully-associative victim cache. A lazy conflict resolution, though complex to implement, improved the performance of high-conflict workloads, while an eager scheme was a better fit for low-conflict workloads.

In a later paper by some of the same authors, an integrated hardware-software transactional memory design for embedded systems was proposed [3]. This scheme includes a hardware trans- actional memory (HTM) architecture with a dedicated hardware module, the Bloom Module, to handle conflict management, that is programmed through low-level primitives. The Bloom Module manages a centralized collection of Bloom filters. It is in charge of snooping transactional data traffic on the bus, and detecting conflicts that arise during transactions. The Bloom Module is a departure from much prior work by decoupling the transactional memory system from the cache coherence 24 hardware. Using a single hardware device to snoop the shared bus reduces design complexity and enhances portability, since this design does not change the CPU hardware.

While existing works on speculative execution on embedded systems offer energy-efficient and low complexity solutions, there are some issues that still need to be addressed. We describe these limitations next. a) Transparency While the described designs for speculative execution on embedded systems are energy-efficient and simple, they are not transparent to the programmer, who must program using special transactional instructions that enable speculation. Transparency is attractive because it means that the energy efficiency of legacy code can be improved, including code whose structure may be poorly understood. Moreover, any speculative execution mechanism can be applied to code written by non- specialist programmers. Finally, transparency implies that the same code will work correctly, if less efficiently, on platforms that do not support speculation. While it might not be reasonable to ask programmers to restructure code to exploit speculation, it is reasonable to allow programmers to annotate code (for example, “this block is a good candidate for speculation”), perhaps with the help of a profiler or static analyzer.

In the first part of this thesis (Chapter 3), new methods are explored for applying speculation on legacy code without the requirement of special supporting instructions in software. The Bloom

Module hardware that was introduced in [3], is utilized for that purpose, to support data conflict detection and resolution without altering the cache coherence protocol. The proposed architecture

(Embedded-Spec) is based on this hardware and it is completely transparent to the programmer.

Apart from offering transparency, Embedded-Spec also goes beyond other existing transparent speculative mechanisms such as SLE, TLR and the Haswell HLE design in several ways. In Haswell and the original SLE proposal, a failed speculation immediately restarts non-speculatively. There are no alternative failover mechanisms or policies. Embedded-Spec offers flexible contention man- agement (conflict resolution) alternatives, including alternatives to TLR’s timestamps. Moreover, 25

[14] and [65], like most work in this area, are concerned with general-purpose platforms, not with embedded systems. Hence, they are targeting on improving throughput, not energy efficiency. Since

Embedded-Spec targets embedded systems, the energy-delay product (EDP) becomes the principal

figure of merit. b) Scalability and Cache-Coherence As the demand for more compute-intensive capabilities for embedded systems increases, multi-core embedded systems are evolving into many-core systems in order to achieve improved performance and energy efficiency, similar to what has happened in the high-performance computing (HPC) domain. The memory is then shared across these multi- ple cores; however, the specific memory organization of these many-core systems has a significant impact on their potential performance. For small-scale systems, shared-bus architectures are attrac- tive for their simplicity, but they do not scale. For large-scale systems, architects have embraced hierarchical structures, where processing elements are grouped into clusters, interconnected by a scalable medium such as a network-on-chip (NoC). In such systems [15–17], cores within a cluster communicate efficiently through common on-chip memory structures, while cores in different clus- ters communicate less efficiently through bandwidth-limited higher-latency links. Architectures that provide the programmer with a shared-memory abstraction, but where memory references across clusters are significantly slower than references within the clusters, are commonly referred to as non-uniform memory access (NUMA) architectures.

When designing speculative synchronization mechanisms for embedded devices, it is essential to keep both the underlying hardware and the software interface simple and scalable. From the works mentioned above on speculative execution for embedded systems devices, some considered only shared-bus single-cluster architectures [3, 10]. While popular for their simplicity, such bus- based architectures are inherently not scalable, because the bus becomes overloaded when shared by more than a handful of processors. As a result, it is necessary to rethink the design of specula- tive mechanisms for scalable, cluster-based embedded systems where inter-cluster communication is 26 restricted.

Some designs exist for hardware transactional memory based on network-on-chip (NoC) com- munication. The IBM transactional memory mechanism [9] is intended for a clustered architecture.

Kunz et. al [13], have proposed a LogTM [7] implementation of HTM on a NoC architecture, and

Meunier and Petrot [12], have described a novel embedded HTM implementation based on a write- through cache coherence policy. While [13] and [12] propose speculation mechanisms for many-core embedded NoC systems, they build on top of an underlying cache coherence protocol and rely on it for detecting read/write data conflicts. Based on the belief that cache coherence will become more and more unwieldy as cluster sizes grow, the focus of this thesis shifts towards designing speculative synchronization mechanisms that do not rest on inherently unscalable foundations.

For high-end many-core cluster-based embedded systems that are subject to NUMA costs, mem- ory organization is the single, most far-reaching design decision, both in terms of raw performance, and (more importantly) in terms of programmer productivity. In order to meet stringent area and power constraints, the cores and memory hierarchy must be kept simple. In particular, scratch- pad memories (SPM) are typically preferred to hardware-managed data caches, which are far more area- (40%) and power-hungry (34%) [1]. Several many-core embedded systems have been designed without the use of caches and cache-coherence (the Epiphany IV Processor from Adapteva [19], ST

Microelectronics p2012/STHORM [18] and Plurality Hypercore Architecture Line (HAL) [68] are some examples). These kind of platforms are becoming increasingly common.

Implementing speculative synchronization in embedded systems that lack cache-coherence sup- port is particularly challenging, since hardware speculative techniques traditionally rely on the underlying cache-coherence protocol to synchronize memory access among the cores and manage read/write data conflicts. For these cacheless systems, a completely new approach is necessary for handling speculative synchronization. The lack of cache-coherence brings major challenges in the design of HTM support, which needs to be designed from scratch. At the same time though, the 27 lack of cache coherence provides a simpler environment. Building on such an environment, a self- contained HTM design that does not rely on an underlying cache coherence protocol can be created from scratch, resulting in a more lightweight solution compared to existing ones. The second part of this thesis (Chapter 4) presents the first implementation of such HTM support within a many- core embedded system that lacks an underlying cache-coherence protocol. As described later, this implementation requires explicit data management and implies a fully-custom design of the transac- tional memory support. While this implementation is limited to single-cluster accesses, the proposed scheme is designed so that it is scalable and can be extended to multiple clusters.

2.2.5 Error-resilient and energy-efficient execution on embedded systems

As discussed so far, in embedded systems design it is important to aim for energy efficiency, high performance and low complexity. While these factors still remain very important in the design of embedded systems, one crucial issue is becoming increasingly concerning, and that is reliability.

Scaling of physical dimensions in semiconductor devices has opened the way to heterogeneous em- bedded SoCs integrating host processors and many-core accelerators in the same chip [69], but at a price of ever-increasing static and dynamic hardware variability [70]. Spatial die-to-die (D2D) and within-die (WID) static variations ultimately induce performance and power mismatches between the cores in a many-core array, introducing heterogeneity in a nominally homogeneous system (i.e., between formally identical processing resources). On the other hand, dynamic variations depend on the operating conditions of the chip, and include aging, supply voltage drops and temperature

fluctuations. The most common consequence of variations is path delay uncertainty. Circuit design- ers typically use conservative guardbands on the operating frequency and/or voltage to ensure safe system operation. These guardbands lead to loss of operational efficiency since they waste energy and limit performance. When the guardbands are reduced (either by increasing the frequency or decreasing the voltage) or when the system is aggressively operated far from a safe point, the delay uncertainty manifests itself either as an intermittent timing error [2] [21] or even worse as a critical 28

Figure 2.4: Percentage error rate versus supply voltage for intermittent timing errors and the Critical Operating Point. operating point (COP) [20].

Intermittent timing errors may ultimately cause erroneous instructions with wrong outputs being stored or, worse, incorrect control flow. The COP defines a voltage and frequency pair at which a core is error-free. If the voltage is decreased below (or the frequency is increased beyond) the COP, the core will face a massive number of errors [20]. Specifically, according to the ‘COP Hypothesis’, in large CMOS circuits there exists a critical operating frequency Fc and a critical voltage Vc for a

fixed ambient temperature T such that:

• Any frequency above Fc causes massive errors.

• Any voltage below Vc causes massive errors.

• For any frequency below Fc or voltage above Vc, no process related errors occur.

In practice Fc and Vc are not single points, but are confined to an extremely narrow range for a given ambient temperature Tc. In principle the COP could be determined for a particular chip after its production, and the most efficient yet safe voltage/frequency pair for the chip could be configured at that time. However, due to static and dynamic variations, the COP may actually change over space and time. As a result, the “safe” operating point may i) differ from one core to another

(imposing to conservatively tune the entire chip to meet the requirements of the most critical core) 29 and ii) suddenly become unsafe due to aging, temperature fluctuations or voltage drops. The COP effect is highly pronounced in well-optimized designs [71] [72]. According to the COP error model, after the critical voltage is surpassed (for a given frequency level), massive errors emerge and the voltage needs to be immediately increased back to a safe level. But intermittent timing errors follow a different trend (Figure 2.4). As the voltage is scaled down, they start gradually emerging at a very low error rate which is later increased exponentially as the voltage is further scaled down. So, there is a range of voltage levels before the point of massive instruction failures is reached.

Timing errors can also be classified in critical and non-critical errors. Non-critical errors are those that originate from timing delays along the (e.g., multiplier) of the processor pipeline and can ultimately result in incorrect data being stored in memory. On the other hand, critical errors are errors that take place in the control part of the processor pipeline (e.g., instruction fetch/decode).

These types of errors can break the original control flow of the program and prevent any software- based solution from taking control. While non-critical errors can be detected and corrected in a

“lazy” manner, critical errors need to be detected and corrected immediately, since they can lead to catastrophic failures. For some types of applications that can tolerate approximate computations, non-critical errors can be ignored since it may be more cost-effective to proceed with inexact data instead of correcting them. For most applications that need accurate results, timing errors need to be detected and corrected.

Noting the energy-efficiency and throughput that transactional memory brings to data synchro- nization for embedded systems, this thesis explores an alternative use of transactional memory: as an energy-efficient recovery mechanism from all types of timing errors. The third and last part of this thesis, proposes an integrated HW/SW scheme that is based on HTM and addresses both in- termittent timing errors and the COP. This scheme enables the system to operate at highly reduced supply voltage margins in order to save energy. The following section (Section 2.2.5.1) presents existing work on error detection and correction and discusses how this thesis contributes in this

field. 30

Figure 2.5: Pipeline augmented with Razor latches and control lines (taken from [2]).

2.2.5.1 Existing work on error detection and correction

Many circuit-level error detection and correction (EDAC) techniques that continuously monitor path delay variations [2, 21] have been proposed. When an error is detected, a recovery technique is enabled that prevents the erroneous instruction from corrupting the architectural state. While these techniques can ensure correct system behavior, they impose substantial error recovery costs both in terms of energy and execution time. For example, recovery techniques such as instruction replay at half clock frequency or multiple-issue instruction replay are used to correct an errant instruction.

Multiple-issue instruction replay incurs the cost of flushing the pipeline and executing N+1 replicas of the instruction (N being the number of pipeline stages). These substantial error recovery costs make the solutions unsuitable for many-core chips operating at near-threshold voltage [73] to save power. In addition, while EDAC techniques can handle sporadic errors, they cannot deal with the

“all-or-nothing” effect of the COP.

Razor [2] is one of the most popular approaches to (DVS) that is based on dynamic error detection and correction of circuit timing errors. Razor uses delay-fault detection circuitry for detecting errors. Specifically, a Razor flip-flop is introduced in the processor’s pipeline stages, that double-samples the pipeline stage values, once with a fast clock and once with 31 a time-borrowing delayed clock as shown in Figure 2.5. The latch values sampled with the fast clock are then validated through a metastability-tolerant comparator. If the values are different, a timing error is detected and a modified pipeline mispeculation recovery mechanism is activated to restore the correct program state. Razor has been incorporated into several processor designs,

(e.g., [74–76]). Other works, such as [77] have also proposed an Error Detection Sequential (EDS) circuitry for delay-fault detection. An alternative solution to EDS circuits are tunable replica circuits

(TRC) [78]. TRCs have the advantage of being completely separate from the processor pipeline and offer a less-intrusive error detection technique that does not affect critical-path timing. Bowman et al. propose a 45 nm resilient design that includes both EDS and TRC-based error detection mechanisms [21]. This design also supports the two error-recovery mechanisms mentioned previously: instruction replay at half clock frequency and multiple-issue instruction replay, enabling correction of timing errors from fast-changing variations such as high-frequency voltage droops.

Other prior works have focused on sensor-based circuits, which are less appropriate for fast-changing variations [79–81]. These circuits can be used to determine when conditions may be right for reducing the voltage with a low risk of incurring failure.

Software techniques can be effective at providing energy-efficient robustness to errors, by exposing variability at lower levels of the software stack. However, early approaches focus on course-grained tasks [82] [83], lack generality (as they call for custom programming methodologies) or are only suitable for a specific class of approximate computing programs, in addition to imposing high recovery cycle overhead. A more recent approach based on OpenMP extensions by Rahimi et al.[84] has shown good potential for reducing the recovery cost incurred by HW-based error-correction techniques. The approach proposed in this thesis (Chapter 5), has some key differences. First, it can deal both with sporadic timing errors (like [84]) and with systematic, COP-like error models. To the best of our knowledge, this approach is the first to combine SW and HW techniques for dealing with COP.

Second, the approach by Rahimi et al. [84] requires error detection and correction (multiple-issue instruction replay) in the HW, as the SW technique alone cannot guarantee complete reliability. 32

Other works utilized transactional memory for error recovery. The IBM Power8 architecture [85] provides support for recovery-only transactions in hardware, but does not target energy savings.

The authors of [86] proposed transaction encoding, a software implementation that combines en- coded processing for error detection and TM for error recovery. While this design uses TM for checkpointing and rollback as we do, it offers a pure software solution, it uses encoded processing for error detection, and it does not address energy-efficiency. FaulTM-multi[87], is an HTM-based fault detection and recovery scheme for multi-threaded applications, with relatively low performance overhead and good error coverage. However, it is not targeted for reducing energy consumption, which is central to the implementation proposed here. Yalcin et al. studied how combining different error detection mechanisms and TM could potentially improve energy efficiency, but they did not provide an actual implementation [88]. To the best of our knowledge, the mechanism proposed in this thesis (Chapter 5) is the first to provide a full-fledged HTM implementation for error resilient execution that specifically targets energy savings. Chapter 3

Energy-efficient and transparent speculation on embedded MPSoC

The transition of embedded systems towards multicore architectures promises an improvement in power-performance scalability. However, this promise can be realized only if applications are capable of a high enough level of concurrency at low energy cost. Locks are typically used to guarantee memory consistency in shared memory programs. However, locks can limit concurrency and therefore slow performance. They can also be costly in terms of energy. By contrast, speculative execution approaches, such as Transactional Memory, Speculative Lock Elision (SLE), and Transactional Lock

Removal (TLR) [5], [89], [14], [67] which detect conflicts dynamically, promise both to improve performance and to save energy.

While speculative execution has been extensively studied for the general-purpose computing domain, it has attracted less attention in the embedded domain (e.g., [10], [3], [13], [12]). However, any practical design for embedded systems must emphasize transparency, low complexity and energy efficiency.

33 34

This chapter describes Embedded-Spec, an energy-efficient embedded architecture that sup- ports transparent speculation (“lock elision”) at the software level through an underlying hardware transactional memory. Embedded-Spec makes the following contributions. First, unlike most ex- isting works on speculative synchronization, it focuses on energy efficiency as well as throughput since both are key constraints for embedded systems. Specifically, energy-delay product (EDP) is evaluated as a figure of merit that captures the trade-off between these two properties. Second, since it targets embedded platforms that are highly resource constrained, Embedded-Spec focuses on simplicity. It proposes the addition of simple hardware structures that avoid changes to the un- derlying cache coherency protocol but leave the flexibility to vary how synchronization conflicts are detected, how they are resolved (contention management) and which policy to use for switching be- tween speculative and non-speculative executions. Last, Embedded-Spec offers a fully transparent solution for speculative execution of locks. This means programmers can take full advantage of the underlying speculative hardware support even when running code written using traditional locks.

The proposed architecture is presented and evaluated through a range of benchmarks written with standard locks. A range of contention management and retry policies are explored. Experiments demonstrate that for resource-constrained platforms, lock speculation can provide real benefits in terms of improved concurrency and energy efficiency, as long as the underlying hardware support is carefully configured.

3.1 Embedded-Spec: Speculative Memory Design

As mentioned in Chapter 2, when designing speculative memory we need to make some key decisions on conflict detection, conflict resolution and data versioning. The original HTM design that was introduced by Herlihy and Moss [5] and later HTM works for embedded systems such as [10], utilized the cache coherence protocol to assist in managing consistency detection. These works required extensive changes in the cache coherence protocol to guarantee conflict detection and resolution. 35

CORECORECORE

Tx logic CacheCacheCache ctrl ctrl ctrl Tx LogicTx logic Bloom Module TxTxTx I$I$ D$D$ I$ D$ Bitbitbit Registers

Figure 3.1: Logic for Transactional Management used in Embedded-Spec. The architectural con- figuration is taken from [3]. The dark blocks show the additional hardware required. That is, the Tx bit for each line of the data cache to indicate if the data is transactional, the Tx logic in the cache controller to handle transactional accesses, and the Bloom module to detect and resolve conflicts.

Later works such as [3] and [47] proposed solutions that decouple the HTM design from the cache coherence protocol. In particular, Ferri et al. [3] proposed an HTM design that requires only a few modifications to the cache system and a dedicated hardware module, the Bloom Module, for conflict detection and resolution. The use of an external separate module for conflict detection and resolution alleviates the need for making extensive changes to the cache coherence protocol, thus simplifying cache coherence logic and reducing the number of tag bits required in the caches. Moreover, multiple conflict recovery schemes can be implemented more easily, since the conflict management decisions are no longer made by each core individually but with a single separate module. Using the Bloom

Module, we can enable dynamic selection of conflict resolution policies during execution, based on the characteristics of our applications and the experienced abort rate. Motivated by the flexibility and low complexity offered by the Bloom Module design of [3], we adopt the same design for conflict detection management in Embedded-Spec.

Fig. 3.1 shows the transactional management architecture with the use of the Bloom module.

The dark blue blocks indicate the three additional hardware components that are necessary for transactional management:

1. A new state bit (called Tx bit) for each line of the data cache, which defines whether the data

contained in the line is transactional or not. 36

2. New logic in the cache controller that handles the new transactional accesses and

3. The external Bloom Module.

By borrowing this design and using it for speculative execution for our different versions of Embedded-

Spec, we keep the hardware design simple.

As implemented in [3] transactional events are triggered through regular read/write operations on memory mapped registers. For example, starting a transaction is done by writing to a special register in the Bloom Module. When the cache controller detects that write, it sets an internal bit to enable the transactional logic. The extra transactional logic of the cache controller has to carry some extra operations while a transaction is executed, as shown in Figure 3.2. Note that the controller has to handle two special cases.

• The first one occurs when a line that is accessed transactionally is already in the cache before

starting the transaction. In this case, the data would not be retrieved from the L2 memory, hence

the Bloom Module would not have the chance to snoop this access on the bus to include it in the

transaction’s read and write sets. The cache controller carries the extra responsibility of issuing

a bus access to notify the Bloom Module that the cache line is being accessed transactionally.

WRITE OP READ Y N N Y HIT HIT ? ? Y Y TX TX Y Y TX TX ? N N? ? N N ?

Extra Bus Overflow Overflow Extra Bus Access Access

Set TX bit Set TX bit Set TX bit Set TX bit

Regular Regular Regular Regular Write Write Read Read

Figure 3.2: Modifications to the cache coherence protocol for transactional accesses. The gray block indicates the added operations. Note: The TX decision diamond denotes whether the Tx bit is already set or not. 37

• The second special case occurs when a line that is accessed transactionally is being replaced

inside the cache. In this case, the cache controller will perform a transaction overflow so that the

transaction is able to complete as an overflowing transaction, while the rest of the cores are stalled

until it completes.

Note that these small changes are added to the cache coherence protocol without changing it signifi- cantly and the transactional logic of the cache controller is responsible for their handling. Moreover, these do not affect the bus protocol itself. Modifications to this design specific to Embedded-Spec will be discussed in Section 3.2.1.

For Embedded-Spec we chose to adopt eager conflict detection and resolution (i.e., conflicts are detected and resolved immediately after they occur) and a requestor-wins abort policy (i.e., when a conflict occurs the requestor core proceeds and all other conflicting cores are aborted), since these choices are more natural to the cache coherence protocol. We also have a lazy data versioning scheme that uses as a baseline the design proposed in [3]. Next section presents the basic components of the target architecture for Embedded-Spec.

3.2 Architecture

Embedded-Spec is based on the same target architecture as [3], illustrated in Figure 3.3. We use

MPARM [90], a cycle-accurate, multi-processor simulator developed for embedded system design space exploration, to model and simulate our architecture. It features a configurable number (up to 8) of RISC-like cores, interconnected through a shared bus (AMBA). Each core has private L1 instruction and data caches, kept coherent through a MESI coherence protocol by per-core snoop devices.

The shared memory is a two-level, partitioned global address space (PGAS) hierarchy. Specifi- cally, MPARM simulates an architecture that encompasses distinct physical memory banks, globally visible throughout the system. Each core has a small L1 local scratchpad (SPM), accessible without 38

SHARED L1 MEM

SPM 0 SPM 1 SPM N

I$ I$ I$ CPU SNOOP CPU0 SNOOP CPU1 … SNOOP N Tile 1 Tile 0 0 D$ 1 D$ N D$ Tile N

T T M T T M T T M

SHARED BUS (AMBA)

T T T M

HARDWARE BLOOM … PRI N SHARED L2 MEM PRI 0 PRI 1 SEMAPHORES MODULE

Figure 3.3: Architecture overview, as proposed in [3]. traversing the system interconnect. Remote SPMs can also be accessed directly through the bus, but at the cost of higher latency. The overall L1 shared memory is the union of the SPMs, and it is globally non-coherent: its addresses are not cacheable, and it is explicitly managed by software. L2 shared memory physically consists of a single device, logically partitioned into a large shared seg- ment, plus small “private” segments for each core. Addresses belonging to the logically shared chunk are cacheable and globally coherent. The private segments are also cacheable, but their addresses are not involved in coherence traffic.

Non-speculative synchronization is supported by a fixed set of architectural hardware locks drawn from a pre-allocated section of memory, the semaphore memory, and accessible by standard syn- chronization calls such as Test(), TestAndSet(), and Release().

3.2.1 The Bloom Module Hardware

The Bloom Module [3] is an external signature-based hardware component that is in charge of conflict detection and resolution. It monitors all transactional accesses, records them as per-core signatures, and notifies the CPUs when data conflicts occur. As explained in more detail below, the Bloom

Module used here departs from prior designs by also snooping on the semaphore memory.

To support Embedded-Spec, we extended the Bloom module’s control logic as well as its indi- vidual Bloom filters to make it aware of the architecture-supported hardware locks. Fig. 3.4 shows 39

LKP/INS R/W CLR Conflict Core0 Filter CONTROL Core1 Filter a) LOGIC Core2 Filter HASH D HASH C HASH B HASH A Registers BLOOM MODULEBLOOM Address Data IRQs (Abort, BUS Overflow)

R/W CLR Core2 Filter Conflict LKP/INS R HashA(Addr) W R HashB(Addr) b) W

HashC(Addr) R W HashD(Addr) R W

Figure 3.4: (a) Overview of the Bloom Module. (b) Internal details of a core Bloom Filter Unit (BFU). Taken from [3] . an overview of the Bloom Module and the internal details of a core Bloom Filter Unit (BFU). The

Bloom module has the following functional blocks:

• Snooping Shared Memory Address: Snoops the shared memory address space to keep track

of the addresses accessed during speculative execution.

• Bloom Filters: Per-core signatures corresponding to the read and write addresses accessed

during the speculative execution.

• Control Logic: Implements the features needed to manage communication with the cores (i.e.,

the abort and hold signals). It also manages the abort policies and handles cache overflow.

• Snooping Semaphore Memory Address: Snoops traffic to and from the hardware locks. It

detects Test(), TestAndSet() and Release() calls and their responses.

• SLE Registers: Per-core registers to keep track of the core status (i.e., which core is in speculative

mode on which hardware lock and which core has ownership of a specific lock). These registers

are kept updated by the Snooping Semaphore Memory Address block. 40

• Hold Queue List Registers: Per-core registers to keep the list of aborted cores that need to

be released at commit time. These registers are used only with Embedded-LR.

Communication between the cores and the Bloom Module is handled via interrupts and read-

/write memory operations (no extra wires are required). A small memory space (approximately

256 ) is reserved for programming registers, used to program specific functionalities of the

Bloom Module at run time. For example, the commit priority is set using read and writes to specific registers in this set. As seen in Fig. 3.4, each core has a Bloom Filter Unit (BFU) consisting of

K read-write pairs of simple Bloom filters. Instead of setting multiple bits in one large filter, our design sets a single bit in K small bloom filters. This parallel Bloom filter design limits the required hardware. Empirical experimentation in [3] showed that K=4 provides the best power/performance tradeoff. Moreover, the hash functions were designed with delay and power as the main criterion and it was thus decided to implement the hash function on a single level of two input XORing of lower order address bits. 1

3.3 The Embedded-Spec Algorithms

This section describes the two variations of the Embedded-Spec architecture:

1. Embedded-LE (Embedded Transparent Lock Elision): The critical section is executed specula-

tively by eliding the lock. The Bloom module monitors memory accesses, and if there is a data

conflict, it directs the conflicting cores to roll back their speculative executions and contend for

the lock. One will succeed, and the rest will spin until the winner releases the lock. When the

lock is released, the waiting cores retry their speculative executions. If the number of retries for

a specific transaction (due to repeated conflicts) exceeds a threshold, the cores revert to non-

speculative execution for that instance of the transaction. When the end of the critical section is

successfully reached, the number of retries is reset to zero.

1This was based on a previous finding that the lower order bits of an address are characterized by more randomness than the higher order bits. 41

2. Embedded-LR (Embedded Transparent Lock Removal): As with Embedded-LE, the critical

section is executed speculatively by eliding the lock, but in case of a data conflict, the Bloom

module directs all conflicting cores but one (the winning core), to roll back and suspend execution

until the active core completes the critical section. When the winner completes, the suspended

cores resume speculative execution, so a lock never needs to be explicitly acquired.

Embedded-LE supports two contention management policies. The requester-abort policy aborts only the core requesting the conflicting address, and the abort-all policy aborts all cores executing the same critical section. The second policy is motivated by the observation that once a core abandons speculation and tries to acquire the lock, it is highly likely it will force the other cores in the same critical section to abort eventually.

Another variation of Embedded-LE is also examined, in which cores suspend execution in a low-power idle mode instead of spinning when waiting for a lock. This approach (called Embedded-

LE-Sleep) saves power but increases latency (by 2 ms). This option is similarly available for

Embedded-LR, which we call Embedded-LR-Sleep. Finally, we explore the effects of allowing aborted cores to attempt to elide the lock more than once before resorting to lock mode by setting a parameter max number of retries.

In Embedded-LR, cores never switch from speculative to non-speculative execution. Un- like “best-effort” HTMs, Embedded-LR guarantees that every transaction eventually commits so

Embedded-LR is not subject to starvation. Embedded-LR supports two abort policies: timestamp and priority-abort. For the timestamp policy, the core with the earliest timestamp is allowed to pro- ceed, whereas in the priority-abort policy, each core has a priority that is increased when it is rolled back and in case of conflict, the higher-priority transaction proceeds. Table 3.1 summarizes all the possible configurations for Embedded-LE and Embedded-LR. The two algorithms are discussed in more detail next. 42

Configuration Abort Policy # retries Sleep 1) requestor-abort EMBEDDED-LE 0,1,2,..∞ Yes/No 2) abort-all 1) timestamp EMBEDDED-LR N/A Yes/No 2) priority-abort

Table 3.1: EMBEDDED-SPEC — All Configurations.

3.3.1 Embedded-LE

Figure 3.5 shows the flowchart of the Embedded-LE algorithm, which is implemented in middleware using API function calls and hardware lock instructions (i.e., Test(), TestAndSet()).

This algorithm (Figure 3.5) is called when a core tries to enter a critical section protected by a lock. For example, when core X tries to enter a critical section, it checks whether the maximum number of retries has been exceeded. If the number of retries has been exceeded, then the lock cannot be elided and the critical section must be executed in the usual way by acquiring the lock with a call to TestAndSet(). If the number of retries is not exceeded, then the algorithm can elide the lock, but must first check that the lock is actually free (with a call to Test()). If Test() returns false, the core spins until the lock is free and Test() returns true (i.e., the lock is free). Once free, the lock is elided and the critical section is executed speculatively.

During speculative execution, the Bloom module detects and resolves data conflicts by aborting one or more cores. When an abort occurs, the Tx abort register is updated to indicate the abort.

Each core X calls the Check abort() function to determine whether it was aborted. If not, it proceeds speculatively. Otherwise, the core terminates the speculative execution, and calls TestAndSet() once to try to acquire the lock. If it succeeds, the core proceeds non-speculatively. Otherwise, the core returns to the mispeculation number checkpoint to determine whether it should continue to speculate, or fall back to locking. Either way, the core spins until the lock is freed.

If eventually core X reaches the end of its critical section, it will check its own execution mode by calling Check In Transaction(), to determine whether it has been running in locking mode or speculative mode in order to either release the lock or end the speculative execution. 43

START

N_Mispeculations F Busy <= TEST_AND_SET (ID) Max_Allowed Aquired T

Busy Free StartStart Critical Critical TEST_LOCK (ID) SectionSection LOCKINGLOCKING MODE MODE Start PC Saving Transaction Data Backup

T F Check_Abort()

End Transaction

Busy Acquired TEST_AND_SET (ID)

StartStart Critical Critical StartStart Critical Critical SectionSection SectionSection TRANSACTIONALTRANSACTIONAL LOCKINGLOCKING MODE MODE MODE

EndEnd CriticalCritical SectionSection

1 0 CHECK_IN_TRANSACTION

End Release Transaction Lock ID

Figure 3.5: The flowchart of the Embedded-LE algorithm.

3.3.2 Embedded-LR

Embedded-LR requires extensions to the Bloom module, and small changes to the middleware, replacing each lock acquisition with a start transaction instruction. Once a core starts speculation, it will never try to acquire the lock. When a conflict is detected, the losing cores are suspended.

We note that the Embedded-LR algorithm does not require much support in the middleware level, since the Bloom Module is already present in the hardware level. The only required feature at 44 the middleware level is starting a new transaction instead of acquiring the lock. Even in the event of a mispeculation, the lock will not be acquired. The idea behind this is to allow at least one core to complete the critical section. In case of a conflict, the core or cores that have been selected to stop (i.e., the “losing” cores), will be aborted and put in a hold state by the Bloom module. When the ‘winning‘ core completes execution of its critical section, the core or cores kept in a hold state, will be released and will be allowed to retry the critical section speculatively. To track suspended cores, the Bloom module is extended with a per-core hold queue list . When a core is aborted, its

CoreID is added to the winning core’s hold queue list register. We also append to the winning core’s hold queue list register the hold queue list lists of the aborted cores. When the winning core commits, every CoreID in its hold queue list register is released and the list is cleared.

3.4 Experimental Results

This section presents an evaluation of the proposed Embedded-Spec design. The architecture was tested under several configurations. The first part of the evaluation is devoted to finding the optimal parameters in terms of energy consumption and execution time. Our target metric is therefore energy delay product (EDP), which is a standard, commonly used evaluation metric in computer architecture [91]. As it was done in previous work using the MPARM simulator [11], the performance and power models are based mostly on data obtained from a 0.13 µm technology provided by STMicroelectronics for their Platform [92], and the energy model for the fully associative caches is based on the approach of Efthymiou et al. [93]. The second part of the evaluation is focused on evaluating the advantages of the optimal configuration over the baseline lock approach. The hardware parameters are reported in Table 3.2.

3.4.1 Benchmarks

To test the design, several benchmarks where chosen and adapted to the simulation platform, which does not include operating system support. The benchmarks belong to the following suites: 45

Parameter Configuration(s) CPU ARMv7, 3-stage in-order pipeline, 200Mhz L1 cache 8KB 1-way Icache, 16KB 4-way Dcache Cores {1, 2, 4, 8} Policies Locking, Embedded-LE, Embedded-LR Signature 2KBits 4-way, Read and Write Bloom filters

Table 3.2: Hardware configurations.

• The STAMP benchmark suite [58]. The selected workloads represent the following synchronization

patterns and critical section sizes: 1) large non-conflicting critical sections (vacation); 2) barrier-

based synchronization with small critical sections (kmeans); 3) large critical sections that may

conflict (genome); 4) a mix of large and small critical sections (labyrinth).

• The MiBench suite [94] patricia: A Patricia trie is a data structure used in place of full trees with

very sparse leaf nodes. Patricia is characterized by a high percentage of time spent in critical

sections, and a high abort rate.

• Datastructures. redblack, skiplist: applications operating on special data structures. The workload

is composed of a certain number of atomic operations (i.e., inserts, deletes and lookups) to be

performed on these two data structures. Redblack-trees and skip-lists constitute the fundamental

blocks of many memory management applications found in embedded applications.

We begin with a design space exploration using the set of benchmarks described above. From this design space exploration, we determine the best combination of abort and retry policies for the two Embedded-Spec algorithms. Next, we compare the best configurations against standard locks.2 2Note that here, we are not focusing on showing the EDP improvement achieved by speculative techniques as we increase parallelism by scaling the number of cores, since this has already been demonstrated by previous work (e.g., [10]). Instead, our emphasis is on providing a detailed exploration of a range of contention management techniques and retry policies and comparing the EDP improvement achieved specifically based on those choices. Hence, the results are not normalized to the single thread execution, but are rather compared to the base synchronization approaches. 46

EMBEDDED-LE vs EMBEDDED-LE-SLEEP (normalized to EMBEDDED-LE) 115% 110% 105% 100% 95%

cycles (%) cycles 90% 85% 80% 75% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores EMBEDDED-LE EMBEDDED-LE-SLEEP

Figure 3.6: Execution time for Embedded-LE and Embedded-LE-Sleep modes.

3.4.2 Embedded-LE Parameter Exploration

3.4.2.1 Sleep Mode

As described in Section 3.3, the Embedded-LE implementation can be executed in conjunction with sleep mode, where if a thread is unable to acquire a lock immediately, it is switched to an

IDLE state to reduce energy consumption. The energy savings, however, come at the expense of an increased execution time required to switch the cores back from sleep mode to normal operation.

Fig. 3.6 shows how the execution time is affected by including sleep mode execution along with lock elision (noted as Embedded-LE-Sleep) as the number of cores is varied.3

To explain the differences observed in the results of the aforementioned benchmarks, we need to bear in mind what is special about each one of them. The genome, patricia and vacation benchmarks have large critical sections and they spend a significant portion of time executing critical sections.

While patricia experiences high abort rates, vacation generally has non-conflicting transactions.

The redblack and skiplist benchmarks are very similar in the sense that they both work on special data structures and have very low abort rates. The labyrinth benchmark includes a mix of large and small critical sections. Finally, kmeans is a benchmark that spends a very small portion of its execution time in critical sections whose size are very small. That is why kmeans often does not show significant changes in behavior when fine-tuning some parameters.

3The labyrinth benchmark triggers software generated transaction aborts, which are not currently supported. Therefore, the simulations where these are triggered are omitted (4 and 8-core configurations for Embedded-LE). 47

EMBEDDED-LE vs EMBEDDED-LE-SLEEP (normalized to EMBEDDED-LE) 110%

100%

90%

80%

70%

energy energy (%) 60%

50%

40% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores EMBEDDED-LE EMBEDDED-LE-SLEEP

Figure 3.7: Energy Consumption for Embedded-LE and Embedded-LE-Sleep modes.

As seen in Fig. 3.6, all benchmarks except for patricia show an increase in execution time.4 This increase is usually negligible and below 5%, but for vacation and redblack it reaches up to 6% and

10% respectively. This increase is expected since switching to/from sleep mode imposes a small time overhead (0.2 µs, i.e., 40 cycles). Only patricia shows a decrease in execution time of 4%. This most probably happens because the small latency introduced by switching to sleep mode can shift timing in such a way that by the time sleeping cores wake up and retry speculation, the cores they previously conflicted with have completed their critical sections so they don’t conflict again. For benchmarks such as patricia, which have relatively high abort rates, a timing shift can have a big impact on the resulting abort rate and hence on performance. Indeed, in this experiment the abort rate for patricia decreased from 42% to 37% when using sleep mode.

Moreover, Fig. 3.7, which reports the energy consumption for the same set of experiments, shows that for benchmarks that spend a considerable amount of time in critical sections, there is a significant reduction in energy consumption due to sleep mode (e.g., 18% for genome and reaches

48% for patricia). Only redblack shows a slight increase (3%), while kmeans, skiplist and labyrinth are not affected at all by sleep mode. Since redblack has a very low abort rate, sleep mode only adds extra energy overhead.

Fig. 3.8 shows the energy-delay product for the same set of experiments to measure the combined

4Note that for most of the figures shown, the y-axis is not 0-based, in order to make the observed trends more readable. 48

EMBEDDED-LE vs EMBEDDED-LE-SLEEP (normalized to EMBEDDED-LE) 120% 110% 100% 90% 80%

EDP (%) EDP 70% 60% 50% 40% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores EMBEDDED-LE EMBEDDED-LE-SLEEP

Figure 3.8: Energy Delay Product for Embedded-LE and Embedded-LE-Sleep modes. effect of sleep modality on both performance and energy consumption. Even though execution time is increased for some benchmarks when sleep mode is used, the effect is largely compensated by the reduction in energy consumption, resulting in a significant EDP improvement in most cases, reaching 14% for genome, 50% for patricia and 20% for vacation. The overall effect of sleep mode in EDP is insignificant for skiplist and non-existent for kmeans and labyrinth. Only redblack shows a clear decrease in EDP from sleep modality for the same reasons mentioned before.

The conclusion we draw from this set of experiments is that if we care only about performance, we should use Embedded-LE-Sleep modality instead of Embedded-LE for patricia while we should avoid it for all other benchmarks. If we care only about energy consumption, then Embedded-LE-

Sleep modality is overall a better choice. Similarly, if we care for both performance and energy consumption, then, overall, Embedded-LE-Sleep is the better way to go.5

The results so far have shown that for many benchmarks it is better to sleep instead of spin.

However, in order to get a better understanding of the design space, the parameter exploration is continued in the following sections testing both sleeping and spinning versions of each configuration

. 5Note that in this results description, the focus is mainly on describing the trends for the 8-core execution, since that is when we experience the most parallelism. The trends for every number of cores configuration can be observed in detail in all the figures included. 49

EMBEDDED-LE: Number of retries (normalized to 1) 120% 173.8% 141.3% 203.4% 182.1% 183.7% 171.9% 382.3% 110% 176.5% 306.1% 174.8% 273.9% 100%

0 90% 1 2 80%

4 cycles (%) cycles 70% ∞

60%

50% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores

Figure 3.9: Performance of Embedded-LE and varying maximum number of retries.

3.4.2.2 Max Number of Retries

Note that in all experiments for Embedded-LE so far, once a thread failed to elide a lock, it would then try to acquire it. We next extend Embedded-LE to allow a thread that had a conflict during lock elision to retry eliding the lock rather than immediately trying to acquire it. Therefore, the next parameter we investigate for Embedded-LE is the max number of retries, which allows to evaluate how many times it is worthwhile to retry a failed speculation on a high conflict critical section before switching back to lock mode.

Fig. 3.9 shows the performance with a varying number of retries allowed before reverting to locks, in Embedded-LE mode. Note that by setting this value to 0, Embedded-LE behaves as in prior experiments, acquiring the lock after a single abort6. Most benchmarks benefit in terms of performance from retrying the speculation several times, instead of not retrying at all (i.e., having number of retries set to 0). In particular, when the maximum number of retries is 0, performance generally tends to degrade as the number of cores increases. A limit of 4 is optimal for patricia and genome, but the rest of the benchmarks do not show significant change in performance based on

6 [14] effectively implemented their version of SLE with max number of misspeculations set to 1, i.e., maximum number of retries set to 0. 50 which of the non-zero values we choose (vacation being the only exception, which shows a clearly worse performance if we restrict the number of retries to 1 instead of allowing more than 1 retry).

Because they both experience high contention, patricia and genome do not benefit from many retries. In benchmarks with high abort rate, switching to locking is preferable after a few retries, since speculation is likely to fail again. Indeed, for patricia as the number of cores increases we have to make sure the abort rate does not increase to the point where it is counterproductive for performance. When we restrict the number of retries to 0, the abort rate is reduced to nearly 0, but very little thread parallelism is exploited for 4 or 8 cores. If we allow one retry, the abort rate reaches 42% (for 8 cores), but is still tolerable when it comes to improving performance. The same trend is experienced in genome as well. Restricting the number of retries to 0 gives a nearly zero abort rate, while allowing one retry yields an abort rate of 17%. For these two benchmarks that have highly contended critical sections it is better to limit the number of retries, in order to prevent the abort rate from increasing to the point where it hurts performance. Note that the exact same phenomenon is experienced in sleep modality.

For energy consumption shown in Fig. 3.10, again 0 becomes the worst choice as we increase the number of cores, but choosing between 2, 4 or an infinite number of retries does not make much difference for most benchmarks. Retrying 4 times seems again to be slightly better for genome and patricia. For the energy-delay product, Fig. 3.11 shows that picking any non-zero number of retries will yield similar benefits for most benchmarks, except for patricia and genome, where restricting the maximum number of retries to 4 is clearly better (shows 10% EDP improvement). If we had to choose a single maximum retry value to use for all benchmarks, we conclude that retrying up to 4 times would be overall the best choice.

The results for Embedded-LE-Sleep appear in Fig. 3.12 and Fig. 3.13. In this case the results are similar, with a few notable differences. As in the non-sleep case, for most benchmarks, any non-zero number of retries yields similar results in terms of performance. Especially for genome and patricia, retrying 2 times at most is better for performance than not restricting the number 51

EMBEDDED-LE: Number of retries (normalized to 1) 120% 120.4% 241.6% 157.7% 131.4% 142.3% 120.1% 110% 175.8% 131.6%

100%

90% 0

80% 1 2 70% energy (%) energy 4

60% ∞

50%

40% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores

Figure 3.10: Energy Consumption of Embedded-LE and varying maximum number of retries.

EMBEDDED-LE: Number of retries (normalized to 1) 120% 209.25% 138.39% 204.17% 149.58% 261.40% 206.48% 110% 481.45% 320.72% 402.28% 229.96% 100% 923.76%

90%

80% 0

70% 1

EDP (%) EDP 2 60% 4 50% ∞ 40%

30%

20% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores

Figure 3.11: Energy Delay Product of Embedded-LE and varying maximum number of retries. of retries. As in the Embedded-LE case, retrying speculation instead of switching back to locks immediately after an abort is always beneficial for performance. When looking at energy though, things change significantly, as now we are able to save considerable amounts of energy while waiting on the lock in sleep mode instead of directly retrying speculative execution. In contrast, the more we allow retrying speculation, the more we risk wasting energy, as Fig. 3.13 shows. For all benchmarks

(except kmeans, which does not spend enough time executing critical sections to matter), restricting 52

EMBEDDED-LE-SLEEP: Number of retries (normalized to 1) 120% 172.5% 123.2% 196.8% 193.2% 187.3% 171.2% 280.6% 110% 216.0% 302.3% 174.6% 356.1% 100% 0 90% 1 2 80%

4 cycles (%) cycles 70% ∞

60%

50% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores

Figure 3.12: Performance of Embedded-LE-Sleep and varying maximum number of retries.

EMBEDDED-LE-SLEEP: Number of retries (normalized to 1) 120%

123.2% 110%

100%

90% 134.7% 0

80% 1

2 energy (%) energy 70% 4

60% ∞

50% 145.6% 40% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores

Figure 3.13: Energy Consumption of Embedded-LE-Sleep and varying maximum number of re- tries. the max number of retries to 0, yields considerable energy savings.

To determine the overall best choice, we have to look at the energy-delay product, as shown in

Fig. 3.14. Benchmarks such as redblack, skiplist, kmeans and labyrinth show better results when choosing any non-zero number of retries, while vacation shows considerable improvement ( 23%) for an infinite number of allowed retries compared to just 1. Genome shows better EDP when we restrict 53

EMBEDDED-LE-SLEEP: Number of retries (normalized to 1) 140% 145.68% 212.17% 163.99% 130% 120% 110% 100% 0 90% 1

80% 2 EDP (%) EDP 70% 4 60% ∞ 50% 40% 242.30%324.51% 144.4% 30% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores

Figure 3.14: Energy Delay Product of Embedded-LE-Sleep and varying maximum number of retries. the number of retries to 2. On the other hand, patricia seems to benefit greatly both in performance and energy when we do not allow any retries at all. This is expected, since benchmarks with high abort rates, such as patricia, benefit from switching to locks following a single misspeculation, while benchmarks with lower conflict levels benefit from retrying the speculation several times.

We conclude that if we want to increase performance and at the same time decrease energy consumption, then for most benchmarks (except for patricia and genome), we should allow retrying speculation for an unlimited number of times until it is successful, instead of switching back to locks. Slight variations to the best non-zero value choice are observed, that lead us to pick small

finite values especially for genome (4 in Embedded-LE mode and 2 in Embedded-LE-Sleep mode) and patricia (4 in Embedded-LE mode). The only exception to these observations is patricia in

Embedded-LE-Sleep mode. In this case, we see a significant improvement when we do not allow any retries and we immediately switch back to locks after an unsuccessful speculation attempt.

Again, this is due to the relatively high contention rate for this benchmark.

To summarize, if our primary goal is to improve performance, allowing an infinite number of retries is best for all benchmarks, except for genome and patricia, which show better performance 54 for 4 maximum retries in Embedded-LE mode and 2 in Embedded-LE-Sleep mode. If our primary goal is to decrease energy consumption, then not restricting the number of retries is again best for most benchmarks, except for patricia and genome, which yield better results if we restrict the number of retries to 4. Finally, if we want to decrease energy consumption, but we are in

Embedded-LE-Sleep mode, then we should not allow any retries for any of the benchmarks.

3.4.2.3 Abort Policy

In this section the parameter exploration is continued by experimenting with the the abort policy, which is set within the Bloom module abort manager. The requestor-abort policy, which aborts only the requesting core when a conflict occurs, is compared to the all cores on the same lock ID policy

(or abort-all policy), which aborts all cores conflicting on the same lock-protected critical section.

Note that for either abort policy, the aborted cores will have to explicitly try to acquire a lock once they have rolled back and restored their previous states. Since multiple cores attempting to execute critical sections on the same lock ID must be consistent (i.e., cores must all be executing either in speculative (LE) or non-speculative (lock) mode), in the case of the requestor-abort policy, the other cores will have to abort as well if the requestor core manages to acquire the lock before they commit. However, since the process of rollback can take several cycles, in many instances the non-aborted cores will commit before the lock is acquired, and therefore it would have been wasteful to abort all the cores immediately when the conflict was first detected.

We set the maximum number of allowed retries to different values, in order to see if the abort policy plays a different role in each case. We try three different values: 0, 1, 2 and an infinite number of maximum allowed retries and for each of these values, we test the two abort policies mentioned before. In all three experiments, the observed trends on performance, energy and EDP were the same, hence only the results for EDP will be presented here, since EDP combines both metrics.

Experiments showed that if no retries are allowed on a failed transaction, then both abort policies yield the exact same results in performance, energy and EDP for all benchmarks, whether sleep mode 55

EMBEDDED-LE: abort-all vs requestor (normalized to abort-all) (# retries set to 1) 110% 105% 100% 95% 90%

EDP (%) EDP 85% 80% 75% 70% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores abort-all requestor

Figure 3.15: Energy delay product for Embedded-LE using different abort policies and maximum number of allowed retries set to 1.

EMBEDDED-LE: abort-all vs requestor (normalized to abort-all) (# retries set to 2) 110% 105% 100% 95% 90%

EDP (%) EDP 85% 80% 75% 70% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores abort-all requestor

Figure 3.16: Energy delay product for Embedded-LE using different abort policies and maximum number of allowed retries set to 2. is enabled or not. This is expected since the abort policy does not really create much of a difference if we immediately switch back to locks on an event of a failed speculation.

Fig. 3.15 shows EDP results for the two abort policies in Embedded-LE mode when we allow at most one speculation retry. As we can see for all benchmarks, both abort policies show similar results, except for genome and patricia, which show considerable benefits for the requestor-abort policy compared to the abort-all policy, as the number of cores increases (18% and 10% improve- ment respectively). If we allow 2 retries in Embedded-LE mode, as seen in Fig. 3.16, patricia is consistently better (21%) for requestor-abort, while genome does not show any difference in this case. Vacation also shows a great benefit (19%) for 8 cores. We note though, that as the maximum allowed number of retries is increased, the benefits of choosing the requestor-abort policy become 56

EMBEDDED-LE: abort-all vs requestor (normalized to abort-all) (# retries set to ∞) 110% 100% 90% 80% 70%

60% EDP (%) EDP 50% 40% 30% 20% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores abort-all requestor

Figure 3.17: Energy delay product for Embedded-LE using different abort policies and maximum number of allowed retries set to infinity. more prominent. Fig. 3.17 shows the corresponding results for an infinite number of allowed retries in Embedded-LE mode. In this case, we observe a dramatic drop in the EDP for specific bench- marks, like genome, patricia and vacation (47%, 75% and 63% respectively) as the number of cores is increased. Our conclusion from this set of experiments, is that for most benchmarks the abort policy does not affect the overall EDP, but for genome, patricia and vacation, the requestor-abort policy shows significant benefits that become more prominent as the number of maximum allowed speculation retries is increased. Thus, we conclude that the requestor-abort policy can be safely chosen every time Embedded-LE mode is activated.

Next, the same set of experiments is repeated, but this time with sleep mode enabled. Fig- ures 3.18, 3.19 and 3.20 show the corresponding results. We generally observe similar trends as in the non-sleep modality, with the following differences. When the maximum number of retries is limited to 1, as shown in Figure 3.18, the requestor-abort policy is slightly worse for vacation as the number of cores is increased, but still it is slightly better for genome and patricia. The differences observed in this case though are too small to draw a conclusion on which technique is better. As we move to higher numbers of allowed speculation retries, as shown in Figures 3.19 and 3.20, the benefits of the requestor-abort policy become more visible in specific benchmarks. In particular, the EDP reduction in the Embedded-LE-Sleep experiment, when setting the number of allowed 57

EMBEDDED-LE-SLEEP: abort-all vs requestor (normalized to abort-all) (# retries set to 1) 110% 105% 100% 95% 90%

EDP (%) EDP 85% 80% 75% 70% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores abort-all requestor

Figure 3.18: Energy delay product for Embedded-LE-Sleep using different abort policies and maximum number of allowed retries set to 1.

EMBEDDED-LE-SLEEP: abort-all vs requestor (normalized to abort-all) (# retries set to 2) 110% 105% 100% 95% 90%

EDP (%) EDP 85% 80% 75% 70% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores abort-all requestor

Figure 3.19: Energy delay product for Embedded-LE-Sleep using different abort policies and maximum number of allowed retries set to 2. retries to infinity and using requestor-abort policy, is 43%, 80% and 76% for genome, patricia and vacation respectively, while it was 47%, 75% and 63% for the Embedded-LE experiment set.

We conclude that it is never disadvantegous to choose the requestor-abort policy over the abort-all policy. In fact, for some benchmarks like genome, patricia and vacation, the requestor-abort policy is beneficial both in terms of performance and energy consumption, especially when a higher number of maximum allowed retries is set and sleep modality is chosen.

3.4.3 Embedded-LR Parameter Exploration

This section evaluates the abort policies of the Embedded-LR implementation. As described in

Section 3.3, this approach is distinct from Embedded-LE because the architecture does not use 58

EMBEDDED-LE-SLEEP: abort-all vs requestor (normalized to abort-all) (# retries set to ∞) 110% 100% 90% 80% 70% 60% EDP (%) EDP 50% 40% 30% 20% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 genome kmeans patricia redblack skiplist vacation labyrinth # cores abort-all requestor

Figure 3.20: Energy delay product for Embedded-LE-Sleep using different abort policies and maximum number of allowed retries set to infinity. locks for mutual exclusion.

3.4.3.1 Abort Policy

The abort policies evaluated are timestamp, which aborts the core with the latest time stamp (i.e., the last core to start executing this critical section) and priority-abort, which favors the core that has been aborted the largest number of times on this particular critical section. To implement a timestamp configuration without increasing the hardware complexity, at the start of a new transac- tional execution the Bloom module increments a global counter and stores its value in the related

Bloom module core register. In this way, each core that is working in speculative mode will keep information about its starting order. When a conflict is detected, the Bloom module aborts the core with the highest value. To implement a priority-abort configuration, the Bloom module increments a per-core register every time the core aborts. The register is cleared on commit. Note that in both cases, the aborted cores are switched into sleep mode for energy-saving reasons.

Fig. 3.21 shows that the timestamp approach provides similar EDP as priority-abort for genome, kmeans, redblack, skiplist and vacation. However, for patricia and labyrinth we observe that the timestamp approach provides significant EDP improvement compared to priority-abort (up to 18% and 31% respectively).

The results for performance and energy consumption show the exact same trends as EDP, so 59

EMBEDDED-LR: timestamp vs priority-abort (normalized to timestamp) 110% 112.5% 115.8% 144.4% 105% 122.3% 100% 95% 90%

85% EDP (%) EDP 80% 75% 70% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 genome kmeans patricia redblack skiplist vacation labyrinth

# cores timestamp priority

Figure 3.21: Energy delay product for Embedded-LR using different abort policies. these graphs are not included, but the results are summarized here. For performance, timestamp is clearly better than priority-abort for patricia (10%) and labyrinth (23%). For energy consumption, timestamp shows similar improvements for patricia (<10%) and labyrinth (18%). These benchmarks all have some longer running critical sections and therefore they tend to benefit from letting them run to completion, as the timestamp configuration allows. For all other benchmarks, no significant difference is observed between the two policies. Based on these observations, we conclude that for execution in Embedded-LR mode the timestamp approach can be safely chosen.

3.4.4 Speculative Execution vs. Locks

Having determined the optimal set of parameters for each benchmark, we can now compare Embedded-

Spec with standard lock approaches (lock and lock-sleep). Using the best parameter configurations for each execution mode and each benchmark presented so far, we perform a set of experiments in which we compare the performance, energy consumption and EDP of each applied technique

(locking, lock-sleep, Embedded-LE, Embedded-LE-Sleep and Embedded-LR).

Fig. 3.22 shows the execution time of each technique, normalized to the execution time of standard locks. As can be seen, for the 1-core configuration locks provide better performance than any kind of speculation. This is expected, and is due to the additional hardware and software support necessary to enable the speculation. As the number of cores is increased though, the speculative approaches 60

Overall Performance Comparison - Best Configurations (normalized to locking) 220% 200% 180% 160% 140% 120% 100%

cycles (%) cycles 80% 60% 40% 20% 0% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 genome kmeans patricia redblack skiplist vacation labyrinth # cores Locking Lock-sleep EMBEDDED-LE EMBEDDED-LE-SLEEP EMBEDDED-LR

Figure 3.22: Execution time of Embedded-Spec vs. standard locks. Showing results for best configurations for each benchmark. begin to show an advantage for all but the the kmeans benchmark. As mentioned earlier, in kmeans the critical sections are rare and small (i.e., less than 5% of time is spent in critical sections), and the results show that Embedded-Spec does not provide benefits. At the same time, Embedded-Spec does not hurt performance when the benchmark does not include large speculative sections.

We also observe that Embedded-LR yields the best performance for an increased number of cores, except for kmeans. Embedded-LR yields performance improvement of at least 47% for pa- tricia and up to 80% for genome, compared to standard locks. The next best configuration to

Embedded-LR is Embedded-LE and Embedded-LE-Sleep, both yielding performance improve- ment of 10%, 31%, 45%, 50% and 70% for vacation, redblack, skiplist, labyrinth and genome respec- tively. The only exceptions are kmeans, for the reasons mentioned, and patricia, which shows better performance for the lock-sleep and locking techniques than for Embedded-LE and Embedded-

LE-Sleep. This is expected for patricia, since it suffers from a relatively high abort rate, hence using locking instead of speculation is preferable. Regarding performance, Embedded-LE and

Embedded-LE-Sleep show a very small difference, with Embedded-LE being slightly better, apart from patricia and redblack where the difference is more pronounced (23% and 6% respectively). 61

Overall Energy Comparison - Best Configurations (normalized to locking) 360%

320%

280%

240%

200%

160% energy energy (%) 120%

80%

40%

0% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 genome kmeans patricia redblack skiplist vacation labyrinth

# cores Locking Lock-sleep EMBEDDED-LE EMBEDDED-LE-SLEEP EMBEDDED-LR

Figure 3.23: Energy Consumption of Embedded-Spec vs. standard locks. Showing results for best configurations for each benchmark.

Regarding locking compared to lock-sleep, the difference in performance is again insignificant, apart from patricia, where lock-sleep is clearly better (11%).

Fig. 3.23 shows the energy consumption for the same set of experiments. Here, lock-sleep is clearly preferable, showing energy benefits starting from 15% for kmeans and reaching up to 73% for genome and labyrinth. When we focus only on energy, locking with sleep mode enabled is clearly better than speculation since it does not encounter aborts. On the other hand, locking without sleep mode enabled becomes the worst choice for energy consumption as we can see in Fig. 3.23. So, with best choice being the lock-sleep technique, the second best choice in terms of energy consumption is

Embedded-LR (except for patricia where Embedded-LE-Sleep is 31% better than Embedded-

LR). Embedded-LE-Sleep comes very close to Embedded-LR in terms of energy consumption, with Embedded-LE following next for most cases. A common observation is that all sleep techniques yield better energy results, which is generally expected.

Fig. 3.24 shows the combined performance-energy consumption results. We observe that Embedded-

LR and lock-sleep are the two best techniques when we care about both performance and energy consumption, with Embedded-LR being better than lock-sleep for genome, redblack and vacation 62

Overall EDP Comparison - Best Configurations (normalized to locking) 360% 659.0% 320%

280% 411.8%

240% 670.0% 476.4% 425.5% 200% 415.8%

EDP (%) EDP 160%

120%

80%

40% 479.6% 0% 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 genome kmeans patricia redblack skiplist vacation labyrinth

# cores Locking Lock-sleep EMBEDDED-LE EMBEDDED-LE-SLEEP EMBEDDED-LR

Figure 3.24: Energy Delay Product of Embedded-Spec vs. standard locks. Showing results for best configurations for each benchmark.

(up to 19%) and very similar for skiplist, labyrinth and kmeans. Only for patricia lock-sleep is better than Embedded-LR (up to 12%), again because of its high abort rate. The next best configuration for EDP for most benchmarks is Embedded-LE-Sleep with Embedded-LE being very close. The worst choice for EDP is again locking. There are exceptions: First, kmeans does not show any sig- nificant difference for any of the applied techniques, which is expected as explained earlier. Second, patricia is the only benchmark that shows a clear improvement in EDP for Embedded-LE-Sleep compared to Embedded-LE (73%). Overall, with respect to energy-delay product, Embedded-LR is the best choice, with lock-sleep following next.

We draw the following conclusions: If reducing energy consumption is our primarily goal then we should use sleep-enabling techniques. Moreover, we should not bother using speculation, but choose the lock-sleep technique instead. Speculation is encouraged only in cases where we encounter increased parallelism. On the other hand, if performance is our primary goal then Embedded-

LR is clearly the winner. Finally, to improve the energy-delay product, we should generally pick

Embedded-LR or lock-sleep and avoid Embedded-LE or traditional locking.

Table 3.3 summarizes the best and second best configuration modes for each of the benchmarks 63

Benchmark Best for Performance Best for Energy Best for EDP Genome 1. TLR, timestamp 1. lock-sleep 1. TLR, timestamp 2. SLE-sleep, 2 retries, 2. TLR, timestamp 2. SLE-sleep, 2 retries, requestor-abort requestor-abort Kmeans 1. TLR, timestamp 1. lock-sleep 1. lock-sleep 2. No diff. on type, #retries 2. TLR, timestamp 2. TLR, timestamp or abort policy Patricia 1. TLR, timestamp 1. lock-sleep 1. lock-sleep 2. SLE, 4 retries, 2. SLE-sleep, 0 retries 2. TLR, timestamp requestor-abort requestor-abort Redblack 1. TLR, timestamp 1. lock-sleep 1. TLR, timestamp 2. SLE, No diff. on #retries 2. TLR, timestamp 2. lock-sleep or abort policy Skiplist 1. TLR, timestamp 1. lock-sleep 1. lock-sleep 2. SLE, No diff. on #retries 2. locking 2. TLR, timestamp or abort policy Vacation 1. TLR, timestamp 1. lock-sleep 1. TLR, timestamp 2. SLE, No diff. on #retries 2. TLR, timestamp 2. SLE, No diff. on #retries, requestor-abort requestor-abort Labyrinth 1. TLR, timestamp 1. lock-sleep 1. lock-sleep 2. No diff. on type, #retries 2. TLR, timestamp 2. TLR, timestamp or abort policy

Table 3.3: EMBEDDED-SPEC – Top Best two configurations when considering performance only, energy only, or energy-delay product. considered in the experiments. We observe that the best configurations may vary depending on whether our goal is to improve performance, energy, or both.

3.5 Summary and Discussion

This chapter presented Embedded-Spec, an energy-efficient and lightweight implementation for transparent speculation on an embedded architecture. Embedded-Spec can operate in two spec- ulative execution modes, Embedded-LE that is based on lock elision and Embedded-LR that is based on lock removal. Through an extensive set of experiments the proposed scheme showed to improve the Energy-Delay product (EDP) for most of the benchmarks and configurations that were considered. However, the benefits of speculation are sensitive to the critical section size, the degree of lock contention, the retry policy and the contention management policy. Results showed energy and performance benefits especially for larger number of cores (e.g., 4–8 cores). When comparing 64 the two proposed speculative execution mechanisms, Embedded-LR provides better performance and energy characteristics than Embedded-LE. However, it was observed that standard locks with sleep mode enabled may still be the best choice if minimizing energy consumption is more criti- cal than improving performance. We conclude that for platforms where energy efficiency matters,

Embedded-Spec can provide real benefits, but that the underlying hardware architecture must be configured with care.

While the speculative execution mechanism described in this chapter is energy-efficient and ap- propriate for embedded systems, it is targeted in a shard bus-based architecture with cache coherence support. This architecture can accommodate up to 8 cores. Increasing the number of cores to ex- tract more parallelism in such a system will likely result in overflooding the shared bus and hurting performance. Driven by the need for scalability, in the next chapter of this thesis, we present a speculative execution mechanism that targets a far more scalable system, a many-core cluster-based embedded architecture. Chapter 4

Speculative Synchronization on

Coherence-free Many-core

Embedded architectures

High-end embedded systems, like their general-purpose counterparts, are turning to many-core cluster-based shared-memory architectures that are subject to non-uniform memory access (NUMA) costs. Memory organization is the single, most far-reaching design decision for such architectures, both in terms of raw performance, and in terms of programmer productivity. For many-core em- bedded systems, in order to meet stringent area and power constraints, the cores and memory hierarchy must be kept simple. In particular, scratchpad memories (SPM) are typically preferred to hardware-managed data caches, which are far more area- (40%) and power-hungry (34%) [1]. Several many-core embedded systems have been designed without the use of caches and cache-coherence.

These kind of platforms are becoming increasingly common.

As embedded systems move to many-core and cluster-based architectures, the design of high- performance, energy-efficient synchronization mechanisms becomes more and more important. Yet,

65 66 speculative synchronization for such embedded systems has received little attention. Moreover, implementing speculative synchronization in embedded systems that lack cache-coherence support is particularly challenging, since hardware speculative techniques traditionally rely on the underlying cache-coherence protocol to synchronize memory accesses among the cores. For these cacheless systems, a completely new approach is necessary for handling speculative synchronization.

The lack of cache-coherence brings major challenges in the design of HTM support, which needs to be designed from scratch. At the same time though, it provides a significant benefit: A more lightweight and simple environment to build upon, that could be more appropriate for the embedded systems domain. Building on such an environment, in this chapter we create from scratch a Hard- ware Transactional Memory design that is self-contained, and does not rely on an underlying cache coherence protocol to provide synchronization and safety guarantees. To the best of our knowl- edge, this is the first design for speculative synchronization in this type of architecture. As will be described later, this implementation requires explicit data management and implies a fully-custom design of the transactional memory support.

4.1 Target Architecture

Before presenting the HTM design, it is essential to describe the target architecture. This work is based on a virtual platform environment called Virtual SoC or VSoC, a SystemC simulator which simulates a cluster-based many-core architecture at a cycle-accurate level [4]. Like recent many- core chips such as Kalray MPPA256 [15], ST Microelectronics p2012/STHORM [18], and even

GPGPUs such as NVIDIA Fermi [16], the VSoC platform encompasses multiple computing clusters, and is highly modular. These systems achieve scalability through a hierarchical design. Simple processing elements (PE) are grouped into small-to-medium sized subsystems (clusters) sharing a high-performance local interconnect and memory. In turn, clusters are replicated and interconnected with a scalable network-on-chip (NoC) medium, as depicted in Figure 4.1. 67

cluster1 cluster2

NI NI R R

cluster3 cluster4

NI NI R R

Figure 4.1: Hierarchical design of our cluster-based embedded system.

Figure 4.2 shows the basic cluster architecture. Each cluster consists of a configurable number, N, of 32-bit ARMv6 RISC processors1, one L1 private instruction cache for each of the N processors, and a shared multi-ported and multi-banked tightly coupled data memory (TCDM). Note that the TCDM is not a cache, but a first level (SPM) structure. As such, it is not managed by hardware but with software and it lacks cache coherence support. The TCDM is partitioned into M banks, where all banks have the same memory capacity. For the ARMv6 processor models the Instruction Set Simulator by Helmstetter and Jolobo [95] is used, wrapped in a SystemC module. A logarithmic interconnect supports communication between the processors and the TCDM banks. The TCDM can handle simultaneous requests from each processor in the cluster. An off-chip Main Memory is also available. This is not part of the cluster, but each cluster’s processors can access it through an off-cluster Main Memory bus (see Figure 4.2).

The logarithmic interconnect is a mesh-of-trees (Figure 4.3) and provides fine-grained address interleaving on the memory banks to reduce banking conflicts. The latency of traversing the inter- connect is normally one clock cycle. If multiple processors are requesting data that reside within different TCDM banks, then the data routing is done in parallel. This allows the cluster to maintain full bandwidth for the communication between the processors and the memories. In addition, when

1The original simulator was designed with ARMv6 processors. Using a later version would not change our expected observations. 68

I$ I$ I$ PE 0 PE 1 PE NͲ1 .....

MASTER PORT MASTER PORT MASTER PORT

LOGARITHMIC INTERCONNECT (MoT)

N N SLAVE SLAVE SLAVE SLAVE Main Memory BUS PORT PORT PORT PORT (2N:1)

test&set BANK  BANK  ... BANK  SEM 0 1 MͲ1 MAIN  MEMORY SHARED level 1 memory (TCDM)

Figure 4.2: Single cluster architecture of target platform. multiple processors are reading the same data address simultaneously, the network can broadcast the requested data to all readers within a single cycle. However, if multiple processors are requesting different data that reside within the same TCDM bank, conflicting requests will occur, which will trigger a round-robin scheduler to arbitrate access for fairness. In this case, additional cycles will be needed to service all data requests. Specifically, the conflicting requests will be serialized, but with no additional latency between consecutive requests.

Regardless of whether the processors within the cluster requested conflicting requests or not, when a memory access request arrives at a bank interface, the data is available on the negative edge of the same clock cycle. Hence the latency for a TCDM access that has not experienced conflicts is two clock cycles [4]: one cycle for traversing the interconnect in each direction.

A stage of demultiplexers between the logarithmic interconnect and the processors selects whether the memory access requests coming from the processors are for the main memory or for the TCDM.

Accesses to memory external to a cluster go through a peripheral interconnection. The off-cluster

(main memory) bus coordinates the accesses and services requests in round-robin fashion. 69

PE0 PE1 PE2 PE3 Cores

LEV 1 tree Routing

LEV 2

LEV 3 r tree Arb LEV 1

LEV 2

Mem MB MB MB MB MB MB MB MB banks 0 1 2 3 4 5 6 7

Figure 4.3: A 4X8 Mesh of Trees. Circles represent routing and arbitration switches. Taken from [4].

The basic mechanism for processors to synchronize is standard read/write operations to a ded- icated memory space which provides test-and-set semantics (a single atomic operation returns the content of the target memory location and updates it). We use this memory for locking, the baseline against which the transactional memory design is compared.

Note that cores within the cluster do not have private data caches or memories (just private per-core instruction caches). Instead, all data accesses go through the TCDM. The absence of coherent data caches implies a completely custom design of the transactional memory support.

Indeed, speculative synchronization through HTM generally relies on the underlying cache coherency protocol to manage conflict detection. Instead, we have to employ a different mechanism to achieve this. This mechanism will be explained in more detail in the next section.

4.2 Transactional Memory Design

Since our system does not have caches, instead of buffering tentative updates in an L1 cache like prior designs, we choose to integrate the HTM mechanism with the TCDM memory, meaning that the TCDM memory holds both speculative and non-speculative data. In prior designs and the design 70 decribed in Chapter 3, that were intended for small-scale embedded devices, a centralized unit (such as the Bloom Module) would snoop on bus transactions, detect data conflicts, and resolve them (i.e., decide which of the conflicting transactions should be aborted). Monitoring all ongoing traffic in a shared bus environment is fairly easy, since only one transaction can traverse the shared bus medium at each cycle. However, in the VSoC system, the logical interconnect permits multiple transactions to traverse the interconnect in the same cycle. Since interconnect access is concurrent, not serial, snooping on the cluster interconnect is not feasible. Serializing and routing transactional memory traffic through a centralized module would create a substantial sequential bottleneck and drastically limit scalability.

For these reasons, we conclude that transactional management must be distributed if it is to be scalable. Thus, we divide conflict detection and resolution responsibilities across the TCDM mem- ory banks, into multiple Transaction Support Modules (TSM). By placing a transactional support module at each bank of the TCDM, we allow conflict detection and resolution mechanisms to be decentralized. In this way, transactional management bandwidth should scale naturally with the number of banks. The proposed design consists of three parts, Transactional Bookkeeping, Data

Versioning and Control Flow. Each of these is described in more detail next.

4.2.1 Transactional Bookkeeping

Transactional bookkeeping (also known as conflict detection management) is the mechanism that keeps track of the read and write data accesses in order to detect shared data conflicts. In a conventional HTM system, this is usually done through extensions on the cache coherence protocol.

Since in the current target system there is no cache coherence protocol, transactional bookkeeping must be implemented in an alternative way. This section describes the proposed design for our VSoC platform.

The TSM of each TCDM bank intercepts all memory traffic to that bank and is aware of which cores are executing transactions. This process keeps track of transactional readers and writers. 71

Address A

Owner Bit Writer ID Readers (1 bit per core)

Time t1: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Time t2:  0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0

Time t3: 1 1 1 0 1 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0

Bit #: 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Figure 4.4: Bookkeeping example. At time t1 address location A had not been read or written. By time t2, cores 1, 7, 8, and 13 have read the address. At time t3 core 13 writes the address and generates a conflict. So, core 13 will be aborted and its read flag will be cleared. Since core 13 was also the writer of address location A, the Writer ID bits and the Owner bit will be cleared as well.

When a TSM detects a conflict, it decides which transaction to abort, and notifies the appropriate processor. For each data line, there can be multiple transactions reading that line, or a single transaction writing it, which we call the line’s owner. Each bank keeps track of which processors have transactionally accessed each data line through a per-bank array of k r-bit vectors, where k is the number of data lines at that bank, and r = 1 + N + log2N, where N is the number of cores in the cluster. The first bit indicates whether the line has been written transactionally, and if so, the next log2N bits identify the owner. The next N bits indicate which processors are transactionally reading that line. For example, for N = 16 cores, a 21-bit vector is needed, as shown in Figure 4.4.

The transactional support mechanism is integrated within each bank of the TCDM memory.

When a transaction accesses a bank, the bank’s TSM checks the corresponding vector to de- termine whether there is a conflict. A transactional write to a memory location that is currently being read by another core will trigger a conflict (and vice versa). A transactional read to a memory location concurrently being read by other cores does not trigger a conflict. 72

4.2.2 Data Versioning

As previously described in Section 2.2.1.3, any transactional memory design must manage Data

Versioning, i.e., keeping track of speculative and non-speculative versions of data. Like conflict detection and resolution, data versioning can be eager or lazy. Lazy versioning leaves the old data in place and buffers speculative updates in different locations, while eager versioning makes speculative updates in place and stores back-up copies of the old data elsewhere. Keeping the original data in place makes the abort scenario very fast, but it delays commits, since extra time is necessary to write the speculative data back to memory during commits. Eager data versioning on the other hand, makes commits faster but increases the abort recovery time, since the original data need to be fetched back to memory. For applications with low contention, hence low abort rates, eager data versioning is more attractive. However, care must be taken to make sure that the abort recovery time does not become a major bottleneck, even with lower abort rates.

For this design, we choose eager versioning and use the TCDM’s banks for storing the speculative versions of data. We borrow from the LogTM idea from Moore et al. [7] in which the original values are stored in a software log structured as a stack. In LogTM, a per-thread transaction log is created in cacheable virtual memory, which holds original data values and their virtual addresses, for all data modified by a transaction. In our platform, keeping per-thread transaction logs would have some significant drawbacks. First, the transaction log could reside anywhere in the memory, across multiple banks. This would imply that during every transactional memory access, the log would

first need to be located and then traversed in order to find out whether the memory location has been accessed by that transaction. Moreover, the abort process would require traversing the log once more and restoring data back to their original locations, which would create data exchanges across different banks through the interconnect and hence significant delays. For these reasons, we propose two alternative data versioning designs that avoid this costly cross-bank data exchange: a

Full-Mirroring and a Distributed Logging design. Both designs utilize the local per-bank TSMs for 73 performing the transactional data saving and restoration processes. Next, each of these designs are described in detail.

4.2.2.1 Full-Mirroring

This design is based on the idea that, for every address in the memory space, we create a mirror address in the same TCDM bank that holds the original data, to be recovered in case of abort. In this way, the restore process does not involve any exchange of data between different banks. Instead, the data saving and restoring process is triggered internally by the TSM of each bank, and it is completed by simply performing an internal bank check in the mirroring address without requiring interaction with other banks’ TSMs. Although this solution consumes more space than keeping a dedicated per-transaction log, it yields a very simple and fast design.

When a transaction first writes an address, the TSM sends a request to the log space to record that address’s original data. As shown in Figure 4.5, the log space is designed so that each address’s mirrored address is in the same bank. When a log needs to be saved, the TSM will trigger a write to the corresponding mirroring address in the bank. Note that only the first time that an address is written within a specific transaction, do we need to log its original value. Hence, we pay the cost of writing to the log space only once for each address that is written during a transaction. If a transaction aborts, the data it overwrote is restored from the log.

Because each address and its mirror reside at the same bank, the latency overhead of recording the address’s original value and restoring it on a transaction abort is quite modest (two extra cycles).

No additional cost will be paid to search for the location where the address is logged, since each address’s mirror lies at a certain location that can be found by a simple calculation and does not require traversing a log. Moreover, the use of an eager versioning scheme makes commits fast, since no data need to be moved. Thus, while full-mirroring has a significant area overhead (i.e., we need to utilize half of the TCDM memory to accommodate the log space), it has an advantage when it comes to simplicity. However, the fact that it uses memory less efficiently may require extra delay 74

I$ I$ I$ PE 0 PE 1 PE N‐1 . . . MAST MAST MAST

LIC PIC SLAVE SLAVE SLAVE

. . . BANK 1 BANK 2 BANK M‐1

LOG 1 LOG 2 LOG M‐1

TSM TSM L1 TCDM TSM

Figure 4.5: Modified single cluster architecture. Notice that the PIC refers to the off-cluster periph- eral interconnect. overhead to move data to/from the TCDM and main memory. Section 4.3.1 presents a detailed overhead comparison of full-mirroring with the alternative distributed logging scheme in terms of space and time.

4.2.2.2 Distributed Logging

As just noted in the previous section, half of the available on-chip TCDM must be reserved for the mirror in our full-mirroring scheme, yet the amount of this memory space that will actually be used to save log data depends on the write footprint of the transactions and these typically cover only a small subset of the available memory space. The second data versioning design proposed here,

Distributed Logging, offers a solution to the space inefficiency of the Full-Mirroring design, by using distributed per-address logs instead of mirrors. 75

BANK 0 BANK 1 BANK M‐1

TSM 0 DATA TSM 1 DATA ... TSM M‐1 DATA

(Core’s 0 Log) 0 (Core’s 0 Log) 1 (Core’s 0 Log) M‐1

(Core’s 1 Log) 0 (Core’s 1 Log) 1 (Core’s 1 Log) M‐1 ......

(Core’s N‐1 Log) 0 (Core’s N‐1 Log) 1 (Core’s N‐1 Log) M‐1

Tightly Coupled Data Memory (TCDM)

Figure 4.6: Distributed per-address log scheme for M banks and N cores.

Figure 4.6 depicts how the Distributed Logging design works. In this design, distributed per- address logs are used to save backups of the original values of data that are written during transac- tions, so that they can be recovered in case of aborts. Again, as in Full-Mirroring the transactional handling and log managing responsibilities are divided across the TSMs of the banks. Each bank’s

TSM monitors transactional accesses to the bank and manages the cores’ logs that reside in that bank. It is also responsible for restoring the log data of the cores that abort their transactions and cleaning the logs of the cores that commit their transactions. Again, all banks’ TSMs work in parallel and independently of one other.

At every bank of the TCDM, we keep a fixed-size log space for each core in the system. Each core’s log holds the addresses that belong to that bank and are written transactionally by that core. In this way, we keep a log space only for the addresses of the bank that are actually written transactionally and not for all of them as in Full-Mirroring. At the same time, with this distributed logging design we still avoid cross-bank data exchange when saving and restoring the log, since each addresses’ log falls within the same bank. Thus the log saving and restoration process is triggered 76 internally by the TSM of each bank and it does not require interaction with the TSMs of other banks. This would not be feasible if we used per-thread transaction logs as was proposed in [7].

When a core writes transactionally to an address of a bank, its log is traversed to check whether it already holds an entry for that address. If not, a new log entry is created to store the original data of the address. Note that the data only need to be logged the first time the address is written within a specific transaction. Therefore, the log size depends on the write footprint of each transaction.

Since the log of each core is distributed among all the TCDM banks, we expect that the log writes will also be divided among the banks. The size of each core’s log space per bank is a parameter in our design, so it can be easily adjusted to the needs of different applications domains. In case of an overflow, our technique resorts to software-managed logging into the main L2 memory. The capability of tuning the log size is intuitively key to reducing the number of overflows. If a conflict is detected and a transaction must abort, each bank’s log is traversed to restore the original data back to its proper address. If a transaction commits, the logs associated with that transaction are all discarded and the speculative data now becomes non-speculative. In Section 4.3.1, we further detail the overhead analysis of each of the proposed data versioning schemes in terms of space and time.

4.2.3 Transaction Control Flow

This section describes the transaction control flow, i.e., how the proposed transactional memory design operates. Before a transaction starts, it reads a special memory-mapped transactional base register. When this request reaches the memory, the bit corresponding to the core that made the request is set internally, to indicate that this processor is executing a transaction. When the transaction starts, its core saves its internal state (program counter, stack pointer and other internal registers), to be able to roll back if the transaction aborts. All transactionally executed memory accesses are marked with a special transactional bit set when the memory accesses are issued to the system. When a transaction ends, it triggers another access to a memory-mapped transactional 77

No Yes Transactional?

Proceed as Yes No usual Read?

Found No Conflict Conflict Yes No Log Data? check_conflict()

No Found Conflict Conflict Read Log check_conflict()

- update_flags() - resolve_conflict() - update_flags() - perform read - abort_transaction() - read log

- restore_data() - write log - clean_flags()

Figure 4.7: Transactional control flow. commit register, which activates a special process at the memory bank level that cleans all the transactional flags and saved logs associated with that core’s transactional accesses. Note that the access to these special transactional registers is a read access, hence it is non-blocking, meaning that multiple cores may access those registers simultaneously without serialization. These special registers do not impose the serialization and contention costs associated with traditional semaphores.

Figure 4.7 depicts the transactional memory control flow. Each TSM has a process that, in each cycle, checks whether the request received by the corresponding bank is transactional. If so, and it is a request to read a saved log value, then the process returns the data from the log.

If the request is a transactional data read but not for the log space, then the function check conflict() checks the address’s flag vector to determine whether this request triggers a conflict. If all concurrent transactions are also reading, then there is no conflict, and the update flags() function adjusts the 78 location’s read flags before performing the read.

On the other hand, if some core has written that address, a conflict is triggered. A resolve conflict() function decides which of the transactions currently accessing that address will have to abort. This decision depends on the current conflict resolution policy. As a starting point, we chose to abort the requester (i.e., the core which issued the access that triggered the conflict). When the re- solve conflict() function returns, it calls the abort transaction() function, which notifies the cores that need to abort. These cores then restore their internal saved state and respond with an ac- knowledgment. Control is passed back to the abort transaction() process that now has to call the restore data() and clean flags() functions. The first function is responsible for restoring the original saved data from the logs back to their original address locations and the second function is for cleaning the read/write flags of the aborted core. It is important to mention that, when an abort occurs in the system, all banks’ TSMs call restore data() and clean flags() simultaneously and the banks stall normal operation until these functions have been completed, in order to avoid interme- diate reads of invalid data. Once the data restoration process has been completed by all banks’

TSMs, the aborted core’s internal state is restored. Thus, the aborted core is ready to retry the transaction. This will not happen right away, but after waiting for a random backoff period, in order to avoid consecutive conflicts with other cores that might be also retrying their aborted transactions simultaneously. More details on how this backoff period is implemented follow in Section 4.3.

The control flow for a write is similar, but there is a difference in the conflict detection criteria:

If the memory location is currently being either read or written by other transactions, then a conflict will be triggered. If no transaction is reading the location, then the update flags() function sets the owner flag and the new owner’s ID for that address. The first time the owner writes, the address’s original data must be saved to the log. 79

4.3 Experimental Results

In this section the proposed transactional memory design is evaluated and compared with the use of a conventional locking scheme. Moreover, the overhead of the proposed data versioning schemes is analyzed. The benchmarks chosen to test the design require some data synchronization and are representative of real embedded systems applications. Since the target simulation platform does not include operating system support, all OS calls within the benchmarks are eliminated. The evaluation starts with the following data structure benchmarks, as well as benchmarks from the

STAMP benchmark suite. Later, Section 4.3.3, presents results from the Eigenbench exploration tool.

• Redblack, Skiplist: These are applications operating on special data structures. The work-

load is composed of a certain number of atomic operations (i.e., inserts, deletes and lookups)

to be performed on these two data structures. Redblack trees and skip-lists constitute the

fundamental blocks of many memory management schemes found in embedded applications.

• Genome: This is a gene sequencing program from the STAMP benchmark suite [58]. A

gene is reconstructed by matching DNA segments of a larger gene. The application has been

parallelized through barriers and large critical sections.

• Vacation: This application also comes from the STAMP benchmark suite [58] and implements

a non-distributed travel reservation system. Each thread interacts with the database via the

system’s transaction manager. Vacation features large critical sections.

• Kmeans: This is a program from the STAMP benchmark suite [58] that groups objects into

K clusters. It uses barrier-based synchronization and features small critical sections. 80

4.3.1 Overhead Characterization

In this section, we further detail the overhead analysis of the proposed data versioning scheme in terms of space and time. As described in Section 4.2.2, the full-mirroring design requires half of the

TCDM memory space to be reserved for the mirroring addresses, even though not all of them will be actually used. The distributed logging design on the other hand, employs distributed per-address logs instead of mirrors. We can fine tune the size of those logs so that it is based on the actual write footprint of the transactions. As a result, the distributed logging scheme provides a better utilization of the available memory, since it reserves for the logs only the space necessary by the transactions, leaving the rest to the application, while full-mirroring reserves half of the available memory for mirrors that will not be entirely used.

For each application that was run, the maximum per-core transactional write footprint was measured, (i.e., the maximum size of data that is written within a single transaction by a core), when running applications with the maximum number of cores that the cluster can accommodate

(16 cores). The results are reported in Table 4.1. In the second column, we see how many bytes are actually written within a single transaction of a core. In the third column, we see how many bytes need to be reserved in total for all cores’ transactions, that is actually the amount of space we need to keep for the logs. We observe that in the worst case we need 5Kbytes for all the logs in the system.

For a TCDM memory size of 256Kbytes, that is roughly 2% of the total TCDM memory size, which means that we can use the remaining 98% of the memory for the actual application data. If we use full-mirroring, we are able to utilize only 50% of the TCDM memory space for the application, that is a considerably less space-efficient solution.

The Distributed Logging scheme has its own cost as well. Since per-address logs are used and not mirrors, the position of each address in the log is not straightforward as in full-mirroring. As a result, every time an address is saved in the log, the log has to be traversed in order to find whether the address already exists there and if not, a new entry has to be added for that address. In 81

Application Per-core Trans. write footprint Total log space (bytes) (bytes) Redblack 256 4096 Skiplist 64 1024 Vacation 320 3072 Genome 192 5120 Kmeans 256 4096

Table 4.1: Per-core transactional write footprint for each application. full-mirroring, the location of each address’s mirror can be computed very simply, by adding to the address the mirrors offset (i.e., the base address of the mirrors). As a result, for full-mirroring, each time an address is saved, two extra cycles are necessary for saving that address (one for reading the original content of the address and one cycle for writing it to the mirroring address). For distributed logging, each time an address is saved, the log has to be traversed first. Based on benchmark profiling, in a bank a core’s log can have up to 5 entries at a time, so 5 extra cycles are necessary to traverse the log in the worst case.

In case of an abort, the total restoration time clearly depends on the write footprint of the target application: the higher the number of writes within a transaction, the bigger the size of the logs or the number of the saved mirrors, that need to be restored. For each address that needs to be restored, 2 cycles are spent, one for reading the original value from the log space and one for writing it back to its original address.

In case of commit, no data need to be restored, since both full-mirroring and distributed logging are eager versioning mechanisms, so the transactional data are already in place upon commit. This way commits are very fast.

4.3.2 Performance Characterization

In this section the proposed transactional memory design is evaluated both for the full-mirroring and the distributed logging scheme and it is compared with the use of a conventional locking scheme.

For each benchmark, experiments were ran using 1, 2, 4, 8 and 16 cores and the total execution time was measured in cycles. As mentioned in Section 4.2.3, a requester-abort policy was chosen for 82

PARAMETER VALUE Main Memory Latency 200 ns Main Memory Size 128 MB Core frequency 200 MHz # Cores 1, 2, 4, 8, 16 TCDM Size 256 KB # TCDM Banks 16 I$ Size 4 KB I$ thit 1 cycle I$ tmiss ≥ 50 cycles

Table 4.2: Experimental setup for VSoC platform. managing conflicts. This is a basic approach also chosen in previous works for transactional memory.

Exponential backoff was also incorporated in the transactional retry process: When a core aborts, it does not retry the transaction immediately. Instead it halts the execution of the transaction, restores the original register values and then waits for a random backoff period, after which it begins re-executing the transaction. The range of the backoff period is tuned according to the conflict rate.

The first time a conflict occurs in a particular transaction, the core waits for an initial random period

(< 100 cycles) before restarting. If the transaction conflicts again, the backoff period is doubled, and will continue to double each time until the transaction completes successfully. This way, the scenario of a sequence of conflicts happening repeatedly between cores that retry the same aborted transaction, is avoided.

Table 4.2 shows the experimental setup of the VSoC platform that was used in the experiments.

The thit and tmiss values represent the instruction cache access times in case of a hit or a miss respectively. Accesses to the off-cluster main memory take 200 ns, significantly longer than accesses to the on-cluster TCDM that take only 4 ns in total. Access to the off-cluster main memory is assisted through a DMA with 0.5 Gbytes/sec bandwidth.2

We first run experiments for redblack, skiplist, genome, kmeans and vacation. The results are shown in Figures 4.8, 4.9, 4.10, 4.12 and 4.11. The Figures show the results for running the

2Since we are assuming a DMA to assist in the data transfer, access time per word will not take the full 200ns. 83

Redblack: Execution Time vs. # Cores 120% 110% 100% 90% 80% 70% 60% core locks core - 50% 40% single 30% Execution time norm. to norm. time Execution 20% 10% 0% 1 2 4 8 16 Locks 100% 68% 63% 65% 66% Full Mirroring 108% 62% 39% 28% 23% Logging 105% 59% 37% 25% 20%

Figure 4.8: Redblack: Performance comparison between locks and transactions for different number of cores.

Skiplist: Execution Time vs. # Cores 100%

90%

80%

70%

60%

50% core locks core - 40%

single 30%

Execution time norm. to norm. time Execution 20%

10%

0% 1 2 4 8 16 Locks 100% 73% 67% 68% 68% Full Mirroring 106% 59% 36% 25% 19% Logging 105% 58% 35% 23% 17%

Figure 4.9: Skiplist: Performance comparison between locks and transactions for different number of cores. 84

Genome: Execution Time vs. # Cores 120% 110% 100% 90% 80% 70% 60% core locks core - 50% 40% single 30% Execution time norm. to norm. time Execution 20% 10% 0% 1 2 4 8 16 Locks 100% 94% 95% 98% 104% Full Mirroring 98% 51% 29% 19% 21% Logging 98% 50% 27% 18% 19%

Figure 4.10: Genome: Performance comparison between locks and transactions for different number of cores. applications using spin-locks in comparison with the proposed transactional scheme using the full- mirroring or the distributed logging scheme respectively, for different number of cores. For each benchmark, we show the percentage change in execution time relative to a baseline execution time of a single core with locks. We make the following observations: First, for the locking scheme, even though the performance improves as we scale from 1 to 2 cores, it does not show significant improvement and in most cases it gets worse, as we move above 4 cores. This means that the performance scaling that we hope to achieve using parallel execution does not follow the scaling of the number of cores. This is expected since in a standard locking scheme, the cores spend a lot of time spinning on the locks before entering the critical section. As a result, execution of the critical sections is serialized and this effect becomes worse as lock contention increases.

The second observation we make from the figures is that the transactional memory configura- tions always achieve better performance than the standard locking scheme. This is because, in the transactional memory scheme, the cores execute critical sections speculatively assuming a real data 85

Vacation: Execution Time vs. # Cores 120% 110% 128% 188% 100% 90% 80% 70% 60% core locks core - 50% 40% single 30% Execution time norm. to norm. time Execution 20% 10% 0% 1 2 4 8 16 Locks 100% 93% 99% 128% 188% Full Mirroring 99% 52% 30% 26% 28% Logging 106% 56% 31% 25% 28%

Figure 4.11: Vacation: Performance comparison between locks and transactions for different number of cores.

Kmeans: Execution Time vs. # Cores 140% 130% 120% 250% 110% 175% 100% 173% 90% 80% 70% core locks core - 60% 50%

single 40%

Execution time norm. to norm. time Execution 30% 20% 10% 0% 1 2 4 8 16 Locks 100% 86% 102% 123% 250% Full Mirroring 97% 83% 88% 107% 173% Logging 97% 83% 91% 108% 175%

Figure 4.12: Kmeans: Performance comparison between locks and transactions for different number of cores. 86 conflict will not occur; locks, on the other hand conservatively assume a conflict will occur and thus effectively serialize all accesses to critical sections. If the abort rate is not significant, then the overall performance improves tremendously. However, even in the event of aborts, transactions are restarted after an exponential backoff period, assuring that the retrying cores will not conflict repeatedly.

Third, we observe that the proposed transactional scheme, both in the full-mirroring and the distributed logging configuration, achieves the intended scaling that we expect from using multiple cores instead of a single core (the only exception to this is kmeans, which we explain later). In most cases, performance keeps doubling as we continue to double the number of cores in the cluster. As the number of cores increases beyond 4 cores though, the performance scaling is slightly reduced and in some cases it levels off (genome, vacation). This is due to the main memory accesses that generate large delays and end up masking the benefits of running on a larger number of cores. These accesses exist in all runs, independent of the number of cores we use, but their effect becomes more pronounced for the 8 and 16 cores configurations since the total execution time is further reduced as we increase the number of cores. In addition, not all benchmarks exhibit similar benefit from parallelism. However, in most cases performance is still improved compared to using fewer cores and in all cases it is better than the performance of the locking scheme.

Next, we examine how the full-mirroring design compares with the distributed logging design in terms of performance. We observe that for redblack, skiplist and genome, the distributed logging scheme always outperforms the full mirroring schemes, while for vacation, it is worse for a small number of cores and it gets better as the number of cores is increased. The kmeans application is the only one in which the distributed logging scheme shows worse performance than the full- mirroring configuration. As we discussed in Section 4.3.1, the distributed logging scheme incurs a slightly bigger overhead in saving the logs than the full-mirroring design. This would normally result in the distributed logging scheme being worse in performance compared to the full-mirroring scheme. At the same time, the distributed logging scheme uses memory more efficiently, allowing for 87 a larger quantity of data to be stored in the TCDM. In contrast, for full-mirroring a larger number of main memory accesses may be required for refilling data, which masks the savings gained from a simpler data versioning scheme. Applications redblack, skiplist and genome have a large number of refill accesses for the full-mirroring scheme because of its inefficient use of the TCDM. As a result, in those benchmarks we see performance being worse for the full-mirroring design. On the other hand, kmeans and vacation have smaller data footprints, hence only a small number of refill accesses to the main memory are necessary. As a result the performance difference caused by the log saving process in the distributed logging scheme now becomes visible. Specifically, it is even more pronounced for a smaller number of cores, since fewer cores mean less but bigger logs, hence bigger log traversing overhead.

Overall, we see that the transactional memory scheme achieves significant performance improve- ment compared to standard locks. Specifically, this improvement ranges from 38% to 80% for redblack, 41% to 83% for skiplist, 49% to 82% for genome , 9% to 17% for kmeans and 44% to

72% for vacation over the baseline single-core lock configuration, depending on the number of cores used. Locks, in comparison, cannot effectively exploit the extra parallelism offered by adding addi- tional cores so performance improvement lags far behind. We conclude that a transactional memory support scheme, when designed carefully based on the needs of the target architecture, can achieve significant performance improvements.

4.3.3 EigenBench

To further evaluate and compare the two proposed TM variants, we use the EigenBench exploration tool. EigenBench [96] is a simple microbenchmark for evaluating TM systems that allows for ex- ploration of several eigen-characteristics, i.e., a set of orthogonal characteristics of TM applications that form a basis for all TM applications (similar to how a basis in linear algebra spans a vector space). The benchmark allows to decouple the eigen-characteristics from each other and vary them independently, enabling the evaluation of corners of the application space not easily reached by real 88 programs. Specifically, the focus here is on three characteristics that are relevant to the proposed

TM designs:

• Working-set size: This parameter represents the size of the memory accessed within trans-

actions. Since our design requires explicit DMA transfers for TCDM management, when the

transaction’s memory footprint increases beyond the size of the TCDM we will experience a per-

formance drop, due to the DMA transfers;

• Contention: This parameter represents the probability of conflicts for a transaction. Since the

roll-back mechanism that takes place upon conflict is different for the two TM implementations,

we expect this to have an impact on the performance when the conflict rate is high;

• Predominance: This parameter represents the fraction of cycles spent in memory operations

within transactions to cycles spent in memory operations outside transactions. It thus represents

a measure of the overhead for handling transactional reads/writes (i.e., for data versioning).

When the predominance factor is low, we expect this overhead to be negligible, while for high

predominance this could be important.

Note that there are no other instructions inside and outside transactions, besides the memory accesses described above. This is a default setting for the eigenbench for measuring the worst-case overhead of the TM system being evaluated.

Figure 4.13 shows three plots that report the results for the above described eigen-characteristics.

We measure the execution cycles for four configurations:

1. FM: Transactions are handled with the Full-Mirroring scheme;

2. LOGGING: Transactions are handled with the Logging scheme;

3. LOCKS: Transactions are protected with locks;

4. UNPROTECTED: Transactions are not protected and instead are allowed to run fully in

parallel. While this is functionally not correct, it provides an upper bound on the achievable per-

formance. 89

Working‐set size 7 6 5 Locks

VS 4

3 2

Speedup 1 0 100 140 180 220 260 300 FM LOGGING UNPROTECTED

Contention 5 4 Locks 3 VS

2 1 Speedup 0 0% 15% 30% 50% 75% 90% FM LOGGING UNPROTECTED

Predominance 1.0 0.8 0.6 Unprotected 0.4 VS

0.2 0.0 Speedup 9% 50% 59% 67% 83% 100% FM LOGGING LOCKS

Figure 4.13: Results for the eigenbench evaluation methodology. Eigen-characteristics considered are working-set size (top), contention (middle) and Predominance (bottom). 90

Next, we analyze the results shown in Figure 4.13.

Working-set size: We observe the speedup of the two TM systems versus locks. On the X-axis we see the transaction’s working-set size (i.e., the footprint of transactional accesses) in KB. The first thing to emphasize is that for a transactional data footprint smaller than 128KB (the size of the

TCDM that a program can use in the FM scheme) both TM systems perform very closely to the ideal (UNPROTECTED) case. Beyond 128KB the FM system starts suffering from DMA transfers.

The same happens for the LOGGING scheme when the transactional data footprint grows beyond

226KB.

Contention: We observe again the speedup of the two TM systems versus locks. On the X-axis we see the percent transactions conflict rate. For very low conflict rates both TM systems perform very closely to the UNPROTECTED case. As the conflict rate increases their performance start dropping and at around 30% conflict rate the LOGGING scheme starts behaving slightly worse than FM. This, as expected, is due to the slightly costlier rollback. It is also relevant to notice that around 75% conflict rate both schemes start performing worse than locks.

Predominance: We see the slowdown of the two TM systems versus UNPROTECTED, as we assess how the overhead for transactional read/write logging causes the TM schemes to depart from the ideal performance. On the X-axis we see the percent predominance. This plot reveals that data versioning has a very low overhead in both designs, as the performance is consistently very close to the UNPROTECTED case.

4.4 Summary and Discussion

In this chapter, a novel HTM scheme was proposed that is targeted to a many-core cluster-based embedded architecture without caches or a cache-coherence protocol. To the best of our knowledge, this is the first design for speculative synchronization in such a type of architecture. A transactional support mechanism was designed from scratch, for handling transactions without relying on an underlying cache coherence protocol to manage read and write memory conflicts. This mechanism 91 is based on the idea of distributing conflict detection and resolution across multiple Transaction

Support Modules (TSMs) that keep track of read and write memory accesses and guarantee execution correctness. Distributing synchronization management makes the design inherently scalable.

Two alternative data versioning managements designs were proposed: full-mirroring and dis- tributed logging. A memory overhead and performance comparison between the two designs over a range of benchmarks, showed that while full-mirroring is a very simple and fast design, it is wasteful and generally impractical in terms of memory, using 50% of the TCDM for mirrors when only about

2% is required in the distributed logging scheme. Furthermore, simulations showed that the full- mirroring design requires more main memory accesses for data refilling that can hurt performance and (depending on the number of cores and the data footprint) makes distributed logging a better choice in most cases. Simulations on data structure applications, benchmarks from the STAMP benchmark suite and the Eigenbench application, showed that both proposed transactional memory designs achieve a significant performance improvement over traditional lock-based schemes ranging from 9% to 83% depending on the number of cores.

This base transactional support scheme gives us a good understanding of how speculation can provide performance benefits on a cluster-based framework of this particular cache-free structure.

In the future it would be interesting to consider alternative schemes that may provide even better efficiency, such as different conflict resolution policies or bookkeeping designs. As a first step, trans- actions were restricted within a single cluster and the focus was on simple and fast transactional handling schemes. While the current implementation is limited to single-cluster accesses, the pro- posed scheme is designed so that it is scalable and can be extended to multiple clusters. It would be interesting to study how this can be done using inter-cluster transactional support. A differ- ent direction would be to explore how this design can be used for alternative purposes other than synchronization. The next chapter examines this possibility by looking into adopting Transactional

Memory mechanisms for reliability. Chapter 5

Transactional Memory Revisited for Error-Resilient and

Energy-Efficient MPSoC Execution

In Chapters 3 and 4 we saw how TM-based speculation can be used for data synchronization in embedded multicore systems, to improve performance and energy-efficiency. Having observed how much transactions can benefit synchronization in embedded multi-core environments, it would be interesting to study how transactions could be used for alternative purposes, other than traditional data synchronization. In this chapter we explore the use of hardware transactional memory as a re- covery mechanism from timing errors or the Critical Operating Point (COP) in multi-core embedded systems operating far beyond the safe nominal supply voltage. Specifically, we propose a scheme that dynamically monitors the platform and adaptively adjusts to the COP among multiple cores, using lightweight checkpointing and roll-back mechanisms adopted from Hardware Transactional Memory

(HTM) for error recovery. Experiments demonstrate that this technique is particularly effective in saving energy while also offering safe execution guarantees. To the best of our knowledge, this work

92 93 is the first to describe a full-fledged HTM implementation for error-resilient and energy-efficient

MPSoC execution.

5.1 Motivation

Scaling of physical dimensions in semiconductor devices has opened the way for heterogeneous em- bedded SoCs integrating host processors and many-core accelerators in the same chip [69], but at a price of ever-increasing static and dynamic hardware variability [70]. Spatial die-to-die and within- die static variations ultimately induce performance and power mismatches between the cores in a many-core array, introducing heterogeneity in a nominally homogeneous system (formally identical processing resources). Dynamic variations depend on the operating conditions of the chip, and in- clude aging, supply voltage drops and temperature fluctuations. The most common consequence of variations is path delay uncertainty. Circuit designers typically use conservative guardbands on the operating frequency or voltage to ensure safe system operation, with the obvious consequent loss of operational efficiency. When the guardbands are reduced, or when the system is aggressively operated far from a safe point, the delay uncertainty manifests itself either as an intermittent timing error [2] [21] or a critical operating point (COP) [20]. As we saw in chapter 2.2.5, timing errors violate the setup or the hold time constraints of the sequential element connected at the end of the path, which in turn can cause erroneous instructions with wrong outputs being stored or, worse, incorrect control flow. COP defines a voltage and frequency pair at which a core is error-free. If the voltage is decreased below (or the frequency is increased beyond) the COP, the core will face a massive number of errors [20]. The COP effect is highly pronounced in well-optimized designs [71]

[72].

Circuit level error detection and correction (EDAC) techniques [2] [21] can transparently detect and correct timing errors, with the side-effect of an increased execution time and energy. In addition, while EDAC techniques are suitable for handling sporadic errors, they are obviously not a good 94 solution for the “all-or-nothing” effect of the COP. In principle the COP can be determined for a particular chip after its production, and the most efficient yet safe voltage/frequency pair for the chip could be configured at that time. However, due to static and dynamic variations, the COP may actually change over space and time. As a result, the “safe” operating point may i) differ from one core to another (imposing to conservatively tune the entire chip to meet the requirements of the most critical core) and ii) suddenly become unsafe due to aging, temperature fluctuations or voltage drops.

This chapter proposes an integrated HW/SW scheme that can address both types of variation phenomena. In particular, the proposed scheme dynamically adjusts to an evolving COP, thus en- abling the system to operate at highly reduced margins without sacrificing performance, while at the same time guaranteeing forward progress at reduced energy levels. To achieve that, it monitors the platform and adaptively adjusts to the COP among multiple cores, using lightweight checkpointing and roll-back mechanisms adapted from Hardware Transactional Memory (HTM) for error recovery.

The platform is initially configured to operate at a safe, reference operating voltage (i.e., with safe margins to hide all variability effects). Every time a new transaction is started, the proposed technique optimistically lowers the voltage in small steps, individually on each core. If sporadic or non-critical errors take place, the HTM-inspired techniques intervene and ensure correct program behavior and progress. If systematic or critical errors take place, then the system reverts to the previous stable operating point. If over time the COP changes, the technique is re-activated and the system is re-calibrated. Next, the details of the target architecture and the proposed design are presented. 95

critical path monitors (EDS) cluster-based programmable many-core accelerator (PMCA)

Host system

System INTERCONNECT

mem CTRL DRAM Shared-memory CLUSTER

Figure 5.1: Target platform high level view.

5.2 Target Architecture

The proposed HW/SW design is driven to a large extent by the target architecture (Fig. 5.1). This is basically the same architecture that we considered in Chapter 4 for implementing hardware trans- actional memory for data conflict speculation and management. A general-purpose host processor is coupled with a programmable many-core accelerator (PMCA) composed of several tens of simple cores, where critical computation kernels of an application can be offloaded to improve overall per- formance/Watt. We assume that the host core is operated with safe margins. This work focuses on the PMCA, and in particular in a design that leverages a multi-cluster configuration to overcome scalability limitations [69] [17] [15].The goal is to improve energy efficiency by operating the PMCA

“dangerously close” to the COP, while exploiting the HTM to avoid failures.

In this multi-cluster configuration, simple processing elements are grouped into clusters sharing high-performance local interconnect and memory. Several clusters are replicated and interconnected through a scalable medium such as a network-on-chip (NoC), while within a cluster a limited number of simple processors (typically 4 to 16) share an L1 tightly-coupled data memory (TCDM). Also, 96 as in Chapter 4, this work focuses on a single computation cluster. Here, the cluster is configured with 8 cores with private instruction chache (1 KByte) and 16 TCDM banks (256KB), plus external

(main) L2 memory (2MB). The TCDM is implemented using two different technologies: 6-transistor

SRAM and Standard Memory (SCM). The SCM achieves lower density (∼3X) than SRAM, but can reliably operate at the same voltage ranges as the rest of the logic. SRAM requires higher voltages to operate reliably, thus consumes (∼4X) the energy [97]. The SCM is used to implement storage that needs to always be reliable (e.g., to implement function calls, for control-flow data and for instruction cache), while program data is stored in SRAM and the HTM techniques are used to recover from errors. The HTM extensions for error-tolerance are designed on top of this baseline cluster. More specifically, existing checkpointing and rollback mechanisms that have been employed for HTM in Chapter 4, are now revisited to be used as a lightweight mechanism for fast and efficient error recovery.

All the base performance/energy/area numbers used in this work are derived from a silicon implementation of the platform in 28nm STMicroelectronics UTB FD-SOI technology, and integrated in the VSoC simulator. The cluster is able to operate over a wide range of frequencies (from 20MHz

@ 0.5V up to 450MHz @ 1.2V). The target frequency is 200MHz, with a nominal voltage of 0.84.

Due to process variation the required Vdd for a safe operating condition may actually vary among cores (up to 0.04V increase is observed). Different sources of dynamic variations also increase the minimum voltage level required for safe operation. The baseline platform considers safe margins to compensate for all sources of variability, and is thus conservatively operated at a reference voltage of 1V. Any errors caused by dynamic variation need to be detected at runtime. We assume each core is equipped with error-detection circuitry such as error-detection sequential (EDS) [21]. 97

5.3 Implementation

The proposed scheme borrows key concepts from Hardware Transactional Memory (HTM) to pro- vide a mechanism for error recovery. Recall that, HTM requires three key components: i) some form of transactional bookkeeping, for keeping track of read/write data conflicts, ii) a data version- ing mechanism, for keeping track of speculative and non-speculative versions of data in case it is necessary to rollback and recover from a data conflict and iii) a rollback mechanism in order to recover in case of conflicts. Since in this case, transactional memory is not used in the traditional context, for conflict detection, a bookkeeping mechanism is not necessary and this makes the design considerably more lightweight. The only mechanisms that need to be adopted from HTM is data versioning and rollback in order to recover from variability-induced errors. Next, these two key parts of the design as well as the control flow are described in detail.

5.3.1 Checkpointing and Rollback

Checkpointing is the mechanism that saves the system’s state for retrieval in case of errors. All parallel parts of the program are protected from errors by enclosing them within transactions (see section 5.3.4). At the beginning of each transaction (Transaction Start) the internal state of the core is saved (i.e., program counter, stack pointer, internal registers, stack contents) to be able to roll back in case of errors. As with conflict resolution (section 2.2.1.3), error resolution can be eager or lazy, meaning that we can resolve the error by aborting the transaction and rolling back right away or wait until the end of the transactional region to do so. In this design, a lazy error resolution scheme is chosen to avoid the cost of frequent error checking. The transactional regions’ sizes are fine tuned to be small enough so that if errors start occurring, it won’t be long before they get detected and the core’s voltage is adjusted back to safer levels. Thus, when a transaction completes execution

(Transaction End) the system checks whether errors have been encountered by the core executing the transaction. If no errors are detected the transaction commits, the checkpointing information 98 is discarded and speculative changes to the data become permanent. If errors are detected the transaction aborts and a rollback mechanism restores the internal core state.

5.3.2 Data Versioning

As discussed in section 2.2.1.3, data versioning can also be either eager or lazy. Lazy data versioning keeps the original data in place and buffers the speculative data updates in different locations

(allowing for fast error recovery). Eager data versioning makes speculative changes in place and stores back-up copies of the original data in separate places (allowing for fast commits but slow abort handling). In this case it is expected that aborts due to errors will be infrequent since the voltage is increased directly after the COP or a timing error is met. Hence, an eager data versioning mechanism is a better choice.

As was done in Chapter 4 for our HTM design, our new design also uses the TCDM memory to hold both speculative and non-speculative data. For data versioning, the distributed per-address log scheme described in 4.2.2.2 is used, since that scheme is simple, fast and more space efficient than the alternative, full-mirroring design. Distributed per-address logs are used to save backups of the original values of data that are written during transactions, so that they can be recovered in case of errors. Data logs are distributed across the TCDM memory banks, so that each bank is responsible for handling recovery only for its associated data.

Since memory is distributed across multiple memory banks that accept and serve access requests in parallel, having a central control logic to manage the distributed logs would not be efficient. For this reason, transactional handling and log managing responsibilities are divided across multiple control modules, one for each bank of the TCDM, called Data Versioning Modules (DVM). Each bank’s DVM is a control block that monitors transactional accesses to the bank and manages the cores’ logs that reside in that bank. It is also responsible for restoring the log data of the cores that abort their transactions and cleaning the logs of the cores that commit their transactions. All banks’ DVMs work in parallel and independently of each other. DVMs are the equivalent of TSMs 99 described in section 4.2, with the difference that they are not responsible for handling transactional bookkeeping, since it is not necessary in this case. In every bank of the TCDM, a fixed-size log space is kept for each core in the system. Each core’s log holds the addresses that belong to that bank and are written transactionally by that core. In this way, a log space is kept only for the addresses of the bank that are actually written transactionally. At the same time, with this distributed log design cross-bank data exchange is avoided when saving and restoring the log, since each addresses’ log falls within the same bank. Thus the log saving and restoration process is triggered internally by the DVM of each bank and it does not require interaction with the DVMs of other banks.

When a core writes transactionally to an address of a bank, its log is traversed to check whether it already holds an entry for that address. If not, a new log entry is created to store the original data of the address. Note that the data only need to be logged the first time the address is written within a specific transaction. Therefore, the log size depends on the write footprint of each transaction.

Since the log of each core is distributed among all the TCDM banks, we expect that the log writes will also be divided among the banks. The size of each core’s log space per bank is a parameter in our design, so it can be easily adjusted to the needs of different applications domains. In case of an overflow, the system resorts to software-managed logging into the main L2 memory. The capability of tuning the transactions’ granularity is intuitively key to reducing the number of overflows. Using the technique described in Section 5.3.4, it was found that 1KB total log size per core (64B in each

TCDM bank) is adequate for the target applications. Overall, the logs for all the cores occupy roughly 3% of the total TCDM space.

If an error is detected and a transaction must abort, each bank’s log is traversed to restore the original data back to its proper address. If a transaction commits, the logs associated with that transaction are all discarded and the speculative data now become non-speculative. 100

Start_Transaction() Increase_Voltage()

No COP Found?

Lower_Voltage() Yes

Execute Transaction

No Yes Errors Detected?

Transaction Commits: - Clean_Logs() - Discard Checkpoint Transaction Aborts: - Restore_Logs() Transaction Ends - Restore State

Figure 5.2: Control Flow of an error-resilient transaction.

5.3.3 Error-Resilient Transactions

The flowchart in Figure 5.2 describes the semantics of the error-resilient transactions (ERT). The execution starts with all platform components set at the safe reference voltage level (1.0V). Each time a core encounters a new transaction it saves its internal state and current stack and checks whether the self-calibration procedure was previously completed and the COP for this core is known.

If the COP is still unknown, the executing core optimistically lowers its voltage level by a pre-defined step (0.02V). If the COP has already been reached, then no voltage adjustment is made.

If the transaction end is reached without errors being detected, the transaction Commits.A clean logs() process is activated at each bank’s DVM to clean up the saved log of the committing core in the respective bank. Note that all these processes are triggered simultaneously by the DVMs of all memory banks. If errors are detected, then the transaction aborts.A restore logs() process is activated simultaneously at each bank’s DVM to restore all the saved log values of the aborted core.

The internal state of the aborted core is restored, its voltage is adjusted back to the previous safe 101

#pragma omp for schedule(dynamic, CHUNK) for (i = LB; i < UB; i++) { /* LOOP_BODY */ } OpenMP code

Transformed code

int start, end, work_left;

work_left = loop_dynamic_start(LB, UB, 1, CHUNK, &start, &end); while (work_left) { ... for (i = start; i < end; i++) { /* LOOP_BODY */ } /* TRANSACTION BODY */

... /* ERROR-RESILIENT TRANSACTION */ work_left = loop_dynamic_next(&start, &end); }

Figure 5.3: Transformed OpenMP dynamic loop level (increase voltage()) (a +0.02V voltage increase beyond the recently found COP) and the core is ready to retry the transaction. From this point on, the voltage level is no longer reduced when starting a new transaction1.

5.3.4 Programming model

Similar to prior approaches [98] [3], in this work transactional memory is integrated into OpenMP [99], a widespread and easy-to-use programming model. An OpenMP program starts on a single thread of execution (the master). Once the parallel directive is encountered, additional threads are cre- ated, and execute the code enclosed within the syntactic boundaries of the construct. The work is parallelized among threads using worksharing directives. For illustration purposes we describe here one of the most used among such directives: dynamic loops. Figure 5.3 shows a code snip- pet with a #pragma omp for directive, used to distribute loop iterations among threads. The schedule(dynamic, CHUNK) clause is used to specify that iterations should be grouped in smaller sets of size CHUNK, and distributed in a dynamic (first come, first served) fashion.

The bottom part of Figure 5.3 shows how this is achieved once the code is transformed by an

1In case a temperature reduction is detected, the voltage can be further decreased, as the COP has “moved” downwards. 102

OpenMP compiler. Runtime library calls are inserted to interact with an iteration scheduler. First, the scheduler is initialized (loop dynamic start) passing as parameters the original loop bounds

(LB, UB), stride and CHUNK. If there are iterations available the function returns a positive integer

(stored in work left) and initializes the input parameters start and end with lower and upper boundaries for the current chunk of iterations. The original loop body is then executed for these iteration instances and a new call to the runtime library (loop dynamic next) repeats the process until there are no iterations left.

This mechanism can be easily augmented to wrap each CHUNK of loop iterations within an error- resilient transaction (ERT). Thus, transaction granularity at the application level may be adjusted by modifying the CHUNK parameter or with OpenMP loop scheduling clauses. This is important for performance as well as energy efficiency since transaction granularity can impact error-rate in our context.

The same scheme can be easily applied to other OpenMP constructs (sections, task, etc.).

Moreover, to ensure robust execution at every point in program execution, we silently define ERT boundaries wherever an OpenMP construct is encountered. The sequential execution in the master thread is also wrapped in an ERT. Additional ERTs can be manually outlined in the code if necessary.

5.4 Experimental Results

The proposed architecture has been modeled in the VSoC simulator. As it was previously de- scribed in Chapter 4, VSoC is a SystemC-based cycle-accurate virtual platform for heterogeneous

System-On-Chip simulation, with back-annotated energy numbers for every system component.

The performance, energy, and area numbers are derived from an implementation of the platform in

STMicroelectronics 28nm UTB FD-SOI technology. This approach couples the advantages of very accurate power models with the simulation speed of the SystemC models. On average, the virtual platform shows a maximum error in timing accuracy below 6% with respect to a complete RTL 103 simulation of the same benchmark. Evaluations were conducted using real-life benchmarks from the computer vision domain: Rotate (image rotation), Strassen (matrix multiplication), Fast (corner detection) and Mahalanobis-Distance (cluster analysis and classification).

5.4.1 Overhead characterization

As a first experiment, we measure the overhead of the proposed HW/SW support for error-resilience in terms of energy and execution delay. The energy overhead for the proposed technique is quite modest: on average, only 1.7% across all benchmarks and never more than 5% of the total system energy. Similarly, execution time overhead is a reasonable 6.6% maximum.

Further detailing the analysis for distributed logs, on average, transactional writes took 0.7 to

1.5 extra cycles to complete and increased total execution time by only 0.5%. The log restoration time clearly depends on the data footprint of the target application: the higher the number of writes within a transaction, the bigger the size of the logs. For each address that needs to be restored, 2 cycles are spent, one for reading the original value from the log space and one for writing it back to its original address. The worst case restoration time per core in isolation is 32 cycles, which may be up to 8 times slower in the unlikely event that all cores were rolling back at the same time. Log restoration time never accounted for more than 3% of the total benchmark execution time. The area overhead of the proposed scheme is quite small. In particular, the distributed per-address logging scheme is space-efficient; the total log space occupies only 3% of the TCDM space. Moreover, based on [21], the area overhead introduced by EDS is nominal.

5.4.2 Energy characterization

Next, we conduct a set of measurements to assess the energy saving capabilities of the technique.

The effect of static within die variations is modeled in the platform by considering different nominal voltages for the target frequency (200MHz) among cores, with a maximum variation of 0.04V (0.84V to 0.88V). To explore how the lowest safe voltage level changes due to temperature variations, we 104

SV Energy consumption @ -40°C 43% 1.0 41% 0.9 38% 0.8 TM 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 ROTATE STRASSEN FAST MD Average SV - 1V SV - 0.98 V SV - 0.96V TM - 1V TM - 0.98 V TM - 0.96 V

Figure 5.4: Energy consumption at -40 C. Steady voltage (SV) versus transactional memory (TM).

SV Energy consumption @ 25°C 30% 1.0 25% 0.9 TM 22% 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 ROTATE STRASSEN FAST MD Average

SV - 1V SV - 0.98 V SV - 0.96V TM - 1V TM - 0.98 V TM - 0.96 V

Figure 5.5: Energy consumption at 25 C. Steady voltage (SV) versus transactional memory (TM). run experiments at three different temperature corners (25◦C, −40◦C and 125◦C).

We compare the transactional memory inspired techniques (TM) to a conservative steady-voltage

(SV) technique, which uses voltage margins (guardbands) to absorb the effects of static and dynamic variations. From the measurements on silicon we observe that in the worst case, a 0.96V operating voltage would be sufficient to compensate for static and temperature variations. In practice, to also account for other sources of dynamic variations (e.g., aging, voltage drops) even more conservative voltage margins would be necessary. Thus, for each temperature corner we consider three reference 105

SV Energy consumption @ 125°C 17% 1.0 TM 12% 0.9 6% 0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0 ROTATE STRASSEN FAST MD Average SV - 1V SV - 0.98 V SV - 0.96V TM - 1V TM - 0.98 V TM - 0.96 V

Figure 5.6: Energy consumption at 125 C. Steady voltage (SV) versus transactional memory (TM). voltage levels for the SV configurations (1.0V, 0.98V, 0.96V) that remain unchanged throughout execution.

Figures 5.4, 5.5 and 5.6 show for each temperature corner the total system energy consumption of each configuration, normalized to the baseline SV configuration at reference voltage 1.0V. For each application we see two groups of three bars. The three leftmost bars correspond to the SV technique, for the three reference voltage levels (1 V, 0.98 V and 0.96 V). The three rightmost bars correspond to the TM technique, starting at the three different reference voltages. The bars at the end of each figure show the average energy improvement over all benchmarks.

We observe that the TM technique achieves significant energy savings compared to conservative execution at a steady reference voltage, for each temperature corner. Intuitively, we observe that at lower temperatures the energy improvement is significantly better compared to higher temperature corners; at lower temperatures the COP moves toward lower voltage levels, leading to larger energy savings. For example, when using a reference voltage of 1.0V and operating at −40◦C, on average the TM technique can save 43% of the energy consumed by the conservative SV technique. Even when the reference voltage is lower, the TM configurations still achieve better energy savings (e.g., at −40◦C, on average TM-0.98V is 41% more energy efficient than SV-0.98V and TM-0.96V is 38% 106 more efficient than SV-0.96V). At ambient temperatures, energy savings diminish, but are still quite substantial (i.e., 30%, 25%, and 22% on average relative to the 3 different reference voltages). Even at 125◦C energy savings can still be realized (i.e., 17%, 12%, and 6% relative to the 3 respective reference voltages). Overall, results show that the proposed technique is a robust, versatile, and cost-effective technique for saving energy while guaranteeing safe execution.

5.5 Summary and Discussion

This chapter introduced a novel HW/SW scheme adopted from Hardware Transactional Memory that dynamically adjusts the operating voltage to an evolving COP in order to operate at highly reduced margins. The scheme was integrated into the OpenMP model, making it easy to program and adjust transaction granularity. Experimental results demonstrate that the proposed technique is particularly effective at saving energy while also offering safe execution guarantees. In particular, energy improvements vary from 6% up to 43% depending on the chosen reference voltage and temperature corner, while the energy and execution time overhead is relatively low. Based on these findings we draw the conclusion that operating dangerously close to the COP instead of using conservative guardbands, pays off, when the proposed lightweight HTM mechanism is used. To the best of our knowledge, this is the first full-fledged implementation of HTM for error resilient execution that targets reducing energy consumption.

This work could be extended in several directions. First, we could consider a broader range of strategies for adjusting the operating point due to COP variations. Specifically, we could design more complex strategies that adjust not only the voltage but also the frequency based on variations.

Second, it would be interesting to explore more flexible solutions for adjusting the voltage, since the current solution is rather strict on increasing the voltage directly after meeting the first failure.

Chapter 6 addresses this issue by introducing a new adaptive error policy that not only is more

flexible and promises better energy savings, but it can also handle a broader range of error types. Chapter 6

Adaptive voltage scaling policies for improving energy savings at near-edge operation

In Chapter 5, a novel HTM-based scheme was presented, that dynamically adjusts the operating voltage to an evolving COP to operate at highly reduced margins. While this scheme proved to have great potential in saving energy, it has some key limitations that need to be addressed. First, it is based on a lazy error resolution scheme, meaning that only when the end of each transaction is reached, the policy checks if errors have occurred and if so the roll-back mechanism is activated.

While a lazy error recovery mechanism is appropriate for non-critical errors, it is not appropriate for critical errors. Non-critical errors are non-systematic errors that take place in the data path of the processor pipeline and can result to incorrect data being stored in memory. Critical errors take place in the control part of the processor pipeline and can break the original control flow of the program and prevent any software-based solution from taking control. For these types of errors a prompt reaction is necessary. For this purpose, a new design in proposed that employs eager error-resolution

107 108 to address both critical and non-critical errors. By choosing eager resolution no time is wasted from the moment of error occurrence to the moment of error resolution.

Second, while the proposed scheme is capable of recovering both from intermittent timing errors and the COP, it is not very flexible on how it addresses intermittent timing errors, in the sense that when the first failure occurs, the voltage level is immediately increased back to a safe level in order to avoid massive instruction failures. Thus if sporadic timing errors occur, they are essentially treated as if they were caused by crossing a critical operating point, while in fact as long as the error itself was corrected, it would still be safe and reasonable to continue operating at the same voltage level. As we saw in Section 2.2.5, according to the COP error model, after the critical voltage is surpassed (for a given frequency level), massive errors emerge and the voltage needs to be immediately increased back to a safe level. But intermittent timing errors follow a different trend (Figure 2.4). They start emerging with a very low error rate which is later increased exponentially as the voltage is further scaled down. So, there is a range of voltage levels before the point of massive instruction failures.

With the previous algorithm we would stop decreasing the voltage level immediately after the point of first failure (POFF). This would not allow us to reach the best energy saving potential.

In this chapter new software-directed adaptive error policies are explored, that work on top of the underlying HTM support. These policies, optimistically lower the voltage beyond the POFF allowing multiple intermittent timing errors to occur and make voltage adjustment decisions based on the number of consecutive commits and aborts in order to achieve better energy savings while still correctly executing code. The challenge here is that there is a fine trade-off between the energy that is saved from scaling the voltage and the energy that is lost due to error recovery and re-execution.

So the chosen error policy must be carefully tuned taking energy savings into account.

6.1 Addressing critical and non-critical errors

As mentioned before, there are two distinct types of errors that can occur: 109

1. Non-critical errors are those that originate from timing delays along the datapath (e.g.,

multiplier) and ultimately lead to writing a bad value to memory.

2. Critical errors are those that occur in the control part of the processor pipeline (instruction

fetch/decode) and ultimately lead to catastrophic failures.

While non-critical errors can result to incorrect data being stored in memory, critical errors can break the original control flow of the program and prevent any software-based solution from taking control. Thus, while the lazy error resolution scheme proposed in Chapter 5 can deal with non- critical errors, it cannot handle critical errors. For this reason, in the new design proposed here, an eager error resolution mechanism is employed for both types of errors. EDS [21] can be used to monitor all paths of the processor pipeline. When an error occurs in the control part of the pipeline, indicating a critical error, the EDS generates an interrupt to the core. Since the core’s state is corrupted and cannot serve an interrupt, the interrupt line can be intercepted by a special hardware block (or a safe core that operates at a safe voltage level). The hardware block will apply a temporary voltage boost to the core so that the core can resume operation. This can be done by applying a forward body bias (FBB) [80, 100, 101], a technique that reduces the threshold voltage and lowers gate delay, thus temporarily increasing performance. The core’s pipeline will be flushed and the core will be ready to jump to an interrupt service routine (ISR) that activates the rollback mechanism. Non-critical errors are also resolved in an eager manner, but an FBB process is not necessary, since the affected core should be able to handle the recovery alone.

6.2 Error policy design

As mentioned earlier, our goal is to design a new adaptive error policy that optimistically lowers the voltage beyond the point of first failure (POFF) allowing multiple intermittent timing errors to occur. Unlike the previously proposed policy that increases the voltage immediately after the

POFF, the new policy will tolerate sporadic errors and make voltage adjustment decisions based 110 on different criteria (e.g., the expected error rate, the number of commits and aborts) in order to achieve better energy savings. Since there is a fine tradeoff between the energy saved due to voltage scaling and the energy expended due to error recovery, choosing the factors that will determine how voltage should be adjusted is not a straightforward process.

When an error occurs, the error policy should be able to determine whether that error was sporadic or not and if retrying the failed transaction at the same voltage level will likely be successful.

One way to decide that is by doing online error monitoring and estimating the number of errors that are expected to occur if the transaction is retried at the same voltage level. For example, we could monitor the errors experienced in the recent time window (TW cycles) and estimate the experienced error rate as: Error Count Experienced Error Rate = , (6.1) TW where ErrorCount is the number of errors detected over the last time window, TW . At the same time, we can estimate the size of the currently running transaction (e.g., S cycles) by monitoring the sizes of the most recent transactions and using their average value. Using this information, we can estimate the number of errors that are expected to occur if the failed transaction is repeated:

Expected Errors = Experienced Error Rate · S, (6.2)

If the number of Expected Errors (EE) is less than some preset threshold, we can assume that it is worthwhile re-executing the transaction at the same voltage level. Otherwise, we can assume that repeating the transaction at the same voltage level will likely lead to new errors and increase the voltage before re-executing.

Since EE is just an estimate, another decision factor for adjusting the voltage could be the number of aborts that are encountered in a particular transaction. For example, we could say that after a certain number of consecutive aborts the voltage should be increased. Deciding this abort threshold would require empirical evaluation. An alternative way to decide it would be as follows:

The energy consumed to complete a transaction can be expressed as: 111

Etrans(V,S) = Ecomp(V,S) + Naborts · (Ecomp(V,S) + Erecov(V,S)), (6.3)

where Etrans(V,S) is a function of the supply voltage V and the size S of the transaction. Ecomp(V,S) is the energy consumed during transaction execution, Nabort is the number of times the transaction was rolled back, and Erecov(V,S) is the energy consumed by rolling back the transaction.

In order to save energy, we want Etrans(V,S) < Etrans(Vref ,S), where Vref is the reference voltage (i.e., the safe voltage when guardbands are considered). Therefore,

Etrans(Vref ,S) − Ecomp(V,S) Nabort < . (6.4) Ecomp(V,S) + Erecov(V,S)

By running simulations for various voltage levels and transaction sizes, we can estimate the average Erecov and Ecomp for different values of S and V , and use these estimations to restrict the number of aborts. This evaluation process guarantees that energy is saved, but it does not guarantee that the energy savings are maximized. Maximizing energy savings would require extensive computation over varying values of S, V , Nabort, which is likely to be impractical for real-time analysis in hardware. If our goal is not just to save energy compared to the reference voltage level, but also restrict the energy consumption within a certain percentage (x%) of the reference voltage level, we could consider that in the above equation.

Alternatively, instead of defining the maximum abort threshold a priori through energy charac- terization experiments, we could accumulate the energy consumed during execution and increase the voltage back to a safe reference voltage level when the energy that has been consumed approaches the energy that would be consumed at the nominal voltage level. This approach is reactive rather than proactive.

Even if the voltage is increased in response to errors, we might still consider lowering it again later for different reasons. For example, the error rate might drop due to temperature fluctuations or the size of transactions might become smaller, allowing more transactions to succeed at the same error rate. That is why another factor could be employed: the consecutive number of commits. After 112 a certain number of consecutive commits without errors, we may attempt to reduce the voltage again. The exact number of consecutive commits can be determined in a number of ways. It can be empirically chosen based on the observed abort rate. It could be also chosen based on the granularity of the experienced error rate. For example, if the experienced error rate is 1 error every million cycles and each transaction is 10, 000 cycles, then having more than 100 consecutive commits without encountering a single error could be an indication that it is safe to decrease the voltage again.

The transaction size plays an important role as well. It should be small enough that the transac- tion is likely to finish without encountering an error but large enough so that the accumulated time and energy overhead for checkpointing and error recovery is reasonable. Through experiments we can get useful feedback on the average energy and recovery time overhead and use this information to statically tune the transaction size.

Considering all the above decision factors, an effective error policy may follow the flowchart shown in Figure 6.1. This policy works as follows: The supply voltage is continuously scaled down until the POFF is reached. When the first error occurs, the error policy makes an estimation on the number of errors that are expected to occur when the transaction is repeated, based on Equations

6.1 and 6.2. If this number is greater than or equal to 1, indicating that an error will likely occur when the transaction is re-executed, the voltage is increased. If not, then the policy decides to optimistically stay at the same voltage level and retry the transaction. The procedure is repeated every time an error occurs. If at any voltage level new errors emerge, they can be tolerated up to a certain point. If the pre-defined maximum abort threshold is reached, then the voltage must be increased again to avoid further energy loss due to aborts and re-execution. On the other hand, if at a certain voltage level a transaction reaches a number of consecutive commits, then the policy decides to optimistically lower the voltage again for further energy savings. The maximum abort and consecutive commit thresholds can be empirically chosen or they can be defined a priori through energy pre-characterization experiments.

While this policy could effectively tolerate errors and increase energy savings, it requires online 113

Start Transaction

Point of Yes First Failure?

Expected Errors < 1 ? No No Yes

Lower Voltage Increase Voltage No Voltage Change

Maximum No Retry Aborts? Transaction

Yes Yes No New Error ? Increase Voltage

No Expected Yes Continue Execution Errors < 1 ?

Consecutive Increase Voltage No Voltage Change commits ?

Yes

LowerYes Voltage

Figure 6.1: Example of an error policy decision flow based on: expected error rate, number of consecutive aborts and number of consecutive commits. monitoring of the transaction size and the error rate at each voltage level. It might be complicated to acquire such knowledge especially when the voltage is frequently changed and aborts occur.

For this reason, an alternative, much simpler design is chosen that does not require online error monitoring and makes decisions based on the experienced commits and aborts. This alternative design is described in the next section.

6.3 The Thrifty uncle/Reckless nephew policy

As mentioned earlier, scaling the voltage to achieve energy savings has a caveat: While scaling the voltage reduces energy consumption, it also increases the transaction abort rate due to errors, which in turn leads to energy loss from transaction recovery and re-execution. To represent this 114 contradiction, we create an adaptive approach we call the Thrifty uncle/Reckless nephew policy1.

The policy has two parts:

• the reckless nephew scales the voltage down for better energy savings while

• the thrifty uncle tries to moderate the energy losses due to aborts that are increased as a

consequence of voltage scaling, by setting up a threshold for voltage scaling.

The reckless nephew decides whether to reduce the voltage based on the number of consecutive successful transactions. When transactions fail consecutively, the thrifty uncle intervenes and in- creases the voltage to avoid further failures. The uncle influences the nephew’s decisions by setting up the threshold of successful transactions that are required for voltage scaling. This threshold is determined by the number of consecutive commits and aborts. Essentially, the thrifty uncle tries to determine the operating voltage level that will allow for energy savings and guarantee forward progress while reducing the energy and time overhead due to transaction failures. Choosing an ideal voltage level without knowledge of the error rate and the actual transaction size is a challenging task. Luckily, the number of the experienced consecutive aborts and commits can be very good indicators of how the voltage should change.

Figure 6.2 shows a flowchart of how the proposed policy works. Starting from a safe reference level, the voltage is scaled in small steps creating multiple operating voltages.When a transaction starts, the policy checks whether this is a failed transaction that is re-starting or a new one. If it is a new transaction, then the nephew must decide whether the voltage should be decreased first.

This decision is based on the number of consecutive successful transactions that have proceeded this one, i.e., the number of consecutive commits. If this number is greater than a pre-defined threshold

C, then the voltage can be safely reduced by one step. Otherwise, no voltage change is allowed.

If it is a failed transaction that is being re-executed, then the thrifty uncle must decide whether the transaction should be re-executed at the same voltage or the voltage should be increased. If

1A reference to the thrifty uncle Scrooge McDuck and reckless nephew Donald Duck of Disney’s Duck family. 115

Start Transaction

Yes No New ?

Yes Consecutive No Yes Consecutive No Commits > C ? Aborts > A ?

Reduce No Voltage Increase No Voltage Voltage Change Voltage Change

C = C x 2 Yes Twice No w/o aborts ?

C = C / 2 No change

Figure 6.2: The ‘Thrifty uncle/Reckless nephew’ policy this transaction has failed consecutively and the number of consecutive aborts is greater than a pre- defined threshold A, then the uncle increases the voltage by one step. Otherwise, the transaction is re-executed at the same voltage level.

Deciding the threshold C that will allow for the voltage to be scaled again in the future, is a tricky process, since the error rate and the transaction size are not known. In our policy this threshold is initially set to 1 and is later adapted based on the experienced consecutive aborts and commits. Thus, every time the consecutive abort threshold (A) is exceeded the uncle realizes that the current voltage level is likely dangerous and not only increases the voltage but also doubles the C threshold to make it more difficult for the nephew to later come back to that level. However, if the error rate is reduced in the future (e.g., due to a temperature drop) or transactions become smaller, then a lower voltage level might be sustainable. In that case, C must be reduced again to allow for an easier transition to lower voltage levels. In this policy, if the voltage is reduced two times a row without any aborts in-between then the uncle divides the threshold C by 2, making it easier for the nephew to further reduce the voltage in the future. Threshold A must be also chosen wisely to allow 116 forward progress. Having consecutive aborts means that the error rate is so big that the transaction is not able to complete without encountering an error. It could also mean that the transaction size is too big that even sporadic errors do not allow it to complete. Ideally, A should be determined based on the accumulated energy, as described in section 6.2. Since we are not monitoring the energy for this policy we pick A empirically. Having two aborts in a row can be coincidental in case of sporadic errors, but more than three aborts in a row could mean that the error rate is elevated or the transaction size is too big for the error rate. Thus, for this implementation A is set to 3. In the next section, the Thrifty uncle/Reckless nephew policy is evaluated in terms of energy and execution time.

6.4 Experimental Results

This section presents an evaluation of the Thrifty uncle/Reckless nephew policy in terms of energy consumption and execution time overhead. We use the same benchmarks as in section 5.4 and the

VSoC simulator. An extra synthetic benchmark has been added, that executes a set of reads and writes on a shared vector. First, we evaluate the policy in terms of energy consumption and then we analyze the overhead in time and energy. Last, we study how transaction size can affect the obtained energy savings.

6.4.1 Energy consumption

We compare the Thrifty uncle/Reckless nephew policy (UN) to a conservative steady-voltage (SV) technique which uses voltage margins (guardbands) to absorb the effects of static and dynamic variations. We also test whether the UN policy can actually achieve better energy savings compared to the previously proposed TM technique of Chapter 5, that increases the voltage immediately after the first failure. In Chapter 5 we experimented with reference voltage levels (i.e., guardbands):

1.0 V, 0.98 V and 0.96 V. We saw that as the reference voltage decreases, less energy savings are seen 117

Energy Consumption (Step : 20 mV) 1.0

0.9

0.8 53% 0.7

0.6 20% 0.5

0.4

0.3

0.2

0.1

0.0 ROTATE STRASSEN MD SYNTHETIC AVERAGE SAFESV POFFTM UN

Figure 6.3: Single-core energy consumption normalized to the baseline SV configuration using a 20mV voltage scaling step. SV: Steady voltage configuration, TM: Transactional Memory-based technique of Chapter 5, UN: Thrifty uncle/Reckless nephew policy. compared to the SV configuration. For that purpose, here we test only using 0.96V as a reference voltage level to show the potential energy savings and expect even more savings if the guardbands become more strict. As in Chapter 5, in this work the target frequency is 200MHz, which corresponds to a nominal voltage of 0.84V. If no guardbands are used to account for variations, that is the point where the first failure is expected to occur. Our error model follows the curves reported by Fojtik et al. in [75].

Based on the intermittent timing error model (Figure 2.4), errors start emerging at a low error rate that is increased exponentially as the voltage is further scaled down. Choosing a small voltage scaling step gives the opportunity for more experimentation around the lowest error rates, thus allowing the policy to more finely determine the optimal voltage level and achieve better energy savings. For this reason, we choose a step size of 20mV for our first experiment. Since such fine- tuning of the voltage level might not be feasible in practice, we also test the policy at a bigger step of 25mV to see how energy savings are affected by this choice.

Figures 6.3 and 6.4 show the total energy consumption of a single core run for each of the SV, TM 118

Energy Consumption (Step : 25 mV) 1.0

0.9

0.8 50% 0.7

0.6 13% 0.5

0.4

0.3

0.2

0.1

0.0 ROTATE STRASSEN MD SYNTHETIC AVERAGE SAFESV POFFTM UN

Figure 6.4: Single-core energy consumption normalized to the baseline SV configuration using a 25mV voltage scaling step. SV: Steady voltage configuration, TM: Transactional Memory-based technique of Chapter 5, UN: Thrifty uncle/Reckless nephew policy. and UN configurations normalized to the baseline SV configuration at reference voltage 0.96V, for a 20mV and 25mV step respectively. At the right end of each graph we see the average results over all benchmarks. As we can see, in both cases TM and UN achieve significantly better results than

SV. However, the UN configuration yields even better energy savings than the TM configuration.

Specifically, for a 20mV step, TM maintains a 41% improvement in energy over SV, similar to what we expected based on the results shown in Chapter 5. However, the UN policy reaches an even better improvement of 53% over SV and a 20% improvement over TM. This means that when carefully done, decreasing the voltage beyond the point of first failure and allowing errors to happen can bring even better energy savings compared to safely staying above the point of first failure. Even when the step is increased to 25mV, energy savings can still be realized, even though they are smaller compared to using a finer step. UN achieves a 50% improvement over SV and a 13% improvement over TM. Next, we discuss the overhead of the proposed technique. 119

6.4.2 Overhead characterization

Apart from energy consumption, we also measure the execution time overhead introduced by the transactions and the error policy. The UN policy runs every time a transaction starts or re-starts.

Each voltage adjustment takes 10 clock cycles. Based on measurements, the UN configuration has a 5% execution time overhead compared to the SV configuration. This overhead is due to the extra time needed to setup transactions (i.e., checkpointing and writing to the logs), the extra time introduced by the UN policy (i.e., time to execute the policy and adjust the voltage) and the delays associated with recovery and re-execution of failed transactions. The UN policy is 4% slower on average than the TM policy. This is because the UN configuration experiences more transaction aborts and re-executions compared to the TM configuration, since it operates at lower voltage levels.

Moreover, the UN policy makes more voltage adjustments than the TM policy, since it adaptively sets the voltage based on the experienced number of commits and aborts. However, according to the statistics we gathered from experiments, the UN policy makes voltage adjustments less than 2% of the times it runs, which means that it learns fast what is the most sustainable voltage level.

6.4.3 Energy savings vs. transaction size

The transaction size is an important parameter for the UN policy. Depending on the experienced error rate, a transaction of smaller size might complete before errors emerge while a bigger one might not get a chance to complete. For example, for a per-cycle error rate of 0.1% (that is 1 error every

1000 cycles), a transaction with 200 cycles size has a very good chance of completing without seeing an error. In fact, 1 every 5 transactions is expected to fail, but forward progress is guaranteed.

However, a transaction with size 1000 cycles will almost certainly fail every time, not allowing for forward progress. The UN policy does not change the transaction size based on the error rate (even though that is a possibility for future work). However, it adaptively changes the voltage level based on the number of consecutive aborts encountered. Thus, in the case of the 1000-cycle transaction 120

Energy Consumption vs. Transaction Size 1.0

0.9

0.8 58% 53% 46% 37% 0.7 3% 0.6 8% 20% 0.5 30%

0.4

0.3

0.2

0.1

0.0 Small Medium Large XLarge ( 500 c ) ( 5,000 c ) ( 50,000 c ) ( 500,000 c ) SV TM UN Figure 6.5: Single-core energy consumption for different transaction sizes. where multiple consecutive aborts would occur, the UN policy would increase the voltage by one step so that the error rate is reduced and the voltage becomes sustainable. That would result in losing some potential energy savings since the policy would have to keep the voltage in higher levels to guarantee forward progress.

To test the effect of different transaction sizes on the obtained energy savings, we run the synthetic benchmark with four different transaction sizes: small (500 cycles), medium (5, 000 cycles), large

(50, 000 cycles) and extra large (500, 000 cycles). Figure 6.5 shows the energy consumption for the

SV, TM and UN configurations normalized to SV for each transaction size. The UN policy achieves

58%, 53%, 46% and 37% improvement over the SV policy for small, medium, large and extra large transaction size respectively. The energy improvement over the TM policy is 30%, 20% and 8% for small, medium and large transactions respectively. As it was expected, as transaction size increases the energy savings decrease, but are still significant. However, we observe that for the extra large transaction size, UN becomes worse than TM (energy consumption is increased by 3%). This is because at this point, the transaction size is so big that transactions do not get a chance to complete 121 without encountering an error, even with the lowest error rate. In this case, it is better to increase the voltage and operate at a safe level above the point of first failure, as the TM policy does, instead of choosing the UN policy.

From the results presented here, we draw the conclusion that our Thrifty uncle/Reckless nephew policy can achieve significant energy savings compared to using guardbands. Moreover, even though our policy allows errors to occur and transactions to fail by operating at more dangerous voltage levels, it still yields better energy savings than a policy that immediately increases the voltage after the first failure, if the transaction size is kept small enough. However, we should also consider that there is an overhead associated with transaction checkpointing and having too small transactions could make this overhead overwhelmingly big compared to the actual computation of the transaction.

Hence, there is a tradeoff between the transaction size and the overhead associated with transaction checkpointing, that we should take into account when setting the size of our transactions.

6.5 Summary and Discussion

In this chapter, we discussed new error policies that can improve the energy savings from voltage scaling by increasing the flexibility of addressing intermittent timing errors. A new, simple error policy was presented, the ‘Thrifty uncle/Reckless nephew’ (UN) policy that can address both inter- mittent timing errors and the COP. This policy, optimistically lowers the voltage beyond the point of first failure allowing multiple intermittent timing errors to occur and makes voltage adjustment decisions based on the experienced number of consecutive commits and aborts, in order to achieve better energy savings. Compared to the TM policy presented in Chapter 5 the UN policy achieves a 20% improvement in energy consumption for a fine voltage scaling step of 20mV (13% if the step is increased to 25mV) while being 4% slower. Compared to a policy that uses guardbands, the UN policy yields a 50-53% improvement in energy. Moreover, the UN policy can deal with a broader 122 range of errors. While the TM policy addresses errors in a lazy manner and can handle only non- critical errors, the UN policy addresses errors eagerly and can deal not only with critical but also with non critical errors. Overall, we conclude that even though the UN policy allows errors to occur by operating at more dangerous voltage levels, it still yields better energy savings than a policy that immediately increases the voltage after the first failure.

There are multiple ways in which this work could be extended. The current implementation of the UN policy does not take into account how energy is consumed during execution in order to restrict energy consumption within a certain range of the safe reference voltage level. If our goal is to achieve the maximum possible energy savings, we could take into account the energy accumulated during transaction re-execution using online energy monitoring. Another possible direction is to monitor the error rate and adapt the transaction size during execution to increase the transaction commit rate and guarantee forward progress without necessarily increasing the voltage. Finally, it would be interesting to see how the system behaves and what kind of energy savings we can obtain if we not only scale the voltage but also change the frequency or the temperature. Chapter 7

Conclusions and future directions

In this thesis we proposed techniques inspired by speculative synchronization to improve the perfor- mance, energy-efficiency and error-resilience of multicore embedded systems.

In Chapter 3, we presented Embedded-Spec, an energy-efficient and lightweight implementation for transparent speculation on a shared-bus embedded multicore architecture. Embedded-Spec can operate in two speculative execution modes, the Embedded-LE mode that is based on lock elision and the Embedded-LR mode that is based on lock removal. Unlike most existing works on speculative synchronization, Embedded-Spec focuses not only on performance but also energy- efficiency, since both are key constraints for embedded systems. The energy-delay product (EDP) was evaluated as a figure of merit that captures the trade-off between these two properties. Embedded-

Spec also targets simplicity by proposing the addition of simple hardware structures that avoid complex changes to the underlying cache coherence protocol. Moreover, it offers a fully transparent solution making it applicable to legacy code.Through an extensive set of experiments over various parameters such as number of cores, abort policy, sleep-modality, critical section size and retry policy,

Embedded-Spec showed that compared to traditional locking it can improve EDP to different degrees based on the chosen configuration.

Following the technology trend towards more scalable systems, in Chapter 4 we proposed a novel

123 124

HTM scheme targeted to a cluster-based many-core embedded architecture. Specifically, driven by the need for simplicity and power-efficiency in modern embedded systems, we turned our focus to a system without caches and cache-coherence support. To the best of our knowledge, a speculative synchronization mechanism for this type of architecture had not been proposed before. Implementing an HTM scheme without caches and cache-coherence support presented many challenges, since

HTM traditionally relies on caches for data versioning and the cache coherence protocol for conflict management. This new HTM scheme required explicit data management and a fully-custom design of the transactional memory support. The design was based on the idea of distributing conflict detection and resolution across multiple Transaction Support Modules (TSMs) to make it scalable.

These modules work at the memory level keeping track of read and write memory accesses to guarantee correctness. Two alternative data versioning management designs were proposed: a simple full-mirroring design and more complex but memory savvy distributed logging design. Results showed that both versions of the HTM scheme can achieve significant performance improvements over traditional lock-based ones, ranging from 9% to 83% depending on the number of cores. While the current implementation is limited to single-cluster accesses, the proposed scheme is designed so that it is scalable and can be extended to multiple clusters.

In Chapter 5 we focused on another critical aspect for embedded systems, error-resilience. We introduced the first HTM-based design for error-resilient and energy-efficient MPSoC execution that allows operation at highly reduced supply voltage margins to save energy and can address not only in- termittent timing errors but also the COP. The proposed scheme dynamically monitors the platform and adaptively adjusts the operating voltage to the evolving COP using lightweight checkpointing and roll-back mechanisms adopted from Hardware Transactional Memory (HTM)for error recovery.

The scheme was compared to a conservative steady-voltage technique that uses voltage guardbands, at different temperatures and reference voltages. Results showed that it achieves significant energy improvements varying from 6% up to 43% at relatively low execution time overhead. The energy improvements were bigger at lower temperatures and higher reference voltage. We conclude that 125 if the proposed HTM-based technique is used, then operating close to the COP instead of using conservative guardbands can not only guarantee error-resilience and forward progress but also yield significant energy savings.

In Chapter 6, we proposed new error recovery policies that improve the flexibility of addressing intermittent timing errors thus yielding better energy savings compared to that presented in Chapter

5. Moreover, these techniques can address a broader range of error types by using eager error resolution instead of the previously used lazy error-resolution scheme. Thus, apart from non-critical errors they can also deal with critical errors that need prompt reaction. We proposed the ‘Thrifty uncle/Reckless nephew’ (UN) policy, a new and simple error policy for addressing intermittent timing errors and the COP. This policy, optimistically lowers the voltage beyond the point of first failure allowing multiple intermittent timing errors to occur and makes voltage adjustment decisions based on the experienced number of consecutive commits and aborts, in order to achieve better energy savings. We compared this policy with the one proposed in Chapter 5 and found that it can improve energy by up to 20% while being 4% slower.Comparing it with a steady-voltage technique that uses guardbands, the UN technique can yield up to 53% energy improvement. We conclude that choosing a policy such as the UN policy that allows errors to occur and transactions to fail by operating at more aggressive voltage levels below the point of first failure pays off compared to increasing the voltage immediately after the first failure.

The work proposed in this thesis can be extended in multiple directions. By implementing for the first time an HTM scheme for a many-core embedded architecture without caches and cache- coherence support, we got a better understanding of how such an architecture can benefit from speculative synchronization when transactions are restricted within a single cluster. Now, it would be interesting to explore how this implementation can be extended to multiple clusters providing inter-cluster transactional support. For that purpose, we need to deploy the distributed nature of the transactional management scheme we designed and rethink the transactional bookkeeping mechanism to make it more scalable. 126

Regarding our work on error-resilient and energy-efficient execution, the proposed UN policy has shown great potential in saving energy and it can be improved in many ways. First if we need to restrict energy consumption within a certain range, we can take into account how energy is accumulated during the execution with online energy monitoring. Second, we could monitor the error rate and adapt the transaction size accordingly in order increase the commit rate of transactions and guarantee forward progress without necessarily increasing the voltage as is currently done in the UN policy.

In the future we should also consider experimenting not only with voltage but also with frequency.

It would be interesting to see the combined effect of frequency and voltage scaling on both energy and performance. Since our work so far was focused on maximizing energy savings, we chose to approach the problem by keeping the frequency constant and scaling only the supply voltage. However, it may be advantageous to adjust both. For example, for applications with certain run-time constraints we could proactively increase the frequency and later adjust the voltage accordingly, to improve both performance and energy consumption. Moreover, for applications that have real-time constraints, the run-time degradation should be considered as well. Thus, for example, after raising the voltage due to consecutive transaction aborts, we could temporarily apply a frequency boost to compensate for the time lost during recovery.

Another interesting direction is to investigate how the proposed technique can be used in the approximate computing domain. Approximate and inexact computation has been studied before but not in the context of HTM recovery mechanisms. For approximate computing applications, it may be more cost-effective to have inexact data than to pay the time and energy cost of roll- back and recovery. Our scheme offers the possibility to applications that can tolerate approximate computations, to intentionally ignore some non-critical errors if we can determine that the error will still allow for approximately correct operation.

Overall, it is great to continue pursuing the vast range of needs and challenges facing embedded systems today in performance, energy-efficiency and error-resilience. Bibliography

[1] Cesare Ferri, Tali Moreshet, R. Iris Bahar, Luca Benini, and Maurice Herlihy. A hardware/-

software framework for supporting transactional memory in a mpsoc environment. SIGARCH

Comput. Archit. News, 35(1):47–54, March 2007.

[2] Dan Ernst, Nam Sung Kim, Shidhartha Das, Sanjay Pant, Rajeev Rao, Toan Pham, Conrad

Ziesler, David Blaauw, Todd Austin, Krisztian Flautner, and Trevor Mudge. Razor: A low-

power pipeline based on circuit-level timing speculation. In IEEE/ACM MICRO, pages 7–,

2003.

[3] C. Ferri, A. Marongiu, B. Lipton, T. Moreshet, R. I. Bahar, M. Herlihy, and L. Benini.

SoC-TM: Integrated HW/SW support for transactional memory programming on embedded

mpsocs. In CODES’11, pages 39–48, Taipei, Taiwan, October 2011.

[4] Daniele Bortolotti, Christian Pinto, Andrea Marongiu, Martino Ruggiero, and Luca Benini.

Virtualsoc: A full-system simulation environment for heterogeneous system-

on-chip. 2013 IEEE International Symposium on Parallel and Distributed Processing, 0:2182–

2187, 2013.

[5] Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-

free data structures. In Proceedings of the 20th Annual International Symposium on Computer

Architecture, pages 289–300. ACM Press, 1993. http://doi.acm.org/10.1145/165123.165164.

127 128

[6] Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg,

Manohar K. Prabhu, Honggo Wijaya, Christos Kozyrakis, and Kunle Olukotun. Transactional

memory coherence and consistency. In Proceedings of the 31st annual international symposium

on Computer architecture, ISCA ’04, pages 102–, Washington, DC, USA, 2004. IEEE Computer

Society. http://dl.acm.org/citation.cfm?id=998680.1006711.

[7] Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, and David A. Wood.

LogTM: Log-based transactional memory. In HPCA, pages 254–265, 2006.

[8] Intel Corporation. Transactional Synchronization in Haswell. Retrieved from

software.intel.com/en-us/blogs/2012/02/07/transactional-synchronization-in-haswell/, 8 Sep

2012.

[9] Bit-tech.net. IBM releases ”world’s most powerful” 5.5GHz processor. Retrieved from www.bit-

tech.net/news/hardware/2012/08/29/ibm-zec12/1, 8 Sep 2012.

[10] C. Ferri, S. Wood, T. Moreshet, R. I. Bahar, and M. Herlihy. Embedded-TM: Energy and

complexity-effective hardware transactional memory for embedded multicore systems. Journal

of Parallel and Distributed Computing, 70(10):1042–1052, October 2010.

[11] C. Ferri, S. Wood, T. Moreshet, R. I. Bahar, and M. Herlihy. Energy and througput efficient

transactional memory for embedded multicore systems. In HiPEAC’10, Pisa, Italy, January

2010.

[12] Q. Meunier and F. Petrot. Lightweight transactional memory systems for NoCs based ar-

chitectures: Design, implementation and comparison of two policies. Journal of Parallel and

Distributed Computing, 70(10):1024–1041, October 2010.

[13] L. Kunz, G. Gir˜ao,and F. Wagner. Evaluation of a hardware transactional memory model in

an NoC-based embedded MPSoC. In SBCCI, pages 85–90, S˜aoPaulo, Brazil, 2010. 129

[14] Ravi Rajwar and James R. Goodman. Speculative lock elision: enabling highly concurrent

multithreaded execution. In MICRO, pages 294–305, 2001.

[15] Kalray. MPPA 256 - Programmable . www.kalray.eu/products/mppa-

manycore/mppa-256/.

[16] NVIDIA. NVIDIA’s next generation CUDA compute architecture: Fermi. white paper,

NVIDIA, 2009.

[17] Plurality Ltd. The hypercore architecture, white paper. Technical Report version 1.7, April

2010.

[18] D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and

D. Dutoit. Platform 2012, a many-core computing accelerator for embedded SoCs: performance

evaluation of visual analytics applications. In DAC, pages 1137–1142. ACM, June 2012.

[19] Adapteva. Epiphany-IV 64-core 28nm Microprocessor (E64G401). Retrieved from

http://www.adapteva.com/epiphanyiv/, 2013.

[20] J. Patel. CMOS process variations: A critical operation point hypothesis. web.stanford.

edu/class/ee380/Abstracts/080402-jhpatel.pdf, 2008.

[21] K.A. Bowman, J.W. Tschanz, S.L. Lu, P.A. Aseron, M.M. Khellah, A. Raychowdhury, B.M.

Geuskens, C. Tokunaga, C.B. Wilkerson, T. Karnik, and V.K. De. A 45nm resilient micropro-

cessor core for dynamic variation tolerance. JSSC, 46(1):194–208, Jan 2011.

[22] Dimitra Papagiannopoulou, Giuseppe Capodanno, Tali Moreshet, Maurice Herlihy, and R. Iris

Bahar. Energy-efficient and high-performance lock speculation hardware for embedded mul-

ticore systems. ACM Transactions on Embedded Computing Systems, 14(3):51:1–51:27, May

2015. 130

[23] D. Papagiannopoulou, T. Moreshet, A. Marongiu, L. Benini, M. Herlihy, and R. Iris Bahar.

Speculative synchronization for coherence-free embedded numa architectures. In Embedded

Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014 Interna-

tional Conference on, pages 99–106, July 2014.

[24] D. Papagiannopoulou, A. Marongiu, T. Moreshet, L. Benini, M. Herlihy, and R. Iris Bahar.

Hardware transactional memory exploration in coherence-free many-core architectures. (Under

Review) International Journal of Parallel Programming, 2016.

[25] Dimitra Papagiannopoulou, R. Iris Bahar, Tali Moreshet, Maurice Herlihy, Andrea Marongiu,

and Luca Benini. Transparent and energy-efficient speculation on NUMA architectures for

embedded mpsocs. In Proceedings of the 1st International Workshop on Many-core Embedded

Systems 2013, MES’2013, Held in conjunction with the 40th Annual IEEE/ACM International

Symposium on Computer Architecture, ISCA 2013, June 24, 2013., pages 58–61, 2013.

[26] Dimitra Papagiannopoulou, Andrea Marongiu, Tali Moreshet, Luca Benini, Maurice Herlihy,

and Iris Bahar. Playing with fire: Transactional memory revisited for error-resilient and energy-

efficient mpsoc execution. In Proceedings of the 25th Edition on Great Lakes Symposium on

VLSI, GLSVLSI ’15, pages 9–14, New York, NY, USA, 2015. ACM.

[27] Maurice Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan Kaufmann,

1 edition, March 2008.

[28] T. E. Anderson. The performance of spin lock alternatives for shared-memory multiprocessors.

1(1):6–16, January 1990.

[29] Chenjie Yu and Peter Petrov. Distributed and low-power synchronization architecture for em-

bedded multiprocessors. In Proceedings of the 6th IEEE/ACM/IFIP International Conference

on Hardware/Software Codesign and System Synthesis, CODES+ISSS ’08, pages 73–78, New

York, NY, USA, 2008. ACM. 131

[30] Chengmo Yang and Alex Orailoglu. Light-weight synchronization for inter-processor communi-

cation acceleration on embedded mpsocs. In Proceedings of the 2007 International Conference

on Compilers, Architecture, and Synthesis for Embedded Systems, CASES ’07, pages 150–154,

New York, NY, USA, 2007. ACM.

[31] Antonino Tumeo, Christian Pilato, Gianluca Palermo, Fabrizio Ferrandi, and Donatella Sci-

uto. Hw/sw methodologies for synchronization in fpga multiprocessors. In Proceedings of

the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, FPGA ’09,

pages 265–268, New York, NY, USA, 2009. ACM.

[32] C. Yu and P. Petrov. Low-cost and energy-efficient distributed synchronization for embed-

ded multiprocessors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,

18(8):1257–1261, Aug 2010.

[33] Hao Xiao, Ning Wu, Fen Ge, T. Isshiki, H. Kunieda, Jun Xu, and Yuangang Wang. Efficient

synchronization for distributed embedded multiprocessors. IEEE Transactions on Very Large

Scale Integration (VLSI) Systems, 24(2):779–783, Feb 2016.

[34] M. Monchiero, G. Palermo, C. Silvano, and O. Villa. Efficient synchronization for embedded

on-chip multiprocessors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,

14(10):1049–1062, Oct 2006.

[35] J. H. Rutgers, M. J. G. Bekooij, and G. J. M. Smit. An efficient asymmetric distributed

lock for embedded multiprocessor systems. In Embedded Computer Systems (SAMOS), 2012

International Conference on, pages 176–182, July 2012.

[36] Olga Golubeva, Mirko Loghi, and Massimo Poncino. On the energy efficiency of synchroniza-

tion primitives for shared-memory single-chip multiprocessors. In Proceedings of the 17th ACM

Great Lakes Symposium on VLSI, GLSVLSI ’07, pages 489–492, New York, NY, USA, 2007.

ACM. 132

[37] Hyeonjoong Cho, B. Ravindran, and E. D. Jensen. Lock-free synchronization for dynamic

embedded real-time systems. In Design, Automation and Test in Europe, 2006. DATE ’06.

Proceedings, volume 1, pages 1–6, March 2006.

[38] Seung Hun Kim, Sang Hyong Lee, Minje Jun, Byunghoon Lee, Won Woo Ro, Eui-Young

Chung, and J. L. Gaudiot. C!!-!!lock : Energy efficient synchronization for embedded multicore

systems. IEEE Transactions on Computers, 63(8):1962–1974, Aug 2014.

[39] J. Li, J. F. Martinez, and M. C. Huang. The thrifty barrier: energy-aware synchronization in

shared-memory multiprocessors. In Software, IEE Proceedings-, pages 14–23, Feb 2004.

[40] C. Liu, A. Sivasubramaniam, M. Kandemir, and M. J. Irwin. Exploiting barriers to opti-

mize power consumption of cmps. In Parallel and Distributed Processing Symposium, 2005.

Proceedings. 19th IEEE International, pages 5a–5a, April 2005.

[41] Tali Moreshet, R. Iris Bahar, and Maurice Herlihy. Energy-aware microprocessor synchroniza-

tion: Transactional memory vs. locks. In Workshop on Memory Performance Issues, February

2006. in conjunction with the International Symposium on High-Performance Computer Ar-

chitecture.

[42] Dave Dice, Yossi Lev, Mark Moir, and Daniel Nussbaum. Early experience with a commercial

hardware transactional memory implementation. SIGPLAN Not., pages 157–168, 2009.

[43] Nir Shavit and Dan Touitou. Software transactional memory. In Proceedings of the 14th ACM

Symposium on Principles of Distributed Computing, pages 204–213. Aug 1995.

[44] T. Harris, A. Cristal, O. S. Unsal, E. Ayguade, F. Gagliardi, B. Smith, and M. Valero. Trans-

actional memory: An overview. IEEE Micro, 27(3):8–29, May 2007.

[45] C. Scott Ananian, Krste Asanovic, Bradley C. Kuszmaul, Charles E. Leiserson, and Sean

Lie. Unbounded transactional memory. In ACM/IEEE International Symposium on High-

Performance Computer Architecture, February 2005. 133

[46] Ravi Rajwar, Maurice Herlihy, and Konrad Lai. Virtualizing Transactional Memory. In

ACM/IEEE International Symposium on Computer Architecture, June 2005.

[47] Luke Yen, Jayaram Bobba, Michael R. Marty, Kevin E. Moore, Haris Volos, Mark D. Hill,

Michael M. Swift, and David A. Wood. LogTM-SE: Decoupling hardware transactional mem-

ory from caches. In HPCA, pages 261–272, 2007.

[48] Arrvindh Shriraman, Sandhya Dwarkadas, and Michael L. Scott. Flexible decoupled

transactional memory support. SIGARCH Comput. Archit. News, 36(3):139–150, 2008.

http://doi.acm.org/10.1145/1394608.1382134.

[49] Jayaram Bobba, Neelam Goyal, Mark D. Hill, Michael M. Swift, and David A. Wood. To-

kenTM: Efficient execution of large transactions with hardware transactional memory. In

Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA

’08, pages 127–138, Washington, DC, USA, 2008. IEEE Computer Society.

[50] Ananian Scott and Rinard Martin. Efficient object-based software transactions. In In Pro-

ceedings, Workshop on Synchronization and Concurrency in Object-Oriented Languages, OOP-

SLA’05, San Diego, CA, USA, 2005.

[51] Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer. Software transac-

tional memory for dynamic-sized data structures. In Proceedings of the twenty-second annual

symposium on Principles of distributed computing, PODC ’03, pages 92–101, New York, NY,

USA, 2003. ACM.

[52] Maurice Herlihy and Eric Koskinen. Transactional boosting: a methodology for highly-

concurrent transactional objects. In Proceedings of the 13th ACM SIGPLAN Symposium on

Principles and practice of parallel programming, PPoPP ’08, pages 207–216, New York, NY,

USA, 2008. ACM. 134

[53] V. Marathe, W. Scherer, and M. Scott. Adaptive software transactional memory. Technical

Report TR 868, Computer Science Department, University of Rochester, May 2005.

[54] Peter Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir, and Daniel

Nussbaum. Hybrid transactional memory. SIGOPS Oper. Syst. Rev., 40:336–346, October

2006. http://doi.acm.org/10.1145/1168917.1168900.

[55] Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kundu, and Anthony Nguyen.

Hybrid transactional memory. In Proceedings of the Eleventh ACM SIGPLAN Symposium on

Principles and Practice of Parallel Programming, PPoPP ’06, pages 209–220, New York, NY,

USA, 2006. ACM.

[56] Moir Yossi, Lev Mark and Nussbaum Dan. Phtm: Phased transactional memory. In Workshop

on Transactional Computing, TRANSACT’07, 2007.

[57] Arrvindh Shriraman, Sandhya Dwarkadas, and Michael L. Scott. Flexible decoupled

transactional memory support. SIGARCH Comput. Archit. News, 36(3):139–150, 2008.

http://doi.acm.org/10.1145/1394608.1382134.

[58] Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun. STAMP: Stan-

ford transactional applications for multi-processing. In International Symposium on Workload

Characterization, Sept. 2008.

[59] F. Klein, A. Baldassin, G. Araujo, P. Centoducatte, and R. Azevedo. On the energy-efficiency

of software transactional memory. In Proceedings of the 22Nd Annual Symposium on Integrated

Circuits and System Design: Chip on the Dunes, SBCCI ’09, pages 33:1–33:6, New York, NY,

USA, 2009. ACM.

[60] Dave Dice, Ori Shalev, and Nir Shavit. Transactional Locking II. In In Proc. of the 20th Intl.

Symp. on Distributed Computing, 2006. 135

[61] Colin Blundell, E Christopher Lewis, and Milo M. K. Martin. Subtleties of transactional

memory atomicity semantics. Computer Architecture Letters, 5(2), Nov 2006.

[62] SaˇsaTomi´c,Cristian Perfumo, Chinmay Kulkarni, Adri`aArmejach, Adri´anCristal, Osman

Unsal, Tim Harris, and Mateo Valero. Eazyhtm: Eager-lazy hardware transactional memory.

In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture,

MICRO 42, pages 145–155, New York, NY, USA, 2009. ACM.

[63] David Kanter. Analysis of Haswell’s Transactional Memory, February 2012. Retrieved from

http://www.realworldtech.com/haswell-tm/.

[64] Andi Kleen. Scaling Existing Lock-based Applications with Lock Elision, February 2014.

Retrieved from http://queue.acm.org/detail.cfm?id=2579227.

[65] M. Pohlack and S. Diestelhorst. From lightweight hardware transactional memory to

lightweight lock elision. In TRANSACT, 2011.

[66] Dave Christie, Jae-Woong Chung, Stephan Diestelhorst, Michael Hohmuth, Martin Pohlack,

Christof Fetzer, Martin Nowack, Torvald Riegel, Pascal Felber, Patrick Marlier, and Etienne

Rivi`ere. Evaluation of AMD’s advanced synchronization facility within a complete transac-

tional memory stack. In EuroSys ’10, pages 27–40, New York, NY, USA, 2010. ACM.

[67] Ravi Rajwar and James R. Goodman. Transactional lock-free execution of lock-based pro-

grams. In ASPLOS, pages 5–17, 2002.

[68] Plurality Ltd. The hypercore processor. www.plurality.com/hypercore.html.

[69] D. Melpignano, et al. Platform 2012, a many-core computing accelerator for embedded SoCs:

Performance evaluation of visual analytics applications. In DAC, pages 1137–1142, 2012.

[70] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De. Parameter variations

and impact on circuits and microarchitecture. In DAC, pages 338–342, June 2003. 136

[71] Veit B. Kleeberger, Petra R. Maier, and Ulf Schlichtmann. Workload- and instruction-aware

timing analysis: The missing link between technology and system-level resilience. In DAC,

pages 49:1–49:6, 2014.

[72] S. Narayanan, G. Lyle, R. Kumar, and D. Jones. Testing the critical operating point (COP)

hypothesis using FPGA emulation of timing errors in over-scaled soft-processors. In SELSE,

2009.

[73] M.R. Kakoee, I. Loi, and L. Benini. Variation-tolerant architecture for ultra low power shared-

l1 processor clusters. TCAS II, 59(12):927–931, Dec 2012.

[74] D. Bull, S. Das, K. Shivashankar, G.S. Dasika, K. Flautner, and D. Blaauw. A power-efficient

32 bit arm processor using timing-error detection and correction for transient-error tolerance

and adaptation to pvt variation. Solid-State Circuits, IEEE Journal of, 46(1):18–31, Jan 2011.

[75] M. Fojtik, D. Fick, Y. Kim, N. Pinckney, D.M. Harris, D. Blaauw, and D. Sylvester. Bubble

razor: Eliminating timing margins in an arm cortex-m3 processor in 45 nm cmos using archi-

tecturally independent error detection and correction. Solid-State Circuits, IEEE Journal of,

48(1):66–81, Jan 2013.

[76] S. Das, D. Roberts, Seokwoo Lee, S. Pant, D. Blaauw, T. Austin, K. Flautner, and T. Mudge.

A self-tuning dvs processor using delay-error detection and correction. IEEE Journal of Solid-

State Circuits, 41(4):792–804, April 2006.

[77] K.A. Bowman, J.W. Tschanz, Nam Sung Kim, J.C. Lee, C.B. Wilkerson, S.L. Lu, T. Karnik,

and V.K. De. Energy-efficient and metastability-immune resilient circuits for dynamic variation

tolerance. Solid-State Circuits, IEEE Journal of, 44(1):49–63, Jan 2009.

[78] J. Tschanz, K. Bowman, S. Walstra, M. Agostinelli, T. Karnik, and Vivek De. Tunable

replica circuits and adaptive voltage-frequency techniques for dynamic voltage, temperature, 137

and aging variation tolerance. In VLSI Circuits, 2009 Symposium on, pages 112–113, June

2009.

[79] Mridul Agarwal, Bipul C. Paul, Ming Zhang, and Subhasish Mitra. Circuit failure prediction

and its application to transistor aging. In Proceedings of the 25th IEEE VLSI Test Symmpo-

sium, VTS ’07, pages 277–286, Washington, DC, USA, 2007. IEEE Computer Society.

[80] J. Tschanz, Nam Sung Kim, S. Dighe, J. Howard, G. Ruhl, S. Vangal, S. Narendra, Y. Hoskote,

H. Wilson, C. Lam, M. Shuman, C. Tokunaga, D. Somasekhar, S. Tang, D. Finan, T. Karnik,

N. Borkar, N. Kurd, and V. De. Adaptive frequency and biasing techniques for tolerance to

dynamic temperature-voltage variations and aging. In Solid-State Circuits Conference, 2007.

ISSCC 2007. Digest of Technical Papers. IEEE International, pages 292–604, Feb 2007.

[81] Ming Zhang, T.M. Mak, J. Tschanz, Kee Sup Kim, N. Seifert, and D. Lu. Design for resilience

to soft errors and variations. In On-Line Testing Symposium, 2007. IOLTS 07. 13th IEEE

International, pages 23–28, July 2007.

[82] L. Leem, Hyungmin Cho, J. Bau, Q.A. Jacobson, and S. Mitra. ERSA: Error resilient system

architecture for probabilistic applications. In DATE, pages 1560–1565, March 2010.

[83] S. Dighe, S.R. Vangal, P. Aseron, S. Kumar, T. Jacob, K.A. Bowman, J. Howard, J. Tschanz,

V. Erraguntla, N. Borkar, V.K. De, and S. Borkar. Within-die variation-aware dynamic-

voltage-frequency-scaling with optimal core allocation and thread hopping for the 80-core

teraflops processor. JSSC, 46(1):184–193, Jan 2011.

[84] Abbas Rahimi, Daniele Cesarini, Andrea Marongiu, Rajesh K. Gupta, and Luca Benini. Im-

proving resilience to timing errors by exposing variability effects to software in tightly-coupled

processor clusters. JETCAS, 4(2):216–229, 2014.

[85] H. Q. Le, G. L. Guthrie, D. E. Williams, M. M. Michael, B. G. Frey, W. J. Starke, C. May, 138

R. Odaira, and T. Nakaike. Transactional memory support in the IBM POWER8 processor.

IBM Journal of Research and Development, 59(1):8:1–8:14, January 2015.

[86] Jons-Tobias Wamhoff, Mario Schwalbe, Rasha Faqeh, Christof Fetzer, and Pascal Felber.

Transactional encoding for tolerating transient hardware errors. In Stabilization, Safety, and

Security of Distributed Systems, volume 8255 of LNCS, pages 1–16. Springer Intl. Pub., 2013.

[87] Gulay Yalcin, Osman Sabri Unsal, and Adrian Cristal. Fault tolerance for multi-threaded

applications by leveraging hardware transactional memory. In Computing Frontiers, pages

4:1–4:9, 2013.

[88] G. Yalcin, A. Cristal, O. Unsal, A. Sobe, D. Harmanci, P. Felber, A. Voronin, J.-T. Wamhoff,

and C. Fetzer. Combining error detection and transactional memory for energy-efficient com-

puting below safe operation margins. In PDP 2014, pages 248–255, Feb 2014.

[89] Tim Harris, James R. Larus, and Ravi Rajwar. Transactional memory,

2nd edition. Synthesis Lectures on Computer Architecture, 5(1):1–263, 2010.

http://www.morganclaypool.com/doi/abs/10.2200/S00272ED1V01Y201006CAC011.

[90] Federico Angiolini, Jianjiang Ceng, Rainer Leupers, Federico Ferrari, Cesare Ferri, and Luca

Benini. An integrated open framework for heterogeneous MPSoC design space exploration. In

DATE ’06, pages 1145–1150. European Design and Automation Association.

[91] M. Horowitz, T. Indermaur, and R. Gonzalez. Low-power digital design. In Low Power

Electronics, 1994. Digest of Technical Papers., IEEE Symposium, pages 8–11, Oct 1994.

[92] STMicroelectronics. Nomadik platform. www.st.com, 2008.

[93] A Efthymiou and J.D. Garside. An adaptive serial-parallel cam architecture for low-power

cache blocks. In Low Power Electronics and Design, 2002. ISLPED ’02. Proceedings of the

2002 International Symposium on, pages 136–141, 2002. 139

[94] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown.

MiBench: A free, commercially representative embedded benchmark suite. In IEEE Interna-

tional Workshop on Workload Characterization, pages 3–14, 2001.

[95] C. Helmstetter and V. Joloboff. Simsoc: A systemc tlm integrated iss for full system simulation.

In IEEE Asia Pacific Conference, 2008.

[96] Sungpack Hong, T. Oguntebi, J. Casper, N. Bronson, C. Kozyrakis, and K. Olukotun. Eigen-

bench: A simple exploration tool for orthogonal tm characteristics. In Workload Characteri-

zation (IISWC), 2010 IEEE International Symposium on, pages 1–11, Dec 2010.

[97] P. Meinerzhagen, S. M. Y. Sherazi, A. Burg, and J. N. Rodrigues. Benchmarking of standard-

cell based memories in the sub-domain in 65-nm cmos technology. IEEE Journal on Emerging

and Selected Topics in Circuits and Systems, 1(2):173–182, June 2011.

[98] Woongki Baek, Chi Cao Minh, Martin Trautmann, Christos Kozyrakis, and Kunle Olukotun.

The OpenTM transactional application programming interface. In PACT, pages 376–387,

2007.

[99] www.openmp.org. Openmp application program interface v.3.0.

[100] J.T. Kao, M. Miyazaki, and A.P. Chandrakasan. A 175-mv multiply-accumulate unit using

an adaptive supply voltage and body bias architecture. Solid-State Circuits, IEEE Journal of,

37(11):1545–1554, Nov 2002.

[101] S. Narendra, A. Keshavarzi, B.A. Bloechel, S. Borkar, and V. De. Forward body bias for

in 130-nm technology generation and beyond. Solid-State Circuits, IEEE

Journal of, 38(5):696–701, May 2003.