RETUNES: Reliable and Energy-Efficient Network-on-Chip Architecture using Adaptive Routing and Approximate Communication

A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University

In partial fulfillment of the requirements for the degree Master of Science

Padmaja Bhamidipati May 2019

© 2019 Padmaja Bhamidipati. All Rights Reserved. 2

This thesis titled RETUNES: Reliable and Energy-Efficient Network-on-Chip Architecture using Adaptive Routing and Approximate Communication

by PADMAJA BHAMIDIPATI

has been approved for the School of Electrical Engineering and Computer Science and the Russ College of Engineering and Technology by

Avinash Karanth Professor of Electrical Engineering and Computer Science

Dennis Irwin Dean, Russ College of Engineering and Technology 3 Abstract

BHAMIDIPATI, PADMAJA, M.S., May 2019, Electrical Engineering RETUNES: Reliable and Energy-Efficient Network-on-Chip Architecture using Adaptive Routing and Approximate Communication (92 pp.) Director of Thesis: Avinash Karanth As the number of processing cores are increasing in a chip multiprocessor (CMP), demand for an energy-efficient and reliable Network-on-Chip (NoC) architecture is increasing. However, energy consumption of NoC continues to increase with the exponential growth in CMPs. Voltage scaling techniques such as Dynamic Voltage and Frequency Scaling (DVFS) and Near Threshold Voltage (NTV) scaling have been proposed to reduce the energy consumption of NoC by scaling the operating voltage and frequency in proportion to the application demand. Apart from DVFS and NTV scaling, recently, approximate communication has been proposed to boost the power savings and reduce latency in NoC for the applications that are not sensitive to imprecise results within an acceptable variance. As transistor technology is scaling down to a few nanometers, aging effects such as Hot Carrier Injection (HCI) and Negative Bias Temperature Instability (NBTI) are increasing which worsens the reliability. Scaling down the transistor size along with the supply voltage increases the susceptibility of NoC to soft errors. Faults and disturbances due to aging and voltage scaling causes serious degradation in reliability of NoC. In this thesis, I propose RETUNES - Reliable and energy-efficient NoC design, where power efficient and fault tolerant architecture is modeled without compromising on the performance of NoC. The energy-efficient part of RETUNES is a five voltage/frequency design that includes NTV for high energy gains. The five voltage modes are switched according to the workload for high energy-efficiency and minimum network congestion in NoC. Energy efficiency of RETUNES is further improved by employing approximate 4 communication throughout the execution of application within tolerable error range. The reliability part of RETUNES introduces a hybrid error correction model to handle the faults observed due to aging, voltage scaling, and temperature. In addition to error correction and detection, RETUNES handles uneven aging in NoC which is caused by uneven distribution of traffic. Adaptive routing algorithm is modeled to even out the non-uniform device wear-out and thereby, minimize the impact of aging in NoC. RETUNES decreases power consumption and threshold voltage variation (∆Vth) during low network load with high reliability and increases the network performance during high network load with reduced reliability. Simulation results of RETUNES demonstrated nearly 2.5 × total power savings and 3 × improvement in Energy-Delay Product (EDP) of NoC for Splash-2 and PARSEC benchmarks on a 4 × 4 concentrated mesh architecture. Simulation results also showed 13% decrease in the energy consumption of NoC, 10% decrease in latency, and 19% EDP improvement by incorporating approximate communication technique. 5 Dedication

I would like to dedicate this thesis to my wonderful husband and my family, this would not have been possible without their support. 6 Acknowledgments

This work was partially supported by National Science Foundation (NSF) grants CCF- 1420718, CCF-1513606 and CCF-1703013. Firstly, I would like to thank Dr. Avinash Karanth for his support and direction. I would also like to thank my committee members Dr. Kaya, Dr. Stinaff, and Dr. Chenji for their feedback and information. 7 Table of Contents

Page

Abstract...... 3

Dedication...... 5

Acknowledgments...... 6

List of Tables...... 9

List of Figures...... 10

List of Acronyms...... 13

1 Introduction...... 15 1.1 Network-on-Chip...... 17 1.2 Energy Efficiency...... 18 1.2.1 Voltage Scaling...... 20 1.2.2 Approximate Computing...... 22 1.3 Reliability...... 24 1.3.1 Effects of Voltage Scaling and Temperature...... 24 1.3.2 Aging Effects...... 25 1.3.3 Error Mitigation...... 29 1.4 Major Contributions...... 32 1.5 Organization of Thesis...... 34

2 RETUNES: Reliable and Energy-Efficient Network-on-Chip...... 35 2.1 Prior Work...... 35 2.2 RETUNES Architecture...... 40 2.2.1 Energy Efficiency (EE-Layer)...... 41 2.2.1.1 Voltage Scaling...... 41 2.2.1.2 Approximate Communication...... 45 2.2.2 Reliability (R-Layer)...... 49 2.2.2.1 Unified Reliability Model...... 49 2.2.2.2 Encoding Framework...... 50 2.2.2.3 Adaptive Routing...... 55 2.3 Centralized Control Unit...... 57 8

3 Performance Evaluation...... 62 3.1 RETUNES Evaluation Approach...... 64 3.2 RETUNES Results...... 64 3.2.1 Power and Area Overhead Analysis...... 65 3.2.2 Packet Latency Analysis...... 66 3.2.3 Lifetime Evaluation...... 67 3.2.4 Reliability Analysis...... 70 3.2.5 Energy-Delay Product...... 71 3.3 Approximate Communication Evaluation...... 72 3.3.1 Packet Latency Analysis...... 73 3.3.2 Power and Energy Analysis...... 75 3.3.3 Energy-Delay Product (EDP) Analysis...... 76

4 Conclusions and Future Work...... 79

References...... 81 9 List of Tables

Table Page

2.1 Traffic load (Flits/cycle), temperature (Celsius) and delay overhead (cycles) calculated for the corresponding voltage modes of RETUNES...... 45

3.1 Applications used in the design...... 63 10 List of Figures

Figure Page

1.1 Microprocessor trend for the past four decades where frequency gains and single-thread performance no longer provide sufficient gains [Rup18]...... 16 1.2 Common NoC topologies...... 18 1.3 Router microarchitecture with cross bar and router pipeline stages (left) and Network Interface which serves as a connection between router and its cores (right)...... 19 1.4 Maximum Energy Point (MEP) that is observed in the NTV region[Yu]..... 22 1.5 Variation of delay and energy with operating voltage at super, near, and sub threshold voltage regions [Mit15]...... 23 1.6 HotSpot thermal map of the traffic flow in NoC where utilization is shown as temperature raise...... 26 1.7 Threshold voltage shift (∆Vth) due to NBTI and HCI effect at different temperatures (a) and different supply voltages (b)...... 28 1.8 Transmission and Re-transmission in communication network between source and destination...... 30

2.1 Reconfigurable NoC architecture based on the network traffic [CPK+13]..... 37 2.2 Control device architecture with router and layer controllers to switch NoC voltage levels [RJCR16]...... 39 2.3 Percentage of buffer utilization at different simulation time (cycles) for blacksholes (left) and LU (right) applications...... 42 2.4 Traffic pattern of blackscholes application at different epochs to determine epoch size for RETUNES...... 43 2.5 Figure shows the flow of original image read from the Memory Control Unit (MCU) and approximated JPEG image sent back to the Memory Control Unit (MCU)...... 47 2.6 Shows the approximation performed on 10bit data, where ’d’ represents the number of duplicates following a digit...... 48 2.7 JPEG encoder, Memory Control Unit (MCU) and approximating core mapped on NoC...... 48 2.8 Unified fault model showing error range separately for threshold voltage variation (∆Vth) and bit errors observed in RETUNES...... 51 2.9 Flowchart shows appropriate encoding layer (e2e or s2s) used in RETUNES for different error ranges (Ne,Fe,Me)...... 52 2.10 RETUNES switch-to-switch encoding layer microarchitecture showing encoder and decoder of R-layer along with the router pipeline stages...... 53 2.11 RETUNES end-to-end encoding layer microarchitecure showing encoder and decoder of R-layer at the Network Interface (NI)...... 54 11

2.12 map of the single router explaining five directions of the router: four links (x, -x, y, -y) of the router connecting adjacent routers and a link for the core.... 56 2.13 RETUNES routing algorithm (adaptive routing algorithm) to determine the path of the packet...... 57 2.14 Graph of threshold voltage change for different supply voltage which shows that lowering the supply voltage slows down the threshold voltage (Vth) change 58 2.15 RETUNES Centralized Control Unit (CCU) showing voltage regulator, CCU micro architecture, and control sequence between CCU and a core...... 59 2.16 Design of global on-chip voltage regulator for NoC in RETUNES...... 60 2.17 RETUNES mode control algorithm...... 61

3.1 Methodology for evaluating RETUNES performance. Showing evaluation flow of the approximate communication (orange), all others (blue), and end results (green and gray)...... 65 3.2 Total dynamic power cost for Splash-2 and PARSEC benchmarks of 64 core NoC when operated in four proposed schemes. Lower is better...... 66 3.3 Area overhead of the decoder, encoder and router for CRC and Hamming code used in s2s and e2e encoding designs...... 67 3.4 Normalized average packet latency (normalized to baseline model - Always- NTV (XY)) for Splash-2 and PARSEC benchmarks of 64 core NoC when operated in four proposed schemes. Blue shows latency cost without reliability. Orange shows reliability cost. Lower is better...... 68 3.5 Threshold voltage change (∆Vth) due to voltage scaling , elevated temperature, and aging at 5 different supply voltages. Lower is better...... 69 3.6 Comparing HotSpot thermal map of Always-STV under xy-routing and RETUNES (V5 scheme) under adaptive routing. RETUNES shows uniform and lower device temperatures when compared to Always-STV under XY- routing...... 70 3.7 Bit error rate observed in RETUNES due to voltage scaling and aging...... 71 3.8 Normalized Energy Delay Product (EDP) (normalized to baseline model - Always-NTV (XY)) for Splash-2 and PARSEC bench-marks for four proposed schemes. Blue shows EDP without reliability. Orange shows reliability cost. Lower is better...... 72 3.9 Comparing the original image (left) with compressed NN image at different error percentage (right)...... 74 3.10 Comparing the original image (left) with compressed NN image at different error percentage (right.)...... 75 3.11 Normalized average packet latency (normalized to baseline model - Always- NTV) for both original and approximated image for AxBench benchmarks of 64 core NoC when operated in all the proposed schemes. Lower is better.... 76 3.12 Normalized Dynamic power(normalized to - Always-STV (XY)) for both original and approximated image for AxBench benchmarks of 64 core NoC when operated in all the proposed schemes. Lower is better...... 77 12

3.13 Normalized Dynamic energy (normalized to Always-NTV scheme original image of 2.6% error rate) for both original and approximated image for AxBench benchmarks of 64 core NoC when operated in all the proposed schemes. Lower is better...... 78 3.14 Normalized EDP (normalized to Always-NTV scheme) for both original and approximated image for AxBench benchmarks of 64 core NoC when operated in all the proposed schemes.Lower is better...... 78 13 List of Acronyms

ADP - Adaptive routing BCH - Bose, Chaudhuri, and Hocquenghem BW - Buffer Write CHE - Channel Hot Electron CMesh - Concentrated Mesh CMPs - chip multiprocessors CRC-32 - 32-bit Cyclic Redundancy Check DAHC - Drain Avalanche Hot Carrier DBAR - Destination Based Adaptive Routing DCT - Discrete Cosine Transform DFT - Discrete Fourier Transform DLP - Data Level Parallelism DOR - Dimensional Order Routing DVFS - Dynamic Voltage and Frequency Scaling DVS - Dynamic Voltage Scaling e2e - end-to-end ECC - Error-Correcting Code EE-layer - Energy Efficiency layer FCS - Frame Check Sequence Fe - Few errors GPUs - Graphic Processor Units HCI - Hot Carrier Injection ILP - Instruction Level Parallelism MBUs - Multiple Bit Upsets MCU - Memory Control Unit Me - More errors MEP - Maximum Energy Point - Metal-Oxide-Semiconductor Field-Effect Transistors NBTI - Negative Bias Temperature Instability Ne - No errors NI - Network Interface NN - Neural Network NoC - Network-on-Chip NPUs - Neural Processing Units NTV - Near Threshold Voltage PTM - Predictive Technology Model RC - Routing Computation R-D - Reaction-Diffusion s2s - switch-to-switch SA - Switch Allocation 14

SBUs - Single Bit Upsets SECDED - Single Error Correction and Double Error Detection SEUs - Single Event Upsets SGHE - Secondary Generated Hot Electron SHE - Substrate Hot Electron SIMD - Single Instruction Multiple Data SIMT - Single Instruction Multiple Threads SMT - Simultaneous Multithreading ST - Switch Traversal STV - Super Threshold Voltage TLP - Thread Level Parallelism TMR - Triple Modular Redundancy VA - Virtual Channel Allocation WMS - Wear-out Monitoring System XY - XY-routing 15 1 Introduction

1 Growing need for faster and more power-efficient computing systems has increased the demand for chip multiprocessors (CMPs) [ONH+96][NO97]. Researchers have focused on improving the clock speed of the processor in order to improve the throughput and execution time [Pen17]. As clock speed is improved by increasing the switching speed of the transistor, supply voltage to the transistor is also scaled up proportionally. However, leakage power and excess heat due to higher supply voltage limited the clock speed of the processors to 2-4 GHz in early 2000s. To continue the improvements in performance without increasing supply voltage of the processor, parallel processing techniques such as Instruction Level Parallelism (ILP), Data Level Parallelism (DLP), Thread Level Parallelism (TLP), etc. have been proposed [YS06], [SDM10]. ILP executes independent instructions in parallel by increasing the depth of the pipeline to fit multiple instructions or by duplicating the components so that multiple instructions can be executed in a single cycle. In a basic block of 10 instructions long, parallelism is typically limited to 3 to 4 instructions on an average [Wal91]. Hence, ILP is limited by the amount of parallelism that can be exploited in any portion of the program. DLP performs similar functions/operations on different data simultaneously. Single Instruction Multiple Data (SIMD), Single Instruction Multiple Threads (SIMT) machines, and most recently Graphic Processor Units (GPUs) exploit DLP. TLP executes multiple threads or instructions concurrently such as Simultaneous Multithreading (SMT) [[LEL+97], [[EEL+97]. However, single core processor with multiple threads is limited by high energy consumption and low performance gains even with SMT. Figure 1.1 shows the microprocessor trends for the past four decades where frequency gains and single-thread performance no longer provide sufficient gains. Hence, chip multiprocessors (CMPs)

1 Some material including figures, sentences, and paragraphs are used verbatim from prior publications [BK18] with permission ©2018 IEEE 16

Figure 1.1: Microprocessor trend for the past four decades where frequency gains and single-thread performance no longer provide sufficient gains [Rup18].

have evolved to resolve energy-performance tradeoff of the uni-processor architecture. In CMPs, multiple processing cores are integrated on a single chip where all the cores work simultaneously to improve energy efficiency and execution speed of the application. With recent advances in silicon technology, multiple cores (with greater number of transistors) are being housed on a chip; for example, Intel Knight’s Landing has 72-cores [Sod15], PEZY Super Computer has 1024 cores [TKT+16] and TILE-Gx has 100-cores [Ram11]. To handle this continuous increase in the number of cores, a highly scalable and flexible interconnection network is crucial. Most common inter-core communication topologies such as buses, rings or crossbars have been proposed for multicore chips to connect its cores together as shown in Figure 1.2. Bus topology is the basic network topology which interconnects all its cores to share a single medium (bus) to broadcast all information as shown in the Figure 1.2.b . Although bus topology is simple and linearly scales with respect to the number of cores (N), the 17 performance (speed, bandwidth) of the topology decreases with increase in the number of cores due to serial access of the bus. In the ring topology all the cores are connected in the form of a ring network where each core is connected to its two neighboring cores as shown in the Figure 1.2.c. During high network load, the ring topology performs better than the bus topology due to several simultaneous communication occurring on several links. However, ring topologies suffer from high number of hops when the size of the network is scaled. Crossbar topology has grid of switching points where data from one core reaches the other by traversing a single switch point. Crossbar with N cores has N2 switching points, which allows N simultaneous communication. Cost and complexity of the crossbar is high as the number of switching points scales to O(N2).

1.1 Network-on-Chip

Network-on-Chip (NoC) is a modular switching fabric that interconnects multiple processing cores in CMPs using routers and links. Different topologies are determined in NoC depending on the connectivity pattern of the routers as shown in Figure 1.2. The most popular topology is the 2D mesh where routers are connected using links forming a grid as shown in 1.2.d. 2D mesh topology is power and area efficient due its low radix count and shorter links. 2D mesh has large number of link traversals that increases energy consumption in the large-scale systems. Figure 1.2.e shows the torus topology where each router has four links connected to it. Even though torus topology decreases the number of link traversals, the wrap around links consume extra power and area [BD14a]. Figure 1.2.f shows a concentrated mesh topology (CMesh) where each router has multiple cores connected to it (4 in this case) [BD14b]. CMesh is preferred over mesh topology as the former reduces the latency and average hop count due to higher concentration [BD14a]. With the increase in the concentration of CMesh topology, factors such as latency and contention might also increase. 18

Figure 1.2: Common NoC topologies.

NoC consists of 3 main components- (1) router- the component that handles the communication protocol, (2) link - a physical connection between routers for communication, and (3) Network Interface (NI) - the component that makes logical connections to the core. In NoC, data is communicated in the form of packets where each packet is divided into flits and is transmitted from one router to another using links. Figure 1.3.a shows router microarchitecture with cross bar and router pipeline stages where the 5 router pipeline stages are elaborated in chapter 2. Figure 1.3.b shows the Network Interface which serves as a connection between router and its cores.

1.2 Energy Efficiency

Dynamic and static power are the two main sources for power dissipation in NoC. Dynamic power is dissipated when the transistor is active and switches the bit from logic 1 to 0 or 0 to 1 whereas, static power is dissipated due to leakage current when the 19

Figure 1.3: Router microarchitecture with cross bar and router pipeline stages (left) and Network Interface which serves as a connection between router and its cores (right).

transistor is not active. Total system power is the sum of the dynamic power and the static power dissipated in a system. Links and routers of an interconnection network are the most power hungry components, where links alone consume nearly 50% of overall chip power on an average [SP03b][FAA08][AAKL+10]. Hence, researchers have focused on improving the energy efficiency of the interconnection networks using energy proportional computing designs. Energy proportionality is a technique that regulates the power consumption of the circuit proportional to the utilization. For example, in NoC when there is low communication demand, the overall power consumption of the network is minimum, whereas at high communication demand, the overall power consumption of the network is maximum. Recently, energy meters are introduced to monitor the power consumed by the subsystems such as processors and memory at the system-level 20

[RRS+14][WJK+12]. Intel Sandy Bridge processor uses Running Average Power Limit (RAPL) interface to limit the power consumption of the subsystems with a time resolution of 1ms [RRS+14]. RAPL allows users to set the power limit and time frame, which reports the performance impact according to the power limit set by the user. Hence, implementing a system that tracks utilization levels to trigger the resource availability might reduce the inefficiencies due to over availability of the resources. Voltage Scaling, Dynamic Voltage and Frequency Scaling (DVFS) [Iru15][BGL12][LND+05][MD+09a], Near Threshold Voltage (NTV) Scaling [MS+16][KJ13][Mit15], Data approximation [Mit16][MBJ14a][BHM+17], routing algorithms [BCR12b][BCR12a], data encoding and decoding techniques [PFAC09][JPKAK14], and power gating [BJS+14][NSB16] [CYZ13] are some of the advanced and popular energy proportionality designs.

1.2.1 Voltage Scaling

Technology scaling down to sub-nanometer combined with an exponential increase in the number of transistors that can be integrated at the on-chip level, has resulted in drastic increase in power density of multicore architectures. As supply voltage and operating frequency directly influence the dynamic power of the transistor according to the equation [1.1], low voltage operations achieve high energy-efficiency [EE11].

2 Pdynamic = CVdd f α (1.1)

Power management techniques such as Dynamic Voltage Scaling (DVS) and Dynamic Voltage and Frequency Scaling (DVFS) have been employed to improve energy efficiency of NoC by scaling down the operating voltage. In DVS, supply voltage (Vdd) of NoC is scaled at runtime while setting the operating frequency low enough to execute the application. During low NoC workload, supply voltage is decreased to eliminate the unnecessary power consumption and at high workload, supply voltage is increased to avoid 21 congestion [PLS01]. Although, DVS saves power by scaling the voltage, it uses single operating frequency that increases the latency of NoC at high workloads. DVFS is another approach to scale the supply voltage along with the operating frequency to improve latency in NoC. In DVFS, supply voltage and frequency are scaled at runtime while reducing power consumption of NoC up to 20-26% [MD+09a][EE11]. DVFS can be applied to different components of NoC individually at different levels of granularity making it more energy- efficient. However, voltage region close to transistor threshold voltage is unexplored as standard DVFS technique can scale the supply voltage only up to 70% of the normal voltage [Iru15][BGL12][LND+05][MD+09a]. Near Threshold Voltage (NTV) Scaling: Near Threshold Voltage (NTV) scaling is an advance power management technique that operates devices with supply voltage close to

the transistor threshold voltage (Vth). NTV region exhibits minimum energy consumption at tolerable latency as shown in the Figure 1.4[GT18]. The minimum energy point shown in the Figure 1.4 is in the optimal supply voltage range that is observed in the NTV region where static and dynamic energy consumption are low. As the supply voltage of NoC is reduced below NTV region, static power consumed by the network dominates the energy savings. Similarly, if supply voltage of NoC is above NTV region, dynamic power consumption increases which in turn increases the overall power consumption of NoC. NTV scaling increases the energy efficiency of the network by more than 5x when the operating voltage scaled by more than 25% of the normal supply voltage [DWB+10] [KJ13]. Supply voltage in DVFS is scaled less than 75% which makes it less power efficient than NTV [Mit15]. Figure 1.5 shows the variation of delay and energy with operating voltage. Clearly, as the supply voltage increases from subthreshold region to near threshold region, delay is reduced by ≈ 50 to 100× and energy consumption is increased by just ≈2×. As we move away from near threshold region to super threshold region, delay is reduced by ≈10× at the expense of huge energy consumption (≈10×). 22

Figure 1.4: Maximum Energy Point (MEP) that is observed in the NTV region[Yu].

Although highly energy-efficient, NTV scaling has several performance challenges. Lower supply voltages lead to low operating frequencies which stall the flow of traffic in NoC increasing the critical path delay. Increased delay develops congestion in the network which might eventually lead to packet loss reducing the performance of NoC. Recently, researchers have focused on mitigating the performance and delay penalties of NTV scaling while improving energy-efficiency for low-throughput applications.

1.2.2 Approximate Computing

Approximate computing is an approximation technique to improve the energy efficiency and performance of NoC which is gaining popularity among the industries. Recently, researches have implemented approximate computing technique in various fields such as machine learning, fluid dynamics, video processing, image recognition, financial analysis, database search, and many more where, applications can compromise on the quality of the computed result [BHM+17][MBJ14a][Mit16]. This technique 23

Figure 1.5: Variation of delay and energy with operating voltage at super, near, and sub threshold voltage regions [Mit15].

makes use of low-sensitive nature of the application while balancing power and latency tradeoff. Approximate computing allows applications to use real world data, producing imprecise results for high throughput and power savings. Previous works have explored approximate computing in designing approximate circuits, software and architectural modifications [Mit16] which are inherently resilient to output errors. More recently, approximate communication has been proposed in which data between two processing cores is approximated to further reduce the cost of communication [BKS+18a][Mit16]. There are two main methods by which the communication cost can be reduced: 24

• Reducing the size of the packet to be transmitted.

• Reducing the number of packets to be transmitted to communicate a message.

Previous works also showed that the communication efficiency can be improved by incorporating current approximate computing techniques that have a potential to reduce the communication overhead [BKS+18a].

1.3 Reliability

Reliability is a main concern in NoC apart from energy efficiency and is degraded when NoC is susceptible to faults and disturbances. A fault can be determined as a cause of deviation from the desired operation of the system (error). The faults are mainly categorized into two types - transient faults and permanent faults [PNK+06][FLJ+13] [MG13]. Transient faults are non-catastrophic and occur due to soft errors. A soft error is noise that is induced due to radiation, electromagnetic interference/noise corrupting the data bits. Soft errors such as timing errors are more prominent at low supply voltage and operating frequencies. Single Event Upsets (SEUs) such as Single Bit Upsets (SBUs) and Multiple Bit Upsets (MBUs) are the soft errors occurred due to corruption of single or multiple bits [LD+], disturbing logic by flipping bits (0 or 1). Permanent faults are catastrophic and occur due to hard errors. Hard errors effect the device functionality causing an irreversible damage. Aging effect is the main cause of hard errors that leads to failures in link and router of NoC. Reliability of NoC is affected by voltage scaling, elevated temperatures, and aging which will be explained in the subsections below.

1.3.1 Effects of Voltage Scaling and Temperature

Aggressively scaling the supply voltage with transistor technology increases Single Event Upsets (SEUs) in transistor. Supply voltage and operating frequency are the two important parameters effecting the probability of SEUs [LD+]. Single Event Upsets 25

(SEU) are soft errors occurred due to alpha particles, cosmic rays, and thermal neurons, flipping binary bits (0 or 1) that results in logic errors in the transmitting data [MT03a] [MT03b]. Operating transistor at low supply voltage further increases the probability of logic failures, making it unreliable. Memory cells are highly vulnerable to SEU due to their low voltage margins. As the capacitance inside the memory cells decreases with the transistor technology, the minimum capacitance charge necessary to hold/retain the information decreases. Therefore, the charge required to switch the bit (0 to 1 or 1 to 0) decreases causing soft errors. On the other hand, elevated temperatures also have adverse effects on reliability of NoC. Temperature and utilization are directly proportional to each other and uneven utilization caused due to elevated temperatures increases aging effects in NoC [SA+14] [KZBH13]. Uneven utilization of the router or link increases if more packets take the same route overusing links and routers in that path. Higher concentration of traffic for an extended period in an area on a chip generates hotspots (high temperature region) and eventually creates open circuits in devices. Figure [1.6] shows the thermal map of the packet transmissions where utilization is shown as temperature raise.

1.3.2 Aging Effects

Aging is a physical phenomenon where performance of the transistor is degraded over time due to high usage. In any network, certain transistors age faster than the others due to increase in their usage and eventually fail to work as intended. This uneven aging causes serious communication failures in the network reducing lifetime of the transistor. The most common factors effecting the lifetime of the transistor are fabrication methods, properties

of the materials, and parameters such as supply voltage (Vdd) and temperature. Aging cannot be mitigated but can be slowed down by controlling the device utilization. In order to increase the lifetime of the circuit, it is important to find the root cause of age degradation 26

Figure 1.6: HotSpot thermal map of the traffic flow in NoC where utilization is shown as temperature raise.

and to design and develop a model which can tune the parameters that cause aging such as operating voltage and temperature. Negative Bias Temperature Instability (NBTI) and Hot Carrier Injection (HCI) are the temporal unreliability issues causing circuit failures [YYHC11]. NBTI and HCI are time dependent effects caused due to conditions such as switching activity, operating voltage, and temperature [MG13][OS95][KVGS13]. Negative Bias Temperature Instability (NBTI): Metal-Oxide-Semiconductor Field- Effect Transistors (MOSFETs) suffer from reliability degradation due to Negative Bias Temperature Instability (NBTI) effect which changes the threshold voltage and decreases drain current. NBTI causes silicon-hydrogen bonds to dissociate giving raise to interface traps in PMOS transistors. Subatomic charge particles such as electrons and holes occupy these electrically active interface traps contributing to the change in threshold voltage. Interface traps decrease mobility and drain current which in turn increases the threshold voltage effecting critical path delay. Reaction-Diffusion (R-D) model is a technique that analytically models the effect of stress and relaxation phase in NBTI [OS95]. There are two 27 different phases in capturing the interface traps using R-D model: Phase1- In this phase, the reaction caused due to release of hydrogen because of the interface traps occurred at the SiO2 and Si interface is linearly dependent on the stress time. Phase2- in this diffusion phase, the hydrogen which is released from phase 1 diffuses for a short period time.This diffusion is time dependent as shown below:

tn , for neutral hydrogen n= 0.25 [BWV+06]

As the supply voltage increases, electric field across the junction of the transistor increases which in turn increases the negative bias of the PMOS transistor. This elevates the stress levels causing degradation in NBTI [OS95]. Similarly, high temperatures disassociate Si-Hydrogen bonds causing NBTI [KZBH13][YYHC11] citeaging-error- main. Hence, the transistor lifetime is interrelated with the operating voltage and device temperature. Researchers focused on implementing new techniques to reduce the timing errors caused due to NBTI [CSGK11][KBW+14]. Hot Carrier Injection (HCI): Hot Carrier Injection is a phenomenon caused due to the hot carriers in a transistor. Hot carriers are the subatomic particles that acquire very high kinetic energy due to high electric fields. These carriers get injected into an unintended region (gate oxide) due to change in their trajectory caused by the high electric fields. When these charged particles enter a region such as gate oxide, they get trapped creating defects [KVGS13]. HCI causes change in threshold voltage, current factor, and conductance. As the electrons are hotter than the holes, HCI effect is more prominent in NMOS transistors. Drain Avalanche Hot Carrier (DAHC), Substrate Hot Electron (SHE), Channel Hot Electron (CHE), and Secondary Generated Hot Electron (SGHE) are 4 different injection mechanisms where HCI effect is observed [MG13]. Traps are higher when gate voltage and drain voltage are approximately equal in CHE mechanism. SHE is caused due to high positive/negative bias of the transistor. DAHC and SGHE are 28 caused due to the ionization impact triggering SGHE after DAHC. Figure [1.7] shows the

(a) ∆Vth at different temperatures.

(b) ∆Vth at different supply voltages.

Figure 1.7: Threshold voltage shift (∆Vth) due to NBTI and HCI effect at different temperatures (a) and different supply voltages (b).

threshold voltage shift (∆Vth) due to NBTI and HCI effects for a 45nm technology node at different temperatures (Figure [1.7.a]) and operating voltages (Figure [1.7.b]) for over 10 years. Hence, from the Figure [1.7.b], we can observe that reducing the supply voltage reduces the threshold voltage shift. However, aggressively scaling down the supply voltage 29 close to sub-threshold region shows adverse effect on the performance and the switching energy in NoC. Non-uniform aging and generation of hotspots leads to faults effecting the performance of NoC. In order to handle the errors due to low supply voltage and aging, a good fault handling model is essential

1.3.3 Error Mitigation

According to ITRS, reliability has become a major concern as the lifetime of the transistor decreases with the increase in the temperature [ITR15]. As the transistor ages, change in threshold voltage increases due to interface traps and fluctuations in the charge density where, Vth variation above 10% leads to permanent faults in the circuit. In addition to uneven aging, faults due to voltage scaling is another concern in NoC. As proposed in [LD+], decrease in the supply voltage increases the bit error rate due to decrease in capacitance of the memory cell. As a result, low operating voltages increase the susceptibility of the device to faults which in turn compromises the reliability [LD+] [Mit15]. As NoCs are the only means for data transfer between the cores, faults and disturbances such as Single Bit Upsets (SEUs), Multiple Bit Upsets (MBUs), process variations and so on have become major challenges with NTV scaling. To prevent serious reliability degradation, strong error correction techniques and packet routing algorithms are crucial for NoC to continue to operate reliably. This increase in transient and permanent faults due to low supply voltage and uneven aging provides motivation to incorporate dynamic error handling techniques in NoC. In this subsection, fault handling techniques such as retransmission and error correction and detection schemes (for faults due to low supply voltage/frequency) and routing algorithms (for faults due to uneven aging) are discussed. Retransmission scheme: Every communication network has a source node and a destination node where source transmits data to the destination as shown in the Figure 1.8. 30

Figure 1.8: Transmission and Re-transmission in communication network between source and destination.

In case of error, destination node requests retransmission of data such that the source node then repeats the transmission. Various error detection schemes such as hamming code and Cyclic Redundancy Check (CRC) are used to detect error at the destination node [GR09]. Retransmission scheme is highly energy-efficient with low area overhead. However, this scheme is limited by increased delay and network load. In order to handle permanent faults in retransmission scheme, data should be rerouted to avoid failed nodes, which further increases the delay and the network load. Error correction scheme is another way of handling faults, with less latency and network load than that of retransmission scheme. Error Detection and Correction Scheme: Hamming distance is the difference in number of bits between two code-words in a block code. Minimum hamming distance is used to detect N errors if hamming distance between the code words is at least N+1. Error detection part of the hamming code is highly energy-efficient when compared to error correction part, as the minimum hamming distance of error detection capability is less than the error correction capability. 32-bit Cyclic Redundancy Check (CRC-32) is an advanced error detection scheme used in digital network and storage devices. Error detection in CRC is implemented by comparing the Frame Check Sequence (FCS) between the received data and the original data stored. FCS is an error detection check value obtained by performing division between the binary data to be the transmitted and the generator polynomial. The 31 reminder of this division is the error detection value (FCS). If the data is corrupted, FCS of received data and FCS of the original data are different. Similarly, if the data is not corrupted, FCS of received data and FCS of the original data are equal. Hamming code, BCH (Bose, Chaudhuri, and Hocquenghem) code, and Reed- Solomon code are the most common error-correcting codes (ECC). In NoC, Hamming code is the popular and energy-efficient error correcting code [SS05]. Hamming code can detect and correct one error bit from the received binary data if the hamming distance is 3, whereas, with hamming distance of 4, it can perform single error correction and double error detection. In order to avoid data loss in hamming code, extra bits called redundant bits are added to the data while transmission. Parity bit is another type of extra bit added to the data in hamming code that plays an important role in detecting error. If the parity value of the transmitted data changes, then it is considered to be corrupted. Bandwidth and power consumption increase while using hamming code for error correction due to the extra bits (parity bit and redundant bits) that are appended to the original binary message for error correction and detection. Routing: NoC experience traffic fluctuations with increase in the size of the chip to accommodate multiple processing cores. This traffic fluctuations causes uneven distribution of load which results in uneven aging of NoC. Uneven aging in the network is one of the main causes of rapid degradation in the lifetime of the network. Distributing stress/load across the network supports uniform aging and improves the lifetime of the transistor. In order to symmetrically distribute traffic in NoC, an efficient routing algorithm is crucial. A packet is routed from a source to its destination in a dedicated path by means of routing algorithm. There are two main categories of routing types, deterministic and adaptive routing. In deterministic routing, packets follow same path from a given source to its destination, for example, Dimensional Order Routing (DOR) algorithm. In a 2D mesh 32 topology, DOR is determined for X and Y coordinates where, packets are routed along one coordinate (x or y) first and then routed along the other coordinate. For example, packets are routed along x-axis until x-coordinate of the current router is equal to the x-coordinate of the destination router. Later, the packets are routed along y-axis until the y-coordinate current router is equal to the y-coordinate destination router. DOR XY- routing ensures smooth flow of traffic avoiding deadlock/live lock. In adaptive routing, packets change their routing paths depending on the traffic conditions of NoC, such as Destination Based Adaptive Routing (DBAR) [RL10] and Regional Congestion Awareness (RCA) routing [BCR12c]. Adaptive routing determines optimal transmission path for the packets from source to destination, decreasing latency of the network. Adaptive routing algorithm ensures uniform distribution of traffic throughout NoC while taking a different route avoiding congested paths.

1.4 Major Contributions

In this thesis, I propose RETUNES: Reliable and Energy-efficient NoC, a unified model that includes low-power design and fault tolerant techniques to explore and analyze critical factors such as reliability, energy efficiency, and latency of NoC. RETUNES uses five voltage modes which are carefully chosen such that the supply voltage is scaled according to the incoming traffic to reduce congestion and improve energy efficiency. Power-performance tradeoff due to multiple voltage modes in NoC is analyzed based on the buffer utilization and application traffic load. At low network load, voltage is

scaled down ensuring maximum power savings and minimum ∆Vth. At high network load, voltage is scaled up ensuring lower bit-error rate and minimum latency. In order to further improve energy efficiency of NoC approximate communication technique is implemented. RETUNES uses data approximation in its design to reduce number of packet transmissions. Data from the cores is captured and annotated to mark the region for approximation using 33

Loop Perforation and Neural Processing Units (NPUs) to enhance performance (latency, throughput) and energy savings. RETUNES enhances reliability by slowing down the wear out at the inter-router level. Adaptive routing algorithm is introduced to even out the link wear out by distributing the incoming traffic. The reliability model of RETUNES is a hybrid error correction and detection scheme with two-layered architecture to mitigate soft errors caused due to lower voltage modes and aging. When operating the NoC in high voltage/frequency modes and at lower ∆Vth, error rates are typically low, and therefore, end-to-end (e2e) error correction is enabled. Similarly, when operating under low voltage modes ( NTV) and at higher

∆Vth, error rates are higher, and therefore, switch-to-switch (s2s) error correction scheme is enabled. This multi-layered hybrid scheme handles SBUs and MBUs that are encountered at lower supply voltages, thus achieving a fine balance between power consumption and network performance. The following are the major contributions of this thesis:

• RETUNES uses multiple voltage modes and approximate communication to manage congestion and energy efficiency of the network while maintaining upper bound on energy-delay-product (EDP).

• RETUNES incorporates adaptive routing algorithm to symmetrically distribute workload on NoC and support uniform aging process. As faults manifest due to NTV scaling, RETUNES combines error handling schemes with voltage scaling to improve the reliability of NoC.

• RETUNES implements a hybrid error correction and detection scheme with two- layered architecture to handle all SBUs and MBUs that are encountered at different supply voltages., thus achieving a fine balance between power consumption and network performance. 34

1.5 Organization of Thesis

The thesis is organized as follows: In Chapter 2, RETUNES architecture is discussed while first section provides an idea of previous works on NoC energy efficiency and reliability. In section 2.2, RETUNES energy efficiency and reliability models are discussed. Energy efficiency techniques such as voltage scaling, NTV scaling, and approximate communication are discussed in the first part of the section 2.2. The second part of the section focuses on techniques to improve reliability of RETUNES which includes, adaptive routing technique in order to improve lifetime of NoC, unified reliability model which handles faults due to aging, voltage scaling, and temperature, and Centralized Control Unit (CCU) which handles critical decisions regarding error correction and voltage switching. In Chapter 3, NoC performance is evaluated beginning with four evaluation schemes used in RETUNES. Section 3.1 describes RETUNES evaluation approach. Next, in Chapter 3, section 3.2, RETUNES power, delay, lifetime, reliability, Energy-Delay product (EDP) is analyzed. Finally, conclusion to the thesis is provided in Chapter 4 which includes thesis contribution and future work. 35 2 RETUNES: Reliable and Energy-Efficient

Network-on-Chip

In this thesis, I propose RETUNES - a reliable and energy-efficient NoC. RETUNES evolves into a multi-layered hybrid model by combining energy-efficiency layer (also known as EE-layer) and reliability layer (also known as R-layer). The EE-layer of RETUNES uses voltage scaling and approximate communication to ensure maximum power savings. The most crucial aspect in voltage scaling is determining supply voltage selection and voltage mode switching. In approximate communication, calculating the tolerable error threshold for approximation is important. The R-layer in the architecture is a hybrid error correction and detection design that handles faults due to transistor aging and voltage scaling. The EE-layer and R-layer work together resolving critical timing errors efficiently at various stages in the design. This chapter is focused on the RETUNES architecture where in section 2.1 prior work on energy efficiency and reliability is explained, and in section 2.2 RETUNES architecture is elaborated.

2.1 Prior Work

Energy efficiency and reliability are the two important factors to be considered in industries such as automobile, chip manufacturing, health, telecommunication and so on. While using NTV scaling in NoC, researchers focused either on energy-efficiency techniques or on different reliability improvement methods but not on both. Survey on energy-efficient methods provided a good insight on state-of-art research in minimizing the power consumption in NoCs [AAF+14][Mit15][RSG03][Mit16]. Approximate computing is an attractive approach to improve the energy efficiency and performance at the cost of accuracy [SDF+11][MAMJ15]. Previous woks have proposed data approximation and compression techniques [DMN+08][JYK08] to improve 36 energy savings and reduce latency even at higher traffic loads. Frameworks such as Enerj performs data approximate in any part of the execution process by annotation [SDF+11][MAMJ15][DPL+14][ESCB12]. Approximate communication is a part of approximate computing where the data transmitting from source to destination is deliberately approximated by reducing the amount of transmissions between the sender and receiver [BKS+18b]. Approximation can be done at the hardware level (memory and computation) or the software level. In [CPK+13] the author proposed a reconfigurable NoC architecture based on the network traffic as shown in the Figure 2.1. This method reduced the amount of data that is transferred between source and destination using file compression and recovery techniques that improved the communication speed by more than 50%. However, this technique leads to high energy consumption due to the continuous compression and recovery performed at the flit level. Previous works also proposed several approximation techniques such as, lossy data compression [DMN+08][SLJ+13], data sampling [AWC+11b], loop perforation [SDMHR11], load value approximation [MBJ14b] and so on [ACV05][KKS15], to implement approximate communication technique. Recently, researchers focused on resolving the low throughput issue caused due to high network loads by identifying and approximating the non-critical data in the application improving bandwidth [AHY+16][MBJ14b][AYMC15]. Dynamic Voltage and Frequency Scaling (DVFS) is another effective technique to reduce the dynamic power consumption of NoC [BMM07][BMJG12][MD+09a][UKK13] [HJ15]. Typically, DVFS designers have two important decisions to make determine the circuit level (granularity) to apply DVFS and to determine the appropriate voltage mode to apply. In fine-grain voltage/frequency domains, NoC routers and links operate independent of each other using multiple supply voltages to improve performance. However, since the current is drawn from different domains, the supply voltage guard-bands increases which results in a decrease in the power-efficiency in NTV environment. In [JR+07] the 37

Figure 2.1: Reconfigurable NoC architecture based on the network traffic [CPK+13].

author discussed the similar problem of fine-grain domain approach for IBM processor. Instead of fine-grain voltage modes, coarse-grain on-chip multiple voltage mode approach in which NoC is controlled globally increases the power efficiency of the network, albeit at a cost of performance. The other crucial aspect of DVFS is the voltage mode selection. Operating voltage of the network can be scaled down if the buffer utilization/traffic within the NoC is low and similarly, supply voltage can be increased if the buffer utilization is high. Prior work has proposed several different metrics to measure traffic, such as buffer usage [MD+09b], predicted link usage algorithm [SP+03a], threshold-controlled algorithm [HM07], temperature aware voltage switching [KET16], congestion aware routing algorithm [EDL+12] and DVFS algorithm [BC+12]. NTV scaling has been proposed for processors, cores, and memory and is more recently applied to NoCs. From the past few decades NTV and NTV computing techniques have been proposing to improve the energy efficiency [AWC+11a][AFGM11]. In [RJCR16] the authors proposed a multi-layered NoC architecture that uses near threshold 38 voltage technique which improved the energy efficiency of the NoC. In their work, it was shown that based on application demand, switching the operating voltage between NTV and normal voltage improves the performance of the NoC. Figure 2.2 shows the proposed architecture for control device that monitors traffic and responds to traffic changes in NoC. The router controller responds to the traffic variations and transfers the information to traffic monitor where the controller makes decision on scaling the supply voltage. BoostNoC implemented a process of safe and efficient packet transfer and improved the system performance and energy efficiency. In this approach the authors did not discuss the congestion of the network due to low frequency of operation at low supply voltage (200MHz at 0.35v in this case). The main drawback of this design is the hardware overhead used to switch operating voltage that is supplied to the routers. Even though all the routers in BoostNoC is supplied with single operating voltage, routers are equipped with an individual control unit that switches the supply voltage increasing the hardware overhead. In [ZDB+07] the authors improved the performance lost due to NTV by application parallelism. Operating NoC at lower voltage will increase the susceptibility of devices to faults due to timing errors [LD+], whereas, operating NoC at high supply voltage accelerates aging [vSA+16]. Previous works on controlling the aging process proposed two different approaches- 1) by modifying architecture/hardware of the existing design. In [ACC+09][BHK+13][CYLA11], the authors modified the hardware by implementing a technique to disable the blocks that shows faults due to NTV scaling, in order to regain reliability and improve performance. 2) by modeling the routing algorithm. Routing algorithm in NoC is important to improve performance (throughput and latency) and to minimize aging effect by selecting most optimal path. Aging process in transistors cannot be stopped, however, controlled aging (voltage scaling) and symmetrical distribution of workload will increase the lifetime of the device [MVD11]. In [BC+12] the authors used aging-aware oblivious routing algorithm that dynamically chooses the routing 39

Figure 2.2: Control device architecture with router and layer controllers to switch NoC voltage levels [RJCR16].

scheme to distribute traffic symmetrically along NoC. However, voltage scaling technique to reduce the threshold voltage variation is not considered in this design. In [vSA+16] the authors proposed a model that interprets aging degradation by comparing the increase in the runtime-delay (threshold voltage change) with the analyzed offline delay. In this design, the authors showed that lower supply voltage slows down the threshold voltage variation due to aging. However, the need for offline calculation is the drawback of the design which loses the ability to dynamically tune parameters (voltage and frequency) according to the wear-out levels. Further there has been different proposed ways to handle wear out dynamically at the cost of hardware overhead [ACR13][WWM14]. In [ACR13] the author proposed Wear-out Monitoring System (WMS) that allows the algorithm to decide between the buffered or buffer-less routers depending on the packet type. The hardware overhead and complexity in this proposed design is due to the technique used for monitoring the aging process (WMS). In [WWM14] the author proposed a routing algorithm based on 40 dynamic programming (DP), where wear-out level of each router is communicated using a parallel network. The process to find and communicate the wear-out level is complex in this design which induces hardware overhead. Along with maintaining a uniform traffic it is also important to handle the stress observed in NoC due to huge data transmissions. Memory-based and video/data processing applications transfer significant amount of data that creates high traffic loads. Prior work on error recovery schemes have shown the impact of encoding techniques on reliability of the network at low operating voltages [LD+]. In those experiments it was demonstrated that Single Bit Upsets (SBUs), Multiple Bit Upsets (MBUs), and hard errors have a higher probability to occur at low supply voltage. As the capacitance inside the memory cells decreases with the transistor technology, the minimum capacitance charge necessary to hold/retain the information decreases. This decrease in ability to retain data leads to an increase in the susceptibility of the memory devices to SBUs. Fortunately, error correction and detection techniques can be proposed to handle soft errors. A 2- layered error management method for NoC to manage both permanent and transient error was proposed in [YA11]. Error-correcting code (ECC) techniques such as s2s and e2e error control mechanism are integrated in the data link layer, physical layer, and network layer depending on the noise conditions of NoC. However, none of the prior works have combined voltage scaling, reliability, and aging of NoC architecture. In what follows, I will describe the RETUNES architecture, voltage scaling (different voltage modes), adaptive routing (aging) and reliability (different error correcting codes) model.

2.2 RETUNES Architecture

RETUNES is an energy-efficient and reliable architecture evaluated on a 4 × 4 concentrated mesh topology with 64 cores using a 45nm transistor technology node. This section elaborates technologies used for energy efficiency and fault tolerance to improve 41 power savings and reliability in NoC. In the first part of this section I explain RETUNES energy efficiency layer (EE-layer) followed by RETUNES reliability layer (R-layer). In the later part of the section I explain the Centralized Control Unit (CCU) design which combines EE-layer and R-layer activity.

2.2.1 Energy Efficiency (EE-Layer)

Energy Efficiency layer (EE-layer) of RETUNES provides maximum energy savings using two prominent techniques voltage scaling combined with energy proportionality and approximate communication. RETUNES monitors and calculates the traffic load (Flits/cycle), temperature (Celsius) and delay overhead (cycles) of the proposed voltage modes to enable supply voltage switching mechanism as explained in the following subsections.

2.2.1.1 Voltage Scaling

RETUNES EE-layer consists of five voltage modes which are carefully chosen to effectively collect traffic variations and to minimize rapid switching of the supply voltage. These five voltage modes are scaled from nominal voltage/Super Threshold Voltage (STV) to Near Threshold Voltage (NTV), where STV and NTV are set to 1.0 volts and 0.35 volts respectively. EE-layer closely monitors buffer utilization from all the routers for every chosen epoch size. Methodology used to choose an epoch size is explained later in this subsection. The buffer utilization captured at each epoch serves as a metric for switching different voltage modes to manage congestion in the network. Figure 2.3 shows the buffer utilization of blackscholes and LU application along with proposed 5 voltage modes at different utilization levels. When buffer utilization is below 15%, NTV voltage mode is activated. Similarly, between 15% and 45% of buffer utilization, V1 voltage mode is activated; between 45% and 58-60% of buffer utilization, V2 voltage mode is activated; between 58-60% and 75% of buffer utilization, V3 voltage mode is activated; and buffer 42

Figure 2.3: Percentage of buffer utilization at different simulation time (cycles) for blacksholes (left) and LU (right) applications.

utilization of 75% and above, STV voltage mode is activated. These empirical values were determined by running Splash-2 and PARSEC suite benchmarks at different epoch and various traffic conditions. RETUNES 5 voltage modes (NTV, v1, v2, v3, STV) and their corresponding frequencies are calculated using the voltage-frequency relation from [EE11]. Equation [2.1] shows the relation that is used to determine the operating frequencies for the proposed voltages from [EE11].

(V − V )β f ∝ dd th (2.1) Vdd where, f is the operating frequency, Vdd is the supply voltage, Vth is the transistor threshold voltage, and beta is the technology dependent constant which is approximately equal to 1.5 in this case. Choosing epoch is a critical task as the energy consumed for switching the operating voltage dominates the energy savings of the network when the epoch size is small (epoch 43

Figure 2.4: Traffic pattern of blackscholes application at different epochs to determine epoch size for RETUNES.

≤ cycles) and congestion is built in the network during heavy traffic loads at low operating frequency if the epoch size is large (epoch ≥ 500 cycles). Figure 2.4 shows the blacksholes traffic pattern at 50, 100, 300, and 500 epoch sizes. RETUNES optimum epoch size is chosen to be 100 cycles by carefully monitoring the power and performance trade-off for several executions of application data. Determining Traffic Load and Device Temperature: Load values for the voltage modes are assigned based on Buffer utilization and average link utilization patterns. Average link utilization is modeled to vary from 0.01 to 0.4 flits/cycle considering 0.4 flits/cycle as the maximum utilization, as most of the networks saturate after that point [ACP11]. Link utilization model [DBKL16] is used to calculate temperature range for the corresponding link utilization. Initially, network is operated at 0.01 flits/cycle where temperature is 44 considered to be 75 to 77 degree Celsius. After every 30 to 35 epochs, temperature for the corresponding link utilization is captured. Determining Overhead Delay:There are two different types of delays that are observed due to voltage scaling:

• On-chip communication delay: This delay is experienced by the flit due to decrease in operating frequency where the delay is inversely proportional to the operating frequency.

• Wakeup delay/Overhead delay: This delay is the cost of switching the operating

voltage (Vdd) of the transistor. As the transistor voltage/frequency is switched, the transistor demands few cycles to wake up or to switch to the new operating voltage/frequency.

The overhead delay and the on-chip communication delay together determine the overall packet delay. The temperature values obtained at every voltage mode or network load is used to calculate the overhead delay of the network using a bias generator from [AD+06]. The first input to calculate the overhead delay is the buffer utilization and the range of buffer utilization determines the supply voltage. The next input parameter is link utilization, where an increase in link utilization indicates an increase in traffic intensity, which in turn leads to increased power consumption thus raising the device temperature. Table 2.1 shows the traffic load (Flits/cycle), temperature (Celsius) and delay overhead (cycles) calculated for the corresponding voltage modes of RETUNES. Routers operating in any voltage mode during step-down, quickly ramp down the frequency and then wait for an overhead delay before ramping down the supply voltage. However, for a voltage step- up, the voltage is increased initially and then the router waits for the overhead delay before increasing the frequency. On receiving a signal to change the voltage mode of NoC, all the 45 routers are instructed to complete the buffer transfers before ramping up/down the voltage to prevent any loss in communication. In RETUNES, voltage/frequency changes affect all links and routers simultaneously and the entire NoC operates at the same voltage/frequency (coarse-grain).

Table 2.1: Traffic load (Flits/cycle), temperature (Celsius) and delay overhead (cycles) calculated for the corresponding voltage modes of RETUNES.

Mode Volt Freq load Temp Range Delay

Unit V Ghz flits/cycle Centigrade cycles

NTV 0.35 0.2 0.01 75-77 8

v1 0.55 0.8 0.1 76-82 5

v2 0.6 1.5 0.2 80-93 4

v3 0.8 2 0.3 90-101 2

STV 1 2.3 0.4 97-104 1

2.2.1.2 Approximate Communication

Approximate communication is a technique to improve energy efficiency and performance of NoC. Approximate communication technique makes use of low-sensitive nature of the application where identification of duplicate data, data accuracy, and error threshold are the three requirements for data approximation. Prior work have proposed three main techniques to approximate transmitted data in NoC - compression, relaxed synchronization, and value prediction [BKS+18a][MBJ14a]. In compression, data with repetitive pattern is compressed while transmitting across NoC to decrease energy consumption and reduce bandwidth. There are two 46 different compression techniques based on the end results of the technique- loss-less compression and lossy compression. Loss-less compression assures full reconstruction of data at the destination core without compromising on the quality of the output. Lossy compression eliminates the redundant data while transmission, achieving higher compression than loss-less compression technique [BKS+18c]. In relaxed synchronization, irrelevant synchronization points are approximated to improve the scalability of parallel tasks/benchmarks [MRCB10][BMR+10]. Relaxed synchronization technique is limited by the selection of set synchronization points that are relevant for the execution of benchmarks. In value prediction, the inputs of the dependent instructions are predicted to reduce latency of the instructions in the pipeline of the processor[MBJ14b][TPE+14]. RETUNES Approximation Procedure: Figure 2.5 represents proposed RETUNES approximation procedure, which consists of JPEG encoder, data approximation stages, and Memory Control Unit (MCU). RETUNES approximation (encoding and decoding) of a JPEG image is performed in 3 stages. In the first stage of approximation, the compressed JPEG image is read pixel by pixel in order to detect the duplicate data. In this stage, frequent repetitive patterns or similar pixel values are observed in the data and marked as duplicates. In the second stage of approximation (encoding), the duplicate data or pixels that are marked as duplicates are encoded as shown in the Figure 2.6. In this example, Nd represents the duplicates of a bit, where N is the number of duplicates observed in the data. This approximated data reduces the number of packet transmissions that are required to transmit the compressed JPEG image from a source router to a destination router (MCU in this case). In the final stage of approximation (decoding), the approximated data is decoded at the destination router (MCU) using the reference pattern +Nd that is transmitted along with the packet. The JPEG encoder stages shown in the RETUNES approximation procedure are taken from approximate computing benchmark suite [YMEL17]. Figure 2.7 shows an NoC with 47

Figure 2.5: Figure shows the flow of original image read from the Memory Control Unit (MCU) and approximated JPEG image sent back to the Memory Control Unit (MCU).

3 types of cores - JPEG encoder cores (highlighted in green), Memory Control Unit (MCU) core (highlighted in yellow), both adopted from AxBench [YMEL17], and approximating core (highlighted in red). JPEG encoder is a lossy compression method generally used for compressing digital images. In this encoder, operations such as level shifter, encoder, quantization, Discrete Fourier Transform (DFT), and Discrete Cosine Transform 48

Figure 2.6: Shows the approximation performed on 10bit data, where ’d’ represents the number of duplicates following a digit.

Figure 2.7: JPEG encoder, Memory Control Unit (MCU) and approximating core mapped on NoC.

(DCT) are performed to compress an image. The path from MCU to approximating core (green arrows) shows the JPEG compression that is performed in AxBench, where, this compressed JPEG image is transmitted to the RETUNES approximating core to approximate the JPEG image. As RETUNES focuses on approximating the compressed 49 image from JPEG encoder, the five cores of JPEG operation are mapped on NoC to replicate the traffic flow from MCU to approximating core. The path from approximating core to MCU (red arrow) shows the RETUNES approximation, where the approximated image (encoded) is sent back to MCU. At the MCU, the decoded image and the approximated image (encoded) are compared to determine the power and quality trade-off of RETUNES approximation procedure. RETUNES only deals with approximation procedure that includes, reading a compressed JPEG image, observing duplicate data, eliminating duplicates by encoding the bits, decoding the data at MCU, and comparing the decoded image with the JPEG image. The JPEG application (JPEG image) and the image compression rate used in RETUNES approximation are taken form AxBench, whereas, errors due to lossy compression are not included in this work.

2.2.2 Reliability (R-Layer)

Reliability layer of RETUNES handles all Single Bit Upsets (SBUs) and Multiple Bit Upsets (MBUs) due to aging, temperature and voltage scaling. In this subsection the reliability model that monitors faults, encoding framework to handle these faults, and adaptive routing algorithm modeled to handle uneven aging are discussed.

2.2.2.1 Unified Reliability Model

The Unified reliability model of R-layer captures faults observed due to aging, voltage scaling, and temperature, ensuring maximum reliability of NoC. Figure 2.8 shows the unified reliability model that monitors bit error rate at every voltage mode and at every threshold voltage variation (∆Vth) range. Reliability degradation is measured as the sum of

∆Vth due to voltage scaling, aging, and temperature as shown in the equation [2.2]. 50

Reliabilitydegradation(Rd) = ∆Vth Temperature + ∆Vth Aging + ∆Vth Voltagescaling (2.2)

where, ∆Vth Temp is the threshold voltage change due to temperature variations, ∆Vth Aging is the threshold voltage change due to aging, ∆Vth Voltagescaling is the threshold voltage change due to voltage scaling.

The threshold voltage variation range (∆Vth range) is divided into three levels along with the error types (More errors (Me), Few errors (Fe) and No errors (Ne)) as shown in

2.8. As the ∆Vth range increases, the error type shifts to its higher error type (Ne is the lowest and Me is the highest) depending on the change in threshold voltage. The ∆Vth range is less than or equal to 3.3% for Ne and Fe (1 error) error type, greater than 3.3% and less than or equal to 6.6% for Fe (2 errors) error type, greater than 6.6% for Me error type. If the variation of threshold voltage is greater than 10%, it is considered to be a permanent fault [WCF11]. Similarly, the fault model shown in 2.8 keeps track of the error type for all the voltage modes used in the design. For example, if the supply voltage of the router is 0.65 volts and probability of error (pe) is greater than 6.6%, then the error type is Fe (2 errors). RETUNES unified reliability model checks the ∆Vth range for every epoch, where, the error range at every ∆Vth range is integrated with the fault model to generate unified fault model for the voltage mode (NTV, V1, V2, V3, STV) that is active in NoC.

2.2.2.2 Encoding Framework

In order to improve network performance while handling faults observed by unified reliability model, an error handling design that adjusts its error correction strength depending on the error range, is effective. When the probability of bit error is high, fault coverage should be increased to ensure reliability; and when the probability of error is low, fault coverage should be reduced to save power. To improve error resilience, a two-layer encoding framework is proposed based on the error range of NoC. When NoC is at high 51

Figure 2.8: Unified fault model showing error range separately for threshold voltage

variation (∆Vth) and bit errors observed in RETUNES.

error range, switch-to-switch (s2s) encoding layer is activated. In this layer, every router uses strong ECCs to increase fault coverage at every router for the input traffic. Routers at low error range employ weak ECC, activating end-to-end (e2e) encoding layer, where ECC is applied only at the source and destination router as the probability of bit error is low. Flowchat 2.9 shows the error range and its appropriate encoding layer (e2e or s2s). NoC reliability design switches to e2e encoding layer if the error range is Ne or Fe (1 error) and switches to s2s encoding layer if the error range is Fe (more than 1 error) and Me. Triple modular redundancy (TMR) control lines are used to signal routers to switch between encoding layers. Encoding Layer Microarchitecture: Figures 2.10 and 2.11 show the proposed micro- architecture for e2e and s2s encoding layers. The proposed e2e encoding layer has 256-bit 52

Figure 2.9: Flowchart shows appropriate encoding layer (e2e or s2s) used in RETUNES for different error ranges (Ne,Fe,Me).

CRC-32 encoded packet with 224 data bits and 32-bit long check value. Each packet is encoded while entering and decoded while exiting the core at the network interface as shown in Figure [2.11]. Router with 3 pipeline stages would add stall cycles and increase flit delay at lower voltage/frequency modes (NTV, V1, V2). In order to decrease overall packet latency, the routers in s2s encoding layer consist of five pipeline stages as shown in the Figure 2.10. The 5 router pipeline stages are Buffer Write (BW), Routing Computation (RC), Virtual Channel Allocation (VA), Switch Allocation (SA), and Switch Traversal (ST) [JKP17].

• BW stage: In this first stage of the router pipeline head flit is written to a virtual channel (buffer) after entering the router. 53

Figure 2.10: RETUNES switch-to-switch encoding layer microarchitecture showing encoder and decoder of R-layer along with the router pipeline stages.

• RC stage: In this second stage, destination information is read from the head flit to compute the output port. Routing protocol also plays an important role along with the destination information to determine the output port of the flit.

• VA stage: In this third stage, Virtual channel (VC) is allocated for the whole packet. If multiple packets (head, body and tail flits) are allocated with same VC, head flits compete to use the VC at the downstream routers. The winner from the VA stage is allocated with VC whereas the looser can compete during the next cycle.

• SA stage: In this fourth stage, packets compete to access the crossbar. Multiple VCs poll to gain access to the output port of the crossbar.

• ST stage: In this fifth stage, the winner flit from SA stage is allowed to traverse the switch where the looser can compete during the next cycle. Finally, flits are 54

Figure 2.11: RETUNES end-to-end encoding layer microarchitecure showing encoder and decoder of R-layer at the Network Interface (NI).

transmitted to either the downstream router or the destination router after ST stage. If flits are transmitted to downstream router, all these 5 pipeline stages are repeated at every router until the packet reaches the destination router.

The proposed s2s encoding layer is implemented using hamming code H(72,64) , which is a Single Error Correction And Double Error Detection (SECDED) code. Using SECDED hamming codes, all 1-bit errors are recovered, and all 2-bit errors are detected. The codeword vector and the transpose of the generator matrix in hamming code are multiplied to detect errors in the syndrome generator. The decoder design forwards data bits along with the parity bits to the encoder in the upstream router. The syndrome compares the forwarded parity bits with the new parity bits in the encoder design to detect faults. 55

However, faults cannot be corrected in the encoder, as the correcting hardware is not present in the encoder design. Finally, error the bit is corrected and transmitted to next router/core. ECC scheme of RETUNES detects all two-bit errors and corrects all single bit errors obtained from the unified reliability model at each epoch. If a fault is detected and cannot be corrected, the flit is dropped and a request for retransmission is sent. When the number of faults increases in a packet, the entire packet is dropped and will be retransmitted to prevent the communication loss.

2.2.2.3 Adaptive Routing

The lifetime of a device is a measure of wear-out (aging) of the device over a period of time. The unsymmetrical distribution of traffic leads to uneven aging which in turn degrades the reliability of the network. RETUNES determines the aging process as the measure of threshold voltage change which varies for all the proposed voltage modes and constantly monitors link utilization dynamically to understand the stress levels of the link. In order to symmetrically distribute traffic in NoC, an efficient routing algorithm is crucial. RETUNES uses adaptive routing algorithm to distribute its packets uniformly throughout NoC to improve lifetime of the transistor. For every epoch, routing algorithm collects the average link utilization for the current router at runtime. Figure 2.12 shows the map of the single router explaining all four links (x, -x, y, -y) of the router connecting adjacent routers and a link for the core. A packet is adaptively routed along the least utilized link from the available four links (directions) of the router. Algorithm 2.13 shows the proposed routing algorithm (Adaptive routing algorithm) which determines the path of the packet. If the average link utilization along x-axis is greater than y-axis, provided x/y-coordinate of current and destination router are not equal, the packet is routed along the y- coordinate. Similarly, if the average link utilization along x-axis is less than y-axis, provided x/y- coordinate of current and destination router are not equal, the packet is routed along the 56

Figure 2.12: map of the single router explaining five directions of the router: four links (x, -x, y, -y) of the router connecting adjacent routers and a link for the core

x- coordinate. If the x-coordinate of the current router is equal to that of the destination router, the routing algorithm routes the packet along y-direction and if the y-coordinate of the current router is equal to that of the destination router, the routing algorithm routes the packet along x-direction ignoring link utilization values. RETUNES effectively adapts to the runtime changes and makes in-flight routing decisions, eliminating offline calculations and lookup tables to improve performance (reduce latency) of NoC. Temperature is another factor that leads to uneven aging. As the stress level of the link increases, the rate of aging increases with the elevated temperatures, eventually decreasing the lifetime of NoC. RETUNES experiences tolerable device temperature as the supply voltage of NoC is not constantly high (STV). Graph 2.14 shows that lowering the supply voltage slows down the aging process due to reduced temperatures and slower threshold voltage variations and thus, RETUNES is reliable even at lower voltage modes. 57

Figure 2.13: RETUNES routing algorithm (adaptive routing algorithm) to determine the path of the packet

2.3 Centralized Control Unit

Centralized Control Unit (CCU) is the heart of RETUNES which handles critical decisions such as voltage mode switching and Error-Correcting Code (ECC) strength. Figure 2.15 shows the CCU and on-chip linear voltage regulator for a concentrated 4 × 4 2D mesh topology. Mode Control Unit (MCU) is the part of CCU that decides operating voltage mode of NoC depending on the buffer utilization value at every epoch. The ECC strength of NoC is decided by the Layer Control Unit (LCU) part of CCU. While it is possible to make switching decisions locally on a per-router basis, NoC is controlled globally to reduce the cost and complexity of the controller. Recently, voltage regulators and NoCs are integrated on the same chip to minimize the power and area cost of off-chip voltage regulators [Gja08]. The linear on-chip voltage regulator used in the design is shown 58

Figure 2.14: Graph of threshold voltage change for different supply voltage which shows that lowering the supply voltage slows down the threshold voltage (Vth) change

in Figure 2.15, which changes the voltage at a rate of 30 mV/ns with a minimum of 5% power loss. A single on-chip voltage regulator is used to regulate the voltage globally in NoC as shown in the Figure 2.16. All the routers are instructed to wait an additional cycle to settle in the new voltage mode to prevent communication loss. RETUNES selects the appropriate voltage mode of operation based on the average buffer utilization of NoC for every epoch. The lowest voltage mode is the most power efficient mode while the highest voltage mode boosts the performance of the network with minimum latency. I contemplate four stages of operation for reliable and energy-efficient RETUNES architecture as follows: Step 1: Initially all the packets are encoded with CRC-32. The buffer utilization of the active layer is constantly monitored which serves as the information to the MCU. For every epoch, the MCU updates the network utilization level. Traffic information is gathered from the buffer utilization trend from Splash-2 [WO+95], PARSEC [BL09] and AxBench [YMEL17] benchmarks by carefully choosing the epoch size to be 100 cycles to avoid 59

Figure 2.15: RETUNES Centralized Control Unit (CCU) showing voltage regulator, CCU micro architecture, and control sequence between CCU and a core

power loss due to frequent voltage mode switching. Step 2: Once the voltage mode is switched, information regarding the current and previous voltage modes are passed to the LCU and the MCU. The MCU senses the local changes in the network in order to send a voltage change request to voltage regulator. Voltage regulator responds to the request sent by the MCU and scales the supply voltage. The output voltage of the voltage regulator is used as supply voltage to NoC. The corresponding overhead delay and frequency is applied to NoC according to the mode control algorithm as shown in 2.17. At the same time all the routers are instructed to complete the in-flight flit transmissions to avoid data loss before switching the voltage modes. Step 3: The LCU plays a crucial role in deciding to switch between the encoding layers, e2e or s2s. Current voltage mode of the network (from the MCU) and the probability of error obtained from unified reliability model are passed to the LCU, where decision is made 60

Figure 2.16: Design of global on-chip voltage regulator for NoC in RETUNES

to upgrade or downgrade ECC mode. ECC of RETUNES uses SECDED hamming code at every switch to correct all 1-bit errors and detect all 2-bit errors. If the error cannot be corrected, a request for retransmission is sent to the source router. The counters at each router constantly keep track of the fault rate. Step 4: On receiving the retransmission signal, the source router resends the requested flit to the destination router. Data is stored in the retransmission buffers until an acknowledgement (ACK) is received. Data and ACK lines are assumed to be separate where fault coverage is not considered for the retransmitted data. Once the retransmission is successful, acknowledgement is sent to the source router to terminate the process. 61

Figure 2.17: RETUNES mode control algorithm 62 3 Performance Evaluation

RETUNES is evaluated on a 4 × 4 concentrated mesh-topology with 64 cores and unidirectional links. Each router has four VCs for every input port and 4 buffer slots per VC. Each packet has 256 bits and is split into four equal 64-bit flits before injecting into the network. In this chapter I evaluate the performance of RETUNES to compare V5 scheme with the other voltage schemes. The four evaluation schemes, Always-STV, Always-NTV, V2 and V5 are shown below:

STV: In STV/Always-STV scheme, NoC is operated in nominal voltage mode (1 volts) where power consumption is maximum. NoC shows best performance with high application speedup and low bit error rate (bit errors due to reduced frequency/voltage) in Always-STV scheme. The performance of Always-STV scheme using XY routing (Always-STV (XY)) in Dimensional Order Routing algorithm (DOR) and Adaptive Routing Algorithm (ADP) (Always-STV (ADP)) is also compared. NTV: In NTV/Always-NTV scheme, NoC and its cores are operated under low operating voltage (close to threshold voltage). This scheme shows high energy efficiency at the cost of latency. NoC and its cores suffer from performance loss due to increase in the error rate in Always-NTV scheme. However, this low-voltage scheme (Always-NTV) has less impact

on ∆Vth providing lower error probabilities when compared to other schemes. Always- NTV scheme under DOR-XY routing (Always-NTV(XY)) is considered as the baseline model for RETUNES. V2 (2 voltage scheme): In V2 scheme, the operating voltage applied to NoC and its cores is switched between 2 voltages (STV and NTV). In this scheme NoC is operated under NTV mode for 25-30% buffer utilization depending on the traffic to avoid congestion, whereas STV mode is operated for buffer utilization higher than 25-30%. V5 (5 voltage scheme): V5 is the proposed scheme of RETUNES with 5-level voltage 63 scaling design. This energy-efficient design includes NTV, V1, V2, V3 and STV voltage modes. These voltage modes are switched based on the communication demand while providing better performance by sacrificing chip area. The power consumption of the network is low when it uses the baseline model, whereas the bit error rate is low when it uses STV voltage mode. Since the bit error rate is also lower in Always-NTV scheme due to lower ∆Vth when compared to STV scheme, V5 scheme provides a fine balance between bite errors due to voltage scaling and aging. NoC optimizes energy consumption when it uses NTV voltage mode and has lower latency when operated in STV voltage mode. Energy efficiency and reliability of the four evaluation schemes are analyzed with the real traffic traces from various applications in Splash-2 [WO+95], PARSEC [BL09] and AxBench [YMEL17] workloads, where the applications and its domains used in the design are listed in Table [3.1].

Table 3.1: Applications used in the design.

Applications Domains

JPEG Image processing

BLACKSCHOLES Financial Analysis

RAYTRACE Graphics

LU High-Performance Computing

RADIOSITY Graphics

FLUIDANIMATE Animation

FACESIM Animation 64

3.1 RETUNES Evaluation Approach

Evaluation model of RETUNES architecture for the four evaluation schemes is explained in this subsection. Figure 3.1 shows the methodology for evaluating transistor lifetime (aging), latency (delay), area, and power. Network simulator (Netsim) is used to evaluate traffic patterns and performance in NoC using the real traffic traces obtained from Multi2Sim. The NoC link utilization to find device temperature from HOTSPOT thermal model is obtained from Netsim. The average dynamic power from Netsim is provided as an input file to HotSpot thermal model and router fault model to calculate ∆Vth due to temperature variations in NoC. Synopsys HSPICE and Netsim are used to calculate ∆Vth due to aging and voltage scaling. The ∆Vth values due to supply voltage, temperature, and aging are used to calculate reliability degradation (Rd) of the transistor. AxBench is a multiplatform benchmark suit for approximate computing. In RETUNES, JPEG a lossy compression image processing technique, is approximated by applying Neural Processing Units (NPUs) using AxBench. The Randomness Calculator calculates randomness of the compressed original image and compressed approximated image and generates trace files for both the original and approximated images. These trace files are used to evaluate power, latency and Energy-Delay Product (EDP) of RETUNES. The power and area cost of the network is obtained from the Synopsys Design Compiler tool using the TSMC 45nm technology libraries [PAM+07] and DSENT - NoC modeling tool [SCK+12].

3.2 RETUNES Results

In this section I will discuss RETUNES performance (power, delay), reliability, area overhead, lifetime, and Energy-Delay Product, without considering approximate communication technique. 65

Figure 3.1: Methodology for evaluating RETUNES performance. Showing evaluation flow of the approximate communication (orange), all others (blue), and end results (green and gray)

3.2.1 Power and Area Overhead Analysis

Figure3.2 shows the dynamic power consumed (mW) by NoC to transmit packets from source to destination in all the evaluation schemes. The results include the total dynamic power consumed due to transmission and retransmission (MBUs) of the packets, due to EE-layer hardware, and due to R-layer hardware. With unified reliability model, V5 scheme shows an average of 60-61% savings in power across multiple applications when compared to the Always-STV scheme and 23% savings when compared to the V2 scheme. Since Always-NTV operates in the lowest mode irrespective of network load, it consumes the least power among the four schemes. The analyzing area overhead of the fault handling hardware is crucial to design an efficient reliability model. An area efficient fault models will decrease the overall area 66

Figure 3.2: Total dynamic power cost for Splash-2 and PARSEC benchmarks of 64 core NoC when operated in four proposed schemes. Lower is better.

consumed by the fault handling hardware in NoC. Figure3.3 shows the area cost of e2e (CRC-32) and s2s (hamming) encoding layers proposed in RETUNES. The reliability design and control unit occupy 2.4% and 0.39% of the overall chip area respectively.

3.2.2 Packet Latency Analysis

This subsection evaluates the average latency of NoC for different applications while operating in five proposed evaluation schemes of RETUNES. Overhead delay and on-chip communication delay due to voltage scaling, and transistor delay due to aging are included in the analysis to find the overall delay of NoC. Delay due to aging is calculated based on

∆Vth, where, ∆Vth is directly proportional to the transistor gate delay. According to alpha power law [SN90], ∆Vth for a given supply voltage results in a transistor delay as shown in equation 3.1

αVdd dg = ϕ (3.1) µ(Vdd − Vth) 67

Figure 3.3: Area overhead of the decoder, encoder and router for CRC and Hamming code used in s2s and e2e encoding designs.

Figure 3.4 explicitly shows the breakdown of packet latency in the four evaluation schemes when no reliability model is considered (shown in blue) and when reliability costs are included (shown in orange). With no reliability model, the average packet latency is approximately 1.2× more in V2 scheme and 10.8× more in baseline model when compared to V5 scheme. The average packet latency is 1.6× more in V5 when compared to Always- STV scheme as the operating frequency of NoC in Always-STV scheme is high (2.3GHz in this case) throughout the application runtime. Similarly, the average packet latency is 1.35× more in V2 and approximately 12× more in the baseline model when compared to V5 scheme when reliability delays are included. More often, the retransmission delays vary with the operating frequency and therefore the results reflect the delay cost accordingly.

3.2.3 Lifetime Evaluation

Transistor aging caused due to HCI and NBTI is modeled as the threshold voltage variation using Synopsys HSPICE tool and Predictive Technology Model (PTM) [ptm] 68

Figure 3.4: Normalized average packet latency (normalized to baseline model - Always- NTV (XY)) for Splash-2 and PARSEC benchmarks of 64 core NoC when operated in four proposed schemes. Blue shows latency cost without reliability. Orange shows reliability cost. Lower is better.

for a 45nm transistor technology node. According to the results shown in 3.5 for five voltage modes (NTV, V1, V2, V3, and STV) for a degradation period of 10 years, it is

evident that the threshold voltage variations (∆Vth) decrease with the operating voltage,

mitigating the aging process. The ∆Vth in transistor due to elevated temperature (∆Vth temp)

is approximately 1.15× higher than the ∆Vth due to voltage scaling (∆Vth voltagescaling).

Similarly, ∆Vth in transistor due to aging (∆Vth aging) is 1.52× higher when compared to

∆Vth voltagescaling. Unified reliability model of RETUNES collects errors due to aging and voltage scaling even though bit errors observed due to ∆Vth voltagescaling are negligible when

compared to the bit errors observed due to ∆Vth aging. RETUNES at low voltage modes 69

Figure 3.5: Threshold voltage change (∆Vth) due to voltage scaling , elevated temperature, and aging at 5 different supply voltages. Lower is better.

experiences low stress levels due to HCI/NBTI effect which in turn detains the rate of aging process. Apart from aging, uneven wear-out/aging also effects the lifetime of the device. As explained in previous chapters uniform distribution of traffic using routing algorithms is one of the efficient ways to decrease uneven aging in NoC. Traffic in NoC is correlated with the temperature variations which is modeled using HotSpot thermal model. Figure 3.6 shows HotSpot thermal map for Always-STV scheme under XY-routing and V5 scheme of RETUNES under adaptive routing algorithm. The routing algorithm modeled in RETUNES showed greater traffic distribution when compared to XY routing for Always- STV scheme. 70

Figure 3.6: Comparing HotSpot thermal map of Always-STV under xy-routing and RETUNES (V5 scheme) under adaptive routing. RETUNES shows uniform and lower device temperatures when compared to Always-STV under XY-routing.

3.2.4 Reliability Analysis

RETUNES hybrid encoding scheme consumes 6% power to improve resiliency of NoC by tuning fault coverage. On an average, the mean bit error rate of V5 scheme is 0.45× when compared to Always-NTV scheme and 2.5× when compared to Always-STV scheme. As expected, Always-STV exhibited a lower error rate and Always-NTV exhibited a higher error rate when compared to the other schemes, whereas V5 scheme of RETUNES displayed an error rate in between Always-STV and Always-NTV in order to balance the power-reliability tradeoff. Single Error Correction and Double Error Detection (SECDED) can prevent retransmissions by correcting all single bit errors and detecting all two-bit errors. However, a full retransmission is needed if an error cannot be corrected. Figure 3.7 shows the bit error rate due to voltage scaling and aging, where the mean error rate due to voltage scaling is nearly 28% and the mean error due to aging is approximately 72% of 71

Figure 3.7: Bit error rate observed in RETUNES due to voltage scaling and aging.

the overall error rate observed in NoC. Reliability analysis in this thesis considers only soft errors (SBUs, MBUs) and the permanent faults and the faults during retransmission are not considered.

3.2.5 Energy-Delay Product

In order to provide a meaningful insight of the performance of RETUNES, the energy and delay of NoC is combined into a single plot to analyze the advantages among the five proposed schemes. When analyzing the Energy Delay Product (EDP), lower is considered to be better. Figure 3.8 shows normalized EDP plot comparing all the four proposed schemes (NTV, V2, V5, STV) under adaptive routing (ADP) and XY-routing (XY). NoC in Always-STV scheme shows decrease in packet latency and increase in overall power consumption, whereas always-NTV scheme shows decrease in overall power consumption and increase in packet latency. RETUNES under V5 scheme improved the EDP of NoC by 72

Figure 3.8: Normalized Energy Delay Product (EDP) (normalized to baseline model - Always-NTV (XY)) for Splash-2 and PARSEC bench-marks for four proposed schemes. Blue shows EDP without reliability. Orange shows reliability cost. Lower is better.

7.5×, 2×, 1.6×, and 1.3× when compared to baseline scheme (Always-NTV), Always-STV scheme under XY-routing (Always-STV(XY)), Always-STV scheme under ADP routing (Always-STV (ADP)), and V2 scheme under XY-routing, respectively.

3.3 Approximate Communication Evaluation

This section provides latency, power, and Energy-Delay product analysis of RETUNES evaluation schemes using approximate communication technique. AxBench is a data ap- proximation computing benchmark suit used to approximate communication data in four stages. Stage 1: In this stage, a block of code in application (JPEG) is annotated to apply data approximation. Parrot transformation is used to annotate the required block of the code. Stage 2: In this stage, Compilation parameters such as learning rate, number of epochs, 73 sampling rate, test data fraction, maximum number of layers, and maximum number of neurons per layer are given as input to the AxBench. Stage 3: In the first step of the third stage, AxBench simulation is performed. During simulation, training data is collected and then the compilation parameters are taken as in- put. In the next step AxBench explores different Neural Network (NN) topologies to find the topology that fits the best for the application. In the second step of this stage, the orig- inal code block is replaced with the NN code and then compiled. In the final step, the NN code is tested on different images at different approximation levels. Stage 4: Theoutput of the simulation process is the original image and the approximated NN image along with the error rate (due to approximation). The randomness of the original image and approximated image from JPEG application is calculated using Matlab programming platform tool [Mat96]. Observations proved that original image is more random than the approximated image, so, original image claims greater number of packet transmissions than NN image. Routing algorithms such as XY-routing and adaptive routing are used to test the performance (power, latency) of the application. Figures 3.9 and 3.10 compares the compressed original image (shown in left) with the compressed NN image and its error rate (shown in right). Approximate communication further improves energy -efficiency that is explained in the following subsection.

3.3.1 Packet Latency Analysis

Figure 3.11 shows the normalized latency of the JPEG encoder for compressed original image and compressed approximated image at two different routing algorithms. Firstly, I applied the approximate communication technique from AxBench on the JPEG application to find the number of packets needed to transmit data from source to destination. After approximation, the JPEG application was observed to have a decrease in the packet 74

Figure 3.9: Comparing the original image (left) with compressed NN image at different error percentage (right).

count by 8.5%. This decrease in packet count drastically decreased the number of flit transmissions and retransmissions in NoC. The approximated image that is using adaptive routing algorithm to route packets, shows approximately 50% decrease in the packet latency when compared to the XY-routed original image. As expected, the Always-STV scheme experienced low packet latency than V5, V2, and Always-NTV schemes. The approximated image with adaptive routing in V5 scheme showed 40% to 49% lower latency than V2 scheme and 64% to 82% lower latency than Always-NTV scheme. The approximate communication technique showed additional 10% decrease in the packet latency of RETUNES. 75

Figure 3.10: Comparing the original image (left) with compressed NN image at different error percentage (right.)

3.3.2 Power and Energy Analysis

RETUNES shows additional dynamic power savings using approximate communi- cation technique apart from voltage scaling. The approximated image shows nearly 9% decrease in the dynamic power consumption when compared to the original image, with a 2.6% error rate as shown in the Figure 3.12. Similarly, RETUNES shows an additional 59% and 24% decrease in dynamic power consumption when compared to Always-STV scheme and V2 scheme respectively. Figure 3.13 shows the normalized dynamic energy of the original and the approximated image of JPEG encoder application at different er- ror rates (due to compression). Approximating an image in V5 scheme additionally saves nearly 13% of energy. The energy consumption of NoC decreases drastically by 32% when adaptive routing technique is combined with approximate communication. As the error rate 76

Figure 3.11: Normalized average packet latency (normalized to baseline model - Always- NTV) for both original and approximated image for AxBench benchmarks of 64 core NoC when operated in all the proposed schemes. Lower is better.

is increased, energy savings increases and quality of the image decreases. An approximated image at 9.96% error, saves nearly 64% of energy when compared to original image with a 2.6% error. So, the compressed and approximated image with maximum error rate (9.96% in this case) using adaptive routing technique consumes least energy when compared to others in the Figure 3.13

3.3.3 Energy-Delay Product (EDP) Analysis

Finally, I analyzed EDP of all the four evaluation schemes (Always-NTV, V2, V5, Always-STV) where the EDP values are normalized to EDP of the original image for Always-NTV scheme. Figure 3.14 shows the EDP of approximated and original image under XY- routing and adaptive routing, where the EDP of Always-NTV scheme is higher than the other schemes. EDP of V5 scheme is 20% less than that of the V2 scheme and approximately 52% and 80% less than that of Always-STV and Always-NTV schemes 77

Figure 3.12: Normalized Dynamic power(normalized to - Always-STV (XY)) for both original and approximated image for AxBench benchmarks of 64 core NoC when operated in all the proposed schemes. Lower is better.

respectively. Applying data approximation technique to RETUNES additionally decreased EDP by approximately 19%. 78

Figure 3.13: Normalized Dynamic energy (normalized to Always-NTV scheme original image of 2.6% error rate) for both original and approximated image for AxBench benchmarks of 64 core NoC when operated in all the proposed schemes. Lower is better.

Figure 3.14: Normalized EDP (normalized to Always-NTV scheme) for both original and approximated image for AxBench benchmarks of 64 core NoC when operated in all the proposed schemes.Lower is better. 79 4 Conclusions and Future Work

In this thesis, I proposed a reliable and energy-efficient Network-on-Chip architecture while implementing five voltage mode (V5) scheme. The V5 scheme of RETUNES showed promising results while implementing voltage scaling (including NTV), adaptive routing, and approximate communication techniques in the design. RETUNES showed power savings of nearly 2.5 × by carefully choosing the appropriate voltage mode (including NTV scaling) for the varying traffic in NoC. Symmetrical distribution of traffic using dynamic adaptive routing algorithm showed balanced wear-out of links thus increasing the lifetime of NoC making it more reliable. I demonstrated that ∆Vth due to voltage scaling is less when

compared to ∆Vth due to elevated temperature and aging effect in NoC. I then evaluated the combined effects of five voltage mode design with adaptive routing, which decreased NoC latency by 10-12 ×, and improved EDP by 1.3-7.5 × (including reliability) when compared to traditional NTV designs. I also observed that the error rate increases as the operating voltage of NoC decreases. The hybrid encoding scheme of RETUNES handles all the bit errors due to low supply voltage and aging, with a minimum area overhead of 2.79% in chip area (reliability design and control unit) and power cost of 6%. Results showed that the unified reliability model and the encoding scheme of RETUNES work together to improve NoC resiliency by tuning fault coverage. Approximate communication technique implemented in the design showed an additional power savings of 13%, while further reducing latency and EDP by 10% and 19% respectively. For future work, RETUNES can be extended to predict NoC workload and to proactively change voltage modes according to the incoming traffic using various machine learning techniques. It is an interesting idea to implement frequency islands at a single supply voltage in RETUNES, where the operating frequency of the links can be scaled according to the network workloads. RETUNES evaluated aging, energy efficiency, and reliability of the routers and links of NoC, leaving the performance of the cores unexplored. 80

This thesis can be extended by applying NTV and voltage scaling technique to the cores (heterogeneous) of NoC to observe power-performance tradeoff. RETUNES reliability layer applies fault tolerant techniques to react to the faults that are already occurred due to Low supply voltage, high device temperature, and aging. I believe that using reliability model that proactively predicts faults might improve latency and reliability of REUNES. Machine learning can be used to predict faults and to mitigate errors at the runtime. As the memory cells are more vulnerable to bit errors at low supply voltage, applying voltage scaling to the memory cells would be an interesting approach. 81 References

[AAF+14] Assad Abbas, Mazhar Ali, Ahmad Fayyaz, Ankan Ghosh, Anshul Kalra, Samee U Khan, Muhammad Usman Shahid Khan, Thiago De Menezes, Sayantica Pattanayak, Alarka Sanyal, et al. A survey on energy-efficient methodologies and architectures of network-on-chip. Computers & Electrical Engineering, 40(8):333–347, 2014.

[AAKL+10] Masud Al Aziz, Samee Ullah Khan, Thanasis Loukopoulos, Pascal Bouvry, Hongxiang Li, and Juan Li. An overview of achieving energy efficiency in on-chip networks. International Journal of Communication Networks and Distributed Systems, 5(4):444–458, 2010.

[ACC+09] Jaume Abella, Javier Carretero, Pedro Chaparro, Xavier Vera, and Antonio Gonzalez.´ Low vccmin fault-tolerant cache with highly predictable perfor- mance. In Proceedings of the 42nd Annual IEEE/ACM International Sympo- sium on Microarchitecture, pages 111–121. ACM, 2009.

[ACP11] Konstantinos Aisopos, Chia-Hsin Owen Chen, and Li-Shiuan Peh. Enabling system-level modeling of variation-induced faults in networks-on-chips. In Proceedings of the 48th Design Automation Conference, pages 930–935. ACM, 2011.

[ACR13] Dean Michael Ancajas, Koushik Chakraborty, and Sanghamitra Roy. Proac- tive aging management in heterogeneous nocs through a criticality-driven routing approach. In Proceedings of the Conference on Design, Automation and Test in Europe, pages 1032–1037. EDA Consortium, 2013.

[ACV05] Carlos Alvarez, Jesus Corbal, and Mateo Valero. Fuzzy memoization for floating-point multimedia applications. IEEE Transactions on Computers, 54(7):922–927, 2005.

[AD+06] K. Agarwal, H. Deogun, et al. Power gating with multiple sleep modes. In 7th International Symposium on Quality Electronic Design (ISQED’06), 2006.

[AFGM11] Amin Ansari, Shuguang Feng, Shantanu Gupta, and Scott Mahlke. Archipelago: A polymorphic cache design for enabling robust near-threshold operation. 2011.

[AHY+16] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Computer Architecture News, 43(3):105–117, 2016.

[AWC+11a] Alaa R Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu, Chris Wilkerson, and Shih-Lien Lu. Energy-efficient cache design using variable-strength 82

error-correcting codes. In ACM SIGARCH Computer Architecture News, volume 39, pages 461–472. ACM, 2011.

[AWC+11b] Jason Ansel, Yee Lok Wong, Cy Chan, Marek Olszewski, Alan Edelman, and Saman Amarasinghe. Language and compiler support for auto-tuning variable-accuracy algorithms. In Code Generation and Optimization (CGO), 2011 9th Annual IEEE/ACM International Symposium on, pages 85–96. IEEE, 2011.

[AYMC15] Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. Pim- enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. In Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on, pages 336–348. IEEE, 2015.

[BC+12] K. Bhardwaj, K. Chakraborty, et al. An milp-based aging-aware routing algorithm for nocs. In 2012 Design, Automation Test in Europe Conference Exhibition (DATE), 2012.

[BCR12a] Kshitij Bhardwaj, Koushik Chakraborty, and Sanghamitra Roy. An milp- based aging-aware routing algorithm for nocs. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012, pages 326–331. IEEE, 2012.

[BCR12b] Kshitij Bhardwaj, Koushik Chakraborty, and Sanghamitra Roy. Towards graceful aging degradation in nocs through an adaptive routing algorithm. In Proceedings of the 49th Annual Design Automation Conference, pages 382– 391. ACM, 2012.

[BCR12c] Kshitij Bhardwaj, Koushik Chakraborty, and Sanghamitra Roy. Towards graceful aging degradation in nocs through an adaptive routing algorithm. In Proceedings of the 49th Annual Design Automation Conference, pages 382– 391. ACM, 2012.

[BD14a] James Balfour and William J Dally. Design tradeoffs for tiled cmp on- chip networks. In ACM International Conference on Supercomputing 25th Anniversary Volume, pages 390–401. ACM, 2014.

[BD14b] James Balfour and William J Dally. Design tradeoffs for tiled cmp on- chip networks. In ACM International Conference on Supercomputing 25th Anniversary Volume, pages 390–401. ACM, 2014.

[BGL12] Andrea Bianco, Paolo Giaccone, and Nanfang Li. Exploiting dynamic voltage and frequency scaling in networks on chip. In High Performance Switching and Routing (HPSR), 2012 IEEE 13th International Conference on, pages 229–234. IEEE, 2012. 83

[BHK+13] Abbas BanaiyanMofrad, Houam Homayoun, Vasileios Kontorinis, Dean Tullsen, and Nikil Dutt. Remediate: A scalable fault-tolerant architecture for low-power nuca cache in tiled cmps. In Green Computing Conference (IGCC), 2013 International, pages 1–10. IEEE, 2013.

[BHM+17] Rahul Boyapati, Jiayi Huang, Pritam Majumder, Ki Hwan Yum, and Eun Jung Kim. Approx-noc: A data approximation framework for network-on-chip architectures. In ACM SIGARCH Computer Architecture News, volume 45, pages 666–677. ACM, 2017.

[BJS+14] Haseeb Bokhari, Haris Javaid, Muhammad Shafique, Jorg¨ Henkel, and Sri Parameswaran. darknoc: Designing energy-efficient network-on-chip with multi-vt cells for dark silicon. In Proceedings of the 51st Annual Design Automation Conference, pages 1–6. ACM, 2014.

[BK18] P. Bhamidipati and A. Karanth. Retunes: Reliable and energy- efficient network-on-chip architecture. In 2018 IEEE 36th International Conference on Computer Design (ICCD), pages 488–495, Oct 2018. doi:10.1109/ICCD.2018.00079

[BKS+18a] Filipe Betzel, Karen Khatamifard, Harini Suresh, David J Lilja, John Sartori, and Ulya Karpuzcu. Approximate communication: Techniques for reducing communication bottlenecks in large-scale parallel systems. ACM Computing Surveys (CSUR), 51(1):1, 2018.

[BKS+18b] Filipe Betzel, Karen Khatamifard, Harini Suresh, David J Lilja, John Sartori, and Ulya Karpuzcu. Approximate communication: Techniques for reducing communication bottlenecks in large-scale parallel systems. ACM Computing Surveys (CSUR), 51(1):1, 2018.

[BKS+18c] Filipe Betzel, Karen Khatamifard, Harini Suresh, David J Lilja, John Sartori, and Ulya Karpuzcu. Approximate communication: Techniques for reducing communication bottlenecks in large-scale parallel systems. ACM Computing Surveys (CSUR), 51(1):1, 2018.

[BL09] Christian Bienia and Kai Li. Parsec 2.0: A new benchmark suite for chip- multiprocessors. In Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simulation, volume 2011, 2009.

[BMJG12] Paul Bogdan, Radu Marculescu, Siddharth Jain, and Rafael Tornero Gavila. An optimal control approach to power management for multi-voltage and fre- quency islands multiprocessor platforms under highly variable workloads. In Networks on Chip (NoCS), 2012 Sixth IEEE/ACM International Symposium on, pages 35–42. IEEE, 2012. 84

[BMM07] Arnab Banerjee, Robert Mullins, and Simon Moore. A power and energy exploration of network-on-chip architectures. In Networks-on-Chip, 2007. NOCS 2007. First International Symposium on, pages 163–172. IEEE, 2007.

[BMR+10] Surendra Byna, Jiayuan Meng, Anand Raghunathan, Srimat Chakradhar, and Srihari Cadambi. Best-effort semantic document search on gpus. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 86–93. ACM, 2010.

[BWV+06] Sarvesh Bhardwaj, Wenping Wang, Rakesh Vattikonda, Yu Cao, and Sarma Vrudhula. Predictive modeling of the nbti effect for reliable design. In Custom Integrated Circuits Conference, 2006. CICC’06. IEEE, pages 189–192. IEEE, 2006.

[CPK+13] Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subrama- nian, Anantha P Chandrakasan, and Li-Shiuan Peh. Smart: a single-cycle reconfigurable noc for soc applications. In Design, Automation & Test in Eu- rope Conference & Exhibition (DATE), 2013, pages 338–343. IEEE, 2013.

[CSGK11] Tuck-Boon Chan, John Sartori, Puneet Gupta, and Rakesh Kumar. On the efficacy of nbti mitigation techniques. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2011, pages 1–6. IEEE, 2011.

[CYLA11] Young Geun Choi, Sungjoo Yoo, Sunggu Lee, and Jung Ho Ahn. Matching cache access behavior and bit error pattern for high performance low vcc l1 cache. In Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE, pages 978–983. IEEE, 2011.

[CYZ13] MR Casu, MK Yadav, and M Zamboni. Power-gating technique for network- on-chip buffers. Electronics Letters, 49(23):1438–1440, 2013.

[DBKL16] Dominic DiTomaso, Travis Boraten, Avinash Kodi, and Ahmed Louri. Dynamic error mitigation in nocs using intelligent prediction techniques. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, page 31. IEEE Press, 2016.

[DMN+08] Reetuparna Das, Asit K Mishra, Chrysostomos Nicopoulos, Dongkook Park, Vijaykrishnan Narayanan, Ravishankar Iyer, Mazin S Yousif, and Chita R Das. Performance and power optimization through data compression in network-on-chip architectures. In High Performance Computer Architecture, 2008. HPCA 2008. IEEE 14th International Symposium on, pages 215–225. IEEE, 2008.

[DPL+14] Zidong Du, Krishna Palem, Avinash Lingamneni, Olivier Temam, Yunji Chen, and Chengyong Wu. Leveraging the error resilience of machine- learning applications for designing highly energy efficient accelerators. In 85

2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 201–206. IEEE, 2014.

[DWB+10] Ronald G Dreslinski, Michael Wieckowski, David Blaauw, Dennis Sylvester, and Trevor Mudge. Near-threshold computing: Reclaiming moore’s law through energy efficient integrated circuits. Proceedings of the IEEE, 98(2):253–266, 2010.

[EDL+12] Masoumeh Ebrahimi, Masoud Daneshtalab, Pasi Liljeberg, Juha Plosila, and Hannu Tenhunen. Lear-a low-weight and highly adaptive routing method for distributing congestions in on-chip networks. In Parallel, Distributed and Network-Based Processing (PDP), 2012 20th Euromicro International Conference on, pages 520–524. IEEE, 2012.

[EE11] Stijn Eyerman and Lieven Eeckhout. Fine-grained dvfs using on-chip regulators. ACM Transactions on Architecture and Code Optimization (TACO), 8(1):1, 2011.

[EEL+97] Susan J Eggers, Joel S Emer, Henry M Levy, Jack L Lo, Rebecca L Stamm, and Dean M Tullsen. Simultaneous multithreading: A platform for next- generation processors. IEEE micro, 17(5):12–19, 1997.

[ESCB12] Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Archi- tecture support for disciplined approximate programming. In ACM SIGPLAN Notices, volume 47, pages 301–312. ACM, 2012.

[FAA08] Antonio Flores, Juan L Aragon,´ and Manuel E Acacio. An energy consumption characterization of on-chip interconnection networks for tiled cmp architectures. The Journal of Supercomputing, 45(3):341–364, 2008.

[FLJ+13] C. Feng, Z. Lu, A. Jantsch, M. Zhang, and Z. Xing. Addressing transient and permanent faults in noc with efficient fault-tolerant deflection router. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 21(6):1053– 1066, June 2013. doi:10.1109/TVLSI.2012.2204909

[Gja08] Juliana Gjanci. On-chip voltage regulation for power management in system- on-chip. 2008.

[GR09] Ahmed Garamoun and M Radetzki. Error correction techniques on noc protocol layers. In Haupt-Seminar on Reliable Network-on-Chip in the Many- Core Era, volume 23, 2009.

[GT18] Mohammad Saber Golanbari and Mehdi B Tahoori. Runtime adjustment of iot system-on-chips for minimum energy operation. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2018. 86

[HJ15] Robert Hesse and Natalie Enright Jerger. Improving dvfs in nocs with coherence prediction. In Proceedings of the 9th International Symposium on Networks-on-Chip, page 24. ACM, 2015. [HM07] S. Herbert and D. Marculescu. Analysis of dynamic voltage/frequency scaling in chip-multiprocessors. In Low Power Electronics and Design (ISLPED), 2007 ACM/IEEE International Symposium on, 2007. [Iru15] Pratheep Joe Siluvai Iruthayaraj. Dynamic voltage and frequency scaling for wireless network-on-chip. 2015. [ITR15] ITRS international technology roadmap for semiconductors 2.0. 2015. [JKP17] Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. On-chip networks. Synthesis Lectures on Computer Architecture, 12(3):1–210, 2017. [JPKAK14] Nima Jafarzadeh, Maurizio Palesi, Ahmad Khademzadeh, and Ali Afzali- Kusha. Data encoding techniques for reducing energy consumption in network-on-chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 22(3):675–685, 2014. [JR+07] N. James, P. Restle, et al. Comparison of split-versus connected-core supplies in the power6 microprocessor. In 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers, 2007. [JYK08] Yuho Jin, Ki Hwan Yum, and Eun Jung Kim. Adaptive data compression for high-performance low-power on-chip networks. In Microarchitecture, 2008. MICRO-41. 2008 41st IEEE/ACM International Symposium on, pages 354– 363. IEEE, 2008. [KBW+14] Veit B Kleeberger, Martin Barke, Christoph Werner, Doris Schmitt- Landsiedel, and Ulf Schlichtmann. A compact model for nbti degradation and recovery under use-profile variations and its application to aging analysis of digital integrated circuits. Microelectronics Reliability, 54(6-7):1083–1089, 2014. [KET16] Saman Kiamehr, Mojtaba Ebrahimi, and Mehdi Tahoori. Temperature- aware dynamic voltage scaling for near-threshold computing. In Great Lakes Symposium on VLSI, 2016 International, pages 361–364. IEEE, 2016. [KJ13] S. Khare and S. Jain. Prospects of near-threshold voltage design for green computing. In 2013 26th International Conference on VLSI Design and 2013 12th International Conference on Embedded Systems, 2013. [KKS15] Georgios Keramidas, Chrysa Kokkala, and Iakovos Stamoulis. Clumsy value cache: An approximate memoization technique for mobile gpu fragment . In Workshop on Approximate Computing (WAPCO15), 2015. 87

[KVGS13] Hyungjun Kim, Arseniy Vitkovskiy, Paul V Gratz, and Vassos Soteriou. Use it or lose it: Wear-out and lifetime in future chip multiprocessors. In Microarchitecture (MICRO), 2013 46th Annual IEEE/ACM International Symposium on, pages 136–147. IEEE, 2013.

[KZBH13] Megan A Kelly, Adam P Zieba, William A Buttemer, and Anthony J Hulbert. Effect of temperature on the rate of ageing: an experimental study of the blowfly calliphora stygia. PloS one, 8(9):e73781, 2013.

[LD+] Kyoungwoo Lee, Dutt, et al. Towards soft errors.

[LEL+97] Jack L Lo, Joel S Emer, Henry M Levy, Rebecca L Stamm, Dean M Tullsen, and Susan J Eggers. Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems (TOCS), 15(3):322–354, 1997.

[LND+05] Ben Lee, Eriko Nurvitadhi, Reshma Dixit, Chansu Yu, and Myungchul Kim. Dynamic voltage scaling techniques for power efficient video decoding. Journal of Systems Architecture, 51(10-11):633–652, 2005.

[MAMJ15] J. S. Miguel, J. Albericio, A. Moshovos, and N. E. Jerger. Doppelgnger: A cache for approximate computing. In 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 50–61, Dec 2015.

[Mat96] Inc MathWorks. MATLAB: Application program interface guide, volume 5. MathWorks, 1996.

[MBJ14a] Joshua San Miguel, Mario Badr, and Natalie Enright Jerger. Load value approximation. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 127–139. IEEE Computer Society, 2014.

[MBJ14b] Joshua San Miguel, Mario Badr, and Natalie Enright Jerger. Load value approximation. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 127–139. IEEE Computer Society, 2014.

[MD+09a] A. K. Mishra, R. Das, et al. A case for dynamic frequency tuning in on- chip networks. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2009.

[MD+09b] A. K. Mishra, R. Das, et al. A case for dynamic frequency tuning in on- chip networks. In 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2009. 88

[MG13] Elie Maricau and Georges Gielen. Cmos reliability overview. In Analog IC Reliability in Nanometer CMOS, pages 15–35. Springer, 2013.

[Mit15] Sparsh Mittal. A survey of architectural techniques for near-threshold computing. 2015.

[Mit16] Sparsh Mittal. A survey of techniques for approximate computing. ACM Computing Surveys (CSUR), 48(4):62, 2016.

[MRCB10] Jiayuan Mengte, Anand Raghunathan, Srimat Chakradhar, and Surendra Byna. Exploiting the forgiving nature of applications for scalable parallel execution. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pages 1–12. IEEE, 2010.

[MS+16] J. Myers, A. Savanth, et al. A subthreshold arm cortex-m0+ subsystem in 65 nm for wsn applications with 14 power domains, 10t sram, and integrated voltage regulator. In IEEE Journal of Solid-State Circuits, 2016.

[MT03a] Kartik Mohanram and Nur A Touba. Cost-effective approach for reducing soft error failure rate in logic circuits. In null, page 893. IEEE, 2003.

[MT03b] Kartik Mohanram and Nur A Touba. Partial error masking to reduce soft error failure rate in logic circuits. In Defect and Fault Tolerance in VLSI Systems, 2003. Proceedings. 18th IEEE International Symposium on, pages 433–440. IEEE, 2003.

[MVD11] A. K. Mishra, N. Vijaykrishnan, and C. R. Das. A case for heterogeneous on- chip interconnects for cmps. In 2011 38th Annual International Symposium on Computer Architecture (ISCA), pages 389–399, June 2011.

[NO97] BA Nayfeh and K Olukotun. A single-chip multiprocessor. Computer, 30(9):79–85, 1997.

[NSB16] Nasim Nasirian, Reza Soosahabi, and Magdy Bayoumi. Traffic-aware power-gating scheme for network-on-chip routers. In Circuits and Systems Conference (DCAS), 2016 IEEE Dallas, pages 1–4. IEEE, 2016.

[ONH+96] Kunle Olukotun, Basem A Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. The case for a single-chip multiprocessor. In ACM Sigplan Notices, volume 31, pages 2–11. ACM, 1996.

[OS95] Shigeo Ogawa and Noboru Shiono. Generalized diffusion-reaction model for the low-field charge-buildup instability at the si-sio 2 interface. Physical Review B, 51(7):4218, 1995. 89

[PAM+07] Antonio Pullini, Federico Angiolini, Paolo Meloni, David Atienza, Srinivasan Murali, Luigi Raffo, Giovanni De Micheli, and Luca Benini. Noc design and implementation in 65nm technology. In Proceedings of the First International Symposium on Networks-on-Chip, pages 273–282. IEEE Computer Society, 2007. [Pen17] David R Penas. Optimization in computational systems biology via high performance computing techniques. 2017. [PFAC09] Maurizio Palesi, Fabrizio Fazzino, Giuseppe Ascia, and Vincenzo Catania. Data encoding for low-power in wormhole-switched networks-on-chip. In Digital System Design, Architectures, Methods and Tools, 2009. DSD’09. 12th Euromicro Conference on, pages 119–126. IEEE, 2009. [PLS01] Johan Pouwelse, Koen Langendoen, and Henk Sips. Dynamic voltage scaling on a low-power microprocessor. In Proceedings of the 7th annual international conference on Mobile computing and networking, pages 251– 259. ACM, 2001. [PNK+06] Dongkook Park, Chrysostomos Nicopoulos, Jongman Kim, Narayanan Vijaykrishnan, and Chita R Das. Exploring fault-tolerant network-on-chip architectures. In Dependable Systems and Networks, 2006. DSN 2006. International Conference on, pages 93–104. IEEE, 2006. [ptm] Predictive technology model. URL: ”http://ptm.asu.edu/” [Ram11] Carl Ramey. Tile-gx100 manycore processor: Acceleration interfaces and architecture. In Hot Chips 23 Symposium (HCS), 2011 IEEE, pages 1–21. IEEE, 2011. [RJCR16] Chidhambaranathan Rajamanikkam, Rajesh JS, Koushik Chakraborty, and Sanghamitra Roy. Boostnoc: power efficient network-on-chip architecture for near threshold computing. In Proceedings of the 35th International Conference on Computer-Aided Design, page 124. ACM, 2016. [RL10] Rohit Sunkam Ramanujam and Bill Lin. Destination-based adaptive routing on 2d mesh networks. In Proceedings of the 6th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, page 19. ACM, 2010. [RRS+14] Thomas Rauber, Gudula Runger,¨ Michael Schwind, Haibin Xu, and Simon Melzner. Energy measurement, modeling, and prediction for processors with frequency scaling. The Journal of Supercomputing, 70(3):1451–1476, 2014. [RSG03] Vijay Raghunathan, Mani B Srivastava, and Rajesh K Gupta. A survey of techniques for energy efficient on-chip communication. In Proceedings of the 40th annual Design Automation Conference, pages 900–905. ACM, 2003. 90

[Rup18] Karl Rupp. 42 years of microprocessor trend data. 2018. URL: ”https: //www.karlrupp.net/2018/02/42-years-of-microprocessor-trend-data/”

[SA+14] Mohamed M Sabry, David Atienza, et al. Temperature-aware design and management for 3d multi-core architectures. Foundations and Trends® in Electronic Design Automation, 8(2):117–197, 2014.

[SCK+12] Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agarwal, Li-Shiuan Peh, and Vladimir Stojanovic. Dsent-a tool connecting emerging photonics with electronics for opto-electronic networks- on-chip modeling. In Networks on Chip (NoCS), 2012 Sixth IEEE/ACM International Symposium on, pages 201–210. IEEE, 2012.

[SDF+11] Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. Enerj: Approximate data types for safe and general low-power computation. In ACM SIGPLAN Notices, volume 46, pages 164–174. ACM, 2011.

[SDM10] John Shalf, Sudip Dosanjh, and John Morrison. Exascale computing technology challenges. In International Conference on High Performance Computing for Computational Science, pages 1–25. Springer, 2010.

[SDMHR11] Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pages 124–134. ACM, 2011.

[SLJ+13] Mehrzad Samadi, Janghaeng Lee, D Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. Sage: Self-tuning approximation for graphics engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 13–24. ACM, 2013.

[SN90] T. Sakurai and A. R. Newton. Alpha-power law model and its applications to cmos inverter delay and other formulas. IEEE Journal of Solid-State Circuits, 25(2):584–594, April 1990.

[Sod15] Avinash Sodani. Knights landing (knl): 2nd generation intel® xeon phi processor. In Hot Chips 27 Symposium (HCS), 2015 IEEE, pages 1–24. IEEE, 2015.

[SP+03a] Li Shang, Li-Shiuan Peh, et al. Dynamic voltage scaling with links for power optimization of interconnection networks. In The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings., 2003. 91

[SP03b] Vassos Soteriou and Li-Shiuan Peh. Dynamic power management for power optimization of interconnection networks using on/off links. In High Performance Interconnects, 2003. Proceedings. 11th Symposium on, pages 15–20. IEEE, 2003.

[SS05] Srinivasa R Sridhara and Naresh R Shanbhag. Coding for system-on-chip networks: a unified framework. IEEE transactions on very large scale integration (VLSI) systems, 13(6):655–667, 2005.

[TKT+16] Akihiro Tabuchi, Yasuyuki Kimura, Sunao Torii, Hideo Matsufuru, Tadashi Ishikawa, Taisuke Boku, and Mitsuhisa Sato. Design and preliminary evaluation of omni openacc compiler for massive mimd processor pezy-sc. In International Workshop on OpenMP, pages 293–305. Springer, 2016.

[TPE+14] Bradley Thwaites, Gennady Pekhimenko, Hadi Esmaeilzadeh, Amir Yaz- danbakhsh, Jongse Park, Girish Mururu, Onur Mutlu, and Todd Mowry. Rollback-free value prediction with approximate loads. In 2014 23rd In- ternational Conference on Parallel Architecture and Compilation Techniques (PACT), pages 493–494. IEEE, 2014.

[UKK13] Saeeda Usman, Samee U Khan, and Sikandar Khan. A comparative study of voltage/frequency scaling in noc. In Electro/Information Technology (EIT), 2013 IEEE International Conference on, pages 1–5. IEEE, 2013.

[vSA+16] V. M. van Santen, H. Amrouch, et al. Aging-aware voltage scaling. In 2016 Design, Automation Test in Europe Conference Exhibition (DATE), 2016.

[Wal91] David W Wall. Limits of instruction-level parallelism, volume 19. ACM, 1991.

[WCF11] Yao Wang, Sorin Cotofana, and Liang Fang. A unified aging model of nbti and hci degradation towards lifetime reliability management for nanoscale mosfet circuits. In Proceedings of the 2011 IEEE/ACM International Symposium on Nanoscale Architectures, pages 175–180. IEEE Computer Society, 2011.

[WJK+12] Vincent M Weaver, Matt Johnson, Kiran Kasichayanula, James Ralph, Piotr Luszczek, Dan Terpstra, and Shirley Moore. Measuring energy and power with papi. In Parallel Processing Workshops (ICPPW), 2012 41st International Conference on, pages 262–268. IEEE, 2012.

[WO+95] S.C. Woo, M. Ohara, et al. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proc. of the 22nd International Symposium on Computer Architecture, June 1995. 92

[WWM14] Liang Wang, Xiaohang Wang, and Terrence Mak. Dynamic programming- based lifetime aware adaptive routing algorithm for network-on-chip. In Very Large Scale Integration (VLSI-SoC), 2014 22nd International Conference on, pages 1–6. IEEE, 2014.

[YA11] Qiaoyan Yu and Paul Ampadu. A dual-layer method for transient and permanent error co-management in noc links. IEEE Transactions on Circuits and Systems II: Express Briefs, 58(1):36–40, 2011.

[YMEL17] A. Yazdanbakhsh, D. Mahajan, H. Esmaeilzadeh, and P. Lotfi-Kamran. Axbench: A multiplatform benchmark suite for approximate computing. IEEE Design Test, 34(2):60–68, April 2017.

[YS06] Ziad Youssfi and Michael Shanblatt. A new technique to exploit instruction- level parallelism for reducing microprocessor power consumption. In Electro/information Technology, 2006 IEEE International Conference on, pages 119–124. IEEE, 2006.

[Yu] Qian Yu. Opportunities and challenges for near-threshold tech- nology in end-point socs for the internet of things. In De- sign And Reuse. URL: https://www.design-reuse.com/articles/39186/ near-threshold-technology-end-point-socs-iot.html

[YYHC11] Hao-I Yang, Shyh-Chyi Yang, Wei Hwang, and Ching-Te Chuang. Impacts of nbti/pbti on timing control circuits and degradation tolerant design in nanoscale cmos sram. IEEE Transactions on Circuits and Systems I: Regular Papers, 58(6):1239–1251, 2011.

[ZDB+07] Bo Zhai, Ronald G Dreslinski, David Blaauw, Trevor Mudge, and Dennis Sylvester. Energy efficient near-threshold chip multi-processing. In Proceedings of the 2007 international symposium on Low power electronics and design, pages 32–37. ACM, 2007. ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Thesis and Dissertation Services ! !