Robust Low Power Computing in the Nanoscale Era

Todd Austin

University of Michigan

Thanks

Slide/concept contributions by: • David Blaauw, University of Michigan • Kypros Constantinides, University of Michigan • Kris Flautner, ARM Ltd. • Nam Sung Kim, Intel Corporation • Trevor Mudge, University of Michigan • Leyla Nazhandali, Virginia Tech • Dennis Sylvester, University of Michigan • Chris Weaver, Intel Corporation

1 Evolution of a 90’s High-End Processor

Power Freq. Die Size Vdd (Watts) (MHz) (mm2 )

Alpha 30 200 234 3.3 21064

Alpha 50 300 299 3.3 Compaq’s Alpha 21164 Alpha 72 667 302 2.0 67 A @ 100 W 21264

Power density 30 W/cm2 Alpha 100 1000 350 1.5 21364

High 90’s Digital Signal Processor

Analog Devices 21160 SHARC • 600 Mflops @ 2W • 100 Mhz SIMD with 6 computational units Recognized that parallelism saves power Had the right workload to exploit this fact

[We will see that the story has become more complicated]

2 Why does power matter?

“… left unchecked, power consumption will reach 1200 Watts for high-end processors in 2018. … power consumption [is] a major shows topper with off-state current leakage ‘a limiter of integration’.”

Intel chairman Andrew Grove Int. Electron Devices Meeting keynote Dec. 2002

Total Power of CPUs in PCs

Early ’90’s – 100M CPUs @ 1.8W = 180MW Early 21st – 500M CPUs @ 18W = 10,000MW Exponential growth Recent comment in a Financial Times article: 10% of US’s energy use is for computers • exponentially growth impimplieslies it will overtake cars/homes/manufacturing NOT! – why we’re here

3 What hasn’t followed Moore’s Law

Batteries have only improved their power capacity by about 5% every two years

Also Important For Server Systems

Internet Service Provider’s Data Center Heavy duty factory – 25,000 sq. ft. ~8,000 servers, ~2,000,000 Watts Want lowest cost/server/sq. ft. Cost a function of: • cooling air flow • power delivery • racking height • maintenance cost • lead cost driver is power ~25%

4 Why does robustness matter?

… the ability to consistently resolve critical dimensions of 30nm is severely compromised creating substantial uncertainty in device performance. ... at 30nm design will enter an era of “probabilistic computing,” with the behavior of logic gates no longer deterministic… susceptibility to single event upsets from radiation particle strikes will grow due to supply voltage scaling while power supply integrity (IR drop, inductive noise, electromigration failure) will be exacerbated by rapidly increasing current demand new approaches to robust and low power design will be crucial to the successful continuation of process scaling ...

Intel chairman Andrew Grove Int. Electron Devices Meeting keynote Dec. 2002

Why does robustness matter?

Grove’s comments • SEUs • IR drop • inductive noise • Electromigration, etc. Increase in variability as feature sizes decrease Likely to be the next major challenge • strengthen interest in fault-tolerance • renew interest in self-healing

5 How are they related?

The move to smaller features can help with power – with qualifications Smaller features increase design margins • reduce power savings • reduce performance gains • reduced area benefits

Challenges Power density is growing Systems are becoming less robust Can architecture help? • Lower power organizations – quick estimates of power • Robust organizations – quick estimates of robustness By one account we need a 2x reduction in power/generation from architecture

Question where will the solution come from • process • circuits • architecture • OS • language

6 Tutorial Schedule

Power Issues: Dynamic and Static Power • Dynamic Power Overview • Static Power Overview • Power Trends Low Power Design Techniques Reliability Issues: SER, Variability and Defects Break Fault Tolerant Design Techniques Robust Low Power Design Techniques

Power Sources

Total Power = Dynamic Power + Static Power + Short Circuit Power

7 Dynamic Power Consumption

Inverter initial state: Input 1 Output 0 No dynamic power

0 1

Dynamic Power Consumption

Input 1→0 • Energy drawn from power supply: E = P(t)⋅dt supply ∫ = V i(t)⋅dt ∫ dd

Vdd = V C ⋅dV ∫ dd O 0 2 = CVdd • Energy consumed by PMOS: E = P(t)⋅dt PMOS ∫ = (V −V )i(t)⋅dt ∫ dd O 1 = CV 2 2 dd

1 P = f ⋅ E = fCV 2 • Power is PMOS 2 dd

8 Dynamic Power Consumption

Input 0→1 • Energy drawn from supply: 0 • Energy consumed by NMOS equals to the energy stored on the capacitance:

1 E = V i(t)⋅dt = CV 2 NMOS ∫ O 2 dd

• Power is

1 P = f ⋅ E = fCV 2 NMOS 2 dd

Leakage Current Components

Subthreshold leakage (Isub) Gate tunneling leakage (Igate) • Dominant when device is OFF • Due to aggressive scaling of the gate oxide layer • Enhanced by reduced Vt thickness (T ) due to process scaling ox • A super exponential function

of Tox

• Comparable to Isub at 90nm technology

9 Dynamic and Leakage Power Trends

ITRS 2002 projections with doubling # of transistors every two years

Temperature Dependence

Temperature across chip varies significantly Sub-threshold leakage a strong function of temperature Gate leakage less sensitive to temperature Greater than 10% variation /10 deg C Source: R. Rao

10 Tutorial Schedule

Power Issues: Dynamic and Static Power Low Power Design Techniques • Dynamic Power Reduction Techniques • Static Power Reduction Techniques • Research Topic: Subthreshold Sensor Processors Reliability Issues: SER, Variability and Defects Break Fault Tolerant Design Techniques Robust Low Power Design Techniques

How to Reduce Dynamic Power

More generally

1 2 Pdyn = αfCVdd where α is switching activity 2

To reduce dynamic power, we can reduce

α – clock gating C – sizing down f – lower frequency Vdd – lower voltage

11 Dynamic Power Reduction - Parallel Computation

Vdd/2, f/2 Vdd/2, f/2 Vdd, f

2 V 2 1 2 Energy = CVdd Energy = 2⋅C( dd ) = CV 2 2 dd Energy reduced by 50%, but double the area and more leakage

Dynamic Power Reduction - DVS Given dynamic workload – scale frequency or voltage • • Clock/power gating – linear • Just-in-time Dynamic Voltage energy saving with duty Scaling (DVS) – cubic energy cycle saving with duty cycle

Energy = PVscaled * t task Energy = PVdd *t on 2 = ( f scaled C S V scaled ) * t task = PVdd * ttask * (duty cycle ) 3 ∝ f scaled 3 ∝ ( f Vdd * duty cycle )

ttask Freq

ton Vdd

12 How Far Can We Scale Down the Voltage?

Traditional DVS (Dynamic Voltage Scaling) • Scaling rang limited to less than Vdd/2

Voltage Range Frequency Range IBM PowerPC 405LP 1.0V-1.8V 153M-333M Transmeta Crusoe TM5800 0.8V-1.3V 300M-1G Intel XScale 80200 0.95V-1.55V 333M-733M

Minimum functional voltage • For an CMOS inverter is [Meindl, JSSC 2000]:

SS Vdd ,limit = 2VT ln(1+ ) ~ 48mV for a typical 0.18μm technology ln10 ⋅VT

LongRun Power Management [Transmeta]

Source: Crusoe™ LongRun™ Power Management White Paper

13 SpeedStep Technology [Intel]

Next generation Speedstep supports more V,F settings 10ms performance switch time Software algorithms to dynamically change settings based on performance statistics Pentium M 1.6 GHz

Frequency Voltage 1.6 GHz (HFM) 1.484 V 1.4 GHz 1.420 V 1.2 GHz 1.276 V 1.0 GHz 1.164 V 800 MHz 1.036 V 600 MHz (LFM) 0.956 V

Reducing Static Power with

Dual Vt Assignments Transistor is assigned either a high or low Vt

• Low-Vt transistor has reduced delay and increased leakage

Low-Vt; 0.9V High-Vt; 0.9V Low-Vt; 1.8V High-Vt; 1.8V

Leakage (norm) 1 0.06 1 0.07

Delay (norm) 1 1.30 1 1.20

Trade-off degrades for lower supply voltage

14 Dual Vt Example

Dual Vt assignment approach

• Transistor on critical path: low Vt

• Non-critical transistor: high Vt

(1x) 1 (Leakage Reduction)

0.8

0.6 (~2x)

0.4

0.2 Normalized current Leakage

0 All Low Vt Dual Vt

State Dependence (Isub) Simulation results of a 0.13um process

Input Subthreshold Three OFF transistors in stack ABC Output Leakage (pA) 000 1 8.0836 One OFF transistor in stack 100 1 15.1873 8X increase in leakage 010 1 13.5167 110 1 55.2532 001 1 13.4401

250 101 1 54.5532 011 1 64.259 111 0 191.2692 200

150

100 Subthreshold Leakage (pA)

0 000 100 010 110 001 101 011 111 Input ABC Source: F. Najm

Approach: rework FSM state assignent or logic

15 Balloon Latch [Shigematsu]

On-Chip Cache Leakage Power

leakage is 57% of total cache power Relative PowerRelative

Caches design with 70-nm BPTM and sub-banking technique

16 On-Chip Cache Leakage Power

Large and fast caches • Improving memory system performance • Consuming sizeable fraction of total chip power StrongARM – ~60% for on-chip L1 caches More caches integrated on chip • 2x64KB L1 / 1.5MB L2 in Alpha 21464 • 256KB L2 / 3MB(6MB) L3 in Itanium 2 Increasing on-chip cache leakage power

• Proportional to exp (1/VTH) × # of bits • 1MB L2 cache leakage power – 87% in 70nm tech

Drowsy Caches [Mudge]

!drowsy

V (1V) Put cache lines into dd VddLow (.3V)

low-power mode when drowsy idle

Cache energy Drowsy SRAM Cell

reductions of 54% to 100

58% 80 Run time increase by 60 Leakage 0.41% for awake tag 40 Drowsy Leakage 20 (drowsy tag) Dynamic Dynamic 0 Regular D$ Drowsy D$

17 Research Topic: Subthreshold Sensor Processors [Austin/Blaauw]

Sensing Communication MEMS-based Proximity Sensors

Novel Subthreshold Low-leakage Processors Memories Storage Computation Thin-film Batteries Power Supply

Energy Efficiency: A Key Requirement

They live on a limited amount of energy generated from a small battery or scavenged from the environment. Traditionally the communication component is the most power-hungry element of the system. However, new trends are emerging:

Passive telemetry Self-powered RF Proximity comm.

18 Performance Demands are LOW

10000.00 8036.77 8296.37 3943.47 2965.01 2253.56

1000.00

183.25

100.00

10.00 4.10

1.00 Platform ARM ARM ARM ARM 1st-gen 1st-gen 1st-gen 720T 7TDMI 920T 1020T Voltage 1.2 1.2 1.2 1.2 1.2 0.5 0.232 (V) Speed 100M 133M 250M 325M 114M 9M 168k (Hz)

The Basics of Subthreshold Circuit Operation

A Short Animation ☺

19 Episode 1: Inverter operation in superthreshold domain

Superthreshold

IN P OUT 1.2V 0V N P

20 Superthreshold

IN P OUT 1.2V 0V N P

Superthreshold

1.2V IN P OUT 1.2V

0V N 0V P

21 Superthreshold

1.2V IN P OUT 1.2V

0V N 0V P

Superthreshold

1.2V IN P OUT 1.2V

0V N 0V P

22 Superthreshold

IN P OUT 0V 1.2V N P

Superthreshold

IN P OUT 0V 1.2V N P

23 Superthreshold

IN P OUT 0V 1.2V N P

Superthreshold

1.2V IN P OUT 1.2V P 0V N 0V

24 Superthreshold

IN P OUT 1.2V 0V N P

Episode 2: Inverter operation in subthreshold domain

25 Subthreshold

IN P OUT 0.2V 0V N P

Subthreshold

IN P OUT 0.2V 0V N P

26 Subthreshold

0.2V IN P OUT 0.2V

0V N 0V P

Subthreshold

0.2V IN P OUT 0.2V

0V N 0V P

27 Subthreshold

0.2V IN P OUT 0.2V

0V N 0V P

Subthreshold

IN P OUT 0V 0.2V N P

28 Subthreshold

IN P OUT 0V 0.2V N P

Subthreshold

IN P OUT 0V 0.2V N P

29 Subthreshold

0.2V IN P OUT 0.2V

0V N 0V P

Subthreshold

IN P OUT 0.2V 0V N P

30 Energy Per Instruction Analysis

Cycles per Instruction

EPI: Energy per Instruction Energy per Cycle

Activity factor: average Clock period number of transistor Supply Voltage switches per transistor Leakage current per cycle Total circuit capacitance

Energy Per Instruction Analysis

Cycles per Instruction

EPI: Energy per Instruction Energy per Cycle

Effect of reducing the voltage

Ileak tclk Eleak Edyn Ecycle Activity factor: average Clock period number of transistor Supply Voltage Superthreshold ⇓ lin. ⇑ lin. ~const. ⇓ quad. ⇓ quad. switches per transistor Leakage current per cycle Total circuit capacitance Subthreshold ⇓ lin. ⇑ exp. ⇑ ∼exp. ⇓ quad. ???

Tension

31 1st-gen General Microarchitecture Overview and Exploration Options

IF-STAGE ID-STAGE EX/MEM-STAGE

8 16 32

Prefetch Buffer 8-bit 16-bit 32-bit I-Mem Reg File 8 32 bits 32

8-bit words 32 bits Acc ALU 16 32 8 x 16 bits 16 x 8 bits 32 x 4 bits ROM 8-bit words Shifter x1

D-Mem External Event Interrupts 8-bit words Scheduler 16-bit words 32-bit words

CONTROL LOGIC

Number of stages Presence of instruction prefetch Harvard vs. Von-Neumann arch buffer ALU width Presence of explicit register file

First Subliminal Chip

Large solar cell

Solar cell for processor

Custom memories Solar cell for discrete cells Discrete cells

Mux-based memories

level converter Test memory array

Test module

Level converter array Subliminal processors Solar cell for adders Discrete adders

32 Pareto Analysis for Several Processors

3.00E-12 2s_h_08w_r w/ explicit register file

2.80E-12 2s_v_08w_r 2s_v_32w 3s_h_16w_r 2.33 # of stages = 3 2.60E-12 4.39 2.66 3s_h_08w_r 3.59 2s_h_16w_r architecture: Von Neumann 2s_h_32w_r 2.40E-12 (vs. Harvard) 2s_v_16w 3s_h_32w_r 3s_v_16w 2.20E-12

r 1.77

Energy (J/inst.) e 5.17 2.00E-12 tt e B 3s_h_32w ALU width 3s_h_08w 1.80E-12 3s_v_08w 3s_h_16w 2s_h_32w Implemented design Area = 2.14 1.60E-12 2s_h_16w 2s_h_08w 2s_v_08w CPI = 2.88 1.78 1.37 3.62 4.99 1.10 1.40E-12 6.14 5.00E-06 1.00E-05 1.50E-05 2.00E-05 2.50E-05 3.00E-05 3.50E-05 4.00E-05 Inst Latency (1/perf == s/inst.)

Pareto Analysis of Sensor Network Processors

SNAP/LE 24 Hempstead (Cornell) 22 (Harvard) 20 18 16 cleverDust 14 (Berkley) 12 10 8 2.25pJ/Inst@1MIPS Energy/Inst (pJ) Subliminal 6 0.85pJ/[email protected] (Michigan) 4 2 0 0.01 0.1 1 10 MIPS

33 Lessons from 1st-generation Study (ISCA 2005) To minimize energy at subthreshold voltages, architects must:

Minimize area To reduce leakage energy per cycle Maximize Transistor To reduce V and energy per min utility cycle Minimize CPI To reduce Energy per instruction

As such, winning designs tend to be compromising designs that balance area, transistor utility and CPI The memory comprises the single largest factor of leakage energy, therefore, efficient designs must reduce memory storage requirements.

2nd Generation Sensor Network Processor

IF/ID Stage EX/MEM Stage WB Stage 8

12 OpA 8 Register Control Imem 24 File 8 Register ALU Write

4x16x2x12 2x2x12 Control OpB 8 Prefetch Buffer 32-bit Carry Control Flag Timer Zero Control

μOperation Decoder

Page 8 External Control 8 Interrupts Dmem Scheduler 128x8

Fetch Jump Control Control

34 Ongoing Work

To be deployed in an intra- Intra-ocular Pressure Sensor ocular pressure sensor

Provides measurement of internal eye pressure

Integrated with a MEMS pressure sensor, wireless communication, and energy scavenging facilities

Tutorial Schedule

Power Issues: Dynamic and Static Power Low Power Design Techniques Reliability Issues: SER, Variability and Defects • Soft Error Radiation Overview • Variability Sources and Effects • Silicon Defect Trends Break Fault Tolerant Design Techniques Robust Low Power Design Techniques

35 Fault Classes

Permanent fault (hard fault) • Irreversible physical change • Latent manufacturing defects, Electromigration Intermittent fault • Hard to differentiate from transient faults Repeatedly occurs at the same location Occurs in bursty manners when fault is activated Replacing the offending circuit removes faults Transient faults (Soft Errors) • Neutron/Alpha particle strikes • Power supply and Interconnect noises • Electromagnetic interference • Electrostatic discharge

Introduction – Soft Errors

Soft errors, also called transient faults and single-event upsets(SEU) Processor execution errors caused by high-energy neutrons resulting from cosmic radiation and alpha particles radiation Appears to be a reliability threat for future technology processors

When a particle strikes a circuit element a small amount of charge is deposited Combinational logic node: a very short duration pulse of X current is formed at the circuit node

Q Q State holding element (FF/SRAM cell): flip the stored value FF

Unlike permanent faults the effects of soft errors are transient

36 Soft Errors (SER)

Alpha particles stemming from radioactive decay of packaging materials Neutrons (cosmic rays) are always present in the atmosphere Soft errors are transient non- recurring faults (also called single event upsets, SEUs) where added/deleted charge on a node results in a functional error • Charge is added/removed by electron/hole pairs absorbed by source/drain diffusion areas

Source: S. Mukherjee, Intel

Soft Error Masking

Logic Masking: the fault gets blocked by a following gate whose output is completely determined by its other inputs

X 0 1 0

Timing Masking: the fault affects the input of a latch only in the period of time that the latch is not sensitive to its input

Clock

Masked tsetup+thold Fault Latched Fault

37 Soft Error Masking

Electrical Masking: the fault’s pulse is attenuated by subsequent logic gates due to electrical properties, and does not affect any latch’s input

Latch

Attenuated Pulse

Microarchitectural Masking: the fault alters a value of at least one flip-flop, but the incorrect values get overwritten without being used in any computation affecting the design’s output

Software Masking: the fault propagates to the design’s output but is subsequently masked by software without affecting the application’s correct execution

How To Measure Reliability: Soft Error Rate (FIT) Failure In Time (FIT) : Failures in 109 hours • 114 FIT means 1 failure every 1000 years It sounds good, but – If 100,000 units are shipped in market, 1 end- user per week will experience a failure

Mean Time to Failure : 1 / FIT

38 Soft Error Considerations

Highly elevation dependent (3-5X higher in Denver vs. sea-level, or 100X higher in airplane)

Critical charge of a node (Qcrit) is an important value

• Node requires Qcrit to be collected before an error will result

• The more charge stored on a node, the larger Qcrit is (Qcrit must be an appreciable fraction of stored Q) • Implies scaling problems caps reduce with scaling, voltage reduces, so stored Q reduces as S2 (~ 2X) per generation Ameliorated somewhat by smaller collection nodes (S/D junctions) But exacerbated again by 2X more devices per generation

Soft Error Rate Trends, ITRS03

39 Impact of Soft Errors in Processors [Iyer]

How do soft errors in processors propagate and impact applications?

Approach

Fault injections (with ii-Measure, hardware level fault injection framework) in combinational logic and flip-flops of MIPS and Alpha-like processors Study fault propagation to the application level

Major findings:

Nearly 5% of faults in combinational logic propagate to state of the processor Errors in Control contribute to 79% of application hangs Errors in Execution blocks a major factor Multiple BitMultiple-flip Distribution Bit-flip in Distribution Alpha processor

in application crashes (45%) and silent data Multiple Bit-flip Double Bit-flip Errors; 1.79% corruption (40%) Errors; 15.10% Faults in combinational logic can cause double and multiple bit errors

Single Bit-Flip Error; 83.11%

SERA SER Analysis Tool [Shanbhag]

Gate-level 32x32 array multiplier Verilog Netlist Process Files Circuit Parser SER Inverter Chain Stimulus Logic Simulator Peaking Characterization Vectors

Path Analyzer

SER Engine

One-time process characterization SER ΔVdd = 20% → SER = 1.28X Δt = 20% → SER = 50X Gate-level SER analysis point tool (available from setup GSRC web-site)

Fast: Speed-up ≥ 106 over Monte Carlo

Accurate: < 5% error over Monte Carlo Captures SER dependence on: process, circuit and input vectors

40 Effects Of Variability High-performance processors are speed-binned • Faster == more $$$ • These parts have small Leff Exponential dependence of leakage on Vth • And Leff, through Vth Delay Leakage Freq Constraint

Reject – too leaky Reject – too slow

Power Since leakage is now Constraint appreciable, parametric yield is being squeezed on both sides

Process Spread Smaller Leff Larger Leff Fast, high leakage Slow, low leakage

Printing in the Subwavelength Regime

Layout 0.25µ 0.18µ

0.13µ 90-nm 65-nm

Figures courtesy Synopsys Inc.

41 Variation: Across-Wafer Frequency

Figure courtesy S. Nassif, IBM

Random Dopant Fluctuations, Intel’s View

10000

1000

100

10 Mean Number ofDopant Atoms 1000 500 250 130 65 32 Technology Node (nm)

Uniform Non-uniform

42 Inter-die vs. Intra-die Variation

Inter-die variation is not always larger than intra-die (ILD)

Design/EDA for Highly Variable Technologies Critical need: Move away from deterministic CAD flow and worst-case corner approaches Examples: • Probabilistic dual-Vth insertion Low-Vth devices exhibit larg process spreads; speed improvements and leakage penalties are thus highly variable • Parametric yield optimization Making design decisions (in sizing, circuit topology, etc.) that quantitatively target meeting a delay spec AND a power spec with given confidence • Avoid designing to unrealistic worst-case specs • Use other design tweaks such as gate length biasing (next)

43 Noise Immune Layout Fabric

This layout style trades off area for: • Noise immunity (both C and L) • Minimizes variations (CMP) • Predictable • Easy layout • Simplifies power Major area penalty (>60%) • Simplifies power distribution

Ref: Khatri, DAC99

Defects: The (Bumpy) Road Ahead for Silicon

What is the failure model of silicon 2-3 generations out? What the literature says… “Expected failure rate of 1012 hours/device”, this would give a high end NVidia graphics part an expected lifetime of less than 1 year “Failure rates higher than 1020 hours/device”, which eliminates the problem What the experts say… Intel [Borkar] and IBM [Bernstein]: critical problem for future silicon

Key failure modes Transistor wear-out (aggravated by scaling) SER-related upsets (especially in logic) Early transistor failures (due to ineffective burn-in) Untestable defects (compounded by complexity)

44 Silicon Defects: Sources and Trajectory

Sources: gate wearout, NBTI, hot electrons, electro-metal migration, etc…

Model Parameters:

Y FG: grace period wear-out rate λL : avg latent manufacturing defects 9 -m b FG+10 λL/t ·(1 -(t+1) ) m : maturing rate FG + (t - tB) b : breakdown rate

tB : breakdown start point Failure Rate (FIT) FailuresFailures occur occur with with increasingincreasing frequency frequency over over F B FailuresFailures occur occur very very soon soon G urn timetime due due to to age-related age-related -in andand failure failure rate rate declines declines wear-out.wear-out. rapidly.rapidly. Failures Failures are are caused caused byby latent latent manufacturing manufacturing ti tB Time defects.defects. Infant Period Grace Period Breakdown Period

Infant Period Graceful Failure rate falls to a small with burn-in Failure rate falls to a small degradation constantconstant value value where where failures failures occuroccur sporadically sporadically due due to to the the occasionaloccasional breakdown breakdown of of weak weak transistorstransistors or or interconnect. interconnect.

Tutorial Schedule

Power Issues: Dynamic and Static Power Low Power Design Techniques Reliability Issues: SER, Variability and Defects Break Fault Tolerant Design Techniques • Classical Techniques • SER Specific Techniques • Full-Spectrum Techniques • Research Topic: Self-Healing Systems Robust Low Power Design Techniques

45 Techniques For Improving Reliability

Fault avoidance (Process / Circuit) • Improving materials Low Alpha Emission interconnect and Packaging materials • Manufacturing process Silicon On Insulator (SOI) Triple Well design process to protect SRAM Fault tolerance (robust design in presence of Soft Error) : Circuit / Architecture • Error Detection & Correction relies mostly on “Redundancy” Space : DMR, TMR Time : Temporal redundant sampling (Razor-like) Information : Error coding (ECC)

DMR Error Detection

Context: Dual-modular redundancy for computation Problem: Error detection across blades

CPU

46 Triple Modular Redundancy (von Neumann)

f (x, y) Voter assumed x x reliable! f (x, y) z majority ⇒y voter small f (x, y) vote z ⇒coarse-grained y f (x, y)

Error Coding : Information Redundancy

Coding: representation of information • Sequence of code words or symbols • Shannon’s theorem in 1948 In noisy channels, errors can be reduced to a certain degree • Golay(1949), Hamming(1950), Stepian(1956), Prange(1957), Huffman Overheads • Spatial overhead : Additional bits required • Temporal overhead : Time to encode and decode Terminology • Distance of code Minimum hamming distance between any two valid codewords • Code separability (e.g. Parity Code) Code is separable if code has separate code and data fields

47 SER-Tolerant Circuit Design [Shanbhag]

DSFF

Dual sampling skewed CMOS style

Employs skewed CMOS for logic and dual sampling FF (DSFF) Both 01 and 10 errors are eliminated if skewing factor ≥ 4. Speed penalty depends on ∆ (maximum SET width) can be made a design parameter. equals 300ps (for 0.18um process) if zero SER wanted. Power penalty: 17% (DSFF) + 20% (Skewed CMOS)

Fingerprinting [Falsafi/Hoe]

Hash updates to architectural state Fingerprints compared across DMR pair Bounded error detection latency Reduced comparison bandwidth

Instruction Stream stream of updates Fingerprint

R1 R2 + R3 ...001010101011010100101010... = 0xC3C9 R2 M[10] M[20] R1 R1 R2 M[20]

48 Recovery Model

Checkpoint n Recover to n Soft error

Time

Error UndetectedError undetected

Rollback-recovery to last checkpoint upon detection

Simultaneous Redundant Multithreadhing [Reinhardt] Logical boundary of redundant execution within a system • Trade-off between information, time, & space redundancy

Sphere of Replication

Thread 1 Thread 2

Input Output Replication Comparison

Rest of System

Compare & validate output before sending it outside the SoR

49 Full-Spectrum Fault Tolerance: DIVA Checker [Austin]

Performance Correctness

Corespeculative Checker EX/ instructions in-order MEM with PC, inst, inputs, addr IF ID REN REG SCHEDULER CHK CT

All core function is validated by checker • Simple checker detects and corrects faulty results, restarts core Checker relaxes burden of correctness on core processor • Tolerates design errors, electrical faults, defects, and failures • Core has burden of accurate prediction, as checker is 15x slower Core does heavy lifting, removes hazards that slow checker

Checker Processor Architecture

PC IF inst core PC I-cache = Core Processor inst ID regs Prediction core inst RF = Stream OK

res/addr result CT regs EX core regs =

WT addr MEM result core res/addr/nextPC D-cache

50 Check Mode

PC IF inst core PC I-cache = Core Processor inst ID regs Prediction core inst RF = Stream OK

res/addr result CT regs EX core regs =

WT addr MEM result core res/addr/nextPC D-cache

Recovery Mode

PC IF inst I-cache

inst ID regs RF

res/addr result CT regs EX

addr MEM result D-cache

51 How Can the Simple Checker Keep Up?

Redundant Core Advance Core

Slipstream

Slipstream effects reduce power requirements of trailing car • Checker processor executes in the core processor slipstream • fast moving air ⇒ branch/value predictions and cache prefetches • Core processor slipstream reduces complexity requirements of checker Symbiotic effects produce a higher combined speed

How Can the Simple Checker Keep Up?

Simple Checker Complex Core

Slipstream

52 Checker Performance Impacts

Checker throughput bounds core IPC 1.05 • Only cache misses stall checker pipeline 1.04 • Core warms cache, leaving few stalls 1.03 Checker latency stalls retirement 1.02 • Stalls decode when speculative state 1.01 buffers fill (LSQ, ROB) 1.00 Relative CPI Relative • Stalled instructions mostly nuked! 0.99 Storage hazards stall core progress 0.98 • Checker may stall core if it lacks resources0.97

r er r e ts Faults flush core to recover state ke k ke ec ec c he • Small impact if faults are infrequent Ch Ch 1k Faul r- o- ache Siz e c cle C C Pi Ub -cy 1/4 12

Research Topic: Self-Repairing Systems

Defect-tolerant self-repairing systems need to support: Error Detection System Diagnosis (locate the origin of the error) System Repair System Recovery

Key idea: Error detection must be performance efficient Continuously check execution for errors Diagnosis, repair and recovery are insensitive on performance Get invoked only when an error is detected (rare scenario) Trade-off performance for more cost efficient techniques

53 Fault Modeling & Analysis Infrastructure

High-performance, high-fidelity, fault modeling simulation infrastructure

Asynchronous fault injection at Statistical fault model Modeling & analyzing the gate level Time, location, duration transient errors

Model Fault-exposed Fully models all the possible Stimuli model (TRIPS traces) ways a fault can be masked logic masked Fault Fault is timing masked Structural analyzer architecture masked error (fault manifests) design Golden model (no fault injected)

MonteCarlo simulation loop – 1000x Defect model Modeling & analyzing Time, location permanent errors Two different setups, one to evaluate the effects of transients, and one for Function test Defect-exposed permanent errors (full-cover. test) model

Defect Defect is exposed analyzer protected Structural unprotected but masked Monte Carlo modeling framework with design Golden model (no defect injected) realistic workloads

MonteCarlo simulation loop – 1000x

Self-Repairing BulletProof Silicon [Austin, Bertacco]

Goal: Single-defect tolerance for 5% area overhead

epochs boundary Key ideas: µprocessor pipeline • No expensive computation checking IF ID EX • Protect computation and test Hw WB MEM • Repair by disabling redundant parts speculative state

Approach: non-speculative state s + checker 1. Execute and protect state BIST 2. Test concurrently when Hw idle CIRCUIT ENVELOPE – logic-level testing and reconfiguration 3. If tests fails → roll back state epochs boundary → disable component ARCHITECTURAL ENVELOPE → restart – Check-pointing and epoch restore Reconfiguration

54 Tutorial Schedule

Power Issues: Dynamic and Static Power Low Power Design Techniques Reliability Issues: SER, Variability and Defects Break Fault Tolerant Design Techniques Robust Low Power Design Techniques • Better-Than Worst Case Design Concepts • Example BTWC Designs • Research Topic: Razor Pipeline

Power and Reliability: How are they related?

The move to smaller features can help with power – with qualifications

Smaller features increase design margins

reduce power savings

reduce performance gains

reduced area benefits

55 Traditional Worst-Case Design

Time-to-Market

Design-Time Verification and LH Optimization Performance

Better-Than-Worst-Case (BTWC) Design

ne li r On ke ec re Ch wa rd Ha Run-Time Verification

Time-to-Market

Typical Case LH Optimization Performance

56 Algorithmic SER-Tolerance [Shanbhag]

PTOTAL x[n] Main ya[n] 1.0 PEC Block Energy >T yˆ[n] savings Power

Estimator P ye[n] main

Voltage 1.0 Voltage Overscale Main Block x[n] y [n] Error Control via Estimator Main Block a Estimators: Prediction, Reduced

MUX yˆ[n] Precision Replica, MAP, Error |. | >Th| | > Th Canceller and others Employ two estimators in ye,1[n] Estimator1 = SEU/MEU scenario Robust to error frequencies up to: ye,2[n] Estimator2 1 in 100 samples for SEU Error-Control Block 1 in 1000 samples for MEU

Timing Error Tolerant Links [De Micheli]

Pipeline Pipeline Input data SENDER buffer i buffer i+1 M Main output U X flip-flop f req Vdd Vdd Delayed flip-flop Clk Frequency/Voltage Controller XOR

Delayed Clk Error?

Aggressively clock on-chips links with high frequency/low voltage

Double-sample link output

Once speculatively, then again with reliable timing

Stall receiver for recovery data if samples disagree

Non-speculative if receiver incurs additional delay

Otherwise, receiver must perform internal recover

57 Research Topic: Razor Error Resilient Circuits [Austin/Blaauw]

In-situ detection/correction of timing errors clk

Tune processor voltage based on errors D1 Q1 0 Main 1 Flip-Flop Eliminate process, temperature, and noise Error_L

margins (tune for near-zero errors) Shadow Latch comparator Purposely run below critical voltage to Error RAZOR FF capture data-dependent latency margins clk_del Implemented with architecture and circuit support IF ID EX MEM WB (read-only) (reg/mem) PC Double-sampling metastability-tolerant b b b Razor FF error Razor FF error u Razor FF error u Razor FF b u error FF Stabilizer b b b u bl bl bl b Razor flip-flops validate pipeline results e e e bl e rec recover recover ov recover Pipeline initiates recovery after timing er Flush flush flush flush flush Control ID ID ID ID errors, forward progress is guaranteed

Razor Prototype Chip 3.3mm 4 stage 64-bit Alpha pipeline

120 - 160MHz operation, 0.18μm

Percentage of FF Razorized: 9% Icache RF Error free Razor overhead ~3% 3.0mm WB IF ID EX 54% energy reduction MEM

Dcache 1.8 Chips Linear Fit 1.7 y=0.78685x + 0.22117 1.6

1.5

1.4 Voltage at 0.1%Error Rate

1.41.51.61.71.8 Voltage at First Failure Point of 0.1% Error Rate Vs Point of First Failure

58 Configuration of the Razor Voltage Controller

reset Ediff = Eref -Esample

Vdd signals . E Ediff Voltage Pipeline error sample Voltage . E Control ref Regulator . Σ Function -

Configuration of Razor Voltage Control System

Run-Time Response of Razor Voltage Controller

16 120MHz 1.80 27C 14 1.75 12 1.70 1.65 10 1.60 8 1.55 6 1.50 4 1.45 Percentage Error Percentage Rate 2 1.40 0 1.35 Output of Controller Voltage 0 100 200 300 400 500 600 Runtime Samples

59 Energy/Performance Characteristics

Pipeline 1% Throughput Energy

IPC Total Energy,

Etotal = Eproc + Erecovery 30-50%

Optimal Etotal

Energy of Processor Energy of Operations, Eproc Pipeline Energy of Processor Recovery, w/o Razor Support Erecovery

Decreasing Supply Voltage

Conclusions

Power Issues: Dynamic and Static Power Low Power Design Techniques Reliability Issues: SER, Variability and Defects Break Fault Tolerant Design Techniques Robust Low Power Design Techniques

60 References

1. C. Constantinescu ‘Trend and Challenge in VLSI Circuit Reliability’ intel 2. H. T. Nguyen ‘A Systematic Approach to Processor SER Estimation and Solutions’ 3. P. Shivakumar et. al, ‘Modeling the effect of Technology trends on Soft Error Rate of Combinational Logic’ 4. P. Shivakumar ‘Fault-Tolernat Computing for Radiation Environment’ Ph.D. Thesis Stanford University 5. M. Nicolaidis ‘Time Redundancy Based Soft-Error Tolerance to Rescue Nanometer Technologies’ 6. L. Anghel, et. al. ‘Cost Reduction and Evaluation of a Temporary Faults Detecting Technique’ 7. L. anghel, et. al. ‘Evaluation of Soft Error Tolerance Technique based on Time and/or Space Redundancy’ ICSD 8. I. Koren, University of Massachsutts ECE 655 Lecture Notes 4-5 ‘Coding’ 9. ITRS 2003 Report 10. J. von Neumann, "Probabilistic logic and the synthesis of reliable organisms from unreliable components," 11. R. E. Lyons, et. al. ‘The Use of Triple-Modular Redundancy to Improve Computer Reliability’ 12. D. G. Mavis, et. al. ‘Soft Error Rate Mitigation Techniques for Modern Microcircuits.’ IEEE 40th Annual International Reliability Physics Symposium 2002. 13. C. Weaver, et. al. ‘A Fault Tolerant Approach to Microprocessor Design’ DSN’01 14. J. Ray, et. al. ‘Dual Use of Superscalar Datapath for Transient-Fault Detection and Recovery’, Proceedings of the 34th Annual Symposium on Microarchitecture (MICRO’01). 15. J. B. Nickel, et. al. ‘REESE: A Method of Soft Error Detection in Microprocessors’, Proceedings of the International Conference on Dependable Systems and Networks (DSN’01). 16. S. Reinhardt, et. al. ‘Transient Fault Detection Simultaneous Multithreading’

References

1. D. Siewiorek ‘Fault Tolerance in Commercial Computers’ CMU 2. W. Bartlett, et. al. ‘Commercial Fault Tolerance: A Tale of Two Systems’ IEEE Dependable and Secure Computing 2004 3. T. Slegel et.al ‘IBM’s S/390 G5 Microprocessor Design’ 4. L. Spainhower, et.al, ‘IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical approach’ 5. D. Bossen et.al ‘Fault tolerant design of the IBM pSeries 690 system using POWER4 processor technology’ 6. ‘Tandem HP Himalaya’ White Paper 7. Fujitsu SPARC64 V Microprocessor Provides Foundation for PRIMEPOWER Performance and Reliability Leadership 8. D. J. Sorin, et. al. ‘SafetyNet: Improving the Availability of SharedMemory Multiprocessors with Global Checkpoint/Recovery.’ 9. Milos Prvulovic, et. al. ‘ReVive:Cost-Effective Architectural Support for Rollback Recovery in Shared- Memory Multiprocessors’ 10. J. Smolens, et.al ‘Fingerprinting: Bounding SoftError Detection Latency and Bandwidth’ 11. D. Sorin, et,al ‘Dynamic Verification of End-to-End Multiprocessor Invariants’

61 Questions?

? ? ? ?

? ? ? ? ? ? ? ?