Soft Error Rate Reduction Using Redundancy Addition and Removal

Kai-Chiang Wu and Diana Marculescu ECE Department Carnegie Mellon University Soft Errors

• A soft error occurs when – a radiation-induced transient event causes a charge disturbance that flips the state of a storage element. – Such a bit-flip is called a single-event transient (SET) or a glitch.

• Definitions – A soft error is often referred to as a single-event upset (SEU). – The rate at which soft errors occur is called soft error rate (SER).

• Current technology scaling trends – shrinking feature sizes, etc. – Circuits become more susceptible to radiation-induced particle hits. – Particles with less energy could flip the states of storage elements.

2 Soft Errors in Combinational Logic LogicalElectricalLatching-window maskingmaskingmasking

L L A A 0 T 1 T C C H H E E S S

3 Scaling Trends of Masking Factors

• Logical masking is decreased due to – decreasing logic depth

• Electrical masking is decreased due to – faster logic gates – lower supply voltages – smaller node capacitances

• Latching-window masking is decreased due to – increasing clock frequencies

4 Outline

• Background and motivation • Related work • Proposed framework – Redundancy addition and removal (RAR) – Metrics for gate characterization – Constraints on RAR • Results and conclusion

5 Related Work

• Triple Modular Redundancy (TMR) – consists of three identical copies and a majority voter. – incurs more than 200% overhead in terms of area and power.

• Partial duplication [Mohanram et al., 2003] / gate sizing [Zhou et al., 2004] – targets gates with high error impact. – incurs potentially large area overhead.

• Flip-flop selection [Joshi et al., 2006] – increases the length of latching windows. – focuses only on latching-window masking.

6 Proposed Framework

• Using Redundancy Addition and Removal (RAR) – Iteratively add and remove redundant wires to minimize a circuit in terms of literal count.

• RAR for SER reduction – Estimate the effects of redundancy manipulations. – Accept only those with positive impact on SER.

• Advantages over other techniques – Very little area overhead – Unified treatment of three masking factors via decision diagrams – Precise estimation of SER impact of added and removed wires

RAR: Combinational and Sequential Logic Optimization by 7 Redundancy Addition and Removal [Entrena and Cheng, 1995] Outline

• Background and motivation • Related work • Proposed framework – Redundancy addition and removal (RAR) – Metrics for gate characterization – Constraints on RAR • Results and conclusion

8 Framework Overview

Define metrics to Apply RAR to identify characterize wires/gates redundant wires/gates in terms of SER impact

Set up constraints to accept only beneficial redundancy manipulations

9 Redundancy Identification RAR – An Introduction [Entrena and Cheng, 1995]

• Add a redundant wire found by mandatory assignments during automatic test pattern generation (ATPG). – The newly added wire could cause one or more originally irredundant wires/gates to become redundant (removable).

• Remove those redundant wires due to the added wire. – Delete gates with only one fanin and gates without any fanout.

• Repeat until no further improvement can be done. – The circuit will become smaller if the removed redundancies are more than the added redundancies.

10 Redundancy Identification RAR – For SER Reduction

• Apply RAR for SER reduction with little area penalty.

• Unsystematic RAR may increase SER by reducing the number of gates or the depth of circuits.

• Solution –Use MARS-C, based on BDDs and ADDs, to quantify the error impact and the masking impact of each gate. – Keep wires/gates with higher masking impact. – Remove wires/gates with higher error impact.

MARS-C: Modeling and Reduction of Soft Errors in 11 Combinational Circuits [Miskov and Marculescu, 2006] Gate Characteriztn Mean Error Impact

Mean error impact (MEI) of each internal gate Gi:

n f nF ∑∑Ρ(Fj fails Gi fails ∩ init _ glitch = (d,a)) d ,a k ==11j ΜΕΙ(Gi ) = nF ⋅ n f

• MEI quantifies the probability that at least one primary output is affected by a glitch originating at the gate. •The larger MEI a gate has, the higher the probability that a glitch occurring at this gate will be latched.

12 Gate Characteriztn Mean Masking Impact

Mean masking impact on duration (MMID) of each internal gate Gi:

n f nG d ,a ∑∑ΜΙ D (G j → Gi ) d ,a k==11j ΜΜΙ D (Gi ) = nG ⋅ n f ⋅ d

• MMID denotes the normalized expected attenuation on the duration of all glitches passing through it.

•The larger MMID a gate has, the more capable of masking glitches this gate is.

13 Redundancy Manipulation Wire Addition Constraint Checking MEI(s), MEI(a), and MEI(b) ↗ a b s e w c t f d g MMID(t) ↗ MEI(c) and MEI(d) ↘

• ΔMEI(s) = MEI(t) × [1-MMID(t)] • Wire w (s Æ t) can be added only if

–MEI(t) < T1 – MMID(t) > T2

14 Redundancy Manipulation Wire Removal Constraint 1 Checking MEI(u), MEI(a), and MEI(b) ↘ a b u e w’ c v f d g MMID(v) ↘ MEI(c) and MEI(d) ↗

• ΔMEI(u) = MEI(v) × [1-MMID(v)] • Wire w’ (u Æ v) can be removed only if

–MEI(v) > T3 ≧ T1 – MMID(v) < T4 ≦ T2

15 Redundancy Manipulation Wire Removal Constraint 2 Checking MEI(u), MEI(a), and MEI(b) ↘ a b u e w’ c v f d g MMID(v) ↘ MEI(c) and MEI(d) ↗ • Wire w’(u Æ v) can be removed only if –Wire w’ is crucial in logical masking at gate v. – The probability that gate u goes to the controlling value of gate v is sufficiently low.

16 Outline

• Background and motivation • Related work • Proposed framework – Redundancy addition and removal (RAR) – Metrics for gate characterization – Constraints on RAR • Results and conclusion

17 Practical Considerations

Mean error susceptibility (MES) of each primary output Fj:

n f nG ∑∑Ρ(Fj fails Gi fails ∩ init _ glitch = (d,a)) d ,a k==11i MES(Fj ) = nG ⋅ n f

Output failure probability of each primary output Fj:

Δd ⋅ Δa dm ,an Ρ(Fj ) = ∑∑MES(Fj ) (d max − d min ) ⋅ (amax − amin ) nm

Soft error rate of each primary output Fj:

SER(Fj ) = Ρ(Fj ) ⋅ R PH ⋅ R EFF ⋅ A CIRCUIT

18 Experimental Setup

• Technology: 70nm, BPTM • Clock period: 250ps • Setup time/hold time: 10/10ps • Supply voltage: 1.0V

•(dmin, dmax)/(amin, amax): (60, 120)ps/(0.8, 1.0)V • Δd/Δa: 20ps/0.1V -2 -1 •RPH: 56.5 m s -5 •REFF: 2.2×10

19 Experimental Results

21.86%

15.91%

7.11%

3.45%

20 Average Mean Error Susceptibility

Avg. MES Avg. MES

alu2 Duration size (ps) x4 Duration size (ps)

16-20% reduction in average mean error susceptibility

21 Output Failure Probability

Output failure probability Output failure probability

alu2 Primary output index x4 Primary output index

30-70% maximum reduction in output failure probability

22 Conclusion

• We propose a SER reduction framework – based on redundancy addition and removal (RAR) – using symbolic SER analysis (MARS-C) – for combinational logic

• Two metrics and three constraints are introduced to guide this framework towards SER reduction.

• Experiments on a subset of standard benchmarks reveal the effectiveness of our framework.

23 Thank you! Backup Slides Soft Errors – A New Great Concern in Logic Circuits

• Soft errors would significantly degrade the robustness of logic circuits, while the nominal SER of SRAMs tends to be nearly constant from the 130nm to 65nm technologies. – Source: Mitra et al., “Robust System Design with Built-in Soft-Error Resilience,” IEEE Computer Magazine, Feb. 2005

• The SER of combinational logic is predicted to be comparable to that of memory elements by 2011. – Source: Shivakumar et al., “Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic,” Proc. Int’l Conference on Dependable Systems and Networks, Jun. 2002

26 Soft Error Generation

Radioactive decay of 232Th, 235U, Cosmic rays react with atmosphere and 238U in packaging materials

Cascades of secondary particles (neutrons, protons, etc.)

Low-energy neutrons High-energy neutrons interact with 10B interact with 28Si

Alpha particles (He2+)

27 Soft Error Modeling

•A(Amplitude condition):

A > Vth (if the correct output is “0”) or A < V (if the correct output is “1”) th dinit = t2 – t1 (Duration condition): •D T = t1’–t1 D > tsetup + thold D = t2’–t1’ •T(Timing condition): t1’ ≦ Tclk – tsetup t ∈ [Tclk + thold – T – D, Tclk – tsetup – T] t2’ ≧ Tclk + thold

•P(A ∩ D ∩ T) = P(D ∩ T) = P(T | D)⋅P(D)

= ∑(Ρ(t ∈[Tclk + thold −T − D,Tclk − tsetup −T ] D = Dk ) ⋅Ρ(D = Dk )) k ⎛ D − (t + t ) ⎞ ⎜ k setup hold ⎟ = ∑⎜ ⋅Ρ(D = Dk )⎟ k ⎝ Tclk − dinit ⎠

28 Sensitization BDDs

L L A A • Sensitization BDDs T T include information C C about logical masking. H H E E S S

∂G3/∂G2 ∂G5/∂G3 ∂G5/∂G1 • Sensitization BDD of Gi Æ Gj is Boolean difference of Gj w.r.t. Gi 2 1 2 01 1 0 1 0 • G2 Æ G3: Bool. diff. of G3 w.r.t. G2 0 1 3 3 • G3 Æ G5: Bool. diff. of G5 w.r.t. G3 1 0 0 1 • G Æ G : Bool. diff. of G w.r.t. G 1 5 5 1 0 1 4 01

0 1 29 Duration ADDs

L L A A • Duration ADDs are created T T with respect to sensitization C C BDDs (logical masking) and H H E E attenuation model (electrical S S masking). ADD Topological order! G2ÆG3ÆG5

1 ADD 1 0 G2ÆG3 ADD G2 2 2 2 0 1 0 1 01 3 D2 1 0 0 D23

0 D235 30 Attenuation Model

τ ⎧if D ≤ p , the glitch is masked. V ⎪ τ dd ⇒ if < D ≤ 2 , the glitch is attenuated . Vmin = Vdd − ⋅ D ⎨ p τ p 2τ p ⎪ ⎩if D > 2τ p , the glitch remains the same.

31 Reconvergent Glitches

• Glitches on reconvergent paths arriving to inputs of a gate – can be merged into a new glitch. – can be masked by each other.

L L A A T T C C H H E E S S

32 MMI Computation – An Example

L L ADDG ÆG ADDG ÆG ÆG A A 1 5 2 3 5 T T ADD ADD C C G1 2 G2ÆG3 1 H H 1 0 1 0 E E S S d 3 2 2 2 0 1 010 1 0 1 Masking Impact of gate G5 w.r.t. gate G1: 4 0 2d/3 3 d ,a 01 1 0 ΜΙ D (G1 → G5 ) 2 2 = Ρ(ADD → 0) ⋅ (d − 0) + Ρ(ADD → d) ⋅ (d − d) 0 2d/3 0 4d/9 G1 →G5 G1 →G5 3 3 3 5 2 7 − Ρ(ADD → d) ⋅ (d − d) = (d − 0) + (d − d) = d G1 8 8 3 12

ADD Masking Impact of gate G w.r.t. gate G : G3ÆG5 5 2 d ,a ΜΙ D (G2 → G5 ) ADD 1 4 4 G3 = Ρ(ADD → 0) ⋅ (d − 0) + Ρ(ADD → d) ⋅ (d − d) 1 0 G2 →G3 →G5 G2 →G3 →G5 9 9 3 2 2 d − Ρ(ADDG →G → 0) ⋅ (d − 0) − Ρ(ADDG →G → d) ⋅ (d − d) 1 0 2 3 2 3 3 3 5 3 4 1 1 2 5 2 1 0 2d/3 = (d − 0) + (d − d) − (d − 0) − (d − d) = d − d = d 8 8 9 2 2 3 6 3 6 MMI Computation – An Example

Mean masking impact on duration (MMID) of gate G5: d ,a ΜΜΙ D (G5 ) ΜΙ (G d ,a → G ) + ΜΙ (G d ,a → G ) + ΜΙ (G d ,a → G ) = D 1 5 D 2 5 D 3 5 3d 7d 12 + d 6 + d 2 5 = = 3d 12

• The duration of a glitch is proportional to the probability of a soft error being latched, but the amplitude is not.

–Use only mean masking impact on duration (MMID) as a guideline for SER reduction.

34 Gate Characteriztn Why MEI & MMI?

•MEI • MMI – Glitch generation – Glitch propagation – Probability of – Capability of filtering generated glitches propagated glitches being registered

35 Constraints on RAR

• ΔMEI(s) = MEI(t) × [1-MMID(t)] – Worst-case estimation

• MMID(t) increases after adding wire s Æ t – More logical masking due to the new connection – More electrical masking due to larger gate delay

• ΔMEI(u) = MEI(v) × [1-MMID(v)] – Average-case estimation

36 Experimental Results (# PIs, Dur. Ori. Opt. Area SER # Add. # Rem. Circuit # POs, size Avg. Avg. over- reduc- wires wires # Gates) (ps) MES MES head tion (36, 60 3.57e-3 2.74e-3 37 29 C432 7, 100 1.87e-2 1.35e-2 24 12 3.45% 21.86% 156) 120 2.95e-2 2.43e-2 24 12 (41, 60 1.65e-3 1.40e-3 78 41 C499 32, 100 7.12e-3 5.77e-3 47 21 4.67% 18.64% 458) 120 10.9e-3 8.99e-3 52 27 (10, 60 2.67e-3 2.22e-3 58 46 alu2 6, 100 1.71e-2 1.38e-2 36 28 2.67% 18.27% 339) 120 2.74e-2 2.26e-2 28 23 (14, 60 9.26e-4 8.10e-4 82 60 alu4 8, 100 8.70e-3 7.57e-3 77 42 3.69% 13.54% 660) 120 1.46e-2 1.26e-2 72 42 (16, 60 10.5e-4 7.80e-4 162 76 t481 1, 100 7.30e-2 6.21e-2 136 57 7.11% 15.91% 566) 120 1.78e-1 1.59e-1 84 23 (24, 60 4.11e-3 3.42e-3 28 20 ttt2 21, 100 1.34e-2 1.13e-2 13 9 1.30% 14.88% 166 120 2.01e-2 1.71e-2 14 11 (94, 60 2.21e-3 1.80e-3 34 17 x4 71, 100 5.89e-3 4.76e-3 18 10 1.79% 18.75% 288) 120 8.72e-3 7.18e-3 19 7 Avg. 3.53% 17.41%

37