PRESENTATIONPRESENTATION ONON Soft Error Trends and Mitigation Techniques in Memory Devices

Charles Slayman Senior Reliability Consultant Ops A La Carte Santa Clara, CA

January, 2011

Soft Errors Page 1 © Ops A La Carte LLC 2010 RAMS 2011 Outline

• Introduction –Source of Soft Errors • Soft Error Rates in SRAM and DRAM • Mitigation Techniques for SRAM and DRAM • Comparison with Soft Errors in FLASH and Logic • Conclusions

Soft Errors Page 2 © Ops A La Carte LLC 2010 RAMS 2011 Background –Source of Soft Errors

• Alpha particles • High Energy Neutrons • Thermal Neutrons

Soft Errors Page 3 © Ops A La Carte LLC 2010 RAMS 2011 Alpha Particles “1” “0” • Interaction of high energy 4He nucleus with IC material generates charge that can upset “0” “1” circuits 9 Energies up to 10 MeV 9 Penetration ranges up to 70um alpha 9 1 MeV of energy loss can generate track 44fC of charge

“0” “1” • Major sources of alpha particles 9 Trace impurities (U, Th, Po, etc.) 9 Natural isotopes (Pb, Pt, Ha, etc.) charge 9 Accidental process contamination collection • Classification of material ‐2 ‐1 “0” bit flip “1”Æ”0” < 2 cm khr Ultralow alpha < 50 cm‐2 khr‐1 Low alpha > 50 cm‐2 khr‐1 Standard material from May & Woods, IRPS 1978

Soft Errors Page 4 © Ops A La Carte LLC 2010 RAMS 2011 High Energy and Thermal Neutrons • High energy neutrons are by‐product of cosmic rays hitting earth’s atmosphere • Broad energy spectra • Flux is dependent on altitude and geomagnetic location • Reference level is New York City –14 neutrons/cm2‐hr above 10 MeV

High Energy Neutrons 1MeV ‐ >1GeV

Thermal Neutrons

from JEDEC JESD89A

Soft Errors Page 5 © Ops A La Carte LLC 2010 RAMS 2011 Thermal Neutrons Create Alpha Particles

• Thermal neutrons are a result of high energy neutrons loosing energy to material in the surrounding environment • Characteristic energy ~ 25meV • Flux is dependent on high energy neutron background and surrounding environment –typically 0.1 to 0.5x of high energy neutron flux • Large capture cross‐section by 10B generates inside IC

Soft Errors Page 6 © Ops A La Carte LLC 2010 RAMS 2011 Outline

• Introduction –Source of Soft Errors • Soft Error Rates in SRAM and DRAM • Mitigation Techniques for SRAM and DRAM • Comparison with Soft Errors in FLASH and Logic • Conclusions

Soft Errors Page 7 © Ops A La Carte LLC 2010 RAMS 2011 Soft Errors in SRAM

• If charge generated by alpha particle or

neutron is large enough (Qcrit), cell is flipped

Soft Errors Page 8 © Ops A La Carte LLC 2010 RAMS 2011 SRAM Soft Error Trend

SRAM bit upset trend flat vs. design rule

1 FIT = 1 upset per 109 hr

• Decrease in SRAM cell Qcrit with process shrinks balanced by decrease in cell area, leading to flat soft error rate trend

Soft Errors Page 9 © Ops A La Carte LLC 2010 RAMS 2011 Example of SRAM Soft Error Rate

7500 Series Xeon Processor with 24MB (=192Mb) of Level 3 SRAM (L$3) • Assume an SRAM soft error rate between 10‐4 to 10‐3 FIT/bit from previous slide (1 FIT = 1 fail per 109 hr) • This translates to 20,000 ‐ 200,000 FIT or 0.2 to 2 errors/year per CPU (sea level, NYC)

Soft Errors Page 10 © Ops A La Carte LLC 2010 RAMS 2011 SRAM Multi‐cell Upset

Probability That Multiple from Ibe, et al Cells Are Upset IEEE Trans. Elec. Dev., Jul. 2010

• If enough charge is generated, multiple cells can be upset • When cells are closely pitched, >10% of upsets can involve multiple cells

Soft Errors Page 11 © Ops A La Carte LLC 2010 RAMS 2011 Soft Errors in DRAM

• Excess charge generated by alpha particle or neutron can – discharge cell capacitor – upset sense amp during read/write operation – upset various logic/control circuits

Soft Errors Page 12 © Ops A La Carte LLC 2010 RAMS 2011 DRAM Soft Error Trend

DRAM cell upset trending down with process technology

• DRAM cell upset soft error rate trending downwards because Qcrit flat but cell area shrinking • As a result, upset of control logic in DRAM becoming more significant

Soft Errors Page 13 © Ops A La Carte LLC 2010 RAMS 2011 Example of DRAM Soft Error Rate

• Up to 250 GB (=2Tb) of main memory can be supported by an Intel 7500 Xeon CPU socket • DRAM error rates are dropping below 10‐9 to 10‐8 FIT/bit • This translates to 2,000 to 20,000 FIT for main memory or 0.02 to 0.2 errors/year • About 10x less than the L3$ SRAM example

Soft Errors Page 14 © Ops A La Carte LLC 2010 RAMS 2011 DRAM Cell Upset Distribution 9 Rseu = single event upset rate 1 FIT = 1 event/10 hr

single cell upset (1 bit) multi-cell upset (2-16 bits) logic upset (1028-8192 bits)

from Boruki et al, IRPS 2008

Soft Errors Page 15 © Ops A La Carte LLC 2010 RAMS 2011 DRAM Bit Error Rate vs. Design Rule • Bit Error Rate = (Event Rate) x (Bit Upsets per Event) • Logic soft errors dominate cell upsets from a bit error rate perspective

logic bit error rate

cell bit error rate

from Boruki et al, IRPS 2008

Soft Errors Page 16 © Ops A La Carte LLC 2010 RAMS 2011 Outline

• Introduction –Source of Soft Errors • Soft Error Rates in SRAM and DRAM • Mitigation Techniques for SRAM and DRAM • Comparison with Soft Errors in FLASH and Logic • Conclusions

Soft Errors Page 17 © Ops A La Carte LLC 2010 RAMS 2011 Interleaving

Neutron or Alpha Multi-cell Hit

• If nearest neighbor cells are in • Bit interleave distance physically same word line, multi‐cell upset = separates cells in same word line multi‐bit error • Multi‐cell upset = multiple singe bit errors

Soft Errors Page 18 © Ops A La Carte LLC 2010 RAMS 2011 Error Correction Codes (ECC) • IF A CLEAN COPY OF DATA EXISTS (Cache ÅÆ Main Memory) 9 Parity detection + interleave most efficient if copy exists –only one extra bit required for protection • IF A CLEAN COPY OF DATA DOES NOT EXIST (Main Memory) 9 Single Bit Correct‐Double Bit Detect (SBC‐DBC) + interleave most effective if copy does not exist • IF THERE IS A PROBABILITY OF MULTI‐BIT ERRORS (No Interleave or DRAM Logic) 9 Multi‐bit codes require greater overhead

* Symbol is grouping of bits

Soft Errors Page 19 © Ops A La Carte LLC 2010 RAMS 2011 DRAM SER Mitigation • Single Bit Correction + interleave is effective at dealing with single cell and multi‐cell upset • 4bit Symbol Correction (aka Chipkill or Single Device Data Correction – SDDC) codes are effective at correcting logic errors from x4 I/O DRAM (or 8bit Symbol Correction with x8 I/O) • However, detection and reset are required for static logic errors, otherwise they will masquerade as hard errors and swamp the ECC circuitry

Soft Errors Page 20 © Ops A La Carte LLC 2010 RAMS 2011 Example of DRAM Chipkill ECC

• JEDEC Standard ECC Dual Inline Memory Module (DIMM) with eighteen x4 DRAM = 72 bit wide bus • Only 2/18 DRAM for ECC = 11% overhead • Chipkill/SDDC capability handles all DRAM multi‐cell and logic errors since an alpha or neutron strike only effects a single chip ( or symbol)

Soft Errors Page 21 © Ops A La Carte LLC 2010 RAMS 2011 Outline

• Introduction –Source of Soft Errors • Soft Error Rates in SRAM and DRAM • Mitigation Techniques for SRAM and DRAM • Comparison with Soft Errors in FLASH and Logic • Conclusions

Soft Errors Page 22 © Ops A La Carte LLC 2010 RAMS 2011 Soft Errors in MLC NAND Flash • Upset of logic operations in FLASH well known in avionics/aerospace applications as single event functional interrupts (SEFI). • Neutron induced threshold shift of multi‐level cell (MLC) NAND Flash recently reported

threshold voltage shift of higher bits from neutron radiation from Gerardin et al, IRPS 2010

Soft Errors Page 23 © Ops A La Carte LLC 2010 RAMS 2011 MLC NAND Flash Soft Error Trend • Neutron cell upset rates are trending upwards but can be handled by embedded ECC

data derived from Gerardin et al, IRPS 2010

Soft Errors Page 24 © Ops A La Carte LLC 2010 RAMS 2011 Logic Soft Errors

• FIT per flop or gate will probably trend below SRAM cell because logic gates do not follow as aggressive design rules • Not all logic flips result in an error (architectural vulnerability factor –AVF) << 1 • From shear count, DRAM and SRAM soft errors will dominate logic soft errors

Soft Errors Page 25 © Ops A La Carte LLC 2010 RAMS 2011 Summary

DEVICE SOFT ERROR COMMENTS RATE SRAM 10-4 to 10-2 FIT/bit • Flat trend with design rule. • Single bit ECC protection and interleave. DRAM – Cell 10-10 to 10-5 FIT/bit • Fixed cell capacitance causes downward trend as design rule shrinks. • ECC protection and interleave. DRAM – Logic 0.1 to 10 FIT/chip • Dominates cell upset in newer technologies on a bit error rate basis. • Requires multi-bit ECC. MLC Flash 10-8 to 10-5 FIT/bit • Trending up as process technology shrinks. • Embedded ECC protection. LOGIC ~10x less than • Will probably trend flat as process SRAM technology shrinks. • Difficult to protect.

Soft Errors Page 26 © Ops A La Carte LLC 2010 RAMS 2011 Conclusions

• Memory soft errors occur at observable rates 9SRAM trend is flat as feature size shrinks 9DRAM cell soft errors trend down, logic soft errors are becoming more important 9MLC FLASH trending upwards with feature size shrink • Most effective mitigation technique for memory are parity, SBC‐DBD or chipkill/SDDC ECC codes • Logic soft errors occur at rates far below memory soft errors and mitigation techniques are much more difficult and can be employed on a limited scale

Soft Errors Page 27 © Ops A La Carte LLC 2010 RAMS 2011