(BEER): Determining DRAM On-Die ECC Functions by Exploiting DRAM Data Retention Characteristics
Total Page:16
File Type:pdf, Size:1020Kb
Bit-Exact ECC Recovery (BEER): Determining DRAM On-Die ECC Functions by Exploiting DRAM Data Retention Characteristics Minesh Pately Jeremie S. Kimzy Taha Shahroodiy Hasan Hassany Onur Mutluyz yETH Zurich¨ zCarnegie Mellon University Increasing single-cell DRAM error rates have pushed DRAM entirely within the DRAM chip [39, 76, 120, 129, 138]. On-die manufacturers to adopt on-die error-correction coding (ECC), ECC is completely invisible outside of the DRAM chip, so ECC which operates entirely within a DRAM chip to improve factory metadata (i.e., parity-check bits, error syndromes) that is used yield. e on-die ECC function and its eects on DRAM relia- to correct errors is hidden from the rest of the system. bility are considered trade secrets, so only the manufacturer Prior works [60, 97, 98, 120, 129, 133, 138, 147] indicate that knows precisely how on-die ECC alters the externally-visible existing on-die ECC codes are 64- or 128-bit single-error cor- reliability characteristics. Consequently, on-die ECC obstructs rection (SEC) Hamming codes [44]. However, each DRAM third-party DRAM customers (e.g., test engineers, experimental manufacturer considers their on-die ECC mechanism’s design researchers), who typically design, test, and validate systems and implementation to be highly proprietary and ensures not to based on these characteristics. reveal its details in any public documentation, including DRAM To give third parties insight into precisely how on-die ECC standards [68,69], DRAM datasheets [63,121,149,158], publi- transforms DRAM error paerns during error correction, we cations [76, 97, 98, 133], and industry whitepapers [120, 147]. introduce Bit-Exact ECC Recovery (BEER), a new methodol- Because the unknown on-die ECC function is encapsulated ogy for determining the full DRAM on-die ECC function (i.e., within the DRAM chip, it obfuscates raw bit errors (i.e., pre- its parity-check matrix) without hardware tools, prerequisite correction errors)1 in an ECC-function-specic manner. ere- knowledge about the DRAM chip or on-die ECC mechanism, fore, the locations of soware-visible uncorrectable errors (i.e., or access to ECC metadata (e.g., error syndromes, parity infor- post-correction errors) oen no longer match those of the pre- mation). BEER exploits the key insight that non-intrusively correction errors that were caused by physical DRAM error inducing data-retention errors with carefully-craed test pat- mechanisms. While this behavior appears desirable from a terns reveals behavior that is unique to a specic ECC function. black-box perspective, it poses serious problems for third-party We use BEER to identify the ECC functions of 80 real DRAM customers who study, test and validate, and/or design LPDDR4 DRAM chips with on-die ECC from three major systems based on the reliability characteristics of the DRAM DRAM manufacturers. We evaluate BEER’s correctness in chips that they buy and use. Section 2.2 describes these cus- simulation and performance on a real system to show that tomers and the problems they face in detail, including, but not BEER is eective and practical across a wide range of on-die limited to, three important groups: (1) system designers who ECC functions. To demonstrate BEER’s value, we propose and need to ensure that supplementary error-mitigation mecha- discuss several ways that third parties can use BEER to improve nisms (e.g., rank-level ECC within the DRAM controller) are their design and testing practices. As a concrete example, we carefully designed to cooperate with the on-die ECC func- introduce and evaluate BEEP, the rst error proling method- tion [40, 129, 160], (2) large-scale industries (e.g., computing ology that uses the known on-die ECC function to recover the system providers such as Microso [33], HP [47], and Intel [59], number and bit-exact locations of unobservable raw bit errors DRAM module manufacturers [4, 92, 159]) or government enti- responsible for observable post-correction errors. ties (e.g., national labs [131,150]) who must understand DRAM 1. Introduction reliability characteristics when validating DRAM chips they buy and use, and (3) researchers who need full visibility into arXiv:2009.07985v1 [cs.AR] 17 Sep 2020 Dynamic random access memory (DRAM) is the predomi- physical device characteristics to study and model DRAM reli- nant choice for system main memory across a wide variety of ability [17, 20, 31, 42, 43, 46, 72, 78–86, 109, 138, 139, 172, 178]. computing platforms due to its favorable cost-per-bit relative to other memory technologies. DRAM manufacturers main- For each of these third parties, merely knowing or reverse- tain a competitive advantage by improving raw storage densi- engineering the type of ECC code (e.g., n-bit Hamming code) ties across device generations. Unfortunately, these improve- based on existing industry [60, 97, 98, 120, 133, 147] and aca- ments largely rely on process technology scaling, which causes demic [129,138] publications is not enough to determine exactly serious reliability issues that reduce factory yield. DRAM how the ECC mechanism obfuscates specic error paerns. manufacturers traditionally mitigate yield loss using post- is is because an ECC code of a given type can have many manufacturing repair techniques such as row/column spar- dierent implementations based on how its ECC function (i.e., ing [51]. However, continued technology scaling in mod- its parity-check matrix) is designed, and dierent designs lead ern DRAM chips requires stronger error-mitigation mecha- to dierent reliability characteristics. For example, Figure1 nisms to remain viable because of random single-bit errors shows the relative probability of observing errors in dierent that are increasingly frequent at smaller process technology bit positions for three dierent ECC codes of the same type (i.e., nodes [39,76,89,99,109,119,120,124,127,129,133,160]. erefore, single-error correction Hamming code with 32 data bits and DRAM manufacturers have begun to use on-die error correction 1We use the term “error” to refer to any bit-ip event, whether observed coding (on-die ECC), which silently corrects single-bit errors (e.g., uncorrectable bit-ips) or unobserved (e.g., corrected by ECC). 1 6 parity-check bits) but that use dierent ECC functions. We BEER exploits the key insight that forcing the ECC function to obtain this data by simulating 109 ECC words using the EINSim act upon carefully-craed uncorrectable error paerns reveals simulator [2, 138] and show medians and 95% condence inter- ECC-function-specic behavior that disambiguates dierent vals calculated via statistical bootstrapping [32] over 1000 sam- ECC functions. BEER comprises three key steps: (1) deliber- ples. We simulate a 0xFF test paern2 with uniform-random ately inducing uncorrectable data-retention errors by pausing pre-correction errors at a raw bit error rate of 10–4 (e.g., as oen DRAM refresh while using carefully-craed test paerns to seen in experimental studies [17,20,43,46,76,102,109,139,157]). control the errors’ bit-locations, which is done by leveraging data-retention errors’ intrinsic data-paern asymmetry (dis- 0.04 cussed in Section 3.2), (2) enumerating the bit positions where the ECC mechanism causes miscorrections, and (3) using a SAT solver [28] to solve for the unique parity-check matrix that 0.02 causes the observed set of miscorrections. Probability Pre-Correction Post-Correction (ECC Function 1) Relative Error Post-Correction (ECC Function 0) Post-Correction (ECC Function 2) We experimentally apply BEER to 80 real LPDDR4 DRAM 0.00 chips with on-die ECC from three major DRAM manufacturers 0 5 10 15 20 25 30 to determine the chips’ on-die ECC functions. We describe Bit Index in Dataword the experimental steps required to apply BEER to any DRAM Figure 1: Relative error probabilities in dierent bit posi- chip with on-die ECC and show that BEER tolerates observed tions for dierent ECC functions with uniform-randomly dis- experimental noise. We show that dierent manufacturers ap- tributed pre-correction (i.e., raw) bit errors. pear to use dierent on-die ECC functions while chips from e data demonstrates that ECC codes of the same type can the same manufacturer and model number appear to use the have vastly dierent post-correction error characteristics. is same on-die ECC function (Section 5.1.3). Unfortunately, our is because each ECC mechanism acts dierently when faced experimental studies with real DRAM chips have two limita- with more errors than it can correct (i.e., uncorrectable errors), tions against further validation: (1) because the on-die ECC causing it to mistakenly perform ECC-function-specic “correc- function is considered trade secret for each manufacturer, we tions” to bits that did not experience errors (i.e., miscorrections, are unable to obtain a groundtruth to compare BEER’s results which Section 3.3 expands upon). erefore, a researcher or en- against, even when considering non-disclosure agreements gineer who studies two DRAM chips that use the same type of with DRAM manufacturers and (2) we are unable to publish ECC code but dierent ECC functions may nd that the chips’ the nal ECC functions that we uncover using BEER for con- soware-visible reliability characteristics are quite dierent dentiality reasons (discussed in Section 2.1). even if the physical DRAM cells’ reliability characteristics are To overcome the limitations of experimental studies with identical. On the other hand, if we know the full ECC function real DRAM chips, we rigorously evaluate BEER’s correctness in (i.e., its parity-check matrix), we can calculate exactly which simulation (Section6). We show that BEER correctly recovers pre-correction error paern(s) result in a set of observed er- the on-die ECC function for 115300 single-error correction rors.