Flipping Bits in Memory Without Accessing Them: an Experimental Study of DRAM Disturbance Errors
Total Page:16
File Type:pdf, Size:1020Kb
Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors 1 1 1 Yoongu Kim Ross Daly Jeremie Kim Chris Fallin Ji Hye Lee Donghyuk Lee1 Chris Wilkerson2 Konrad Lai Onur Mutlu1 1Carnegie Mellon University 2Intel Labs Abstract. Memory isolation is a key property of a reliable disturbance errors, DRAM manufacturers have been employ- and secure computing system — an access to one memory ad- ing a two-pronged approach: (i) improving inter-cell isola- dress should not have unintended side effects on data stored tion through circuit-level techniques [22, 32, 49, 61, 73] and in other addresses. However, as DRAM process technology (ii) screening for disturbance errors during post-production scales down to smaller dimensions, it becomes more difficult testing [3, 4, 64]. We demonstrate that their efforts to contain to prevent DRAM cells from electrically interacting with each disturbance errors have not always been successful, and that 1 other. In this paper, we expose the vulnerability of commodity erroneous DRAM chips have been slipping into the field. DRAM chips to disturbance errors. By reading from the same In this paper, we expose the existence and the widespread address in DRAM, we show that it is possible to corrupt data nature of disturbance errors in commodity DRAM chips sold in nearby addresses. More specifically, activating the same and used today. Among 129 DRAM modules we analyzed row in DRAM corrupts data in nearby rows. We demonstrate (comprising 972 DRAM chips), we discovered disturbance this phenomenon on Intel and AMD systems using a malicious errors in 110 modules (836 chips). In particular, all modules program that generates many DRAM accesses. We induce manufactured in the past two years (2012 and 2013) were vul- errors in most DRAM modules (110 out of 129) from three nerable, which implies that the appearance of disturbance er- major DRAM manufacturers. From this we conclude that rors in the field is a relatively recent phenomenon affecting many deployed systems are likely to be at risk. We identify more advanced generations of process technology. We show the root cause of disturbance errors as the repeated toggling that it takes as few as 139K reads to a DRAM address (more of a DRAM row’s wordline, which stresses inter-cell coupling generally, to a DRAM row) to induce a disturbance error. As effects that accelerate charge leakage from nearby rows. We a proof of concept, we construct a user-level program that provide an extensive characterization study of disturbance er- continuously accesses DRAM by issuing many loads to the rors and their behavior using an FPGA-based testing plat- same address while flushing the cache-line in between. We form. Among our key findings, we show that (i) it takes as demonstrate that such a program induces many disturbance few as 139K accesses to induce an error and (ii) up to one in errors when executed on Intel or AMD machines. every 1.7K cells is susceptible to errors. After examining var- We identify the root cause of DRAM disturbance errors as ious potential ways of addressing the problem, we propose a voltage fluctuations on an internal wire called the wordline. low-overhead solution to prevent the errors. DRAM comprises a two-dimensional array of cells, where each row of cells has its own wordline. To access a cell within 1. Introduction a particular row, the row’s wordline must be enabled by rais- The continued scaling of DRAM process technology has ing its voltage — i.e., the row must be activated. When there enabled smaller cells to be placed closer to each other. Cram- are many activations to the same row, they force the word- ming more DRAM cells into the same area has the well- line to toggle on and off repeatedly. According to our obser- known advantage of reducing the cost-per-bit of memory. vations, such voltage fluctuations on a row’s wordline have Increasing the cell density, however, also has a negative a disturbance effect on nearby rows, inducing some of their impact on memory reliability due to three reasons. First, cells to leak charge at an accelerated rate. If such a cell loses a small cell can hold only a limited amount of charge, too much charge before it is restored to its original value (i.e., which reduces its noise margin and renders it more vulner- refreshed), it experiences a disturbance error. able to data loss [14, 47, 72]. Second, the close proximity We comprehensively characterize DRAM disturbance er- of cells introduces electromagnetic coupling effects between rors on an FPGA-based testing platform to understand their them, causing them to interact with each other in undesirable behavior and symptoms. Based on our findings, we exam- ways [14, 42, 47, 55]. Third, higher variation in process tech- ine a number of potential solutions (e.g., error-correction and nology increases the number of outlier cells that are excep- frequent refreshes), which all have some limitations. We pro- tionally susceptible to inter-cell crosstalk, exacerbating the pose an effective and low-overhead solution, called PARA, two effects described above. that prevents disturbance errors by probabilistically refresh- As a result, high-density DRAM is more likely to suffer ing only those rows that are likely to be at risk. In contrast to from disturbance, a phenomenon in which different cells in- other solutions, PARA does not require expensive hardware terfere with each other’s operation. If a cell is disturbed structures or incur large performance penalties. This paper beyond its noise margin, it malfunctions and experiences a makes the following contributions. disturbance error. Historically, DRAM manufacturers have 1 been aware of disturbance errors since as early as the Intel The industry has been aware of this problem since at least 2012, which 1103, the first commercialized DRAM chip [58]. To mitigate is when a number of patent applications were filed by Intel regarding the problem of “row hammer” [6, 7, 8, 9, 23, 24]. Our paper was under review Work done while at Carnegie Mellon University. when the earliest of these patents was released to the public. 978-1-4799-4394-4/14/$31.00 c 2014 IEEE 1 cell To our knowledge, this is the first paper to expose the row 4 wordline widespread existence of disturbance errors in commodity DRAM chips from recent years. row 3 row 2 We construct a user-level program that induces disturbance row 1 errors on real systems (Intel/AMD). Simply by reading row 0 from DRAM, we show that such a program could poten- bitline tially breach memory protection and corrupt data stored in row-buffer pages that it should not be allowed to access. a. Rows of cells b. A single cell We provide an extensive characterization of DRAM dis- turbance errors using an FPGA-based testing platform and Figure 1. DRAM consists of cells 129 DRAM modules. We identify the root cause of distur- the cells — and immediately writes the charge back into the bance errors as the repeated toggling of a row’s wordline. cells [38, 41, 43]. Subsequently, all accesses to the row are We observe that the resulting voltage fluctuation could dis- served by the row-buffer on behalf of the row. When there turb cells in nearby rows, inducing them to lose charge at are no more accesses to the row, the wordline is lowered to an accelerated rate. Among our key findings, we show that a low voltage, disconnecting the capacitors from the bitlines. (i) disturbable cells exist in 110 out of 129 modules, (ii) A group of rows is called a bank, each of which has its own up to one in 1.7K cells is disturbable, and (iii) toggling the dedicated row-buffer. (The organization of a bank is simi- wordline as few as 139K times causes a disturbance error. lar to what was shown in Figure 1a.) Finally, multiple banks After examining a number of possible solutions, we pro- come together to form a rank. For example, Figure 2 shows pose PARA (probabilistic adjacent row activation), a low- a 2GB rank whose 256K rows are vertically partitioned into overhead way of preventing disturbance errors. Every time eight banks of 32K rows, where each row is 8KB ( 64Kb) a wordline is toggled, PARA refreshes the nearby rows in size [34]. Having multiple banks increases parallelismD be- with a very small probability (p 1). As a wordline is tog- cause accesses to different banks can be served concurrently. gled many times, the increasing disturbance effects are off- 64K cells set by the higher likelihood of refreshing the nearby rows. 2. DRAM Background Bank7 7 data 0 In this section, we provide the necessary background on cmd ••• K 256 Chip DRAM organization and operation to understand the cause addr Chip and symptoms of disturbance errors. Processor Bank0 2.1. High-Level Organization MemCtrl Rank DRAM chips are manufactured in a variety of configura- tions [34], currently ranging in capacities of 1–8 Gbit and in Figure 2. Memory controller, buses, rank, and banks data-bus widths of 4–16 pins. (A particular capacity does not 2.3. Accessing DRAM imply a particular data-bus width.) By itself, an individual DRAM chip has only a small capacity and a narrow data-bus. An access to a rank occurs in three steps: (i) “opening” the That is why multiple DRAM chips are commonly ganged to- desired row within a desired bank, (ii) accessing the desired gether to provide a large capacity and a wide data-bus (typi- columns from the row-buffer, and (iii) “closing” the row.