[3B2] mmi2010010131.3d 1/2/010 17:17 Page 131

...... PHASE-CHANGE TECHNOLOGY AND THE FUTURE OF MAIN MEMORY

...... PHASE-CHANGE MEMORY MAY ENABLE CONTINUED SCALING OF MAIN MEMORIES, BUT

PCM HAS HIGHER ACCESS LATENCIES, INCURS HIGHER POWER COSTS, AND WEARS OUT

MORE QUICKLY THAN DRAM. THIS ARTICLE DISCUSSES HOW TO MITIGATE THESE

LIMITATIONS THROUGH BUFFER SIZING, ROW CACHING, WRITE REDUCTION, AND WEAR Benjamin C. Lee

LEVELING, TO MAKE PCM AVIABLEDRAM ALTERNATIVE FOR SCALABLE MAIN MEMORIES. Stanford University

...... Over the past few decades, mem- Programming mechanisms such as current Ping Zhou ory technology scaling has provided many injection scale with cell size. Phase-change benefits, including increased density and ca- memory (PCM), spin-torque transfer Jun Yang pacity and reduced cost. Scaling has provided (STT) magnetoresistive RAM (MRAM), these benefits for conventional technologies, and ferroelectric RAM (FRAM) are examples Youtao Zhang such as DRAM and , but of resistive memories. Of these, PCM is clos- now scaling is in jeopardy. For continued est to realization and imminent deployment Bo Zhao scaling, systems might need to transition as a NOR flash competitor. In fact, various from conventional charge memory to emerg- researchers and manufacturers have proto- University of Pittsburgh ing resistive memory. Charge memories re- typed PCM arrays in the past decade.2 quire discrete amounts of charge to induce PCM provides a nonvolatile storage Engin Ipek a voltage, which is detected during reads. mechanism that is amenable to process scal- In the nonvolatile space, flash memories ing. During writes, an access injects University of Rochester must precisely control the discrete charge current into the storage material and ther- placed on a floating gate. In volatile main mally induces phase change, which is detected memory, DRAM must not only place charge during reads. PCM, relying on analog current Onur Mutlu in a storage but also mitigate sub- and thermal effects, doesn’t require control threshold charge leakage through the access over discrete . As technologies scale Carnegie Mellon device. must be sufficiently large and heating contact areas shrink, program- to store charge for reliable sensing, and tran- ming current scales linearly. Researchers proj- University sistors must be sufficiently large to exert ef- ect this PCM scaling mechanism will be fective control over the channel. Given more robust than that of DRAM beyond Doug Burger these challenges, scaling DRAM beyond 40 40 nm, and it has already been demonstrated nanometers will be increasingly difficult.1 in a 32-nm device prototype.1,3 As a scalable Microsoft Research In contrast, resistive memories use electri- DRAM alternative, PCM could provide a cal current to induce a change in atomic clear road map for increasing main memory structure, which impacts the resistance density and capacity. detected during reads. Resistive memories Providing a path for main-memory scal- are amenable to scaling because they don’t ing, however, will require surmounting require precise charge placement and control. PCM’s disadvantages relative to DRAM......

0272-1732/10/$26.00 c 2010 IEEE Published by the IEEE Computer Society 131

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:11:23 EDT from IEEE Xplore. Restrictions apply. [3B2] mmi2010010131.3d 1/2/010 17:17 Page 132

...... TOP PICKS

Bitline butothersofferhigherresistivityandimprove the device’s electrical characteristics. Nitrogen doping increases resistivity and lowers pro- Metal (bitline) gramming current, whereas GS offers lower- 5,6 Chalcogenide latency phase changes. (GS contains the Storage first two elements of GST, germanium and antimony, and does not include tellurium.) Heater Phase changes are induced by injecting current into the resistor junction and heating the chalcogenide. The current and voltage Metal (access) Wordline characteristics of the chalcogenide are identi- Access device cal regardless of its initial phase, thereby low- 7 (a) (b) ering programming complexity and latency. The amplitude and width of the injected cur- Figure 1. Storage element with heater and chalcogenide between rent pulse determine the programmed state. electrodes (a), and cell structure with storage element and bipolar junction PCM cells are one-transistor (1T), one- transistor (BJT) access device (b). resistor (1R) devices comprising a resistive storage element and an access transistor (Figure 1). One of three devices typically con- PCMaccesslatencies,althoughonlytensof trols access: a field-effect transistor (FET), a nanoseconds, are still several times slower bipolar junction transistor (BJT), or a than those of DRAM. At present technology diode. In the future, FET scaling and large nodes, PCM writes require energy-intensive voltage drops across the cell will adversely af- current injection. Moreover, the resulting fect gate-oxide reliability for unselected word- thermal stress within the storage element lines.8 BJTs are faster and can scale more degrades current-injection contacts and limits robustly without this vulnerability.8,9 Diodes endurance to hundreds of millions of writes occupy smaller areas and potentially enable per cell for present process technologies. Be- greater cell densities but require higher oper- cause of these significant limitations, PCM is ating voltages.10 presently positioned mainly as a flash replace- ment. As a DRAM alternative, therefore, Writes PCM must be architected for feasibility in The access transistor injects current into main memory for general-purpose systems. the storage material and thermally induces Today’s prototype designs do not miti- phase change, which is detected during gate PCM latencies, energy costs, and finite reads. The chalcogenide’s resistivity captures endurance. This article describes a range of logical values. A high, short current PCM design alternatives aimed at making pulse (reset) increases resistivity by abruptly PCM systems competitive with DRAM sys- discontinuing current, quickly quenching tems. These alternatives focus on four areas: heat generation, and freezing the chalcoge- row buffer design, row caching, wear reduc- nide into an amorphous state. A moderate, tion, and wear leveling.2,4 Relative to long current pulse (set) reduces resistivity by DRAM, these optimizations collectively pro- ramping down current, gradually cooling vide competitive performance, comparable the chalcogenide, and inducing crystal energy, and feasible lifetimes, thus making growth. Set latency, which requires longer PCM a viable replacement for main-memory current pulses, determines write perfor- technology. mance. Reset energy, which requires higher current pulses, determines write power. Technology and challenges Cells that store multiple resistance levels As Figure 1 shows, the PCM storage ele- could be implemented by leveraging interme- ment consists of two electrodes separated by diate states, in which the chalcogenide is par- a resistive heater and a chalcogenide (the tially crystalline and partially amorphous.9,11 phase-change material). Ge2Sb2Te5 (GST) Smaller current slopes (slow ramp-down) is the most commonly used chalcogenide, produce lower resistances, and larger slopes ...... 132 IEEE MICRO

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:11:23 EDT from IEEE Xplore. Restrictions apply. [3B2] mmi2010010131.3d 1/2/010 17:17 Page 133

(fast ramp-down) produce higher resistan- of phase-change material that must ces. Varying slopes induce partial phase tran- be melted to completely current sitions and/or change the size and shape of flow decreases. Specifically, as feature size the amorphous material produced at the scales down (1/k), contact area decreases contact area, generating resistances between quadratically (1/k2). Reduced contact area those observed from fully amorphous or causes resistivity to increase linearly (k), fully crystalline chalcogenides. The difficulty which in turn causes programming current and high-latency cost of differentiating be- to decrease linearly (1/k). These effects en- tween many resistances could constrain able not only smaller storage elements but such multilevel cells to a few per cell. also smaller access devices for current injec- tion. At the system level, scaling translates Wear and endurance into lower memory-subsystem energy. Writes are the primary wear mechanism in Researchers have demonstrated this PCM PCM. When current is injected into a vol- scaling mechanism in a 32-nm device ume of phase-change material, thermal prototype.1,3 expansion and contraction degrades the elec- trode storage contact, such that programming PCM characteristics currents are no longer reliably injected into Realizing the vision of PCM as a scal- the cell. Because material resistivity highly able memory requires understanding depends on current injection, current vari- and overcoming PCM’s disadvantages rela- ability causes resistance variability. This tive to DRAM. Table 1 shows derived greater variability degrades the read window, technology parameters from nine proto- which is the difference between programmed types published in the past five years by minimum and maximum resistances. multiple semiconductor manufacturers.2 Write endurance, the number of writes Access latencies of up to 150 ns are several performed before the cell cannot be pro- times slower than those of DRAM. At cur- grammed reliably, ranges from 104 to 109. rent 90-nm technology nodes, PCM writes Write endurance depends on process and dif- require energy-intensive current injection. fers across manufacturers. PCM will likely Moreover, writes induce thermal expansion exhibit greater write endurance than flash and contraction within storage elements, memory by several orders of magnitude degrading injection contacts and limiting (for example, 107 to 108). The 2007 Interna- endurance to hundreds of millions of writes tional Technology Roadmap for Semiconductors per cell at current processes. Prototypes im- (ITRS ) projects an improved endurance of plement 9F 2 to 12F 2 PCM cells using BJT 1012 writes at 32 nm.1 Wear reduction and access devices (where F is the feature size)— leveling techniques could prevent write limits up to 50 percent larger than 6F 2 to 8F 2 from being exposed to the system during a DRAM cells. memory’s lifetime. These limitations position PCM as a re- Reads placement for flash memory; in this market, PCM properties are drastic improvements. Before the cell is read, the bitline is pre- Making PCM a viable alternative to charged to the read voltage. The wordline DRAM, however, will require architecting is active-low when using a BJT access transis- PCM for feasibility in main memory for tor(seeFigure1).Ifaselectedcellisina general-purpose systems. crystalline state, the bitline is discharged, with current flowing through the storage element and the access transistor. Otherwise, Architecting a DRAM alternative the cell is in an amorphous state, preventing With area-neutral buffer reorganizations, or limiting bitline current. Lee et al. show that PCM systems are within the competitive range of DRAM systems.2 Scalability Effective buffering hides long PCM latencies As contact area decreases with feature and reduces PCM energy costs. Scalability size, thermal resistivity increases, and the trends further favor PCM over DRAM......

JANUARY/FEBRUARY 2010 133

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:11:23 EDT from IEEE Xplore. Restrictions apply. [3B2] mmi2010010131.3d 1/2/010 17:17 Page 134

...... TOP PICKS

Table 1. Technology survey. Published prototype

Parameter* Horri6 Ahn12 Bedeschi13 Oh14 Pellizer15 Chen5 Kang16 Bedeschi9 Lee10 Lee2

Year 2003 2004 2004 2005 2006 2006 2006 2008 2008 ** Process, F (nm) ** 120 180 120 90 ** 100 90 90 90 Array size (Mbytes) ** 64 8 64 ** ** 256 256 512 ** Material GST, N-d GST, N-d GST GST GST GS, N-d GST GST GST GST, N-d Cell size (mm2) ** 0.290 0.290 ** 0.097 60 nm2 0.166 0.097 0.047 0.065 to 0.097 Cell size, F 2 ** 20.1 9.0 ** 12.0 ** 16.6 12.0 5.8 9.0 to 12.0 Access device ** ** BJT FET BJT ** FET BJT Diode BJT Read time (ns) ** 70 48 68 ** ** 62 ** 55 48 Read current (mA) ** ** 40 ** ** ** ** ** ** 40 Read voltage (V) ** 3.0 1.0 1.8 1.6 ** 1.8 ** 1.8 1.0 Read power (mW) ** ** 40 ** ** ** ** ** ** 40 Read energy (pJ) ** ** 2.0 ** ** ** ** ** ** 2.0 Set time (ns) 100 150 150 180 ** 80 300 ** 400 150 Set current (mA) 200 ** 300 200 ** 55 ** ** ** 150 Set voltage (V) ** ** 2.0 ** ** 1.25 ** ** ** 1.2 Set power (mW) ** ** 300 ** ** 34.4 ** ** ** 90 Set energy (pJ) ** ** 45 ** ** 2.8 ** ** ** 13.5 Reset time (ns) 50 10 40 10 ** 60 50 ** 50 40 Reset current (mA) 600 600 600 600 400 90 600 300 600 300 Reset voltage (V) ** ** 2.7 ** 1.8 1.6 ** 1.6 ** 1.6 Reset power (mW) ** ** 1620 ** ** 80.4 ** ** ** 480 Reset energy (pJ) ** ** 64.8 ** ** 4.8 ** ** ** 19.2 Write endurance 107 109 106 ** 108 104 ** 105 105 108 (MLC)...... * BJT: bipolar junction transistor; FET: field-effect transistor; GST: Ge2Sb2Te5; MLC: multilevel cells; N-d: nitrogen doped. ** This information is not available in the publication cited.

Array architecture In such architectures, sense amplifiers drive PCM cells can be hierarchically organized banks of explicit latches. These latches pro- into banks, blocks, and subblocks. Despite vide greater flexibility in row buffer organiza- similarities to conventional memory array tion by enabling multiple buffered rows. architectures, PCM has specific design issues However, they also incur area overheads. that must be addressed. For example, PCM Separate sensing and buffering enables multi- reads are nondestructive. plexed sense amplifiers. Multiplexing also Choosing bitline sense amplifiers affects enables buffer widths narrower than array array read-access time. Voltage-based sense widths (defined by the total number of - amplifiers are cross-coupled inverters that re- lines). Buffer width is a critical design param- quire differential discharging of bitline eter that determines the required number of capacitances. In contrast, current-based expensive current-based sense amplifiers. sense amplifiers rely on current differences to create differential voltages at the ampli- Buffer organizations fiers’ output nodes. Current sensing is faster Another way to make PCM more com- but requires larger circuits.17 petitive with DRAM is to use area-neutral In DRAM, sense amplifiers both detect buffer organizations, which have several and buffer data using cross-coupled inverters. benefits. First, area neutrality enables a com- In contrast, we explore PCM architectures in petitive DRAM alternative in an industry which sensing and buffering are separate. where area and density directly impact cost ...... 134 IEEE MICRO

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:11:23 EDT from IEEE Xplore. Restrictions apply. [3B2] mmi2010010131.3d 1/2/010 17:17 Page 135

and profit. Second, to mitigate fundamental 2.4 PCM constraints and achieve competitive PCM buffer (12F 2) 2.2 performance and energy relative to DRAM- PCM buffer (9F 2) based systems, narrow buffers reduce the 2.0 PCM baseline DRAM baseline number of high-energy PCM writes, and 1.8 multiple rows exploit temporal locality. 1.6 This locality not only improves performance, 1.4 but also reduces energy by exposing addi- 1.2 tional opportunities for write coalescing. Third, as PCM technology matures, baseline 1.0 PCM latencies will likely improve. Finally, 0.8 process technology scaling will drive linear 0.6 Energy (normalized to DRAM) 512B × 4 reductions in PCM energy. 0.4 0.2 Area neutrality. Buffer organizations 0 achieve area neutrality through narrower 0.8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 buffers and additional buffer rows. The Delay (normalized to DRAM) number of sense amplifiers decreases line- arly with buffer width, significantly reduc- Figure 2. Phase-change memory (PCM) buffer organization, showing delay ing area because fewer of these large and energy averaged for benchmarks to illustrate Pareto frontier. Open circuits are required. We take advantage circles indicate designs satisfying area constraints, assuming 12F 2 PCM of these area savings by implementing multi- multilevel cells. Closed circles indicate additional designs satisfying area ple rows with latches far smaller than the constraints, assuming smaller 9F 2 PCM multilevel cells. removed sense amplifiers. Narrow widths reduce PCM write energy but negatively impact spatial locality, opportunities for As Figure 2 shows, reorganizing a single, write coalescing, and application perfor- wide buffer into multiple, narrow buffers mance. However, the additional buffer reduces both energy costs and delay. Pareto rows can mitigate these penalties. We ex- optima shift PCM delay and energy into amine these fundamental trade-offs by the neighborhood of the DRAM baseline. constructing area models and identifying Furthermore, among these Pareto optima, designs that meet a DRAM-imposed area we observe a knee that minimizes both en- budget before optimizing delay and ergy and delay: four buffers that are 512 energy.2 bytes wide. Such an organization reduces the PCM delay and energy disadvantages Buffer design space. Figure 2 illustrates from 1.6 and 2.2 to 1.1 and 1.0, the delay and energy characteristics of respectively. Although smaller 9F 2 PCM the buffer design space for representa- cells provide the area for wider buffers tive benchmarks from memory-intensive and additional rows, the associated energy scientific-computing applications.18-20 The costs are not justified. In general, diminish- triangles represent PCM and DRAM base- ing marginal reductions in delay suggest lines implementing a single 2,048-byte that area savings from 9F 2 cells should go buffer. Circles represent various buffer toward improving density, not additional organizations. Open circles indicate organi- buffering. zations requiring less area than the DRAM baseline when using 12F 2 cells. Closed Delay and energy optimization. Using four circles indicate additional designs that be- 512-byte buffers is the most effective way come viable when considering smaller 9F 2 to optimize average delay and energy across cells. By default, the PCM baseline workloads. Figure 3 illustrates the impact (see the triangle labeled ‘‘PCM base’’ in of reorganized PCM buffers. Delay penal- the figure) does not satisfy the area budget ties are reduced from the original 1.60 because of larger current-based sense ampli- to 1.16. The delay impact ranges from fiers and explicit latches. 0.88 (swim benchmark) to 1.56 (fft ......

JANUARY/FEBRUARY 2010 135

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:11:23 EDT from IEEE Xplore. Restrictions apply. [3B2] mmi2010010131.3d 1/2/010 17:17 Page 136

...... TOP PICKS

Scaling scenarios are also bleak for access 1.6 . As these transistors scale down, Delay 1.4 Energy increasing subthreshold leakage makes it in- creasingly difficult to ensure DRAM reten- 1.2 tion times. Not only is less charge stored in the capacitor, but that charge is also stored 1.0 less reliably. These trends will impact DRAM’s reliability and energy efficiency in 0.8 future process technologies. According to 0.6 the ITRS, ‘‘manufacturable solutions are not known’’ for DRAM beyond 40 nm.1 0.4 In contrast, the ITRS projects PCM scaling mechanisms will extend to 32 nm, after which 0.2 other scaling mechanisms could apply.1 PCM Delay and energy (normalized to DRAM) 0 scaling mechanisms have already been demon- cg is mg fft rad oce art equ swi avg strated at 32 nm with a novel device structure designed by Raoux et al.3 Although both Figure 3. Application delay and energy when using PCM with optimized DRAM and PCM are expected to be viable buffering (four 512-byte buffers) as a DRAM replacement. at 40 nm, energy-scaling trends strongly favor PCM. Lai and Pirovano et al. have sep- arately projected a 2.4 reduction in PCM energy from 80 nm to 40 nm.7,8 In contrast, benchmark) relative to a DRAM-based sys- the ITRS projects that DRAM energy will fall tem. When executing benchmarks on an by only 1.5 across the same technology effectively buffered PCM, more than half nodes, thus reflecting the technical challenges of the benchmarks are within 5 percent of DRAM scaling. of their DRAM performance. Benchmarks Because PCM energy scales down 1.6 that perform less effectively exhibit low more quickly than DRAM energy, PCM sys- write-coalescing rates. For example, buffers tems will significantly outperform DRAM can’t coalesce any writes in the fft systems at future technology nodes. At 40 workload. nm, PCM system energy is 61.3 percent Buffering and write coalescing also reduce that of DRAM, averaged across workloads. memory subsystem energy from a baseline Switching from DRAM to PCM reduces en- of 2.2 to 1.0 parity with DRAM. ergy costs by at least 22.1 percent (art bench- Although each PCM array write requires mark) and by as much as 68.7 percent (swim 43.1 more energy than a DRAM array benchmark). This analysis does not account write, these energy costs are mitigated by nar- for refresh energy, which could further in- row buffer widths and additional rows, which crease DRAM energy costs. Although the reduce the granularity of buffer evictions and ITRS projects constant retention time of 64 expose opportunities for write coalescing, ms as DRAM scales to 40 nm,3 less-effective respectively. access-transistor control might reduce reten- tion times. If retention times fall, DRAM re- Scaling and implications fresh energy will increase as a fraction of total DRAM scaling faces many significant DRAM energy costs. technical challenges because scaling exposes weaknesses in both components of the one- Mitigating wear and energy transistor, one-capacitor cell. Capacitor scal- In addition to architecting PCM to offer ing is constrained by the DRAM storage competitive delay and energy relative to mechanism. Scaling makes increasingly diffi- DRAM, we must also consider PCM wear cult the manufacture of small capacitors that mechanisms. With only 107 to 108 writes store sufficient charge for reliable sensing over each cell’s lifetime, solutions are needed despite large parasitic capacitances on the to reduce and level writes coming from the bitline. lowest-level processor . Zhou et al...... 136 IEEE MICRO

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:11:23 EDT from IEEE Xplore. Restrictions apply. [3B2] mmi2010010131.3d 1/2/010 17:17 Page 137

show that write reduction and leveling can improve PCM endurance with light circuitry overheads.4 These schemes level wear across memory elements, remove redundant bit writes, and collectively achieve an average lifetime of 22 years. Moreover, an energy Cell array study shows PCM with low-operating power (LOP) peripheral logic is energy efficient. Shift Shifter Improving PCM lifetimes Shift amount An evaluation on a set of memory-intensive Column Offset Column mux workloads shows that the unprotected life- address time of PCM-based main memory can last Write only an average of 171 days. Although Lee current et al. track written cache lines and written Read cache words to implement partial writes circuit 2 Write and reduce wear, fine-grained schemes at circuit the bit level might be more effective. More- over, combining wear reduction with wear leveling can address low lifetimes arising Write data Read data from write locality. Here, we introduce a hi- erarchical set of techniques that both reduce and level wear to improve the lifetime of Figure 4. Implementation of redundant bit-write removal and row shifting. PCM-based main memory to more than Added hardware is circled. 20 years on average.

Eliminating redundant bit writes. In con- of 4, the resulting lifetime is still insufficient ventional memory access, a write updates for practical purposes. Memory updates tend an entire row of memory cells. However, to exhibit strong locality, such that hot cells many of these writes are redundant. Thus, fail far sooner than cold cells. Because a in most cases, writing a cell does not change memory row or segment’s lifetime is deter- what is already stored within. In a study with mined by the first cell to fail, leveling various workloads, 85, 77, and 71 percent of schemes must distribute writes and avoid cre- bit writes were redundant for single-level-cell ating hot memory regions that impact system (SLC), multilevel-cell with 2 bits per cell lifetime. (MLC-2), and multilevel-cell with 4 bits per cell (MLC-4) memories, respectively. Row shifting. After redundant bit writes are Removing these redundant bit writes can removed, the bits that are written most in a improve the lifetimes of SLC, MLC-2, and row tend to be localized. Hence, a simple MLC-4 PCM-based main memory to 2.1 shifting scheme can more evenly distribute years, 1.6 years, and 1.4 years, respectively. writes within a row. Experimental studies We implement the scheme by preceding a show that the optimal shift granularity is write with a read and a comparison. After 1 byte, and the optimal shift interval is the old value is read, an XNOR gate filters 256 writes per row. As Figure 4 shows, out redundant bit-writes. Because a PCM the scheme is implemented through an ad- read is considerably faster than a PCM ditional row shifter and a shift offset regis- write, and write operations are typically less ter. On a read access, data is shifted back latency critical, the negative performance im- before being passed to the processor. The pact of adding a read before a write is rela- delay and energy overhead are counted in tively small. the final performance and energy results. Although removing redundant bit writes Row shifting extends the average lifetimes extends lifetime by approximately a factor for SLC, MLC-2, and MLC-4 PCM-based ......

JANUARY/FEBRUARY 2010 137

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:11:23 EDT from IEEE Xplore. Restrictions apply. [3B2] mmi2010010131.3d 1/2/010 17:17 Page 138

...... TOP PICKS

model energy and delay results for peripheral 40 circuits (interconnections and decoders).21 35 We use HSpice simulations to model PCM read operations.10,22 We derive parameters 30 for PCM write operations from recent results on PCM cells.3 25 Because of its low-leakage and high- 20 density features, we evaluate PCM integrated Years on top of a multicore architecture using 3D 15 stacking. For baseline DRAM memory, we integrate low-standby-power peripheral logic 10 because of its better energy-delay (ED 2) re- 21 5 duction. For PCM, we use low-operating- power (LOP) peripheral logic because of its 0 low dynamic power, to avoid compounding art swi avg mcf sph oce milc PCM’s already high dynamic energy. Because lucus mgrid ammp sweb-s sweb-e sweb-b PCM is nonvolatile, we can power-gate idle

SLC MLC-2 MLC-4 memorybankstosaveleakage.Thus, LOP’s higher leakage energy is not a concern for PCM-based main memory. Figure 5. PCM lifetime after eliminating redundant bit writes, row shifting, and segment swapping. [SLC refers to single-level cells. MLC-2 and MLC-4 Energy model. With redundant bit-write re- refer to multilevel cells with 2 and 4 bits per cell, respectively. Some moval, the energy of a PCM write is no lon- benchmarks (mcf, oce, sph, and sweb-s) exhibit lifetimes greater than ger a fixed value. We calculate per-access 40 years.] write energy as follows:

Epcmwrite ¼ Efixed þ Eread þ Ebitchange main memories to 5.9 years, 4.4 years, and where Efixed is the fixed portion of energy 3.8 years, respectively. for each PCM write (row selection, decode, XNOR gates, and so on), and Eread is the Segment swapping. The next step considers energy to read out the old data for compar- wear leveling at a coarser granularity: mem- ison. The variable part, Ebitchange, depends ory segments. Periodically, memory seg- on the number of bit writes actually ments of high and low write accesses are performed: swapped. This scheme is implemented in the memory controller, which keeps track Ebitchange ¼ E1!0N1!0 þ E0!1N0!1 of each segment’s write counts and a map- ping table between the virtual and true seg- Performance and energy. Although PCM ment number. The optimal segment size is has slower read and write operations, experi- 1 Mbyte, and the optimal swap interval is mental results show that the performance every 2 106 writes in each segment. Mem- impact is quite mild, with an average penalty ory is unavailable in the middle of a swap, of 5.7 percent. As Figure 6 shows, dynamic which amounts to 0.036 percent perfor- energy is reduced by an average of 47 percent mance degradation in the worst case. relative to DRAM. The savings come from Applying the sequence of three tech- two sources: redundant bit-write removal niques4 extends the average lifetime of and PCM’s LOP peripheral circuitry, PCM-based main memory to 22 years, 17 which is particularly power efficient during years, and 13 years, respectively, as Figure 5 burst reads. Because of PCM’s nonvolatility, shows. we can safely power-gate the idle memory banks without losing data. This, along with Analyzing energy implications PCM’s zero cell leakage, results in 70 percent Because PCM uses an array structure sim- of leakage energy reduction over an already ilar to that of DRAM, we use Cacti-D to low-leakage DRAM memory...... 138 IEEE MICRO

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:11:23 EDT from IEEE Xplore. Restrictions apply. [3B2] mmi2010010131.3d 1/2/010 17:17 Page 139

100

80

60

40

Dynamic energy (normalized to DRAM) 20

0 ammp art lucus mcf mgrid milc oce sph swi avg

Row buffer miss Row buffer hit Write DRAM refresh

Figure 6. Breakdown of dynamic-energy savings. For each benchmark, the left bar shows the dynamic energy for DRAM, and the right bar shows the dynamic energy for PCM.

Combining dynamic- and leakage-energy savings, we find that the total energy savings 0.9 is 65 percent, as Figure 7 shows. Because Total energy ED 2 of significant energy savings and mild perfor- 0.8 2 mance losses, 96 percent of ED reduction is 0.7 achieved for the ammp benchmark. The av- erage ED2 reduction for all benchmarks is 0.6 60 percent. 0.5

his article has provided a rigorous 0.4 survey of phase-change technology to T 0.3 drive architectural studies and enhance- ments. PCM’s long latencies, high energy, 0.2 and finite endurance can be effectively 0.1 mitigated. Effective buffer organizations, Energy efficiency (normalized to DRAM) combined with wear reduction and leveling, 0 can make PCM competitive with DRAM at ammp art lucus mcf mgrid milc oce sph swi avg present technology nodes. (Related work also supports this effort.23,24) Figure 7. Total energy and energy-delay (ED2) reduction. The proposed memory architecture lays the foundation for exploiting PCM scalabil- ity and nonvolatility in main memory. Scal- fundamentally change the landscape of com- ability implies lower main-memory energy puting. Software designed to exploit the non- and greater write endurance. Further- volatility of PCM-based main memories can more, nonvolatile main memories will provide qualitatively new capabilities......

JANUARY/FEBRUARY 2010 139

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:11:23 EDT from IEEE Xplore. Restrictions apply. [3B2] mmi2010010131.3d 1/2/010 17:17 Page 140

...... TOP PICKS

For example, system boot or hibernate could 11. T. Nirschl et al., ‘‘Write Strategies for 2 and be perceived and instantaneous; application 4-Bit Multi-level Phase-Change Memory,’’ checkpointing could be less expensive;25 and Proc. Int’l Devices Meeting (IEDM file systems could provide stronger safety 08), IEEE Press, 2008, pp. 461-464. guarantees.26 Thus, this work is a step toward 12. S. Ahn et al., ‘‘Highly Manufacturable High a fundamentally new with Density Phase Change Memory of 64Mb implications across the hardware-software and Beyond,’’ Proc. Int’l Electron Devices interface. MICRO Meeting (IEDM 04), IEEE Press, 2004, pp. 907-910...... 13. F. Bedeschi et al., ‘‘An 8 Mb Demonstrator References for High-Density 1.8 V Phase-Change Mem- 1. Int’l Technology Roadmap for Semiconduc- ories,’’ Proc. Symp. VLSI Circuits, 2004, tors: Process Integration, Devices, and pp. 442-445. Structures, Semiconductor Industry Assoc., 14. H. Oh et al., ‘‘Enhanced Write Performance 2007; http://www.itrs.net/Links/2007ITRS/ of a 64 Mb Phase-Change Random Access 2007_Chapters/2007_PIDS.pdf. Memory,’’ Proc. IEEE J. Solid-State Circuits, 2. B. Lee et al., ‘‘Architecting Phase Change vol. 41, no. 1, 2006, pp. 122-126. Memory as a Scalable DRAM Alternative,’’ 15. F. Pellizzer et al., ‘‘A 90 nm Phase Change Proc. 36th Ann. Int’l Symp. Computer Archi- Memory Technology for Stand-Alone Non- tecture (ISCA 09), ACM Press, 2009, pp. 2-13. Applications,’’ Proc. 3. S. Raoux et al., ‘‘Phase-Change Random Symp.VLSICircuits,IEEE Press, 2006, Access Memory: A Scalable Technology,’’ pp. 122-123. IBM J. Research and Development, vol. 52, 16. S. Kang et al., ‘‘A 0.1-mm 1.8-V 256-Mb no. 4, 2008, pp. 465-479. Phase-Change Random Access Memory 4. P. Zhou et al., ‘‘A Durable and Energy Effi- (PRAM) with 66-MHz Synchronous Burst- cient Main Memory Using Phase Change Read Operation,’’ IEEE J. Solid-State Cir- Memory Technology,’’ Proc. 36th Ann. Int’l cuits, vol. 42, no. 1, 2007, pp. 210-218. Symp. Computer Architecture (ISCA 09), 17. M. Sinha et al., ‘‘High-Performance and ACM Press, 2009, pp. 14-23. Low-Voltage Sense-Amplifier Techniques 5. Y. Chen et al., ‘‘Ultra-thin Phase-Change for Sub-90 nm SRAM,’’ Proc. Int’l Systems- Bridge Memory Device Using GeSb,’’ on-Chip Conf. (SOC 03), 2003, pp. 113-116. Proc. Int’l Electron Devices Meeting (IEDM 18. V. Aslot and R. Eigenmann, ‘‘Quantitative 06), IEEE Press, 2006, pp. 30.3.1-30.3.4. Performance Analysis of the SPEC 6. H. Horii et al., ‘‘A Novel Cell Technology OMPM2001 Benchmarks,’’ Scientific Pro- Using N-Doped GeSbTe Films for Phase gramming, vol. 11, no. 2, 2003, pp. 105-124. Change RAM,’’ Proc. Symp. VLSI Technol- 19. D.H. Bailey et al., NAS Parallel Benchmarks, ogy, IEEE Press, 2003, pp. 177-178. tech. report RNR-94-007, NASA Ames 7. S. Lai, ‘‘Current Status of the Phase Change Research Center, 1994. Memory and Its Future,’’ Proc. Int’l Electron 20. S. Woo et al., ‘‘The SPLASH-2 Programs: Devices Meeting (IEDM 03), IEEE Press, Characterization and Methodological Con- 2003, pp. 10.1.1-10.1.4. siderations,’’ Proc.22ndAnn.Int’lSymp. 8. A. Pirovano et al., ‘‘Scaling Analysis of Computer Architecture (ISCA 1995), ACM Phase-Change Memory Technology,’’ Proc. Press, 1995, pp. 24-36. Int’l Electron Devices Meeting (IEDM 03), 21. S. Thoziyoor et al., ‘‘A Comprehensive IEEE Press, 2003, pp. 29.6.1-29.6.4. Memory Modeling Tool and Its Application 9. F. Bedeschi et al., ‘‘A Multi-level-Cell Bipolar- to the Design and Analysis of Future Mem- Selected Phase-Change Memory,’’ Proc. Int’l ory Hierarchies,’’ Proc. 35th Int’l Symp. Solid-State Circuits Conf. (ISSCC 08), IEEE Computer Architecture (ISCA 08), IEEE CS Press, 2008, pp. 428-429, 625. Press, 2008, pp. 51-62. 10. K.-J. Lee et al., ‘‘A 90 nm 1.8 V 512 Mb 22. W.Y. Cho et al., ‘‘A 0.18-mm 3.0-V 64-Mb Diode-Switch PRAM with 266 MB/s Read Nonvolatile Phase-Transition Random Throughput,’’ J. Solid-State Circuits, vol. 43, Access Memory (PRAM),’’ J. Solid-State no. 1, 2008, pp. 150-162. Circuits, vol. 40, no. 1, 2005, pp. 293-300...... 140 IEEE MICRO

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:11:23 EDT from IEEE Xplore. Restrictions apply. [3B2] mmi2010010131.3d 1/2/010 17:17 Page 141

23. M.K. Qureshi, V. Srinivasan, and J.A. Rivers, Pittsburgh. His research interests include ‘‘Scalable High Performance Main Memory computer architecture, compilers, and system System Using Phase-Change Memory security. Zhang has a PhD in computer Technology,’’ Proc. 36th Ann. Int’l Symp. science from the University of Arizona. Computer Architecture (ISCA 09), ACM Press, 2009, pp. 24-33. Bo Zhao is pursuing a PhD in electrical and 24. X. Wu et al., ‘‘Hybrid Cache Architecture computer engineering at the University of with Disparate Memory Technologies,’’ Pittsburgh. His research interests include Proc. 36th Ann. Int’l Symp. Computer Archi- VLSI circuits and microarchitectures, mem- tecture (ISCA 09), ACM Press, 2009, ory systems, and modern processor architec- pp. 34-45. tures. Zhao has a BS in electrical engineering 25. X. Dong et al., ‘‘Leveraging 3D PCRAM from Beihang University, Beijing. Technologies to Reduce Checkpoint Over- head for Future Exascale Systems,’’ Proc. Engin Ipek is an assistant professor of Int’l Conf. High Performance Computing computer science and of electrical and Networking, Storage, and Analysis (Super- computer engineering at the University of computing 09), 2009, article 57. Rochester. His research focuses on computer 26. J. Condit et al., ‘‘Better I/O through Byte- architecture, especially multicore architec- Addressable, Persistent Memory,’’ Proc. tures, hardware-software interaction, and ACM SIGOPS 22nd Symp. Operating Sys- high-performance memory systems. Ipek tems Principles (SOSP 09), ACM Press, has a PhD in electrical and computer 2009, pp. 133-146. engineering from Cornell University.

Benjamin C. Lee is a Computing Innova- Onur Mutlu is an assistant professor of tion Fellow in electrical engineering and a electrical and computer engineering at member of the VLSI Research Group at Carnegie Mellon University. His research Stanford University. His research focuses on focuses on computer architecture and sys- scalable technologies, power-efficient com- tems. Mutlu has a PhD in electrical and puter architectures, and high-performance computer engineering from the University of applications. Lee has a PhD in computer Texas at Austin. science from Harvard University. Doug Burger is a principal researcher at Ping Zhou is pursuing his PhD in electrical Microsoft Research, where he manages the and computer engineering at the University Computer Architecture Group. His research of Pittsburgh. His research interests include focuses on computer architecture, memory new memory technologies, 3D architecture, systems, and mobile computing. Burger has a and chip multiprocessors. Zhou has an MS PhD in computer science from the Uni- in computer science from Shanghai Jiao versity of Wisconsin, Madison. He is a Tong University in China. Distinguished Scientist of the ACM and a Fellow of the IEEE. Jun Yang is an associate professor in the Electrical and Computer Engineering Direct questions and comments to Department at the University of Pittsburgh. Benjamin Lee, Stanford Univ., Electrical Her research interests include computer Engineering, 353 Serra Mall, Gates Bldg architecture, microarchitecture, energy effi- 452, Stanford, CA, 94305; bcclee@ ciency, and memory hierarchy. Yang has a stanford.edu. PhD in computer science from the University of Arizona.

Youtao Zhang is an assistant professor of computer science at the University of

......

JANUARY/FEBRUARY 2010 141

Authorized licensed use limited to: Carnegie Mellon Libraries. Downloaded on March 18,2010 at 11:11:23 EDT from IEEE Xplore. Restrictions apply.