Quantitative Modeling of Racetrack Memory, a Tradeoff Among Area, Performance, and Power
Total Page:16
File Type:pdf, Size:1020Kb
1C-1 Quantitative Modeling of Racetrack Memory, A Tradeoff among Area, Performance, and Power Chao Zhang†, Guangyu Sun†, Weiqi Zhang†,FanMi‡,HaiLi‡, Weisheng Zhao§†CECA, Peking University, China ‡University of Pittsburgh, U.S.A. §Spintronics Interdisciplinary Center, Beihang University, China Email: †{zhang.chao, gsun, zhangweiqi}@pku.edu.cn, ‡[email protected], ‡[email protected], §[email protected] Abstract—Recently, an emerging non-volatile memory called Racetrack different RM “cells” would overlap and affect each other, which Memory (RM) becomes promising to satisfy the requirement of increasing induces significant layout inefficiency. Second, in order to access on-chip memory capacity. RM can achieve ultra-high storage density different bits stored in a single racetrack cell, a cell need more than by integrating many bits in a tape-like racetrack, and also provide comparable read/write speed with SRAM. However, the lack of circuit- one access ports. Third, a multi-bit cell may require various shift level modeling has limited the design exploration of RM, especially in the effort (latency and energy) to access different bits. Because the multi- system-level. To overcome this limitation, we develop an RM circuit-level port cell and shift operation are not modeled in NVsim, we need a model, with careful study of device configurations and circuit layouts. new tool to quantitatively model the RM circuit design. This model introduces Macro Unit (MU) as the building block of RM, and analyzes the interaction of its attributes. Moreover, we integrate the Thus, we propose a RM circuit-level model, with careful study of model into NVsim to enable the automatic exploration of its huge design device configurations and circuit layouts. In order to enable automatic space. Our case study of RM cache demonstrates significant variance exploration of the huge RM design space, we propose a macro under different optimization targets, in respect of area, performance, unit (MU) design to integrate the RM into NVsim. We also analyze and energy. In addition, we show that the cross-layer optimization is the interactive impact in MU parameters. Then we perform a cross- critical for adoption of RM as on-chip memory. layer optimization for area, latency, and energy. This paper is organized as follows. We first introduce the pre- liminary on racetrack memory and NVsim in Section II. Racetrack I. INTRODUCTION memory modeling is presented in Section III. RM cross-layer opti- With the rapid development of computing systems, there is an mization is conducted in Section IV. We conclude our work after a urgent demand of increasing on-chip memory capacity. Memory re- cross-layer case study in Section V. searchers are now finding alternatives of static random access memory (SRAM), such as embedded DRAM (eDRAM), and spin-transfer II. PRELIMINARY 2 ∼ 4 torque random access memory (STT-RAM). They can achieve A. Racetrack Memory times storage density over SRAM. Recently, a new type of non- Racetrack Memory (RM) is a new variation of magnetic random volatile memory (NVM), racetrack memory (RM), which draws access memory (MRAM). It stores many bits in racetrack like attention of researchers [1], [2], [3], [4], further achieves a higher stripes, which achieves ultra-high storage density. It inherits various storage density. Compared with STT-RAM, RM achieves about 12 distinctive attributes of MRAM, including non-volatility, high density, times storage density, and keeps similar read/write speed. low leakage power, and fast read access, etc. [8], [9], [2], [6], [10], Dr. Parkin et al. [1] first proposed the racetrack memory (RM) [11]. The compatibility with CMOS technology and its distinguish in 2008. They pointed out the tremendous application potential of scalability make RM a promising candidate to replace SRAM as on- racetrack memory. In 2011, Dr. Annunziata et al. [5] demonstrated the chip memory in the future [2], [6], [12]. first 200mm RM wafer (real fabricated chips). It was fabricated with IBM 90nm CMOS technology, and each die contained 256 racetrack cells, which approves the feasibility of RM fabrication. Venkatesan et al. first explored the application of RM as on-chip CPU caches [2], and on-chip GPGPU caches [4]. They found RM could achieve %LW/LQH 6KLIW 'RPDLQ:DOOV 026 about 8× storage density, and about 7× energy reduction as on-chip *1' 9'' CPU caches. And RM-based on-chip cache could improve GPGPU 9'' 'RPDLQ *1' performance by 12.1% and reduce energy about 3.3× over SRAM. 6KLIW 07- 3LQQHG/D\HU Sun et al. [6] further proposed hierarchical and dense architecture 026 6× $FFHVV for racetrack (HDART). Their RM-based cache achieved about 026 chip area reduction, 25% performance improvement, and 62% energy reduction over STT-RAM. 6RXUFH/LQH However, system-level analysis cannot fully explore the RM circuit design space, because of the lack of a circuit-level model. Several Fig. 1. An illustration of racetrack (RT) structure. works [2], [4] seems ignored the shrink potential of racetrack width. Sun et al. estimated the system-level performance with fixed RM Racetrack memory (RM), also known as Domain Wall Memo- configuration [6]. Without the circuit level model and its simulation ry (DWM), takes a magnetic nanowire stripe, called racetrack (RT), tool, further research will also be limited. In order to fully explore as its basic building block, as illustrated in Figure 1. The physical the system-level design and optimization, we need a quantitative and layout of the racetrack contains successive notches or pinning sites, automatic simulation tool. where domain walls (DW) can rest. A bit in RT is represented by NVsim [7] is the most popular NVM modeling tool. But it cannot the magnetization direction of a domain. A phenomenon called spin- support the RM efficiently by taking a racetrack as its cell. First, momentum transfer can shift the domain walls in lock step mode. The 978-1-4799-7792-5/15/$31.00 ©2015 IEEE 100 1C-1 DWs move in most case opposite to the current direction. Note that the DW shift will not cause mechanical movement of the material, RT column decoder Pre-charger but the change of domains magnetisation direction. RT row decoder row RT Basic operations, including read, write and shift, are performed Array driver shift RT through ports. Read is done via access port by measuring the Macro Unit Array Mat magneto-resistance. Write is performed by activating a high write Row decoder Bank Word line driver current through the access port, to flip the magnetization direction of a specific domain. A “shift” operation is to drive the domain walls Column multiplexer under a spin-polarized current pulse through the entire RT stripe, Sense amplifier controlled by the shift ports. Output driver An access port, consisting of a magnetic tunneling junction (MTJ) (a) (b) and a word line controlled transistor, executes the read and write, as Fig. 2. An illustration of a racetrack memory bank. (a) Overview of the illustrated in Figure 1. The MTJ is a sandwich structure: a pinned bank; (b) Detailed view of array. White and gray components are existing layer with fixed magnetization direction lies on the bottom, a MgO parts, and yellow ones are new for RM. barrier is in the middle, and a domain in RT works as the top layer. The bit-line (BL), word-line (WL), and source-line (SL) are connected to the domain of MTJ, to the gate of access transistor, A. Macro Unit and to the source of access transistor, respectively. The transistor controls the current density through MTJ, which determines the Figure 3 (a) and (b) show the side and top views of one RM latency of read and write in a port. The magnetization direction in the cell layout. There are one access transistor and two shift transistors. domain represents the stored value. When the magnetization direction The racetrack is stacked on top of these transistors. Each end of the of domain is antiparallel against the pinned layer of MTJ, a high racetrack has two shift transistors. The insulator (MgO) and pinned resistance will be archived from source line to bit line, indicating magnetic layer are sandwiched between the racetrack and the access logic “1”. And a parallel direction between the domain and the transistor, which makes up one access port. Apparently, there is a lot pinned layer stores logic “0”. As shown in Figure 1, domains with of unused space in the layout of one RM cell, because a transistor magnetization direction as ”up, down, down” are formed successively. is normally wider than a racetrack . If we organize such a RM cell The shift port consists of a pair of transistors connected at both ends to achieve an array as traditional memory type, the area efficiency is of the RT. The current through the RT will shift all domain walls quite low due to the blank space in the cell [2]. opposite to the current direction, after shift control lines turn on In order to utilize the layout efficiently, multiple RM cells can corresponding pair of transistors in the shift ports. If a shift pulse be overlapped with each other [6]. An example of two overlapped pushes the domain walls to right, the data sequence read out from RM cells is shown in Figure 3 (c) and (d). All shift transistors and the port will be “100”. And similarly it will be “001”, if the domain access transistors of both cells are aligned vertically, as illustrated in walls move to left. Figure 3 (d). In this work, such a layout with overlapped RM cells is called Macro Unit (MU). MU performs as the basic building block of B. NVSim Modeling Framework RM array. Obviously, we can increase the number of RM cells in one NVSim is a circuit-level model, which facilitates the NVM system- MU to further improve area efficiency.