<<

1

Cache Memory Design with Magnetic Skyrmions in a Long Nanotrack Mei-Chin Chen, Ashish Ranjan, Anand Raghunathan, Fellow, IEEE and Kaushik Roy, Fellow, IEEE School of Electrical and Computer Engineering, Purdue University, West Lafayette, 47906, USA

Abstract—Magnetic skyrmion (MS), a vortex-like region with technology, domain wall motion based racetrack memory, was reversed magnetization in nanomagnets, has recently emerged proposed by IBM [1]. In a racetrack memory, multiple bits as an exciting development in the field of . It has can be coded in a sequence of magnetic domains, separated a number of beneficial features, including remarkably high stability, ultra-low depinning , and extremely by domain walls, within a nanowire. DWM-based caches [2]– compact size. Due to these benefits, skyrmions have generated [5] have shown significant improvement in performance (with great interest in the design of spintronic memory. In this work, higher packing density and better energy efficiency) over other we evaluate the use of skyrmion-based memory as a last-level spintronic memory devices. However, the motion of domain for general purpose processors. In the skyrmion-based walls might be pinned by the presence of defects [6], raising memory structure, data can be densely packed as multiple bits in a long magnetic nanotrack. Write operations are performed concerns about the feasibility of DWM-based memory. by injecting a -polarized current in the nanotrack. Since Magnetic skyrmions have recently emerged as a promising multiple skyrmions (each representing a bit) are packed into a alternative for future memories [7]–[10]. They can be observed single nanotrack, they need to be accessed by shifting them along in non-centrosymmetric bulk magnetic materials or ultra- the nanotrack with a charge current passing through a spin-Hall thin magnetic systems with breaking inversion symmetry and metal (SHM). We identify the following key challenges associated with MS-based cache design: (i) the high current requirements large spin orbital coupling. The state of a magnetic skyrmion for skyrmion nucleation limits the density benefits offered by can be explained by the presence of Dzyaloshinskii-Moriya these structures, since the transistor supplying write currents Interaction (DMI) [11], [12] – the DMI between two atomic is the limiting factor that determines the bit-cell area; (ii) the spins S1 and S2 with a neighboring atom can be expressed as proposed nanotrack structure results in significant performance H = −D1 2 · (S1 × S2) where D1 2 is the Dzyaloshinskii- overheads due to the latency arising from the shift operations; DM , , (iii) the skyrmions move toward the edge of the nanotrack Moriya (DM) vector [7], [8], [13]–[16]. Magnetic skyrmions during shift operations owing to the Magnus force. Hence, an have been shown to possess several benefits over domain wall additional idle operation time is required to relax skyrmions back motion based racetrack memory in terms of stability, density, through repulsive force from the edge; (iv) to avoid annihilation and are less limited by imperfectness of the material. Specifi- of skyrmions from the edge, the duration and the current density cally, topological properties prevent the motion of skyrmions of the shift operation have to be well controlled. To overcome these challenges, a multi-bit skyrmion cell with appropriate from being pinned at defect sites in a magnetic layer, and thus peripheral circuit is proposed, considering the heterogeneity in skyrmions are more robust information carriers. the read/write characteristics. The density benefits are explored Magnetic skyrmions (MS) can be stored as multiple bits in by performing layout of different multi-bit cells. We perform a a long nanotrack to realize highly dense memory. Ref. [17] systematic device-circuit-architecture co-design to evaluate the first demonstrated the use of magnetic skyrmions to realize feasibility of our proposal. Our experiments demonstrate the potential of, and the challenges involved in, using skyrmion-based on-chip caches. The work proposed the use of a shift-based memory as last-level caches. write mechanism [18] for creation of skyrmions. However, such an approach is considered to be applicable for domain Index Terms—Magnetic Skyrmion (MS), Spin-Hall metal (SHM), Magnus force, Dzyaloshinskii-Moriya Interaction (DMI) wall motion based device. No experimental (or simulation) results to date have demonstrated the creation of skyrmions using the shift-based mechanism. In our work, a magnetic skyrmion is written (or nucleated) by injecting a local spin- I.INTRODUCTION polarized current in the nanotrack, whereas the read operation NCREASED leakage current and process variations are is performed by sensing the change in resistance arising from I a major challenge to memories realized using deeply- the presence (or absence) of skyrmion at a specific location scaled CMOS devices. The need for non-volatility (zero off- in the nanotrack. In order to read or write a bit stored in the state leakage), higher density, and robustness has consequently nanotrack, a variable number of shift operations are required led researchers to explore alternative technologies to replace depending on the location relative to the read/write port. The traditional CMOS-based on-chip memories. Several emerging noticeably high density and non-volatility offered by MS- technologies such as phase change memory (PCM), resistive based memory are key positives for last level on-chip cache random-access memory (RRAM), spin-transfer torque Mag- applications. netic RAM (STT-MRAM), and domain wall motion (DWM) We explore the use of magnetic skyrmions as last-level based memory have been proposed as potential substitutes for on-chip caches in general purpose processors. We propose SRAM and DRAM. One such promising high-density memory a multi-port skyrmion-based cell and evaluate its potential Digital Object Identifier: 10.1109/TMAG.2019.2909188

1941-0069 c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. 2 in realizing an on-chip memory array. Despite possessing a number of beneficial attributes such as high stability, non- volatility, high density1, and low leakage, magnetic skyrmions pose certain challenges: (i) the current density required for skyrmion nucleation [19] is substantially higher, necessitating the need for large access transistors for writing a skyrmion, in turn limiting the density benefits. (ii) the variable access latency arising from packing multiple bits in a single nan- otrack, leads to energy and performance overheads. (iii) the motion of skyrmions drifts away from the direction of electron flow owing to the Magnus force [20]. In order to relax the skyrmions back to the center region of the nanotrack, an Fig. 1: Schematic of MS-based device and bit-cell. The pro- idle operation time is needed which leads to additional shift posed device structure can perform read/write/shift operations. latency. (iv) skyrmions might suffer annihilation through the A skyrmion can be nucleated in the nanotrack (yellow layer) edges due to large drive current density required for high- by injecting a spin-polarized current through the left MTJ. speed operation. To address these challenges, we perform a The motion of skyrmions can be driven by utilizing vertical design-space exploration for the multi-bit skyrmion cell while injection of a spin current generated from a charge current considering the peripheral circuits required to perform these flowing through the Spin-Hall Metal (SHM) layer (blue layer). operations. We also performed layout to estimate the density The reference MTJ is used to form a voltage divider on the benefits of the proposed multi-bit cell. To keep skyrmions read port, and the presence of a skyrmion can be detected by enclosed in the nanotrack under high current injection, it is es- sensing the voltage at the output of the inverter. sential to analyze the various design choices possible and their impacts on system energy and performance. We developed a device-circuit-architecture framework to understand these operation. In the following paragraphs we describe these design points for the proposed multi-bit cell. operations in detail along with the peripheral circuits required The key contributions of this work are as follows: to perform these operations • We explore the feasibility of last-level cache design Nucleation of a skyrmion (Write operation). A skyrmion for general purpose processors with magnetic skyrmion- is nucleated in the nanotrack by injecting a local spin- based memory. polarized current through the MTJ on the left (write MTJ). • We propose a magnetic skyrmion-based multi-bit cell and This is performed by charging the bitline (BL) to VWRITE, utilize suitable circuit and architecture optimizations that sourceline (SL) to ground (GND), and turning ON the write mitigate the unique challenges posed by the skyrmion access transistors by driving the write wordlines (WWL) structure. to VDD. Nucleating a skyrmion requires that the injected • We develop a systematic device-to-architecture co-design spin-polarized current exceeds certain threshold Jth [19]. We framework and perform an in-depth analysis of the den- exploit spin-polarized current generated from the electrical sity benefits, along with the energy and performance current through a 20 nm-diameter write MTJ to create a trade-offs associated with the proposed skyrmion-based skyrmion. The proposed device structure consists of a 0.4 nm- cache. Our experiments on the PARSEC benchmark suite thick ferromagnetic nanotrack adjacent to a 3 nm-thick SHM. [21] demonstrate 2.41× improvement in cache energy The material parameters used in our simulations correspond with 2% average degradation in cache performance over to Co/Pt multilayers [19], and are shown in Table IV. In our an iso-area traditional SRAM-based L2 cache. simulation, a stable skyrmion can be nucleated in a 60 nm The rest of this is organized as follows. Section 2 nanotrack by injecting a spin-polarized current through the presents the fundamentals of skyrmion-based device and bit- write MTJ for 50 ps with a current density of 6.8 × 1012 cell. The design and otimization of a multi-bit skyrmion A/m2. Note that the presence of DMI necessitates the need of cell is described in Section 3. Section 4 demonstrates the high current density for nucleation. memory array organization. Section 5 presents the experimen- tal methodology and the results are presented in Section 6. Motion of skyrmions (Shift operation). Skyrmions are Finally, we conclude the paper in Section 7. packed as multiple bits in a long nanotrack. Hence, in or- der to access a specific bit stored in a long nanotrack, the II.SKYRMION-BASED MEMORY corresponding skyrmion need to be placed underneath the write (read) port via shift operations. A shift operation is Fig. 1 shows the proposed MS-based device structure in accomplished by connecting shift wordlines (SWL) to VDD, which skyrmions are stored in a ferromagnetic nanotrack and precharging BL and SL to appropriate voltage. The motion adjacent to an SHM. To realize a bit-cell using this structure, of skyrmions can be controlled by an in-plane spin polarized we need to perform three different operations: (i) a write current flowing through the nanotrack directly, or by a vertical operation, (ii) a shift and an idle operation, and (iii) a read injection of a spin-polarized current perpendicular to the plane 1The sizes of skyrmions and the spacing between them can be potentially (CPP) which is obtained by injecting a charge current through shrunk down to the nanometer scale. the SHM layer. We choose CPP method in our proposed device 3

Material FePt Nd2Fe14B SmCo5 Anisotropy constant (MJ/m3) 2.0 4.3 17.1 Exchange constant (pJ/m) 8 7.7 12 Saturation magnetization (MA/m) 1.1 1.28 0.84 DMI strength (mJ/m2) 0.1 0.1 0.1 TABLE I: Comparison of high-K materials used in the present simulations structure as a skyrmion undergoes a larger Slonczewski in- plane torque instead of a smaller field like out-of-plane torque, higher velocities can be obtained with lower current densities. The motion of skyrmions can be well explained by the Fig. 2: (a) Critical annihilation current density versus various Theile’s equation [20], shift operation time. (b) Critical annihilation current density G × vd − Dαvd + jspin = 0 (1) in a 1 ns shift operation versus various high-K materials with different width. The annihilation current density is less where jspin represents the vertical spin current generated sensitive to the width of the high-K materials with higher from the charge current flowing through the SHM (in blue) anisotropy constant underlayer. The longitudinal and the transverse velocity can be written as αD G vx = j ,vy = j (2) d G2 + α2D2 spin d G2 + α2D2 spin MTJ, and two access transistors. A read operation is performed by connecting the read wordlines (RWL) to VDD, driving BL Hence, for G =6 0, the motion of skyrmions deviates from the to Vread and SL to GND. Here, the ferromagnetic nanotrack intended direction. The transverse motion of a skyrmion stops at the read region serves as the free layer of the read MTJ, at a certain distance from the edge owing to the skyrmion- and the resistance of the read MTJ is denoted as Rsk (Rap) edge interaction. The final displacement with respect to the with the presence (absence) of a skyrmion under the read edge decreases as SHM current density increases. Skyrmions MTJ. The voltage divider consisting of the reference MTJ with are annihilated if the applied charge current density is larger resistance (Rap) in series with the read MTJ, will drive the than a certain value (Jani) which is function of the operation output of the inverter high in the presence of a skyrmion, and time. Fig. 2(a) shows that the critical annihilation current vice versa. It is to be noted that the trip-point of the inverter density can be significantly increased by reducing the shift is selected between the maximum voltage (with the absence operation time. Moreover, a high energy barrier is induced of a skyrmion, high resistance state) and the minimum voltage K on the boundaries by adhering high- materials at the edges, (with the presence of a skyrmion, low resistance state) at node allowing skyrmions to be well confined in the nanotrack with “A”. However, since the average magnetization of a skyrmion larger current injection [22], [23]. Fig. 2(b) compares the is not parallel (m =-1) to the fixed layer (m =-1), the resis- J z z critical annihilation current density ( ani) for three different tance change of the read MTJ is lower here compared with a K 2 14 5 high- materials (FePt, Nd Fe B, SmCo ) with the edge full parallel-to-antiparallel resistance switching of an MTJ. To nm nm < ns width ranging from 1 to 5 for 1 shift duration. achieve sufficient resistance change for read operation, we use The corresponding material parameters, adopted from [19], an MTJ of diameter 20 nm and ∼ 200 % magnetoresistance K [24] are shown in Table I. Utilizing high- materials at ratio. We also match the size of the skyrmion to the size of the the edges, makes switching the spin direction much harder read MTJ to ensure that the region captured by the read MTJ is when a skyrmion approaches the edge due to the Magnus closer to mz=-1 (anti-parallel to the fixed layer), which in turn force, thereby keeping the skyrmion in the nanotrack. The leads to higher magnetoresistance change. Table II compares velocity of skyrmions, which increases with increased current the voltage swing (∆V) at node “A” in Fig. 1 in the presence density, is therefore enhanced during shifts, achieving faster and the absence of a skyrmion under a read current of ∼ 1.25 shift operations. Note that the induced energy barrier from × 10−5 A by pulling up the BL voltage to V (0.8 V) and K read the high- materials is known to depend on the width of SL to GND. As shown in the table II, changing the width of K adhering high- and the material properties [22]. Note that the nanotrack increases the skyrmion dimension (the region the skyrmion Hall effect could be addressed by having notches, captured by the read MTJ is closer to mz=-1), which in turn pinning effects or ratchets geometries. However these effects leads to a greater voltage swing as a higher magnetoresistance are not considered in this work. The interested readers are change could be achieved. However, this also increases the referred to Ref. [25]–[30] for more detail. required reliable spacing between consecutive skyrmions to Detection of the presence of a skyrmion (Read operation). free them from repulsive force between neighboring skyrmions Electrical detection of skyrmions at room temperature through [33]. Moreover, since the fixed layer of the read MTJ is located the magnetoresistance effect have been proposed and recently at the center region of the nanotrack, the deviation between the demonstrated in experiments [31], [32]. In this work, we use position of a skyrmion and a read port degrades the resistance this mechanism to perform a read operation. Specifically, we change with the absence/presence of a skyrmion. Therefore, introduce a read port that includes a read MTJ, a reference a read operation to a specific skyrmion bit requires an idle 4

Width (nm) Radius (nm) ∆V (V) Reliable spacing (nm) RWL WWL SWL BL SL 60 13.5 0.125 ∼ 74 Read VDD 0 0 VREAD 0 70 15.5 0.133 ∼ 77 Shift Left 0 0 VDD 0 VSHIFT 80 18.5 0.137 ∼ 81 Shift Right 0 0 VDD VSHIFT 0 Write 0 VDD 0 VWRITE 0 TABLE II: Read voltage swing (∆V) under a read current of Clear 0 VDD 0 0 VWRITE −5 ∼ 1.25 × 10 A between the presence and the absence of a Idle 0 0 0 0 0 skyrmion for different nanotrack width and the corresponding TABLE III: Bias voltage conditions for various operations skyrmion radius. The reliable spacing between consecutive skyrmions is also compared. operation after each shift operation, i.e., the total number of idle operations is equal to the number of shift operations required for reading a skyrmion bit. This operation relaxes the skyrmions back to the center region through edge repulsion. We achieve this by turning all access transistors OFF which stabilizes the magnetization of the nanotrack.

III.MULTI-BITSKYRMIONCELLDESIGN Fig. 3 shows the logical representation of data stored along the nanotrack. Depending on the existence of a skyrmion, different logic values can be stored along the nanotrack as Fig. 3: Logical view of a multi-bit MS-based cell with (a) multiple bits. We denote the presence of a skyrmion to single write/read port or (b) single write, multiple read ports. represent logic “1”, while its absence denotes logic “0”. A A sequence of bits are stored in the nanotrack. current injected into the SHM (blue layer) from the right can shift skyrmions to the right-hand side of the nanotrack, and vice versa. The logical views of a multi-bit cell with a single Fig. 3(b). The current location of the read port is referred write/read port and a cell with single write and multiple read to as the current port status. In order to access a bit from a ports are shown in Fig. 3(a) and Fig. 3(b), respectively. Note multi-bit cell, a shift controller determines the appropriate read that the read ports can be placed at any location along the port and calculates the number of shift operations required by nanotrack, however, the write port is placed at the end of comparing the input address bits with the current port status. the long nanotrack to ensure simplicity for write operation. This also results in a reduction of the number of extra bits Consider Fig. 3(a) as an example. A write port at address required to avoid data loss. Table III lists the bias voltage “0x0” and a read MTJ at address “0x7” with a sequence of conditions for write/shift/read/clear/idle operations. 0’s and 1’s stored in the cell is presented. In the first write cycle, “0” is written into the address “0x0”, and subsequently Density of the skyrmion based multi-bit MS cell. Fig. 4 shifted right to the next address “0x1”. “1” is written into the shows the layout of an 8/16/32-bit MS cell with a single address “0x0” during the next write cycle, and then the data write/read port. As discussed in Section II, the current re- in the nanotrack is again shifted to the right. By repeatedly quirement for the write operation is considerably higher than writing data into the address “0x0” and subsequently right that for the read and shift operations. Hence, as shown in shifting all stored data to the next address, a sequence of bits Fig. 4(a), for an 8-bit MS cell with single write/read port, the can be written to the nanotrack. To read the stored data at, cell area is dominated by the peripheral write transistors since say, address “0x5”, the bit is shifted right by two positions the dimension of the write transistors are much larger than the to reach the location under the read MTJ. Similarly, to write nanotrack. Note that, the length of the nanotrack is determined data at a specific address, we first shift the bit to the position by the number of stored bits and the read ports. The total where the write port is located. Before writing a new data into length of the nanotrack can be reduced by having multiple the address, the previously stored data is cleared by injecting read/write ports as fewer extra bits are required to prevent the a current with spin polarization in the opposite direction to stored data from being destroyed during shift operations (light the magnetization of the skyrmion center. To prevent stored yellow part in the Fig. 3). For the 8-bit MS cell case with a data in the nanotrack overflowing during shift operations, we single read port, the write transistors dominate the total cell extend the nanotrack by having extra data bits (light yellow area. Thus, the density ((i.e., cell area per bit)) of the 8-bit MS part in the Fig. 3). In the worst-case scenario for this example, cell does not improve further by introducing more read ports to access the stored data at address “0x0”, the bit is required to (as presented in Fig. 5). On the other hand, for the 16/32-bit be shifted right by seven positions. Thus, seven extra bits are MS cell, since the nanotrack dominates the cell area with one required to avoid the loss of stored data from address “0x1” read port, the density can be improved by packing more bits to “0x7”. The write/read latency is dependent on the location within a smaller area. Hence, as shown in Fig. 5, at the 45nm where a bit is stored. However, the average read latency can technology node (F), for an 16-bit cell with one/two read ports, be alleviated by introducing multiple read ports, as shown in the CMOS transistors require 135.59 F2/bit and 112.67 F2/bit, 5

300 8-bit SRAM 250 16-bit STTMRAM 32-bit 1write 200

/bit) 32-bit 2write 2 150 (144)

100 Cell sizeCell (F

50 (46)

0 0 1 2 3 4 5 6 7 8 No. of read ports

Fig. 5: Bit-cell area comparison for different multi-bit designs Fig. 4: Layout of a 8/16/32-bit MS cell with single write/read port at the 45nm technology node (F) respectively. Fig. 5 compares the cell size of the proposed MS cells, i.e., 8/16-bit MS cells with one write port and 32-bit MS cell with both a single write port and two write ports, while varying the number of read ports. We also show the cell size of SRAM (triangle) and 1T-1R STT-MRAM (star) on the figure for reference. The total area of a multi-bit MS cell is determined by the number of read and write transistors, as well as the length of the nanotrack. For an 8-bit MS cell, the cell size is dominated by the write transistors when the total number of read ports is less than 3. Although having more read ports beyond 3, shortens the nanotrack with fewer extra bits required, the area of read peripheral transistors inevitably increases too. Similarly, for the 16-bit MS cell case with less than 5 read ports, the bit-cell area is mainly determined by the Fig. 6: The memory array organization of skyrmion based nanotrack itself. In the case of 32-bit MS cell, the multi-bit cell multi-bit cells area is further dominated by the nanotrack, therefore having an extra write port helps reduce the 15 extra bits required, based hybrid cache organization presented in TapeCache [2], thereby improving the density ((i.e., cell area per bit)). i.e., the tag array is designed with SRAM to avoid variable access latency during performance-critical tag lookup opera- IV. ARRAY ORGANIZATION tions, and the data array is realized using the proposed multi- Fig. 6 shows the memory array organization with the bit skyrmion array. The data array is further composed of proposed multi-bit cell. The wordlines for performing read, randomly addressable clusters, each of which stores multiple write and shift operations (i.e., RWLs, WWLs, SWLs) are cache blocks. We assume a bit-interleaved mapping of the shared among all the multi-bit cells placed in a row. The BL cache blocks in each cluster, such that a given cache can and the SL is shared among all the multi-bit cells placed in a be accessed in parallel after performing an appropriate number column. In this architecture, multiple words can be placed on of shift operations to all the nanotracks within a cluster. The the same row and accessed independently. The address decoder addressing policy and the cache management policies are also is used to select a multi-bit skyrmion cell in the array, with assumed to be similar to that of TapeCache. the shift control logic selecting the appropriate word. Note that the sense amplifier shared across the entire column detects the V. EXPERIMENTAL METHODOLOGY output signal as logic ‘0’ or ‘1’. In this section, we present a brief description of the simu- Skyrmion-based cache design. To evaluate the benefits of the lation framework and present the experimental setup used to proposed memory array at the application level, we integrate evaluate our proposal. it as a last-level cache in the of a general Simulation Framework. Micromagnetic simulations of the purpose processor. Towards this end, we follow the DWM- skyrmion device are performed using the tool Mumax3 [34], 6

Parameter Value Processor Core Alpha, out-of-order processor, 4 cores at 2 GHz Saturation magnetization (Msat) 580 kA/m L1 I/D-cache 16KB per core, 2 way-set associative, 64B line size 3 PMA anisotropy constant (Ku) 0.8 MJ/m L2 unified cache 2MB shared, 16 way-set associative, 64B line size Exchange constant (A) 15 pJ/m Cache latency L1 cache: 2-cycle, L2 cache: 9-cycle Dzyaloshinskii-Moriya interaction (DMI) strength (D) 3 mJ/m2 Gilbert damping constant (α) 0.1 TABLE V: System configuration Spin polarization (P ) 0.4 Spin Hall Angle (θsh) 0.07 Nanotrack width and thickness 60 nm × 0.4 nm SHM width and thickness 60 nm × 3 nm and bitcell-level technology parameters to produce the array- MTJ diameter 20 nm level characteristics mentioned above. These array-level char- TABLE IV: Material parameters used for simulation acteristics are then reflected in GEM5 [40], a cycle accurate architectural simulator that models a wide range of Instruction Set Architectures (ISAs) along with a detailed and flexible [35]. The magnetization dynamics of magnetic skyrmions memory system. Specifically, we model the skyrmion cache driven by vertical current can be expressed by architecture in GEM5 to evaluate the proposed design as an L2 γ cache. In our experiments, we perform an iso-area replacement τ = (m × H + α(m × (m × H ))) + τ 1+ α2 eff eff SL of L2 cache and compare the energy and performance of the ǫ − αǫ′ ǫ′− αǫ proposed design with that of SRAM-based and STT-MRAM- τSL = β (m × (mp × m)) − β m × mp 1+ α2 1+ α2 based caches. All the memory technologies considered in the j ¯h 2 β = z evaluation are based on a 45nm technology node The CMOS Msated baseline system configuration used in our analysis is shown in P Λ2 Table V. We perform a full-system simulation for 1 billion in- ǫ = 2 2 (Λ + 1) + (Λ − 1)(m · mp) structions in the regions of interest for caches across a suite of (3) multi-threaded benchmarks from PARSEC [21], a benchmark suite typically used for studies of chip-multiprocessors. where m is the normalized magnetization vector, mp is the fixed layer polarization, γ is the Gilbert gyromagnetic ratio, α VI.RESULTS AND DISCUSSION is the Gilbert damping parameter, Heff is the effective field, jz is the current density along the z axis, Msat is the saturation A. Device and circuit level results magnetization, e is the elementary charge, d is the skyrmion As we discussed in Section III, shift operations are involved layer thickness, P is the polarization of conduction electron, in both write and read operations. However, during shift the Slonczewski Λ parameter characterizes the spacer layer, operations, the trajectory of skyrmions in the nanotrack bends and ǫ′ is the secondary spin transfer term. The material param- away from the center as a result of Magnus force. Thus, an idle eters used in our simulations correspond to Co/Pt multilayers operation is required to relax skyrmions back to the center re- [36], and are shown in Table IV. We consider a 0.4 nm thick gion through edge repulsion after every shift operation. Fig. 7 Co nanotrack with perpendicular magnetic anisotropy on a 3 compares the required relaxation time and the longitudinal nm Pt substrate inducing DMI. The sample is discretized into 3 shift distance within various operation times for a current of an element size of 1 × 1 × 0.4 nm . The Non-equilibrium 1.44 × 10−5 A and 5.76 × 10−5 A, respectively, with and Green’s Function (NEGF) based spin transport simulation has without adhering high-K materials at both the edges. High-K been used in order to obtain the resistance of the MTJ [37]. materials are adhered to prevent skyrmions from annihilation The charge current (Ie) flowing through the SHM and the under a current of 5.76 × 10−5 A. The longitudinal velocity is corresponding spin current (Is) are calculated using [38] proportional to the injection current density, and the required

AMTJ relaxation time is related to the transverse shift distance, which Is = θsh Ie (4) increases with increasing drive current or operation time. Since ASHM skyrmions in the nanotrack stop at a certain distance to the where A and A are the cross sectional areas of MTJ SHM edge owing to the skyrmion-edge interaction, the required the MTJ and SHM, respectively, and θ is the spin-Hall sh relaxation time is the same after 1.2 ns and 0.8 ns operation angle. The spin current from eqn.(4) is used to analyze the −5 −5 time under a current of 1.44 × 10 A and 5.76 × 10 magnetization dynamics with the generalized LLGS equation. A, respectively. With the aid of adhering high-K materials, Magnetization dynamics simulations are performed using the skyrmions can be operated under a higher current density, and Mumax3 platform [34], [35]. thus higher transverse velocity can be reached. However, a System-level Evaluation Framework. The device parameters higher relaxation time is also required which increases the obtained with the proposed simulation framework are used as shift latency. Since the reliable spacing in our case is ∼ 74 technology parameters in a modified version of CACTI [39] nm, a current of 1.44 × 10−5 A for 1 ns and 5.76 × 10−5 A for to evaluate the read/write characteristics of the skyrmion- 0.2 ns is required during the shift operation, leading to a 0.9 based cache. CACTI is an integrated tool that is commonly ns and 1.3 ns relaxation time, respectively. We compare the used by computer architects for modeling dynamic power, access latency, area, and leakage power of caches. It takes 2We used a commercial 45nm technology that was readily available to us, rather than predictive technology models, for our simulations. We expect the inputs as the cache parameters (e.g. capacity, block size, energy to further scale by ∼0.15× from 45nm technology node to 15nm, a associativity etc.), the number of read/write ports in cache, state-of-the-art technology node [41]. 7

6 SRAM

 5 STT-MRAM 4 MS 3 2

Normalized values values Normalized 1 0 Read Write Read Write Energy Energy Latency Latency Fig. 7: Comparison of relaxation time and final position of skyrmion under a current of 1.44 × 10−5 A and 5.76 × 10−5 A, subjected to the nanotrack without (solid) and with (dashed) Fig. 8: Array-level comparison of read and write characteris- high-K, respectively. High-K materials are adhered under the tics with iso-area SRAM and STT-MRAM current of 5.76 × 10−5 A to avoid skyrmions from annihilation from edges. eight different L2 cache designs under iso-area conditions: (i) a 2MB SRAM cache with 1 read/write port, (ii) an 8MB STT- performance evaluation and energy consumption for 8/16/32 MRAM cache with 1 read/write port, (iii) two 2MB MS-based bit-MS with either high-K materials at the two edges (for a cache designs with 3 read ports and 1 write port, storing 8 current of 5.76 × 10−5 A) or no such material at the edges bits in the nanotrack, with and without high-K material at the (for a current of 1.44 × 10−5 A). two edges (8bit-MS high-K and 8bit-MS nohigh-K), (iv) two 4MB MS-based cache designs with 3 read ports and 1 write B. System-level results port, storing 16 bits in the nanotrack, either having high-k In this section we present the array-level analysis of the material on the edges or no high-k material (16bit-MS high- proposed MS-based cache design and then evaluate the impact K and 16bit-MS nohigh-K), and (v) two 4MB MS-based on system performance and cache energy. cache designs with 8 read ports and 2 write ports, storing 32 bits in the nanotrack with either a high-K material at the Array-level results. Fig. 8 compares the different en- edges or no such material (32bit-MS high-K and 32bit-MS ergy/latency components of the proposed MS-based array with nohigh-K). The IPC is normalized to the 2MB SRAM-based an STT-MRAM and SRAM array. The MS-based array is cache design. Across all benchmarks, the 8bit-MS high-K realized using an 8-bit multi-bit cell. As shown in the figure, design leads to an average degradation of 2.0% and 4.3% the write energy for the MS-based array is 1.78× higher than in performance compared to the SRAM and STT-MRAM STT-MRAM, and 5.5× higher than SRAM. This is because designs. This degradation is primarily due to two factors: of the high current requirements for skyrmion nucleation. The (a) reduced cache capacity (iso-capacity w.r.t. SRAM and write latency for the MS-based array is slightly (4%) lower 0.25× capacity w.r.t. to STT-MRAM), and (b) shift overhead than STT-MRAM but 2.4× higher than SRAM. On the other arising from the memory structure. In contrast, for the 8bit-MS hand, the read energy (and latency) is identical for both STT- nohigh-K design, the system performance further reduces by MRAM and the proposed MS-based array due to similar read 3.3% and 5.6% compared to the SRAM and the STT-MRAM mechanisms, and 1.1× higher than SRAM. Furthermore, we cache as a result of additional shift latency incurred for each observe the read latency is identical for the MS-based array, cache access in the absence of high-K material at the edges. STT-MRAM, and SRAM. Apart from the read and write On the other hand, the two 16-bit MS configurations (16bit- energies, the MS array also consumes shift energy during MS high-K and 16bit-MS nohigh-K) degrade the perfor- read/write operations. The shift energy is a function of the mance by 0.1% and 2.7% compared to the SRAM cache on length of the nanotrack, the number of read/write ports in the average, respectively. Furthermore, we observe a 2.4% and bit-cell, and the high-K material on the edges of the nanotrack. 5.0% reduction in performance for the two designs when In our evaluations, the shift energy per operation was found to compared with the STT-MRAM cache design. The smaller be 9.51× −4 and 8.68× −5 pJ for the 8-bit based memory 10 10 performance degradation over the SRAM and STT-MRAM array with and without high-k material on the nanotrack edges, cache is attributed to the 2× higher cache capacity offered respectively. by the 16-bit configuration. For the 32bit-MS designs, the Performance evaluation. Fig. 9 compares the IPC (instruc- performance improves by 0.4% with the high-K cache design, tions per cycle) for six different skyrmion-based cache con- and degrades by 0.6% for the design with no high-K material, figurations with SRAM and STT-MRAM caches. We consider over the SRAM cache. This improvement is mainly because 8

1.2 SRAM 8bit-MS high-K 8bit-MS nohigh-K 16bit-MS high-K 16-bit MS nohigh-K 32bit-MS high-K 32bit-MS nohigh-K STT-MRAM

1.1 

1

0.9

0.8 Normalized IPC IPC Normalized

0.7 blacksch bodytrack canneal dedup facesim ferret fluidanim freqmine streamclust swaptions vips x264 rtview geomean Fig. 9: L2 cache performance comparison across different memory technologies of a reduced number of shift operations performed on average energy reduction over an iso-area SRAM-based cache. They with higher number of read and write ports in the 32-bit also point to key avenues for improvement in skyrmion-based design. Note that, the performance reduces by 0.6% and 3.0%, memory - the high nucleation energy for skyrmions leads to respectively over the STT-MRAM cache design, since the large write transistors, curtailing density benefits, while the overall cache capacity does not increase with the 32bit-MS latency due to shift operations limits performance. design as discussed earlier.

Energy comparison. Fig. 10 illustrates the L2 cache energy VII.CONCLUSION consumed by the proposed cache designs compared to the In this work, we explored magnetic skyrmions to design iso-area SRAM and STT-MRAM caches. The cache energy last-level caches. We propose a multi-bit skyrmion-based cell is normalized to the energy consumed by the STT-MRAM design that packs multiple bits in a nanotrack. Since the size design. On average, We observe a 2.41× and 2.45× reduction and spacing of skyrmions can be down to nanometer scale, in cache energy for the 8bit-MS high-K and 8bit-MS nohigh- the skyrmion-based nanotrack has the potential to provide K designs over the SRAM cache. This is due to the re- significant density benefits compared to other memory tech- duced leakage energy consumption with non-volatile magnetic nologies. However, the high current requirements for skyrmion skyrmions. The energy benefits are slightly higher for the 8bit- nucleation is a bottleneck to achieving significant density ben- MS nohigh-K design because of lower shift energy consumed efits. We analyzed different device tuning and design tradeoffs with no high-K material on the nanotrack edges. For the associated with the proposed bit-cell and evaluated the area, 16-bit designs, the energy benefits were found to be 2.37× performance and energy benefits while accounting for the and 2.41× for the 16bit-MS high-K and 16bit-MS nohigh- peripheral circuit requirements. We designed a device-circuit- K designs, respectively. The energy benefits are moderately architecture framework to evaluate the system-level benefits lower for the 16-bit configurations over the 8-bit designs due of the proposed design. Our experiments reveal considerable to a higher energy consumed by the shift operations. Note that, benefits over an iso-area SRAM cache. However, the energy for a subset of benchmarks (canneal, ferret, streamclust and and performance is lower than an iso-area STT-MRAM cache, vips), the 16-bit configurations have a lower energy compared suggesting the need for mechanisms to lower the current to the 8-bit designs. This is because of the lower capacity density requirements for skyrmion nucleation. misses observed in the 16-bit designs that eventually leads to lower write energy. In these benchmarks, the benefits in write energy outweigh the increase in shift energy, thereby ACKNOWLEDGMENT leading to improved cache energy. The energy benefits over This work is funded in part by Center for Spintronics, SRAM reduce to 2.27× and 2.31× with the 32bit-MS high-K funded by SRC and MARCO, by National Science Founda- and 32bit-MS nohigh-K designs, respectively. The benefits in tion, and by the Vannevar Bush Faculty Fellowship. energy are lower than the other two designs (8-bit and 16-bit configurations) since the resistance offered by the nanotrack increases, which in turn increases the energy consumed for REFERENCES each shift operation. [1] S. Parkin, M. Hayashi, and L. Thomas, “Magnetic domain-wall racetrack In contrast, the energy consumed by the MS-based cache memory,” Science, vol. 320, no. 5873, pp. 190–194, 2008. designs is higher than the baseline iso-area STT-MRAM cache [2] R. Venkatesan, V. Kozhikkottu, C. Augustine, A. Raychowdhury, K. Roy, in all cases. Specifically, the 8-bit and 16-bit designs consume and A. Raghunathan, “Tapecache: a high density, energy efficient cache based on domain wall memory,” in Proceedings of the 2012 ACM/IEEE 1.29× and 1.27×, 1.30× and 1.28× higher energy than the international symposium on Low power electronics and design. ACM, STT-MRAM cache. Similarly, the 32-bit designs consume 2012, pp. 185–190. 1.37× and 1.34× energy over the STT-MRAM cache. This [3] Z. Sun, W. Wu, and H.H. Li, “Cross-layer racetrack memory design for ultra high density and low power consumption,” in Proceedings of the increase in energy is because of the additional shift energy 50th Annual Design Automation Conference. ACM, 2013, p. 53. overheads and the reduced cache capacity arising from the [4] A. Ranjan, S.G. Ramasubramanian, R. Venkatesan, V. Pai, K. Roy, and larger write transistor requirements for the multi-bit MS cell. A. Raghunathan, “Dyrectape: A dynamically reconfigurable cache using domain wall memory tapes,” in Proceedings of the 2015 Design, Au- In summary, our results show that skyrmion-based caches tomation & Test in Europe Conference & Exhibition. EDA Consortium, offer small improvements in performance with substantial 2015, pp. 181–186. 9

2 3.8 2.7 3.0 3.4 24.7 2.8 3.3 2.2 3.9 3.1

 1.8 ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 1.6 1.4 1.2

1 Normalized Energy Energy Normalized 0.8 blacksch bodytrack canneal dedup facesim ferret fluidanim freqmine streamclust swaptions vips x264 rtview geomean SRAM 8bit-MS high-K 8bit-MS nohigh-K 16bit-MS high-K 16bit-MS nohigh-K 32bit-MS high-K 32bit-MS nohigh-K STT-MRAM Fig. 10: Energy trends across different memory technologies

[5] R. Venkatesan, S.G. Ramasubramanian, S. Venkataramani, K. Roy, and [24] G. Zhao, X.F. Zhang, and F. Morvan, “Theory for the coercivity and its A. Raghunathan, “Stag: Spintronic-tape architecture for gpgpu cache mechanisms in nanostructured permanent magnetic materials,” Reviews hierarchies,” in Computer Architecture (ISCA), 2014 ACM/IEEE 41st in Nanoscience and Nanotechnology, vol. 4, no. 1, pp. 1–25, 2015. International Symposium on. IEEE, 2014, pp. 253–264. [25] C. Reichhardt, D. Ray, and C.J. Olson Reichhardt, “Quantized trans- [6] A. Thiaville, Y. Nakatani, J. Miltat, and Y. Suzuki, “Micromag- port for a skyrmion moving on a two-dimensional periodic substrate,” netic understanding of current-driven domain wall motion in patterned Physical Review B, vol. 91, no. 10, pp. 104426, 2015. nanowires,” EPL (Europhysics Letters), vol. 69, no. 6, pp. 990, 2005. [26] C. Navau, N. Del-Valle, and A. Sanchez, “Interaction of isolated [7] A. Fert, V. Cros, and J. Sampaio, “Skyrmions on the track,” Nature skyrmions with point and linear defects,” Journal of Magnetism and nanotechnology, vol. 8, no. 3, pp. 152, 2013. Magnetic Materials, 2018. [8] N. Nagaosa and Y. Tokura, “Topological properties and dynamics of [27] D. Stosic, T.B. Ludermir, and M. V Miloseviˇ c,´ “Pinning of magnetic magnetic skyrmions,” Nature nanotechnology, vol. 8, no. 12, pp. 899, skyrmions in a monolayer co film on pt (111): Theoretical characteri- 2013. zation and exemplified utilization,” Physical Review B, vol. 96, no. 21, [9] R. Wiesendanger, “Nanoscale magnetic skyrmions in metallic films and pp. 214403, 2017. multilayers: a new twist for spintronics,” Nature Reviews Materials, vol. [28] X. Ma, C.J. Olson Reichhardt, and C. Reichhardt, “Reversible vector 1, no. 7, pp. 16044, 2016. ratchets for skyrmion systems,” Physical Review B, vol. 95, no. 10, pp. [10] R. Tomasello, E. Martinez, R. Zivieri, L. Torres, M. Carpentieri, and 104401, 2017. G. Finocchio, “A strategy for the design of skyrmion racetrack [29] C. Reichhardt, D. Ray, and C.J. Olson Reichhardt, “Magnus-induced memories,” Scientific reports, vol. 4, pp. 6784, 2014. ratchet effects for skyrmions interacting with asymmetric substrates,” [11] I. Dzyaloshinsky, “A thermodynamic theory of weak ferromagnetism of New Journal of Physics, vol. 17, no. 7, pp. 073034, 2015. antiferromagnetics,” Journal of Physics and Chemistry of Solids, vol. 4, [30] C. Reichhardt and C.J. Olson Reichhardt, “Noise fluctuations and drive no. 4, pp. 241–255, 1958. dependence of the skyrmion hall effect in disordered systems,” New [12] T. Moriya, “New mechanism of anisotropic superexchange interaction,” Journal of Physics, vol. 18, no. 9, pp. 095005, 2016. Physical Review Letters, vol. 4, no. 5, pp. 228, 1960. [31] C. Hanneken, F. Otte, A. Kubetzka, B. Dupe,´ N. Romming, [13] S. Muhlbauer,¨ B. Binz, F. Jonietz, C. Pfleiderer, A. Rosch, A. Neubauer, K. Von Bergmann, R. Wiesendanger, and S. Heinze, “Electrical detection R. Georgii, and P. Boni,¨ “Skyrmion lattice in a chiral magnet,” Science, of magnetic skyrmions by tunnelling non-collinear magnetoresistance,” vol. 323, no. 5916, pp. 915–919, 2009. Nature nanotechnology, vol. 10, no. 12, pp. 1039, 2015. [14] X.Z. Yu, Y. Onose, N. Kanazawa, J.H. Park, J.H. Han, Y. Matsui, N. Na- [32] D. Maccariello, W. Legrand, N. Reyren, K. Garcia, K. Bouzehouane, gaosa, and Y. Tokura, “Real-space observation of a two-dimensional S. Collin, V. Cros, and A. Fert, “Electrical detection of single mag- skyrmion crystal,” Nature, vol. 465, no. 7300, pp. 901, 2010. netic skyrmions in metallic multilayers at room temperature,” Nature [15] X.Z. Yu, N. Kanazawa, Y. Onose, K. Kimoto, W.Z. Zhang, S. Ishiwata, nanotechnology, vol. 13, no. 3, pp. 233, 2018. Y. Matsui, and Y. Tokura, “Near room-temperature formation of a skyrmion crystal in thin-films of the helimagnet fege,” Nature materials, [33] X. Zhang, G.P. Zhao, H. Fangohr, J.P. Liu, W.X. Xia, J. Xia, and vol. 10, no. 2, pp. 106, 2011. F.J. Morvan, “Skyrmion-skyrmion and skyrmion-edge repulsions in skyrmion-based racetrack memory,” Scientific reports, vol. 5, pp. 7643, [16] S. Heinze, K. Von Bergmann, M. Menzel, J. Brede, A. Kubetzka, 2015. R. Wiesendanger, G. Bihlmayer, and S. Blugel,¨ “Spontaneous atomic- scale magnetic skyrmion lattice in two dimensions,” Nature Physics, [34] M. Najafi, B. Kruger,¨ S. Bohlens, M. Franchin, H. Fangohr, A. Van- vol. 7, no. 9, pp. 713, 2011. haverbeke, R. Allenspach, M. Bolte, U. Merkt, D. Pfannkuche, et al., [17] F. Chen, Z. Li, W. Kang, W. Zhao, H. Li, and Y. Chen, “Process variation “Proposal for a standard problem for micromagnetic simulations includ- aware data management for magnetic skyrmions racetrack memory,” in ing spin-transfer torque,” Journal of Applied Physics, vol. 105, no. 11, Design Automation Conference (ASP-DAC), 2018 23rd Asia and South pp. 113914, 2009. Pacific. IEEE, 2018, pp. 221–226. [35] A. Vansteenkiste, J. Leliaert, M. Dvornik, M. Helsen, F. Garcia-Sanchez, [18] R. Venkatesan, M. Sharad, K. Roy, and A. Raghunathan, “Dwm-tapestri- and B. Van Waeyenberge, “The design and verification of mumax3,” AIP an energy efficient all-spin cache using domain wall shift based writes,” advances, vol. 4, no. 10, pp. 107133, 2014. in Proceedings of the Conference on Design, Automation and Test in [36] P.J. Metaxas, J.P. Jamet, A. Mougin, M. Cormier, J. Ferre,´ V. Baltz, Europe. EDA Consortium, 2013, pp. 1825–1830. B. Rodmacq, B. Dieny, and R.L. Stamps, “Creep and flow regimes of [19] J. Sampaio, V. Cros, S. Rohart, A. Thiaville, and A. Fert, “Nucleation, magnetic domain-wall motion in ultrathin pt/co/pt films with perpendic- stability and current-induced motion of isolated magnetic skyrmions in ular anisotropy,” Physical review letters, vol. 99, no. 21, pp. 217208, nanostructures,” Nature nanotechnology, vol. 8, no. 11, pp. 839, 2013. 2007. [20] A.A. Thiele, “Steady-state motion of magnetic domains,” Physical [37] X. Fong, S.K. Gupta, N.N. Mojumder, S.H. Choday, C. Augustine, Review Letters, vol. 30, no. 6, pp. 230, 1973. and K. Roy, “Knack: A hybrid spin-charge mixed-mode simulator for [21] C. Bienia, S. Kumar, J.P. Singh, and K. Li, “The PARSEC Benchmark evaluating different genres of spin-transfer torque mram bit-cells,” in Suite: Characterization and Architectural Implications,” in Proc. PACT, Simulation of Semiconductor Processes and Devices (SISPAD), 2011 2008, pp. 72–81. International Conference on. IEEE, 2011, pp. 51–54. [22] P. Lai, G.P. Zhao, H. Tang, N. Ran, S.Q. Wu, J. Xia, X. Zhang, and [38] L. Liu, T. Moriyama, D.C. Ralph, and R.A. Buhrman, “Spin-torque Y. Zhou, “An improved racetrack structure for transporting a skyrmion,” ferromagnetic resonance induced by the spin hall effect,” Physical review Scientific Reports, vol. 7, pp. 45330, 2017. letters, vol. 106, no. 3, pp. 036601, 2011. [23] H.T. Fook, W.L. Gan, I. Purnama, and W.S. Lew, “Mitigation of magnus [39] “CACTI, www.hpl.hp.com/research/cacti,” . force in current-induced skyrmion dynamics,” IEEE Transactions on [40] N. Binkert, B. Beckmann, G. Black, S.K. Reinhardt, A. Saidi, A. Basu, Magnetics, vol. 51, no. 11, pp. 1–4, 2015. J. Hestness, D.R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, 10

M. Shoaib, N. Vaish, M.D. Hill, and D.A. Wood, “The gem5 simulator,” Kaushik Roy Kaushik Roy received B.Tech. de- SIGARCH Computer Arch. News, vol. 39, no. 2, pp. 1–7, Aug. 2011. gree in electronics and electrical communications [41] R. Perricone, I. Ahmed, Z. Liang, M.G. Mankalale, X.S. Hu, C.H. Kim, engineering from the Indian Institute of Technol- M. Niemier, S.S. Sapatnekar, and J.-P. Wang, “Advanced spintronic ogy, Kharagpur, India, and Ph.D. degree from the memory and logic for non-volatile processors,” in Proceedings of the electrical and computer engineering department of Conference on Design, Automation & Test in Europe. European Design the University of Illinois at Urbana-Champaign in and Automation Association, 2017, pp. 972–977. 1990. He was with the Semiconductor Process and Design Center of Texas Instruments, Dallas, where he worked on FPGA architecture development and low-power circuit design. He joined the electrical and computer engineering faculty at Purdue Univer- Mei-Chin Chen received the B.S. degree in elec- sity, West Lafayette, IN, in 1993, where he is currently Edward G. Tiedemann trophysics from National Chiao Tung University, Jr. Distinguished Professor. He also the director of the center for brain-inspired Taiwan, in 2012. She was a Research Assistant with computing (C-BRIC) funded by SRC/DARPA. His research interests include the Department of Electronics Engineering, National neuromorphic and emerging computing models, neuro-mimetic devices, spin- Chiao Tung University from 2012 to 2013. She is tronics, device-circuit-algorithm co-design for nano-scale Silicon and non- currently pursuing the Ph.D. degree in electrical and Silicon technologies, and low-power electronics. Dr. Roy has published more computer engineering with Purdue University, West than 700 in refereed journals and conferences, holds 18 patents, supervised 75 PhD dissertations, and is co-author of two books on Low Power Lafayette, IN, USA. & She is currently a Research Assistant with the Na- CMOS VLSI Design (John Wiley McGraw Hill). noelectronics Research Laboratory, Purdue Univer- Dr. Roy received the National Science Foundation Career Development sity. Her current research interests include simulation Award in 1995, IBM faculty partnership award, ATT/Lucent Foundation of spin devices. award, 2005 SRC Technical Excellence Award, SRC Inventors Award, Purdue College of Engineering Research Excellence Award, Humboldt Research Award in 2010, 2010 IEEE Circuits and Systems Society Technical Achieve- ment Award (Charles Doeser Award), Distinguished Alumnus Award from In- dian Institute of Technology (IIT), Kharagpur, Fulbright-Nehru Distinguished Chair, DoD Vannevar Bush Faculty Fellow (2014-2019), Semiconductor Ashish Ranjan received the B.Tech. degree in Research Corporation Aristotle award in 2015, and best paper awards at electronics engineering from the Indian Institute of 1997 International Test Conference, IEEE 2000 International Symposium Technology (BHU), Varanasi, India, in 2009. He on Quality of IC Design, 2003 IEEE Latin American Test Workshop, 2003 received his PhD degree in Electrical and Computer IEEE Nano, 2004 IEEE International Conference on Computer Design, 2006 Engineering from Purdue University, West Lafayette, IEEE/ACM International Symposium on Low Power Electronics & Design, IN, USA. He is currently a Research Staff Member and 2005 IEEE Circuits and system society Outstanding Young Author Award at IBM T. J. Watson Research Center, Yorktown (Chris Kim), 2006 IEEE Transactions on VLSI Systems best paper award, Heights, NY. 2012 ACM/IEEE International Symposium on Low Power Electronics and His prior industry experience includes three years Design best paper award, 2013 IEEE Transactions on VLSI Best paper award. as a senior member technical staff in the Design Cre- Dr. Roy was a Purdue University Faculty Scholar (1998-2003). He was a ation Division, Mentor Graphics Corporation, Noida, Research Visionary Board Member of Motorola Labs (2002) and held the India. His primary research interests include circuit-architecture co-design M. Gandhi Distinguished Visiting faculty at Indian Institute of Technology for , domain-specific accelerators, and approximate (Bombay) and Global Foundries visiting Chair at National University of computing. He was awarded the University Gold Medal for his academic Singapore. He has been in the editorial board of IEEE Design and Test, IEEE performance by IIT (BHU), Varanasi in 2009. He also received the Andrews Transactions on Circuits and Systems, IEEE Transactions on VLSI Systems, Fellowship from Purdue University in 2012. and IEEE Transactions on Electron Devices. He was Guest Editor for Special Issue on Low-Power VLSI in the IEEE Design and Test (1994) and IEEE Transactions on VLSI Systems (June 2000), IEE Proceedings – Computers and Digital Techniques (July 2002), and IEEE Journal on Emerging and Selected Topics in Circuits and Systems (2011). Dr. Roy is a fellow of IEEE. Anand Ragunathan is a Professor of Electrical and Computer Engineering and Chair of the VLSI area at Purdue University, where he directs research in the Integrated Systems Laboratory. His current areas of research include domain-specific architec- ture, system-on-chip design, computing with post- CMOS devices, and heterogeneous parallel com- puting. Previously, he was a Senior Research Staff Member at NEC Laboratories America, where he led projects on system-on-chip architecture and design methodology. He has also held the Gopalakrishnan Visiting Chair in the Department of Computer Science and Engineering at the Indian Institute of Technology, Madras. Prof. Raghunathan has co-authored a book, eight book chapters, and over 200 refereed journal and conference papers, and holds 21 U.S patents. His publications received eight best paper awards and five best paper nominations. He received a Patent of the Year Award and two Technology Commer- cialization Awards from NEC, and was chosen among the MIT TR35 (top 35 innovators under 35 years across various disciplines of science and technology) in 2006. Prof. Raghunathan has been a member of the technical program and organizing committees of several leading conferences and workshops, chaired premier IEEE/ACM conferences (CASES, ISLPED, VTS, and VLSI Design), and served on the editorial boards of various IEEE and ACM journals in his areas of interest. He received the IEEE Meritorious Service Award and Outstanding Service Award. He is a Fellow of the IEEE and Golden Core Member of the IEEE Computer Society. Prof. Raghunathan received the B. Tech. degree from the Indian Institute of Technology, Madras, and the M.A. and Ph.D. degrees from Princeton University.