<<

Non-Volatile Memories: Concepts and Research Challenges

Lizandro Oliveira1, Adenauer Yamin1 and Mauricio Pilla1 Universidade Federal de Pelotas - UFPel Pelotas, Brasil {lsoliveira, adenauer, pilla}@inf.ufpel.edu.br

Abstract— Memory systems have been pointed out 110% per year, when these new technologies will as an important aspect for performance and power be used in many products [4]. consumption. Many techniques have been developed in Although high-performance computing is still order to reduce power and energy consumption while intending to improve performance, or at least mitigate its an important driver for semiconductor technology degradation. Non-volatile memories offer several advan- innovation, consumer electronics is shifting toward tages over traditional DRAM and SRAM memories, such mobile, pervasive connectivity, and -centric as non-volatility, power consumption and higher density. applications. The changing market trend imposes This work revises and discusses emerging memory tech- different requirements on hardware, e.g., ultra- nologies, focusing on non-volatile memories and points low power computing, high-density and low-cost research challenges. , novel functions, etc. Emerging NVMs Index Terms - Non- (NVM), , scratchpad memory (SPM), main memory. with high performance, good scalability, and new functionalities may become important technology I.INTRODUCTION enablers to fulfill these requirements [3]. U.S. data centers consumed about 70 billion This provides a survey of NVMs and their KWh in 2014, representing 2% of the country’s characteristics. The paper is organized as follows. total energy consumption [1]. In this scenario, Section II presents a background about NVMs, the importance of memory hierarchies has increa- their characteristics and comparisons among dif- sed with advances in the performance of proces- ferent NVMs. Section III presents related works sors [2]. Therefore, computer architects must try about NVMs. We also classify those . Fi- to close the processor-memory gap. This distance nally, Section V present our conclusions. becomes more significant because the memory system is one of the main factors for performance II.NON-VOLATILE MEMORIES and energy consumption. Temporary and permanent data storage is requi- The specification of nonvolatile memory is ba- red in any functional information processing sys- sed on the floating gate configuration, which is the tems, which has so far been ful-filled by CMOS- feature of an erased gate put into many cells to based memories, i.e., SRAM, DRAM, and Flash facilitate erasure [5]. memory. The speed gap between logic and me- NVMs, including Phase Change Memory mory has become a critical system performance (PCM), , and Magnetic RAM bottleneck, i.e., the "memory wall” [3]. (MRAM) have many advantages over DRAM Various techniques use non-volatile memories while they are applied as main memory for embed- (NVMs) as an alternative because of their advanta- ded systems due to their attractive characteristics ges over traditional DRAM and SRAM memories. such as power-economy, low-cost, high density, Although the emerging memory market is still non-volatility, and shock-resistivity [6]. smaller than traditional ones, it is expected that this However, the NVMs have two drawbacks which market will grow by 2021, reaching rates around need to be overcome when used as main memory [6]: write activities take longer time than read ac- density reduction and STT-MRAM prototype chip tivities on NVM; NVMs have a maximum number demonstration. of write operations that can be performed. Smullen et al. [8] reduced retention time by NVMs are classified according to their functio- decreasing the planar area of the cell, thereby nal properties concerning to the programming and reducing the write current. The retention time of erasing operations [5]. Fig. 1 shows the classifica- an MTJ is primarily impacted by the thermal tion flow chart for semiconductor memories [5]. stability of its free layer. Jog et al. [11] decreased There are mainly five types of NVM tech- the thickness of the free layer and lowered the nology: flash memory, FeRAM, MRAM, PCM, saturation magnetization to obtain a lower thermal and RRAM [5]. In this Section, we present the barrier. characteristics and limitations of emerging NVMs. According to Meena et al. [5], for STT-RAM to be adopted as a universal mainstream semiconduc- A. Magnetoresistive Random-Access Memory tor memory, this key challenge should be resolved: (MRAM) the simultaneous achievement of low switching An MRAM is based on memory cells having current and high thermal stability. It must be dense two elements, one with a fixed (approximately 10F 2), fast (below 10 ns of read magnetic polarity and the other one with a swit- and write speeds), and operating at low power [12]. chable polarity [5]. These magnetic elements are The basic cell structure of STT-RAM is depicted positioned on top of each other but separated by a in Fig. 3. thin insulating tunnel barrier as shown in the cell structure in Fig. 2. MRAM can be directly coupled C. Ferro-magnetic RAM (FeRAM) with processors and used as both volatile and non- volatile storage. This characteristic differentiates This early emerging nonvolatile memory used MRAM from resistive RAM and PCM, limited by ferroelectric capacitors as a nonvolatile device slow write speed and inadequate endurance, which combined with a CMOS circuit [9], as shown in are pursued primarily for storage applications [7]. Fig. 4. However, the technology behind FeRAM presents difficulties in scaling of its B. -Transfer Torque Magnetic RAM (STT- size: because it is not straightforward to process RAM) the ferroelectric materials and the electrode ma- An STT-RAM memory uses the magnetic tunnel terials without having a reaction between them. junction (MTJ) for the storage element, which As a result of this, the most suitable applications reduces the energy required to write the cell. STT- for FeRAM are somewhat limited, to smart IC RAM is a form of MRAM that uses spin-transfer cards and small-scale microcontrollers with a small torque to reorient the free layer by passing a large, storage capacity [9]. directional write current through the MTJ [8]. FeRAM maintains the data without any external These non-volatile devices can be built directly power supply by using a ferroelectric material in the CMOS circuit. Because of the close con- in the place of a conventional dielectric material nection between the logic circuit and the memory between the plates of the capacitor. One disadvan- compared to the conventional logic circuit with tage of FeRAM is that the read cycle is destruc- SRAM, the MTJ-based non-volatile logic achieves tive [5]. FeRAMs are expected to have many appli- not only drastic area and I/O power reduction but cations in small consumer devices such as personal also improves data transfer speeds [9]. digital assistants (PDAs), smartphones, power me- Although STT-RAM shows a lower density than ters, smart cards, and in security systems. Even PCM and RRAM, and higher latency and energy after FeRAM has achieved a level of commercial for write operations than SRAM, this technology success, current FeRAM chips offer performance is widely used for the design of caches due to that is either comparable to or exceeding current its high endurance. Huai [10] presents a review Flash memories, but still slower than DRAMs [5]. on the progress in the intrinsic switching current

Volatile Non-Volatile

RAM Floating Nitride Flash ROM & Fuse Emerging Others Gate New

DRAM SRAM EPROM EEPROM

NAND NOR SDRAM Mobile DRAM Racetrack RRAM STT-RAM FeRAM MRAM PCM Polymer

Transparent Zero-RAM Nanobridge CMO Q-dots Millipede 3-D Molecular Nano-Mech CMORRAM CMO MTM SPBMM /Flexible

Fig. 1. Flow chart for the semiconductor memory classification according to their functional criteria (adapted from [5]).

Fig. 4. Basic structure of a FeRAM cell (adapted from [5]).

Fig. 2. Basic MRAM cell structure (adapted from [5]). ficiently high voltage is applying, a filament is formed in an insulating dielectric. Then, it may be set/reset (leading to a low/high resistance) by appropriate voltages [13]. This memory offers the potential to be simple and cost-effective that can replace current memory technologies, such as hard disk drivers, RAM and flash memories [14].

Fig. 3. Basic STT-RAM cell structure (adapted from [5]).

D. Resistive RAM (ReRAM) ReRAM is a simple, two-terminal metal- insulator-metal (MIM) bistable device as shown in the basic configuration in Fig. 5. It can exist in two distinct conductivity states, with each state being induced by applying different voltages across the device terminals [5]. A typical unipolar switching ReRAM uses a dielectric insulator. When a suf- Fig. 5. Basic RRAM cell structure [5]. Xu et al. [15] emphasized that ReRAM was considered to be the most promising as it operates faster than PCM, and it has simpler and smaller cell structure than magnetic memory (MRAM or STT-RAM). Despite these advantages, ReRAM presents some critical issues like data retention, low endurance [16] of about 1011 cycles [17], high latency and high power consumption for write operations [18]. Goux et al. [19] presented the use of a stacked RRAM memory to improve some characteristics of these memories. A long-term retention time (>10 years) and endurance (>106 write/erase cycles) Fig. 6. Basic PCM cell structure [5]. could be achieved [20]. According to Endoh et al. [9], there are several options of storage principles for ReRAM that can be categorized into two types. One is conductive bridge RAM (CBRAM), and the other is oxide RAM (OxRAM). Both CBRAM and OxRAM use redox reaction in their operational principle [21], [22]. RReAM may cover the storage application area occupied by NAND flash memories [23], [24]. Fig. 7. Structure of a polymer memory device [5]. E. Phase Change Memory RAM (PCM) PCM employ a material called GeSbTe (GST), which is alloys of germanium, antimony, and fast access velocity and non-volatility. Former re- tellurium. GST has two phases, known as an search has demonstrated that RM has potential to amorphous phase and a crystalline phase, represen- serve as on-chip cache or main memory. However, ting the high and low electrical resistivity, respecti- RM has more flexibility and difficulty in design vely [25]. The basic PCM cell structure is depicted space of main memory because it has more device in Fig. 6. level design parameters [27]. Compared to SRAM and eDRAM (embedded In a RM, information is stored on a U-shaped DRAM), these NVM technologies have higher nanowire as a pattern of magnetic regions with dif- density, lower standby power, better scalability, ferent polarities. The U-shaped magnetic nanowire and are non-volatile [26]. However, PCM has is an array of keys, which are arranged vertically some problems like write latency, write energy and like trees in a forest as shown in Fig. 8. endurance [25]. A DWM is a spin-based memory technology in which several bits of data are densely packed F. Polymer memory into the domains of a ferromagnetic wire. The In a polymer memory, a layer consisting of mo- ferromagnetic wire can have multiple domains lecules and/or nanoparticles in an organic polymer separated by domain walls. Each domain can be matrix is sandwiched between an array of top and separately programmed to a certain magnetization bottom electrodes as illustrated in Fig. 7. Polymer direction, and therefore store a single bit. Hence, memory has the advantage of a simple fabrication a DWM macro-cell is capable of storing multiple process and good controllability of materials [5]. bits of data [28]. G. Racetrack memory (RM) H. Comparison of memory technologies A RM, also known as Domain Wall Memory Static leakage power of SRAM has been in- (DWM), can achieve ultra-high storage density, creased with CMOS scaling and became a major sal memory replacement in low power embedded systems. The property common to all NVMs is that their write latency/energy are significantly higher than for read operations. Under normal conditions, they also retain data for several years without the need of any standby power [40]. Tables I and II show various characteristics of memory technologies, while a proposal to classify previous works that use some type of NVM is Fig. 8. Racetrack memory diagram showing an array of U-shaped shown in Table III. magnetic nanowires. [5]. III.RELATED WORKS NVMs have been investigated by several works part of power consumed in various semiconductor at different levels of . Mittal et chips [29]. Emerging memories like STT-RAM al. [40] [41] present several techniques to exploit and PCM provide lower leakage power and higher the advantages and mitigate the disadvantages of density when compared to traditional SRAM me- emerging NVMs. However, the authors do not mories. address any use of these memories in SPMs. In According to Banakar et al. [30], on-chip SRAM this Section, we present works that improve the caches require power in the range from 25% usage of NVMs in main memories, SPMs and ca- to 45% of the total chip power. They proposed ches. These works were chosen from the research an SPM as an alternative to cache. The results domain. showed an average energy reduction of 40% and the average area-time reduction of 46%. Further- A. Main Memory more, SPM can provide better timing predictabi- Zhao et al. [58] proposed a persistent memory lity, which is specially desirable in hard real-time system with performance very close to that of embedded systems [31]. a native system. They propose Kiln, a persistent Unlike charge-based memories such as DRAM, memory design that adopts a non-volatile cache NVMs store data in the form of change in physical and a nonvolatile main memory to enable atomic state. Since a write to NVM involves changing its in-place updates without logging or copy-on-write. physical state, a write operation to NVM consumes This design adopts a nonvolatile cache (NV cache) larger time and energy than a read operation, and an NVM, forming a multi-version persistent leading to read-write asymmetry [32]. Similarly, memory hierarchy. Experiments were conducted the write-latency and energy of logical 1 → 0 using McSim [73], a Pin-based multi- and many- transition is higher than that of 0 → 1, leading core cycle-accurate simulation infrastructure. The to 0/1 write asymmetry [33]. authors note that the persistence interface of most STT-RAM memory is widely used in last level existing software applications is optimized for ac- memories [34], [35], while PCM is a DRAM cesses to disk-based storage devices. According to alternative [36]. the authors, until that moment, no existing public New STT-RAM models have been proposed to benchmark suites could have been used to evaluate reduce write operation issues [37] [38]. the Kiln design. Therefore, they constructed a set Xu et al. [39] proposed a methodology to design of benchmarks for this purpose. Kiln provides STT-RAM for different optimization goals such persistence support with only a 9% performance as read/write performance and write energy by overhead to the native system, hence up to 2× per- leveraging the trade-off between write current and formance improvement to the log-based NVRAM write time of MTJ. The simulation results indicate persistent memory. that by utilizing PMTJ, the optimized STT-RAM Kim et al. [59], to adopt a new memory te- can compete against SRAM and DRAM as univer- chnology, discussed a new file system based on TABLE I CHARACTERISTICSOF DIFFERENT MEMORY TECHNOLOGIES [41], [42] E [43]

Technology eDRAM SRAM ReRAM PCM STT-RAM DWM Cell size (F 2) 20-100 50-200 4-10 4-12 6-50 1-50 Speed (R/W) Fast Very fast Fast/slow Fast/very slow Fast/slow Fast Energy (R/W) Medium Low Medium/high Medium/very high Medium/high Medium Leakage power Low High ∼ 0 ∼ 0 ∼ 0 ∼ 0 Retention period 30-100 µs N/A N/A N/A N/A N/A Endurance > 1015 > 1015 108 − 1012 108 − 109 > 1015 > 1015 Maturity Product Product Prototype Prototype Product Prototype Main challenge Refresh Leakage Writes Writes Writes Shifts

TABLE II APPROXIMATE DEVICE-LEVEL PROPERTIES OF MEMORY TECHNOLOGIES -ADAPTEDFROM MITTAL AND VETTER [40], AND CARGNINIETAL. [44]

Technology Minimal Cell Size (F 2) Endurance (Cycles) Read Latency Write Latency Erase Latency Access Granularity Standby Power HDD N/A 1015 5 ms 5 ms - 512 B 1W SLC Flash 4-6 104 − 105 25 µs 500 µs 2 ms 4 KB 0 DRAM 4-10 1015 50 ns 50 ns - 64 B Refresh power NAND 4 104 104ns 106 ns - - - NOR 10 105 15 ns 103 ns - - - SRAM 150 - 2 ns 2 ns - - - FeRAM 22 1012 40 ns 65 ns - - - ReRAM 30 105 100 ns 100 ns - 64 B 0 PCM 4 1012 12 ns 100 ns - 64 B 0 STT-MRAM 20 1016 5 ns 5-30 ns - 64 B 0 pSTT-MRAM - - 3 ns 3 ns - - - TAS-MRAM - 1012 30 ns 30 ns - - - tmpfs with three aspects that need to be included, NVMs to provide frequent, low overhead check- and evaluate its performance against popular file points. By adapting existing multi-level checkpoint systems such as ext4. In this work, they evaluated a techniques, the authors devised new methods ter- file system constructed mainly on the non-volatile med NVM-checkpoints that efficiently store check- main memory. To evaluate the performance of in- points on both local and remote NVMs. The pre- memory file system on the NVM, they experimen- copy method can reduce peak interconnect usage tally run sysbench on three different file system up to 46%. Since the approach treats NVM as environments. They measured the performance of memory rather than as RAM disk, pre-copying ext4 on two different devices such as block-device can be generalized to directly move data to remote and RamDisk. Then, they compared the read/write NVMs. latencies of tmpfs against ext4 on RamDisk. Jung et al. [63] proposed Memorage, a system Moraru et al. [60] introduced techniques for architecture that virtually manages all available robust wear-aware memory allocation, preventing physical resources for memory and storage in of erroneous writes and consistency-preserving up- an integrated manner. The performance of stu- dates that are cache-efficient. The authors show died memory-intensive multiprogramming worklo- that these techniques are efficiently implementable ads was improved by up to 40.5% with an average and effective. of 16.7%. Gao et al. [61] proposed hardware support that clusters failed lines at one end of a memory region Several other works take advantage of new me- to reduce fragmentation and improve performance mory technologies to replace DRAM as main me- under failures. Experimental results showed that mory and improve life time [64], performance [65], hardware and software cooperation can greatly energy and performance [66], [67], and other extend the life of wearable memories. works turn their attention to energy optimiza- Kannan et al. [62] explored the use of node-local tion [68]. TABLE III chitecture could reduce the memory access time A CLASSIFICATION OF TECHNIQUES by 18.17%, the dynamic energy by 24.29%, and the leakage power by 37.34% compared with a Classification References baseline pure SRAM SPM with same area size. SPM [45], [46], [47], [48], [49] In the cited works [45] [46], the authors used [50], [51], [52], [53], [54], [55], Cache [56], [57], [58] NVsim [18] to estimate the read/write latencies [58], [59], [60], [61], [62], [63], and energy consumption for a given size of PCM, Main memory [64], [65], [66], [67], [68] SRAM and STT-RAM. In these works, the bench- [45], [46], [47], [48], [52], [53], Energy efficiency marks were selected from MiBench [77]. [55], [56], [57], [69], [70], [71], [72] Qiu et al. [47] propose a heterogeneous SPM ar- Performance improvement [45], [46], [48], [56], [57] chitecture that is configured with SRAM, MRAM, [45], [48], [49], [52], [53], [54], STT-RAM [55], [56], [57], [58] and Z-RAM (Zero-capacitor RAM) for multicore [46], [60], [61], [62], [63], [64], embedded systems. The authors propose two al- PCM [65], [66] gorithms: a dynamic programming (MDPDA) and MRAM [47] a genetic algorithm (AGADA) to allocate data to Z-RAM [47] [45], [46], [47], [48], [49], [52], different memory banks, therefore, reducing me- Proposed algorithms [55], [56] mory access cost concerning power consumption and latency. They evaluate their algorithm across a host of benchmarks selected from PARSEC [78]. They ran these workloads on M5 simulator [79] B. Scratchpad Memory (SPM) to obtain the memory traces. Further, to get the SPM memories that employ NVMs are investi- memory parameters, it was used a modified version gated in the works from [45] to [49], comparing of CACTI [80]. Experimental results show that the them with traditional memory technologies. MDPDA algorithm can reduce the dynamic power Wang et al. [45] explore and evaluate a low- consumption for quad-core system 24.18%, com- power SPM architecture consisting of STT-RAM. pared to the greedy algorithm (Udayakumaran). They also conducted a sensitivity analysis of va- From other results, it can be observed that the rious area ratios of a hybrid SPM architecture reduction in energy consumption is proportional to (SRAM+STT-RAM). In this hybrid architecture, the reduction in memory access latency. AGADA they use a dynamic data allocation scheme that algorithm consumes, on average, 2.21% more dy- allocates the most-written data into SRAM and namic power than that of the MDPDA. However, the most-read data into STT-RAM optimally in a the AGADA algorithm is more competitive in region. Experimental results were obtained on a overall performance. simulation platform built upon Simics [74] with Rodríguez et al. [48] explore the use of an STT- GEMS [75]. The authors used Udayakumaran’s al- RAM-based SPM. They propose the implemen- gorithm [76] to manage the data allocation for pure tation of on-chip SPMs using STT-RAMs with SRAM SPM. Experimental results show that the relaxed volatility as a means to further take ad- hybrid SPM with a 2:1 (SRAM:STT-RAM) area vantage of the area and energy characteristics of ratio achieves the best performance and energy- this technology. An experimental evaluation shows delay product, while the pure STT-RAM configu- that the proposed design offers potential savings ration outperforms the others in energy. in energy consumption over 60% when compared Hu et al. [46] propose a novel hybrid SPM with different iso-area on-chip memory designs. which consists of SRAM and NVM to take ad- For the memory design, they used the STeTSiMS vantage of the ultralow leakage power and high [81] simulation and modeling system. Their data density of latter. Also, they propose a dynamic data memory architecture consists of three components: management algorithm. They compare Udayaku- a cache memory, an SPM, and the main memory. maran’s algorithm with the proposed algorithms. The cache and the SPM are located on-chip; the According to the results, the hybrid SPM ar- main memory can be assumed to be off-chip DRAM. Experiments were performed using the set of PARSEC applications with different inten- GEM5 simulator [82] in System-call Emulation sities of read/write operations were simulated, and mode (SE). The test applications are computati- CCear performed close to an ideal refresh policy onal kernels extracted from the SPEC CPU2006 with low overhead. benchmarks [83], the Mantevo benchmarks [84], Li et al. [51] developed two compilation-based the Mediabench suite [85], and the PARSEC ben- approaches to improve the energy efficiency and chmarks. The experimental evaluation shows that performance of STT-RAM-based hybrid cache by the proposed system has significant benefits for reducing the migration overheads: MDL and MCL. embedded applications. It provides savings of 63% The first approach reduced the migrations by rear- in dynamic power consumption compared to a ranging the data layout and the second is proposed nonvolatile STT-RAM-based SPM; and executes to reduce the migrations by locking migration- up to 28.5% faster, saving 53% of the energy con- intensive memory blocks into SRAM part of hy- sumed by a system featuring an iso-area hardware- brid cache. The proposed compilation technique managed SRAM cache. is implemented on LLVM [89]. The benchmarks Wang et al. [49] used an SPM based on STT- were selected from the LLVM test suites. Experi- RAM in real-time embedded systems. They pre- mental results show that combining these methods, sent algorithms for allocating data variables to on average, the number of write operations on SPM and distribute write activity evenly in the STT-RAM is reduced by 17.6%, the number of SPM address space, to achieve wear-leveling and migrations is reduced by 38.9%, the total dynamic prolong the lifetime of NVM. In this work, two energy is reduced by 15.6%, and the total access optimization algorithms are used: one is an op- latency is reduced by 13.8%. timal algorithm based on ILP, the other is an Qiu et al. [52] proposed a loop re-timing fra- efficient heuristic algorithm that can obtain close- mework during compilation to reduce the migra- to-optimal solutions. The memory system para- tion overhead by changing the interleaved memory meters are obtained from Monazzah et al. [86]. access pattern. This work adopts the migration po- Since STT-RAM can be made much denser than licy in [90]. The benchmarks are selected from Li- SRAM, they compare target systems with 2 KB vermore [91], BLITZ++ [92], and DSP programs. STT-RAM based SPM versus 1 KB SRAM-based A Pin-based [93] simulator of STT-RAM based hy- SPM. The benchmarks programs were selected brid cache with the migration policy was designed from Mälardalen WCET [87]. The Chronos WCET for evaluation. With the proposed loop retiming tool [88] was used to obtain each program’s WCET technique, the interleaved memory accesses can be and obtain the number of read/write memory ac- significantly reduced so that migration overhead is cesses of each variable from the memory access mitigated, and energy efficiency of hybrid cache trace. Experimental results show that the total CPU is significantly improved. The experimental results utilization increases with each program’s WCET, have shown that, with the proposed methods, the ranging from 0.22 for 1-year lifetime, to 0.38 for migration number is reduced by up to 27.1% and 32-year lifetime. the cache dynamic energy is reduced by up to 14.0%. C. Cache Komalan et al. [53] presented a modification STT-RAM is a promising candidate for SRAM of the traditional MSHR (Miss Status Handling replacement because of its excellent features. Register) organization for the effective exploitation However, wide adoption of STT-RAM as cache of STT-MRAM based L1 I-caches. This work uses memories is impeded by its long write latency and a modified version of Kroft’s MSHR organization high write power. [94], in which a small intermediate buffer is used Li et al. [50] proposed a Cache-Coherence- to hold instructions arriving from L2 before they Enabled Adaptive Refresh (CCear) to minimize actually update the IL1. This version is called the number of refresh operations for volatile STT- enhanced MSHR (EMSHR). For simulations, the RAM, adopted as the LLC for CMP systems. A authors included a wide range of representative benchmarks from the DSPstone, MediaBench and that includes SRAM cache, STT-RAM cache, and SPEC CPU2006 benchmark suites. In order to STT-RAM/SRAM hybrid cache banks. They used evaluate the effectiveness of the proposed modi- a GEM5 Simulator to simulate a 2-GHz processor fications, they used GEM5 simulator [82] in the with four cores. The classic memory model in the SE mode. According the simulations, appropriate GEM5 is modified to implement the hybrid L2 tuning of selective architecture parameters can cache with non-uniform access times, the access- reduce the performance penalty introduced by the aware policies, and the dynamic cache partitioning NVM (∼45%) to extremely tolerable levels (∼1%) scheme. PARSEC 2.1 [99] benchmark suite is used and show energy gains up to 35%. Furthermore, as the parallel workload to evaluate the proposed on configuring the modified NVM based system scheme. The latency and energy values of the to occupy area comparable to the original SRAM- SRAM cache used in this paper were obtained based configuration, it outperforms the SRAM from CACTI, and for the STT-RAM cache is used baseline and leads to even more energy savings. a modified NVSim [18], and the model reported Komalan et al. [54] explored the read penalty in [35] to obtain the parameters in 65 nm techno- issues in an NVM based L1 data cache (D-cache) logy. The experimental results show that the pro- for an ARM-like single core general purpose posed scheme and policies can achieve an average system. They propose a design method for the of 89 times improvement in cache lifetime and can STT-MRAM based D-cache in such a platform. reduce energy consumption by 58% compared with This design addresses the adverse effects due to an SRAM cache. the STT-MRAM read penalty issues by means of Ahn et al. [57] propose a STT-RAM based micro-architectural modifications along with code hybrid cache architecture called prediction hybrid transformations. They utilize the utilize the micro- cache. The key concept of it is write intensity architecture simulator GEM5 and a subset of the prediction, which predicts write intensity of cache PolyBench Benchmark suite for simulations. Ac- blocks on their misses and determines block place- cording to the simulations, the appropriate tuning ment based on it. The proposed hybrid cache archi- of selective architecture parameters in their propo- tecture is compared against Read-Write Aware Hy- sal and suitable optimizations can reduce the per- brid Caches (RWHCA) [100]. The hybrid caches formance penalty introduced by the NVM (initially are configured to have 4 SRAM ways and 12 STT- ∼54%) to extremely tolerable levels (∼8%). RAM ways. The energy and delay characteristics Li et al. [55] proposed the significant reduction of the L2 cache at a 45 nm technology modeled by of the number of refresh operations through re- CACTI 6.5 [101] and NVSim [18]. They use 16 arranging program data layout at compilation time. write-intensive benchmarks from SPEC CPU2006 An N-refresh scheme is also proposed to further [81]. Experimental results show that the archi- reduce the number of refreshes. They implemented tecture achieves 28% (31%) energy reduction in a PIN-based [95] cache simulator to evaluate the hybrid caches, 3% (4%) energy reduction in main proposed methods. The architecture parameters memories, and 1.4% (4.7%) speedup in the single- are based on TI DM3x Video SOC [96]. The core (quad-core) system compared to the existing Powerstone benchmark suite [97] is evaluated in hybrid cache architecture. the experiments which contain typical benchmarks for embedded systems. The ILP based methods IV. RESEARCHCHALLENGES are implemented using a commercial ILP solver, Emerging NVMs have been widely studied to LINGO [98]. Experimental results show that, on enable more efficient, intelligent, and secure com- average, the proposed methods can reduce the puting systems. Recent works try to minimize number of refresh operations by 84.2%, and reduce some limitation of NVMs by using hybrid archi- the dynamic energy consumption by 38.0% for tectures. volatile STT-RAM caches while incurring only Zhang et al. [102] presents a page replacement 4.1% performance degradation. method based on NVM-DRAM hybrid main me- Lin et al. [56] propose a hybrid cache design mory system for low power and consistency gua- Fig. 9. Summary of industry test chips of PCM, STTRAM, and RRAM, plotted as the total chip capacity vs. years [3]. rantee. Choi and Park [103] propose a hybrid cache Fig. 9 summarizes major industry test chips of architecture containing both SRAM and NVM. PCM, STTRAM and RRAM reported in confe- Chen [3] reviews emerging NVM technologies rences. They are fully functional test chips with and evaluated in terms of their advantages, chal- peripheral circuitries. Size of the symbols is pro- lenges, and applications. portional to the technology node of the CMOS pro- PCM cell size is limited mainly by access de- cess used for the fabrication. The company name, vices (or selector devices) due to the relatively array capacity, and technology node for every large current required for switching. Bipolar junc- demonstrated chip are labeled near each symbol. tion transistor (BJT), vertical transistor, and even Among the three NVM technologies, PCM test diode have been experimented as access devices to chips (blue symbols) were demonstrated the ear- reduce PCM cell size [3]. liest, and vary from tens of Mb to several Gb STTRAM still faces some critical challenges. in capacity. Capacity of STTRAM test chips (red With small on/off ratio, STTRAM needs well- symbols) is typically smaller, because STTRAM designed reading schemes and is sensitive to in- targets performance-driven applications instead of creasing MTJ variability with scaling. Reducing high-density data storage. RRAM (green symbols) MTJ size and switching current while maintaining varies in the widest range in terms of test chip sufficient thermal stability (i.e., retention) requires capacity and technology nodes, because different device and material innovations, e.g., PMA, dual- companies are targeting different applications for MgO structure, composite free-layer, etc [3]. this technology, e.g., eNVM for micro-controller Tradeoffs exist among key RRAM parame- units (MCUs), standalone data storage, etc. [3]. ters, e.g., speedretention, power-speed, endurance- retention, etc. A major challenge of RRAM is reliability, variability, and failure mechanisms [3]. V. CONCLUSIONS [12] D. Apalkov, A. Khvalkovskiy, S. Watts, V. Nikitin, X. Tang, D. Lottis, K. Moon, X. Luo, E. Chen, A. Ong, et al. Spin- The increasing demand for data storage and transfer torque magnetic random access memory (stt-mram). ACM Journal on in Computing Syst., processing in computer systems requires desirable 9(2):13, 2013. characteristics of memory systems, such as low [13] H. Li and Y. Chen. An overview of non-volatile memory power, high density and non-volatility. NVMs offer technology and the implication for tools and architectures. In Design, Autom. & Test in Eur. Conf. & Exhibition, pages several advantages and challenges when compa- 731–736. IEEE, 2009. red to conventional memory technologies. Several [14] S. Kim. Resistive ram (reram) technology for high density NVMs have been studied in the literature. In this memory applications. In 4th Workshop Innovative Memory paper, we presented a survey and comparison of Technol MINATEC; June 21-24, 2012. [15] C. Xu, X. Dong, N. P. Jouppi, and Y. Xie. Design impli- NVM technologies and techniques to improve their cations of -based rram cross-point structures. In performance and endurance. We also presented a Design, Autom. & Test in Europe Conf. & Exhibition, pages proposal for classifying previous works in NVM. 1–6. IEEE, 2011. [16] A. Sawa. Resistive switching in transition metal oxides. Research challenges are discussed. Materials today, 11(6):28–36, 2008. [17] Y.-B. Kim, S. R. Lee, D. Lee, C. B. Lee, M. Chang, J. H. Hur, M.-J. Lee, G.-S. Park, C. J. Kim, U.-I. Chung, et al. REFERENCES Bi-layered rram with unlimited endurance and extremely uniform switching. In Symp. on VLSI Technology, pages [1] Yevgeniy Sverdlik. Here’s how much energy all 52–53. IEEE, 2011. us data centers consume. [Online]. Available: [18] X. Dong, C. Xu, N. Jouppi, and Y. Xie. Nvsim: A circuit- http://www.datacenterknowledge.com/archives/2016/06/27/heres- level performance, energy, and area model for emerging non- how-much-energy-all-us-data-centers-consume, 2016. volatile memory. In Emerging Memory Tech., pages 15–50. [2] John L. Hennessy and David A. Patterson. Computer Springer, 2014. architecture: a quantitative approach. Elsevier, 2011. [19] L. Goux, A. Fantini, G. Kar, Y.-Y. Chen, N. Jossart, R. De- [3] An Chen. A review of emerging non-volatile me- graeve, S. Clima, B. Govoreanu, G. Lorenzo, G. Pour- mory (nvm) technologies and applications. Solid-State tois, et al. Ultralow sub-500na operating current high- Electronics, 125:25 – 38, 2016. ISSN 0038-1101. performance tin\al 2 o 3\hfo 2\hf\tin bipolar rram achieved URL http://www.sciencedirect.com/science/ through understanding-based stack-engineering. In Symp. on article/pii/S0038110116300867. Extended pa- VLSI Technology, pages 159–160. IEEE, 2012. pers selected from ESSDERC 2015. [20] Y. J. Seo, H. M. An, H. D. Kim, and T. G. Kim. Improved [4] Yann de Charentenay. Storage-class memory will be the clear performance in charge-traptype flash memories with an go-to market for emerging non-volatile memory in 2021. al2o3 dielectric by using bandgap engineering of charge- [Online]. Available: http://www.yole.fr, 28 July 2016. trapping layers. J Korean Phys Soc, 55(6):2679–2692, 2009. [5] J. S. Meena, S. M. Sze, U. Chand, and T.-Y. Tseng. Overview [21] H.-S. P. Wong, H.-Y. Lee, S. Yu, Y.-S. Chen, Y. Wu, P.-S. of emerging nonvolatile memory technologies. Nanoscale Chen, B. Lee, F. T. Chen, and M.-J. Tsai. Metal–oxide rram. research letters, 9(1):526, 2014. Proc. of the IEEE, 100(6):1951–1970, 2012. [6] Y. Wang, J. Du, J. Hu, Q. Zhuge, and E. H. . Sha. Loop [22] R. Waser, R. Dittmann, G. Staikov, and K. Szot. Redox- scheduling optimization for chip-multiprocessors with non- based resistive switching memories–nanoionic mechanisms, volatile main memory. In 2012 IEEE International Confe- prospects, and challenges. Advanced materials, 21(25-26): rence on Acoustics, Speech and Signal Processing (ICASSP), 2632–2663, 2009. pages 1553–1556, March 2012. ISSN 2379-190X. [23] A. Kawahara, R. Azuma, Y. Ikeda, K. Kawai, Y. Katoh, [7] S. H. Kang and C. Park. Mram: Enabling a sustainable Y. Hayakawa, K. Tsuji, S. Yoneda, A. Himeno, K. Shima- device for pervasive system architectures and applications. kawa, et al. An 8 mb multi-layered cross-point reram macro In IEEE Intl. Electron Devices Meeting, pages 38.2.1–38.2.4, with 443 mb/s write throughput. IEEE Journal of Solid-State Dec 2017. Circuits, 48(1):178–185, 2013. [8] C. W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, and [24] T. Liu, T. H. Yan, R. Scheuerlein, Y. Chen, J. K. Lee, M. R. Stan. Relaxing non-volatility for fast and energy- G. Balakrishnan, G. Yee, H. Zhang, A. Yap, J. Ouyang, et al. efficient stt-ram caches. In Intl. Symp. on High Perf. Comput. A 130.7-mm2 2-layer 32-gb reram memory device in 24- Arch., pages 50–61. IEEE, 2011. nm technology. IEEE Journal of Solid-State Circuits, 49(1): [9] T. Endoh, H. Koike, S. Ikeda, T. Hanyu, and H. Ohno. An 140–153, 2014. overview of nonvolatile emerging memories— for [25] Y. Joo, D. Niu, X. Dong, G. Sun, N. Chang, and Y. Xie. working memories. IEEE Journal on Emerging and Selected Energy-and endurance-aware design of phase change me- Topics in Circuits and Syst., 6(2):109–119, 2016. mory caches. In Design, Autom. & Test in Eur. Conf. & [10] Y. Huai. Spin-transfer torque mram (stt-mram): Challenges Exhibition, 2010, pages 136–141. IEEE, 2010. and prospects. AAPPS bulletin, 18(6):33–40, 2008. [26] J. Wang, X. Dong, Y. Xie, and N. P. Jouppi. i2wap: [11] A. Jog, A. K. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, Improving non-volatile cache lifetime by reducing inter-and and C. R. Das. Cache revive: architecting volatile stt-ram intra-set write variations. In IEEE 19th Intl. Symp. on High caches for enhanced performance in cmps. In Proc. of the Perf. Comp. Arch., pages 234–245. IEEE, 2013. Design Autom. Conf., pages 243–252. ACM, 2012. [27] H. Zhang, C. Zhang, Q. Hu, C. Yang, and J. Shu. Perf. anal. on structure of racetrack memory. In 23rd Asia and South [42] Z. Sun, X. Bi, W. Wu, S. Yoo, and H. H. Li. Array Pacific Design Autom. Conf., pages 367–374, Jan 2018. organization and data management exploration in racetrack [28] R. Venkatesan, V. Kozhikkottu, C. Augustine, A. Ray- memory. IEEE Trans. on Computers, 65(4):1041–1054, chowdhury, K. Roy, and A. Raghunathan. Tapecache: a 2016. high density, energy efficient cache based on domain wall [43] S. Mittal. A survey of techniques for architecting processor memory. In Proc. of the ACM/IEEE Intl. Symp. on Low components using domain-wall memory. ACM Journal on power Electron. and Design, pages 185–190. ACM, 2012. Emerging Technologies in Computing Syst., 13(2):29, 2016. [29] S. Fujita, H. Noguchi, K. Ikegami, S. Takeda, K. Nomura, [44] L. V. Cargnini, L. Torres, R. M. Brum, S. Senni, and and K. Abe. Technology trends and near-future applications G. Sassatelli. Embedded memory hierarchy exploration of embedded stt-mram. In IEEE Intl. Memory Workshop, based on magnetic random access memory. Journal of Low pages 1–5. IEEE, 2015. Power Electron. and Applications, 4(3):214–230, 2014. [30] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and [45] P. Wang, G. Sun, T. Wang, Y. Xie, and J. Cong. Designing P. Marwedel. Scratchpad memory: design alternative for scratchpad memory architecture with emerging stt-ram me- cache on-chip memory in embedded systems. In Proc. of the mory technologies. In IEEE Intl. Symp. on Circuits Syst., Intl. Symp. on Hardw./softw. Codesign, pages 73–78. ACM, pages 1244–1247. IEEE, 2013. 2002. [46] J. Hu, C. J. Xue, Q. Zhuge, W.-C. Tseng, and E. H.-M. Sha. [31] L. Li, L. Gao, and J. Xue. Memory coloring: A compiler Data allocation optimization for hybrid scratch pad memory approach for scratchpad memory management. In 14th with sram and nonvolatile memory. IEEE Trans. on VLSI Intl. Conf. on Parallel Arch. and Compilation Techniques Syst., 21(6):1094–1102, 2013. (PACT)., pages 329–338. IEEE, 2005. [47] M. Qiu, Z. Chen, and M. Liu. Low-power low-latency data [32] R. Bishnoi, F. Oboril, M. Ebrahimi, and M. B. Tahoori. allocation for hybrid scratch-pad memory. IEEE Embedded Avoiding unnecessary write operations in stt-mram for low Systems Letters, 6(4):69–72, 2014. power implementation. In Intl. Symp. on Quality Electron. [48] G. Rodríguez, J. Tourino, and M. T. Kandemir. Volatile stt- Design, pages 548–553. IEEE, 2014. ram scratchpad design and data allocation for low energy. [33] R. Bishnoi, M. Ebrahimi, F. Oboril, and M. B. Tahoori. ACM Trans. on Arch. and Code Optimization, 11(4):38, Asynchronous asymmetrical write termination (aawt) for a 2015. low power stt-mram. In Proc. of the Conf. on Design, Autom. [49] Z. Wang, Z. Gu, M. Yao, and Z. Shao. Endurance-aware al- & Test in Eur., page 180. European Design and Automation location of data variables on nvm-based scratchpad memory Association, 2014. in real-time embedded systems. IEEE Trans. on Computer- [34] X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony, and Aided Design of Integr. Circuits and Syst., 34(10):1600– Y. Xie. Hybrid cache architecture with disparate memory 1612, 2015. technologies. In ACM SIGARCH computer arch. news, [50] J. Li, L. Shi, Q. Li, C. J. Xue, Y. Chen, Y. Xu, and 37, pages 34–45. ACM, 2009. W. Wang. Low-energy volatile stt-ram cache design using [35] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen. A novel cache-coherence-enabled adaptive refresh. ACM Trans. on architecture of the 3d stacked mram l2 cache for cmps. In Design Autom. of Electron. Syst., 19(1):5, 2013. IEEE Intl. Symp. on High Perf. Comput. Arch., pages 239– [51] Q. Li, J. Li, L. Shi, M. Zhao, C. J. Xue, and Y. He. 249. IEEE, 2009. Compiler-assisted stt-ram-based hybrid cache for energy [36] B. C Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting efficient embedded systems. IEEE Trans. on VLSI Systems, phase change memory as a scalable dram alternative. In 22(8):1829–1840, 2014. ACM SIGARCH Comput. Arch. News, volume 37, pages 2– [52] K. Qiu, M. Zhao, Q. Li, C. Fu, and C. J. Xue. Migration- 13. ACM, 2009. aware loop retiming for stt-ram-based hybrid cache in em- [37] P. K. Amiri, Z. M. Zeng, J. Langer, H. Zhao, G. Rowlands, bedded systems. IEEE Trans. on Computer-Aided Design of Y.-J. Chen, I. N. Krivorotov, J.-P. Wang, H. W. Jiang, J. A. Integr. Circuits Syst., 33(3):329–342, 2014. Katine, et al. Switching current reduction using perpendi- [53] M. Komalan, J. I. G. Pérez, C. Tenllado, P. Raghavan, cular anisotropy in cofeb–mgo magnetic tunnel junctions. M. Hartmann, and F. Catthoor. Feasibility exploration of Applied Physics Letters, 98(11):112507, 2011. nvm based i-cache through mshr enhancements. In Proc. [38] Z. R. Tadisina, A. Natarajarathinam, B. D. Clark, A. L. of the Conf. on Design, Autom. & Test in Europe, page 21. Highsmith, T. Mewes, S. Gupta, E. Chen, and S. Wang. European Design and Automation Association, 2014. Perpendicular magnetic tunnel junctions using co-based mul- [54] M. P. Komalan, C. Tenllado, J. I. G. Pérez, F. T. Fernández, tilayers. Journal of Applied Physics, 107(9):09C703, 2010. and F. Catthoor. System level exploration of a stt-mram [39] C. Xu, D. Niu, X. Zhu, S. H Kang, M. Nowak, and based level 1 data-cache. In Proc. of the Design, Autom. Y. Xie. Device-architecture co-optimization of stt-ram based & Test in Eur. Conf. & Exhibition, pages 1311–1316. EDA memory for low power embedded systems. In Proc. of the Consortium, 2015. Intl. Conf. on Computer-Aided Design, pages 463–470. IEEE [55] Q. Li, Y. He, J. Li, L. Shi, Y. Chen, and C. J. Xue. Press, 2011. Compiler-assisted refresh minimization for volatile stt-ram [40] S. Mittal and J. S. Vetter. A survey of software techniques for cache. IEEE Trans. on Computers, 64(8):2169–2181, 2015. using non-volatile memories for storage and main memory [56] C. Lin and J.-N. Chiou. High-endurance hybrid cache design systems. IEEE Trans. on Parallel and Distributed Syst., 27 in cmp architecture with cache partitioning and access-aware (5):1537–1550, 2016. policies. IEEE Trans. on VLSI Syst., 23(10):2149–2161, [41] S. Mittal, J. S. Vetter, and D. Li. A survey of architectural 2015. approaches for managing embedded dram and non-volatile [57] J. Ahn, S. Yoo, and K. Choi. Prediction hybrid cache: An on-chip caches. IEEE Trans. on Parallel and Distributed energy-efficient stt-ram cache architecture. IEEE Trans. on Syst., 26(6):1524–1537, 2015. Computers, 65(3):940–951, 2016. [58] J. Zhao, S. Li, D. H. Yoon, Y. Xie, and N. P. Jouppi. Kiln: simulation and detailed microarchitecture modeling. IEEE Closing the performance gap between systems with and Intl. Symp. on Perf. Analysis of Syst. and Softw. (ISPASS), without persistence support. In IEEE/ACM Intl. Symp. on pages 74–85, 2013. Microarchitecture, pages 421–432. IEEE, 2013. [74] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, [59] H. Kim, J. Ahn, S. Ryu, J. Choi, and H. Han. In-memory file G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and system for non-volatile memory. In Proc. of the Research B. Werner. Simics: A full system simulation platform. in Adaptive and Convergent Systems, pages 479–484. ACM, Computer, 35(2):50–58, 2002. 2013. [75] Milo M. K. Martin, Daniel J. Sorin, Bradford M. Beckmann, [60] I. Moraru, D. G. Andersen, M. Kaminsky, N. Tolia, P. Ran- Michael R. Marty, Min Xu, Alaa R. Alameldeen, Kevin E. ganathan, and N. Binkert. Consistent, durable, and safe Moore, Mark D. Hill, and David A. Wood. Multifacet’s memory management for byte-addressable non volatile main general execution-driven multiprocessor simulator (gems) memory. In Proc. of the ACM SIGOPS Conf. on Timely toolset. SIGARCH Comput. Archit. News, 33(4):92–99, Results in Operating Systems, page 1. ACM, 2013. November 2005. ISSN 0163-5964. URL http://doi. [61] T. Gao, K. Strauss, S. M. Blackburn, K. S. McKinley, acm.org/10.1145/1105734.1105747. D. Burger, and J. Larus. Using managed runtime systems [76] S. Udayakumaran and R. Barua. Compiler-decided dynamic to tolerate holes in wearable memories. In ACM SIGPLAN memory allocation for scratch-pad based embedded systems. Notices, volume 48, pages 297–308. ACM, 2013. In Proc. of the Intl. Conf. on Compilers, arch. and synthesis [62] S. Kannan, A. Gavrilovska, K. Schwan, and D. Milojicic. for embedded systems, pages 276–286. ACM, 2003. Optimizing checkpoints using nvm as . In [77] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, IEEE 27th Intl. Symp. on Parallel & Distributed Processing, T. Mudge, and R. B. Brown. Mibench: A free, commercially pages 29–40. IEEE, 2013. representative embedded benchmark suite. In IEEE Intl. [63] J.-Y. Jung and S. Cho. Memorage: Emerging persistent ram Workshop on Workload Characterization, pages 3–14. IEEE, based malleable main memory and storage architecture. In 2001. Proc. of the Intl. ACM Conf. on Supercomputing, pages 115– [78] Christian Bienia. Benchmarking modern multiprocessors. 126. ACM, 2013. Princeton University, 2011. [64] A. Sampson, J. Nelson, K. Strauss, and L. Ceze. Appro- [79] N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. ximate storage in solid-state memories. ACM Trans. on Saidi, and S. K. Reinhardt. The m5 simulator: Modeling Comput. Syst., 32(3):9, 2014. networked systems. IEEE Micro, 26(4):52–60, 2006. [65] B. Li, S.C. Shan, Y. Hu, and X. Li. Partial-set: write speedup [80] N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Op- of pcm main memory. In Design, Autom. Test Eur. Conf. and timizing nuca organizations and wiring alternatives for large Exhibition, 2014, pages 1–4. IEEE, 2014. caches with cacti 6.0. In Proc. of the Annual IEEE/ACM Intl. [66] Z. Zhang, Z. Jia, P. Liu, and L. Ju. Energy efficient real- Symp. on Microarchitecture, pages 3–14. IEEE Computer time task scheduling for embedded systems with hybrid main Society, 2007. memory. Journal of Signal Processing Syst., 84(1):69–89, [81] C. W. Smullen, A. Nigam, S. Gurumurthi, and M. R. Stan. 2016. The stetsims stt-ram simulation and modeling system. In [67] Q. Hu, G. Sun, J. Shu, and C. Zhang. Exploring main Proc. of the Intl. Conf. on Computer-Aided Design, pages memory design based on racetrack memory technology. In 318–325. IEEE Press, 2011. Intl. Great Lakes Symp. on VLSI (GLSVLSI), pages 397–402, [82] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, May 2016. A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, [68] G. Wang, Y. Guan, Y. Wang, and Z. Shao. Energy-aware S. Sardashti, et al. The gem5 simulator. ACM SIGARCH assignment and scheduling for hybrid main memory in Comput. Arch. News, 39(2):1–7, 2011. embedded systems. Computing, 98(3):279–301, 2016. [83] J. L. Henning. Spec cpu2006 benchmark descriptions. ACM [69] Sparsh Mittal, Jeffrey S. Vetter, and Dong Li. Improving SIGARCH Computer Arch. News, 34(4):1–17, 2006. energy efficiency of embedded dram caches for high-end [84] M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willen- computing systems. In Proc. of the 23rd Intl. Symp. on bring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, High-perf. Parallel and Distributed Computing, HPDC ’14, H. K. Thornquist, and R. W. Numrich. Improving perfor- pages 99–110. ACM, New York, NY, USA, 2014. ISBN mance via mini-applications. Sandia National Laboratories, 978-1-4503-2749-7. URL http://doi.acm.org/10. Tech. Rep. SAND2009-5574, 3, 2009. 1145/2600212.2600216. [85] Chunho Lee, Miodrag Potkonjak, and William H Mangione- [70] A. Agrawal, A. Ansari, and J. Torrellas. Mosaic: Exploiting Smith. Mediabench: a tool for evaluating and synthesizing the spatial locality of process variation to reduce refresh multimedia and communicatons systems. In Proc. of the energy in on-chip modules. In IEEE Intl Symp. on ACM/IEEE Intl. symposium on Microarchitecture, pages High Perf. Comput. Arch., pages 84–95, Feb 2014. ISSN 330–335. IEEE CS, 1997. 1530-0897. [86] A. M. H. Monazzah, H. Farbeh, S. G. Miremadi, M. Fazeli, [71] J. Ahn and K. Choi. Lasic: Loop-aware sleepy instruction and H. Asadi. Ftspm: A fault-tolerant scratchpad memory. In caches based on stt-ram technology. IEEE Trans. on VLSI 43rd Annual IEEE/IFIP Intl. Conf. on Dependable Systems Systems, 22(5):1197–1201, May 2014. ISSN 1063-8210. and Networks, pages 1–10. IEEE, 2013. [72] Y. Wang, L. Ni, C. H. Chang, and H. Yu. Dw-aes: A domain- [87] J. Gustafsson, A. Betts, A. Ermedahl, and B. Lisper. The wall nanowire-based aes for high throughput and energy- mälardalen wcet benchmarks: Past, present and future. In efficient data encryption in non-volatile memory. IEEE OASIcs-OpenAccess Series in Informatics, volume 15. Sch- Trans. on Information Forensics and Security, 11(11):2426– loss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2010. 2440, Nov 2016. ISSN 1556-6013. [88] X. Li, Y. Liang, T. Mitra, and A. Roychoudhury. Chronos: A [73] Jung Ho Ahn, Sheng Li, Seongil O, and Norman P. Jouppi. timing analyzer for embedded software. Science of Computer Mcsima+: A manycore simulator with application-level+ Programming, 69(1):56–67, 2007. [89] C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In Proc. of the Intl. Symp. on Code Generation and Optimization: feedback-directed and runtime optimization, page 75. IEEE CS, 2004. [90] J. Li, C. J. Xue, and Y. Xu. Stt-ram based energy-efficiency hybrid cache for cmps. In IEEE/IFIP 19th Intl. Conf. on VLSI and System-on-Chip, pages 31–36. IEEE, 2011. [91] Livermore. [Online]. Available: http://www.netlib.org/benchmark/livermorec, 2013. [92] Blitz++. [Online]. Available: http://blitzplus- pplusp.sourcearchive.com, 2013. [93] S. Berkowits. Pin-a dynamic binary instrumentation tool. Intel Developer Zone, 2012. [94] D. Kroft. Lockup-free instruction fetch/prefetch cache or- ganization. In Proc. of the Symp. on Comput. Arch., pages 81–87. IEEE CS Press, 1981. [95] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Low- ney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In ACM sigplan notices, volume 40, pages 190–200. ACM, 2005. [96] Texas Instruments. Tms320dm368 digital media system-on- chip data sheet, 2011. [97] J. Scott, L. H. Lee, J. Arends, and B. Moyer. Designing the low-power m•coreTM architecture. In Power driven microarchitecture workshop, pages 145–150, 1998. [98] Lingo-optimization modeling software for linear, nonli- near, and integer programming. [Online]. Available: http://www.lindo.com, 2013. [99] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: Characterization and architectural impli- cations. In Proc. of the Intl. Conf. on Parallel Arch. and Compilation techniques, pages 72–81. ACM, 2008. [100] X. Wu, J. Li, L. Zhang, E. Speight, and Y. Xie. Power and performance of read-write aware hybrid caches with non- volatile memories. In Design, Autom. & Test in Eur. Conf. & Exhibition, 2009, pages 737–742. IEEE, 2009. [101] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi. Cacti 6.0: A tool to model large caches. HP Laboratories, pages 22–31, 2009. [102] Y. Zhang, J. Zhan, J. Yang, W. Jiang, L. Li, and Y. Li. Energy-aware page replacement for nvm based hybrid main memory system. In IEEE 23rd International Conference on Embedded and Real-Time Computing Systems and Applica- tions (RTCSA), pages 1–6, Aug 2017. ISSN 2325-1301. [103] J. Choi and G. Park. Nvm way allocation scheme to reduce nvm writes for hybrid cache architecture in chip- multiprocessors. IEEE Transactions on Parallel and Dis- tributed Systems, 28(10):2896–2910, Oct 2017. ISSN 1045- 9219.