ROSS: a Design of Read-Oriented STT-MRAM Storage for Energy
Total Page:16
File Type:pdf, Size:1020Kb
ROSS: A Design of Read-Oriented STT-MRAM Storage for Energy-Efficient Non-Uniform Cache Architecture Jie Zhang, Miryeong Kwon, Chanyoung Park, Myoungsoo Jung, and Songkuk Kim School of Integrated Technology, Yonsei Institute Convergence Technology, Yonsei University [email protected], [email protected], [email protected], [email protected], [email protected] Abstract high leakage power. With increasing demand for larger caches, SRAM is struggling to keep up with the density Spin-Transfer Torque Magnetoresistive RAM (STT- and energy-efficiency requirements set by state-of-the- MRAM) is being intensively explored as a promis- art system designs. ing on-chip last-level cache (LLC) replacement for Thanks to the device-level advantages of STT-MRAM SRAM, thanks to its low leakage power and high stor- such as high-density structure, zero leakage current, and age capacity. However, the write penalties imposed by very high endurance, it comes out as an excellent candi- STT-MRAM challenges its incarnation as a successful date to replace age-old SRAM for LLC design. However, LLC by deteriorating its performance and energy effi- the performance of STT-MRAM is critically sensitive to ciency. This write performance characteristic unfortu- write frequency due to its high write latency and energy nately makes STT-MRAM unable to straightforwardly values. Therefore, an impulsive replacement of SRAM substitute SRAM in many computing systems. with high-density STT-MRAM, simply for increasing In this paper, we propose a hybrid non-uniform cache LLC capacity, can deteriorate cache performance and in- architecture (NUCA) by employing STT-MRAM as a troduce poor energy consumption behavior. read-oriented on-chip storage. The key observation here Extensive research is taking place to address these is that many cache lines in LLC are only touched by challenges. Guangyu et al [8] proposed a 3D-stacked read operations without any further write updates. These STT-MRAM, which intends to hide the long-latency of cache lines, referred to as singular-writes, can be in- STT-MRAM on writes by adding a small SRAM write ternally migrated from SRAM to STT-MRAM in our buffer to each STT-MRAM bank. Asit K. et al [9] dis- hybrid NUCA. Our approach can significantly improve cussed a memory request scheduling scheme on a 3D the system performance by avoiding many cache read multicore environment integrating STT-MRAM, which misses with the larger STT-MRAM cache blocks, while aims to resolve the STT-MRAM write issues at the on- it maintains the cache lines requiring write updates in chip network level. This approach promoted an idea the SRAM cache. Our evaluation results show that, by of re-routing subsequent requests to idle cache banks, utilizing the read-oriented STT-MRAM storage, our hy- instead of the write-busy bank. While these proposed brid NUCA can better the performance of a conven- schemes can partially alleviate the write overhead prob- tional SRAM-only NUCA and a dead block aware STT- lem imposed by STT-MRAM, the quest for an energy- MRAM NUCA by 30% and 60% with 45% and 8% efficient cache still remains unconquered. In addition, lower energy values, respectively. the hybrid cache using SRAM-based write buffer is un- aware of application behaviors, and its benefits are re- 1 Introduction stricted by the amount of available buffer at runtime. We observe that performance and energy-efficiency Last-level cache (LLC) is utilized as a shared resource of a hybrid cache critically depend on its internal data- in most commercial multicore products and co-processor movement trend, and by studying such trends we can architectures [6], making it a pivotal design component modify the cache architecture accordingly to get the best in determining both system performance and energy- out of it. Specifically, we observe that nearly 90% of the efficiency. While many emerging applications enjoy the data in a LLC can actually be written only once during massive parallelism and high computational power of its lifetime, which has been reported as “deadwrites” in multicore systems, they are required to manage large [13]. One of the potential approaches to eliminate STT- datasets presenting challenge of scale, which in turn MRAM’s disadvantages, being aware of this deadwrite leads to the demand for integrating larger LLCs in mod- characteristics, is to bypass the writes associated to the ern computing system [2]. deadwrites to the underlying main memory. However, Static Random Access Memory (SRAM) has been we also observe that, about 60% of total read misses re- the prevailing technology for on-chip cache, success- sults from such bypassed write data. fully catering to the latency demands of the high perfor- In this paper, we propose a Read-Oriented STT- mance processors. However, SRAM’s cell design consti- MRAM Storage (ROSS) based non-uniform cache ar- tutes a large number of transistors (usually six, at least chitecture (NUCA), that can avail combined benefits of four), making it a low-density device with considerably a hybrid design and the data movement/usage trends in a LLC. We built upon the standard NUCA architec- SRAM STT-MRAM ture, and modified it to develop a high-capacity hybrid cache where designated SRAM banks are replaced with STT-MRAM banks. This inclusion of high-density STT- Cell Structure MRAM blocks enables the cache to attain larger storage capacity, as well as higher energy efficiency. In our pro- posed architecture, cache data with singular-write char- Cell Area (F2) 50-120 6-40 acteristics (i.e., data is written only once and is not re- Leakage (mW) 75.7 6.6 referenced by any latter write operations) are migrated from the SRAM cache blocks to the STT-MRAM blocks Read Latency (ps) 397 238 inside the LLC, rather than bypassing them to the main Write Latency (ps) 397 6932 memory. As a result, our proposed scheme allows the Read Energy (pJ) 35 13 cache controller to free up sufficient SRAM blocks for Write Energy (pJ) 35 90 write intensive data, without sustaining the read oper- ation overheads imposed by off-chip memory accesses. Table 1: Features of 32KB SRAM and STT-MRAM Our comprehensive evaluation results show that, our caches [14]. ROSS cache significantly improves the overall LLC per- STT-MRAM Cache. Compared to SRAM, STT- formance, incurring only a minor overhead from the data MRAM is comprised of a magnetic tunnel junction migration process. The main contributions of this work (MTJ) and a single access transistor. As a result, STT- can be summarized as follows: MRAM caches have higher density (3x ∼ 4x) and lower • Evaluation of data movement trend in the LLC. We ex- leakage power. However, for the write operation, STT- tensively evaluate data read and write trend in the LLC MRAM involves physical rotation of the MTJ free layer, for a wide range of memory-intensive workloads. This which requires high write latency and write power con- motivational evaluation allows us to better understand sumption. High write overhead becomes a serious obsta- the actual trend of data movement in the LLC, and to cle for STT-MRAM to completely replace SRAM. modify our architecture accordingly, for optimum per- NUCA. In modern cache architectures, large LLC is em- formance and energy-efficiency. ployed to relieve the burden of frequent accesses to un- • Development of ROSS based NUCA. Motivated from derlying main memory. However, global wire delay fol- the observations, we propose two novel hybrid ROSS lowed by increased cache area size, has emerged as a based non-uniform cache architectures. The first one dominant factor for increasing cache access time and is hybrid ROSS (HB-ROSS) that takes advantage of more access contentions. The non-uniform cache ar- SRAM’s low write latency and small write power as chitecture (NUCA) is proposed to reduce the penalty of well as STT-MRAM’s high density and energy efficiency long wire delay by dividing the cache into smaller banks. by detecting and retiring data with singular-write from Each bank has non-uniform latency based on bank loca- SRAM to STT-MRAM. Second, to further optimize en- tions, smaller than what it would be if the whole cache ergy efficiency, early retirement ROSS (ER-ROSS) is de- was a uniform cache. signed to reduce SRAM capacity, while maintaining high performance. Differing from HB-ROSS, ER-ROSS re- tires all potential singular-write data sets at early stage. 3 Design and Implementation of ROSS • Comprehensive system-level evaluation. We evalu- ate and compare our proposed ROSS-based NUCA with In this section, we first describe the motivation behind a SRAM-only NUCA and a dead block aware STT- our proposed ROSS based NUCA architecture, and then MRAM based NUCA, and show that our proposed ROSS discuss about the LLC topology, NUCA baseline and caches show significant improvement in terms of both ROSS scheme. At last, we provide the implementation cache performance and energy-efficiency. Specifically, details of ROSS. compared to the prior work, our singular-write aware ROSS cache improves LLC performance by upto 60%, and consumes upto 50% less energy at system level. 3.1 Motivation Figure 1 shows cache eviction trend for the tested work- loads and “1” ∼ “>5” denote the number of writes on 2 Background cache blocks. One can observe from this figure that, nearly 90% of the written data are actually never re- SRAM Cache. Table 1 lists the technology features of written before their eviction, referred to singular-writes. SRAM and STT-MRAM. In practice, SRAM has a six- Singular-writes can be generated by different situations transistor structure and has high static power due to sub- such as data fill by a single read-miss, a redundant write threshold and gate leakage currents.