Using MRAM in an Intelligent Memory Hierarchy (IMH)
Shinobu Fujita Senior Fellow Toshiba Corporation
MRAM Developer Day 2018 Santa Clara, CA 1 Our previous works(1) Embedded Memory Integration (31nm MTJ in 65nm CMOS)
1Mbit 4Mb(65nm) 1Gb(28nm) MTJ Last Process for embedded Logics
2T-2MTJ Ikegami(Toshiba), IEDM 2014.2015
Access time < 4ns 2 Noguchi(Toshiba), VLSI circuit symposium, 2013, 2014 Our previous works(2) Demonstration: Last Level Cache Active Energy Reduction Measured cache energy
Cache energy is reduced further by over 90% of SRAM-based one.
Cache-ARM CPU STT-MRAM Last Level Cache demonstration H. Noguchi et. al, Toshiba , ISSCC 2016 ISSCC 2016 Outline
. Introduction : Key Points of Intelligent Memory Hierarchy (IMH)
. IMH case study 1: Last level cache (LLC) memories with eMRAM
. Rethink requirement of eMRAM for LLC
. IMH case study 2: Persistent memories with eMRAM
. Summary
4 STT-MRAM potential in memory hierarchy
Fast write speed Near- Memory High endurance
CPU core Far- Memory L1
L2/3
LLC SRAM Storage time Class STT-MRAM Near Memory Memory DRAM Far Memory Storage
SSD SCM
SSD/HDD
5 https://www.i-micronews.com/manufacturing-report/product/emerging-non-volatile-memory-2017.html
Low speed Before after High speed & High density (>40ns) 2019 2020 (<20ns) TAM ~$ 1B TAM ~ $ 4B or more Simply NVM replacement is not Intelligent Memory Hierarchy…
Traditional trend HW/SW co-design based of memory hierarchy new memory hierarchy with I/F Power supply CPU CPU core core Volatile ~nsec resistor Volatile Architecture cache Paradigm shift ~sec Intelligent Main memory On/Off Judgment Nonvolatile Volatile ~msec File storage Nonvolatile OS/API
I/F sensor I/F network I/F
Application oriented NVM adaptation (FeRAM, MRAM, Ultra-fast STTRAM) Intelligent Memory Hierarchy (IMH) should have - Nonvolatile/Volatile Hybrid IMH - Intelligent Power Management Point 1 7 (Breakeven Time aware) IMH Intelligent Memory Hierarchy (IMH) should be Point 2 changed from applications to applications. IMH There should be “only one storage (Long retention Point 3 NVM) at the bottom of IMH”.
S.Fujita, SSDM 2016 9 Outline
. Introduction
. Intelligent Memory Hierarchy (IMH) for near-future applications from IoT edge to Cloud
. IMH case study 1: Last level cache (LLC) with eSTT-MRAM
. Rethink requirement of LLC-MRAM
. IMH case study 2: Persistent memories with eSTT-MRAM
. Summary
10 Trend of cache memory for processors
Capacity of Cache Memory in CPU is increasing, which increases standby power of processors! L1/L2 SRAM
L1/L2/L3 SRAM
L1/L2/L3/L4 SRAM (+eDRAM)
Mature process and technology has been presented. 1Mb- Full Function@40nm CMOS (Qualcomm) 8Mb- Full Function@28nm CMOS (Samsung) 4Gb- Full Function@29nm DRAM-Tr. (SKH&Toshiba) 256Mb@40nm DDR3/4 shipping, 1Gb sample (Everspin) 30nm MTJ demonstration for Last Level Cache- MRAM (TDK Headway) TSMC, GF, Samsung…. eMRAM manufacturing on FDSOI
http://semimd.com/blog/2018/07/18/emerging-memory-types-headed-for-volumes/ Break even time for utilizing nonvolatile memory
S. Fujita, IMW2015 13 Intelligent Memory Hierarchy (IMH) for HP-Processor Conventional
All Nonvolatile
IMH = Volatile/Nonvolatile Hybrid !!
14 Outline
. Introduction
. Intelligent Memory Hierarchy (IMH) for near-future applications from IoT edge to Cloud
. IMH case study 1: Last level cache (LLC) with eSTT-MRAM
. Rethink requirement of LLC-MRAM
. IMH case study 2: Persistent memories with eSTT-MRAM
. Summary
15 Ex. of SRAM LLC (last level cache) several MB ・Read Write random access time:~2ns ・Power(Write power): ~10fJ/bit ・Retention: N.A. ・Area: 200 to 400 f2 MRAM should not compete with SRAM! Let’s RETHINK requirement of MRAM for Last Level Cache! ・Read Write Speed ・Power(Write power) ・Endurance ・Retention ・Area
Shinobu Fujita Toshiba ・Error Rate 16 Access frequency of cache memory and main memory
Cache Cache Cache miss rate Miss rate Miss rate CPU L2 ~10% Once L1 ~10% ~10% Core 20 ns Cache 30~40 ns 100~500 ns per Cache Last Level Cache 3 ns (LLC) (should be MRAM) Base clock Main Memory 1GHz (DRAM)
Time Time Time
Higher WER than that of conv. SRAM is acceptable for MRAM-LLC, since main memory DRAM can cover it. CPU performance simulation for comparison between SRAM-LLC and MRAM-LLC 1.2
1
0.8
0.6
0.4 性能(高いほど高性能)
0.2
0 gcc lbm mcf milc omnetpp soplex sphinx3 xalancbmk average GemsFDTD libquantum
L3-1MB-2ns/2ns-WB10E L3-1MB-10ns/2ns-WB0E L3-1MB-10ns/10ns-WB0E L3-1MB-10ns/10ns-WB10E L3-1MB-10ns/35ns-WB0E L3-1MB-10ns/35ns-WB10E L3-1MB-10ns/35ns-WB100E CPU performance is affected by read latency of Last Level Cache. Relative CPU performance Relativeperformance CPU (Instruction per second, a.u.) CPU performance is not largely affected by write latency (<~20ns), since write process is not on the CPU pipeline and write latency(<~20ns) is also covered by write buffer.18 Influence of MRAM read access time on CPU performance
4 MRAM
3 SRAM < 3% 2 @5ns
1
0 0 10 20 30
Average CPU Time CPU per(ns) Instruction Average Read Access Time (ns) S. Fujita et al., ASP-DAC 2015 Comparison of Read latency between SRAM and MRAM having Mb capacity
10
9
8
7
6
5 SRAM MRAM 4
3
2
Read latencyRead(ns) 1
0 1 10 100 1000 Memory capacity (Mb)
S. Fujita et al, NVMSA2017 Let’s RETHINK requirement of MRAM for Last Level Cache! ・Read Write Speed ・Power(Write power) ・Endurance ・Retention ・Area ・Error Rate
21 How to reduce Write Energy with STT-MRAM?(1)
(2) e-STT-MRAM Write Pulse Time Power Write Power Write
(1) Conventional SRAM/eDRAM
Power Time RAM active power Power RAM static power Time
Time Energy = “Write pulse time x Write power” should decreased! How to reduce Write Energy with STT-MRAM?(2)
Static/Active Static/Active duty ratio=1:50 duty ratio=1:100
MRAM Higher Power MRAM than SRAM
SRAM=1 128MB
Lower Power than SRAM Relative Average Power for L2 Cache Memory
Access Time of MRAM (ns) S. Fujita, NVMSA2017 Conventional PG to Normally-off / Instant-On with STT-MRAM-LLC
Power Conventional Power Gating Active Power On Active Frequent CPU stop Power gating Leakage Power gating Leakage Leakage (SRAM) Deep-Power-Down-State Time Power Gating with STT-MRAM Power
Reduced Power Instant On Better Performance Frequent CPU stop Power gating Deep-Power-Down-State Power gating Reduced Leakage Power Time (Normally-off) 24 Real measurement of cache Active/Static state; “CPU cores very frequently STOP shortly even while the application is running! “
STOP
Real time monitoring of 8 CPU cores.
3D graphics application is running.
25 S. Fujita, ICICDT2016 CPU cores very frequently STOP shortly even while the application is running!
Mobile Benchmark Software Average Real time CPU/cache Measurement Data Real time Measurement Data
(Movie software) 1: CPU Active Sum of CPU state time 2: CPU Standby 3: CPU Sleep/Deep-Sleep
26 Let’s RETHINK requirement of MRAM for Last Level Cache! ・Read Write Speed ・Power(Write power) ・Endurance ・Retention ・Area ・Error Rate
27 CPU simulations (Total Write times)
Max: ~3 x 1010/ Year
4 core CPU, 3GHz, 8MB LLC, CPU simulator (Gem5) 28 Trade off : High Endurance > 1012 & High-speed write<20ns
1018
1015
1012
109 Endurance 106 H. Noguchi, eSTT-MRAM ISSCC 2015
103 0 10 20 30 40 Write Pulse (ns) Write pulse>40ns : practically unlimited endurance Write pulse<20ns : limited endurance affect LLC/CPU long term reliability Let’s RETHINK requirement of MRAM for Last Level Cache! ・Read Write Speed ・Power(Write power) ・Endurance ・Retention ・Area ・Error Rate
30 Retention is controllable for each application.
110 Long retention 100 90 80 [2] 70 60 Short retention/ Lower Iw 50 40 D 30 [1] 20 Toshiba [3] 10 0 0 50 100 MTJ diameter [nm] Retention requirement for LLC
1.E+00 80 Chip failure : 1000FIT 70 1.E-01 80% D 60 1.E-02 50% 50 1.E-03 Necessary Necessary 10MB, 80C Average lifetime[s]Average 40 256KB, 80C 1.E-04 256KB, 80C, SECDED 30 1.5L2 cache 2.5L3 cache 3.5 1.E-01 1.E+01 1.E+03 1.E+05 1.E+07 1.E+09 (LLC) Measured average cache data life for various workloads. Red line shows reference data distribution for 1MB L2 cache A. Jog et Required Data Retention [s] al., DAC2012, pp. 243-252. Relatively low is acceptable for LLC. D 32 Let’s RETHINK requirement of MRAM for Last Level Cache! ・Read Write Speed ・Power(Write power) ・Endurance ・Retention ・Area ・Error Rate
33 Expected cell area reduction with e-STT-MRAM in 22nm CMOS and beyond
6T SRAM 0.092um2 (Intel, ISSCC2012) eRAM Cell are comparison (22nm/16nm CMOS) 190F2@F=22nm
SRAM @22nm
2T-2MTJ STTMRAM @22nm Slow Fast MTJ >5ns 0.037um2 = 76F2@F=22nm MTJφ=35nm Iw=50uA 6T SRAM 0.092um2 eDRAM 0.029um2 Ra=6Ωum2 (Intel, ISSCC2012) (Intel, ISSCC2013) DRAM@22nm
Expected area reduction of SRAM with MRAM x 50% or less with 2T-2MTJ (Fast), x 25% or less with 1T-1MTJ (Slow) Low write current is needed for small memory cell
16nm LP-CMOS K. Ikegami, K. Abe, S. Fujita et al., IEDM2015 MTJ should be scaled with Metal pitch. MTJ scaling for eMRAM on CMOS tech. node 100 80 60 40 20 0 28 nm 20 nm 16 nm 7 nm 5 nm ~30nm MTJ for 5nm CMOS TSMC, VLSI symposium 2018 M1 pitch MTJ size (x nm MTJ is Not needed!!)
Samsung, IEDM2016
GF, VLSI symposium 2018 Let’s RETHINK requirement of MRAM for Last Level Cache! ・Read Write Speed ・Power(Write power) ・Endurance ・Retention ・Area ・Error Rate
37 Improve Error tolerance of L2 cache
TAG Index
TAG Array Data Array K. Ikegami, H. Noguchi, S. Fujita et al., IEDM2015
SRAM eMRAM ECC (ex. SEC- correctable Cache To CPU DED) Hit
uncorrectable Cache To Main Miss Memory DRAM RER/WER~1e9 in eMRAM is acceptable for a mobile processor case study. Requirement for eMRAM for LLC
Item
Read /Write Speed < 10ns / 20ns(Write Pulse) These 3 realized Endurance > 1x1012 at the same time: Active Energy(Write Energy) < 50fJ Most Challenging! Retention > order of minute @1ppm Area Less than half of SRAM area Error Rate <1x10-9
39 How to meet requirement for LLC? (1)
1.0E-02 [3] General [4] MTJ, p-MTJ [2] Higher energy 1.0E-03 [6] [1] compared with SRAM SRAM compatible
1.0E-04Toshibazone 2014 [7] Toshiba 2016
AdvancedToshiba 2012 Kitagawa(Toshiba),IEDM2012 28~35nm
Writecurrent(A) P-MTJ 1.0E-05 Saida(Toshiba),Intermag2014 22nm 1.E-10 1.E-09 1.E-08 1.E-07 Saida(Toshiba), VLSI 2016, 2017 16nm to 1xnm…
Programming(mA) current Programming time (nsec) QUALCOMM, TDK-Headway &TSMC, Samsung, GF, Write time (s) Mature MTJ tech. for small MTJs (30~40nm)
[1] Sony corp. IEDM (2005) • Improvement of MgO break down voltage [2] New York univ. APPLIED PHYSICS LETTERS 97, 242510 (2010) [3] Cornel Univ. APPLIED PHYSICS LETTERS 95, 012506 (2009) • Reduction in R of MTJ [4] Minnesota univ. J. Phys. D: Appl. Phys. 45, 025001 (2012). A [6] IBM corp. Appl Phys Lett 98, 022501 (2011). [7] TDK-Headway Applied Physics Express 5 093008 (2012) • Improvement of Iw/D • Small MTJ to reduce Iw with low WER 40 How to meet requirement for LLC? (2)
New architecture: NAND-flash-like-MRAM; Voltage-Control Spintronics MRAM (VoCSM) 1018
1015
1012
109 Endurance 106
103 0 10 20 30 40 H. Yoda et al., IEDM 2016 High Speed Write & Write Pulse (ns) High Endurance VoCSM can also reduce Write Energy ( = Iw x tw x Vdd )
1.0E-02 [3] General [4] MTJ, p-MTJ [2]
1.0E-03 [6] [1] SRAM compatible zone 1.0E-04 [7] Powerdown VoCSM
Programming current (mA) current Programming 1.0E-05 1.E-10 1.E-09Higher speed 1.E-08 1.E-07 Programming time (nsec) Measured Iw of VoCSM
Y. Ohsawa et al., EDTMC, 2018 42 Outline
. Introduction
. Intelligent Memory Hierarchy (IMH) for near-future applications from IoT edge to Cloud
. IMH case study 1: Last level cache (LLC) with eSTT-MRAM
. Rethink requirement of LLC-MRAM
. IMH case study 2: Persistent memories with eSTT-MRAM
. Summary
43 Conventional persistent memory with NVDIMM
DRAM=NVM DRAM=0 DRAM<NVM NVM only (3D-Xpoint) (3D-Xpoint) Conventional persistent memory NVDIMM with battery backup
High Cost Large Volume A new persistent memory architecture with eSTT-MRAM
64GB
Processor DRAMDRAMDRAMDRAM SSD eMRAM(1Gb) 1TB
Dirty Line Data CPU (Page) Last Level Main Memory Cache NAND (SSD)
OS / File System (Memory/Disk address) PCIe eMRAM <Control 2> Backup into SSD <Control 3> When unexpected power down <Control1> Write back (Background/PCIe). Data in MRAM can be used. 46 ALL dirty line replica into MRAM Write into main memory. All write back data simulated • Single Core – 3.2GHz/3-way-OoO • Cache memory ) – 32KB-L1D/ 32KB-L1I
x x 64B – 1MB-LLC ( 1.00E+06 • Main memory 16GB/DDR4 9.00E+05 8.00E+05 7.00E+05 1Gb e-MRAM can persist data in 6.00E+05 16GB DRAM. 5.00E+05 4.00E+05 3.00E+05 2.00E+05 1.00E+05 0.00E+00
47 Interval of write back simulated 20
18
16 GemsFDTD 14 gcc 12 lbm )
% libquantum 10 mcf Ratio( 8 milc
6 omnetpp soplex 4 sphinx3 2 xalancbmk
0 10 20 30 40 50 60 70 80 90 Interval of write back (ns)
Less than 30ns write latency is required. (MRAM is only NVM solution.) Write Buffer can compensate write latency (~100ns)
2500
2000 ←2KB
1500
1000
500 Write BufferCapacity(B)
0 eMRAM requirement for persistent memory
Items Requirement
Write latency < 100ns
Read Latency < 1us
Endurance > 1x1010
Retention ~1 week
Mem. Capacity > x Gb Challenging! 50 Comparison of persistent memory
Persistent Cost Virtual Latency Frequency Processor Memory memory for Performance Organization Capacity Persistency (Ex. In-Memory Data base) eMRAM Low 16GB DDR ~100ns /DRAM compatible /NAND 1TB (Proposed) DRAM16GB/ Very High 1TB ~1 us ~1 us Enhanced 3DXp 1TB (Conventional1)
DRAM16GB/ High 16GB DDR ~ms NAND 1TB compatible (conventional2) Summary 3 Key Points of Intelligent Memory Hierarchy (IMH) Nonvolatile/Volatile Hybrid & Intelligent Power Management. IMH should be changed from applications to applications. There should be “only one storage (Long retention NVM) at the bottom of IMH. IMH case study 1: Last level cache (LLC) memories with eMRAM MRAM/SRAM hybrid cache memory hierarchy considering break even time ware designs Rethinking requirement of eMRAM for LLC Based on our deepest analysis, the requirements have been cralified. High speed write & High endurance & low Iw ; the most difficult to realize. Largely improved STT-MRAM or new architecture VoCSM is expected. IMH case study 2: Persistent memories with eMRAM Several Gb eMRAM can be proposed as a novel Persistent Memory. STT-MRAM can cover the requirement but Gb density is a challenging. Lower cost and higher performance compared with conventional NVDIMM- based ones. 52 Thank you for your kind attention!
Acknowledgements This work was partly supported by the ImPACT Program of the Council for Science, Technology and Innovation (Cabinet Office, Government of Japan).
Co-workers (Toshiba R&D center) Kazutaka Ikegami, Susumu Takeda, Satoshi Shirotori, Naoharu Shimomura, Hiroaki Yoda, Tomoaki Inokuchi, Katsuhiko Koi, Hideyuki Sugiyama, Yuichi Ohsawa, and Atsushi Kurobe
53