Expanding the World of Heterogenous Memory Hierarchies The Evolving Non-Volatile Memory Story

Bill Gervasi Principal Systems Architect

16 May 2019 2 Checkpointing Memory Processing Tiers Challenges Persistence Sharing Time Agenda Current Solutions Security Seeking Mixed the Ideal Distributed A New Mode Processing Standard Solutions 3

Data processing is great 4

Data processing is great

Until something goes wrong The Cost of Power Failure 5 6 7

DRAM Checkpointing Run degrades performance Checkpoint Storage Checkpointing Run burns power Checkpoint Storage

Run Checkpointing sucks Checkpoint Storage 8

DRAM Run But checkpointing avoids data loss from failure Checkpoint Storage Checkpoint

Run FAIL! RESTART Run 9

System failure is a key factor in server software design

Data persistence is essential

Storage access time impacts transaction granularity 10 The game we play to trade off performance, capacity, and cost 11

To reduce the penalties from checkpointing…

…move non-volatile storage closer to the CPU Traditional Server Architecture Review 12

Network

Memory I/O $ CPU Control

… … Memory Memory Memory Memory Memory Memory Memory Memory … … …

Faster, lower latency 13

The Search for The Holy Grail 14

When we no longer fear power failure…

DATA PERSISTENCE What if you could 15 replace DRAM with a non-volatile memory?

You’d call it Memory Class Storage 16

When was the last time you read about a new volatile memory?

NRAM™

MRAM PCM

3DXP ReRAM

The non-volatile memory revolution is under way 17 To NVRAM To DRAM

To core memory From vacuum tubes 18

THIS is why the term “Persistent Memory” is insufficient

The industry must distinguish between deterministic and non-deterministic persistent memory

Only “Memory Class Storage” is fully deterministic AND persistent 19

NRAM FeRAM 3DXpoint MRAM Flash ReRAM DRAM SRAM Not all “persistence” is created equal 20

“Write endurance” determines HOW persistent Wear leveling needed if writes are limited 21 Temperature sensitivity impacts long term retention

Weeks of Data Retention 22 READ WRITE WRITE READ WRITE

DATA DATA DATA DATA DATA

DRAM interface is deterministic Data latency is FIXED

READ WRITE WRITE READ WRITE DATA DATA DATA X X HOUSEKEEPING Any endurance limit breaks determinism 23 Memory Class Storage

Full DRAM Speed

No endurance limits

Fully deterministic 24 NVRAM is a Memory Class Storage 25 In the future?

Memory Class Storage

Memory Class Storage = NVRAM NVRAM

For now… 26 Storage Class Memory Is NOT a Memory Class Storage 27

Hard Disk Storage Memory Class Storage Class SSD Memory Storage Flash NVMe ≥ DRAM performance = DRAM endurance Phase Change ≥ DRAM capacity 3DXpoint Resistive RAM Wasteland Magnetic RAM 3D NOR

DDR DRAM DDR NVRAM 28

Non-Deterministic Non-Deterministic Non-Deterministic

Deterministic Deterministic Deterministic 29

DRAM

NVDIMM-N

Optane

NVRAM Memory Class Storage

NVDIMM-P 30 Itty bitty leaky capacitors lose charge DRAM On power fail, you lose

ACT ACT ACT ACT

RD RD RD RD

WR WR WR WR

PRE PRE PRE PRE REFRESH Refresh time consumes up to 15% of bandwidth 31

Run DRAM

FAIL! Energy Source 32

Flash Voltage NVDIMM-N NVM Regulator DRAM Array Control

Use DRAM normally Isolation Buffers -N NVDIMM

On Power Fail, copy to Flash Power Fail

Power restored, copy to DRAM Voltage CPU Regulator Host System 33 Run

NVDIMM-N FAIL! RESTORE

Switch to Battery Copy Flash to Power DRAM

Copy DRAM to Flash Run 34

NVDIMM-N One power fail cycle pays for 1-2 MINUTES Copy Flash to DRAM a LOT of protection Copy DRAM to Flash 1-2 MINUTES Faster than Flash!!! 35 3DXpoint Array But vs DRAM? Meh

NVM Optane Decent capacity, though Control Reads are slow Writes are deathly slow

RD WR CPU Data Host System

Could be used as a very slow DRAM but more common as expansion Data 36 512GB = 512GB 3DXpoint Array

3DXpoint Array

App Direct NVM Control NVM Control Memory Mode 512GB + 64GB = 512GB

DRAM as

CPU Optane Host System CPU Host System Small Energy Source Non-Volatile Memory 37 Array – Any Kind Voltage Regulator NVDIMM-P DRAM NVM Cache Control NVDIMM-P Protocol

Read A Read B Send Read C Send Send

RSP Data A RSP RSP Data C Data B

CPU New non-deterministic protocol Host System Not backward compatible with DDR

Requires NVDIMM-P aware CPU 38

Volatile Mode Battery Backup No Persistence ala NVDIMM-N

NVDIMM-P Persistence Options

Reduced Explicit FLUSH Energy, Command Cacheless NVRAM 39

DRAM speed Non-volatility Unlimited write endurance Wide temperature range Scalable beyond DRAM Flexible fabrication & application Low power Low cost 40

Drop in replacement for DRAM NVRAM MemoryDRAM Class Storage

Fully Deterministic

Permanently persistent

Always available Host System 41

NRAM™ PCM *

DDR5 * Future generation devices NVRAM

ReRAM * MRAM * 42 43 Comparing DRAM & NVRAM

No refresh is required

“Self refresh” can be power OFF

Some timing differences (but deterministic!)

Data persistence definitions

Greater per-die capacity 44

NRAM™ PCM

ReRAM ≠ MRAM

Precharge Persistence Timings requirement definition

DDR5 NVRAM Specification brings coherence 45 DRAM NVRAM

IDLE IDLE

“0 ns” REFRESH

REFRESH “350 ns” = NOP

Refresh command is not needed Decoded as NOP for compatibility 46 DRAM NVRAM

IDLE IDLE

SELF REFRESH SELF REFRESH

FREQUENCY REFRESH CHANGE FREQUENCY CHANGE Power burned “No” power burned DRAM NVRAM 47

IDLE READ

ACTIVATE IDLE WRITE READ

PRECHARGE WRITE Precharge command is not needed Decoded as NOP for compatibility 48

Persistence Definitions*

Intrinsic: Power Fail: Immediately On After NVRAM Extrinsic: WRITE RESET After FLUSH Command * Discussions on-going Data is persistent 49

Intrinsic WR WR WR Persistence

Extrinsic WR WR FLUSH WR WR FLUSH Persistence

Power Fail WR WR WR WR WR RESET Persistence

* Discussions on-going 50

DDR5 DRAM DDR5 NVRAM is limited enables up to to 32Gb per die 128Tb per die 51 DDR5 SDRAM

ACT RD WR ACT RD WR ACT RD WR

DDR5 NVRAM

REXT ACT RD WR ACT RD WR REXT ACT RD WR

Row Extension adds up to 12 more bits of addressing

Backward compatible with DDR5 – Acts like REXT = 0 until needed 52 ROW Bank buffer 0 DDR5

ROW Bank buffer 31 … SDRAM COLUMNS

“ROW” includes bank group & bank…

REXT ROW Bank buffer 0 DDR5

REXT ROW Bank buffer 31 … NVRAM

COLUMNS 53

REXT A REXT C ACT BANK W, ROW K READ BANK Y  Row A + M ACT BANK X ,ROW L READ BANK Z  Row B + N ACT BANK Y, ROW M ACT BANK W, ROW P

REXT B READ BANK W  Row C + P

ACT BANK Z, ROW N WRITE BANK X  Row A + L READ BANK W  Row A + K ACT BANK X, ROW L

WRITE BANK X  Row C + L READ BANK X  Row A + L REXT A READ BANK Y  Row A + M ACT BANK X, ROW R READ BANK Z  Row B + N READ BANK X  Row A+R

Row Extension Example 54

REXT A REXT C ACT BANK W, ROW K READ BANK Y  Row A + M ACT BANK X ,ROW L READ BANK Z  Row B + N ACT BANK Y, ROW M ACT BANK W, ROW P

REXT B READ BANK W  Row C + P

ACT BANK Z, ROW N WRITE BANK X  Row A + L READ BANK W  Row A + K ACT BANK X, ROW L

WRITE BANK X  Row C + L READ BANK X  Row A + L REXT A READ BANK Y  Row A + M ACT BANK X, ROW R READ BANK Z  Row B + N READ BANK X  Row A+R

Row Extension Replacement Example 55

NVRAM Memory Class Storage 56

DRAM NVRAM Run Checkpointing Run can be made Checkpoint Storage to persistent memory Run Run OR

Checkpoint Storage Checkpointing Run can be turned Run off completely Run

Checkpoint Storage 57 Phase 1 Phase 2

DRAM NVRAM NVRAM Run Run Run

Checkpoint Storage Checkpoint NVRAM No checkpoint

Run Run Run

Checkpoint Storage Checkpoint NVRAM No checkpoint

Run Run Run

Checkpoint Storage Checkpoint NVRAM 58

Keep in mind…

Checkpoints may include system failure

Knowing when a task may resume is complicated Power failure is not the only thing to fear 59 Remember Those Persistence Definitions Immediately On After NVRAM WRITE RESET After FLUSH Command Tasks may not be safe until system Tasks may be safe in stability confirmed nanoseconds Tasks may be safe in microseconds 60

Persistence Capacity Performance

System designers have a lot of options to balance Homogenous 61 Main Memory

DRAM

NVDIMM-N MCS

NVDIMM-P Optane Heterogenous 62 Main Memory

DRAM + Optane MCS + NVDIMM-P

MCS + Optane 512GB 63

When capacity NVDIMM-P Optane meets persistence

64GB

DRAM NVRAM 32GB Memory Class Storage

NVDIMM-N Homogenous 64 Main Memory Combinations

Data Safe Performance Capacity

DRAM No Best 1.0 X

NVDIMM-N Yes Best 0.5 X

Optane Yes Worst 10 X

NVDIMM-P Yes Mid 10 X

MCS Yes Best+ 1 X+ Heterogeneous 65 Main Memory Combinations

Data Safe Performance Capacity

DRAM + Optane No High 6 X

DRAM + NVDIMM-P No High 6 X

MCS + Optane Yes High 6 X

MCS + NVDIMM-P Yes High 6 X Homogenous Heterogeneous 66 Main Memory Combinations Main Memory Combinations

Software need not care Software encouraged to put critical functions All functions take the in faster memory same time Often mount slower memory as RAM drive 67

Software support via DAX assists in moving…

from mounted drives…

…to RAM drive…

…to direct access mode 68

The Power

of

Zero Power 69 Putting a Node to Sleep

Operating Self Refresh Mode Mode

Instant On means power must stay alive

Refresh operations burn significant power 70 Memory Class Storage can be turned off entirely

Operating Power 33 Mode Off 71 Memory Module (DIMM)

DDR5 memory modules Module Power have on-DIMM voltage Memory Media regulation (PMIC) PMIC Data Buffers DIMM power may be shut off independently of system power System Power

System Motherboard DIMM1 DIMM2 72

Module Module Power Power Memory Media Memory Media

PMIC PMIC Data Buffers Data Buffers

System Power

System power off; both DIMMs off Multiple power System power on & both DIMMs off management options System power on & DIMM1 on, DIMM2 off 73

Nantero NRAM™

My favorite NVRAM

Full presentation on Wednesday… ELECTRODE 74

ELECTRODE

Van der Waals energy barrier keeps CNTs apart or together

Data retention >300 years @ 300 ֯C, >12,000 years @ 105 ֯C

Stochastic array of hundreds nanotubes per each cell 75

No temperature sensitivity

5 ns balanced read/write performance 76

NRAM Data Retention = 12,000 Years

10,000 years ago 4,500 years ago

2,500 years ago X 77 Y NRAM LAYER 64 Kb tile X Z 256 K tiles = 16 Gb Drivers Receivers Array size tuned to the size of drivers & receivers Chip-level timing is a function of bit line flight times Replicate this “tile” as needed for device capacity Add I/O drivers to emulate any PHY needed

I/O PHY 78

Carbon Nanotube Arrays

DDR4, DDR5 72 bits NRAM Row Column Bank Decode Decode Decode SECDED ECC Engine

Address 64 bits

FIFO FIFO Chip ID Die Selector x4/x8

Data Strobe Strobe Data 79

15-20% DDR4/DDR5

DDR4/DDR5 NRAM Architectural improvements improve data throughput Base throughput 15% or greater at the same Elimination of refresh clock frequency Elimination of tFAW restrictions Elimination of bank group restrictions

Elimination of power states

Elimination of inter-die delays

Bandwidth: larger is better NVRAM 80 Memory Class Storage NRAM NRAM NRAM NRAM NRAM NRAM NRAM NRAM NRAM

NRAM NRAM NRAM NRAM NRAM NRAM NRAM NRAM NRAM

Plugs into an RDIMM slot

Appears to the CPU as DRAM

Memory controller may optionally be tuned for NVRAM 81 One less layer of marshmallows to deal with Persistence

Persistence Non- deterministic Fully deterministic 82 Would you rather… 83 Step on broken glass? Know Your

Enemy A LEGO? Or some jacks? 84

…about those energy stores…

Batteries Supercapacitors Tantalums (etc.) 85

Batteries Supercapacitors Tantalums (etc.)

High capacity Medium capacity Low capacity High energy density Low energy density Low energy density Low reliability Degrade over time …but stable 86

Energy needed for backup of DRAM cache

Flash or Storage Storage Class DRAM Controller Energy Memory

I/O More room 87 for storage

Eliminate need X for backup energy

Flash or Storage Storage Class Controller NVRAM Energy Memory X I/O 88 NVRAM Changes the Math DRAM cache limited by 1GB/TB energy available No DRAM? Cache size dictated by cost/performance 89

Switching gears again…

…to Systems Evolution 90

Pop quiz How many CPUs in a 1980s PC? 91

Modem

One?

Graphics Network Adapter Adapter Sound Blaster They were killed by 92 “Native Signal Processing” They were called “DSPs” Digital Signal Processors

Drivers

They put processing next to the data Analog front end devices 93

With NSP… $ $ $ W W So why do it? W Now We Are 94 Trending Back FPGA

CPU Memory

Storage Fabric Storage

Memory CPU

AI 95 Processor 96 Elements Processor Processor Elements Elements Bridge

I/O Low I/O Elements Latency Elements Fabric Bridge

Storage Storage Elements Elements Storage Elements

Bridge 97

Distributed resources

Application-specific computing

In-memory computing

Artificial intelligence and deep learning

Security 98

Graphics Accelerator Search Engine Human interface Standard CPU

HTML processing Artificial Intelligence Accelerator Human interface management Low Latency Network Adapter Fabric

Memory Array

Filesystem Aware Storage Example AI accelerator 99 Tbps links

I/O NNP Control

Exec HBM SRAM Unit SIMD architectures Matrix interconnections Exec HBM SRAM Unit Fast pipes still limit load/save time

HBM … Challenges: • Model checkpointing Exec HBM SRAM • Data loss on power fail Unit • Temperature sensitivity I/O 100

Back propagation algorithms complicate things

Data loss problems are amplified

Checkpointing highly time and bandwidth consuming 101

The more distributed memory gets, the harder to load and unload

MEM MEM MEM MEM

MEM MEM MEM MEM

MEM MEM MEM MEM 102

NVRAM TO THE RESCUE!

Replacing dynamic memory with persistent memory resolves the data loss issues Just leave the data in place 103 as long as you want

I/O NNP Control Replace DRAM with NVRAM

MCS Exec Replace eRAM with NVRAM HBM SRAM HBM Unit

MCS Exec MCS MCS MCS MCS HBM SRAM HBM Unit MCS MCS MCS MCS

MCS HBM HBM … MCS MCS MCS MCS

MCS Exec HBM SRAM HBM Unit 104

SRAM & The final frontier… Registers 105 Continuing to look for ways to bring Memory Class Storage down under 1ns Voltage adjustment Faster edge rates Better error check Shadow registers Getting smarter

It will happen 106

When we no longer fear power failure…

DATA PERSISTENCE

Full END TO END persistence 107

Are we getting near the day when we look back at volatile memory… …and LAUGH? 108

……b…bu…but…but… 109

Persistent data introduces challenges, too 110 Data is ALWAYS there!

Data security is a growing concern 111 Application opens data from previous application

Memory moved from one system to another

Spy devices on memory buses

So many potential breaches 112

Infection via hack Infection via spy devices 113

Password: X2.Hd44**3#jj0%

General trend is to encrypt data before transmission or storage 114

X2.Hd44**3#jj0%

Keep the bad guys out

X2.Hd44**3#jj0% Some are adding in-memory 115 Small Energy Source Non-Volatile Memory compute functions Array – Any Kind including encryption Voltage X2.Hd44**3#jj0% Regulator Works as long as the bus Smart DRAM NVM Cache is secure Control Encryption quality may be Password: limited by transfer size

CPU Management of many keys Host System can get complicated quickly 116 ISO/IEC 11889 Saving Data is Need tiers 117 Power Fail a Pain of memory Sucks & storage Persistence Sharing is Essential Time Today’s Summary Solutions Persistence Help Complications But We Data Can Do Mix & DDR5 Better Distribution Match NVRAM Spec Challenges Memories in Progress 118

Thank you for your time

Bill Gervasi [email protected] 119

I’m here to learn too

What do you deal with?