Rick Coulson Senior Fellow, NVM Solutions Group

1 Outline

• The history of storage latency and where we stand today • The promise of Storage Class Memory (SCM) and 3D Xpoint™ Memory • Extensive compute platform changes driven by the quest for lower storage latency with SCM / 3D Xpoint™ Memory – As traditional storage – As persistent memory • Innovation opportunities abound 1956: IBM RAMAC 350

5 MBytes $57,000 $15200/Mbyte ~1.5 Random IOPs* 600ms latency

* IOPs depends on the workload and is a range *Other names and brands may be claimed as the property of others.

5 1980: IBM 3350

300 Mbytes $60,000 $200/MByte 30 random IOPs 33ms latency

*Other names and brands may be claimed as the property of others. 1983: IBM3380

2.52 GBytes $82,000 $36/MByte ~160 IOPS total 25ms latency

Two hard disk assemblies each with two independent actuators each accessing 630 MB gigabyte within one chassis

*Other names and brands may be claimed as the property of others. 2007: 15K RPM HDD

15K RPM HDD About 200 random IOPs* ~5ms latency

IOPS scaling problem was addressed through HDDs in parallel in Enterprise

* IOPs depends on the workload and is a range 2016: 10K RPM HDD

1.8TB ~150 IOPs 6.6ms latency 2016 NVMe NAND SSD

2TB 500,000+ IOPs ~60 usec latency

10 The Continuing Need For Lower Latency

RAMAC 350 HDD ~100x 10K RPM Access time reduction 600ms ~6ms access

~10,000x NAND SSD Access time reduction ~60us access

59 Years

RAMAC 305 ~40,000,000x Core™ i7 100 Hz best Clock speed increase ~4 Ghz clock case “clock”

Source: Wikipedia *Other names and brands may be claimed as the property of others. Lower Storage Latency Requires Media and

Platform Improvements Persistent Memory

3D XPoint™ memory (SCM)

Ultra fast SSD

Drive for Lower Latency

Media Bottlenecks

Platform HW / SW bottlenecks

12 Addressing Media Latency:

Next Gen NVM / SCM Resistive RAM NVM Options

Scalable Resistive Memory Element Family Defining Switching Characteristics Wordlines Memory Element Phase Energy (heat) converts material Change between crystalline (conductive) and Selector Memory amorphous (resistive) phases Device Magnetic Switching of magnetic resistive Tunnel layer by spin-polarized electrons Junction (MTJ) Electrochemical Formation / dissolution of Cells (ECM) “nano-bridge” by electrochemistry Binary Oxide Reversible filament formation by Filament Oxidation-Reduction Cross Point Array in Backend Layers ~4l2 Cell Cells Interfacial Oxygen vacancy drift diffusion Switching induced barrier modulation

Scalable, with potential for near DRAM access times 3D XPoint™ Technology

Crosspoint Breakthrough Structure Material Advances Selectors allow dense Compatible switch and packing and individual materials access to bits

Scalable High Performance Memory layers Cell and array architecture can be stacked in that can switch states 1000x a 3D manner faster than NAND 1000X 1000X 10X faster endurance denser THAN NAND OF NAND THAN DRAM

*Results have been estimated or simulated using internal analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance 3D XPoint™ Technology Instantiation

Intel® Optane™ SSDs based on 3D XPoint™ 3D Xpoint™ Technology Video

Please excuse the marketing

17 3D Xpoint™ Technology Video Demonstration of 3D Xpoint™ SSD Prototype Need to Address System Architecture To Go Lower

120

100 )

80 usecs 60

Latency( 40

20

0 NAND MLC NVMe SSD 3D Xpoint NVMe SSD DIMM Memory (4kB read) (4kB read) (64B read) Storage Platform Changes

Intel® Optane™ SSDs

21 Addressing Interface Efficiency With NVMe / PCI

10,000

200 SSD NAND technology offers ~500X

) 175 reduction in media latency over HDD

uS 150 125 NVMe™ eliminates 20 µs of 100 controller latency

75 Latency ( Latency 50 ~7X 3D XPoint™ SSD delivers < 10 µs latency 25 3D XPoint™ Persistent Memory 0 HDD SSD SSD SSD PM +SAS/ NAND NAND 3D 3D SATA +SAS/ +NVMe™ XPoint™ Xpoint™ SATA +NVMe™

Drive Controller Software Latency Latency Latency (ie. SAS HBA)

Source: Storage Technologies Group, Intel NVMe Delivers Superior Latency Platform HW/SW Average Latency Excluding Media 4KB 100 90 80 70 SATA SAS 60 50 40 PCIe NVMe approaches 30 theoretical max of 800K

microseconds microseconds (us) 20 IOPS at 18us 10 0 0 100000 200000 300000 400000 500000 600000 700000 800000 900000 IOPS 2 WV LSI 4 WV LSI 6 WV LSI 2 WV AHCI 4 WV AHCI 1 FD 1 CPU 1 FD 2 CPU 1 FD 4 CPU

Source: Storage Technologies Group, Intel NVMe/PCIe Provides More Bandwidth

8

7 6.4 6

5

4 3.2 PCIe/NVMe provides 3 more than 10X the Bandwidth of SATA. 2 Even More with Gen 4 Bandwidth (GB/sec) Bandwidth 1 0.55

0 SATA 4x PCIeG3/NVMe 8x PCIeG3/NVMe

Source: Storage Technologies Group, Intel Storage SW Stack Optimizations

Much of the storage stack designed with HDDs latencies in mind • No point in optimizing until now • Example: Paging algorithms with seek optimization and grouping

25 Synchronous Completion for Queue Depth 1?

Async (interrupt-driven) Context Switch From Yang: FAST ‘12 9.0 µs System call Return to user Ta’ Tb=1.4 Tu = 2.7 µs -10th USENIX CPU Ta’’ user Conference on File user kernel kernel user (P2) OS cost = Ta + Tb and Storage Storage = 4.9 + 1.4 Device Device 4.1 µs Technologies command Interrupt = 6.3 µs

Sync (polling) System call 4.4 µs Return to user CPU user kernel user polling

Storage 2.9 µs Device OS cost = 4.4 µs Standards for Low Latency

In most Datacenter usage models, a storage write does not “count” until replicated High replication overhead diminishes the performance differentiation of 3D XPoint™ technology NVMe over Fabrics is a developing SNIA specification for low overhead replication Summary: Block Storage Platform Changes

Move to PCIe based storage Streamlined command set NVMexpress Intel® Optane™ SSDs OS / SW stack optimizations Fast replication standards

28 Persistent Memory Oriented Platform Changes

DIMMs based on 3D XPoint™

29

Why Persistent Memory?

30

25

) 20

usecs 15

10 Latency(

5

0 NAND MLC NVMe SSD 3D Xpoint NVMe SSD 3D XPointDIMMDIMM Memory Memory (4kB read) (4kB read) (64B(64B read)read) Open NVM Programming Model

50+ Member Companies

SNIA Technical Working Group Initially defined 4 programming modes required by developers

Spec 1.0 developed, approved by SNIA voting members and published

Interfaces for PM-aware interfaces for application accessing a Kernel support for block Interfaces for legacy applications to accessing kernel PM support PM-aware file system NVM extensions access block NVM extensions

32 NVM Library: pmem.io 64-bit Linux Initially

Application

Standard Load/Store File API User Space Library

pmem-Aware MMU Mappings File System • Open Source Kernel • http://pmem.io Space • libpmem • libpmemobj • libpmemblk Transactional Intel 3D XPoint DIMM • libpmemlog • libvmem 33 • libvmmalloc

33 Write I/O Replaced with Persist Points

Application

Standard Load/Store File API User Space NVM Library

No Page

pmem-Aware MMU Mappings File System

Kernel Traditional APIs NVML API Space msync() pmem_persist() FlushViewOfFile()

NVDIMM Support for Persistent Memory The Path

Core Core Core Core MOV L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2

L3

Memory Controller Memory Controller

NVDIMM NVDIMM NVDIMM NVDIMM

36 New Instructions For Flushing Writes

Core Core Core Core MOV L1 L1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 CLFLUSH, CLFLUSHOPT, CLWB

L3

Memory Controller Memory Controller PCOMMIT

NVDIMM NVDIMM NVDIMM NVDIMM

37 Flushing Writes from Caches

Instruction Meaning

Cache Line Flush: CLFLUSH addr Available for a long time

Optimized Cache Line Flush: CLFLUSHOPT addr New to allow concurrency

Cache Line Write Back: CLWB addr Leave value in cache for performance of next access

38

38 Flushing Writes from Memory Controller

Instruction Meaning

Persistent Commit: PCOMMIT Flush stores accepted by memory subsystem

Flush outstanding writes Asynchronous DRAM Refresh on power failure Platform-Specific Feature

39

39 Example Code

Comments MOV X1, 10 MOV X2, 20 X2,X1 are in pmem . MOV R1, X1 Stores to X1 and X2 are globally visible, but may . not be persistent . . CLFLUSHOPT X1 CLFLUSHOPT X2 X1 and X2 moved from caches to memory . . SFENCE PCOMMIT . . SFENCE Ensures PCOMMIT has completed Join the Discussion about Persistent Memory

Learn about the Persistent Memory programming model . http://www.snia.org/forums/sssi/nvmp Join the pmem NVM Libraries Open Source project . http://pmem.io Read the documents and code supporting ACPI 6.0 and Linux NFIT drivers . http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf . https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git/log/?h=nd . https://github.com/pmem/ndctl . http://pmem.io/documents/ . https://github.com/01org/prd Intel Architecture Instruction Set Extensions Programming Reference . https://software.intel.com/en-us/intel-isa-extensions Intel 3D XPointTM Memory . http://www.intel.com/content/www/us/en/architecture-and-technology/non-volatile-memory.html Persistent Memory Summary

New storage model for low latency DIMMs based on 3D XPoint™ New instructions to support persistence OS support Lots of innovation opportunity Low Latency Ahead

<1 usec

Persistent Memory

3D XPoint™ memory

NVMe SSD <10 usec Ultra fast SSD

43