#CLUS BRKDCN-2494 Deep Dive into NVMe™ and NVMe™ over Fabrics (NVMe-oF™) J Metz, Ph.D, R&D Engineer, Advanced Storage Board of Directors, NVM Express, Inc. Board of Directors, SNIA Board of Directors, FCIA

@drjmetz

#CLUS Agenda • Introduction • Who is NVM Express? • Storage Concepts • What is NVMe™ • NVMe Operations • The Anatomy of NVM Subsystems • Queuing and Queue Pairs • NVMe over Fabrics (NVMe-oF™) • How NVMe-oF Works • NVMe-oF™/RDMA • NVMe-oF™/FC • NVMe-oF™/TCP (Future) • Whats New in NVMe 1.3 • Additional Resources

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 3 What This Presentation Is… and Is Not

• What it is:

• A technology conversation

• Deep dive (We’re going in, Jim!)

• What it is not

• A product conversation

• Comprehensive and exhaustive

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 4 Goals

• At the end of this presentation you should know: • What NVMe is and why it is important • How NVMe is extended for remote access over a network (i.e., “Fabrics”) • The different types of fabrics and their differences • Some of the differences between traditional SCSI- based storage solutions and NVMe-based solutions • What’s new in NVMe and NVMe over Fabrics

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 5 Prerequisites

• You really should know…

• Basics of Block-based storage

• Basic terminology (initiator, target)

• Helpful to know…

• Basic PCIe semantics

• Some storage networking

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 6 Note: Screenshot Warning!

• Get your screenshots ready when you see this symbol:

• Useful for getting more information about a topic, URLs, etc.

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 7 What We Will Not Cover (In Detail)

• NVMe-MI (Management)

• NVMe-KV (Key Value)

• Protocol Data Protection features

• Advances in NVMe features

• RDMA verbs

exchanges

• New form factor designs

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 8 Who is NVM Express? NVM Express, Inc. 125+ Companies defining NVMe together

Board of Marketing Subcommittee Directors NVMexpress.org, webcasts, 13 elected companies, stewards of tradeshows, social media, and press the technology & driving processes Co-Chairs: Janene Ellefson and Chair: Amber Huffman Jonmichael Hands Technical Management I/F Interop (ICC) Workgroup Workgroup Workgroup Out-of-band management over PCIe® Interop & Conformance Testing in Base specification and VDM and SMBus collaboration with UNH-IOL NVMe over Fabrics Chair: Peter Onufryk Chair: Ryan Holmqvist Chair: Amber Huffman Vice-Chair: Austin Bolen

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 10 About NVM Express (The Technology) • NVM Express (NVMe™) is an open collection of standards and information to fully expose the benefits of non-volatile memory in all types of computing environments from mobile to data center

• NVMe™ is designed from the ground up to deliver high bandwidth and low latency storage access for current and future NVM technologies NVM Express Base Specification

The register and command set for PCI Express attached storage with industry standard software available for numerous operating systems. NVMe™ is widely considered the defacto industry standard for PCIe SSDs.

NVM Express Management Interface (NVMe-MI™) Specification

The command set and architecture for out of band management of NVM Express storage (i.e., discovering, monitoring, and updating NVMe™ devices using a BMC). NVM Express Over Fabrics (NVMe-oF™) Specification

The extension to NVM Express that enables tunneling the NVM Express command set over additional transports beyond PCIe. NVMe over Fabrics™ extends the benefits of efficient storage architecture at scale in the world’s largest data centers by allowing the same protocol to extend over various networked interfaces.

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 11 NVMe™ Adoption – Industry • NVMe™ displacing SAS and SATA SSDs in server/PC markets • PCIe NAND Flash SSDs primarily inside servers • Lower latency Storage Class Memory (i.e., 3D Xpoint™) SSDs - NVMe™-only • Extensive Client (i.e., , tablet) use of smaller form factor SSDs – M.2 and BGA • NVMe™ ecosystem and recognition growing quickly • Many servers offer NVMe™ slots - different server configurations and form factors • Startups already shipping NVMe™ and NVMe over Fabrics™ (NVMe-oF™) solutions • Storage class NVMe™ SSDs emerging – enable high availability (HA) in storage arrays • NVMe-oF™ emerging as a solution to limited scale of PCIe as a fabric • Expanding ecosystem (i.e., Analyzers, NVMe-oF™ adapters)

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 12 The Basics of Storage and Memory The Anatomy of Storage

• There is a “sweet spot” for storage • Depends on the workload and application type • No “one-size fits all” • Understanding “where” the solution fits is critical to understanding “how” to put it together • Trade-offs between 3 specific forces • “You get, at best, 2 out of 3”

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 14 The Anatomy of Storage

• There is a “sweet spot” for storage • Depends on the workload and application type • No “one-size fits all” • Understanding “where” the solution fits is critical to understanding “how” to put it together • Trade-offs between 3 specific forces • “You get, at best, 2 out of 3” • NVMe goes here

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 15 Storage Solutions - Where Does It Fit?

• Different types of storage apply in different places

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 16 Storage Solutions - Where Does NVMe Fit?

• Different types of storage apply in different places

• NVMe is PCIe-based

• Local storage, in-server

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 17 Storage Solutions - Where Does NVMe Fit?

• Different types of storage apply in different places

• NVMe is PCIe-based

• Local storage, in-server

• PCIe Extensions (HBAs and switches) extend NVMe outside the server

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 18 Storage Solutions - Where Does NVMe-oF Fit?

• Different types of storage apply in different places

• NVMe is PCIe-based

• Local storage, in-server

• PCIe Extensions (HBAs and switches) extend NVMe outside the server • NVMe-oF inherits characteristics of underlying transport

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 19 But First… SCSI

• SCSI is the command set used in traditional storage • Is the basis for most storage used in the data center • Obviously, is the most obvious starting point for working with Flash storage • These commands are transported via: • Fibre Channel • Infiniband • iSCSI (duh!) • SAS, SATA • Works great for data that can’t be accessed in parallel (like disk drives that rotate) • Any latency in protocol acknowledgement is far less than rotational head seek time

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 20 Evolution from Disk Drives to SSDs The Flash Conundrum • Flash • Requires far fewer commands than SCSI provides • Does not rotate (no latency time, exposing latency of a one command/one queue system) • Thrives on random (non-linear) access • Both read and write • Nearly all Flash storage systems use SCSI for access • But they don’t have to!

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 21 So, NVMe… • Specification for SSD access via PCI Express (PCIe), initially flash media • Designed to scale to any type of Non Volatile Memory, including Storage Class Memory • Design target: high parallelism and low latency SSD access • Does not rely on SCSI (SAS/FC) or ATA (SATA) interfaces: New host drivers & I/O stacks • Common interface for Enterprise & Client drives/systems: Reuse & leverage engineering investments • New modern command set • 64-byte commands (vs. typical 16 bytes for SCSI) • Administrative vs. I/O command separation (control path vs. data path) • Small set of commands: Small fast host and storage implementations • Standards development by NVMe working group. • NVMe is de facto future solution for NAND and post NAND SSD from all SSD suppliers • Full support for NVMe for all major OS (, Windows, ESX etc.) • Learn more at nvmexpress.org

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 22 What is NVMe? NVMe Operations What’s Special about NVMe?

Slower Faster

. HDD: varying speed conveyer belts . Flash: all data blocks available at the . Flash: all data blocks available at the carrying data blocks (faster belts = same seek time & latency same seek time & latency lower seek time & latency)

. SCSI / SAS: pick & place robot . SCSI /SAS: pick & place robot . NVMe / PCIe: pick & place robot with w/tracked, single arm executing 1 w/single arm executing 1 command at 1000s of arms, all processing & command at a time, 1 queue a time, 1 queue executing commands simultaneously, with high depth of commands

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 25 Technical Basics

• 2 key components: Host and NVM Subsystem (a.k.a. storage target)

• “Host” is best thought of as the CPU and NVMe I/O driver for our purposes

• “NVM Subsystem” has a component - the NVMe Controller - which does the communication work

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 26 Technical Basics • Memory-based deep queues (up to 64K commands per queue, up to 64K queues) • Streamlined and simple command set (13; only 3 required commands) • Command completion interface optimized for success (common case) • NVMe Controller: SSD element that processes NVMe commands

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 27 The Anatomy of NVM Subsystems What You Need - NVM Subsystem Fabric Fabric • Architectural Elements Port Port

• Fabric Ports NVMe NVMe controller controller • NVMe Controllers NSID NSID NSID NSID • NVMe Namespaces

• Implementation Dependent NVMe NVMe Elements Namespace Namespace • NVM Media and Interface NVM Interface (I/F)

NVM Media NVM Media NVM Media

NVMe subsystem

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 29 Fabric Ports Fabric Fabric Port Port • Subsystem Ports are associated with Physical Fabric Ports NVMe NVMe controller controller • In PCIe it’s a PCIe command NSID NSID NSID NSID • Multiple NVMe Controllers may be accessed through a single port

• NVMe Controllers are associated NVMe NVMe with one port Namespace Namespace • Fabric Types; PCIe, RDMA ( RoCE/iWARP, NVM Interface (I/F)

InfiniBand™), Fibre Channel NVM Media NVM Media NVM Media

NVMe subsystem

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 30 NVMe Controller Fabric Fabric • NVMe Command Processing Port Port • On PCIe systems, it’s associated with a single PCIe function NVMe NVMe • Access to NVMe Namespaces controller controller • Namespace ID (NSID) associates a Controller to NSID Namespaces(s) NSID NSID NSID • May have multiple Controllers per NVM Subsystem • Used in multi-host and multi-path configurations NVMe NVMe • NVMe Queue Host Interface Namespace Namespace • Paired Command Submission and Completion Queues NVM Interface (I/F) • Admin Queue for configuration, Scalable number of IO Queues NVM Media NVM Media NVM Media

NVMe subsystem

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 31 NVMe Namespaces and NVM Media Fabric Fabric Port Port • Defines the mapping of NVM Media to a formatted LBA range NVMe NVMe • Multiple formats supported with/without controller controller end-to-end protection NSID NSID NSID NSID • NVM Subsystem may have multiple Namespaces • Private or Shared Namespaces NVMe NVMe • Private is accessible by one Controller, Namespace Namespace Shared accessible by multiple Controllers NVM Interface (I/F) • Namespace Reservations NVM Media NVM Media NVM Media

NVMe subsystem

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 32 NVMe Subsystem Implementations

NVMe PCIe SSD Implementation (single Subsystem/Controller)

NVMe all NVM Storage Appliance Implementation All NVM Appliance with (1000’s of Subsystems/Controllers) PCIe NVMe SSDs

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 33 Queuing and Queue Pairs In This Section…

• NVMe Host/Controller Communications

• Command Submission and Completion

Multi-queue • NVMe Multi-Queue Model communication

• Command Data Transfers NVMe Command Submission and Completion • NVMe communications over multiple fabric transports

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 35 NVMe Host/Controller Communications

• NVMe Multi-Queue Interface Model Host • Single Administrative and Multiple IO Queues • Host sends NVMe Commands over the Submission Queue (SQ) • Controller sends NVMe Completions over a paired Completion Queue (CQ) • Transport type-dependent interfaces facilitate the queue operations and Transport-Dependent Interfaces NVMe Command Data transfers Memory PCIe Registers Fabric Capsule Operations

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 36 NVMe Multi-Queue Interface • I/O Submission and Completion Queue Pairs are aligned to Host Host CPU Cores NVMe Host Driver • Independent per-queue operations CPU Core 0 CPU Core 1… • No inter-CPU locks on command Submission or Completion • Per Completion Queue enables source core steering

Transport-Dependent Interfaces

Memory PCIe Registers Fabric Capsule Operations

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 37 Queues Scale With Controllers

• Each Host/Controller pair have an Host independent set of NVMe queues NVMe Host Driver

• Controllers and queues operate CPU Core 0 CPU Core 1 CPU Core N-1 autonomously

• NVMe Controllers may be local PCIe or remote Fabric

• Use a common NVMe Queuing Model

NVMe controller NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 38 NVMe Commands and Completions

• NVMe Commands are sent by the Host to the Controller in Submission Queue Entries (SQE) • Separate Admin and IO Commands • Three mandatory IO Commands • Added two fabric-only Commands • Commands may complete out of order • NVMe Completions are sent by the Controller to the Host in Completion Queue Entries (CQE) • Command Id identifies the completed command • SQ Head Ptr indicates the consumed SQE slots that are available for posting new SQEs

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 39 NVMe Generic Queuing Operational Model

1. Host Driver enqueues the SQE into Host the SQ NVMe Host Driver

Fabric Port

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 40 NVMe Generic Queuing Operational Model

1. Host Driver enqueues the SQE into Host the SQ NVMe Host Driver 2. NVMe Controller dequeues SQE

Fabric Port

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 41 NVMe Generic Queuing Operational Model

1. Host Driver enqueues the SQE into Host the SQ NVMe Host Driver 2. NVMe Controller dequeues SQE (A) Data transfer, if applicable, goes here

Fabric Port

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 42 NVMe Generic Queuing Operational Model

1. Host Driver enqueues the SQE into Host the SQ NVMe Host Driver 2. NVMe Controller dequeues SQE 3. NVMe Controller enqueues CQE into the CQ

Fabric Port

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 43 NVMe Generic Queuing Operational Model

1. Host Driver enqueues the SQE into Host the SQ NVMe Host Driver 2. NVMe Controller dequeues SQE 3. NVMe Controller enqueues CQE into the CQ 4. Host Driver dequeues CQE

Fabric Port

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 44 NVMe Generic Queuing Operational Model

1. Host Driver enqueues the SQE into Host the SQ NVMe Host Driver 2. NVMe Controller dequeues SQE 3. NVMe Controller enqueues CQE into the CQ 4. Host Driver dequeues CQE

Fabric This queuing functionality is always present… Port

… but where this takes place can differ NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 45 NVMe Queuing on Memory (PCIe) Host

1.Host Driver enqueues the SQE in NVMe Host Driver

host-memory resident SQ Enqueue SQE

1

Memory Transport (PCIe)

Fabric Port

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 46 NVMe Queuing on Memory (PCIe) Host

1.Host Driver enqueues the SQE in NVMe Host Driver

host-memory resident SQ Enqueue SQE 2.Host Driver notifies controller 1 about new SQE by writing

doorbell register 2

PCIe Write (SQ Doorbell) Memory Transport (PCIe)

Fabric Port

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 47 NVMe Queuing on Memory (PCIe) Host

1.Host Driver enqueues the SQE in NVMe Host Driver

host-memory resident SQ Enqueue SQE 2.Host Driver notifies controller 1 about new SQE by writing

3 doorbell register 2 PCIe Read Response PCIe Read (SQE) (SQE) 3.NVMe Controller dequeues SQE PCIe Write (SQ Doorbell) by reading it from the host Memory Transport (PCIe) Fabric memory SQ Port

Dequeue SQE

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 48 NVMe Queuing on Memory (PCIe) Host

1.Host Driver enqueues the SQE in NVMe Host Driver host-memory resident SQ Enqueue SQE 2.Host Driver notifies controller about 1 new SQE by writing doorbell

register 3 2 PCIe Read Response PCIe Read (SQE) 3.NVMe Controller dequeues SQE by (SQE) PCIe Write (SQ Doorbell) reading it from the host memory SQ Memory Transport (PCIe)

Fabric (A) Data transfer, if applicable, goes Port here Dequeue SQE

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 49 NVMe Queuing on Memory (PCIe) Host

1.Host Driver enqueues the SQE in NVMe Host Driver host-memory resident SQ Enqueue SQE 2.Host Driver notifies controller about 1 new SQE by writing doorbell

register 3 2 PCIe Read Response PCIe Read (SQE) 3.NVMe Controller dequeues SQE by (SQE) PCIe Write 4 (SQ Doorbell) PCIe Write reading it from the host memory SQ Memory Transport (PCIe) (CQE)

Fabric 4.NVMe Controller enqueues CQE by Port writing it to host-resident CQ Dequeue SQE

Enqueue CQE

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 50 NVMe Queuing on Memory (PCIe) Host

1.Host Driver enqueues the SQE in NVMe Host Driver host-memory resident SQ Enqueue SQE Dequeue CQE 1 2.Host Driver notifies controller about 5 new SQE by writing doorbell register

3 3.NVMe Controller dequeues SQE by 2 PCIe Read Response reading it from the host memory SQ PCIe Read (SQE) (SQE) PCIe Write 4 (SQ Doorbell) PCIe Write 4.NVMe Controller enqueues CQE by Memory Transport (PCIe) (CQE) Fabric writing it to host-resident CQ Port

5.Host Driver dequeues CQE Dequeue SQE

Enqueue CQE

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 51 NVMe Queuing on Memory (PCIe) Host

1.Host Driver enqueues the SQE in NVMe Host Driver host-memory resident SQ Enqueue SQE Dequeue CQE 1 2.Host Driver notifies controller about 5 new SQE by writing doorbell register

3 3.NVMe Controller dequeues SQE by 2 PCIe Read Response reading it from the host memory SQ PCIe Read (SQE) (SQE) PCIe Write 4 (SQ Doorbell) PCIe Write 4.NVMe Controller enqueues CQE by Memory Transport (PCIe) (CQE) Fabric writing it to host-resident CQ Port

5.Host Driver dequeues CQE Dequeue SQE

Enqueue CQE

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 52 What is NVMe over Fabrics (NVMe-oF)? What is NVMe over Fabrics (NVMe-oF™)?

• Built on common NVMe architecture with additional definitions to support message-based NVMe operations • Expansion of NVMe semantics to remote storage • Retains NVMe efficiency and performance over network fabrics • Eliminates unnecessary protocol translations • Standardization of NVMe over a range Fabric types • Initial fabrics; RDMA (RoCE, iWARP, InfiniBand™) and Fibre Channel • NVMe-TCP standard completion expected end of CY18 • Released in 2016

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 54 NVMe-oF: Server Access to External Storage Spoiled for choice!

• IP/Ethernet RDMA: RoCEv2, iWARP (standardized RDMA transport: NVMe over Fabrics 1.0) • RoCEv2: UDP/IP-based, requires DCB (“lossless”) Ethernet, strong industry support • iWARP: TCP/IP-based, better network loss tolerance, weaker industry support • Hardware (RNIC) implementation preferred, RNICs support other protocols, e.g., SMB Direct, iSCSI (via iSER) • IP/Ethernet non-RDMA: TCP/IP (transport design in progress for NVMe over Fabrics 1.1) • Software-based implementation, leverages TCP offloads in high- volume NICs • Fibre Channel (FC-NVMe standard) • Largely compatible with current/future FC hardware, e.g. data transfer, but new firmware/drivers needed • InfiniBand - specialized, e.g. high-performance computing

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 55 What’s Special About NVMe over Fabrics? Maintaining Consistency • Recall: Host • Multi-queue model NVMe Host Driver • Multipathing capabilities built-in CPU Core 0 CPU Core n… • Optimized NVMe System • Architecture is the same, regardless of transport

• Extends efficiencies across Transport-Dependent Interfaces fabric Memory PCIe Registers Fabric Capsule Operations

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 56 NVMe Multi-Queue Scaling Host

• Queue pairs scale NVMe Host Driver

• Maintain consistency to multiple CPU Core 0 CPU Core 1 CPU Core N-1 Subsystems • Each controller provides a separate set of queues, versus other models where single set of queues is used for Transport-Dependent Interfaces Fabric Capsule Operations multiple controllers • Efficiency remains

NVMe controller NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 57 NVMe and NVMe-oF Models

• NVMe is a Memory-Mapped, PCIe Model

• Fabrics is a message-based transport; no shared memory

NVMe Transports

Memory Message Message & Memory

Data & Commands/Responses Data & Commands/Responses Commands/Responses use shared memory use capsules use capsules Data uses fabric-specific data transfer mechanism

Example: Example: Example: PCI Express Fibre Channel RDMA (InfiniBand, RoCE, iWARP) Fabric Message-Based Transports

Capsule = Encapsulated NVMe Command/Completion within a transport message Data = Transport data exchange mechanism (if any)

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 58 What’s Special About NVMe-oF: Bindings

• What is a Binding? • “A specification of reliable delivery of data, commands, and responses between a host and an NVM subsystem for an NVMe Transport. The binding may exclude or restrict functionality based on the NVMe Transport’s capabilities” • I.e., it’s the “glue” that links all the pieces above and below (examples): • SGL Descriptions • Data placement restrictions • Data transport capabilities • Authentication capabilities

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 59 Key Differences Between NVMe and NVMe-oF

• One-to-one mapping between I/O Submission Queues and I/O Completion Queues

• NVMe-oF does not support multiple I/O Submission Queues being mapped to a single I/O Completion Queue

• A controller is associated with only one host at a time

• NVMe over Fabrics allows multiple hosts to connect to different controllers in the NVM subsystem through the same port

• NVMe over Fabrics does not define an interrupt mechanism that allows a controller to generate a host interrupt

• It is the responsibility of the host fabric interface (e.g., Host Adapter) to generate host interrupts

• NVMe over Fabrics Queues are created using the Connect Fabrics command

• Replaces PCIe queue creation commands

• If metadata is supported, it may only be transferred as a contiguous part of the logical block

• NVMe over Fabrics does not support transferring metadata from a separate buffer

• NVMe over Fabrics does not support PRPs but requires use of SGLs for Admin, I/O, and Fabrics commands

• This differs from NVMe over PCIe where SGLs are not supported for Admin commands and are optional for I/O commands

• NVMe over Fabrics does not support Completion Queue flow control

• This requires that the host ensures there are available Completion Queue slots before submitting new commands

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 60 NVMe over Fabrics Capsules

• NVMe over Fabric Command Capsule NVMe Fabric CMD Capsule Command Id • Encapsulated NVMe SQE Entry OpCode • May contain additional Scatter Gather Lists (SGL) NSID or NVMe Command Data Buffer • Transport-agnostic Capsule format Optional Additional SGL(s) Address or (PRP/SGL) Command Data

Command Parameters

• NVMe over Fabric Response Capsule • Encapsulated NVMe CQE Entry NVMe Fabric CMD Capsule • May contain NVMe Command Data Command Parm • Transport-agnostic Capsule format SQ Head Ptr Optional Command Status P Command Data Command Id

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 61 NVMe Queuing on Capsule Transports Host

1.Host Driver encapsulates SQE into NVMe Host Driver 1

an NVMe Command Capsule Encapsulate SQE

Capsule Transport Fabric Port

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 62 NVMe Queuing on Capsule Transports Host

1.Host Driver encapsulates SQE into NVMe Host Driver 1

an NVMe Command Capsule Encapsulate SQE 2.Fabric enqueues the SQE into the remote SQ by sending the Capsule Send

Capsule Dependent Message Dependent 2 -

Transport Capsule Transport Fabric Port

Receive Capsule Dequeue SQE

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 63 NVMe Queuing on Capsule Transports Host

1.Host Driver encapsulates SQE into NVMe Host Driver 1

an NVMe Command Capsule Encapsulate SQE 2.Fabric enqueues the SQE into the remote SQ by sending the Capsule Send

Capsule Dependent Message Dependent (A) Data transfer, if applicable, 2 -

goes here Transport Capsule Transport Fabric Port

Receive Capsule Dequeue SQE

NVMe controller

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 64 NVMe Queuing on Capsule Transports Host

1.Host Driver encapsulates SQE into NVMe Host Driver 1

an NVMe Command Capsule Encapsulate SQE 2.Fabric enqueues the SQE into the remote SQ by sending the Capsule Send Capsule 3.Controller encapsulates CQE into 2 an NVMe Response Capsule Capsule Transport Fabric Port

Receive Capsule Dequeue SQE

Encapsulate CQE NVMe controller 3

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 65 NVMe Queuing on Capsule Transports Host

1.Host Driver encapsulates SQE into NVMe Host Driver 1

an NVMe Command Capsule Encapsulate Decapsulate SQE CQE

2.Fabric enqueues the SQE into the Transport

remote SQ by sending the Capsule

- Dependent Message

3.Controller encapsulates CQE into Send Send Capsule Capsule an NVMe Response Capsule 2 4 Capsule Transport 4.Fabric enqueues the CQE into the Fabric remote CQ by sending the Capsule Port

Receive Capsule Dequeue SQE

Encapsulate CQE NVMe controller 3

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 66 NVMe Queuing on Capsule Transports Host

1.Host Driver encapsulates SQE into NVMe Host Driver 1

an NVMe Command Capsule Encapsulate Decapsulate SQE CQE

2.Fabric enqueues the SQE into the Transport

remote SQ by sending the Capsule

- Dependent Message

3.Controller encapsulates CQE into Send Send

Capsule Capsule Dependent Message Dependent an NVMe Response Capsule 2 - 4

Transport Capsule Transport 4.Fabric enqueues the CQE into the Fabric remote CQ by sending the Capsule Port

Receive Capsule Dequeue SQE

Encapsulate CQE NVMe controller 3

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 67 NVMe Command Data Transfers

68 NVMe Command Data Transfers

Command Id OpCode Physical Region Page (PRP) NSID Physical Memory Page Address Used by Memory Transport (PCIe)

Buffer Address (PRP/SGL) or Scatter Gather List (SGL) Command Used by Memory and Capsule Transports Parameters Physical, logical key, or capsule offset address

• SQE contains the NVMe Command Data buffer address • Physical Region Page (PRP) used only for PCIe Transport • Scatter Gather List used by both PCIe and Capsule Transports • SGL = [Address, Length] • Address may be physical, logical with key, or capsule offset based • Supports SGL lists; { [Address,Length]...[Address,Length] }

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 69 NVMe Command Data Transfers (Controller Initiated) Host • Controller initiates the Read or Write of the Host Memory NVMe Host Driver NVMe Command Data to/from Host Memory Buffer Buffer

• Data transfer operations are transport specific; examples • PCIe Transport: PCIe Read/ PCIe Write Operations • RDMA Transport: RDMA_READ/RDMA_WRITE Operations Transport Fabric Port

Read Command Write Command Data Data NVMe controller NVM Subsystem

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 70 NVMe Command Data Transfers (In-Capsule Data) Host • NVMe Command and Command Data Send Capsule with Data NVMe Host Driver sent together in Command Capsule

• Reduces latency by avoiding the Controller having to fetch the data from Host • SQE SGL Entry will indicate Capsule Offset type address Transport Fabric Port

Command Data

NVMe Controller NVM Subsystem

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 71 RDMA-based NVMe-oF

72 What is Remote (RDMA)?

• RDMA is a host-offload, host-bypass technology that allows an application (including storage) to make data transfers directly to/from another application’s memory space • The RDMA-capable Ethernet NICs (RNICs) – not the host – manage reliable connections between source and rNIC/HCA rNIC/HCA destination • Applications communicate with the RDMA NIC using dedicated Queue Pairs (QPs) and Completion Queues (CQs) • Each application can have many QPs and CQs • Each QP has a Send Queue (SQ) and Receive Queue (RQ)

• Each CQ can be associated with multiple SQs or RQs

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 73 Benefits of RDMA User User Application S/W

IO Library • Bypass of system software stack components that processes network traffic Kernel Apps • For user applications (outer rails), RDMA bypasses the kernel altogether Kernel OS Stack S/W • For kernel applications (inner rails), RDMA bypasses Sys Driver the OS stack and the system drivers

• Direct data placement of data from one machine (real or PCIe virtual) to another machine – without copies H/W Transport & • Increased bandwidth while lowering latency, jitter, and Network (L4/ L3) CPU utilization Ethernet (L1/ L0) • Great for networked storage! Standard RDMA NIC Flow NIC Flow

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 74 Queues, Capsules, and More Queues Example of Host Write To Remote Target • NVMe Host Driver encapsulates the NVMe Submission Queue Entry (including data) into a fabric-neutral Command Capsule and passes it to the NVMe RDMA Transport • Capsules are placed in Host RNIC RDMA Send Queue and become an RDMA_SEND payload • Target RNIC at a Fabric Port receives Capsule in an RDMA Receive Queue • RNIC places the Capsule SQE and data into target host memory • RNIC signals the RDMA Receive Completion to the target’s NVMe RDMA Transport • Target processes NVMe Command and Data • Target encapsulates the NVMe Completion Entry into a fabric-neutral Response Capsule and passes it to NVMe RDMA Transport

Source: SNIA

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 75 NVMe Multi-Queue Host Interface Map to RDMA Queue-Pair Model Standard (local) NVMe

• NVMe Submission and Completion Queues are aligned to CPU cores • No inter-CPU software locks • Per CQ MSI-X interrupts enable source core interrupt steering

NVMe Over RDMA Fabric

• Retains NVMe SQ/CQ CPU alignment • No inter-CPU software locks • Source core interrupt steering retained by using RDMA Event Queue MSI-X interrupts

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 76 Varieties of RDMA-based NVMe-oF

• RoCE is based on InfiniBand transport over Ethernet • RoCEv2 • Enhances RoCE with a UDP header (not TCP) and Internet routability • Uses InfiniBand transport on top of Ethernet • iWARP is layered on top of TCP/IP • Offloaded TCP/IP flow control and management • Both iWARP and RoCE (and InfiniBand) support verbs • NVMe-oF using Verbs can run on top of either transport

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 77 NVMe over Fabrics and RDMA

• InfiniBand

• RoCE v1 (generally not used)

• RoCE v2 (most popular vendor option) • iWARP

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 78 Compatibility Considerations Choose one • iWARP and RoCE are software-compatible if written to the RDMA Verbs • iWARP and RoCE both require RNICs • iWARP and RoCE cannot talk RDMA to each other because of L3/L4 differences • iWARP adapters can talk RDMA only to iWARP adapters • RoCE adapters can talk RDMA only to RoCE adapters

No mix and match!

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 79 Key Takeaways Things to remember about RDMA-based NVMe-oF • NVMe-oF requires the low network latency that RDMA can provide

• RDMA reduces latency, improves CPU utilization

• NVMe-oF supports RDMA verbs transparently • No changes to applications required

• NVMe-oF maps NVMe queues to RDMA queue pairs

• RoCE and iWARP are software compatible (via Verbs) but do not interoperate because their transports are different

• RoCE and iWARP

• Different vendors and ecosystem • Different network infrastructure requirements

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 80 FC-NVMe: Fibre Channel-Based NVMe-oF Fibre Channel Protocol

• Fibre Channel has layers, just like FC-4 Upper Layer OSI and TCP Protocol Interface

FC-3 Common Services • At the top level is the Fibre Channel Protocol (FCP) FC-2 Framing and Flow Control • Integrates with upper layer protocols, such as SCSI, FICON, FC-1 Byte Encoding and NVMe FC-0 Physical Interface

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 82 What Is FCP?

• What’s the difference between FCP and “FCP”? • FCP is a data transfer protocol that carries other upper-level transport protocols (e.g., FICON, SCSI, NVMe)

• Historically FCP meant SCSI FCP, but other protocols exist now • NVMe “hooks” into FCP

• Seamless transport of NVMe traffic

• Allows high performance HBA’s to work with FC-NVMe

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 83 FCP Mapping

• The NVMe Command/Response capsules, and for some commands, data transfer, are directly mapped into FCP Information Units (IUs)

• A NVMe I/O operation is directly mapped to a Fibre Channel Exchange

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 84 FC-NVMe Information Units (IUs)

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 85 Zero Copy

• Zero-copy • Allows data to be sent to user application with minimal copies • RDMA is a semantic which encourages more efficient data handling, but you don’t need it to get efficiency • FC has had zero-copy years before there was RDMA • Data is DMA’d straight from HBA to buffers passed to user • Difference between RDMA and FC is the • RDMA does a lot more to enforce a zero-copy mechanism, but it is not required to use RDMA to get zero-copy

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 86 FCP Transactions NVMe Initiator NVMe Target

• NVMe-oF using Fibre Channel FCP FCP Transactions look similar to I/F Model I/F Model RDMA

• For Read

• FCP_DATA from Target IO Read IO Write Initiator Target Initiator Target • For Write • Transfer Ready and then DATA to Target

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 87 RDMA Transactions NVMe-oF NVMe-oF Target Initiator • NVMe-oF over RDMA NVMe-oF NVMe-oF protocol transactions I/F Model I/F Model

• RDMA Write

• RDMA Read with RDMA

Read Response IO Read IO Write Initiator Target Initiator Target

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 88 FC-NVMe: More Than The Protocol

• Dedicated Storage Network

• Run NVMe and SCSI Side-by-Side

• Robust and battle-hardened discovery and name service

• Zoning and Security

• Integrated Qualification and Support

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 89 TCP-based NVMe-oF (Future) NVMe-TCP NVMe™ Host Software

• NVMe™ block storage protocol over standard Host Side Transport Abstraction TCP/IP transport

• Enables disaggregation of NVMe™ SSDs without compromising latency and without requiring changes to networking infrastructure *

• Independently scale storage & compute to

TCP RoCE

maximize resource utilization and optimize for iWARP InfiniBand

specific workload requirements ChannelFibre Next FabricsGenNext • Maintains NVMe™ model: sub-systems, controllers namespaces, admin queues, data Controller Side Transport queues Abstraction

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 91 NVMe-TCP Data Path Usage

• Enables NVMe-oF I/O operations in existing IP Datacenter environments

• Software-only NVMe Host Driver with NVMe-TCP transport

• Provides an NVMe-oF alternative to iSCSI for Storage Systems with PCIe NVMe SSDs • More efficient End-to-End NVMe Operations by eliminating SCSI to NVMe translations • Co-exists with other NVMe-oF transports • Transport selection may be based on h/w support and/or policy

Source: Dave Minturn ()

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 92 NVMe-TCP Control Path Usage

• Enables use of NVMe-oF on Control- Path Networks (example: 1g Ethernet) • Discovery Service Usage • Discovery controllers residing on a common control network that is separate from data-path networks • NVMe-MI Usage • NVMe-MI endpoints on control processors (BMC, ..) with simple IP network stacks (1g Ethernet) • NVMe-MI on separate control network

Source: Dave Minturn (Intel)

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 93 NVMe-TCP Standardization

• Expect NVMe over TCP standard to be ratified in 1H 2018

• The NVMe-oF 1.1 TCP ballot passed in April 2017

• NVMe Workgroup adding TCP to spec alongside RDMA

Source: Kam Eshghi (Lightbits Labs)

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 94 NVMe-oF Key Takeaways

• NVMe over Fabrics – expansion of NVMe to allow scaling of ecosystem to larger disaggregated systems • Majority of capacity is shipped in external storage

• Keep same semantics as NVMe • NVMe-oF technology is still in early stages of deployment • Primarily supported in latest Linux distros

• Proper sizing – optimize deployment to use case • Customers with RDMA experience will lean towards NVMe over RDMA

• Customers with large FC infrastructure may find transition to NVMe over FC easier • Multiple deployment options

• “Front end” or “Back end” connectivity, disaggregated/top of rack flash deployment

#CLUS © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 95 What’s New in NVMe 1.3 What’s New in NVMe 1.3 Source: NVM Express, Inc.

Type Description Benefit Boot Partitions Enables bootstrapping of an SSD in a low resource environment Client/Mobile Host Controlled Host control to better regulate system thermals and device throttling Thermal Management Enables exchange of meta data between device and host. First use is Directives Streams to increase SSD endurance and performance Provides more flexibility with shared storage use cases and resource Data Center/Enterprise Virtualization assignment, enabling developers to flexibly assign SSD resources to specific virtual machines Emulated Controller Better performance for software defined NVMe controllers Optimization Start a timer and record time from host to controller via set and get Timestamp features Debug Error Log Updates Error logging and debug, root cause problems faster Telemetry Standard command to drop telemetry data, logs Internal check of SSD health, ensure devices are operating as Device Self-Test expected Simple, fast, native way to completely erase data in an SSD, allowing Management Sanitize more options for secure SSD reuse or decommissioning Management Allows same management commands in or out-of-band Enhancements SGL Dword Storage Simpler implementation Simplification

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 97 NVMe Status and Feature Roadmap 2014 2015 2016 2017 2018 2019 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4

NVMe 1.2 – Nov ‘14 NVMe 1.2.1 May’16 NVMe 1.3 NVMe 1.4* • Namespace Management • Sanitize • IO Determinism • Controller Memory Buffer • Streams • Persistent Event Log Base • Host Memory Buffer NVMe NVMe • Virtualization • Multipathing • Live Firmware Update

NVMe-oF 1.0 May’16 NVMeoF-1.1*

• Transport and protocol • Enhanced Discovery NVMe NVMe oFabric • RDMA binding • In-band Authentication • TCP Transport Binding

NVMe-MI 1.0 Nov’15 NVMe-MI 1.1* • Out-of-band management • SES • Device discovery

MI • NVMe-MI In-band • Health & temp monitoring NVMe NVMe • Native Enclosure Mgmt • Firmware Update

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 98 Additional Resources

© 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 99 Additional NVMe-Related Sessions

BRKDCN-2010 - Designing Storage Networks for next decade in an All Flash Data Center Mark Allen Thursday, Jun 14, 10:30 a.m. - 12:00 p.m. BRKSAN-2883 - Advanced Design Ed Mazurek Tuesday, Jun 12, 01:30 p.m. - 03:30 p.m. BRKINI-3009 - The New HyperFlex All NVMe Node Innovations - Moving beyond All Flash Vikas Ratna Thursday, Jun 14, 08:30 a.m. - 10:00 a.m.

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 100 Additional Resources

• NVMe For Absolute Beginners • https://blogs.cisco.com/datacenter/nvme-for-absolute-beginners • NVMe Program of Study • https://jmetz.com/2016/08/learning-nvme-a-program-of-study/ • A NVMe Bibliography • https://jmetz.com/2016/02/a-nvme-bibliography/ • NVMe and Fibre Channel • https://jmetz.com/2016/09/fibre-channel-and-nvme-over-fabrics/ • Other Storage Topics, Webinars, Articles, Podcasts • jmetz.com/bibliography

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 101 Summary or, How I Learned to Stop Worrying and Love the Key Takeaways

#CLUS Summary

• NVMe and NVMe-oF • Treats storage like memory, just with permanence • Built from the ground up to support a consistent model for NVM interfaces, even across network fabrics • No translation to or from another protocol like SCSI (in firmware/software) • Inherent parallelism of NVMe multiple I/O Queues is exposed to the host • NVMe commands and structures are transferred end-to-end, and architecture is maintained across a range of fabric types

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 103 Cisco Webex Teams

Questions? Use Cisco Webex Teams (formerly Cisco Spark) to chat with the speaker after the session How 1 Find this session in the Cisco Events App 2 Click “Join the Discussion” 3 Install Webex Teams or go directly to the team space 4 Enter messages/questions in the team space

Webex Teams will be moderated cs.co/ciscolivebot#BRKDCN-2690 by the speaker until June 18, 2018.

#CLUS BRKDCN-2494 ©© 2018 2018 Cisco Cisco and/or and/or its its affiliates. affiliates. All All rights rights reserved. reserved. Cisco Cisco Public Public 104104 Complete your online session evaluation

Give us your feedback to be entered into a Daily Survey Drawing. Complete your session surveys through the Cisco Live mobile app or on www.CiscoLive.com/us.

Don’t forget: Cisco Live sessions will be available for viewing on demand after the event at www.CiscoLive.com/Online.

#CLUS Presentation ID © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 105 Continue Demos in Walk-in Meet the Related your the Cisco self-paced engineer sessions education campus labs 1:1 meetings

#CLUS BRKDCN-2494 © 2018 Cisco and/or its affiliates. All rights reserved. Cisco Public 106 Thank you

107

#CLUS #CLUS