S7281: Device Lending: Dynamic Sharing of GPUs in a PCIe Cluster

Jonas Markussen PhD student Simula Research Laboratory Outline

• Motivation

• PCIe Overview

• Non-Transparent Bridges

• Device Lending Distributed applications may need to access and use IO resources that are physically located inside remote hosts Front-end . . . Control + Signaling + Data Interconnect ......

… … … Compute node Compute node Compute node Software abstractions simplify the use and allocation of resources in a cluster and facilitate development of distributed applications

Control + Handled in software Signaling + . . . • rCUDA … Data • CUDA-aware Open MPI … … • Custom GPUDirect RDMA implementation … Front-end • . . .

… …

Logical view of resources Local resource Remote resource using middleware Application Application CUDA library + driver CUDA – middleware integration Local Middleware service PCIe IO Interconnect transport (RDMA)

Interconnect

Interconnect transport (RDMA) Middleware service/daemon Remote CUDA driver PCIe IO bus In PCIe clusters, the same fabric is used both as local IO bus within a single node and as the interconnect between separate nodes

Memory bus PCIe interconnect switch RAM External PCIe cable CPU and chipset

Interconnect PCIe bus switch

PCIe interconnect PCIe IO device host adapter Local resource Remote resource over native fabric Application Application CUDA library + driver CUDA library + driver Local PCIe IO bus PCIe IO bus

PCIe-based interconnect

Remote PCIe IO bus PCIe Overview PCIe is the dominant IO bus technology in today, and can also be used as a high-bandwidth low-latency interconnect

35

30

25

20 PCIe x4

15 PCIe x8 PCIe x16 10

Gigabytes per second (GB/s) 5

0 Gen 2 Gen 3 Gen 4

PCI-SIG. PCI Express 3.1 Base Specification, 2010. http://www.eetimes.com/document.asp?doc_id=1259778 Memory reads and writes are handled by PCIe as transactions that are packet-switched through the fabric depending on the address

CPU and chipset

RAM • Upstream • Downstream • Peer-to-peer (shortest path)

PCIe device

PCIe device PCIe device IO devices and the CPU share the same physical address space, allowing devices to access system memory and other devices Address space Interrupt vecs 0x00000… 0xfee00xxx CPU and chipset IO device

IO device RAM IO device

RAM 0xFFFFF…

PCIe device • Memory-mapped IO (MMIO / PIO) • Direct Memory Access (DMA) • Message-Signaled Interrupts (MSI-X) PCIe device PCIe device Non-Transparent Bridges Remote address space can be mapped into local address space by using PCIe Non-Transparent Bridges (NTBs) Address space

NTB CPU and chipset CPU and chipset Local RAM RAM RAM Local host NTB addr mapping Remote host Local Remote 0xf000 0x9000 ......

PCIe NTB adapter PCIe NTB adapter Using NTBs, each node in the cluster take part in a shared address space and have their own “window” into the global address space

A’s addr space Global addr space Local IO devices Addr space in A Addr space in B Global addr space Addr space in C

Local RAM

C’s addr space A B C Local IO devices

Exported address range

NTB-based interconnect Local RAM Device Lending A remote IO device can be “borrowed” by mapping it into local address space, making it appear locally installed in the system

Device driver Owner CPU and chipset CPU and chipset Borrower

RAM RAM NTB addr mapping Remote Local 0xb000 0x2000 ...... PCIe hot-plug

Physical device NTB adapter NTB adapter Inserted device 0xb000 0xe000 0x1000 0x2000 By intercepting DMA API calls to set up IOMMU mappings and inject reverse NTB mappings, physical location is completely transparent

Device driver CPU and chipset Borrower Owner CPU and chipset dma_addr = dma_map_page(0x9000);

RAM RAM

NTB addr mapping IOV Phys Use addr 0xf000 Local Remote 0x5000 0x9000 0xf000 0x5000 ...... IOMMU

Physical device NTB adapter NTB adapter Inserted device 0xb000 0xe000 0x1000 0x2000 Borrowed remote resource Resource appears local Application to OS, driver, and app CUDA library + driver Local PCIe IO bus Unmodified local driver (with hot-plug support)

PCIe NTB interconnect Hardware mappings ensure fast data path

Works with any PCIe device Remote (even individual SR-IOV functions) PCIe IO bus Borrowed remote resource Remote resource using middleware Application Application CUDA library + driver CUDA – middleware integration Local Middleware service PCIe IO bus Interconnect transport (RDMA)

PCIe NTB interconnect Interconnect

Interconnect transport (RDMA) Middleware service/daemon Remote CUDA driver PCIe IO bus PCIe IO bus Borrowed remote resource Local resource Application Application CUDA library + driver CUDA library + driver Local PCIe IO bus PCIe IO bus

PCIe NTB interconnect

Remote PCIe IO bus Device-to-host memory transfer

14

12

10

8

6

4

2 Gigabytes per second (GB/s) 0 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB Transfer size

bandwidthTest (Local) bandwidthTest (Borrowed) PXH830 DMA (GPUDirect RDMA)

GPU: P400 driver: Version 375.26 (Centos 7) 1. Nvidia CUDA 8.0 Samples bandwidthTest 2. GPUDirect RDMA benchmark using Dolphin NTB DMA CPU: E5-1630 3.7 GHz Memory: DDR4 2133 MHz https://github.com/Dolphinics/cuda-rdma-bench Using Device Lending, nodes in a PCIe cluster can share resources through a process of borrowing and giving back devices

RAM Task A CPU + chipset Task A Task B Task C FPGA NIC SSD

SSD SSD SSD NTB GPU GPU GPU SSD GPU SSD

RAM Task B CPU + chipset

NIC FPGA GPU NTB NIC GPUGPU GPUGPU SSDSSD RAM Task C SSD CPU + chipset FPGA

GPU GPU GPU NTB Device pool http://mlab.no/blog/2016/12/eir/ Server room

EIR – Efficient aided diagnosis framework for gastrointestinal examination

Examination room Examination room Moving forward

• Strategy-based management

• Fail-over mechanisms

• VFIO and other API integration (“SmartIO”)

• Borrowing vGPU functions Thank you!

“Device Lending in PCI Express Networks” My email address Selected ACM NOSSDAV 2016 publications “Efficient Processing of Video in a Multi Auditory Environment using Device Lending of GPUs” [email protected] ACM Multimedia Systems 2016 (MMSys’16)

“PCIe Device Lending” University of Oslo 2015

Device Lending demo and more Visit Dolphin in exhibition area (booth 625)