S7281: Device Lending: Dynamic Sharing of GPUs in a PCIe Cluster
Jonas Markussen PhD student Simula Research Laboratory Outline
• Motivation
• PCIe Overview
• Non-Transparent Bridges
• Device Lending Distributed applications may need to access and use IO resources that are physically located inside remote hosts Front-end . . . Control + Signaling + Data Interconnect ......
… … … Compute node Compute node Compute node Software abstractions simplify the use and allocation of resources in a cluster and facilitate development of distributed applications
Control + Handled in software Signaling + . . . • rCUDA … Data • CUDA-aware Open MPI … … • Custom GPUDirect RDMA implementation … Front-end • . . .
… …
Logical view of resources Local resource Remote resource using middleware Application Application CUDA library + driver CUDA – middleware integration Local Middleware service PCIe IO bus Interconnect transport (RDMA)
Interconnect
Interconnect transport (RDMA) Middleware service/daemon Remote CUDA driver PCIe IO bus In PCIe clusters, the same fabric is used both as local IO bus within a single node and as the interconnect between separate nodes
Memory bus PCIe interconnect switch RAM External PCIe cable CPU and chipset
Interconnect PCIe bus switch
PCIe interconnect PCIe IO device host adapter Local resource Remote resource over native fabric Application Application CUDA library + driver CUDA library + driver Local PCIe IO bus PCIe IO bus
PCIe-based interconnect
Remote PCIe IO bus PCIe Overview PCIe is the dominant IO bus technology in computers today, and can also be used as a high-bandwidth low-latency interconnect
35
30
25
20 PCIe x4
15 PCIe x8 PCIe x16 10
Gigabytes per second (GB/s) 5
0 Gen 2 Gen 3 Gen 4
PCI-SIG. PCI Express 3.1 Base Specification, 2010. http://www.eetimes.com/document.asp?doc_id=1259778 Memory reads and writes are handled by PCIe as transactions that are packet-switched through the fabric depending on the address
CPU and chipset
RAM • Upstream • Downstream • Peer-to-peer (shortest path)
PCIe device
PCIe device PCIe device IO devices and the CPU share the same physical address space, allowing devices to access system memory and other devices Address space Interrupt vecs 0x00000… 0xfee00xxx CPU and chipset IO device
IO device RAM IO device
RAM 0xFFFFF…
PCIe device • Memory-mapped IO (MMIO / PIO) • Direct Memory Access (DMA) • Message-Signaled Interrupts (MSI-X) PCIe device PCIe device Non-Transparent Bridges Remote address space can be mapped into local address space by using PCIe Non-Transparent Bridges (NTBs) Address space
NTB CPU and chipset CPU and chipset Local RAM RAM RAM Local host NTB addr mapping Remote host Local Remote 0xf000 0x9000 ......
PCIe NTB adapter PCIe NTB adapter Using NTBs, each node in the cluster take part in a shared address space and have their own “window” into the global address space
A’s addr space Global addr space Local IO devices Addr space in A Addr space in B Global addr space Addr space in C
Local RAM
C’s addr space A B C Local IO devices
Exported address range
NTB-based interconnect Local RAM Device Lending A remote IO device can be “borrowed” by mapping it into local address space, making it appear locally installed in the system
Device driver Owner CPU and chipset CPU and chipset Borrower
RAM RAM NTB addr mapping Remote Local 0xb000 0x2000 ...... PCIe hot-plug
Physical device NTB adapter NTB adapter Inserted device 0xb000 0xe000 0x1000 0x2000 By intercepting DMA API calls to set up IOMMU mappings and inject reverse NTB mappings, physical location is completely transparent
Device driver CPU and chipset Borrower Owner CPU and chipset dma_addr = dma_map_page(0x9000);
RAM RAM
NTB addr mapping IOV Phys Use addr 0xf000 Local Remote 0x5000 0x9000 0xf000 0x5000 ...... IOMMU
Physical device NTB adapter NTB adapter Inserted device 0xb000 0xe000 0x1000 0x2000 Borrowed remote resource Resource appears local Application to OS, driver, and app CUDA library + driver Local PCIe IO bus Unmodified local driver (with hot-plug support)
PCIe NTB interconnect Hardware mappings ensure fast data path
Works with any PCIe device Remote (even individual SR-IOV functions) PCIe IO bus Borrowed remote resource Remote resource using middleware Application Application CUDA library + driver CUDA – middleware integration Local Middleware service PCIe IO bus Interconnect transport (RDMA)
PCIe NTB interconnect Interconnect
Interconnect transport (RDMA) Middleware service/daemon Remote CUDA driver PCIe IO bus PCIe IO bus Borrowed remote resource Local resource Application Application CUDA library + driver CUDA library + driver Local PCIe IO bus PCIe IO bus
PCIe NTB interconnect
Remote PCIe IO bus Device-to-host memory transfer
14
12
10
8
6
4
2 Gigabytes per second (GB/s) 0 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB Transfer size
bandwidthTest (Local) bandwidthTest (Borrowed) PXH830 DMA (GPUDirect RDMA)
GPU: Quadro P400 Nvidia driver: Version 375.26 (Centos 7) 1. Nvidia CUDA 8.0 Samples bandwidthTest 2. GPUDirect RDMA benchmark using Dolphin NTB DMA CPU: Xeon E5-1630 3.7 GHz Memory: DDR4 2133 MHz https://github.com/Dolphinics/cuda-rdma-bench Using Device Lending, nodes in a PCIe cluster can share resources through a process of borrowing and giving back devices
RAM Task A CPU + chipset Task A Task B Task C FPGA NIC SSD
SSD SSD SSD NTB GPU GPU GPU SSD GPU SSD
RAM Task B CPU + chipset
NIC FPGA GPU NTB NIC GPUGPU GPUGPU SSDSSD RAM Task C SSD CPU + chipset FPGA
GPU GPU GPU NTB Device pool http://mlab.no/blog/2016/12/eir/ Server room
EIR – Efficient computer aided diagnosis framework for gastrointestinal examination
Examination room Examination room Moving forward
• Strategy-based management
• Fail-over mechanisms
• VFIO and other API integration (“SmartIO”)
• Borrowing vGPU functions Thank you!
“Device Lending in PCI Express Networks” My email address Selected ACM NOSSDAV 2016 publications “Efficient Processing of Video in a Multi Auditory Environment using Device Lending of GPUs” [email protected] ACM Multimedia Systems 2016 (MMSys’16)
“PCIe Device Lending” University of Oslo 2015
Device Lending demo and more Visit Dolphin in exhibition area (booth 625)