Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang 2 Monolithic Computer
OS / Hypervisor
3 Can Application monolithic Hardware servers continue to Heterogeneity meet datacenter Flexibility needs? Perf / $ TPU GPU FPGA
HBM ASIC NVM
DNA Storage NVMe
5 Making new hardware work with existing servers is like fitting puzzles 6 Can Application monolithic Hardware servers continue to Heterogeneity meet datacenter Flexibility needs? Perf / $ Poor Hardware Elasticity
• Hard to change hardware components
- Add (hotplug), remove, reconfigure, restart
• No fine-grained failure handling
- The failure of one device can crash a whole machine
8 Can Application monolithic Hardware servers continue to Heterogeneity meet datacenter Flexibility needs? Perf / $ Poor Resource Utilization
• Whole VM/container has to run on one physical machine
- Move current applications to make room for new ones
wasted! cpu
mem
Server 1 Server 2 JobJob 12 Available Space Required Space 10 Resource Utilization in Production Clusters
* Google Production Cluster Trace Data. * Alibaba Production Cluster Trace Data. “https://github.com/google/cluster-data” “https://github.com/alibaba/clusterdata." Unused Resource + Waiting/Killed Jobs Because of Physical-Node Constraints
11 Can Application monolithic Hardware servers continue to Heterogeneity meet datacenter Flexibility needs? Perf / $ How to achieve better heterogeneity, flexibility, and perf/$?
Go beyond physical node boundary
13 Resource Disaggregation:
Breaking monolithic servers into network- attached, independent hardware components
14 15 Application Flexibility Heterogeneity Hardware Perf / $
Network
16 Why Possible Now?
• Network is faster
• InfiniBand (200Gbps, 600ns) Berkeley Firebox • Optical Fabric (400Gbps, 100ns)
• More processing power at device
• SmartNIC, SmartSSD, PIM
• Network interface closer to device Intel Rack-Scale HP The Machine System • Omni-Path, Innova-2
IBM Composable System 17 Disaggregated Datacenter End-to-End Solution
Unmodified Performance Application Dist Sys Heterogeneity
OS Flexibility
Network Reliability
Hardware $ Cost Disaggregated Datacenter End-to-End Solution
Physically Disaggregated Resources
Disaggregated Operating System (OSDI’18)
New Processor and Memory Architecture
Networking for Disaggregated Resources
Kernel-Level RDMA Virtualization (SOSP’17)
RDMA Network 20 Can Existing Kernels Fit?
monolithic microkernel Kern Kern Kern kernel Core GPU P-NIC CPU CPU
mem NIC mem NIC Disk Server Disk Server Shared Main Memory
network across servers Disk NIC Monolithic Server Monolithic/Micro-kernel Multikernel (e.g., Linux, L4) (e.g., Barrelfish, Helios, fos)
21 Existing Kernels Don’t Fit
Access remote resources
Network Distributed resource mgmt
Fine-grained failure handling
22 When hardware is disaggregated
The OS should be also
23 OS Virtual File & Process Memory Storage Mgmt System System Network
24 Network
File & Process Storage Mgmt System Network Virtual Network Memory System File & Network Storage System Network
25 The Splitkernel Architecture
• Split OS functions into monitors
• Run each monitor at h/w device
• Network messaging across non-coherent components Process GPU XPU Monitor Minitor Manager • Distributed resource mgmt and Processor Processor New h/w failure handling (CPU) (GPU) (XPU) network messaging across non-coherent components
Memory NVM HDD SSD Monitor Monitor Monitor Monitor
Memory NVM Hard Disk SSD 26 LegoOS The First Disaggregated OS
Memory
Processor
Storage NVM
27 How Should LegoOS Appear to Users?
As a set of hardware devices? As a giant machine?
• Our answer: as a set of virtual Nodes (vNodes)
- Similar semantics to virtual machines
- Unique vID, vIP, storage mount point
- Can run on multiple processor, memory, and storage components
28 Abstraction - vNode
Process GPU XPU Monitor Minitor Manager vNode1 Processor Processor New h/w (CPU) (GPU) (XPU) vNode2 network messaging across non-coherent components
Memory NVM HDD SSD Monitor Monitor Monitor Monitor
Memory NVM Hard Disk SSD
One vNode can run multiple hardware components
One hardware component can run multiple vNodes 29 Abstraction
• Appear as vNodes to users
• Linux ABI compatible
• Support unmodified Linux system call interface (common ones)
• A level of indirection to translate Linux interface to LegoOS interface
30 LegoOS Design
1. Clean separation of OS and hardware functionalities
2. Build monitor with hardware constraints
3. RDMA-based message passing for both kernel and applications
4. Two-level distributed resource management
5. Memory failure tolerance through replication
31 Separate Processor and Memory
Processor CPU $ CPU $
Last-Level TLB
MMU
DRAM PT
32 Separate Processor and Memory
Processor Separate and move CPU $ CPU $ Last-Level hardware units to memory Memorycomponent Network
TLB MMU
DRAM PT Memory
33 Separate Processor and Memory
Virtual Memory Processor Separate and move CPU $ CPU $ Last-Level hardware units to memory Memorycomponent Network
TLB MMU
DRAM PT Memory
34 Separate Processor and Memory
Processor Separate and move CPU $ CPU $ virtual memory Last-Level system to memory Memory
Network component
TLB MMU Virtual Memory
DRAM PT Memory
35 Separate Processor and Memory
Virtual Virtual Address Address Processor Virtual CPU $ CPU $ Processor components only Address see virtual memory addresses Last-Level All levels of cache are virtual cache Virtual Address Memory Network TLB MMU Virtual Memory Memory components manage
DRAM PT Memory virtual and physical memory
36 Challenge: network is 2x-4x slower than memory bus
37 Add Extended Cache at Processor
Processor CPU $ CPU $ Last-Level
Memory Network
TLB MMU Virtual Memory
DRAM PT Memory
38 Add Extended Cache at Processor
Processor CPU $ CPU $ • Add small DRAM/HBM at processor Last-Level • Use it as Extended Cache, or DRAM ExCache
• Software and hardware co- managed Memory Network • Inclusive TLB MMU Virtual Memory • Virtual cache DRAM PT Memory
39 LegoOS Design
1. Clean separation of OS and hardware functionalities
2. Build monitor with hardware constraints
3. RDMA-based message passing for both kernel and applications
4. Two-level distributed resource management
5. Memory failure tolerance through replication
40 Distributed Resource Management
Global Process GPU Process Manager (GPM) Monitor Minitor Global Resource Mgmt Processor Processor Global (CPU) (GPU) Memory Manager (GMM) network messaging across non-coherent components Global Storage Manager (GSM)
Memory NVM HDD SSD Monitor Monitor Monitor Monitor 1. Coarse-grain allocation
Memory NVM Hard Disk SSD 2. Load-balancing
3. Failure handling
41 Implementation and Emulation
Process Monitor • Processor
Reserve DRAM as ExCache (4KB page as cache line) CPU CPU CPU CPU • Processor LLC Disk • h/w only on hit path, s/w managed miss path
ExCache • Indirection layer to store states for 113 Linux syscalls
RDMA Network • Memory • Limit number of cores, kernel-space only Memory Monitor Linux Kernel Module • Storage/Global Resource Monitors CPU CPU CPU CPU CPU • Implemented as kernel module on Linux LLC Disk LLC Disk DRAM DRAM • Network Memory Storage • RDMA RPC stack based on LITE [SOSP’17] 42 Performance Evaluation
7 • Unmodified TensorFlow, running CIFAR-10 Linux−swap−SSD Linux−swap−ramdisk InfiniSwap • Working set: 0.9G 5 LegoOS • 4 threads 3 Slowdown • Systems in comparison 1 128 256 512 ExCache/Memory Size (MB) • Baseline: Linux with unlimited memory LegoOS Config: 1P, 1M, 1S • Swap to SSD, and ramdisk
Only 1.3x to 1.7x slowdown• InfiniSwap [NSDI’17] when disaggregating devices with LegoOS
To gain better resource packing,43 elasticity, and fault tolerance! LegoOS Summary
• Resource disaggregation calls for new system
• LegoOS: a new OS designed and built from scratch for datacenter resource disaggregation
• Split OS into distributed micro-OS services, running at device
• Many challenges and many potentials
44 Disaggregated Datacenter
flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use
Physically Disaggregated Resources Disaggregated Operating System (OSDI’18)
New Processor and Memory Architecture
Networking for Disaggregated Resources
Kernel-Level RDMA Virtualization (SOSP’17)
RDMA Network Network Requirements for Resource Disaggregation
• Low latency
• High bandwidth
• Scale RDMA
• Reliable
46 RDMA (Remote Direct Memory Access)
CPU User CPU User Kernel Kernel • Directly read/write remote Memory Memory memory NIC NIC • Bypass kernel • Memory zero copy Socket over Ethernet
CPU User CPU User Benefits: Kernel Kernel – Low latency Memory Memory – High throughput NIC NIC – Low CPU utilization
RDMA 47 Things have worked well in HPC •Special hardware •Few applications •Cheaper developer
48 RDMA-Based Datacenter Applications Pilaf HERD-RPC Cell [ATC ’13] [ATC ’16] [ATC ’16]
FaRM Wukong FaSST [NSDI ’14] [OSDI ’16] [OSDI ’16]
Hotpot NAM-DB HERD [SoCC ’17] [VLDB ’17] RSI [SIGCOMM ’14] [VLDB ’16]
APUS Octopus DrTM [SoCC ’17] [ATC ’17] DrTM+R [SOSP ’15] [EuroSys ’16]
FaRM+Xact Mojim [SOSP ’15] [ASPLOS ’15] 49 Things have worked well in HPC • Special hardware • Few applications • Cheaper developer What about datacenters? • Commodity, cheaper hardware • Many (changing) applications • Resource sharing and isolation
50 Native RDMA
User-Level RDMA App
node, Conn Mem send recv lkey, Mgmt rkey Mgmt User addr Space Memory Connections Queues Keys space
Library Kernel Space OS Kernel Bypassing
lkey 1 rkey 1 Permission check … … Hardware RNIC Address mapping Cached PTEs lkey n rkey n
51 Userspace Fat applications No resource sharing
Hardware
Developers RDMA want High-level Low-level Easy to use Difficult to use Resource Difficult to share share Socket Isolation
Abstraction Mismatch
52 Things have worked well in HPC • Special hardware • Few applications • Cheaper developer What about datacenters? • Commodity, cheaper hardware • Many (changing) applications • Resource sharing and isolation
53 Userspace Native RDMA
Hardware User-Level RDMA App
node, Conn Mem send recv lkey, Mgmt rkey Mgmt User addr Space Memory Connections Queues Keys space
Library Kernel Space OS Kernel Bypassing
lkey 1 rkey 1 Permission check … … Hardware RNIC Address mapping Cached PTEs lkey n rkey n
54 unscalable Expensive, Userspace Hardware hardware On-NIC SRAMstores andcachesmetadata
Requests /us 1.5 4.5 0 3 6 1 4 Total Size (MB) 16 64 Write-1K Write-64B 256 1024 55 Things have worked well in HPC • Special hardware • Few applications • Cheaper developer What about datacenters? • Commodity, cheaper hardware • Many (changing) applications • Resource sharing and isolation
56 Fat applications Expensive, No resource unscalable sharing hardware
Are we removing too much from kernel?
57 LITE -Without Local Indirection Kernel TiEr
High-levelHigh-level abstraction Resource Resource abstraction sharing sharing
Protection
Performance isolation Performance isolation Protection
58 User-Level RDMA App
node, Conn Mem send recv lkey, Mgmt rkey Mgmt addr
User Memory Connections Queues Keys Space space
Library
Hardware lkey 1 rkey 1 RNIC Permission check Cached … … Address mapping PTEs lkey n rkey n
59 Simpler applications User-Level RDMA App
node, Conn Mem User send recv lkey, Space Mgmt rkey Mgmt addr LITE APIs Memory RPC/Msg APIs Sync APIs
Memory Connections Queues Keys Kernel space LITE Space
Hardware lkey 1 rkey 1 RNIC Permission check Cached … … Address mapping PTEs lkey n rkey n
60 Simpler applications User-Level RDMA App
node, Conn Mem User send recv lkey, Space Mgmt rkey Mgmt addr LITE APIs Memory RPC/Msg APIs Sync APIs
Memory Connections Queues Keys Kernel space LITE Space Permission check Address mapping Global lkey Global rkey
Hardware RNIC Global lkey Global rkey Cheaper hardware Scalable performance 61 Implementing Remote memset Native RDMA LITE
62 Main Challenge: How to preserve the performance benefit of RDMA?
63 LITE Design Principles
1.Indirection only at local node 2.Avoid hardware-level indirection
3.Hide kernel-space crossing cost except for the problem of too many layers of indirection – David Wheeler
Great Performance and Scalability
64 LITE RDMA:Size
LITE Requests /us 1.5 4.5 RDMA wrtMR sizeand numbers 0 3 6 1 scalesmuch betterthannative 4 Total Size(MB) 16 ofMRScalability 64 LITE_write-1K Write-1K LITE_write-64B Write-64B 256 1024 65 LITE Application Effort
Application LOC LOC using LITE Student Days LITE-Log 330 36 1 LITE-MapReduce 600* 49 4 LITE-Graph 1400 20 7 LITE-Kernel-DSM 3000 45 26 • Simple to use • Needs no expert knowledge • Flexible, powerful abstraction • Easy to achieve optimized performance 66 * LITE-MapReduce ports from the 3000-LOC Phoenix with 600 lines of change or addition MapReduce Results • LITE-MapReduce adapted from Phoenix [1] 25 Hadoop 23 Phoenix LITE 21 8
6
Runtime (sec) 4
LITE-MapReduce2 outperforms Hadoop
0 Phoenixby 4.3x2-node to4-node 5.3x8-node [1]: “Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07)” 67 LITE Summary
• Virtualizes RDMA into flexible, easy-to-use abstraction • Preserves RDMA’s performance benefits • Indirection not always degrade performance!
• Division across user space, kernel, and hardware
68 Disaggregated Datacenter
flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use
Physically Disaggregated Resources Disaggregated Operating System (OSDI’18)
New Processor and Memory Architecture
Networking for Disaggregated Resources
Kernel-Level RDMA Virtualization (SOSP’17)
RDMA Network Disaggregated Datacenter
flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use
Physically Disaggregated Resources Virtually Disaggregated Resources
Disaggregated OS Disaggregated Distributed Shared Persistent (OSDI’18) Persistent Memory Memory (SoCC ’17)
New Processor and Network-Attached Distributed Non-Volatile Memory Architecture NVM Memory
Networking for Disaggregated Resources
Kernel-Level RDMA New Network Topology, Virtualization (SOSP’17) Routing, Congestion-Ctrl
RDMA Network InfiniBand Conclusion
• New hardware and software trends point to resource disaggregation
• My research pioneers in building an end-to-end solution for disaggregated datacenter
• Opens up new research opportunities in hardware, software, networking, security, and programming language
71 Thank you Questions?
wuklab.io