Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang 2 Monolithic Computer

OS / Hypervisor

3 Can Application monolithic Hardware servers continue to Heterogeneity meet datacenter Flexibility needs? Perf / $ TPU GPU FPGA

HBM ASIC NVM

DNA Storage NVMe

5 Making new hardware work with existing servers is like fitting puzzles 6 Can Application monolithic Hardware servers continue to Heterogeneity meet datacenter Flexibility needs? Perf / $ Poor Hardware Elasticity

• Hard to change hardware components

- Add (hotplug), remove, reconfigure, restart

• No fine-grained failure handling

- The failure of one device can crash a whole machine

8 Can Application monolithic Hardware servers continue to Heterogeneity meet datacenter Flexibility needs? Perf / $ Poor Resource Utilization

• Whole VM/container has to run on one physical machine

- Move current applications to make room for new ones

wasted! cpu

mem

Server 1 Server 2 JobJob 12 Available Space Required Space 10 Resource Utilization in Production Clusters

* Google Production Cluster Trace Data. * Alibaba Production Cluster Trace Data. “https://github.com/google/cluster-data” “https://github.com/alibaba/clusterdata." Unused Resource + Waiting/Killed Jobs Because of Physical-Node Constraints

11 Can Application monolithic Hardware servers continue to Heterogeneity meet datacenter Flexibility needs? Perf / $ How to achieve better heterogeneity, flexibility, and perf/$?

Go beyond physical node boundary

13 Resource Disaggregation:

Breaking monolithic servers into network- attached, independent hardware components

14 15 Application Flexibility Heterogeneity Hardware Perf / $

Network

16 Why Possible Now?

• Network is faster

• InfiniBand (200Gbps, 600ns) Berkeley Firebox • Optical Fabric (400Gbps, 100ns)

• More processing power at device

• SmartNIC, SmartSSD, PIM

• Network interface closer to device Intel Rack-Scale HP The Machine System • Omni-Path, Innova-2

IBM Composable System 17 Disaggregated Datacenter End-to-End Solution

Unmodified Performance Application Dist Sys Heterogeneity

OS Flexibility

Network Reliability

Hardware $ Cost Disaggregated Datacenter End-to-End Solution

Physically Disaggregated Resources

Disaggregated (OSDI’18)

New Processor and Memory Architecture

Networking for Disaggregated Resources

Kernel-Level RDMA Virtualization (SOSP’17)

RDMA Network 20 Can Existing Kernels Fit?

monolithic Kern Kern Kern kernel Core GPU P-NIC CPU CPU

mem NIC mem NIC Disk Server Disk Server Shared Main Memory

network across servers Disk NIC Monolithic Server Monolithic/Micro-kernel Multikernel (e.g., , L4) (e.g., Barrelfish, Helios, fos)

21 Existing Kernels Don’t Fit

Access remote resources

Network Distributed resource mgmt

Fine-grained failure handling

22 When hardware is disaggregated

The OS should be also

23 OS Virtual File & Memory Storage Mgmt System System Network

24 Network

File & Process Storage Mgmt System Network Virtual Network Memory & Network Storage System Network

25 The Splitkernel Architecture

• Split OS functions into monitors

• Run each monitor at h/w device

• Network messaging across non-coherent components Process GPU XPU Monitor Minitor Manager • Distributed resource mgmt and Processor Processor New h/w failure handling (CPU) (GPU) (XPU) network messaging across non-coherent components

Memory NVM HDD SSD Monitor Monitor Monitor Monitor

Memory NVM Hard Disk SSD 26 LegoOS The First Disaggregated OS

Memory

Processor

Storage NVM

27 How Should LegoOS Appear to Users?

As a set of hardware devices? As a giant machine?

• Our answer: as a set of virtual Nodes (vNodes)

- Similar semantics to virtual machines

- Unique vID, vIP, storage mount point

- Can run on multiple processor, memory, and storage components

28 Abstraction - vNode

Process GPU XPU Monitor Minitor Manager vNode1 Processor Processor New h/w (CPU) (GPU) (XPU) vNode2 network messaging across non-coherent components

Memory NVM HDD SSD Monitor Monitor Monitor Monitor

Memory NVM Hard Disk SSD

One vNode can run multiple hardware components

One hardware component can run multiple vNodes 29 Abstraction

• Appear as vNodes to users

• Linux ABI compatible

• Support unmodified Linux system call interface (common ones)

• A level of indirection to translate Linux interface to LegoOS interface

30 LegoOS Design

1. Clean separation of OS and hardware functionalities

2. Build monitor with hardware constraints

3. RDMA-based message passing for both kernel and applications

4. Two-level distributed resource management

5. Memory failure tolerance through replication

31 Separate Processor and Memory

Processor CPU $ CPU $

Last-Level TLB

MMU

DRAM PT

32 Separate Processor and Memory

Processor Separate and move CPU $ CPU $ Last-Level hardware units to memory Memorycomponent Network

TLB MMU

DRAM PT Memory

33 Separate Processor and Memory

Virtual Memory Processor Separate and move CPU $ CPU $ Last-Level hardware units to memory Memorycomponent Network

TLB MMU

DRAM PT Memory

34 Separate Processor and Memory

Processor Separate and move CPU $ CPU $ Last-Level system to memory Memory

Network component

TLB MMU Virtual Memory

DRAM PT Memory

35 Separate Processor and Memory

Virtual Virtual Address Address Processor Virtual CPU $ CPU $ Processor components only Address see virtual memory addresses Last-Level All levels of cache are virtual cache Virtual Address Memory Network TLB MMU Virtual Memory Memory components manage

DRAM PT Memory virtual and physical memory

36 Challenge: network is 2x-4x slower than memory bus

37 Add Extended Cache at Processor

Processor CPU $ CPU $ Last-Level

Memory Network

TLB MMU Virtual Memory

DRAM PT Memory

38 Add Extended Cache at Processor

Processor CPU $ CPU $ • Add small DRAM/HBM at processor Last-Level • Use it as Extended Cache, or DRAM ExCache

• Software and hardware co- managed Memory Network • Inclusive TLB MMU Virtual Memory • Virtual cache DRAM PT Memory

39 LegoOS Design

1. Clean separation of OS and hardware functionalities

2. Build monitor with hardware constraints

3. RDMA-based message passing for both kernel and applications

4. Two-level distributed resource management

5. Memory failure tolerance through replication

40 Distributed Resource Management

Global Process GPU Process Manager (GPM) Monitor Minitor Global Resource Mgmt Processor Processor Global (CPU) (GPU) Memory Manager (GMM) network messaging across non-coherent components Global Storage Manager (GSM)

Memory NVM HDD SSD Monitor Monitor Monitor Monitor 1. Coarse-grain allocation

Memory NVM Hard Disk SSD 2. Load-balancing

3. Failure handling

41 Implementation and Emulation

Process Monitor • Processor

Reserve DRAM as ExCache (4KB page as cache line) CPU CPU CPU CPU • Processor LLC Disk • h/w only on hit path, s/w managed miss path

ExCache • Indirection layer to store states for 113 Linux syscalls

RDMA Network • Memory • Limit number of cores, kernel-space only Memory Monitor Linux Kernel Module • Storage/Global Resource Monitors CPU CPU CPU CPU CPU • Implemented as kernel module on Linux LLC Disk LLC Disk DRAM DRAM • Network Memory Storage • RDMA RPC stack based on LITE [SOSP’17] 42 Performance Evaluation

7 • Unmodified TensorFlow, running CIFAR-10 Linux−swap−SSD Linux−swap−ramdisk InfiniSwap • Working set: 0.9G 5 LegoOS • 4 threads 3 Slowdown • Systems in comparison 1 128 256 512 ExCache/Memory Size (MB) • Baseline: Linux with unlimited memory LegoOS Config: 1P, 1M, 1S • Swap to SSD, and ramdisk

Only 1.3x to 1.7x slowdown• InfiniSwap [NSDI’17] when disaggregating devices with LegoOS

To gain better resource packing,43 elasticity, and fault tolerance! LegoOS Summary

• Resource disaggregation calls for new system

• LegoOS: a new OS designed and built from scratch for datacenter resource disaggregation

• Split OS into distributed micro-OS services, running at device

• Many challenges and many potentials

44 Disaggregated Datacenter

flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use

Physically Disaggregated Resources Disaggregated Operating System (OSDI’18)

New Processor and Memory Architecture

Networking for Disaggregated Resources

Kernel-Level RDMA Virtualization (SOSP’17)

RDMA Network Network Requirements for Resource Disaggregation

• Low latency

• High bandwidth

• Scale RDMA

• Reliable

46 RDMA (Remote Direct Memory Access)

CPU User CPU User Kernel Kernel • Directly read/write remote Memory Memory memory NIC NIC • Bypass kernel • Memory zero copy Socket over Ethernet

CPU User CPU User Benefits: Kernel Kernel – Low latency Memory Memory – High throughput NIC NIC – Low CPU utilization

RDMA 47 Things have worked well in HPC •Special hardware •Few applications •Cheaper developer

48 RDMA-Based Datacenter Applications Pilaf HERD-RPC Cell [ATC ’13] [ATC ’16] [ATC ’16]

FaRM Wukong FaSST [NSDI ’14] [OSDI ’16] [OSDI ’16]

Hotpot NAM-DB HERD [SoCC ’17] [VLDB ’17] RSI [SIGCOMM ’14] [VLDB ’16]

APUS Octopus DrTM [SoCC ’17] [ATC ’17] DrTM+ [SOSP ’15] [EuroSys ’16]

FaRM+Xact Mojim [SOSP ’15] [ASPLOS ’15] 49 Things have worked well in HPC • Special hardware • Few applications • Cheaper developer What about datacenters? • Commodity, cheaper hardware • Many (changing) applications • Resource sharing and isolation

50 Native RDMA

User-Level RDMA App

node, Conn Mem send recv lkey, Mgmt rkey Mgmt User addr Space Memory Connections Queues Keys space

Library Kernel Space OS Kernel Bypassing

lkey 1 rkey 1 Permission check … … Hardware RNIC Address mapping Cached PTEs lkey n rkey n

51 Userspace Fat applications No resource sharing

Hardware

Developers RDMA want High-level Low-level Easy to use Difficult to use Resource Difficult to share share Socket Isolation

Abstraction Mismatch

52 Things have worked well in HPC • Special hardware • Few applications • Cheaper developer What about datacenters? • Commodity, cheaper hardware • Many (changing) applications • Resource sharing and isolation

53 Userspace Native RDMA

Hardware User-Level RDMA App

node, Conn Mem send recv lkey, Mgmt rkey Mgmt User addr Space Memory Connections Queues Keys space

Library Kernel Space OS Kernel Bypassing

lkey 1 rkey 1 Permission check … … Hardware RNIC Address mapping Cached PTEs lkey n rkey n

54 unscalable Expensive, Userspace Hardware hardware On-NIC SRAMstores andcachesmetadata

Requests /us 1.5 4.5 0 3 6 1 4 Total Size (MB) 16 64 Write-1K Write-64B 256 1024 55 Things have worked well in HPC • Special hardware • Few applications • Cheaper developer What about datacenters? • Commodity, cheaper hardware • Many (changing) applications • Resource sharing and isolation

56 Fat applications Expensive, No resource unscalable sharing hardware

Are we removing too much from kernel?

57 LITE -Without Local Indirection Kernel TiEr

High-levelHigh-level abstraction Resource Resource abstraction sharing sharing

Protection

Performance isolation Performance isolation Protection

58 User-Level RDMA App

node, Conn Mem send recv lkey, Mgmt rkey Mgmt addr

User Memory Connections Queues Keys Space space

Library

Hardware lkey 1 rkey 1 RNIC Permission check Cached … … Address mapping PTEs lkey n rkey n

59 Simpler applications User-Level RDMA App

node, Conn Mem User send recv lkey, Space Mgmt rkey Mgmt addr LITE Memory RPC/Msg APIs Sync APIs

Memory Connections Queues Keys Kernel space LITE Space

Hardware lkey 1 rkey 1 RNIC Permission check Cached … … Address mapping PTEs lkey n rkey n

60 Simpler applications User-Level RDMA App

node, Conn Mem User send recv lkey, Space Mgmt rkey Mgmt addr LITE APIs Memory RPC/Msg APIs Sync APIs

Memory Connections Queues Keys Kernel space LITE Space Permission check Address mapping Global lkey Global rkey

Hardware RNIC Global lkey Global rkey Cheaper hardware Scalable performance 61 Implementing Remote memset Native RDMA LITE

62 Main Challenge: How to preserve the performance benefit of RDMA?

63 LITE Design Principles

1.Indirection only at local node 2.Avoid hardware-level indirection

3.Hide kernel-space crossing cost except for the problem of too many layers of indirection – David Wheeler

Great Performance and Scalability

64 LITE RDMA:Size

LITE Requests /us 1.5 4.5 RDMA wrtMR sizeand numbers 0 3 6 1 scalesmuch betterthannative 4 Total Size(MB) 16 ofMRScalability 64 LITE_write-1K Write-1K LITE_write-64B Write-64B 256 1024 65 LITE Application Effort

Application LOC LOC using LITE Student Days LITE-Log 330 36 1 LITE-MapReduce 600* 49 4 LITE-Graph 1400 20 7 LITE-Kernel-DSM 3000 45 26 • Simple to use • Needs no expert knowledge • Flexible, powerful abstraction • Easy to achieve optimized performance 66 * LITE-MapReduce ports from the 3000-LOC Phoenix with 600 lines of change or addition MapReduce Results • LITE-MapReduce adapted from Phoenix [1] 25 Hadoop 23 Phoenix LITE 21 8

6

Runtime (sec) 4

LITE-MapReduce2 outperforms Hadoop

0 Phoenixby 4.3x2-node to4-node 5.3x8-node [1]: “Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07)” 67 LITE Summary

• Virtualizes RDMA into flexible, easy-to-use abstraction • Preserves RDMA’s performance benefits • Indirection not always degrade performance!

• Division across , kernel, and hardware

68 Disaggregated Datacenter

flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use

Physically Disaggregated Resources Disaggregated Operating System (OSDI’18)

New Processor and Memory Architecture

Networking for Disaggregated Resources

Kernel-Level RDMA Virtualization (SOSP’17)

RDMA Network Disaggregated Datacenter

flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use

Physically Disaggregated Resources Virtually Disaggregated Resources

Disaggregated OS Disaggregated Distributed Shared Persistent (OSDI’18) Persistent Memory Memory (SoCC ’17)

New Processor and Network-Attached Distributed Non-Volatile Memory Architecture NVM Memory

Networking for Disaggregated Resources

Kernel-Level RDMA New Network Topology, Virtualization (SOSP’17) Routing, Congestion-Ctrl

RDMA Network InfiniBand Conclusion

• New hardware and software trends point to resource disaggregation

• My research pioneers in building an end-to-end solution for disaggregated datacenter

• Opens up new research opportunities in hardware, software, networking, security, and programming language

71 Thank you Questions?

wuklab.io