LegoOS A Disseminated Distributed OS for Hardware Resource Disaggregation Y4 Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang

1 2 Monolithic Server

OS / Hypervisor

3 Problems?

4 cpu mem

Resource Utilization

Server 1 Server 2 JobJob 12 Available Space Required Space Heterogeneity TPU Hard to add, remove, or reconfigure devices in a servers after deployment Elasticity FPGA No extra PCIe slots Fault Tolerance NVM

5 How to improve resource utilization, elasticity, heterogeneity, and fault tolerance?

Go beyond physical server boundary!

6 Hardware Resource Disaggregation:

Breaking monolithic servers into network-attached, independent hardware components

7 8 Resource Fault Application Utilization Tolerance

Elasticity Heterogeneity Hardware

Network

9 Why Possible Now?

Intel • Network is faster Rack-Scale System

• InfiniBand (200Gbps, 600ns)

• Optical Fabric (400Gbps, 100ns)

• More processing power at device

• SmartNIC, SmartSSD, PIM Berkeley Firebox HP The Machine

• Network interface closer to device

• Omni-Path, Innova-2

IBM Composable System dReDBox 10 Outline

• Hardware Resource Disaggregation

• Kernel Architectures for Resource Disaggregation

• LegoOS Design and Implementation

• Abstraction

• Design Principles

• Implementation and Emulation

• Conclusion

11 12 Can Existing Kernels Fit?

monolithic Kernel Kernel Kernel kernel Core GPU P-NIC CPU CPU mem NIC mem NIC msg passing over local bus

Disk Server Disk Server Shared Main Memory

network across servers Disk NIC Monolithic Server

Monolithic/Micro-kernel Multikernel (e.g., , L4) (e.g., Barrelfish, Helios, fos)

13 Existing Kernels Don’t Fit

Access remote resources

Network Distributed resource mgmt

Fine-grained failure handling

14 When hardware is disaggregated

The OS should be also

15 OS Virtual File & Memory Storage Mgmt System System Network

16 Network

File & Process Storage Mgmt System Network Virtual Network Memory & Network Storage System Network

17 The Splitkernel Architecture

Process GPU XPU Monitor Minitor Manager • Split OS functions into monitors Processor Processor New h/w (CPU) (GPU) (XPU) • Run each monitor at h/w device network messaging across non-coherent components • Network messaging across non-coherent components

Memory NVM HDD SSD • Distributed resource mgmt and Monitor Monitor Monitor Monitor failure handling Memory NVM Hard Disk SSD

18 LegoOS The First Disaggregated OS

Memory

Processor

Storage NVM

19 Outline

• Hardware Resource Disaggregation

• Kernel Architectures for Resource Disaggregation

• LegoOS Design and Implementation

• Abstraction

• Design Principles

• Implementation and Emulation

• Conclusion

20 How Should LegoOS Appear to Users?

As a set of hardware devices? As a giant machine?

• Our answer: as a set of virtual Nodes (vNodes)

- Similar semantics to virtual machines

- Unique vID, vIP, storage mount point

- Can run on multiple processor, memory, and storage components

21 Abstraction - vNode

Process GPU XPU Monitor Minitor Manager vNode1 Processor Processor New h/w (CPU) (GPU) (XPU) vNode2 network messaging across non-coherent components

Memory NVM HDD SSD Monitor Monitor Monitor Monitor

Memory NVM Hard Disk SSD

One vNode can run multiple hardware components One hardware component can run multiple vNodes 22 Abstraction

• Appear as vNodes to users

• Linux ABI compatible

• Support unmodified Linux system call interface (common ones)

• A level of indirection to translate Linux interface to LegoOS interface

23 LegoOS Design

1. Clean separation of OS and hardware functionalities

2. Build monitor with hardware constraints

3. RDMA-based message passing for both kernel and applications

4. Two-level distributed resource management

5. Memory failure tolerance through replication

24 Separate Processor and Memory

Processor

CPU $ CPU $

Last-Level Cache TLB

MMU

DRAM PT

25 Separate Processor and Memory

Processor

CPU $ CPU $ Last-Level Cache TLB Disaggregating DRAM MMU PT

Memory Network

DRAM Memory

26 Separate Processor and Memory

Processor CPU $ CPU $ Separate and move Last-Level Cache hardware units to memory component Memory Network

TLB MMU

DRAM PT Memory

27 Separate Processor and Memory

Virtual Memory System Processor

CPU $ CPU $

Last-Level Cache

Memory Network

TLB MMU

DRAM PT Memory

28 Separate Processor and Memory

Processor CPU $ CPU $ Separate and move Last-Level Cache system to memory component Memory Network

TLB MMU Virtual Memory System

DRAM PT Memory

29 Separate Processor and Memory

Virtual Virtual Address Address Processor Virtual CPU $ CPU $ Processor components only Address see virtual memory addresses Last-Level Cache All levels of cache are virtual cache

Virtual Address Memory Network

TLB MMU Virtual Memory System Memory components manage DRAM PT Memory virtual and physical memory

30 Challenge: Remote Memory Accesses

• Network is still slower than local memory bus

• Bandwidth: 2x - 4x slower, improving fast

• Latency: ~12x slower, and improving slowly

31 Add Extended Cache at Processor

Processor

CPU $ CPU $

Last-Level Cache

Memory Network

TLB MMU Virtual Memory System

DRAM PT Memory

32 Add Extended Cache at Processor Processor

CPU $ CPU $

Last-Level Cache • Add small DRAM/HBM at processor

DRAM ExCache • Use it as Extended Cache, or ExCache

• Software and hardware co-managed

• InclusiveMemory Network

TLB MMU Virtual Memory System • Virtual cache

DRAM PT Memory

33 LegoOS Design

1. Clean separation of OS and hardware functionalities

2. Build monitor with hardware constraints

3. RDMA-based message passing for both kernel and applications

4. Two-level distributed resource management

5. Memory failure tolerance through replication

34 Distributed Resource Management

Global Process Manager (GPM) Process GPU Global Monitor Minitor Global Resource Mgmt Processor Processor Memory Manager (GMM) (CPU) (GPU) Global Storage Manager (GSM) network messaging across non-coherent components 1. Coarse-grain allocation Memory NVM HDD SSD Monitor Monitor Monitor Monitor 2. Load-balancing Memory NVM Hard Disk SSD 3. Failure handling

35 Distributed 0 User Virtual Address Space max

vRegion 1 vRegion 2 vRegion 3 mmap 1.5GB write 1GB fix-sized, coarse-grain virtual region (vRegion) (e.g., 1GB)

• GMM assigns vRegions to mem components GMM Processor - On virtual mem alloc syscalls (e.g., mmap)

- Make decisions based on global loads

• Owner of a vRegion vRegion 1 vRegion 2

- Fine-grained virtual memory allocation Used Used Used Used

- On-demand physical memory allocation (Physical Memory) (Physical Memory) Memory (M1) Memory (M2) - Handle memory accesses 36 Implementation and Emulation • Status Process Monitor • 206K SLOC, runs on x86-64, 113 common Linux syscalls CPU CPU CPU CPU Processor • Processor LLC Disk • Reserve DRAM as ExCache (4KB page as cache line) ExCache DRAM • h/w only on hit path, s/w managed miss path

RDMA Network • Memory

Limit number of cores, kernel-space only Memory Monitor Linux Kernel Module • • Storage/Global Resource Monitors CPU CPU CPU CPU CPU • Implemented as kernel modules on Linux LLC Disk LLC Disk DRAM DRAM • Network Memory Storage • RDMA RPC stack based on LITE [SOSP’17] 37 Performance Evaluation

7 Linux−swap−SSD • Unmodified TensorFlow, running CIFAR-10 Linux−swap−ramdisk InfiniSwap 5 LegoOS • Working set: 0.9G

3 • 4 threads Slowdown

1 • Systems in comparison 128 256 512 ExCache/Memory Size (MB) Baseline: Linux with unlimited memory LegoOS Config: 1P, 1M, 1S •

Only 1.3x to 1.7x slowdown when• disaggregatingSwap to SSD, and devices ramdisk with LegoOS To gain better resource packing,• InfiniSwap elasticity, [NSDI’17 and] fault tolerance!

38 Conclusion

• Hardware resource disaggregation is promising for future datacenters

• The splitkernel architecture and LegoOS demonstrate the feasibility of resource disaggregation

• Great potentials, but many unsolved challenges!

39 Thank you! Questions? Open source @ LegoOS.io

Poster Tonight. Number 11. .io