Hardware, Software, and Network Approaches Towards Datacenter Resource Disaggregation Yiying Zhang 2 Monolithic Computer

Farewell to Servers: Hardware, Software, and Network Approaches towards Datacenter Resource Disaggregation Yiying Zhang 2 Monolithic Computer OS / Hypervisor 3 Can Application monolithic Hardware servers continue to Heterogeneity meet datacenter Flexibility needs? Perf / $ TPU GPU FPGA HBM ASIC NVM DNA Storage NVMe 5 Making new hardware work with existing servers is like fitting puzzles 6 Can Application monolithic Hardware servers continue to Heterogeneity meet datacenter Flexibility needs? Perf / $ Poor Hardware Elasticity • Hard to change hardware components - Add (hotplug), remove, reconfigure, restart • No fine-grained failure handling - The failure of one device can crash a whole machine 8 Can Application monolithic Hardware servers continue to Heterogeneity meet datacenter Flexibility needs? Perf / $ Poor Resource Utilization • Whole VM/container has to run on one physical machine - Move current applications to make room for new ones wasted! cpu mem Server 1 Server 2 JobJob 12 Available Space Required Space 10 Resource Utilization in Production Clusters * Google Production Cluster Trace Data. * Alibaba Production Cluster Trace Data. “https://github.com/google/cluster-data” “https://github.com/alibaba/clusterdata." Unused Resource + Waiting/Killed Jobs Because of Physical-Node Constraints 11 Can Application monolithic Hardware servers continue to Heterogeneity meet datacenter Flexibility needs? Perf / $ How to achieve better heterogeneity, flexibility, and perf/$? Go beyond physical node boundary 13 Resource Disaggregation: Breaking monolithic servers into network- attached, independent hardware components 14 15 Application Flexibility Heterogeneity Hardware Perf / $ Network 16 Why Possible Now? • Network is faster • InfiniBand (200Gbps, 600ns) Berkeley Firebox • Optical Fabric (400Gbps, 100ns) • More processing power at device • SmartNIC, SmartSSD, PIM • Network interface closer to device Intel Rack-Scale HP The Machine System • Omni-Path, Innova-2 IBM Composable System 17 Disaggregated Datacenter End-to-End Solution Unmodified Performance Application Dist Sys Heterogeneity OS Flexibility Network Reliability Hardware $ Cost Disaggregated Datacenter End-to-End Solution Physically Disaggregated Resources Disaggregated Operating System (OSDI’18) New Processor and Memory Architecture Networking for Disaggregated Resources Kernel-Level RDMA Virtualization (SOSP’17) RDMA Network 20 Can Existing Kernels Fit? monolithic microkernel Kern Kern Kern kernel Core GPU P-NIC CPU CPU mem NIC mem NIC Disk Server Disk Server Shared Main Memory network across servers Disk NIC Monolithic Server Monolithic/Micro-kernel Multikernel (e.g., Linux, L4) (e.g., Barrelfish, Helios, fos) 21 Existing Kernels Don’t Fit Access remote resources Network Distributed resource mgmt Fine-grained failure handling 22 When hardware is disaggregated The OS should be also 23 OS Virtual File & Process Memory Storage Mgmt System System Network 24 Network File & Process Storage Mgmt System Network Virtual Network Memory System File & Network Storage System Network 25 The Splitkernel Architecture • Split OS functions into monitors • Run each monitor at h/w device • Network messaging across non-coherent components Process GPU XPU Monitor Minitor Manager • Distributed resource mgmt and Processor Processor New h/w failure handling (CPU) (GPU) (XPU) network messaging across non-coherent components Memory NVM HDD SSD Monitor Monitor Monitor Monitor Memory NVM Hard Disk SSD 26 LegoOS The First Disaggregated OS Memory Processor Storage NVM 27 How Should LegoOS Appear to Users? As a set of hardware devices? As a giant machine? • Our answer: as a set of virtual Nodes (vNodes) - Similar semantics to virtual machines - Unique vID, vIP, storage mount point - Can run on multiple processor, memory, and storage components 28 Abstraction - vNode Process GPU XPU Monitor Minitor Manager vNode1 Processor Processor New h/w (CPU) (GPU) (XPU) vNode2 network messaging across non-coherent components Memory NVM HDD SSD Monitor Monitor Monitor Monitor Memory NVM Hard Disk SSD One vNode can run multiple hardware components One hardware component can run multiple vNodes 29 Abstraction • Appear as vNodes to users • Linux ABI compatible • Support unmodified Linux system call interface (common ones) • A level of indirection to translate Linux interface to LegoOS interface 30 LegoOS Design 1. Clean separation of OS and hardware functionalities 2. Build monitor with hardware constraints 3. RDMA-based message passing for both kernel and applications 4. Two-level distributed resource management 5. Memory failure tolerance through replication 31 Separate Processor and Memory Processor CPU $ CPU $ Last-Level TLB MMU DRAM PT 32 Separate Processor and Memory Processor Separate and move CPU $ CPU $ Last-Level hardware units to memory Memorycomponent Network TLB MMU DRAM PT Memory 33 Separate Processor and Memory Virtual Memory Processor Separate and move CPU $ CPU $ Last-Level hardware units to memory Memorycomponent Network TLB MMU DRAM PT Memory 34 Separate Processor and Memory Processor Separate and move CPU $ CPU $ virtual memory Last-Level system to memory Memory Network component TLB MMU Virtual Memory DRAM PT Memory 35 Separate Processor and Memory Virtual Virtual Address Address Processor Virtual CPU $ CPU $ Processor components only Address see virtual memory addresses Last-Level All levels of cache are virtual cache Virtual Address Memory Network TLB MMU Virtual Memory Memory components manage DRAM PT Memory virtual and physical memory 36 Challenge: network is 2x-4x slower than memory bus 37 Add Extended Cache at Processor Processor CPU $ CPU $ Last-Level Memory Network TLB MMU Virtual Memory DRAM PT Memory 38 Add Extended Cache at Processor Processor CPU $ CPU $ • Add small DRAM/HBM at processor Last-Level • Use it as Extended Cache, or DRAM ExCache • Software and hardware co- managed Memory Network • Inclusive TLB MMU Virtual Memory • Virtual cache DRAM PT Memory 39 LegoOS Design 1. Clean separation of OS and hardware functionalities 2. Build monitor with hardware constraints 3. RDMA-based message passing for both kernel and applications 4. Two-level distributed resource management 5. Memory failure tolerance through replication 40 Distributed Resource Management Global Process GPU Process Manager (GPM) Monitor Minitor Global Resource Mgmt Processor Processor Global (CPU) (GPU) Memory Manager (GMM) network messaging across non-coherent components Global Storage Manager (GSM) Memory NVM HDD SSD Monitor Monitor Monitor Monitor 1. Coarse-grain allocation Memory NVM Hard Disk SSD 2. Load-balancing 3. Failure handling 41 Implementation and Emulation Process Monitor • Processor Reserve DRAM as ExCache (4KB page as cache line) CPU CPU CPU CPU • Processor LLC Disk • h/w only on hit path, s/w managed miss path ExCache • Indirection layer to store states for 113 Linux syscalls RDMA Network • Memory • Limit number of cores, kernel-space only Memory Monitor Linux Kernel Module • Storage/Global Resource Monitors CPU CPU CPU CPU CPU • Implemented as kernel module on Linux LLC Disk LLC Disk DRAM DRAM • Network Memory Storage • RDMA RPC stack based on LITE [SOSP’17] 42 Performance Evaluation 7 • Unmodified TensorFlow, running CIFAR-10 Linux−swap−SSD Linux−swap−ramdisk InfiniSwap • Working set: 0.9G 5 LegoOS • 4 threads 3 Slowdown • Systems in comparison 1 128 256 512 ExCache/Memory Size (MB) • Baseline: Linux with unlimited memory LegoOS Config: 1P, 1M, 1S • Swap to SSD, and ramdisk Only 1.3x to 1.7x slowdown• InfiniSwap [NSDI’17] when disaggregating devices with LegoOS To gain better resource packing,43 elasticity, and fault tolerance! LegoOS Summary • Resource disaggregation calls for new system • LegoOS: a new OS designed and built from scratch for datacenter resource disaggregation • Split OS into distributed micro-OS services, running at device • Many challenges and many potentials 44 Disaggregated Datacenter flexible, heterogeneous, elastic, perf/$, resilient, scalable, easy-to-use Physically Disaggregated Resources Disaggregated Operating System (OSDI’18) New Processor and Memory Architecture Networking for Disaggregated Resources Kernel-Level RDMA Virtualization (SOSP’17) RDMA Network Network Requirements for Resource Disaggregation • Low latency • High bandwidth • Scale RDMA • Reliable 46 RDMA (Remote Direct Memory Access) CPU User CPU User Kernel Kernel • Directly read/write remote Memory Memory memory NIC NIC • Bypass kernel • Memory zero copy Socket over Ethernet CPU User CPU User Benefits: Kernel Kernel – Low latency Memory Memory – High throughput NIC NIC – Low CPU utilization RDMA 47 Things have worked well in HPC •Special hardware •Few applications •Cheaper developer 48 RDMA-Based Datacenter Applications Pilaf HERD-RPC Cell [ATC ’13] [ATC ’16] [ATC ’16] FaRM Wukong FaSST [NSDI ’14] [OSDI ’16] [OSDI ’16] Hotpot NAM-DB HERD [SoCC ’17] [VLDB ’17] RSI [SIGCOMM ’14] [VLDB ’16] APUS Octopus DrTM [SoCC ’17] [ATC ’17] DrTM+R [SOSP ’15] [EuroSys ’16] FaRM+Xact Mojim [SOSP ’15] [ASPLOS ’15] 49 Things have worked well in HPC • Special hardware • Few applications • Cheaper developer What about datacenters? • Commodity, cheaper hardware • Many (changing) applications • Resource sharing and isolation 50 Native RDMA User-Level RDMA App node, Conn Mem send recv lkey, Mgmt rkey Mgmt User addr Space Memory Connections Queues Keys space Library Kernel Space OS Kernel Bypassing lkey 1 rkey 1 Permission check … … Hardware RNIC Address mapping Cached PTEs lkey n rkey n 51 Userspace Fat applications No resource sharing

Hardware, Software, and Network Approaches Towards Datacenter Resource Disaggregation Yiying Zhang 2 Monolithic Computer

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support