Bare Metal Library Abstractions for modern hardware Cyprien Noel Plan
1. Modern Hardware? 2. New challenges & opportunities 3. Three use cases ○ Current solutions ○ Leveraging hardware ○ Simple abstraction Myself
● High performance trading systems ○ Lock-free algos, distributed systems ● H2O ○ Distributed CPU machine learning, async SGD ● Flickr ○ Scaling deep learning on GPU ⎼ Multi GPU Caffe ○ RDMA, multicast, distributed Hogwild ⎼ CaffeOnSpark ● UC Berkeley ○ NCCL Caffe, GPU cluster tooling ○ Bare Metal Modern Hardware? Device-to-device networks
Moving from ms software to µs hardware
Number crunching ➔ GPU FS, block io, virt mem ➔ Pmem Network stack ➔ RDMA RAID, replication ➔ Erasure codes Device mem ➔ Coherent fabrics
And more: Video, crypto etc. OS abstractions replaced by
● CUDA ● OFED ● Libpmem ● DPDK More powerful, but also more complex ● SPDK and non-interoperable ● Libfabric ● UCX ● VMA ● More every week... Summary So Far
● Big changes coming! ○ At least for high-performance applications ● CPU should orchestrate ○ Not in critical path ○ Device-to-device networks ● Retrofitting existing architectures difficult ○ CPU-centric abstractions ○ ms software on µs hardware (e.g. 100s instructions per packet) ○ OK in some cases, e.g. VMA (kernel bypass sockets), but much lower acceleration, most features inexpressible What do we do?
● Start from scratch? ○ E.g. Google Fushia - no fs, block io, network etc. ○ Very interesting but future work ● Use already accelerated frameworks? ○ E.g. PyTorch, BeeGFS ○ Not general purpose, no interop, not device-to-device ● Work incrementally from use cases ○ Look for simplest hardware solution ○ Hopefully useful abstractions will emerge Use cases
● Build datasets ○ Add, update elements ○ Apply functions to sets, map-reduce ○ Data versioning ● Training & inference ○ Compute graphs, pipelines ○ Deployment ○ Model versioning Datasets
● Typical solution ○ Protobuf messages ○ KV store ○ Dist. file system ● Limitations ○ Serialization granularity ○ Copies: kv log, kernel1, replication, kernel2, fs (x12) ○ Remote CPU involved, stragglers ○ Cannot place data in device EC shard Datasets
● Simplest hardware implementation ○ Write protobuf in arena, like Flatbuffers ○ Pick an offset on disks, e.g. a namespace ○ Call ibv_exp_ec_encode_async ● Comments ○ Management, coordination, crash resiliency ○ Thin wrapper over HW: line rate perf. (x12) ● User abstraction? ○ Simple, familiar ○ Efficient, device friendly mmap
● Extension to classic mmap ○ Distributed ○ Typed - Protobuf, other formats planned ● Protobuf is amazing ○ Forward and backward compatible ○ Lattice mmap
● C++ const Test& test = mmap
● Python test = Test() bm.mmap("/test", test) i = test.field() mmap, recap
● Simple abstraction for data storage ● Fully accelerated, “mechanically friendly” ○ Thin wrapper over HW, device-to-device, zero copy ○ ~1.5x replication factor ○ Network automatically balanced ○ Solves straggler problem ○ No memory pinning or TLB thrashing, NUMA aware Use cases
● Compute ○ Map-reduce, compute graphs, pipelines ● Typical setup ○ Spark, DL frameworks ○ Distribution using Akka, gRPC, MPI ○ Kubernetes or SLURM scheduling ● Limitations ○ No interop ○ Placement difficult ○ Inefficient resources allocation Compute
● Simplest hardware implementation ○ Define a task, e.g. img. resize, CUDA kernel, PyTorch graph ○ Place tasks in queue ○ Work stealing - RDMA atomics ○ Device-to-device chaining - GPU Direct Async ● User abstraction? task
● Python @bm.task def compute(x, y): return x * y
# Runs locally compute(1, 2)
# Might be rebalanced on cluster data = bm.list() bm.mmap("/data", data) compute(data, 2) task, recap
● Simple abstraction for CPU and device kernels ● Work stealing instead of explicit schedule ○ No GPU hoarding ○ Better work balancing ○ Dynamic placement, HA ● Device-to-device chaining ○ Data placed directly in device memory ○ Efficient pipelines, even very short tasks ○ E.g. model parallelism, low latency inference Use cases
● Versioning ○ Track datasets and models ○ Deploy / rollback models ● Typical setup ○ Copy before update ○ Symlinks as versions to data ○ Staging / production environments split Versioning
● Simplest hardware implementation ○ Keep multiple write ahead logs ○ mmap updates ○ tasks queues ● User abstraction? branch
● Like a git branch ○ But any size data ○ Simplifies collaboration, experimentation ○ Generalized staging / production split ● Simplifies HA ○ File system fsync, msync (Very hard! Rajimwale et al. DSN ‘11) ○ Replaces transactions, e.g. queues, persistent memory ○ Allows duplicate work merge branch
● C++ Test* test = mutable_mmap
● Similar in Python Summary
● mmap, task, and branch simplify hardware-acceleration ● Helps build pipelines, manage cluster resources etc. ● Early micro benchmarks suggest very high performance Thank You!
Will be open sourced BSD Contact me if interested - [email protected]
Thanks to our sponsor