Bare Metal Library Abstractions for Modern Hardware Cyprien Noel Plan

Bare Metal Library Abstractions for modern hardware Cyprien Noel Plan

1. Modern Hardware? 2. New challenges & opportunities 3. Three use cases ○ Current solutions ○ Leveraging hardware ○ Simple abstraction Myself

● High performance trading systems ○ Lock-free algos, distributed systems ● H2O ○ Distributed CPU machine learning, async SGD ● Flickr ○ Scaling deep learning on GPU ⎼ Multi GPU Caffe ○ RDMA, multicast, distributed Hogwild ⎼ CaffeOnSpark ● UC Berkeley ○ NCCL Caffe, GPU cluster tooling ○ Bare Metal Modern Hardware? Device-to-device networks

Moving from ms software to µs hardware

Number crunching ➔ GPU FS, block io, virt mem ➔ Pmem Network stack ➔ RDMA RAID, replication ➔ Erasure codes Device mem ➔ Coherent fabrics

And more: Video, crypto etc. OS abstractions replaced by

● CUDA ● OFED ● Libpmem ● DPDK More powerful, but also more complex ● SPDK and non-interoperable ● Libfabric ● UCX ● VMA ● More every week... Summary So Far

● Big changes coming! ○ At least for high-performance applications ● CPU should orchestrate ○ Not in critical path ○ Device-to-device networks ● Retrofitting existing architectures difficult ○ CPU-centric abstractions ○ ms software on µs hardware (e.g. 100s instructions per packet) ○ OK in some cases, e.g. VMA (kernel bypass sockets), but much lower acceleration, most features inexpressible What do we do?

● Start from scratch? ○ E.g. Google Fushia - no fs, block io, network etc. ○ Very interesting but future work ● Use already accelerated frameworks? ○ E.g. PyTorch, BeeGFS ○ Not general purpose, no interop, not device-to-device ● Work incrementally from use cases ○ Look for simplest hardware solution ○ Hopefully useful abstractions will emerge Use cases

● Build datasets ○ Add, update elements ○ Apply functions to sets, map-reduce ○ Data versioning ● Training & inference ○ Compute graphs, pipelines ○ Deployment ○ Model versioning Datasets

● Typical solution ○ Protobuf messages ○ KV store ○ Dist. file system ● Limitations ○ Serialization granularity ○ Copies: kv log, kernel1, replication, kernel2, fs (x12) ○ Remote CPU involved, stragglers ○ Cannot place data in device EC shard Datasets

● Simplest hardware implementation ○ Write protobuf in arena, like Flatbuffers ○ Pick an offset on disks, e.g. a namespace ○ Call ibv_exp_ec_encode_async ● Comments ○ Management, coordination, crash resiliency ○ Thin wrapper over HW: line rate perf. (x12) ● User abstraction? ○ Simple, familiar ○ Efficient, device friendly mmap

● Extension to classic mmap ○ Distributed ○ Typed - Protobuf, other formats planned ● Protobuf is amazing ○ Forward and backward compatible ○ Lattice mmap

● C++ const Test& test = mmap("/test"); int i = test.field();

● Python test = Test() bm.mmap("/test", test) i = test.field() mmap, recap

● Simple abstraction for data storage ● Fully accelerated, “mechanically friendly” ○ Thin wrapper over HW, device-to-device, zero copy ○ ~1.5x replication factor ○ Network automatically balanced ○ Solves straggler problem ○ No memory pinning or TLB thrashing, NUMA aware Use cases

● Compute ○ Map-reduce, compute graphs, pipelines ● Typical setup ○ Spark, DL frameworks ○ Distribution using Akka, gRPC, MPI ○ Kubernetes or SLURM scheduling ● Limitations ○ No interop ○ Placement difficult ○ Inefficient resources allocation Compute

● Simplest hardware implementation ○ Define a task, e.g. img. resize, CUDA kernel, PyTorch graph ○ Place tasks in queue ○ Work stealing - RDMA atomics ○ Device-to-device chaining - GPU Direct Async ● User abstraction? task

● Python @bm.task def compute(x, y): return x * y

# Runs locally compute(1, 2)

# Might be rebalanced on cluster data = bm.list() bm.mmap("/data", data) compute(data, 2) task, recap

● Simple abstraction for CPU and device kernels ● Work stealing instead of explicit schedule ○ No GPU hoarding ○ Better work balancing ○ Dynamic placement, HA ● Device-to-device chaining ○ Data placed directly in device memory ○ Efficient pipelines, even very short tasks ○ E.g. model parallelism, low latency inference Use cases

● Versioning ○ Track datasets and models ○ Deploy / rollback models ● Typical setup ○ Copy before update ○ Symlinks as versions to data ○ Staging / production environments split Versioning

● Simplest hardware implementation ○ Keep multiple write ahead logs ○ mmap updates ○ tasks queues ● User abstraction? branch

● Like a git branch ○ But any size data ○ Simplifies collaboration, experimentation ○ Generalized staging / production split ● Simplifies HA ○ File system fsync, msync (Very hard! Rajimwale et al. DSN ‘11) ○ Replaces transactions, e.g. queues, persistent memory ○ Allows duplicate work merge branch

● C++ Test* test = mutable_mmap("/test"); branch b; # Only visible in current branch test->set_field(12);

● Similar in Python Summary

● mmap, task, and branch simplify hardware-acceleration ● Helps build pipelines, manage cluster resources etc. ● Early micro benchmarks suggest very high performance Thank You!

Will be open sourced BSD Contact me if interested - [email protected]

Thanks to our sponsor