The Multikernel A new OS architecture for scalable multicore systems

Andrew Baumann1 Paul Barham2 Pierre-Evariste Dagand3 Tim Harris2 Rebecca Isaacs2 Simon Peter1 Timothy Roscoe1 Adrian Schüpbach1 Akhilesh Singhania1

1 Systems Group, ETH Zurich 2 Microsoft Research, Cambridge 3 ENS Cachan Bretagne

Systems Group | Department of Computer Science | ETH Zurich SOSP, 12th October 2009 Introduction

How should we structure an OS for future multicore systems?

I Scalability to many cores I Heterogeneity and hardware diversity

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 2 System diversity

“Victoria Falls” 128 threads FB DIMMFB FB DIMM DIMM FB FBDIMM DIMM FB FB DIMM DIMM FB DIMM 16 cores 65x (two sockets) MCU MCU MCU MCU MCU MCU MCU MCU AMD Opteron (Istanbul)

UltraSPARC T2 L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ processor L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ 64 threads Full Cross Bar eight cores Full Cross Bar 35x C0 C1 C2 C3 C4 C5 C6 C7 C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU UltraSPARC® T1 processor 32 threads eight cores NIU Sun Niagara T2 14x Sys I/F PCIe (E-NET+) Buffer Switch Core UltraSPARC® IIIi NIU Sys I/F PCIe processor (Ethernet+) Buffer Switch Core 1x 2x 10 Power <100 W x8 @2. GHz Gigabit Ethernet 2x 10 Power <95 W x8 @ 2.0 GHz Intel Nehalem (Beckton) Gigabit Ethernet 2004 2005 2006 2007 2008

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 3 The interconnect matters Today’s 8-socket Opteron

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 4 The interconnect matters Tomorrow’s 8-socket Nehalem

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 5 The interconnectGDDR matters GDDR On-chip interconnects

In-Order In-Order Multi-threaded Multi-threaded Wide SIMD Wide SIMD I$ D$ I$ D$ Fixed Function Fixed Fixed Function Fixed Display Interface Display Coherent L2 Cache Interface Display Memory ControllerMemory Memory ControllerMemory ControllerMemory Memory ControllerMemory In-Order In-Order Multi-threaded Multi-threaded Wide SIMD Wide SIMD Texture Logic Texture Texture Logic Texture I$ D$ I$ D$ System Interface System System Interface System

PCIe DDR2 Controller 0 DDR2 Controller 1

SerDes SerDes PROCESSOR CACHE

PCIe 0 Reg File L2 CACHE MAC/ L-1I L-1D P P P I-TLB D-TLB PHY 2 1 0 2D DMA

UART, MDN TDN

HPI, I2C, GbE 0 UDN IDN JTAG,SPI STN SWITCH Flexible GbE 1 I/O

Flexible I/O PCIe 1 MAC/ XAUI 1 PHY MAC/ PHY SerDes SerDes

DDR2 Controller 3 DDR2 Controller 2

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 6 Core diversity

I Within a system: I Programmable NICs I GPUs I FPGAs (in CPU sockets) I On a single die: I Performance asymmetry I Streaming instructions (SIMD, SSE, etc.) I Virtualisation support

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 7 Summary

I Increasing core counts, increasing diversity I Unlike HPC systems, cannot optimise at design time

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 8 I Proposal: structure the OS as a distributed system I Design principles: 1. Make inter-core communication explicit 2. Make OS structure hardware-neutral 3. View state as replicated

The multikernel model

I It’s time to rethink the default structure of an OS I Shared-memory kernel on every core I Data structures protected by locks I Anything else is a device

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 9 The multikernel model

I It’s time to rethink the default structure of an OS I Shared-memory kernel on every core I Data structures protected by locks I Anything else is a device I Proposal: structure the OS as a distributed system I Design principles: 1. Make inter-core communication explicit 2. Make OS structure hardware-neutral 3. View state as replicated

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 9 Outline

Introduction

Motivation Hardware diversity

The multikernel model Design principles The model

Barrelfish

Evaluation Case study: Unmap

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 10 I Decouples system structure from inter-core communication mechanism I Communication patterns explicitly expressed I Naturally supports heterogeneous cores, non-coherent interconnects (PCIe) I Better match for future hardware I . . . with cheap explicit message passing (e.g. Tile64) I . . . without cache-coherence (e.g. Intel 80-core) I Allows split-phase operations I Decouple requests and responses for concurrency I We can reason about it

1. Make inter-core communication explicit

I All communication with messages (no shared state)

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 11 1. Make inter-core communication explicit

I All communication with messages (no shared state) I Decouples system structure from inter-core communication mechanism I Communication patterns explicitly expressed I Naturally supports heterogeneous cores, non-coherent interconnects (PCIe) I Better match for future hardware I . . . with cheap explicit message passing (e.g. Tile64) I . . . without cache-coherence (e.g. Intel 80-core) I Allows split-phase operations I Decouple requests and responses for concurrency I We can reason about it

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 11 Message passing vs. shared memory: experiment Shared memory (move the data to the operation): I Each core updates the same memory locations (no locking) I Cache-coherence protocol migrates modified cache lines I Processor stalled while line is fetched or invalidated I Limited by latency of interconnect round-trips I Performance depends on data size (cache lines) and contention (number of cores)

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 12 Shared memory results 4 4-core AMD system ⇥ 12 SHM1

10

8 1000) ×

6

4 Latency (cycles

2

0 2 4 6 8 10 12 14 16 Cores

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13 Shared memory results 4 4-core AMD system ⇥ 12 SHM2 SHM1 10

8 1000) ×

6

4 Latency (cycles

2

0 2 4 6 8 10 12 14 16 Cores

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13 Shared memory results 4 4-core AMD system ⇥ 12 SHM4 SHM2 10 SHM1

8 1000) ×

6

4 Latency (cycles

2

0 2 4 6 8 10 12 14 16 Cores

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13 Shared memory results 4 4-core AMD system ⇥ 12 SHM8 SHM4 10 SHM2 SHM1

8 1000) ×

6 (no locking!)

4 Latency (cycles

2 Stalled cycles

0 2 4 6 8 10 12 14 16 Cores

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13 Message passing vs. shared memory: experiment Message passing (move the operation to the data): I A single server core updates the memory locations I Each client core sends RPCs to the server I Operation and results described in a single cache line I Block while waiting for a response (in this experiment)

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 14 Message passing vs. shared memory: tradeoff 4 4-core AMD system ⇥ 12 SHM8 SHM4 10 SHM2 SHM1 MSG1 8 1000) ×

6

4 Latency (cycles

2

0 2 4 6 8 10 12 14 16 Cores

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15 Message passing vs. shared memory: tradeoff 4 4-core AMD system ⇥ 12 SHM8 SHM4 10 SHM2 SHM1 MSG8 8 MSG1 1000) ×

6

4 Latency (cycles

2

0 2 4 6 8 10 12 14 16 Cores

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15 Message passing vs. shared memory: tradeoff 4 4-core AMD system ⇥ 12 SHM8 SHM4 10 SHM2 SHM1 MSG8 8 MSG1 1000) ×

6

4 Latency (cycles

Messaging 2 faster for: 4 cores 0 2 4 6 8 10 12 14 16 4 cache lines Cores

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15 Message passing vs. shared memory: tradeoff 4 4-core AMD system ⇥ 12 SHM8 SHM4 10 SHM2 SHM1 MSG8 8 MSG1 1000)

× Server

6

4 Latency (cycles

2

0 2 4 6 8 10 12 14 16 Cores Actual cost of update at server 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15 Message passing vs. shared memory: tradeoff 4 4-core AMD system ⇥ 12 SHM8 SHM4 10 SHM2 SHM1 MSG8 8 MSG1 1000)

× Server

6

4 “spare” cycles

Latency (cycles if RPC was 2 split-phase

0 2 4 6 8 10 12 14 16 Cores Actual cost of update at server 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15 I Adaptability to changing performance characteristics I Late-bind protocol and message transport implementations

2. Make OS structure hardware-neutral

I Separate OS structure from hardware I Only hardware-specific parts: I Message transports (highly optimised / specialised) I CPU / device drivers

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 16 2. Make OS structure hardware-neutral

I Separate OS structure from hardware I Only hardware-specific parts: I Message transports (highly optimised / specialised) I CPU / device drivers I Adaptability to changing performance characteristics I Late-bind protocol and message transport implementations

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 16 I Required by message-passing model I Naturally supports domains that do not share memory I Naturally supports changes to the set of running cores I Hotplug, power management

3. View state as replicated

I Potentially-shared state accessed as if it were a local replica I Scheduler queues, control blocks, etc.

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 17 3. View state as replicated

I Potentially-shared state accessed as if it were a local replica I Scheduler queues, process control blocks, etc. I Required by message-passing model I Naturally supports domains that do not share memory I Naturally supports changes to the set of running cores I Hotplug, power management

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 17 I In a multikernel, sharing is a local optimisation I Shared (locked) replica for threads or closely-coupled cores I Hidden, local I Only when faster, as decided at runtime I Basic model remains split-phase

Replication vs. sharing as default

I Replicas used as an optimisation in previous systems: Tornado, K42 clustered objects Linux read-only data, kernel text

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 18 Replication vs. sharing as default

I Replicas used as an optimisation in previous systems: Tornado, K42 clustered objects Linux read-only data, kernel text

I In a multikernel, sharing is a local optimisation I Shared (locked) replica for threads or closely-coupled cores I Hidden, local I Only when faster, as decided at runtime I Basic model remains split-phase

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 18 The multikernel model

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 19 Outline

Introduction

Motivation Hardware diversity

The multikernel model Design principles The model

Barrelfish

Evaluation Case study: Unmap

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 20 Barrelfish

I From-scratch implementation of a multikernel I Supports x86-64 multiprocessors (ARM soon) I Open source (BSD licensed)

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 21 Barrelfish structure Monitors and CPU drivers

I CPU driver serially handles traps and exceptions I Monitor mediates local operations on global state I URPC inter-core (shared memory) message transport on current (cache-coherent) x86 HW

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 22 Non-original ideas in Barrelfish Multiprocessor techniques: I Minimise shared state (Tornado, K42, Corey) I User-space messaging decoupled from IPIs (URPC) I Single-threaded non-preemptive kernel per core (K42) Other ideas we liked: I Capabilities for all resource management (seL4) I Upcall processor dispatch (Psyche, Sched. Activations, K42) I Push policy into application domains (, Nemesis) I Lots of information (Infokernel) I Run drivers in their own domains (µkernels) I EDF as per-core CPU scheduler (RBED) I Specify device registers in a little language (Devil)

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 23 Applications running on Barrelfish

I Slide viewer (this one!) I Webserver (www.barrelfish.org) I Virtual machine monitor (runs unmodified Linux) I SPLASH-2, OpenMP (benchmarks) I SQLite I ECLiPSe (constraint engine) I more. . .

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 24 Outline

Introduction

Motivation Hardware diversity

The multikernel model Design principles The model

Barrelfish

Evaluation Case study: Unmap

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 25 Evaluation goals How do we evaluate an alternative OS structure?

I Good baseline performance I Comparable to existing systems on current hardware I Scalability with cores I Adapability to different hardware I Ability to exploit message-passing for performance

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 26 Case study: Unmap (TLB shootdown)

I Send a message to every core with a mapping, wait for all to be acknowledged I Linux/Windows: 1. Kernel sends IPIs 2. Spins on shared acknowledgement count/event I Barrelfish: 1. User request to local monitor domain 2. Single-phase commit to remote cores I How to implement communication?

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 27 Broadcast

Unmap communication protocols

Unicast

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 28 Unmap communication protocols

Unicast Broadcast

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 28 Unmap communication protocols Raw messaging cost 14 Broadcast Unicast 12

10 1000) × 8

6

4 Latency (cycles

2

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Cores 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 29 Why use multicast 8 4-core AMD system ⇥

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 30 Why use multicast 8 4-core AMD system ⇥

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 30 I “NUMA-aware” multicast

Multicast communication

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 31 Multicast communication

I “NUMA-aware” multicast

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 31 Unmap communication protocols Raw messaging cost 14 Broadcast Unicast 12 Multicast NUMA-Aware Multicast 10 1000) × 8

6

4 Latency (cycles

2

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Cores 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 32 I We tackle this with constraint logic programming [Schüpbach et al., MMCS’08] I System knowledge base stores rich, detailed representation of hardware, performs online reasoning i e I Initial implementation: port of the ECL PS constraint solver I Prolog query used to construct multicast routing tree

System knowledge base

I Constructing multicast tree requires hardware knowledge I Mapping of cores to sockets (CPUID data) I Messaging latency (online measurements) I More generally, Barrelfish needs a way to reason about diverse system resources

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 33 System knowledge base

I Constructing multicast tree requires hardware knowledge I Mapping of cores to sockets (CPUID data) I Messaging latency (online measurements) I More generally, Barrelfish needs a way to reason about diverse system resources I We tackle this with constraint logic programming [Schüpbach et al., MMCS’08] I System knowledge base stores rich, detailed representation of hardware, performs online reasoning i e I Initial implementation: port of the ECL PS constraint solver I Prolog query used to construct multicast routing tree

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 33 Unmap latency

60 Windows Linux 50 Barrelfish

40 1000) ×

30

20 Latency (cycles

10

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Cores

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 34 Summary of other results

I No penalty for shared-memory (SPLASH, OpenMP) I Network throughput: 951.7Mbit/s (same as Linux) I Pipelined web server

I Static: 640 Mbit/s vs. 316 Mbit/s for lighttpd/Linux I Dynamic:3417 requests/s (17.1Mbit/s) bottlenecked on SQL

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 35 I Barrelfish: our concrete implementation I Reasonable performance on current hardware I Better scalability/adaptability for future hardware I Promising approach

www.barrelfish.org

Conclusion

I Modern computers are inherently distributed systems I It’s time to rethink OS structure to match I The Multikernel: model of the OS as a distributed system 1. Explicit communication, replicated state 2. Hardware-neutral OS structure

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 36 Conclusion

I Modern computers are inherently distributed systems I It’s time to rethink OS structure to match I The Multikernel: model of the OS as a distributed system 1. Explicit communication, replicated state 2. Hardware-neutral OS structure I Barrelfish: our concrete implementation I Reasonable performance on current hardware I Better scalability/adaptability for future hardware I Promising approach

www.barrelfish.org

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 36 Backup slides

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 37 URPC implementation

I Current hardware provides one communication mechanism: cache-coherent shared memory I Can we “trick” cache-coherence protocol to send messages? I User-level RPC (URPC) [Bershad et al., 1991]

I Channel is shared ring buffer I Messages are cache-line sized I Sender writes message into next line I Receiver polls on last word I Marshalling/demarshalling, naming, binding all implemented above

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 38 I There is a tradeoff here! I IPIs are decoupled from fast-path messaging, used only for: 1. Specific (batches of) operations that require low latency, even when other tasks are executing 2. Awakening cores that have blocked to save power (alternatively, MONITOR/MWAIT)

Polling for receive Tradeoff vs. IPIs

I Polling is cheap: line is local to receiver until message arrives I Hardware-imposed costs for IPI (on 4 4-core AMD): ⇥ I 800 cycles to send (from user-mode) ⇡ I 1200 cycles lost in receive (to user-mode) ⇡

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 39 Polling for receive Tradeoff vs. IPIs

I Polling is cheap: line is local to receiver until message arrives I Hardware-imposed costs for IPI (on 4 4-core AMD): ⇥ I 800 cycles to send (from user-mode) ⇡ I 1200 cycles lost in receive (to user-mode) ⇡ I There is a tradeoff here! I IPIs are decoupled from fast-path messaging, used only for: 1. Specific (batches of) operations that require low latency, even when other tasks are executing 2. Awakening cores that have blocked to save power (alternatively, MONITOR/MWAIT)

12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 39