The Multikernel A new OS architecture for scalable multicore systems
Andrew Baumann1 Paul Barham2 Pierre-Evariste Dagand3 Tim Harris2 Rebecca Isaacs2 Simon Peter1 Timothy Roscoe1 Adrian Schüpbach1 Akhilesh Singhania1
1 Systems Group, ETH Zurich 2 Microsoft Research, Cambridge 3 ENS Cachan Bretagne
Systems Group | Department of Computer Science | ETH Zurich SOSP, 12th October 2009 Introduction
How should we structure an OS for future multicore systems?
I Scalability to many cores I Heterogeneity and hardware diversity
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 2 System diversity
“Victoria Falls” 128 threads FB DIMMFB FB DIMM DIMM FB FBDIMM DIMM FB FB DIMM DIMM FB DIMM 16 cores 65x (two sockets) MCU MCU MCU MCU MCU MCU MCU MCU AMD Opteron (Istanbul)
UltraSPARC T2 L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ processor L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ 64 threads Full Cross Bar eight cores Full Cross Bar 35x C0 C1 C2 C3 C4 C5 C6 C7 C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU UltraSPARC® T1 processor 32 threads eight cores NIU Sun Niagara T2 14x Sys I/F PCIe (E-NET+) Buffer Switch Core UltraSPARC® IIIi NIU Sys I/F PCIe processor (Ethernet+) Buffer Switch Core 1x 2x 10 Power <100 W x8 @2. GHz Gigabit Ethernet 2x 10 Power <95 W x8 @ 2.0 GHz Intel Nehalem (Beckton) Gigabit Ethernet 2004 2005 2006 2007 2008
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 3 The interconnect matters Today’s 8-socket Opteron
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 4 The interconnect matters Tomorrow’s 8-socket Nehalem
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 5 The interconnectGDDR matters GDDR On-chip interconnects
In-Order In-Order Multi-threaded Multi-threaded Wide SIMD Wide SIMD I$ D$ I$ D$ Fixed Function Fixed Fixed Function Fixed Display Interface Display Coherent L2 Cache Interface Display Memory ControllerMemory Memory ControllerMemory ControllerMemory Memory ControllerMemory In-Order In-Order Multi-threaded Multi-threaded Wide SIMD Wide SIMD Texture Logic Texture Texture Logic Texture I$ D$ I$ D$ System Interface System System Interface System
PCIe DDR2 Controller 0 DDR2 Controller 1
SerDes SerDes PROCESSOR CACHE
PCIe 0 Reg File L2 CACHE MAC/ L-1I L-1D P P P I-TLB D-TLB PHY 2 1 0 2D DMA
UART, MDN TDN
HPI, I2C, GbE 0 UDN IDN JTAG,SPI STN SWITCH Flexible GbE 1 I/O
Flexible I/O PCIe 1 MAC/ XAUI 1 PHY MAC/ PHY SerDes SerDes
DDR2 Controller 3 DDR2 Controller 2
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 6 Core diversity
I Within a system: I Programmable NICs I GPUs I FPGAs (in CPU sockets) I On a single die: I Performance asymmetry I Streaming instructions (SIMD, SSE, etc.) I Virtualisation support
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 7 Summary
I Increasing core counts, increasing diversity I Unlike HPC systems, cannot optimise at design time
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 8 I Proposal: structure the OS as a distributed system I Design principles: 1. Make inter-core communication explicit 2. Make OS structure hardware-neutral 3. View state as replicated
The multikernel model
I It’s time to rethink the default structure of an OS I Shared-memory kernel on every core I Data structures protected by locks I Anything else is a device
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 9 The multikernel model
I It’s time to rethink the default structure of an OS I Shared-memory kernel on every core I Data structures protected by locks I Anything else is a device I Proposal: structure the OS as a distributed system I Design principles: 1. Make inter-core communication explicit 2. Make OS structure hardware-neutral 3. View state as replicated
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 9 Outline
Introduction
Motivation Hardware diversity
The multikernel model Design principles The model
Barrelfish
Evaluation Case study: Unmap
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 10 I Decouples system structure from inter-core communication mechanism I Communication patterns explicitly expressed I Naturally supports heterogeneous cores, non-coherent interconnects (PCIe) I Better match for future hardware I . . . with cheap explicit message passing (e.g. Tile64) I . . . without cache-coherence (e.g. Intel 80-core) I Allows split-phase operations I Decouple requests and responses for concurrency I We can reason about it
1. Make inter-core communication explicit
I All communication with messages (no shared state)
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 11 1. Make inter-core communication explicit
I All communication with messages (no shared state) I Decouples system structure from inter-core communication mechanism I Communication patterns explicitly expressed I Naturally supports heterogeneous cores, non-coherent interconnects (PCIe) I Better match for future hardware I . . . with cheap explicit message passing (e.g. Tile64) I . . . without cache-coherence (e.g. Intel 80-core) I Allows split-phase operations I Decouple requests and responses for concurrency I We can reason about it
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 11 Message passing vs. shared memory: experiment Shared memory (move the data to the operation): I Each core updates the same memory locations (no locking) I Cache-coherence protocol migrates modified cache lines I Processor stalled while line is fetched or invalidated I Limited by latency of interconnect round-trips I Performance depends on data size (cache lines) and contention (number of cores)
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 12 Shared memory results 4 4-core AMD system ⇥ 12 SHM1
10
8 1000) ×
6
4 Latency (cycles
2
0 2 4 6 8 10 12 14 16 Cores
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13 Shared memory results 4 4-core AMD system ⇥ 12 SHM2 SHM1 10
8 1000) ×
6
4 Latency (cycles
2
0 2 4 6 8 10 12 14 16 Cores
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13 Shared memory results 4 4-core AMD system ⇥ 12 SHM4 SHM2 10 SHM1
8 1000) ×
6
4 Latency (cycles
2
0 2 4 6 8 10 12 14 16 Cores
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13 Shared memory results 4 4-core AMD system ⇥ 12 SHM8 SHM4 10 SHM2 SHM1
8 1000) ×
6 (no locking!)
4 Latency (cycles
2 Stalled cycles
0 2 4 6 8 10 12 14 16 Cores
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 13 Message passing vs. shared memory: experiment Message passing (move the operation to the data): I A single server core updates the memory locations I Each client core sends RPCs to the server I Operation and results described in a single cache line I Block while waiting for a response (in this experiment)
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 14 Message passing vs. shared memory: tradeoff 4 4-core AMD system ⇥ 12 SHM8 SHM4 10 SHM2 SHM1 MSG1 8 1000) ×
6
4 Latency (cycles
2
0 2 4 6 8 10 12 14 16 Cores
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15 Message passing vs. shared memory: tradeoff 4 4-core AMD system ⇥ 12 SHM8 SHM4 10 SHM2 SHM1 MSG8 8 MSG1 1000) ×
6
4 Latency (cycles
2
0 2 4 6 8 10 12 14 16 Cores
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15 Message passing vs. shared memory: tradeoff 4 4-core AMD system ⇥ 12 SHM8 SHM4 10 SHM2 SHM1 MSG8 8 MSG1 1000) ×
6
4 Latency (cycles
Messaging 2 faster for: 4 cores 0 2 4 6 8 10 12 14 16 4 cache lines Cores
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15 Message passing vs. shared memory: tradeoff 4 4-core AMD system ⇥ 12 SHM8 SHM4 10 SHM2 SHM1 MSG8 8 MSG1 1000)
× Server
6
4 Latency (cycles
2
0 2 4 6 8 10 12 14 16 Cores Actual cost of update at server 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15 Message passing vs. shared memory: tradeoff 4 4-core AMD system ⇥ 12 SHM8 SHM4 10 SHM2 SHM1 MSG8 8 MSG1 1000)
× Server
6
4 “spare” cycles
Latency (cycles if RPC was 2 split-phase
0 2 4 6 8 10 12 14 16 Cores Actual cost of update at server 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 15 I Adaptability to changing performance characteristics I Late-bind protocol and message transport implementations
2. Make OS structure hardware-neutral
I Separate OS structure from hardware I Only hardware-specific parts: I Message transports (highly optimised / specialised) I CPU / device drivers
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 16 2. Make OS structure hardware-neutral
I Separate OS structure from hardware I Only hardware-specific parts: I Message transports (highly optimised / specialised) I CPU / device drivers I Adaptability to changing performance characteristics I Late-bind protocol and message transport implementations
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 16 I Required by message-passing model I Naturally supports domains that do not share memory I Naturally supports changes to the set of running cores I Hotplug, power management
3. View state as replicated
I Potentially-shared state accessed as if it were a local replica I Scheduler queues, process control blocks, etc.
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 17 3. View state as replicated
I Potentially-shared state accessed as if it were a local replica I Scheduler queues, process control blocks, etc. I Required by message-passing model I Naturally supports domains that do not share memory I Naturally supports changes to the set of running cores I Hotplug, power management
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 17 I In a multikernel, sharing is a local optimisation I Shared (locked) replica for threads or closely-coupled cores I Hidden, local I Only when faster, as decided at runtime I Basic model remains split-phase
Replication vs. sharing as default
I Replicas used as an optimisation in previous systems: Tornado, K42 clustered objects Linux read-only data, kernel text
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 18 Replication vs. sharing as default
I Replicas used as an optimisation in previous systems: Tornado, K42 clustered objects Linux read-only data, kernel text
I In a multikernel, sharing is a local optimisation I Shared (locked) replica for threads or closely-coupled cores I Hidden, local I Only when faster, as decided at runtime I Basic model remains split-phase
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 18 The multikernel model
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 19 Outline
Introduction
Motivation Hardware diversity
The multikernel model Design principles The model
Barrelfish
Evaluation Case study: Unmap
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 20 Barrelfish
I From-scratch implementation of a multikernel I Supports x86-64 multiprocessors (ARM soon) I Open source (BSD licensed)
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 21 Barrelfish structure Monitors and CPU drivers
I CPU driver serially handles traps and exceptions I Monitor mediates local operations on global state I URPC inter-core (shared memory) message transport on current (cache-coherent) x86 HW
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 22 Non-original ideas in Barrelfish Multiprocessor techniques: I Minimise shared state (Tornado, K42, Corey) I User-space messaging decoupled from IPIs (URPC) I Single-threaded non-preemptive kernel per core (K42) Other ideas we liked: I Capabilities for all resource management (seL4) I Upcall processor dispatch (Psyche, Sched. Activations, K42) I Push policy into application domains (Exokernel, Nemesis) I Lots of information (Infokernel) I Run drivers in their own domains (µkernels) I EDF as per-core CPU scheduler (RBED) I Specify device registers in a little language (Devil)
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 23 Applications running on Barrelfish
I Slide viewer (this one!) I Webserver (www.barrelfish.org) I Virtual machine monitor (runs unmodified Linux) I SPLASH-2, OpenMP (benchmarks) I SQLite I ECLiPSe (constraint engine) I more. . .
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 24 Outline
Introduction
Motivation Hardware diversity
The multikernel model Design principles The model
Barrelfish
Evaluation Case study: Unmap
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 25 Evaluation goals How do we evaluate an alternative OS structure?
I Good baseline performance I Comparable to existing systems on current hardware I Scalability with cores I Adapability to different hardware I Ability to exploit message-passing for performance
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 26 Case study: Unmap (TLB shootdown)
I Send a message to every core with a mapping, wait for all to be acknowledged I Linux/Windows: 1. Kernel sends IPIs 2. Spins on shared acknowledgement count/event I Barrelfish: 1. User request to local monitor domain 2. Single-phase commit to remote cores I How to implement communication?
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 27 Broadcast
Unmap communication protocols
Unicast
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 28 Unmap communication protocols
Unicast Broadcast
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 28 Unmap communication protocols Raw messaging cost 14 Broadcast Unicast 12
10 1000) × 8
6
4 Latency (cycles
2
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Cores 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 29 Why use multicast 8 4-core AMD system ⇥
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 30 Why use multicast 8 4-core AMD system ⇥
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 30 I “NUMA-aware” multicast
Multicast communication
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 31 Multicast communication
I “NUMA-aware” multicast
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 31 Unmap communication protocols Raw messaging cost 14 Broadcast Unicast 12 Multicast NUMA-Aware Multicast 10 1000) × 8
6
4 Latency (cycles
2
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Cores 12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 32 I We tackle this with constraint logic programming [Schüpbach et al., MMCS’08] I System knowledge base stores rich, detailed representation of hardware, performs online reasoning i e I Initial implementation: port of the ECL PS constraint solver I Prolog query used to construct multicast routing tree
System knowledge base
I Constructing multicast tree requires hardware knowledge I Mapping of cores to sockets (CPUID data) I Messaging latency (online measurements) I More generally, Barrelfish needs a way to reason about diverse system resources
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 33 System knowledge base
I Constructing multicast tree requires hardware knowledge I Mapping of cores to sockets (CPUID data) I Messaging latency (online measurements) I More generally, Barrelfish needs a way to reason about diverse system resources I We tackle this with constraint logic programming [Schüpbach et al., MMCS’08] I System knowledge base stores rich, detailed representation of hardware, performs online reasoning i e I Initial implementation: port of the ECL PS constraint solver I Prolog query used to construct multicast routing tree
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 33 Unmap latency
60 Windows Linux 50 Barrelfish
40 1000) ×
30
20 Latency (cycles
10
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Cores
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 34 Summary of other results
I No penalty for shared-memory (SPLASH, OpenMP) I Network throughput: 951.7Mbit/s (same as Linux) I Pipelined web server
I Static: 640 Mbit/s vs. 316 Mbit/s for lighttpd/Linux I Dynamic:3417 requests/s (17.1Mbit/s) bottlenecked on SQL
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 35 I Barrelfish: our concrete implementation I Reasonable performance on current hardware I Better scalability/adaptability for future hardware I Promising approach
www.barrelfish.org
Conclusion
I Modern computers are inherently distributed systems I It’s time to rethink OS structure to match I The Multikernel: model of the OS as a distributed system 1. Explicit communication, replicated state 2. Hardware-neutral OS structure
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 36 Conclusion
I Modern computers are inherently distributed systems I It’s time to rethink OS structure to match I The Multikernel: model of the OS as a distributed system 1. Explicit communication, replicated state 2. Hardware-neutral OS structure I Barrelfish: our concrete implementation I Reasonable performance on current hardware I Better scalability/adaptability for future hardware I Promising approach
www.barrelfish.org
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 36 Backup slides
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 37 URPC implementation
I Current hardware provides one communication mechanism: cache-coherent shared memory I Can we “trick” cache-coherence protocol to send messages? I User-level RPC (URPC) [Bershad et al., 1991]
I Channel is shared ring buffer I Messages are cache-line sized I Sender writes message into next line I Receiver polls on last word I Marshalling/demarshalling, naming, binding all implemented above
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 38 I There is a tradeoff here! I IPIs are decoupled from fast-path messaging, used only for: 1. Specific (batches of) operations that require low latency, even when other tasks are executing 2. Awakening cores that have blocked to save power (alternatively, MONITOR/MWAIT)
Polling for receive Tradeoff vs. IPIs
I Polling is cheap: line is local to receiver until message arrives I Hardware-imposed costs for IPI (on 4 4-core AMD): ⇥ I 800 cycles to send (from user-mode) ⇡ I 1200 cycles lost in receive (to user-mode) ⇡
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 39 Polling for receive Tradeoff vs. IPIs
I Polling is cheap: line is local to receiver until message arrives I Hardware-imposed costs for IPI (on 4 4-core AMD): ⇥ I 800 cycles to send (from user-mode) ⇡ I 1200 cycles lost in receive (to user-mode) ⇡ I There is a tradeoff here! I IPIs are decoupled from fast-path messaging, used only for: 1. Specific (batches of) operations that require low latency, even when other tasks are executing 2. Awakening cores that have blocked to save power (alternatively, MONITOR/MWAIT)
12.10.2009 The Multikernel: A new OS architecture for scalable multicore systems 39