IN A BIT MORE DEPTH

COMP9242 2006/S2 Week 4 MOTIVATION

• Early operating systems had very little structure

• A strictly layered approach was promoted by [Dij68]

• Later OS (more or less) followed that approach (e.g., ).

• Such systems are known as monolithic kernels

COMP924206S2W04 MICROKERNELS... 2 ADVANTAGES OF MONOLITHIC KERNELS

• Kernel has access to everything: ➜ all optimisations possible ➜ all techniques/mechanisms/concepts implementable

• Kernel can be extended by adding more code, e.g. for: ➜ new services ➜ support for new hardare

COMP924206S2W04 MICROKERNELS... 3 PROBLEMS WITH LAYERED APPROACH

• Widening range of services and applications ⇒ OS bigger, more complex, slower, more error prone.

• Need to support same OS on different hardware.

• Like to support various OS environments.

• Distribution ⇒ impossible to provide all services from same (local) kernel.

COMP924206S2W04 MICROKERNELS... 4 EVOLUTION OF THE LINUX KERNEL

Linux Kernel Size (.tar.gz)

35 30 25 20 15 Size (Mb)Size 10 5 0 31-Jan-93 28-Oct-95 24-Jul-98 19-Apr-01 14-Jan-04 Date Linux 2.4.18: 2.7M LoC

COMP924206S2W04 MICROKERNELS... 5 APPROACHES TO TACKLING COMPLEXITY

• Classical software-engineering approach: modularity ➜ (relatively) small, mostly self-contained components ➜ well-defined interfaces between them ➜ enforcement of interfaces ➜ containment of faults to few modules

COMP924206S2W04 MICROKERNELS... 6 APPROACHES TO TACKLING COMPLEXITY

• Classical software-engineering approach: modularity ➜ (relatively) small, mostly self-contained components ➜ well-defined interfaces between them ➜ enforcement of interfaces ➜ containment of faults to few modules

• Doesn’t work with monolithic kernels: ➜ all kernel code executes in privileged mode ➜ faults aren’t contained ➜ interfaces cannot be enforced ➜ performance takes priority over structure

COMP924206S2W04 MICROKERNELS... 6-A EVOLUTION OF THE LINUX KERNEL — PART 2

Software-engineering study of Linux kernel [SJW+02]:

• Looked at size and interdependencies of kernel “modules” ➜ “common coupling”: interdependency via global variables • Analysed development over time (represented by linearised version)

COMP924206S2W04 MICROKERNELS... 7 EVOLUTION OF THE LINUX KERNEL — PART 2

Software-engineering study of Linux kernel [SJW+02]:

• Looked at size and interdependencies of kernel “modules” ➜ “common coupling”: interdependency via global variables • Analysed development over time (represented by linearised version) • Result 1: Module size grows linearly with version number

COMP924206S2W04 MICROKERNELS... 7-A EVOLUTION OF THE LINUX KERNEL — PART 2

COMP924206S2W04 MICROKERNELS... 8 EVOLUTION OF THE LINUX KERNEL — PART 2

Software-engineering study of Linux kernel [SJW+02]:

• Looked at size and interdependencies of kernel “modules” ➜ “common coupling”: interdependency via global variables • Analysed development over time (represented by linearised version) • Result 1: Module size grows linearly with version number

COMP924206S2W04 MICROKERNELS... 9 EVOLUTION OF THE LINUX KERNEL — PART 2

Software-engineering study of Linux kernel [SJW+02]:

• Looked at size and interdependencies of kernel “modules” ➜ “common coupling”: interdependency via global variables • Analysed development over time (represented by linearised version) • Result 1: Module size grows linearly with version number • Result 2: Interdependency grows exponentially with version!

COMP924206S2W04 MICROKERNELS... 9-A EVOLUTION OF THE LINUX KERNEL — PART 2

Software-engineering study of Linux kernel [SJW+02]:

• Looked at size and interdependencies of kernel “modules” ➜ “common coupling”: interdependency via global variables • Analysed development over time (represented by linearised version) • Result 1: Module size grows linearly with version number • Result 2: Interdependency grows exponentially with version! • The present Linux model is doomed! • There is no reason to believe that this problem does not exist for Win- dows, Mac OS, ...

COMP924206S2W04 MICROKERNELS... 9-B IDEA: BREAK UP THE OS

Applications

User-level servers

OS non-privileged

Microkernel

Hardware privileged

Based on the ideas of Brinch Hansen’s “Nucleus” [BH70]

COMP924206S2W04 MICROKERNELS... 10 MONOLITHIC VS. MICROKERNEL OS STRUCTURE

Application syscall

user mode VFS

IPC, file system Application UNIX Device File Server Driver Server Scheduler, virtual memory kernel IPC mode

Device drivers, Dispatcher, ... IPC, virtual memory

Hardware Hardware

COMP924206S2W04 MICROKERNELS... 11 MICROKERNEL OS

• Kernel: – contains code which must run in supervisor mode; – isolates hardware dependence from higher levels; – is small and fast ⇒ extensible system;

COMP924206S2W04 MICROKERNELS... 12 MICROKERNEL OS

• Kernel: – contains code which must run in supervisor mode; – isolates hardware dependence from higher levels; – is small and fast ⇒ extensible system; – provides mechanisms.

COMP924206S2W04 MICROKERNELS... 12-A MICROKERNEL OS

• Kernel: – contains code which must run in supervisor mode; – isolates hardware dependence from higher levels; – is small and fast ⇒ extensible system; – provides mechanisms.

• User-level servers: – are hardware independent/portable, – provide “OS environment”/”OS personality” (maybe several), – may be invoked: ➜ from application (via message-passing IPC) ➜ from kernel (upcalls);

COMP924206S2W04 MICROKERNELS... 12-B MICROKERNEL OS

• Kernel: – contains code which must run in supervisor mode; – isolates hardware dependence from higher levels; – is small and fast ⇒ extensible system; – provides mechanisms.

• User-level servers: – are hardware independent/portable, – provide “OS environment”/”OS personality” (maybe several), – may be invoked: ➜ from application (via message-passing IPC) ➜ from kernel (upcalls); – implement policies [BH70].

COMP924206S2W04 MICROKERNELS... 12- DOWNCALL VS. UPCALL

downcall upcall (syscall)

user code un−privileged

kernel privileged

unprivileged code privileged code enterskernelmode entersusermode

implemented via trap implemented via IPC

COMP924206S2W04 MICROKERNELS... 13 MICROKERNEL-BASED SYSTEMS

classic + thin specialized

classic Security em- native OS bedded highly-specialized Java RT app component MM

L4 L4 L4

HW HW HW

COMP924206S2W04 MICROKERNELS... 14 EARLY EXAMPLE:

• Separation of mechanism from policy ➜ e.g. protection vs. security

• No hierarchical layering of kernel.

• Protection, even within OS. ➜ Uses (segregated) capabilities.

• Objects, encapsuation, units of protection.

• Unique object name, no ownership.

• Object persistence based on reference counting [WCC+74].

COMP924206S2W04 MICROKERNELS... 15 HYDRA ...

• Can be considered the first object-oriented OS;

• Has been called the first microkernel OS (by people who ignored Brinch Hansen)

• Has had enormous influence on later operating systems research

• Was never widely used even at CMU because of ➜ poor performance, ➜ lack of a complete environment.

COMP924206S2W04 MICROKERNELS... 16 POPULAR EXAMPLE:

• Developed at CMU by Rashid and others [RTY+88] from 1984 on

• successor of Accent [FR86] and RIG [Ras88].

COMP924206S2W04 MICROKERNELS... 17 POPULAR EXAMPLE: MACH

• Developed at CMU by Rashid and others [RTY+88] from 1984 on

• successor of Accent [FR86] and RIG [Ras88].

Goals:

• Tailorability: support different OS interfaces

• Portability: almost all code H/W independent

• Real-time capability

• Multiprocessor and distribution support

• Security

COMP924206S2W04 MICROKERNELS... 17-A POPULAR EXAMPLE: MACH

• Developed at CMU by Rashid and others [RTY+88] from 1984 on

• successor of Accent [FR86] and RIG [Ras88].

Goals:

• Tailorability: support different OS interfaces

• Portability: almost all code H/W independent

• Real-time capability

• Multiprocessor and distribution support

• Security

Coined term microkernel.

COMP924206S2W04 MICROKERNELS... 17-B BASIC FEATURES OF MACH KERNEL

• Task and management

• Interprocess communication (asynchronous message-passing)

• Memory object management

• System call redirection

• Device support

• Multicomputer support

COMP924206S2W04 MICROKERNELS... 18 MACH TASKS AND THREADS

• Task consists of one or more threads

• Task provides address space and other environment

• Thread is active entity (basic unit of CPU utilisation)

• Threads have own stacks, are kernel scheduled

• Threads may run in parallel on multiprocessor

• “Privileged user-state program” may be used to control scheduling

• Task created from “blueprint” with empty or inherited address space ➜ similar approach adopted by Linux

• Activated by creating a thread in it

COMP924206S2W04 MICROKERNELS... 19

MACH IPC: PORTS

• Addressing based on ports: ➜ port is a mailbox, allocated/destroyed via a system call ➜ has a fixed-size message queue associated with it ➜ is protected by (segregated) capabilities ➜ has exactly one receiver, but possibly many senders ➜ can have “send-once” capability to a port

COMP924206S2W04 MICROKERNELS... 20

MACH IPC: PORTS

• Addressing based on ports: ➜ port is a mailbox, allocated/destroyed via a system call ➜ has a fixed-size message queue associated with it ➜ is protected by (segregated) capabilities ➜ has exactly one receiver, but possibly many senders ➜ can have “send-once” capability to a port • Can pass the receive capability for a port to another ➜ give up read access to the port

COMP924206S2W04 MICROKERNELS... 20-A MACH IPC: PORTS

• Addressing based on ports: ➜ port is a mailbox, allocated/destroyed via a system call ➜ has a fixed-size message queue associated with it ➜ is protected by (segregated) capabilities ➜ has exactly one receiver, but possibly many senders ➜ can have “send-once” capability to a port • Can pass the receive capability for a port to another process ➜ give up read access to the port • Kernel detects ports without senders or receiver • Processes may have many ports (UNIX server has 2000!) • Ports can be grouped into port sets

➜ Allows listening to many ports (like )

COMP924206S2W04 MICROKERNELS... 20-B MACH IPC: PORTS

• Addressing based on ports: ➜ port is a mailbox, allocated/destroyed via a system call ➜ has a fixed-size message queue associated with it ➜ is protected by (segregated) capabilities ➜ has exactly one receiver, but possibly many senders ➜ can have “send-once” capability to a port • Can pass the receive capability for a port to another process ➜ give up read access to the port • Kernel detects ports without senders or receiver • Processes may have many ports (UNIX server has 2000!) • Ports can be grouped into port sets

➜ Allows listening to many ports (like ) • Send blocks if queue is full ➜ except with send-once cap (used for server replies)

COMP924206S2W04 MICROKERNELS... 20-C MACH IPC: MESSAGES

• Segregated capabilities: ➜ threads refer to them via local indices ➜ kernel marshalls capabilities in messages ➜ message format must identify caps

COMP924206S2W04 MICROKERNELS... 21 MACH IPC: MESSAGES

• Segregated capabilities: ➜ threads refer to them via local indices ➜ kernel marshalls capabilities in messages ➜ message format must identify caps

• Message contents: – Send capability to destination port (mandatory) ➜ used by kernel to validate operation – optional send capability to reply port ➜ for use by receiver to send reply – possibly other capabilities

COMP924206S2W04 MICROKERNELS... 21-A MACH IPC: MESSAGES

• Segregated capabilities: ➜ threads refer to them via local indices ➜ kernel marshalls capabilities in messages ➜ message format must identify caps

• Message contents: – Send capability to destination port (mandatory) ➜ used by kernel to validate operation – optional send capability to reply port ➜ for use by receiver to send reply – possibly other capabilities – “in-line” (by-value) data – “out-of-line” (by reference) data, using copy-on-write, ➜ may contain whole address spaces

COMP924206S2W04 MICROKERNELS... 21-B MACH IPC

Message header port rights out−of− in−line (capabilities) line data data

virtual address space virtual address space task 1 task 2

IPC

mapping mapping before after IPC IPC physical memory

COMP924206S2W04 MICROKERNELS... 22 MACHVIRTUALMEMORYMANAGEMENT

• Address space constructed from memory regions – initially empty

COMP924206S2W04 MICROKERNELS... 23 MACHVIRTUALMEMORYMANAGEMENT

• Address space constructed from memory regions – initially empty – populated by: ➜ explicit allocation ➜ explicitly mapping a memory object ➜ inheriting from “blueprint” (as in Linux clone()), ➜ inheritance: not, shared or copied ➜ allocated automatically by kernel during IPC ➜ when passing by-reference parameters ⇒ sparse virtual memory use (unlike UNIX)

COMP924206S2W04 MICROKERNELS... 23-A MACHVIRTUALMEMORYMANAGEMENT

• Address space constructed from memory regions – initially empty – populated by: ➜ explicit allocation ➜ explicitly mapping a memory object ➜ inheriting from “blueprint” (as in Linux clone()), ➜ inheritance: not, shared or copied ➜ allocated automatically by kernel during IPC ➜ when passing by-reference parameters ⇒ sparse virtual memory use (unlike UNIX) – 3 page states: ➜ unallocated, ➜ allocated & unreferenced, ➜ allocated & initialised

COMP924206S2W04 MICROKERNELS... 23-B COPY-ON-WRITE IN MACH

• When data is copied (“blueprint” or passed by-reference): ➜ source and destination share single copy, ➜ both virtual pages are mapped to the same frame

• Marked as read-only

• When one copy is modified, a fault occurs

• Handling by kernel involves making a physical copy is made ➜ VM mapping is changed to refer to the new copy

COMP924206S2W04 MICROKERNELS... 24 COPY-ON-WRITE IN MACH

• When data is copied (“blueprint” or passed by-reference): ➜ source and destination share single copy, ➜ both virtual pages are mapped to the same frame

• Marked as read-only

• When one copy is modified, a fault occurs

• Handling by kernel involves making a physical copy is made ➜ VM mapping is changed to refer to the new copy

• Advantage: ➜ efficient way of sharing/passing large amounts of data

COMP924206S2W04 MICROKERNELS... 24-A COPY-ON-WRITE IN MACH

• When data is copied (“blueprint” or passed by-reference): ➜ source and destination share single copy, ➜ both virtual pages are mapped to the same frame

• Marked as read-only

• When one copy is modified, a fault occurs

• Handling by kernel involves making a physical copy is made ➜ VM mapping is changed to refer to the new copy

• Advantage: ➜ efficient way of sharing/passing large amounts of data

• Drawbacks: ➜ expensive for small amounts of data (page table manipulations) ➜ data must be properly aligned

COMP924206S2W04 MICROKERNELS... 24-B MACH ADDRESS MAPS

• Address spaces represented as address maps:

head

entry

memory object

COMP924206S2W04 MICROKERNELS... 25 MACH ADDRESS MAPS

• Address spaces represented as address maps:

head tail

entry

memory object

• Any part of AS can be mapped to (part of) a memory object

• Compact representation of sparse address spaces ➜ Compare to multi-level page tables?

COMP924206S2W04 MICROKERNELS... 25-A MEMORY OBJECTS

• Kernel doesn’t support file system

• Memory objects are an abstraction of secondary storage: – can be mapped into virtual memory – are cached by the kernel in physical memory – pager invoked if uncached page is touched ➜ used by file system server to provide data

• Support data sharing ➜ by mapping objects into several address spaces

• Memory is only cache for memory objects

COMP924206S2W04 MICROKERNELS... 26 USER-LEVEL PAGE FAULT HANDLERS

• All actual I/O performed by pager; can be ➜ default pager (provided by kernel), or ➜ external pager, running at user-level.

Task External Pager

map IPC

Kernel memory cache object

COMP924206S2W04 MICROKERNELS... 27 USER-LEVEL PAGE FAULT HANDLERS

• All actual I/O performed by pager; can be ➜ default pager (provided by kernel), or ➜ external pager, running at user-level.

Task External Pager

map IPC

Kernel memory cache object

• Intrinsic page fault cost: 2 IPCs

COMP924206S2W04 MICROKERNELS... 27-A HANDLING PAGE FAULTS

➀ Check protection & locate memory object ➜ uses address map ➁ Check cache, invoke pager if cache miss ➜ uses a hashed page table ➂ Check copy-on-write ➜ perform physical copy if write fault ➃ Enter new mapping into H/W page tables

COMP924206S2W04 MICROKERNELS... 28 REMOTE COMMUNICATION

• Client A sends message to server B on remote node ➀ A sends message to local proxy port for B’s receive port ➁ User-level network message server receives from proxy port ➂ NMS converts proxy port into (global) network port. ➃ NMS sends message to NMS on B’s node ➜ may need conversion (byte order...) ➄ Remote NMS converts network port into local port (B’s). ➅ Remote NMS sends message to that port

COMP924206S2W04 MICROKERNELS... 29 MACH UNIX EMULATION

3 Emulation Application Unix Server 5 Library 4

1 syscall redirect 2

Mach

• Emulation library in user address space handles IPC

• Invoked by system call redirection (trampoline mechanism) ➜ supports binary compatibility

COMP924206S2W04 MICROKERNELS... 30 MACH = MICROKERNEL?

• Most OS services implemented at user level ➜ using memory objects and external pagers ➜ Provides mechanisms, not policies

• Mostly hardware independent

COMP924206S2W04 MICROKERNELS... 31 MACH = MICROKERNEL?

• Most OS services implemented at user level ➜ using memory objects and external pagers ➜ Provides mechanisms, not policies

• Mostly hardware independent

• Big! ➜ 140 system calls ➜ Size: 200k instructions

• Performance poor ➜ tendency to move features into kernel ➜ Darwin (base of MacOS X) has complete BSD kernel inside Mach

COMP924206S2W04 MICROKERNELS... 31-A MACH = MICROKERNEL?

• Most OS services implemented at user level ➜ using memory objects and external pagers ➜ Provides mechanisms, not policies

• Mostly hardware independent

• Big! ➜ 140 system calls ➜ Size: 200k instructions

• Performance poor ➜ tendency to move features into kernel ➜ Darwin (base of MacOS X) has complete BSD kernel inside Mach

Further information on Mach: [YTR+87,CDK94,Sin97]

COMP924206S2W04 MICROKERNELS... 31-B OTHER CLIENT-SERVER SYSTEMS

• Lots!

COMP924206S2W04 MICROKERNELS... 32 OTHER CLIENT-SERVER SYSTEMS

• Lots!

• Most notable systems: Amoeba: FU Amsterdam, early 1980’s [TM81,TM84,MT86] Chorus: INRIA (France), from early 1980’s [DA92,RAA+90,RAA+92] ➜ Commercialised by Chorus Systèmes in 1988 ➜ Bought by Sun a number of years back ➜ Closed down later, Chorus team spun out to create Jaluna ➜ Now market new microkernel-like OSware Windows NT: Microsoft (early 1990’s) [Cus93] ➜ Early versions (NT 3) were microkernel-ish ➜ Now run main servers run in kernel mode

COMP924206S2W04 MICROKERNELS... 32-A CRITIQUE OF MICROKERNEL ARCHITECTURES

I’m not interested in making devices look like user-level. They aren’t, they shouldn’t, and microkernels are just stupid. Linus Torvalds

COMP924206S2W04 MICROKERNELS... 33 CRITIQUE OF MICROKERNEL ARCHITECTURES

I’m not interested in making devices look like user-level. They aren’t, they shouldn’t, and microkernels are just stupid. Linus Torvalds Is Linus right?

COMP924206S2W04 MICROKERNELS... 33-A MICROKERNEL PERFORMANCE

• First generation microkernel systems exhibited poor performance when compared to monolithic UNIX implementations ➜ particularly Mach, the best-known example

COMP924206S2W04 MICROKERNELS... 34 MICROKERNEL PERFORMANCE

• First generation microkernel systems exhibited poor performance when compared to monolithic UNIX implementations ➜ particularly Mach, the best-known example

• Reasons are investigated by [Chen & Bershad 93]: ➜ instrumented user and system code to collect execution traces ➜ run on DECstation 5000/200 (25MHz R3000) ➜ run under Ultrix and Mach with Unix server ➜ traces fed to memory system simulator ➜ analyse MCPI (memory cycles per instruction) ➜ baseline MCPI (i.e. excluding idle loops)

COMP924206S2W04 MICROKERNELS... 34-A ULTRIX VS. MACH MCPI

COMP924206S2W04 MICROKERNELS... 35 INTERPRETATION

Observations:

• Mach memory penalty (i.e. cache missess or write stalls) higher

• Mach VM system executes more instructions than Ultrix (but has more functionality)

COMP924206S2W04 MICROKERNELS... 36 INTERPRETATION

Observations:

• Mach memory penalty (i.e. cache missess or write stalls) higher

• Mach VM system executes more instructions than Ultrix (but has more functionality)

Claim:

• Degraded performance is (intrinsic?) result of OS structure

COMP924206S2W04 MICROKERNELS... 36-A INTERPRETATION

Observations:

• Mach memory penalty (i.e. cache missess or write stalls) higher

• Mach VM system executes more instructions than Ultrix (but has more functionality)

Claim:

• Degraded performance is (intrinsic?) result of OS structure

• IPC cost (known to be high in Mach) is not a major factor [Ber92]

COMP924206S2W04 MICROKERNELS... 36-B ASSERTIONS

1 OS has less instruction and data locality than user code ➜ System code has higher cache and TLB miss rates ➜ Particularly bad for instructions

COMP924206S2W04 MICROKERNELS... 37 ASSERTIONS

1 OS has less instruction and data locality than user code ➜ System code has higher cache and TLB miss rates ➜ Particularly bad for instructions

2 System execution is more dependent on instruction cache behaviour than is user execution ➜ MCPIs dominated by system i-cache misses Note: most benchmarks were small, i.e. user code fits in cache

COMP924206S2W04 MICROKERNELS... 37-A ASSERTIONS

1 OS has less instruction and data locality than user code ➜ System code has higher cache and TLB miss rates ➜ Particularly bad for instructions

2 System execution is more dependent on instruction cache behaviour than is user execution ➜ MCPIs dominated by system i-cache misses Note: most benchmarks were small, i.e. user code fits in cache 3 Competition between user and system code is not a problem ➜ Few conflicts between user and system caching ➜ TLB misses are not a relevant factor Note: the hardware used has direct-mapped physical caches. ⇒ Split system/user caches wouldn’t help

COMP924206S2W04 MICROKERNELS... 37-B SELF INTERFERENCE

➜ Only examine system cache misses ➜ Shaded: System cache misses removed by associativity ➜ MCPI for system-only, using R3000 direct-mapped cache ➜ Reductions due to associativity were obtained by running system on a simulator and using a two-way associative cache of the same size

COMP924206S2W04 MICROKERNELS... 38 ASSERTIONS...

4 Self-interference is a problem in system instruction reference streams. ➜ High internal conflicts in system code ➜ System would benefit from higher cache associativity

COMP924206S2W04 MICROKERNELS... 39 ASSERTIONS...

4 Self-interference is a problem in system instruction reference streams. ➜ High internal conflicts in system code ➜ System would benefit from higher cache associativity

5 System block memory operations are responsible for a large percentage of memory system reference costs. ➜ Particularly true for I/O system calls

COMP924206S2W04 MICROKERNELS... 39-A ASSERTIONS...

4 Self-interference is a problem in system instruction reference streams. ➜ High internal conflicts in system code ➜ System would benefit from higher cache associativity

5 System block memory operations are responsible for a large percentage of memory system reference costs. ➜ Particularly true for I/O system calls

6 Write buffers are less effective for system references. ➜ write buffer allows limited asynchronous writes on cache misses

COMP924206S2W04 MICROKERNELS... 39-B ASSERTIONS...

4 Self-interference is a problem in system instruction reference streams. ➜ High internal conflicts in system code ➜ System would benefit from higher cache associativity

5 System block memory operations are responsible for a large percentage of memory system reference costs. ➜ Particularly true for I/O system calls

6 Write buffers are less effective for system references. ➜ write buffer allows limited asynchronous writes on cache misses

7 Virtual-to-physical mapping strategy can have significant impact on cache performance ➜ Unfortunate mapping may increase conflict misses ➜ “Random” mappings (Mach) are to be avoided

COMP924206S2W04 MICROKERNELS... 39-C OTHEREXPERIENCEWITHMICROKERNELPERFORMANCE

• System call costs are (inherently?) high ➜ Typically hundreds of cycles, 900 for Mach/i486

• Context (address-space) switching costs are (inherently?) high ➜ Getting worse (in terms of cycles) with increasing CPU/memory speed ratios [Ous90] ➜ IPC (involving system calls and context switches) is inherently expensive

COMP924206S2W04 MICROKERNELS... 40 SO, WHAT’S WRONG?

COMP924206S2W04 MICROKERNELS... 41 SO, WHAT’S WRONG?

• Microkernels heavily depend on IPC

• IPC is expensive

COMP924206S2W04 MICROKERNELS... 41-A SO, WHAT’S WRONG?

• Microkernels heavily depend on IPC

• IPC is expensive – Is the microkernel idea flawed? – Should some code never leave the kernel? – Do we have to buy flexibility with performance?

COMP924206S2W04 MICROKERNELS... 41-B A CRITIQUE OF THE CRITIQUE

Data presented earlier: ➜ are specific to one (or a few) system, ➜ results cannot be generalised without thorough analysis, ➜ no such analysis had been done

⇒ Cannot trust the conclusions [Lie95]

COMP924206S2W04 MICROKERNELS... 42 RE-ANALYSIS OF CHEN & BERSHAD’S DATA

...... other MCPI ...... system cache miss MCPI . . sedU 0.227 M 0.495 egrep U 0.035 M 0.081 yacc U 0.067 M 0.129 gcc U 0.434 M 0.690 compress U 0.250 M 0.418 ab U 0.427 M 0.534 espresso U 0.041 M 0.068

MCPI for Ultrix and Mach

COMP924206S2W04 MICROKERNELS... 43 RE-ANALYSIS OF CHEN & BERSHAD’S DATA...

...... conflict misses ...... capacity misses . . sedU 0.170 M 0.415 egrep U 0.024 M 0.069 yacc U 0.039 M 0.098 gcc U 0.130 M 0.388 compress U 0.102 M 0.258 abU 0.230 M 0.382 espresso U 0.012 M 0.037 MCPI caused by cache misses: conflict (black) vs capacity (white)

COMP924206S2W04 MICROKERNELS... 44 CONCLUSION

• Mach system (kernel + UNIX server + emulation library) is too big!

COMP924206S2W04 MICROKERNELS... 45 CONCLUSION

• Mach system (kernel + UNIX server + emulation library) is too big!

• UNIX server is essentially same

• Emulation library is irrelevant (according to Chan & Bershad)

COMP924206S2W04 MICROKERNELS... 45-A CONCLUSION

• Mach system (kernel + UNIX server + emulation library) is too big!

• UNIX server is essentially same

• Emulation library is irrelevant (according to Chan & Bershad)

⇒ Mach kernel working set is too big

COMP924206S2W04 MICROKERNELS... 45-B CONCLUSION

• Mach system (kernel + UNIX server + emulation library) is too big!

• UNIX server is essentially same

• Emulation library is irrelevant (according to Chan & Bershad)

⇒ Mach kernel working set is too big

Can we build microkernels which avoid these problems?

COMP924206S2W04 MICROKERNELS... 45-C REQUIREMENTSFORMICROKERNELS:

• Fast (system call costs, IPC costs)

• Small (big ⇒ slow)

⇒ Must be well designed

⇒ Must provide a minimal set of operations.

Can this be done?

COMP924206S2W04 MICROKERNELS... 46 AREHIGHSYSTEMCOSTSESSENTIAL?

• Example: kernel call cost on i486 – Mach kernel call: 900 cycles

COMP924206S2W04 MICROKERNELS... 47 AREHIGHSYSTEMCOSTSESSENTIAL?

• Example: kernel call cost on i486 – Mach kernel call: 900 cycles – Inherent (hardware-dictated cost): 107 cycles. ⇒ 800 cycles kernel overhead

COMP924206S2W04 MICROKERNELS... 47-A AREHIGHSYSTEMCOSTSESSENTIAL?

• Example: kernel call cost on i486 – Mach kernel call: 900 cycles – Inherent (hardware-dictated cost): 107 cycles. ⇒ 800 cycles kernel overhead – L4 kernel call: 123–180 cycles (15–73 cycles overhead)

COMP924206S2W04 MICROKERNELS... 47-B AREHIGHSYSTEMCOSTSESSENTIAL?

• Example: kernel call cost on i486 – Mach kernel call: 900 cycles – Inherent (hardware-dictated cost): 107 cycles. ⇒ 800 cycles kernel overhead – L4 kernel call: 123–180 cycles (15–73 cycles overhead) ⇒ Mach’s performance is a result of design and implementation not the microkernel concept!

COMP924206S2W04 MICROKERNELS... 47-C MICROKERNEL DESIGN PRINCIPLES [LIE96]

Minimality: If it doesn’t have to be in the kernel, it shouldn’t be in the kernel

Appropriate abstractions which can be made fast and allow efficient implementation of services

Well written: It pays to shave a few cycles off TLB refill handler or the IPC path

Unportable: must be targeted to specific hardware ➜ no problem if it’s small, and higher layers are portable ➜ Example: Liedtke reports significant rewrite of memory management when porting from 486 to Pentium ⇒ “abstract hardware layer” is too costly

COMP924206S2W04 MICROKERNELS... 48 NON-PORTABILITY EXAMPLE: I486 VS PENTIUM:

• Size and associativity of TLB

• Size and organisation of cache (larger line size — restructured IPC)

• Segment regs in Pentium used to simulate tagged TLB

⇒ different trade-offs

COMP924206S2W04 MICROKERNELS... 49 WHAT Must A MICROKERNEL PROVIDE?

• Virtual memory/address spaces

• Threads

• fast IPC

• Unique identifiers (for IPC addressing)

COMP924206S2W04 MICROKERNELS... 50 MICROKERNEL DOES Not HAVE TO PROVIDE:

• File system ➜ use user-level server (as in Mach)

• Device drivers ➜ user-level driver invoked via interrupt (= IPC)

• Page-fault handler ➜ use user-level pager

COMP924206S2W04 MICROKERNELS... 51 L4 IMPLEMENTATIONTECHNIQUES (LIEDTKE)

• Appropriate system calls to reduce number of kernel invocations ➜ e.g., reply & receive next ➜ as many syscall args as possible in registers

COMP924206S2W04 MICROKERNELS... 52 L4 IMPLEMENTATIONTECHNIQUES (LIEDTKE)

• Appropriate system calls to reduce number of kernel invocations ➜ e.g., reply & receive next ➜ as many syscall args as possible in registers

• Efficient IPC ➜ rich message structure ➜ value and reference parameters in message ➜ copy message only once (i.e. not user→kernel→user)

COMP924206S2W04 MICROKERNELS... 52-A L4 IMPLEMENTATIONTECHNIQUES (LIEDTKE)

• Appropriate system calls to reduce number of kernel invocations ➜ e.g., reply & receive next ➜ as many syscall args as possible in registers

• Efficient IPC ➜ rich message structure ➜ value and reference parameters in message ➜ copy message only once (i.e. not user→kernel→user)

• Fast thread access ➜ Thread UIDs (containing thread ID) ➜ TCBs in (mapped) VM, cache-friendly layout ➜ Separate kernel stack for each thread (for fast interrupt handling)

COMP924206S2W04 MICROKERNELS... 52-B L4 IMPLEMENTATIONTECHNIQUES (LIEDTKE)

• Appropriate system calls to reduce number of kernel invocations ➜ e.g., reply & receive next ➜ as many syscall args as possible in registers

• Efficient IPC ➜ rich message structure ➜ value and reference parameters in message ➜ copy message only once (i.e. not user→kernel→user)

• Fast thread access ➜ Thread UIDs (containing thread ID) ➜ TCBs in (mapped) VM, cache-friendly layout ➜ Separate kernel stack for each thread (for fast interrupt handling)

• General optimisations ➜ “Hottest” kernel code is shortest ➜ Kernel IPC code on single page, critical data on single page ➜ Many H/W specific optimisations COMP924206S2W04 MICROKERNELS... 52-C • Minimising copying saves copy time as well as TLB/cache misses • TCB in VM: ➜ fast access (easy address calc), ➜ fewer TLB misses (no table of TCBs), ➜ no checking for residency, ➜ IPC doesn’t need to worry about memory management • UIDs: ➜ fast access to TCB, ➜ no index checking • msg in regs: ➜ only have 2 regs for this on 486 (8 bytes), but 48% faster! ➜ Partially due to just treating short messages separately ➜ More improvement on machines with more regs • short kernel code ⇒ fewer cache/TLB misses (single page) • single page ⇒ fewer TLB misses

COMP924206S2W04 MICROKERNELS... 52-1 MICROKERNEL PERFORMANCE

System CPU MHz RPC µs cyc/IPC semantics L4 R4600 100 1.7 µs 100 full L4 Alpha 433 0.2 µs 45 full L4 Pentium 166 1.5 µs 121 full L4 486 50 10 µs 250 full QNX 486 33 76 µs 1254 full Mach R2000 16.7 190 µs 1584 full SCR RPC CVAX 12.5 464 µs 2900 full Mach 486 50 230 µs 5750 full Amoeba 68020 15 800 µs 6000 full Spin Alpha 21064 133 102 µs 6783 full Mach Alpha 21064 133 104 µs 6916 full Exo-tlrpc R2000 116.7 6 µs 53 restricted Spring SparcV8 40 11 µs 220 restricted DP-Mach 486 66 16 µs 528 restricted LRPC CVAX 12.5 157 µs 981 restricted

COMP924206S2W04 MICROKERNELS... 53 PISTACHIO IPC PERFORMANCE

port/ C++ optimised Architecture optimisation intra AS inter AS intra AS inter AS Pentium-3 UKa 180 367 113 305 Pentium-4 UKa 385 983 196 416 Itanium 2 UKa/NICTA 508 508 36 36 cross CPU UKa 7419 7410 N/A N/A MIPS64 NICTA/UNSW 276 276 109 109 cross CPU NICTA/UNSW 3238 3238 690 690 PowerPC-64 NICTA/UNSW 330 518 200‡ 200‡ Alpha 21264 NICTA/UNSW 440 642 ≈70† ≈70† ARM/XScale NICTA/UNSW 340 340 151 151 UltraSPARC NICTA/UNSW 100‡ 100‡ † “Version 2” assembler kernel ‡ Guestimate!

COMP924206S2W04 MICROKERNELS... 54

4 CASE IN POINT: L LINUX [HÄRTIG et al. 97]

• Port of Linux kernel to L4 (like Mach Unix server) ➜ single-threaded (for simplicity, not performance) ➜ is pager of all Linux user processes ➜ maps emulation library and signal-handling code into AS ➜ server AS maps physical memory (& Linux runs within) ➜ copying between user and server done on physical memory ➜ use software lookup of page tables for address translation

COMP924206S2W04 MICROKERNELS... 55 4 CASE IN POINT: L LINUX [HÄRTIG et al. 97]

• Port of Linux kernel to L4 (like Mach Unix server) ➜ single-threaded (for simplicity, not performance) ➜ is pager of all Linux user processes ➜ maps emulation library and signal-handling code into AS ➜ server AS maps physical memory (& Linux runs within) ➜ copying between user and server done on physical memory ➜ use software lookup of page tables for address translation

• Changes to Linux restricted to architecture-dependent part

• Duplication of page tables (L4 and Linux server)

• Binary compatible to native Linux via trampoline mechanism

➜ but also modified with RPC stubs

COMP924206S2W04 MICROKERNELS... 55-A 4 SIGNAL DELIVERY IN L LINUX

Linux User Process

user manipulate thread (2) signal thread thread getpid read Enter Linux resume (4) forward signal (1) (3) Linux Server

main thread

• Separate signal-handler thread in each user process ➜ server IPCs signal-handler thread ➜ libcso handler thread ex_regs main user thread to save state ➜ user thread IPCs Linux server ➜ server does signal processing

libca ➜ server IPCs user thread to resume

COMP924206S2W04 MICROKERNELS... 56

schedule

libc

libca

schedule 4 L LINUX PERFORMANCE: MICROBENCHMARKS

System Time [µs] Cycles Linux 1.68 223

COMP924206S2W04 MICROKERNELS... 57 4 L LINUX PERFORMANCE: MICROBENCHMARKS

System Time [µs] Cycles Linux 1.68 223 3.95 526 L4Linux (trampoline) 5.66 753

COMP924206S2W04 MICROKERNELS... 57-A 4 L LINUX PERFORMANCE: MICROBENCHMARKS

System Time [µs] Cycles Linux 1.68 223 L4Linux 3.95 526 L4Linux (trampoline) 5.66 753 MkLinux in-kernel 15.66 2050 MkLinux server 110.60 14710

COMP924206S2W04 MICROKERNELS... 57-B 4 L LINUX PERFORMANCE: MICROBENCHMARKS

System Time [µs] Cycles Linux 1.68 223 L4Linux 3.95 526 L4Linux (trampoline) 5.66 753 MkLinux in-kernel 15.66 2050 MkLinux server 110.60 14710

Cycle breakdown Client Cycles Server enter emulation lib 20

Hardware cost: 82 send syscall message 168 wait for msg cycles (133MHz 131 Linux kernel Pentium) receive reply 188 send reply leave emulation lib 19

COMP924206S2W04 MICROKERNELS... 57-C ......

......

......

......

......

[ ] ! :

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ] ! :

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ] ......

......

......

......

......

......

......

[ ]

......

......

.

. .

.

[ ]

. .

.

.

.

.

.

.

[ ]

[ ]

4 L LINUX PERFORMANCE

Microbenchmarks: :

......

......

......

......

......

[ ] ! :

......

write /dev/null lat ...... 64 5

......

......

......

......

......

[ ]

......

null process lat ......

......

......

......

......

......

[ ]

......

simple process lat ......

......

......

......

......

......

[ ] ......

/bin/sh process lat ......

......

......

......

......

......

[ ]

......

mmap lat ......

......

......

......

......

......

[ ]

......

2-proc context switch lat ......

......

......

......

......

......

[ ]

......

8-proc context switch lat ......

......

......

......

......

......

[ ] ! :

......

pipe lat ...... 25 8

......

......

......

......

......

[ ]

......

UDP lat ......

......

......

......

......

......

[ ]

......

RPC/UDP lat ......

......

......

......

......

......

[ ]

......

TCP lat ......

......

......

......

......

......

[ ]

......

RPC/TCP lat ...... Linux

......

......

......

1 ...... 4

......

[ ]

......

pipe bw ...... L Linux

......

......

......

1 ......

......

[ ]

......

TCP bw ...... MkLinux (in-kernel)

......

......

......

1 ......

......

[ ]

......

file reread bw ...... MkLinux (user)

.

. .

1 .

[ ]

.

.

mmap reread bw .

.

.

.

.

. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

[ ]

[ ]

COMP924206S2W04 MICROKERNELS... 58 ......

......

......

......

......

[ ] ! :

......

......

......

......

......

......

......

[ ] ......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ] ......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ] ! :

......

......

......

......

......

......

......

[ ] ......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

......

......

......

......

......

[ ]

......

......

.

. .

.

[ ] . .

.

.

.

.

.

.

[ ]

[ ]

4 L LINUX PERFORMANCE

Microbenchmarks: :

......

......

......

......

......

[ ] ! :

......

write /dev/null lat ...... 64 5

......

......

......

......

......

[ ]

......

null process lat ......

......

......

......

......

......

[ ]

......

simple process lat ......

......

......

......

......

......

[ ] ......

/bin/sh process lat ......

......

......

......

......

......

[ ]

......

mmap lat ......

......

......

......

......

......

[ ]

......

2-proc context switch lat ......

......

......

......

......

......

[ ]

......

8-proc context switch lat ......

......

......

......

......

......

[ ] ! :

......

pipe lat ...... 25 8

......

......

......

......

......

[ ]

......

UDP lat ......

......

......

......

......

......

[ ]

......

RPC/UDP lat ......

......

......

......

......

......

[ ]

......

TCP lat ......

......

......

......

......

......

[ ]

......

RPC/TCP lat ...... Linux

......

......

......

1 ...... 4

......

[ ]

......

pipe bw ...... L Linux

......

......

......

1 ......

......

[ ]

......

TCP bw ...... MkLinux (in-kernel)

......

......

......

1 ......

......

[ ]

......

file reread bw ...... MkLinux (user)

.

. .

1 .

[ ]

.

.

mmap reread bw .

.

.

.

.

. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

[ ]

[ ]

Macrobenchmarks: Kernel compile:

Linux 476 s L4Linux 506 s (+6.3%) L4Linux (trampo) 509 s (+6.9%) MkLinux (kernel) 555 s (+16.6%) MkLinux (user) 605 s (+27.1%)

COMP924206S2W04 MICROKERNELS... 58-A CONCLUSIONS

• Mach sux 6⇒ microkernels suck

COMP924206S2W04 MICROKERNELS... 59 CONCLUSIONS

• Mach sux 6⇒ microkernels suck • L4 shows that performance might be deliverable ➜ L4Linux gets close to monolithic kernel performance ➜ need real multi-server system to evaluate microkernel potential

COMP924206S2W04 MICROKERNELS... 59-A CONCLUSIONS

• Mach sux 6⇒ microkernels suck • L4 shows that performance might be deliverable ➜ L4Linux gets close to monolithic kernel performance ➜ need real multi-server system to evaluate microkernel potential • Recent work (Wombat) gets substantially closer to native performance ⇒ Microkernel-based systems can perform • Mach has prejudiced community (see Linus...) – It’s still an uphill battle! – ... but recently made significant inroads ➜ Qualcomm deployment ➜ Darbat project

COMP924206S2W04 MICROKERNELS... 59-B CONCLUSIONS

• Mach sux 6⇒ microkernels suck • L4 shows that performance might be deliverable ➜ L4Linux gets close to monolithic kernel performance ➜ need real multi-server system to evaluate microkernel potential • Recent work (Wombat) gets substantially closer to native performance ⇒ Microkernel-based systems can perform • Mach has prejudiced community (see Linus...) – It’s still an uphill battle! – ... but recently made significant inroads ➜ Qualcomm deployment ➜ Darbat project • L4 starting to make significant inroads in embedded-systems market

COMP924206S2W04 MICROKERNELS... 59-C REFERENCES

[Ber92] Brian N. Bershad. The increasing irrelevance of IPC performance for microkernel-based operating systems. In USENIX WS Microkernels & other Kernel Arch., pages 205–211, Seattle, WA, USA, Apr 1992.

[BH70] Per Brinch Hansen. The nucleus of a multiprogramming . CACM, 13:238–250, 1970.

[CB93] J. Bradley Chen and Brian N. Bershad. The impact of operating system structure on memory system performance. In 14th SOSP, pages 120–133, Asheville, NC, USA, Dec 1993.

[CDK94] George F. Coulouris, Jean Dollimore, and Tim Kindberg. Distributed Systems: Concepts and Design. Addison-Wesley, 2nd edition, 1994.

[Cus93] Helen Custer. Inside Windows NT. Microsoft Press, 1993.

COMP924206S2W04 MICROKERNELS... 60 REFERENCES

[DA92] Randall W. Dean and Francois Armand. Data movement in kernelized systems. In USENIX WS Microkernels & other Kernel Arch., pages 243–261, Seattle, WA, USA, Apr 1992.

[Dij68] Edsger W. Dijkstra. The structure of the “THE” multiprogramming system. CACM, 11:341–346, 1968.

[FR86] Robert Fitzgerald and . The integration of virtual memory management and interprocess communication in Accent. Trans. Comp. Syst., 4:147–177, 1986.

[HHL+97] Hermann Härtig, Michael Hohmuth, , Sebastian Schönberg, and Jean Wolter. The performance of µ-kernel-based systems. In 16th SOSP, pages 66–77, St. Malo, France, Oct 1997.

[Lie95] JochenLiedtke. On µ-kernel construction. In 15th SOSP,

COMP924206S2W04 MICROKERNELS... 61 REFERENCES

pages 237–250, Copper Mountain, CO, USA, Dec 1995.

[Lie96] Jochen Liedtke. Towards real microkernels. CACM, 39(9):70–77, Sep 1996.

[MT86] Sape J. Mullender and Andrew S. Tanenbaum. The design of a capability-based distributed operating system. The Comp. J., 29:289–299, 1986.

[Ous90] J.K. Ousterhout. Why aren’t operating systems getting faster as fast as hardware? In 1990 Summer USENIX Techn. Conf., pages 247–56, Jun 1990.

[RAA+90] M. Rozier, . Abrossimov, F. Armand, I. Boule, M. Gien, M. Guillemont, F. Herrmann, C. Kaiser, S. Langlois, P.Léonard, and W. Neuhauser. Overview of the CHORUS distributed operating system. Technical report CS/TR-90-25, Chorus systeèmes, Montigny-le-Bretonneux

COMP924206S2W04 MICROKERNELS... 62 REFERENCES

(France), Apr 1990. [RAA+92] M. Rozier, V. Abrossimov, F. Armand, L. Boule, M. Gien, M. Guillemont, F. Herrman, C. Kaiser, S. Langlois, P.Léonard, and W. Neuhauser. Overview of the Chorus distributed operating system. In USENIX WS Microkernels & other Kernel Arch., pages 39–69, Seattle, WA, USA, Apr 1992. [Ras88] Richard F. Rashid. From RIG to Accent to Mach: The evolution of a network operating system. In Bernardo A. Huberman, editor, The Ecology of Computation, Studies in Computer Science and Artifical Intelligence, pages 207–230. North-Holland, Amsterdam, 1988. [RTY+88] Richard Rashid, Avadis Tevanian, Jr., Michael Young, David Golub, Robert Baron, David Black, William J. Bolosky, and Jonathan Chew. Machine-independent

COMP924206S2W04 MICROKERNELS... 63 REFERENCES

virtual memory management for paged uniprocessor and multiprocessor architectures. Trans. Computers, C-37:896–908, 1988.

[Sin97] PradeepK.Sinha. Distributed Operating Systems: Concepts and Design. Comp. Soc. Press, 1997.

[SJW+02] Stephen R. Schach, Bo Jin, David R. Wright, Gillian Z. Heller, and A. Jefferson Offutt. Maintainability of the Linux kernel. IEE Proc.: Softw., 149:18–23, 2002.

[TM81] Andrew S. Tanenbaum and Sape J. Mullender. An overview of the Amoeba distributed operating system. Operat. Syst. Rev., 15(3):51–64, 1981.

[TM84] Andrew S. Tanenbaum and Sape Mullender. The design of a capability-based distributed operating system. Technical Report IR-88, Vrije Universiteit, Nov 1984.

COMP924206S2W04 MICROKERNELS... 64 REFERENCES

[WCC+74] W. Wulf, E. Cohen, W. Corwin, A. Jones, R. Levin, C. Pierson, and F. Pollack. HYDRA: The kernel of a multiprocessor operating system. CACM, 17:337–345, 1974.

[YTR+87] Michael Young, Avadis Tevanian, Richard Rashid, David Golub, Jeffrey Eppinger, Jonathan Chew, William Bolosky, David Black, and Robert Baron. The duality of memory and communication in the implementation of a multiprocessor operating system. In 11th SOSP, pages 63–76, 1987.

COMP924206S2W04 MICROKERNELS... 65