<<

Theseus: A State Spill-free

Kevin Boos Lin Zhong

Rice Efficient Computing Group Rice University PLOS 2017 October 28 Problems with Today’s OSes

• Modern single-node OSes are vast and very complex • Results in entangled web of components • Nigh impossible to decouple • Difficult to maintain, evolve, update safely, and run reliably

2 Easy Evolution is Crucial

• Computer hardware must endure longer upgrade cycles[1] • Exacerbated by the (economic) decline of Moore’s Law • Evolutionary advancements occur mostly in • Extreme example: DARPA’s challenge for systems to “remain robust and functional in excess of 100 years”[2]

[2] DARPA seeks to create software systems that could last 100 years. https://www.darpa.mil/news-events/2015-04-08. [1] The PC upgrade cycle slows to every ve to six years, Intel’s CEO says. PCWorld article. 3 What do we need?

Easier evolution by reducing complexity without reducing size (and features).

We need a disentangled OS that: • allows every component to evolve independently at runtime • prevents failures in one component from jeopardizing others

4 “But surely existing systems have solved this already?”

– the astute audience member

5 Existing attempts to decouple systems

1. Traditional modularization

2. Encapsulation-based

3. Privilege-level separation

4. Hardware-driven

6 Existing attempts to decouple systems

1. Traditional modularization • Decompose large monolithic system into smaller entities of related functionality (separation of concerns) 2. Encapsulation-based Achieves some goals: code reuse • Often causes tight coupling, which inhibits other goals 3. Privilege-level separation • Evidence: is highly modular, but requires substantial effort to realize live update [3, 4, 5, 6] 4. Hardware-driven and fault isolation [7, 8, 9]

[3] J. Arnold and M. F. Kaashoek. : Automatic rebootless kernel updates. EuroSys, 2009. [4] M. Siniavine and A. Goel. Seamless kernel updates. DSN) 2013. [7] M. M. Swift, M. Annamalai, B. N. Bershad, and H. M. Levy. Recovering device drivers. OSDI, 2004. [5] G. Altekar, I. Bagrak, P. Burstein, and A. Schultz. OPUS: Online patches and updates for security. USENIX Security, 2005. [8] M. M. Swift, B. N. Bershad, and H. M. Levy. Improving the reliability of commodity operating systems. SOSP, 2003. [6] K. Makris and K. D. Ryu. Dynamic and adaptive updates of non-quiescent subsystems in commodity OS. EuroSys, 2007. [9] . Jacobsen, et al. Lightweight capability domains: Towards decomposing the . SIGOPS Oper. Syst. Rev., 2016. 7 Existing attempts to decouple systems

1. Traditional modularization • Group related code and data together into a single entity • Strict boundaries between entities, 2. Encapsulation-based e.g., classes in OOP Achieves better maintainability and adaptability 3. Privilege-level separation • Similar problems as traditional modularization, i.e., inextricably coupled entities that are difficult to interchange [10, 11] 4. Hardware-driven

[10] C. A. Soules, et al.. System support for online reconfiguration. Usenix ATC, 2003. [11] F. M. David, E. M. Chan, J. C. Carlyle, and R. H. Campbell. CuriOS: Improving reliability through operating system structure. OSDI, 2008. 8 Existing attempts to decouple systems

1. Traditional modularization • Aims to decouple entities by forcing them into separate domains with boundaries based on privilege levels 2. Encapsulation-based • , virtual machines Achieves fault isolation 3. Privilege-level separation • Coarse spatial granularity [12] • Evolution remains difficult because userspace servers must still closely collaborate [13] 4. Hardware-driven

[12] J. N. Herder, H. Bos, B. Gras, P. Homburg, and A. S. Tanenbaum. 3: A highly reliable, self-repairing operating system. ACM OS Review, 2006. [13] C. Giuffrida, A. Kuijsten, and A. S. Tanenbaum. Safe and automatic live update for operating systems. ASPLOS, 2013. 9 Existing attempts to decouple systems

1. Traditional modularization • Choose entity bounds based on the underlying hardware architecture (cores, coherence domains) 2. Encapsulation-based • Barrelfish [14], Helios [15], fos [16], K2 [17] Achieves scalable and energy-efficient performance 3. Privilege-level separation • Does not facilitate evolution, runtime flexibility, or fault isolation

4. Hardware-driven

[14] A. Baumann, et al. The : A new os architecture for scalable multicore systems. SOSP, 2009. [16] D. Wentzlaff, et al. An operating system for multicore and clouds. SoCC, 2010. [15] E. B. Nightingale, et al. Helios: Heterogeneous multiprocessing with satellite kernels. SOSP, 2009. [17] Felix Lin, et al. K2: a mobile OS for heterogeneous coherence domains. ASPLOS, 2014 10 Key Insight:

state spill is the root cause of entanglement within OSes

overlooked by existing decoupling strategies [EuroSys’17] What is state spill?

• When one software entity’s state undergoes a lasting change as a result of handling an interaction with another entity.

• Prevalent and deeply ingrained in modern system software • Causes entanglement • Individual entities cannot be easily interchanged • Multiple entities share fate • Hinders other goals as well

Kevin Boos, et al. A Characterization of State Spill in Modern Operating Systems. EuroSys, 2017. 12 “Why is state spill a useful concept?”

– the skeptical audience member

13 userspace nano-core processes userspace Modern OSes under the light of state spill nano-core processes

kern filesyst filesystem syscall PIC el dispatcher IRQ em cons task s kern fil ole filesyst filesystem syscall PIC c el dispatcher mgmt e em IRQ h s cons s filesystem task fil ole e mgmt c d y VGA indirection sysc h e s filesystem u st layer all e module y sysc e graphics mux indir d VGA indirection l st all modulee m ecti u layer l e graphics mux indir r s s on event dispatcher e m ecti c c laye s on RR r s event dispatcher CFQ h h r c c laye policy filesyst e RR policy e CFQ h h r em policy filesyst d d policy e e em d u u PIC d l l u u PIC s IRQ l e e input event l s IRQ c e input event r mux e c r h r mux submodule submodule r h frame e frame stack stack e allocator d allocator d PIT clock allocator PIT clock allocator ul ul IRQ IRQ er FCFS er FCFS key key policy policy boar boar d d mouse entanglement mouse heap indir entanglement heap indir PIC indirection layer PIC indirection layer allocator ecti via state spill allocator ecti PIC IRQ via state spill PIC IRQ IRQ on IRQ on laye laye r r (a) (b) Microkernel OS (c) Theseus Kernel(a) Monolithic Kernel (b) Microkernel OS (c) Theseus Kernel

• Web of interacting modules spill state into each other • Larger modules contain submodules that cannot be managed independently

14 userspaceuserspace nano-corenano-core processesprocesses Theseus: a Disentangled OS

kern filesystem kern syscallsyscall PIC filesystfilesyst filesystem el PIC el dispatcherdispatcher IRQIRQ em cons s cons tasktask fil ole c fil ole mgmtmgmt e h e s filesystem e s filesystem y sysc d y VGA indirectionindirection sysc st all u st layerlayer all modulemodule e indir ll e graphicsgraphics mux mux indir m ecti e m ecti ss onon rr s eventevent dispatcher dispatcher c cc layelaye RRRR CFQ h hh r r policypolicy filesystfilesyst policy e ee d emem dd u uu PICPIC ll ll ss IRQIRQ ee inputinput event event e cc rr muxmux submodulesubmodule rr hh frameframe e stack e allocatorallocator dd PITPIT clockclock allocator ulul IRQIRQ erer FCFSFCFS keykey policypolicy boarboar d entanglement d mousemouse entanglement heap indirindir PIC indirectionindirection layer layer PIC allocator ectiecti via state spill PIC IRQIRQ via state spill IRQIRQ onon layelaye rr (a)(a) Monolithic Monolithic Kernel Kernel (b)(b) Microkernel Microkernel OSOS (c) Theseus KernelKernel

Inspired by Implemented Runtime distributed from scratch composable computing using Rust

15 Our Namesake: Ship of Theseus

The ship wherein Theseus and the youth of Athens returned from Crete had thirty oars, and was preserved by the Athenians down even to the time of Demetrius Phalereus, for they took away the old planks as they decayed, putting in new and stronger timber in their places, in so much that this ship became a standing example among the philosophers, for the TODO: picture of ship logical question of things that grow; one side holding that the ship remained the same, and the other contending that it was not the same.

— Plutarch (Theseus) 16 Theseus Directives

Primary Directive no state spill

eliminate state spill above all else, e.g., performance, ease of programming

Secondary Directive elementary modules

no submodules; modules as small as possible

17 Design Principles Design Principles

Decoupling entities based on state spill (pairwise) 1. No Encapsulation 2. Stateless Communication

Composing a Disentangled OS (multi-entity) 3. Universal, connectionless interfaces 4. Generic pattern reuse

19 P1: No Encapsulation

Client state state Client state Server state • Encapsulation: code & data start config(c) are bundled togetherconfig(c) void • Direct cause of statevoid spill c c c c func1() func1() c

r1 r1 c, s1 • Instead, eschewc, encapsulations1 r1 c, s1 r1 c, s1 func2(r1) • Client shouldfunc2(r1) maintainc, s1 its own r2 progress with ther2 server c, s1, s2 c, s1, s2 r1, r2 c, s1, s2 end• Must preserver1, r2 information hiding c, s1, s2 a’ c Standard Encapsulation Opaque Exportation + Stateless Communication 20 r2 P2: Stateless Communication

• An interaction from client to server must be self-sufficient, containing everything that the server requires to handle it • No assumption of prior interactions or state fn0( )

void • Implications: fn1( ) • Server entity is effectively stateless • No hidden global dependencies

fn2() ( ) stateful 21

a’

c r2 Select Design and Implementation Decisions State Management via Opaque Exportation

•ClientP1 state + P2 –> opaque Serverexportation state Client state Server state • Server returns sealed representationstart of progressconfig(c) to client config(c) void void c • Preserves information hiding c c c • Client cannotfunc1() inspect/modify state func1() c r1 r1 c, s1 • Allows stateless communication c, s1 r1 c, s1 r1 c, s1 • Handled transparentlyfunc2(r1) by compiler func2(r1) c, s1 r2 r2 c, s1, s2 • Leverage Rust’s affine type system c, s1, s2 r1,to r2 avoid high overheadc, s1, s2 end r1, r2 c, s1, s2 a’ 23 c Standard Encapsulation Opaque Exportation + Stateless Communication r2 “But can you always eliminate state spill?”

– the slightly persnickety audience member

24 Theseus: a State Spill-free Operating System Kevin Boos and Lin Zhong

OSes are complex and entangled Main goals of Theseus Design & implementation decisions ● Existing OSes are a web of entangled entities Flat Module Architecture ○ Cannot treat entities independently Directive 1: no state spill (above all else) ● Submodules contribute to complex entanglement ○ ● OS components should be easily interchangeable Directive 2: elementary modules Extract submodules into first-order modules

filesystem syscall dispatcher PIC IRQ at runtime, for fluid system evolution filesystem kernel console

task mgmt

filesystem sc filesy ○ he Goal: Runtime Composability stem du VGA indirection layer ler graphics mux syscall indirectio n layer event dispatcher

RR ● Encapsulation causes state spill policy CFQ Prior decoupling strategies are insufficient; policy nano_core entanglement remains between system entities PIC IRQ input event mux schedule r frame allocator Client state Server state Client state Server state stack allocator ○ Modularization PIT clock IRQ FCFS policy

keyboar heap allocator d mouse ○ PIC IRQ indirectio indirection layer Encapsulation (OOP) n layer config(c) config(c) PIC IRQ ○ Privilege-level separation (μkernel) c void void Monolithic / Microkernel OS Theseus ○ Hardware-driven () c c c fn1() fn1() c …● Simplifies some module states logic must → nano_core exist manages somewhere all r1 r1 c, s1 ● Permits communication and compositional hierarchy c, s1 For disentanglement, we focus only on r1 c, s1 r1 c, s1 • State spill is not always avoidable, especially in low-level fn2(r1) c, s1 states and how they propagate fn2(r1) StateOS code Management that interfaces directly with hardware r2 r2 c, s1, s2 c, s1, c, s1, s2 ●• e.g., frame allocator, handler r1, r2 r1, r2 At some point, some entities must hold some state throughout entities in the system. s2 c, s1, s2 Client Client Client ●•ExportMulti-client multi-client states state are Opaque Exportation + as data blob jointly Standard Encapsulation exported as blobs jointly Stateless Communication owned by all clients State Spill is the root cause Multi-client state owned by every client state ● Clientless states are ● Scenario: source entity A (“client” role) db • Clientless states are owned Theseus design principles Server W owned by state_db communicates with destination B (“server role). metamoduleby state_db metamodule, ○module caches weak reference System-wide entity Entity caches a ● 1. No traditional encapsulation W State spill occurs when B’s state undergoes a (e.g., hardware resource) weak reference to it lasting change after handling an interaction from A. ○ Client A should maintain the state representing 25 its progress with server B, instead of B Caller entity A Callee entity B ○ Preserve information hiding: A cannot inspect Software-only Isolation and Safety (“client”) (“server”) Initial state ● Modules are separate binaries: namespace isolation pre-interaction or modify state from B 2. Stateless interactions ● Augment Rust compiler to permit minimal subset of unsafe code necessary for basic OS functionality Changed state ○ An interaction from A → B must include mid-interaction ● everything B needs to handle it Error handling is mandatory, using Option & Result ○ ○ Implication: B can be practically stateless Panics are disallowed and transformed into errors Lasting changes post-interaction 3. Universal, connectionless communication state Current status and future work spill ○ All entities are accessible in a uniform way ○ Do not assume ongoing existence of interfaces ● Done: baseline OS from scratch, all in Rust 4. Re-use of generic, spill-free patterns ● Now: analyze & rethink modules and interfaces ○ Implement common OS design patterns once in to remove state spill a spill-free way, then re-use across system ● Far: no user/kernel distinction: “bag of modules” Software-only Safety & Isolation

• Problem: we need isolation between modules • For fault isolation but also for interchangeability • Easy solution: rely on Rust’s memory safety guarantees • Why not just use existing techniques? (Singularity, SPIN, Vino) • Challenge: many modules, modules are in kernel core • No support for processes, SIPs, extensions, etc

26 Building Modules in Isolation

• Each module is a separate Rust crate • Compiled into individual binaries, isolated into private “namespaces”

Standard OS Theseus

No true Easy to distinction extricate a between Compiler/Linker Compiler single crate modules, or due to clear blurry lines boundaries

27 Many modules, all at the core ✓ Naming isolation done • Problem: Hardware protection or spatial multiplexing is infeasible for 100+ low-level modules • Solution: shared resources • Challenge: modules implement core kernel functionality • They need to execute unsafe code • Unsafe code can do … well, anything

28 Unsafe code is a necessary evil

• Some unsafe code is okay fn keyboard_irq_handler() { let scan_code: u8 = unsafe { • Port I/O & MMIO in_byte(0x60) • Register access };

& descriptor tables log!("scan_code: {}", scan_code);

• Most unsafe code is bad! // notify end of interrupt unsafe { • Dereferencing arbitrary pointers out_byte(0x20, 0x20); • Random type reinterpretations } } • Solution: augment Rust compiler fn bad_boy_bad_boy() { to permit minimal subset of unsafe { *(0xFFFF1234 as *mut u32) = necessary unsafe code 0xDEADF00D; • Principal of least privilege, per-module } } 29 userspace Module Management,processes flattened

• Submodules contribute to entanglement

kern filesystem syscall PIC filesyst el • Must be separated out into em dispatcher IRQ s cons task fil ole mgmt c h e e s filesystem standalone, first-order modules y sysc d VGA indirection module u st layer all l e graphics mux indir e m ecti with spill-free public interfaces r on event dispatcher laye RR CFQ r policy policy nano_core PIC • IRQ nano_core can then manage s input event c mux submodule h frame stack e allocator d PIT clock allocator ul all modules indiscriminately IRQ er FCFS key policy boar d mouse entanglement heap indir PIC indirection layer allocator ecti PIC IRQ via state spill IRQ on laye Two important clarifications: r (a) Monolithic Kernel (b) Microkernel OS (c) Theseus 1. State spill freedom does not preclude arbitrary module interaction 2. A module’s flat code structure does not preclude compositional hierarchy amongst modules 30 Why Rust? Beneficial Features of Rust

• High-level constructs with low-level flexibility

fn clear_vga_screen() { range(0, 80*25, |i| { *((0xb8000 + i * 2) as *mut u16) = VgaChar::new(Black) << 12; }); }

32 Beneficial Features of Rust

• High-level constructs with low-level flexibility

let &mut MemoryManagementInfo { ref mut page_table, ref mut vmas, ref mut stack_allocator } = current_mmi;

match page_table { &mut PageTable::Active(ref mut active_table) => { let mut frame_allocator = FRAME_ALLOCATOR.lock(); if let Some((stack, stack_vma)) = stack_allocator.alloc_stack( ... ) { vmas.push(stack_vma); Ok(stack) } } _ => { Err("MemoryManagementInfo::alloc_stack: failed to allocate stack!") } } 33 Beneficial Features of Rust

• High-level constructs with low-level flexibility • Functional & imperative

p3.and_then(|p3| p3.next_table(page.p3_index())) .and_then(|p2| p2.next_table(page.p2_index())) .and_then(|p1| p1[page.p1_index()].pointed_frame()) .or_else(huge_page) .map(|frame| { frame.number * PAGE_SIZE + offset })

sections.filter(|s| !s.is_allocated()) .map(|s| s.addr) .min()

34 Beneficial Features of Rust

• High-level constructs with low-level flexibility • Functional & imperative • Strongly typed + type inference

35 Beneficial Features of Rust

• High-level constructs with low-level flexibility • Functional & imperative • Strongly typed + type inference • Clear ownership/borrowing semantics • Explicit lifetimes

36 Beneficial Features of Rust

• High-level constructs with low-level flexibility • Functional & imperative • Strongly typed + type inference • Clear ownership/borrowing semantics

• error: `x` does not live long enough Explicit lifetimes | 3 | &x let y: &u32 = { | - borrow occurs here let x = 5; 4 | }; &x | ^ `x` dropped here while still borrowed }; ... println!("{}", y); 10 | println!(“{}”, y); | - borrowed value needs to live until here

37 Beneficial Features of Rust

• High-level constructs with low-level flexibility • Functional & imperative • Strongly typed + type inference • Clear ownership/borrowing semantics • Explicit lifetimes • Compile-time checking & guarantees

38 Concluding Remarks Current & Future Work

• Applying the herein described design to transform our existing baseline OS into state spill-free design • Evaluate and demonstrate runtime composability

Related projects using Theseus • Exploring multi-entity resource accounting with state spill • Massive MU-MIMO LTE/5G Basestation system software • Goal: bring reliability, safety, flexibility, scalability to the edge • Refactoring network stack for network provenance and tolerance of DoS attacks 40 Theseus in conclusion

• Eliminate state spill above all else • In pursuit of runtime composability for easy long-term OS evolution • Implemented from scratch in Rust Rice Efficient Computing Group recg.org • Will be open-sourced soon!

41 42 Backup Slides Rust disadvantages

• Lifetime checking isn’t great

let tasklist = task::get_tasklist().read(); let mut curr_task = tasklist.get_current().unwrap().write(); let curr_mmi = curr_task.mmi.as_ref().unwrap(); let mut curr_mmi_locked = curr_mmi.lock(); curr_mmi_locked.map_dma_memory(paddr, 512, PRESENT | WRITABLE);

• Copious overusage of panic • Must translate into normal error responses • Allocations are not as obvious as in C • Many more in paper

44 P3: Universal, Connectionless Interfaces

• All entities should be easily accessible through a uniform invocation and management interface • Promotes easy interchangeability • Avoids module-specific logic

• No expectation of an interface’s ongoing availability • Interactions cannot be stateful, must be “connectionless”

45 Pattern Reuse

• Common recurring OS design patterns should be implemented only once and reused throughout the OS • Examples: multiplexers, dispatchers, indirection layers • Pattern must enforce the absence of state spill regardless of specialization • Should be instantiable at compile time • Lowers development risk of adding new features

46 Modules must not jeopardize evolution

Regular Modules • In general, easy to update because modules are stateless • The nano-core relieves modules that do contain states from the burden of implementing their own state-saving logic • (Tedious approach, commonly used in live updates for reliable systems, e.g., [40]) • In Theseus, compiler is the sole manager of these exported states; it produces control routines to move them in and out of a module’s bounds, a guaranteed-safe operation because they are not accessible at the source level • Updates can occur on demand without the modules’ cooperation or knowledge Metamodules • Must also be easily updatable • state_db “recalls” lent-out states and externalizes them temporarily

47 Related work (abridged)

• Rust OSes with different goals: Tock, • Static Composability: Flux OSKit, Think, • Componentized interfaces: OpenCOM, Knit • Microservices architecture and serverless

48