How to Run POSIX Apps in a Minimal Picoprocess Jon Howell, Bryan Parno, John . Douceur Research, Redmond, WA

Abstract Libraries We envision a future where Web, mobile, and Application Function # Examples applications are delivered as isolated, complete Abiword 63 ,Freetype stacks to a minimal, secure client host. This shift imbues Gimp raster graphics 55 Gtk,Gdk Gnucash personal finances 101 Gnome, app vendors with full autonomy to maintain their apps’ 54 Gtk,Gdk integrity. Achieving this goal requires shifting complex Hyperoid 6 svgalib behavior out of the client platform and into the vendors’ vector drawing 96 Magick,Gnome isolated apps. We ported rich, interactive POSIX apps, Marble 3D globe 73 KDE, such as Gimp and Inkscape, to a spartan host platform. HTML/JS renderer 74 We describe this effort in sufficient detail to support re- producibility. Table 1: A variety of rich, functional apps transplanted to run in a minimal native picoprocess. While these 1 Introduction apps are nearly fully functional, plugins that depend on () are not yet supported (§3.9). Numerous academic systems [5, 11, 13, 15, 19, 22, 25–28, 31] and deployed systems [1–3, 23] have started pushing towards a world in Web, mobile, and multaneously [16]. It pushes the minimal client host in- desktop applications are strongly isolated by the client terface to an extreme, proposing a client host without kernel. A common in this work is that guarantee- TCP, a file system or even storage, and with a UI con- ing strong isolation requires simplifying the client, since strained to simple blitting (i.e., copying pixel arrays complexity tends to breed vulnerability. to the screen). In support of rich apps, Embassies’s mini- Complexity evicted from the client kernel takes up res- mal interface specifies execution of native binary code. idence in the apps themselves. This shift is beneficial: It Native code is an important practical choice, because, lets each app vendor decide independently which com- we assert, it is the lack of native code that has forced plexity is worth the risk of vulnerability, and one ven- each prior system based on language safety to evolve a dor’s decision in favor of complexity does not undermine complex trusted interface that provides access to native another’s decision to favor security. Of course, requiring libraries [8, 10, 17, 20]. This complexity undermines the each app vendor to implement a complete software stack intent to provide strong security. is impractical, so we this complexity to migrate to While native code is a target that every compiler can app frameworks that app vendors can choose among, just hit, it seems daunting to port arbitrary POSIX apps to as web developers choose among an ever evolving set of such a minimal interface. Such apps expect to run on a app frameworks on the server. complex host with hundreds of system calls and dozens The minimality of the client interface must not inhibit of system services, reflecting decades of development. the richness required by applications such as desktop However, our experience suggests this task is far eas- productivity apps. New client application models often ier than one might expect. Interactive apps use relatively fail due to the burden of migrating every app–and ev- little of the complexity available in modern host plat- ery –to run under a new model. Thus, we argue forms. More importantly, rather than alter the app, the that shifting app delivery to a minimal-client model re- functions that are required can often be emulated behind quires an inexpensive app migration from complex- the POSIX interface. This technique works without even host frameworks such as POSIX and Windows. recompiling the hundreds of libraries involved. The em- On the other hand, support for richness should not sac- ulation work can be shared easily across many applica- rifice the small size and tight specification of the isolation tions, making the work scalable. The broad se- interface. The web’s current client execution interface lection of rich apps that our system supports (see Table 1) has repeatedly failed to achieve strong app isolation, due demonstrates the generality of the approach. to an interface bloated with HTML, DOM, JPG, PNG, Contributions. This paper demonstrates the tractability JavaScript, Flash, etc. in pursuit of richness. of porting rich POSIX apps to a minimal environment, The recent Embassies system provides a concrete - thus enabling them to run on a multitude of minimal ample of how to achieve both security and richness si- client [13, 16, 18, 22, 31]. We give a full account-

USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 321 ing of the porting task, including which functionality is specific secret. Apps use this key, along with cryp- required and where corners can be cut. This includes tographic libraries, to store and recover private in- low-level details, such as an exhaustive list of syscalls formation despite starting from a public binary. handled, to enable reproducibility and to eliminate any Communication. All communication outside the pro- ambiguity about complexity hidden under the hood. Ul- cess, whether IPC to another on the local ma- timately, we hope that this will expedite other efforts to chine, or remote to an host, follows IP seman- adopt these techniques and hence achieve rich applica- tics: Data is transferred by value (a logical copy), so tions atop minimal, strongly-isolating client kernels. that the suspicious recipient needn’t worry about concur- rent modification; addressing is non-authoritative; deliv- 2 Background: Minimal Client Facilities ery admits loss and duplication; packet privacy and in- In this work, we aim to transplant apps from a rich tegrity are not guaranteed. Just like servers on the Inter- POSIX interface to a minimal client kernel. To ground net, apps build up integrity and privacy themselves using the discussion, we target the minimal Embassies pico- cryptography. To underscore these semantics, all com- process interface [16], since it takes minimality to an munication in Embassies–remote or local–is done via IP. extreme. If we can port an app to Embassies, we can • get addresses assigns the process one IPv4 certainly port it to a client with a richer interface. and one IPv6 address. The Embassies application binary interface (ABI) pro- • allocate packet allocates memory for an out- vides execution primitives that support an app’s internal going packet; this allocation is distinguished from computation, cryptographic primitives to facilitate pri- allocate memory to enable zero-copy transfer. vacy and integrity, primitives for IPC and network com- • send packet delivers a packet, interpreting its munication, and user interface (UI) primitives for user argument as an IP header and payload. interaction. • receive packet returns an allocated and de- Execution. The execution primitives include: queued packet, or NULL if the queue is empty. • Calls to allocate memory and free memory. • free packet frees an allocated packet. To simplify the specification and to the ABI User Interface. portable to most host environments, the app speci- • ui event returns a dequeued UI event (keystroke fies only the amount of memory required; it has no or pointer motion), or NULL if the queue is empty. control over the addresses returned by the allocator. • Some calls that manage viewports, letting them be • create accepts only the thread’s initial transferred among applications, or letting one ap- program counter and stack pointer; the application plication sublet a region of its viewport to another provides the stack and any execution context. application. In every case, even where nested, each • exit thread destroys the current thread. viewport is owned by a single app; no app can in- • A simplified -like [6] synchronization schedul- spect or modify the pixels of another app’s view- ing primitive, the zutex. zutex wake is a race- port. Details can be found elsewhere [16]. free primitive that supports app-level ef- • map allocates a framebuffer to back a ficient synchronization primitives. The correspond- viewport. This allocation is distinguished from ing zutex wait is the only blocking call in the allocate memory to enable fast pixel blitting. ABI; it allows an app to yield the processor. • update canvas informs the client kernel that a • clock returns a rough notion of wall-clock . region of the framebuffer has been updated, and that • set timer sets a timer, in clock coordinates, that its pixels should be blitted to the display. wakes a zutex on its expiration. Each picoprocess These calls comprise the entire Embassies ABI; all of has only one timer; the app must multiplex it. the functionality described in the rest of the paper is im- • get alarms returns a list of three distin- plemented in terms of these primitives. guished zutexes representing external events, one for each of receive packet, ui event, and 3 The POSIX Emulator timer expired. Waiting on these zutexes is how A conventional POSIX application employs dozens of threads block on external activity. libraries, access to a rich interface, and by • A call to create a new picoprocess. way of those system calls, access to other rich services, such as the X server’s graphics functions and the dbus Cryptographic Infrastructure. desktop configuration object broker. • random provides a supply of cryptographically To execute applications expecting this rich POSIX en- strong entropy. vironment, our POSIX emulator cleverly repurposes ex- • app key provides a machine-specific, application- isting libraries and programs atop the execution environ-

2 322 2013 USENIX Annual Technical Conference (USENIX ATC ’13) USENIX Association stack allocations heap allocations ...

libcairo.so ...... libpng12.so entry relocate self point fetch ROM image libz libgthread.so copy of ...... ld-.so libc.so libpthread.so

POSIX ABI file descriptors misc. functions Virtual poll, time mounts, path resolution VFS interface Clock interface

VFS hook ROM tmpfs pipes timer IP hook file system multiplexer IP interface

IP multiplexer UI hook

Embassies ABI IP ifc. timer ifc. UI blitter ifc.

Figure 1: The POSIX Emulator. To Embassies, the emulator (the large, L-shaped boundary) is a binary string whose is its first , and which may call back into a set of low-level interfaces provided by the Embassies ABI. Internally, the emulator loads the app’s -only image, maps it into a virtual filesystem, and calls into a copy of ld-linux.so. That loader, using the emulated POSIX ABI, reads the app executable and additional ELF libraries into memory. The glibc libraries’ syscalls are redirected to the emulator’s POSIX interface. Non-POSIX hooks provide connections for UI and TCP services implemented outside of the emulator (Figure 2). ment’s minimal services (§2). Figure 1 gives a structural the ELF image, map these libraries into memory, link overview of how the emulator maps the entire POSIX in- the images together (resolving symbolic references), and terface down to Embassies’s picoprocess interface. then jump to the app’s entry point. Below, we provide a functional exposition of this em- Embassies, however, provides neither a file system ulation, starting with application launch. from which to map files nor a kernel willing to parse ELF binaries. Thus, our emulator must perform these tasks, 3.1 Application Launch which it does by invoking ld-linux.so, an image of Embassies provides minimal support for app launch, which is included in the emulator’s boot block. The em- merely loading and starting a vendor-specified boot block ulator calls ld-linux.so and passes the app’s path as of code. Specifically, the host (1) maps the applications’ an argument, which instructs the loader to map the app boot block into an arbitrary region of address space, (2) (and its libraries) into memory. POSIX calls made by sets up a minimal stack, and (3) places in a register the ld-linux.so are serviced by the emulator (§3.2). address of a dispatch table for the Embassies ABI (§2). To call the loader, the emulator creates a suitable Within its boot block, the POSIX emulator (1) re- argv (naming the ELF executable), an envp (e.g. locates its symbols, using a small piece of position- pointing DISPLAY at 127.0.0.1:6), and an auxv (some independent code, (2) allocates an adequate stack, and constants to convince libraries they’re running on Linux). (3) establishes a dispatch function to emulate the POSIX syscall interface (§3.2) and virtual file system (§3.3). 3.2 Intercepting System Calls Next, the emulator must load the app and its libraries The loader, as well as other libraries in the glibc suite, into memory. In a full Linux implementation, the kernel are at the bottom of the library stack; these are the li- would interpret the app’s ELF binary format, map the braries that make actual POSIX syscalls. In principle, app binary into memory, map the loader ld-linux.so other libraries could also include direct syscall instruc- into memory, and then jump to the loader. The loader tions, but in practice, we have never observed this; in- would then enumerate dynamic library references within stead, they simply use libc’s syscall symbol.

3 USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 323 We want to exploit the functionality of the glibc suite, accept recvfrom but glibc’s system calls will fail in an Embassies process; bind recvmsg they must be intercepted and replaced with calls to the connect send getpeername sendmsg syscall emulation layer. In principle this can be achieved getsockname sendto by creating an alternate “sysdep” personality for glibc. getsockopt setsockopt In practice, at least for the architecture, we found listen shutdown it easiest to a binary rewriting pass to each of the recv socket libraries in the glibc suite, patching every system call in- vocation (i.e., each occurrence of int $0x80) with a Table 2: Socket Calls. These calls are plumbed through the call to a dispatch function that we inserted at the end of VFS interface to either the named pipes implementation the library. or the TCP stack. The dispatch function in each library must, in turn, be patched dynamically to call into the emulator’s syscall The image file transmission protocol supports partial dispatcher. To identify libraries in need of such dynamic fetches, so that the app can start with only a subset of the patching, we modified the libraries’ ELF headers to label image, and then later in additional components. the dispatch function. As libraries are mmaped into the 3.3.2 Supported Interfaces app’s address space, a filter file system in the VFS layer POSIX defines a wide, complex interface for interact- (§3.3) detects the modified ELF signature and transpar- ing with the file system, so implementing the entire in- ently updates the dispatch function to point at the emula- terface would be quite labor intensive. Fortunately, to ’s syscall dispatcher. support the varied applications from Table 1, it suffices for the VFS to support the following functions. 3.3 First, there is the core interface open, close, Much of the POSIX ABI concerns file naming and file ftruncate, and ftruncate64; and the metadata descriptors, which provide access to a variety of func- interface , lstat, fstat, and access. VFS tions. Thus, like a Unix POSIX implementation, the em- file descriptors track file pointers for read, write, ulator contains a virtual file system (VFS) abstraction. writev, and lseek. Directory functions , VFS components include a read-only app image, a getdents, getdents64, (hard) link, and unlink RAM-based writable temporary filesystem (tmpfs) that are only implemented in the tmpfs. The socket calls implements POSIX scratch directories like /tmp, and (Table 2) are routed through the VFS to the Unix pipe named pipes (Unix-domain sockets). The writable and TCP (§4.2) implementations. tmpfs directories provide the namespace for the Unix- The emulator also implements file handle functions domain sockets. There are also the virtual files that em- dup, dup2, pipe, and pipe2. pipe connects two file ulate POSIX special files. These comprise the /proc descriptors with a blocking pipe with no presence in the files of Section 3.8.1 and an emulated /dev/random VFS namespace. Functions fsync and fdatasync which passes entropy up from the client kernel’s are no-ops. Most of fcntl and fcntl64 are no-ops, random facility. except F DUPFD, which calls the dup implementation. The emulated VFS contains an overlay mount table to these file systems together. 3.4 Support mmap 3.3.1 The Read-Only Application Image POSIX is versatile, but in practice it is used in The most important VFS component is the read-only only a few idiomatic ways. binary image, whence libraries and data files are fetched. First, mmap(MAP ANONYMOUS) is used to allocate A Linux app expects to fetch its libraries and read-only blank memory at an address chosen by the kernel. The data files by name from a (shared) file system via read emulator transforms these calls into Embassies memory and mmap. In Embassies, such files come from a private allocations. app image whose integrity has been verified. Second, apps use mmap explicitly to map in non- To support this, the developer packages every file the executable data files. These calls also give the emulator app requires into a single -style image file. The emu- freedom to choose the target address, so the emulator al- lator fetches this file from an untrusted cache on the local locates fresh memory and uses a memcpying read im- machine, delegating to the cache the complexity of fetch- plementation to simulate the effect of the mmap. ing the image from an upstream cache or origin server Finally, apps use mmap implicitly when they dynam- and exploiting commonality with other apps [14]. The ically link executable libraries, either at load time via reply appears in memory as a single (jumbo) IP packet. ld-linux.so or at runtime via dlopen. Some of The emulator ensures integrity by comparing the image’s these calls do expect to control the resulting data place- hash to a fixed hash value embedded in the boot block. ment, a degree of control that Embassies does not pro- vide when allocating memory.

4 324 2013 USENIX Annual Technical Conference (USENIX ATC ’13) USENIX Association Fortunately, the loader does not really care where a face. Why? We cannot say; the best solution would given library ends up; it just requires that the data seg- be to eradicate these deprecated calls. Instead, as a ment of the library appears at the correct offset from workaround, the emulator assumes that virtual mem- the text segment. To this end, the loader’s first mmap ory has no cost, generously over-allocates on the initial call does not specify a target address; instead, it speci- brk(0) call, and services each subsequent brk exten- fies a length sufficient to reserve enough address space sion by releasing more of the initial allocation. to cover all the segments in the file. The loader’s subse- quent mmap calls (e.g., for the data segment) do specify 3.5 Clock and Timers The emulator provides the various flavors of POSIX a target address, but the target address is always within time: time, gettimeofday, and clock gettime. the memory range allocated by the initial mmap call. It translates all of these from the nanosecond precision Thus, the emulator can support this final class of mmap clock supplied by the client kernel. That clock provides calls by simply using the Embassies interface to allo- rate but no offset information; hence all of our apps think cate the initial memory region (which does not specify a the current time is 2011. We use ntpdate to acquire a particular address), and then confirming that subsequent clock offset, although we have not yet attended to the mmaps (that do specify an address) fall within the initial security implications. memory allocation. As long as they do, the emulator can Embassies supplies the process with a single timer, take the appropriate action, e.g., it can zero-fill the spec- which signals the process by firing a zutex, and thus can ified region for the binary’s .bss section or copy in the be reset in a race-free way. The emulator has the respon- contents of the mmaped file. sibility to multiplex this one timer into as many alarms This approach is clearly “less portable”, in the sense as it needs to implement POSIX timeout interfaces. It that a POSIX app could in theory call mmap with an ad- does so using a tree of upcoming deadlines, for scalabil- dress outside of any preexisting allocation. Fortunately, ity. We found the clock multiplexer to be surprisingly we have not yet encountered any applications that rely on subtle, with many race conditions that lead to deadlocks. this functionality. It was helpful to diagram the detailed mapping between 3.4.1 Fast mmap the host timer state and the state of the guest timer list. The approach above is adequate for correct POSIX emulation, but for the apps we tested, where the bulk 3.6 Synchronization Primitives of the image comprises mmap-loaded libraries, it incurs The Embassies client kernel provides a single uni- many megabytes of memcpys, adding noticeable delay fied synchronization abstraction, the zutex, that is used (150 ms) to the app start time. We corrected this perfor- both for internal waiting on other threads and waiting on mance problem by page-aligning mmapable libraries in external events (the network or the clock). This central the image tar file (§3.3.1), and servicing mmap requests abstraction is a simplified futex [6]. Like the futex, by yielding the memory region from the VFS to the app. the zutex is actually a race-free scheduling primitive in Of course, this means that the region can not be read support of efficient synchronization. or mmaped later in the program’s execution; if a program The POSIX futex maps readily onto the zu- needs to map a file multiple times, we either store mul- , with the emulator folding in timeout behavior (§3.5). tiple copies in the image file (often worth the space), or Many extra POSIX behaviors are neutered. For example mark the region “precious”, inhibiting the optimization. highly concurrent servers use FUTEX CMP REQUEUE Fast-mmap files must be stored in the image in their to avoid convoys, but our emulator simply wakes in-memory layout, not their on-disk ELF layout, includ- the requested threads and lets them requeue them- ing necessary blank space to position the data and bss selves. The emulator rejects FUTEX WAKE OP segments. The blank spaces are, of course, easy to com- and FUTEX WAIT BITSET with an error, alerting press during transmission. libpthread to revert to the basic behavior. 3.4.2 Other Memory Calls The nanosleep call and POSIX multiple-wait prim- Most POSIX memory allocations appear as anony- itives select, newselect, and poll are all mapped mous mmap calls. The emulator tracks such requested re- into zutex wait operations, again with timeout behav- gions, freeing the underlying Embassies allocation once ior constructed by the emulator. POSIX blocking oper- the entire region has been munmapped. ations, like a read on an empty pipe, wait on zutex- Embassies provides no read/write/execute memory signaled events. protections, so the emulator simply ignores mprotect, 3.7 Network Multiplexing madvise, and msync. It also rejects mremap. Embassies provides each process with a single zutex Unfortunately, ld-linux.so and libc both make to the arrival of IP traffic. Thus, the emulator initial memory allocations with the ancient brk inter- must collect incoming IP packets and multiplex them

5 USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 325 inside the app. The emulator itself uses IP to fetch its sched setparam image (§3.3) and for querying time servers (§3.5). The chown sigaction emulator’s network stack demultiplexes IP and UDP, and fchmod sigprocmask rename delivers TCP packets to the LWIP library (§4.2). sched get priority max sched setscheduler 3.8 Threads sched get priority min POSIX uses clone to express both thread creation and process fork (§3.9). The emulator pattern-matches Table 3: Failure-oblivious calls return either EINVAL or ENOSYS, which the caller handles gracefully. the thread-creation idiom and sets up the new thread’s initial thread-local store (TLS). Because Embassies’ fchown set tid address create thread conveys only a stack pointer, the em- flock setitimer ulator constructs a stub stack to pass the POSIX param- fstatfs setpriority eters and the caller’s designated stack to the new thread. init setrlimit inotify init1 shmget It records metadata about the new thread to correctly statfs implement CLONE CHILD CLEARTID upon POSIX’s ipc thread-exit call. readlink times The POSIX process-exit call, exit group, signals sched getaffinity xi sched yield the zone host (§4) that a zone has exited. set robust list xi timer create 3.8.1 Supplying the Stack Address xi sched rr get interval Several applications rely on garbage collection li- braries that need to know the address of the top (but not Table 4: Neutered calls simply return 0 (success). the bottom) of the current thread’s stack. This is exposed tation does not support fork at all, leaving Inkscape’s in Linux POSIX through pseudofiles in /proc. plug-ins inoperative. At first blush, it appears that the stack bottom address An expedient approach, if the code is sufficiently - is also needed. For example, Libwebkit’s JavaScriptCore iomatic, is to emulate the fork with a thread, and perhaps garbage collector queries libpthreads for the stack intercept and neuter close calls from the child “pro- bottom address. However, the GC does not use the bot- cess” preparing to . The exec call would launch a tom address directly; instead, it adds RLIMIT STACK to new zone (§4), or if fault-containment is desired, a new yield the top address. Since libpthreads determines picoprocess (§2). the bottom address by subtracting RLIMIT STACK from Alternatively, since the fork/exec pattern is usu- the top address it obtains from /proc/self/maps, ally implemented in a widely-used library, such as ’s any sane value for RLIMIT STACK will work correctly. g , one could modify this higher-level library to We used 8 MB. map fork’s semantics cleanly onto the creation of a new The stack top value returned by /proc/self zone or picoprocess. /maps, on the other hand, does matter: It is how a con- servative garbage collector learns the extent of the stack. 3.10 Neutered System Calls Another garbage collector, libgc, looks for the stack The remaining syscalls are either unused by interactive top in /proc/stat/self. We special VFS apps, or can be simply rejected or neutered. This section nodes at those names which return the appropriate stack identifies such syscalls in the interest of completeness. top value for the current thread. Many calls (Table 3) can be rejected, returning To identify which thread is querying the interface, the ENOSYS or EINVAL, and the libraries that call them ei- emulator snoops the app’s thread-local store (TLS) reg- ther handle the failure gracefully, fall back to an alternate ister; that is, it uses grey-box assumptions about how POSIX mechanism, or ignore the result and trundle along glibc manages the TLS. For all threads other than the obliviously [24]. main thread, the emulator records each stack address as Other syscalls can be neutered with brazen lies: When its thread is created by the clone syscall. For the main the caller actually checks the return code, we may need thread, the emulator allocated the stack (§3.1) and thus to return 0 (“success”) even if we don’t actually emu- knows its address. late the promised semantics (Table 4). Other functions require slightly more credible lies: The emulator fills in 3.9 Unimplemented: Fork some plausible constant values to placate the caller (Ta- Some apps employ the fork/exec pattern; e.g., ble 5). For instance, the clock getres call should Inkscape uses it for its plug-in modules. This pattern provide some information about clock quality (§3.5), but does not translate well to the minimal Embassies envi- we just claim a 500 ms resolution. As another exam- ronment, since Embassies’s memory management facili- ple, we found no software that used chdir, so getcwd ties are far too simple. The current emulator implemen- simply returns “/”.

6 326 2013 USENIX Annual Technical Conference (USENIX ATC ’13) USENIX Association clock getres getpid stack getcwd getppid X gimp allocations getegid getresgid32 getegid32 getresuid32 heap geteuid getrusage zone host allocations geteuid32 getuid getgid getuid32 LWIP getgid32 sched getparam getpgrp sched getscheduler pipe

Table 5: Deluded calls return slightly fancier lies than 0. POSIX emulator VFS hook 3.11 Additional Program Requirements IP hook Emulating the POSIX ABI is minimally intrusive to the apps, but a few conflicts remain. UI ifc. 3.11.1 Address Freedom Figure 2: We have already seen that Embassies’s refusal to let Multiple POSIX apps coexist in one picoprocess as zones. Each zone comprises a noncontiguous partition of apps specify specific locations for allocated memory re- the address space. Each has its own copies of libraries, like quires the boot block to relocate itself (§3.1) and requires libc, and its own stack and heap allocations. Programs that a delicate hand in servicing mmap (§3.4). expect POSIX pipe IPC, such as an X session, see the same It also means that every executable must be relocat- behavior within the picoprocess. able or position independent. Every Linux shared li- brary is relocatable, but for no discernible reason, ex- POSIX processes. Rather than convert them into li- ecutables are not relocatable by default. We address braries, we found it expedient to create a general mecha- this by rebuilding each app’s top-level executable with nism for loading multiple programs into a single picopro- the -pie (“position-independent executable”) compiler cess. This is easier than it sounds, because each program flag. Although this requires tampering with the app’s separately allocates memory and file descriptors, which build system (§6.1), it is required only for the top-level carves the resource namespaces into interleaving parti- application, not any libraries; and in most cases, passing tions. We call such partitions “zones” (Figure 2). CFLAGS=-pie to -buildpackage does Embassies’s refusal to allow memory allocations at the job. The change is nowhere near as invasive as trying specific addresses works to our advantage when imple- to change to static linkage (§6.1). menting zones, since it precludes zones from demanding 3.11.2 The TLS Register overlapping allocations. It is zones that use the emulated On the architecture we experimented on, the arcane Unix pipes (§3.3). For example, the X zone listens on x86-32 instruction set architecture, a paucity of general- /tmp/.X11unix/X0, and the client library in the purpose registers leads POSIX compilers to employ a main application zone binds to it there. disused segment register %gs as a thread-local stor- The vestigial brk interface (§3.4.2), however, age (TLS) pointer. This usage gets compiled into ev- presents a hurdle. Two threads in different zones may ery library and application binary. Since the idiom has concurrently extend different brk heaps. The brk inter- no security-sensitive semantics, we opted to provide a face assumes hidden per-process state, which becomes store-gs call in the x86-32 Embassies ABI; the emu- per-zone state. The good news is that we can infer lator uses it to implement set thread area. which zone is making the request, and hence which per- A better solution would be to either recompile or bi- zone state to consult, because each request should appear nary rewrite every binary to eliminate %gs references. within the address space set aside for that zone’s brk. The bad news is that, on 32-bit hardware, virtual 4 Zones: Programs as Libraries address space is scarce enough to warrant preserving, Besides kernel services, POSIX apps often expect ac- which means allocating only appropriately-sized brk re- cess to higher-level services provided by daemon pro- gions for each zone. This is tricky because the initial grams like X windows, a (e.g., twm), call from each zone is a stateless brk(0), from which or a configuration manager, like the dbus desktop bus. the emulator cannot infer the identity of the calling zone. We satisfy such apps by including these services inside Our expedient solution forces the zones to start up se- the apps that need them, rather than in the client kernel’s quentially. A more elegant solution would identify the TCB (which would add them to every app’s TCB). calling zone by its TLS or stack pointer, or (better yet) X, twm, and dbus are designed as independent eliminate brk calls from libc.

7 USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 327 4.1 Example Zones POSIX involves figuring out why a segfault occurred in a The X server presents a complex security boundary, library dozens of layers below the app. To assist the prac- and this complexity conflicts with Embassies’s goal of a titioner who wishes to extend this approach, this section minimal client kernel TCB. Therefore we use X only in- identifies our most valuable debugging strategies. side the picoprocess, in a zone, disregarding its security- It is important to plumb error out of the pi- sensitive multiplexing functions and exploiting only its coprocess. Our insecure debug-mode Embassies monitor rasterization function. The rendered frame buffer that offers an extended ABI with debug channels that record X produces is blitted to the user’s display through Em- to files. The emulator routes stdout and stderr to them. bassies’s pixel-level UI interface. Since most of our changes occur behind the POSIX in- Some apps, like Gimp, use a plethora of palette win- terface, it is very effective to compare system call traces; dows. For expediency, we add a twm window man- divergences often identify root causes. We capture a ref- ager zone into such apps, to allow manipulation of the erence trace in Linux with strace, and add a corre- palettes within the surface of the app’s single display re- sponding debug facility at the emulator’s entry point. It gion. With more effort, one could coordinate multiple emits a trace file using another debug output channel. windows via Embassies’s window management, perhaps Of course, a debugger is invaluable. Our debug-mode using a technique like Nitpicker’s [9]. monitor runs apps as Linux processes. It routes Em- Gnome desktop apps expect to connect to the dbus bassies syscalls out through a pipe to a coordinating pro- daemon to find other components and learn configura- cess, but leaves the conventional POSIX syscall interface tion settings. This tight coupling among applications has intact, enabling gdb to connect to the process. no cost in a trusted-everything system, but is too risky However, gdb has no access to symbols. The emu- for mutually untrusting apps. Hence we do not repro- lator does not use POSIX mmap to map in ELF files, duce the connected dbus; instead, we link a copy of the so gdb’s inspection of Linux-provided metadata in daemon into each app to expediently satisfy the client /proc/pid/maps is fruitless. To bridge this gap, the library. With more effort, one could strip the dbus de- emulator records a trace of file open and mmap - pendencies out of each app. tions via another debug channel. A script transforms the trace into a gdb add-symbol-file script, solving 4.2 Extension Hooks the symbol problem. The emulator sits below libc, and hence cannot ex- Similarly, gdb’s usual mechanism for discovering ploit libc. Coding without libc is painful; thus where new threads fails when thread creation is handled by the possible, we push functionality out of the emulator into emulator. Thus, the debug monitor provides another ex- layers above. To facilitate this modularity, the emulator tension by which the emulator signals thread creation, exports four hooks via unused syscall . and the debug monitor generates the appropriate trap Specifically, as alluded to above, we use an X zone (int $0x3) to alert gdb. to translate app UIs into easily blitted pixel regions. We haven’t yet implemented gdb stubs for our secure A modified X server within the zone supplies the monitors, because once an app runs correctly in the de- . It uses one extension hook, bug monitor, it rarely fails in the secure monitors. In the ex get dispatch table, to gain access to the raw rare failure cases, we found it sufficient to study a core Embassies UI functions. It uses a second extension hook, file (a snapshot at the moment of failure). Each secure ex open zutex as fd, to wrap the UI notification monitor has a debug mode in which a picoprocess ex- zutex in a POSIX file descriptor, enabling the extension ception generates an ELF-format core dump. to smoothly integrate into X’s existing poll loop. The debug monitor also provides an extension to query All unhandled IP traffic, including TCP traffic, is CPU time (POSIX times()), and a sampling profiler, handed off to a TCP stack based on lwIP [7] that re- for diagnosing performance problems. An example dis- sides in the zone host. The lwIP stack is a loadable covery was that the emulator was returning bogus stat module, attaching to the emulator’s IP multiplexer via values, causing a to deem its cache file in- ex add default handler and servicing requests valid, causing it to re-scan thousands of individual font for SOCK STREAM sockets via ex mount vfs. files at app start. 5 Debugging Strategies Finally, gathering the appropriate file set for the read- only app image is tedious. To expedite, the emulator can The key premise of this work is that most apps use start in “gullible mode”, where rather than fetch an im- only a fraction of POSIX functionality. This paper cata- age, it passes every open request path out to a lookup logs these functions in detail precisely because the chal- server located on the development machine where the lenge is in discovering which functions matter. original POSIX app is installed. That server hashes the Most of the effort in emulating the right subset of corresponding file, injects the file contents into the cache,

8 328 2013 USENIX Annual Technical Conference (USENIX ATC ’13) USENIX Association and returns the path to the emulator. By this means, the 6.2 Limitations emulator demand-loads the app’s required files; it also Our experience iterating the emulator to support sev- captures a trace of these loads, which serves as a mani- eral apps suggests that the emulator is asymptotically fest for generating the app image. nearing completeness, ready to support most desktop productivity apps. In most cases where we have intro- 6 Discussion duced a lie or neutered behavior into the emulator, it is Our goal is to reuse conventional interactive desk- because we have examined the corresponding call site top applications in a new minimal runtime environment. in libc, and we were able to conclude that the lie com- Ideal reuse would use unmodified binaries; required pletely satisfies that code path. This approach occasion- modifications can be ranked based on their invasiveness. ally backfires when a different call site finds the lie un- Transparently emulating required behavior below the convincing, but these occurrences are rare. POSIX interface has proven to be very inexpensive; the System configuration tools are unlikely to port well, main cost is discovering which features actually warrant since our approach destroys tight application coupling, implementation (§5). Our experience suggests that the for example by neutering dbus. We accept this limita- emulator is asymptotically nearing completeness. tion as fundamental to Embassies’s goal of making apps The choice to give apps no control over their memory more autonomous. layout, which makes Embassies implementable on any Embassies presently has only paper designs for audio host, is slightly invasive; it requires relinking the top- and GPU facilities. Apps that integrate multiple pro- level app binary, which is easy in practice (§3.11.1). grams with fork() are not currently supported (§3.9). The Embassies environment demands some point 6.3 Inter-Application Protocols changes higher in the software stack, including the bind- This paper focuses on moving apps from a rich, trust- ing of X to the Embassies UI interface (§4.1) and the ing, shared environment to the isolated picoprocess. replacement of implicit kernel communication with ex- However, interesting apps still communicate with the plicit protocols (§6.3). Such changes do require source outside world. Some inter-app communication is already modification of specific packages, but very few such based on IP: The apps we used discover printers and send changes are required, compared with the hundreds of li- jobs with the Internet Printing Protocol [12], so printing brary packages ported. works correctly without special support. The Embassies impositions do preclude running some However, how should apps replace communication unmodified binaries, such as closed-source apps. Closed- patterns once done locally? For example, suppose one source libraries with Embassies-compatible semantics, app produces data another app wishes to read. We ex- such as a PDF-rendering library, may be usable, though. pect such communications, once supplied by a complex 6.1 Dynamic vs. Static Linking trusted platform (e.g., the OS), to be replaced by IP- In our previous experience with the Xax project [13], based protocols. Just as in the Internet, IP-based pro- we found that modifying a package’s build system was tocols are bilateral: Both participants have the opportu- frustratingly difficult, generally much harder than modi- nity to decide how much of the protocol they are willing fying the source code and using the package’s build sys- to implement, and to select vulnerability-resistant im- tem to remake it. Most packages use common source plementations. The Embassies paper [16] addresses this languages such as or C++, but it seems every package question in greater detail. uses a different build scheme. Engineering choices in Xax required statically linking 7 Evaluation each app with all of its libraries. Because that changed 7.1 Porting Effort how apps and libraries build, the task ranged from dif- The most salient proof of effectiveness for our tech- ficult to all-but-impossible, and required new work for niques is in the results: We are able to run many rich nearly every package. apps without even recompiling them (Figure 3). Instead, Thus, in the present work, we elected instead to we binary-rewrite glibc to redirect the POSIX interface, keep applications dynamically linked, and to press use libraries as unmodified binaries, and relink the top- ld-linux.so into service for runtime linking. We most executable to make it relocatable. That such non- found that this expedient substantially reduces the inva- invasive techniques are successful with eight interactive siveness of porting, as essentially every intermediate li- apps built on disparate library stacks is strong evidence brary is readily usable in binary form. that they will generalize easily to most interactive apps. Figure 4 shows lines of code [30] in the components and patches to existing programs. Most of the effort is in the VFS implementation in the emulator.

9 USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 329 component SLOC emulator 29156 zone host 1328 lwIP patches 477 X patches 660 twm patches 0

Figure 3: POSIX emulation handles diverse, rich applications, e.g., the Midori Web renderer, Figure 4: Lines of code Gimp, Marble, Inkscape, and Gnumeric. Not shown are Abiword, Gnucash, and Hyperoid. in system components.

7.2 Performance set reachable statically. This analysis is approximate, For compute-bound tasks, the emulator is not in- because some syscalls (e.g., ioctl) aggregate multiple volved, and apps run at native speeds. We verified this behaviors, and our static analysis tool is rather coarse. by running, on both Linux and Embassies, image rota- Figure 5 shows the results. Columns are syscall num- tions in Gimp and a subset of the SunSpider JavaScript bers (sorted for contiguity); rows are applications. The benchmark [29] in Midori. As anticipated, in both cases upper eight apps are those we support (Table 1); the the difference is negligible, within 2% (cv = 1%). lower eight are other Linux apps to aid extrapolation. Informally, we have observed some emulated activi- System calls in region () are supported as described in ties run faster than their Linux equivalents. For example, this paper: as meaningful, failure-oblivious, neutered, or filesystem interactions with temporary files outperform deluded calls. Syscalls in region (x) are observed dynam- Linux because they avoid a kernel-mode transition. ically when the app is run on Linux but not when run on The application launch mechanism precludes the use our emulator. For example, because shmget is neutered of the OS buffer cache, but we recapture much of that in the emulator, shmat and shmctl never appear dy- performance in the Embassies environment [14, Fig. 14]. namically. If the lower eight apps in Figure 5 were run on App starts are 50–100 ms slower than Linux; the largest our emulator, these syscalls might be obviated for analo- bottleneck is verifying the integrity of fetched content. gous reasons, but we do not know this for certain. The 27 syscalls in region (s) are reachable statically, 7.3 Coverage but not observed dynamically. Five are never called be- We have demonstrated a layer that emulates a small cause a better version is emulated (stat64 for stat). subset of Posix behavior, and we have shown this to be Another 7 are trivial variations of existing emulation sufficient to run a diverse set of productivity apps. How- (ppoll for poll), and 12 (those unique to muse and ever, perhaps exercising the apps more aggressively or stella) are neuterable (mlock, setgid32). We saw 3 running additional productivity apps would require sub- calls in the lower eight apps that likely require imple- stantially more Posix-level emulation. To bound the de- mentation: utimes, symlink, and mknod. gree to which more emulation could be required, we Our static analysis is imperfect, as evidenced by ten compare the set of syscalls visited dynamically with the calls reached dynamically but not discovered statically.

10 330 2013 USENIX Annual Technical Conference (USENIX ATC ’13) USENIX Association plex shared graphics stack (X or an HTML DOM ren- derer). Since simplicity is a priority, we eliminate even the shared buffer cache, requiring a sharing implementa- tion different than that used in Slinky [14]. 8.2 Porting Applications Several years ago, our Xax project [13] demonstrated that rich stacks of libraries could be readily transplanted from a conventional environment to provide useful functionality even from inside a pico- process attached to a . This paper reports on a more thorough implementation that supports com- plete, rich, interactive applications. Xax gave a high- level overview of the porting effort, enumerating five cat- egories of techniques used to emulate the missing OS or Figure 5: An approximate static coverage check validates to trigger alternative behavior in the transplanted library. that emulating most Linux system calls is not necessary. This paper aims to completely demystify the process. The Drawbridge effort demonstrated that similar tech- niques could be used for code based on the Windows Nonetheless, the broad agreement between static analy- commodity OS stack [22]; that project required intro- sis and dynamic observation suggests that our emulation ducing additional techniques, such as hoisting the GDI is largely complete. Of Linux’s 333 syscalls, 203 are graphics rasterizing library from the OS kernel to be- observed neither by dynamic tracing nor by static analy- come a library inside the picoprocess. The Drawbridge sis. This lends credence to our claim that a broad range system assumes a non-minimal host that includes a file of productivity apps can be supported without emulating system, buffer cache, and TCP stack. the majority of Linux functionality. The task at hand is reminiscent of the Exoker- nel’s motto, “exterminate all operating system abstrac- 8 Related Work tions” [18]. Like Exokernel, Embassies minimizes ab- stractions in the host platform; but where the Exoker- 8.1 Application Models nel evicted abstractions to expose new performance op- Java was offered as an alternative to the clunky mid- portunities, Embassies aims to produce a simple, rarely- 1990s web programming interface [10]. Absent native changing host with a minimal attack surface. Therefore, code, Java had to either rewrite every framework an app Exokernel techniques, such as those for sharing storage, could want, or import and abstract existing frameworks do not translate well to Embassies apps. as native libraries. Practicality demanded applying the ’s Native Client system [31] includes ports of latter technique; even the early UI toolkit AWT [33] ab- dozens of libraries, but does not support complete inter- stracted over the host UI at a high level. The result was active applications. The difference in target assumption– a Java client with a complex implementation that shared that applications will run as web plug-ins, rather than re- the host’s vulnerabilities, and isolation that depended on placing web apps altogether–has led the project to a dif- a complex and growing security interface [21]. ferent ABI, security model, and execution model. These As Java largely failed to replace the HTML web app choices necessitate a modified C compiler, which in turn model, HTML thrived, evolving a notion of isolation requires fussing with libraries’ build environment (§6.1), [15, 32] fundamental to web apps. However, pressure a task we found difficult to scale. However, once those to enhance functionality has progressively grown client issues are resolved, the approach in the present paper complexity, undermining the promise of isolation [16]. should readily enable the conversion of POSIX apps into The Slinky system proposed distributing POSIX apps NaCl plug-ins. as static binaries, enabling app developers to precisely specify their dependencies [4]. They extended the Linux 9 Conclusion kernel to detect and exploit implicit page sharing while This paper showed how to support rich POSIX appli- preserving the semantics of static executables. Their ap- cations on top of a minimal picoprocess interface. Such proach treats shared libraries as a configuration problem. support can be achieved by providing a POSIX emula- It inspired our work; we extend the Slinky insight to tion layer and by binding existing programs, like lwIP, autonomy-preserving isolation against adversarial neigh- X, and twm into the application itself. The POSIX em- boring apps. This not only requires avoiding late-bound ulation layer is not nearly as complicated as a conven- library sharing, but also demands eliminating the com-

11 USENIX Association 2013 USENIX Annual Technical Conference (USENIX ATC ’13) 331 tional POSIX implementation (e.g., Linux); in fact, this mashups. In HotOS (May 2007). paper exhaustively lists every syscall emulated and every [16] HOWELL, J., PARNO, B., AND DOUCEUR, J. Em- program adaptation required. Such emulation is possible bassies: Radically refactoring the web. In NSDI (2013). in part because many POSIX functions exist to support [17] JANG, D., VENKATARAMAN, A., SAWKA, G. M., AND scalability and performance more relevant to server ap- SHACHAM, H. Analyzing the crossdomain policies of plications (e.g., databases and web servers) and hence Flash applications. In IEEE Web 2.0 Security and Privacy Workshop (W2SP) (2011). are unused by interactive apps. Thus, not only is it feasi- [18] KAASHOEK, M. F., ENGLER, D. R., GANGER, G. R., ble to adapt POSIX applications to a sparse environment, NO, H. M. B., HUNT, R., MAZIERES` , D., PINCK- it is reproducible. We hope these results will encourage NEY, T., GRIMM, R., JANNOTTI, J., AND MACKENZIE, others to adapt the existing world of rich POSIX-based K. Application performance and flexibility on Exokernel applications to even the most minimal of client execu- systems. In SOSP (1997). tion environments. [19] MICKENS, J., AND DHAWAN, M. Atlantis: Robust, ex- tensible execution environments for Web applications. In SOSP (2011). References [20] MICROSOFT. Silverlight. http://www. [1] ANDROID OS. http://www.android.com/. microsoft.com/silverlight/. [2] APPLE. iOS6, 2013. http://www.apple.com/ [21] NEVILLE, P. S. Mastering Java security policies and iphone/. permissions. http://www2.sys-con.com/itsg/ [3] BARTH, A., JACKSON, C., REIS, C., AND virtualcd/java/archives/0501/neville/ THE TEAM. The secu- index., 2004. rity architecture of the browser. [22] PORTER, D. E., BOYD-WICKIZER, S., HOWELL, J., http://www.adambarth.com/papers/2008/ OLINSKY, R., AND HUNT, G. C. Rethinking the library barth-jackson-reis., 2008. OS from the top down. In ASPLOS (2011). [4] COLLBERG, C., HARTMAN, J. H., BABU, S., AND [23] REIS, C., AND GRIBBLE, S. D. Isolating Web Pro- UDUPA, S. K. Slinky: static linking reloaded. In USENIX grams in Modern Browser Architectures. In ACM Eu- ATC (2005). roSys (2009). [5] COX, R. S., GRIBBLE, S. D., LEVY, H. M., AND [24] RINARD, M., CADAR, C., DUMITRAN, D., ROY, HANSEN, J. G. A safety-oriented platform for Web ap- D. M., LEU, T., , AND BEEBEE,JR., W. S. Enhancing plications. In IEEE Symp. on Security & Privacy (2006). server availability and security through failure-oblivious [6] DREPPER, U. are tricky. Tech. rep., , computing. In OSDI (2004). Nov. 2011. [25] TANG, S., MAI, H., AND KING, S. T. Trust and Protec- [7] DUNKELS, A. lwIP - a lightweight TCP/IP stack. http: tion in the Illinois Browser Operating System. In OSDI //savannah.nongnu.org/projects/lwip/, (2010). 2013. [26] WANG, H. J., FAN, X., JACKSON, C., AND HOWELL, [8] ECMA. Standard ECMA-262: ECMAScript J. Protection and communication abstractions for web language specification. http://www. browsers in MashupOS. In SOSP (Oct. 2007). ecma-international.org/publications/ [27] WANG, H. J., GRIER, C., MOSHCHUK, A., KING, standards/Ecma-262.htm, June 2011. S. T., CHOUDHURY, P., AND VENTER, H. The multi- [9] FESKE, N., AND HELMUTH, C. A Nitpicker’s guide to a principal OS construction of the Gazelle web browser. In minimal-complexity secure GUI. In IEEE ACSAC (2005). USENIX Security Symposium (2009). ™ [10] GOSLING,J.,JOY, B., AND STEELE, G. Java Lan- [28] WANG, H. J., MOSHCHUK, A., AND BUSH, A. Conver- guage Specification. Addison-Wesley, 1996. gence of desktop and web applications on a multi-service [11] GRIER, C., TANG, S., AND KING, S. T. Secure web OS. In USENIX HotSec Workshop (2009). browsing with the OP web browser. In IEEE Symposium [29] WEBKIT. SunSpider JavaScript Benchmark. Ver- on Security and Privacy (2008). sion 0.9.1 at http://www.webkit.org/perf/ [12] HASTINGS, T., HERRIOT, R., DEBRY, R., ISAACSON, sunspider/sunspider.html, 2012. S., AND POWELL, P. Internet Printing Protocol/1.1: [30] WHEELER, D. A. SLOCCount. Software distribution. Model and Semantics. RFC 2911 (Proposed Standard), http://www.dwheeler.com/sloccount/. Sept. 2000. Updated by RFCs 3380, 3382, 3996, 3995. [31] YEE, B., SEHR, D., DARDYK, G., CHEN, J. B., MUTH, [13] HOWELL, J., DOUCEUR, J. R., ELSON, J., AND R., ORMANDY, T., OKASAKA, S., NARULA, N., AND LORCH, J. R. Leveraging legacy code to deploy desk- FULLAGAR, N. Native client: A sandbox for portable, top applications on the web. In OSDI (2008). untrusted x86 native code. In IEEE Symposium on Secu- [14] HOWELL, J., ELSON, J., PARNO, B., AND DOUCEUR, rity & Privacy (2009). J. R. Missive: Fast appliance launch from an untrusted [32] ZALEWSKI, M. Browser security handbook: Same- buffer cache. Tech. Rep. MSR--2013-9, Microsoft Re- origin policy. Online handbook. http://code. search, Jan. 2013. google.com/p/browsersec/wiki/Part2. [15] HOWELL, J., JACKSON, C., WANG, H. J., AND FAN, [33] ZUKOWKSI, J. Java AWT Reference. O’Reilly, 1997. X. MashupOS: Operating system abstractions for client

12 332 2013 USENIX Annual Technical Conference (USENIX ATC ’13) USENIX Association