-Cap: A Capability Framework for Xen

Yathindra Naik

University of Utah Salt Lake City, UT USA [email protected]

Abstract ual VMs. Security models that are implemented with provide strong isolation and can be lever- hypervisors do not offer flexible policies or fine grained aged to disaggregate large stack of a traditional, access control. monolithic system. Disaggregation can be achieved by Xen is a full-feature, open-source , which running individual applications and kernel components provides support for both para-virtualization and hardware- in separate VMs. Existing hypervisors, however, do not assisted virtualization [1]. It is adopted in the industry offer a fine grained access control mechanism. Without as a default component of the cloud datacenter stack [6]. such mechanism, individual VMs still run in one of two Xen provides mechanisms to run VMs in isolation, extremes—complete isolation or excessive authority. which can further be used to disaggregate individual ap- This project extends Xen with the capability ac- plications or services of an OS and effectively sandbox cess control model—a mechanism for fine grained, dy- their execution. namic management of rights. Together with strong iso- Capabilities offer flexible and fine grained access lation, capabilities can be used to create least-privilege control mechanism and have been used extensively in services—an environment in which individual applica- many research operating systems [7, 9, 17]. Capabilities tions have minimal rights, that are required to perform have been shown to work well for dynamic privilege their tasks. management [10]. With capabilities, it is possible to We discuss the design and implementation of a capa- construct least privilege services—an environment in bility access framework for Xen. We also demonstrate which individual applications have minimal rights, that examples of least-privilege services. Overall, we gained are required to perform their tasks. By augmenting Xen valuable insights for designing a secure system using an with capabilities it is possible to achieve isolation and industry standard virtualization platform. fine grained access control. Though implementation of capabilities is well under- stood in the object oriented microkernels [7], it is not 1. Introduction clear how we can adopt the same model in a virtualized Many of todays modern computer security problems environment. Microkernel systems are designed from come from the fact that traditional operating systems scratch to support capability model where everything in are monolithic. A compromise in one of the components the system is treated as an object. Microkernels provide of the OS makes it easy to gain control of the entire abstractions to operate on these objects. Xen hypervisor system. provides a more traditional hardware-centric abstrac- Virtualization has become a de-facto standard in dat- tion such as device drivers, DMA, interrupts etc. It is acenter and web hosting environments. It is also gaining challenging to map capability model which relies on popularity as a mechanism for constructing novel exe- object-centric abstractions on a system that provides cution environments on desktop machines [13]. Virtu- hardware-centric model. In order to map the capability alization can isolate individual applications in separate model on such a hardware-centric design, it would re- VMs [3, 15]. But isolation alone cannot guarantee se- quire redesigning many of the fundamental abstractions curity. In addition to isolation, a secure system needs a in the system. This will require making changes that can flexible way to manage privileges available to individ-

1 affect many parts of the system including the application 2.2 Xen-Cap’s view of capabilities layer and therefore implies lots of code changes. In our system, a capability is a record in a hypervisor Our contributions are the following: protected data structure. Please refer to Figure1 for • A capability framework to do fine-grained access more details. We call this data structure the capability control in Xen. address space, or the CSpace. VMs are subjects in our model, i.e, each VM has a private CSpace, and all the • Examples for constructing more secure services us- code inside it has the same rights. ing capability framework. Each capability is identified by a system-wide unique 64-bit name. The name by itself does not provide access 2. Capabilities to an object. To obtain an access, the capability with this 2.1 Overview name must be added to the CSpace of the VM. Capability [5] is a special token that uniquely identifies In order to guarantee authenticity properties of ca- an object (resource) and allows the application to per- pabilities, i.e. anyone with capability is authenticated form certain operations on that object. Any application to use a resource, we need to ensure capability names that possesses a capability can interact with the object are unique. We create capabilities using the multiply- in ways permitted by the capability. Those operations with-carry (MWC) random number generator invented might include reading and writing data associated with by Marsaglia [11, 12]. We chose this method for two the object, and executing a function of the object. reasons: In a classic capability object model, capabilities are • It is fast since it involves computer integer arithmetic. implemented as references to system objects. Every • It has extremely large periods ranging from around resource in such a system is an object, e.g., a memory 260 to 22000000. page, an IPC entry point, a hardware device [7, 17] etc. Applications invoke capabilities to perform operations Consider a period of 260, if the system allocates a allowed on that object. The capability invocation is the capability every 300ns then it would take approximately only IPC mechanism in the system. 10960.4 years for an overflow to occur. Hypervisors do not offer objects as a fundamental We also create capabilities at runtime and do not use abstraction. Neither do hypervisors have a unified IPC a persistence store. Using persistance store to remember mechanism which can provide a meaningful interposi- capabilities would require securing the persistence store. tion interface for enforcing system wide access control. We would also need to address file system consistency In a virtualized system, most system resources still issues. Therefore, we did not have time to consider such run in the unmodified kernels and ap- a design in this work. plications which implement them. Some objects, like As we mentioned earlier, capabilities by themselves hypercall invocation points, memory pages, etc., are do not provide any security guarantees in our system. implemented and can be protected by the hypervisor. To enforce access control, capabilities are bound to high But the high level objects could be anything the appli- level objects, by the code which manages these objects, cation chooses to protect such as a file, or a record in e.g., files in a file system, internet connections in the net- a database. The hypervisor itself cannot make sense of work stack, memory pages, and even virtual machines. the high level objects. Hence a capability access check The code which manages the objects is responsible for cannot be enforced in the hypervisor. enforcing the access control checks. Hypervisor protects A pragmatic approach to implement a capability ac- its own objects such as hypercalls, memory pages, IPC cess control model on a hypervisor is to separate en- end points, etc. Hypervisor ensures VM-level protection. forcement of capability rules and access checks. Ca- However, if higher level objects need to be protected, pability rules—integrity of capabilities, exchange of we extend the TCB into the code that implements them. capabilities—are enforced in the hypervisor, but the ac- The hypervisor however makes sure that VMs cannot cess checks are made by a high level code. This allows forge capabilities and can only acquire them through us to extend the capability model into traditional ap- an authorized grant hypercall. This simplifies the work plications and kernel components with only minimal of the high level code—to check if access is permitted, changes to their code. the high level code can just invoke the hypervisor check

2 Figure 1. Xen-Cap Capabilities

call. If the capability is present in the capability address ated a capability knows about its binding. If we want space of the virtual machine which tries to perform the to distribute rights to this object, we need some query- access, then check returns OK and the high level code ing mechanism for applications to get the names of can permit access. the capabilities bound to these objects2. For example, consider a shared file system where 3. Xen-Cap Interface we want to allow access to specific files. Since The main components of the Xen-Cap capability inter- only the file system knows which capabilities are face are three hypercalls - cap_create, cap_grant bound to which files, we need a way to query ca- and cap_check.1 The only way to interact with capa- pability names to grant them to specific applica- bilities are through these hypercalls. We’ll explain how tions. In order to query capability names, we intro- capabilities are initialized in Section 5.3. Now we’ll duce a mechanism called as get_cap_names API. describe how VMs can interact with capabilities. get_cap_names takes a high level object, i.e, file name, and asks the system which manages that ob- 3.1 Flow of capabilities ject to return the name of the capability bound to that • Creating and binding capabilities: A capability is cre- object. Then the granting VM can perform the grant ated with the cap_create hypercall. cap_create operation, but of course only if it has a corresponding creates a capability in the calling VM’s CSpace and capability in its CSpace. returns the name of the capability to the requesting • Checking Capabilities: Any application process in application. The application then binds the capabil- our model will permit access only if the requester ity with its object. Binding means application now has a valid capability. This is achieved by using the knows which capability to check on access to a par- cap_check hypercall. The check operation takes a ticular object. The application decides how to store domain identifier of the VM which tries to perform this information. the access and a capability name. It then performs • Distribution of Capabilities: Application can grant a lookup on the CSpace data structure of the VM capabilities to other VMs by using cap_grant hy- which requests the access. If it finds the capability in percall. The grant will succeed only if VM has a CSpace, it returns OK. capability with that name in its CSpace, and if VM has the right to grant it according to the rules of the 3.2 Xen-Cap interface—Libxl and Linux Kernel model. Xen-Cap interface for capabilities is available from Distribution of capabilities involves another interest- two levels in the guest systems—user level apps, and ing problem. Initially only the application which cre- guest kernel. At the user level, applications can use the xen specific library which in turn invokes the cap_* 1 We did not have time to implement cap_revoke which revokes an hypercalls. access to a resource. This can be useful to cut-off communication between entities and also create isolated islands of services. We did not have time to look into take-grant model [10] which deals with 2 The grant rule checks are not not supported yet—we always grant issues of granting capabilities to many entities. right now

3 Figure 2. Xen-Cap Architecture

To support capabilities in guest kernel, we introduce 3.4 CSpace Organization our capability hypercalls inside the Linux guest kernel. We create capabilities in the hypervisor protected mem- It follows the same semantics as in Xen. This helped us ory and hold them in a fast lookup data structure— patch the frontend drivers in Linux guest kernel. The CSpace. Designing a fast lookup data structure for capa- frontend driver code can be found in the Linux source bilities is a research problem in itself. As we mentioned code under drivers/block/ and drivers/net/. We have before, in a capability system, access to any resource— patched talk_to_netback() for network frontend every memory page, files, PCI devices in the system is driver via xenbus_cap_op(). xenbus_cap_op() cre- via capability objects. The problem is to design a effi- ates the necessary capabilities via cap_create hyper- cient data structure that can easily scale to billions or call and does the grant. We also had to patch connect() more objects in the system. Consider a system with 4GB routine in (drivers/block/xen-blkback/xenbus.c) which of physical memory and 4kB page size. To protect every handles virtual disk backend from guest via memory page on this system, we would need 1048576 xenbus_cap_op(). capabilities. Therefore, a practical way to adopt capabilities on systems such as Xen is to design auxiliary data struc- 3.3 Example of Sandboxing a PDF reader tures to manage capabilities. In our design, we use an In Figure2, the capability is shown as a 64-bit string and array data structure as CSpace for simplicity. This can it is bound to the object Foo.pdf by the file system code be easily replaced with other fast lookup data structures in the File System VM. Capabilities are protected by the such as a hash table. In order to scale capabilities to a hypervisor in a fast lookup data structure called CSpace. large number, we could think of designing some kind of VMs can access capabilities only through well-defined a sparse tree like data structure where the inner nodes set of capability hypercalls - cap_create, cap_grant serve to translate the address and leaves are either data or and cap_check. The Application VM hosts a PDF capability pages similar to EROS operating system [17]. Reader and is created with capabilities to access Foo.pdf. Our current implementation, which serves as a pro- Application VM uses the existing Xen API’s to read totype, relies on arrays to make things simple and we the PDF file. The Application VM uses the capability plan to replace this with other high lookup structure in interface to protect those files. Any request to access future. Foo.pdf results in a capability check on the Application VM’s CSpace. 4. Xen Overview Capabilities flow according to cap_* API’s and this Xen is a baremetal hypervisor which provides platform defines the life cycle of the capabilities in the model we to run multiple operating systems. It provides a hypervi- have implemented so far. sor, a special virtual machine that helps manage other

4 guest VMs (Domain 0) and infrastructure for device implemented with two components - a backend and discovery, virtual machine creation etc. Please refer to frontend split device. The backend and frontend devices Figure3 for more details. run in the Domain 0, and the guest virtual machines, A running instance of a virtual machine is called and work as a proxy-stub pair, which provides access a domain or guest. A special domain, called Domain to a physical device running inside a privileged device 0 contains the drivers for all the devices in the system. driver domain. Domain 0 also contains a control stack to manage virtual The frontend device driver is the normal device driver machine creation, destruction, and configuration. that comes with the guest operating system. Instead of The major components of Xen are as follows: talking to the real device, frontend driver talks to the backend device driver. • The Xen hypervisor is a minimal microkernel that The backend device driver typically in Domain 0, runs directly on hardware and is responsible of man- routes all the requests from the frontend device driver aging physical CPU, memory, and interrupts. Hyper- to the actual physical device. visor provides a hardware-like interface sufficient for The split-device driver model relies on a shared mem- running traditional operating systems inside isolated ory page and event channel communication primitives virtual containers. to establish a high-throughput cross virtual machine • The Control Domain or Domain 0 is a specialized communication mechanism. virtual machine responsible for managing other guest domains. Domain 0 has direct access to physical 5. Securing Services hardware and typically hosts backend device drivers. A Xen VM interacts with other VMs and the hypervisor It is the first VM started by the system. through the following interfaces: • Toolstack and Console: Domain 0 contains tools to • Hypercalls are well-defined and limited functions to manage virtual machine creation, destruction , and access hypervisor services configuration. The toolstack is either driven by a • XenStore is a system-wide configuration database command line console, by a graphical interface or by with a publish-subscribe interface OpenStack or CloudStack. • Event-channel and shared memory are two inter- • XenStore is a light-weight database that stores VM VM communication mechanisms configuration details. It exposes a file system like interface to write and read from the database. It is Xen comes with the xl toolstack which provides also important for the split-device driver architecture mechanisms to create, destroy and configure guest that Xen implements to support para-virtualization virtual machines. Xen-Cap interfaces with xl toolstack of device drivers. to grant capabilities to guest VMs during their creation.

The Xen hypvervisor implements a small subset of 5.1 Xen objects traditional operating system abstractions: virtual ma- Xen objects are hypervisor-level objects, i.e. hypercall chine scheduling, page-level memory management and interface, event-channels, PCI devices, shared memory. address spaces, memory sharing across domains, and In order to mediate access to Xen objects, we use XSM interrupt-like notification channels. In contrast to many framework. Similar to FLASK [19] and SELinux [18], microkernels, the Xen hypervisor does not implement XSM [4] is a generalized security framework for Xen. any IPC mechanism besides a single-bit, interrupt like XSM has security hook functions throughout the hy- notification channel. Guest virtual machines are free to pervisor. XSM provides simple ways to override the implement any IPC mechanism on top of two primitives: default hook functions with one of our own. This allows the shared memory mechanisms, and the event channels. plugging in different security models without having A typical guest virtual machine, called Domain U, to re-write a lot of code. The security hooks mediate starts with four virtual devices: console, Xenstore, disk, access to Xen level objects such as hypercalls, event and network. A guest virtual machine does not have channels, memory pages, etc. When the hook funtion is access to real physical devices. Instead, Xen provides triggered, control is transferred to the Xen-Cap interface, the guest with a notion of a virtual device, which is i.e. cap_* interface.

5 Figure 3. Xen Architecture

5.1.1 Hypercall Interface simple primitives: shared memory 3, and interrupt-like The guest virtual machines access Xen services by using event channels. the hypercall interface. Hypercalls are implemented like Event channels are used as a signaling mechanism in system calls in a traditional operating system. Hyper- Xen. An event notification is equivalent to a hardware calls are well-defined and limited in number. In Xen, the interrupt. hypercalls gives virtual machines access to use event Securing event channels with capabilities allows us to channels, map pages, grant page access to other VM’s, restrict the communication channels available to VMs. send interrupts, reserve memory, etc. Securing hyper- We can selectively allow specific VMs to communicate. calls is important in minimizing authority of a process. For example, we allow a PDF Application VM to only By limiting the hypercalls allowed for a VM, we can communicate with the File System VM but not others. effectively isolate the system resource that VMs can Creating and binding capabilities: We create a access. capability for every VM to establish event channel Creating and binding capabilities: We create capa- communication with other VMs. In order to establish bilities for every hypercall in Xen. For binding capabili- event channels with other VMs, there is a exchange ties, we use a shared memory page (more about this in of capabilities between the two VMs. In order to bind Section 5.2). the capability, we use the VM structure in the Xen Distribution of capabilities: The VM config file has hypervisor. We introduce a new hypervisor level object an additional parameter we introduced cap_hypercalls. in the domain structure called est_evtchn. We create a It includes the names of all the hypercalls that the VM is capability with cap_create and bind the capability to allowed to invoke. During the creation of VMs, we parse the VM structure, i.e the est_evtchn object. Any VM this parameter and grant the capabilities for hypercalls that wants to communicate with another VM must have listed. est_evtchn capability. Since Domain 0 is booted first, Checking capabilities Every hypercall invocation we create a est_evtchn capability and bind it at boot is intercepted via XSM and cap_check is invoked. If time. For other guest VMs, we create est_evtchn only cap_check returns OK, we allow the hypercall, other- if its config file specifies it. wise the hypercall fails. Distribution of est_evtchn capability: In order to grant capabilities for event channels, we introduce a new parameter to the VM config file called cap_evtchn. It is a list of domain id’s for which we grant est_evtchn 5.1.2 Event Channels capability. During VM creation, the libxl code parses the Unlike traditional microkernels, Xen does not imple- ment a high-level IPC mechanism. Instead, the Xen in- tervirtual machine communication is built on top of two 3 We did not have time to work on shared memory in this work

6 config file and grants est_evtchn capability to every XenStore has a particular format for entries and they domain listed in the config file. are grouped under particular directories. There are 3 Checking capabilities: On every event channel oper- paths: ation such as alloc_channel(), bind_interdomain(), • /vm - This holds information related to VM specific evtchn_send(), etc., XSM intercepts these calls and configuration. Number of processors, typer of kernel, transfers control to Xen-Cap interface. Then, we in- boot parameters etc. voke a cap_check to verify if the domain requesting • event channel operation has est_evtchn capability in /local/domain - This holds information about various its CSpace. If the check returns OK, then the VMs are device drivers, memory stats, state of VM etc. allowed to use event channels. • /tools - This holds information related to various tools. XenStore provides high-level semantics which look 5.2 Boot protocol like this, The Xen hypervisor grants capabilities to all the write(’/local/domain/4/NFS/1’, ’/file1’) hypervisor-level objects to the first booting domain. In a write(’/tool/guest/foo’, ’xs’) default Xen setup, Domain 0 which contains all the tools and backend device drivers, boots first and receives capa- where the user provides the key-value pair. Note that bilities on all the hypervisor-level objects. We therefore internally, node names are key names. Please refer to grant capabilities required to access hypervisor-level Figure4 for more details. objects. 5.3.2 XenStore’s default permission model During system boot, Xen discovers and creates ca- pabilities for resources such as PCI devices, memory XenStore has a notion of security for nodes which is page table, device drivers, etc. In the later stages of boot, associated with an access control list of VM id’s that Domain 0 is created. XSM intercepts domain creation are allowed to read/write nodes or do both. function and allows Xen-Cap to allocate memory for set_perms(’/tool/mytool’, { ’dom’ : 0, CSpace. Once the CSpace for Domain 0 is allocated, hy- ’read’ : True, pervisor queries capability names for all the hypervisor- ’write’ : True, level objects such as hypercalls, PCI devices, memory ’dom’ : 1, pages, etc., and grants them to Domain 0. ’read’ : True, Next, we need the get_caps_name interface to ’write’ : False}) query capabilities bound to these hypervisor-level ob- The first is the owner of the jects. We implement this as a shared memory page node and the permissions on the first pair are the default called BOOT_CAP_INFO during boot phase. Once permissions for any of the domain id’s not specified the capabilities are created for the hypervisor-level in the list. Owner domain has both read/write permis- objects discovered during boot, we store them in sions by default. The other pairs that follow have the BOOT_CAP_INFO memory page. permissions that are specified. Generally, each domain makes entries under /local/domain/- 5.3 XenStore /. Guest domains grant permissions to Do- main 0. This is because Domain 0 needs to know the 5.3.1 Overview state of the guest domains. More important reason is Xen store is a light-weight key-value store. It mainly that backend drivers are typically hosted in Domain 0 has configuration information of each virtual machine. and they need to communicate with the frontend drivers XenStore aims to simplify development of additional that come with the guest. drivers by providing higher level abstractions like read, write, list and so on. XenStore offers a hierarchical 5.3.3 XenStore and split-device driver model namespace for keys which gives users the ability to XenStore provides a mechanism for the initial hand- group similar keys together under a single directory. shake process between frontend and backend drivers. Every key-value pair is also known as a node internally. When guest boots up, libxl code initializes the entries

7 for frontend virtual disk driver, virtual network driver, had to patch a number of places in libxl code path where console etc., depending on the specification provided we call xs_get_caps() and then do cap_grant. through config file. Backend drivers put a watch on Checking capabilities: Whenever a XenStore node these nodes prior to their creation. Watch is a notifica- is created, we invoke cap_check on the parent node. tion mechanism provided by XenStore. Any updates to This ensures child node is created only if the parent node these nodes trigger the watch callback function on the has write capability on it. By default, reading the value virtual machine that put a watch. Once the guest fron- of the node in XenStore requires the read capability. tend entries are written, backend is notified and both of XenStore communication abstractions dictate if we need them change the state accordingly. to check read or write capability. For example, looking up a value of a particular node invokes a cap_check for the read capability on that particular node. 5.3.4 Capabilities for XenStore 5.4 Network File System Creating and binding capabilities: The first thing we 5.4.1 Overview needed was to find a way to bind capabilities to each node in XenStore. XenStore does not have a general se- File systems have a large code base and represents one curity framework similar to XSM or SELinux. We intro- of the major subsystems in the OS. Users rely on file duce capabilities straight to the XenStore code, when the system to share files and directories for collaborative nodes are created in XenStore in construct_node() work. Sharing often brings the question of what to share via cap_create. The capability is bound to the nodes with whom. In other words, we need a way to secure (inside the node struct) and also stored in Domain 0’s the file system even when there is collaboration. The CSpace. default Linux ACL access control model is inflexible to Mapping XenStore permissions to Xen-Cap capa- implement the principle of least privilege, e.g. share bilities: We needed to map the permission model used a single file with another VM like our PDF reader by XenStore to our capability model. We create a read example. Capabilities provide a flexible and simple capability and a write capability for each node. We take mechanism to construct controlled communication flow care to grant the least amount of privileges necessary for without requiring complex policies. the correct functionality of existing tools. We grant both In order to isolate the file system, we run the file read/write capability to owner domains. We converted system in a separate VM and export the directories to the existing XenStore permission model by introducing application VMs. One of the best ways to export a file our routine xs_strings_to_caps(). This understands system to another VM is via Network File System (NFS). the semantics of XenStore permissions and does the nec- NFS is a distributed file system protocol used to export essary capability creation and grants the same. XenStore file systems over a network connection. It offers the keeps track of the creation of new guest VMs using its most simple way to share files over network. Please own domain structure. We maintain a flag in this struc- refer to Figure5 for more details. ture to indicate if the domain is in capability mode. This takes care of VM’s that are not in capability mode and 5.4.2 Capabilities for NFS can can fall back to the default XenStore permission Creating and binding capabilities: In order to bind model. capabilities to files, we needed some metadata attached Distribution of capabilities: In the default XenStore to the files to store capability names bound to each file. permission model, when a new node is created, it inher- One option we considered was to bind capabilities inside its the permissions from its parent node. We felt that file’s inode. Though this seemed like a good option, this kind of permission inheritance limits our reasoning we did not want to deal with different filesystems and about authority. Therefore, we disable inheritance and their representation of inodes. Hence, we introduced change all xl tools to explicitly grant capabilities to each our own metadata for binding capabilities to files. We node in XenStore. We introduce xs_get_caps(path) needed something unique about a file that could serve API to query the capabilities needed to access a partic- as a binding between the file and the capability. The ular node. With this in place, we can explicitly grant inode number of a file seemed to be a viable option. capabilities rather than follow an inheritance model. We The inode number is unique during the lifetime of a file.

8 Figure 4. XSM under Xen-Cap. Every node in XSM is binded with a capability. Access to these nodes will result in checking capabilities.

We created a mapping for inode number of a file to the server. We call this config parameter cap_files. The capability and maintain this mapping in a hash table. NFS server lists the files that it wants to export to the The key to this hash table is the inode number. We use guest VMs through cap_files. a hash table implementation called uthash developed cap_files = "/files/foo /files/bar" by Hanson [20]. It is a very flexible implementation that allows any C structure to be a part of the table. We The guest VMs specify the NFS server VM’s id along had to make few changes to it for using it in our kernel with the files it needs to access. modules. cap_files = Distribution of capabilities: To grant capabilities "sharedfs backend:/files/foo,/files/bar" for exported directories and files, we explicitly mention their full path names in the VM config file. For each Once we are in the process of creating the VM, of these files, we grant the capability to the guest VMs. libxl routine initiate_domain_create() invokes See more details in section 5.4.3. libxl_domain_setcapfiles(). This routine parses Checking capabilties: One issue here is that NFS the cap_files from the config file and iterates through exports files to client VMs and the only way we can the list of files and writes them in XenStore un- grant access to files is by verifying client VMs capa- der NFS server VMs directory. Specifically, under bilities. But our capability interface works with virtual /local/domain//NFS/backend/. As machine as its subject. Hence, we needed a way to map soon as these entries are made in XenStore, a watch the client VM’s IP address to it domain id. is fired that starts the xen-cap-nfs module in the NFS To resolve VFS request to domain ids, we maintain server domain. IP address to domain id mapping in a hash table. Once The xen-cap-nfs module reads the list of files to a NFS request for a file comes from the client VM, we be exported from XenStore. For each of the file, we can track the client ip address. And then we use the convert the file pathname into inode numbers. Once the mapping table to get its domain id and we can invoke conversion succeeds we invoke cap_create for each cap_check on the client VM. Once the client has the file and add the capability name and the inode number required capability, we grant access to the file. This is in the hash table. This completes the binding process the overall picture of how things flow for file system in for NFS sever domain. our capability model. Once the NFS server domain comes up, we rely on the Linux initialization script (rc.local) to indicate that 5.4.3 Implementation details the file system is up and running. This is because the NFS server domain: We modify the config file to routine kern_path() has to convert the file pathnames enable exporting files from the VM that acts as the NFS in the NFS server domain to their inode numbers.

9 Figure 5. NFS under Xen-Cap. NFS server VM exports a part of the file system to client VMs. The client VM is created with capabilities to only a part of the file system. Once client tries to access files from the server, capability checks are made on the incoming request and access is granted.

The initialization script invokes the xen-cap-nfs mod- the client VM. If it succeeds then access to file is granted ule which starts the process of creating capabilities and to the client. binding them to file inode numbers. The module goes through all the files that needs to be exported, converts them to their inode numbers and creates capabilities and 5.5 Xen-Cap changes to VM config file adds them to the hash table. NFS client domain: The libxl code while creating We made several changes to Xen’s VM config file. We the NFS client domain, write its IP address and domain needed some way to specify hypercalls allowed to a par- id to XenStore. This will later help in converting the ticular VM. Hence, we introduced the cap_hypercalls NFS client request to their domain id. We maintain a parameter, which is a list of hypercall names allowed mapping of the NFS client IP address to its domain id for a particular VM. Please refer to Figure6 for in a hash table. more details. We introduced a new routine called The libxl code parses the cap_files parameter from libxl_domain_setcapfiles() that parses the cap_hypercall the config file. It iterates through the domain id:<file list and grants capabilities to these hypercalls. We have pathname> list, queries the capability names for each to grant capabilities to the VM early during the boot file and grants the same. process otherwise the VM creation will fail. We have Once the NFS client domain comes up, it mounts complete control over a VM through config file and the exported file system from NFS server domain. The once we make sure config files can only be edited by client sends NFS file requests and the NFS server first the creator of the VM, we can be confident that we have determines whether it has a mapping for the incoming secured the VM. client request by looking up its IP address from the hash We also added the cap_files parameter to config table. If a mapping is found, we get the domain id of the file which lists the directories and files that is shared by client VM and then the server looks up the capability the NFS server VM. We have distinguished the server for the requested file name by its inode number. After and client here. The VM which exports the files has a fetching the capability, server invokes cap_check on particular format. The client VM which accesses the files also lists the server domain names and the files it wants to access from them.

10 # # XEN-CAP GUEST CONFIG FILE (SERVES AS BACKEND FOR VIF DRIVER) #

# This config file has additional parameters to support Xen-Cap # The additional parameters are cap_files, cap_hypercalls, cap_evtchn

# cap_hypercalls - List of hypercalls allowed for this VM # cap_files - List of file pathnames that are being exported to other domains # cap_evtchn - List of domain id’s which can establish event channel communication

kernel = "/boot/vmlinuz-3.7.5+" ramdisk = "/boot/initrd.img-3.2.16emulab1" memory = 192 name = "backend" blkif = "yes" netif = "yes" vif = [’ip=155.98.36.197’] disk = [’phy:/dev/loop0,xvda1,w’] root = "/dev/xvda1" extra = ’xen-cap-lsm=y security=xencap earlyprintk=xen console=hvc0’ cap_hypercalls = "memory_pin_page getdomaininfo setvcpucontext pausedomain unpausedomain resumedomain max_vcpus destroydomain vcpuaffinity scheduler getvcpucontext getvcpuinfo domain_settime set_target domctl set_virq_handler setdomainmaxmem setdomainhandle grant_mapref grant_unmapref grant_setup grant_transfer grant_copy grant_query_size memory_adjust_reservation memory_stat_reservation console_io profile schedop_shutdown evtchn_unbound evtchn_interdomain evtchn_send evtchn_status evtchn_reset get_pod_target set_pod_target map_domain_pirq irq_permission iomem_permission pci_config_permission"} cap_files = "/files/foo /files/bar" cap_evtchn = "0" on_crash = "preserve" Figure 6. Xen-Cap VM config file

And finally, cap_evtchn lists domain id’s which can rity policies and it becomes a problem to come up with establish event channel communication with one another. a policy to deal with every situation. Here is an example config file, Capsicum is the latest research work that implements capability access control framework in FreeBSD at the 6. Related Work level of file descriptors [21]. Capsicum helps sandbox Securing large stacks of untrusted software is an ongo- applications and disaggregate privileges at the user- ing effort. There have been numerous research effort level applications. However, similar to MAC framework, in finding the right security model. As experience has Capsicum does not protect against a number of attacks shown, none of the security models have been able to on the operating system kernel. solve all the problems. There have been a number of There have been numerous research operating sys- mandatory access control frameworks (MAC) such as tems [3, 7, 15, 17] that implement capability systems SELinux [18], AppArmor [2], Smack [16], and Solaris from scratch. However, there have been no solutions to Trusted Extensions [8], that help but does not completely traditional desktop and server environments. solve the problem. MAC frameworks depend on secu-

11 There are solutions that leverage hypervisors to Using persistence store for capabilities is another area provide isolation like Xen-Cap. Some of them are we haven’t looked at. To make isolation completely Qubes [15], Bromium [3], and XenClient [22]. These transparent and quick, we need mechanisms to spawn solutions attempt to sandbox applications by execut- light-weight VMs. ing them in a separate VM. Complete isolation works Overall, this is a step towards realizing a production well in environments where sharing is not required. But system that offers isolation of services with flexible and many subsystems in the operating system are inherently fine-grained access control model. designed to provide sharing of resources (e.g. file sys- tems, block storage, network stack), require mechanism 8. Acknowledgments for secure sharing to avoid operating with reduced func- The author would like to thank Anton Burtsev, Robert tionality or with excessive privilege. Xen-Cap provides Ricci, Mike Hibler and all the members of the Flux ways to secure environments where is sharing is neces- Research Group for their contributions. Anton Burtsev sary. Securing NFS under Xen-Cap was one such ways spearheaded the idea and envisioned the entire project. of achieving the same. Without his guidance, the project would not have seen Disaggregation of core operating system services, the light of the day. Robert Ricci, my advisor, would device drivers, and hardware devices is important. Sand- always sit with us and help us breakdown the problem boxing applications in individual VMs helps to reduce and provided valuable suggestions. Mike Hibler was privilege attacks but does not address the attacks against last stop for any issues to get resolved. His experience devices or device drivers. Xen [1], VMWare [14] and with systems helped us in many situations. Bromium [3] by default run all the device drivers in a This material is based upon work supported by the single privileged domain. Therefore, a vulnerability in National Science Foundation under Grant No. 1059440. one of the device drivers could lead to a compromise of Any opinions, findings, and conclusions or recommenda- the entire guest VM. tions expressed in this material are those of the author(s) and do not necessarily reflect the views of the National 7. Conclusions Science Foundation. Xen-Cap describes our approach to implement the capa- References bility model on the Xen hypervisor. Xen-Cap serves as [1] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. L. Harris, a simple prototype to examine the issues one can face A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen while designing a capability model on a virtualized sys- and the art of virtualization. In ACM Symposium on tem. Our approach was aimed to reduce re-engineering Operating Systems Principles, pages 164–177, 2003. effort which resulted in simplifying a lot of details. We [2] Bauer, M. Paranoid penguin: an introduction to Novell were able to use Xen-Cap to secure critical services run- AppArmor. Linux Journal, 2006(148):13, 2006. ning on Xen without significant amount of code changes. [3] Bromium. Bromium Micro-virtualization. whitepaper, The nice feature of Xen-Cap is that it is simple to use 2010. http://www.bromium.com/misc/BromiumMicrovirtualization. (just three hypercalls). pdf. We have some more areas that needs work in future. [4] G. Coker. Xen security modules (xsm). Xen Summit, The capability rules that we have implemented works pages 1–33, 2006. http://mail.xen.org/files/ well for simple scenarios. The grant rule requires more summit_3/coker-xsm-summit-090706.pdf. thinking as we do not address issues involving transitive [5] J. B. Dennis and E. C. V. Horn. Programming semantics grants. We did not have time to implement revoke rule. for multiprogrammed computations. Communications Revoking capabilities will require mechanisms to track of The ACM, 9:143–155, 1966. which domain has those capabilities in order to do [6] Amazon EC2 underlying architecture. http://openfoo. it recursively. The CSpace data structure needs to be org/blog/amazon_ec2_underlying_architecture. replaced with a more efficient structure. We could make html. use of read-only capability caches inside the hypervisor [7] D. Elkaduwe, G. Klein, and K. Elphinstone. Verified to enhance CSpace lookup. Naming resources with Protection Model of the seL4 Microkernel. 2008. capabilities will require a different design. We may have [8] Faden, G. Solaris trusted extensions. Sun Microsystems to adopt a capability table similar to seL4 or EROS. Whitepaper, April, 2006.

12 [9] N. Hardy. KeyKOS architecture. Operating Systems Review, 19:8–25, 1985. [10] R. J. Lipton and L. Snyder. A Linear Time Algorithm for Deciding Subject Security. Journal of The ACM, 24:455–464, 1977. [11] G. Marsaglia, B. Narasimhan, and A. Zaman. A ran- dom number generator for PC’s. Computer Physics Communications, 60:345–349, 1990. [12] G. Marsaglia and A. Zaman. A New Class of Random Number Generators. Annals of Applied Probability, 1:462–480, 1991. [13] D. E. Porter, S. Boyd-Wickizer, J. Howell, R. Olinsky, and G. C. Hunt. Rethinking the library OS from the top down. In Architectural Support for Programming Languages and Operating Systems, pages 291–304, 2011. [14] Rosenblum, M. VMwareâA˘ Zs´ Virtual PlatformâDˇ c.´ In Proceedings of Hot Chips, pages 185–196, 1999. [15] Rutkowska, J. and Wojtczuk, R. Qubes OS architecture. Invisible Things Lab Tech Rep, 2010. [16] Schaufler, C. Smack in embedded computing. In Proceedings of the 10th Linux Symposium, 2008. [17] J. S. Shapiro, J. M. Smith, and D. J. Farber. EROS: a fast capability system. In ACM Symposium on Operating Systems Principles, pages 170–185, 1999. [18] Smalley, S. and Vance, C. and Salamon, W. Implement- ing SELinux as a Linux security module. NAI Labs Report, 1:43, 2001. [19] R. Spencer, S. Smalley, P. Loscocco, M. Hibler, D. An- dersen, and J. Lepreau. The Flask Security Architec- ture: System Support for Diverse Security Policies. In USENIX Security Symposium, 1999. [20] uthash: A hash table for C structures. http:// troydhanson.github.com/uthash/. [21] Watson, R.N.M. and Anderson, J. and Laurie, B. and Kennaway, K. Capsicum: practical capabilities for UNIX. In USENIX Security, 2010. [22] XenClient. http://www.citrix.com/products/ xenclient/how-it-works.html.

13