Implementation of Xen PVHVM drivers in OpenBSD

Mike Belopuhov Esdenera Networks GmbH [email protected]

Abstract In OpenBSD a pvbus(4) pseudo bus abstrac- tion takes care of identifying the type of the hy- OpenBSD 5.9 will include a native implemen- pervisor via a CPUID signature and probes for tation of Xen PVHVM drivers. It was written child devices using a standard config(9) frame- from scratch to facilitate simplicity and main- work for that. tainability. One of major goals of this effort is Xen nexus device that performs domU setup to run OpenBSD images in the Amazon cloud. is implemented as a xen(4) device driver that also acts as an attachment point for other vir- 1 Introduction tual devices configured in the virtual machine settings. Xen virtual machine monitor provides two types of guest hosting depending on the under- lying hardware: paravirtualized and hardware 3 The hypercall interface assisted virtualization mode when a CPU with In order to be able to perform different oper- virtualization extensions (AMD-V or VT- ations like configuring devices, virtual inter- x) is used. rupts or simply fetching information from the At the same time guests running in the hard- dom0, Xen makes use of a hypercall interface ware assisted virtualization mode are not re- which works similar to the syscall(9) interface stricted access to the paravirtualized facilities that provides a VMEXIT event on the hypervi- via the hypercall interface normally used by the sor side. paravirtualized instances. The guest allocates a page of memory within We will explore what facilities are provided the kernel’s code segment and communicates and how an HVM guest can combine emulated its location in the physical memory to the hy- PCI device tree and interfaces provided via par- pervisor via an MSR write that fills it with avirtualization. content. Upon inspection the content of the page contains SGDT instructions at offsets rep- 2 Guest domain initialization resenting different hypercalls. Since OpenBSD virtual memory subsystem In order to gain access to the paravirtualized doesn’t implement a proper way to allocate services the guest must take a few steps in order memory pages that can be later called into for to identify the hypervisor and setup the hyper- various reasons, a code segment of the kernel call interface. itself had to be extended by one page. And while this is a rather straightforward modifica- Once triggered a guest operating system tion, perhaps a randomized location would suit must run an interrupt vector that traverses a it better. pending event channel ports bitmap inside the Via the established hypercall interface other Shared Information Page to establish which parameters of the system can be learned, for ports have triggered the event. example extended version, enabled virtual ma- A xen intr establish() method is provided chine features, etc. in order to setup a callback that will be exe- Unlike other implementations, OpenBSD cuted by the interrupt vector when associated uses a single hypercall function that is defined event port is pending in the event channel ports as a variable argument function and expands bitmap. In many cases the event port number is the parameter list in order to construct hyper- not known in advance and can be allocated by call arguments. the aforementioned method itself. Likewise a xen intr disestablish() can be called to remove the binding. 4 Shared Information Page During system startup and device driver ini- tialization interrupts remain masked and are un- One of several basic ways of communicating masked after the root filesystem is mounted. information to the hypervisor and back to the Device drivers are required to operate in the guest system is using shared memory pages. polling mode until interrupts are enabled. The Shared Information Page is a specialized After startup is finished, device drivers page of memory that provides guests access can mask and unmask their interrupt sources to the bitmap of masked and pending event at will via calls to xen intr mask() and channel ports events as well as other informa- xen intr unmask(). tion, such as RTC, TSC, and information about Unlike other implementations, we have in- NMIs. cluded support for marking Xen upcall inter- The guest system must allocate a page of rupts as pending to integrate interrupt process- memory that has both physical and virtual map- ing better with the rest of the system, e.g. to pings, for example via malloc(9), and commu- ensure that interrupt handler is not reentrant. nicate it’s frame number (a number of page sized increments) to the hypervisor via a mem- ory operation hypercall. 5.1 Interrupts: the IDT method Shared Information Page also includes a run- When indicated by the virtual machine features ning cycle counter and a wall clock so it should a guest system may communicate an allocated be possible in the future to turn this into a sys- Interrupt Descriptor Table vector to the hyper- tem timecounter. In fact the PVCLOCK inter- visor to deliver the interrupt directly into the face used by is implemented this way. system without the help of an emulated APIC. To set up an IDT vector a system must es- 5 The interrupt subsystem tablish a link between an IDT vector number in a range of 0-255 and a callback function via There are two ways for a Xen hypervisor to in- an IDT gate descriptor. OpenBSD groups IDT ject an interrupt request into the system: via an vector numbers according to which Interrupt Interrupt Descriptor Table vector that has been Priority Level they represent. IPL NET prior- allocated by the guest solely for these purposes ity is used for Xen interrupt vector and there- and via a virtual PCI device, the XenStore Plat- fore the first vector 0x70 in that group has been form Device. reserved for it. Due to the fact that this interrupt vector is as well. not established via a PIC-compatible interface This allows drivers for paravirtualized de- low level interrupt stub functions that basically vices to take advantage of a standard approach implement pending interrupt processing cannot to DMA memory management. The first step be used for our interrupt vector. Instead a set of is to create a DMA map that records meta in- new functions akin to those used for the LAPIC formation about the mapping that will be per- timer is rolled to provide this functionality. formed later. It records number of segments, their sizes and a total size of the mapping. Due to the limitation of grant tables, only the page 5.2 Interrupts: Platform Device sized segments are currently supported. The As an alternative to the IDT method, domU wrapper of bus dmamap create() allocates an guest implementing PCI bus discovery can im- additional array of entries that will be used plement a driver for the XenSource Platform to map physical addresses of map segments Device, 0x5853:0x0001. This device provides to grant table references. At the same time a level triggered interrupt wired to the emulated grant table entries for all map segments are re- APIC and once it’s set up, Xen can be made served via xen grant table alloc(). This array aware of it. of entries is then set as a DMA map cookie. A driver xspd(4) has been implemented for A wrapped bus dmamap destroy() method can this device that configures the Xen interrupt free those references and destroy the map. vector to call the Xen upcall when the IDT Not all bus dma(9) methods need to be method is not available. wrapped. For instance memory allocation and KVA mapping functions bus dmamem alloc() and bus dmamem map() as well as their de- 6 Grant tables structive counterparts don’t need any special handling. Grant tables represent a mechanism of passing However in order to establish associations references to pages of memory allocated by the between physical addresses of DMA memory guest across domains. In essence it’s similar segments and grant table references wrapped to the IOMMU mechanism where device vis- versions of bus dmamap load() family of ible addresses are translated into physical ad- functions is required. After calling the system dresses but in this case device visible addresses bus dmamap load() method a wrapper needs are represented as indexes into the grant table to go through all map segments represented and point to a grant table entries. by the dm segs member of the bus dmamap t Grant table entries contain a frame number structure, associate them with grant table ref- and access flags that are set up when one do- erences and update entries in the grant table main wants to provide access to its own mem- via xen grant table enter(). Upon success the ory to the other domain. Upon startup the hy- physical address of the segment ds addr is re- pervisor sets an upper limit on how many grant placed with the grant table reference. After this table frames can be used by the guest system. call, the driver can pass this reference to the OpenBSD implements a bus dma(9) [1] ab- other domain via a ring descriptor or a similar straction on top of grant tables. It defines a new mechanism. bus dma tag that contains methods that wrap To remove the mapping a underlying bus dmamap * functions in a way bus dmamap unload() method wrapper that the memory managed by this underlying calls the xen grant table remove() and puts methods gets accounted for by the grant tables physical addresses back into the ds addr before _dm_cookie Grant Table page page 1 reference 0 segment 0 reference 1 grant table entry 0 page 2 segment 1 reference 2 grant table entry 1 segment 2 grant table entry 2 page 3 Mapping table bus_dmamap_t object ... Memory pages

Figure 1: Interfacing Grant Tables with bus dma load(9) calling to the system function. tures. Depending on whether it’s a read or a write operation the iovec matrix is ether allo- cated and returned by the XenStore driver or in 7 XenStore case of a write operation by the callee. In case the matrix was allocated by the driver it needs The XenStore is a hierarchical filesystem like to be disposed of by calling xs resfree(). property storage system that can be accessed A simpler property accessor interface via an interrupt driven producer/consumer ring xs getprop() and xs setprop() was introduced interface in order to learn about configured vir- for when complexity of a generic xs cmd() is tual devices and their properties. This extends not required. the hypercall interface with various information Extensive string parsing in the kernel was in the ASCII string format. avoided by providing a simple iovec based in- The XenStore ring page is usually allocated terface thus multiple independent strings are re- by the hypervisor and has to be mapped in by turned back to the caller as an array of vectors. the guest into it’s kernel virtual memory space Other parts of the infrastructure are required to via a call to the pmap kenter pa(). The physical perform data conversions and manipulations on address of the page is fetched with a hypercall. their own which in most cases not strictly re- The event channel port for XenStore is also pre- quired as string comparisons cover most of the allocated and can be learned via a hypercall as use cases. well. The two most important hierarchies available throught the XenStore interface are “device/” 8 Power management facilities and “backend/” that represent available virtual devices and their backend counterparts. Con- Xen provides a few power management capa- figuring these devices in large part requires set- bilities to the controlling domain, namely it al- ting properties within these hierarchies. lows to signal halt, reboot and suspend events To issue a set of pre-defined commands a to the guest. In order to receive these events a common interface akin to SCSI command sub- guest is required to install a notification on the mission was implemented. xs cmd() takes an “control/shutdown” node via a XenStore watch operation code, a node that the operation is per- operation. formed upon as well as matrix of iovec struc- Once installed, an asynchronous event mes- sage will be delivered to the guest that will re- tem operators to disable xnf(4) driver in the ker- quire it to read the “control/shutdown” property nel via config(8) or User Kernel Config during and act accordingly. boot and fall back to the legacy device driver To execute an event callback in a safe envi- without recompiling the kernel. ronment it’s scheduled via a task add(9) to exe- cute in the shared system task queue (systq) un- der the kernel lock. This allows us to perform 10 Virtual network interface graceful shutdown and reboot operations. A driver for the virtual network interface is based around the idea that receive and transmit 9 Virtual device attachment ring descriptors take grant table references to the networking stack buffers instead of phys- In order for the guest to probe for virtual de- ical addresses that regular hardware counter- vices a XenStore directory listing operation parts do. must be issued for the “device/” node and then Both receive and transmit rings are allo- for every subdirectory of it another listing oper- cated as contiguous chunks of memory so that ation must be performed. This produces nodes they can be associated with a single grant ta- like “device/vif/0” and “device/vif/1” that rep- ble reference. These grant table references are resent different virtual devices that have been passed to the hypervisor as device properties configured for this virtual machine instance. “rx-ring-ref” and “tx-ring-ref” accordingly. A standard procedure in this case is to uti- Each receive and transmit descriptor has its lize the BSD autoconfiguration framework [2] own grant table reference pointing to a single and specifically config found(9) in order to at- buffer not exceeding a page in size. In order to tach devices configured in the kernel configura- support fragmented chains in the transmission tion file. For instance if “xnf* at xen0” config- code path we need to ensure that the chain is not uration option is included in the kernel config longer than 18 fragments (when scatter-gather file a specified match function for the xnf(4) de- operation is supported by the backend) and then vice will be called and the driver can determine extract each individual memory buffer in order whether or not it should proceed with attach- to load it into the transmit descriptor map and ment. This decision is based on the provided associate with a transmit descriptor. This effec- xen attach args structure that records the node tively transposes a tree like structure of a ring name and a few other bits of information to sup- of mbuf chains into a flat ring of buffers that port device attachment procedure. Netfront implements. Default configuration of virtual networking In order to tell the hypervisor that the device interfaces employed by Xen allows for both is ready to receive network traffic a state prop- paravirtualized and legacy PCI drivers to at- erty needs to be set to the connected state value. tach to the same network interface. It’s done Driver supports receive and transmit this way in order to simplify VM configuration TCP/UDP checksum offloading for both IPv4 and provide a simple fallback method. Thus and IPv6 packets. One of the quirks when it’s imperative for the virtual network interface dealing with checksums turns out to be almost to claim ownership of the device and instruct exact emulation of Linux checksum offloading the hypervisor to exclude the legacy PCI de- by Xen Netfront that seems to conflict with vice from the PCI device tree. This is done how OpenBSD IPv4 checksum offloading is via an I/O port operation once xnf(4) driver fin- supposed to work in conjunction with protocol ishes the attachment process. This allows sys- checksum offloading. 11 Operation in the Amazon EC2 13 Conclusion

Antoine Jacoutot and Reyk Floter¨ did a great This effort has proven that all of the Xen job with providing OpenBSD development im- PVHVM infrastructure can be implemented in ages in the Amazon cloud. In order to create about 2500 lines of code with comments (ex- and upload an OpenBSD image ready to be de- cluding device drivers) in just a few C files ployed as an AMI a script 1 by Antoine can (/sys/dev/pv/xen.c and /sys/dev/pv/xenstore.c) be used. It depends on on the “sysutils/ec2- and a few header files (/sys/dev/pv/xenvar.c and api-tools” package and automates creation of a /sys/dev/pv/xenreg.h). Existing and well-known freshly installed OpenBSD system with an ad- system abstractions such as bus dma(9) can be ditional rc script that fetches SSH key and con- used to keep device driver implementation sim- figures the primary networking interface. ple and comprehensible.

Acknowledgments 12 Future Work The author would like to thank Esdenera Net- works GmbH for funding this work, OpenBSD One of the issues exposed by the Netfront is developers, especially Reyk Floter,¨ Mark Ket- lack of hardware interrupt moderation for the tenis, Martin Pieuchot, Antoine Jacoutot, Mike receive ring and therefore it’s becomes some- Larkin and Theo de Raadt for productive dis- what easy to mount DoS attacks against it. cussions and code reviews. Huge thanks to all Since OpenBSD does not have an infrastructure our users who took their time to test, report to polling mode of operation for networking in- bugs, submit patches and encourage develop- terfaces we need to implement additional coun- ment. termeasures. To ensure compatibility with existing migra- Availability tion techniques we need to implement full sus- Scheduled to become available in OpenBSD pend/resume functionality. It’s controlled by 5.9. The whole stack will be enabled in the the same “control/shutdown” node. In order to default “GENERIC” kernel as well as install suspend the system a few steps must be taken ramdisk kernel “RAMDISK CD”, not requir- such as disabling interrupts and after that a hy- ing users to build their own kernels or install percall is issued that upon return signals either media. resume or a cancelled suspend. A diskfront driver can be implemented to support paravirtualized storage devices. Pos- References sible benefits include speed improvements and potentially support for hot-pluggable storage [1] Jason Thorpe, A Machine-Independent devices such as Amazon S3 volumes. DMA Framework for NetBSD, Usenix 1998 Annual technical conference. Support for a native timecounter should be fairly easy to implement given that the Shared [2] Chris Torek, Device Configuration in Information Page provides most of the data al- 4.4BSD, Lawrence Berkeley Laboratory, ready. 1992.

1https://github.com/ajacoutot/aws-openbsd