Bachelor’s Thesis Nr. 198b

Systems Group, Department of Computer Science, ETH Zurich

Device Queues for USB

by Joël Busch

Supervised by Prof. Timothy Roscoe, Reto Achermann, Roni Häcki Systems Group, ETH Zurich

November 2017–May 2018

Abstract

The motivation behind this thesis is that we want to provide the research operating system Barrelfish with a useful feature and in the course of implementing it we want to prove that some of Barrelfish’s newer systems work well. Therefore the goal we chose was to develop a USB mass storage service. The implementation of the service provides an opportunity to use the newly introduced Device Queues for inter-process bulk communication. Demonstrating their functionality and effi- ciency is part of our aim. Furthermore, in the course of developing the USB mass storage driver we can show how well the system knowledge base and the device manager cooperate to enable event based driver startup. Our approach was to first reorganize the existing USB subsystem to conform to the new driver model for Barrelfish and to change its initialization to be based around events dispatched by the system knowledge base. Next we added a USB mass storage driver to the subsystem and used Device Queues for providing clients with access to the service over a zero-copy channel. Finally Barrelfish’s FAT implemen- tation was extended to include write support and its back-end was modified to use the aforementioned communication channel. The resulting mass storage service achieves better performance than Linux when both are run virtualized on the same host with the same USB hardware being passed through. Some stability issues, particularly based around FAT, still need to be ad- dressed and performance can still be improved, but the core functionality has been demonstrated. We further provide evidence that the Device Queues do not add any measurable communication overhead. Finally we show that the device manager and the system knowledge base robustly solve the problem of device initialization on a discoverable bus.

1 2 Contents

1 Introduction 5 1.1 Thesis Outline ...... 7

2 Background 9 2.1 Barrelfish ...... 9 2.1.1 Inter-domain communication ...... 10 2.1.2 System initialization ...... 11 2.2 Protocols ...... 14 2.2.1 USB ...... 14 2.2.2 SCSI ...... 17 2.2.3 FAT ...... 18 2.3 Previous USB subsystem ...... 19

3 Design 21 3.1 New driver model ...... 21 3.2 Queue setup ...... 22 3.3 Initialization ...... 24 3.4 USB mass storage driver ...... 25 3.5 FAT ...... 26

4 Implementation 27 4.1 New driver model and Initialization ...... 27 4.2 USB mass storage driver ...... 29 4.3 Queue setup ...... 30 4.4 FAT ...... 30

5 Evaluation 33 5.1 Aim ...... 33 5.2 Testing Setup ...... 34 5.2.1 Systems and Hardware ...... 34 5.2.2 Testing the full stack ...... 34 5.2.3 Isolating FAT ...... 35 5.2.4 Isolating the Device Queue ...... 36

3 4 CONTENTS

5.2.5 Generating a baseline ...... 37 5.3 Test failures ...... 37 5.4 Results ...... 38 5.5 Interpretation ...... 40

6 Conclusion and Future Work 41 6.1 Conclusion ...... 41 6.2 Future Work ...... 42 Chapter 1

Introduction

The Barrelfish research operating system [16] runs many of its services, includ- ing its drivers, in user-space processes. This necessitates efficient communication between processes, especially when facing large amounts of data to transfer. Previ- ously Barrelfish had a variety of small tools to enable bulk transfer of data between processes, but they all had various drawbacks. As a formal solution to the problem Device Queues [6] were recently added to Barrelfish. They allow for sending mem- ory buffers between processes without copying the data. Instead, ownership of a buffer that both processes have access to is passed back and forth. Only the current owner of a location is allowed to access the buffer. Since there is no effective safe- guard against access by the non-owner, the queue is intended to be used between cooperative processes. When using a hardware specific back-end, the same func- tionality can be used to write directly into hardware buffers of the device, enabling zero-copy usage.

File systems communicating with user programs are a common use case that in- volves shifting large amounts of data between a driver in one process to a user application in another. This project specifically targets a file system on a USB mass storage device [10], because it is a good opportunity to demonstrate the use of Device Queues in a real world scenario.

Another new addition to Barrelfish is the specification of a more unified driver interface [5] that also allows for drivers to share a process and employ local direct communication within the same virtual memory space. This is of interest to the USB subsystem, because there are various drivers typically running that are tightly coupled and could benefit from sharing a process. At a minimum three drivers must be running and working together if any functionality of a USB device needs to be used.

The situation before the project was that one application called the USB man- ager [1] was running driver code for the host controller, who is responsible for

5 6 CHAPTER 1. INTRODUCTION enabling communication with all other USB devices, and for the hubs, which are port multipliers for USB. It was directly responsible for bringing up the drivers of USB devices that were connected. Startup of the USB manager was hard coded into the device manager [4] and the USB manager had it’s own stub mechanism for locating USB driver binaries that was working only for the case of the keyboard driver.

A better solution can be achieved with two mechanisms in tandem: Firstly the common interface in the new driver model [5] allows the device manager to start various drivers in a more unified way. Secondly the publish-subscribe server Oc- topus [20] can asynchronously inform the device manager of new device records appearing in the system knowledge base [7].

Based on these we can design a system where the device manager reacts to events signifying the appearance of new devices. It can then easily and dynamically load drivers for the device, because they conform to the new interface. This also means that the discovery code running in the hub driver can be made agnostic of driver startup routines. It only needs to put a device record into the system knowledge base.

For these reasons the project is aimed at transforming the existing USB implemen- tation in Barrelfish in multiple ways. First of all the existing driver is ported to the new device driver model, including the addition of new instantiation routines by the device manager. The second change is allowing communication with the USB driver over Device Queues for efficient bulk data transfers. And finally, to enable testing of the queues a USB mass storage driver is added, including supporting in- frastructure such as a partial SCSI [15] implementation to encapsulate in the USB transport and changes to the existing FAT file system [14].

The contributions of the thesis are:

• Splitting all the USB code into driver modules conforming to the new driver interface

• Having the device manager perform event based initialization of devices in the USB subsystem

• Producing a USB mass storage driver that allows clients to connect over a Device Queue interface

• Merging write support into the FAT library and making adjustments allowing it to connect to the aforementioned Device Queue 1.1. THESIS OUTLINE 7

1.1 Thesis Outline

Beyond this introduction the thesis is split into five chapters. Chapter 2 contains explanations of various relevant Barrelfish subsystems and some of the protocols and standards used. Chapter 3 provides the new design of the USB subsystem as it relates to initialization and inter domain communication, including the reasoning behind design choices. Chapter 4 describes various implementation details that may be relevant primarily in system maintenance. Chapter 5 presents the testing setup used for performance evaluation and the results obtained from it. Finally, chapter 6 draws conclusions from the data and experiences and casts a gaze forward towards possible future improvements. 8 CHAPTER 1. INTRODUCTION Chapter 2

Background

2.1 Barrelfish

Barrelfish is a research operating system developed by researchers in the Systems Group of ETH Zürich. The main goal behind Barrelfish is exploring designs for better scalability as well as system heterogeneity. It finds further use as a platform which to base related research and teaching in the area of operating system design on [16].

The base design chosen for Barrelfish is that of the multikernel [8], where each core independently runs a small kernel. The kernels do not share memory and do not need to be identical. In fact they are regarded simply as CPU drivers within Barrelfish. They do not enjoy special treatment and are simply started by the device manager like other device drivers. This allows kernels for different architectures to run within a single system. The kernels are minimal by design, much like those of microkernel [12] operating systems. They only provide access to kernel objects and memory, access to the core system hardware, a channel for local communication within a CPU core, and scheduling to user space dispatchers.

Such a dispatcher is, from the kernel’s perspective, akin to a process in other op- erating systems, but with the limitation of being fixed to the single core the kernel that controls it lives on. One or multiple dispatchers, interacting over various means of message passing, form a domain together. A domain is the abstraction that is mostly equivalent to a process from the user’s perspective. That is to say it has a common virtual memory space as well as a shared capability space regardless of how many cores its dispatchers run on. Capabilities represent authorization to use various system resources.

All operating system functions in Barrelfish beyond the four functions of the ker- nels mentioned above are run in user space within such domains. Discussing all the

9 10 CHAPTER 2. BACKGROUND functionality present would be well beyond the scope of this work. Nevertheless, a few that are particularly relevant will be explained below.

2.1.1 Inter-domain communication

Inter-domain communication (IDC) [3] allows different user space domains to communicate amongst each other. The methods vary by the specific needs, plat- form capabilities and sometimes the layout of where domains are running in a system.

Flounder

The most mature and most widely used IDC tool in Barrelfish is Flounder [3]. Flounder is the name of both a stub compiler and the associated domain specific language. The input for the compiler are interface descriptions for remote proce- dure calls (RPC) expressed in the Flounder language. They are processed into a series of stub and header files that are specific to a given underlying interconnect driver.

Those interconnect drivers vary depending on the hardware support used. There is local message passing for intra-core communication, user level message passing using shared memory, Rockcreek which is specific to the Intel Single-Chip Cloud Computer and an implementation of Beehive message passing. The Flounder in- terfaces are more suited for function calls and short messages than transferral of large amounts of data by their nature.

Pbuf and bulk_transfer

For bulk transfer various other methods were developed [2]. Two main mecha- nisms existed, called bulk_transfer and Pbuf. The former simply shares a block of physical memory between two domains in a master-slave setup. They would both map the memory and have read and write access. The concept was not well formalized, and did not track who was in control of which block. Pubf is built on the same idea, but instead of simply identifying memory locations with offsets into the block, it has a layer of indirection with pbuf-ids, such that an empty pbuf could be returned without actually having to give up the underlying memory location at the same time. 2.1. BARRELFISH 11

Devif

More recently Device Queues [6], also called Devif, have been introduced to Bar- relfish. Like the previous methods they are also based on the same chunk of physi- cal memory being mapped in two places, but the questions of buffer ownership and registered memory are properly formalized.

A queue is always created between two programs, which can be running in different domains or even within the same domain. Endpoints can register memory regions to the queue. A registration by one endpoint causes the other endpoint to also receive the memory capability, enabling it to map the same region in its virtual memory space. Endpoints have sets of buffers they currently have control over, we say that they ’own’ the buffers.

The endpoint registering memory starts out with ownership over all of it. During usage of the queue each endpoint adds each buffer they dequeue to their owner- ship set and likewise remove from it each buffer they enqueue. While a buffer is in the queue, the queue owns it. Any memory access to a buffer not owned by the accessing endpoint results in undefined behavior. Users are free to subdivide buffers in any configuration they want. The queues don’t provide any help in al- locating buffers within the region, nor do they impose restrictions. In the general case it would be too expensive to have the queue inject checks on each pointer they transmit, as this is the fast path for data transmissions.

The Devif also allows for queue stacking, i.e. encapsulating queues in one another, with the functionality added on top of the inner queue. The most notable use for this is perhaps the debug queue that inserts validity checks. For example, it checks if a buffer is owned by the agent trying to enqueue it. This should assist develop- ment by making erroneous actions explicitly fail. It also keeps a partial history of operations on the queue to aid in debugging.

Again, just as Flounder, Devif has multiple back-end implementations. There are device specific back-ends, for example for network interface cards (NIC), but there is also an IDC back-end that uses Flounder message passing to inform the other ac- tor of queue creation, memory region registration, de-registration, queue teardown and manual user notifications about queue state. Importantly, the buffer data is not sent over Flounder messages, enabling zero copy data transfer.

2.1.2 System initialization

In figure 2.1 we can see the the system knowledge base (SKB) tightly coupled with Octopus accepting new device information and sending out events based on the new knowledge to the device manager called Kaluga. Kaluga brings up various device drivers, who in turn can explore the system further and add new knowl- 12 CHAPTER 2. BACKGROUND edge back into Octopus. Each component is described in more detail in their own sections below.

Figure 2.1: Interaction of the SKB and Octopus, Kaluga and various device drivers in the system initialization

System knowledge base

The system knowledge base (SKB) [7] performs the role of a central repository of knowledge about the state of the system Barrelfish is running on. Part of it is static knowledge, known at compile time, part is dynamic added at runtime. Knowledge is stored according to schemata that are defined by another domain specific language called Skate. Skate was designed in such a way to make it easy to extract knowledge from manuals while also making documentation an inherent part of the process. Skate files are compiled using the Skate compiler which produces C-header files that define how to add or query knowledge from the SKB. The SKB also uses a Prolog-based constraint solver called Eclipse CLP [17] to find valid configurations that unify incoming knowledge with the system state in order to provide valid configurations for PCI bridge programming or NUMA information for memory allocation for example. Static knowledge includes, for example, the device database that maps drivers to known devices, knowledge of PCI quirks that are needed for PCI device setup and knowledge of the devices contained in a well known SOC. Dynamic knowledge consists of records for devices found on discoverable buses during the boot process, results of online hardware profiling, knowledge of syn- 2.1. BARRELFISH 13 chronization states, a name registry allowing the SKB to take on the role of a name server and various other pieces of knowledge.

Octopus

Octopus [4] [20] makes heavy use of the SKB. It is a server that provides clients with a publish-subscribe system as well as a searchable key-value store, based on the SKB. For that it uses two channels to its clients: Remote procedure call bind- ings for synchronous communication and event bindings for asynchronous com- munication. Clients can read entries from Octopus and publish new information through it. Additionally they can install triggers on entries or provide a filter for the kind of new entry they are interested in, in order to receive change events. By using triggers and common state in Octopus additional services like distributed locking, semaphores and barriers are provided. Thus Octopus is sometimes called a lock manager, but beyond that it also serves an important function in Kaluga. The publish-subscribe service can be used for an event based driver startup routine used for devices located on discoverable buses like USB or PCI. In figure 2.1 we can see the event loop working for the PCI subsystem. Octopus sets off an initial event for the PCI root bridge appearing in the SKB, which is received by Kaluga. Kaluga in turn starts up the PCI root bridge driver, which then does discovery on the bus and adds any devices it finds back into Octopus. This in turn generates new events in Octopus for Kaluga. Kaluga can then query the SKB for a valid PCI configuration and with that it can bring up the PCI device drivers. This project implements a similar setup for the USB subsystem.

Kaluga and driver model

Kaluga is Barrelfish’s device manager [4]. It subscribes to device entry changes in Octopus and uses knowledge from the SKB to load the appropriate driver binaries with the relevant arguments. Thus Kaluga can work independently of the method by which hardware was discovered on the buses. Depending on whether a driver already implements the new driver model [5] Kaluga uses different methods for starting it. There was no unified mechanism for driver startup beforehand and so Kaluga simply had methods for each driver to set up state, acquire capabilities and finally start the driver as a standard program with the state and capabilities. There was either a default start function used or you could inject your own as was done for PCI and the CPUs. With the new driver model there are now three concepts specified. First of all there is the driver domain, that is one domain spawned by Kaluga as a container within 14 CHAPTER 2. BACKGROUND which one or multiple drivers are run. It has a basic life-cycle defined by a create and a destroy call. The driver domain can dynamically start driver instances. Second there are the driver modules. They work similarly to general libraries in Barrelfish, except that they use a different linker to allow for drivers to be dy- namically instanced, despite linking in Barrelfish currently being static only. A module contains all the driver logic and is used to provide functionality for ini- tialization and destruction, attachment and detachment and finally for controlling power states. Modules are registered during the build process so that their location is part of the static knowledge in the SKB. And finally, as the third concept we have driver instances. A driver instance is what we call the state of a specific running instance of a driver module. Each instance has its own driver control interface allowing Kaluga to manage them independently. By decoupling the module and instance from the domain, there is the necessary support in place to run instances of different driver modules or multiple instances of the same module within a single domain, allowing for memory and capability sharing.

2.2 Protocols

Various well known established protocols were relevant to this project and although the inclined reader is likely already familiar with them, we will touch upon them again quickly to refresh the memory.

2.2.1 USB

The Universal Serial Bus (USB) [9] was intended to be a universal interface to replace the various incompatible preexisting standards that were a burden on sys- tem designers and makers of peripheral devices at the time. It has succeeded in becoming ubiquitous, being regularly updated and recently gaining a new physical connector and the ability to encapsulate other protocols as Alternate Modes. This project only worked on the USB 2.0 standard. Some of the desired properties in the specification of USB were the ability to hot- plug and for one controller chip in the host to administer many devices. This necessitated a discoverable bus and the allocation of unique vendor IDs and device IDs in order for systems to determine what drivers to load for a specific device. The USB devices connected to a single controller chip in the host form a tree structure. At the root of the tree is the controller chip running what is called the host controller. The host controller manages and communicates with all the devices 2.2. PROTOCOLS 15 in the tree. In the case of USB 2.0 the host controller driver must conform to the Enhanced Host Controller Interface (EHCI) [13].

Each internal node of the tree is a device called a hub. It has one upstream and one or more downstream ports. Hubs are responsible for passing on data transpar- ently. They also inform the host controller of new devices being plugged in to their downstream ports and have some power management functionality. The hub built into the host controller is called the root hub.

Finally the leaves of the tree are USB devices, that can provide various kinds of functionality, from webcams over printers to human interface devices like mice and keyboards. To use their functionality an operating system needs to load an appropriate driver for a specific device class or depending on the device even a driver for that specific device. The communication of the loaded driver with the device is facilitated by the host controller driver and the hub drivers.

Since USB version 2.0 which allows for a transfer rate of up to 480 Mbit/s was released it has become practical to use the interface for mass storage purposes. For mass storage devices the USB specification defines three main transport protocols that are described in the following subsections.

Control / Bulk / Interrupt

The Control / Bulk / Interrupt (CBI) [19] protocol uses the three types of endpoints its name consists of. The control endpoint is used to send commands to the device, the bulk-in and bulk-out-endpoints are used for reading and writing data and on the interrupt endpoint command completion is signaled. This specification is only intended for use with floppy disk drives.

Bulk-only transport

The second protocol is the bulk-only transport (BOT) [18]. This one uses the same bulk-in and bulk-out endpoints for all three types of transfers, command, data and status. For this reason it is sometimes also called Bulk / Bulk / Bulk or BBB for short, though in this document we will use BOT. The design of BOT is simpler and cheap to implement in the micro-controllers of devices and has found wide usage. It is almost always used for USB flash drives of the USB 2.0 variety and as a fall-back on newer ones.

It is important to note, that bulk-only transport does not define which command set is to be used to realize block access over the USB transport. A handful of protocols have been assigned subclass numbers, most widely used among them is SCSI. However, not all low level operations can be accessed by this method, 16 CHAPTER 2. BACKGROUND

Figure 2.2: The flow of USB packets in three phases during the execution of each command for example it can be impossible to read out S.M.A.R.T. values or submit commands even if the underlying storage supports it. In order to send all three kinds of messages, commands, data and status, over the same two bulk endpoints the three phase protocol was defined. Its phases are de- noted with 1., 2.a), 2.b), 2.c) and 3 in figure 2.2. • Phase one: Any communication is initialized by the host device sending a command in a format called the command block wrapper (CBW), starting the protocol. The CBW it is a 31 byte package containing 16 bytes of a command specified by the encapsulated protocol. It is sent over the bulk-out endpoint. • Phase two: Depending on the command sent before, one of three things happens: Either data is sent from the host to the device, using the bulk-out endpoint (2a), or it is sent from the device to the host, using the bulk-in endpoint (2b), or if the command did not call for data transfer this phase is skipped (2c). • Phase three: Finally there is always a transmission of status information in a 13 byte command status wrapper (CSW). It lets the host know if the transfer was completed successfully or, in case of partial success, how much data was processed by the device. This way the host can retry with a new command for the remaining data or react in some other way. This three phase design necessitates the device and the host being in agreement over which phase they are currently in, that is to say, it requires synchronization. Synchronization is achieved by the device stalling an endpoint if it encounters un- expected input or writing explicit error codes into a CSW. Upon discovery of such 2.2. PROTOCOLS 17 a mismatch the host should either retry reading a CSW or move to perform reset recovery. The latter is initialized by a class specific USB request and as such is ad- dressed to the standard USB endpoint 0, so the stalls on the bulk endpoints cannot prevent it. Then the other endpoints are cleared and transmission can begin anew with the host sending a CBW. Bulk-only transport has proven to be a limiting factor as transfer speeds grew. It does not support queuing up requests or out of order execution on the device side. Which brings us to the introduction of UAS.

USB attached SCSI

Because of BOT’s limitations a new transport has been specified in the USB 3.0 specification: USB attached SCSI (UAS) [11]. It can also be implemented for USB 2.0, though of course at the lower speeds of that specification. The UAS protocol (UASP) moves back to using seperate USB endpoints for command, data and status. It allows for multiple transfers streams, as have become available in USB 3.0 to be used, thereby enabling full resource usage of the faster interface. Command queuing and out-of-order processing without software overheads are also available, though the latter feature requires hardware support only present in USB 3.0. For backward compatibility it is suggested that devices also support bulk-only transport.

2.2.2 SCSI

The Small Computer System Interface (SCSI) [15] is a full stack of standards al- lowing the connection between host computers and peripheral devices, most often those intended for bulk storage of data, but sometimes also I/O devices like print- ers and scanners. SCSI specifies physical connections, electrical and optical inter- faces, protocols, and command sets. It has a long history since its first revision has been published in 1986 and accordingly it has branched out into many different use cases, only a tiny part of which are relevant to this project. Since the transport used is USB we can forgo all the physical specifications. It bears mentioning, that SCSI was originally defined with a parallel interface, but in most usages it is used over a serial mode of transport these days. This is also true in the case of using USB as the transport. This project used only the SCSI commands, packaged inside USB requests, and of the SCSI command set only a small subset were of use. In particular: • The INQUIRY command, to gain insight about the SCSI device, in particular its type, to make sure it is a block device. 18 CHAPTER 2. BACKGROUND

• The READ CAPACITY command, used to determine the total number of blocks available on the SCSI device as well as the physical block size.

• The READ command, used to read a number of blocks starting at a block address.

• the WRITE command, used to write a number of blocks starting at a block address.

Of the latter three commands there exist multiple versions. In the evolution of SCSI it has turned out with growing storage sizes that the initial 6 byte command version with only 2 byte for block addressing were insufficient. So too were the later 10 and 12 byte versions with 4 byte addressing obsoleted. This project used the 16 byte versions of those commands.

2.2.3 FAT

The File Allocation Table file system (FAT) [14] is a simple file system that has existed for a long time and is well supported by all major operating systems. More complex file system with journaling for meta-data or even all data are preferably used for system drives these days, however FAT still finds wide usage on USB thumb drives, where portability is paramount and too many writes concentrated in meta-data sections are to be avoided because of flash wear concerns.

In FAT a number of sequential disk sectors is treated as a cluster, the number of sectors being a power of two, typically ranging from one to 128. A cluster is the smallest amount of storage allocated for any file contents. A table holds informa- tion about which clusters are allocated to which file, with a direct mapping between the index of an entry in the map and the index of the associated cluster on disk. This table is the file allocation table that lends the file system as a whole its name. Each entry in the table simply contains the number of the cluster that follows it in the file allocation, or alternatively a marker signaling that it was the last. This way a file allocation is represented as a single linked list of clusters. Directories themselves are realized as simple files. Only the root directory is special in that it is statically allocated in a well known location. Directories contain the file name, file size, some other meta-data and the starting cluster number for each file they contain.

There are some reserved blocks at the beginning of a FAT formatted volume, con- taining meta-data of the file system itself, such as the block and cluster size, disk size, the number of FATs (they can be repeated as a backup) and sometimes a bootstrap program if the drive is intended as a boot drive. There are also various complications surrounding long file names, because they were implemented in a backwards compatible fashion and had to use some tricks, but it is of no particular concern for this project. 2.3. PREVIOUS USB SUBSYSTEM 19

Barrelfish already had a partial implementation of FAT16 and FAT32, including long file name support. Only reading was supported. The implementation was also specific to an ATA driver in Barrelfish and simply called its methods in remote procedure calls.

2.3 Previous USB subsystem

Before this project began the USB subsystem in Barrelfish was organized into two main parts, the USB manager and a USB library. The USB manager was run- ning driver code for the host controller and hubs with only device drivers being in separate domains [1]. The USB manager was only used on armv7, within the OMAP44xx device startup routine, which is used to bring up the Pandaboard, a single board computer used in Barrelfish development. It was started by Kaluga directly, without any event based mechanism. The USB manager contained the drivers for the host controller and for hubs. It initialized the host controller, which would also bring up the root hub directly. The root hub did discovery on all ports and set up devices as it found them. For other hubs the internal hub driver would be called. For other devices an external driver binary would be located. In order to do that the USB manager had a stub mechanism, that was only implemented for the keyboard driver and simply held a hard-coded path. That path would be used to spawn the driver in a standard Barrelfish domain. The USB manager and the only other driver, the keyboard driver used Flounder RPCs as their sole IDC mechanism. 20 CHAPTER 2. BACKGROUND Chapter 3

Design

This chapter is intended to present the design that was finally implemented for the new USB subsystem. It also provides the reasoning behind design decisions and discusses a few alternative approaches.

3.1 New driver model

The first design goal for the new USB subsystem as it relates to the driver model is to remove duplicate system knowledge. The previous situation with a stub resolver for driver paths inside the USB manager was undesirable as it introduced a separate repository of knowledge that would have had to be separately administrated if any changes were ever to be made to default driver binary locations. It would also have been potentially error prone since the hard-coded paths were independent of the build process and could have provided paths that were invalid for a specific compiled Barrelfish system. The new Driverkit already implements a mechanism of registering driver modules, to bring them up later by name. It also provides the translation to path names with a tweak to the build system where driver modules are treated a bit differently from other libraries in order to give them a defined location in the resulting ELF file. Using this mechanism is the clear choice, without any real alternatives, since unification of system knowledge is the motivation. The second goal is modularization. We want to split the functionality of the EHCI driver from the hub driver and package them into modules separately, with their own initialization procedures, so that they can be started independently by Kaluga and the USB manager is no longer necessary. Finally the third goal is to keep overheads low between drivers in the USB sub- system that need to interact closely. Any USB driver, be it hub drivers or device

21 22 CHAPTER 3. DESIGN drivers rely on the EHCI driver for their transfers, that can happen very frequently. Previously the device drivers ran in separate domains, requiring IDC for any trans- fer invocation. The USB manager provided a Flounder interface, that was used by the USB library to allow device drivers easier access.

We have three possible choices for designing the interaction between the EHCI driver and the device drivers. We can run them in separate domains and use Floun- der interfaces like before, we can run them in the same domain an use direct func- tion calls or we can have them interact over a Device Queue, keeping the choice of co-locating the modules in one domain open.

In order to eliminate as many overheads as possible we choose to forgo the Floun- der interface in order to avoid the overhead of frequent remote procedure calls. This still leaves us with the choice of the queue or direct function calls, which will be discussed in section 3.2.

Either way we want to implement domain sharing to test out the driver domains introduced in the new driver model. As figure 3.1 shows, all USB driver modules are run within the same driver domain in the new design. A point of further research beyond this thesis should be to look into scalability questions regarding this choice. Is it sustainable to run all USB drivers in a single domain or are there bottlenecks we run into? What happens if the USB device is located too far from the CPU core in multi-CPU systems?

3.2 Queue setup

It was clear from the beginning of the project that we wanted to use a Device Queue for the transport of user data between the USB mass storage service and the user domain. The question is which driver should export the queue for the user domain.

The first design option is having the EHCI driver export the queue and letting it pass on the data to and from the USB mass storage driver over a second Device Queue. Queues can be linked with minimal overhead so the data path in not expected to be impacted much. The advantage is that we can then also use the queue between the EHCI driver and the mass storage driver to handle USB transfers and other driver interactions. This would allow flexibility regarding the question of running the drivers in a common domain.

The second option is letting user domains connect to the USB mass storage driver directly, using only one queue in the data path, keeping it as short as possible. As mentioned in section 3.1 we have to consider whether we want to use a second queue for the interaction of the drivers anyway, to keep the design flexible, or to use direct functions calls. 3.2. QUEUE SETUP 23

We choose the second design with only one queue because it is much simpler to implement and because it is unclear if flexibility in the domain setup is even all that valuable at this time. The additional complexity of routing all driver interaction through a queue would complicate the development of the mass storage driver. In case future scalability questions demand for the USB driver domain to be split up this design decision will have to be revisited.

In figure 3.1 we can see clients for the mass storage service connecting from a user domain over a queue directly to the USB mass storage driver. One such client we create in this project is the FAT library, but nothing restricts usage to file systems. Any software interested in block access to a USB mass storage device can connect to the queue directly. In the figure this is represented by the left user domain at the bottom.

A question left open for further work is the naming of queues. Each mass storage driver instance needs to present a queue with a unique name so that the client may decide which one it wishes to connect to. Once name assignment is solved, a mechanism to present user software with a set of meaningful choices is also needed. Currently the queue name is hard-coded, meaning multiple instances of the mass storage driver running at once would all present the same name, leading to undefined results when a user tries connecting to a queue of that name.

Figure 3.1: The new design of the USB subsystem, with event based driver initial- ization, driver modules running in a common driver domain and communication with the mass storage driver over Device Queues. 24 CHAPTER 3. DESIGN

3.3 Initialization

In figure 3.1 we can see how the new USB subsystem fits into the previous pic- ture of system initialization. This is the order in which events related to the USB subsystem happen during system initialization: 1. Kaluga installs triggers for the PCI root bridge, PCI device and USB device records in Octopus among others. 2. The PCI root bridge record appears and Octopus triggers an event for Kaluga. 3. Kaluga starts the PCI root bridge driver. 4. The PCI root bridge driver runs discovery on the PCI bus and registers new device records in Octopus. 5. Octopus triggers events about PCI device records for Kaluga 6. Kaluga queries the SKB to find a valid PCI configuration and starts drivers for the PCI devices. 7. One of the PCI devices is the USB host controller. Kaluga uses the USB start function for it and brings up the driver domain. 8. Inside the driver domain Kaluga brings up the EHCI driver, who also runs the internal USB root hub. 9. The USB root hub discovers devices on the USB bus, probes them and cre- ates device entries in Octopus. 10. Octopus informs Kaluga of a new USB device. 11. Kaluga runs the USB device change event callback function and starts the appropriate USB device driver inside the USB driver domain. The root hub is normally integrated in the host controller and interacts with it more directly compared to other hubs who have to communicate with the host controller over the bus. Therefore the root hub can not be represented in software by only the standard hub driver. Trying to split it off from the EHCI driver was considered, but since they always appear and work together anyway, it we found it acceptable to let the modules stay connected and have the EHCI driver start up the root hub directly. There can be multiple layers of USB hubs in the whole device tree, up do a depth of five layers. If a hub finds another hub, the procedure is the same as for devices, just that Kaluga starts the hub driver. This hub itself then runs discovery. This way the event loop can take a few iterations until all USB devices are up and running. When a USB hub detects a port change it also informs Octopus of any new devices, so the same event mechanism ensures hot-plugged devices are brought up. 3.4. USB MASS STORAGE DRIVER 25

3.4 USB mass storage driver

One main goal of this thesis was to add a driver for USB mass storage devices to the existing USB stack. The USB code already available in Barrelfish was nearly entirely written for the Enhanced Host Controller Interface (EHCI), so this project needed to be focused on USB 2.0 as well.

It was not immediately clear which USB mass storage protocol would be suitable. The USB Mass Storage Class Specification Overview [10] offered three choices, Control / Bulk / Interrupt, Bulk-only transport and USB attached SCSI. Those are explained in subsection 2.2.1

We did not consider floppy disks to be particularly useful in the modern computing environment, so CBI was soon disregarded. UAS would have been possible to sup- port on USB 2.0, however it was questionable how widely supported the protocol would be in devices over USB 2.0. The ubiquity and relative ease of implementa- tion of bulk-only transport won out in the end.

In order to communicate over BOT with standard USB flash drives the SCSI trans- parent command set had to be used. Barrelfish does not yet have a SCSI stack that could be used within the USB mass storage driver. Implementing one would have been beyond the scope of this thesis, so a small component within the USB mass storage driver provides the subset of commands that were needed for reading status information and reading and writing blocks of data.

In the future, if Barrelfish ever gets an independent SCSI stack, this design should be reexamined. Perhaps the mass storage driver could then provide only BOT as the transport and be agnostic of the commands that are sent, in the interest of clean layering. If this proves unrealistic then at least some code reuse should be possible, so repeating the definition of SCSI commands in the USB mass storage driver can be avoided.

The internal design of the driver is fairly simple. During the drivers initialization phase the device is checked for compatibility with the driver, some important de- vice parameters like its size are collected and the queue for client access is set up. Upon a client connecting to the queue setup of it on the driver side is finished. Then each client request for reads or writes to the device that is received on the queue causes a run of the three phase protocol with an appropriate SCSI command assembled in the CBW. In case of success the resulting data is transmitted back over the queue.

For now we assumed that only one client at a time may connect to a mass storage driver. This could be expanded with relative ease to allow for multiple connections later, but no overall performance increase is expected as BOT does not provide any underlying parallelism that could be exploited [11]. 26 CHAPTER 3. DESIGN

3.5 FAT

This project did not itself contribute much code to the file system related aspects of the FAT implementation. It only modified it in the course of back-porting code for write support and switching out the back end. Thus rather than describing the design of the full FAT library this section is limited to describing the changes in design. The first change is the addition of write support. The source of the functionality is code originating from the Advanced Operating Systems class, which is itself based around Barrelfish. Back-porting it to Barrelfish adds the ability to create and remove files, to write into files and to create and remove directories. To support these functions a way of flushing the caches was also added and the long file name conversion gained some additional functionality. Secondly the the FAT library is now designed to be a client of a USB mass storage driver. The FAT library now has a code path where it starts a queue connection to the USB mass storage driver. Any previous points of contact with the ATA driver now send requests into the queue and waits for a response by the driver, before returning to the previous control flow. The FAT implementation needs to be reworked without question in future projects. The underlying access to blocks and clusters of the block device needs to be ab- stracted out of the file system code. Perhaps this can be achieved by employing Device Queues more widely. If a common command abstraction can be found then it might be as simple as connecting to the correct queue during mounting. Chapter 4

Implementation

In the following sections we will describe the implementation of the designs ex- plained in the last chapter. The parts about initialization and the new driver model are handled in a common section because they are so closely related.

4.1 New driver model and Initialization

Three new modules based on the Driverkit template form the basis of the new USB subsystem. They are responsible for the EHCI driver including the root hub, the hub driver and the mass storage driver respectively. They are linked together with the custom linker for drivers and build into a single application binary. Most of the USB driver code is still present within the new modules, but it has been reorganized. We modularized the monolithic USB manager application and gave the EHCI driver and the hub driver separate initialization procedures to be used by Kaluga. All code related to flounder interface functions was stripped out and so was the USB manager. The Flounder interface is no longer necessary, because the modules run in the same domain and can now simply use direct calls. Some code had to be imported from the USB library too, because it had functionality beyond abstracting IDC communication. There are likely more definitions in there that will need to be imported when one is implementing other features on the USB subsystem. Kaluga has received a new custom start function for the USB subsystem, that starts a driver domain for USB drivers in case there is none running yet. It then instanti- ates the EHCI driver module in the domain. In the architecture startup for x86 new Octopus listeners are installed, with a callback function that examines the USB de- vice class and based on that instantiates the fitting driver module in the USB driver domain.

27 28 CHAPTER 4. IMPLEMENTATION

Currently the interface for interrupt routing on PCI is undergoing changes, thus a timer providing periodic calls to the USB interrupt function was added into the EHCI driver as a temporary holdover until the new interface is ready.

The hub driver now writes Octopus records upon discovery of new devices. This way Octopus triggers events for Kaluga to initialize device drivers. The Octopus record is defined in the following scheme:

hw..device. { usbrev: %u, class: %u, subclass:%u, protocol: %u, vendor: %u, product: %u, devicerev: %u, hostcontroller: %u, hub: %u, port: %u, devicepointer: %u }

There are a few things to note about this entry. First of all there is no distinctive device name. Since there is not yet a naming scheme for devices on the USB bus established in Barrelfish, we simply let Octopus assign a sequentially growing number upon entry of device information, from "hw.usb.device.0" upwards.

The first seven fields hold data describing some specific capabilities of the device. They are read from the device descriptors upon being found by the hub driver. Then there are three fields that describe the location of the device, in terms of its position in the USB device tree. And finally we have a pointer to the device memory that was allocated by the hub driver, so that the device driver may retrieve it once it’s started by Kaluga.

This could be changed such that the device driver allocates memory for the device state itself and then calls the EHCI driver to initialize device state as the first thing it does, but it does not really matter. As longs as they run in the same virtual memory space handing over the pointer through Octopus works just as well. 4.2. USB MASS STORAGE DRIVER 29

4.2 USB mass storage driver

The USB mass storage driver has its main functionality in the initialization function and the request handler for client requests coming in over the queue. Other than that there are a few callbacks for proper queue setup.

The initialization function starts by determining whether the device is compatible with the driver, i.e. whether it supports BOT and SCSI. It probes the device further and prepares it to be in a state ready to handle incoming client requests. The device queue for communication with the client is also set up.

The request handler reacts to notifications on the queue, reads the client request and injects the buffer coming from the client into the prepared USB transfers. This was done so that the usual transfer setup functions from the USB driver could be used, despite them also allocating pages for the USB transfers. The handler then calls into the SCSI stub, where the SCSI command is assembled and some meta- data is set for the USB transfer. The SCSI stub finally calls the transfer function that runs the three phase protocol against the device.

The coupling between the SCSI stub and the rest of the mass storage driver became tighter than intended. A layer of abstraction removing the setting of transfer meta- data from the assembly of the command for the command block entry would be desirable in the future.

Since we are not using real interrupts yet, the callbacks signaling completion of the three individual transfers would be delayed, so instead we resorted to using busy waiting while polling the USB driver about the completion status of the previous transfer before starting the next. This delivered much better results in preliminary testing. Once interrupts are routed to the USB driver this may need to be revisited.

The two available class specific requests in the USB bulk only transfer specifica- tion were also implemented. They are used for resetting the device and getting the highest logical unit number (LUN), which is the identifying number given to de- vices in the SCSI environment. For all the flash drives we used this was only ever 0. If there are USB mass storage devices out there that present multiple SCSI end- points over the same USB endpoint the driver will have to be adjusted to support them.

The mass storage driver calls directly into the EHCI driver to execute transfers since they are in the same domain. If they need to be separated the question of using Device Queues between individual driver modules should be revisited.

One thing to mention is that there is a bug that we failed to locate so far. The buffers are all standard page sized, that is to say they are 4 KiB large. The USB driver also uses 4 KiB buffers for its transfers, so it seems like a good idea to match that size. One such buffer should therefore be able to hold eight standard blocks 30 CHAPTER 4. IMPLEMENTATION

(512 byte) of a block device. It works as expected for up to requests for a size of seven blocks. However, when we attempt a USB transfer with a SCSI command asking for eight blocks of data, the buffer will return wrong data, without any indication of failure. We have been unable to locate where the wrong data originates. The reported size of the amount of data read is the size of the full block. The buffer address is still the one we set up beforehand. We suspect that it may be a off-by-one error somewhere deeper in the USB stack, but have not been able to track it down yet.

4.3 Queue setup

A single queue is exported by the mass storage driver. Memory for it is allocated by the user who connects, not the driver. The queue is used with fixed sized buffers of 4 KiB in size because that is the size expected by the USB driver for the individual transfers. Each client request is defined to be a transfer of two buffers, the first carrying command information, the second either carrying data or being intended for it on the way back. The command structure is much smaller than 4 KiB but it was simpler to just use a standard buffer to carry it. A command structure defines only a code for the intended action, the starting block number, a length and a processed length field. It is a very minimal interface that will have to be expanded for further functionality in the future. Once the command has been processed, the results, located in those same buffers, are passed right back in the same order so that the client has ownership of nearly all buffers all the time. The USB mass storage driver does not currently take requests for more blocks than fit a page, but we can envision a future expansion where the driver could take larger requests and spread them over multiple USB transfers. The data could be spread over multiple page sized buffers supplied by the client resulting in batched queue activity, with fewer notifications flowing back and forth. Since the FAT library only requests a cluster at a time this was not yet needed.

4.4 FAT

The same disclaimers as in chapter Design apply, much of the file system driver is untouched. The specific implementation of features added by this project are lain out below. 4.4. FAT 31

The mounting procedure was modified such that it runs the connection to the device queue, when it encounters a URI with the word "usbtesting" behind the protocol specifier. We hard-coded this, because there is no naming scheme available for USB devices in Barrelfish and none for the queues either, so there was no effective way to specify an intended resource to mount. Instead this is based on the assump- tion that exactly one USB mass storage driver instance offers a device queue. A connection to the hard-coded queue name is then established, some memory for it is allocated and the usual start blocks are read using the queue. In all the places where calls to ATA for reading or writing were made over Flounder interfaces before, there are now two buffers sent to the mass storage driver over the queue instead, one for the command and one for data. This concerns only few functions, namely those for acquiring or flushing cache contents and some initialization code. For compatibility with the existing code we could not use the queue buffers to pass the incoming data back thought the caches and the rest of the FAT and VFS library into the hands of the end-user utilizing it. Enabling this behavior would have required changing too much of the file system code and was considered beyond the scope of this project. Therefore, the current solution is that data is copied to or from the queue buffers at the point of contact between the FAT library and the device queue. We can currently claim zero-copy transport of the data into the user domain but not down to the level of the testing code. The memory used by the FAT library to store data coming from the queue is al- located with a call to the Barrelfish version of the well known memory allocation function "malloc" when a buffer is needed. Using malloc on Barrelfish in the fast path like that can be a problem because it requires communication with the mem- ory server. As an intermediary performance improvement one could use the slab allocator until the file system is capable of operating on the queue buffers directly. The other implementation changes were caused by merging in the FAT library from the Advanced Operating Systems class. It implements the usual back-end functions for VFS, so there is nothing particularly noteworthy to say about the interface. Most of the work was in fixing portability issues, where the code assumed 32 bit pointers. Some unexpected behavior from the cache was encountered, it reported being un- able to release clusters from cache sometimes. However, initial attempts at debug- ging were unsuccessful and the data contents read and written were correct, so the issue was put aside in favor of work more directly related to the thesis. This issue warrants further scrutiny in the future. 32 CHAPTER 4. IMPLEMENTATION Chapter 5

Evaluation

This chapter contains the evaluation of the work done on the USB mass storage service in the course of the project. It is split up in five sections. Section 5.1 describes what questions the evaluation aims to answer. Section 5.2 explains the setup we used to collect quantitative data. Section 5.3 reports roadblocks encoun- tered in the course of testing. Section 5.4 showcases the collected results. And finally section 5.5 draws concrete insights from the results.

5.1 Aim

The goals of this evaluation are to determine the quality of the implemented USB mass storage service, the overhead of the Device Queue, the applicability of Kaluga, Octopus and the new driver model to driver startup and finally the impact of the remaining bugs we are aware of.

The quality of the mass storage service is determined by three main factors: Most importantly, it has to work correctly, that means we have to check that no corruption of user data occurs during transport. Another measure of quality is the stability of the service, particularly whether there are any crashes. And the last aspect of quality we want to examine is performance as measured in transfer speed. We also need a point of comparison to evaluate if the achieved speed is acceptable for the environment we run the tests in.

This thesis was also aimed at demonstrating the abilities of Device Queues, so we need to quantify what kind of overhead it introduces. The goal is determining if the queue presents a bottleneck that significantly reduces transfer speeds.

Showing how well Kaluga, Octopus and the new driver model assist in implement- ing event based driver startup can not be determined by any quantifiable measure.

33 34 CHAPTER 5. EVALUATION

Rather it relies on the observations of the programmer. We will explain our expe- riences in the section on results. Two bugs with potential or even likely performance impact were discovered and are not yet fixed. The one that forces us to use less than eight blocks per page sized transfer 4.2, and the one related to the FAT cache, that keeps failing to release cache entries 4.4. Additionally, there is a malloc call on the fast path in FAT. These three issues need to be investigated and their impact on performance should be determined.

5.2 Testing Setup

5.2.1 Systems and Hardware

First of all, the appropriate point of comparison has to be determined. In the de- velopment setup we have Barrelfish running on QEMU x86 64 bit with KVM vir- tualization support on a Ubuntu 16.04 host. We would like to use the same setup for testing as well. Therefore, we decided on Linux emulated on the same host to determine a baseline for performance comparison. We chose Ubuntu 16.04 for the emulation guest. The QEMU command line options were used identically, except of course the lo- cations of the system images and the Barrelfish menu.lst entries, that are specific to the respective systems. For both operating systems we passed through the same physical USB mass storage device, a Toshiba Kingston DataTraveler SE9 in the USB 2.0 version and 8 GB variant. The manufacturer does not provide performance data in their data sheet or anywhere else on their public website, so it’s peak performance was determined using the benchmarking tool ChrystalDiskMark. The drive exhibits a sequential read speed of 28.8 MB/s, sequential write speed of 5.5 MB/s, random 4 KiB read speed of 3.9 MB/s and random 4 KiB write speed of only 0.01 MB/s. Since the USB flash drive has much faster read than write access, we chose reading with a sequential access pattern for our measurements of speed, so our testing would not be impacted by the limitations of the hardware.

5.2.2 Testing the full stack

The first test setup to simply measure overall speed is designed as follows: Files of sizes ranging from 1 KB to 1 GB, filled with random data, are placed in a FAT32 file system on the flash drive. That data is read from a standard user domain through the virtual file system with the FAT back end. Time measuring only begins right 5.2. TESTING SETUP 35

# Ubuntu -system-x86_64 -machine type=q35 -smp 2 -m 4096 - enable-kvm -usb -device usb-ehci,id=ehci -device usb-host,bus=ehci.0,vendorid=0x0930,productid=0 x6545 -hda ubuntu.qcow2

# Barrelfish qemu-system-x86_64 -machine type=q35 -smp 2 -m 4096 - enable-kvm -usb -device usb-ehci,id=ehci -device usb-host,bus=ehci.0,vendorid=0x0930,productid=0 x6545 -nographic -kernel x86_64/sbin/elver -append ’loglevel=3’ -initrd x86_64/sbin/cpu loglevel=3, x86_64/sbin/init,x86_64/sbin/mem_serv,x86_64/sbin/ monitor,x86_64/sbin/ramfsd boot,x86_64/sbin/skb boot,eclipseclp_ramfs.cpio.gz nospawn,skb_ramfs. cpio.gz nospawn,x86_64/sbin/kaluga boot,x86_64/sbin /acpi boot,x86_64/sbin/spawnd boot,x86_64/sbin/ proc_mgmt boot,x86_64/sbin/startd boot,x86_64/sbin/ routing_setup boot,x86_64/sbin/pci auto,x86_64/sbin /corectrl auto,x86_64/sbin/usb auto,x86_64/sbin/ usb_devq_test

Figure 5.1: Comparison of command line options used to boot both test systems within QEMU. before the call to the read operation of the virtual file system and ends right after. Not included in the measurement are mounting the file system and opening the file. We call this the FAT-test in this document. After the timer is stopped we hash the read data with SHA512 and compare the hash to one computed by reliable standard tools directly from the host operating system, to make sure that the data is correct. The bug mentioned in the section on the mass storage implementation 4.2, forced us to use a FAT file system with clusters of four blocks instead of the standard eight, such that a cluster could still be handled in a single request. The same file system with the same data was used on both test systems.

5.2.3 Isolating FAT

In order to isolate the two issues related to FAT mentioned in the previous sec- tion 5.2, we also measure the speed of data transfers directly on the endpoint of the 36 CHAPTER 5. EVALUATION queue connecting the FAT library with the USB mass storage service. Since the goal is to measure the impact of the FAT library we can’t use it at all during that measurement, meaning we have to rely on raw block reads. For that we define non-overlapping ranges of blocks on the flash device. The block ranges are also sized between 1 KB and 1 GB rounded up to the next multiple of 512 byte. These ranges were also hashed by the host operating system to get a reliable truth value for comparison. We start a timer, send requests to read blocks in clusters of four over the queue and dequeue the results in a loop and then stop the timer. The data is left in the buffers of the queue. After the timer is stopped we hash the read data and compare to the precomputed hashes to ensure correctness. This test is named queue-test in this thesis. Transfers of four blocks were chosen so that we run the same number of transfers for the raw data as the FAT library would, since it operates on the four block clusters of the file system. In order to have a direct comparison the same queue is used and no changes are made in the service to accommodate this testing method. The difference measured lies only in traversing the FAT, the cache issue and the memory allocation used in the FAT library.

5.2.4 Isolating the Device Queue

To determine the overhead of the Device Queue separately, we use a third point of measurement on the other endpoint of the queue, within the mass storage driver. More specifically, we put it in the request handler that usually responds to read and write requests on the queue. The test is invoked by a special command code sent over the queue. The measurement, we call it driver-test, runs very much like the queue-test. We start a timer and invoke the SCSI stub with requests for blocks, just like the driver usually does to handle read requests coming from the queue. The difference is that we run the requests in a loop without enqueuing the resulting data on the Device Queue. Once the loop is finished and we have all data in memory, the timer is stopped. The data is then hashed and checked for correctness and finally the measurement results are sent over the queue for final handling by our testing program. We run this test in two configurations: First it is executed with requests of four blocks at a time, to get the direct equivalent of the queue-test data. That is to say the exact same calls are made into the SCSI stub, only the queue interaction can account for difference between these two. The second configuration operates with seven-block-requests, because this is the highest we can do until the bug 4.2 is fixed. This configuration compared to the first 5.3. TEST FAILURES 37 provides us with a good estimate of the bugs approximate impact on performance, if we assume that the seven and the eight block transfers should be relatively close in performance.

5.2.5 Generating a baseline

The final test, called Linux-test, works by reading blocks with the "dd" program and relying on its timer output. It reads the same ranges of blocks and pipes the result into sha512sum. Not that we suspect anything wrong with dd, it’s just for completeness. This test too is run in two configurations, one with four blocks per transfer and one with seven. sizes=(2 20 196 1954 19532 195313 1953125) starts=(30000 30002 30022 30218 32172 51704 247017) cluster=4 path=/dev/sdc

[...] for i in ${!sizes[*]} do dd if=$path count=$((512*${sizes[i]})) skip=$ ((512*${starts[i]})) bs=$((512*$cluster)) iflag =skip_bytes,direct,count_bytes | sha512sum >> sums.txt done

Figure 5.2: Excerpt of the bash script used for gathering performance data on Linux. Not shown is some supporting code for automation of tests.

The timer results are printed to the standard error by dd where they are collected from outside of this script. Also note the use of the "direct" flag in figure 5.2, which circumvents the Linux kernel block cache. This is important because we need values for real input and output performance of the USB mass storage service, not the block cache, that is naturally much faster.

5.3 Test failures

Not all data points could be acquired in the way the test setup called for, or at all. 38 CHAPTER 5. EVALUATION

First of all, the 1 GB file could not be successfully read in the FAT-test. Invariably, the FAT library would abort with an error. Sometimes, the cache was full and at other times it would abort with a missing cache entry error. Increasing the cache size did not succeed as a workaround. In the queue-test and the driver-test we used the memory allocated in the queue to store read data in memory. However, it was not possible to allocate more than 214 page sized buffers of memory in the frame allocation method. Rather than copying the data during the timer run or starting to work with multiple memory regions within a queue, we have decided to forgo hashing of the 100 MB and 1 GB amounts of read data and using the existing buffer as a ring buffer.

5.4 Results

Stability was good. Apart from the aforementioned 1 GB read through FAT, no crashes were encountered during any of the other 135 tests. Data correctness was also guaranteed for all cases where the hash could be calculated, that is to say for the 30 runs through FAT on files between 1 KB and 100 MB in size as well as the 75 runs on blocks from the two other measurement points, with transfers between 1 KB and 10 MB in size. The data on transfer speed is presented in the chart of throughput versus transfer sizes in figure 5.3. Each bar represents the average result of 5 runs of each test configuration and the error bars represent the standard deviations within them. For the test running through FAT we see a speed of 0.5 MB/s on the 1 KB file. With larger file sizes this increases up to a peak of 2 MB/s for the 1 MB file, then performance decreases for the 10 MB and especially the 100 MB file, down to about 0.75 MB/s. The data point for the 1 GB file is missing as explained above in section 5.3. The test from the queue endpoint on the user side starts out with a similar slow start, then also shows performance of 2 MB/s and in contrast to the first data series, is able to sustain that speed, up to the 1 GB transfer. Data from the measuring point on the service side of the queue resembles the pre- vious series closely, as long as we operate on 2 KiB per USB transfers. Apart from small files it also achieves 2 MB/s and sustains it as we move to large files. The 3.5 KiB per USB transfer measurements on the same measurement point show a slow start with the 1 KB transfer at 1 MB/s, on the 10 KB transfer it reaches 2.5 MB/s and for the larger ones, performance is stable at 3.5 MB/s. Measurements on the Linux guest with 2 KiB transfers in dd start at about 0.5 MB/s, but with a high standard deviation for the 1 KB transfer. They improve with size 5.4. RESULTS 39

Figure 5.3: Comparison of transfer speeds measured in three locations in Barrelfish and with dd in Linux. The values in brackets represent the cluster size used.

up to a peak of about 1.7 MB/s at the 100 KB transfer, then performance decays down to between 0.75 MB/s and 1.25 MB/s for larger transfers.

Data on Linux for the 3.5 KiB per read exhibits a large standard deviation of up to 0.75 MB/s. The performance curves along the same path set out by the smaller cluster size, only with a higher peak of around 2.5 MB/s and sustained performance just shy of 1.5 MB/s.

Finally, there is a result that is not quantified in the measured data but must still be recorded. It concerns the implementation of the event based driver startup rou- tines. Working with Kaluga, Octopus and the new driver model, supported by the Driverkit resources was a positive experience. Having this supporting infras- tructure in place made it easy to register new drivers to be brought up by Kaluga automatically with the appearance of an associated device. Reading and writing device records from and to Octopus and registering triggers within it, are simple operations that have a clear interface. The Driverkit templates help grasping the new driver model more quickly than any description could. These systems work well together, are reliable and easy to use. 40 CHAPTER 5. EVALUATION

5.5 Interpretation

The most direct performance comparison between the testing systems we can make from the data is comparing performance as measured from the FAT side of the de- vice queue in Barrelfish, without involvement of the FAT library, with performance of Linux block access, both working on 2 KiB blocks. This way, the same number of USB requests need to be dispatched in both systems. For transfer sizes from 1 KB to 100 KB we see similar performance, with just about a 20 percent advantage for Barrelfish, but then Linux performance drops dramati- cally until Barrelfish has a lead of a factor of about 2. The picture is similar when comparing the 3.5 KiB block transfers in Linux with the 3.5 KiB block transfers from directly inside the USB mass storage driver in Barrelfish. The comparison is marred by the large standard deviation the Linux measurements exhibit, but the general trend is the same, Barrelfish starts with an advantage and it grows more prominent once Linux performance drops. It is unclear why that drop happens in Linux. We did not have the opportunity to explore the issue in much depth. We made sure there was no unexpected system load in the guest or host operating systems, nor any concurrent accesses on the drive, in fact it was not even mounted. The USB mass storage service in Barrelfish is competitive with Linux, at least in the emulated QEMU environment. The overhead introduced by the FAT library amounts to only about up to 10 percent when it’s working correctly. Only when the cache errors start appearing in the 100 MB file performance drops off significantly until it finally crashes on the 1 GB file. We think the failed cache evictions may result in fewer cache slots being used until the cache appears to be full and the program aborted. A second possible cause lies with the memory allocation on the fast path. If memory allocation slows down as the amount of allocated memory grows, this may explain part of the slowdown but probably not the crash. We compare the two data series collected within the driver to observe the effect of the different transfer sizes on block transfer performance. This gives us an idea of the improvement we can expect from correcting the bug, that currently prevents us from using the full buffer in each USB transfer. We see a performance improvement between 40 and 60 percent when moving from four to seven blocks per cluster. Therefore, we may probably assume that once both bugs are fixed we should see peak performance through FAT improved by about 50 percent or at about 3 MB/s in this emulated environment and on files that are 100 KB or larger. The data also demonstrates that the Device Queue works as expected, it adds no overhead outside the margin of error to read the data buffers from another domain over the queue compared to reading them right at the point before they enter the queue. This is independent of the transfer size scenario. Chapter 6

Conclusion and Future Work

6.1 Conclusion

First of all, to state the obvious, the new mass storage service works. We can now read from and write to a FAT file-system on a USB flash drive while using the Device Queue to talk to the service. To go a bit further, it works reasonably well, the transfer speeds in the emulated environment are comparable with Linux’s and stability is good. Only the FAT library needs more work until it is reliable enough to be used as a dependable tool during Barrelfish development.

Since the emulated Linux system does not come close to achieving the speeds the USB flash drive is capable of, we are not worried that Barrelfish does not reach them either. Instead we attribute this to QEMU that seems to introduce more over- head than we expected on USB devices that are passed through to guest systems.

The remaining bug in the USB mass storage service that prevents full utilization of transfer bandwidth is troubling, but it seems to be a minor issue and not a grave architectural flaw. The prospective performance after the bug is found and resolved is promising.

The implementation of the new feature is a success overall, but it is not all we have set out to do. One important goal of the thesis is to demonstrate the functionality of other Barrelfish features. The Device Queues are more of a focal point, but the interplay of Kaluga, Octopus and the new driver model are of interest as well.

The event based initializations procedures have proven to be a good fit for the USB subsystem. It is convenient to simply write a device record to Octopus and let it inform Kaluga of the change independently. The USB hub does not need to worry about any startup procedures anymore. This works robustly and quickly even when hot-plugging a mass storage device long after boot. It also makes certain the device

41 42 CHAPTER 6. CONCLUSION AND FUTURE WORK manager is actually aware of all the devices and their drivers. This may prove important if central power management of devices is introduced. Finally the Device Queues are very useful to implement zero-copy data transfers and have unnoticeable overhead for their usage with the USB mass storage driver. It is unfortunate that their usage within a domain could not be assessed, but there is no reason to expect any problems in that space either. They are efficient and simple to work with. In some temporary code we also have had a chance to observe that one can easily connect two queues to pass on data, which is quite the useful feature.

6.2 Future Work

Whenever a project ends one wishes there had been more time to improve various aspects. We will next have a look at further work that presents itself after what has been done. The most pressing issue to make the USB mass storage service reliable and faster are two bugs that need to be found and addressed. The one resulting in the FAT li- brary throwing cache-related warnings and sometimes crashing it is probably more dire, because it is so unpredictable and impedes basic functionality. The second bug seems to be related to USB transfers. Being able to make requests for full pages would afford us the ability to use a FAT file system in its default configu- ration with 8-block clusters and increase speeds by around 50 percent as we have seen in the discussion of results. To make the new USB mass storage capabilities more useful in general system usage, without a priori knowledge of the USB mass storage devices connected to the system, we need to figure out the naming of resources. There are three types of names needed. Most importantly a USB device needs to be named such that the user can choose which device to access should there be multiple available. If needed this should include considerations about multiple LUNs 4.2 that might be available on a single mass storage device. We also need names for driver instances and finally names for the Device Queues. Those are three distinct name-spaces, but since the objects they represented really map one-to-one it may be possible to use the same identifier thrice. As it stands, there is also a rather wide discrepancy of what the USB BOT standard calls for and what is implemented in terms of error checking and failure recovery. The portion of SCSI commands implemented is also limited to the most commonly used ones. Filling out the missing parts would aid in reliability and might add desirable functionality. And finally, a makeover of the virtual file system may be worth considering, such that it can operate on buffers owned by the user. This would enable a true zero- copy data path from the user code down into the USB host controller up to the 6.2. FUTURE WORK 43 point where data is written to the bus. This step may be especially beneficial if other block device drivers in the future are also expected to use Device Queues or if anyone is interested in modifying the ATA driver to support them. 44 CHAPTER 6. CONCLUSION AND FUTURE WORK Bibliography

[1]A CHERMANN, R. Barrelfish USB Subsystem. Bachelor’s the- sis, ETH Zurich, August 2013. http://www.barrelfish.org/ publications/ba-acreto-usbsubsystem.pdf.

[2]B ARRELFISH. Bulk Transfer. Barrelfish Technical Note 014, Systems Group, ETH Zurich, 2011.

[3]B ARRELFISH. Inter-dispatcher communication in Barrelfish. Barrelfish Tech- nical Note 011, Systems Group, ETH Zurich, 2011.

[4]B ARRELFISH. Barrelfish Architecture Overview. Barrelfish Technical Note 000, Systems Group, ETH Zurich, 2013.

[5]B ARRELFISH. Device Drivers in Barrelfish. Barrelfish Technical Note 019, Systems Group, ETH Zurich, May 2017.

[6]B ARRELFISH. Device Queues in Barrelfish. Barrelfish Technical Note 026, Systems Group, ETH Zurich, 2017.

[7]B ARRELFISH. Skate in Barrelfish. Barrelfish Technical Note 020, Systems Group, ETH Zurich, 2017.

[8]B AUMANN,A.,BARHAM, P., DAGAND, P.-E., HARRIS, T., ISAACS,R., PETER,S.,ROSCOE, T., SCHÜPBACH,A., AND SINGHANIA, A. The mul- tikernel: a new os architecture for scalable multicore systems. In Proceed- ings of the ACM SIGOPS 22nd symposium on Operating systems principles (2009), SOSP ’09, pp. 29–44.

[9]C OMPAQ,HEWLETT-PACKARD,INTEL,LUCENT,MICROSOFT,NEC, AND PHILIPS. Universal Serial Bus Specification. http://www.usb. org/developers/docs/usb20_docs/usb_20_020718.zip, ac- cessed 2018-05-03.

[10]C URTIS E.,STEVENS. Universal Serial Bus Mass Storage Class Specification Overview. USB Implementers Forum. http: //www.usb.org/developers/docs/devclass_docs/Mass_

45 46 BIBLIOGRAPHY

Storage_Specification_Overview_v1.4_2-19-2010.pdf, accessed 2018-04-30.

[11]C URTIS E.,STEVENS. Universal Serial Bus Mass Storage Class USB At- tached SCSI Protocol (UASP). USB Implementers Forum. http://www. usb.org/developers/docs/devclass_docs/uasp_1_0.zip, accessed 2018-05-03.

[12]H ÄRTIG,H.,HOHMUTH,M.,LIEDTKE,J.,WOLTER,J., AND SCHÖN- BERG, S. The performance of µ-kernel-based systems. In ACM SIGOPS Operating Systems Review (1997), vol. 31, ACM, pp. 66–77.

[13]I NTEL. Enhanced Host Controller Interface Specification for Univer- sal Serial Bus. https://www.intel.com/content/dam/www/ public/us/en/documents/technical-specifications/ ehci-specification-for-usb.pdf, accessed 2018-05-04.

[14]M ICROSOFT CORPORATION. Microsoft Extensible Firmware initiative FAT32 File System Specification. https: //download.microsoft.com/download/1/6/1/ 161ba512-40e2-4cc9-843a-923143f3456c/fatgen103.doc, accessed 2018-05-03.

[15]S EAGATE TECHNOLOGY LLC. SCSI Commands Reference Manual. https://www.seagate.com/staticfiles/support/disc/ manuals/scsi/100293068a.pdf, accessed 2018-05-03.

[16]T HE BARRELFISH PROJECT. Barrelfish Operating System. http://www. barrelfish.org, accessed 2018-05-03.

[17]T HE ECLIPSE PROJECT. The eclipse constraint programming system. http://eclipseclp.org/, accessed 2018-05-04.

[18]USBI MPLEMENTERS FORUM. Universal Serial Bus Mass Storage Class Bulk-Only Transport. http://www.usb.org/developers/docs/ devclass_docs/usbmassbulk_10.pdf, accessed 2018-05-03.

[19]USBI MPLEMENTERS FORUM. Universal Serial Bus Mass Storage Class Control/Bulk/Interrupt (CBI) Transport. http://www.usb.org/ developers/docs/devclass_docs/usb_msc_cbi_1.1.pdf, accessed 2018-05-03.

[20]Z ELLWEGER,G.,SCHÜPBACH,A., AND ROSCOE, T. Unifying synchro- nization and events in a multicore os. In Proceedings of the Asia-Pacific Workshop on Systems (2012), APSYS ’12, pp. 16:1–16:6.