University “Politehnica” of Bucharest

Automatic Control and Computers Faculty, Computer Science and Engineering Department

MASTER THESIS

Guest OS Backward Compatibility for the FreeBSD

Scientific Adviser: Author: As.dr.ing. Mihai Carabas Ionut - Alexandru Teaca

Bucharest, 2017 Contents

1 Introduction 1 1.1 Motivation...... 1 1.2 Objectives...... 2 1.3 Outline...... 2

2 Background 4 2.1 ...... 4 2.2 Hardware devices...... 5 2.3 Overview of the bhyve hypervisor...... 6

3 Related Work9

4 The ATA/ATAPI Emulation 11 4.1 Overview...... 11 4.2 Hardware Resources...... 12 4.2.1 I/O ports description...... 12 4.2.2 Interrupts...... 13 4.3 Data Transfer Protocols...... 14 4.3.1 PIO command protocol...... 14 4.3.2 DMA command protocol...... 15 4.4 Command descriptions...... 16 4.4.1 ATA commands...... 17 4.4.2 ATA-ATAPI commands...... 18 4.4.3 ATAPI commands...... 18 4.5 Implementation Details...... 19 4.5.1 Initialization...... 19 4.5.2 reset protocol...... 20 4.5.3 Device addressing considerations...... 20 4.5.4 Block device emulation...... 21

5 The NE2000 Emulation 22 5.1 Overview...... 22 5.2 Hardware Resources...... 23 5.2.1 I/O ports description...... 23 5.2.2 Interrupts...... 24 5.3 Packet Transfer Protocols...... 24 5.3.1 Packet Ring Buffers...... 25 5.3.2 Packet Transfer Emulation...... 26 5.4 Implementation Details...... 27 5.4.1 Initialization...... 27 5.4.2 Software Reset command...... 28 5.4.3 Multithreading environment...... 29

i CONTENTS ii

6 Device Emulation Evaluation 30 6.1 ATA/ATAPI...... 30 6.1.1 Configuration...... 30 6.1.2 Results of ATA/ATAPI emulation...... 32 6.1.3 Validation and Performance...... 34 6.2 NE2000...... 35 6.2.1 Configuration...... 35 6.2.2 Results of NE2000 emulation...... 36 6.2.3 Validation and Performance...... 37

7 Conclusion and Further Work 40 List of Figures

2.1 Monitor...... 5 2.2 Hosted Virtualization...... 6 2.3 bhyve structure...... 7 2.4 VM state machine...... 7 2.5 bhyve device emulation...... 8

4.1 ATA/ATAPI overview...... 11

5.1 NE2000 overview...... 22

iii List of Tables

4.1 ATA Bus Master Register Offsets...... 13 4.2 PCI Compatibility and PCI-Native Mode Bus Master Adapters Configuration Registers...... 13 4.3 Physical Region Descriptor Table Entry...... 16

5.1 NE2000 NIC Register Offsets...... 24 5.2 NE2000 Receive Ring Header...... 26

6.1 Hardware configuration...... 30

iv Chapter 1

Introduction

The area of applications using software virtualization has been growing more and more in the last years, virtualization being fundamental for many technologies (for example, cloud computing). Nowadays the main issue is to support many different guest operating systems. There are many types of applications which run on legacy operating systems (FreeBSD 4, Windows XP) and nobody wants to change their setup or to upgrade the . However, they need to migrate toward virtualized hosts. Hence, the solution to this problem is to enhance the in order to support such operating systems. This subject is present in all hypervisors but we will focus on the FreeBSD Hypervisor (bhyve). This chapter starts with the motivation of this project which explains the reasoning to develop more device emulations in the bhyve hypervisor, continues with clear objectives regarding the features we implemented and ends with an outline that emphasizes the structure of the thesis.

1.1 Motivation

We started by analyzing the key reasons why some older operating systems are not supported in bhyve and what are the critical parts which have to be implemented to improve that. We did this by comparing the FreeBSD 4.0 with the FreeBSD 8.0 release which is the last supported version of FreeBSD i386/amd64 and we noticed that the main differences consisted in the supported hardware devices. On a first sight, we observed some compatibility issues at the media storage devices like hard disks, floppy drives, and optical disc drives. The FreeBSD Hypervisor (bhyve) provides only the emulation of the Advanced Host Controller Interface (AHCI) used by the Serial ATA devices but the FreeBSD 4 has no drivers for that. The class of media storage devices is critical for any hypervisor because it allows to install and boot an operating system. We propose to implement the ATA(AT Attachment) Host Adapter Standard which is supported in the FreeBSD 4.0 release and many other operating systems. Another category of devices in which bhyve does not provide good support for the older op- erating systems are the network card devices. The bhyve hypervisor uses a virtio net device emulation but these operating systems do not have virtio device drivers and are not able to use this emulation. This class of devices is absolutely necessary for any hypervisor because almost all applications require network access. One of the best supported devices, especially in older operating systems, is the NE2000 device. We propose to implement the NE2000 device emulation which will allow a larger number of unmodified guest operating systems to run under bhyve.

1 CHAPTER 1. INTRODUCTION 2

Currently bhyve supports any version of FreeBSD i386/amd64 since the FreeBSD 8.0 release. The "Guest OS Backward Compatibility for the Free BSD Hypervisor" general project is aimed to support guest Operating Systems with older versions such as FreeBSD4/5. In order to help with that, we have implemented the ATA/ATAPI 6 emulation and the NE2000 device emulation. In this work we present two different types of emulations: a generic ATA/ATAPI drive controller which runs attached on both LPC the PCI(through a Host PCI Adapter) buses and the NE2000 device under both PCI and LPC attachments.

1.2 Objectives

Analysing the motivation of this project presented in the previous section, we emphasize more precisely the objectives of this project. The main objectives are to implement two device emulations in the FreeBSD hypervisor in order to provide better compatibility with the older operating systems. We start with the first objective where we intend to emulate an ATA disk and an ATAPI cdrom in order to boot a virtual machine and install it to the emulated disk. In order to accomplish this objective there are several requirements to implement: • emulate the I/O ports accesses according to the ATA/ATAPI datasheet specification; • implement the ATA 6 standard and the ATA Packet commands (the ATAPI Packet is used to communicate with the ATAPI cdrom device); • implement the PIO4 and WDMA2 data transfer protocols working at transfer rates of more than 16.700MB/s; • work with both primary and secondary channels where each of them support master and slave drives at the same time; • configure and run the ATA/ATAPI emulation under both PCI and LPC attachments. The second objective is to emulate a NE2000 network card device in order to have Internet connectivity in the guest virtual machine. In order to accomplish this objective, there are several requirements to implement: • emulate the I/O ports accesses according to the NE2000 datasheet specification; • implement the PIO data transfer protocol; • implement the management of the Packet Ring Buffers used in the packets transfer; • find a solution to transfer the frames between the NE2000 guest network driver and the host networking stack; • configure and run the NE2000 emulation under both PCI and LPC attachments. Besides these two objectives that are related with the device emulation development, we also intend to have a testing process that will help with the correctness validation and performance evaluation for the ATA/ATAPI and NE2000 implementations.

1.3 Outline

The structure of this thesis is as follows. In Chapter 2 we give a short introduction to the background information that is necessary in the understanding of our work in the device em- ulation domain. We prezent some general concepts about virtualization, the hardware speci- fications and protocols related with the ATA/ATAPI and NE2000 devices and an overview of CHAPTER 1. INTRODUCTION 3 the bhyve hypervisor. It is the purpose of Chapter 3 to present the current situation regarding the ATA/ATAPI and NE2000 emulations in the bhyve hypervisor and other and hypervisors. In Chapter 4 we focus on the design and implementation of the ATA/ATAPI emulation. We start with the overview presentation of the ATA/ATAPI emulation and we present the hardware resources that we have emulated such as the IO ports and IRQ lines. After that, we describe the design of the data transfer protocols such as the PIO and DMA protocols, then we continue by emphasizing the ATA and ATAPI commands implemented in our emulation and we finish with the presentation of some details of implementation. In Chapter 5 we focus on the design and implementation of another type of emulation, the NE2000 network card interface. We start with the overview presentation of the NE2000 em- ulation and we present the hardware resources that we have emulated such as the IO ports, IRQ lines and Packet Ring Buffers. After that, we describe the design of the data transfer protocols such as the PIO and DMA protocols and we finish the chapter with some details of implementation such as the Packet Reception and Packet Transmission flows, the Software Reset command and how we managed to synchronize the implementation to run in two different contexts of execution. Chapter 6 presents the results and the features developed in this project and the testing for both ATA/ATAPI and NE2000 emulations. For each implemented emulation we emphasize the configuration of the bhyve hypervisor in order to enable these features and the results that prove the emulations are functional. In both sections of the chapter, we also describe the process of validation and performance evaluation that we applied to the ATA/ATAPI and NE2000 emulations in order to demonstrate the correctness and efficiency of the implementation. Finally, Chapter 7 concludes this thesis with a summary of the main features that we imple- mented in the bhyve hypervisor, discusses the objectives that we accomplished in this project and indicates some future development regarding the ATA/ATAPI and NE2000 decive emula- tions. Chapter 2

Background

This chapter presents some background information that is necessary in the understanding of our work in the device emulation domain. Because we will develop some modules in the bhyve hypervisor which is a Virtual Machine Monitor, we start this chapter by presenting some general concepts about virtualization. Then, we present some general information about technical standards such as PATA, SATA, AHCI, NE2000 and PCI that are related to the ATA/ATAPI and NE2000 emulations, because these device emulations require a very good understanding of the hardware specifications and protocols. In the end of the chapter we introduce the structure of the bhyve and the necessary tools for running a virtual machine, and we present an overview about the device emulation in the bhyve hypervisor.

2.1 Virtualization

The goal of virtualization is to isolate several operating systems of different users on the same machine. One of the cutting edge technologies that uses virtualization in order to share one physical machine with many users is the cloud computing. The clients rent bare-metal machines in cloud and pay per hour of usage. The main advantage is they can scale the infrastructure depending on the load. In order to run in a virtualized environment, the guest operating system needs a Virtual Machine Monitor which controls the virtualized resources like CPU, memory, disk and I/O devices. One of the requirements implemented by the VMM is to ensure that a program running inside a virtual machine must have the same behaviour with a program running on the physical machine. A general architecture that uses a virtualized environment is presented in Figure 2.1 where the VMM runs directly on the hardware. A different approach is the hosted virtualization, where the VMM is implemented as an applica- tion inside the host operating system as shown in Figure 2.2. The FreeBSD Hypervisor (bhyve) implements a hosted virtualization type where the VMM core is implemented as a inside the FreeBSD operating system. More details about the structure of the bhyve VMM are presented in Section 2.3. There are 2 types of virtualization techniques, the full virtualization and the paravirtualization. The difference between them is that one supports running the guest operating systems with no changes while the other does not. The full virtualization emulates the hardware completely and allows the guest operating systems to run without any changes. Some VMMs that provides full virtualization are: VMWare, VirtualBox, KVM (the Kernel-based Virtual Machine) and bhyve.

4 CHAPTER 2. BACKGROUND 5

Figure 2.1: Virtual Machine Monitor

Because sometimes the full virtualization may affect the performance, the paravirtualization provides a slightly different interface to the guest operating systems. Therefore, they must be changed in order to reduce the overhead and increase the performance of the virtual machines. Still, the applications from the guest operating systems run unchanged. The most popular VMM that implements paravirtualization is the hypervisor. We do not want to present details about this hypervisor because is not the goal of our work.

2.2 Hardware devices

Our main topic is device emulation which involves emulating the hardware devices in software emulators. This kind of work requires a very good understanding of the hardware specifications and protocols so we need to present some general information about the technical standards such as PATA, SATA, AHCI, NE2000 and PCI that are related with the ATA/ATAPI and NE2000 emulations. In Section 2.3 we present an overview about device emulation in the bhyve hypervisor. The ATA (AT Attachment) defines the physical, electrical, transport, and command protocols for the internal attachment of storage devices to host systems [11]. There can be Parallel ATA (PATA) or Serial ATA (SATA) interface standards. Parallel ATA (PATA) is the legacy AT Attachment being an interface standard for the connection of storage devices like hard disks, floppy drives, and optical disc drives in computers. [16]. Serial ATA takes the place of the former legacy AT Attachment standard, offering many advantages over the older interface: reduced cable size and cost (seven conductors instead of 40 or 80), native hot swapping, faster data transfer through higher signalling rates, and more efficient transfer through an (optional) I/O queuing protocol. [17]. The Advanced Host Controller Interface (AHCI) is a host adapter for the Serial ATA disk drive controller. This specification defines the functional behavior and software interface of the Advanced Host Controller Interface, which is a hardware device that is an interface communi- cation between the software and Serial ATA devices. AHCI is a PCI class device that performs movement data between system host memory and Serial ATA devices. [7]. For the Parallel ATA protocol there is a host adapter controller interface too. The ATA/AT- CHAPTER 2. BACKGROUND 6

Figure 2.2: Hosted Virtualization

API Host Adapters Standard specifies the AT Attachment Interface between host systems and storage devices using Direct Memory Access protocol. The use for the AT Attachment Interface is any host system that has a PCI bus and storage devices connected to the processor [8]. The NE2000 represents a class of low cost Ethernet network cards initially produced by by 1987. What we have actually implemented is the emulation of the National Semiconductor DS8390/WD83C690 Ethernet adapter which is going to be used for the NE1000, NE2000 and a variety of similar clones. [18].

2.3 Overview of the bhyve hypervisor bhyve stands for the BSD hypervisor, and is a hypervisor/ introduced in the FreeBSD operating system. It provides running virtual machines with near native performance and it relies on modern CPU features such as Intel VT-x and Extended Page Tables. We start by introducing the structure of bhyve and the necessary tools for running a virtual machine. The FreeBSD Hypervisor (bhyve) implements a hosted virtualization and the bhyve core is running in the vmm.ko loadable kernel module, but there are some utilities like the libvmmapi library and the bhyve, bhyveload and bhyvectl tools that are necessary in the man- agement of a virtual machine as shown in Figure 2.3. Before using any of these utilities, the vmm.ko module must be loaded. We use the bhyveload tool to load and prepare the FreeBSD kernel image from the disk image: /usr/sbin/bhyveload -m 256 -d ./vm0.img vm0 That will start the FreeBSD loader and prepare a new instance of virtual machine so a new device /dev/vmm/vm0 will be created. In order to boot the VM with 2 vCPUs, 256M RAM memory and access to the network through the tap0 interface, we use the bhyve tool: CHAPTER 2. BACKGROUND 7

Figure 2.3: bhyve structure

/usr/sbin/bhyve -c 2 -m 256 -A -H -P -s 0:0,hostbridge -s 1:0,virtio-net,tap0 -s 2:0,ahci- hd,./vm0.img -s 31,lpc -l com1,stdio vm0 After the VM has been shut down, its resources can be released with: /usr/sbin/bhyvectl –destroy –vm=vm0 We shall present the states of the Virtual Machine running in the bhyve hypervisor that are described in Figure 2.4 because it will help to understand how the device emulation works.

Figure 2.4: VM state machine

For the management of the guest virtual machine, bhyve hypervisor uses the hardware virtual machine extensions of the CPU that is specific to the hardware architecture (intel or amd). In the following presentation we chose to present the Intel virtual-machine extensions (VMX) that are used in the bhyve hypervisor. This support of virtualization requires two modes of VMX operation: the VMX root operation (or the hypervisor mode where the VMM runs) and VMX CHAPTER 2. BACKGROUND 8 non-root operation where the guest software runs. We assume that the guest virtual machine has been loaded in memory and is prepared to start. The whole execution of the virtual machine runs in the vm_loop routine from the bhyve application where it executes two main transitions: the VM entry transition that enters into the VMX non-root operation and the VM exit transition that returns back from the VMX non-root operation to the VMX root operation. Since the vm_loop routine runs in the userspace application and the VMX instructions must be executed in the VMX root mode, bhyve transfers the control to the VMM by using the ioctl system call. In order to enter for the first time into the guest mode, bhyve VMM uses the vmlaunch in- struction, which starts the execution of the virtual machine and continues to run until the first VM exit when the VMM regains the control. Some reasons why the VM may exit are: a page fault or an IO instruction in the guest software, a NMI, an external interrupt, a task switch or a debug exception. Out of all these reasons, we are more concerned about the INOUT reason because is related with the device emulation. Running an IO instruction in the guest software is treated as a fault by the CPU and the control is transferred to the VMM in order to handle the exception. The bhyve VMM analyses the exit reason and tries to emulate the instruction in the Kernel space. If the VM exit is not handled, it will return back in userspace passing the VM exit information in the ioctl argument. The VM exit information contains the description of the IO instruction such as the port offset, direction (in / out), length in bytes, address of the memory operand and is accessed by the VMM from the virtual-machine control data structures (VMCSs) using the vmread, and vmwrite instructions. In userspace, bhyve identifies the device responsible with the instruction emulation based on the port offset and calls the registered handler for that port. Once the emulation is completed, the control is transferred to the VMM in order to resume the guest software execution using the same procedure. This time, the VMM will enter into the guest mode using the vmresume instruction instead of the vmlaunch, because the virtual machine is already launched. In the remaining part of this section we present an overview about the device emulation in the bhyve hypervisor. Most of the devices are emulated in userspace in usr.sbin/bhyve. The other ones such as the PICs and the timers are implemented into the kernel in vmm/io. There are two categories of devices emulated in bhyve, the LPC and PCI devices as shown in Figure 2.5. Through the LPC devices we enumerate the uart controller and rtc (real-time clock) controller. One part of the PCI devices are represented by the virtio class which contains block, net and rng (random entropy from /dev/random) subclasses. The AHCI controller is also a PCI device emulated in bhyve. Our implementations will run either as PCI or LPC devices depending on configuration.

Figure 2.5: bhyve device emulation Chapter 3

Related Work

In this chapter we present the current situation regarding the ATA/ATAPI and NE2000 emu- lations in the bhyve hypervisor and other emulators and hypervisors. In the bhyve hypervisor, there is no other implementation related to the ATA/ATAPI or the NE2000 emulations. There is an ATA controller implemented in GXemul framework that supports full- system computer architecture emulation, but there is no way to port this software in the bhyve sources tree. This is because it uses a different application interface for the communication with the rest of the system and because it is coded in the C++ language. At the moment, the network is emulated using a virtio network driver and Peter Grehan is working on the e1000 device emulation. The reason why you might want the NE2000 emulation instead is not the performance which is worse than both the virtio and e1000 emulations, but for the compatibility with the legacy operating systems which do not have drivers for any of them (virtio or e1000). Because of this, you will see the NE2000 emulation supported in almost all hypervisors, such as VMware, KVM or XEN. The XEN and KVM hypervisors make use of the QEMU open source project, which is one of the best machine emulators in terms of supported devices. It can be used either as a machine emulator in which case it will run operating systems made for one machine (ARM) on a machine with a different architecture (Intel X86) or as a virtualizer in which case it will run under the Xen and KVM hypervisors. The QEMU emulator supports both the ATA/ATAPI and NE2000 emulations, but because of the BSD license and different architectures between the QEMU and bhyve, it is not possible to integrate this code. What we can do instead is to study and analyse this software in order to better understand the design implemented there, or any other information regarding the ATA and NE2000 specifications. The module that emulates the NE2000 NIC in the emulator [3] represents a good source of inspiration regarding the register emulation and transmit/receive protocols. In order to probe as a PCI device, the QEMU implementation uses the Vendor and Device IDs from the RTL8029AS controller, which is an Ethernet Controller [4] compatible with the NE2000 class. This controller we used too for the PCI enumeration. Maybe the most important technical help when you emulate the hardware devices in software is represented by the datasheets, because this type of work is very similar with the device driver development where the datasheet specification is absolutely necessary. For the ATA/ATAPI specification we used the AT Attachment with Packet Interface - 6 (ATA/ATAPI-6) datasheet [12]. In order to implement the commands sent to the CDROM driver we used the SCSI Commands Reference Manual [5]. There is also a datasheet for the 8390 NIC [6] which has been used for a proper understanding of the chip registers, transmit/receive protocols and control operations.

9 CHAPTER 3. RELATED WORK 10

Even though there is no other implementation related with the ATA/ATAPI specification in the bhyve hypervisor, there is an implementation of the Advanced Host Controller Interface (AHCI is another disk drive controller) standard that we must analyse and highlight. This implementation helped with the design of the ATA implementation and was used to evaluate the ATA performance by comparing the ATA transfer rates against the AHCI transfer rates. In order to understand the general device emulation mechanisms used in the emulation of the AHCI standard, we carefully read the AHCI code and we noticed useful information related to the enumeration and initialization of each device and how an emulated device in bhyve allocates the basic resources such as the IO Ports and IRQ lines. Chapter 4

The ATA/ATAPI Emulation

In this chapter we focus on the design and implementation of the ATA/ATAPI emulation. We start with the overview presentation of the ATA/ATAPI emulation and we present the hardware resources that we have emulated such as the IO ports and IRQ lines. After that, we describe the design of the data transfer protocols such as the PIO and DMA protocols, then we continue by emphasizing the ATA and ATAPI commands implemented in our emulation and we finish with the presentation of some details of implementation.

4.1 Overview

We start this chapter with the overview presentation of the ATA/ATAPI emulation where we emphasize the place of the module implementation inside the bhyve application.

Figure 4.1: ATA/ATAPI overview

11 CHAPTER 4. THE ATA/ATAPI EMULATION 12

In order to achieve this, we zoom in the Figure 2.4 which describes the general architecture of the device emulation and we add the the ATA/ATAPI device and the modules that interacts with it. The result is presented in Figure 4.1 where we observe that the ATA/ATAPI implements two interfaces in order to implement the device emulation. These interfaces represent the means of access to the ATA controller ports and there are two interfaces because there are two ways of access depending on which bus the device is at- tached. When the ATA/ATAPI emulation is configured to run under the PCI bus, the IO ports are accessed through the PCI interface implemented by the pci_ata_write and pci_ata_read functions. If the ATA/ATAPI emulation is configured to run under the LPC bus, the IO ports are accessed directly through the basic INOUT interface which is implemented by the lpc_ata_io_handler and lpc_ata_ioctl_handler functions. In the end, both interfaces access the same ATA/ATAPI implementation after some translations of port offsets. The count of the IO ports and their significance are presented in detail in the next section.

4.2 Hardware Resources

4.2.1 I/O ports description

The IO ports represent an interface of communication between the ATA controller and the software drivers. The driver accesses the IO ports using write and read instructions in order to control the controller and transfer the data between the ATA devices and the system memory. Depending under which attachment is running, there are two ways to address the IO ports. When running under the PCI attachment the driver addresses the ATA internals registers through the PCI BAR registers which are discovered at the PCI enumeration phase. In the case of the LPC attachment the ATA IO ports can be accessed directly by the software driver using basic IO read and write instructions because they are located on the address bus of the CPU. The physical addresses of the ATA IO ports are described in the /boot/device.hints configuration file. The ATA controller contains a set of registers, the ATA registers. Basically this set is controlled by the guest driver in order to communicate with the ATA controller. The implementation of the emulator shall provide such a register interface and emulate each IO access of the guest driver in order to maintain the internal state of the controller and emulate the driver commands. One category of registers are the ATA Channel Registers which contains two blocks of registers. The Command Block registers are used to send commands to the device and to read the status from the device. These registers include the LBA High, LBA Mid, Device, Sector Count, Command, Status, Features, Error, and Data registers and they are addressed through the BAR0 on the primary channel and BAR2 on the secondary channel. The Control Block registers are used to control the device and to read the alternate status from the device. These registers include the Device Control and Alternate Status registers and they are addressed through the BAR1 on the primary channel and BAR3 on the secondary channel [13]. When running under the PCI attachment, the ATA controller provides another category of registers, the ATA Bus Master Registers which are used to configure the DMA engine and are presented in the Table 4.1. These registers are not used under the LPC attachment because there is no any DMA channel. The ATA Bus Master Registers count 16 bytes in total and can be accessed per byte, word, or dword quantities. The base address for these registers is the PCI BAR 4 [9]. CHAPTER 4. THE ATA/ATAPI EMULATION 13

Offset Register Register Access 00h ATA Bus Master Command register Primary R/W 01h Device Specific 02h ATA Bus Master Status register Primary RWC 03h Device Specific 04h-07h ATA Bus Master PRD Table Address Primary R/W 08h ATA Bus Master Command register Secondary R/W 09h Device Specific 0Ah ATA Bus Master Status register Secondary RWC 0Bh Device Specific 0Ch-0Fh ATA Bus Master PRD Table Address Secondary R/W

Table 4.1: ATA Bus Master Register Offsets

In order to emulate the ATA controller under the PCI attachment we also need to implement the PCI space configuration registers presented in the Table 4.2. These registers are read by the software driver at the PCI enumeration phase in order to find out the PCI BAR registers and other information like the Device, Class and Vendor IDs.

Offset Bits 31-24 Bits 23-16 Bits 15-8 Bits 7-0 00h PCI 04h PCI 08h Class Code PCI 0Ch PCI 10h Base Address 0 – Base Address of Cmd-Block Regs, ATA Channel X 14h Base Address 1 – Base Address of Control Regs, ATA Channel X 18h Base Address 2 – Base Address of Cmd-Block Regs, ATA Channel Y 1Ch Base Address 3 – Base Address of Control Regs, ATA Channel Y 20h Base Address 4 – Base Address of ATA Bus Master Registers 24h Base Address 5 – Vendor Specific 28h PCI 2Ch Subsystem ID PCI 30h PCI 34h PCI 38h PCI 3Ch PCI Interrupt Line

Table 4.2: PCI Compatibility and PCI-Native Mode Bus Master Adapters Configuration Reg- isters

4.2.2 Interrupts

There are two main types of interrupts that a device can use to interrupt the CPU: the Level- triggered and Edge-triggered interrupts. The description of these types of interrupts is not the goal of this work, but the relation with the ATA controller shall be presented. The PCI bridge uses the level-triggered interrupts while the LPC bridge uses the edge-triggered interrupts. So, the ATA emulation has to raise both types of interrupts while working attached under the PCI bus or LPC bus. The interfaces provided by the bhyve library to raise these interrupts are quite different so we need to implement a general mechanism to assert interrupts efficiently no matter under which attachment the ATA runs. The solution we implemented is to register different callbacks for each channel depending under what attachment is used. Each time the ATA controller needs to interrupt the host, it will call the specific callback. This way, we do CHAPTER 4. THE ATA/ATAPI EMULATION 14 not need to check each time if we run under PCI or LPC so the solution is transparent and efficient. The design of the ATA emulator is based on an event-driven architecture, where the events rep- resent the write operations on the emulator ports, that are actually the guest driver commands. After the commands are interpreted, processed, and the state of the emulator is updated ac- cordingly, the guest driver is notified by asserting an interrupt. The ATA emulation supports two channels representing the Primary and Secondary channels and have separate IRQ lines for each channel. The emulator reserves these IRQ numbers in the initialization phase of the channel and asserts an IRQ at each command completion.

4.3 Data Transfer Protocols

In this section we talk about the methods to transfer the data between the host and the ATA/ATAPI devices. There are two general ways defined in the ATA standard: the PIO and the DMA protocols. We implemented in our emulation both protocols and we will present them below. The media of the ATA/ATAPI devices is organised in logical blocks distributed linearly. When the host wants to transfer the data, it specifies the count and the address of the first logical block to transfer. The Logical Block Addressing defines the way to address the data on the media. There is a linear distribution of the blocks and they are located by an integer index, with the first block being the LBA 0, the second being the LBA 1, and so on. Because we implemented the LBA 28-bit addressing mode, regardless of the data transfer protocol used (PIO or DMA), the read/write commands use 28 bits to address each logical block. Hence, the maximum size supported by the 28-bit addressing is 228 × 512 bytes = 128 GB since the size of the logical block is 512 bytes. In order to send a read/write command using the LBA 28-bit addressing, the host uses the LBA group registers and the first 4 bits in the Device register in the following order: LBA Low=LBA(7:0), LBA Mid=LBA(15:8), LBA High=LBA(23:16), Device(3:0)=LBA(27:24).

4.3.1 PIO command protocol

The PIO protocol is a mechanism used to transfer data to or from the ATA device by writing or reading the Data register. The commands applicable to the ATA disk drive that use the PIO protocol are the READ and WRITE commands aimed to get or put the data to or from the disk. Once that we introduced the ATAPI emulation, we realized there are more commands that use the PIO mechanism in order to transfer data. Actually, all the ATAPI commands are performed using the PIO protocol and even the packet itself is a PIO transfer. So we had to cope with different PIO transfers of different lengths and different logic. Because of this, we designed a general, flexible and easy to use mechanism to create PIO transfers. In order to set up either a read or write PIO transfer, the ata_pio_do_transfer function is called with the length of the transfer and the callback which is going to be called when the transfer ends. This callback is supposed to implement the logic of any PIO command. Take care that each PIO setup can handle one single transfer in progress, but this is not a problem since the commands for a device are never issued in parallel. We can talk about the PIO protocol as being used to transfer data between the host and the ATA devices under both PCI and LPC attachments, and it works pretty much the same. The only noticeable difference is found in the word size transferred through the DATA register. When running under the LPC attachment, the PIO protocol uses words of 16 bits to transfer data while under the PCI attachment it uses words of 32 bits. This difference causes a worse performance for the ATA controller working in the LPC attachment because there are two times more accesses to the DATA register so the overhead is doubled. CHAPTER 4. THE ATA/ATAPI EMULATION 15

PIO data-in

The PIO data-in protocol is used to transfer one or more blocks of data from the media of the device to the host memory. It is closely related to the commands which use this protocol since they are responsible for preparing the PIO data buffer. There is a PIO data-in class containing the IDENTIFY DEVICE, READ MULTIPLE, READ SECTOR(S) and the ATAPI commands which use this protocol. Once the ATA driver starts the PIO command, the emulator reads the data from the block disk into the PIO buffer and interrupts the host when the data is ready to be transferred. When the host handles the interrupt, it starts to transfer the data by polling the DATA register and gets 4 bytes at each read. By the time the buffer is empty the PIO command is completed. If the transfer counts multiple blocks the host gets interrupted for each transferred block which is 128 sectors by default. If the total count of sectors is not evenly divisible by the block count, the emulator interrupts after the last partial block is transferred.

PIO data-out

The PIO data-out protocol is used to transfer one or more blocks of data from the host memory to the media device. In opposite with the PIO data-in protocol where the data buffer is prepared by the emulator before the transfer starts, the host starts to write the data into the device buffer. It is closely related to the commands which use this protocol since they are responsible with the transfer parameters like the sector count and the offset in the block disk. There is a PIO data-out class containing commands like WRITE MULTIPLE, WRITE SECTOR(S) and the ATAPI commands which use this protocol. Once the PIO command is issued by the ATA driver, the emulator saves the transfer parameters (sector count and offset) and waits the data from the host without asserting the interrupt. The ATA driver starts to transfer the data by polling the DATA register and puts 4 bytes into the PIO buffer at each write. When the total number of sectors is written on the block disk the PIO command is completed. For every transferred block, the emulator writes the data on the block disk and interrupts the host. If the total count of sectors is not evenly divisible by the block count, the emulator interrupts after the last partial block is transferred.

4.3.2 DMA command protocol

The Direct memory access (DMA) is a method to access the main system memory (RAM) by the peripheral devices without the CPU intervention. It is much better than the PIO protocol because it brings less overhead and the CPU is available to perform other work while the transfer is running. In contrast with the PIO protocol where the host is polling the DATA register on the whole duration of the transfer, with DMA the host first starts the transfer and when the operation is done it receives an interrupt from the DMA controller [19]. For the ATA/ATAPI emulation the host can use the DMA protocol only when runs under the PCI attachment. The reasoning is that there is no DMA channel for the ATA controller while running under the LPC attachment. When running under the PCI attachment, the DMA channel is provided by the PCI Host adapter and is configured using the ATA Bus Master Registers. The DMA protocol is used to transfer one or more blocks of data from the host memory to the media device or from the media device to the host memory. It is closely related to the commands which use this protocol since they are responsible with the transfer parameters like the sector count and the offset in the block disk. There is a DMA class of commands which contains the READ DMA and WRITE DMA commands that uses this protocol. First of all we introduce the concept of DMA transaction in our emulation. Basically the DMA transaction contains 3 different parts: the WRITE/READ DMA command, the start CHAPTER 4. THE ATA/ATAPI EMULATION 16 and the stop of the DMA procedure. Once the DMA command is issued by the ATA driver, the host driver is going to start the DMA transaction by writing the address of the Physical Region Descriptor Table to the emulator and setting the Start/Stop Bus Master bit in the ATA Bus Master Command Register. Because the emulator gets the physical address in the guest memory space it needs to translate it into the bhyve process memory space. To achieve this, the emulator uses the paddr_guest2host function which takes as parameter the guest address and returns a pointer to the host memory space. The PRD Table contains several Physical Region Descriptors (PRDs) which describe the areas of the guest memory that are involved in the data transfer (see the Table 4.3). Each PRD entry has 8 bytes and specifies the location where the data will be transferred to/from. The first 4 bytes represent the address of the physical guest memory region, the next 2 bytes specify the number of bytes which are supposed to be transferred to/from that region. The physical address of the entry is translated into the bhyve address space in the same way as the PRDT address using the paddr_guest2host. When the bit 7 (EOT) of the last byte of the PRD is set it marks the end of the table.

Byte 3 Byte 2 Byte 1 Byte 0 Dword0 Host memory Region Physical Base Address [31-1] 0 Dword1 EOT Vendor Specific Byte Count[15-1] 0

Table 4.3: Physical Region Descriptor Table Entry

Once the DMA transaction starts, the emulator iterates through the PRD entries in order to prepare the block request executed by the block instance of the ATA device. Basically, a block request contains an array of iovec structures where each iovec structure describes an entry in the PRD Table. Once the block request has been prepared, it is executed in the context of the block instance using write and read operations on the file descriptor associated with the block device. The ATA driver is notified when the transfer is complete by asserting an interrupt. Once it receives the interrupt it initiates the last phase of the transaction by clearing the Start/Stop Bus Master bit in the ATA Bus Master Command Register. The emulator verifies the state of the transaction and sets the status register accordingly providing the ATA driver the result if the transfer has been completed successfully or not. This is the last part of the DMA transaction.

4.4 Command descriptions

The ATA driver makes use of the ATA/ATAPI commands in order to control the ATA drives and to transfer the data between the host memory and the ATA drives. All the ATA commands are sent through the IO ports interface presented in the Section 4.2.1 and use the IRQ system presented in the Section 4.2.2. In order to emulate some of the ATA and ATAPI devices we have implemented two categories of commands: the General feature set of the ATA 6 standard, and a subset of the SCSI commands required by the CDROM device. Through the ATA commands we emphasize a set of commands for identification, some data transfer commands that use either the PIO or DMA procotol ( see the Section 4.3) and two commands used for the implementation of the packet command of the ATAPI devices. Because all the ATA commands are sent to the ATA drives using the same procedure, we present the mechanism used by the ATA driver to issue a general command. In the FreeBSD operating system the ATA commands are implemented in the ata-lowlevel driver which com- municates directly with our controller emulation. For a better understanding of the struc- ture of the ATA commands we took a look over some functions from the ata-lowlevel driver: ata_generic_command, ata_tf_write, ata_wait which implements the general mechanism of command control. CHAPTER 4. THE ATA/ATAPI EMULATION 17

The ATA driver starts to prepare the command by selecting the master/slave drive from the ATA channel (see Section 4.5.3)and waits the ATA controller to clear the ATA_S_BUSY regis- ter to make sure the drive is not busy with other command. When the driver is ready to issue the command, it enables the interrupts by setting the ATA_A_4BIT bit in the ATA_CONTROL register and continues to configure the parameters of the command in the Command Block regis- ters: ATA_FEATURE, ATA_COUNT, ATA_SECTOR, ATA_CYL_LSB, ATA_CYL_MSB, and ATA_DRIVE. The command processing begins when the ATA_COMMAND command register is written with the ATA command code which identifies a specific command [14]. The emulator saves the command parameters into its internal data structures and uses the command code to identify the command. Once the command is fully emulated, the emulator interrupts the ATA driver to notify the command completition.

4.4.1 ATA commands

We begin to present the ATA commands that make up the general feature set of the ATA 6 standard and are implemented in the ATA emulation. In the second part of this section we present the ATA commands that we have implemented in order to support the ATA Packet commands used to communicate with the ATAPI devices.

IDENTIFY DEVICE

This command enables the host to receive the parameter information from the device. The device sets the BSY bit to one, prepares the 256 words of the device identification data to transfer to the host, sets the DRQ and READY bits to one, clears the BSY bit to zero, and asserts the INTRQ interrupt. Then the host transfers the data by reading the Data register using the PIO data-in protocol. The data is transferred using 128 successive word transfers. We note that there are different endianesses between the driver and emulator data. In order to corectly emulate the ATA controller we shall provide the identification data in the big endian order. For example in order to send the “FaKe MoDeL IDA diSk” string to the host memory the emulator prepares the “aFeKM DoLeI ADd Si k” string. The host uses this command to find out the CHS parameters of the disk device like the number of cylinders, heads and sectors, the model name, serial number and firmware version, and also the supported capabilities. The supported capabilities by the ATA emulator are: Multi Word DMA2 and PIO4 data transfer protocols working with the 28-bit LBA addressing, and support for the write-read verify and flushcache commands.

READ MULTIPLE

This command is issued by the ATA driver in order to read data using the PIO data-in protocol. It specifies the number of sectors to be transferred and the offset on the block disk using the LBA 28-bit addressing mode. The emulator reads the data from the disk, prepares the PIO buffer, marks the PIO read command in progress and interrupts the host meaning the data is ready. After the host gets the interrupt starts to transfer the data.

WRITE MULTIPLE

This command is issued by the ATA driver in order to write data using the PIO data-out protocol. It specifies the number of sectors to be transferred and the offset on the block disk using the LBA 28-bit addressing mode. The emulator saves the command parameters into the CHAPTER 4. THE ATA/ATAPI EMULATION 18

PIO setup, marks the PIO write command in progress and waits the host to start the transfer of the data.

READ and WRITE DMA

These commands are issued by the ATA driver in order to read / write data using the DMA data transfer protocol and indicate the first phase of the DMA protocol. It specifies the number of sectors to be transferred and the offset on the block disk using the LBA 28-bit addressing mode. The emulator saves the command parameters into the DMA setup including also the direction of the operation (read / write) and marks the DMA transaction as started. Afterwards the host prepares the PRD Table and activate the DMA transaction.

FLUSH CACHE

This command is a non-data ATA command which is used by the host in order to ask the device to flush the write cache. Basically, the emulator creates a flush request to the block device which will flush the file descriptor associated with the disk drive.

4.4.2 ATA-ATAPI commands

In order to support the ATAPI devices, beside the ATA commands that make up the gen- eral feature set of the ATA 6 standard we had to implement two more ATA commands: ATA_ATAPI_IDENTIFY and ATA_PACKET_CMD.

ATA_ATAPI_IDENTIFY

This ATA command is used by the host in order to get information about the ATAPI device. The host finds out there is an ATAPI device after the Software Reset, when each ATA drive changes the device registers with its signature. If there is an ATAPI device, the host will call the ATA_ATAPI_IDENTIFY command in order to get extra information such as the model and serial number of the device and other supported capabilities.

ATA_PACKET_CMD

This command is issued by the host in order to send the packet command to the ATAPI device. After this command, the host transmits the packet made up of 12 bytes of data using the PIO protocol to the device. The first byte from the packet represents the command code and it is used to select the ATAPI command.

4.4.3 ATAPI commands

As we said before, we want to emulate an ATAPI cdrom in order to boot a virtual machine using the ATA/ATAPI emulation. Hence, our goal is to implement only a subset of the SCSI commands required by the CDROM device. Each ATAPI command is sent by the host in a packet command of 12 bytes using the PIO protocol where the first byte from the packet represents the command code which is used to select the ATAPI command. CHAPTER 4. THE ATA/ATAPI EMULATION 19

INQUIRY

Using this command, the host asks some information such as the vendor, product and revision ids from the ATAPI device. The device replies the data to the host through a PIO transfer of 36 bytes.

READ_CAPACITY

The host issues this command to read the capacity of the media CDROM. The device replies the number of blocks and the block size of its media through a PIO transfer of 8 bytes.

READ_TOC

The host requests the ATAPI Drive to read the data from the Table of Contents and transfer the result back to the Host. Our ATAPI module emulates the media having one single track so the response to this command is composed of Track1 in the data zone.

READ_10

This command is issued by the host in order to read data from the media using the PIO protocol. The command specifies the LBA address of the starting block and the number of blocks to be read. For each 2048 bytes representing the size of the block the ATAPI drive interrupts the host.

PREVENT_ALLOW and TEST_UNIT_READY

We don’t do anything special when we get these commands, but they have only been imple- mented because the driver issues these commands and the drive must acknowledge them by raising an interrupt.

4.5 Implementation Details

We end the ATA/ATAPI implementation chapter with some details of implementation which we consider it is important to present. First of all we begin with the initialization phase of the ATA emulator, we continue by presenting the software reset protocol and how the master/slave devices are selected by the ATA driver and we finish with the presentation of the block device emulation in bhyve. We note that the development of the ATA/ATAPI emulation in bhyve has been branched from the FreeBSD 11.0-CURRENT r274023 revision by cloning the tree sources at the [1] repository.

4.5.1 Initialization

The initialization of the ATA controller begins in the ata_init function where it allocates and initializes the ATA channel data structures. It registers two instances of the block device emulations for both the master and slave drives of the channel which manages the backing file used as a media storage. The PIO and DMA commands that are implemented by the emulator, results in read and write calls on the backing image file descriptor managed by the block device emulation. CHAPTER 4. THE ATA/ATAPI EMULATION 20

Because the ATA channel emulation can run on both PCI and LPC attachments there are specific initialization routines for each attachment.

PCI Initialization

We handle the PCI specific initialization in the pci_ata_init function. In order to get enu- merated on the PCI bus our PCI Adapter shall implement a subset of the PCI standard type configuration header register set. First of all, the PCIR_DEVICE and PCIR_VENDOR registers are set with the 0x8211 and 0x1283 values which means we emulate a Waldo ATA Controller. For the PCI Class registers we set the Base-Class and Sub-Class registers with the next values: 01h – Mass Storage and 01h – IDE. We set the PCIP_STORAGE_IDE_MASTERDEV bit in the Pro- gramming Interface register to indicate that the adapter can do bus master operations which means Direct Memory Access. We also set the PCIP_STORAGE_IDE_MODEPRIM and PCIP_STORAGE_IDE_MODESEC bits in the same register to inform the ATA driver we support both primary and secondary channels in the PCI adapter emulation. Also in the PCI initialization function of the ATA Host emulator we allocate and register the BAR registers. The ATA device uses only five BARs out of six. The PCI BAR0 and BAR2 registers represent the base addresses of the primary respectively secondary channels. The PCI BAR1 and BAR3 registers represent the base addresses of the control register for the primary respectively secondary channels. The PCI Base Address Register BAR4 is the base address of the ATA Bus Master I/O registers responsible with the DMA operations. In the end of this function we reserve the IRQ line of the PCI Host Adapter used by the ATA channels to interrupt the host.

LPC Initialization

When running under the LPC attachment the initialization is simpler because we don’t need to configure any specific registers. We just register the IO ports used by the ATA driver to access the ATA channel registers. Since the base address registers and IRQ lines are fixed and specified by the IBM PC standard, we hardcode these values in our data structures. The same values though shall be added in the /boot/device.hints of the guest virtual machine. See the Section 4.2 for more details.

4.5.2 Software reset protocol

One operation that we had to emulate is the software reset. The ATA driver uses this protocol in the probe routine of the ATA channel in order to look for any signs of the ATA/ATAPI devices on the channel. In order to reset the controller the driver is writing the ATA_A_RESET bit in the ATA_CONTROL register. Note that this operation shall reset both drives from the ATA channel, so for the emulation of this operation we set the ATA_E_ILI bit in the ATA_ERROR register and the ATA_S_READY bit in the ATA_STATUS register for each drive. Furthermore, we have to set either the ATA or ATAPI signatures in the drive registers depending on what device is configured.

4.5.3 Device addressing considerations

First of all, when a register is set the value is written to the register of both devices. The host discriminates between the two by using the DEV bit in the Device register. The data CHAPTER 4. THE ATA/ATAPI EMULATION 21 is transferred in parallel either to or from the host memory to the device’s buffer with the direction specified by the last command transferred from the host. The device performs all the operations necessary to properly write the data to, or read the data from, the media. When two devices are connected on the cable, the commands are written in parallel to both devices, and for almost all the commands except the EXECUTE DEVICE DIAGNOSTIC command, only the selected device executes the command. When the Device Control register is written, both devices respond back regardless of which device is selected. When the DEV bit is cleared to zero, Device 0 is selected. When the DEV bit is set to one, Device 1 is selected. When two devices are connected to the cable, one shall be set as Device 0 and the other as Device 1 [15]. In terms of data structures in software, the ATA channel structure has two ATA drive data structures that hold the set of registers for each ATA device. This is mandatory in the case of the Software Reset command which is targeted for both devices. They shall reset and put their signature in the ATA registers which are different for an ATA disk and an ATAPI CDROM. Furthermore, the PIO mechanism and the state of the interrupts (enabled / disabled) are maintained on each drive.

4.5.4 Block device emulation

In order to fully emulate an ATA/ATAPI device, beside the implementation of the ATA com- mands, we shall somehow store and retrieve the data transferred by the virtual machine. We achieve this by using a regular file in the FreeBSD file system where we write and read the data blocks. For a better management of the file operations like the read and write calls we use the imple- mentation of the block device emulations in bhyve. For each ATA drive we associate an instance of the blockif_ctxt structure. Since the ATA channel supports maximum 2 drives (master and slave drives), each of them will have assigned one blockif_ctxt structure. The reasoning we decided to use this API is because the block model creates an extra thread where the read/write requests are executed. The public API for read/write operations works by submitting block requests to the block device queue which are pulled and executed in its execution context. Definitely the IO operations on the device file descriptor shall be executed on a different context than the main thread of the virtual machine which runs all the host instructions. Otherwise, the whole virtual machine would get stuck when the ATA driver initiates data transfer commands. Chapter 5

The NE2000 Emulation

In this chapter we focus on the design and implementation of another type of emulation, the NE2000 network card interface. We start with the overview presentation of the NE2000 em- ulation and we present the hardware resources that we have emulated such as the IO ports, IRQ lines and Packet Ring Buffers. After that, we describe the design of the data transfer protocols such as the PIO and DMA protocols and we finish the chapter with some details of implementation such as the Packet Reception and Packet Transmission flows, the Software Reset command and how we managed to synchronize the implementation to run in two different contexts of execution.

5.1 Overview

We start this chapter with the overview presentation of the NE2000 emulation where we em- phasize the place of the module implementation inside the bhyve application.

Figure 5.1: NE2000 overview

22 CHAPTER 5. THE NE2000 EMULATION 23

In order to achieve this, we zoom in the Figure 2.4 which describes the general architecture of the device emulation and we add the the NE2000 device and the modules that interacts with it. The result is presented in Figure 5.1 where we observe that the NE2000 implements two interfaces in order to implement the device emulation. These interfaces represent the means of access to the NE2000 controller ports and there are two interfaces because there are two ways of access depending on which bus the device is attached. When the NE2000 emulation is configured to run under the PCI bus, the IO ports are accessed through the PCI interface implemented by the pci_ne2000_write and pci_ne2000_read func- tions. If the NE2000 emulation is configured to run under the LPC bus, the IO ports are accessed directly through the basic INOUT interface which is implemented by the lpc_ne2000_io_handler function. In the end, both interfaces access the same NE2000 implementation after some translations of port offsets. The count of the IO ports and their significance are presented in detail in the next section.

5.2 Hardware Resources

5.2.1 I/O ports description

The IO ports represent an interface of communication between the NE2000 controller and the software drivers. The driver accesses the IO ports using write and read instructions in order to control the controller and transfer the network packets between the Packet Ring Buffers and the system memory. Depending under which attachment is running, there are two ways to address the IO ports. When running under the PCI attachment the driver addresses the NE2000 internals registers through the PCI BAR registers which are discovered at the PCI enumeration phase. In the case of the LPC attachment the NE2000 IO ports can be accessed directly by the software driver using basic IO read and write instructions because they are located on the address bus of the CPU. The physical addresses of the NE2000 IO ports are described in the /boot/device.hints configuration file. There are two sets of registers in the NE2000 register address space: the NIC registers and the ASIC registers. They are accessed by the host’s CPU through one single IO port. The first range presented in the Table 5.1 contains 16 registers, starts at offset zero from the base address port and it is used to access the NIC register pages. There are 3 pages of 16 bytes so each of them has 16 different registers. We emulate only the first two pages, the third one being not mandatory. The first page contains most of the control and status registers while the second one stores the physical and multicast MAC addresses. The host driver selects one page by programming the PS0 and PS1 bits from the CR register (Command Register). Depending of the combination of these bits one page is selected. For example when the PS0 = 0 and PS1 = 0 the Page0 is selected. The second range of registers has only two registers starting at offset 16 from the base address port. It contains the Reset and the Data registers. The Data register located at the offset 0 relative to the ASIC range base is a bidirectional port used by the host to read or write words from or to the controller buffer memory. The Reset register is located at the offset 15 relative to the ASIC range base and is read by the host in order to reset the board. The Reset is a hardware reset and is different than the Software Reset command presented in the Section 5.4.2. When the NE2000 emulation runs under the LPC attachment it registers one IO port containing 32 registers using the register_inout function. Under the PCI attachment it allocates one BAR CHAPTER 5. THE NE2000 EMULATION 24

Page 0 (PS1=0 PS0=0) Page 1 (PS1=0 PS0=1) Offset RD WR RD WR 00H Command(CR) Command(CR) Command(CR) Command(CR) 01H Current Local DMA0 Page Start Register PHY Addr 0 PHY Addr 0 02H Current Local DMA1 Page Stop Register PHY Addr 1 PHY Addr 1 03H Boundary Pointer Boundary Pointer PHY Addr 2 PHY Addr 2 04H Transmit Status Transmit Page Start PHY Addr 3 PHY Addr 3 05H No. of Collisions Transmit Byte Count0 PHY Addr 4 PHY Addr 4 06H FIFO Transmit Byte Count1 PHY Addr 5 PHY Addr 5 07H Interrupt Status Interrupt Status Current Page Current Page 08H Current Remote DMA0 Remote Start Reg0 Multicast Addr 0 Multicast Addr 0 09H Current Remote DMA1 Remote Start Reg1 Multicast Addr 1 Multicast Addr 1 0AH Reserved Remote Byte Count0 Multicast Addr 2 Multicast Addr 2 0BH Reserved Remote Byte Count1 Multicast Addr 3 Multicast Addr 3 0CH Receive Status Receive Configuration Multicast Addr 4 Multicast Addr 4 0DH Tally Counter 0 Transmit Configuration Multicast Addr 5 Multicast Addr 5 0EH Tally Counter 1 Data Configuration Multicast Addr 6 Multicast Addr 6 0FH Tally Counter 2 Interrupt Mask Multicast Addr 7 Multicast Addr 7

Table 5.1: NE2000 NIC Register Offsets register for both NIC and ASIC ports of 32 ports. Even though the ASIC range has 2 registers we need to allocate 16 registers because the Reset register starts at offset 15.

5.2.2 Interrupts

There are two main types of interrupts that a device can use to interrupt the CPU: the Level- triggered and Edge-triggered interrupts. The description of these types of interrupts is not the goal of this work, but the relation with the NE2000 controller shall be presented. The PCI bridge uses the level-triggered interrupts while the LPC bridge uses the edge-triggered interrupts. So, the NE2000 emulation has to raise both types of interrupts while working attached under the PCI bus or LPC bus. The interfaces provided by the bhyve library to raise these interrupts are quite different so we need to implement a general mechanism to assert interrupts efficiently no matter under which attachment the NE2000 runs. The solution we implemented is to register different callbacks for each channel depending under what attachment is used. Each time the NE2000 controller needs to interrupt the host, it will call the specific callback. This way, we do not need to check each time if we run under PCI or LPC so the solution is transparent and efficient. The NE2000 controller uses one single line interrupt reserved in the initialization phase in order to notify the host when a network packet is received or transmitted. On the transmission flow, once the packet is transferred from the guest system memory into the internal NE2000 Transmit ring buffer, and the effective transmission is emulated, the guest driver is notified by asserting an interrupt for transmission complete. The procedure is similar on the reception flow, where the emulator waits to receive the packets from the host network stack in order to transfer them to the Reception ring buffer and notifies the guest driver for packet reception.

5.3 Packet Transfer Protocols

The NE2000 standard specifies two types of DMA channels used for the access to either the local packet buffers or the host system memory. CHAPTER 5. THE NE2000 EMULATION 25

The Local DMA channel is used during the reception to store packets from the receive FIFO in the receive buffer ring and during the transmission to transfer packets from the local transmit buffer memory to the transmit FIFO. This protocol is easy to emulate since we just copy some arrays of data from one location to another. The second type of DMA channel also known as the Remote DMA channel is provided for the NE2000 NIC in order to perform data transfers between the system memory on the host side and the local buffers from the NE2000 RAM memory on the device side. Note that all data transfers between the host system memory and the NE2000 RAM memory go through the Data IO port. Consequently, a data transfer will consist in a sequence of PIO accesses on the Data IO port. In order to program a data transfer in the NE2000 RAM two register pairs are used: the Remote Start Address (RSAR0 RSAR1) and the Remote Byte Count (RBCR0 RBCR1) register pairs. The host uses the Start Address Register to specify the beginning of the block to be transferred and the Byte Count Register to indicate the number of bytes to be moved. Note that this DMA channel is not provided by the NE2000 controller but is an external DMA controller. In the bhyve hypervisor there is no such DMA controller, so in our configuration the CPU is the one which transfers the data using PIO (Programmed Input Output) accesses to the Data IO port. In any case, this is transparent for the NE2000 emulation where the management of the data transfer is the same no matter who masters the transfer (CPU or external DMA).

5.3.1 Packet Ring Buffers

The NE2000 internal data structures include three main memory buffers: the NIC registers, the Receive Buffer Ring and the Transmit Buffer. The NIC registers is an array with 3 dimensions where each dimension represents one page of 16 bytes and it is used internally to store the value of each register. The Receive Buffer Ring and the Transmit Buffer reside into the 32 KBytes of RAM memory inside the NE2000 device. In this section we present the management of these buffers from the point of view of the NE200 emulation.

Receive Buffer Ring

How we said before, the Receive Buffer Ring is placed in the RAM and its location is pro- grammed by the guest driver in two registers, the Page Start and Page Stop Register that define the range where the ring starts and ends physically into the NE2000 RAM memory. The ring buffer is comprised of a series of fixed length 256 bytes pages used to store the received packets. Beside these two static registers, there are two dynamic registers that control the logical ring: the Current Page Register which is the head of the ring and points to the first page used to store a packet, and the Boundary Pointer Register which is the tail of the ring and points to the first packet not yet read by the guest driver. At the initialization of the NE2000 emulator, both Current Page Register and Boundary Pointer Register are set to the value of the Page Start Register. Each received packet that is stored in the ring buffer, is prefixed by the NE2000 Reception header which specifies the Receive status of the packet, the size in bytes of the packet and the offset of the page containing the next packet. The layout of the NE2000 Receive Ring Header is presented in the Table 5.2.

Transmit Buffer

The Transmit Buffer is placed in the NE2000 RAM memory similar to the Receive Buffer Ring and its location is controlled by the guest driver by programming two registers: the Transmit Page Start Address Register and the Transmit Byte Count Registers (TBCR0,1). In contrast CHAPTER 5. THE NE2000 EMULATION 26

Byte 0 Receive Status Byte 1 Next Packet Pointer Byte 2 Receive Byte Count 0 Byte 3 Receive Byte Count 1

Table 5.2: NE2000 Receive Ring Header with the Receive Buffer Ring which contains a circular list of packets, the Transmit Buffer is not a ring. It can contain only one packet at a time, so the guest driver has to wait for the completion of the transmitted packet before starting to transfer a new packet for transmission.

5.3.2 Packet Transfer Emulation

Beside the common mechansims related with any device emulation that we discussed so far like emulating the basic hardware resources, handling the IO register accesses and maintaing the state of the device in the software data structures, there is a problem more specific with the network card emulation that we have to solve: how do we actually transfer the network packets between the bhyve application and the FreeBSD host operating system ? In other words we need to implement a mechanism to transmit the packets received from the guest operating system to the host network stack and vice versa. In this section we present the solution we found and how it is applicable for the reception and transmission of the network packets. The main idea behind the network card emulation is to have an interface that provides a method to transfer the Ethernet frames from the guest network driver to the host networking stack and vice versa. One of the solutions implemented by the most hypervisors is the tap interface. The tap interface is a software loopback mechanism provided by the host operating system which is used in our implementation to communicate with the host networking stack. In other words, the tap interface provides a mechanism to insert Ethernet frames into the local network stack in order to transmit them to the network, and a mechanism to receive the Ethernet frames from the network. Note that the host operating system is responsible with the routing of the packets. It is a character device on which we use regular system calls like open, write, read and close in order handle it. To transmit data from the guest network driver to the host networking stack we use the write system call whereas to receive data from the host networking stack and send it to the guest network driver we use the read system call. Even though this mechanism allows us to communicate with the host network stack, we still need a way to transmit and receive frames to and from other hosts from the network. We solved this problem by using a bridge to connect the tap interface together with one of the physical network interfaces from the host operating system [10]. All the necessary preparation of the host responsible with the creation of the network tap device and the bridge interface is described in the Section 6.2.1.

Packet Reception

In order to transfer the packets received from the host network to the guest network driver, first of all we need a way to extract the packets from the tap interface. To achieve this, we call the read system call on the tap file descriptor, the only problem being to identify the time when to read. Because we implement an event-driven approach and we do not have a context of execution to wait for a read event on the tap file descriptor, we register the corresponding file descriptor of the tap interface to the mevent system implemented in bhyve. The mevent runs in its own CHAPTER 5. THE NE2000 EMULATION 27 context of execution and provides a callback which will be called each time there is something to read from that file descriptor. For each packet received from the tap interface we check if the Ethernet frame is valid and if the ring buffer is not full. We decide if a frame is valid to be transferred to the guest network driver by checking its destination MAC address. First of all, we check if the destination MAC matches the NE2000 MAC address. If they are not the same, it means the received frame is a wrong unicast packet or is a broadcast packet so we check if the destination MAC is the ff:ff:ff:ff:ff:ff broadcast address or belongs to a multicast group. If the received frame does not match any condition then the emulator drops the packet. We ignore all the filters and accept all the frames when the NE2000 network card emulation is configured in the promiscuous mode. If the Ethernet frame is valid and the ring buffer is not full we add the ED header to the received frame and copy them to the Receive Buffer Ring. The ED header helps the guest network driver with additional information regarding the correctness of the packet and the location inside the ring buffer (see the Section 5.3.1 for more details about the Receive Buffer Ring). At last, the guest network driver is notified of the packet reception by asserting the interrupt line.

Packet Transmission

The transmission mechanism involves the Transmit Buffer presented in Section 5.3.1 where the NE2000 driver stores the packet and waits the device to complete the transmission. In other words the packets are transmitted sequentially, the guest driver waiting for the completion of each packet before transmitting a new one. Once the packet is prepared and placed in the Transmit Buffer by the NE2000 driver, the transmission starts when the TXP bit is set in the Command Register. The packet contains all the necessary headers (Ethernet, IP, Transport, Application), so the packet is sent with no intervention to the host network stack through the tap interface using the write system call.

5.4 Implementation Details

We end the NE2000 emulation chapter with some details of implementation which we consider it is important to present. First of all we begin with the initialization phase of the NE2000 emulator, we continue by presenting the software reset protocol and we finish the section by explaining how we protect the emulator data structurs in a multithreading environment. Note that the development of the NE2000 emulation in bhyve has been branched from the /mir- ror/FreeBSD/head:285620 revision by cloning the tree sources at the [2] repository.

5.4.1 Initialization

The initialization of the NE2000 controller begins in the ne2000_init function where it allocates and initializes the NE2000 device’s data structures. It parses the input configuration options where it searches for the MAC address of the NE2000 network interface and for the name of network tap interface used in the packet transfer emulation. It continues to open and configure the tap interface and register the corresponding file descriptor to the mevent system. The management of the network tap interface and the mevent system is presented in the Packet Transfer Emulation Section 5.3.2. In the end of the initialization proccess, the NE2000 emulator initializes some registers in order to configure the speed and the mode of the network medium type as 10BaseT and full-duplex and store the network MAC address in the first 6 words of the RAM memory. This information will be read by the NE2000 network driver in the probe routines. CHAPTER 5. THE NE2000 EMULATION 28

Because the NE2000 network card emulation can run under both PCI and LPC attachments there are specific initialization routines for each attachment that are presented below.

PCI Initialization

We handle the PCI specific initialization in the pci_ne2000_init function. In order to get enu- merated on the PCI bus our PCI Adapter shall implement a subset of the PCI standard type con- figuration header register set. More precisely we set the PCIR_DEVICE and PCIR_VENDOR registers with the 0x8029 and 0x10ec values which means we emulate the RTL8029 PCI card as a generic NE2000 device. Also in the PCI initialization function of the NE2000 emulator we allocate and register the BAR registers. There are two sets of registers in the NE2000 register address space: the NIC registers and the ASIC registers. They are accessed by the host’s CPU through one single IO port, so we allocate one BAR register of 32 ports for both NIC and ASIC ports. For more details regarding the I/O ports description see the Section 5.2.1. In the end of this function we reserve the IRQ line of the PCI slot used by the NE2000 network card used to interrupt the host. For more details regarding the Interrupts description see the Section 5.2.2.

LPC Initialization

When running under the LPC attachment the initialization is simpler because we don’t need to configure any specific registers. We just register the IO ports used by the NE2000 network driver to access the device registers. There are two sets of registers in the NE2000 register address space: the NIC registers and the ASIC registers. They are accessed by the host’s CPU through one single IO port, so we allocate one IO port of 32 registers for both NIC and ASIC ports. For more details regarding the I/O ports description see the Section 5.2.1. Since the base address registers and IRQ lines are fixed and specified by the IBM PC standard, we hardcode these values in our data structures. The same values though shall be added in the /boot/device.hints of the guest virtual machine. See the Section 5.2 for more details.

5.4.2 Software Reset command

The NE2000 network driver uses this command to take the controller offline. The Software Reset command is issued by setting the STP bit in the Command Register. When the driver issues this command, the controller gets in the Software Reset State and no packets will be received or transmitted. Through the actions executed when we receive this command we reset the Packet Ring Buffers registers, disable the interrupts and set the RST bit high in the ISR register. A hardware characteristic that is a little bit difficult to emulate in software is that the controller does not get in the Software Reset State in the same moment when the STP bit is set but in about 5 microseconds after the STP bit was set. Because of that, the driver polls the ISR register looking after the RST bit in order to find out when the controller entered the Reset state. In order to emulate this hardware characteristic we do not perform the Software Reset protocol whenever the STP bit is set but only when the driver polls the ISR register after setting the STP bit. Otherwise we could enter the reset state multiple times. In other words, each time the network driver reads the ISR register and the STP bit is set in the Command Register the emulator gets in the Software Reset State and calls the Software Reset procedure. CHAPTER 5. THE NE2000 EMULATION 29

5.4.3 Multithreading environment

The NE2000 network card emulation runs in two different contexts of execution. The access to the IO registers and the transmission of the packets are performed in the context of execution of the guest virtual machine which is the bhyve main thread. On the other hand the reception of the packets flow runs in a different context which is the mevent thread. Both executions access the registers and the NIC’s memory concurrently so we have to syn- chronize their access on the shared data. In order to achieve this synchronization we use one single mutex to lock the reception packets flow, with the access to the IO registers and the transmission of the packets flows. Chapter 6

Device Emulation Evaluation

This chapter presents the results and the features developed in this project and the testing for both ATA/ATAPI and NE2000 emulations. For each implemented emulation we emphasize the configuration of the bhyve hypervisor in order to enable these features and the results that prove the emulations are functional. In both sections of the chapter we also describe the process of correctness validation and performance evaluation that we performed for the ATA/ATAPI and NE2000 emulations in order to demonstrate the correctness and implementation efficiency. Before that, we present the hardware setup in the Table 6.1 that was used for the development, validation and performance evaluation of the ATA/ATAPI and NE2000 emulations. This in- formation is useful in order to evaluate the testing results because the performance of the implementation depends on the hardware where the hypervisor runs.

Processor Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz Processor frequency 3.30GHz CPU(s) 4 Thread(s) per core 2 L1 cache 32 Kb data + 32 Kb instruction L2 cache 512K L3 cache 3072K Memory 8GiB System Memory Memory Frequency 1333 MHz Storage System ATA Disk on SATA AHCI Controller Operating System FreeBSD 10.0-RELEASE IDE vim, cscope

Table 6.1: Hardware configuration

6.1 ATA/ATAPI

6.1.1 Configuration

Before presenting the configuration of the ATA controller we begin with the primary/secondary and master/slave meanings because they will help us to understand the parameters configu- ration. In general the motherboards have two IDE interfaces (primary and secondary), also known as channels. In both primary and secondary IDE channels there can be connected only ATA/ATAPI drives. Each interface can support two devices, making a total of up to four

30 CHAPTER 6. DEVICE EMULATION EVALUATION 31

ATA/ATAPI drives. The two drives have to decide for themselves how to share the same ATA channel. To accomplish this, one drive on each channel is assigned as the "master" and the other drive (if present) is assigned as the "slave". Consequently it leaves us with the following possibilities: • Primary Master Drive • Primary Slave Drive • Secondary Master Drive • Secondary Slave Drive

PCI configuration

When running attached on the PCI bus the ATA emulation supports both ATA channels em- ulated in the same time. In order to configure two ATA channels running under the PCI attachment the configuration string is: -s N:M,ata-hd,"0,./MASTER,./SLAVE;1,./MASTER,./SLAVE" where the semicolon character is used to separate the two channels. Each channel can be configured to support one or two drives where each drive can be either ATA or ATAPI. The configuration string used to configure one ATA channel is: -s N:M,ata-hd,X,./MASTER,./SLAVE or -s N:M,ata-hd,X,./MASTER where N:M are the pci slot information, X is the ATA channel 0 or 1 (Primary or Secondary channel) followed by the name of the drives. There must be only one drive representing the master drive.

LPC configuration

The LPC bus has two IDE interfaces (primary and secondary), also known as channels. Like the PCI attachment, each channel can support two devices, making a total of up to four ATA/AT- API drives. The configuration parameters for the LPC attachment are: -l ata-hd,X,./MASTER,./SLAVE or -l ata-hd,X,./MASTER

Unlike the PCI attachment where the driver probes the ATA channel on the fly using the PCI bus enumeration, when running attached on the LPC bus the ATA driver needs some hints about the addresses of the IO ports and the IRQ numbers for each ATA channel. We do that by adding the boot hints from the Listing 6.1 in the /boot/device.hints configuration file:

1 hint.ata.0.at="isa" 2 hint.ata.0.port="0x1F0" 3 hint.ata.0.irq="14" 4 hint.ata.1.at="isa" 5 hint.ata.1.port="0x170" 6 hint.ata.1.irq="15" Listing 6.1: ATA Boot Device Hints CHAPTER 6. DEVICE EMULATION EVALUATION 32

ATA/ATAPI devices

The ATA/ATAPI emulation works with two types of devices: ATA disks and ATAPI cdroms. In order to differentiate the ATAPI cdroms by the ATA disks we use a simple naming convention. The ATAPI CDROM devices are configured like the ATA disks except that the name of the image file shall have the iso extension. Otherwise the device is considered a regular ATA drive. For example, in order to add an ATAPI CDROM on the ATA channel 0 with the media from the release.iso image the configuration parameters are: -l ata-hd,0,./release.iso Another example might be to boot the virtual machine using an ATA disk and an ATAPI cdrom: -l ata-hd,0,./disk.img,./release.iso In this case the ATA disk represents the master drive while the ATAPI cdrom represents the slave drive.

6.1.2 Results of ATA/ATAPI emulation

We start to present some outputs of the ATA drivers from the guest virtual machine. There are different scenarios of boot with the ATA disk and ATAPI cdrom drives running under both PCI and LPC attachments. Note that in all these test configurations we used as the guest virtual machine the FreeBSD operating system. Let us see the output of the ATA pci driver from the guest virtual machine which probes our PCI ATA Host Adapter in the Listing 6.2. We observe that the ATA Host adapter is successfully recognized by the atapci driver and the BAR addresses are properly allocated (BAR1 = 0x2020-0x2027, BAR2 = 0x2028-0x202b, BAR3 = 0x170-0x177, BAR4 = 0x376, BAR5 = 0x2040-0x204f ). Also both ATA channels of the adapter are handled by the ata0 and ata1 drivers.

1 atapci0: (ITE IT8211F UDMA133 controller) 2 port 0x2020-0x2027,0x2028-0x202b,0x170-0x177, 3 0x376,0x2040-0x204f at device 3.0 on pci0 4 ata0: at channel 0 on atapci0 5 ata1: at channel 1 on atapci0 Listing 6.2: PCI ATA channels

Once the ATA channels are probed by the ATA driver, it looks for the ATA drives. Because we specified both ATA drives as being ATA disks there are two instances of the ada driver ada0 and ada1 in the Listing 6.3, where the ada driver is responsible with handling the ATA disk drive. In this case the first ATA disk drive is the master drive whereas the second ATA disk drive is the slave one.

1 ada0 at ata0 bus 0 scbus0 target 0 lun 0 2 ada0: ATA-6 device 3 ada0: Serial Number 123456 4 ada0: 16.700MB/s transfers (WDMA2, PIO 65536bytes) 5 ada0: 8192MB 6 ada0: (16777216 512 byte sectors: 16H 63S/T 16644C) 7 ada0: Previously was known as ad0 8 ada1 at ata0 bus 0 scbus0 target 1 lun 0 9 ada1: ATA-6 device 10 ada1: Serial Number 123456 11 ada1: 16.700MB/s transfers (WDMA2, PIO 65536bytes) 12 ada1: 8192MB 13 ada1: (16777216 512 byte sectors: 16H 63S/T 16644C) 14 ada1: Previously was known as ad1 CHAPTER 6. DEVICE EMULATION EVALUATION 33

Listing 6.3: ATA Disk Drive under PCI

Another scenario might be to attach the ATA channels under the LPC bus as we do in the Listing 6.4. In contrast with the PCI attachment where the ATA pci driver allocates the BAR registers, here the ATA driver probes the IO ports and the IRQ line addressed through the LCP/ISA bus. Note that the values of the IO port addresses and IRQ lines are the one specified in the /boot/device.hints configuration file from the guest virtual machine. 1 ata0: at port 2 ata0: 0x1f0-0x1f7,0x3f6 irq 14 on isa0 3 ata1: at port 4 ata1: 0x170-0x177,0x376 irq 15 on isa0 Listing 6.4: LPC ATA channels

Similar to the case when the ATA channel is attached under the PCI bus, the ada driver probes the ATA disk drive as we see in the Listing 6.5. In contrast with the PCI attachment, there is no DMA channel provided by the ATA channel so the only available data transfer protocol is the PIO protocol. 1 ada1 at ata1 bus 0 scbus2 target 0 lun 0 2 ada1: ATA-6 device 3 ada1: Serial Number 123456 4 ada1: 16.700MB/s transfers (PIO4, PIO 65536bytes) 5 ada1: 8192MB 6 ada1: (16777216 512 byte sectors: 16H 63S/T 16644C) 7 ada1: Previously was known as ad2 Listing 6.5: ATA Disk Drive under LPC

Since we can also attach an ATAPI cdrom drive instead of an ATA disk drive on the ATA channel, let us see how the ATAPI cdrom gets probed. In the Listing 6.6 there is an ATAPI cdrom drive inserted in the ata1 ATA channel attached under the LPC bus. The cd driver probes the cdrom device and prints some information about the cdrom device using the ATAPI commands. 1 cd0 at ata1 bus 0 scbus2 target 0 lun 0 2 cd0: 3 cd0: Removable CD-ROM SCSI-0 device 4 cd0: Serial Number 123456 5 cd0: 16.700MB/s 6 cd0: transfers (PIO4, ATAPI 12bytes, PIO 65534bytes) 7 cd0: cd present [350001 x 2048 byte records] Listing 6.6: ATAPI cdrom Drive under LPC

If the guest virtual machine is in the process of installation and we get at the Partitioning phase of the FreeBSD Installer, we will be able to choose any of the two disks where we can install the operating system as we see in the Listing 6.7. As one can see, the model implemented by the ATA emulator is an ATA-6 disk device which supports both PIO and WDMA2 data transfer protocols. Note that the speed of 16.700MB/s from the console log is not the actual speed of the data transfer between the guest operating system and emulator since the guest virtual machine have not performed any speed test. It is a hardcoded value presented by the ada driver and represents the speed specified by the WDMA2 standard. 1 Select the disk on which to install FreeBSD. 2 vtbd0 654 MB Disk 3 ada0 8.0 GB ATA Hard Disk (BHYVE ATA IDE DISK) 4 ada1 8.0 GB ATA Hard Disk (BHYVE ATA IDE DISK) Listing 6.7: FreeBSD Installer - Partitioning CHAPTER 6. DEVICE EMULATION EVALUATION 34

At the moment the guest FreeBSD virtual machine can be installed on either ada0 or ada1 devices and run with no restrictions on the ATA disk emulation. Note that there is a limitation on the size of the ATA disk which is maximum 128 GB due to the LBA 28-bit addressing. For more details, take a look on the Section 4.3 too see why the maximum size is 128 GB when working with the LBA 28-bit addressing. For more information about how to configure all these combinations of ATA disk drives and AT- API cdrom drives in the bhyve hypervisor, take a look in the Section 6.1.1 where the parameters configuration are presented in detail.

6.1.3 Validation and Performance

In this section we describe the process of validation and measurement of the performance that we performed for the ATA/ATAPI emulation. Considering we managed to boot a guest operating system from an ISO image which is read using our ATAPI cdrom emulation, and install it on the image disk using our ATA disk emulation, one can say the ATA/ATAPI emulation proved its correctness. Even though this statement may be perfectly true, we wanted to make sure there is no bug in the implementation and we needed a fast and automatic process to test the IO accesses to the ATA disk. Hence, we ended up to find an automated solution that performs a set of tests inside the virtual machine that validate the emulated disk. The main scenario of testing is to write some random data on the disk and after reboot read it back in a different buffer in order to compare them. We did this by running the dd tool with different parameters from a shell script. One test case uses an approach similar to the Listing 6.8 where we varied the BLOCK_SIZE and MAX_SECTORS variables in order to cover the whole disk with different chunk sizes. All the tests we performed passed successfully, which proves the correctness of the ATA/ATAPI implementation.

1 dd bs=$BLOCK_SIZE count=1 if=/dev/random of=tests/testX 2 dd bs=$BLOCK_SIZE count=1 if=tests/testX of=/dev/ada1 oseek=$MAX_SECTORS 3 reboot 4 dd bs=$BLOCK_SIZE count=1 if=/dev/ada1 of=out/testX iseek=$MAX_SECTORS 5 diff out/testX tests/testX 6 if ($status != 0) then 7 echo "Test Failed" 8 endif Listing 6.8: ATA Validation Tests

Regarding the performance of the ATA/ATAPI implementation, we measured the transfer rates between the guest virtual machine and the ATA disk emulation running the diskinfo tool inside the virtual machine. The complete command diskinfo -t /dev/ada1 takes as parameter the ATA device which is measured, in our case being the ATA disk emulated by our implementation. The output of this command is printed in the Listing 6.9, and we observe there are transfer rates greater than 100MB/s.

1 outside: 102400 kbytes in 0.719681 s = 142285 kB/s 2 middle: 102400 kbytes in 0.796385 s = 128581 kB/s 3 inside: 102400 kbytes in 0.781721 s = 130993 kB/s Listing 6.9: ATA Transfer rates

Note that for this data transfers, the ATA driver used the READ and WRITE DMA commands so the ATA implementation executed these transfers using the DMA protocol. In order to evaluate this result of 100MB/s we compare it against the transfer rate specified by the ATA-6 standard for the DMA Multi-word 2 which is 16.7MB/s. We get better results because we CHAPTER 6. DEVICE EMULATION EVALUATION 35

emulate the data transfers by transferring the data to the disk belonging to the FreeBSD host where bhyve runs. Hence, we depend on the performance of the storage system where the ATA implementation runs. Considering we tested on a hardware configuration having a SATA AHCI Controller we can explain why we get these good results. We shall mention that some ATA disks may come with the Ultra DMA feature that provides data transfer rates up to 100MB/s. If we implemented that feature, the performance of the ATA emulation would be much better because there is less overhead in the Ultra DMA protocol. We chose not to implement this feature because the goal of our work is not to offer high transfer rates, but to provide better compatibility with the older operating systems. Considering there is another storage emulation in the bhyve hypervisor we found interesting to compare it against the ATA/ATAPI emulation. We talk about the SATA AHCI controller emulation which provides a fast disk storage solution in bhyve. Even though the ACHI and ATA controllers are pretty much different and do not have the same data transfer protocols, it is still interesting to measure the performance of the AHCI module and compare it against our ATA emulator because it is a good indicator in our evaluation. By applying the same procedure with the diskinfo tool on the partition controlled by the AHCI driver we get the following transfer rates presented in the Listing 6.10.

1 outside: 102400 kbytes in 0.326701 s = 313436 kB/s 2 middle: 102400 kbytes in 0.339824 s = 301332 kB/s 3 inside: 102400 kbytes in 0.348496 s = 293834 kB/s Listing 6.10: AHCI Transfer rates

We observe that the transfer rate of the AHCI emulation is about 3 times bigger than the ATA’s emulation. If we take in consideration that the AHCI emulation makes use of the UDMA6 data transfer protocol which specifies transfer rates of 300MB/s and we tested on a hardware configuration having a SATA AHCI Controller too, these results are explicable. Note that the AHCI emulation has an insignificant overhead for the guest virtual machine. We wanted to dig more and find a better reasoning why the ATA emulation seems to introduce such an overhead in the data transfers. Maybe the most important cause is the LBA 28-bit addressing used by the ATA emulation while the AHCI emulation makes use of the LBA 48-bit addressing mode which offers the possibility to transfer more data sectors (65536) per DMA transaction which means less IO port accesses and interrupts, so there is less overhead in the communication between the virtual machine driver and the storage emulation. Again, we chose not to implement the LBA 48-bit addressing mode because the goal of our work is not to offer high transfer rates, but to provide better compatibility with the older operating systems.

6.2 NE2000

6.2.1 Configuration

The NE2000 network card emulation can run in the bhyve hypervisor under both PCI and LPC attachments. In order to use the NE2000 network card one shall configure the device parameters in the bhyve configuration input. The general configuration string provides the name of the network tap interface used in the packet transfer emulation and the MAC address of the NE2000 emulation device: -ne2k,TAP[,HWaddr]. Before starting the bhyve virtual machine with the NE2000 network card emulation enabled the host shall be prepared in order to provide the network tap interface. In the FreeBSD operating system, one can create the network tap interface for the NE2000 network device using the CHAPTER 6. DEVICE EMULATION EVALUATION 36

ifconfig tool. If one wants the device card to participate in the network it shall create a bridge interface which contains the network tap interface and the host physical interface as members. These actions can also be achieved using the ifconfig tool. All the preparation of the host is presented in the Listing 6.11.

1 ifconfig tap0 create 2 ifconfig bridge0 create 3 ifconfig bridge0 addm igb0 addm tap0 4 ifconfig bridge0 up Listing 6.11: Preparing the Host

The reasoning why we provide the network tap interface to the NE2000 emulator is explained in the Section 5.3.2 which presents the mechanisms of the packet transfer emulation. If the hardware MAC address is missing from the device configuration parameters, then the NE2000 emulation will use the default MAC address: 00:a0:98:4a:0e:ee. Below we describe the NE2000 configuration parameters for both PCI and LPC attachments. In order to attach the NE2000 network card on the PCI bus we simply add: -s 4:0,ne2k,/dev/tap0 in the bhyve configuration input. The /dev/tap0 device represents the corresponding network tap interface used by the NE2000 device to attach to. The device parameters of the NE2000 network card when running under the LPC bus are very similar: -l ne2k0,/dev/tap0 Under the LPC bus, one can add up to two NE2000 network card devices: ne2k0 and ne2k1. In order to have two NE2000 network cards simultaneously in the virtual machine, both ne2k0 and ne2k1 shall be added: -l ne2k0,/dev/tap0 -l ne2k1,/dev/tap1,00:a0:98:4a:0e:de In this situation, the host shall be prepared in order to provide two network tap interfaces. Here, the first NE2000 network card will use the default MAC address while the second device will use the mac provided as parameter. Unlike the PCI attachment where the driver probes the NE2000 network card on the fly using the PCI bus enumeration, when running attached on the LPC bus the ED driver needs some hints about the addresses of the IO ports and the IRQ numbers for each NE2000 controller. We do that by adding the boot hints from the Listing 6.12 in the /boot/device.hints configuration file:

1 hint.ed.0.at="isa" 2 hint.ed.0.port="0x310" 3 hint.ed.0.irq="10" 4 hint.ed.1.at="isa" 5 hint.ed.1.port="0x330" 6 hint.ed.1.irq="11" Listing 6.12: NE2000 Boot Device Hints

6.2.2 Results of NE2000 emulation

In this section we present some outputs of the NE2000 network drivers from the guest virtual machine which probe the NE2000 emulation. There are different scenarios of boot with the NE2000 network card running under both PCI and LPC attachments. Note that in all these test configurations we used as the guest virtual machine the FreeBSD operating system. CHAPTER 6. DEVICE EMULATION EVALUATION 37

In the first example we attach the NE2000 network card on the PCI bus and we show the output in the Listing 6.13 where the NE2000 pci driver from the guest virtual machine probes our NE2000 network card. We observe that the device is successfully recognized by the ED driver and the hardware resources are properly allocated (port 0x2000-0x201f irq 17).

1 ed0: port 0x2000-0x201f irq 17 at device 4.0 on pci0 2 ed0: Ethernet address: 00:a0:98:4a:0e:ee Listing 6.13: PCI NE2000 attachment

Once the NE2000 pci driver identifies the device, it continues the probing and creates the ed0 network interface presented in the Listing 6.14. The driver reads the device capabilities like the speed and the mode of the network medium type from the hardware registers and the device RAM. The ed0 network interface works in 10baseT speed and full-duplex mode.

1 ed0: flags=8802 metric 0 mtu 1500 2 ether 00:a0:98:4a:0e:ee 3 nd6 options=29 4 media: Ethernet 10baseT/UTP Listing 6.14: NE2000 under PCI

Another scenario might be to attach two NE2000 network cards under the LPC bus as we do in the Listing 6.15. In contrast with the PCI attachment where the NE2000 pci driver allocates the BAR registers, here the NE2000 isa driver probes the IO ports and the IRQ line addressed through the LCP/ISA bus. Note that the values of the IO port addresses and IRQ lines are the one specified in the /boot/device.hints configuration file from the guest virtual machine.

1 ed0 at port 0x310-0x32f irq 10 on isa0 2 ed0: Ethernet address: 00:a0:98:4a:0e:ee 3 ed1 at port 0x330-0x34f irq 11 on isa0 4 ed1: Ethernet address: 00:a0:98:4a:0e:de Listing 6.15: LPC NE2000 attachment

Similar to the case when the NE2000 network card is attached under the PCI bus, the ed driver probes the devices as we see in the Listing 6.16 and creates the network interfaces ed0 and ed1.

1 ed0: flags=8802 2 metric 0 mtu 1500 ether 00:a0:98:4a:0e:ee 3 nd6 options=29 4 media: Ethernet 10baseT/UTP 5 ed1: flags=8802 6 metric 0 mtu 1500 ether 00:a0:98:4a:0e:de 7 nd6 options=29 8 media: Ethernet 10baseT/UTP Listing 6.16: NE2000 under LPC

For more information about how to configure all these combinations of NE2000 network cards in the bhyve hypervisor, take a look in the Section 6.2.1 where the parameters configuration are presented in detail.

6.2.3 Validation and Performance

In this section we describe the process of validation and measurement of the performance that we performed for the NE2000 emulation. In order to test the receive and transmit flows working with unicast traffic, the first test case to check is the reachability from the guest virtual machine to the host physical machine using the CHAPTER 6. DEVICE EMULATION EVALUATION 38

ping tool. Even though this test is working, it is not complete because there is a small number of transferred packets and it uses packets of fixed size. Furthermore, we wanted to make sure there is no bug in the implementation and we needed a fast and automatic process to test the receive and transmit flows in the NE2000 emulation. Hence, we ended up to find an automated solution that performs a transfer test inside the virtual machine that validates the emulated network card. The main scenario of testing is to transfer a big file containing random data through the network between two virtual machines that use the NE2000 network interface. The test case is presented in the Listing 6.17 where the Guest1 stores 1GB of random data in the text1.bin file and send it to the Guest2 using the scp tool. After the transmission is complete, it transfers it back from the Guest2 and compares the files using the diff tool. The scp tool transfers the file using the TCP transport protocol and validates both transmission and reception flows.

1 guest1:~# dd bs=1024 count=1048576 if=/dev/random of=test1.bin 2 guest1:~# scp test1.bin root@GUEST2_IP:/root/tests 3 guest1:~# scp root@GUEST2_IP:/root/tests/test1.bin test2.bin 4 guest1:~# diff test1.bin test2.bin 5 guest1:~# if ($status != 0) then 6 guest1:~# echo "Files differ" 7 guest1:~# else if ($status == 0) then 8 guest1:~# echo "OK" 9 guest1:~# endif Listing 6.17: NE2000 Validation Tests

So far, we tested only unicast traffic. The NE2000 network card shall allow all the broadcast traffic and the multicast traffic that matches the multicast filter of the network interface. In order to test this type of traffic we used the ping traffic. For example, for the broadcast traffic we send a ping from the host to the network address where the guest virtual machine belongs to. This will send an ICMP packet with the destination MAC address ff:ff:ff:ff:ff:ff which is received by the NE2000 network card. In terms of multicast traffic we firstly add the NE2000 network interface from the guest virtual machine in a multicast group and send traffic to that multicast group from the host machine. To join the multicast group 224.0.0.2 in a FreeBSD guest virtual machine, we use the mtest tool as in the Listing 6.18.

1 guest1:~# mtest 2 multicast membership test program; enter ? for list of commands 3 j 224.0.0.2 ed0 4 ok 5 q Listing 6.18: NE2000 in multicast group

Once the ed0 network interface joined the multicast group, we send a ping from the host to the 224.0.0.2 multicast group and the Guest1 shall reply back as in the Listing 6.19.

1 host:~# ping -c 1 -I 172.16.39.1 224.0.0.2 2 PING 224.0.0.2 (224.0.0.2) from 172.16.39.1 : 56(84) bytes of data. 3 64 bytes from 172.16.39.10: icmp_req=1 ttl=64 time=1.31 ms 4 5 --- 224.0.0.2 ping statistics --- 6 1 packets transmitted, 1 received, 0% packet loss, time 0ms Listing 6.19: PING to the multicast group

Another feature that we tested for validation is the promiscuous mode of the NE2000 network card. In order to enable this mode, we used the tcpdump tool listening on the ed0 network CHAPTER 6. DEVICE EMULATION EVALUATION 39

interface. Once the network card gets in this mode, all the packets are accepted and transferred to the ED driver from the guest virtual machine. In order to catch any synchronization issues in the NE2000 implementation, from time to time we restarted the ed network interface while the traffic was running through the NE2000 device. All the tests we performed passed successfully, which proves the correctness of the NE2000 implementation. Regarding the performance of the NE2000 implementation, we measured the bandwidth between two virtual machines that use the NE2000 network interface. In order to measure the bandwidth we used a setup of test made of two virtual machines connected in a bridge where we performed tests using the iperf tool running as server in the Guest1 virtual machine and as client in the Guest2 virtual machine. For the UDP bandwidth we ran the iperf commands like in the Listing 6.20 and we got a bandwidth of 0.71 MBytes/sec, and for the TCP bandwidth we ran the iperf commands like in the Listing 6.21 and we got a bandwidth of 0.67 MBytes/sec.

1 Guest1: iperf -s -u 2 Guest2: iperf -c 192.168.0.2 -f Mbyte -u -b 7m 3 ------4 Client connecting to 192.168.0.2, UDP port 5001 5 Sending 1470 byte datagrams 6 UDP buffer size: 0.01 MByte (default) 7 ------8 [ 3] local 192.168.0.1 port 54052 connected with 192.168.0.2 port 5001 9 [ ID] Interval Transfer Bandwidth 10 [ 3] 0.0-10.0 sec 7.09 MBytes 0.71 MBytes/sec 11 [ 3] Sent 5057 datagrams 12 [ 3] Server Report: 13 [ 3] 0.0-10.0 sec 7.09 MBytes 0.71 MBytes/sec 2.649 ms 0/ 5056 (0%) 14 [ 3] 0.0-10.0 sec 1 datagrams received out-of-order Listing 6.20: UDP bandwidth

1 Guest1: iperf -s 2 Guest2: iperf -c 192.168.0.2 -f Mbyte 3 ------4 Client connecting to 192.168.0.2, TCP port 5001 5 TCP window size: 0.03 MByte (default) 6 ------7 [ 3] local 192.168.0.1 port 31221 connected with 192.168.0.2 port 5001 8 [ ID] Interval Transfer Bandwidth 9 [ 3] 0.0-10.0 sec 6.75 MBytes 0.67 MBytes/sec Listing 6.21: TCP bandwidth

The small bandwidth is explainable because the frames are transferred between the guest system memory and the NE2000’s Transmit Buffer through the Data IO port using the PIO mechanism (Programmed Input Output). Even though the performance is weak the results are comparable with the bandwidth of the NE2000 hardware device which is supposed to work at 10 Mbits/sec which means 1.25 MBytes/sec. In order to improve the performance of the NE2000 emulation we shall provide an external DMA controller that will be used by the guest driver to transfer the frames between the guest system memory and the NE2000’s Transmit Buffer resulting in a smaller number of IO accesses and less overhead in the communication between the virtual machine driver and the NE2000 network card emulation. We did not implement the external DMA controller because the goal of our work is not to offer high transfer rates, but to provide better compatibility with the older operating systems. Chapter 7

Conclusion and Further Work

The area of applications using software virtualization has been growing more and more in the last years, virtualization being fundamental for many technologies (for example, cloud computing). Nowadays the main issue is to support many different guest operating systems. There are many types of applications which run on legacy operating systems (FreeBSD 4, Windows XP) and nobody wants to change their setup or to upgrade the operating system. However, they need to migrate toward virtualized hosts. Hence, the solution to this problem is to enhance the hypervisors in order to support such operating systems. This subject is present in all hypervisors but we focused on the FreeBSD Hypervisor (bhyve). We analyzed the key reasons why some older operating systems are not supported in bhyve and what are the critical parts which has to be implemented to improve that, and we found out that the main differences consisted in the supported hardware devices. Because of that, the main objectives were to implement two device emulations in the FreeBSD hypervisor in order to provide better compatibility with the older operating systems. First of all we observed some compatibility issues at the media storage devices such as hard disks, floppy drives, and optical disc drives where the bhyve hypervisor did not provide enough support for the older operating systems so we implemented the ATA/ATAPI 6 emulation. We emulated successfully an ATA disk and an ATAPI cdrom and we managed to boot a virtual machine and install it to the emulated disk. In order to accomplish this objective we implemented the following requirements: • emulate the I/O ports accesses according to the ATA/ATAPI datasheet specification; • implement the ATA 6 standard and the ATA Packet commands (the ATAPI Packet is used to communicate with the ATAPI cdrom device); • implement the PIO4 and WDMA2 data transfer protocols working at transfer rates of more than 16.700MB/s; • work with both primary and secondary channels where each of them support master and slave drives at the same time; • configure and run the ATA/ATAPI emulation under both PCI and LPC attachments. Another category of devices where bhyve did not provide good support for the older operating systems are the network card devices, so we implemented the NE2000 device emulation. We emulated successfully a NE2000 network card device and we managed to have Internet connec- tivity in the guest virtual machine. In order to accomplish this objective we implemented the following requirements: • emulate the I/O ports accesses according to the NE2000 datasheet specification;

40 CHAPTER 7. CONCLUSION AND FURTHER WORK 41

• implement the PIO data transfer protocol; • implement the management of the Packet Ring Buffers used in the packets transfer; • find a solution to transfer the Ethernet frames between the NE2000 guest network driver and the host networking stack; • configure and run the NE2000 emulation under both PCI and LPC attachments. Furthermore, we designed a testing process that helped us with the correctness validation and performance evaluation for the ATA/ATAPI and NE2000 implementations. Future optimizations related with the ATA/ATAPI emulation include the implementation of the Ultra DMA protocol and the migration from the LBA 28-bit addressing to the LBA 48-bit addressing mode which offers the possibility to transfer more data sectors (65536) per DMA transaction which means less IO port accesses and interrupts, so there is less overhead in the communication between the virtual machine driver and the storage emulation. We did not implement these features because the goal of our work was not to offer high transfer rates, but to provide better compatibility with the older operating systems. In order to improve the performance of the NE2000 emulation an optimization might be to provide an external DMA controller that will be used by the guest driver to transfer the frames between the guest system memory and the NE2000’s Transmit Buffer resulting in a smaller number of IO accesses and less overhead in the communication between the virtual machine driver and the NE2000 network card emulation. We did not implement the external DMA controller because the goal of our work was not to offer high transfer rates, but to provide better compatibility with the older operating systems. Bibliography

[1] bhyve ATA repo. https://svn.grid.pub.ro/svn/bhyve-ATA-emul/. [2] bhyve NE2000 repo. https://socsvn.freebsd.org/socsvn/soc2015/iateaca/bhyve-ne2000- head/. [3] Qemu ne2k. https://github.com/qemu/qemu/blob/master/hw/net/ne2000.c. [4] Realtek PCI Full-Duplex Ethernet Controller with built-in SRAM. INDUSTRIAL PARK, HSINCHU 30077, TAIWAN, R.O.C. REALTEK SEMICONDUCTOR CORPORATION. [5] SCSI Commands Reference Manual. Seagate Technology LLC All rights reserved, February 2006. [6] DP8390D/NS32490D NIC Network Interface Controller. 111 West Bardin Road, Arlington, TX76017, July 1995. National Semiconductor Corporation. [7] James Boyd. Serial ata advanced host controller interface. pages 9–9, 2111 NE 25 Avenue, Hillsboro, OR 97124, 2008. Intel Corporation. [8] Tony Goodfellow. Ata/atapi host adapters standard. pages 15–15, 2052 Alton Parkway, Irvine, CA92602, USA, 2003. Pacific Digital Corporation. [9] Tony Goodfellow. Ata/atapi host adapters standard. pages 11–11, 2052 Alton Parkway, Irvine, CA92602, USA, 2003. Pacific Digital Corporation. [10] FreeBSD Documentation Handbook. Bridging. https://www.freebsd.org/doc/handbook/network- bridging.html, 2017. [Online; accessed 16-May-2017]. [11] Peter T. McLean. Information technology - at attachment with packet interface - 6. pages 16–16, 11 West 42nd Street, New York, New York 10036, 2002. American National Stan- dards Institute. [12] Peter T. McLean. Information technology - at attachment with packet interface - 6. 11 West 42nd Street, New York, New York 10036, 2002. American National Standards Institute. [13] Peter T. McLean. Information technology - at attachment with packet interface - 6. pages 63–63, 11 West 42nd Street, New York, New York 10036, 2002. American National Stan- dards Institute. [14] Peter T. McLean. Information technology - at attachment with packet interface - 6. pages 64–65, 11 West 42nd Street, New York, New York 10036, 2002. American National Stan- dards Institute. [15] Peter T. McLean. Information technology - at attachment with packet interface - 6. pages 56–57, 11 West 42nd Street, New York, New York 10036, 2002. American National Stan- dards Institute. [16] Wikipedia. Parallel ata — wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Parallel_ATA, 2015. [Online; accessed 14-February-2015].

42 BIBLIOGRAPHY 43

[17] Wikipedia. Serial ata — wikipedia, the free encyclopedia. http://en.wikipedia.org/wiki/Serial_ATA, 2015. [Online; accessed 14-February-2015]. [18] Wikipedia. Ne1000 — wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/NE1000, 2016. [Online; accessed 2-April-2017]. [19] Wikipedia. Direct memory access. https://en.wikipedia.org/wiki/Direct_memory_access, 2017. [Online; accessed 25-April-2017].