haskus-manual Release 0.1

Sylvain HENRY

Aug 03, 2017

System - Volume 1: building guide

1 haskus-system 1 1.1 Introduction...... 1 1.2 Discussing the approach...... 2 1.3 Building systems: the automated way...... 5 1.4 Building systems: the manual way...... 7 1.5 Reference: system.yaml syntax...... 11 1.6 Reference: haskus-system-build ...... 13 1.7 The Sys monad...... 13 1.8 Device management...... 14 1.9 Graphics Overview...... 19 1.10 Modules Overview...... 25 1.11 architecture notes...... 26 1.12 X86 encoding...... 27

2 haskus-binary 29 2.1 Binary modules...... 29

3 haskus-utils 37

i ii CHAPTER 1

haskus-system

haskus-system is a framework written in Haskell that can be used for system programming. Fundamentally it is an experiment into providing an integrated interface leveraging Haskell features (type-safety, STM, etc.) for the whole system: input, display, sound, network, etc.

Introduction haskus-system is a framework written in Haskell that can be used for system programming. Fundamentally it is an experiment into providing an integrated interface leveraging Haskell features (type-safety, STM, etc.) for the whole system: input, display, sound, network, etc.

The big picture

A typical can be roughly split into three layers: • Kernel: device drivers, virtual memory management, process , etc. • System: system services and daemons, low-level kernel interfaces, etc. • Application: end-user applications (web browser, video player, games, etc.) kernel haskus-system is based directly and exclusively on the . Hence, • it doesn’t rely on usual user-space kernel interfaces (e.g., libdrm, libinput, X11, wayland, etc.) to communicate with the kernel • it doesn’t contain low-level kernel code (, etc.) Note, however, that programs using the haskus-system are compiled with GHC: hence they still depend on GHC’s runtime system (RTS) dependencies (libc, etc.). Programs are statically compiled to embed those dependencies. haskus-system

1 haskus-manual, Release 0.1 haskus-system acts at the system level: it provides interfaces to the Linux kernel (hence to the hardware) in Haskell and builds on them to provide higher-level interfaces (described in the Volume 2 of this documentation). You can use these interfaces to build custom systems. Then it is up to you to decide if your system has the concept of “application” or not: you may design domain specific systems which provide a single “application”.

Discussing the approach

The haskus-system framework aims to help writing systems in Haskell. Writing a new operating system from scratch is obviously a huge task that we won’t undertake. Instead, pragmatically, we build on the Linux kernel to develop the haskus-system. The fact that it is based on the Linux kernel shouldn’t confuse you: we don’t have to let applications directly access it through a UNIX-like interface! This is similar to the approach followed by Google with Android: the Linux kernel is used internally but applications have to be written in Java and they have to use the Android interfaces. The difference is that we are using Haskell. The haskus-system framework and the systems using it are written with the Haskell language. We use GHC to compile Haskell codes, hence we rely on GHC’s runtime system. This runtime system works on a bare-bones Linux kernel and manages memory (garbage collection), user-space threading, asynchronous I/O, etc.

Portability

The portability is ensured by the Linux kernel. In theory we could use our approach on any architecture supported by the Linux kernel. In practice, we also need to ensure that GHC supports the target architecture. In addition, haskus-system requires a thin architecture-specific layer because Linux interface is architecture spe- cific. Differences between architectures include: system call numbers, some structure fields (sizes and orders), the instruction to call into a system call and the way to pass system call parameters (calling convention). The following architectures are currently supported by each level of the stack: • haskus-system: x86-64 • GHC: x86, x86-64, PowerPC, and ARM • Linux kernel: many architectures

Performance

Using a high-level language such as Haskell is a trade-off between performance and productivity. Just like using C lan- guage instead of plain assembly language is. Moreover in both cases we expect the compilers to perform optimizations that are not obvious or that would require complicated hard to maintain codes if they were to be coded explicitly. GHC is the Haskell compiler we use. It is a mature compiler still actively developed. It performs a lot of optimizations. In particular, it performs inter-modules optimizations so that well-organized modular code doesn’t endure performance costs. Haskell codes are compiled into native code for the architecture (i.e., there is no runtime interpretation of the code). In addition, it is possible to use LLVM as a GHC backend to generate the native code. The generated native codes are linked with a runtime system provided by GHC that manages: • Memory: garbage collection • Threading: fast and cheap user-space threading • Software transactional memory (STM): safe memory locking

2 Chapter 1. haskus-system haskus-manual, Release 0.1

• Asynchronous I/O: non-blocking I/O interacting with the threading system Performance-wise, this is a crucial part of the stack we use. It has been carefully optimized and it is tunable for specific needs. It is composed of about 40k lines of C code. As a last resort, it is still possible to call codes written in other languages from Haskell through the Foreign Function Interface (FFI) or by adding a Primary Operation (primop). haskus-system uses these mechanisms to interact with the Linux kernel. It seems to us that this approach is a good trade-off. As comparison points, most UNIX-like systems rely on unsafe interpreted scripts ( systems, etc.); Google’s Android (with Dalvik) used to perform runtime bytecode interpretation and then just-in-time compilation, currently (with ART) it still uses a garbage collector; Apple’s platforms rely on a garbage collection variant called “automatic reference counting” in Objective-C and in Swift languages (while it might be more efficient, it requires much more care from the programmers); JavaScript based applications and applets (unsafe language, VM, etc.) tend to generalize even on desktop.

Productivity

Writing system code in a high-level language such as Haskell should be much more productive than writing it in a low-level language like C. • Most of the boilerplate code (e.g., error management, logging) can be abstracted away. • Thanks to the type system, many errors are caught during the compilation, which is especially useful with system programming because programs are harder to debug using standard methods and tools. Moreover it makes codes much easier to maintain because the compiler checks many more things during refactoring. • High-level code is often dense and terse. Hence we can show full working code snippets in the documentation that you can quickly copy. Moreover, writing system code is much more fun and we can quickly get enjoyable results.

Durability and Evolution

Our approach should be both durable and evolutive. Durable because we only use mature technology: Linux and GHC developments both started in early 1990s and are still very active. The only new layer in the stack is the haskus-system framework. All of these are -source free software, ensuring long-term access to the sources. The approach is evolutive: Haskell language is evolving in a controlled way with GHC’s extensions (and a potential future Haskell standard revision); GHC as a compiler and a runtime system is constantly improving and support for new architectures could be added; Linux support for new hardware and new architectures is constantly enhanced and specific developments could be done to add features useful for the haskus-system (or your own system on top of it). The haskus-system framework itself is highly evolutive. First it is new and not tied to any standard. Moreover code refactoring in Haskell is much easier than in low-level languages such as C (thanks to the strong typing), hence we can easily enhance the framework interfaces as user code can easily be adapted.

Single Code Base & Integration

In our opinion, a big advantage of our approach is to have an integrated framework whose source is in a single code base. It makes it much easier to evolve at a fast pace without having to maintain interface compatibility between its internal components. Moreover, refactoring is usually safe and relatively easy in Haskell, so we could later split it into several parts if needed.

1.2. Discussing the approach 3 haskus-manual, Release 0.1

As a comparison point, usual Linux distributions use several system services and core libraries, most of them in their own repository and independently developed: libc, dbus, , libdrm, libinput, Mesa/X11/Wayland, PulseAudio, etc. It is worth noting that the issue has been identified and that an effort has been recently made to reduce the fragmentation and to centralize some of them into a more integrated and coherent framework: . Having a single codebase written with a high-level language makes it easier to find documentation, to understand how things work (especially the interaction between the different components) and to make contributions.

Standards

The haskus-system can only be used on top of the Linux kernel. It doesn’t try to follow some standards (UNIX, POSIX, System V, etc.) to be portable on other kernels. In our opinion, these standards have been roadblocks to progress in system programming because system services and applications are usually designed to follow the least common standards to ensure portability. For instance, useful features specific to the Linux kernel may not be used because some BSD kernels do not support them [See also the heated debates about systemd requiring Linux specific features]. With our approach, we can use every feature of the Linux kernel and develop new ones if needed. It is often stated that programs should conform to the “UNIX philosophy”: each program should do only one thing and programs must be easily composable. Despite this philosophy, UNIX systems often stand on feet of clay: programs are composed with unsafe shell scripts and data exchanged between programs are usually in weakly structured plain text format. In our opinion, functional programming with strong typing is much more principled than the “UNIX philosophy”: functions are by nature easily composable and their interfaces are well-described with types. In addition, we are not limited to plain text format and the compiler ensures that we are composing functions in appropriate ways. As an example, compare this with UNIX standard commands such as ls which include many result sorting flags while the sort command could be used instead: the weakly structured output of the ls command makes it very inconvenient to indicate on which field to sort by (hard to compose). Moreover, the output of the ls command mustn’t change ever, otherwise many tools relying on it may be broken (not evolutive). This is because most commands do two things: compute a result and format it to be outputted, while they should only do the first (according to the UNIX philosophy). We don’t have this issue because we use type-checked data types instead of plain text. Even if the haskus-system is in a single code base, its functions can be used in other Haskell programs just by importing its modules. The compiler statically checks that functions are appropriately called with valid parameters. Compare this with the usual interface between two UNIX programs: parameters from the first program have to be serialized and passed on the command-line (with all the imaginable limitations on their sizes); then the second program has to parse them as well as its standard input, to handle every error case (missing parameter, invalid parameter, etc.), and to write the result; finally the first program has to parse the outputs (both stdout and stderr) of the second one and to react accordingly. For such a fundamental concept, there is a lot of boilerplate code involved and many potential errors lurking in it.

Building And Testing

Our approach allows us to quickly have a working prototype that can be tested in an emulated environment (e.g., with QEMU). As a comparison point, building a minimal usual from scratch is very cumbersome as we can in the “Linux From Scratch” book. A lot of different packages have to be downloaded from various places, patched, configured, built and installed. Even if our approach is currently far from being on par with a usual Linux distribution, we expect it to stay much more simpler to build.

4 Chapter 1. haskus-system haskus-manual, Release 0.1

Proprietary Drivers

Some vendors do not provide open-source drivers nor documentation for their hardware. Instead they provide pre- compiled libraries and/or kernel modules. As they presuppose the use of some system libraries and services (OpenGL, X11, etc.), haskus-system doesn’t support them.

Building systems: the automated way

Building and testing your own systems based on haskus-system requires quite a few steps: configuring and building a Linux kernel, etc. Hopefully we have built a tool called haskus-system-build that performs all these steps automatically (details are explained in the next chapter if you want to understand what it does internally). Prerequisite: you need to have Stack and git installed.

Installing haskus-system-build

The tool is distributed in the haskus-system-build package. To install the latest version, use:

> git clone https://github.com/haskus/haskus-system-build.git > cd haskus-system-build > stack install --install-ghc

It will install the program into ~/.local/bin. Be sure to add this path to your $PATH environment variable.

Getting started

To start a new project, enter a new directory and uses the init command:

> mkdir my-project > cd my-project > haskus-system-build init

It downloads the default system template into the current directory. It is composed of 4 files:

> find .-type f ./stack.yaml ./src/Main.hs ./my-system.cabal ./system.yaml

• src/Main.hs is the system code • my-system.cabal is the package configuration file that has been tweaked to build valid executables (use static linking, etc.) • stack.yaml is Stack configuration file with the required dependencies on Haskus packages • system.yaml is the haskus-system-build configuration file. Let’s look into it:

linux: source: tarball version: 4.11.3 options:

1.3. Building systems: the automated way 5 haskus-manual, Release 0.1

enable: - CONFIG_DRM_BOCHS - CONFIG_DRM_RADEON - CONFIG_DRM_NOUVEAU - CONFIG_DRM_VIRTIO disable: - VIRTIO_BALLOON make-args: "-j8" ramdisk: init: my-system qemu: # Select a set of options for QEMU: # "default": enable recommended options # "vanilla": only use required settings to make tests work profile: vanilla options: "" kernel-args: ""

As you can see, it contains a Linux kernel configuration, a reference to our system as being the ramdisk “init” program and some QEMU configuration.

Building and Testing

You need to have some programs installed before we continue: • everything required to build Linux: make, gcc, binutils... • static libraries (e.g., glibc-static and zlib-static packages on Fedora) • (un)packing tools: lzip, gzip, tar, cpio • QEMU • stack Now let’s try our system with QEMU!

> haskus-system-build test

On the first execution, this command downloads and builds everything required to test the system so it can take quite some time. Then QEMU’s window should pop up with our system running in it. On following executions building is much faster because the tool reuses previously built artefacts if the configuration hasn’t changed.

Distributing and testing on real computers

If you want to distribute your system, the easiest way is to install it on an empty storage device (e.g., usb stick). Warning: data on the device will be lost! Don’t do that if you don’t know what you are doing! To install your system on the device whose device file is /dev/sde:

> haskus-system-build make-device --device /dev/sde

6 Chapter 1. haskus-system haskus-manual, Release 0.1

Note that you have to be in the sudoers list. ISO image Another distribution method is to create an ISO image that you can distribute online or burn on CD/DVD.

> haskus-system-build make-iso ... ISO image:. system-work/iso/my-system.iso

Note that you can test the ISO image with QEMU before you ship it:

> haskus-system-build test-iso

This allows you to test the boot-loader configuration.

Building systems: the manual way

This chapter explains how to compile and build a full system from scratch manually. Using the haskus-system-build tool presented in Building systems: the automated way is much easier as it performs many of the required steps presented in this chapter automatically. We will follow these steps: 1. Building your own systems : these are user-space Linux init programs (written in Haskell and using the haskus-system framework) which are compiled statically. 2. Putting your systems into ramdisks. 3. Configuring and building the Linux kernel. 4. Testing with QEMU. 5. Configuring a boot-loader. Building ISO images and bootable devices. Testing the ISO image with QEMU.

Building systems

Suppose we have the following HelloWorld system code and that we want to build it.

import Haskus.System

main ::IO () main = runSys' <| do term <- defaultTerminal writeStrLn term"Hello World!" waitForKey term powerOff

You can use Cabal as for any other Haskell program. However, we want to build a static executable, hence the .cabal file must contain an executable section such as:

executable HelloWorld main-is: HelloWorld.hs build-depends: base, haskus-system default-language: Haskell2010 ghc-options:-Wall- static -threaded

1.4. Building systems: the manual way 7 haskus-manual, Release 0.1

cc-options:- static ld-options:- static -pthread #extra-lib-dirs:/ path/to/static/libs

If static versions of the libgmp, libffi and glibc libraries (used by GHC’s runtime system) are not available on your system, you have to compile them and to indicate to the linker where to find them: uncomment the last line in the previous extract of the .cabal file (the extra-lib-dirs entry) and modify it so that the given path points to a directory containing the static libraries. Finally, we recommend using stack to ensure that you are using appropriate versions of GHC and of haskus-system. Example of stack.yaml contents: resolver: lts-8.15 packages: - '.' - location: git: [email protected]:haskus/haskus-system commit: 33ba0413f2adae33f66f78e51f7e9e52e63758f1 extra-dep: true ghc-options: "haskus-system":- freduction-depth=0- fobject-code extra-deps: - haskus-binary-0.6.0.0 - haskus-utils-0.6.0.0

Finally use stack build to compile the program. Examples The haskus-system-examples repository contains several examples of such systems (including this HelloWorld program). Refer to its .cabal file and to its stack.yaml file.

Creating ramdisks

To execute a system with the Linux kernel, the easiest way is to create a ramdisk: an image of the file-system containing your system (basically a zip file). To do that, put your system files into a directory /path/to/my/system. Then execute:

(cd /path/to/my/system ; find .| cpio -o -H newc | gzip)> myimage.img

You need to have the cpio and gzip programs installed. It builds a ramdisk file named myimage.img in the current directory.

Building the Linux kernel

The Linux kernel is required to execute systems using haskus-system. Leaving aside modules and firmwares, a compiled Linux kernel is a single binary file that you can use with QEMU to execute your own systems. To build Linux, you first need to download it from http://kernel.org and to unpack it:

8 Chapter 1. haskus-system haskus-manual, Release 0.1

wget https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.9.8.tar.xz tar xf linux-4.9.8.tar.xz

Then you need to configure it. We recommend at least the following: cd linux-4.9.8

# default configuration for the X86-64 target make x86_64_defconfig

# enable some DRM( graphics) drivers ./scripts/config -e CONFIG_DRM_BOCHS ./scripts/config -e CONFIG_DRM_RADEON ./scripts/config -e CONFIG_DRM_NOUVEAU ./scripts/config -e CONFIG_DRM_VIRTIO_GPU

# fixup configuration (use default values) make olddefconfig

If you know what you are doing, you can configure it further with: make xconfig

Finally, build the kernel with: make -j8

Copy the resulting kernel binary that you can use with QEMU for instance: cp arch/x86/boot/bzImage linux-4.9.8.bin

You can also copy built modules and firmwares with: make modules_install INSTALL_MOD_PATH=/path/where/to/copy/modules make firmware_install INSTALL_FW_PATH=/path/where/to/copy/firmwares

Testing with QEMU

To test a system with QEMU, we recommend that you first build a ramdisk containing it, say myimage.img. We suppose your system (i.e., the user-space program) is stored in /my/system in the ramdisk. You also need to build a recent Linux kernel, say linux.bin. To launch QEMU, use the following command line: qemu-system-x86_64 -kernel linux.bin -initrd myimage.img -append"rdinit=/my/system"

We recommend the following options for QEMU:

# make QEMU faster by using KVM -enable-kvm

# use newer simulated hardware -machine q35

1.4. Building systems: the manual way 9 haskus-manual, Release 0.1

# make pointer handling better by simulatinga tablet -usbdevice"tablet"

# redirect the guest on the host terminal -serial stdio -append"console=ttyS0"

# enable better sound device -soundhw"hda"

# make the guest Linux output more quiet -append"quiet"

Distributing systems

To distribute your systems, we will create a directory /my/disk containing: • your system (in a ramdisk) • the Linux kernel • the boot-loader files (including its configuration) A boot-loader is needed as it loads Linux and the ramdisk containing your system. We use Syslinux boot-loader but you can use others such as GRUB. Note that you don’t need a boot-loader when you test your system with QEMU because QEMU acts as a boot-loader itself. To distribute your systems, you can install the boot-loader on a device (e.g., USB stick) and copy the files in the /my/disk directory on it. Or you can also create a .iso image to burn on a CD-ROM (or to distribute online). Downloading Syslinux You first need to download and unpack the Syslinux boot-loader: wget http://www.kernel.org/pub/linux/utils/boot/syslinux/syslinux-6.03.tar.xz tar xf syslinux-6.03.tar.xz

Creating the disk directory You need to execute the following steps to create your disk directory: Create some directories: mkdir -p /my/disk/boot/syslinux

Copy Syslinux: find syslinux-6.03/bios *.c32 -exec cp {}/ my/disk/boot/syslinux ; cp syslinux-6.03/bios/core/isolinux.bin /my/disk/boot/syslinux/

Copy the Linux kernel: cp linux-4.9.8.bin /my/disk/boot/

Copy the system ramdisk: cp myimage.img /my/disk/boot/

10 Chapter 1. haskus-system haskus-manual, Release 0.1

Finally, we need to configure the boot-loader by creating a file /my/disk/boot/syslinux/syslinux.cfg containing:

DEFAULT main PROMPT0 TIMEOUT 50 UI vesamenu.c32

LABEL main MENU LABEL MyOS LINUX/ boot/linux-4.9.8.bin INITRD/ boot/myimage.img APPEND rdinit="/my/system"

Replace /my/system with the path of your system in the myimage.img ramdisk. Creating a bootable device To create a bootable device (e.g., bootable USB stick), you have to know its device path (e.g., /dev/XXX) and the partition that will contain the boot files (e.g., /dev/XXX_N). You can use fdisk and mkfs.ext3 to create an appropriate partition. You have to install Syslinux MBR:

sudo dd bs=440 if=syslinux-6.03/bios/mbr/mbr.bin of=/dev/XXX

Then you have to copy the contents of the disk directory on the partition and configure it to be bootable:

sudo mount /dev/XXX_N/ mnt/SOMEWHERE sudo cp -rf /my/disk/* /mnt/SOMEWHERE sudo syslinux-6.03/bios/extlinux/extlinux --install /mnt/SOMEWHERE/boot/syslinux sudo umount /mnt/SOMEWHERE

Now your device should be bootable with your system! Creating a bootable CD-ROM To create a bootable CD-ROM, you first need to create a .iso disk image with the xorriso utility:

xorriso -as mkisofs -R-J# use Rock-Ridge/Joliet extensions -o mydisk.iso # output ISO file -c boot/syslinux/boot.cat # create boot catalog -b boot/syslinux/isolinux.bin # bootable binary file -no-emul-boot # does not use legacy floppy emulation -boot-info-table # write additional Boot Info Table( required by

˓→SysLinux) -boot-load-size 4 -isohybrid-mbr syslinux-6.03/bios/mbr/isohdpfx_c.bin # hybrid ISO /my/disk

It should create a mydisk.iso file that you can burn on a CD or distribute online.

Reference: system.yaml syntax

This is the reference for system.yaml files used by haskus-system-build tool.

1.5. Reference: system.yaml syntax 11 haskus-manual, Release 0.1

Linux kernel

• linux.source: how to retrieve the Linux kernel – tarball (default) – git (not yet implemented) • linux.version: which Linux version to use – requires linux.version=tarball • linux.options: – enable: list of Linux configuration options to enable – disable: list of Linux configuration options to disable – module: list of Linux configuration options to build as module • linux.make-args: string of arguments passed to make

Ramdisk

• ramdisk.init: name of the program to use as init program

QEMU

• qemu.profile: option profile to use – default (default): enable some recommended options – vanilla: only set required options (-kernel, etc.) • qemu.options: string of additional arguments to pass to QEMU • qemu.kernel-args: string of additional arguments to pass to the Kernel Example

linux: source: tarball version: 4.11.3 options: enable: - CONFIG_DRM_BOCHS - CONFIG_DRM_VIRTIO disable: - CONFIG_DRM_RADEON module: - CONFIG_DRM_NOUVEAU make-args: "-j8"

ramdisk: init: my-system

qemu: profile: vanilla options: "-enable-kvm" kernel-args: "quiet"

12 Chapter 1. haskus-system haskus-manual, Release 0.1

Reference: haskus-system-build

This is the reference for the haskus-system-build program. Commands: • init: create a new project from a template – --template or -t (optional): template name • build: build the project and its dependencies (Linux) • test: launch the project into QEMU – --init (optional): specify an init program (override ramdisk.init in system.yaml) • make-disk: create a directory containing the whole system – --output or -o (mandatory): output directory • make-iso: create an ISO image of the system • test-iso: test the ISO image with QEMU • make-device: install the system on a device – --device or -d (mandatory): device path (e.g., /dev/sdd). For now, the first partition is used as a boot partition. Note that the tool also builds libgmp as it is required to statically link programs produced by GHC. Some distributions (e.g., Archlinux) only provide libgmp.so and not libgmp.a. The tool requires other programs and commands: • git • tar, lzip, gzip, cpio • make, gcc, binutils... • stack • dd, (u)mount, cp • qemu • xorriso In volume 2, we describe the high-level interfaces provided by the haskus-system to control and use the computer.

The Sys monad

Many high-level interfaces of the haskus-system use the Sys monad. It is basically a wrapper for the IO monad that adds a logging mechanism. For instance, consider the following HelloWorld code where runSys :: Sys a -> IO a:

import Haskus.System

main ::IO () main = runSys <| do

-- Initialize the default terminal term <- defaultTerminal

1.6. Reference: haskus-system-build 13 haskus-manual, Release 0.1

-- print a string on the standard output writeStrLn term"Hello World!"

-- wait for a key to be pressed waitForKey term

-- print system log sysLogPrint

-- shutdown the computer void powerOff

This code prints the string “Hello World!” in the Linux terminal and waits for a char to be entered in the terminal. Then it prints the system log that is implicitly maintained in the Sys monad. Hence, the output of this program is something like:

Hello World!

---- Log root --*- FORK: Terminal input handler |---- Read bytes from Handle0( succeeded with 1) |---- readBytes /=0( success) --*- FORK: Terminal output handler

[ 1.818814] reboot: Power down

You can see that the log is hierarchical and that it supports forks: defaultTerminal forks 2 threads to handle asynchronous terminal input/output; the input thread indicates in the log that it has read 1 byte from the terminal input (when I have pressed the enter key). Note that the log entries produced by the framework functions may change in the future, hence the contents of the log may change and you may not get exactly the same output if you try to execute this code.

Device management

Device management is the entry-point of system programming. Programs have to know which devices are available to communicate with the user (graphic cards, input devices, etc.) or with other machines (network cards, etc.). In this chapter, we present the basic concepts and we show examples with simple virtual devices provided by Linux. In the next chapters, we build on these concepts and we show how to use specific device classes: display devices, input devices, etc.

Enumerating Available Devices haskus-system provides an easy to use interface to list devices as detected by the Linux kernel. To do that, use defaultSystemInit and systemDeviceManager as in the following code: import Haskus.System main ::IO () main = runSys <| do

sys <- defaultSystemInit

14 Chapter 1. haskus-system haskus-manual, Release 0.1

term <- defaultTerminal let dm = systemDeviceManager sys

inputDevs <- listDevicesWithClass dm"input" graphicDevs <- listDevicesWithClass dm"drm"

let showDev dev = writeStrLn term ("-" ++ show (fst dev)) showDevs = mapM_ showDev

writeStrLn term"Input devices:" showDevs inputDevs

writeStrLn term"Display devices:" showDevs graphicDevs

void powerOff

Linux associates a class to each device. The previous code shows how to enumerate devices of two classes: “input” and “drm” (, i.e., display devices). If you execute it in QEMU you should obtain results similar to:

Input devices: - "/virtual/input/mice" - "/LNXSYSTM:00/LNXPWRBN:00/input/input0/event0" - "/platform/i8042/serio0/input/input1/event1" Display devices: - "/pci0000:00/0000:00:01.0/drm/card0" - "/pci0000:00/0000:00:01.0/drm/controlD64"

To be precise, we are not listing devices but event sources: a single device may have multiple event sources; some event sources may be virtual (for instance the mice input device is a virtual device that multiplexes all the mouse device event sources and that is useful if you have more than one connected mouse devices).

Plug-And-Play (Hot-Pluggable) Devices

We are now accustomed to (un)plug devices into computers while they are running and to expect them to be imme- diately detected and usable (i.e., without rebooting). For instance input devices (keyboards, mice, joysticks, etc.) or mass storages. The operating system has to when a new device becomes available or unavailable. Linux loads some drivers asynchronously to speed up the boot. Hence devices handled by these drivers are detected after the boot as if they had just been plugged in. haskus-system provides an interface to receive events when the state of the device tree changes. The following code shows how to get and print these events:

import Haskus.System

main ::IO () main = runSys <| do

term <- defaultTerminal sys <- defaultSystemInit let dm = systemDeviceManager sys

-- Display kernel events onEvent (dmEvents dm) <| \ev ->

1.8. Device management 15 haskus-manual, Release 0.1

writeStrLn term (show ev)

waitForKey term void powerOff

If you execute this code in QEMU, you should get something similar to:

-- Formatting has been enhanced for readability KernelEvent { kernelEventAction = ActionAdd , kernelEventDevPath = "/devices/platform/i8042/serio1/input/input3" , kernelEventSubSystem = "input" , kernelEventDetails = fromList [("EV","7") ,("KEY","1f0000 0 0 00") ,("MODALIAS","input:b0011v0002p0006e0000-e0,...,8,amlsfw") ,("NAME","\"ImExPS/2Generic ExplorerMouse\"") ,("PHYS","\"isa0060/serio1/input0\"") ,("PRODUCT","11/2/6/0") ,("PROP","1") ,("REL","143") ,("SEQNUM","850")]} KernelEvent { kernelEventAction = ActionAdd , kernelEventDevPath = "/devices/platform/i8042/serio1/input/input3/mouse0" , kernelEventSubSystem = "input" , kernelEventDetails = fromList [("DEVNAME","input/mouse0") ,("MAJOR","13") ,("MINOR","32") ,("SEQNUM","851")]} KernelEvent { kernelEventAction = ActionAdd , kernelEventDevPath = "/devices/platform/i8042/serio1/input/input3/event2" , kernelEventSubSystem = "input" , kernelEventDetails = fromList [("DEVNAME","input/event2") ,("MAJOR","13") ,("MINOR","66") ,("SEQNUM","852")]} KernelEvent { kernelEventAction = ActionChange , kernelEventDevPath = "/devices/platform/regulatory.0" , kernelEventSubSystem = "platform" , kernelEventDetails = fromList [("COUNTRY","00") ,("MODALIAS","platform:regulatory") ,("SEQNUM","853")]}

The three first events are due to Linux lazily loading the driver for the mouse. The last event is Linux asking the user-space to load the wireless regulatory information.

Using Devices

To use a device, we need to get a handle (i.e., a reference) on it that we will pass to every function applicable to it. The following code shows how to do it.

16 Chapter 1. haskus-system haskus-manual, Release 0.1

{-# LANGUAGE TypeApplications #-}

import Haskus.System import Haskus.Format.Binary.Word

import qualified Haskus.Arch.Linux.Terminal as Raw main ::IO () main = runSys <| do

sys <- defaultSystemInit term <- defaultTerminal let dm = systemDeviceManager sys

-- Get handle for "zero", "null" and "urandom" virtual devices zeroDev <- getDeviceHandleByName dm"/virtual/mem/zero" >..~!!> sysErrorShow"Cannot get handle for\"zero\" device" nullDev <- getDeviceHandleByName dm"/virtual/mem/null" >..~!!> sysErrorShow"Cannot get handle for\"null\" device" randDev <- getDeviceHandleByName dm"/virtual/mem/urandom" >..~!!> sysErrorShow"Cannot get handle for\"urandom\" device"

readStorable @Word64 randDev Nothing >.~.>( \a -> writeStrLn term ("From urandom device:" ++ showa )) >..~!> const (writeStrLn term"Cannot read urandom device" )

readStorable @Word64 zeroDev Nothing >.~.>( \a -> writeStrLn term ("From zero device:" ++ showa )) >..~!> const (writeStrLn term"Cannot read zero device" )

void <| Raw.writeStrLn nullDev"Discarded string"

-- Release the handles releaseDeviceHandle zeroDev releaseDeviceHandle nullDev releaseDeviceHandle randDev

waitForKey term void powerOff

This code reads a 64-bit word from the urandom device that returns random data and another from the zero device that returns bytes set to 0. Finally, we write a string into the null device that discards what is written into it. These three devices are virtual and are always available with Linux’s default configuration.

Device Specific Interfaces

In the previous code example we have used read and write methods as if the device handle had been a normal file handle. Indeed Linux device drivers define the operational semantics they want to give to each system call applicable to a file handle: read, write, fseek, , , etc. Some system calls may be invalid with some device handles (e.g., write with the urandom driver). This gives a weak unified interface to device drivers: the system calls are the same but the operational semantics depends on the driver. Moreover there are a lot of corner cases, such as system call parameters or flags only valid for some drivers. Finally, as there aren’t enough “generic” system calls to cover the whole spectrum of device features, the system call is used to send device specific commands to drivers. In practice you really have to know which

1.8. Device management 17 haskus-manual, Release 0.1

device driver you’re working with to ensure that you use appropriate system calls. To catch up as many errors at compile time as possible, in haskus-system we provide device specific interfaces that hide all this complexity. If you use them, you minimise the risk of accidentally using an invalid system call. Some of these interfaces are presented in the next chapters. Nevertheless you will have to use the low-level interface presented in this chapter if you want to write your own high-level interface to a device class not supported by haskus-system or if you want to extend an existing one.

Implementation notes

Internally haskus-system mounts a virtual file system through which the Linux kernel exposes the hard- ware of the machine. In this file-system each device is exposed as a sub-directory in the /devices directory and the path to the device’s directory uniquely identifies the device in the system. Directory nesting represents the device hierarchy as the system sees it. Regular files in device directories represent device properties that can be read and sometimes written into from user-space. Sometimes, when the tree relationship between devices is not sufficient, relations between devices are represented as symbolic links.

File descriptor vs Handle

Linux allows programs in user-space to have handles on kernel objects. Suppose the kernel has an object A and a reference R_A on A. Instead of directly giving R_A to user-space processes, the kernel maintains a per-process array of kernel object references: D_pid for the process with the pid identifier. To “give” R_A to this process, the kernel finds an empty cell in D_pid, put R_A in it and gives the index of the cell to the process. For historical reasons, the cell index is called a file descriptor and D_pid a file descriptor table even if in Linux they can be used for kernel objects that are not files (e.g., clocks, memory). User-space processes can only refer to kernel objects through theses indirect references. Note that the file descriptor table is specific to each process: sharing a file descriptor with another process does not allow to share the referred kernel object. In haskus-system we use the term “handle” instead of “file descriptor” as we find it less misleading.

Device special files and /dev

Ideally there would be a system call to get a handle on a device by providing its unique identifier (similarly to the getDevieHandleByName API provided by haskus-system). Sadly it’s not the case. We have to: 1. Get the unique device triple identifier from its name Linux has two ways to uniquely identify devices: • a path in /devices in the sysfs file-system • a triple: a major number, a minor number and a device type (character or block). haskus-system retrieves the triple by reading different files the the sysfs device directory. 2. Create and open a device special file With a device triple we can create a special file (using the mknod system call). haskus-system creates the device special file in a virtual file system (tmpfs), then opens it and finally deletes it. Usual Linux distributions use a virtual file-system mounted in /dev and create device special files in it. They let some applications directly access device special files in /dev (e.g., X11). Access control is ensured by file permissions (user, user groups, etc.). We don’t want to do this in haskus-system: we provide high-level instead.

18 Chapter 1. haskus-system haskus-manual, Release 0.1

Netlink socket

Linux dynamically adds and removes files and directories in the sysfs file-system, when devices are plugged or unplugged. To signal it to user-space, it sends kernel events in a socket. The Netlink socket is also used to pass some other messages, for instance when the kernel wants to ask something to the user-space. haskus-system handles a Netlink socket, parses received kernel events and delivers them through a STM broadcast channel. In usual Linux distributions, a called udev is responsible of handling these kernel events. Rules can be written to react to specific events. In particular, udev is responsible of creating device special file in the /dev directory. The naming of theses special files is a big deal for these distributions as applications use them directly afterwards and don’t use the other unique device identifiers (i.e., the device path in the sysfs file-system). In haskus-system, high-level APIs are provided to avoid direct references to device special files.

Further reading

In usual Linux distributions, udev (man 7 udev) is responsible of handling devices. It reads sysfs and listens to kernel events to create and remove device nodes in the /dev directory, following customizable rules. It can also execute custom commands (crda, etc.) to respond to kernel requests.

Graphics Overview

In this chapter we present Linux’s display model.

Generalities

From a programmer’s point of view, screens basically display rectangular grids of sample points (or “pixels” standing for “picture elements”) where each point can have a different color. The display resolution is the number of pixels in each dimension: for example, a display resolution of 1920x1080 means that there are 1920 pixels in the horizontal dimension and 1080 in the vertical dimension. On some hardware, we may also use sub-pixel rendering: pixel colors are generated by blending several sub-pixel colors. If we know the physical layout of these sub-pixels, we may want to precisely select the sub-pixel colors to enhance the overall output. There can be a lot of different hardware configurations. It is common to have computers with multiple connected display devices: multi-screens, video-projector, TV, etc. Each display supports its own modes of operation (display resolution, refresh rates, etc.). Besides, we may want each device to display something different or some of them to display the same thing (“clone” mode): in the latter case, what if the display resolutions or the other mode settings are different? It is also possible to have several graphic cards in the same computer: • each card can provide its own set of output display ports; • several cards may be physically connected together to accelerate the display on one of their display ports; • several cards may be connected to the same output connectors and the system can switch between the different graphic cards. Usually there are two cards: a power-consuming performant one and a less power-consuming basic one used to save power. In this chapter we present a simple approach to display rendering: basically the picture to display is generated by the CPU in host memory and then transferred to the GPU memory (implicitly by using memory mapping). Most recent graphic cards propose more efficient approaches: the picture to display is generated and transformed directly by the graphic card. Instead of sending a picture, the host sends commands or programs to be executed by the GPU.

1.9. Graphics Overview 19 haskus-manual, Release 0.1

Currently Linux doesn’t propose a unified interface to advanced graphic card capabilities from different vendors (these are usually handled by MESA in user-space and accessed through the OpenGL interface). haskus-system doesn’t provide support for them yet. In this chapter, we describe the Kernel Mode Setting (KMS) and the Direct Rendering Manager (DRM) interfaces. In usual Linux distributions, some graphic card manufacturers provide closed-source propri- etary drivers that do not support theses interfaces: they use a kernel module and user-space libraries that communicate together by using a private protocol. The user-space libraries provide implementations of standard high-level interfaces such as OpenGL and can be used by rendering managers such as X.org. haskus-system doesn’t offer a way to use these drivers.

Display Model

Linux’s display model is composed of several entities that interact with each other. These entities are represented on the following graph:

Numbers indicate relationship arities: for instance a Controller can be connected to at most a single , but a Framebuffer can be used by a variable amount of Controllers. The arrows indicate entities that store references to others entities (at the tip of the arrow). In order to display something, we have to configure a that goes from some surfaces (pixel data stored in a memory buffers) to some connectors (entity representing physical ports onto which display devices are connected).

Card

The following schema shows a contrived example of a graphic card containing some of these entities.

Entities belong to a single graphic card: they can’t be shared between several graphic cards (if your system has more than one of these).

Connectors

Each physical port where you can plug a display device (a monitor, a video-projector, etc.) corresponds to a Connector entity in the display model. The following code shows how to retrieve the graphic card objects and how to display information about each connec- tor: import Haskus.System import Haskus.Arch.Linux.Graphics.State main ::IO () main = runSys <| do

sys <- defaultSystemInit term <- defaultTerminal

-- get graphic card devices cards <- loadGraphicCards (systemDeviceManager sys)

forM_ cards <| \card -> do state <- readGraphicsState (graphicCardHandle card)

20 Chapter 1. haskus-system haskus-manual, Release 0.1

>..~!!> assertShow"Cannot read graphics state"

-- get connector state and info let conns = graphicsConnectors state

-- show connector state and info writeStrLn term (show conns)

void powerOff

When executed in QEMU, this code produces the following output:

-- Formatting has been enhanced for readability [ Connector { connectorID = ConnectorID 21 , connectorType = Virtual , connectorByTypeIndex =1 , connectorState = Connected(ConnectedDevice { connectedDeviceModes = [ Mode { ... , modeClock = 65000 , modeHorizontalDisplay = 1024 , modeVerticalDisplay = 768 , modeVerticalRefresh = 60 , modeFlags = fromList [ModeFlagNHSync,ModeFlagNVSync] , modeStereo3D = Stereo3DNone , modeType = fromList [ModeTypePreferred,ModeTypeDriver] , modeName = "1024x768" } , ... ] , connectedDeviceWidth =0 , connectedDeviceHeight =0 , connectedDeviceSubPixel = SubPixelUnknown , connectedDeviceProperties = [ Property { propertyMeta = PropertyMeta { ... , propertyName = "DPMS" , propertyType = PropEnum [(0, "On") ,(1, "Standby") ,(2, "Suspend") ,(3, "Off")] } , propertyValue =0 } ] }) , connectorPossibleEncoderIDs =[EncoderID 20] , connectorEncoderID = Just(EncoderID 20) , connectorHandle = Handle ... } ]

Each connector reports its type in the connectorType field: in our example it is a virtual port because we use QEMU, but it could have been VGA, HDMI, TV, LVDS, etc.

1.9. Graphics Overview 21 haskus-manual, Release 0.1

If there are several connectors of the same type in the same card, you can distinguish them with the connectorByTypeIndex field. You can check that a display device is actually plugged in a connector with the connectorState property: in our example, there is a (virtual) screen connected. We can get more information about the connected device: • connectedDeviceModes: modes supported by the connected display device. In particular, a display reso- lution is associated to each mode. In our example, the display resolution of the first mode is 1024x768; the other modes have been left out for clarity. • connectedDeviceWidth and connectedDeviceHeight: some display devices report their physical dimensions in millimeters. • connectedDeviceSubPixel: whether the device uses some kind of sub-pixel technology. • connectedDeviceProperties: device specific properties. In this example, there is only a single property named “DPMS” which can take 4 different values (“On”, “Standby”, “Suspend”, “Off”) and whose current value is 0 (“On”): this property can be used to switch the power mode of the screen. A connector gets the data to display from an encoder: • connectorPossibleEncoderIDs: list of encoders that can be used as sources. • connectorEncoderID: identifier of the currently connected encoder, if any.

Detecting Plugging/Unplugging

We can adapt what our system displays to the connected screens, but how do we detect when a screen is connected or disconnected? A solution would be to periodically check the value of the connectorState property. But a better method is to use a mechanism explained in the basic device management page: when the state of a connector changes, the kernel sends to the user-space an event similar to the following one:

KernelEvent { kernelEventAction = ActionChange , kernelEventDevPath = "/devices/.../drm/card0" , kernelEventSubSystem = "drm" , kernelEventDetails = fromList [("DEVNAME","drm/card0") ,("MAJOR","226") ,("MINOR","0") ,("HOTPLUG","1") ,("SEQNUM","1259")]}

When the system receives this event, it knows it has to check the state of the connectors. Note that the number of connector entities may change dynamically. For instance a single DisplayPort con- nector supporting the Multi-Stream Transport (MST) allows several monitors to be connected in sequence (daisy- chaining): each monitor receives its own video stream and appears as a different connector entity. It is also possible to connect a MST hub that increases the number of connector entities.

Encoders

Encoders convert pixel data into signals expected by connectors: for instance DVI and HDMI connectors need a TMDS encoder. Each card provides a set of encoders and each of them can only work with some controllers and some

22 Chapter 1. haskus-system haskus-manual, Release 0.1 connectors. There may be a 1-1 relationship between an encoder and a connector, in which case the link between them should already be set. We can display information about encoders using a code similar to the code above for connectors. When executed into QEMU, we get the following result:

[ Encoder { encoderID = EncoderID 20 , encoderType = EncoderTypeDAC , encoderControllerID = Just(ControllerID 19) , encoderPossibleControllers =[ControllerID 19] , encoderPossibleClones =[] , encoderHandle = Handle ... } ]

As we can observe, the graphic card emulated by QEMU emulates a single DAC encoder. The encoderPossibleClones field contains the sibling encoders that can be used for cloning: only these en- coders can share the same controller as a source.

Controllers

Controllers let you configure: • The display mode (display resolution, etc.) that will be used by the display devices that are connected to the controller through an encoder and a connector. • The primary source of the pixel data from a FrameBuffer entity We can display information about controllers using a code similar to the code above for connectors. When executed into QEMU, we get the following result:

[ Controller { controllerID = ControllerID 19 , controllerMode = Just(Mode{ ...}) , controllerFrameBuffer = Just(FrameBufferPos { frameBufferPosID = FrameBufferID 46 , frameBufferPosX =0 , frameBufferPosY =0 }) , controllerGammaTableSize = 256 , controllerHandle = Handle ... } ]

• controllerMode: the display mode that has to be used by the display device(s). • controllerFrameBuffer: the FrameBuffer entity used as a data source and the coordinates in the FrameBuffer contents.

Planes

Some controllers can blend several layers together from different FrameBuffer entities: these layers are called Planes. Controller support at least a primary plane and they can support others such as or overlay planes.

1.9. Graphics Overview 23 haskus-manual, Release 0.1

TODO: * List plane resources * primary plane * cursor planes * overlay planes * example

Framebuffers And Surfaces

Planes take their input data from FrameBuffer entities. FrameBuffer entities describe how pixel data are en- coded and where to find them in the GPU memory. Some pixel encoding formats require more than one memory buffers (Surface entities) that are combined to obtain final pixel colors.

TODO: * Pixel formats * FrameBuffer dirty * Mode * Generic buffers * Note on accelerated buffers

If we use an unaccellerated method (“dumb buffers” in Linux terminology) where the graphics data are fulling gen- erated by the CPU, applications only have to map the contents of the Surface entities into their memory address spaces and to modify it to change what is displayed.

Further Reading

As explained in the basic device management page, device drivers can support the ioctl system call to handle device specific commands from the user-space. The display interface is almost entirely based on it. Additionally, mmap is used to map graphic card memory in user-space and read is used to read events (V-Blank and page-flip asynchronous completion). In usual Linux distributions, the libdrm provides an interface over these system calls. You can learn about the low-level interface by reading the drm manual (man drm) or its source code. David Herrmann has written a good tutorial explaining how to use the legacy low-level display interface in the form of C source files with detailed comments. While some details of the interface have changed since he wrote it (e.g., the way to flip frame buffers and the atomic interface), it is still a valuable source of information. The newer atomic interface is described in an article series on LWN called “Atomic mode setting design overview” (August 2015) by Daniel Vetter. Wayland is the new display system for usual Linux based distributions. It can be a great source of inspiration and of information. You can also read the Linux kernel code located in drivers/gpu/drm in the kernel sources. Linux multi-GPU: • Buffer sharing is supported with DRM Prime • GPU switching is supported with vga_switcheroo In volume 3, we describe the internals of haskus-system.

24 Chapter 1. haskus-system haskus-manual, Release 0.1

Modules Overview

The code base of haskus-system is becoming quite large. This page gives an overview of the different modules.

Interface with the Linux kernel haskus-system provides foreign primops to call Linux system calls from Haskell code without going through the libc. In addition to basic system calls, it provides wrappers for some Linux subsystems/features accessible through multiplexing syscalls (e.g., ioctl) or through specific file systems (e.g., , sysfs). • Haskus.Arch.Linux: system calls and low-level interfaces • Haskus.Arch.Linux.Input: input subsystem • Haskus.Arch.Linux.Graphics: kms/drm subsystem

Formats haskus-system provides support for some file formats (e.g., ELF, DWARF, CPIO) and some file system formats (e.g., ISO9660). These can be used to interact with Linux (e.g., to look up for functions in the vDSO ELF image), to build initramfs images or bootable disk images, etc. • Haskus.Format.Compression: some compression algorithms and containers • Haskus.Format.CPIO: CPIO archive format • Haskus.Format.Elf: ELF object format • Haskus.Format.Dwarf: DWARF debugging information format

Architectures haskus-system provides architecture specific modules (currently only for x86-64), in particular the thin architec- ture specific layer to call Linux system calls. Additionally, Haskus has a dictionnary of x86 instructions; it is currently used to implement a disassembler and could be used to implement assemblers, analyzers, emulators, etc. A wrapper for the x86’s cpuid instruction is also provided. • Haskus.Arch.X86_64: Currently only X86-64 is supported. – Haskus.Arch.X86_64.ISA: instruction set architecture – Haskus.Arch.X86_64.Disassembler – Haskus.Arch.X86_64.Linux: arch-specific Linux interface (syscalls) – Haskus.Arch.X86_64.Cpuid: CPUID wrapper

System interface haskus-system provides modules to interact with the system: input devices, display devices, etc. These modules are used to easily build a custom system without dealing directly with the low-level Linux interface. It also provides a custom monad with common features for system programming (logging, etc.). • Haskus.System

1.10. Modules Overview 25 haskus-manual, Release 0.1

X86 architecture notes

Notes on x86 in haskus-system and in general.

Instruction list haskus-system aims to support X86 architectures by providing a declarative list of instruction specifications (in Haskus.Arch.X86_64.ISA.Insns) that can be used to write: • disassemblers • assemblers • analyzers • (micro-)benchmarks • emulators/simulators Most instruction lists I have found (e.g., NASM’s insns.dat file, ’s documentation) “flatten” the instruction poly- morphism down to operand classes: for each instruction, there is an entry in the list for each tuple of possible operand classes. In haskus-system, I have tried to keep as much of the original semantics in a declarative way. For instance, I describe the semantics of some optional bits in the opcodes: • bit indicating that operands have to be commuted • bit indicating that the operand size is 8-bit • bit indicating that the FPU stack is popped • etc. In theory, it should be easily possible to generate a flattened instruction list from haskus-system‘s one, while the other way around is much more involved. Additionally, the list contains for each instruction: • flags that are read/set/unset/undefined • implicit operands • operand access mode: Read/Write/Both/None (used to access metadata) • constraints: alignment requirement for memory • supported prefixes • required X86 extensions • some documentation In the future, I would like to add: • pseudo-code for each instruction (useful for emulators, etc.) • exceptions (maybe inferred from pseudo-code?) • required privileges • maybe precomputed latencies and such as found in Agner’s work for different architectures (useful for cross- compilers) and a way to compute them ourselves on the current architecture

26 Chapter 1. haskus-system haskus-manual, Release 0.1

Instruction Operands

The polymorphism of x86 instructions is hard to formalize. For each operand, we have to consider: • the type of data it represents: an integer, a float, a vector of something • its size • where it is stored: in memory or in a register (which depends on the type and size) There are often functional dependencies between operands: the type of one of the operands may constrain the type of the other ones. Usually assemblers infer the operand size from the operands, but as memory addresses don’t indicate the type of the data they point to, explicit annotations are sometimes required.

Semantics vs Encoding

There isn’t an isomorphism between an instruction and its encoding. Put another way: sometimes it is possible to encode a given instruction using different encodings. A basic example is to add superfluous prefixes (e.g., overriding the default segment with the same segment). The semantics of the program stays the same but there are side-effects: • instruction size vs instruction cache • instruction alignment vs instruction decoder An assembler would have to choose whether to use one encoding form or another. In haskus-system, I have tried to define an isomorphism between instruction encodings and a more high-level instruction description by using encoding variants (see EncodingVariant in Haskus.Arch.X86_64.ISA.Insn). The decoder indicates which variant of an instruction encoding it has decoded. The (future) encoder will allow the user to specify which encoding to use (for side-effect considerations).

Instruction entry point x86 ISA uses some prefixes to alter the meaning of some instructions. In the binary form (in the legacy encoding), a prefix is just a byte put before the rest of the instruction bits. It is possible to jump/branch into the instruction after some prefixes in order to avoid them. Hence there is a kind of instruction overlay: two “different” instructions sharing some bits in the binary. It makes the representation and the analyze of these programs a little bit trickier. Especially if we make a non-linear disassembly by following branchs: we get branch labels inside instructions. I have been told this is used in the glibc.

X86 encoding

Mode

An x86-64 architecture can run in 5 different execution modes, excluding System Management Mode (SMM). Different flags from different registers are used to indicate the current mode. We synthesize them into a virtual M flag.

1.12. X86 encoding 27 haskus-manual, Release 0.1

Operation, address and stack sizes

Default size

The default operation size (DOS) and the default address size (DAS) depend on the execution mode (M) and on the D/B flag in the CS segment descriptor.

Similarly, the defaut stack size (DSS) depends on the execution mode and on the D/B flag in the SS segment descriptor.

Overridden size

The DOS and the DAS can be overridden per instruction with the 66 and 67 prefixes respectively, giving use the overridden operation size (OOS) and the overridden address size (OAS).

Finally, in 64-bit execution mode (M=4), some instructions defaults to a 64-bit operation size and a new W prefix can be used to enforce 64-bit operation size. This gives us the overridden operation size 64 (OOS64).

Summary

DOS, DAS, DSS, OOS, OAS and OOS64 are sizes that can be inferred from the execution context. They don’t corre- spond directly to actual operation or operand sizes: it depends on each instruction.

Full poster

28 Chapter 1. haskus-system CHAPTER 2

haskus-binary

Binary modules haskus-binary is a package containing a set of modules dedicated to the manipulation of binary data. It provides data type mapping those of other languages such as C and even more. All these modules are in Haskus.Format.Binary. We don’t rely on external tools such as C2HS to provide bindings to C libraries. There are several reasons for that: • We don’t want to depend on .h files; • .h files often contain pecularities that are difficult to handle automatically; • Documentation and code of the resulting Haskell files are often very bad: – No haddock – Very low-level (e.g. #define are not transformed into datatypes with Enum instances) Instead haskus-binary lets you write bindings in pure Haskell code and provides many useful things to make this process easy.

Word, Int

The Word module contains data types representing unsigned words (Word8, Word16, Word32, etc.) and signed integers (Int8, Int16, Int32, etc.). It also contains some C types such as CSize, CShort, CUShort, CLong, CULong, etc.

Endianness

Words and Ints are stored (i.e., read and written) using host endianness (byte ordering). AsBigEndian and AsLittleEndian data types in the Endianness module allow you to force a different endianness. The following example shows a data type containing a field for each endianness variant. We explain how to use this kind of data type as a C structure later in this document.

29 haskus-manual, Release 0.1

data Dummy= Dummy { fieldX :: Word32 -- ^ 32-byte unsigned word (host endianness) , fieldY :: AsBigEndian Word32 -- ^ 32-byte unsigned word (big-endian) , fieldZ :: AsLittleEndian Word32 -- ^ 32-byte unsigned word (little-endian) } deriving(Generic,Storable)

We can also explicitly change the endianness with the following methods: hostToBigEndian, hostToLittleEndian, bigEndianToHost, littleEndianToHost, reverseBytes. Each of these methods is either equivalent to id or to reverseBytes depending on the host endianness.

Bits

The Bits module allows you to perform bitwise operations on data types supporting them.

Buffer

A Buffer is basically a strict ByteString with a better name and a better integration with Storable type class.

Structures

You map C data structures with Haskell data type as follows:

{-# LANGUAGE DeriveAnyClass #-} {-# LANGUAGE DeriveGeneric #-}

import Haskus.Format.Binary.Storable import Haskus.Utils.Types.Generics (Generic) data StructX= StructX { xField0 :: Word8 , xField1 :: Word64 } deriving(Show,Generic,Storable)

The Storable instance handles the alignment of the field as a C non-packed structure would (i.e. there are 7 padding bytes between xField0 and xField1). peek and poke can be used to read and write the data structure in memory.

Nesting

Data structures can be nested:

data StructY= StructY { yField0 :: StructX , yField1 :: Word64 } deriving(Show,Generic,Storable)

Arrays (or Vectors)

haskus-binary supports vectors: a fixed amount of Storable data correctly aligned. You can define a vector as follows:

30 Chapter 2. haskus-binary haskus-manual, Release 0.1

{-# LANGUAGE DataKinds #-} import Haskus.Format.Binary.Vector as V v :: Vector5 Word16

Vectors are storable, so you can peek and poke them from memory. Alternatively, you can create them from a list:

Just v = fromList [1,2,3,4,5] Just v = fromList [1,2,3,4,5,6] -- this fails dynamically Just v = fromList [1,2,3,4] -- this fails dynamically

-- take at most 5 elements then fill with 0: v = [1,2,3,4,5] v = fromFilledList 0[1,2,3,4,5,6]

-- take at most 5 elements then fill with 7: v = [1,2,3,7,7] v = fromFilledList 7[1,2,3]

-- take at most 4 (!) elements then fill with 0: v = [1,2,3,0,0] v = fromFilledListZ 0[1,2,3]

-- useful for zero-terminal strings: s = "too long \NUL" s :: Vector 10 CChar s = fromFilledListZ 0( fmap castCharToCChar"too long string" )

You can concatenate several vectors into a single one: import Haskus.Utils.HList x = fromFilledList 0[1,2,3,4]:: Vector4 Int y = fromFilledList 0[5,6]:: Vector2 Int z = fromFilledList 0[7,8,9]:: Vector3 Int v =V. concat (x `HCons` y `HCons` z `HCons` HNil)

>:tv v :: Vector9 Int

> v fromList [1,2,3,4,5,6,7,8,9]

You can also safely drop or take elements in a vector. You can also index into a vector: import Haskus.Format.Binary.Vector as V v :: Vector5 Int v = fromFilledList 0[1,2,3,4,5,6]

-- v2 = [1,2] v2 =V. take @2 v

-- won't compile (8 > 5) v2 =V. take @8 v

-- v2 = [3,4,5] v2 =V. drop @2 v

-- x = 3

2.1. Binary modules 31 haskus-manual, Release 0.1

x =V. index @2 v

Finally, you can obtain a list of the values

>V. toListv [1,2,3,4,5]

Enums

If you have a C enum (or a set of #define’s) with consecutive values and starting from 0, you can do:

{-# LANGUAGE DeriveAnyClass #-}

import Haskus.Format.Binary.Enum

data MyEnum = MyEnumX | MyEnumY | MyEnumZ deriving(Show,Eq,Enum,CEnum)

If the values are not consecutive or don’t start from 0, you can write your own CEnum instance:

-- Add 1 to the enum number to get the valid value instance CEnum MyEnum where fromCEnum =(+1). fromIntegral . fromEnum toCEnum = toEnum .( \x -> x-1). fromIntegral

To use an Enum as a field in a structure, use EnumField:

data StructZ= StructZ { zField0 :: StructX , zField1 :: EnumField Word32 MyEnum } deriving(Show,Generic,Storable)

The first type parameter of EnumField indicates the backing word type (i.e. the size of the field in the structure). For instance, you can use Word8, Word16, Word32 and Word64. To create or extract an EnumField, use the methods:

fromEnumField :: CEnum a => EnumField ba -> a toEnumField :: CEnum a => a -> EnumField ba

We use a CEnum class that is very similar to Enum because Enum is a special class that has access to data constructor tags. If we redefine Enum, we cannot use fromEnum to get the data constructor tag.

Bit sets (or “flags”)

We often use flags that are combined in a single word. Each flag is associated to a bit of the word: if the bit is set the flag is active, otherwise the flag isn’t active. haskus-binary uses the CBitSet class to get the bit offset of each flag. By default, it uses the Enum instance to get the bit offsets as in the following example:

32 Chapter 2. haskus-binary haskus-manual, Release 0.1

{-# LANGUAGE DeriveAnyClass #-}

import Haskus.Format.Binary.BitSet

data Flag = FlagX -- bit 0 | FlagY -- bit 1 | FlagZ -- bit 2 deriving(Show,Eq,Enum,CBitSet)

If you want to use different bit offsets, you can define your own CBitSet instance:

-- Add 1 to the enum number to get the valid bit offset instance CBitSet Flag where toBitOffset =(+1). fromEnum fromBitOffset = toEnum .( \x -> x-1)

To use a bit set as a field in a structure, use BitSet:

data StructZ= StructZ { zField0 :: ... , zField1 :: BitSet Word32 Flag } deriving(Show,Generic,Storable)

The first type parameter of BitSet indicates the backing word type (i.e. the size of the field in the structure). For instance, you can use Word8, Word16, Word32 and Word64. Use the following methods to manipulate the BitSet:

fromBits ::(CBitSet a, FiniteBits b) => b -> BitSet ba toBits ::(CBitSet a, FiniteBits b) => BitSet ba -> b member ::(CBitSet a, FiniteBits b) => BitSet ba -> a -> Bool notMember ::(CBitSet a, FiniteBits b) => BitSet ba -> a -> Bool toList ::(CBitSet a, FiniteBits b) => BitSet ba ->[ a] fromList ::(CBitSet a, FiniteBits b, Foldable m) => ma -> BitSet ba intersection :: FiniteBits b => BitSet ba -> BitSet ba -> BitSet ba union :: FiniteBits b => BitSet ba -> BitSet ba -> BitSet ba

Note that we don’t check if bit offsets are outside of the backing word. You have to choose a backing word that is large enough.

Unions

An union provides several ways to access the same buffer of memory. To use them with haskus-binary, you need to give the list of available representations in a type as follows:

{-# LANGUAGE DeriveAnyClass #-} {-# LANGUAGE DataKinds #-}

import Haskus.Format.Binary.Union

u :: Union '[Word8, Word64, Vector 5 Word16]

Unions are storable so you can use them as fields in storable structures or you can directly peek/poke them. You can retrieve a member of the union with fromUnion. The extracted type must be a member of the union otherwise it won’t compile.

2.1. Binary modules 33 haskus-manual, Release 0.1

fromUnionu :: Word64 fromUnionu :: Word8 fromUnionu :: Vector5 Word16 fromUnionu :: Word32 -- won't compile!

To create a new union from one of its member, use toUnion or toUnionZero. The latter sets the remaining bytes of the buffer to 0. In the example, the union uses 10 bytes (5 * 2 for Vector 5 Word16) and we write 8 bytes (sizeOf Word64) hence there are two bytes that can be left uninitialized (toUnion) or set to 0 (toUnionZero).

u :: Union '[Word8,Word64,Vector 5 Word16] u = toUnion (0x1122334455667788:: Word64)

> print (fromUnionu :: Vector5 Word16) fromList [30600,21862,13124,4386,49850]

-- or u = toUnionZero (0x1122334455667788:: Word64) > print (fromUnionu :: Vector5 Word16) fromList [30600,21862,13124,4386,0]

Bit fields

You may need to define bit fields over words. For instance, you can have a Word16 split into 3 fields X, Y and Z composed of 5, 9 and 2 bits respectively. X Y Z w :: Word16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 You define it as follows:

{-# LANGUAGE DataKinds #-} {-# LANGUAGE TypeApplications #-}

import Haskus.Format.Binary.BitField

w :: BitFields Word16 '[ BitField 5 "X" Word8 , BitField 9 "Y" Word16 , BitField 2 "Z" Word8 ] w = BitFields 0x0102

Note that each field has its own associated type (e.g. Word8 for X and Z) that must be large enough to hold the number of bits for the field. Operations on BitFields expect that the cumulated size of the fields is equal to the whole word size: use a padding field if necessary. You can extract and update the value of a field by its name:

x = extractField @"X"w z = extractField @"Z"w w' = updateField @"Y" 0x100 w -- w' = 0x402

z = extractField @"XXX"w -- won't compile

w'' = withField @"Y" (+2) w

34 Chapter 2. haskus-binary haskus-manual, Release 0.1

Fields can also be ‘BitSet’ or ‘EnumField’:

{-# LANGUAGE DataKinds #-} {-# LANGUAGE DeriveAnyClass #-} import Haskus.Format.Binary.BitField import Haskus.Format.Binary.Enum import Haskus.Format.Binary.BitSet dataA= A0| A1| A2| A3 deriving(Show,Enum,CEnum) dataB= B0| B1 deriving(Show,Enum,CBitSet) w :: BitFields Word16 '[ BitField 5 "X" (EnumField Word8 A) , BitField 9 "Y" Word16 , BitField 2 "Z" (BitSet Word8 B) ] w = BitFields 0x1503

BitFields are storable and can be used in storable structures. You can easily pattern-match on all the fields at the same time with matchFields and matchNamedFields. It creates a tuple containing one value (and its name with matchNamedFields) per field.

> matchFieldsw (EnumField A2,320,fromList [B0,B1])

> matchNamedFieldsw (("X",EnumField A2),("Y",320),("Z",fromList [B0,B1]))

2.1. Binary modules 35 haskus-manual, Release 0.1

36 Chapter 2. haskus-binary CHAPTER 3

haskus-utils

Variant/Flow

37