Rights to copy

• © Copyright 2004-2019, Bootlin • License: Creative Commons Attribution - Share Alike 3.0 • https://creativecommons.org/licenses/by-sa/3.0/legalcode • You are free: – to copy, distribute, display, and perform the work – to make derivative works – to make commercial use of the work • Under the following conditions: – Attribution. You must give the original author credit. – Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only • under a license identical to this one. – For any reuse or distribution, you must make clear to others the license terms of this work. – Any of these conditions can be waived if you get permission from the copyright holder. • Your fair use and other rights are in no way affected by the above. • Document sources: https://git.bootlin.com/training-materials/

Dongkun Shin, SKKU 1 6. Kernel Frameworks for Device Drivers

Dongkun Shin, SKKU 2 Kernel and Device Drivers

• In , a driver is always interfacing with: – a framework that allows the driver to expose the hardware features to applications. – a bus infrastructure, part of the device model, to detect/communicate with the hardware.

Dongkun Shin, SKKU 3 Types of devices

• Essentially three types of devices under Linux: – Network devices • represented as network interfaces, visible in userspace using ifconfig. – Block devices • used to provide userspace applications access to raw storage devices (hard disks, USB keys). • visible to the applications as device files in /dev. – Character devices • used to provide userspace applications access to all other types of devices (input, sound, graphics, serial, etc.). • visible to the applications as device files in /dev.

Dongkun Shin, SKKU 4 Major and minor numbers

• Within the kernel, all block and character devices are identified using a major and a minor number. • The major number typically indicates the family of the device. • The minor number typically indicates the number of the device (when they are for example several serial ports) • Major and minor numbers are statically allocated, and identical across all Linux systems. • defined in admin-guide/devices.

Dongkun Shin, SKKU 5 Device Files

• Devices are represented as device files to the applications – allows applications to manipulate all system objects with the normal file API (open, read, write, close, etc.) • All device files are by convention stored in the /dev directory • name is associated to that the kernel understands • Example of device files in a Linux system $ ls -l /dev/ttyS0 /dev/tty1 /dev/sda1 /dev/sda2 /dev/zero block brw-rw---- 1 root disk 8, 1 2011-05-27 08:56 /dev/sda1 brw-rw---- 1 root disk 8, 2 2011-05-27 08:56 /dev/sda2 crw------1 root root 4, 1 2011-05-27 08:57 /dev/tty1 char crw-rw---- 1 root dialout 4, 64 2011-05-27 08:56 /dev/ttyS0 crw-rw-rw- 1 root root 1, 5 2011-05-27 08:56 /dev/zero major minor • Example C code that uses the usual file API to write data to a serial port int fd; fd = open("/dev/ttyS0", O_RDWR); write(fd, "Hello", 5); close(fd);

Dongkun Shin, SKKU 6 Creating Device Files

• On a basic Linux system, the device files have to be created manually using the mknod command – mknod /dev/ [c|b] major minor – Needs root privileges – Coherency between device files and devices handled by the kernel is left to the system developer • On more elaborate Linux systems, mechanisms can be added to create/remove them automatically when devices appear and disappear – devtmpfs: virtual filesystem (devfs 2.0, since Linux 2.6.32) – daemon: executes in userspace and listens to uevents the kernel sends – mdev program: a lighter solution than udev (BusyBox) – ueventd in Android

Dongkun Shin, SKKU 7 Character Driver

• For an application, a character device is essentially a file. • Character device drive must implement file operations: – open, close, read, write, etc. – described in the struct file_operations structure • Linux filesystem layer will ensure that the driver's operations are called when an userspace application makes the corresponding system call.

Dongkun Shin, SKKU 8 open() and release()

• int foo_open(struct inode *i, struct file *f) – Called when user space opens the device file. – Only implement this function when you do something special with the device at open() time. – struct inode is a structure that uniquely represents a file in the system (be it a regular file, a directory, a symbolic link, a character or block device) – struct file is a structure created every time a file is opened. Several file structures can point to the same inode structure. • Contains information like the current position, the opening mode, etc. • Has a void *private_data pointer that one can freely use. • A pointer to the file structure is passed to all other operations • int foo_release(struct inode *i, struct file *f) – Called when user space closes the file. – Only implement this function when you do something special with the device at close() time.

Dongkun Shin, SKKU 9 read()

• ssize_t foo_read(struct file *f, char __user *buf, size_t sz, loff_t *off) – Called when user space uses the read() system call on the device. – Must read data from the device, write at most sz bytes to the user space buffer buf, and update the current position in the file off. f is a pointer to the same file structure that was passed in the open() operation – Must return the number of bytes read. • 0 is usually interpreted by userspace as the end of the file. – On UNIX, read() operations typically block when there isn’t enough data to read from the device

Dongkun Shin, SKKU 10 write()

• ssize_t foo_write(struct file *f, const char __user *buf, size_t sz, loff_t *off) – Called when user space uses the write() system call on the device – The opposite of read, must read at most sz bytes from buf, write it to the device, update off and return the number of bytes written.

Dongkun Shin, SKKU 11 Exchanging data with user-space

• Kernel code isn't allowed to directly access user-space memory, using memcpy() or direct pointer dereferencing • To keep the kernel code portable and have proper error handling, your driver must use special kernel functions to exchange data with user- space.

• A single value – get_user(v, p); • The kernel variable v gets the value pointed by the user-space pointer p – put_user(v, p); • The value pointed by the user-space pointer p is set to the contents of the kernel variable v. • A buffer – unsigned long copy_to_user(void __user *to, const void *from, unsigned long n); – unsigned long copy_from_user(void *to, const void __user *from, unsigned long n); • Return value: Zero on success, non-zero on failure (-EFAULT)

Dongkun Shin, SKKU 12 Exchanging data with user-space

Dongkun Shin, SKKU 13 Zero copy access to user memory

• Having to copy data to or from an intermediate kernel buffer can become expensive when the amount of data to transfer is large • Zero copy options are possible: – mmap() system call to allow user space to directly access memory mapped I/O space. – get_user_pages() to get a mapping to user pages without having to copy them.

Classical Data Transfer mmap Data Transfer

Dongkun Shin, SKKU 14 () vs. unlocked_ioctl()

• ioctl() – Allows to extend the driver capabilities beyond the limited read/write API. • example: changing the speed of a serial port, setting video output format, querying a device serial number... – runs under the Big Kernel Lock (BKL). – BKL invokes long-running ioctl(), long latencies for unrelated processes. • long unlocked_ioctl(struct file *f, unsigned int cmd, unsigned long arg) – Called unlocked because it didn't hold the BKL – cmd is a number identifying the operation to perform – arg is the optional argument passed as third argument of the ioctl() system call. Can be an integer, an address, etc. – The semantic of cmd and arg is driver-specific.

Dongkun Shin, SKKU 15 ioctl() example

Kernel Side Application Side

Dongkun Shin, SKKU 16 Beyond character drivers: kernel frameworks

• Many device drivers are not implemented directly as character drivers • implemented under a framework, specific to a given device type (framebuffer, V4L, serial, etc.) • Allows to factorize the common parts of drivers for the same type of devices • From userspace, they are still seen as character devices by the applications • Allows to provide a coherent userspace interface (ioctl, etc.) for every type of device, regardless of the driver • Each framework defines a structure that a must register to be recognized as a device in this framework – struct uart_port for serial port, struct netdev for network devices, struct fb_info for framebuffers, etc.

Dongkun Shin, SKKU 17 Example: Framebuffer Framework

• Kernel option CONFIG_FB – FB • tristate "Support for frame buffer devices" • Implemented in C files in drivers/video/fbdev/core/ • Defines the user/kernel API – include/uapi/linux/fb.h • Defines the set of operations a framebuffer driver must implement and helper functions for the drivers – struct fb_ops – include/linux/fb.h

Dongkun Shin, SKKU 18 Framebuffer driver operations

• The operations a framebuffer driver can or must implement, and define them in a struct fb_ops structure

Dongkun Shin, SKKU 19 Framebuffer driver code

• In the probe() function, registration of the framebuffer device and operations

• register_framebuffer() will create the character device that can be used by user space applications with the generic framebuffer API. Dongkun Shin, SKKU 20 Device managed allocations

• The probe() function is typically responsible for allocating a significant number of resources: memory, mapping I/O registers, registering interrupt handlers, etc. • These resource allocations have to be properly freed: – In the probe() function, in case of failure – In the remove() function • This required a lot of failure handling code that was rarely tested • To solve this problem, device managed allocations have been introduced. • The idea is to associate resource allocation with the struct device, and automatically release those resources – When the device disappears – When the device is unbound from the driver • Functions prefixed by devm_ • See Documentation/driver-model/devres.txt for details

Dongkun Shin, SKKU 21 Device managed allocations: memory alloc example

• Normally done with kmalloc(size_t, gfp_t), released with kfree(void *) • Device managed with devm_kmalloc(struct device *, size_t, gfp_t)

Dongkun Shin, SKKU 22 Driver-specific Data Structure

• Each framework defines a structure that a device driver must register to be recognized as a device in this framework – struct uart_port for serial ports, – struct net_device for network devices, – struct fb_info for framebuffers, – etc. • In addition to this structure, the driver usually needs to store additional information about its device • This is typically done – By subclassing the appropriate framework structure – By storing a reference to the appropriate framework structure – Or by including your information in the framework structure

Dongkun Shin, SKKU 23 Driver-specific Data Structure Examples

Dongkun Shin, SKKU 24 Links between structures

• The framework typically contains a struct device * pointer that the driver must point to the corresponding struct device – the relation between the logical device (for example a network interface) and the physical device (for example the USB network adapter) • The device structure also contains a void * pointer that the driver can freely use. – often used to link back the device to the higher-level structure from the framework. – allows, for example, from the struct platform_device structure, to find the structure describing the logical device

Dongkun Shin, SKKU 25 Links between structures

Dongkun Shin, SKKU 26 Links between structures

Dongkun Shin, SKKU 27 Links between structures

Dongkun Shin, SKKU 28 Input subsystem

• Takes care of all the input events coming from the human user. • Initially written to support the USB HID (Human Interface Device) devices, it quickly grew up to handle all kind of inputs (using USB or not): keyboards, mice, joysticks, touchscreens, etc. • The input subsystem is split in two parts: – Device drivers: they talk to the hardware (for example via USB), and provide events (keystrokes, mouse movements, touchscreen coordinates) to the input core – Event handlers: they get events from drivers and pass them where needed via various interfaces (most of the time through ) • In user space it is usually used by the graphic stack such as X.Org, Wayland or Android’s InputManager.

Dongkun Shin, SKKU 29 Input subsystem diagram

Dongkun Shin, SKKU 30 Input subsystem overview

• Kernel option CONFIG_INPUT – menuconfig INPUT • tristate "Generic input layer (needed for keyboard, mouse, ...)" • Implemented in drivers/input/ – input.c, input-polldev.c, evbug.c • Implements a single character driver and defines the user/kernel API – include/uapi/linux/input.h • Defines the set of operations a input driver must implement and helper functions for the drivers – struct input_dev for the device driver part – struct input_handler for the event handler part – include/linux/input.h

Dongkun Shin, SKKU 31 Input subsystem API (1/3)

• An input device is described by a very long struct input_dev structure

• Before being used it, this structure must be allocated and initialized, typically with: – struct input_dev *devm_input_allocate_device(struct device *dev);

Dongkun Shin, SKKU 32 Input subsystem API (2/3)

• Depending on the type of events that will be generated, the input bit fields evbit and keybit must be configured: • For example, for a button we only generate EV_KEY type events, and from these only BTN_0 events code: – set_bit(EV_KEY, myinput_dev.evbit); – set_bit(BTN_0, myinput_dev.keybit); • set_bit() is an atomic operation allowing to set a particular bit to 1 • Once the input device is allocated and filled, the function to register it is: – int input_register_device(struct input_dev *); • When the driver is unloaded, the input device will be unregistered using: – void input_unregister_device(struct input_dev *);

Dongkun Shin, SKKU 33 Input subsystem API (3/3)

• The events are sent by the driver to the event handler using – input_event(struct input_dev *dev, unsigned int type, unsigned int code, int value); – The event types are documented in Documentation/input/event- codes.txt – An event is composed by one or several input data changes (packet of input data changes) such as the button state, the relative or absolute position along an axis, etc.. – After submitting potentially multiple events, the input core must be notified by calling: • void input_sync(struct input_dev *dev): – The input subsystem provides other wrappers such as • input_report_key(), input_report_abs(), ...

Dongkun Shin, SKKU 34 Polled input subclass

• The input subsystem provides a subclass supporting simple input devices that do not raise interrupts but have to be periodically scanned or polled to detect changes in their state. • A polled input device is described by a struct input_polled_dev structure:

Dongkun Shin, SKKU 35 Polled input subsystem API

• Allocating the struct input_polled_dev structure is done using – devm_input_allocate_polled_device() • Among the handlers of the struct input_polled_dev only the poll() method is mandatory, this function polls the device and posts input events. • The fields id, name, evkey and keybit of the input field must be initialized too. • If none of the poll_interval fields are filled then the default poll interval is 500ms. • The device registration/unregistration is done with: – input_register_polled_device(struct input_polled_dev *dev) – Unregistration is automatic after using devm_input_allocate_polled_device()

Dongkun Shin, SKKU 36 evdev user space interface

• The main user space interface to input devices is the event interface • Each input device is represented as a /dev/input/event character device • A user space application can use blocking and non-blocking reads, but also select() (to get notified of events) after opening this device. • Each read will return struct input_event structures of the following format:

• A very useful application for input device testing is evtest, from http://cgit.freedesktop.org/evtest/

Dongkun Shin, SKKU 37 Block Device

Dongkun Shin, SKKU 38 From userspace

• From userspace, block devices are accessed using device files, usually stored in /dev/ – Created manually or dynamically by udev • Most of the time, they store filesystems : they are not accessed directly, but rather mounted, so that the contents of the filesystem appears in the global hierarchy of files • Block devices are also visible through the filesystem, in the /sys/block/ directory

Dongkun Shin, SKKU 39 Global architecture

Dongkun Shin, SKKU 40 Global architecture

• An user application can use a block device – Through a filesystem, by reading, writing or mapping files – Directly, by reading, writing or mapping a device file representing a block device in /dev • In both cases, the VFS subsystem in the kernel is the entry point for all accesses – A filesystem driver is involved if a normal file is being accessed • The buffer/page cache of the kernel stores recently read and written portions of block devices – critical component for the performance of the system

Dongkun Shin, SKKU 41 Standard Linux filesystem format: ext2, ext3, ext4

• The standard filesystem used on Linux systems is the series of ext{2,3,4} filesystems – ext2 – ext3, brought journaling compared to ext2 – ext4, mainly brought performance improvements and support for even larger filesystems • It supports all features Linux needs from a filesystem: permissions, ownership, device files, symbolic links, etc.

Dongkun Shin, SKKU 42 Other Linux/Unix filesystems

, intended to become the next standard filesystem for Linux. Integrates numerous features: data checksuming, integrated volume management, snapshots, etc. • XFS, high-performance filesystem inherited from SGI IRIX, still actively developed. • JFS, inherited from IBM AIX. No longer actively developed, provided mainly for compatibility. • reiserFS, used to be a popular filesystem, but its latest version Reiser4 was never merged upstream. • All those filesystems provide the necessary functionalities for Linux systems: symbolic links, permissions, ownership, device files, etc.

Dongkun Shin, SKKU 43 F2FS: filesystem for flash-based storage

• Filesystem that takes into account the characteristics of flash-based storage: eMMC, SD cards, SSD, etc. • Developed and contributed by Samsung • Available in the mainline • For optimal results, need a number of details about the storage internal behavior which may not easy to get • Benchmarks: best performance on flash devices most of the time: • See http://lwn.net/Articles/520003/ • Technical details: http://lwn.net/Articles/518988/ • https://www.usenix.org/system/files/conference/fast15/fast15- paper-lee.pdf F2FS: A New File System for Flash Storage (FAST’15) • Not as widely used as ext3,4, even on flash-based storage.

Dongkun Shin, SKKU 44 Squashfs: read-only filesystem

• Read-only, compressed filesystem for block devices. Fine for parts of a filesystem which can be read-only (kernel, binaries...) • Great compression rate, which generally brings improved read performance • Used in most live CDs and live USB distributions • Supports several compression algorithms (LZO, XZ, etc.) • Benchmarks: roughly 3 times smaller than ext3, and 2-4 times faster (http://elinux.org/Squash_Fs_Comparisons) • Details: http://squashfs.sourceforge.net/

Dongkun Shin, SKKU 45 new EROFS file system (Kernel 4.19)

• stands for Enhanced Read-Only File System, developed at Huawei • lightweight read-only file system • designed for embedded devices with very limited physical memory and lots of memory consumers, such as Android devices. • VLE (variable-length extent) compression, fixed-sized compressed size rather than fixed-sized input size – all data read from block device at once can be utilized, and read amplification can be easier to estimate and control; – generally, it has a better compression ratio than fixed-sized input compression approaches configured with the same size; – aggressively optimized paths such as partial page read can be implemented to gain better performance for page-unaligned read (unimplemented yet, in TODO list).

Dongkun Shin, SKKU 46 Two main types of drivers

• Most of the block device drivers are implemented below the I/O scheduler, to take advantage of the I/O – Hard disk drivers, CDROM drivers, etc. • For some drivers however, it doesn't make sense to use the IO scheduler – RAID and volume manager, like md – The special loop driver – Memory-based block devices

Dongkun Shin, SKKU 47 Linux Virtual Files System (VFS)

• abstraction interface layer between user process and file system implementations. • provides common object models (such as i-node, file object, page cache, directory entry, and so on) and methods to access file system objects.

• hides the differences of each file system implementation from user processes. • user processes do not need to know which file system to use, or which system call should be issued for each file system.

Dongkun Shin, SKKU 48 Linux I/O Subsystem

Writing data to a disk 1. A process requests to write a file through the write() system call. 2. The kernel updates the page cache mapped to the file. 3. A pdflush kernel thread takes care of flushing the sync cmd page cache to disk. 4. The file system layer puts each block buffer together to a bio struct and submits a write request to the block device layer. 5. The block device layer gets requests from upper layers and performs an I/O elevator operation and puts the requests into the I/O request queue. 6. A device driver such as SATA will take care of write operation. 7. A disk device firmware performs hardware operations like seek head, rotation, and data transfer to the sector on the platter.

Dongkun Shin, SKKU 49 mmap

• Possibility to have parts of the virtual address space of a program mapped to the contents of a file • Particularly useful when the file is a device file • Allows to access device I/O memory and ports without having to go through (expensive) read, write or ioctl calls • One can access to current mapped files by two means: – /proc//maps – pmap

Dongkun Shin, SKKU 50 mmap Overview

Dongkun Shin, SKKU 51 VFS Common Objects

• superblock – describes and maintains state for the file system • inode – contains all the metadata to manage objects in the file system (including the operations that are possible on it) • dentry – translate between names and inodes, for which a directory cache exists to keep the most-recently used around. – maintains relationships between directories and files for traversing file systems. • file object – represents an open file – keeps state for the open file such as the write offset, and so on

Dongkun Shin, SKKU 52 Block I/O Layer Structures

• File – virtually contiguous • Segment – virtually contiguous memory pages • Block – Logical storage space

Bio : primitive disk block I/O

Request : one or more bios (contiguous in logical address)

Dongkun Shin, SKKU 53 bio Structure

• struct bio

struct bio

*bi_next struct bio struct bio

*bi_bdev struct block_device Segment *bi_io_vec struct bio_vec struct bio_vec struct bio_vec

bi_vcnt bv_page bv_page bv_page

bi_sector bv_len bv_len bv_len

bi_size bv_offset bv_offset bv_offset bi_vcnt

Dongkun Shin, SKKU 54 request Structure

• struct request

struct request

queue_list

struct request

queue_list

bio struct bio

bio_tail struct bio sector struct bio nr_sectors

struct request

queue_list Storage Memory

Dongkun Shin, SKKU 55 Request example

Dongkun Shin, SKKU 56 I/O Scheduler

• Elevator layer – Generic block layer invokes the I/O scheduler via elevator layer (interface) – Fundamental I/O scheduler functions (Merge/Searching) • I/O Scheduler – Determine that exact position of the new element in the queue – Tries to keep the request queue sorted by sector – Several scheduling policy • Flush daemon – Dispatch requests from I/O scheduler and send them to device

IO system call Flush daemon Dongkun Shin, SKKU 57 I/O Schedulers

Dongkun Shin, SKKU 58 Available I/O schedulers

• Four I/O schedulers in current kernels – Noop, for non-disk based block devices – Deadline, tries to guarantee that an I/O will be served within a deadline – CFQ, the Complete Fairness Queuing, the default scheduler, tries to guarantee fairness between users of a block device • The current scheduler for a device can be get and set in /sys/block//queue/scheduler

Dongkun Shin, SKKU 59 Noop I/O Scheduler

• No data ordering, Simplest • FIFO Schedule

Dongkun Shin, SKKU 60 Deadline Scheduler

• Impose a deadline on all I/O operations to prevent starvation of requests. • Maintains two deadline queues, and two sorted queues (both read and write). – Deadline queues are basically sorted by their deadline (the expiration time) – Sorted queues are sorted by the sector number. • Read queues are given a higher priority • If the first request in the deadline queue has expired, D/L scheduler serves it first. • Otherwise, the scheduler serves a batch of requests from the sorted queue. • Benefits – Nearly a real-time scheduler. – Excels in reducing latency of any given single I/O – Does quite well in benchmarks, most likely the best – Good for solid state/flash drives • Disadvantages – If the phone is overloaded, crashing or unexpected closure of processes can occur Dongkun Shin, SKKU 61 I/O Scheduler

• Complete Fairness Queuing (CFQ) – QoS (Quality of Service) by maintaining per-process I/O queues – For large multi-user systems with a lot of competing processes – Avoid starvation of processes – Default I/O scheduler – Can slowdown a single main application – “ionice” command can set the priority of process Process 1

Process 2 Round Robin (time Process 3 Dispatch Queue Slice)

Process 4

Queue sorted by sector

Dongkun Shin, SKKU 62 File System and Storage Tuning

• vmstat – monitor the movement of blocks in and out of the disk subsystem. • iostat – Shows disk utilization

Dongkun Shin, SKKU 63 File System and Storage Tuning

• IO Scheduler – # echo "noop" > /sys/block//queue/scheduler • nr_requests (NCQ) – the number of requests that can be issued to a storage – # echo 64 > /sys/block/sdb/queue/nr_requests • read_ahead_kb – defines how large read ahead operations can be – For large streaming reads, a large size of the read ahead buffer might increase performance. – # cat /sys/block//queue/read_ahead_kb – # echo 64 > /sys/block//queue/read_ahead_kb • ionice – assign I/O priority at CFQ I/O elevator – Can restrict the storage utilization of a specific process. • Idle: access to the storage if no other processes with a priority of best-effort or higher request access to data. • Best-effort: default. Processes will inherit 8 levels of the priority of their respective CPU nice level to the I/O priority class. • Real time: The highest available I/O priority – options: • -c<#> I/O priority 1 for real time, 2 for best-effort, 3 for idle • -n<#> I/O priority class data 0 to 7 • -p<#> process id of a running task – # ionice -c 3 -p113 (Sets process with PID 113 as an idle io process) – # ionice -c 2 -n 0 bash (Runs 'bash' as a best-effort program with highest priority)

Dongkun Shin, SKKU 64