CS 194-24 Lab 2: fs

Vedant Kumar, Palmer Dabbelt February 27, 2014

Contents

1 Getting Started 2

2 lpfs Structures and Interfaces 3

3 The VFS Layer 3 3.1 Operation Tables ...... 5 3.2 Inode Cache ...... 5 3.3 Directory Cache ...... 5

4 The Linux Block Layer 5 4.1 Page Cache ...... 6 4.2 Device Mapper ...... 6

5 Other Useful Kernel Primitives 6 5.1 Slab Allocation ...... 6 5.2 Work Queues ...... 7 5.3 The RCU Subsystem ...... 7 5.4 Wait Queues ...... 7

6 Schedule and Grading 7 6.1 Design Document ...... 7 6.2 Checkpoint 1 ...... 8 6.3 Checkpoint 2 ...... 8 6.4 Checkpoint 3 ...... 9 6.5 Evaluation ...... 9

1 CS 194-24 Spring 2013 Lab 2: fs

For this lab you will implement a filesystem that supports efficient snapshots, copy-on-write updates, encrypted storage, checksumming, and fast crash recovery. Our goal is to give you a deeper understanding of how real filesystems are designed and implemented. We have provided the on-disk data structures and some support code for a log-structured filesystem (lpfs). You can either build on top of the distributed code or implement a novel, feature-equivalent design. The first rule of kernel programming may well be “don’t mess with the kernel”, which is why we’ve built fsdb. The idea here is to run your kernel code in userspace via a thin compatibility layer. This gives you the chance to debug and test in a relatively forgiving environment. You will extend fsdb to host ramfs as well as your own filesystem.

1 Getting Started

Pull the latest sources from the class project repo. You should see some new directories:

• lpfs: A filesystem skeleton. The compatibility layer also lives in here.

• ramfs: A compact version of linux/fs/ramfs. Note that ramfs/compat.c is symlinked to lpfs/compat.c. You can mount a ramfs by running mount -t ramfs ramfs /mnt. • userspace: Miscellaneous tools to help build and debug your filesystem. Run sudo make fsdb. You should see some interesting output. A reduced version follows: dd if=/dev/zero of=.lpfs/disk.img bs=1M count=128 make reset_loop .lpfs/mkfs-lp /dev/loop0 Disk formatted successfully. .lpfs/fsdb /dev/loop0 snapshot=0 (info) lpfs: mount /dev/sda, snapshot=0 Registered filesystem |fsdb> The build system creates, mounts and formats a disk image for you. It uses a loop device to accomplish this. Then it invokes fsdb on your new disk, leaving you ready to debug. Since we’re relying on the build system to do some interesting work, it’s crucial that you thoroughly understand */Makefile.mk. You may occasionally need to extend the build system, so reading through these files early on is worthwhile. Let’s make a small modification to lpfs to see how everything works. Go to the bottom of lpfs/struct.h and uncomment the LPFS DARRAY TEST macro. This will cause the filesystem to run sanity checks on its block layer abstraction code (‘darray’) instead of actually mounting. Now when you run sudo make fsdb, you should see this: (info) lpfs: mount /dev/sda, snapshot=0 (info) lpfs: Starting darray tests. (info) lpfs: darray tests passed! (info) Note: proceeding to graceful crash... Looks like the tests pass in userspace. The next step is to run them in the kernel to make sure this wasn’t a fluke. Run make linux, then ./boot qemu, and finally mount -t lpfs /dev/sda /mnt in the guest’s shell. If you see the same success messages, feel free to do a happy hacker dance. Life is short.

2 CS 194-24 Spring 2013 Lab 2: fs

2 lpfs Structures and Interfaces

In an attempt to make this lab manageable, we’ve designed a set of on-disk structures that define lpfs. These structures are defined in lpfs/lpfs.h. If you need to modify these structures, make sure that you also update the formatting program (lpfs/mkfs-lp.c). Failure to do this will result in corrupted images. The main structure you’ll find inside here is struct lp superblock fmt, which defines the on-disk format of an lpfs superblock. Superblocks are a concept that exist in most -derived systems. The superblock is the first block in the on-disk filesystem image and contains all the information necessary to initialize a filesystem image. This block is loaded when the OS attempts to mount a block device using a particular filesystem implementation and is parsed by the filesystem implementation. As you can probably see from the superblock structure, lpfs is a log-structured filesystem. LFS, the first log-structured file system, is described in a research paper online http://www.cs.berkeley. edu/~brewer/cs262/LFS.pdf. lpfs largely follows the design of LFS: data is stored in segments that are written serially, the SUT contains segment utilization information, and garbage collection must be performed to free segments for later use. The one major difference is that lpfs uses a statically-placed journal instead of LFS’s dynamic journal. The goal here is to aid crash recovery – if the journal is static then it should be easier to find. Another minor difference is that lpfs supports snapshots. Effectively what this means is that you can make a that tells lpfs to keep around an exact copy of the filesystem at some particular point in time. This maps well to log-structured filesystems lpfs/lpfs.h also contains the on-disk structures that describe files, directories, and journal entries. These pretty much mirror the structure of a traditional UNIX filesystem, most of the interesting bits in lpfs are in the log. lpfs/struct.h summarizes the important interfaces the filesystem relies on. You will notice that much of the code (including the entire transaction system and some of the mount logic) is far from complete. You will need to implement all of this. lpfs/inode.c takes care of loading and filling in batches of inodes. lpfs/inode map.c tracks inode mappings: these objects define a snapshot by specifying an on-disk byte address for every live inode. lpfs/darray.c implements an abstraction on top of the buffer head API. It presents a picture of a segment as a contiguous array, handles locking, and can sync your buffers to disk. The problem with darray is that the buffer head interface is quite bloated. When the rest of your filesystem is done, you should rewrite darray using the lighter bio interface.

3 The Linux VFS Layer

Filesystems are one of the more complicated aspects of an operating system (by lines of code, only drivers/ and arch/ are bigger than fs). Luckily for you, Linux provides something known as the VFS (Virtual Filesystem Switch) layer that is designed to help manage this complexity. In Linux, all1 filesystems are implemented using the VFS layer. Due to the fact that UNIX is designed to map pretty much every operation to the filesystem, the VFS layer plays a central role in Linux. Figure 1 shows exactly where the VFS layer lives and how it plugs into the rest of Linux. Linux’s VFS documentation is very good and can be found at linux/Documentation/filesystems/vfs.txt. You will need to read this document to complete this lab. In that directory you’ll also find documenta- tion for other filesystems which may or may not be useful – VFS was kind of hacked on top of an early UNIX filesystem, which still more-or-less exists as ext2 in Linux today.

1There’s also FUSE, which maps VFS calls into userspace – but FUSE itself hooks into VFS so I think it still counts.

3 CS 194-24 Spring 2013 Lab 2: fs

Figure 1: A map of Linux’s VFS layer

4 CS 194-24 Spring 2013 Lab 2: fs

3.1 Operation Tables The primary means of interfacing your filesystem with the VFS layer consists of filling out operation tables with callbacks that will be used to perform operations that are specific to your filesystem. There are three of these tables: struct super operations which defines operations that are global across your filesystem, struct inode operations which defines methods on inodes, and struct file operations which defines operations that are local to a particular file. This split is largely historical: on the original UNIX directories and files were both accessed via the same system calls, so a different function was required to differentiate between the two. The simplest disk-based filesystem I know of is ext2, which you can find in the Linux sources (linux/fs/ext2/). If you look at how they define these operation tables, you’ll notice that a significant fraction of them can be filled out using generic mechanisms provided by Linux – you’ll want to take advantage of this so you can avoid re-writing a whole bunch of stuff that already works.

3.2 Inode Cache Linux’s VFS layer was designed with traditional UNIX filesystems in mind, and as such has a number of UNIX filesystem concepts baked into it. You’ve already seen one of these with the whole directory/file distinction, but another important one is that the VFS layer directly talks to your filesystem in terms of inodes. While this was probably originally a decision that stemmed from Sun’s attempts at hacking NFS into their UNIX, today it has important performance implications: specifically that Linux will cache inodes in something (quite sensibly) known as the inode cache. The VFS layer handles the majority of the inode caching logic for you. The one thing you’ll have to be aware of is that reference counting is done on these inodes to ensure that they’re never freed while they can still be accessed from anywhere else within the VFS layer. If you end up manually passing around inodes (for example, your garbage collection layer might do this) then you’ll need to be sure to keep the reference counts coherent.

3.3 Directory Cache A number of VFS system calls do path lookup operations. In order to speed these up, Linux maintains something known as the dcache which is a cache of partially resolved directory lookups. What this means to you as a VFS programmer is that you’re pretty much isolated from doing any sort of name resolution, you simply need to provide methods that re-populate the dcache on requests from Linux. Since dentries are cached you’re going to have to remove them from the cache on rmdir(), as otherwise Linux won’t know that they’ve disappeared. The VFS documentation describes how to modify the dcache in order to do this.

4 The Linux Block Layer

The whole purpose of a filesystem is to provide access to block devices in a more friendly manner for users. Thus, one of the primary interfaces that filesystems use is the block layer. Linux’s block layer was designed to provide high performance access to rotating disks which imparts a significant amount of complexity into it. Luckily for you, the Linux Device Drivers book contains a great description of the block layer (the multiqueue changes aren’t in until 3.13, so you’re safe). The section on “Request Processing” contains all the information you’ll need to deal with the block layer for this lab. Note that before you dive into the device driver book, you’ll want to look at the page cache. Linux abstracts the vast majority of the block layer behind an in-RAM cache in order to improve performance, so you’re going to want to read up on the page cache first.

5 CS 194-24 Spring 2013 Lab 2: fs

4.1 Page Cache Modern systems tend to have significantly more physical memory than would be required just to hold each running process’s segments. In order to take advantage of this extra memory, Linux fills otherwise unused memory pages with a cache of disk blocks in the hope these in-memory copies can be used by programs. It turns out this cache is one of the most important performance considerations for the sorts of machines that are common today. Caching disk blocks in memory is extremely important: it hides latency and increases bandwidth for both disk reads and writes. On most machines almost all operations hit in the page cache and disk IO is relegated to simply providing a persistent backing store. Since the performance impact of the page cache is so large, Linux provides significant shared code to help manage the cache for your filesystem – in fact, the page cache is so ingrained into Linux file systems that it would be pretty much impossible to write an on-disk filesystem without using the page cache. Luckily, Linux’s implementation of the page cache is pretty much transparent to your filesystem. You register your filesystem with the page cache by filling out a struct address space operations and attaching it to your inodes. These callbacks will then get called at the appropriate times by the page cache when it wants to operate on your filesystem – for example, your filesystem will probably need to have some specific page cache eviction function that ensures pages make it back to disk before eviction.

4.2 Device Mapper So far we have discussed block devices assuming they map to a physical block device such as a hard disk. While this was the original purpose of the block layer, it has since been expanded with what’s known as the “Device Mapper” (often times just “DM”) framework. DM allows code that targets the block layer to be backed by a virtual block device, an example of which may be a software RAID configuration. DM then loops back into the block layer to actually satisfy requests, after performing some sort of arbitrary computation. For this assignment, you will be using device mapper to provide the relevant cryptographic operations required by the lab document. This boils down to invoking cryptsetup on your loop device.

5 Other Useful Kernel Primitives

The VFS layer depends on a large amount of shared kernel code. While you may be familiar with some of these systems (atomic operations, for example) there are a number of systems you probably haven’t seen before.

5.1 Slab Allocation By this point you’ve probably noticed that the VFS layer interacts with a number of different caches. All of these caches use the slab allocation in order to allocate memory as efficiently as possible. The default Linux slab allocator is known as SLUB (yes, that’s not a typo – the old one was SLAB). While the various caches should hide the details of slab allocation from you, it can still be useful to know what’s going on behind the scenes. The general idea behind slab allocation is to speed up memory management when objects are reused many times. The general idea is to keep around a cache of already initialized objects and return one of those rather than creating a new object. This saves the overhead of re-initializing objects over and over again. There are additional significant advantages in both time and space that result from coalescing objects of the same kind during allocation that slab caches provide. Linux’s slab allocator can be accessed through the kmem cache * methods.

6 CS 194-24 Spring 2013 Lab 2: fs

5.2 Work Queues Log-structured filesystems essentially involve garbage collection: you’ll need to clean up unused segments and merge mostly-empty segments when the machine isn’t particularly busy. Userspace garbage collec- tion tends to involve either a signal or a background thread, but neither of these approaches are correct when within the kernel. You can’t use a signal because your filesystem isn’t attached to a particular userspace thread, and forking kernel threads is generally frowned upon because of the resource overhead involved (though you’ll notice that there’s a whole bunch of them anyway...). To implement garbage collection you will instead need to use Linux’s work queue functionality to defer work for a later time. The idea behind work queues is that there is a pool of event handling kernel-threads that exist the entire time the system is running. The work queue mechanism allows you to enqueue an item of work that will later be dequeued by one of these threads. This allows work to be performed asynchronously without the overhead of creating a bunch of threads. Linux’s work queue mechanisms are documented in linux/Documentation/workqueue.txt. Note that the provided lpfs code currently spawns off placeholder syncer and cleaner threads. We suggest that you get rid of these and use work queues.

5.3 The RCU Subsystem The RCU (Read, Copy, Update) subsystem is a mechanism for synchronizing particular sorts of opera- tions without actually using any synchronization primitives directly. RCU is commonly used to manage lists of buffers. While this sort of buffer management is common in filesystems, we believe that the RCU subsystem should be hidden from you until you need to start using BIO. The RCU subsystem is somewhat complicated. We’ll eventually cover it in sections, but the Linux documentation for it is very good. It can be found at linux/Documentation/RCU/ (my personal favorite is whatisRCU.txt, but Vedant likes one of the other ones so YMMV).

5.4 Wait Queues Operations that touch block devices are very high latency. This means you’re going to have to sleep whenever you submit a request that doesn’t hit in the page cache. Linux provides a generic mechanism for sleeping until the completion of an event, known as wait event. You shouldn’t need to access this directly for the majority of your code, as the page cache and BIO code will handle these for you, but you will probably want to look at wait queues for putting your cleaner thread to sleep under certain conditions. Later on when you convert darray over to the bio interface, you may need to use wait for completion to synchronize your I/Os.

6 Schedule and Grading

We’ll be following the same sort of checkpoint system that was used for Lab 1: three checkpoints, once a week on Thursdays at 9pm. You’ll probably notice that this lab is more heavily loaded towards the third checkpoint. This means it will be particularly important to ensure that your early checkpoints are useful for your later checkpoints. The requirements for this assignment are specified in terms of the Linux system calls (or where their names don’t match, libc functions). You’ll have to translate those into the corresponding VFS operations in order to actually implement the lab.

6.1 Design Document Here’s a subset of the items we’re looking for: • Task separation, work distribution, time estimates

7 CS 194-24 Spring 2013 Lab 2: fs

• A plan to manage transactions, snapshots, and the cleaner • A plan to manage the journal and crash recovery • Remarks on how on on-disk structures will be used

• Proposed changes to lpfs, or a description of your own filesystem • A brief description of your tests (no mention of ramfs is needed) • Any questions you may have about the lab Avoid being vague, since this inhibits our helping you. As a special case of avoiding vagueness, avoid listing ‘Correctness Constraints’ in your documents. Avoid being too specific (i.e prefer pseudocode over code, aim for brevity and clarity).

6.2 Checkpoint 1 For the first checkpoint you will be implementing a userspace compatibility layer for ramfs, an in- memory filesystem that’s already been written for Linux. We’ve written much of this compatibility layer already, so all you really have to do is implement the struct dentry management code and some of the VFS-provided generic functions in userspace. You’ll also need to extend userspace/fsdb.c by invoking the appropriate file operations and inode operations methods from command handlers.

• (15 points) Make ramfs work in userspace. All of the fsdb commands should work. • (15 points) A filesystem test suite, running against ramfs inside of Linux and against your com- patibility layer. Your tests may take the form of commands to fsdb (which can be piped in) or shell scripts which can be run in the kernel. • (10 points) A design document that describes how you will complete the remaining checkpoints.

6.3 Checkpoint 2 For the second checkpoint you’ll be implementing a read-only version of lpfs that gets it data from a real block device and runs within Linux (as well as the userspace compatibility wrapper).

• (5 points) mount() • (5 points) umount() • (5 points) open() • (5 points) close() • (5 points) readdir() • (5 points) read() • (5 points) seek() • (5 points) stat() • (5 points) statfs() • (5 points) Encryption • (5 points) Checksums

8 CS 194-24 Spring 2013 Lab 2: fs

6.4 Checkpoint 3 The final checkpoint involves making the core functionality of the lab work. You’ll need to be able to perform read and write operations that target a live, on-disk filesystem image. Be sure not to break anything from Checkpoint 2!

• (5 points) mkdir() • (5 points) rmdir()

• (5 points) write() • (5 points) truncate() • (5 points) link()

• (5 points) unlink() • (5 points) rename() • (5 points) sync() • (5 points) fsync() • (10 points) Snapshotting • (10 points) Efficient crash recovery • (1 point) Updates to the darray interface

6.5 Evaluation We’re going to grade your filesystems by invoking fsdb on your tests and by replacing the default init process with a script that stresses a bunch of filesystem operations. As we get the stress tester for your httpd project up and running, we’ll also set up a separate test environment where your webserver’s DATA ROOT is backed by your new filesystem.

9