Controlling File System Write Ordering

Controlling File System Write Ordering Nathan C. Burnett, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau Computer Sciences Department, University of Wisconsin – Madison 1 Introduction would call write() to update the write-ahead log, call barrier() write() Traditional operating system kernels present a relatively and finally call to update the data narrow interface to applications for I/O. This inter- itself. This ensures that the on-disk log will be updated face typically consists of open(), close(), read(), before the on-disk data is updated. write(), lseek()and fsync(). Some applications, We have implemented file system barriers in FreeBSD 5.4. Unfortunately, we found that if the workload nor- however, require more control than this limited interface barrier() allows. For example, a database management system such mally issues only a few writes between as Oracle or PostgreSQL, performs write-ahead logging calls, the disk scheduler and buffer cache manager be- to provide transactional semantics to users. In order to come overly constrained and performance suffers. ensure that the database can be brought to a consistent 2.2 Asynchronous Graphs state in the event of a crash, the DBMS must ensure that updates to the log are committed to disk strictly before After our experience with file system barriers, we realized updates to the data itself. So, the DBMS needs to control that the application needed to be able to specify order- the order in which data is committed to stable storage for ing constraints at a finer granularity than barriers allowed. correctness. Thus, we propose asynchronous graphs. In this scheme, write() In current operating systems, there are essentially two each time an application calls the kernel as- ways to control disk write ordering. In the first, the ap- signs a unique identifier to that write operation and re- plication accesses the raw storage device, bypassing the turns it to the application. When making subsequent calls write() filesystem and kernel buffer cache. This provides the de- to , the application can use these identifiers to write() sired control, but at the cost of application portability, inform the kernel that the data from the latest complexity and system managability. The second way is operation should be committed to stable storage only after to use synchronousI/O, such as fsync(). This also pro- the data from the specified writes. This allows an appli- vides the required control, but this time at the cost of per- cation to express ordering constraints only where order- formance. A third, degenerate, option is to forego control- ing matters, instead of imposing global constraints on all ling write ordering entirely. This solution is fast, simple of the writes in the system. The kernel is free to reorder and portable, but can be disastrous in the event of a sys- writes arbitrarily if no ordering constraint was specified. tem crash. Any application with complex, on-disk data This gives the application the power to specify its order- structures will have a similar set of trade-offs. ing requirements, while allowing the operating system to In this work, we seek to provide a new interface which use its normal I/O optimizations for all other I/O. Using our example of a DBMS again, the DBMS calls allows the application to control the order in which data write() is written to disk. This interface should be fast (i.e. as to update the log, and is given an identifier for write() asynchronous as possible), and simple enough that it will that write. It then calls a number of times to be easy to standardize to facilitate portability. update the data, each time passing in the identifier for the log write to ensure that the data updates are written to disk 2 Approach after the log updates. Our approach is to allow an application to specify order- 3 Current Status ing dependencies between write operations. We propose two alternative methods for allowing this type of control, So far, we have been exploring the asynchronous graph file system barriers and asynchronous graphs. concept within a simple simulator. Our simulator sim- ulates the buffer cache itself, and distinguishes between 2.1 File System Barriers sequential and non-sequential disk accesses. That is, se- File system barriers entail the addition of a new sys- quential accesses are fast and non-sequential accesses are tem call, barrier(). This call guarantees that the slow. So far we have been able to show that asynchronous data from any write() calls made previous to the graphs require fewer writes and, in particular, fewer non- barrier() call will be commited to stable storage sequential writes than using fsync or filesystem barriers strictly before the data from any subsequent calls to write. for ordering with a TPC-B-like workload. barrier() is global in scope, that is, the ordering In the near futurewe are extendingour simulator to take is imposed on all write operations, without regard to into account clustered writes, the buffer cleaning daemon which process requested them. This interface makes it and a more accurate disk model. In the following months, easy to convert existing applications to use the new in- we plan to implement the asynchronous graph interface in terface, simply replace calls to fsync() and sync() FreeBSD and evaluate its performance on real and syn- with calls to barrier(). For example, a DBMS thetic workloads..

Load more