Wasteless Journaling for Fast and Reliable Multistream Storage

Okeanos: Wasteless Journaling for Fast and Reliable Multistream Storage Andromachi Hatzieleftheriou and Stergios V. Anastasiadis Department of Computer Science University of Ioannina, Greece Abstract more, logging is useful for checkpointing parallel appli- Synchronous small writes play a critical role in the re- cations to preserve multiple hours or days of processing liability and availability of file systems and applications after an application or system crash [1]. that use them to safely log recent state modifications and Today, several file systems use a log file (journal)in quickly recover from failures. However, storage stacks order to temporarily move data or metadata from mem- usually enforce page-sized granularity in their data trans- ory to disk at sequential throughput. Thus, they post- fers from memory to disk. We experimentally show that pone the more costly writes to the file system without subpage writes may lead to storage bandwidth waste and penalizing the corresponding latency perceived by the high disk latencies. To address the issue in a journaled applications. A basic component across current oper- file system, we propose wasteless journaling as a mount ating systems is the page cache that temporarily stores mode that coalesces synchronous concurrent small writes recently accessed data and metadata in case they are of data into full page-sized blocks before transferring reused soon. It receives byte-range requests from the them to the journal. Additionally, we propose selective applications, and communicates with the disk through journaling that automatically applies wasteless journal- page-sized blocks. The page-sized block granularity of ing on data writes whose size lies below a fixed precon- disk accesses is prevalent across all data transfers, in- figured threshold. In the Okeanos prototype implemen- cluding data and metadata updates or the correspond- tation that we developed, we use microbenchmarks and ing journaling whenever it is used. Asynchronous small application-level workloads to show substantial improve- writes improve their efficiency, when multiple consecu- ments in write latency, transaction throughput and stor- tive requests are batched into page-sized blocks before age bandwidth requirements. they are flushed to disk. Instead, each synchronous write is flushed to disk individually causing data and metadata 1 Introduction traffic of multiple full pages, even if the bytes actually Synchronous small writes lie in the critical path of sev- modified across the pages collectively occupy much less eral contemporary systems that target fast recovery from space. failures with low performance overhead during normal In Figure 1, we measure the amount of data written to operation [1, 4, 5]. Typically, synchronous small writes the journal across different mount modes. We use a syn- are applied to a sequential file (write-ahead log)inorder thetic workload that consists of 100 concurrent threads to record updates before the actual modification of the with periodic synchronous writes of varying request sizes system state. In addition, the system periodically copies (one req/s). We include the ordered, writeback, and its entire state (checkpoint) to permanent storage. After a journal –refered to as data journaling from now on for transient failure, recent state can be reconstructed by re- clarity– modes of the ext3 file system (Section 4). As playing the logged updates against the latest checkpoint. the request size increases up to 4KB, the trafficofdata Write-ahead logging improves system reliability by pre- journaling remains almost unchanged at a disproportion- serving recent updates from failures; it also increases ately high value. At each write call, the mode appends to system availability by substantially reducing the subse- the journal the entire modified data and metadata blocks quent recovery time. The method is widely applied in rather than only the corresponding block modifications. general-purpose file systems [6], relational databases [5], Instead, the ordered and writeback modes incur almost distributed key-value stores [4], event processing en- linearly increasing traffic, because they only store to the gines [3], and other mission-critical systems [7]. Further- journal the blocks that contain modified metadata. 1 Write Traffic 2 Related Work 1000 The log-structured file system addresses the synchronous metadata update problem and the small-write problem by batching data writes sequentially to a segmented log [9]. 100 In transaction processing, group commit is a known Data Jrn Wasteless database logging optimization that periodically flushes 10 Writeback to the log multiple outstanding commit requests [5]. The Selective above approaches gather multiple block writes into a sin- Total Journal Volume (MB) Ordered 1 gle multi-block request instead of fitting multiple sub- 0 1 10 100 page modifications into a single block that we do. Also, Request Size (KB) subpage journaling of metadata updates is already avail- able in commercial file systems, such as IBM JFS and Figure 1: During a 5min interval, we measure the total MS NTFS [8]. Adding extra spindles to improve I/O write traffic to the journal device across different mount parallelism or non-volatile RAM to absorb small writes modes of the ext3 file system on Linux. could also reduce latency and raise throughput [6]. How- ever, such solutions carry drawbacks that primarily have to do with increased cost and maintenance concerns. We set as objective to reduce the journal trafficsothat A structured storage system may maintain numerous we improve the performance of reliable storage at low independent log files to facilitate load balancing in case cost. Thus, we introduce wasteless journaling and selec- of failure [4]. However, concurrent sequential writes to tive journaling as two new mount modes, that we pro- the same device create a random-access workload with pose, design and fully implement in the Linux ext3 file low disk throughput. To address this issue, the sys- system. We are specifically concerned about highly con- tem may store multiple logs into a single file and sepa- current multithreaded workloads that synchronously ap- rate them by record sorting during recovery. Similarly, ply small writes over common storage devices [1, 4, 7]. for the storage needs of parallel applications in high- We target to save the disk bandwidth that is currently performance computing, specialized file formats are used wasted due to unnecessary writes of unmodified data, to manage as a single file the data streams generated by or writes with high positioning overhead. The opera- multiple processes [1]. Instead, we aim to handle the tions in both these cases occupy valuable disk access above cases at low cost through the mount modes that time that should be used for useful data transfers instead. we add to a general-purpose file system. To achieve our goal we transform multiple random small The Echo distributed file system logged subpage up- writes into a single block append to the journal. dates for improved performance and availability, but by- We summarize our contributions as follows: (i) Con- passed logging for page-sized or larger writes [2]. How- sider the reduction of journal bandwidth in current sys- ever, Echo was discontinued in the early nineties partly tems as a means to improve the performance of reli- because its hardware lacked fast enough computation rel- able storage at low cost; (ii) Design and fully implement ative to communication. Recent research introduced se- wasteless and selective journaling as optional mount mantic trace playback to rapidly emulate alternative file modes in a widely-used file system; (iii) Discuss the im- system designs [8]. In that context, the authors emulated plications of our journaling optimizations to the consis- writing block modifications instead of entire blocks to tency semantics; (iv) Apply micro-benchmarks, a storage the journal, but didn’t consider the performance and re- workload and database logging traces over a single jour- covery implications. Due to the obsolete hardware plat- nal spindle to demonstrate performance improvements form or the high emulation level at which they were ap- up to an order of magnitude across several metrics; (v) plied, the above studies leave open the general architec- Use a parallel file system to show that wasteless journal- tural fit and actual performance benefit of journal band- ing doubles, at reasonable cost, the throughput of parallel width reduction in current systems. application checkpointing over small writes. In the remaining paper, we summarize the related re- 3SystemDesign search in Section 2, present architectural aspects of our We set as objective to safely store recent state updates design in Section 3, while in Section 4 we describe on disk and ensure their fast recovery in case of fail- the implementation of the Okeanos prototype system. ure. We also strive to serve the synchronous small writes In Section 5, we explain our experimentation environ- and subsequent reads at sequential disk throughput with ment, in Section 6 we present performance measure- low bandwidth requirements. We are motivated by the ments across different workloads, and in Section 7 we important role that small writes play for reliable stor- outline our conclusions and future work. age and the lack of comprehensive studies on subpage 2 data logging in current systems. In order to reduce the Thus, we handle consistency in a relatively clean way, storage bandwidth consumed by data journaling, we de- because we eliminate the case that we turn off the jour- signed and implemented a new mount mode that we call naling of a particular buffer halfway through a transac- wasteless journaling. During synchronous writes, we tion. On the other hand, if the first update of the buffer transform partially modified data blocks into descrip- is page-sized, we decide to skip journaling for the entire tor records that we subsequently accumulate into special update sequence of the corresponding block. In our ex- journal blocks. We synchronously transfer all the data perience, the above two transitions in update sizes along modifications from memory to the journal device.

Load more