Write Dependency Disentanglement with Horae Xiaojian Liao, Youyou Lu, Erci Xu, and Jiwu Shu, Tsinghua University https://www.usenix.org/conference/osdi20/presentation/liao

This paper is included in the Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation November 4–6, 2020 978-1-939133-19-9

Open access to the Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation is sponsored by USENIX Write Dependency Disentanglement with Horae

Xiaojian Liao, Youyou Lu,∗ Erci Xu, and Jiwu Shu∗ Tsinghua University

Abstract sisted in the storage medium, and further underlies a variety of techniques (e.g., journaling [44], database transaction [13]) Storage systems rely on write dependency to achieve atom- to provide ordering guarantee in the IO stack. Yet, the write icity and consistency. However, enforcing write dependency order is achieved through an expensive approach, referred comes at the expense of performance; it concatenates multi- as exclusive IO processing in this paper. In the exclusive IO ple hardware queues into a single logical queue, disables the processing, the following IO requests can not be processed concurrency of flash storage and serializes the access to iso- until the preceding one has been transferred through PCIe lated devices. Such serialization prevents the storage system bus, then been processed by the device controller and finally from taking full advantage of high-performance drives (e.g., returned with a completion response. NVMe SSD) and storage arrays. Unfortunately, this one-IO-at-a-time fashion of processing orae In this paper, we propose a new IO stack called H to al- conflicts with the high parallelism of the NVMe stack, and leviate the write dependency overhead for high-performance further nullifies the concurrency potentials between multiple orae drives. H separates the dependency control from the devices. First, it concatenates the multiple hardware queues data flow, and uses a dedicated interface to maintain the write of the NVMe SSD, thereby eliminating the concurrent pro- orae dependency. Further, H introduces the joint flush to en- cessing of both host- and device-side cores [50]. Moreover, it FLUSH able parallel commands on individual devices, and serializes the access to physically independent drives, prevent- write redirection to handle dependency loops and parallelize ing the applications from enjoying the benefits of aggregated orae in-place updates. We implement H in kernel and devices. In our motivation study, we observe that with the ff demonstrate its e ectiveness through a wide variety of work- scaling of hardware queues and devices, the performance orae × × loads. Evaluations show H brings up to 1.8 and 2.1 loss introduced by the write dependency can be up to 87%. performance gain in MySQL and BlueStore, respectively. Conversely, orderless writes without dependency can easily 1 Introduction saturate the high bandwidth of NVMe SSDs (§3). Therefore, to leverage the high bandwidth of NVMe SSDs The storage system has been under constant and fast evolution while preserving dependency, we propose the barrier trans- in recent years. At the device level, high-performance drives, lation (§4) to convert the ordered writes into orderless data such as NVMe SSD [15], are pushed onto the market with blocks and ordering metadata that describes the write de- around 5× higher bandwidth and 6× lower latency against pendency. The key idea of barrier translation is shifting the their previous generation (e.g., SATA SSD) [11, 18]. From write dependency maintenance to the ordering metadata dur- the system perspective, developers are also proposing new ing normal execution and crash recovery, while concurrently ways of storage array organization to boost performance. For dispatching the orderless data blocks. example, in BlueStore [23], a state-of-the-art storage backend We incarnate this idea by re-architecting modern IO stack of [45], data, metadata and journal can be separately with Horae (§5). In a nutshell, Horae bifurcates the tradi- ff persisted in di erent, or even dedicated devices. tional IO path into two dedicated ones, namely ordered con- With drastic changes from the hardware to the software, trol path and orderless data path. In the control path, Horae maintaining the write dependency without severely impacting flushes ordering metadata directly into the devices’ persistent the performance becomes increasingly challenging. The write controller memory buffer (CMB), a region of general-purpose dependency indicates a certain order of data blocks to be per- read/write memory on the controller of NVMe SSDs [15, 16], o ∗Jiwu Shu and Youyou Lu are the corresponding authors. using memory-mapped IO (MMIO). On the other hand, H - {shujw, luyouyou}@tsinghua.edu.cn rae reuses classic IO stack (i.e., block layer to device driver to

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 549 device) to persist orderless writes. Note that this design is also App. MySQL write-ahead log MySQL scaling-friendly as orderless data blocks can be processed in fsync fsync fsync fbarrier File both an asynchronous (to a single device) and pipelined (to journal Ext4 BarrierFS System multiple devices) manner. Per-core SWQ S S Now, with a bifurcated IO path, we further develop a series Block Order Preserving W W Layer of techniques to ensure both high performance and consis- Q Q Block Layer tency in Horae. First, we design compact ordering metadata and efficiently organize them in the CMB (§5.2). Second, the Device H H H Order Preserving W W W joint flush of Horae performs parallel FLUSH commands on Driver Q Q Q Dispatch dependent devices (§5.3). Next, Horae uses the write redi- R/W Barrier Write rection to break the dependency loops, parallelizing in-place Storage Device A B C Barrier-Enabled Device updates with strong consistency guarantee (§5.4). Finally, (a) Linux IO stack (b) Barrier-Enabled IO stack for crash recovery, Horae reloads the ordering metadata, and Figure 1: Existing IO Stacks with Different Order-Preserving further only commits the valid data blocks but discarding Techniques. SWQ: software queue. In current multi-queue block invalid ones that violate the write order (§5.5). layer, each core has a software queue. HWQ: hardware queue. The To quantify the benefits of Horae, we build a kernel file storage device determines the maximum number of HWQs. Emerging system called HoraeFS to test applications relying on POSIX NVMe drives usually have multiple HWQs. interfaces, and a user-space object store called HoraeStore for distributed storage (§5.6). We compare HoraeFS against queues based on different algorithms (e.g., deadline). While ext4 and BarrierFS [46], resulting in an up to 2.5× and 1.8× in the storage device, the controller may fetch and process speedup at file system and application (e.g., MySQL [13]) arbitrary requests due to timeouts and retries. level, respectively (§6). We also evaluate HoraeStore against As a result of this design, the file system must explicitly BlueStore [23], showing the transaction processing perfor- enforce the storage order. Traditionally, the file system relies mance increases by up to 2.1×. on two important steps: synchronous transfer and cache bar- To sum up, we make the following contributions: rier (e.g., cache FLUSH). First, synchronous transfer requires • We perform a study of the write dependency issue on the file system to process and transfer the dependent data multi-queue drives among both single and multiple de- blocks through each layer to the storage interface serially. vices setup. The results demonstrate considerable over- Then, to further avoid the write reordering by the controller head of ensuring write dependency. in the storage device layer, the file system issues a cache bar- • We propose the barrier translation to decouple the or- rier command (e.g., FLUSH), draining out the data blocks in dering metadata from the dependent writes to enforce the volatile embedded buffer to the persistent storage. After- correct write order. wards, the file system repeats this processing the next request. • We present a new IO stack called Horae to disentangle Through interleaving dependent requests with exclusive IO the write dependency of both a single physical device processing, the file system ensures the write requests are made and storage arrays. It introduces a dedicated path to durable with the correct order. control the write order, and uses joint flush and write The basic approach in guaranteeing the storage order is redirection to ensure high performance and consistency. undoubtedly expensive. It exposes DMA transfer latency • We adapt a kernel file system and a user-space object (synchronous transfer) and flash block programming delay store to Horae, and conduct a wide variety of experi- (cache barrier) to the file system. ments, showing significant performance improvement. 2.2 Ordering Guarantee Acceleration 2 Background Many techniques [25,26, 46,48] improve the basic approach This section starts with a brief introduction of enforcing write presented in §2.1, by reducing the overhead of synchronous dependency under current IO stack (§2.1). Then, we illus- transfer and cache barrier. As barrier-enabled IO stack (Barri- trate state-of-the-art techniques that alleviate the overhead of erIO [46]) is the closest competitor, we introduce it briefly. enforcing the write dependency (§2.2). BarrierIO reduces the storage order overhead by preserv- ing the order throughout the entire IO stack. Specifically, 2.1 Basic Ordering Guarantee Approach the BarrierIO enforces the write dependency mainly using In Figure 1(a), we can see that modern IO stack is a combina- two techniques: the order-preserving dispatch to accelerate tion of software (i.e., the block layer, the device driver) and the synchronous transfer, and the barrier write command to hardware (i.e., the storage device) layers. Each layer may improve the cache barrier. First, as shown in Figure 1(b), the reorder the write requests for better performance [15,50] or order-preserving block layer ensures that the IO scheduler fairness [4, 29]. Specifically, in the block layer, the host IO follows the write order specified by the file system. Further, scheduler can schedule the requests in the per-core software the order-preserving dispatch maintains the write order of

550 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association 1 Device 2 Devices 3 Devices crease the number of threads issuing 4 KB writes to gain the 600 orderless maximum IOPS. ordered 400 For orderless random write, we use libaio [3] engine with iodepth of 512. In this setup, the write requests issued by IOPS (K) IOPS 200 libaio have no ordering constraints and can be freely re- 0 10 20 30 40 50 60 70 80 90 ordered by the storage controller according to NVMe specifi- # of hardware queues cation [15]. The results are shown in Figure 2. As we enable Figure 2: Ordered Write VS. Orderless Write with Varying The more hardware queues, the IOPS of orderless writes increases Number of Hardware Queues and Devices. Each device has up and gradually saturates the storage devices. to 32 hardware queues. Described in Section 3.1. For ordered random write, we use libaio engine but set the iodepth to 1. This setup follows the principle of exclusive requests queueing in the (single) hardware queue. As a col- IO processing in guaranteeing the ordering. As shown in laborator, the storage controller also fetches and serves the Figure 2, the IOPS of ordered write can hardly grow, even if requests in a serialized fashion. Second, BarrierIO replaces we use more devices as RAID 0 to serve the write request. the expensive FLUSH with a lightweight barrier write com- mand for the storage controller to preserve the write order. The gap between orderless and ordered writes (blue area File systems provide the fsync() call for applications to in Figure 2) indicates the overhead of the write dependency. order their write requests. Yet, it is still too expensive to With the increase of hardware queues, the overhead becomes preserve order by fsync(). Thus, like OptFS [26], Barri- more severe, and reaches up to 87%. We conclude that the erIO separates the ordering from the durability, and further Linux IO stack is not efficient in handling ordered writes. introduces a file system interface, i.e., fbarrier(), for order- ing guarantee. The fbarrier() writes the associated data 3.2 Write Dependency Analysis blocks in order, but returns without durability assurance. In this subsection, we analyze the write dependency overhead 3 Motivation via explaining the behaviors of Linux IO stack. Multi-queue. The improvement of storage technologies has For a single device, the physically independent hardware been continuously pushing forward the performance of solid- queues get logically dependent. The application thread firstly state drives. To meet the large internal parallelism of flash puts the write requests in the software queues through the storage, SSDs are often equipped with multiple embedded IO interface (e.g., io submit()). Linux IO stack supports processors and multiple hardware queues [33, 38]. In the various IO schedulers (e.g., BFQ [4]), which perform requests host side, as shown in Figure 1(a), the IO stack employs merging/reordering in the software queues. Next, the requests the multicore friendly design. It statically maps the per-core are dispatched to hardware queues, where IO commands are software queues to the hardware queues for concurrent access. generated. In the hardware queues, out of consideration for Multi-device. On the other hand, for higher volume capacity, hardware performance (e.g., device-side scheduling) and com- performance and isolation, applications usually stripe data plexity (e.g., request retries), there are no ordering constraints 1 blocks to multiple devices as in RAID 0, or manually isolate of storage controller processing the commands [15] . There- different types of data into multiple devices. For example, fore, due to the orderless feature of both types of queues, to as shown in Figure 1(a), the ext4 file system uses a dedi- guarantee storage order, the IO stack only processes a single cated device for journaling processing. The MySQL database request or a set of independent requests at a time. As a result, redirects its write-ahead logs to a logging device. the ordered write request keeps most queues idle and leaves The multi-queue and multi-device bring opportunities to the multi-queue drives underutilized. enhance performance for independent write requests. Nev- For multiple devices, physically isolated devices get logi- ertheless, it still remains unknown how much overhead the cally connected. The IO stack employs isolated in-memory write dependency introduces to the multi-queue and multi- data structures for the software environment of multi-device. device design. Here, we first start with a performance study In the hardware side, a device has its private DMA engine and of the write dependency atop multi-queue and multi-device. hardware context. Despite the concurrent execution environ- ment, the ordered writes flow through the multi-device in a 3.1 Write Dependency Overhead serialized fashion; the application can not send a request to a In this subsection, we quantify the overhead of write depen- different device until the on-going device finishes execution. dency atop Linux IO stack by comparing the IOPS of ordered writes with orderless ones. We use an NVMe SSD (spec in 1 Table 2 Intel 750) with up to 32 hardware queues and use In barrier-enabled devices, the host can queue multiple ordered writes and insert a barrier command in between. Such barrier command is available FIO [9] for testing. During the test, we vary the number of in a few eMMC products [22] with usually single hardware queue. Multi- hardware queues and attach more devices. Further, we in- queue-based NVMe does not have similar concept.

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 551 3.3 Limitation of Existing Work metadata as A˜x and the data content as A¯x, and thus we have ˜ ¯ A straightforward solution to remedy aforementioned issues Ax = Ax ∪ Ax. Specifically, if Ax has consecutive data blocks ˜ is to keep the entire IO stack ordered as BarrierIO does, so as of n length to device A from logical block address (lba) m, Ax to allow more ordered writes to stay in the hardware queue. = {A, m, n}. Recall that in Section 2.2, each layer in BarrierIO stack must Given a write stream Ax  Bx+1, the barrier translation ˜ ˜ ¯ ¯ preserve the request order spread by the upper layer. Imple- turns it into Ax  Bx+1  {Ax, Bx+1}. Specifically, the trans- ˜ menting such design is quite simple in old drives with a sole lated write stream guarantees the order in two steps: (1) {Ax, ˜ ¯ ¯ ˜ ˜ command queue (e.g., mobile UFS, SATA SSD): the requests Bx+1}  {Ax, Bx+1} and (2) Ax  Bx+1. First, we must ensure in the hardware queue are serviced in the FIFO way. How- the ordering metadata is made durable no later than data con- ever, it is quite challenging to extend this idea to multi-queue tent. Then, the write dependency of original write stream is drives and multiple devices, without sacrificing the multicore extracted as the ordering metadata. designs of the host IO stack and the core/data parallelism of 4.2 Proof fast storage devices. We explain the reason in detail. Now, we further discuss the correctness of barrier translation The key of BarrierIO is that each layer agrees on a specific via the following proof. Our main point is to show that the order. While a single hardware queue structure itself can de- translated write stream has the same effect on satisfying the scribe the order, for multi-queue and multi-device, the host IO ordering constraints. In other words, the following proves stack must manually specify the order. Maintaining a global that if the ordering metadata is made durable in specific order, order among multiple queues throughout the entire IO stack and is made durable no later than the data content, the write potentially ruins the multi-queue and multi-device concur- dependency of original write stream is maintained. rency, which are vital features to fully exploit the bandwidth We formalize the implication as follows: of high-performance drives, according to our evaluation (§6.2) and a recent study [50]. Further, the firmware design should A˜x  Bx˜+1  {A¯x, Bx¯+1} =⇒ Ax  Bx+1 (1) comply with the host-specified order, which may introduce synchronization among embedded cores and may neutralize The key lies in the non-deterministic order between A¯x and the internal core and data parallelism. Bx¯+1. We thus discuss the proof in two situations: A¯x  Bx¯+1 In this paper, instead of keeping the IO stack ordered, we and Bx¯+1 ≺ A¯x as follows. seek a new approach that keeps most parts of current IO stack The first case A¯x  Bx¯+1. Since we already have A˜x  Bx˜+1, orderless while preserving correct storage order. We now we have ( A˜x ∪ A¯x )  (Bx˜+1 ∪ Bx¯+1). Because Ax = A˜x ∪ A¯x present our design in the following sections. and Bx+1 = Bx˜+1 ∪ Bx¯+1, we have Ax  Bx+1. The second case B¯ ≺ A¯ . This case at first glance may orae x+1 x 4 The H Foundation seem to violate the write order. However, since we have Bx˜+1 To efficiently utilize the modern fast storage devices, we wish  Bx¯+1, we can always find and discard the content Bx¯+1 via to keep the orderless and independent property of both the the indexing Bx˜+1. In this situation, the data content remains software and hardware intact. We achieve this goal via barrier empty, i.e., ∅. The empty result obeys any write dependency, translation, which disentangles the write dependency from i.e., ∅ ∈ {Ax  Bx+1} = {∅, {Ax}, {Ax, Bx+1}}. original slow and ordered write requests. Two intuitive concerns may be raised from the second case. First, a long write stream may be at a higher risk of losing 4.1 Design more data due to discard, although it keeps a consistent state Here, we refer a series of ordered write requests issued by of disk status. However, such scenario is common and accept- the file system or applications as a write stream. Commonly, able in storage systems. Similar to roll-back and undo log, the a write stream can have multiple sets of data blocks and the discarding time window (e.g., fsync() delay) is determined inbetween barriers that serve as ordering points between two by the file systems or applications. If applications desire data sets of data blocks. We note a set of data blocks to device durability rather than ordering, they must synchronously wait A with a monotonic set ID x as Ax, and refer a barrier (write for the durability of the write stream, e.g., calling fsync(). dependency) as . Thus, for a write stream Ax  Bx+1, it shall Second, a special case of the write dependency, called be ensured that the data blocks of Ax must be made durable discarding at a dependency loop, may erase the old but valid prior to (≺) Bx+1 or at the same time (=) as Bx+1. data content. For example, consider Ax  Bx+1  Ax+2, where Next, we move on to remove the dependency (i.e., ) be- Ax and Ax+2 operate on the same logical block address. This tween two write requests. We use {} to group a set of indepen- would occur when storage systems perform in-place updates dent write requests that can be processed concurrently. Our (IPU). Simply discarding A¯x in the second case loses the key issue is to translate the Ax  Bx+1 into {Ax, Bx+1}. We valid data. Our solution is to break the dependency loop by ′ decouple the indexing of a write stream from its data content. redirecting IPU to another location, i.e., Ax  Bx+1  A x+2, ′ The indexing (i.e., ordering metadata) keeps a minimum set where A x+2 targets on a different address from Ax. In this of information to retain the dependency. We refer the ordering way, we guarantee the correctness of dependency loop as in

552 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association : ordered write request i to device A : slow barrier Ai fbarrier() fsync() queue_tx() apply_tx()

Bk : orderless write request k to device B : faster barrier HoraeFS HoraeStore Device A ①Ax ≼Bx+1 B6 B5 B4 A3 A2 A1 Device B Ordering Layer (a) the write stream before the barrier translation ③{Ax, Bx+1} Ordered data writes Block Layer ② Ax ≼ Bx+1 A1 A2 A3 B6 B4 A3 A1 Device A Device Driver Ordered control writes B4 B5 B6 {lba, len, device id} Device B (b) the write stream after the barrier translation Device A Device B ...... Orderless data writes

Figure 3: An Example of The Barrier Translation. The solid Figure 4: The Horae IO Stack Architecture. Horae splits the circle represents the ordering metadata. The subscript denotes the traditional IO path into ordered control path ② and orderless data desired write order of the file system or applications. path ③. Horae persists the ordering metadata A˜x via the control path before submitting the orderless data blocks A¯x to the data path. normal case, preserving high concurrency as well as the old version of data. We show more details in §5.4. trary hardware queues or storage devices without exclusively occupying the hardware resources. 4.3 Example The key of Horae is the ordering layer, to which the order- We now use an example to go through the entire process of bar- ing guarantee of the entire IO stack is completely delegated rier translation. Figure 3(a) presents the original write stream. (§5.2). Note that the ordering layer does not need to handle all As we can see, even for different devices, the block-sized (e.g., block IOs, but instead just need to capture the write dependen- 4 KB) data is still processed in a serialized fashion. Even cies of ordered writes. Specifically, Horae stores the ordering worse, the classic approach uses slow and coarse-grained metadata in the persistent controller memory buffer (CMB in barrier (the cloud shape, e.g., flash FLUSH) to enforce the NVMe spec 1.2 [15], PMR in NVMe spec 1.4 [16]) of the relationship , which may process unrelated data blocks (e.g., storage device using an ordering queue structure. Horae lever- the orderless data blocks). ages epochs to group a set of writes with no intra-dependency, Figure 3(b) shows the output of the barrier translation. and further uses the ordering queue structure itself to reflect Original expensive barriers are replaced with faster and fine- the order of each epoch with inter-dependency. grained barriers (the thunder shape, e.g., memory barrier). Separating the ordering control path from the data path pro- Further, the initial ordered writes are translated to orderless vides numerous benefits; it saves the block layer, the device ones with no barriers. As a result, after processing the order- driver, and the devices from enforcing write order, which can ing metadata, the devices can process data blocks at their full sacrifice performance or particular property (e.g., schedul- throttle concurrently. ing). Further, this design enables Horae to perform parallel FLUSHes despite the dependencies among multiple devices 5 The Horae IO stack (§5.3). Yet, it also faces a challenge, the dependency loop. To demonstrate the advantage of the barrier translation, we As we mentioned in §4.2, the dependency loop occurs implement Horae by modifying the classic Linux IO stack. when multiple in-place updates (IPU) operate on the same ad- The following sections first give an overview of Horae, and dress. As the data path of Horae is totally orderless, multiple then present the techniques at length. ordered in-progress (issued by the file system but the comple- tion response is not returned) IPUs can co-exist and be freely 5.1 Overview reordered. The dependency loops, if not properly handled, As the high level architecture shown in the Figure 4,Horae can introduce data version issue (e.g., the former request over- extends the generic IO stack with an ordering layer target- writes the later one) and even the security issue (e.g., unau- ing ordered writes. Moreover, Horae separates the ordering thorized data access). Horae breaks the dependency loops control from the data flow: the ordering layer translates the by write redirection (§5.4). In other words, Horae treats the ordered data writes (①) into ordering metadata (②) plus IPUs as versioned writes, stores their ordering metadata seri- the orderless data writes (③). The ordering metadata passes ally, and concurrently redirects them to a pre-reserved disk through a dedicated control path (MMIO). The orderless data location. In background, Horae writes the redirected data writes flow through the original block layer and device driver blocks back to their original destination. In this way, Horae (block IO) as usual. As a result of separation, the data path parallels the IPUs while retaining their ordering. is no longer bounded by the write dependency, and it thus Atop the ordering layer, Horae exports block device ab- allows the file systems to dispatch the data blocks to arbi- straction and provides ordering control interface to the upper

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 553 MMIO A , A , Z , Z Format: lba len devid etag dr plba rsvd A1, A2, Z3, Z4 5 6 7 8 Block IO Size in bits: 32 32 81 1 32 22 A1 A2 A A Z Z3 Z4 ≼1 2 3 Z4 A5 A6 Z7 Z8 A5 A6 1 1 Z7 Z8 PCIe Figure 6: The Ordering Metadata Format. lba: logical block address. len: length of continuous data blocks. devid: destination Head 2 3 Tail device ID. etag: epoch boundary. dr: is made durable. plba: lba of Epoch[0] Epoch[n-1] Epoch[n] 4 4 prepare write. rsvd: reserved bits. The Ordering A1 A2 Z3 Z A A Z Z Queue 4 5 6 7 8 the unit of epoch. To realize epoch, Horae uses the etag to

A1 :data block indicate the boundary of epochs. The etag implies . A1 A2 A5 A6 ….. Z3 Z4 Z7 Z8 Now, we go through an example presented in Figure 5 Device A A1 :ordering metadata Device Z to show how data blocks reach the storage with ordering Figure 5: The Circular Ordering Queue Organization. The or- constraints. Suppose that the ordering layer receives two dering queue is located in the persistent controller memory buffer sets of ordered write requests from two threads with ordering orae (CMB) of SSD. H groups a set of independent writes to an epoch, constraints {A ,A ,Z ,Z }  {A ,A ,Z ,Z }. The ordering layer and enforces the ordering between epochs. Described in Section 5.2. 1 2 3 4 5 6 7 8 first forms two epochs N and N-1, and constructs the ordering metadata according to the data blocks and write dependencies. layer systems (§5.6). We provide APIs (application program- Next, the two threads store the ordering metadata concurrently ming interfaces) and high level porting guidelines for upper to the ordering queue via MMIO ( 1 ). Then, Horae updates layer systems to adapt to Horae. Moreover, we build a kernel the 8-byte tail pointer sequentially to ensure both the update file system, HoraeFS, for applications that rely on POSIX file atomicity of each epoch and the write order of associated system interfaces, and a user-space object store, HoraeStore, ordering metadata ( 2 , 3 ). Finally, the two threads dispatch for distributed storage backend. the orderless writes concurrently via block IO interface ( 4 ). On top of HoraeFS and HoraeStore, similar to previous works [23,26,46], we provide two interfaces, the ordering and Since the size of available CMB is usually limited (e.g., durability interface, for upper layer systems or applications to 2 MB), the ordering queue may exceed the CMB region. orae organize the ordered data flow. The ordering interfaces (i.e., Thus, H introduces two operations, swap and dequeue, fbarrier(), queue tx()) send the data blocks to the stor- to reclaim free space of the CMB for incoming requests. age with ordering guarantee, but return without ensuring dura- Swap. The ordering layer blocks the threads issuing ordered bility. The durability interfaces (i.e., fsync(), apply tx()) writes, and then invokes swap operation, when the total size deliver the intact semantics as before. of the valid ordering metadata exceeds the queue capacity. It moves the valid ordering metadata to a checkpoint region with 5.2 Ordering Guarantee larger capacity, e.g., a portion of flash storage. It first waits The major role of the ordering layer is guaranteeing the write for the on-going append operations on the ordering queue to complete. Next, it copies the whole valid ordering metadata order. Recall that the write stream Ax  Bx+1 is translated to to the checkpoint region. Finally, it updates the head and A˜x  Bx˜+1  {A¯x, Bx¯+1}, thus the ordering layer enforces the write dependency of the ordering metadata before dispatching tail pointers atomically with a lightweight journal. the translated data blocks. We first present the organization Dequeue. The ordering layer dequeues the expired ordering of the ordering metadata. metadata when its associated data blocks and the preceding Ordering metadata organization. As shown in the center of ones are durable. The dequeue operation moves the 8-byte Figure 5, through the PCIe base address register, Horae uses head pointer. the persistent Controller Memory Buffer (CMB) as the persis- Optimization. We observe that the MMIO write latency tent circular ordering queue. The ordering queue is bounded through PCIe is acceptable, but MMIO read can be extremely by the head and tail pointers, and stores the ordering meta- slow (8 B read costs 0.9us, 4 KB read costs 113us). This data of one write stream. For an incoming ordered write re- is because MMIO read is split into small-sized (determined quest, Horae first appends its ordering metadata to the queue by CPU) non-posted read transactions to guarantee atomic- via MMIO. As shown in Figure 6, the ordering metadata is ity [17]. Yet, MMIO reads can be abundant on the CMB. For compact, mainly consisting of range-based destination (i.e., example, Horae allocates free locations from the ordering lba, len, devid). Storing the ordering metadata via control queue before sending the ordering metadata, which requires path is a CPU-to-device transfer with byte-addressability; un- frequent head and tail pointer access. Also, Horae needs like an interrupt-based memory-to-device DMA transfer, it to read the whole ordering queue for a swap operation. Horae does not transfer a full block nor switch context. Thus, per- avoids slow MMIO reads by maintaining an in-memory write- sisting the compact ordering metadata via MMIO is efficient. through cache for the entire CMB. The cache serves all read Horae leverages the epoch to group a set of independent operations in memory. Write operations are performed to the writes. And, Horae only enforces the write dependency in cache, and persisted to the device CMB simultaneously via

554 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association A: <1,5> B: <2,4> C: <3,6> D: <7,9> A: <0,0> B: <0,0> C: <0,0> D: <7,9> completed via interrupt handler or polling. Head Flushed FLUSH(C) Tail Head Flushed FlushedTail While the legacy FLUSH is always performed in a serialized and synchronous fashion, the joint FLUSH enables Horae to A1 B2 C3 B4 A5 C6 D7 D8 D9 A1 B2 C3 B4 A5 C6 D7 D8 D9 The Ordering Q The Ordering Queue flush concurrently and asynchronously, for devices without PLP. These devices expose extremely long flash programming Device A Device B Device C Device D Device A Device B Device C Device D latency, so async flushings on them exploit the potential par- (a) Pre-FLUSH states (b) Post-FLUSH states allelism. However, Horae remains the sync flushing on the other type of devices with PLP. Since flushing such devices is Figure 7: The Durability and Joint FLUSH of Horae. A FLUSH command to a single device also triggers flushings to other devices returned from the block layer directly, async flushing incurs whose write requests must be made durable in advance. unnecessary context switches from the wakeup mechanism. 5.4 Handling Dependency Loops MMIO. The extra memory consumption (2 MB) is negligible. To understand the motivation for resolving the dependency 5.3 Durability Guarantee loops, we show two examples. First, consider a data block is Applications may also require instant durability. Linux IO overwritten repeatedly. Reordering of two overwrite opera- stack uses the FLUSH command (e.g., calling fsync()) to tions may cause the later one to be overwritten by the former enforce durability and ordering of updates in a storage de- one, and results in a data version issue. Second, consider vice. Further, the FLUSH serves as a barrier between multiple at the file system level. The file system frees a data block devices, so as to ensure that the post-FLUSH requests are not from owner A, and then reallocates it to owner B. Reordering made durable prior to the pre-FLUSH ones. For example, to of reallocating and freeing upon a sudden crash can cause a guarantee the durability of Ax  Bx+1, Linux IO stack issues security issue: owner B can see the data content of owner A. two FLUSHes. The first one on device A ensures the write de- Classic IO stacks handle dependency loops by prohibiting pendency () as well as the durability of Ax, and the second multiple in-progress IPUs on the same address. IPUs on the one on device B for durability of Bx+1. same address are operated exclusively, where the next IPU The FLUSH of Horae no longer serves as a barrier. Thus, can not be submitted until the preceding one is completely Horae eliminates intermediate FLUSHes and invokes an even- durable. However, this approach serializes the access to the tual FLUSH to ensure durability. Unlike the legacy FLUSH same address, leaving the device underutilized. Horae allows targeting on sole device, the FLUSH of Horae, called joint multiple in-progress parallel IPUs on the same address, and FLUSH, automatically flushes related devices whose data resolves dependency loops through IPU detection, prepare blocks should be persisted in advance. and commit write. Figure 7 shows an example of the joint FLUSH in detail. IPU detection. The foremost issue is to detect IPUs. In Ho- The states of the ordering queue are in Figure 7a, and suppose rae, the upper layer systems must specify the IPU explicitly we decide to flush device C. The ordering layer firstly finds because of the awareness of IPU. the flush point (6), the last position of the flushed device in the Prepare write. In receiving an IPU, Horae first allocates an ordering queue. Next, it identifies the devices that need to be available location from the preparatory area (p-area), a pre- flushed simultaneously. A device should be flushed if it has reserved area of each device for handling dependency loops. requests prior to the flush point. To quickly locate the flush It stores the location in the plba region (shown in Figure 6) of candidates, Horae keeps the first and last position of each the ordering metadata. Next, it dispatches the IPU to the plba device (device range) in the ordering queue (e.g., h1,5i of position of p-area, via the same routine as the classic orderless device A), as shown in the top of Figure 7a. By checking the write. Further, Horae uses an in-memory IPU radix tree to existence of the intersection of the device range and head-to- record the plba for following read operations to retrieve the flush range (h1,6i), Horae selects the flush candidates. Then, latest data. The IPU tree accepts the logical block address Horae sends FLUSH commands to the candidate devices (A (lba) as the input and outputs the newest plba. Compared and B) simultaneously. When the joint FLUSH completes, Ho- to the IO processing, the modification on the IPU tree is rae moves the flushed pointer and resets the device range, performed in memory, and its overhead is thus negligible. as shown in Figure 7b. Commit write. Horae applies the effects of IPUs via the With the flushed pointer, Horae ensures the durability of commit write, once the durability of the prepare write is sat- the write requests between the head and flushed position. isfied. Horae firstly scans the ordering queue, and merges However, on SSDs with power loss protection (PLP), data the overlapping data blocks of the p-area. Then, it writes the blocks are guaranteed to durable when they reach the storage merged data blocks back to their original destination concur- buffer, prior to a FLUSH command. Therefore, Horae uses rently. When the commit write completes, Horae removes the dr bit (shown in Figure 6) to indicate the durability of associated entries of the IPU tree, and dequeues the entries the requests after the flushed position. For SSD with PLP, between the head and flushed pointer of the ordering queue. Horae sets the dr bit, once the ordered write requests are Horae introduces two commit write policies, lazy and ea-

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 555 API Explanation olayer init stream(sid, param) Register an ordered write stream with ID sid and parameters param olayer submit bio(bio, sid) Submit an ordered block IO bio to stream sid olayer submit bh(bh, op, opflags, sid) Submit a buffer head bh to stream sid with specific flags opflags and op olayer submit bio ipu(bio, sid) Submit an ordered in-place update block IO bio to stream sid olayer blkdev issue flush(bdev, gfp mask, error sector, sid) Issue a joint FLUSH to device bdev and stream sid fbarrier(fd) Write the data blocks and file system metadata of file fd in order fdatabarrier(fd) Write the data blocks of file fd in order io setup(nr events, io ctx, sids) Create an asynchronous I/O context io ctx and a set of streams sids io submit order(io ctx, nr, iocb, sid) Write nr data blocks defined in iocb to stream sid

Table 1: The APIs of Horae. Horae provides the stream (or sequencer) abstraction for upper layer systems. Each stream is identified by a unique sid, and represents a sequence of ordered IO requests. To realize multiple streams, Horae evenly partitions the CMB area and p-area, and assigns a portion to each stream. ger commit write. The lazy commit write is performed in The ordering queue. As stated in §5.2,Horae writes the background when the IO stack is idle. When Horae runs out ordering metadata via MMIO. But MMIO writes are not of p-area space, it triggers eager commit write. guaranteed to be durable because they are posted transactions Read. As the ordered IPUs are redirected to p-area, the without completions. After each MMIO write, Horae issues following read operation retrieves the latest data blocks from a non-posted MMIO read of zero byte length to ensure prior p-area first. Horae searches the IPU tree for the plba, and writes are made durable [24]. reads the data from the plba position of p-area. Upon a power outage, the ordering queue, the head, With write redirection for IPUs, Horae provides higher flushed and tail pointers, are saved to a backup region consistency than the data consistency, which matches the of flash memory, with the assistance of capacitors inside the default ordered mode of ext4. The data consistency requires SSD. When power resumes, Horae loads the backup region that (1) the file system metadata does not point to garbage into the CMB. data, and (2) the old data blocks do not overwrite new ones. The data block. Horae recovers the storage to a correct Horae is able to preserve the order between data and file state with the support of the ordering queue. Horae scans system metadata, and thus the file system metadata always the ordering queue from the head position to the flushed points to valid data. Also, Horae controls the order of parallel position, in case of the data blocks are made durable with the IPUs. Therefore, Horae supports data consistency. FLUSH response but expired entries are not dequeued. Horae However, the data consistency does not provide request commits the data blocks of the p-area that obey the order. atomicity. Under data consistency, the IPUs directly overwrite Further, Horae recovers the durable yet not flushed data valid data blocks, and can sometimes leave the file system blocks (e.g., in SSD with PLP). Starting from the flushed a combination of old and new data, which brings inconve- position, Horae sequentially scans the ordering queue, and nience to maintain application-level consistency [2]. This is commits the prepare writes until it finds a non-durable data because commodity storage does not provide atomic write block (i.e., dr bit is not set). After that, Horae discards the operations. For a write request containing a 4 KB data block, following data blocks through filling the blocks with “zeros”, it may be split into multiple 512 B (determined partially by because they violate the write order. Max Payload Size) PCIe atomic ops (i.e., transactions) [17]. orae A crash may result in partially updated 4 KB data block. 5.6 The API of H and Use Cases Horae enhances data consistency with request atomicity; This subsection first describes the API of Horae, and then continuous data blocks of each write request sent to the Ho- presents two use cases: a file system leveraging a single write rae are made durably visible atomically due to double write. stream (i.e., a single ordering queue), and a user-space object Once the durability of the prepare write is satisfied, Horae store running with multiple write streams. ensures request atomicity. Otherwise, the prepare write may 5.6.1 The API of Horae be partially completed and its effect is thus ignored. To enable the developers to leverage the efficient ordering con- 5.5 Crash Consistency trol of Horae, we provide three levels of functionalities/APIs In the face of a sudden crash, Horae must be able to recover shown in Table 1: the kernel block device, the file system and the system to a correct state that the data blocks are persisted the asynchronous IO interface (i.e., libaio [3]). in the correct order and the already durable data blocks with Kernel block device interface. The kernel systems (e.g., a completion response are not lost. This subsection discusses file systems) can use olayer init stream to initialize an the crash consistency of Horae, including the ordering queue ordered write stream for further use. olayer submit bio consistency and the data block consistency. and olayer submit bh deliver the same block IO sub-

556 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association mission function as classic interfaces (i.e., submit bio CPU IO Context switch Ordering layer and submit bh); but they return when associated order- Ext4 fsync 51.18us ing metadata are persisted in the sid ordering queue App. D Wait D (§5.2). olayer submit bio ipu is for in-place updates 6.30 9.93 0.85 0.85 JBD2 Pre JM Wait JM JC Wait JC Post (§5.4). olayer blkdev issue flush is extended from clas- 1.05 6.12 10.75 2.83 11.78 0.72 sic FLUSH interface (i.e., blkdev issue flush); it keeps the HoraeFS fbarrier 20.82us FLUSH 0.7 same arguments and performs joint on a given stream App. D and target devices (§5.3). 6.30 0.85 0.7 0.7 0.85 interface. We offer two ordering file system Submit Pre JM JC Post 1.05 6.12 2.83 0.72 0.85 0.85 interfaces, the fbarrier and fdatabarrier, for upper layer Flush Wait JMWait JC F Post systems or applications to organize their ordered data flow. HoraeFS fsync 34.82us 5.95 5.26 1.09 The fbarrier bears the same semantics as the osync of OptFS and fbarrier of BarrierFS; it writes the data blocks Figure 8: The fsync() and fbarrier(). 4 KB data size. The and journaled metadata blocks in order but returns without number shows the latency of each operation in microseconds. D: application data blocks. Pre: prepare the journaled metadata. JM: ensuring durability. The fdatabarrier is the same as that journaled file system metadata. JC: journaled commit record. Post: of BarrierFS; it only ensures the ordering of the application change transaction state, calculate journal stats, etc. data blocks but not the journaled blocks. We further show the internals of these interfaces in the following §5.6.2. the BH New states of each write. Although the journal area is Libaio interface. Libaio provides wrapper functions for repeatedly overwritten, we do not treat the journaled writes async IO system calls, which are used for some systems (e.g., as IPUs, as the journal area is always cleaned before reused. BlueStore and KVell [36]) designed on high-performance Figure 8 shows a side-by-side comparison between Ext4 orae drives. We expose the ordering control path of H via and Horae on a NVMe SSD. For each 4 KB allocating write in two new interfaces on libaio. io setup allows developers to most cases, both file systems issue three data blocks, namely allocate a set of streams defined in sids. io submit order the data block (D), the journaled metadata block (JM) and performs ordered IO submission; it bears the same seman- the journaled commit block (JC). Further, both file systems tics of fdatabarrier. We further show the usage of these order the data blocks with {D, JM }  JC. Ext4 enforces the interfaces in boosting BlueStore in §5.6.3. ordering constraints through the exclusive IO processing. It Porting guidelines. We provide three guidelines. First, upper waits for the completion of preceding data blocks (e.g., Wait layer systems can distinguish the ordered writes from the JM) and issues a FLUSH (not shown in the figure because it orderless ones based on the categories of the request, and send is returned by the block layer quickly). HoraeFS eliminates them using Horae’s APIs. The requests that contain metadata, the exclusive IO processing, and preserves the order through the write-ahead log and the data of eager persistence (e.g., the ordering layer, as shown in the black rectangle. HoraeFS data specified by the fsync() thread) are treated as ordered waits for the durability of the associated blocks in flush thread, writes. Second, due to the separation of ordering control path, and finally issues a FLUSH to the ordering layer for durability. upper layer system can design and implement the ordering and HoraeFS differs from BarrierFS in the IO dispatching (i.e., durability logic individually. In the ordering logic, they can the white rectangle) and IO waiting (i.e., the gray rectan- use the ordering control interface (e.g., olayer submit bio) gle) phase. During IO dispatching phase, BarrierFS passes to dispatch the following ordered writes immediately after through the entire order-preserving SCSI software stack. Be- the previous one returns from the control path. This allows sides, BarrierFS experiences extra waiting time due to the the ordered data blocks to be transferred in an asynchronous order-preserving hardware write, i.e., the barrier write. manner without waiting for the completion of DMA. Third, they can remove all FLUSHes that serve as ordering points in 5.6.3 The HoraeStore Distributed Storage Backend the ordering logic, and invoke an eventual joint FLUSH in the We build HoraeStore atop the Horae with a set of modifica- durability logic to guarantee durability. tions of BlueStore [23], an object store engine of Ceph. BlueStore directly manages the block device by async IO 5.6.2 The HoraeFS File System interfaces (e.g., libaio), providing transaction interfaces for We build HoraeFS atop the Horae with a set of modifications distributed (i.e., RADOS). Inside each write of BarrierFS [46]. BarrierFS builds on Ext4, and divides the transaction, it first persists the aligned write to data storage, journaling thread (i.e., JBD2 in Figure 8) into submit thread followed by storing the unaligned small writes and metadata and flush thread. in a RocksDB [1] KV transaction (KVTXN). The RocksDB HoraeFS inherits this design, and the major changes are first writes the write-ahead log (WAL), then applies the up- that (1) we submit ordered writes and reads to the ordering dates to the KV pairs. For inter-transaction ordering, it uses layer first, (2) we remove the FLUSH to coordinate the data sequencers; the next transaction can not start a KVTXN until and journal device and (3) we detect IPUs through inspecting the preceding one has made the aligned write durable and

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 557 eipeetH implement We ihaon 0 O change. LOC 200 around with intra- For interfaces. transaction same the exporting while o M)rqieeto H of requirement PMR) (or H driver). SCSI and NVMe layer, block the (i.e., h ocrec fKTN’sbiso.Frt nH in First, submission. KVTXNs’ of improves concurrency and the transactions, dependent of range overlapping  rnato reig H ordering, transaction 558 14th USENIXSymposium onOperating Systems Designand Implementation in CMB the of alternatives the discuss further [ protection loss [ market [ StarBlaze from SSD enabled e for ory LOC H 100 approximately changes. with BarrierFS on based plemented stack IO traditional for needed of are lines changes 1288 no of (LOC); consisting code layer), ordering the (i.e., module Discussion and Details Implementation 5.7 for RocksDB to dispatched KV be processing. the further in can queued and KV be thread, a can submit KVTXNs in more durability Hence, the thread. ensures sub- flush and KV ordering, a for in KVTXNs thread the mit starts it durability; from KVTXN i.e., satisfied, is the them once between concurrently, ordering transactions H dependent words, two other its In process of KVTXN. ordering and the (D) satisfied writes has aligned one preceding the after ately S concurrently. pairs interface, KV new and this H With transaction, pairs. each KV for and WAL data, of order H of interface durable. become waits to and KV KVTXNs blocks in-progress single which for a KVTXNs, perform uses serially BlueStore to thread Besides, KVTXN. its started horae: stack. IO NVMe Linux native Performance. Write Ordered 9: Figure eapn h M eino aaio-akdCMB- capacitor-backed a of region CMB the remapping tore V XN KVT urnl,CBealdSD r led vial nthe in available already are SSDs CMB-enabled Currently, H H ordering, inter-transaction For H Throughput (GB/s) orae orae 0.5 1.0 h olwn rnato trsteKTNimmedi- KVTXN the starts transaction following the , 0 14 S ffi ed eino yeadesbepritn mem- persistent byte-addressable of region a needs tore 2 , in reigcnrlpt.W elz hsby this realize We path. control ordering cient 8 4 orae eod H Second, . 18 orae 1 NVMe SSD B 1 NVMe Write (KB) size .Mroe,mn Sshv nbe power enabled have SSDs many Moreover, ]. ceeae h rnato reigguarantee, ordering transaction the accelerates 10 S orae , tore , 12 io 63 64 32 16 , 18 nLnxkre sapugbekernel pluggable a as kernel Linux in submit orae sipeetdbsdo BlueStore on based implemented is orae orae , 32 orae .Teeoe h essetCMB persistent the Therefore, ]. S S S tore tore 20 tore order() using ] H a eahee aiy We easily. achieved be can orae rcse h aa WAL data, the processes eaae h reigof ordering the separates sstenwaycIO async new the uses orae 0 1 2 h hogpto admywiig1G rv pc y1wiesra ie,1tra) vanilla: thread). 1 (i.e., stream write 1 by space drive GB 1 writing randomly of throughput The { Osak are:BrirealdI tc.Hrzna otdlns aiu eiebandwidth. device maximum lines: dotted Horizontal stack. IO Barrier-enabled barrier: stack. IO D ioremap vanilla ocnrltewrite the control to , S 8 4 1 tore , D 1 NVMe SSD C 1 NVMe Write (KB) size 2 §  } orae 7 xed the extends . 63 64 32 16 wc() V XN KVT orae Si im- is FS orae . can 1 - barrier hc nygaate h ipthodrrahn nstorage in reaching order dispatch the guarantees only which dntda F-R n reiggaate(eoe as (denoted guarantee ordering and BFS-DR) as (denoted hsscineautsteH the evaluates section This irS[ rierFS yaseigtefloigquestions: following the answering by bu ( ext4-DR in disable barriers We ext4-DR). the as in (denoted options mode default journaling with ordered ext4 setup We vanilla. upon running of size Systems. The PLP. Compared with are H C by and used B CMB SSD NVMe The NVMe (C). datacenter-grade SSD high-performance the NVMe and consumer-grade (B), the SSD A), as SSDs: (labelled categorized SSD broadly SATA three the use We SSDs. candidate the Table GHZ. 2.50 at running chine Hardware. Setup Experimental 6.1 Evaluation 6 fI tcs ail n areI [ BarrierIO and Vanilla stacks, IO of F-D.Snew onthv are opin storage compliant barrier have not do we Since BFS-OD). al iu Osak x4[ Ext4 stack. IO Linux fault C B A 0 1 2 ff • • • • • CP70NVMe P3700 DC r(o trg eim.Smlrt x4 ets Bar- test we ext4, to Similar medium). storage (not er 6 R SATA PRO 860 di H does How ( H durability? of performance the is What ( H ordering? of performance the is What in?( tions? H does improvement much How ( introduce? recovery does overhead H Can 2 NVMe (B SSDs 2 + NVMe B) 8 4 5 NVMe 750 ff Samsung 46 Write (KB) size rn ossec ee?( level? consistency erent Model Intel Intel pnBrirOsakwt uaiiyguarantee durability with stack BarrierIO upon ] orae ecnutaleprmnswt 2cr ma- 12-core a with experiments all conduct We § 6.6 63 64 32 16 horae § orae eoe orcl fe rs n o much how and crash a after correctly recover al :SDSpecifications. SSD 2: Table , 6.2 § orae § 6.3 5.6 , rt:1900MB Write: ed 2800MB Read: 2200MB Read: e.Bandwidth Seq. rt:530MB Write: rt:950MB Write: s2MB. 2 is § , ed 560MB Read: ) emil opr ihtotypes two with compare mainly We 5.2 § efr ne npaeudtswith updates in-place under perform 5.3 nobarrier ) ) orae 7 0 2 4 sajunln l system file journaling a is ] H , 2 § / 3 NVMe (2B + SSDs 3 C) NVMe / / 6.4 hw h pcfiainof specification the shows / / s orae orae / s s orae 8 4 s s s pin ihext4-OD, with option) 46 , ad OS(G span) (8GB IOPS Rand. orae Write (KB) size § USENIX Association SadH and FS .Vnlai h de- the is Vanilla ]. 5.4 ngaaten the guaranteeing in the guaranteeing in § 6.5 rn oapplica- to bring ) 63 64 32 16 rt:475K Write: 230K Write: ed 640K Read: 430K Read: ed 100k Read: rt:90k Write: , § 5.5 orae ) S tore ext4 BFS HFS-SF HFS-JF 100 60 150 150 Ordering Ordering SATAOrdering + SATA SSDs Ordering Ordering 40 40 100 100 50 20 20 50 50 0 0 0 0 0 4 Durability Durability 4 Durability 5 Durability 100 Durability 50 50

IOPS(K) 2 2 0 0 0 0 0 1 4 12 1 4 12 1 4 12 1 4 12 1 4 12 # of theads # of theads # of theads # of theads # of theads (a) SATA (b) NVMe (c) SATA + SATA (d) SATA + NVMe (e) NVMe + NVMe Figure 10: File System Performance under FIO Allocating Write Workload. Ordering: nobarrier option of ext4, but only guarantees a relaxed ordering of reaching the storage write buffer. fbarrier() of BarrierFS and HoraeFS. Durability: fsync(). SF: serialized FLUSH. JF: joint FLUSH. devices, we reuse the software of BarrierFS and add extra of 4 KB writes. BarrierIO eliminates (4) in ordering guar- 5% overhead of the hardware barrier write, following the as- antee, and thus can achieve up to 2.2 GB/s in NVMe stack sumption of the BarrierFS paper. To run BarrierFS correctly theoretically. However, we observe that configuring the drive in multi-queue drives, we modify the NVMe driver to setup to a single IO command queue considerably decreases the only one IO command queue. HoraeFS (abbreviated as HFS) available bandwidth of high-performance storage. uses just one ordering queue. 6.3 File System Evaluation 6.2 Basic Performance Evaluation In this subsection, we evaluate the performance of POSIX First, we demonstrate the effectiveness of Horae in guarantee file systems atop different IO stacks with varying the number the ordering (§5.2), through measuring the throughput of and type of storage devices. Moreover, we demonstrate the three IO stacks on block devices. We vary the number of effectiveness of Horae in guaranteeing the durability (§5.3). devices and organize them as soft-RAID 0. Specifically, we We perform allocating write so that the file system always evenly distribute X KB random writes to different devices in a finds the updated metadata in the journal. We set up the file round-robin fashion. Figure 9 shows the overall throughput of system with two modes: the internal journal that mixes data ordering guarantee with varying the write size. Note that we and journal blocks in a single device, and the external journal only have the result of BarrierIO on a single device, because that uses a dedicated device for locating journal. The result is it does not support multiple drives. shown in Figure 10, and we make the following observations. Result. From the result, we find that Horae outperforms Effect of removing flush. As shown in Figure 10(a), on vanilla and BarrerIO IO stack by 4.1× and 2× respectively in SATA SSD, HFS achieves 80% higher IOPS averagely against the case of a single device and 4 KB write unit. On multiple ext4, and exhibits comparable performance compared to BFS. devices, Horae achieves up to 6.8× throughput than vanilla. Here, the major overhead is two FLUSHes. The former one Further, we observe that Horae can easily saturate the device is used to control the write order of the data blocks and the bandwidth with small write units (e.g., 4 KB). commit record, and the later one is to ensure durability. The Analysis. We now decompose the IO path to better under- FLUSH of SATA SSD exposes raw flash programming delay stand the performance. The overall IO path can be broken (1 ms). In BFS and HFS, the former FLUSH is eliminated. down into four parts, and we measure the overhead of each Effect of async DMA. As shown in Figure 10(b), on NVMe part when issuing 4 KB data blocks as follows: (1) data page SSD, HFS achieves 22% higher IOPS than ext4. Compared to preparation costs 0.8 us in our test; (2) the ordering layer pro- the durability of HFS, the ordering guarantee of HFS can fur- cesses and writes ordering metadata within 0.7 us; (3) about ther boost IOPS by 57%. The major overhead here is shifted 1.0 us is spent on block layer, which performs request merging to the DMA transfer. Figure 8 shows the source of improve- and bio(block IO data structure)-to-request(NVMe driver ment, and the number next to each rectangle shows the single data structure) transmission; (4) data DMA, device-side pro- thread latency of each operation. Due to the separation of the cessing and interrupt handler consume 8.7 us. The classic ordering and durability, HFS and BFS can overlap the DMA approach experiences all parts except (2), which counts up transfer with CPU processing, and thus partly hide the DMA to 10.5 us in total. Thus, the maximum IOPS and throughput delay. BFS does not perform well on NVMe SSD due to the that a single write stream can achieve in classic IO stack are restriction of the single IO command queue, especially when 95K and 380 MB/s. Horae experiences (1) and (2) in most we increase the number of threads. cases. The ordering layer delegates the submission of order- Effect of joint flush. As shown in Figure 10(c), HFS out- less block IO to per-CPU background submitter threads, so as performs ext4 by 88% and 90% on average in durability and to hide the overhead of block layer for foreground ordering ordering respectively. Comparing HFS-PF to HFS-SF, we calls. Thus, Horae can achieve up to 2.6 GB/s in the case find the joint flush improves the overall performance by up to

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 559 100 100 ext4 ext4-R0 BFS HFS-R0 HFS ext4 100 ext4-OD HFS-OD HFS 80 ext4-DR HFS-DR 50 50 60 K Tx/s K

IOPS (K) IOPS 0 DR OD DR OD 16 32 64 128 256 512 0 ordered data-journaling p-area size (MB) DR OD DR OD DR OD (a) IOPS with different consistency (b) Sensitivity to p-area size (a) Sole (b) Separate redo (c) Separate redo&undo

Figure 11: In-Place Update Performance. OD: nobarrier op- Figure 12: MySQL under OLTP-insert Workload. (a) Sole: mix tion of ext4, fbarrier() of HFS. DR: fsync(). Test on SSD C. data, redo log and undo log in device B. (b) Separate redo: data and undo log to device B, redo log to device C. (c) Separate redo & undo: 70%. This is because joint flush allows physically indepen- data to device B, redo log to device C, undo log to another device B. dent devices to perform flushing concurrently. DR: The fsync() used to control the write order of transactions is replaced with fbarrier(). OD: All fsync()s are replaced with ff E ect of parallel device access. Figure 10(d-e) plots the fbarrier()s. Sync() the database every 1 second. R0: organize result when we use an NVMe SSD to accelerate the journal. In underlying devices as logical volumes using RAID 0. ordering, HFS improves ext4 by 150% and 76% respectively. The major contributor here is the parallel device access. In 6.5 Crash Recovery Evaluation HFS, once the associated ordering metadata is processed To verify the consistency guarantees of Horae (§5.5), we run serially by the control path, the data blocks can be transferred workloads, and forcibly shut down the machine at random. and processed by individual devices concurrently. While in We restart the machine and measure the recovery performance. ext4, this is done in a serialized manner. We choose Varmail workload of Filebench [8] for its intensive fsync(). Varmail contains two fsync()s in each flow loop, 6.4 In-Place Update Evaluation and we replace the first one with fbarrier(). We repeat the test 30 times, and observe HoraeFS can In this subsection, we evaluate the performance of in-place always recover to a consistent state. The recovery time comes update under different consistency and with varying the size from two main parts: the ordering queue load time and com- of the preparatory area (p-area) (§5.4). mit write time. First, Horae loads the pointers and the order- ing metadata into host DRAM, which consumes 29.8 ms on As shown in the X title of Figure 11(a), we first setup the average. Next, the commit write requires “read-merge-write”, file system with two modes, the ordered and data-journaling and costs 497.6 ms on average. mode, representing data and version consistency, respectively. Then, we issue 10 GB overwrites to a 10 GB file. The ordered 6.6 Application Evaluation mode performs metadata journal. The data-journaling mode performs data journal to achieve the version consistency that 6.6.1 MySQL the version of data matches that of metadata. We evaluate MySQL with OLTP-insert workload [21]. The In the ordered mode, the double write of eager commit setups are described in the caption of Figure 12. write enhances the data consistency at the cost of 10% IOPS In sole configuration, HFS-DR outperforms ext4-DR and loss in durability. In ordering, HFS exhibits 50% higher BFS-DR by 15% and 23% respectively. In ordering, HFS- IOPS compared to ext4, because HFS can emit multiple IPUs OD prevails ext4-OD by 56% and achieves 36% higher TPS simultaneously without interleaving each IPU with DMA than BFS-OD. This evidences that HFS is more efficient in transfer and FLUSH command. controlling the write order. When using dedicated devices to store redo and undo logs In the data-journaling mode, both file systems first put (i.e., Figure 12(b)), HFS-DR outperforms ext4-DR by 16%, the IPU to journal area. When performing journal, HFS and HFS-OD performs 76% better than ext4-OD. This is dispatches the commit record immediately after the journaled because HFS can parallelize the IOs to individual devices. data, and thus provides 60% higher IOPS on average. Comparing Figure 12(b) with (c), we find that separating When Horae runs out of p-area space, Horae blocks incom- undo logs does not bring much improvement in both ext4 ing requests and triggers eager commit write. To investigate and HFS. Undo logs perform logical logging to retain the old the performance of Horae in such a situation, we run the same version of database tables, which incurs less writes compared IPU workload with the scaling of p-area size. The results are to physical logging (redo log). MySQL tightly embeds the shown in Figure 11(b). We find that HFS-OD performs dra- undo logs in the table files, thus separating undo logs does matically better than ext4-OD even with small p-area. To not alleviate the write traffic to the data device. provide request atomicity, HFS-DR delivers less IOPS than Comparing HFS with HFS-R0, we observe that, from the ext4-DR. As we enlarge the p-area, the IOPS gap narrows. performance aspect, manually distributing data flows to de-

560 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association -M, and TPS tore S er. PM: × ff tore S HFS-DRAM orae and 2.1 orae . We find that × (b), H 15 14 persists the ordering transaction() HFS-PM shows the results. 14 orae , once the aligned write and 16 32 64 exhibits 1.4 brings opportunities to apply apply ciency. Nevertheless, several tore ffi Write size(KB) Write S tore S HFS-CMB , representing ordering and durabil- orae enables more KV transactions to con- orae Recall that H erent sequencers do not have dependency. ff tore S (a), H transaction() 4 8 HFS-flash 14 orae (b)Ordering (a)Durability transaction() queue ext4 0 0

500 500 Throughput (MB/s) Throughput -the-shelf non-volatile media are capable of storing the The In Figure ff multiple transactions. As shown in Figure o storing the ordering metadata directly through theinterface block-based to SSD (i.e., HFS-flash)throughput. significantly decreases This the is because, even the ordering metadata is PCIe transfer masks the concurrency of translated orderless gain against BlueStore, in Spreserve and order, M BlueStore setup, does respectively.and not To metadata submit to the RocksDB smallcompleted. until write the While in aligned H write hasKV transaction been have been processed seriallypath, via associated the data control blocks can be processed concurrently. outperforms BlueStore averagely by 23%M and 83% setup, in respectively. S and multiple Due devices, to the the slower write datametadata device dependency and burdens WAL over the devices. faster similar Hence, TPS BlueStore-M delivers compared toas the BlueStore-S. control In path H guaranteestion the between ordering, aligned the write synchroniza- andFurther, KV H transaction is alleviated. tinuously fulfill the metadata and WAL storage. 7 Discussion CMB Alternatives. metadata in the CMB for e ordering metadata: SSD (flash), Intel persistentand memory capacitor-backed (PM) DRAM. We locate thein ordering these queue media and measureof the the single-threaded throughput ordered writes, as shown in Figure Figure 15: Comparison offlash: The block Media IO of to The SSD. Ordering CMB: MMIO Queue. to SSD’s memory bu sequencers. Each sequencer serializes thetransactions transactions, of and di Each transaction issues 20 KB write,aligned which write is to split data into device 16 and KB 4 KB small write to RocksDB. queue ity guarantee, respectively. Figure flash gradually outperforms ext4. We also find the PM and Intel Optane persistent memory. DRAM: capacitor-back memory. Two interfaces are evaluated, writes. When the write size increases (over 64 KB), HFS- 16 B, it must be padded to 4 KB, where the 4 KB synchronous . s, ]. 13 19 . There- 10 12 15.3 s are replaced HoraeStore-M 14.6 SQLite runs at s used to control 14th USENIX Symposium on Operating Systems Design and Implementation 561 14th USENIX Symposium on Operating Systems Design and Implementation

11.8 fdatabarrier()

12.3 fbarrier() # of sequencers of # Store-S: mix data, meta- 2 4 6 8 HFS-OD BFS-OD

11.0 queue_transaction(b) c and locality. Therefore, . In ordering, both BFS fdatasync()

11.8 ffi becomes inexpensive, ext4- 5 0 fdatasync() HoraeStore-S transaction: durability guaran- 10

12.6 FLUSH

12.1 FLUSH ext4-OD HFS-DR 10.0 10 12 s. erent write tra ] of Ceph with varying the number of 5 1.0 ff BlueStore-M 0.9 SSD SATA NVMeSSD 0.8 transaction: ordering guarantee. # of sequencers of # ext4-DR BFS-DR 2 4 6 8 (a) apply_transaction(a) fdatabarrier()

Bluestore-S 5 0

4 2 0

20 15 10 TPS (K) TPS On SATA SSD, the ordering setups (OD) outperform the On NVMe SSD, as the K Tx/s K uniform distribution potentially bounds the better devices and data, WAL to devicedevice A. B, WAL Store-M: to data device to C. apply device A, metadata to durability ones (DR) byreduction an of order the of prohibitive magnitude due to the USENIX Association 6.6.3 BlueStore fore, more IOs can be processed at the same time. OD achieves almost the sameseparates performance the as control ext4-DR. path HFS from thecan order data the path, table and files thus and SQLite logs through chances of overlapping CPU with IOs. 6.6.2 SQLite ruins the data locality. sion of logical volumes. Atreats naive the implementation devices of equally. RAID However, 0 thetion data usually have flows di of applica- tees. queue Figure 14: Object Store Performance. with size 100 bytes. DR: Thestorage first order of three transactions are replacedbut with the last one remains intact. OD: All Figure 13: SQLite Random Insert Performance. ject store with defaultstore options. benchmark We [ use the built-in object and HFS exhibit overdue 20% to performance the gain separation against of ordering ext4 and durability that brings vices of particular usage is better than the automatic disper- This subsection evaluates the transaction processing of ob- This subsection focuses on theThe performance detailed of setups SQLite are [ presented in the caption of Figure WAL mode. 1M inserts in random key order. Key size 16 bytes, value DRAM are also satisfiable alternatives of the CMB. redirection with enhanced consistency. Storage IO stack. A school of works [30,31,34,35,37,40–43, 8 Related Work 49,51] improve the storage IO stack. Xsyncfs [40] uses output- Storage order. Many researchers [25–28,40,46] have stud- triggered commits to persist a data block only when the result ied and mitigated the overhead of storage order guarantee. needs to be externally visible. IceFS [37] allocates separate Among these studies, the closest ones are OptFS [26] and Bar- journals for each container for isolation. SpanFS [31] parti- rierFS [46] that separate the ordering guarantee from durabil- tions the file system at domain granularity for performance ity guarantee and provide similar ordering primitive. OptFS, scalability. Built atop F2FS [34], ParaFS [51] co-designs the BarrierFS and HoraeFS are proposed under different storage file system with the SSD’s FTL to bridge the semantics gap technologies (HDD, SATA SSD, NVMe SSD and storage between the hardware and software. iJournaling [41] designs arrays), thereby mainly differing in the architecture, the scope fine-grained journaling for each file, and thus mitigates the of application and hardware requirements. First, OptFS em- interference between fsync() threads. CCFS [42] provides beds the transaction checksum in the journal commit block similar stream abstraction at file system level for applica- and detects the ordering violation during recovery, which tions to implement correct crash consistency. Its stream is reduces the rotational disk FLUSH overhead at runtime. Barri- designed on individual journals and still relies on exclusive orae erFS preserves the order layer by layer, thereby aligning with IO processing to preserve the order. H exports stream the single queue structure of SCSI stack and devices. They at block level via the dedicated control path, not relying on are difficult to be extended to multiple queues or multiple exclusive IO processing nor journal. TxFS [30] leverages the devices. In contrast, Horae stores the ordering metadata via file system journaling to provide transactional interface. Son a dedicated control path to maintain the write order. This et al. [43] propose a high-performance and parallel journal design aims to let the ordering bypass the traditional stack to scheme. FlashShare [49] punches through the IO stack to enable high throughput and easy scaling to multiple devices. device firmware to optimize the latency for ultra-low latency Second, the checksum-based ordering approach of OptFS is SSDs. AsyncIO [35] overlaps the CPU execution with IO limited to continuous address space (e.g., file system journal- processing, so as to reduce the fsync() latency. CoinPurse ing), because the checksum can be only used to detect the leverages the byte interface and device-assisted logic to ex- ordering violation of data blocks in pre-determined locations. pedite non-aligned writes [47]. However, these works still Alternatively, Horae builds a more generic ordering layer rely on exclusive IO processing to control the internal order which can spread data blocks to arbitrary logical locations of (e.g., the order between data blocks and metadata blocks) and any device. Third, OptFS requires the disk to support asyn- external order (e.g., the order of applications’ data). chronous durability notification. BarrierIO requires barrier NoFS [27] introduces backpointer-based consistency to compliant storage device which is only available in a few remove the ordering point between two dependent data blocks. eMMC (embedded multimedia card) products. Horae can Due to the lack of ordering updates, NoFS does not support run on the standard NVMe devices with exposed CMB. The atomic operations (e.g., rename()). CMB feature is already defined in NVMe spec, and is under 9 Conclusion increasing promotion and recognition by NVMe and SPDK communities [6]. In this paper, we revisit the write dependency issue on high- Dependency tracking. Some works use dependency track- performance storage and storage arrays. Through a perfor- ing techniques to handle storage order. [39] mance study, we notice that with the growth of performance directly tracks the dependencies of the file system structures of storage arrays, the performance loss induced by the write in a per-pointer basis. Similarly, Featherstitch [28] introduces dependency becomes more severe. Classic IO stack is not the patch to specify how a range of bytes should be changed. efficient in resolving this issue. We thus propose a new IO Horae also tracks the write dependencies in the ordering stack called Horae.Horae separates the ordering control queue. The tracking unit of Horae is different from prior from the data flow, and uses a range of techniques to ensure works; each entry in Horae describes how a range of data both high performance and strong consistency. Evaluations blocks (e.g., 4 KB) should be ordered. The block-aligned show that Horae outperforms existing IO stacks. tracking introduces less complexity of both dependency track- ing and file system modifications, and it is more generic in 10 Acknowledgement the context of block device. In addition, due to the disability We sincerely thank our shepherd Vijay Chidambaram and the of telling data versions, soft updates does not support version anonymous reviewers for their valuable feedback. We also consistency. Featherstitch assumes single in-progress write thank Qing Wang and Zhe Yang for the discussion on this to the same block address, and treats dependency loop as work. This work is supported by National Key Research & De- errors. Thus, the in-place updates of Featherstitch must wait velopment Program of China (Grant No. 2018YFB1003301), for the completion of preceding one. Instead, Horae saves the National Natural Science Foundation of China (Grant No. the in-place updates from long DMA transfer through write 61832011, 61772300).

562 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association References [16] NVMe SSD with Persistent Memory Region. [1] A Persistent Key-Value Store for Fast Storage. https: https://www.flashmemorysummit.com/ //rocksdb.org/. English/Collaterals/Proceedings/2017/ 20170810_FM31_Chanda.pdf. [2] A way to do atomic writes. https://lwn.net/ [17] PCI Express Base Specification Revision 3.0. Articles/789600/. http://www.lttconn.com/res/lttconn/pdres/ [3] An async IO implementation for Linux. 201402/20140218105502619.pdf. https://elixir.bootlin.com/linux/v4.18. [18] Product Brief: Intel R OptaneTM SSD DC D4800X 20/source/fs/aio.c. Series. https://www.intel.com/content/www/ [4] BFQ (Budget Fair Queueing). https://www. us/en/products/docs/memory-storage/solid- kernel.org/doc/html/latest/block/bfq- state-drives/data-center-ssds/optane-ssd- iosched.html. dc-d4800x-series-brief.html. [5] Ceph Objectstore benchmark. https://github. [19] SQLite. https://www.sqlite.org/index.html. com/ceph/ceph/blob/master/src/test/ [20] Starblaze OC SSD. http://www.starblaze-tech. objectstore_bench.cc. com/en/lists/content/id/137.html. [6] Enabling the NVMeTM CMB and PMR [21] SysBench manual. https://imysql.com/wp- Ecosystem. https://nvmexpress.org/wp- content/uploads/2014/10/sysbench- content/uploads/Session-2-Enabling-the- manual.pdf. NVMe-CMB-and-PMR-Ecosystem-Eideticom-and- [22] Embedded Multimedia Card. http://www.konkurel. Mell....pdf. ru/delson/pdf/D93C16GM525(3).pdf, 2018. [7] ext4 Data Structures and Algorithms. https: [23] A. Aghayev, S. Weil, M. Kuchnik, M. Nelson, G. R. //www.kernel.org/doc/html/latest/ Ganger, and G. Amvrosiadis. File systems unfit as filesystems/ext4/index.html. distributed storage backends: Lessons from 10 years [8] Filebench - A Model Based File System Work- of ceph evolution. In Proceedings of the 27th ACM load Generator. https://github.com/filebench/ Symposium on Operating Systems Principles, SOSP ’19, filebench. page 353–369, New York, NY, USA, 2019. Association for Computing Machinery. [9] fio - Flexible I/O tester. https://fio.readthedocs. [24] D.-H. Bae, I. Jo, Y. A. Choi, J.-Y. Hwang, S. Cho, D.- io/en/latest/fio_doc.html. G. Lee, and J. Jeong. 2b-ssd: The case for dual, byte- [10] Intel Solid State Drive 750 Series Datasheet. and block-addressable solid-state drives. In Proceed- https://www.intel.com/content/dam/ ings of the 45th Annual International Symposium on www/public/us/en/documents/product- Computer Architecture, ISCA ’18, page 425–438. IEEE specifications/ssd-750-spec.pdf. Press, 2018.

[11] Intel R SSD 545s Series. https://www.intel. [25] Y.-S. Chang and R.-S. Liu. Optr: Order-preserving trans- com/content/www/us/en/products/memory- lation and recovery design for ssds with a standard block storage/solid-state-drives/consumer- device interface. In Proceedings of the 2019 USENIX ssds/5-series/ssd-545s-series/545s-256gb- Conference on Usenix Annual Technical Conference, 2-5inch-6gbps-3d2.html. USENIX ATC ’19, pages 1009–1023, Berkeley, CA, USA, 2019. USENIX Association. [12] Intel R SSD DC P3700 Series. https: //ark.intel.com/content/www/us/en/ark/ [26] V. Chidambaram, T. S. Pillai, A. C. Arpaci-Dusseau, and products/79621/intel-ssd-dc-p3700-series- R. H. Arpaci-Dusseau. Optimistic crash consistency. In 2-0tb-2-5in-pcie-3-0-20nm-mlc.html. Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, pages 228– [13] MySQL reference manual. https://dev.mysql. 243, New York, NY, USA, 2013. ACM. com/doc/refman/8.0/en/. [27] V. Chidambaram, T. Sharma, A. C. Arpaci-Dusseau, and [14] NoLoad U.2 Computational Storage Proces- R. H. Arpaci-Dusseau. Consistency without ordering. sor. https://www.eideticom.com/uploads/ In Proceedings of the 10th USENIX Conference on File attachments/2019/07/31/noload_csp_u2_ and Storage Technologies, FAST’12, page 9, USA, 2012. product_brief.pdf. USENIX Association. [15] NVMe specifications. https://nvmexpress.org/ [28] C. Frost, M. Mammarella, E. Kohler, A. de los Reyes, resources/specifications/. S. Hovsepian, A. Matsuoka, and L. Zhang. Generalized

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 563 file system dependencies. In Proceedings of Twenty- Design and Implementation, OSDI’14, pages 81–96, First ACM SIGOPS Symposium on Operating Systems Berkeley, CA, USA, 2014. USENIX Association. Principles, SOSP ’07, page 307–320, New York, NY, [38] Marvell. Marvell 88SS1093 Con- USA, 2007. Association for Computing Machinery. troller. https://www.marvell.com/content/ [29] M. Hedayati, K. Shen, M. L. Scott, and M. Marty. Multi- dam/marvell/en/public-collateral/storage/ queue fair queuing. In 2019 USENIX Annual Technical marvell-storage-88ss1093-product-brief- Conference (USENIX ATC 19), pages 301–314, Renton, 2017-03.pdf, 2017. WA, July 2019. USENIX Association. [39] M. K. McKusick and G. R. Ganger. Soft updates: A [30] Y. Hu, Z. Zhu, I. Neal, Y. Kwon, T. Cheng, V. Chi- technique for eliminating most synchronous writes in dambaram, and E. Witchel. Txfs: Leveraging file- the fast filesystem. In Proceedings of the Annual Confer- system crash consistency to provide ACID transac- ence on USENIX Annual Technical Conference, ATEC tions. In 2018 USENIX Annual Technical Conference ’99, page 24, USA, 1999. USENIX Association. (USENIX ATC 18), pages 879–891, Boston, MA, July [40] E. B. Nightingale, K. Veeraraghavan, P. M. Chen, and 2018. USENIX Association. J. Flinn. Rethink the sync. In Proceedings of the 7th [31] J. Kang, B. Zhang, T. Wo, W. Yu, L. Du, S. Ma, and Symposium on Operating Systems Design and Imple- J. Huai. Spanfs: A scalable file system on fast storage mentation, OSDI ’06, page 1–14, USA, 2006. USENIX devices. In Proceedings of the 2015 USENIX Confer- Association. ence on Usenix Annual Technical Conference, USENIX ATC ’15, pages 249–261, Berkeley, CA, USA, 2015. [41] D. Park and D. Shin. ijournaling: Fine-grained journal- USENIX Association. ing for improving the latency of fsync system call. In Proceedings of the 2017 USENIX Conference on Usenix [32] W.-H. Kang, S.-W. Lee, B. Moon, Y.-S. Kee, and M. Oh. Annual Technical Conference, USENIX ATC ’17, pages Durable write cache in flash memory ssd for relational 787–798, Berkeley, CA, USA, 2017. USENIX Associa- and nosql databases. In Proceedings of the 2014 ACM tion. SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 529–540, New York, NY, [42] T. S. Pillai, R. Alagappan, L. Lu, V. Chidambaram, A. C. USA, 2014. ACM. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Applica- tion crash consistency and performance with CCFS. In [33] N. Kirsch. Phison E12 High-Performance SSD Con- 15th USENIX Conference on File and Storage Technolo- troller. https://www.legitreviews.com/sneak- gies (FAST 17), pages 181–196, Santa Clara, CA, Feb. peek-phison-e12-high-performance-ssd- 2017. USENIX Association. controller_206361, 2018. [43] Y. Son, S. Kim, H. Y. Yeom, and H. Han. High- [34] C. Lee, D. Sim, J.-Y. Hwang, and S. Cho. F2fs: A new performance transaction processing in journaling file file system for flash storage. In Proceedings of the 13th systems. In Proceedings of the 16th USENIX Confer- USENIX Conference on File and Storage Technologies, ence on File and Storage Technologies, FAST’18, pages FAST’15, pages 273–286, Berkeley, CA, USA, 2015. 227–240, Berkeley, CA, USA, 2018. USENIX Associa- USENIX Association. tion. [35] G. Lee, S. Shin, W. Song, T. J. Ham, J. W. Lee, and [44] S. C. Tweedie. Journaling the linux ext2fs filesystem. In J. Jeong. Asynchronous i/o stack: A low-latency kernel In LinuxExpo’98: Proceedings of The 4th Annual Linux i/o stack for ultra-low latency ssds. In 2019 USENIX Expo, 1998. Annual Technical Conference (USENIX ATC 19), pages 603–616, Renton, WA, July 2019. USENIX Associa- [45] S. Weil, S. Brandt, E. Miller, D. Long, and C. Maltzahn. tion. Ceph: A scalable, high-performance distributed file sys- tem. pages 307–320, 11 2006. [36] B. Lepers, O. Balmau, K. Gupta, and W. Zwaenepoel. Kvell: The design and implementation of a fast persis- [46] Y. Won, J. Jung, G. Choi, J. Oh, S. Son, J. Hwang, and tent key-value store. In Proceedings of the 27th ACM S. Cho. Barrier-enabled io stack for flash storage. In Symposium on Operating Systems Principles, SOSP ’19, Proceedings of the 16th USENIX Conference on File page 447–461, New York, NY, USA, 2019. Association and Storage Technologies, FAST’18, pages 211–226, for Computing Machinery. Berkeley, CA, USA, 2018. USENIX Association. [37] L. Lu, Y. Zhang, T. Do, S. Al-Kiswany, A. C. Arpaci- [47] Z. Yang, Y. Lu, E. Xu, and J. Shu. Coinpurse: A Dusseau, and R. H. Arpaci-Dusseau. Physical disentan- device-assisted file system with dual interfaces. In 2020 glement in a container-based file system. In Proceedings 57th ACM/IEEE Design Automation Conference (DAC), of the 11th USENIX Conference on Operating Systems pages 1–6, 2020.

564 14th USENIX Symposium on Operating Systems Design and Implementation USENIX Association [48] J. Yeon, M. Jeong, S. Lee, and E. Lee. Rflush: Re- think the flush. In Proceedings of the 16th USENIX Conference on File and Storage Technologies, FAST’18, pages 201–209, Berkeley, CA, USA, 2018. USENIX Association. [49] J. Zhang, M. Kwon, D. Gouk, S. Koh, C. Lee, M. Alian, M. Chun, M. T. Kandemir, N. S. Kim, J. Kim, and M. Jung. Flashshare: Punching through server storage stack from kernel to firmware for ultra-low latency ssds. In 13th USENIX Symposium on Operating Systems De- sign and Implementation (OSDI 18), pages 477–492, Carlsbad, CA, Oct. 2018. USENIX Association. [50] J. Zhang, M. Kwon, M. Swift, and M. Jung. Scalable parallel flash firmware for many-core architectures. In 18th USENIX Conference on File and Storage Technolo- gies (FAST 20), pages 121–136, Santa Clara, CA, Feb. 2020. USENIX Association. [51] J. Zhang, J. Shu, and Y. Lu. Parafs: A log-structured file system to exploit the internal parallelism of flash devices. In 2016 USENIX Annual Technical Conference (USENIX ATC 16), pages 87–100, Denver, CO, June 2016. USENIX Association.

USENIX Association 14th USENIX Symposium on Operating Systems Design and Implementation 565