arXiv:1711.02258v1 [cs.OS] 7 Nov 2017 h rirto sbigahee ya xrml expen- extremely an by achieved being 31]. of is layers 58, arbitration copy- the the across in 35, requirement or [59, ordering 61], the filesystems [41, Preserving log-structure soft-update in or 4], on-write journal- 64, filesystem 40, in [65, 35], ing a 26, of [46, atomicity the transaction and database durability the guaranteeing in e.g. lcsaerflce otesoaesurface, storage data the the to which reflected in are for order blocks the essential ensure is to it writers arbitration, software the of layers multiple the from uncertainties compound the write- Despite manager. storage cache IO back and layers; manager, arbitration queue of command collection scheduler, a is stack IO Modern Motivation 1 storage. inso are nbe Osakare stack IO scheduling Enabled Barrier ingre- Key of surface. dients storage the the reach by they imposed till condition application ordering the orches- preserve are to layers the these trated that so for filesystem dispatch the the command and scheduler, module IO barrier the overhaul we cache storage, Flash the and Exploiting transfer with flush. requests: requests write write among successive order interleaving storage the ensuring re- in expensive sort prohibitively adopts device block existing smc sb 73 by as transaction, much increases as a performance MySQL of and performance durability , SQLite in the and in- server Relaxing in performance respectively. 75%, SQLite and platform. 270% server by mobile creases in in Stack as com- IO well journal Enabled as the Barrier of implement plane We data mit. and plane Barri- decou- control effectively in the to ple Journaling threads separate enforc- Mode the in Dual dedicates order. overhead erFS excessive storage of the cause ing root the inates Journaling hswr sddctdt lmntn h vredof overhead the eliminating to the dedicated guaranteeing is work This Abstract , are nbe Osakscesul elim- successfully stack IO enabled Barrier . re rsrigDispatch Preserving Order ayn University Hanyang ayn University Hanyang trg order storage ojpWon Youjip × onakOh Joontaek n y43 by and are nbe OSakfrFahStorage Flash for Stack IO Enabled Barrier × nmdr Osak The stack. IO modern in epciey nserver in respectively, , and , pc ae IO based Epoch trg order storage ulMode Dual ea & University A&M Texas asn Electronics ayn University Hanyang enbeSon Seongbae agenCho Sangyeun amnJung Jaemin , aedfclyi ul tlzn h nelighg per- high underlying storage. the formance utilizing fully providers in service difficulty the have vendors, storage Despite the by stor- [42]. claimed Flash age the of density improvement the performance storage splendid in the higher 10] the [5, for 36], QLC) quest [24, and nm) endless TLC, the 10 (MLC, to cell (sub due im- per process is multi-bits barely finer This has the [21]. cell deteriorated of not adoption Flash has a it program if proved to time the hand, which exhibits [69], 4,000 reportedly nearly performance SSD read is random NVMe KIOPS 750 art improve- to the up performance of 27, State phenomenal [18, ment. brought queue command have deep 69] and [47], size large cache 6], storage [70, controller multi-channel/way e.g. age, of set a in order 15]. storage [23, a requests guarantee to principle mental h rtswith writes the this call is We durable. request transfer-and-flush made preceding is the and transferred with completely associated after block only data request following the the dispatching resort; C: sive mobile/UFS2.0, array Flash G: write(), server/SATA3.0 B: server/PCIe, F: E: (supercap), Orderless server/NVMe, vs. mobile/eMMC5.0, D: IO server/SATA3.0, write() A: Ordered 1: Figure i.1aam sa motn rn.W xmn the examine We trend. important an us alarms 1 Fig. stor- Flash the in parallelism the and concurrency The

Ordered IO / Buffered IO (%) 10 15 20 25 30 5 0 0 1351 × HDD 2131 transfer-and-flush asn Electronics Samsung fHDspromne nteother the On performance. HDD’s of 50 ehns.Frdcds interleaving decades, For mechanism. ayn University Hanyang oyugHwang Jooyoung 403 2297 Buffered IO(IOPSX10) yogelChoi Gyeongyeol 100 y=(3.4 A B 584 supercap C D 150 X 10)x a entefunda- the been has E F 3 3 200 -1.1 2296 HDD G 250 performance of write with ordering guarantee (write() lous delay in the Flash storage. With reasonable com- followed by fdatasync()) against the one without or- plexity, the storage controller can be made to flush the dering guarantee (write()). We test seven Flash stor- cache contents satisfying a certain ordering condition ages with different degrees of parallelism. In a single from the host [30, 56, 39]. The mobile Flash storage channel mobile storage for smartphone (SSD A), the per- standards already defines “cache barrier” command [28] formance of ordered write is 20% of that of the buffered which precisely serves this purpose. For order preserv- write. In a thirty-two channel Flash array (SSD G), this ing block device layer, the command dispatch mecha- ratio decreases to 1%. In SSD with supercap (SSD E), nism and the IO scheduler of the block device layer the ordered write performance is 25% of that of the are overhauled so that they can preserve partial order buffered write. There are two important observations. in the incoming sequence of the requests in scheduling First, the overhead of transfer-and-flush becomes severe them. For barrier enabled filesystem, we define new in- as the the degree of parallelism increases. Second, use of terfaces, fbarrier() and fdatabarrier() to exploit Power-Loss Protection (PLP) hardware fail to eliminate the nature of order preserving block device layer. The the transfer-and-flush overhead. The overhead is going to fbarrier() and the fdatabarrier() system calls are get worse as the Flash storage employs higher degree of the ordering guarantee only counter part of fsync() parallelism and denser Flash device. and fdatasync(), respectively. fbarrier() shares the Fair amount of works have been dedicated to ad- same semantics as osync() of OptFS [8]; it writes the dress the overhead of storage order guarantee. The tech- dirty pages, triggers filesystem journal commit and re- niques deployed in the production platforms include turns without persisting them. fdatabarrier() ensures non-volatile writeback cache at the Flash storage [22], the storage order between its preceding writes and the no-barrier mount option at the EXT4 filesystem [14], following writes without flushing the writeback cache in or transactional checksum [55, 32, 62]. Efforts as trans- between and without waiting for DMA completion of the actional write at the filesystem [49, 17, 53, 35, 66] and preceding writes. It is a storage version of the memory transactional block device [30, 71, 43, 67, 51] save the barrier, e.g. mfence [52]. OptFS does not providethe one application from the overhead of enforcing the storage equivalent to fdatabarrier(). The order-preserving order associated with filesystem journaling. A school of block device layer is filesystem-agnostic. We can imple- works address more fundamental aspects in controlling ment fbarrier() and fdatabarrier() in any filesys- the storage order such as separating the ordering guaran- tems. We modify EXT4 to support fbarrier() and tee from durability guarantee [8], providing a program- fdatabarrier()1. We only present our result of EXT4 ming model to define the ordering dependency among filesystem due to the space limit. We modify the journal- the set of writes [19], persisting a data block only when ing module of EXT4 and develop Dual Mode journaling the result needs to be externally visible [48]. These works for order preserving block device. We call the modified share the same essential principle in controlling the stor- version of EXT4, the BarrierFS. age order; transfer-and-flush. For example, OptFS[8] Barrier Enabled IO stack not only removes the flush checkpoints the data blocks only after the associated overhead but also the transfer overhead in enforcing the journal transaction becomes durable. Featherstitch[19] storage order. While large body of the preceding works realizes the ordering dependency between the patch- successfully eliminate the flush overhead, few works groups via interleaving them with transfer-and-flush. dealt with the overhead of DMA transfer in storage or- In this work, we revisit the issue of eliminating the der guarantee. The benefits of Barrier Enabled IO stack transfer-and-flush overhead in modern IO stack. We aim include the following; at developing an IO stack where the host can dispatch the following command before the data blocks associ- • The application can control the storage order virtu- ated with the preceding command becomes durable and ally without any overheads; without being blocked before the preceding command is serviced and yet the or without stalling the queue. host can enforce the storage order between them. • The latency of a journal commit decreases signifi- We develop a Barrier Enabled IO stack which effec- cantly. The journaling module can enforce the stor- tively addresses our design objective. Barrier enabled IO age order between the journal logs and the journal stack consists of the cache barrier-aware storage device, commit mark without interleaving them with flush the order preserving block device layer and the barrier and without interleaving them with DMA transfer. enabled filesystem. Barrier enabled IO stack is built upon • Throughput of the filesystem journaling improves the foundation that the host can control a certain par- significantly. Dual Mode journaling commits multi- tial order in which the cache contents are flushed, per- sist order. Different from rotating media, the host can 1The source codes are currently unavailable to public to abide by the enforce a persist order without the risk of getting anoma- double blind rule of the submission. We plan to open-source it shortly.

2 ple transactions concurrently and yet can guarantee

the durability of the individual journal commit. I D C P

Eliminating all the inefficiencies, the host now can Di s patc h Com m and Q ueue Q ueue

W ri tebac k

Fl as h

successfully exploit the concurrency and the parallelism Cac he IO Schedul er in the underlying storage satisfying all ordering con-

Host straints. Relaxing the durability of a transaction, SQLite Storage Figure 2: Set of queues in the IO stack: the sources of performance and MySQL performance increase as much arbitration as by 73× and by 43×, respectively, in server storage. The rest of the paper is organized as follows. Section 2 introduces the background. Section 3, section 4, and sec- the source of arbitration at each layer. tion 5 explain the block device layer, the filesystem layer, and the application of Barrier Enabled IO stack, respec- • I 6= D. IO scheduler reorders and coalesces the tively. Section 6 and section 7 discusses the result of the IO requests subject to their optimization criteria, experiment and surveys the related works, respectively. e.g. CFQ, DEADLINE, etc. When there is no Section 8 concludes the paper. scheduling mechanism, e.g. NO-OP scheduler [3] or NVMe [12] interface, the dispatch order may be 2 Background equal to the issue order. D C 2.1 Orders in IO stack • 6= . Storage controller freely schedules the commands in its command queue. Also, the data A write request travels a complicated route until the asso- blocks can be transferred out of order due to the er- ciated data blocks reach the storage surface. The filesys- rors, time-out and retry. tem puts the request to the IO scheduler queue. The block • C 6= P. The cache replacement algorithm, map- device driver removes one or more requests from the ping table update algorithm, and storage controller’s queue and constructs a command. It probes the device policy to schedule Flash operations governs the per- and dispatches the command if the device is available. sist order independent of the order in which the data The device is available if the command queue at the blocks are transferred. storage device is not full. Arriving at the storage device, the command is inserted into the command queue. The Due to the all these sources of arbitrations, the modern storage controller removes the command from the com- IO stack is said to be orderless [7]. mand queue and services it, i.e. transfers the data block between the host and the storage. When the transfer fin- ishes, the device sends the completion signal to the host. 2.2 Transfer-and-Flush The contents of the writeback cache are committed to Enforcing a storage order corresponds to preserving a storage surface either periodically or by an explicit re- partial order between issue order I and persist order P, quest from the host. i.e. satisfying the condition I = P. It is equivalent to We define four types of orders in the IO stack; Issue collectively enforcing the individual ordering constraints Order, I , Dispatch Order, D, Transfer Order, C , and between the layers; Persist Order, P. The issue order I = {i1,i2,...,in} is a set of write requests issued by the application or by the (I = P) ≡ (I = D) ∧ (D = C ) ∧ (C = P) (1) file system. The subscript denotes the order in which the requests enter the IO scheduler. The dispatch order D = Modern IO stack has evolved under the assumption {d1,d2,...,dn} denotes a set of the write requests which that the host cannot control the persist order, i.e. C 6= P. are dispatched to the storage device. The subscript de- Persist order specifically denotes the order in which the notes the order in which the requests leaves the IO sched- contents in the writeback cache are persisted whereas uler. Transfer order, C = {c1,c2,...,cn}, is the set of storage order denotes an order in which the write re- transfer completions. Persist Order P = {p1, ,..., pn} quests from the filesystem are persisted. For rotating me- is a set of operations which make the associated data dia such as , the disk scheduling is entirely blocks durable. Fig. 2 schematically illustrates the lay- left to the storage device due to its complicated sector ers and the associated orders in the IO stack. We say a geometry hidden from outside [20]. Blindly enforcing a certain partial order is preserved if the relative position certain persist order may bring unexpected delay in IO of the requests against a certain designated request, bar- service. Inability to control the persist order, C 6= P, is rier, are preserved. We use the notation ‘=’ to denote that a fundamental limitation of the modern IO stack, which a certain partial order is preserved. We briefly summarize makes the condition I = P in Eq. 1 unsatisfiable.

3 To circumvent this limitation in satisfying a storage dispatch complete context-switch order, the host takes the indirect and expensive resort fsync() fsync() to satisfy each component in Eq. 1. First, after dis- start return

patching the write command to the storage device, the File System caller is blocked until the associated DMA transfer com- pletes, Wait-on-Transfer. This is to prohibit the storage Journal controller from servicing the commands in out-of-order Block Layer manner and to satisfy the transfer order, D = C . This Block Device may stall the command queue. When the DMA trans- D JD Flush JC Flush fer completes, the caller issues the flush command and Figure 3: DMA, flush and context switches in fsync() blocks again waiting for its completion. When the flush returns, the caller wakes up and issues the following of the two conditions are violated, the file system may command; Wait-on-Flush. These two are used in tandem recover incorrectly in case of unexpected system failure leaving the caller under a number of context switches. [65, 8]. JBD interleaves the write request for JD and the Transfer-and-flush is unfortunate sole resort in enforcing write request for JC with transfer-and-flush. To control the storage order in a modern orderless IO stack. the storage order between the transactions, JBD thread waits for JC to become durable before it starts commit- 2.3 Analysis: fsync() in EXT4 ting the next journal transaction. An fsync() can be represented as a tandem of Wait- We examine how the EXT4 filesystem controls the stor- on-transfer and Wait-on-flush as in Eq. 2. D, JD and JC age order among the data blocks, journal descriptor, jour- denote the write request for D, JD and JC, respectively. nal logs and journal commit block in fsync() in Or- ‘xfer’ and ‘flush’ denote wait-for-transfer and wait-for- dered mode journaling. In Ordered mode, EXT4 ensures flush, respectively. that data blocks are persisted before the associated jour- nal transaction does. D→xfer→JD→xfer → |flush→JC→{zxfer→ flush} (2) Fig. 3 illustrates the behavior of an fsync(). The ap- FLUSH|FUA plication dispatches the write requests for the dirty pages, D. After dispatching the write requests, the application In early days, the block device layer was responsible blocks and waits for the completion of the associated for issuing the flush and for waiting for its comple- DMA transfer. When the DMA transfer completes, the tion [63, 25]. This approach blocks not only the caller application thread resumes and triggers the JBD thread but all the other requests which share the same dispatch to commit the journal transaction. After triggering the queue [14]. Since Linux 2.6.37 kernel, this role has been JBD thread, the application thread sleeps again. When migrated from the block device layer to the filesystem the JBD thread makes journal transaction durable, the layer [15]. The filesystem uses flush option (REQ FLUSH) fsync() returns, waking up the caller. The JBD thread and force-unit-atomic option (REQ FUA) in writing JC should be triggered only after D are completely. Oth- and the filesystem blocks until it completes. With FLUSH erwise, the storage controller may service the write re- option, the storage device flushes the writeback cache be- quests for D, JD and JC in out-of-order manner and stor- fore servicing the command. With FUA option, the stor- age controller may persist the journal transaction prema- age controller writes a given block directly to the storage turely before D reaches the writeback cache. In this hap- surface. The last four steps in Eq. 2 can be compressed pens, the filesystem can be recovered incorrectly in case into a write request with FLUSH|FUA option. When the of the unexpected system failure. filesystem is responsible for waiting for the completion A journal transaction consists of the journal descrip- of Flash, the other commands in the dispatch queue can tor block, one or more log blocks and the journal commit progress after JCFLUSH|FUA is dispatched. In both ap- block. A transaction is usually written to the storage with proaches, the caller is subject to transfer-and-flush over- two requests: one for writing the coalesced chunk of the head to interleave JD and JC. journal descriptor block and the log blocks and the other for writing the commit block. In the rest of the paper, we will use JD and JC to denote the coalesced chunk of 3 Order Preserving Block Device Layer the journal descriptor and the log blocks, and the commit 3.1 Design block, respectively. JBD needs to enforce the storage or- der in two situations. JD needs to be made durable before We overhaul the IO scheduler, the dispatch module and JC. The journal transactions need to be made durable in the write command to satisfy each of three conditions, the order in which they have been committed. When any I = D, D = C , and C = P, respectively.

4 In the legacy IO stack, the host has been entirely IO stack can satisfy the persist order without cache flush. responsible for controlling the storage order; the host The essential condition C = P in ensuring the storage postpones sending the following command until it en- order can now be satisfied with the barrier command. sures that the result of the preceding command is made We start our effort with devising a more efficient bar- durable. In Barrier enabled IO stack, the host and the rier write command. Implementing a barrier as a separate storage device share the responsibility. The host side command occupies one entry in the command queue and block device layer is responsible for dispatching the costs the host the latency of dispatching a command. To commands in order. The host and the storage device col- avoid this overhead, we define a barrier as a command laborate with each other to transfer the data blocks (or to flag, REQ BARRIER, to the write command as in the case service the commands, equivalently) in order. The way of REQ FUA or REQ FLUSH. In our implementation, we in which the host and the storage device collaborate with designate one unused bit in the SCSI command as a bar- each other will be detailed shortly. The storage device is rier flag. responsible for making them durable in order. This ef- We discuss the implementation aspect of a barrier fective orchestration between the host and the storage command. It is a matter of how the storage controller device saves the IO stack from the overhead of transfer- can enforce the persist order imposed by the barrier com- and-flush based storage order guarantee. Fig. 4 illustrates mand. When the Flash storage device has Power Loss the organization of Barrier Enabled IO stack. Protection (PLP) feature, e.g. supercapacitor, supporting The order preserving block device layer is respon- a barrier command is trivial. Thanks to PLP, the write- sible for dispatching the commands in order and for back cache contents are always guaranteed to be durable. having them serviced in order. The IO scheduler and The storage controller can flush the writeback cache in the command dispatch module is redesigned to pre- any order fully utilizing its parallelism and yet can guar- serve the order. Order preserving block device layer antee the persist order. There is no performanceoverhead defines two types of write requests: orderless and in enforcing the persist order. order-preserving. There exists special type of order- For the devices without PLP, the barrier command can preserving request called barrier. We introduce two new be supported in three ways; in-order write-back, trans- attributes REQ ORDERED and REQ BARRIER for the order- actional write-back or in-order recovery from crash. In preserving request and the barrier request, respectively. in-order write-back, the storage controller flushes data We call a set of order-preserving write requests which blocks in epoch basis and inserts some delay in between can be reordered with each other as an epoch [13]. A if necessary. It may fail to fully exploit the underly- barrier request is used to delimit an epoch. ing parallelism in the storage controller. In transactional write, the storage controller flushes the writeback cache 3.2 barrier write, the command contents as a single atomic unit [56, 39]. Since all epochs in the writeback cache are are flushed together, the con- The “cache barrier”, or “barrier” for short, command is straint imposed by the barrier command is well satisfied. defined in the standard command set for mobile Flash The performance overhead of transactional flush is 12% storage [28]. When the storage controller receives the in worst case with a traditional commit approach but can barrier command, the controller guarantees that the data be eliminated by maintaining next page pointer at the blocks transferred following the barrier command reach spare area of the Flash page [56]. the storage surface after the data blocks transferred be- The in-order recovery method guarantees the persist fore the barrier command do without flushing the cache order imposed by the barrier command through crash re- in between. A few eMMC products in the market support covery routine. When multiple controller cores concur- cache barrier command [1, 2]. Via barrier command, the rently write the data blocks to multiple channels, one may have to use sophisticated crash recovery protocol such as ARIES protocol [45] to recover the storage to fb arrier() fd atab arrier() consistent state. If the entire Flash storage is treated as a single log device, we can use simple crash recovery F i le BarrierF S F i le

System (Du al Mod e Jou rn al in g ) System algorithm used in LFS [59]. Since the persist order is en- forced by the crash recovery logic, the controller is able Blo ck Ord er Preservin g Ep och Based Blo ck

L ayer Disp atch IO Sch ed u l er L ayer to flush the writeback cache as if there is no ordering

WRITE with dependency. The controller is saved from performance

BARRIER BARRIER fl ag penalty at the cost of complexity in the recovery routine.

Barrier Comp l ian t Storag e Device We implement the cache barrier command in UFS de- vice, which is a commercial product used in the smart- Figure 4: Organization of the Barrier Enabled IO stack phone. We use simple LFS style recovery routine. The

5 UFS controller treats the entire storage as a single log

structured device and maintains an active segment in fsync() W : Write Request i i

W W memory. FTL appends incoming data blocks to the ac- 4 6 Epoch I/O Scheduler Epoch

W W tive segment in the order in which they are transferred. 2 5

{W W W } W {W W W }

1, 2, 4 5 1, 2, 4 W W

1 3

W It naturally satisfies the ordering constraints between W

3 4

W W W W W W W W W W W

4 3 2 1 1 4 3 2 6 5 5 Block

W

2 the epochs. When an active segment becomes full, it is W

1 Device striped across the multiple Flash chips in log-structured Ordered: Barrier: manner. In crash recovery, the UFS controller locates the beginning of the most recently flushed segment. It scans Figure 5: Epoch Based Barrier Reassignment the pagesin the segment fromthe beginningtill it first en- queue. These orderless requests can be scheduled with counters the page which has not been programmed prop- the other requests in the following epoch. Differentiat- erly. The storage controller discards the rest of the pages ing the order-preservingrequests from orderless ones, we including the incomplete one. avoid imposing unnecessary ordering constraint on the Developing a sophisticated barrier-aware SSD con- requests. Currently, the Epoch based IO scheduler is im- troller is subject to a number of design choices and plemented on top of existing CFQ scheduler. Each pro- should be dealt with in detail in separate context. cess defines its own scheduler queue. Through this work, we demonstrate that the performance Fig. 5 illustrates how the barrier reassignment works. benefit in using the cache barrier command deserve the The circular and the rectangular write request denote the complexity of implementing it if the host side IO stack order-preserving attribute and barrier attribute, respec- can properly exploit it. tively. In Fig. 5, the application calls fsync() and in the mean time, pdflush daemon flushes the dirty pages. In 3.3 Epoch Based IO scheduling Fig. 5, fsync() creates three write requests: w1,w2 and w4. The filesystem marks the three requests as ordering There are three scheduling principles in Epoch based IO preserving ones. The filesystem designates the last re- scheduling. First, it preserves the partial order between quest, w4, as a barrier write. pdflush creates three write the epochs. Second, the requests within an epoch can be requests w3,w5 and w6. They are all orderless. The re- freely scheduled with each other. Third, the orderless re- quests from the two threads are fed to the IO scheduler barrier quests can be scheduled freely across the epochs. It sat- with as w1,w2,w3,w5,w4 ,w6 in order. When the bar- I D isfies = condition. rier write, w4, enters the queue, the scheduler stops ac- The Epoch Based IO scheduler uses existing IO sched- cepting the new request. There are only five requests in uler, e.g. CFQ, NO-OP and etc., to schedule the IO re- the queue, w1,w2,w3,w4 and w5. w6 cannot be inserted quests within an epoch. The key ingredient of the Or- at the queue since the queue is blocked. The IO scheduler der Preserving IO scheduler is Epoch based barrier re- reorders the them and dispatches them in w2w3w4w5w1 assignment. When the IO request enters the scheduler order. After they are scheduled, w1 leaves the queue last. queue, the order preserving IO scheduler examines if it The IO scheduler puts the barrier flag to w1. In this sce- is a barrier request. If the request is not a barrier request, nario, the request w6 is going to be scheduled with the it is inserted as normal requests. If the request is a barrier requests in the following epoch. write request, IO scheduler removes the barrier flag from the request and inserts it to the queue. After the scheduler 3.4 Order Preserving Dispatch inserts a barrier write, the scheduler stops accepting more requests. The IO scheduler re-orders and merges the IO The order preserving dispatch is a fundamental innova- requests in the queue based upon its own scheduling tion of this work. In order preserving dispatch, the host discipline e.g. FIFO, SCAN, CFQ. The requests in the dispatches the following write request when the storage queue either are orderless or belong to the same epoch. device acknowledges that the preceding request has suc- Therefore, they can be freely scheduled with each other cessfully been received (6(a)) and yet the transfer order without violating the ordering condition. The merged re- between the two requests are preserved, i.e. D = C . The quest will be order-preservingif one of the constituents is order preserving dispatch guarantees the transfer order order-preserving. The IO scheduler designates the order- without blocking the caller. Legacy IO stack controls the preserving request that leaves the queue last as a new transfer order with Wait-On-Transfer. Wait-On-Transfer barrier. This mechanism is called Epoch Based Barrier not only exposes the caller to the contextswitch overhead Reassignment. When there is no more order-preserving but also makes the IO latency less predictable. It may requests in the queue, the IO scheduler starts accepting stall the storage device since the caller postpones dis- the IO requests. When the IO scheduler unblocks the patching the following command till the preceding com- queue, there can be one or more orderless requests in the mand is serviced. Order preserving dispatch eliminates

6 submit I/O re-run 4 BarrierFS: Barrier Enabled Filesystem

File System 4.1 Programming Model

Block

Layer We propose two new filesystem interfaces, fbarrier()

finish I/O reordering

dispatch

& merge IRQ and fdatabarrier() which are the ordering guaran- Device tee only counter part to fsync() and fdatasync(), re-

receive decode DMA

CMD CMD transfer spectively. fbarrier() shares the same semantics with (a) When Device is Available osync() in OptFS [8]. The salient feature of Barri- erFS is fdatabarrier(). fdatabarrier() returns af-

submit I/O ready re-run ter dispatching the write requests for dirty pages. With context-switch File fdatabarrier(), the application can enforce a stor- System

dispatch

wakeup &

again age order virtually without any overhead; without flush,

reschedule Block without waiting for DMA completion and even without Layer

reordering

delay finish & merge context switch. The following codelet illustrates the us-

fail I/O

dispatch

IRQ

Device age of the fdatabarrier().

device busy receive decode DMA

CMD CMD transfer write(fileA, "Hello") ; (b) When Device is Busy fdatabarrier(fileA) ; write(fileA, "World")} Figure 6: Order Preserving Dispatch It ensures that “Hello” is written to the storage surface all these overheads. ahead of “World”. Modern applications have been us- For order preserving dispatch, the only thing the host ing expensive fdatasync() to guarantee both durabil- block device driver does is to set the priority of a barrier ity and ordering. For example, SQLite which is the de- write command to ordered when dispatching it. Then, the fault DBMS in mobile device, such as Android, iOS SCSI compliant storage device automatically guarantees or Tizen uses fdatasync() to ensure that the updated the transfer order constraint in serving the requests. SCSI database node reach the disk surface ahead of the up- standard defines three command priority levels: head of dated database header. In SQLite, fdatabarrier() can the queue, ordered, and simple[57], with which the in- replace the fdatasync() when it is used for ensuring coming command is put at the head of the command the storage order, not the durability. queue, tail of the command queue or at arbitrary position The Barrier Enabled IO stack is filesystem agnos- determinedby the storage controller. In addition, the sim- tic. fbarrier() and fdatabarrier() can be imple- ple command cannot be inserted in front of the existing mented in any filesystem using proposed order preserv- ”ordered”or ”head of the queue” commands.The head of ing block device layer. As a seminal work, we modify the queue priority is used when a command requires an the EXT4 filesystem for order preserving block device immediate service, e.g. flush command. Via setting the layer. We optimize fsync() and fdatasync() for or- priority of barrier write command to ordered, the host der preserving block device layer and newly implement ensures the the data blocks associated with the write re- fbarrier() and fdatabarrier().We name the mod- quests in the preceding epoch are transferred ahead of the ified EXT4 as BarrierFS. fbarrier() in BarrierFS sup- data blocks associated with the barrier write. Likewise, ports all journal modes in EXT4; WRITEBACK, OR- the data blocks associated with the following epoch are DERED and DATA. transferred after the data blocks associated with the bar- rier write is transferred. The transfer order condition is 4.2 Dual Mode Journaling satisfied. The caller may be blocked after dispatching the write Committing a journal transaction essentially consists of request. This can happen when the device is unavailable two separate tasks: dispatching write commands for JD or the caller is switched out involuntarily, e.g. time quan- and JC to the storage (host side) and making them tum expires. For both cases, the block device driver of durable (storage side). In the order preserving block de- the order preserving dispatch module uses the same er- vice design, the host (the block device layer) is respon- ror handling routine adopted by the existing block de- sible for controlling the dispatch order and transfer or- vice driver; the kernel daemon inherits the task and re- der while the storage controller takes care of handling tries dispatching the request after a certain time interval, the persist order. The design of order preserving block e.g., 3 msec for SCSI device [57] (Fig. 6(b)). The thread device layer naturally supports separation of the control resumes once the request is dispatched successfully. plane (dispatching the write requests) and the data plane

7 DMA Tran sfer Con text Swtich Execu tion ated transaction from the committing transaction list and fsync() wakes up the caller. Via separating the control plane

App l ication (commit thread) and data plane (flush thread), the com- mit thread can commit the following transaction after it JBD is done with dispatching the write requests for preceding journal commit. In Dual Mode journaling, there can be

Storag e D JD F lus h JC F lus h more than one committing transactions in flight. (a) fsync() in EXT4 with FLUSH/FUA In fsync() or fbarrier(), the BarrierFS dispatches the write request for D as an order-preserving request.

fsync() Then, the commit thread dispatches the write request for

() JD and JC both with order-preserving and barrier write. App l ication As a result, D and JD form a single epoch while JC by Commit Thread itself forms another. A journalcommit consists of the two

F l u sh

Thread epoches: {D,JD} and {JC}. An fsync() in barrierFS

Storag e

D JD JC F lus h can be represented as in Eq. 3. Eq. 3 also denotes the fbarrier(). (b) fsync() and fbarrier() in BarrierFS D→JDBAR → JCBAR →xfer→flush (3) Figure 7: fsync() and fbarrier(), D: DMA for dirty | {z } pages, JD: DMA for journal descriptor, JC: DMA for fbarrier() journal commit block The benefit of Dual Mode Journaling is substantial. In EXT4 (Fig. 7(a)), an fsync() consists of a tandem of three DMA’s and two flushes interleaved with context (persisting the associated data blocks and journal trans- switches. In BarrierFS, an fsync() consists of single action) in filesystem journaling. For effective separation, flush, three DMA’s(Fig. 7(b)) and fewer number of con- these two planes should work independently with mini- text switches. The transfer-and-flush between JD and JC mum dependency. For filesystem journaling, we allocate are completely eliminated. fbarrier() returns almost separate threads for dispatching the write requests and instantly after the commit thread dispatches the write re- for making them durable: commit thread and flush thread, quest for JC. respectively. This mechanism is called Dual Mode Jour- BarrierFS forces journal commit if fdatasync() or naling. fdatabarrier() do not find any dirty pages. Through The commit thread is responsible for dispatching the this scheme, fdatasync() (or fdatabarrier()) can write requests for JD and JC. In BarrierFS, the com- delimit an epoch despite the absence of the dirty pages. mit thread tags both requests with REQ ORDERED and REQ BARRIER so that JD and JC are transferred and are guaranteed to be persisted in order. After the dis- 4.3 Multi-Transaction Page Conflict patching write request for JC, the commit thread in- A buffer page can belong to only one journal transac- serts the journal transaction to the committing transac- tion at a time [65]. Blindly inserting a buffer page to tion list. In ordering guarantee (fbarrier()), the com- the running transaction may yield removing it from the mit thread wakes up the caller. In the legacy IO stack, committing transaction before it becomes durable. We JBD thread interleaves the write request for JC and JD call this situation as page conflict. In both EXT4 and with transfer-and-flush. In BarrierFS, the commit thread BarrierFS, when the application thread inserts a buffer dispatches them in order-preserving dispatch discipline page to the running transaction, it checks if the buffer without Wait-For-Transfer overhead and with Wait-For- page is being held by the committing transaction. If so, Flush overhead. the application blocks without inserting it to the run- The flush thread is responsible for (i) issuing the flush ning transaction. When the JBD thread of EXT4 (or flush command, (ii) handling error and retry and (iii) removing thread in BarrierFS) has made the committing transac- the transaction from the committing transaction list. The tion durable, it identifies the conflict pages in the com- flush thread is triggered when the JC is transferred. If the mitted transaction and inserts them to the running trans- journaling is triggered by fbarrier(), the flush thread action. In EXT4, there is only one committing transac- removes the transaction from the committing transaction tion at a time. The running transaction is guaranteed to list and returns. It does not call flush. There is no caller be conflict free when the JBD thread resolves the page to wake up. If the journaling is initiated by fsync(), conflicts from the committed transaction. In BarrierFS, the flush thread flushes the cache, removes the associ- the running transaction can conflict with more than one

8 t t t t committing transactions, multi-transaction page conflict. D T F

When the flush thread resolves the page conflicts from a t

D

committed transaction, the running transaction may still BarrierF S

conflict with the other committing transactions. If the t +t D T

EXT4

running transaction is committed prematurely with con- (no f lush)

t +t +t flicted pages missing, the storage order can be compro- D T EXT4

mised. Whenever the flush thread resolves the page con- (quick f lush)

t +t +t

D T F flicts and notifies the commit thread about its comple- EXT4 tion of persisting a transaction, the commit thread has to (f ull flush) scan all the pages in the other committing transactions for page conflict. To reduce the overhead of scanning Figure 8: fsync() under different storage order guar- the pages, we introduce conflict-page list. The applica- antee: BarrierFS, EXT4 (no flush), EXT4 (quick flush), tion thread inserts the buffer page to the conflict-page list EXT4 (full flush), tD: dispatch latency, tC: transfer la- if the buffer page is being held by one of the committing tency, tε : flush latency in supercap SSD, tF : flush latency transactions. When the flush thread has made the com- mitting transaction durable, the flush thread inserts the performance in these workloads. SQLite can be the ap- conflict pages to the buffer page list of the running trans- plication which the Barrier Enabled IO stack benefits action and removes them from the conflict-page list. The the most. SQLite uses fdatasync() not only to guar- commit thread can start committing a running transaction antee the durability of a transaction but also to con- only when conflict-page list is empty. trol the storage order in various occasions, e.g. be- tween writing the undo-logand storing the journal header 4.4 Analysis and between writing updated database node and writ- ing the commit block [37]. In a single insert transac- We examine how the journaling throught may vary sub- tion, SQLite calls fdatasync() four times, three of ject to different methods of journal commit: BarrierFS, which are to control the storage order. We can replace EXT4 with no-barrier option, EXT4 with supercap them with fdatabarrier()’s without compromising SSD and and plain EXT4. Fig. 8 schematically illus- the durability of a transaction. Some applications pre- trates the behaviors. With mount option, no-barrier fer to trade the durability and freshness of the result with filesystem does not issue flush command in or fsync() the performanceand scalability of the operation [11, 16]. . t , t and t denote the dispatch latency, fdatasync() D C F The benefit of BarrierFS can be more than signifi- transfer latency, and flush latency associated with com- cant in these applications. One can replace all fsync() mitting a journal transaction, respectively. In particular, and fdatasync() with ordering guarantee counterparts, t denotes the total flush latency in supercap SSD. ε fbarrier() and fdatabarrier(), respectively. With supercap SSD, EXT4 (quick flush), the journal commits are interleaved by tD +tC +tε . The host observes the round-trip delay of the flush command and the asso- 6 Experiment ciated context switch overhead, tε . tε is not negligible in Flash storage. EXT4 with no-barrier option, EXT4 (no 6.1 Setup flush), can commit a new transaction once all the associ- ated blocks are transferred to the storage. The journaling We implement Barrier Enabled IO stack on three differ- is interleaved by command dispatch and DMA transfer, ent platforms: smartphone (Galaxy S6, Android 5.0.2, tD +tC. In BarrierFS, the commit thread keeps dispatch- Linux 3.10), PC server (4 cores, Linux 3.10.61) and en- ing the journal commit operationswithout waiting for the terprise server (16 cores, Linux 3.10.61). We test three completion of the transfer. The interval between the suc- storage devices: mobile storage (UFS 2.0, QD2=16, sin- cessive journal commit can be as small as tD. gle channel), 850 PRO for server (SATA 3.0, QD=32, 8 channels), 843TN for server (SATA 3.0, QD=32, 8 chan- 5 Applications on Barrier Enabled IO nels, supercap). We call each of these as UFS, plain-SSD stack and supercap-SSD, respectively. We implement barrier write command in UFS device. In plain-SSD, we in- fsync() accounts for dominant fraction of IO in mod- troduce 5% performance penalty to simulate the barrier ern applications, e.g. mail server [60] or OLTP. 90% overhead. For supercap-SSD, we assume that there is no of IO’s in the TPC-C workload is created by fsync() barrier overhead. for synchronizing the logs to the storage [50]. The or- der preserving IO stack can significantly improve the 2QD: queue depth

9 100 32

30.7 30.8

XnF B 29.6 32 32

28

29.5

3 X P 80 24 24 24 16 16 20

60 QD QD

14.8 16 8 8

40 12 0 0

7.03 IOPS (X10 ) 8 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 Queue Depth

20

0.63 4 0.80 0.84

0.51 0.17 0.02 time (sec) time (sec)

0 0

UFS plain-SSD supercap-SSD (a) Wait-For-Transfer, plain SSD (b) Barrier, plain SSD Figure 9: 4KB Randwom Write; XnF: write() fol- 16 16 lowed by fdatasync(), X: write followed by 12 12 fdatasync()(no-barrier option), B: write() fol- 8 8 QD QD lowed by fdatabarrier(), P: Plain Buffered write() 4 4 0 0 0.2 0.25 0.3 0.35 0.4 0.2 0.25 0.3 0.35 0.4 6.2 Order Preserving Block Layer time (sec) time (sec) (c) Wait-For-Transfer, UFS (d) Barrier, UFS We examine the performance of 4 KByte random write with different ways of enforcing the storage order. Fig. 9 Figure 10: Queue Depth, 4KB Random Write, Wait- illustrates the result. In scenario ‘X’ where ‘X’ denotes For-Transfer: write() followed by fdatasync() Wait-On-Transfer, the host sends the following request with no barrier, Barrier: write() followed by after the data block associated with the preceding re- fdatabarrier() quest is completely transferred. Despite the absence of 3 the flush overhead, the storage devices exhibit less than EXT4-DR EXT4-OD

write() 2.5 BFS-DR BFS-OD 50% of its plain buffered write performance, the scenario 2.001.98 2.00 2.001.99 2 ‘P’. All three devices are severely underutilized. Aver- 1.5 1.32 1.02 1.01 1.01 age queue depths in all three devices are less than one. 1 Wait-On-Transfer overhead in modern IO stack prohibits 0.5 0.12 0.16 0.21 the host from properly exploiting the underlying Flash 0 UFS plain-SSD supercap-SSD storage. In scenario ‘B’ where ‘B’ denotes Barrier, the Context Switch / 4KB IO performance increases at least by 2× against scenario Figure 11: Average Number of Context Switches per ‘X’. The average queue depths reach near the maximum fsync()/fbarrier(), 4 KByte write() followed by in all three Flash storages. An fdatabarrier() is not fsync() or fbarrier(), EXT4-DR: fsync(), BFS- entirely free. We observe 1 % to 25% performance de- DR: fsync(), EXT-OD: fsync() with no-barrier, ficiency when it is compared against the plain buffered BFS-OD: fbarrier() write. Plain buffered write exhibits shorter queue depth than barrier write does (Fig. 9). This is because in plained than the SSD’s do. The smartphone uses transactional buffered write, the IO scheduler merges the multiple re- checksum in filesystem journaling. With BarrierFS, we quests and the number of commands dispatched to the can eliminate not only the transfer overhead but also storage device decreases. the checksum overhead. The fsync() latency decreases Fig. 10 is another manifestation of fdatabarrier(). by 60% in BarrierFS. In supercap-SSD and UFS, the The storage performance is closely related to the com- fsync() latencies at 99.99th percentile are 30× of the mand queue utilization [33]. When the requests are in- average fsync() latency(Table 1). Using BarrierFS, the terleaved with DMA transfer, the queue depth never tail latencies at 99.99th percentile decrease by 50%, 20% goes beyond one (Fig. 10(a) and Fig. 10(c)). When the and 70% in UFS, plain-SSD and supercap-SSD, respec- write request is followed by fdatabarrier(), the queue tively, against EXT4. depth grows near to its maximum in all three storage. (Fig. 10(b) and Fig. 10(d)). Order preserving block layer UFS plain-SSD supercap-SSD enables the host to fully exploit the concurrency and the (%) EXT4 BFS EXT4 BFS EXT4 BFS µ 1.29 0.51 5.95 3.52 0.15 0.09 parallelism of the underlying Flash storage. Median 1.20 0.44 5.43 3.01 0.15 0.09 99th 4.15 3.51 11.41 8.96 0.16 0.10 99.9th 22.83 9.02 16.09 9.30 0.28 0.24 6.3 Filesystem Journaling 99.99th 33.10 17.60 17.26 14.19 4.14 1.35 Latency: In plain-SSD and supercap-SSD, the average Table 1: fsync() latency statistics (msec) fsync() latency decreases by 40% when we use Barri- erFS against when we use EXT4 (Table 1). UFS expe- Context Switches: We examine the number of ap- riences more significant reduction in fsync() latency plication level context switches in various modes of

10 16 16 Next IO tem journaling under varying number of CPU cores. We 12 12 use modified DWSL workload in fxmark [44]. In DWSL 8 8 workload, each thread performs 4 Kbyte allocating write QD QD

4 Next IO 4 dipatch followed by fsync(). Each thread operates on its own complete 0 0 file. Each thread writes total 1 GByte. BarrierFS exhibits 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 much more scalable behavior than EXT4 (Fig. 13). In time (msec) time (msec) plain-SSD, BarrierFS exhibits 2× performance against (a) Durability Guarantee (b) Ordering Guarantee EXT4 in all numbers of cores (Fig. 13(a)). In supercap- Figure 12: Queue Depth Chanages in BarrierFS: SSD, the performance saturates with six cores in both write() followed by fsync() vs. write() followed EXT4 and BarrierFS. BarrierFS exhibits 1.3× journaling by fbarrier() throughput against EXT4 at the full throttle (Fig. 13(b)). journaling. Fig. 11 illustrates the result. In EXT4-DR, 6.4 Mobile Workload: SQLite fsync() wakes up the caller twice; after DMA trans- fer of D completes and after the journal transaction is In mobile storage, BarrierFS achieves 75% performance made durable. This applies to all three Flash storages. improvement against EXT4 in default PERSIST journal In BarrierFS, fsync() wakes up the caller only once; mode under durability guarantee (Fig. 14). We replace after the transaction is made durable. In UFS and su- first three fdatasync()’s with fdatabarrier()’s percap SSD, fsync() of BFS-DR wakes up the caller among all four fdatasync()’s in a transaction. We twice in entirely different reasons. In UFS and supercap- keep the last fdatasync() for the durability of a SSD, the interval between the successive write requests transaction. In Ordering guarantee, we replace all four are much smaller than the timer interrupt interval due fdatasync()’s with fdatabarrier()’s. When we re- to small flush latency. As a result, write() requests move the durability requirement, the performance in- rarely update the time fields of the inode and fsync() creases by 2.8× in PERSIST mode against the baseline becomes an fdatasync(). fdatasync() wakes up the EXT4. In WAL mode, SQLite issues fdatasync() once caller twice in BarrierFS; after transferring D and after in every commitand there is not much room for improve- flush completes. The plain-SSD uses TLC flash. The in- ment for BarrierFS. terval between the successive write()’s can be longer than The benefit of eliminating the Transfer-and-flush is the timer interrupt interval. In plain-SSD, fsync() oc- more significant as the storage has higher degree of par- casionally commits journal transaction and the average allelism and slow Flash device. In plain-SSD, SQLite ex- number of context switches becomes less than two in hibits 73× performancegain in BFS-OD against baseline BFS-DR for plain-SSD. EXT4-DR. BFS-OD manifests the benefits of BarrierFS. The fbarrier() rarely finds updated metadata since it re- 6.5 Server Workload turns quickly. Most fbarrier() calls are serviced as fdatabarrier(). fdatabarrier() does not block the We run two workloads: varmail workload in caller and it does not release CPU voluntarily. The num- FILEBENCH [68] and OLTP-insert workloads from ber of context switches in fbarrier() is much smaller sysbench [34]. Sysbench is database workload and uses than EXT4-OD. BarrierFS significant improves the con- MySQL [46]. varmail is metadata intensive workload. text switch overhead against EXT4. We also test OptFS [8]. We use osync() in OptFS. Command Queue Utilization: In BarrierFS, We perform two sets of experiments. First, we leave fsync() drives the queue upto two (Fig. 12(a)). Theo- the application intact and replace the EXT4 with Bar- retically, it can drive the queue depth upto three because rierFS (EXT4-DR and BFS-DR). We compare the the host can dispatches the write requests for D, JD and JC, in tandem. According to our instrumentation, there 5 ) EXT4-DR BFS-DR ) EXT4-DR BFS-DR 3 3 25 exists 160 µsec context switch interval between the 4 20 3 application thread and the commit thread. It takes ap- 15 proximately 70µsec to transfer a 4 KByte block from the 2 10 host to device cache. The command from the application 1 5 ops/sec(X10 ops/sec(X10 thread is serviced before the commit thread dispatches 0 0 0 2 4 6 8 10 12 0 2 4 6 8 10 12 the command for writing JD. In fbarrier(), BarrierFS #core #core successfully saturates the command queue (Fig. 12(b)). (a) plain-SSD (b) supercap-SSD The queue depth increases to fifteen. Throughput: We examine the throughput of filesys- Figure 13: fxmark: scalability of filesystem journaling

11 EXT4-DR BFS-DR OptFSEXT4-OD BFS-OD 2 12 Benefit of eliminating a seek overhead is marginal for Flash storage. Due to this reason, in varmail work- 3 3 1.5 9 load which rarely entails selective data mode journal- 1 6 ing, OptFS and EXT4-OD exhibit similar performancein

Tx/s (X10 ) 0.5 Tx/s (X10 ) 3 Flash storage(Fig. 15). The selective data mode journal-

0 0 ing increases the amount of pages to scan for osync(), PERSIST WAL PERSIST WAL only a few of which can be dispatched to the storage. (a) UFS (b) plain-SSD The selective data mode journaling can negatively inter- fere with the osync() especially when the underlying Figure 14: SQLite Performance: inserts/sec (100,000 in- storage has short latency. In [8], MySQL performance serts) decreases to one thirds in OptFS against EXT4-OD and the selective data mode journaling has been designated fsync() performance between BarrierFS and EXT4. as its prime cause. Our MySQL workload creates even The second set of experiment is for ordering guarantee. larger amount of selective data journaling and the per- In EXT4, we use nobarrier mount option. In Barri- formance of OptFS corresponds to one eights of that of erFS, we replace fsync() with fbarrier(). Fig. 15 EXT-OD under MySQL workload (Fig. 15). illustrates the result. In plain-SSD, BFS-DR brings 60% performance gain against EXT4-DR in varmail workload. This is due to 7 Related Work the more efficient implementation of fsync() in Bar- OptFS [8] is the closest work of our sort; they pro- rierFS. The benefit of BarrierFS manifests itself when posed a new journaling primitive osync() which re- we relax the durability guarantee. The varmail work- turns without persisting the journaling transaction and load is known for its heavy fsync() traffic. In EXT4- yet which guarantees that the write requests associ- OD, the journal commit operations are interleaved by ated with journal commits are stored in order. OptFS DMA transfer latency. In BFS-OD, the journal commit does not provide the filesystem primitive that corre- operations are interleaved by the dispatch latency. The sponds to in our Barrier Enabled IO Dual mode journal can significantly improve the journal- fdatabarrier() stack. still relies on Wait-On-Transfer in en- ing throughput via increasing the concurrency in jour- osync() forcing the storage order. Featherstitch[19] propose a nal commit. With ordering guarantee, BarrierFS achieves programming model to specify the set of requests that 80% performance gain against EXT4 with no-barrier op- can be scheduled together, and the or- tion. patchgroup dering dependency between them pg depend(). While In MySQL, BFS-OD prevails EXT4-OD, by 12%. xsyncfs [48] successfully mitigates the overhead of The performance increases 43× when we replace the fsync(), xsyncfs maintains complex causal dependen- fsync() of EXT4 with fbarrier(). cies among buffered updates. An order preserving block Notes on OptFS: In SQLite (Fig. 14(b)), varmail and device layer can make the implementation of xsyncfs MySQL (Fig. 15), we observe that OptFS does not show much simpler. NoFS (no order file system) [9] introduces as good performance in Flash storage as it does in the “backpointer” to entirely eliminate the transfer-and-flush rotating media [8]. OptFS is elaborately designed to re- ordering requirement in the file system. However, it does duce the seek overhead inherent in Ordered mode jour- not support atomic transactions. naling of EXT4. OptFS achieves this objective via two A few works proposed to use multiple running trans- innovations: via flushing larger number of transactions action or multiple committing transaction to circum- together and via selectively journaling the data blocks. vent the transfer-and-flush overhead in filesystem jour- naling [38, 29, 54], to improve journaling performance 3 or to isolate errors. IceFS [38] allocates separate run- (X10 ) EXT4-DR BFS-DR OptFS EXT4-OD BFS-OD 60 ning transactions for each container. SpanFS [29] splits 50 a journal region into multiple partitions and allocates 40 committing transactions for each partition. CCFS [54] 30 allocates separate running transactions for individual 20 threads. These systems, where each journaling session 10 0 still relies on the transfer-and-flush mechanism in en- Varmail OLTP-insert Varmail OLTP-insert forcing the intra- and inter-transaction storage orders, are plain-SSD supercap-SSD complementary to our work. Figure 15: Performance for Server Workloads, A number of file systems providea multi-block atomic Filebench: Varmail(ops/s), Sysbench: OLTP-insert(Tx/s) write feature [17, 35, 53, 66] to relieve applications from

12 the overhead of logging and journaling. These file sys- [7] CHIDAMBARAM, V. Orderless and Eventually Durable File Sys- tems internally use the transfer-and-flush mechanism to tems. PhD thesis, UNIVERSITY OF WISCONSIN–MADISON, enforce the storage order between write requests for data 2015. blocks and associated metadata. An order preserving [8] CHIDAMBARAM,V., PILLAI,T. S.,ARPACI-DUSSEAU,A.C., AND ARPACI-DUSSEAU, R. H. Optimistic Crash Consistency. block device can effectively mitigate overheads incurred In Proc. of ACM SOSP 2013 (Farmington, PA, USA, Nov 2013). when enforcing the storage order in these file systems. [9] CHIDAMBARAM, V., SHARMA, T., ARPACI-DUSSEAU,A.C., AND ARPACI-DUSSEAU, R. H. Consistency Without Ordering. 8 Conclusion In Proc. of USENIX FAST 2012 (San Jose, CA, USA, Feb 2012). [10] CHO, Y. S., PARK, I. H., YOON, S. Y., LEE, N. H., JOO, In this work, we develop an Barrier Enabled IO stack S. H., SONG, K.-W., CHOI, K., HAN, J.-M., KYUNG,K.H., AND JUN, Y.-H. Adaptive multi-pulse program scheme based on to address the transfer-and-flush overhead inherent in tunneling speed classification for next generation multi-bit/cell the legacy IO stack. Barrier Enabled IO stack effec- NAND flash. IEEE Journal of Solid-State Circuits(JSSC) 48, 4 tively eliminates the transfer-and-flush overhead associ- (2013), 948–959. ated with controlling the storage order and is successful [11] CIPAR, J., GANGER, G., KEETON, K., MORREY III,C.B., in fully exploiting the underlying Flash storage. We like SOULES,C.A., AND VEITCH, A. LazyBase: trading freshness for performance in a scalable database. In Proc. of ACM EuroSys to conclude this paper with two important observations. 2012 (Bern, Switzerland, Apr 2012). First, “cache barrier” is a necessity than a luxury. “cache [12] COBB,D., AND HUFFMAN, A. NVM express and the PCI ex- barrier” is an essential tool for the host to control the per- press SSD Revolution. In Proc. of Intel Developer Forum (San sist order which has not been possible before. Currently, Francisco, CA, USA, 2012). cache barrier command is only available in the standard [13] CONDIT,J., NIGHTINGALE,E. B.,FROST,C., IPEK,E.,LEE, B., BURGER,D., AND COETZEE, D. Better I/O through byte- commandset for mobile storage. Given its implication on addressable, persistent memory. In Proc. of ACM SOSP 2009 IO stack, it should be available in all range of the stor- (Big Sky, MT, USA, Oct 2009). age device ranging from the mobile storage to the high [14] CORBET, J. Barriers and journaling filesystems. performance Flash storage with supercap. Second, elim- http://lwn.net/Articles/283161/. inating a “Wait-On-Transfer” overhead is not an option. [15] CORBET, J. The end of block barriers. It blocks the caller and stalls the command queue leav- https://lwn.net/Articles/400541/, August 2010. ing the storage device being severely underutilized. As [16] CUI,H.,CIPAR,J.,HO, Q.,KIM,J.K.,LEE,S.,KUMAR,A., WEI, J., DAI, W., GANGER, G. R., GIBBONS, P. B., ET AL. the storage latency becomes shorter, the relative cost of Exploiting bounded staleness to speed up big data analytics. In “Wait-On-Transfer” can become more significant. Proc. of USENIX ATC 2014 (Philadelihia, PA, USA, Jun 2014). Despite all the preceding sophisticated techniques to [17] DABEK, F., KAASHOEK, M. F., KARGER, D., MORRIS,R., optimize the legacy IO stack for Flash storage, we care- AND STOICA, I. Wide-area Cooperative Storage with CFS. In fully argue that the IO stack is still fundamentally driven Proc. of ACM SOSP 2001 (Chateau Lake Louise, Banff, Canada, Oct 2001). by the old legacy that the host cannot control the per- [18] DEES, B. Native command queuing-advanced performance in sist order. This work shows how the IO stack can evolve desktop storage. IEEE Potentials Magazine 24, 4 (2005), 4–7. when the persist order can be controlled and its substan- [19] FROST,C.,MAMMARELLA,M.,KOHLER,E., DELOS REYES, tial benefit. We hope that this work serves as a possible A., HOVSEPIAN,S., MATSUOKA,A., AND ZHANG, L. Gener- basis for the future IO stack in the era of Flash storage. alized File System Dependencies. In Proc. of ACM SOSP 2007 (Stevenson, WA, USA, Oct 2007). [20] GIM,J., AND WON, Y. Extract and infer quickly: Obtaining References sector geometry of modern hard disk drives. ACM Transactions on Storage (TOS) 6, 2 (2010), 6. [1] emmc5.1 solution in . https://www.skhynix.com/kor/product/nandEMMC.jsp. [21] GRUPP, L. M., DAVIS,J.D., AND SWANSON, S. The bleak future of nand flash memory. In Proc.of USENIX FAST 2012 [2] Toshiba expands line-up of e-mmc version 5.1 (Berkeley, CA, USA, 2012), USENIX Association, pp. 2–2. compliant embedded nand flash memory modules. http://toshiba.semicon-storage.com/us/company/taec/news/2015/03/memory-20150323-1.html[22] GUO, J., YANG,J.,ZHANG, Y., AND. CHEN, Y. Low cost power failure protection for mlc nand flash storage systems with [3] AXBOE, J. Linux block IO present and future. In Proc. of Ottawa pram/dram hybrid buffer. In Design, Automation & Test in Europe Linux Symposium (Ottawa, Ontario, Canada, Jul 2004). Conference & Exhibition (DATE), 2013 (2013), IEEE, pp. 859– [4] BEST, S. JFS Overview. 864. http://jfs.sourceforge.net/project/pub/jfs.pdf, [23] HELLWIG, C. ”block: update documentation for req flush / 2000. req fua”. [5] CHANG, Y.-M., CHANG, Y.-H., KUO, T.-W., LI, Y.-C., AND linux-2.6/Documentation/block/barrier. LI, H.-P. Achieving SLC Performance with MLC Flash Mem- txt. ory. In Proc. of DAC 2015 (San Francisco, CA, USA, 2015). [24] HELM, M., PARK, J.-K., GHALAM, A., GUO,J., WAN HA, [6] CHEN, F., LEE,R., AND ZHANG, X. Essential roles of exploit- C., HU, C.,KIM,H.,KAVALIPURAPU,K.,LEE,E.,MOHAM- ing internal parallelism of flash memory based solid state drives MADZADEH,A., ET AL. 19.1 A 128Gb MLC NAND-Flash de- in high-speed data processing. In Proc. of IEEE HPCA 2011 (San vice using 16nm planar cell. In Proc. of IEEE ISSCC 2014 Dig. Antonio, TX, USA, Feb 2011). Tech Papers (San Francisco, CA, USA, Feb 2014).

13 [25] HEO, T. I/O Barriers. [43] MIN,C.,KANG, W.-H.,KIM,T.,LEE, S.-W., AND EOM,Y. I. Linux/Documentation/block/barrier.txt, July 2005. Lightweight application-level crash consistency on transactional flash storage. In Proc. of USENIX ATC 2015 (Santa Clara, CA, [26] JEONG,S.,LEE,K.,LEE,S.,SON,S., AND WON, Y. I/O Stack USA, Jul 2015). Optimization for . In Proc. of USENIX ATC 2013 (San Jose, CA, USA, Jun 2013). [44] MIN,C.,KASHYAP,S.,MAASS,S., AND KIM, T. Understand- ing manycore scalability of file systems. In Proc.of USENIX ATC [27] JESD220C, J. S. Universal Flash Storage(UFS) Version 2.1. 2016 (Denver, CO, 2016), USENIX Association, pp. 71–85. [28] JESD84-B51, J. S. Embedded Multi-Media Card(eMMC) Elec- trical Standard (5.1). [45] MOHAN,C.,HADERLE,D.,LINDSAY,B.,PIRAHESH,H., AND SCHWARZ, P. ARIES: a transaction recovery method supporting [29] KANG,J.,ZHANG,B.,WO, T.,YU, W.,DU,L.,MA,S., AND fine-granularity locking and partial rollbacks using write-ahead HUAI, J. SpanFS: A Scalable File System on Fast Storage De- logging. ACM Transactions on Database Systems(TODS) 17, 1 vices. In Proc. of USENIX ATC 2015 (Santa Clara, CA, USA, Jul (1992), 94–162. 2015). [46] MYSQL, A. Mysql 5.1 reference manual. Sun Microsystems [30] KANG, W.-H.,LEE, S.-W.,MOON,B.,OH,G.-H., AND MIN, (2007). C. X-FTL: Transactional FTL for SQLite Databases. In Proc. of ACM SIGMOD 2013 (New York, NY, USA, Jun 2013). [47] NARAYANAN,D.,DONNELLY,A., AND ROWSTRON, A. Write Off-loading: Practical Power Management for Enterprise Stor- [31] KESAVAN,R., SINGH,R., GRUSECKI, T., AND PATEL, Y. Al- age. ACM Transactions on Storage(TOS) 4, 3 (2008), 10:1–10:23. gorithms and data structures for efficient free space reclamation in wafl. In Proc. of USENIX FAST 2017 (Santa Clara, CA, 2017), [48] NIGHTINGALE, E. B., VEERARAGHAVAN, K., CHEN, P. M., USENIX Association, pp. 1–14. AND FLINN, J. Rethink the Sync. In Proc. of USENIX OSDI 2006 (Seattle, WA, USA, Nov 2006). [32] KIM,H.-J., AND KIM, J.-S. Tuning the ext4 filesystem perfor- mance for android-based smartphones. In Frontiers in Computer [49] OKUN,M., AND BARAK, A. Atomic writes for data integrity Education. Springer, 2012, pp. 745–752. and consistency in shared storage devices for clusters. In Proc. of ICA3PP 2002 (Beijing, China, Oct 2002). [33] KIM, Y. An empirical study of redundant array of independent solid-state drives (RAIS). Springer Cluster Computing 18, 2 [50] OU, J., SHU,J., AND LU, Y. A high performance file system (2015), 963–977. for non-volatile main memory. In Proc. of ACM EuroSys 2016 (London, UK, Apr 2016). [34] KOPYTOV, A. SysBench manual. http://imysql.com/wp-content/uploads/2014/10/sysbench-manual.pdf[51] OUYANG,, X., NELLANS, D., WIPFEL, R., FLYNN,D., AND 2004. PANDA, D. K. Beyond block I/O: Rethinking traditional storage primitives. In Proc. of IEEE HPCA 2011 (San Antonio, TX, USA, [35] LEE,C.,SIM,D.,HWANG,J., AND CHO, S. F2FS: A New File Feb 2011). System for Flash Storage. In Proc. of USENIX FAST 2015 (Santa Clara, CA, USA, Feb 2015). [52] PALANCA,S.,FISCHER,S.A.,MAIYURAN,S., AND QAWAMI, S. Mfence and lfence micro-architectural implementation [36] LEE,S.,LEE, J.-Y., PARK, I.-H., PARK,J.,YUN, S.-W.,KIM, method and system, July 5 2016. US Patent 9,383,998. M.-S.,LEE, J.-H., KIM,M.,LEE, K., KIM, T., ET AL. 7.5 A 128Gb 2b/cell NAND flash memory in 14nm technology with [53] PARK, S., KELLY, T., AND SHEN, K. Failure-atomic Msync(): tPROG=640us and 800MB/s I/O rate. In Proc. of IEEE ISSCC A Simple and Efficient Mechanism for Preserving the Integrity 2016 (San Francisco, CA, USA, Feb 2016). of Durable Data. In Proc. of ACM EuroSys 2013 (Prague, Czech Republic, Apr 2013). [37] LEE,W.,LEE,K.,SON,H.,KIM,W.-H.,NAM,B., AND WON, Y. WALDIO: eliminating the filesystem journaling in resolving [54] PILLAI,T. S., ALAGAPPAN,R.,LU, L., CHIDAMBARAM, V., the journaling of journal anomaly. In Proc. of USENIX ATC 2015 ARPACI-DUSSEAU,A.C., AND ARPACI-DUSSEAU,R. H. Ap- (Santa Clara, CA, USA, Jul 2015). plication crash consistency and performance with ccfs. In Proc.of [38] LU,L.,ZHANG, Y., DO, T., AL-KISWANY, S., ARPACI- USENIX FAST 2017 (Santa Clara, CA, 2017), USENIX Associa- DUSSEAU,A.C., AND ARPACI-DUSSEAU, R. H. Physical tion, pp. 181–196. Disentanglement in a Container-Based File System. In Proc. of [55] PRABHAKARAN, V., BAIRAVASUNDARAM,L. N., AGRAWAL, USENIX OSDI 2014 (Broomfield, CO, USA, Oct 2014). N., GUNAWI,H. S.,ARPACI-DUSSEAU,A.C., AND ARPACI- [39] LU, Y., SHU, J., GUO,J.,LI,S., AND MUTLU, O. Lighttx: DUSSEAU, R. H. IRON File Systems. In Proc. of ACM SOSP A lightweight transactional design in flash-based ssds to support 2005 (Brighton, UK, Oct 2005). flexible transactions. In In proc. of IEEE ICCD 2013. [56] PRABHAKARAN, V., RODEHEFFER, T. L., AND ZHOU,L. [40] MATHUR, A., CAO, M., BHATTACHARYA, S., DILGER,A., Transactional flash. In Proc. of USENIX OSDI 2008, vol. 8. TOMAS,A., AND VIVIER, L. The new ext4 filesystem: cur- [57] REV, H. SCSI Commands Reference Manual. rent status and future plans. In Proc. of Linux symposium 2007 http://www.seagate.com/files/staticfiles/support/docs/manual/Interface%20manuals/100293068h.pdf/, (Ottawa, Ontario, Canada, Jun 2007). Jul 2014. Seagate. [41] MCKUSICK,M. K.,GANGER,G.R., ET AL. Soft Updates: A [58] RODEH,O.,BACIK,J., AND MASON, C. Btrfs: The linux b-tree Technique for Eliminating Most Synchronous Writes in the Fast filesystem. ACM Transactions on Storage (TOS) 9, 3 (2013), 9. Filesystem. In Proc. of USENIX ATC 1999 (Monterey, CA, USA, Jun 1999). [59] ROSENBLUM,M., AND OUSTERHOUT, J. K. The design and implementation of a log-structured file system. ACM Transac- [42] MEARIAN, L. ’s density surpoasses hard drives tions on Computer Systems (TOCS) 10, 1 (Feb. 1992), 26–52. for first time. http://www.computerworld.com/article/ 3030642/data-storage/flash-memorys-density- [60] SEHGAL, P.,TARASOV, V., AND ZADOK, E. Evaluating Perfor- surpasses-hard-drives-for-first-time.html, Feb mance and Energy in File System Server Workloads. In Proc. of 2016. USENIX FAST 2010 (San Jose, CA, USA, Feb 2010).

14 [61] SELTZER,M.I.,GANGER,G. R.,MCKUSICK,M.K.,SMITH, K. A., SOULES,C.A., AND STEIN, C. A. Journaling Versus Soft Updates: Asynchronous Meta-data Protection in File Sys- tems. In Proc. of USENIX ATC 2000 (San Diego, CA, USA, Jun 2000). [62] SHILAMKAR, G. Journal Checksums. http://wiki.old.lustre.org/images/4/44/Journal-\checksums.pdf, May 2007. [63] STEIGERWALD, M. Imposing Order: Working with write barriers and journaling filesystems. Linux Magazine 78 (2007), 60–64. [64] SWEENEY, A., DOUCETTE, D., HU, W., ANDERSON,C., NISHIMOTO,M., AND PECK, G. Scalability in the xfs file sys- tem. In Proc. of USENIX ATC (1996), vol. 15. [65] TWEEDIE, S. C. Journaling the linux ext2fs filesystem. In Proc.of The Fourth Annual Linux Expo (Durham, NC, USA, May 1998). [66] VERMA,R.,MENDEZ,A.A.,PARK,S.,MANNARSWAMY,S., KELLY, T., AND MORREY, C. Failure-Atomic Updates of Ap- plication Data in a Linux File System. In Proc. of USENIX FAST 2015 (Santa Clara, CA, USA, Feb 2015). [67] WEISS,Z., SUBRAMANIAN,S., SUNDARARAMAN,S.,TALA- GALA,N.,ARPACI-DUSSEAU,A., AND ARPACI-DUSSEAU,R. ANViL: Advanced Virtualization for Modern Non-Volatile Mem- ory Devices. In Proc. of USENIX FAST 2015 (Santa Clara, CA, USA, Feb 2015). [68] WILSON, A. The new and improved FileBench. In Proc. of USENIX FAST 2008 (San Jose, CA, USA, Feb 2008). [69] XU, Q., SIYAMWALA, H., GHOSH, M., SURI, T., AWASTHI, M., GUZ,Z.,SHAYESTEH,A., AND BALAKRISHNAN, V. Per- formance Analysis of NVMe SSDs and Their Implication on Real World Databases. In Proc. of ACM SYSTOR 2015 (Haifa, Israel, May 2015). [70] Y. PARK, S., SEO, E., SHIN, J. Y., MAENG,S., AND LEE,J. Exploiting Internal Parallelism of Flash-based SSDs. IEEE Com- puter Architecture Letters(CAL) 9, 1 (2010), 9–12. [71] ZHANG, C., WANG, Y., WANG, T., CHEN,R.,LIU,D., AND SHAO, Z. Deterministic crash recovery for NAND flash based storage systems. In Proc. of ACM/EDAC/IEEE DAC 2014 (San Francisco, CA, USA, Jun 2014).

15