Scalable Persistent Memory with Kernel-Userspace Collaboration

Youmin Chen, Youyou Lu, Bohong Zhu, Andrea . Arpaci-Dusseau†, Remzi H. Arpaci-Dusseau†, Jiwu Shu∗ Tsinghua University † University of Wisconsin – Madison

Abstract involvement of a centralized in normal operations by We introduce Kuco, a novel direct-access file system architec- directly recording updates in PM logs, but requires a ture whose main goal is scalability. Kuco utilizes three key to apply them (including both data and metadata) to the file techniques – collaborative indexing, two-level locking, and system, which causes one more time of data copying. The versioned reads – to offload time-consuming tasks, such as trusted process (e.g., TFS or KernFS) in both file systems pathname resolution and concurrency control, from the kernel is also responsible for concurrency control, which inevitably to userspace, thus avoiding kernel processing bottlenecks. becomes the bottleneck under high concurrency. Upon Kuco, we present the design and implementation of In this paper, we revisit the file system design by in- KucoFS, and then experimentally show that KucoFS has troducing a kernel-userspace collaboration architecture, or excellent performance in a wide range of experiments; impor- Kuco, to achieve both direct access performance and high tantly, KucoFS scales better than existing file systems by up scalability. Kuco follows a classic client/server model with to an order of magnitude for metadata operations, and fully two components, including a userspace (named Ulib) exploits device bandwidth for data operations. to provide basic file system interfaces, and a trusted (named Kfs) placed in the kernel to process requests sent by 1 Introduction Ulib and perform critical updates (e.g., metadata). Emerging byte-addressable persistent memories (PMs), such Inspired by distributed file system designs, e.g., AFS [17], as PCM [22, 34, 51], ReRAM [3], and the recently released that improve scalability by minimizing server loads and reduc- Intel Optane DCPMM [27], provide performance to ing client/server interactions, Kuco presents a novel task divi- DRAM and data persistence similar to disks. Such high- sion and collaboration between Ulib and Kfs, which offloads performance hardware increases the importance of redesign- most tasks to Ulib to avoid a possible Kfs bottleneck. For ing efficient file systems. In the past decade, the systems metadata scalability, we introduce a collaborative indexing community has proposed a number of file systems, such technique to allow Ulib to perform pathname resolution before as BPFS [11], PMFS [14], and NOVA [43], to minimize sending requests to Kfs. In this way, Kfs can update metadata the software overhead caused by a traditional file system items directly with the pre-located addresses provided by architecture. However, these PM-aware file systems are part Ulib. For data scalability, we first propose a two-level locking of the and applications need to trap into the mechanism to coordinate concurrent writes to shared files. kernel to access them, where system calls (syscalls) and the Specifically, Kfs manages a lease for each file and virtual file system (VFS) still incur non-negligible overhead. assigns it to the process that intends to the file. Instead, In this regard, recent work [13, 21, 28, 39] proposes to deploy threads within this process lock the file with a range-lock file systems in userspace to access file data directly (i.e., direct completely in userspace. Second, we introduce a versioned access), thus exploiting the high performance of PM. protocol to achieve direct reads even without interacting Despite these efforts, we find that another important per- with Kfs, despite the presence of concurrent writers. formance metric – scalability – still has not been well ad- Kuco also includes techniques to enforce data protection dressed, especially when multicore processors meet fast PMs. and improve baseline performance. Kuco maps the PM space NOVA [43] improves multicore scalability by partitioning into userspace in readonly mode to prevent buggy programs internal data structures and avoiding using global locks. from corrupting file data. Userspace direct writes are achieved However, our evaluation shows that it still fails to scale with a three-phase write protocol. Before Ulib writes a file, well due to the existence of the VFS layer. Even worse, Kfs switches the related PM pages from readonly to writeable some userspace file system designs further exasperate the by toggling the permission bits in the . A pre- scalability problem by introducing a centralized component. allocation technique is also used to reduce the number of For example, Aerie [39] ensures the integrity of file system interactions between Ulib and Kfs when writing a file. metadata by sending expensive inter-process communications With the Kuco architecture, we build a PM file system (IPCs) to a trusted process (TFS) that has the authority to named KucoFS, which gains userspace direct-access per- update metadata. Strata [21], as another example, avoids the formance and delivers high scalability simultaneously. We ∗Jiwu Shu is the corresponding author ([email protected]). evaluate KucoFS with file system benchmarks and real-world applications. The evaluation results show that KucoFS scales NOVA Aerie/Strata ZoFS SplitFS KucoFS better than existing file systems by an order of magnitude Category Kernel Userspace Hybrid Medium Low Medium Low High Metadata under high contention workloads (e.g., creating files in the (§5.2.1) (§5.2.1) (Fig. 7g in [13]) (§5.2.1) (§5.2.1) same directory or writing data in a shared file), and delivers Medium Low High Read High Low slightly higher throughput under low contention. It also (§5.2.2) (§5.2.2) (§5.2.2)

Scalability (journaling Medium Low Medium High hits the bandwidth ceiling of PM devices for normal data ❶ Write in ) (§5.2.3) (§5.2.3) (Fig.7f in [13]) (§5.2.3) operations. In summary, we make the following contributions: Softeware Medium Medium ❷ High Low Low • We conduct an in-depth analysis of state-of-the-art PM- overhead (sigsetjump) (metadata) aware file systems and summarize their limitations on Avoid stray ✓ ✗ ✓ ✗ ✓ solving the software overhead and scalability problems. writes Read uco POSIX Partition Coffer POSIX Partition • We introduce K , a userspace-kernel collaboration archi- protection tecture with three key techniques, including collaborative Visibility Immed- After batch/ Immed- append: Immed-

indexing, two-level locking, and versioned read to achieve Other issues of updates iately After digest iately After iately

❸ Hardware high scalability. None None MPK None None required • We implement a PM file system named KucoFS based on the Kuco architecture, and experimentally show that Table 1: Comparison of different NVM-aware file systems. KucoFS achieves up to one order of magnitude higher or rename files in the same directory. As shown in Figure 1a, scalability for metadata operations, and fully exploits the their throughput is almost unchanged as we increase the PM bandwidth for data operations. number of threads, since VFS needs to acquire the lock of the parent directory. Aerie [14] relies on a centralized 2 Motivation TFS to metadata operations and enforce concurrency In the past decade, researchers have developed a number of control. Although Aerie batches metadata changes to reduce PM file systems, such as BPFS [11], SCMFS [41], PMFS [14], communication with the TFS, our evaluation in §5 shows HiNFS [29], NOVA [43], Aerie [39], Strata [21], SplitFS [28], that the TFS still inevitably becomes the bottleneck under and ZoFS [13]. They are broadly categorized into three types. high concurrency. In Strata [21], the KernFS needs to digest First, kernel-level file systems. Applications access them by logged data and metadata in the background. If an application trapping into the kernel for both data and metadata operations. completely uses up its log, it has to wait for an in-progress Second, userspace file systems (e.g., Aerie [39], Strata [21], digest to complete before it can reclaim log space. As a and ZoFS [13]). Among them, Aerie [39] relies on a trusted result, the number of digestion threads limits Strata’s overall process (TFS) to manage metadata and ensure the integrity scalability. Both Aerie and Strata interact with the trusted of it. The TFS also coordinates concurrent reads and writes process (TFS/KernFS) via expensive IPCs, which introduces to shared files with a distributed lock service. Strata [21], extra syscall overhead. ZoFS does not require a centralized in contrast, enables applications to append their updates component, so it achieves much higher scalability. However, directly to a per-process log, but requires background threads ZoFS still fails to scale well when processing operations that (KernFS) to asynchronously digest logged data to storage require allocating new spaces from the kernel (e.g., creat devices. ZoFS avoids using a centralized component and and append, see Figures 7d, 7f, and 7g in their paper). Our allows userspace applications to update metadata directly evaluation shows that SplitFS scales poorly for both data and with the help of a new hardware feature named Intel Memory metadata operations because it 1) does not support sharing Protection Key (MPK). Note that Aerie, Strata, and ZoFS still between different processes, and 2) relies on Ext4 to update rely on the kernel to enforce coarse-grained allocation and metadata (see Figures 7 and 9). protection. Third, hybrid file systems (e.g., SplitFS [28] and ❷ Software overhead. Placing a file system in the kernel our proposed Kuco). SplitFS [28] presents a coarse-grained faces two types of software overhead, i.e., the syscall and VFS split between a -space library and an existing kernel file overhead. We investigate such overhead by still analyzing system. It handles data operations entirely in userspace, and NOVA, where we collect the latency breakdown of common processes metadata operations through the Ext4 file system. file system operations. Each operation is performed on 1 Table 1 provides a summary of existing PM-aware file systems million files or directories with a single thread. We make and how well they behave in various aspects. two observations from Figure 1b. First, syscalls take up ❶ Multicore scalability. NOVA [43], a state-of-the-art kernel to 21% of the total execution time (e.g., and open). file system for PMs, is carefully designed to improve scal- Also, after a process traps into the kernel, the OS may ability by introducing the per-core allocator and per- schedule other tasks before returning control to the original log. Nevertheless, VFS still limits its scalability for certain one. Hence, syscalls bring extra uncertainty for latency- operations. We experimentally show this by deploying NOVA sensitive applications [12, 33]. Second, kernel file on Intel Optane DCPMMs (detailed experimental setup is systems are implemented by overriding VFS functions, and described in § 5.1), and use multiple threads to create, delete, VFS causes non-negligible overhead. Although recent PM file hc a as omlpormt etriae when terminated be to program normal a cause may which oaheegaeu ro eun oSivksa invokes ZoFS return, error graceful achieve To Figure Overview 3.1 bottleneck. the becoming from server the avoid to part client K the introduce we paper, this In K The 3 K the this introduce and to designs, us motivates system file existing with overhead software resources. MPK limited MPK, the use for to compete needs may also they application an the if on mechanism, relies MPK other heavily to ZoFS Third, data adjustments. corresponding written newly a issuing the until of processes visibility by the performance improve delaying SplitFS reboots. and and after areas, Strata, even Aerie, these NVM Second, in in permanent data are Hence, the corruptions userspace. corrupt such to easily image can system writes file stray the of subset a [ map and Aerie metadata applications. both userspace (including to cache data) DRAM the and log tion as known is which writes data, stray the corrupt and locations incorrect kernel ❸ introduces still it so overhead. operations, metadata handle causes to which syscall, each ( delays of extra beginning the at instruction attackers. malicious by corrupted is that metadata accessing directly, metadata update to applications ZoFS overhead. userspace software the allows extra into incurs trapping still avoid it to however, [ userspace kernel; ZoFS NOVA. in for system layer file VFS 34% a the of deploys in average spent an still that is find time we the VFS, of in cache page the bypass [ systems ol iutnosy h eta daudryn K underlying idea central The simultaneously. goals hta that n lblkre hed hc r called are which thread, kernel global a and model server n-rie akdvso n olbrto ewe the between collaboration and division task fine-grained a epciey napiainacse K accesses application An respectively. Throughput (Mops/s) osmaie ti adt civ ihsaaiiyadlow and scalability high achieve to hard is it summarize, To te issues. Other 0.1 0.2 0.3 iue1 otaeoeha n clblt fNOVA. of scalability and overhead Software 1: Figure 0 and client 2 Number of Threads of Number 9 hw h K the shows server Rename Create , 10 11 / evrmodel server [ uco 14 ∼ ihtoprs nldn srpc library userspace a including parts, two with , 0 s.SltSrqie enlfiesystem file kernel a requires SplitFS ns). 200 14 .Srt [ Strata ]. hr otlasaeo are loads most where , is,mssdpitr a edt rtsto writes to lead can pointers misused First, a (b) (a) 20 , Architecture 29 , 40 Unlink uco 30 fsync , 21 43 a eaotdt elz h two the realize to adopted be can rhtcue tflosa follows It architecture. xoe h e-rcs opera- per-process the exposes ] , 50 ocn plctost make to applications forcing , uco s ietacs DX to (DAX) access direct use ] Latency Breakdown (%) 100 150

open 50 uco architecture. 0 (μs) stat 0.8 rhtcuet show to architecture 39 mkdir FS ffl uco 0.4

n piF [ SplitFS and ] mknod ae oteclient the to oaded

ylnigwith linking by rename 6.1 Ulib VFS sigsetjump 5.1 unlink read-1K 8.7

and write-1K

5.9 Syscall uco client Kfs

28 1.4 13 is ] ] 4.8 / , (§ process same the from writes Concurrent processes). (or opoetfiesse eaaafo en orpe,K corrupted, being from metadata system file protect To nrdcsthe of involvement introduces can the which without lock, acquired range userspace be pure a using serialized are osntalwapiain oudt eaaadirectly; metadata to update posted are to requests such applications instead, allow not does accesses. write and read direct support to userspace to space [ systems file userspace existing with interact Ulib protocol read versioned with interacts K The 2: Figure ae from pages number the K reduce between further To interactions pages data of writable. involved the to switch readonly to first from table page the in bits placing by K writers. concurrent other are there that with interacting without blocks physical to data K file addresses), logical PM map (which mappings block data di between conflicts write shared resolve to to lock writes concurrent handle (§ files to a used scalability, is data mechanism For addresses. given Therefore, out. such it sending encapsulates Kfs before request then the and in userspace, information in items metadata creat (§ userspace o to mechanism K scalability, between collaboration and division them. of behalf on metadata aa Msaei apdt srpc nraol mode. readonly in userspace to mapped is space PM data, edn nuesae(§ userspace in reading open() 2 Kfs uco uco 3.4 ofrhrpeetbgyporm rmcrutn file corrupting from programs buggy prevent further To K Meta request Msg Buf uco a efr eaaamdfiain ietywt the with directly modifications metadata perform can 3 .Before ). rt n di and first, adopts nbe srpc ietwie nraol addresses readonly on writes direct userspace enables 4 Poll or 3.3 and Update eieshg clblt ihafiegandtask fine-grained a with scalability high delivers unlink .Specifically, ). Kfs Kfs w-ee locking two-level 3.2 Kfs pre-allocation Kfs uco rt:drc cesbsdo a on based access direct write: ; Ulib ntekre iha with kernel the in .Isedo edn eaaaoeain (e.g., operations metadata sending of Instead ). read() ffl hndsrd xetfrtewieprotection write the for Except desired. than uco via esoe read versioned i eaaemmr esg bu message memory separate via to ) a h ahaetaesljbfrom job traversal pathname the oad ff uco noprtsthe incorporates erent rtsafile, a writes olbrtv indexing collaborative query Userspace library ( Index architecture. Kfs 1 a edacnitn eso fdata of version consistent a read can 3.5 Applica(on directly, Ulib Ulib Kfs where , Versioned .B digetavrinbt in bits version extra adding By ). o ocrec control. concurrency for Par77on tree read write() ssalaebsddistributed lease-based a uses and Kfs ntne ie,applications) (i.e., instances Kfs 21 ehiu opromfile perform to technique Ulib eaaaudts( updates metadata Ulib he-hs rt protocol write three-phase oaqietelc,despite lock, the acquire to Ulib Page table , Kfs Kfs Ulib 39 oie h permission the modifies olbrtv indexing collaborative Write rtfid l h related the all finds first and , ,K ], a loaemr free more allocate can and ) ed ietacs via access direct read: ; hnwiigafile, a writing when ff Kfs w-ee locking two-level uco rn applications erent Kfs Readonly Kfs … he-hs write three-phase Range locks K . map o metadata For . astePM the maps hnupdates then uco ff ➀ r.Like ers. - Leases Kernel ➃ further unlink() User Kfs ): Ulib uco to uco mechanism that prevents stray writes, the PM space in K Userspace library (Ulib) / is then divided into different partition trees, which act as Pathname 2 Creat(“a”, pred, …) 1 the minimum unit for read protection. By applying Kuco resolu;on Msg buf Bob Tom in a file system named KucoFS and putting all techniques 3 Poll “/” “Bob” together, KucoFS gains direct-access performance, delivers … Kfs 4 Validate & create predecessor high scalability, and ensures the kernel-level data protection. Figure 3: Creating a file (➀-➃) with collaborative indexing. 3.2 Collaborative Indexing In a typical client-server model, whenever Kfs receives a layers of linked list-like data structure. Each higher layer acts metadata request, it needs to find the related metadata (e.g., as an “express lane” for the lower list layer. The list-based that describe file attributes, or dentries that map file structure enables lock-free atomic updates by performing names to inode numbers) by performing iterative pathname pointer manipulations. Besides, there are only insert and resolution from the root inode to the directory containing this delete operations to the dentry list performed by a single Kfs, file. Such pathname traversal overhead is a heavy burden for including rename operations which are performed by first Kfs, especially when a directory contains a large number of inserting a new node and then deleting the old one. Therefore, sub-files or with deep directory hierarchies. a read to a dentry is always performed to a consistent one To address this issue, we propose to offload the pathname even without acquiring the lock. resolution task from Kfs to Ulib. By mapping partition trees to Second, with such a lock-free design, userspace applica- userspace, Ulib can find the related metadata items directly in tions may read metadata items that are being deleted by Kfs, userspace, and then sends a metadata update request to Kfs by causing the “read-after-delete” anomaly. To safely reclaim encapsulating the metadata addresses in the request as well. the deleted items, we need to ensure that no threads access In this way, Kfs can update metadata directly with the given it anymore. We address this issue by using an epoch-based addresses, and the pathname resolution overhead is offloaded reclamation mechanism (EBR) [15]. EBR maintains a global from Kfs to userspace. epoch and three reclaim queues, where the execution is Figure 3 shows how Kuco creates a file with a pathname of divided into epochs and reclaim queues are maintained for “/Bob/a”. Ulib first finds the predecessor dentry of file “a” in the last three epochs. Each thread also owns a private epoch. the dentry list of “Bob”(➀). It then sends a creat request to Items deleted in epoch e are placed into the queue for epoch Kfs, and the address of the predecessor is put in the message e. Each time Ulib starts an operation, it reads the global epoch too (➁). Kfs then creates the file after receiving the request and updates its own epoch to be equal to the global one. It (➂➃), which includes creating an inode of this file, and then then checks the private epochs of others. If all Ulibs are active inserting a new dentry in the parent directory’s dentry list with in the current epoch e, then a new epoch begins. At this time, the given predecessor. To delete a file, both the inode of this all threads are active either in e or in e+1, and items in the file and dentry in the parent directory should be deleted, so queue related to e-1 can be reclaimed safely. We also add a both of their addresses are kept in the unlink request before dirty flag in each inode/dentry. Kfs deletes a metadata item by Ulib sends it. Note that atime is disabled by default, enabling setting its dirty flag to an invalid state, preventing applications readonly operations (e.g., stat, readdir) to be performed from reading the already deleted items. in userspace without posting extra requests to Kfs. Third, Kfs needs to handle conflicting metadata operations In Kuco, Ulibs produce pointers and Kfs consumes them. properly. For example, when multiple Ulibs are performing This “one-way” pointer sharing paradigm simplifies ensuring metadata operations concurrently, the pre-located metadata the correctness and safety of Kuco. On the one hand, metadata item of one Ulib might be deleted or renamed by another items are placed in a metadata area with separate address concurrent Ulib before Kfs accesses it. Hence, this item space and Ulib can only pass the addresses of two types of is no longer valid and its address cannot be used by Kfs metadata items (i.e., dentry and inode). Hence, we add an anymore. It is also possible that a malicious process attacks identifier field at the beginning of each metadata item, which Kfs by providing arbitrary addresses. Luckily, only the Kfs can helps Kfs to check the metadata type – any addresses not in the update metadata, and it can validate the pre-located metadata metadata area or not pointing to a dentry/inode is considered before processing the operation. Specifically, Kfs checks if invalid. On the other hand, Kfs also performs consistency the pre-located item still exists or is still the predecessor, and checking based on the file system internal logic: avoids creating files with the same name. When the validation First, Ulib might read an inconsistent directory tree. For fails, Kfs then resolves the pathname itself and returns an example, when Kfs is creating new files in a directory, error code to the Ulib if the operation fails anyway. concurrent Ulibs may read an inconsistent dentry list of this Discussion. First, Kuco ensures that all metadata operations directory. To address this issue, we organize the dentry list of are processed atomically. For creat, Kfs atomically inserts each directory with a skip list [32] and each dentry is indexed a new dentry in the skip list only after an inode has been by the hash value of the file name. Skip list has multiple created, to make the created file visible; For unlink, it ❷ ❸ version 1 Atomic add state 0: invalid; 1: valid a specific slot in the ring buffer ( and , the location is offset determined by the fetched version modulo the ring buffer Range of write size size). The insertion is blocked when this slot overlaps with Insert 3 c*me Lock create *me the head of the ring buffer. After this, Ulib traverses the ring checksum buffer backward to find the first conflicting lock item (i.e., 4 Conflict checking their written data overlaps). If such a conflict exists, Ulib Ring buffer 2 Init a lock item verifies its checksum, and then polls on its state until it is Figure 4: Direct access range-lock. Each opened file owns a released. Ulib also checks its ctime field repeatedly to avoid ❶ ❹ range-lock. - show the steps to acquire a lock. the deadlock if a thread aborts before it releases the lock (❹). ff atomically deletes the dentry before deleting other fields. With this design, multiple threads can write di erent data Rename involves updating two dentries (create a new entry in pages in the same file concurrently. the destination path, and then delete the old one), so a program can see two same files on both places at some point in time. 3.4 Three-Phase Write Ulib We leverage the dirty flag in each dentry to prevent such an Once the lock has been required, can actually write inconsistent state. Specifically, the old entry on the source file data. Since PM spaces are mapped to userspace in Ulib path is set to dirty before creating the new entry, and is then set readonly mode, cannot write file data directly. Instead, to invalid after the new entry is created. As a whole, we can we propose a three-phase write protocol to perform direct uco observe that metadata operations always change the directory writes. To ensure the crash consistency, K follows a copy- tree atomically, and Ulib is guaranteed to have a consistent on-write (CoW) approach to write file data, where the newly view of the directory tree even without acquiring the lock. written data is always redirected to new PM pages. Similar Second, Kuco’s scalability is further improved by avoiding to NOVA [43] and PMFS [14], we use 4 KB as the default uco using locks — concurrent metadata updates are all delegated data page size. The write protocol in K consists of three Ulib to the global Kfs, so they can be processed without any locking steps. First, locks the file via two-level locking and sends Kfs overhead (only Kfs can update metadata) [16, 35]. Kuco a request to to allocate new PM pages. Note that, by ensures the crash consistency of metadata via an operation using a CoW way, space allocation is necessary for both Kfs log, which will be discussed in §4.2. overwrite and append operations. also needs to modify the related page table entries to make these allocated PM 3.3 Two-Level Locking pages writable before sending the response message back. Kuco introduces a two-level locking service to coordinate Second, Ulib copies both the unmodified data from the old concurrent writes to shared files, which prevents Kfs from place and new data from the user buffer to the allocated PM being frequently involved in concurrency control. First, Kfs pages, and persists them via flush instructions. Third, Ulib assigns write leases (in the kernel, see Figure 2) on files sends another request to Kfs to update the metadata of this file to enforce coarse-grained coordination between different (i.e., inode, block mapping), switch the newly written pages processes, as in Aerie and Strata [21, 39]. Only the process to readonly, and finally releases the lock. that holds a valid write lease (not yet expired) can write the Furthermore, we introduce the pre-allocation mechanism file. We assume that Ulib applies for leases infrequently, and to avoid allocating new PM pages from Kfs for every write this is based on the fact that it is not the common case for operations. Specifically, we allow Ulib to allocate more multiple processes to frequently and concurrently write the free pages from Kfs than desired (4 MB at a time in our same file. More fine-grained sharing between processes can implementation). In this way, Ulib can use local free PM be achieved via or pipes [21]. Read leases are pages without interacting with Kfs for most write operations. not needed in Kuco (see Section 3.5). When an application exits, the unused pages are given back to Second, we introduce a direct access range-lock to serialize the Kfs. For an abnormal , these free pages are temporarily concurrent writes between threads within the same process. non-reusable by other applications, but still can be reclaimed Once a Ulib acquires the write lease of a file, it creates a range during the recovery phase (see §4.2). Pre-allocation also lock for this file in userspace, which is actually a DRAM helps with reducing the overhead of updating page table ring buffer (as shown in Figures 4). A thread writes a file entries. When the Kfs updates page table entries after each by acquiring the range-lock first, and it is blocked if a lock allocation, it needs to flush the related TLB entries explicitly conflict occurs. Each slot in the ring buffer has five fields, to make the modifications visible. Pre-allocation allows which are state, offset, size, ctime, and a checksum. The allocating multiple data pages at a time, so the TLB entries checksum is the hash value of the first four fields. We also can be flushed in batch. place a version at the head of each ring buffer to describe the order of each write operation. To acquire the lock of 3.5 Versioned Read a file, Ulib firstly increments the version with an atomic In the write protocol, both old and new versions of data pages fetch_and_add (i.e., ❶). It then inserts a lock item into are temporarily kept due to the CoW way, providing us the Block mapping item inode Direct inode … User: Bob a8r start version end pointer start = 1 table creat(/Bob/a) data 2 0 0 1 55 56 96 page end = 1 inode b Bob 1 … DRAM a Dentry List 2 (a) V1 V1 V1 V2 V2 3 0 Indirect Block LogEntry: … 1 block (b) V1 V1 V3 V3 V3 mapping Op: Create n … 3 inode: 4 Block n mapping mapping (c) V4 V4 V4 V3 V3 checkpoint Tail ptr 1 p_inode: 7 NVM … Append name: a Figure 5: Block mapping format and the versioned read Data pages Metadata pages Opera7on log Attr: … protocol. Mapping items with the same version correspond to the uco same write operation. The above three consistent cases describe how Figure 6: Data layout of a partition tree in K FS. creat the start and end bits can be formatted when the version changes. operation with three steps is also shown. ff opportunity to read file data even without blocking writes. sioned read is slightly di erent from legacy locked read in However, block mappings that map a logical file to physical that it allows concurrent writes. For example, a write starts pages are still updated in place by Kfs. This dirves us to and has not yet been completed, but in-between, there is a read, design the versioned read mechanism to achieve user-level which reads an old snapshot of data. In this case, the execution ➛ ➛ direct reads without any involvement of the Kfs, regardless of still equals to a serializable order (e.g., “read write”, “ ” concurrent writers. indicates happens-before). Versioned read has the same semantic as locked read within each thread, because a read or Versioned Read is designed to allow userspace reads write has to complete before issuing the next one. without locking the file, while ensuring that readers never read uco data from incomplete writes. To achieve this, K uses an 4 KucoFS Implementation -like [6] block mapping to index data pages and embeds a version field in each pointer of the block mapping. As In this section, we describe how the Kuco architecture is shown in Figure 5, each 96-bit block mapping item contains applied in a persistent memory file system named KucoFS. four fields, which are start, version, end and pointer. For a write operation, say, writing three data pages, Kfs updates 4.1 Data Layout the related block mapping items with the following format: KucoFS organizes partition trees of Kuco in a hybrid way 1|V1|0|P1 0|V1|0|P2 0|V1|1|P3. In particular, all three items using both DRAM and PM (Figure 6). In DRAM, an array Ulib share the same version (i.e., V1), which is provided by of pointers (inode table) is placed at a predefined location to when it acquires the range lock (in Section 3.3). The start bit point to the actual inodes. The first element in the inode table of the first item and the end bit of the last item are set to 1. points to the root inode of the current partition tree. With this, We only reserve 40-bit for the pointer field since it points to a Ulib can find any files from the root inode in userspace. As 4 KB-aligned page and the lower 12 bits can be discarded. discussed before, the dentry list of a directory is organized With this format, readers can read a consistent snapshot of into a skip list, which is also placed in DRAM. data pages when one of the three cases is met in Figure 5: For efficiency, KucoFS only operates on the DRAM meta- a) No overlapping. When two updates to a file are performed data for normal requests. To ensure the durability and crash on non-overlapping pages, items with the same version consistency of metadata, KucoFS places an append-only should be enclosed with both a start bit and an end bit (V1 persistent operation log in PM for each partition tree. When and V2 in case a). the Kfs updates the metadata, it first atomically appends a b) Overlaps the end part. When a thread overwrites the end log entry, and then actually updates the DRAM metadata part of a former write, a reader should always see a start (see §4.2). When system failures occur, the DRAM metadata bit when the version increases (V1➔V3 in case b). can always be recovered by replaying the log entries in the c) Overlaps the front part. When a thread overwrites the first operation log. In addition to the operation log, the extra PM half of a former write, a reader should always see an end space is cut into 4 KB data pages and metadata pages. Free bit before the version decreases (V4➔V3 in case c). PM pages are managed with both a bitmap in PM and a free If Ulib meets any case other than the above three cases, it list in DRAM (for fast allocation), and the bitmap is lazily indicates that Kfs is updating the block mapping for some persisted by the Kfs during the checkpoint phase. other incomplete writes. In this case, Ulib needs to validate again by re-scanning the sequence of the related versions. 4.2 Crash Consistency and Recovery After Ulib succeeds in the version checking, it then reads Metadata consistency. KucoFS ensures the metadata con- the associated data pages. As a whole, Kuco utilizes the sistency by ordering updates to DRAM and PM. Figure 6 embedded versions to detect incomplete writes and retries shows the steps of how Kfs creates a file when it receives a until reading a consistent snapshot of data. creat request from Ulib. In ➊, Kfs reserves an unused inode Read Semantics. In a multi-thread/process execution, ver- number from the inode table and appends a log entry to the operation log. This log entry records the inode number, file we plan to reduce the DRAM footprint by only keeping active name, parent directory inode number, and other attributes. In metadata in DRAM. ➋, it allocates an inode with each field filled, and updates the inode table to point to this inode. In ➌, it then inserts 4.3 Write Protection a dentry into the dentry list with the given address of the KucoFS strictly controls updates to the file system image. predecessor, to make the created file visible. A creation fails Both in-memory metadata and the persistent operation log if the same dentry already exists (avoid creating the same are critical, so the Kfs in the kernel is the only one that is files). To delete a file, Kfs appends a log entry first, deletes allowed to update them. File pages are mapped to userspace the dentry in the parent directory with the given addresses, in readonly mode. Applications can only write data to and finally frees the related spaces (e.g., inode, data pages and newly allocated PM pages and existing data pages cannot block mapping). If a crash happens before the operation is be modified. KucoFS also provides process-level isolation finished, the DRAM metadata updates will be lost, but Kfs can for userspace data structures. The message buffer and range reconstruct them to the newest state by replaying the log after locks are privately owned by each process, so an attacker recovery. For rename oprations, except for system failures, cannot access them in other processes, except that it performs the kernel thread may crash and cause the dirty flag to be a privilege escalation attack. Such security issues are out of in an inconsistent state. However, we consider the whole file the scope of this work. As such, we conclude that KucoFS system crashes if the kernel thread crashes, which requires the achieves the same write protection as kernel file systems. file system to be rebooted, and the above logging technique Preventing stray writes. Unlike many existing userspace ensures that rename operation is also crash-consistent. file systems that are vulnerable to stray writes [21, 28, 39], Data consistency. KucoFS handles file write operations by KucoFS prevents this issue by mapping the PM space in first updating data pages in a CoW way, and then appending readonly mode. Note that there is still a temporary, writable a log entry in the operation log to record the metadata window (less than 1 µs) for the newly-written pages after a modifications. At this point, the write is considered durable. write operation is finished but before the permission bits are Then, KucoFS can safely update DRAM metadata to make changed. This is unavoidable, as same as in existing kernel file this operation visible. when a system failure occurs before systems like PMFS. Fortunately, this rarely happens. Besides, the log entry is persisted, KucoFS can roll back to its last range locks and message buffers in userspace might also be consistent state since old data and metadata are untouched. corrupted by stray writes. For this threat, we add checksum Otherwise, this write operation is made visible by replaying and lease fields at each slot, which can be used to check the operation log after recovery. whether the inserted element has been corrupted or not. Log cleaning and recovery. We introduce a checkpoint mech- anism to avoid the operation log from growing arbitrarily. 4.4 Read Protection When the Kfs is not busy, or the size of the log exceeds a KucoFS organizes its directory tree with partition trees, which threshold (1MB on our implementation), we use a background act as the minimal unit for access control. Each partition tree kernel thread to trigger a checkpoint, which applies metadata is self-contained, consisting of metadata and data in PM, and modifications in the operation log to PM metadata pages. the related metadata copy in DRAM. KucoFS does not allow The bitmap that is used to manage the PM free pages is file/directory structures to span across different partitions. updated and persisted as well. After that, the operation log When a program accesses KucoFS, only the partition trees is truncated. Background digestion never blocks front-end it has access to are mapped to its , but other operations, and the only impact is that log cleaning consumes partition trees are invisible to it. extra PM bandwidth. However, metadata are typically small- In KucoFS, read access control is strengthened with the sized and bandwidth consumption is not high. following compromises. First, similar to existing userspace Each time KucoFS is rebooted from a crash, Kfs first file system [13,39], KucoFS cannot support “write-only” or replays the un-checkpointed log entries in the operation log, complex permission semantics such as POSIX access control so as to make PM metadata pages up-to-date. It then copies lists (ACLs), since existing page table only has a single bit PM metadata pages to DRAM. The free list of PM data pages to indicate a page is readonly or read-write. Second, KucoFS is also reconstructed according to the bitmap stored in PM. does not support flexible data sharing between users, because Crashing again during the recovery is not a concern since it is hard to change the permission of a specific file (e.g., via the log has not yet been truncated and can be replayed again. ) with the partition tree design [13,21,31]. Yet there are Keeping redundant copies of metadata between DRAM and several practical approaches: ❶ creating a standalone partition PM introduces higher consumption of PM/DRAM space, but that applications with different permissions have access to it; we believe it is worth the efforts. With structured metadata ❷ posting user-level RPCs between different applications to in DRAM, we can perform fast indexing directly in DRAM; acquire the data. We believe such a tradeoff is not likely to be appending log entries in the log saves the number of updates an obstacle, since KucoFS still supports efficient data sharing to PMs, which reduces the persistence overhead. In the future, between applications within the same user, which is the more common case in real-world scenarios [13]. cache. Ulib also needs to post an extra request to the Kfs to allocate more free data pages once the pre-allocated space 4.5 Memory-Mapped I/O is used up. Finally, Ulib sends the write request to the Kfs to Supporting DAX feature in a copy-on-write file system needs finish the rest steps, including changing the permission bits extra efforts, since files are out-of-place updated in normal of the written data pages to readonly, appending a log entry write operations [43]. Besides, DAX leaves great challenges to describe this write operation, and updating the DRAM for programmers to correctly use PM space with atomicity metadata. Ulib finally unlocks the file in the range lock. and crash consistency. Taking these factors into consideration, Read. KucoFS enables reading file data without interacting we borrow the idea from NOVA to provide atomic-, with the Kfs. To read the first 4 KB from this file, Ulib finds which has higher consistency guarantees. When an application the inode in userspace and reads the first block mapping item. maps a file into userspace, Ulib copies file data to its privately The version checking is performed to ensure its state satisfies managed data pages, and then sends a request to Kfs to one of the three conditions described in Section 3.5. After map these pages into contiguous address space. When the this, Ulib can safely read the data page pointed by the pointer application issues a msync , Ulib then handles it in the mapping item. as a write operation, so as to atomically make the updates in Close. Ulib also needs to send a close request to the Kfs these data pages visible to other applications. upon closing this file. Kfs then reclaims the write lease since it will not access this file anymore. 4.6 KucoFS’s KucoFS provides a POSIX-like interface, so existing appli- cations are able to access it without any modifications to the 5 Evaluation . We achieves this by setting the LD_PRELOAD In our evaluation, we try to answer the following questions: environment variable. Ulib intercepts all APIs in standard C • Does KucoFS achieve the goal of delivering direct access library that are related to file system operations. Ulib processes performance and high scalability? syscalls directly if the prefix of an accessed file matches • How does each individual technique in KucoFS help with with a predefined string (e.g., /kuco/usr1). Otherwise, the achieving the above goals? syscalls is processed in legacy mode. Note that read or write • How does KucoFS perform under macro-benchmark and operations only pass file descriptors in the parameter list. real-world applications? Ulib distinguishes them from legacy syscalls via a mapping table [23], which tracks files of KucoFS. 5.1 Experimental Setup 4.7 Examples: Putting It All Together Testbed. Our experimental testbed is equipped with 2× Intel Xeon Gold 6240M CPUs (36 physical cores), 384 GB DDR4 Finally, we summarize the design of the Kuco architecture DRAM, and 12 Optane DCPMMs (256GB per module, 3 TB and KucoFS by walking through an example of writing 4 KB in total). We perform all experiments on the Optane DCPMMs of data to a new file and then reading it out. residing on NUMA 0 (1.5 TB), whose read bandwidth peaks Open. Before sending an open request, Ulib pre-locates the at 37.6 GB/s and the write bandwidth is 13.2 GB/s. The server related metadata first. Since this is a new file, Ulib cannot is installed with 19.04 and Linux 5.1, the newest find it directly. Instead, it finds the predecessor in its parent kernel version supported by NOVA. directory’s dentry list for latter creation. The address, as well Compared systems. We evaluate KucoFS against NVM- as other information (e.g., file name, O_CREAT flags, etc.), aware file systems including PMFS [14], NOVA [43], are encapsulated in the open request. When the Kfs receives SplitFS [28], Aerie [39], and Strata [21], as well as traditional the request, it creates this file based on the given address. It file systems with DAX support including Ext4-DAX [2] and also needs to assign a write lease to this process. Then, the XFS-DAX [38]. Strata only supports a few applications and Kfs sends a response message. After this, Ulib creates a file has trouble running multi-threaded workloads. Similar to descriptor and a range lock for this opened file, and returns to previous papers [13,49], we only show part of its performance the application. results in § 5.3 and § 5.4. We only evaluate SplitFS in §5.2 Write. The application then uses a write call via Ulib to and § 5.4 since it only supports a subset of APIs. For a write 4 KB of data to this newly created file. First, Ulib tries fair comparison, we deploy SplitFS with strict mode, which to lock the file via the two-stage locking service. Since the ensures both durability and atomicity. ZoFS is not open- write lease is still valid, it acquires the lock directly through sourced so we did not evaluate it. Aerie is based on Linux the range-lock. Ulib blocks the program when there are write 3.2.2, which cannot support Optane DCPMMs. Hence, we conflicts and wait until other concurrent threads have released compare with Aerie [39] by emulating persistent memory the lock. After that, Ulib can acquire the lock successfully. It with DRAM, which injects extra delays. Due to limited space, then allocates a 4 KB-page from the pre-allocated pages, we only describe Aerie’s experimental data verbally without copies the data into it, and flushes them out of the CPU adding extra figures. 1.5 w/o CI 1.5 KucoFS 1 (a) 1 (a) Creat, Collaborative Low indexing (Mops/s) 0.5 (Mops/s) 0.5 Throughput Throughput Create, Medium 0 0 1.5 NOVA 0.9 KucoFS 1 (b) Creat, (b) 0.6 Medium Conflicts handling (Mops/s) 0.5 (Mops/s) 0.3 Throughput Throughput 0 0 1 60 NOVA w/o Lock rw Lock KucoFS XFS-DAX NOVA (c) Creat, EXT4-DAX SplitFS 40 (c) Medium Versioned 0.5 PMFS KucoFS (s) (more files) reads (Mops/s) (Mops/s) 20 Throughput Throughput Execution time 0 0 0 10 20 30 0 4 8 12 16 #. of Threads #. of Threads Figure 7: creat performance with FxMark. Low: in different Figure 8: Benefits of collaborative indexing and versioned folders; medium: in the same folder; more files: each thread creates read. w/o CI: KucoFS without collaborative indexing. one million files. 5.2 Effects of Individual Techniques spaces. Aerie supports synchronizing metadata updates of the created files to TFS with batching (by compromising the We use FxMark [25] to analyze the effects of individual visibility), so it achieves comparable performance to that of techniques, which explores the scalability of basic file system KucoFS. Aerie fails to work properly when the number of operations. FxMark provides 19 micro-benchmarks, which threads increases. The throughput of KucoFS, however, is is categorized based on four criteria: data types (i.e., data or only decreased slightly with the medium sharing level, which metadata), modes (i.e., read or write), operations, and sharing is one order of magnitude higher than other file systems, and levels. We only evaluate the commonly used operations (e.g., 3× higher than that of ZoFS. We explain the high scalability read, write, mknod, etc.) due to the limited space. of KucoFS from the following aspects. First, in KucoFS, all metadata updates are delegated to Kfs, so it can update them 5.2.1 Effects of Collaborative Indexing without any locking overhead. Second, by offloading indexing Kfs Basic performance. In KucoFS, creat operation requires tasks to userspace, only needs to do lightweight work. posting requests to Kfs, so we choose this operation to show Larger workload. Furthermore, we measure the scalability of the effects of collaborative indexing. FxMark evaluates creat KucoFS in terms of data capacity by extending the workload operations by letting each client thread create 10 K files in size. Specifically, we let each thread create 1 million files, private directories (i.e., low sharing level) or a shared directory 100× larger than the default size in FxMark, and the results (i.e., medium). As shown in Figures 7a and 7b, KucoFS are shown in Figure 7c. Compared to the results with a smaller exhibits the highest performance among the compared file workload size, the throughput of KucoFS drops by 28.5%. systems and its throughput never collapses, regardless of This is mainly because a file system needs more time to find the sharing level. XFS-DAX, Ext4-DAX and PMFS use a proper slot for insertion in the parent directory when the a global lock to perform metadata journaling in a shared number of files increases. Even so, KucoFS still outperforms log, which leads to their poor scalability. NOVA shows other file systems by an order of magnitude. excellent scalability under low sharing level by avoiding Conflict handling. KucoFS requires Kfs to fall back and global locks (e.g., it uses per-inode log and partitions its retry when a conflict occurs, which may impact overall free spaces). However, all kernel file systems fail to scale performance. In this regard, we also test how KucoFS be- under the medium sharing level since VFS needs to lock haves when handling operations that conflict with each other. the parent directory before creating files. SplitFS relies on Specifically, we use multiple threads to create the same file Ext4 to create files, which accounts for its low scalability. concurrently if it does not exist, or delete it instead when From the ZoFS paper we also find that ZoFS even shows it has already been created. We collect the throughput of lower throughput than NOVA under low sharing level, since these successful creations and deletions and the results are it needs to trap into the kernel frequently to allocate new shown in Figure 8b. As a comparison, the results of NOVA XFS-DAX EXT4-DAX PMFS NOVA SplitFS KucoFS (a) Read, Low (b) Read, Medium (c) Overwrite, Low (d) Append, Low (e) Overwrite, 9 3 3 Medium 6 2 2 3 1 1 0 0 0 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30 Throughput (Mops/s)Throughput

#. of Threads #. of Threads (Mops/s)Throughput #. of Threads #. of Threads (Mops/s)Throughput #. of Threads Figure 9: Read and Write throughput with FxMark. Low: threads read(write) data from(to) separate files; medium: in the same file but different data blocks; default I/O size: 4 KB; gray area: Optane DCPMMs do not scale on NUMA platform. is also shown in the figure. We can observe that KucoFS to that of NOVA despite its direct-access feature. We find that achieves 2.4× higher throughput than NOVA. In NOVA, a SplitFS needs to map more PM space to userspace whenever thread needs to acquire the lock before creating or deleting it reads a page that has not been mapped yet, which causes files. Worse, if this creation or deletion fails, other concurrent extra overhead. The performance improvement of KucoFS threads will be blocked unnecessarily since the lock does is more obvious for medium sharing level because all the not protect a valid operation. Instead, in KucoFS, threads can compared systems need to lock the file before actually reading send creation or deletion requests to Kfs without been blocked file data. The locking overhead impacts their performance and Kfs is responsible for determining whether this operation significantly, despite they use shared locks [23]. Second, can be processed successfully. Furthermore, since Ulib has the read performance of all evaluated file systems drops already provided related addresses in the request, Kfs can use dramatically when the number of threads keeps increasing these addresses to validate metadata items directly, which (gray area). To get stable results, we first bind threads to introduces insignificant overhead. NUMA 0 (local access), and the cores at NUMA 1 are used Breakdown. We also measure the benefit of collaborative only if the total number of threads is greater than 18. Both indexing by comparing with a variant of KucoFS that disables we and past work [47] observe that cross-socket accessing this optimization (i.e., move the metadata indexing tasks to Optane impacts performance greatly. To confirm that our back to Kfs, denoted as “w/o CI”). Figure 8a shows the software design is scalable, we deploy NOVA and KucoFS results by measuring the throughput of creat with a varying in DRAM, and both of them show scalable read throughput number of clients. We make the following observations. First, again. Therefore, many recent papers [13, 19] only use the in the single thread evaluation, collaborative indexing does cores from the local NUMA node in their evaluation. With our not contribute to improving performance, since moving the emulated persistent memory, Aerie shows almost the same metadata indexing task from Ulib back to the Kfs does not performance as that of KucoFS with the low sharing level, reduce the overall latency of each operation. Second, when the but its throughput falls far behind others at a medium sharing number of client threads increases, we find that collaborative level because Aerie needs to interact with the TFS frequently. indexing improves throughput by up to 55%. Since KucoFS We further demonstrate the efficacy of versioned read by only allows the Kfs to update metadata on behalf of multiple concurrently reading/writing data from/to the same file. In Ulib instances, the theoretical throughput limit is Tmax = 1/L our evaluation, one read thread is selected to sequentially (Ops/s, where L is the latency for Kfs to process one request). read a file with an I/O size of 16 KB, and an increasing Therefore, the offloading mechanism improves performance number of threads are launched to overwrite the same file by shortening the execution time of each request (i.e., L). concurrently (4 KB writes to a random offset). We let the read thread issue read operations for 1 million times and 5.2.2 Effects of Versioned Read measure its execution time with a varying number of writers. Figures 9a and 9b show the file read performance of each For comparison, we also implement KucoFS r/w lock that file system with a varying number of threads under different reads file data by acquiring read-write locks in the range-lock sharing levels (i.e., low/medium). We make the following ring buffer, and KucoFS w/o lock that reads file data directly observations. First, KucoFS exhibits the highest throughput without a correctness guarantee. We make the following among the compared file systems, which peaks at 9.4 Mops/s observations from Figure 8b. First, the proposed versioned (hardware bandwidth has been fully utilized). The perfor- read achieves almost the same performance as that of KucoFS mance improvement stems primarily from the design of ver- w/o lock. This proves that the overhead of version checking is sioned read, which empowers userspace direct access without extremely low. We also observe that KucoFS r/w lock needs the involvement of Kfs. These kernel file systems (e.g., XFS, much more time to finish reading (7% to 3.2× more time than Ext4, NOVA and PMFS) have to perform context switches KucoFS for different I/O sizes). This is because it needs to use and walk through the VFS layer, which impact the read atomic operations to acquire the range lock, which severely performance. SplitFS only achieves comparable performance impact read performance when conflicts become frequent. Second, the execution time of NOVA is orders of magnitudes Workload Fileserver Webserver Webproxy Varmail higher than that of KucoFS. NOVA directly uses mutexes R/W Size 16 KB/16 KB 1 MB/8 KB 1 MB/16 KB 1 MB/16 KB / to synchronize the reader and concurrent writers. As a result, R W Ratio 1:2 10:1 5:1 1:1 Total number of files in each workload is 100K. the reader is always blocked by writers. Threads 1 16 1 16 1 16 1 16 XFS-DAX 39K 127K 121K 1.35M 192K 863K 99K 319K 5.2.3 Effects of Three-Phase Writes Ext4-DAX 52K 362K 123K 1.33M 316K 2.50M 57K 135K append overwrite PMFS 72K 317K 110K 1.25M 218K 1.54M 169K 1.06M We evaluate both and operations to NOVA 71K 537K 133K 1.43M 337K 3.02M 220K 2.04M analyze the write protocol (see Figures 9 c-d). For overwrite Strata 75K - 105K - 420K - 283K - operations with low sharing level, some of them exhibit a KucoFS 99K 683K 141K 1.48M 463K 3.22M 320K 2.55M performance curve that increases first and then decreases. ➶ 32% 27% 6% 3% 10% 7% 13% 24% In the rising part, KucoFS shows the highest throughput “➶” indicates the performance improvement over the 2nd-best system. among the compared systems because it is enabled to write Table 2: Filebench throughput with 1 and 16 threads (Ops/s). data in userspace directly. XFS and NOVA also show good scalability. Among them, NOVA partitions free spaces to when using mmap to access kernel file systems, which builds avoid the locking overhead when allocating new data pages, the page table during the syscall. The experimental results are while XFS directly writes data in-place without allocating new as expected (not shown in the figure): when we concurrently pages. Both PMFS and Ext4 fail to scale since they rely on a issue 4KB read/write requests, all the evaluated file systems centralized transaction manager to write data, introducing ex- saturate the hardware bandwidth. tra locking overhead. In the decreasing part, their throughput is mainly affected by two factors: the cross-NUMA overhead, 5.3 Filebench: Macro-Benchmarks which has been explained before, and the poor scalability of We then use Filebench [1] as a macro-benchmark to evaluate Optane’s write performance [19]. SplitFS fails to run properly the performance of KucoFS. Table 2 shows both workload under this setting. For append operations, XFS-DAX, Ext4- settings (similar to that in the NOVA paper) and experimental DAX and PMFS exhibit bad scalability as the number of results with 1 and 16 threads (adding more threads does not threads increases. This is because they use a global lock to contribute to higher throughput with Filebench [13]). We can manage the free data pages and metadata journal, so the lock observe that, first, KucoFS shows the highest performance contention contributes to the major overhead. Both NOVA among all the evaluated workloads. In single-threaded evalu- and KucoFS show better scalability, and KucoFS outperforms ation with Fileserver workload, its throughput is 2.5×, 1.9×, NOVA by from 10% to 2× with an increasing number of 1.38×, 1.39× and 1.32× as much as that of XFS, Ext4, PMFS, threads. The throughput of SplitFS lies between NOVA and NOVA, and Strata respectively, and is 3.2×, 5.6×, 1.9×, 1.45× Ext4-DAX. This is because, SplitFS first appends data in a and 1.13× higher with Varmail workload. For read-dominated staging file, and then re-links it to the original file by trapping workloads (e.g., webserver/webproxy), KucoFS also shows into the kernel. On our emulated persistent memory, Aerie slightly higher throughput. The performance improvement shows the worst performance because the trusted service is mainly comes from the direct access feature of KucoFS. Strata the bottleneck, where clients need to frequently interact with also benefits from direct access and performs the second- the TFS to acquire the lock and allocate spaces. best in most workloads. We also observe that the design Two-level locking. To analyze the effects of the lock design, of KucoFS is a good fit for the Varmail workload. This is we also evaluate overwrite operations with the medium expected: Varmail frequently creates and deletes files, so it sharing level, where threads write data to the same file at generates more metadata operations and issues system calls different offsets. As shown in Figure 9e, the throughput of more frequently. As described before, KucoFS eliminates KucoFS is one order of magnitude higher than the other four the OS-part overhead and is better at handling metadata file systems when the number of threads is small (SplitFS operations. Besides, Strata shows much higher throughput fails to run properly in this setting). The range-lock design in than NOVA since the file I/Os in Varmail is small-sized. KucoFS enables parallel updates to different data blocks in Strata only needs to append these small-sized updates to the the same file. The performance of KucoFS drops again when operation log, reducing the write amplification dramatically. the number of threads grows to more than 8, which is mainly Second, KucoFS is better at handling concurrent workloads. restricted by the ring buffer size in the range-lock (we reserve With 16 client threads under the Fileserver workload, KucoFS 8 lock slots in the ring buffer). We also find that ZoFS shows outperforms XFS-DAX by 4.4×, PMFS by 1.2×, and NOVA 2× - 3× higher throughput than that of NOVA (Fig.7f in their by 27%. The performance improvement is more obvious paper), but it still underperforms KucoFS. for Varmail workload: it achieves 10× higher performance Memory-mapped I/O. Memory-mapped I/O is the most effi- than XFS-DAX and Ext4-DAX on average. Two reasons cient way to access the file system. Kfs in KucoFS constructs contribute to its good performance: first, KucoFS incorporates all page tables in advance when processing mmap requests. techniques like collaborative indexing to enable Kfs to For a fair comparison, we add the MAP_POPULATE flag provide scalable metadata accessing performance; second, XFS-DAX PMFS Strata KucoFS FLEX [42] avoids kernel overhead by replacing conventional EXT4-DAX NOVA SplitFS file operations with similar DAX-based operations, which 300 shares some similarities to SplitFS. While these systems 200 share the same idea of splitting tasks between the kernel and userspace, KucoFS is different in that it exhibits a fine-grained 100 split of responsibilities while enforcing close collaboration. 0 Persistent memory storage systems. Except for persistent Throughput (Kops/s)Throughput 128 1KB 4KB 8KB memory file systems mentioned before, we summarize more Object Size (SET) PM systems here. First, general PM optimizations. Yang et Figure 10: Redis performance with different file systems. al. [46] explore the performance properties and characteristics KucoFS avoids using a global lock by letting each client of Optane DCPMM at the micro and macro levels, and manage private free data pages. NOVA also exhibits good provide a number of guidelines to maximize the performance. / scalability since it uses per-inode log-structure and partitions Libnvmmio [10] extends userspace memory-mapped I O with the free spaces to avoid global locks. failure atomicity. Many recent papers also designed various data structures that work correctly and efficiently on persistent 5.4 Redis: Real-World Application memory [7,18,26,30,48,52]. Second, PM-aware file systems. BPFS [11] adopts short-circuit shadow paging to guarantee Redis exports a set of APIs allowing applications to process the metadata and data consistency. SCMFS [41] simplifies and query structured data, and uses the file system for persis- the file management by mapping files to contiguous virtual tent data storage. Redis has two approaches to persistently address regions with the management (VMM) record its data: one is to log operations to an append-only- in existing OS. NOVA-Fortis [44] steps further to be fault- file (AOF), and the other is to use an asynchronous snapshot tolerant by providing a snapshot mechanism. Ziggurat [49] mechanism. We only evaluate Redis with AOF mode in this is a tiered file system which estimates the temperature of file paper. Figure 10 shows the throughput of SET operations data and migrates cold data from PM to disks. DevFS [20] using 12-byte keys with various value sizes. For small values, pushes the file system implementation into the storage device the throughput of Redis is 53% higher on average on KucoFS, that has compute capability and device-level RAM. Third, compared to PMFS, NOVA, and Strata, and 76% higher distributed PM systems. Hotpot [36] manages PM devices compared to XFS-DAX and Ext4-DAX. This is consistent of different nodes in the cluster with a distributed shared with the results of append in Section 5.2. With larger object persistent memory architecture. Octopus [24, 37] leverages sizes, KucoFS achieves slightly higher throughput than other PM and RDMA to build an efficient distributed file system by file systems since most of the time is spent on writing data. reducing the software overhead. Similarly, Orion [45] is also Note that Redis is a single-threaded application, so it is distributed persistent memory file system but is built in the reasonable for KucoFS to achieve a throughput of 100 Kops/s kernel. FlatStore [8] is a log-structured key-value storage with 8KB objects (around 800MB/s). SplitFS is good at engine based on RDMA network; it minimizes the flush handling append operations since it processes data-plane overhead by batching small-sized requests. operations in userspace. However, it still underperforms KucoFS, because Redis posts fsync to flush the AOF file 7 Conclusion each time it appends new data. Hence, SplitFS needs to trap into the kernel to update metadata, which again causes VFS In this paper, we introduce a kernel and user-level collabora- and syscall overhead. tive architecture named Kuco, which exhibits a fine-grained task division between userspace and the kernel. Based on 6 Related Work Kuco, we further design and implement a PM file system named KucoFS and experiments show that KucoFS provides Kernel-userspace collaboration. The idea of moving I/O both efficient and highly scalable performance. operations from the kernel to userspace has been well studied. Belay et al. [4] abstract the Dune process leveraging the Acknowledgements hardware in modern processors. It enables direct We sincerely thank our shepherd Donald E. Porter and the access to the privileged CPU instructions in userspace and anonymous reviewers for their insightful feedback. We also executes syscalls with reduced overhead. Based on Dune, thank Qing Wang and Ramnatthan Alagappan for their excel- IX [5] steps further to improve the performance of data-center lent suggestions. This material is supported by the National applications by separating management and func- Key Research & Development Program of China (Grant No. tions of the kernel (control-plane) from network processing 2018YFB1003301), the National Natural Science Founda- (data plane). Arrakis [31] is a new network server operating tion of China (Grant No. 62022051, 61832011, 61772300, system, where applications have direct access to I/O de- 61877035), and Huawei (Grant No. YBN2019125112). vices and the kernel only enforces coarse-grained protection. References [10] Jungsik Choi, Jaewan Hong, Youngjin Kwon, and Hwan- soo Han. Libnvmmio: Reconstructing software IO path [1] Filebench file system benchmark. "http: with failure-atomic memory-mapped interface. In 2020 //www.nfsv4bat.org/Documents/nasconf/ USENIX Annual Technical Conference (USENIX ATC 2004/filebench.pdf", 2004. 20), pages 1–16. USENIX Association, July 2020.

[2] Support ext4 on NV-DIMMs. "https://lwn.net/ [11] Jeremy Condit, Edmund B. Nightingale, Christopher Articles/588218", 2014. Frost, Engin Ipek, Benjamin Lee, Doug Burger, and Derrick Coetzee. Better i/o through byte-addressable, [3] IG Baek, MS Lee, S Seo, MJ Lee, DH Seo, D-S Suh, persistent memory. In Proceedings of the ACM SIGOPS JC Park, SO Park, HS Kim, IK Yoo, et al. Highly 22Nd Symposium on Operating Systems Principles, scalable nonvolatile resistive memory using simple SOSP ’09, pages 133–146, New York, NY, USA, 2009. binary oxide driven by asymmetric unipolar voltage ACM. pulses. In Electron Devices Meeting, 2004. IEDM ff Technical Digest. IEEE International, pages 587–590. [12] Je rey Dean and Luiz André Barroso. The tail at scale. Commun. ACM IEEE, 2004. , 56(2):74–80, February 2013. [13] Mingkai Dong, Heng Bu, Jiefei Yi, Benchao Dong, [4] Adam Belay, Andrea Bittau, Ali Mashtizadeh, David and Haibo Chen. Performance and protection in the Terei, David Mazières, and Christos Kozyrakis. Dune: zofs user-space nvm file system. In The 27th ACM Safe user-level access to privileged cpu features. In Symposium on Operating Systems Principles, SOSP ’19, Proceedings of the 10th USENIX Conference on Oper- 2019. ating Systems Design and Implementation, OSDI’12, pages 335–348, Berkeley, CA, USA, 2012. USENIX [14] Subramanya R. Dulloor, Sanjay Kumar, Anil Keshava- Association. murthy, Philip Lantz, Dheeraj Reddy, Rajesh Sankaran, and Jeff Jackson. System software for persistent memory. [5] Adam Belay, George Prekas, Ana Klimovic, Samuel In Proceedings of the Ninth European Conference on Grossman, Christos Kozyrakis, and Edouard Bugnion. Computer Systems, EuroSys ’14, pages 15:1–15:15, New Ix: A protected dataplane operating system for high York, NY, USA, 2014. ACM. throughput and low latency. In Proceedings of the 11th USENIX Conference on Operating Systems Design and [15] Keir Fraser. Practical lock-freedom. Technical report, Implementation, OSDI’14, pages 49–65, Berkeley, CA, University of Cambridge, Computer Laboratory, 2004. USA, 2014. USENIX Association. [16] Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. [6] Remy Card, Theodore Ts’o, and Stephen Tweedie. Flat combining and the synchronization-parallelism Design and implementation of the second extended tradeoff. In Proceedings of the Twenty-Second Annual filesystem. In Proceedings of the 1st Dutch International ACM Symposium on Parallelism in Algorithms and Symposium on Linux, pages 1–6, 1994. Architectures, SPAA ’10, page 355–364, New York, NY, USA, 2010. Association for Computing Machinery. [7] Shimin Chen and Qin Jin. Persistent b+-trees in non- [17] J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satya- volatile main memory. Proc. VLDB Endow., 8(7):786– narayanan, Robert N. Sidebotham, and M. West. Scale 797, February 2015. and performance in a distributed file system. In Pro- ceedings of the Eleventh ACM Symposium on Operating [8] Youmin Chen, Youyou Lu, Fan Yang, Qing Wang, Yang Systems Principles, SOSP ’87, page 1–2, New York, NY, Wang, and Jiwu Shu. Flatstore: An efficient log- USA, 1987. Association for Computing Machinery. structured key-value storage engine for persistent mem- ory. In Proceedings of the Twenty-Fifth International [18] Deukyeon Hwang, Wook-Hee Kim, Youjip Won, and Conference on Architectural Support for Programming Beomseok Nam. Endurable transient inconsistency in Languages and Operating Systems, ASPLOS ’20, page byte-addressable persistent b+-tree. In Proceedings 1077–1091, New York, NY, USA, 2020. Association for of the 16th USENIX Conference on File and Storage Computing Machinery. Technologies, FAST’18, page 187, 2018.

[9] Youmin Chen, Jiwu Shu, Jiaxin Ou, and Youyou Lu. [19] Joseph Izraelevitz, Jian Yang, Lu Zhang, Juno Kim, Xiao Hinfs: A persistent memory file system with both buffer- Liu, Amirsaman Memaripour, Yun Joon Soh, Zixuan ing and direct-access. ACM Trans. Storage, 14(1):4:1– Wang, Yi Xu, Subramanya R Dulloor, et al. Basic per- 4:30, April 2018. formance measurements of the intel optane dc persistent memory module. arXiv preprint arXiv:1903.05714, Symposium on Operating Systems Principles, SOSP ’19, 2019. 2019.

[20] Sudarsun Kannan, Andrea C Arpaci-Dusseau, Remzi H [29] Jiaxin Ou, Jiwu Shu, and Youyou Lu. A high perfor- Arpaci-Dusseau, Yuangang Wang, Jun Xu, and Gopinath mance file system for non-volatile main memory. In Palani. Designing a true direct-access file system with Proceedings of the Eleventh European Conference on devfs. In 16th USENIX Conference on File and Storage Computer Systems, EuroSys ’16, pages 12:1–12:16, New Technologies, page 241, 2018. York, NY, USA, 2016. ACM.

[21] Youngjin Kwon, Henrique Fingler, Tyler Hunt, Simon [30] Ismail Oukid, Johan Lasperas, Anisoara Nica, Thomas Peter, Emmett Witchel, and Thomas Anderson. Strata: Willhalm, and Wolfgang Lehner. Fptree: A hybrid scm- A cross media file system. In Proceedings of the 26th dram persistent and concurrent b-tree for storage class Symposium on Operating Systems Principles, SOSP ’17, memory. In Proceedings of the 2016 International pages 460–477, New York, NY, USA, 2017. ACM. Conference on Management of Data, SIGMOD ’16, pages 371–386, New York, NY, USA, 2016. ACM. [22] Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. Architecting phase change memory as a scalable [31] Simon Peter, Jialin Li, Irene Zhang, Dan R. K. Ports, dram alternative. In Proceedings of the 36th annual Doug Woos, Arvind Krishnamurthy, Thomas Anderson, International Symposium on and Timothy Roscoe. Arrakis: The operating system (ISCA), pages 2–13, New York, NY, USA, 2009. ACM. is the control plane. In Proceedings of the 11th USENIX Conference on Operating Systems Design and [23] Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Implementation, OSDI’14, pages 1–16, Berkeley, CA, Zhang. Socksdirect: Datacenter sockets can be fast and USA, 2014. USENIX Association. compatible. In Proceedings of the ACM Special Interest Group on Data Communication, SIGCOMM ’19, pages [32] William Pugh. Skip lists: A probabilistic alternative to 90–103, New York, NY, USA, 2019. ACM. balanced trees. Commun. ACM, 33(6):668–676, June 1990. [24] Youyou Lu, Jiwu Shu, Youmin Chen, and Tao Li. Octo- pus: An rdma-enabled distributed persistent memory file [33] Henry Qin, Qian Li, Jacqueline Speiser, Peter Kraft, system. In Proceedings of the 2017 USENIX Conference and John Ousterhout. Arachne: Core-aware thread on Usenix Annual Technical Conference, USENIX ATC management. In Proceedings of the 12th USENIX Con- ’17, page 773–785, USA, 2017. USENIX Association. ference on Operating Systems Design and Implemen- tation, OSDI’18, page 145–160, USA, 2018. USENIX [25] Changwoo Min, Sanidhya Kashyap, Steffen Maass, Association. Woonhak Kang, and Taesoo Kim. Understanding manycore scalability of file systems. In Proceedings [34] Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and of the 2016 USENIX Conference on Usenix Annual Jude A. Rivers. Scalable high performance main Technical Conference, USENIX ATC ’16, pages 71–85, memory system using phase-change memory technol- Berkeley, CA, USA, 2016. USENIX Association. ogy. In Proceedings of the 36th annual International Symposium on Computer Architecture (ISCA), pages [26] Moohyeon Nam, Hokeun Cha, Young ri Choi, Sam H. 24–33, New York, NY, USA, 2009. ACM. Noh, and Beomseok Nam. Write-optimized dynamic hashing for persistent memory. In 17th USENIX Con- [35] Sepideh Roghanchi, Jakob Eriksson, and Nilanjana Basu. ference on File and Storage Technologies (FAST 19), Ffwd: Delegation is (much) faster than you think. pages 31–44, Boston, MA, February 2019. USENIX In Proceedings of the 26th Symposium on Operating Association. Systems Principles, SOSP ’17, pages 342–358, New York, NY, USA, 2017. ACM. [27] Intel Newsroom. Intel@ optaneTM dc persistent memory. https://www.intel.com/content/ [36] Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. Dis- www/us/en/products/memory-storage/ tributed shared persistent memory. In Proceedings of optane-dc-persistent-memory.html, April the 2017 Symposium on Cloud Computing, SoCC ’17, 2019. page 323–337, New York, NY, USA, 2017. Association for Computing Machinery. [28] Kadekodi ohan, Kwon Lee Se, Kashyap Sanidhya, Kim Taesoo, Kolli Aasheesh, and Chidambaram Vijay. [37] Jiwu Shu, Youmin Chen, Qing Wang, Bohong Zhu, Splitfs: A file system that minimizes software overhead Junru Li, and Youyou Lu. Th-dpms: Design and im- in file systems for persistent memory. In The 27th ACM plementation of an rdma-enabled distributed persistent memory storage system. ACM Trans. Storage, 16(4), memories and rdma-capable networks. In Proceedings October 2020. of the 17th USENIX Conference on File and Stor- age Technologies, FAST’19, page 221–234, USA, 2019. [38] Adam Sweeney, Doug Doucette, Wei Hu, Curtis An- USENIX Association. derson, Mike Nishimoto, and Geoff Peck. Scalability [46] Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph in the file system. In USENIX Annual Technical Izraelevitz, and Steve Swanson. An empirical guide Conference, volume 15, 1996. to the behavior and use of scalable persistent memory. In 18th USENIX Conference on File and Storage Tech- [39] Haris Volos, Sanketh Nalli, Sankarlingam Panneersel- nologies (FAST 20), pages 169–182, Santa Clara, CA, vam, Venkatanathan Varadarajan, Prashant Saxena, and February 2020. USENIX Association. Michael M. Swift. Aerie: Flexible file-system interfaces to storage-class memory. In Proceedings of the Ninth Eu- ropean Conference on Computer Systems, EuroSys ’14, [47] Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph pages 14:1–14:14, New York, NY, USA, 2014. ACM. Izraelevitz, and Steven Swanson. An empirical guide to the behavior and use of scalable persistent memory. [40] Ying Wang, Dejun Jiang, and Jin Xiong. Caching or arXiv preprint arXiv:1908.03583, 2019. not: Rethinking virtual file system for non-volatile main memory. In 10th USENIX Workshop on Hot Topics in [48] Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, Storage and File Systems (HotStorage 18). USENIX Khai Leong Yong, and Bingsheng He. Nv-tree: Reduc- Association, 2018. ing consistency cost for nvm-based single level systems. In Proceedings of the 13th USENIX Conference on File [41] Xiaojian Wu and A. L. Narasimha Reddy. Scmfs: A and Storage Technologies, FAST’15, pages 167–181, file system for storage class memory. In Proceedings of Berkeley, CA, USA, 2015. USENIX Association. 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, [49] Shengan Zheng, Morteza Hoseinzadeh, and Steven pages 39:1–39:11, New York, NY, USA, 2011. ACM. Swanson. Ziggurat: A tiered file system for non-volatile [42] Jian Xu, Juno Kim, Amirsaman Memaripour, and Steven main memories and disks. In 17th USENIX Conference Swanson. Finding and fixing performance pathologies on File and Storage Technologies (FAST 19), pages 207– in persistent memory software stacks. In Proceed- 219, 2019. ings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages [50] Deng Zhou, Wen Pan, Tao Xie, and Wei Wang. A file and Operating Systems, ASPLOS ’19, page 427–439, system bypassing volatile main memory: Towards a New York, NY, USA, 2019. Association for Computing single-level persistent store. In Proceedings of the 15th Machinery. ACM International Conference on Computing Frontiers, CF ’18, pages 97–104, New York, NY, USA, 2018. [43] Jian Xu and Steven Swanson. Nova: A log-structured ACM. file system for hybrid volatile/non-volatile main mem- ories. In Proceedings of the 14th Usenix Conference [51] Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. A on File and Storage Technologies, FAST’16, pages 323– durable and energy efficient main memory using phase 338, Berkeley, CA, USA, 2016. USENIX Association. change memory technology. In Proceedings of the [44] Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha 36th annual International Symposium on Computer Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Architecture (ISCA), pages 14–23, New York, NY, USA, Steven Swanson, and Andy Rudoff. Nova-fortis: A 2009. ACM. fault-tolerant non-volatile main memory file system. In Proceedings of the 26th Symposium on Operating [52] Pengfei Zuo, Yu Hua, and Jie Wu. Write-optimized and Systems Principles, SOSP ’17, pages 478–496, New high-performance hashing index scheme for persistent York, NY, USA, 2017. ACM. memory. In Proceedings of the 13th USENIX Confer- ence on Operating Systems Design and Implementation, [45] Jian Yang, Joseph Izraelevitz, and Steven Swanson. OSDI’18, page 461–476, USA, 2018. USENIX Associ- Orion: A distributed file system for non-volatile main ation.