Direct-FUSE: Removing the Middleman for High-Performance FUSE Support Yue Zhu*, Teng Wang*, Kathryn Mohror+, Adam Moody+, Kento Sato+, Muhib Khan*, Weikuan Yu* Florida State University* Lawrence Livermore National Laboratory+ Outline

• Background & Motivation ·Design ·Performance Evaluation ·Conclusion

ROSS’18 S-2 Introduction n High-performance computing (HPC) systems needs efficient file system for supporting large-scale scientific applications

Ø Different file systems are used for different kinds of data in a single job

Ø Both kernel- and user-level file systems can be used in the applications

Ø Due to kernel-level file systems’ development complexity, reliability and portability issues, user-level file systems are more leveraged for particular I/O workloads with special purpose n (FUSE)

Ø A software interface for Unix-like computer operating systems

Ø It allows non-privileged users to create their own file systems without modifying kernel code

Ø User defined file system is run as a separate process in user-space

Ø Example: SSHFS, GlusterFS client, FusionFS(BigData’14)

ROSS’18 S-3 How does FUSE Work? n Execution path of a function call

Ø Send the request to the user-level file system process

o App program → VFS → FUSE kernel module → User-level file system process

Ø Return the data back to the application program

o User-level file system process → FUSE kernel module → VFS → App program Application Program User Level File System User Space Kernel Space (VFS)

FUSE

Page Cache

Storage Device

ROSS’18 S-4 FUSE File System vs. Native File System

FUSE File System Native File System # User-kernel 4 2 Mode Switch # Context 2 0 Switch # Memory 2 1 Copies

Application Program User Level File System User Space Kernel Space Virtual File System (VFS)

FUSE Ext4

Page Cache

Storage Device

ROSS’18 S-5 Number of Context Switches & I/O Bandwidth n The complexity added in FUSE file system execution path causes performance degradation in I/O bandwidth

Ø tmpfs: a file system that stores data in

Ø FUSE-tmpfs: a FUSE file system deployed on top of tmpfs

Ø dd micro-benchmark and perf system profiling tool are used to gather the I/O bandwidth and the number of context switches

Ø Experiment method: continually issue 1000 writes

Write Bandwidth # Context Switches

Block FUSE-tmpfs tmpfs FUSE- tmpfs Size (KB) (MB/s) (GB/s) tmpfs 4 163 1.3 1012 7 16 372 1.6 1012 7 64 519 1.7 1012 7 128 549 2.0 1012 7 256 569 2.4 2012 7

ROSS’18 S-6 Breakdown of Metadata & Data Latency n The actual file system operations (i.e. metadata or data operations) only occupy a small amount of total execution time

Ø Tests are on tmpfs and FUSE-tmpfs

Ø Real Operation in metadata operation: the time of conducting operation

Ø Data Movement: the actual time of write in a complete write function call

Ø Overhead: the cost besides the above two, e.g. the time of context switches

250 11.18% 600 Real Operation Data Movement 38.12% 200 500 Overhead Overhead 150 400 (ns) 300 100 37.86% 200 2.17% 33.7% Latency 50 Latency (ns) 100 34.8% 10.08%15.82% 0 tmpfs FUSE-tmpfs tmpfs FUSE-tmpfs 0 Create Close 1 4 16 64 128 256 Metadata Operations Transfer Sizes (KB) Fig. 1. Time Expense in Metadata Operations Fig. 2. Time Expense in Data Operations

ROSS’18 S-7 Existing Solution and Our Approach n How to reduce the overheads from FUSE?

Ø Build an independent user-space library to avoid going through kernel (e.g., IndexFS (SC’14), FusionFS)

Ø However, this approach cannot support multiple FUSE libraries with distinct file paths and file descriptors n We propose Direct-FUSE to support multiple backend I/O services to an application

Ø We adapted libsysio to our purpose in Direct-FUSE

o libsysio is developed by Scalability team of Sandia National Lab):

« a POSIX-like file I/O, and name space support for remote file systems from an application’s user-level address space.

ROSS’18 S-8 Outline

·Background & Motivation • Design ·Performance Evaluation ·Conclusion

ROSS’18 S-9 The Overview of Direct-FUSE n Direct-FUSE mainly consists of three components

1. Adapted-libsysio

o Intercept file path and file descriptor for backend services identification

o Simplify metadata and data execution path in original libsysio

2. lightweight-libfuse (not real libfuse)

o Abstract file system operations from backend services to unified APIs

3. Backend services

o Provide defined file system operations (e.g., FusionFS)

Application Program Direct-FUSE Adapted-libsysio

lightweight-libfuse

Backend Services FUSE-Ext4 FusionFS Client ….

Ext4 FusionFS Server …

ROSS’18 S-10 Path and File Descriptor Operations n To facilitate the interception of file system operations for multiple backends, the operations are categorized into two:

1. File path operations

i. Intercept prefix and path (e.g., :/sshfs/test.txt) and return mount information

ii. Look up corresponding inode based on the mount information, and redirect to defined operations

2. File descriptor operations

i. Find open-file record based on given file descriptor

« Open-file record contains pointers to inode, current stream position, etc

ii. Redirect to defined operations based inode info in open-file record

ROSS’18 S-11 Requirements for New Backends n Interact with FUSE high-level APIs n Separated as an independent user-space library

Ø The library contains the fuse file system operations, initialization function, and also the unmount function

Ø If a backend passes some specialized data to the fuse module via fuse_mount(), then the data has to be globalized for later file system operations n Implemented in C/C++ or has to be binary compatible with C/C++

ROSS’18 S-12 Outline

·Background and Challenges ·Design • Performance Evaluation ·Conclusion

ROSS’18 S-13 Experimental Methodology n We compare the bandwidth of Direct-FUSE with local FUSE file system and native file system on disk and memory by Iozone

Ø Disk

o Ext4-fuse: FUSE file system overlying Ext4

o Ext4-direct: Ext4-fuse bypasses the FUSE kernel

o Ext4-native: original Ext4 on disk

Ø Memory

o tmpfs-fuse, tmpfs-direct, and tmpfs-native are similar to the three tests on disk n We also compare the I/O bandwidth of distributed FUSE file system with Direct-FUSE

Ø FusionFS: a distributed file system that supports metadata- and write-intensive operations

ROSS’18 S-14 Sequential Write Bandwidth n Direct-FUSE achieves comparable bandwidth performance to the native file system

Ø Ext4-direct outperforms Ext4-fuse by 16.5% on average

Ø tmpfs-direct outperforms tmpfs-fuse at least 2.15x

Ext4-fuse Ext4-direct Ext4-native tmpfs-fuse tmpfs-direct tmpfs-native 10000 1000

(MB/s) 100 10 1 Bandwidth 4 16 64 256 1024 Write Transfer Sizes (KB)

ROSS’18 S-15 Sequential Read Bandwidth n Similar to the sequential write bandwidth, the read bandwidth of Direct-FUSE is comparable to the native file system

Ø Ext4-direct outperforms Ext4-fuse by 2.5% on average

Ø tmpfs-direct outperforms tmpfs-fuse at least 2.26x Ext4-fuse Ext4-direct Ext4-native tmpfs-fuse tmpfs-direct tmpfs-native 10000 1000

(MB/s) 100 10 1

Bandwidth 4 16 64 256 1024 Read Transfer Sizes (KB)

ROSS’18 S-16 Distributed I/O Bandwidth n Direct-FUSE outperforms FusionFS in write bandwidth and shows comparable read bandwidth

Ø Writes benefit more from the FUSE kernel bypassing n Direct-FUSE delivers similar scalability results as the original FusionFS 10000 10000 fusionfs direct-fusionfs fusionfs direct-fusionfs 1000 1000

100 (MB/s) 100

10 10 Bandwidth Bandwidth (MB/s) 1 Bandwidth 1 1 2 4 8 16 1 2 4 8 16 Number of Nodes Number of Nodes Write Read

ROSS’18 S-17 Overhead Analysis n The dummy read/write occupies less than 3% of the complete I/O function time in Direct-FUSE, even when the I/O size is very small

Ø Dummy write/read: no actual data movement, directly return once reach the backend service

Ø Real write/read: the actual Direct-FUSE read and write I/O calls 10000 10000 dummy write real write dummy read real read

s) 1000

s) 1000 n n 100 100 Latency ( Latency 10 ( Latency 10

1 1 1B 4B 16B 64B 256B 1KB 1B 4B 16B 64B 256B 1KB Transfer Sizes Transfer Sizes

ROSS’18 S-18 Conclusions

Ø We have revealed and analyzed the context switches count and time overheads in FUSE metadata and data operations

Ø We have designed and implemented Direct-FUSE, which can avoid crossing kernel boundary and support multiple FUSE backends simultaneously

Ø Our experimental results indicate that Direct-FUSE achieves significant performance improvement compared to original FUSE file systems

ROSS’18 S-19 Sponsors of This Research

ROSS’18 S-20