Int. J. Embedded Systems, Vol. 6, Nos. 2/3, 2014 95

Buffered FUSE: optimising the Android IO stack for user-level filesystem

Sooman Jeong and Youjip Won* Hanyang University, #507-2, Annex of Engineering Center, 17 Haengdang-dong, Sungdong-gu, Seoul, 133-791, South Korea E-mail: [email protected] E-mail: [email protected] *Corresponding author

Abstract: In this work, we optimise the Android IO stack for user-level filesystem. Android imposes user-level filesystem over native filesystem partition to provide flexibility in managing the internal storage space and to maintain host compatibility. The overhead of user-level filesystem is prohibitively large and the native storage bandwidth is significantly under-utilised. We overhauled the FUSE layer in the Android platform and propose buffered FUSE (bFUSE) to address the overhead of user-level filesystem. The key technical ingredients of buffered FUSE are: 1) extended FUSE IO size; 2) internal user-level write buffer; 3) independent management thread which performs time-driven FUSE buffer synchronisation. With buffered FUSE, we examined the performances of five different filesystems and three disk algorithms in a combinatorial manner. With bFUSE on XFS filesystem using the deadline scheduling, we achieved the IO performance improvements of 470% and 419% in Android ICS and JB, respectively, over the existing smartphone device.

Keywords: Android; storage; user-level filesystem; FUSE; write buffer; embedded systems.

Reference to this paper should be made as follows: Jeong, S. and Won, Y. (2014) ‘Buffered FUSE: optimising the Android IO stack for user-level filesystem’, Int. J. Embedded Systems, Vol. 6, Nos. 2/3, pp.95–107.

Biographical notes: Sooman Jeong received his BS degrees in Computer Science from Hanyang University, Seoul, Korea, in 2003. He worked for Samsung Electronics as a Software Engineer. He is currently working toward his MS degree in the Department of Computer Science, Hanyang University, Seoul, Korea. His research interests include optimising I/O stack of smartphones.

Youjip Won received his BS and MS degrees in Computer Science from the Seoul National University, Korea, in 1990 and 1992, respectively. He received his PhD in Computer Science from the University of Minnesota, Minneapolis, in 1997. After receiving his PhD degree, he joined Intel as a Server Performance Analyst. Since 1999, he has been with the Department of Computer Science, Hanyang University, Seoul, Korea, as a Professor. His research interests include operating systems, file and storage subsystems, multimedia networking, and network traffic modelling and analysis.

1 Introduction An Android-based device is no longer a simple media player but has become a professional content creator. Smart devices have rapidly spread to the public and One piece of clear evidence for this trend is a recent expanded their territory into all home electronics including introduction of an Android-based digital camera with a high phones, TVs, cameras, and game consoles. As they have resolution sensor (up to 17 mega-pixels) (Galaxy Camera, become so closely related to our everyday lives, improving http://www.samsung.com/us/photography/galaxy-camera). the performance of these smart devices will have a In the future, smart devices including phones, TVs, and significant impact on society. A recent study on Android cameras are projected to acquire and play movies at smartphones suggests that storage is the biggest ultra-HD (3,840 × 2,160 @ 60 frames/sec) resolution. performance bottleneck (Kim et al., 2012a).

Copyright © 2014 Inderscience Enterprises Ltd. 96 S. Jeong and Y. Won

Android apps will require more IO bandwidth to handle We examined the behaviour of FUSE, native filesystem, such high definition multimedia contents. buffer cache, and IO scheduler for user partition accesses The Android platform partitions its storage into multiple and found that FUSE layer is the dominant source of logical partitions. There are two main storage partitions in overhead. FUSE entails significant overhead mostly due to the Android: data and sdcard. Data partition (/data) is used the following three reasons: to harbour text files, e.g., SQLite tables, and sdcard 1 command fragmentation partition (/sdcard) harbours multimedia files, e.g., pictures, mp3 files and video clips. In recent smartphones, 2 excessive context switch both of these partitions share the same storage device. 3 loss of access correlation. Android storage has gone through several generations of evolution. In the early models, as the partition names We propose buffered FUSE (bFUSE) framework, which suggest, data partition was at internal NAND storage device effectively addresses the above mentioned three technical (eMMC) and sdcard partition was at external sdcard (HTC issues. bFUSE framework consists of: Dream, http://en.wikipedia.org/wiki/HTC_Dream). In the 1 extended FUSE IO unit later generation of Android-based smartphones (Samsung Galaxy S2, http://www.samsung.com/global/microsite/ 2 FUSE buffer managed by independent thread galaxys2/html/), data partition and sdcard partition resided 3 time driven FUSE buffer synchronisation. at the same physical device, internal NAND-based storage and each was allocated a ‘fixed’ size logical partition. Until The Android IO stack consists of FUSE, native filesystem this generation, the Android IO stack imposed EXT4 (EXT4), IO scheduler for block device (CFQ), and (Mathur et al., 2007) over data partition and VFAT underlying eMMC storage. We focus on finding the optimal (Microsoft Corporation, 2000) filesystem over sdcard combination in the IO stack design choices. We examined partition. VFAT filesystem is de facto standard in embedded the behaviour of five filesystems (EXT4, XFS, F2FS, multimedia devices, e.g., cameras, camcorders, TVs, etc. , and NILFS2) and three IO schedulers provided by Therefore, for PC compatibility, it is mandatory that sdcard standard kernel (CFQ, deadline, and NOOP) in a partition adopts VFAT filesystem despite its technical combinatorial manner. By using bFUSE and deadline-based deficiencies (Lim et al., 2013). However, this fixed IO scheduler and changing the filesystem to XFS, the IO partition-based storage configuration causes a significant performance (storage throughput) increases by 470% over underutilisation of the storage space. the baseline (existing smartphone). With this optimisation, Recent Android platform imposes user-level filesystem we are able to achieve 38 Mbyte/sec sequential write (FUSE) for sdcard partition instead of VFAT. This allows bandwidth and utilise 99% of the bandwidth in sdcard to freely manoeuvre between the two respective the Android storage stack. partitions. The FUSE filesystem exists as a virtual The remainder of this paper is organised as follows. filesystem layer on top of EXT4, the native filesystem. Section 2 presents the background. The behaviour of the While this configuration provides flexibility in managing Android IO stack is analysed in Section 3. bFUSE is the internal storage space, it causes significant overhead on introduced in Section 4. Section 5 provides optimisations the FUSE, decreasing the performance of IO accesses to for the Android IO stack. Section 6 presents the results of user partition. For example, in Galaxy S3, sequential write our integration of the proposed schemes. Section 7 describes performance in FUSE imposed filesystem is one fourths of other works related to the study, and Section 8 concludes that in native EXT4 filesystem partition. this paper. The IO access to the Android storage can be categorised into two types: data partition accesses and sdcard partition accesses. While the IO accesses to both partitions suffer 2 Background from excessive overhead, the source of inefficiency differs vastly. For data partition accesses, majority of IOs are 2.1 Storage partitions for android-based created by SQLite operation and the inefficiency lies in the smartphones excessive filesystem journaling (Lee and Won, 2012; Jeong Android manages several filesystem partitions: /recovery, et al., 2013b). In this work, we focus on optimising the /boot, /cache, /system, /data and /sdcard. The Android IO stack for handling multimedia files. Sdcard /system partition contains Android-executable files and partition is the default storage partition for mp3, video files, pre-installed applications. The /cache partition is used for photographs, and user-downloaded documents, e.g., pdf updating firmware. The /boot and /recovery partitions files. We found that the IO accesses to sdcard partition are used for maintaining boot image. There are two types of suffer from significant overhead and the apps utilise only main partition: ‘/data’ and ‘/sdcard’. /data partition 25% of the raw NAND storage bandwidth.

Buffered FUSE: optimising the Android IO stack for user-level filesystem 97 harbours SQLite tables, SQLite journals, and apps. /sdcard and user partitions are formatted with different filesystems, partition usually harbours pictures, video clips, and mp3 EXT4 and VFAT, respectively. Recently, the Android files, which are mostly user created contents. We call adopted FUSE to manage both data and user partitions /sdcard an ‘sdcard partition’ or a ‘user partition’. at the same physical partition (Google Galaxy Nexus, Typical accesses to the data partition (/data) and the http://www.google.com/nexus/4; Samsung Galaxy S3, user partition (/sdcard) are small (4 Kbyte) synchronous http://www.samsung.com/ae/microsite/galaxys3/en/ writes and large buffered writes (> 64 Kbyte), respectively index.html). This is to dynamically adjust the partition size, (Lee and Won, 2012). The IOs to ‘/data’ partition are allowing more flexibility in utilising the storage space, and mostly caused by SQLite database operation and excessive to maintain host-compatibility to transfer files to and from number of IO operations are generated due to the the host PC via multimedia transfer protocol (MTP) uncoordinated interactions between SQLite and EXT4 (Lee (Osborne et al., 2010). In this configuration, /sdcard is and Won, 2012). Sdcard partition does not suffer from imposed with FUSE. FUSE imposed filesystem looks and journaling of journal anomaly (Jeong et al., 2013b), but its works like an independent logical partition, and VFS performance is severely degraded by excessive overhead on mounts this partition with FUSE filesystem type. FUSE the FUSE. imposed filesystem resides in the existing concrete filesystem as a normal file (or a directory). Configuration 2 Figure 1 Storage configuration for Android-based smartphones in Figure 1 illustrates this configuration. By mounting user partition (/sdcard) with FUSE, the system allows the users to freely manoeuvre a proportion of data and user partitions. However, overhead caused by the FUSE virtualisation service is prohibitively large. Table 1 illustrates the information on data and user partitions of six Android-based smartphones. The models are sorted in chronological order. In earlier models, e.g., Nexus One, data partition uses raw NAND and is managed by log structured filesystem such as YAFFS2. The more recent smartphones, e.g., Nexus S (Google Nexus S, http://www.google.com/nexus/s/), Galaxy S (Samsung Galaxy S, http://www.samsung.com/global/microsite/ galaxys/index_2.html), and Galaxy S2 (Samsung Galaxy S2, http://www.samsung.com/global/microsite/galaxys2/ html/), began to use FTL loaded NAND storage device such There are two different approaches in configuring data and as eMMC as their internal storage. In these models, data user partitions. Earlier smartphone models locate data and partition and user partition reside in separate partitions. In user partitions at different logical partitions (Google Nexus the most recent Android-based smartphones, such as Galaxy S, http://www.google.com/nexus/s/; Samsung Galaxy S, Nexus and Galaxy S3, data and user partitions reside in the http://www.samsung.com/global/microsite/galaxys/index_2. same partition. User partition is mounted with FUSE and html; Samsung Galaxy S2, http://www.samsung.com/global resides in the data partition as a normal directory (and files). /microsite/galaxys2/html/). Configuration 1 in Figure 1 corresponds to this configuration. In this configuration, data

Table 1 Storage configuration of various smartphone models

Internal storage Ext-storage Model name Year /data /sdcard /extSdCard Nexus One (http://en.wikipedia.org/wiki/Nexus_One) 2010 YAFFS2 VFAT VFAT Nexus S (http://www.google.com/nexus/s/) 2010 EXT4 VFAT - Galaxy S (http://www.samsung.com/global/microsite/galaxys/index_2.html) 2010 RFS VFAT VFAT Galaxy S2 (http://www.samsung.com/global/microsite/galaxys2/html/) 2011 EXT4 VFAT VFAT Galaxy Nexus (http://www.google.com/nexus/4) 2011 EXT4 FUSE - Galaxy S3 (http://www.samsung.com/ae/microsite/galaxys3/en/index.html) 2012 EXT4 FUSE VFAT

98 S. Jeong and Y. Won

2.2 Filesystem in et al., 2013a). First, we measured the performance of the raw device. We ‘open()’ the /sdcard partition and FUSE (File system in user space, http://fuse.sourceforge.net/) generated the workloads. To minimise interference on the is a widely used framework that has been created to enable storage layers, we used the IO scheduler, ‘NOOP’, instead the filesystem to be developed in the user space without of the default ‘CFQ’. CFQ is known to be more CPU modifying kernel source code. The FUSE framework is demanding than NOOP (Kim et al., 2009). This experiment especially helpful in realising virtual filesystems. FUSE is labelled NOOP. Second, we changed the disk scheduler to allows programmers to directly control filesystem operation CFQ and still used the raw device. This is to measure the using simple application interface that is equivalent to overhead of disk scheduler. This is labeled BLK. In the third mount, open, read, or write in the legacy system calls. On experiment (EXT4), we measured the filesystem overhead. the other hand, FUSE causes additional context-switch and We formatted the storage device with EXT4 filesystem and memory copy, which lead to significantly low performance. measured the performance. In the fourth experiment Some performance may be regained by increasing the speed (VFAT), we used VFAT instead of EXT4. In the fifth of processor, memory, and bus. Rajgarhia and Gehani experiment (FUSE), we mounted the FUSE partition, which (2010) showed that FUSE exhibits similar performance to resides on EXT4, and measured the IO performance. This is the existing kernel level filesystem on a desktop PC. the default configuration for the user partition in Galaxy S3. Table 2 summarises the labels and the respective experiment Figure 2 Overhead of individual storage layers in eMMC on Galaxy-S3 operations.

Figure 3 IO overhead breakdown

Notes: NOOP: raw device with , BLK: raw device with CFQ scheduler, EXT4: EXT4 filesystem with CFQ scheduler, VFAT: VFAT filesystem with CFQ scheduler, FUSE: FUSE with EXT4 and CFQ scheduler. Figure 2 illustrates the results. The labels in the x-axis denote the five experiments. In ICS and JB, the raw Table 2 Configuration options for each experiment device (NOOP) yielded sequential write bandwidth of 39 Mbyte/sec and 40 Mbyte/sec, respectively. The Label NOOP BLK EXT4 VFAT FUSE performance decreased by 7% and 5% in ICS and JB, IO scheduler NOOP CFQ CFQ CFQ CFQ respectively, when CFQ disk scheduler was used (BLK). Filesystem EXT4 VFAT EXT4 The performance decreased by 14% and 16%, respectively, FUSE Yes when EXT4 filesystem was used. In the default configuration for the user partition of Galaxy S3, i.e., when we imposed FUSE on the filesystem, we obtained sequential 2.3 Dissection of the IO stack overhead write bandwidth of only 7 Mbyte/sec on both Android We overhauled the IO stack for sdcard partition access and versions. This is less than 20% of the physical bandwidth of analysed the overhead caused by each layer of the storage the raw device. stack. We conducted an experiment on a commercially In the Android storage stack of JB version, the overhead available smartphone model, Samsung Galaxy S3 of FUSE, EXT4, and IO scheduler account for 79%, 16%, (http://www.samsung.com/ae/microsite/galaxys3/en/index.h and 5% of the total overhead, respectively (Figure 3), and tml) (Samsung Exynos 4412 1.4 GHz Quad-core, 2 GB only 20% of the raw device performance is available to the RAM, 32 GB eMMC1), running Android 4.0.4 (ICS) and applications. The IO accesses to the user partition suffer Android 4.1.2 (JB). The version of each of the Android from excessive software overhead. The dominant fraction of kernel is 3.0.15 and 3.0.30, respectively. The data partition the overhead is caused by FUSE. The overhead of resides at internal flash storage and is formatted with EXT4 filesystem, EXT4, is more significant in a mobile device filesystem. The user partition is imposed with FUSE and with NAND-based storage than in a desktop with an HDD. actually resides at the data partition. The default IO In a desktop with an HDD, it is commonly accepted that the scheduler in the Android is complete fair queuing (CFQ) filesystem overhead is 5% of the total overhead (Axboe, 2007). (Giampaolo, 1998). When the Android platform adopts We generated sequential write workloads with 512 KB FUSE instead of VFAT to manage the user partition, the IO unit size and 512 MB file size using Mobibench (Jeong performance decreases by 50%. Based on these Buffered FUSE: optimising the Android IO stack for user-level filesystem 99 observations, we carefully conjecture that from Google’s Figure 4 IO distribution of file types, block types, IO modes, point of view, the flexible and efficient space management randomness, and IO size, (a) IO size (b) locality (c) IO must be more important than storage performance in current modes (d) block types smartphone models.

3 The Android IO stack for the user partition 3.1 IO characteristics of the user partition accesses To design an IO subsystem, it is mandatory to acquire a thorough understanding on the IO workload characteristics for the respective storage partition. We selected the most popular Android apps that use sdcard partition: camera, camcorder, and gallery. For camera application, we took (a) pictures consecutively for one minute. For camcorder, we recorded a video for one minute, with full HD resolution (1,920 × 1,080 @ 30 fps). For gallery, we scrolled the thumbnails and displayed each picture for one minute. We ran the experiment on Galaxy S3 with the user partition mounted with FUSE. We used mobile storage analysis tool (MOST) (Jeong et al., 2013a) to collect and analyse the trace. MOST can extract the file type (journal vs. data file) and block type (data vs. metadata) information for a given block access. In this study, we examined the IO size distribution, spatial locality (random vs. sequential), fraction of synchronous IO, and fraction of accesses for each filesystem region (metadata, journal, and data) for a given (b) Android apps. This information provides an important guideline in designing and optimising the filesystem and the respective storage stack. The following summarises IO characteristics of the IOs to the user partition. • Most of IOs are over 256 Kbyte in size. In camera and camcorder, 70% of the write requests to the block device are larger than 256 Kbyte [Figure 4(a)]. This is an important characteristic in designing the filesystem as well as the storage device. In particular, the current FUSE splits the incoming IO requests into 4 KB units. When the incoming request is large, this splitting behaviour of the FUSE causes significant overhead. For (c) the data partition accesses in Android, 64% of the write operations is 4 Kbyte (Jeong et al., 2013b). • Over 90% of the total IOs are sequential. For sdcard partition accesses, sequential IOs account for over 90% of the total IOs [Figure 4(b)]. The percentages were near 100% for camera and camcorder. We have observed an entirely different access characteristic in the data partition accesses. In the data partition accesses, random writes constitute a dominant fraction (75%) of the entire write operations.

• Over 70% of the total IOs are buffered. Buffered IOs accounted for over 70% of all IOs in all three (d) applications [Figure 4(c)]. The percentage of buffered Note: The number at the end of each bar indicates the IOs reached 100% for camcorder application. This number of respective block IO for R (read) and W is in direct contrast to the data partition access (write). characteristics (Jeong et al., 2013b). In the data • Journal and metadata writes are negligible. We partition accesses, 70% of the writes are synchronous. categorised the IO accesses into three groups based on They are caused by SQLite database behaviour. 100 S. Jeong and Y. Won

the type of filesystem region: data, metadata, and requests, it is possible that a set of split requests are journal. Accesses to the filesystem journal and interleaved with other IO streams, e.g., journal buffer flush metadata are negligible. Data writes account for 95% and metadata synchronisation. As a result, the aggregate and 97% of the total IO counts for camera and stream may look ‘more’ random. It is critical for camcorder, respectively [Figure 4(d)]. In the case of NAND-based flash storage to properly identify the gallery, data reads account for almost 100% of the total randomness and hotness of the underlying traffic (Lee et al., read counts and data writes account for 30% of the total 2008; Hsieh et al., 2006). Loss of access locality may write counts. significantly aggravate the inefficiency of the underlying storage behaviour. 3.2 Anatomy of the FUSE behaviour In the Android IO stack, the side effect of IO fragmentation amplifies dramatically because most of We observed that 79% of all IO overhead, which is all IO sdcard accesses are large sized (> 256 Kbyte) and mobile accesses to sdcard partition, is caused by FUSE (Figure 3). devices have relatively low CPU and memory copy We examined the call path of accessing the sdcard partition performance. Subsequently, the overhead of going through to analyse the cause for this overhead. Figure 5 the FUSE layer becomes prohibitively large. We physically schematically illustrates the mechanism for a ‘write()’ examined the overhead of IO fragmentation. We varied the system call. When the application makes a write system call FUSE IO size from 4 Kbyte to 512 Kbyte and measured the to the FUSE partition, VFS forwards the request to the throughput for sequential write. IO size denotes the unit size FUSE layer. The FUSE layer inserts this request to its own into which the FUSE library splits the IO request at the request queue, which resides in the kernel. The OS hands request queue. Figure 6 illustrates the results. On both over the CPU to another thread after it inserts the write Android versions, with 4 Kbyte FUSE IO unit size, the request to the queue. User-level FUSE library reads the internal storage yielded only 7 Mbyte/sec. With 512 Kbyte FUSE request queue through VFS interface. FUSE library is IO unit size, the internal storage yielded 30 Mbyte/sec run by its own thread. The FUSE layer splits the IO request sequential write throughput. in the request queue into 4 Kbyte units and returns them to the FUSE library. The FUSE library translates the incoming Figure 6 Throughput vs. FUSE IO unit size IOs into the filesystem operations appropriate for EXT4 (or other existing concrete filesystem where FUSE resides) and creates the request.

Figure 5 write() path of the FUSE framework (see online version for colours)

Notes: Sequential buffered write, internal eMMC, file: 512 Mbyte, record-size: 512 Kbyte.

Figure 7 Number of context switches vs. FUSE IO size (see online version for colours)

We identified the main source of FUSE’s inefficiency to be IO fragmentation. The FUSE library removes an IO request from the FUSE mount filesystem in 4 Kbyte units, which we call ‘FUSE IO units’. Since the default FUSE IO unit is Notes: Sequential buffered write, internal eMMC, 4 Kbyte, the FUSE library splits a large IO request from the file: 512 Mbyte, record size: 512 Kbyte. application into multiple IO requests. This IO fragmentation We examined the number of context switches for each increases the system call processing overhead and the FUSE IO unit size using Mobibench (Jeong et al., 2013a). number of context switches. Also, IO fragmentation can Figure 7 illustrates the results. The total number of context dismantle the spatial locality that the original IO stream switches decreases as the IO unit size increases. bears. Since large IO requests are split into multiple Buffered FUSE: optimising the Android IO stack for user-level filesystem 101

4 Buffered FUSE 4.2 Implementation 4.1 Design A FUSE buffer is an array of fixed size entries. An entry consists of buffer header and buffer data. Buffer header We carefully examined the IO workload characteristics of contains file descriptor, offset, and size. Buffer data is a sdcard accesses in the Android platform and designed a fixed size region to hold data blocks for write operations. A novel user-level file system mechanism, called bFUSE, to sufficiently large region, 2 Mbyte, is allocated to an entry so effectively address the IO fragmentation issue of the that it can hold consecutive write requests in a single entry. FUSE. bFUSE is specifically designed for the Android IO Figure 9 illustrates the structure of the buffer. stack where a dominant fraction of IOs are large sized When the FUSE library inserts a request to the FUSE (> 256 Kbyte) sequential (and buffered) writes. bFUSE buffer, it searches the FUSE buffer (write buffer) entries for practically eliminates all overhead of the FUSE layer and the same file descriptor. If the incoming request and the increases the efficiency of the FUSE while preserving the existing FUSE buffer entry form a consecutive region of a flexibility in the storage space management. file, the FUSE library places the data blocks of the incoming request in the data block region of the existing entry and Figure 8 write() path of bFUSE framework (see online updates the size field. When the data region of the existing version for colours) entry is full, the FUSE library allocates a new entry in the FUSE buffer.

Figure 9 Structure of bFUSE (see online version for colours)

Figure 10 Effect of separate thread management under various FUSE buffer sizes (see online version for colours) Despite its dramatic performance impact, the design and implementation of bFUSE are rather simple and straight forward. bFUSE consists of the following key ingredients. First is extended FUSE IO unit size. In bFUSE, the FUSE IO unit increases from 4 Kbyte to 512 Kbyte. Note that about 70% of the IO accesses are larger than 256 Kbyte [Figure 4(a)]. Second, we introduce a FUSE buffer between the FUSE library and VFS. Upon reading a request from legacy FUSE request queue, the sdcard service immediately issues the request to the underlying filesystem. In bFUSE, Notes: Sequential buffered write, internal eMMC, filesystem: the FUSE library takes incoming requests from request EXT4, file size: 512 Mbyte, record size: 512 Kbyte. queue in kernel and places them in the FUSE buffer. Third, bFUSE may negatively affect the overall IO performance we introduce management thread for the FUSE buffer due to memory copy overhead and increased latency to which periodically flushes the FUSE buffer contents flush IO requests to the filesystem. The size of the region is to the underlying filesystem. Thread-based FUSE buffer user configurable. We performed a series of experiments to management enables the FUSE library, i.e., FUSE, to determine the right size for the data region. We managed the process IO requests concurrently with the FUSE buffer bFUSE with producer-consumer paradigm and allocated a management exploiting multiple cores of modern mobile separate thread for flushing the buffer. Figure 9 illustrates CPU. Figure 8 illustrates the organisation of bFUSE the structure. framework. The FUSE buffer resides at the user space The FUSE buffer manager regularly flushes buffer (within the FUSE library).We implemented FUSE buffer at entries. In addition, it flushes buffer entries for a given file the user address space instead of the kernel space to reduce when the file is closed, or when the request reaches the end the number of system calls and to dynamically adjust the of the file, or when fsync()/fdatasync() is called. The FUSE buffer size. performance of the user partition accesses relies heavily on the interval at which the FUSE buffer manager flushes the 102 S. Jeong and Y. Won contents. We examined the IO performance under various The FUSE buffer in the FUSE layer brings a number of flush intervals and found that 10 msec yields the best improvements to the Android storage stack: reductions in performance. In all experiments, flush interval of FUSE the number of write() system calls, the number of context buffer manager was set to 10 msec. We measured the switches, randomness in the underlying IO traffic, and throughput of sequential buffered writes (512 KB FUSE IO filesystem journaling overhead. unit size) both with and without the FUSE buffer manager. Figure 10 illustrates the results. With single threaded Figure 12 Effect of FUSE buffer on randomness of the IO implementation of bFUSE, throughput actually decreases accesses, (a) without FUSE buffer (b) with FUSE with introduction of bFUSE, without the FUSE buffer buffer (see online version for colours) manager. With the FUSE buffer manager, we achieved 8% increase in performance. Based on the detailed examination of the process, we believe the reason for this performance gain is reduced randomness in the incoming IOs. The objective of the FUSE buffer is to coalesce multiple IO requests into one so that we can reduce the aggregate system call overhead and make the larger fraction of IOs sequential. Most NAND flash storages are designed to exploit sequentiality of the incoming IOs (Lee et al., 2005) to improve the IO performance and to lengthen the storage life (a) time. We examined the ratio of random IOs under various FUSE buffer size. Figure 11 illustrates the results. We can see that as the ratio of random IOs decreases, the throughput increases. With 8 Mbyte FUSE buffer, the ratio of random IOs to the entire IOs is the smallest.

Figure 11 Fraction of random IOs (IO count)

(b)

Notes: Sequential buffered write on internal eMMC through FUSE with/without bFUSE. Base filesystem: EXT4, file size: 512 Mbyte, record size (app): 512 Kbyte.

Figure 13 Effect of IO scheduling policy

Notes: Sequential buffered write on internal eMMC through FUSE with different size of FUSE IO unit. File size: 512 Mbyte, record size (app): 512 Kbyte. The FUSE buffer successfully reduces the number of journal and metadata updates and the number of system calls reducing the system call overhead. Also, by coalescing multiple write requests into one, there is less chance that sequential IOs issued by the application are inter-mixed Notes: Sequential buffered write on internal eMMC with other IO streams, e.g., journal flush. We examined the through FUSE with three different IO scheduler. block access trace for a sequential write of 512 Mbyte file File size: 512 Mbyte, record size: 512 Kbyte. with and without FUSE buffer. Figure 12 illustrates the results. Without the FUSE buffer, writing a file consists of higher number of smaller write bursts [Figure 12(a)]. With 5 Optimising the Android storage stack the FUSE buffer, writing a file consists of lower number of larger write bursts. Also, with the write buffer, there are In an effort to optimise the entire IO stack, we examined the fewer journal updates and metadata updates. Occasional throughput of the user partition under five filesystems, updates in the 0–0.5 × 103 LBA range and 2.5 × 103 range EXT4, XFS, F2FS, BTRFS, and NILFS2; and three IO are for metadata updates and journal updates, respectively. schedulers, NOOP, CFQ, and deadline; in a combinatorial manner.

Buffered FUSE: optimising the Android IO stack for user-level filesystem 103

5.1 IO scheduler Without the FUSE buffer, XFS and F2FS exhibited 17% and 10% improvements, respectively, over EXT4 currently provides CFQ, NOOP, and deadline on both Android versions. BTRFS exhibited 5% IO schedulers, with CFQ being the default. These improvement over EXT4 on ICS while the performance schedulers are designed for HDDs and should be decreased by 20% on JB. NILFS2 performed significantly reexamined in the context of NAND-based storage devices worse than the rest. In addition to the write requests (Wang et al., 2013; Shen and Park, 2013). We compared the to the storage device, there were occasional kernel performance of CFQ, NOOP, and deadline as the IO generated IO requests, such as journal updates and metadata scheduler. Figure 13 shows the results. The default IO updates. These requests negatively affect the filesystem scheduler, CFQ, yielded the worst performance among the performance because it breaks the sequentiality in the three, with deadline performing 6% better than CFQ. With underlying IO stream. Figure 14 also shows the fraction of 8 Mbyte FUSE buffer and deadline driven scheduling, the sequential IOs to all IO requests to the filesystem. The sequential write throughput increases by 12% compared to filesystem’s performance coincides with the fraction of the CFQ IO scheduler. sequential IOs.

5.2 Filesystem 5.3 Analysis We examined the performance of five widely used filesystems in Linux: EXT4, BTRFS, NILFS, XFS, and We analysed block level access patterns in these F2FS. We examined how well these filesystems match with filesystems. Figure 15 shows the time series of block the FUSE framework. XFS (Sweeney et al., 1996) is a accesses in writing 512 Mbyte. X-axis is time and Y-axis is journaling filesystem designed for enterprise class the logical block address. It only shows write() operation. storage systems. BTRFS (Rodeh et al., 2012) is Sequential write patterns are commonly observed in all filesystem block traces. IO requests in the lower LBA copy-on-write-based filesystem and is envisioned as the 3 future filesystem for Linux OS. NILFS2 is a log-structured region (0~0.5 × 10 ) are for metadata updates. filesystem (Konishi et al., 2006). We included NILFS2 to The FUSE buffer and extended FUSE IO unit are examine the effectiveness of the legacy log structured designed to mitigate or to remove the overhead caused by filesystem on the Android platform. Flash Friendly IO fragmentation. We examined in detail how these features Filesystem, F2FS (F2FS Patch on LKML, https://lkml.org/ contribute to minimising the overhead of FUSE. We lkml/2012/10/5/205), is the youngest among the five, generated sequential write workloads and measured the specifically designed for flash-based storage. number of system calls, the number of block IO requests mkfs tools for F2FS, NILFS2, and BTRFS are ported issued to the NAND storage device, and the number of for ARM processor. Since F2FS is recently merged into the metadata and journal updates. Without extended FUSE IO standard Linux kernel 3.8 (F2FS File-System Merged unit, writing 512 Mbyte consists of 128,000 write system Into Linux 3.8 Kernel, http://www.phoronix.com/scan.php? calls since FUSE library splits the incoming IO requests into page=news_item&px=MTI1OTU), we backported F2FS to 4 Kbyte units. By increasing the FUSE IO unit size to Linux 3.0.15 (Android ICS for Galaxy S3) and Linux 3.0.30 512 Kbyte, the number of write system calls decreases by (Android JB for Galaxy S3), respectively. We measured the × 128 (from 128,000 to 1,000). With 8 Mbyte FUSE buffer, performance of sequential writes (buffered IOs). IO size we further reduce the number of system calls by × 4 (from was 512 Kbyte. FUSE IO unit size was 512 Kbyte and 1,000 to 249). With bFUSE (extended FUSE IO and a FUSE buffer was not used. Figure 14 shows the throughput FUSE buffer), we are able to achieve sheer × 512 reduction for the different filesystems. in the number of system calls (from 128,000 to 249). On average, bFUSE coalesces four write system calls from the Figure 14 Throughput vs. fraction of sequential IO counts application into one write system call with the help of the FUSE buffer. IO requests shown in Table 3 are the requests issued to the internal storage. Since the maximum IO size is 512 Kbyte in eMMC standard (e-MMC, 2011), the number of IO requests remains the same regardless of FUSE buffer. In data partition accesses, journal and metadata update operations constitute a dominant fraction of all IO operations (Lee and Won, 2012). We found that for the user partition accesses, i.e., buffered writes, the journal and metadata update overhead is negligible. Notes: Sequential buffered write on internal eMMC through FUSE with five different base filesystems. File size: 512 Mbyte, record size: 512 Kbyte, FUSE IO Size: 512 Kbyte, without FUSE buffer.

104 S. Jeong and Y. Won

Table 3 The number of system calls, IO operations and metadata/journal updates

Filesystem EXT4 XFS F2FS BTRFS NILFS2 # of pwrite() FUSE 128,000 128,000 128,000 128,000 128,000 Nobuf 1,000 1,000 1,000 1,000 1,000 bFUSE 249 249 249 249 249 # of IO request FUSE 1,048 1,121 1,044 1,041 1,077 Nobuf 1,032 1,112 1,048 1,016 1,070 bFUSE 1,030 1,099 1,049 1,011 1,059 Metadata/journal FUSE 23 11 26 15 153 Nobuf 12 12 33 9 137 bFUSE 11 6 35 9 142 Note: FUSE: baseline, nobuf: bFUSE without FUSE buffer, bFUSE: bFUSE with 8 Mbyte FUSE buffer.

Figure 15 Block access pattern, sequential buffered write on internal eMMC through FUSE with five different base filesystems, (a) EXT4 (b) XFS (c) F2FS (d) BTRFS (e) NILFS2 (see online version for colours)

(a) (b)

(c) (d)

(e) Notes: File size: 512 Mbyte, record size: 512 Kbyte.

6 Combined study Android 4.0.4 (ICS) and Android 4.1.2 (JB) are used as baseline. In the baseline configuration, FUSE IO unit size is We examined the performance of the user partition accesses 4 Kbyte, and the default IO scheduler is CFQ. under each combination of storage stack options. There are Figure 16 shows the performances of five filesystems. five different filesystems, EXT4, XFS, F2FS, BTRFS, and Experiment for each filesystem was conducted in four steps: NILFS2; and three IO schedulers, NOOP, CFQ, and First, we tested the baseline configuration, ‘base’, which is deadline. We tested the storage throughputs under 4 Kbyte CFQ IO scheduler with FUSE IO unit of 4 Kbyte. Second, to 512 Kbyte FUSE IO unit sizes and 2 Mbyte to 64 Mbyte FUSE IO unit is increased to 512 Kbyte, shown as ‘512 KB FUSE buffer sizes. It is reported that IO characteristics of IO’. Third, the IO scheduler is changed to deadline, keeping Android-based smartphones are not sensitive to hardware FUSE IO unit at 512 Kbyte. This step is labelled ‘deadline’. models (Kim et al., 2012a; Lee and Won). Galaxy S3 with Buffered FUSE: optimising the Android IO stack for user-level filesystem 105

Fourth, the FUSE buffer is set at 8 Mbyte (with 512 Kbyte MAX (38.6 Mbyte/s and 39.8 Mbyte/s for ICS and JB, FUSE IO and deadline IO scheduler). This step is labelled respectively). In EXT4 with bFUSE (512 Kbyte FUSE IO ‘bFUSE’. size and 8 Mbyte FUSE buffer), we achieved 95% and 91% In EXT4 filesystem on both Android versions, extended of the raw device bandwidth for ICS and JB, respectively. IO size led to about 350% performance enhancement, With XFS filesystem and the bFUSE (512 Kbyte FUSE IO change of IO scheduler added about 5% enhancement, and unit, 8 Mbyte FUSE buffer, deadline scheduler), we bFUSE added 12% more enhancement. Performance eliminated most of the software overhead and achieved 99% enhancements in XFS, F2FS, and BTRFS for each step were and 91% of the raw device throughput for ICS and JB, similar. It is worth noting that a FUSE buffer needs to be respectively. This corresponds to spectacular improvement managed by a separate thread so as not to block the caller. of 470% and 419% from the original baseline configuration Otherwise, the performance gain becomes less significant. for ICS and JB, respectively.

Figure 16 Effect of individual optimisation options, (a) ICS (b) JB 7 Related work In the past few years, many studies have examined the performance of NAND flash-based storage devices. Kim et al. (2012a) showed that the determinant of mobile device performance is not the network speed but storage, which is contrary to popular belief. Kim et al. (2012b) conducted a study on improving buffer cache management in order to improve IO performance of smartphones using low-cost flash memory. They developed a new buffer cache replacement scheme, spatial clock, which addresses the issue of spatial locality (a) that has been neglected in other algorithms. These algorithms, including LRU, clock (Bensoussan et al., 1972)], Linux2Q (Bovet and Cesati, 2005), CFLRU (Park et al., 2006), LRUWSR (Jung et al., 2008), FOR (Lv et al., 2011), and FAB (Jo et al., 2006), focused only on reducing write cycles. Spatial clock is based on the clock algorithm, but page frame is aligned with the logical sector number. Alignment was achieved using AVL tree (Adelson-Velskii and Landis, 1963). Lee and Won (2012) performed comprehensive analysis

on the Android IO workload. They performed trace driven (b) analysis on 14 apps from various categories, e.g., contact, Notes: Sequential buffered write on internal eMMC with calendar, gallery, camera, twitter, etc. They found that different filesystem. File size: 512 Mbyte, record typical IOs to data partition (/data) and sdcard partition size (app): 512 Kbyte. (/sdcard) are 4 Kbyte write followed by fsync() and large size buffered write (> 64 Kbyte), respectively. They Figure 17 Throughput of optimised android storage stack for also found that SQLite and EXT4 interact in an unexpected user partition accesses way and put significant stress on the underlying storage. On the other hand, Chiang and Lo (2006) proposed a component-based VFAT file system on embedded system, which improved performance of VFAT by supporting RAM-based storage and disk-based storage. They not only saved system memory but also improved performance of read and write operation on resource-limited embedded system environment. Studies have been done on the relationship between the actions of FTL and the unnecessary writes generated by the EXT4 journal. Jang and Lim (2012) asserted that FTL Notes: Sequential buffered write on internal eMMC. IO should be located on the host side instead of NAND scheduler: deadline, file size: 512 Mbyte, record size (app): 512 Kbyte. controller to be more efficient, taking into account the resource constraints (RAM and CPU) for page mapping Finally, we combine and summarise the results of all table and redundant writing by the journal. They modified experiments (Figure 17). For ease of comparison, the MTD and JBD of Linux 2.6.35 and compared the throughput was normalised to the raw device throughput, performance of host-side FTL to that of controller-side FTL 106 S. Jeong and Y. Won on NANDsim. Random write IOPS improved about 20% to References 30%. Although this study is significant in that it presents a Adelson-Velskii, M. and Landis, E.M. (1963) An Algorithm for the solution to overcoming hardware limitations of FTL as well Organization of Information, Vol. 146, pp.263–266, Defense as performance reduction caused by excess journaling, it Technical Information Center, in Russian, English translation does not provide specific instructions on implementing the by Myron J. Ricci in Soviet Math. Doklady, Vol. 3, proposed method nor explains the cause for performance pp.1259–1263, 1962. enhancement. Falaki et al. (2010) analysed various Axboe, J. (2007) ‘CFQ IO scheduler’, in Presentation at Linux. smartphone usage patterns. On top of the performance Conf. AU, January. increases achieved by these earlier works (Kim et al., Bensoussan, A., Clingen, C.T. and Daley, R.C. (1972) 2012a; Lee and Won, 2012), bFUSE provides an effective ‘The multics virtual memory: concepts and design’, solution to further enhance the performance of the Android Communications of the ACM, Vol. 15, No. 5, pp.308–318. IO stack. Bovet, D. and Cesati, M. (2005) Understanding the Linux Kernel, O’Reilly Media, Incorporated, Sebastopol, CA, USA. Chiang, M-L. and Lo, C-J. (2006) ‘Lyrafile: a component-based 8 Conclusions vfat file system for embedded systems’, International Journal of Embedded Systems, Vol. 2, No. 3, pp.248–259. The role of smartphones is changing from a simple office Embedded Multi-Media Card (e-MMC) (2011) Electrical Standard assistant, which handles contact management, schedule (4.5 Device), June. management, and e-mails, to a professional device for F2FS File-System Merged into Linux 3.8 Kernel [online] creating various types of multimedia contents, replacing http://www.phoronix.com/scan.php?page=news_item&px= cameras and camcorders. There is a significant demand for MTI1OTU (accessed 8 November 2013). higher performance storage subsystem to harbour better F2FS Patch on LKML [online] quality pictures and videos. The Android storage stack https://lkml.org/lkml/2012/10/5/205 (accessed 8 November allocates a separate storage partition for user-created 2013). multimedia files. The Android storage stack for the user Falaki, H., Mahajan, R., Kandula, S., Lymberopoulos, D., partition, as it currently stands, is an uncoordinated Govindan, R. and Estrin, D. (2010) ‘Diversity in smartphone usage’, in Proc. of the 8th International Conference on collection of software layers with excessive overhead. It can Mobile Systems, Applications, and Services, ACM, draw only 20% of the storage performance. In this study, we pp.179–194. focus our efforts on optimising the Android IO stack for File system in user space [online] http://fuse.sourceforge.net/ sdcard partition accesses. We characterised the IO workload (accessed 8 November 2013). to sdcard partition, examined the overhead of individual Galaxy Camera [online] software layers, and found the optimal storage configuration http://www.samsung.com/us/photography/galaxy-camera for sdcard partition accesses. We found that a dominant (accessed 8 November 2013). fraction of software overhead is caused by FUSE. We Galaxy S2 [online] propose bFUSE, bFUSE, to address the performance issues http://www.samsung.com/global/microsite/galaxys2/html/ of the legacy FUSE. We examined the performances of five (accessed 8 November 2013). filesystems and three disk schedulers to find the best match Giampaolo, D. (1998) Practical File System Design with the for the bFUSE framework. After applying 512 Kbyte FUSE Be File System, Morgan Kaufmann Publishers Inc., IO unit and 8 Mbyte FUSE buffer in bFUSE, and with XFS San Francisco, CA. in the deadline scheduling mode, the proposed Android IO Google Galaxy Nexus [online] http://en.wikipedia.org/wiki/ stack achieved 470% and 419% performance improvements Nexus_4 (accessed 8 November 2013). in Android 4.0.4 (ICS) and Android 4.1.2 (JB), respectively, Google Nexus S [online] http://en.wikipedia.org/wiki/Nexus_S over the existing state of the art smartphone model with the (accessed 8 November 2013). default options. The observed performance corresponds to Hsieh, J-W., Kuo, T-W. and Chang, L-P. (2006) ‘Efficient 99% and 91% of the raw device bandwidth in ICS and JB, identification of hot data for flash memory storage systems’, TOS, Vol. 2, No. 1, pp.22–40. respectively. The proposed Android IO stack opens up a HTC Dream [online] http://en.wikipedia.org/wiki/HTC_Dream new opportunity for the existing smartphone storage to (accessed 8 November 2013). handle higher definition multimedia contents such as Jang, B. and Lim, S-H. (2012) ‘Storage subsystem implementation ultra-high definition quality video recording (3,840 × 2,160 for mobile embedded devices’, in Park, J.H., Jeong, Y-S., @ 60 frames/sec). Park, S.O. and Chen, H-C. (Eds.): Embedded and Multimedia Technology and Service, Lecture Notes in Electrical Engineering, Vol. 181, pp.197–204, Springer, Acknowledgements Netherlands. Jeong, S., Lee, K., Hwang, J., Lee, S. and Won, Y. (2013a) This work is sponsored by IT R&D programme ‘AndroStep: Android storage performance analysis tool’, in MKE/KEIT. [No. 10035202, Large Scale hyper-MLC SSD ME13: In Proceedings of the First European Workshop on Technology Development], and by IT R&D programme Mobile Engineering, Lecture Notes in Electrical Engineering, MKE/KEIT. [No. 10041608, Embedded system Software Aachen, Germany, 66 February, Vol. 215. for New-memory-based Smart Device]. Buffered FUSE: optimising the Android IO stack for user-level filesystem 107

Jeong, S., Lee, K., Lee, S., Son, S. and Won, Y. (2013b) ‘I/O stack Lv, Y., Cui, B., He, B. and Chen, X. (2011) ‘Operation-aware optimization for smartphones’, in Proc. of the USENIX buffer management in flash-based systems’, in Proceedings of Annual Technical Conference. the 2011 International Conference on Management of Data, Jo, H., Kang, J.U., Park, S.Y., Kim, J.S. and Lee, J. (2006) ‘FAB: ACM, New York, NY, USA, pp.13–24. flash-aware buffer management policy for portable media Mathur, A., Cao, M., Bhattacharya, S., Dilger, A., Tomas, A. and players’, IEEE Trans. on Consumer Electronics, Vol. 52, Vivier, L. (2007) ‘The new ext4 filesystem: current status and No. 2, pp.485–493. future plans’, in Proc. of the Linux Symposium, Ottawa. Jung, H., Shim, H., Park, S., Kang, S. and Cha, J. (2008) Microsoft Corporation (2000) FAT32 File System Specification ‘LRU-WSR: integration of LRU and writes sequence [online] http://microsoft.com/whdc/system/platform/ reordering for flash memory’, IEEE Transactions on firmware/fatgen.mspx (accessed 8 November 2013). Consumer Electronics, Vol. 54, No. 3, pp.1215–1223. Nexus One (Google/HTC) Kim, H., Agrawal, N. and Ungureanu, C. (2012a) ‘Revisiting http://en.wikipedia.org/wiki/Nexus_One (accessed storage for smartphones’, in Proc. of the 10th USENIX 8 November 2013). Conference on File and Storage Technologies, San Jose, CA, Osborne, R., Van Zoest, A., Robinson, A., Fudge, B., USA, February. Srinivasan, M., Fry, K. et al. (2010) Media Transfer Protocol, Kim, H., Ryu, M. and Ramachandran, U. (2012b) ‘What is a good 16 February, US Patent 7,664,872. buffer cache replacement scheme for mobile flash storage?’, Park, S., Jung, D., Kang, J., Kim, J. and Lee, J. (2006) ‘CFLRU: a in Proceedings of the 12th ACM SIGMETRICS/Performance replacement algorithm for flash memory’, in Proc. of the joint international conference on Measurement and Modeling 2006 International Conference on Compilers, Architecture of Computer Systems, ACM, pp.235–246. and Synthesis for Embedded Systems, ACM, New York, NY, Kim, J., Oh, Y., Kim, E., Choi, J., Lee, D. and Noh, S.H. (2009) USA, pp.234–241. ‘Disk schedulers for solid state drives’, in EMSOFT 2009: 7th Rajgarhia, A. and Gehani, A. (2010) ‘Performance and extension ACM Conf. on Embedded Software, pp.295–304. of user space file systems’, in Proceedings of the 2010 ACM Konishi, R., Amagai, Y., Sato, K., Hifumi, H., Kihara, S. and Symposium on Applied Computing, Sierre, Switzerland, SAC Moriai, S. (2006) ‘The Linux implementation of a ‘10, ACM, New York, NY, USA, pp.206–213. log-structured file system’, SIGOPS Oper. Syst. Rev., July, Rodeh, O., Bacik, J. and Mason, C. (2012) ‘BTRFS: the Linux Vol. 40, No. 3, pp.102–107. B-tree filesystem’, IBM Research Report, July. Lee, K. and Won, Y. (2012) ‘Smart layers and dumb result: IO Samsung Galaxy S [online] http://www.samsung.com/global/ characterization of an android-based smartphone’, in microsite/galaxys/index_2.html (accessed 8 November 2013). EMSOFT 2012: In Proc. of International Conference on Embedded Software, Tampere, Finland, 7–12 October. Samsung Galaxy S3 [online] http://www.samsung.com/ae/ microsite/galaxys3/en/index.html (accessed 8 November Lee, S., Shin, S., Kim, Y-J. and Kim, J. (2008) ‘LAST: 2013). locality-aware sector translation for NAND flash memory-based storage systems’, SIGOPS Oper. Syst. Rev., Shen, K. and Park, S. (2013) ‘FlashFQ: a fair queueing I/O Vol. 42, No. 6, pp.36–42. scheduler for flash-based SSDs’, in Proc. of the USENIX Annual Technical Conference. Lee, S.W., Park, D.J., Chung, T.S., Lee, D.H., Park, S.W. and Songe, H.J. (2005) ‘FAST: a log-buffer based ftl scheme with Sweeney, A., Doucette, D., Hu, W., Anderson, C., Nishimoto, M. fully associative sector translation’, The UKC, August. and Peck, G. (1996) ‘Scalability in the XFS file system’, in Proc. of the USENIX Annual Technical Conference, USENIX Lim, S-H., Lee, S. and Ahn, W-H. (2013) ‘Applications IO Association, Berkeley, CA, USA, p.1. profiling and analysis for smart devices’, Journal of Systems Architecture, October, Vol. 59, No. 9, pp.740–747. Wang, H., Huang, P., He, S., Zhou, K. and Li, C. (2013) ‘A novel I/O scheduler for SSD with improved performance and lifetime’, in Mass Storage Systems and Technologies (MSST).