CFS: a Distributed File System for Large Scale Container Platforms

CFS: A Distributed File System for Large Scale Container Platforms Haifeng Liu§†, Wei Ding†, Yuan Chen†, Weilong Guo†, Shuoran Liu†, Tianpeng Li†, Mofei Zhang†, Jianxing Zhao†, Hongyin Zhu†, Zhengyi Zhu† §University of Science and Technology of China, Hefei, China †JD.com, Beijing, China ABSTRACT ACM Reference Format: We propose CFS, a distributed file system for large scale Haifeng Liu, Wei Ding, Yuan Chen, Weilong Guo, Shuoran container platforms. CFS supports both sequential and ran- Liu, Tianpeng Li, Mofei Zhang, Jianxing Zhao, Hongyin Zhu, dom file accesses with optimized storage for both large files Zhengyi Zhu. 2019. CFS: A Distributed File System for Large and small files, and adopts different replication protocols for Scale Container Platforms. In 2019 International Conference different write scenarios to improve the replication perfor- on Management of Data (SIGMOD âĂŹ19), June 30-July 5, mance. It employs a metadata subsystem to store and distrib- 2019, Amsterdam, Netherlands. ACM, New York, NY, USA, 14 ute the file metadata across different storage nodes basedon pages. https://doi.org/10.1145/3299869.3314046 the memory usage. This metadata placement strategy avoids the need of data rebalancing during capacity expansion. CFS 1 INTRODUCTION also provides POSIX-compliant APIs with relaxed semantics Containerization and microservices have revolutionized cloud and metadata atomicity to improve the system performance. environments and architectures over the past few years [1, We performed a comprehensive comparison with Ceph, a 3, 17]. As applications can be built, deployed and managed widely-used distributed file system on container platforms. faster through continuous delivery, more and more com- Our experimental results show that, in testing 7 commonly panies start to move legacy applications and core business used metadata operations, CFS gives around 3 times per- functions to containerized environment. formance boost on average. In addition, CFS exhibits better The microservices running on each set of containers are random-read/write performance in highly concurrent envi- usually independent from the local disk storage. While de- ronments with multiple clients and processes. coupling compute from storage allows the companies to scale the container resources in a more efficient way, it also CCS CONCEPTS brings up the need of a separate storage because (1) con- • Information systems → Distributed storage; tainers may need to preserve the application data even after they are closed, (2) the same file may need to be accessed by different containers simultaneously, and (3) the storage KEYWORDS resources may need to be shared by different services and distributed file system; container; cloud native; applications. Without the ability to persist data, containers arXiv:1911.03001v1 [cs.DC] 8 Nov 2019 might have limited usage in many workloads, especially in stateful applications. One option is to take the existing distributed file systems Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not and bring them to the cloud native environment through the 1 made or distributed for profit or commercial advantage and that copies bear Container Storage Interface (CSI) , which has been supported this notice and the full citation on the first page. Copyrights for components by various container orchestrators such as Kubernetes [5] of this work owned by others than ACM must be honored. Abstracting with and Mesos [13], or through some storage orchestrator such credit is permitted. To copy otherwise, or republish, to post on servers or to as Rook2. When seeking such a distributed file system, the redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. engineering teams who own the applications and services SIGMOD ’19, June 30-July 5, 2019, Amsterdam, Netherlands running on JD’s container platform provide many valuable © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-5643-5/19/06...$15.00 1https://github.com/container-storage-interface/spec https://doi.org/10.1145/3299869.3314046 2https://rook.io/ feedbacks. However, in terms of performance and scalability, at any time [22, 26, 27], CFS adopts two strongly consis- these feedbacks also give us hard time to adopt any existing tent replication protocols based on different write scenarios open source solution directly. (namely, append and overwrite) to improve the replication For example, to reduce the storage cost, different appli- performance. cations and services usually need to be served from the same shared storage infrastructure. As a result, the size of - Utilization-Based Metadata Placement. CFS employs a sepa- files in the combined workloads can vary from a fewkilo- rate cluster to store and distribute the file metadata across bytes to hundreds of gigabytes, and these files can be ac- different storage nodes based on the memory usage. One cessed in a sequential or random fashion. However, many advantage of this utilization-based placement is that it does distributed file systems are optimized for either large files not require any metadata rebalancing during capacity expan- such as HDFS [22], or small files such as Haystack2 [ ], but sion. Although a similar idea has been used for chunk-server very few of them have optimized storage for both large and selection in MooseFS [23], to the best of knowledge, CFS small size files6 [ , 12, 20, 26]. Moreover, these file systems is the first open source solution to apply this technique for usually employ a one-size-fits-all replication protocol, which metadata placement. may not be able to provide optimized replication performance for different write scenarios. - Relaxed POSIX Semantics and Metadata Atomicity. In a In addition, there could be heavy accesses to the files by POSIX-compliant distributed file system, the behavior of a large number of clients simultaneously. Most file opera- serving multiple processes on multiple client nodes should tions, such as creating, appending, or deleting a file would be the same as the behavior of a local file system serving mul- require updating the file metadata. Therefore, a single node tiple processes on a single node with direct attached storage. that stores all the file metadata could easily become the CFS provides POSIX-compliant APIs. However, the POSIX performance or storage bottleneck due to the hardware lim- consistency semantics, as well as the atomicity requirement its [22, 23]. One can resolve this problem by employing a between the inode and dentry of the same file, have been separate cluster to store the metadata, but most existing carefully relaxed in order to better align with the needs of works [4] on this path would require rebalancing the stor- applications and to improve the system performance. age nodes during capacity expansion, which could bring significant degradation on read/write performance. 2 DESIGN AND IMPLEMENTATION Lastly, in spite of the fact that having a POSIX-compliant file system interface can greatly simplify the development As shown in Figure 1, CFS consists of a metadata subsystem, a of the upper level applications, the strongly consistent se- data subsystem, and a resource manager, and can be accessed mantics defined in POSIX I/O standard can also drastically by different clients as a set of application processes hosted affect the performance. Most POSIX-compliant file systems on the containers. alleviate this issue by providing relaxed POSIX semantics, The metadata subsystem stores the file metadata, and con- but the atomicity requirement between the inode and dentry sists of a set of meta nodes. Each meta node consists of a set of of the same file can still limit their performance on metadata meta partitions. The data subsystem stores the file contents, operations. and consists of a set of data nodes. Each data node consists To solve these problems, in this paper, we propose Chubao of a set of data partitions. We will give more details about File System (CFS), a distributed file system designed for large these two subsystems in the following sections. scale container platforms. CFS is written in Go and the code The volume is a logical concept in CFS and consists of a set is available at https://github.com/ChubaoFS/cfs. Some key of meta partitions and data partitions. Each partition can only features include: be assigned to a single volume. From a client’s perspective, the volume can be viewed as a file system instance that - General-Purpose and High Performance Storage Engine. CFS contains data accessible by containers. A volume can be provides a general-purpose storage engine to efficiently store mounted to multiple containers so that files can be shared both large and small files with optimized performance on among different clients simultaneously. It needs to be created different file access patterns. It utilizes the punch holein- at the very beginning before the any file operation. terface in Linux [21] to asynchronously free the disk space The resource manager manages the file system by process- occupied by the deleted small files, which greatly simplifies ing different types of tasks (such as creating and deleting par- the engineering work of dealing with small file deletions. titions, creating new volumes, and adding/removing nodes). It also keeps track of the status such as the memory and disk - Scenario-Aware Replication. Different from any existing open utilizations, and liveness of the meta and data nodes in the source solution that only allows a single replication protocol cluster. The resource manager has multiple replicas, among type uint32// inode type Container Volume Container linkTarget []byte// symLink target name Client Client nLink uint32// number of links ... flag uint32 Resource Container Cluster Manager ...// other fields } type dentry struct{ parentId uint64// parent inode id name string// name of the dentry inode uint64// current inode id type uint32// dentry type ... ... } ... ... ... ... Meta Partition Data Partition 2.1.2 Raft-based Replication.

Load more