Optimizing Local File Accesses for FUSE-Based Distributed Storage

Total Page:16

File Type:pdf, Size:1020Kb

Optimizing Local File Accesses for FUSE-Based Distributed Storage Optimizing Local File Accesses for FUSE-Based Distributed Storage Shun Ishiguro∗ Jun Murakami∗ Yoshihiro Oyama∗z Osamu Tatebeyz ∗Department of Informatics, The University of Electro-Communications Email: fshun,[email protected], [email protected] yFaculty of Engineering, Information and Systems, University of Tsukuba Email: [email protected] zJapan Science and Technology Agency, CREST Abstract—Modern distributed file systems can store huge these communications between the kernel module and the amounts of information while retaining the benefits of high reli- userland daemon involve frequent memory copies and context ability and performance. Many of these systems are prototyped switches, they introduce significant runtime overhead. The with FUSE, a popular framework for implementing user-level file systems. Unfortunately, when these systems are mounted framework forces applications to access data in the mounted on a client that uses FUSE, they suffer from I/O overhead file system via the userland daemon, even when the data is caused by extra memory copies and context switches during stored locally and could be accessed directly. The memory local file access. Overhead imposed by FUSE on distributed copies also increase memory consumption because redundant file systems is not small and may significantly degrade the data is stored in different page cache. performance of data-intensive applications. In this paper, we propose a mechanism that achieves rapid local file access in In this paper, we propose a mechanism that allows appli- FUSE-based distributed file systems by reducing the number cations to access local storage directly via the FUSE kernel of memory copies and context switches. We then incorporate module. We focus on the flow of operations performed in the mechanism into the FUSE framework and demonstrate its local file accesses by FUSE-based file systems and propose effectiveness through experiments, using the Gfarm distributed a mechanism to “bypass” redundant context switches and file system. memory copies. We then explain our modifications to the FUSE kernel module in order to incorporate our mechanism I. INTRODUCTION into it. The result is a kernel module that directly accesses local Highly scalable distributed file systems have become a storage wherever possible, thereby eliminating unnecessary key component of data-intensive science where low latency, kernel-daemon communication and many memory copies and high capacity, and high scalability of storage are crucial to context switches associated with this communication, which application performance. As a result, a large body of literature in turn improves the I/O performance of FUSE-based file has been devoted to distributed file systems and how they can systems. store a great number of large files by federating storage across We demonstrate the effectiveness of our proposed mech- multiple servers [1]–[7]. anism by using a Gfarm distributed file system [1] against Several modern distributed file systems [1], [3], [6], [7] our modified version of FUSE and testing its performance. allow users to mount their file systems on UNIX clients by Our results show that the mechanism significantly improved using FUSE [8], a widely used framework for implementing the I/O throughput of the Gfarm file system (ranging from user-level file systems. Once the file systems are mounted in 26% to 63%), and also greatly reduced the amount of memory this way, the distributed file system can be accessed with stan- consumed by page caching (half of the original). Because our dard system calls such as open, close, read, and write. mechanism has no strict dependencies on Gfarm, we believe While this approach increases the simplicity and enhances the that our test results are generalizable to other FUSE-based separation of concerns significantly, it also has a drawback: the distributed file systems and that our proposed mechanism current FUSE framework imposes considerable overhead on could be clearly integrated into the FUSE kernel module. I/O and can degrade the overall performance of data-intensive The contributions of this work are (1) to provide a mecha- applications. Previous papers describe that FUSE may incur nism for reducing context switches and memory copies occur- large runtime overhead [9]–[11]. For example, Bent et al. [9] ring in local storage accesses in FUSE-based distributed file reported that approximately 20% overhead was imposed on systems, and (2) to demonstrate its effectiveness by adapting the bandwidth of file systems. the mechanism to Gfarm and showing experimental results. The FUSE framework consists of a kernel module and II. GFARM AND FUSE userland daemons. The kernel module mediates between ap- plications and a userland daemon, forwarding requests for A. Gfarm file system access to the daemon and returning the pro- Figure 1 shows the basic architecture of Gfarm, which cessing results of the daemon to the application. Because consists of metadata servers and I/O servers. A metadata ① open ⑥ close metadata server ④ ② file locations validity, file info I/O server I/O server I/O server & & & client client client Fig. 2. Access to a user-level file system using FUSE ③ open ⑤ read, write ⑥ close Fig. 1. Gfarm Architecture 3) The daemon processes the request in several ways. Some use a local file system such as ext3. Others use a remote server manages file system metadata, including inodes, direc- server and network communications. tory entries, and file locations. I/O servers manage file data 4) The daemon returns the result to the FUSE module. and are run in either dedicated nodes or client nodes. Under 5) The FUSE module forwards the result to the application certain settings, an I/O server and its client programs run in program. the same node; under other settings, they may run in different To clarify the runtime overhead imposed on reading and nodes. writing files, we here outline the flow of the read operation One of the ways for applications to access a Gfarm file in FUSE-based file systems. First, the application issues a system is to call functions in the Gfarm library, including those read system call to the user-level file system, which sends a for standard operations such as open and read. When a client corresponding read request to the FUSE module. The FUSE accesses a file, the flow of the operation is as follows: module receives this read request and forwards it to a userland 1) The client sends an open request to a metadata server daemon by writing the request and its properties to the special when it wants to open a file. file /dev/fuse. The daemon picks up the request from 2) The client receives the location of the file from the /dev/fuse and performs the operations appropriate for the metadata server. read request, such as copying data to the FUSE module buffer. 3) The client sends an open request to the I/O server that Finally, the FUSE module returns the result to the application. contains the file. Due to these additional context switches and memory 4) The I/O server communicates with the metadata server to copies, the performance of a FUSE-based user-level file system check whether the open request is valid. File information tends to be poorer than that of a kernel-level file system. such as an inode number and a generation number is also At minimum, two additional context switches occur when communicated. an application process sends a request to a user-level file 5) The client sends further requests, such as read and write, system and receives the result: one to transfer control from to the I/O server. the application process to the daemon, and another to transfer 6) The client sends a close request to both the metadata control back from the daemon to the application process. server and the I/O server. Additional memory copies also occur, when request data (such By restricting communication of read and write requests to as the request type) is communicated to the daemon via only the I/O server containing the file, Gfarm reduces the /dev/fuse, and when the daemon copies file data to the workloads of the metadata server and the network. FUSE module’s own buffer, so that it can be returned to the application. B. FUSE The basic FUSE architecture consists of a kernel module C. Accessing a FUSE-Mounted Gfarm File System (hereafter referred to as the “FUSE module”) and userland Gfarm2fs is a FUSE-based userland daemon for mounting daemons. Developers create their user-level file systems within a Gfarm file system on a UNIX client. Figure 3 shows the these daemons by providing file system operations such as flow of a file access when gfarm2fs is used. The steps are as open, close, read, and write. follows: Figure 2 shows how the FUSE framework is used to access 1) The application program issues a system call for a file a user-level file system. The steps are as follows: access. 1) The application issues a system call to the operating 2) The FUSE module receives a request for file system system to access files in a FUSE-based file system. The access. service routine for the system call sends a request for 3) The FUSE module forwards the request to the gfarm2fs the corresponding file operation to the FUSE module. daemon. gfarm2fs processes the request by communi- 2) The FUSE module forwards the request to a userland cating with a metadata server and I/O servers using the daemon. Gfarm library. Fig. 3. Access to a Gfarm file system using FUSE Fig. 4. Flow of the open component. The FUSE module obtains a file descriptor from gfarm2fs, which opens the file. 4) The results of the request are returned to the FUSE module.
Recommended publications
  • Instituto De Pesquisas Tecnológicas Do Estado De São Paulo ANDERSON TADEU MILOCHI Grids De Dados: Implementação E Avaliaçã
    Instituto de Pesquisas Tecnológicas do Estado de São Paulo ANDERSON TADEU MILOCHI Grids de Dados: Implementação e Avaliação do Grid Datafarm – Gfarm File System – como um sistema de arquivos de uso genérico para Internet São Paulo 2007 Ficha Catalográfica Elaborada pelo Departamento de Acervo e Informação Tecnológica – DAIT do Instituto de Pesquisas Tecnológicas do Estado de São Paulo - IPT M661g Milochi, Anderson Tadeu Grids de dados: implementação e avaliação do Grid Datafarm – Gfarm File System como um sistema de arquivos de uso genérico para internet. / Aderson Tadeu Milochi. São Paulo, 2007. 149p. Dissertação (Mestrado em Engenharia de Computação) - Instituto de Pesquisas Tecnológicas do Estado de São Paulo. Área de concentração: Redes de Computadores. Orientador: Prof. Dr. Sérgio Takeo Kofuji 1. Sistema de arquivo 2. Internet (redes de computadores) 3. Grid Datafarm 4. Data Grid 5. Máquina virtual 6. NISTNet 7. Arquivo orientado a serviços 8. Tese I. Instituto de Pesquisas Tecnológicas do Estado de São Paulo. Coordenadoria de Ensino Tecnológico II.Título 07-20 CDU 004.451.52(043) ANDERSON TADEU MILOCHI Grids de Dados: Implementação e Avaliação do Grid Datafarm – Gfarm File System - como um sistema de arquivos de uso genérico para Internet Dissertação apresentada ao Instituto de Pesquisas Tecnológicas do Estado de São Paulo - IPT, para obtenção do título de Mestre em Engenharia de Computação Área de concentração: Redes de Computadores Orientador: Prof. Dr. Sérgio Takeo Kofuji São Paulo Março 2007 A Deus, o Senhor de tudo, Mestre dos Mestres, o único Caminho, Verdade e Vida. À minha esposa, pelo seu incondicional apoio apesar da dolorosa solidão. À minha filha, pela compreensão nos meus constantes momentos de isolamento.
    [Show full text]
  • Comparison of Kernel and User Space File Systems
    Comparison of kernel and user space file systems — Bachelor Thesis — Arbeitsbereich Wissenschaftliches Rechnen Fachbereich Informatik Fakultät für Mathematik, Informatik und Naturwissenschaften Universität Hamburg Vorgelegt von: Kira Isabel Duwe E-Mail-Adresse: [email protected] Matrikelnummer: 6225091 Studiengang: Informatik Erstgutachter: Professor Dr. Thomas Ludwig Zweitgutachter: Professor Dr. Norbert Ritter Betreuer: Michael Kuhn Hamburg, den 28. August 2014 Abstract A file system is part of the operating system and defines an interface between OS and the computer’s storage devices. It is used to control how the computer names, stores and basically organises the files and directories. Due to many different requirements, such as efficient usage of the storage, a grand variety of approaches arose. The most important ones are running in the kernel as this has been the only way for a long time. In 1994, developers came up with an idea which would allow mounting a file system in the user space. The FUSE (Filesystem in Userspace) project was started in 2004 and implemented in the Linux kernel by 2005. This provides the opportunity for a user to write an own file system without editing the kernel code and therefore avoid licence problems. Additionally, FUSE offers a stable library interface. It is originally implemented as a loadable kernel module. Due to its design, all operations have to pass through the kernel multiple times. The additional data transfer and the context switches are causing some overhead which will be analysed in this thesis. So, there will be a basic overview about on how exactly a file system operation takes place and which mount options for a FUSE-based system result in a better performance.
    [Show full text]
  • Accelerating Big Data Analytics on Traditional High-Performance Computing Systems Using Two-Level Storage Pengfei Xuan Clemson University, [email protected]
    Clemson University TigerPrints All Dissertations Dissertations December 2016 Accelerating Big Data Analytics on Traditional High-Performance Computing Systems Using Two-Level Storage Pengfei Xuan Clemson University, [email protected] Follow this and additional works at: https://tigerprints.clemson.edu/all_dissertations Recommended Citation Xuan, Pengfei, "Accelerating Big Data Analytics on Traditional High-Performance Computing Systems Using Two-Level Storage" (2016). All Dissertations. 2318. https://tigerprints.clemson.edu/all_dissertations/2318 This Dissertation is brought to you for free and open access by the Dissertations at TigerPrints. It has been accepted for inclusion in All Dissertations by an authorized administrator of TigerPrints. For more information, please contact [email protected]. ACCELERATING BIG DATA ANALYTICS ON TRADITIONAL HIGH-PERFORMANCE COMPUTING SYSTEMS USING TWO-LEVEL STORAGE A Dissertation Presented to the Graduate School of Clemson University In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Computer Science by Pengfei Xuan December 2016 Accepted by: Dr. Feng Luo, Committee Chair Dr. Pradip Srimani Dr. Rong Ge Dr. Jim Martin Abstract High-performance Computing (HPC) clusters, which consist of a large number of compute nodes, have traditionally been widely employed in industry and academia to run diverse compute-intensive applications. In recent years, the revolution in data-driven science results in large volumes of data, often size in terabytes or petabytes, and makes data-intensive applications getting exponential growth. The data-intensive computing presents new challenges to HPC clusters due to the different workload characteristics and optimization objectives. One of those challenges is how to efficiently integrate software frameworks developed for big data analytics, such as Hadoop and Spark, with traditional HPC systems to support both data-intensive and compute-intensive workloads.
    [Show full text]
  • Design of Store-And-Forward Servers for Digital Media Distribution University of Amsterdam Master of Science in System and Network Engineering
    Design of store-and-forward servers for digital media distribution University of Amsterdam Master of Science in System and Network Engineering Class of 2006-2007 Dani¨el S´anchez ([email protected]) 27th August 2007 Abstract Production of high quality digital media is increasing in both the commercial and academic world. This content needs to be distributed to end users on demand and efficiently. Initiatives like CineGrid [1] push the limit looking at the creation of content distribution centres connected through dedicated optical circuits. The research question of this project is the following: “What is the optimal architecture for the (CineGrid) storage systems that store and forward content files of a size of hundreds of GBs?” First I made an overview of the current situation. At the moment the Rembrandt cluster nodes [16] are used in the storage architecture. All data has to be transferred manually to the nodes via FTP. This is not preferred, because administration is difficult. Therefore a list of criteria is made for the new storage architecture. Important criteria are bandwidth (6.4 Gb/s) and space (31.2 TB a year and expandable). I made a comparison between open source distributed parallel file systems based on these criteria. Lustre and GlusterFS turned out to be the best of these file systems according to the criteria. After that I proposed two architectures which use these file systems. The first architecture contains only cluster nodes and the second architecture contains cluster nodes and a SAN. In the end it is recommended to install GlusterFS in the first architecture on the existing DAS- 3 nodes [15] with Ethernet as interconnect network.
    [Show full text]