Zfs-Ascalabledistributedfilesystemusingobjectdisks

zFS-AScalableDistributedFileSystemUsingObjectDisks Ohad Rodeh Avi Teperman [email protected] [email protected] IBM Labs, Haifa University, Mount Carmel, Haifa 31905, Israel. Abstract retrieves the data block from the remote machine. zFS also uses distributed transactions and leases, instead of group- zFS is a research project aimed at building a decentral- communication and clustering software. We intend to test ized file system that distributes all aspects of file and stor- and show the effectiveness of these two features in our pro- age management over a set of cooperating machines inter- totype. connected by a high-speed network. zFS is designed to be zFS has six components: a Front End (FE), a Cooper- a file system that scales from a few networked computers to ative Cache (Cache), a File Manager (FMGR), a Lease several thousand machines and to be built from commodity Manager (LMGR), a Transaction Server (TSVR), and an off-the-shelf components. Object Store (OSD). These components work together to The two most prominent features of zFS are its coop- provide applications/users with a distributed file system. erative cache and distributed transactions. zFS integrates The design of zFS addresses, and is influenced by, issues the memory of all participating machines into one coher- of fault tolerance, security and backup/mirroring. How- ent cache. Thus, instead of going to the disk for a block ever, in this article, we focus on the zFS high-level archi- of data already in one of the machine memories, zFS re- tecture and briefly describe zFS’s fault tolerance character- trieves the data block from the remote machine. zFS also istics. The first prototype of zFS is under development and uses distributed transactions and leases, instead of group- will be described in another document. communication and clustering software. The rest of the article is organized as follows: In Sec- This article describes the zFS high-level architecture tion 2, we describe the goals of zFS. Section 3 details and how its goals are achieved. the functionality of zFS’s various components followed by Section 4 which details zFS’s architecture and protocols. Issues of fault tolerance are briefly discussed in Section 5 1. Introduction and Section 6 compares zFS to other file systems. We con- clude with Section 7 summarizing how combining all these components supports higher performance and scalability. zFS is a research project aimed at building a decentral- ized file system that distributes all aspects of file and storage management over a set of cooperating machines inter- 2. zFS Goals connected by a high-speed network. zFS is designed to be a file system that will (1) Scale from a few networked The design and implementation of zFS is aimed at computers to several thousand machines, supporting tens of achieving a scalable file system beyond those that exist to- thousands of clients and (2) Be built from commodity, off- day. More specifically, the objectives of zFS are: the-shelf components such as PCs, Object Store Devices (OSDs) and a high-speed network, and run on existing op- • Creating a file system that operates equally well on erating systems such as Linux. few or thousands of machines zFS extends the research done in the DSF project [10] • Using off-the-shelf components with OSDs by using object disks as storage media and by using leases and distributed transactions. • Making use of the memory of all participating ma- The two most prominent features of zFS are its coop- chines as a global cache to increase performance erative cache [8, 14] and distributed transactions. zFS integrates the memory of all participating machines into one • Achieving almost linear scalability: the addition of coherent cache. Thus, instead of going to the disk for a machines will lead to an almost linear increase in per- block of data already in one of the machine memories, zFS formance zFS will achieve scalability by separating storage manage- (a) a file name, (b) some flags, (c) a file system pointer fsptr ment from file management and by dynamically distribut- that points to the location in the file-system where the file ing file management. or directory resides. An fsptr is a pair Object Store Identi- Storage management in zFS is encapsulated in the Ob- fier and an object id inside that OSD: hobs id,oidi. ject Store Devices (OSDs)1 [1], while file management is An example is depicted in Figure 1. done by other zFS components, as described in the following sections. Having OSDs handle storage management implies that functions usually handled by file systems are done in the OSD itself, and are transparent to other components of zFS. These include: data striping, mirroring, and contin- uous copy/PPRC. The Object Store does not distinguish between files and directories. It is the responsibility of the file system management (the other components of zFS) to handle them cor- rectly. zFS is designed to work with a relatively loosely- coupled set of components. This allows us to eliminate clustering software, and take a different path than those used by other clustered file systems [12, 6, 2]. zFS is designed to support a low-to-medium degree of file and directory sharing. We do not claim to reach GPFS-like scalability for very high sharing situations [12]. Figure 1. An example of a zFS layout on disk. There are three object stores: two, three, 3. zFS Components and seven. ObS2 contains two file-objects with object id’s 5 and 19, it also contains a This section describes the functionality of each zFS directory-object, number 11, that has two di- component, and how it interacts with other components. It rectory entries. These point to two files “bar” also contains a description of the file system layout on the and “foo” that are located on OSDs three and object store. seven. 3.1. Object Store Some file systems use different storage systems for The object store (OSD) is the storage device on which meta-data (directories), and file data. Using Object-Stores files and directories are created, and from where they are for storing all data allows using higher level management retrieved. The OSD API enables creation and deletion of and copy-services provided by the OSD. For example, an objects, and writing and reading byte-ranges to/from the OSD will support snapshots, hence, creating a file-system object. Object disks provide file abstractions, security, safe snapshot requires taking snapshots at the same time from writes and other capabilities as described in [9]. all the OSDs. Using object disks allows zFS to focus on management The downside is that directories become dispersed and scalability issues, while letting the OSD handle the throughout the OSDs, and directory operations become physical disk chores of block allocation and mapping. distributed transactions. 3.2. File System Layout 3.3. Front End zFS uses the object-stores to lay out both files and direc- The zFS front-end (FE) runs on every workstation on tories. We assume each directory maps to a single object, which a client wants to use zFS. It presents to the client 2 and that a file also maps to a single object . A file-object the standard file system API and provides access to zFS contains the set of bytes that the file is comprised of. It may files and directories. Using Linux as our implementation be sparse, containing many non-contiguous chunks. A di- platform this implies integration with the VFS layer which rectory contains a set of entries, where each entry contains: also means the the FE is an in-kernel component. On many 1We also use the term Object Disk. Unix systems (including Linux), a file-system has to define 2This can change in the future, to multiple objects per file. and implement three sets of operations. Super Block Operations Operations that determine the object-lease for ob j is requested from the LMGR. If one behavior of the file system. does not exist, the requesting machine creates a local instance of an LMGR to manage O for it. Inode Operations Operations on whole file and directory objects; e.g., create, delete etc. 3.5. File Manager File Operations Specific operations on files or directories; e.g., open, read, readdir etc. Each opened file in zFS is managed by a single file By implementing these sets of operations and integrat- manager assigned to the file when the file is opened. ing them within the operating system kernel a new file- The set of all currently active file managers manage all system can be created. In Linux this can be done either opened zFS files. Initially, no file has an associated by changing the kernel sources or by building a loadable file-manager(FMGR). The first machine to perform an module implementing these operations. When the module open() on file F will create an instance of a file manager is loaded it registers the new file system with the kernel and for F. Henceforth, and until that file manager is shut-down, then the new file system can be mounted. each lease request for any part of the file will be mediated by that FMGR. For better performance, the first machine 3.4. Lease Manager which performs an open() on a file, will create a local instance of the file manager for that file. The need for a Lease Manager (LMGR) stems from the The FMGR keeps track of each accomplished open() following facts: (1) File systems use one form or another and read() request, and maintains the information re- of locking mechanism to control access to the disks in or- garding where each file’s blocks reside in internal data der to maintain data integrity when several users work on structures.

Zfs-Ascalabledistributedfilesystemusingobjectdisks

Elastic Storage for Linux on System Z IBM Research & Development Germany

Copy on Write Based File Systems Performance Analysis and Implementation

ZFS (On Linux) Use Your Disks in Best Possible Ways Dobrica Pavlinušić CUC Sys.Track 2013-10-21 What Are We Going to Talk About?

Oracle® ZFS Storage Appliance Security Guide, Release OS8.6.X

Serverless Network File Systems

Enhancing the Accuracy of Synthetic File System Benchmarks Salam Farhat Nova Southeastern University, [email protected]

NASD) Storage Architecture, Prototype Implementations of [Patterson88]

File Manager Manual

IBM Spectrum Scale CSI Driver for Container Persistent Storage

Globalfs: a Strongly Consistent Multi-Site File System

11.7 the Windows 2000 File System

Disk Array Data Organizations and RAID