The Google File System

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗ ABSTRACT 1. INTRODUCTION We have designed and implemented the Google File Sys- We have designed and implemented the Google File Sys- tem, a scalable distributed file system for large distributed tem (GFS) to meet the rapidly growing demands of Google’s data-intensive applications. It provides fault tolerance while data processing needs. GFS shares many of the same goals running on inexpensive commodity hardware, and it delivers as previous distributed file systems such as performance, high aggregate performance to a large number of clients. scalability, reliability, and availability. However, its design While sharing many of the same goals as previous dis- has been driven by key observations of our application work- tributed file systems, our design has been driven by obser- loads and technological environment, both current and an- vations of our application workloads and technological envi- ticipated, that reflect a marked departure from some earlier ronment, both current and anticipated, that reflect a marked file system design assumptions. We have reexamined tradi- departure from some earlier file system assumptions. This tional choices and explored radically different points in the has led us to reexamine traditional choices and explore rad- design space. ically different design points. First, component failures are the norm rather than the The file system has successfully met our storage needs. exception. The file system consists of hundreds or even It is widely deployed within Google as the storage platform thousands of storage machines built from inexpensive com- for the generation and processing of data used by our ser- modity parts and is accessed by a comparable number of vice as well as research and development efforts that require client machines. The quantity and quality of the compo- large data sets. The largest cluster to date provides hun- nents virtually guarantee that some are not functional at dreds of terabytes of storage across thousands of disks on any given time and some will not recover from their cur- over a thousand machines, and it is concurrently accessed rent failures. We have seen problems caused by application by hundreds of clients. bugs, operating system bugs, human errors, and the failures In this paper, we present file system interface extensions of disks, memory, connectors, networking, and power sup- designed to support distributed applications, discuss many plies. Therefore, constant monitoring, error detection, fault aspects of our design, and report measurements from both tolerance, and automatic recovery must be integral to the micro-benchmarks and real world use. system. Second, files are huge by traditional standards. Multi-GB files are common. Each file typically contains many applica- Categories and Subject Descriptors tion objects such as web documents. When we are regularly D[4]: 3—Distributed file systems working with fast growing data sets of many TBs comprising billions of objects, it is unwieldy to manage billions of ap- General Terms proximately KB-sized files even when the file system could support it. As a result, design assumptions and parameters Design, reliability, performance, measurement such as I/O operation and blocksizes have to be revisited. Third, most files are mutated by appending new data Keywords rather than overwriting existing data. Random writes within Fault tolerance, scalability, data storage, clustered storage a file are practically non-existent. Once written, the files are only read, and often only sequentially. A variety of ∗The authors can be reached at the following addresses: data share these characteristics. Some may constitute large {sanjay,hgobioff,shuntak}@google.com. repositories that data analysis programs scan through. Some may be data streams continuously generated by running applications. Some may be archival data. Some may be in- termediate results produced on one machine and processed Permission to make digital or hard copies of all or part of this work for on another, whether simultaneously or later in time. Given personal or classroom use is granted without fee provided that copies are this access pattern on huge files, appending becomes the fo- not made or distributed for profit or commercial advantage and that copies cus of performance optimization and atomicity guarantees, bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific while caching data blocks in the client loses its appeal. permission and/or a fee. Fourth, co-designing the applications and the file system SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA. API benefits the overall system by increasing our flexibility. Copyright 2003 ACM 1-58113-757-5/03/0010 ...$5.00. For example, we have relaxed GFS’s consistency model to 2.2 Interface vastly simplify the file system without imposing an onerous GFS provides a familiar file system interface, though it burden on the applications. We have also introduced an does not implement a standard API such as POSIX. Files are atomic append operation so that multiple clients can append organized hierarchically in directories and identified by path- concurrently to a file without extra synchronization between names. We support the usual operations to create, delete, them. These will be discussed in more details later in the open, close, read,andwrite files. paper. Moreover, GFS has snapshot and record append opera- Multiple GFS clusters are currently deployed for different tions. Snapshot creates a copy of a file or a directory tree purposes. The largest ones have over 1000 storage nodes, at low cost. Record append allows multiple clients to ap- over 300 TB of diskstorage, and are heavily accessed by pend data to the same file concurrently while guaranteeing hundreds of clients on distinct machines on a continuous the atomicity of each individual client’s append. It is use- basis. ful for implementing multi-way merge results and producer- consumer queues that many clients can simultaneously ap- 2. DESIGN OVERVIEW pend to without additional locking. We have found these types of files to be invaluable in building large distributed 2.1 Assumptions applications. Snapshot and record append are discussed fur- In designing a file system for our needs, we have been ther in Sections 3.4 and 3.3 respectively. guided by assumptions that offer both challenges and op- portunities. We alluded to some key observations earlier 2.3 Architecture and now lay out our assumptions in more details. A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients,asshown • The system is built from many inexpensive commodity in Figure 1. Each of these is typically a commodity Linux components that often fail. It must constantly monitor machine running a user-level server process. It is easy to run itself and detect, tolerate, and recover promptly from both a chunkserver and a client on the same machine, as long component failures on a routine basis. as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable. • The system stores a modest number of large files. We Files are divided into fixed-size chunks. Each chunkis expect a few million files, each typically 100 MB or identified by an immutable and globally unique 64 bit chunk larger in size. Multi-GB files are the common case handle assigned by the master at the time of chunkcreation. and should be managed efficiently. Small files must be Chunkservers store chunks on local disks as Linux files and supported, but we need not optimize for them. read or write chunkdata specified by a chunkhandle and • The workloads primarily consist of two kinds of reads: byte range. For reliability, each chunkis replicated on multi- large streaming reads and small random reads. In ple chunkservers. By default, we store three replicas, though large streaming reads, individual operations typically users can designate different replication levels for different read hundreds of KBs, more commonly 1 MB or more. regions of the file namespace. Successive operations from the same client often read The master maintains all file system metadata. This in- through a contiguous region of a file. A small ran- cludes the namespace, access control information, the map- dom read typically reads a few KBs at some arbitrary ping from files to chunks, and the current locations of chunks. offset. Performance-conscious applications often batch It also controls system-wide activities such as chunklease and sort their small reads to advance steadily through management, garbage collection of orphaned chunks, and the file rather than go backand forth. chunkmigration between chunkservers. The master peri- odically communicates with each chunkserver in HeartBeat • The workloads also have many large, sequential writes messages to give it instructions and collect its state. that append data to files. Typical operation sizes are GFS client code linked into each application implements similar to those for reads. Once written, files are sel- the file system API and communicates with the master and dom modified again. Small writes at arbitrary posi- chunkservers to read or write data on behalf of the applications in a file are supported but do not have to be tion. Clients interact with the master for metadata opera- efficient. tions, but all data-bearing communication goes directly to the chunkservers. We do not provide the POSIX API and • The system must efficiently implement well-defined se- therefore need not hookinto the Linux vnode layer. mantics for multiple clients that concurrently append Neither the client nor the chunkserver caches file data. to the same file. Our files are often used as producer- Client caches offer little benefit because most applications consumer queues or for many-way merging. Hundreds stream through huge files or have working sets too large of producers, running one per machine, will concur- to be cached.

Load more