<<

TheThe GoogleGoogle FileFile SystemSystem

ByBy GhemawatGhemawat,, GobioffGobioff andand LeungLeung OutlineOutline

¾¾ OverviewOverview ¾¾ AssumptionAssumption ¾¾ DesignDesign ofof GFSGFS ¾¾ SystemSystem InteractionsInteractions ¾¾ MasterMaster OperationsOperations ¾¾ FaultFault ToleranceTolerance ¾¾ MeasurementsMeasurements OverviewOverview

¾¾ GFS:GFS: ScalableScalable distributeddistributed filefile systemsystem forfor largelarge distributeddistributed datadata--intensiveintensive applicationsapplications AssumptionAssumption

¾¾ BuiltBuilt fromfrom inexpensiveinexpensive commoditycommodity componentscomponents thatthat oftenoften failfail ¾¾ SystemSystem storesstores modestmodest numbernumber ofof largelarge .files.

z FewFew millionmillion files,files, eacheach 100MB+.100MB+. MultiMulti--GBGB common.common. SmallSmall filesfiles supportedsupported butbut needneed notnot optimizeoptimize for.for. AssumptionAssumption

¾¾ TypicalTypical workloadworkload ReadRead patternpattern

z LargeLarge streamingstreaming readsreads • Individual operations read hundreds of KBs, more commonly 1MB or more. • Succesive operations from same client often read through contiguous region of file.

z SmallSmall randomrandom readsreads • Read a few KB at some random offset. • Performance-conscious apps often batch small reads. AssumptionAssumption

¾¾ TypicalTypical workloadworkload WriteWrite patternpattern

z ManyMany largelarge sequentialsequential writeswrites thatthat appendappend datadata toto files.files. • Once written, files seldom modified again. • Small writes at arbitrary positions supported, not efficient. (not often in typical workload) DesignDesign ofof GFSGFS

¾ Multiple standalone and independent “clusters” ¾ Each cluster includes a master and multiple chunkservers ¾ Cluster accessed by multiple clients. ¾ Can run chunkserver and client on same machine. ¾ Files divided into fixed-size chunks.

z Each chunk is 64MB big ¾ Each file has 64 bit chunk handle assigned by master at time of chunk creation.

z The addressable range of a GFS cluster is • 264 * 64MB = 1125899906842624 TB DesignDesign ofof GFSGFS

¾ Master

z maintains all metadata. • Namespace, access control info, files Æ chunks mapping, locations of chunks.

z Manages chunk lease management, garbage collection of orphaned chunks, chunk migration.

z Periodically communicates with each chunkserver in HeartBeat to give it instructions and collect its state. ¾ Chunkservers

z Repositories of chunks in GFS

z Chunks stored on each chunkserver’s local disks as LIINUX files

z Chunks replicated (default factor of 3). DesignDesign ofof GFSGFS

¾ GFSGFS clientclient codecode linkedlinked intointo eacheach applicationapplication ¾ ClientClient doesdoes notnot cachecache filefile datadata

z Most applications stream through huge files or have working sets too large to be cached

z No need to worry about Cache coherence issues ¾ ChunkserversChunkservers dodo notnot cachecache filefile data.data.

z Most applications stream through huge files or have working sets too large to be cached

z Chunkservers could benefit from some LINUX buffer caches DesignDesign ofof GFSGFS

¾¾ FaultFault ToleanceToleance MeasuresMeasures

z (Dynamic)(Dynamic) ReplicationReplication

z HeartbeatHeartbeat messagesmessages

z LoggingLogging

z CheckpointingCheckpointing // RecoveryRecovery DesignDesign ofof GFSGFS DesignDesign ofof GFSGFS

¾ ConsistencyConsistency ModelModel

z File regions: “consistent”, “defined” • Consistent: All clients will always see same data regardless of which replicas. • Defined: Consistent AND clients will see what a mutation writes in its entirety

z Guarantees by GFS • File namespace mutations are atomic (e.g. creation). Handled exclusively by master. • Successful mutation without interference → defined • Concurrent successful mutations → undefined but consistent • Failed mutation → inconsistent (and undefined) DesignDesign ofof GFSGFS

¾¾ ImplicationsImplications ofof thethe consistencyconsistency modelmodel onon thethe applicationsapplications

z ApplicationsApplications needneed toto distinguishdistinguish betweenbetween defineddefined andand undefinedundefined regionsregions • Rely on appends not overwrites => avoid undefined regions • Application-level Checkpointing • Write self-validating self-identifying records SystemSystem InteractionsInteractions

¾¾ InteractionsInteractions includeinclude

z FileFile Read,Read, FileFile Mutations,Mutations, andand SnapshotsSnapshots ¾¾ HasHas toto ensureensure thethe consistencyconsistency guaranteesguarantees ¾¾ HasHas toto ensureensure thethe throughputthroughput

z MinimizeMinimize thethe involvementinvolvement ofof mastermaster inin thethe interactionsinteractions

z DelegateDelegate operationsoperations ontoonto chunkserverschunkservers/clients/clients SystemSystem InteractionsInteractions

¾¾ ReadRead accessaccess SystemSystem InteractionsInteractions

¾¾ FileFile Mutations:Mutations: Write/RecordWrite/Record Append/SnapshotsAppend/Snapshots

z WriteWrite:: causescauses datadata toto bebe writtenwritten atat anan applicationapplication--specifiedspecified filefile offset.offset.

z RecordRecord appendappend:: causescauses datadata toto bebe appendedappended atomicallyatomically atat leastleast onceonce eveneven inin thethe presencepresence ofof concurrentconcurrent mutations.mutations.

z SnapshotSnapshot:: MakesMakes aa copycopy ofof aa filefile oror directorydirectory treetree

z NeedNeed toto ensureensure consistencyconsistency modelmodel SystemSystem InteractionsInteractions

¾ EachEach mutationmutation performedperformed atat allall thethe chunkchunk’’ss replicasreplicas ¾ UseUse leasesleases toto maintainmaintain aa consistentconsistent mutationmutation orderorder acrossacross replicas.replicas. z Master grants chunk lease to one of the replicas – the primary. z Primary picks a serial order for all mutations to the “chunk”. All replicas follow this order when applying mutations. z THUS global mutation order defined first by lease grant order chosen by master and then by serial numbers in the lease SystemSystem InteractionsInteractions

¾¾ LeaseLease

z DelegateDelegate authorizationauthorization toto thethe primaryprimary replicareplica inin datadata mutationsmutations

z MinimizeMinimize managementmanagement overheadoverhead atat thethe mastermaster

z LeaseLease expirationexpiration time:time: 6060 seconds.seconds. CanCan getget extended.extended. SystemSystem InteractionsInteractions

¾¾ WriteWrite accessaccess 1. Client asks master which chunkserver holds current lease and locations of other replicas. If no lease, master grants one to a replica it chooses. 2. Master replies with primary and secondaries. Client caches this for future mutations; contacts master again only when primary unreachable or does not have a lease 3. Client pushes data to all replicas. Client can do so in any order. Each chunkserver stores data in internal LRU buffer cache until data used or aged out. 4. Once all replicas have received data, client sends write to the primary. Primary assigns consecutive serial numbers to mutations it receives, possibly from multiple clients, which provides necessary serialization. Applies mutation to its own local state in serial number order SystemSystem InteractionsInteractions

¾¾ WriteWrite accessaccess 5. Primary forwards write request to all secondary replicas. Each secondary replica applies mutations in the same serial number order assigned by the primary 6. Secondaries all reply to primary – “finished the operation” 7. Primary replies to the client • Success of operation • Operation failed SystemSystem InteractionsInteractions

¾¾ WriteWrite accessaccess

z IfIf aa writewrite byby thethe applicationapplication isis largelarge andand straddlesstraddles aa chunkchunk boundary,boundary, GFSGFS clientclient codecode breaksbreaks itit downdown intointo multiplemultiple writewrite operationsoperations • Concurrent writes on the same chunk could result in “undefined” region • SystemSystem InteractionsInteractions

¾¾ RecordRecord AppendAppend

z AnAn atomicatomic operationoperation whichwhich appendsappends datadata toto thethe endend ofof aa filefile • The appended data is guaranteed to be “defined” and “consistent” • GFS appends it to the file at least once automatically at an offset of GFS’s choosing and returns that offset to the client • Control flow similar to a write with some small changes

z The length of the appended data is limited to 16 MB (1/4 of the maximum chunk size) SystemSystem InteractionsInteractions

¾¾ RecordRecord AppendAppend (steps)(steps)

z FollowFollow similarsimilar stepssteps asas inin thethe WriteWrite accessaccess • Client pushes data to all replicas • Client sends “Append” request to the primary replica • Primary checks if the data being appended will straddle the boundary of the last chunk. (Note: the appended data size <= 16MB)

z Pads the chunk to the maximum size (on all replicas)

z Replies to the client indicating the operation should be retired on the next chunk

z This is a uncommon case SystemSystem InteractionsInteractions

¾¾ RecordRecord appendappend (steps)(steps) • If the data is within the chunk’s boundary

z The primary appends the data

z Tells the secondaries to append the data at the same offset

z Reply success to the client

z This is the common case ! SystemSystem InteractionsInteractions

¾¾ SnapshotsSnapshots

z MakesMakes aa copycopy ofof aa filefile oror directorydirectory treetree (the(the ““sourcesource””)) reallyreally fastfast whilewhile minimizingminimizing anyany interruptionsinterruptions ofof ongoingongoing mutationsmutations

z UsersUsers useuse itit toto quicklyquickly createcreate branchbranch copiescopies ofof hugehuge datadata setssets oror toto checkpointcheckpoint currentcurrent statestate beforebefore experimentingexperimenting withwith changeschanges thatthat cancan laterlater bebe committedcommitted oror rolledrolled backback easilyeasily

z AdoptAdopt thethe CopyCopy--OnOn--WriteWrite techniquetechnique forfor highhigh efficiencyefficiency andand lowlow overheadoverhead SystemSystem InteractionsInteractions

¾ When master receives snapshot request, it first revokes any outstanding leases on the chunks in the files it is about to snapshot – thus, any subsequent writes to these chunks will require an interaction with the master to find the lease holder, giving master opportunity to create a new copy of the chunk first. ¾ After leases revoked or expired, master logs operation to disk. Then applies log record to its in-memory state by duplicating the metadata for source file or directory tree. Newly created snapshot files point to the same chunks as the source files. z Reference counts to these chunks are increased by one. ¾ First time a client wants to write a chunk C after the snapshot operation it sends a request to the master to find the current lease holder. z Master notices that the rreferenceference count for a chunk C is greater than one. z Defers replying to client request and instead picks a new chunk handle C’. z Then asks each chunkserver that has C to create a new chunk called C’. z Request handling proceeds as normal. MasterMaster operationsoperations

¾ Maintains the mapping between file names and chunks ¾ Handles the mutation lease ¾ Namespace management and locking ¾ Replica placement

z Chunk creation, re-replication, rebalancing • Space utilization, “recent” creations, spread across racks ¾ Garbage collection

z Lazy garbage collection for simplicsimplicity,ity, background activity ¾ Stale replica deletion

z Using chunk version number MasterMaster operationsoperations

¾ NamespaceNamespace managementmanagement

z File creation / deletion ¾ LocksLocks onon thethe directoriesdirectories // filesfiles areare usedused toto properlyproperly serializeserialize variousvarious mastermaster operationsoperations

z A snapshot of /home/user requires read-lock on /home and /home/user and write-lock on /home/userL1.

z A file creation /home/user/foo.dat requires a read-lock on /home/userL2 and write-lock on /home/user/foo.dat

z L1 and L2 conflicts with each other and these two operations will be serialized. MasterMaster operationsoperations

¾¾ ReplicaReplica placementplacement

z ChunkChunk creationcreation

z rere--replicationreplication • If a replica corrupts or fails. • Create more replicas for hot chunks (chunks which are being accessed very often)

z rebalancingrebalancing • Equalize space utilization • Balance the load on each chunk server • Spread across racks for higher reliability MasterMaster operationsoperations

¾ GarbageGarbage collectionscollections

z Release the occupied space of deleted files • File deletion in GFS is just a renaming of the filename to a “hidden” filename

z Removes orphaned chunks • Orphaned chunks are chunks which are not referred to by any files. • A result from the failures in the chunk creation/deletion process.

z Remove stale replicas (mentioned in next slide)

z Garbage collections are performed when the master is not too busy. MasterMaster operationsoperations

¾ Stale replica deletion

z Chunk replicas may become stale if the chunkservechunkserverr fails and missies mutations to the chunk while it is down

z Everytime the master grants a lease on a chunk it increases the chunk version number on it. • The version number will be passed to the replicas and the client. • The master keeps a copy of the chunk version number in the metadata. • Both the master and the chunkservers will save the version number information in their persistent states before the client is notified about the lease.

z When version number mismatch happens, the one with the highest version number will be used. The stale replicas (one with lower version number) will be garbage collected. FaultFault ToleranceTolerance

¾ Fast recovery z Master uses operation logs and checkpoints • Fast recovery frfromom the nearest ccheckpointheckpoint by reapplying changes recorded in the log z When chunkserver boots up (possibly after a crash), it reports the chunks it has to the master. Usually this takes just seconds. ¾ chunk replication ¾ master replication z Operation logs and checkpoints are replicated on multiple machines z A mutation to the state is considered committed only after its log record has been flushed to disk locally and on all master replicas z There’s only one active master server • However, there can be multiple read-only shadow master replicas z An external monitoring infrastructure starts a new active master server based on the replicated logs and checkpoints in case of master failure FaultFault ToleranceTolerance

¾¾ DataData IntegrityIntegrity

z ChecksumsChecksums • Each 64 KB blocks of a chunk is protected by a 32 bit checksum.

z ChecksumChecksum verificationverification isis donedone everyevery timetime thethe chunkchunk isis beingbeing read.read.

z DuringDuring idleidle periods,periods, chunkserverchunkserver scansscans troughtrough thethe chunkschunks andand verifyverify thethe checksumschecksums ¾¾ DiagnosisDiagnosis ToolsTools andand AnalysisAnalysis ofof logslogs MeasurementMeasurement

Cluster A: used for research and develomenent by over a hundred engineers. Cluster B: used for production data processing.

Note: average meta size at master or a chunkserver is 50~100MB MeasurementMeasurement

A has a sustaining read rate of 580MB/s for a week while the network configuration can support 750MB/s Resources are being used pretty efficiently. MeasurementMeasurement Cluster X: for R&D Cluster Y: for production ReviewsReviews andand ConclusionsConclusions

¾ SpecialsSpecials aboutabout GFSGFS

z No caching • Due to ’s workload pattern

z Centralized (Single) master • Good for Google’s file pattern (a bunch of giga or tera byte sized files) • Bad for zillions of (small) files • Simpler and more efficient in file meta data management

z Relaxed consistency model • Simpler design and implementation

z Potentially more reliable • Supports higher throughputs ReviewsReviews andand ConclusionsConclusions

¾ SpecialsSpecials aboutabout GFSGFS

z Incorporate multiple fault-tolerance measures • Each chunk is protected through checksum and replications • Master server is protected through logging/checkpointing/recovery and replicaions • Dynamic replica reallocation • Replication costs ($$$) is a non-issue for Google

z Does not use majority voting as a fault tolerance mechanism • Need to maintain high-throughputs • Corruption in the meta data at the server can be a problem

z Detection through diagnosis tools and analysis of logs ReviewsReviews andand ConclusionsConclusions MeasurementMeasurement

Micro-benchmarks on a GFS cluster consisting of one master, two master replicas, 16 chunkservers, and 16 clients.