The Google File System

TheThe GoogleGoogle FileFile SystemSystem ByBy GhemawatGhemawat,, GobioffGobioff andand LeungLeung OutlineOutline ¾¾ OverviewOverview ¾¾ AssumptionAssumption ¾¾ DesignDesign ofof GFSGFS ¾¾ SystemSystem InteractionsInteractions ¾¾ MasterMaster OperationsOperations ¾¾ FaultFault ToleranceTolerance ¾¾ MeasurementsMeasurements OverviewOverview ¾¾ GFS:GFS: ScalableScalable distributeddistributed filefile systemsystem forfor largelarge distributeddistributed datadata--intensiveintensive applicationsapplications AssumptionAssumption ¾¾ BuiltBuilt fromfrom inexpensiveinexpensive commoditycommodity componentscomponents thatthat oftenoften failfail ¾¾ SystemSystem storesstores modestmodest numbernumber ofof largelarge files.files. z FewFew millionmillion files,files, eacheach 100MB+.100MB+. MultiMulti--GBGB common.common. SmallSmall filesfiles supportedsupported butbut needneed notnot optimizeoptimize for.for. AssumptionAssumption ¾¾ TypicalTypical workloadworkload ReadRead patternpattern z LargeLarge streamingstreaming readsreads • Individual operations read hundreds of KBs, more commonly 1MB or more. • Succesive operations from same client often read through contiguous region of file. z SmallSmall randomrandom readsreads • Read a few KB at some random offset. • Performance-conscious apps often batch small reads. AssumptionAssumption ¾¾ TypicalTypical workloadworkload WriteWrite patternpattern z ManyMany largelarge sequentialsequential writeswrites thatthat appendappend datadata toto files.files. • Once written, files seldom modified again. • Small writes at arbitrary positions supported, not efficient. (not often in typical workload) DesignDesign ofof GFSGFS ¾ Multiple standalone and independent “clusters” ¾ Each cluster includes a master and multiple chunkservers ¾ Cluster accessed by multiple clients. ¾ Can run chunkserver and client on same machine. ¾ Files divided into fixed-size chunks. z Each chunk is 64MB big ¾ Each file has 64 bit chunk handle assigned by master at time of chunk creation. z The addressable range of a GFS cluster is • 264 * 64MB = 1125899906842624 TB DesignDesign ofof GFSGFS ¾ Master z maintains all file system metadata. • Namespace, access control info, files Æ chunks mapping, locations of chunks. z Manages chunk lease management, garbage collection of orphaned chunks, chunk migration. z Periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state. ¾ Chunkservers z Repositories of chunks in GFS z Chunks stored on each chunkserver’s local disks as LIINUX files z Chunks replicated (default factor of 3). DesignDesign ofof GFSGFS ¾ GFSGFS clientclient codecode linkedlinked intointo eacheach applicationapplication ¾ ClientClient doesdoes notnot cachecache filefile datadata z Most applications stream through huge files or have working sets too large to be cached z No need to worry about Cache coherence issues ¾ ChunkserversChunkservers dodo notnot cachecache filefile data.data. z Most applications stream through huge files or have working sets too large to be cached z Chunkservers could benefit from some LINUX buffer caches DesignDesign ofof GFSGFS ¾¾ FaultFault ToleanceToleance MeasuresMeasures z (Dynamic)(Dynamic) ReplicationReplication z HeartbeatHeartbeat messagesmessages z LoggingLogging z CheckpointingCheckpointing // RecoveryRecovery DesignDesign ofof GFSGFS DesignDesign ofof GFSGFS ¾ ConsistencyConsistency ModelModel z File regions: “consistent”, “defined” • Consistent: All clients will always see same data regardless of which replicas. • Defined: Consistent AND clients will see what a mutation writes in its entirety z Guarantees by GFS • File namespace mutations are atomic (e.g. creation). Handled exclusively by master. • Successful mutation without interference → defined • Concurrent successful mutations → undefined but consistent • Failed mutation → inconsistent (and undefined) DesignDesign ofof GFSGFS ¾¾ ImplicationsImplications ofof thethe consistencyconsistency modelmodel onon thethe applicationsapplications z ApplicationsApplications needneed toto distinguishdistinguish betweenbetween defineddefined andand undefinedundefined regionsregions • Rely on appends not overwrites => avoid undefined regions • Application-level Checkpointing • Write self-validating self-identifying records SystemSystem InteractionsInteractions ¾¾ InteractionsInteractions includeinclude z FileFile Read,Read, FileFile Mutations,Mutations, andand SnapshotsSnapshots ¾¾ HasHas toto ensureensure thethe consistencyconsistency guaranteesguarantees ¾¾ HasHas toto ensureensure thethe throughputthroughput z MinimizeMinimize thethe involvementinvolvement ofof mastermaster inin thethe interactionsinteractions z DelegateDelegate operationsoperations ontoonto chunkserverschunkservers/clients/clients SystemSystem InteractionsInteractions ¾¾ ReadRead accessaccess SystemSystem InteractionsInteractions ¾¾ FileFile Mutations:Mutations: Write/RecordWrite/Record Append/SnapshotsAppend/Snapshots z WriteWrite:: causescauses datadata toto bebe writtenwritten atat anan applicationapplication--specifiedspecified filefile offset.offset. z RecordRecord appendappend:: causescauses datadata toto bebe appendedappended atomicallyatomically atat leastleast onceonce eveneven inin thethe presencepresence ofof concurrentconcurrent mutations.mutations. z SnapshotSnapshot:: MakesMakes aa copycopy ofof aa filefile oror directorydirectory treetree z NeedNeed toto ensureensure consistencyconsistency modelmodel SystemSystem InteractionsInteractions ¾ EachEach mutationmutation performedperformed atat allall thethe chunkchunk’’ss replicasreplicas ¾ UseUse leasesleases toto maintainmaintain aa consistentconsistent mutationmutation orderorder acrossacross replicas.replicas. z Master grants chunk lease to one of the replicas – the primary. z Primary picks a serial order for all mutations to the “chunk”. All replicas follow this order when applying mutations. z THUS global mutation order defined first by lease grant order chosen by master and then by serial numbers in the lease SystemSystem InteractionsInteractions ¾¾ LeaseLease z DelegateDelegate authorizationauthorization toto thethe primaryprimary replicareplica inin datadata mutationsmutations z MinimizeMinimize managementmanagement overheadoverhead atat thethe mastermaster z LeaseLease expirationexpiration time:time: 6060 seconds.seconds. CanCan getget extended.extended. SystemSystem InteractionsInteractions ¾¾ WriteWrite accessaccess 1. Client asks master which chunkserver holds current lease and locations of other replicas. If no lease, master grants one to a replica it chooses. 2. Master replies with primary and secondaries. Client caches this for future mutations; contacts master again only when primary unreachable or does not have a lease 3. Client pushes data to all replicas. Client can do so in any order. Each chunkserver stores data in internal LRU buffer cache until data used or aged out. 4. Once all replicas have received data, client sends write to the primary. Primary assigns consecutive serial numbers to mutations it receives, possibly from multiple clients, which provides necessary serialization. Applies mutation to its own local state in serial number order SystemSystem InteractionsInteractions ¾¾ WriteWrite accessaccess 5. Primary forwards write request to all secondary replicas. Each secondary replica applies mutations in the same serial number order assigned by the primary 6. Secondaries all reply to primary – “finished the operation” 7. Primary replies to the client • Success of operation • Operation failed SystemSystem InteractionsInteractions ¾¾ WriteWrite accessaccess z IfIf aa writewrite byby thethe applicationapplication isis largelarge andand straddlesstraddles aa chunkchunk boundary,boundary, GFSGFS clientclient codecode breaksbreaks itit downdown intointo multiplemultiple writewrite operationsoperations • Concurrent writes on the same chunk could result in “undefined” region • SystemSystem InteractionsInteractions ¾¾ RecordRecord AppendAppend z AnAn atomicatomic operationoperation whichwhich appendsappends datadata toto thethe endend ofof aa filefile • The appended data is guaranteed to be “defined” and “consistent” • GFS appends it to the file at least once automatically at an offset of GFS’s choosing and returns that offset to the client • Control flow similar to a write with some small changes z The length of the appended data is limited to 16 MB (1/4 of the maximum chunk size) SystemSystem InteractionsInteractions ¾¾ RecordRecord AppendAppend (steps)(steps) z FollowFollow similarsimilar stepssteps asas inin thethe WriteWrite accessaccess • Client pushes data to all replicas • Client sends “Append” request to the primary replica • Primary checks if the data being appended will straddle the boundary of the last chunk. (Note: the appended data size <= 16MB) z Pads the chunk to the maximum size (on all replicas) z Replies to the client indicating the operation should be retired on the next chunk z This is a uncommon case SystemSystem InteractionsInteractions ¾¾ RecordRecord appendappend (steps)(steps) • If the data is within the chunk’s boundary z The primary appends the data z Tells the secondaries to append the data at the same offset z Reply success to the client z This is the common case ! SystemSystem InteractionsInteractions ¾¾ SnapshotsSnapshots z MakesMakes aa copycopy ofof aa filefile oror directorydirectory treetree (the(the ““sourcesource””)) reallyreally fastfast whilewhile minimizingminimizing anyany interruptionsinterruptions

Load more