1 CephFS fsck: Distributed Filesystem Checking Hi, I’m Greg
Greg Farnum CephFS Tech Lead, Red Hat [email protected]
Been working as a core Ceph developer since June 2009
3 4 What is Ceph?
An awesome, software-based, scalable, distributed storage system that is designed for failures
•Object storage (our native API) •Block devices (Linux kernel, QEMU/KVM, others) •RESTful S3 & Swift API object store •POSIX Filesystem
5 APP APP HOST/VM CLIENT
RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible with distributed block device, distributed file system, apps to directly S3 and Swift with a Linux kernel client with a Linux kernel client access RADOS, and a QEMU/KVM driver and support for FUSE with support for C, C++, Java, Python, Ruby, and PHP
RADOS
A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes
Reliable Autonomic Distributed Object Store 6 What is CephFS?
An awesome, software-based, scalable, distributed POSIX- compliant file system that is designed for failures
7 RADOS A user perspective
8 Objects in RADOS
Data 01110010101 01010101010 00010101101 xattrs version: 1 omap foo -> bar baz -> qux
9 The librados API
C, C++, Python, Java, shell. File-like API: • read/write (extent), truncate, remove; get/set/remove xattr or key • efficient copy-on-write clone • Snapshots — single object or pool-wide • atomic compound operations/transactions • read + getxattr, write + setxattr • compare xattr value, if match write + setxattr • “object classes” • load new code into cluster to implement new methods • calc sha1, grep/filter, generate thumbnail • encrypt, increment, rotate image • Implement your own access mechanisms — HDF5 on the node • watch/notify: use object as communication channel between clients (locking primitive) • pgls: list the objects within a placement group
10 The RADOS Cluster
M M M
CLIENT
Object Storage Devices (OSDs)
11 Object Storage Object 1 Object 3 Object 1 Object 3 Devices (OSDs)
Object 2 Object 4 Object 2 Object 4
Pool 1 Pool 2
M
Monitor
12 10 10 01 01 10 10 01 11 01 10
hash(object name) % num pg
CRUSH(pg, cluster state, rule set)
13 RADOS data guarantees
• Any write acked as safe will be visible to all subsequent readers • Any write ever visible to a reader will be visible to all subsequent readers • Any write acked as safe will not be lost unless the whole containing PG is lost • A PG will not be lost unless all N copies are lost (N is admin- configured, usually 3)… • …and in case of OSD failure the system will try to bring you back up to N copies (no user intervention required)
14 RADOS data guarantees
• Data is regularly scrubbed to ensure copies are consistent with each other, and administrators are alerted if inconsistencies arise • …and while it’s not automated, it’s usually easy to identify the correct data with “majority voting” or similar. • btrfs maintains checksums for certainty and we think this is the future
15 CephFS System Design
16 CephFS Design Goals
Infinitely scalable Avoid all Single Points Of Failure Self Managing
17 CLIENT
01 10
Metadata Server (MDS)
M
M M
18 Scaling Metadata
So we have to use multiple MetaData Servers (MDSes)
Tw o I s s u e s : • Storage of the metadata • Ownership of the metadata
19 Scaling Metadata – Storage
Some systems store metadata on the MDS system itself
But that’s a Single Point Of Failure!
• Hot standby? • External metadata storage √
20 Scaling Metadata – Ownership
Traditionally: assign hierarchies manually to each MDS • But if workloads change, your nodes can unbalance
Newer: hash directories onto MDSes • But then clients have to jump around for every folder traversal
21 one tree
two metadata servers
22 one tree
two metadata servers
23 The Ceph Metadata Server
Key insight: If metadata is stored in RADOS, ownership should be impermanent
One MDS is authoritative over any given subtree, but... • That MDS doesn’t need to keep the whole tree in-memory • There’s no reason the authoritative MDS can’t be changed!
24 The Ceph MDS – Partitioning
Cooperative Partitioning between servers: • Keep track of how hot metadata is • Migrate subtrees to keep heat distribution similar • Cheap because all metadata is in RADOS • Maintains locality
25 The Ceph MDS – Persistence
All metadata is written to RADOS • And changes are only visible once in RADOS
26 The Ceph MDS – Clustering Benefits
Dynamic adjustment to metadata workloads • Replicate hot data to distribute workload
Dynamic cluster sizing: • Add nodes as you wish • Decommission old nodes at any time
Recover quickly and easily from failures
27 28 29 30 31 DYNAMIC SUBTREE PARTITIONING
32 Does it work?
33 It scales! Click to edit Master text styles 34 It redistributes! Click to edit Master text styles 35 Cool Extras Besides POSIX-compliance and scaling
36 Snapshots
$ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myfile $ ls -F foo bar/ $ ls foo/.snap/one myfile bar/ $ rmdir foo/.snap/one # remove snapshot
37 Recursive statistics
$ ls -alSh | head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph $ getfattr -d -m ceph. pomceph # file: pomceph ceph.dir.entries="39" ceph.dir.files="37" ceph.dir.rbytes="10550153946827" ceph.dir.rctime="1298565125.590930000" ceph.dir.rentries="2454401" ceph.dir.rfiles="1585288" ceph.dir.rsubdirs="869113" ceph.dir.subdirs="2"
38 Different Storage strategies
•Set a “virtual xattr” on a directory and all new files underneath it follow that layout. •Layouts can specify lots of detail about storage: • pool file data goes into • how large file objects and stripes are • how many objects are in a stripe set
•So in one cluster you can use • one slow pool with big objects for Hadoop workloads • one fast pool with little objects for a scratch space • one slow pool with small objects for home directories • or whatever else makes sense...
39 CephFS Important Data structures
40 Directory objects
•One (or more!) per directory •Deterministically named:
41 File objects
•One or more per file •Deterministically named
42 MDS log (objects)
The MDS fully journals all metadata operations. The log is chunked across objects. •Deterministically named
43 MDSTable objects
• Single objects • SessionMap (per-MDS) • stores the state of each client Session • particularly: preallocated inodes for each client • InoTable (per-MDS) • Tracks which inodes are available to allocate • (this is not a traditional inode mapping table or similar) • SnapTable (shared) • Tracks system snapshot IDs and their state (in use pending create/delete)
• All located in the metadata pool
44 CephFS Metadata update flow
45 Client Sends Request Object Storage Devices (OSDs) Create dir CLIENT
MDS
log.1 log.2 log.3 dir.1 dir.2 dir.3
46 MDS Processes Request “Early Reply” and journaling Object Storage Devices (OSDs)
Early Reply Journal Write CLIENT
MDS
log.1 log.2 log.3 dir.1 dir.2 dir.3
47 MDS Processes Request Journaling and safe reply Object Storage Devices (OSDs)
CLIENT Safe Reply Journal ack
MDS
log.1 log.2 log.3 log.4 dir.1 dir.2 dir.3
48 …time passes… Object Storage Devices (OSDs)
MDS
log.4 dir.1 dir.2 dir.3
log.5 log.6 log.7 log.8
49 MDS Flushes Log Object Storage Devices (OSDs)
Directory Write
MDS
log.4 dir.1 dir.2 dir.3
log.5 log.6 log.7 log.8
50 MDS Flushes Log Object Storage Devices (OSDs)
Write ack
MDS
log.4 dir.1 dir.2 dir.3 dir.4
log.5 log.6 log.7 log.8
51 MDS Flushes Log Object Storage Devices (OSDs)
Log Delete
MDS
dir.1 dir.2 dir.3 dir.4
log.5 log.6 log.7 log.8
52 Traditional fsck
53 e2fsck
Drawn from “ffsck: The Fast File System Checker”, FAST ’13 and http://mm.iit.uni-miskolc.hu/Data/texts/Linux/SAG/ node92.html
54 Key data structures & checks
• Superblock • check free block count, free inode count • Data bitmap: blocks marked free are not in use • Inode bitmap: inodes marked free are not in use • Directories • inodes are allocated, reasonable, and in-tree • inodes • consistent internal state • link counts • blocks claimed are valid and unique
55 Procedure
• Pass 1: iterate over all inodes • check self-consistency • builds up maps of in-use blocks/inodes/etc • correct any issues with doubly-allocated blocks • Pass 2: iterate over all directories • check dentry validity and that all referenced inodes are valid • cache a tree structure • Pass 3: Check directory connectivity in-memory • Pass 4: Check inode reference counts in-memory • Pass 5: Check cached maps against on-disk maps and overwrite if needed
56 CephFS fsck What it needs to do
57 RADOS is different than a disk
•We look at objects, not disk blocks •And we can’t lose them (at least not in a way we can recover) •We can deterministically identify all pieces of file data •and the inode they belong to! •It is not feasible to keep all metadata in-memory at once •Data loss is the result of: •bugs in the system, •simultaneous catastrophic failure of RADOS (probably losing lots of random data), •or bitrot
58 Failure detection: “forward scrub”
•Intended to find tree inconsistencies from bugs or bitrot •Runs continuously in the background •Traverse the tree and make sure referents agree in both directions
59 Catastrophic failure repair: backwards repair
•Repair the tree after catastrophic failure or scrub detects an issue •Run only with administrator intervention •Could affordably be offline-only
60 Forward Scrub Being implemented
61 Goal: Check the tree
In the background, examine the entire filesystem hierarchy and make sure it’s self-consistent.
62 File objects
• Every inode we find has the correct data objects • The inode location and data object backtrace is self-consistent • slightly tricky, since they can be stale • (Optionally) The file size is consistent with what objects exist
63 Directory Objects
• The directory objects are self-consistent with their summaries and actual contents (the rstats) • The directory’s backtrace and actual parent directory are self- consistent • This covers any bugs causing double-links
64 Constraints
• Limited memory: we can’t hold all the metadata in-memory at once • Scalable: we have an MDS cluster and this needs to work within that framework
65 On-disk data structure changes
• inode_t gets a scrub_stamp (time) and scrub_version to keep track of the last time it was scrubbed • frag_t gets a scrub_stamp and scrub_version for both “local” and “recursive” scrubs • Directories can be “fragmented” into multiple pieces; frag_t holds the metadata about a given directory fragment
• These values can be reported to the user via our rstats “virtual xattr" interface
66 Algorithm Design
• Obviously we’re doing a depth-first search: • This means in the worst case we restrict memory usage to O(tree depth) • Validating files first means that when we validate directory rstats, they’re not about to get changed • Basic strategy: construct stack of “CDentry”s to scrub and when you scrub a directory, push its contents to the top of the stack first • But we want to limit it so we don’t explode memory usage
67 ScrubStack
• Examine the dentry • If it’s a file: scrub it directly and record scrub stamp/version • If it’s a directory: • on first access, generate list of all dentries, segregated by directory/file status • this lets us kick out the CDentry, CInode, CDir objects we had to read in to get the list • on first access, note the current version and time • push the next unscrubbed directory fragment on to the stack and restart • If there are no frags left, push files onto the stack and restart • When all files are scrubbed, scrub directory frags and record directory scrubbed as of the start stamp/version values
• If we hit a tree owned by another MDS, spin off a request to scrub that tree and move on to another area of our own hierarchy until it’s completed
68 How’s it scale?
Well, each MDS scrubs the data for which it’s authoritative. There’s only interaction at the boundaries, and there’s a lot of machinery around that which makes it fairly easy.
69 Backwards scrub/repair Being designed
70 Goal: repair the tree
Once we know that there’s been some failure (either due to catastrophic data loss or a serious bug), examine the raw RADOS state and get all the data back into the filesystem hierarchy. Make all referents consistent with each other.
71 File objects
• Given an object name and location, we know if it’s a file object and which inode it belongs to • The first file object for an inode has a backtrace of its location
72 Directory Objects
• Given an object name and location, we know if it’s a directory object, and for which directory inode • The directory has a backtrace on it • The directory has (versioned) forward links to all children
73 MDS Logs
The MDS logs can contain much newer versions of inodes than any of the backing objects • eg, a file got renamed into a directory on a different MDS and it hasn’t been flushed yet
74 Constraints
• Limited memory: we can’t hold all the metadata in-memory at once • Scalable: we have an MDS cluster and this needs to work within that framework
75 Algorithm Design
Well, we haven’t finished this yet…but we have lots of ideas!
76 Building blocks
• Forward scrub: lets us identify things which we know are in the filesystem (potentially: tag them while doing so), and find things we know are missing but shouldn’t be • Backtraces: give us (possibly stale) snapshots of where in the tree each directory or file lives • importantly, these are versioned! So we can use the backtrace from one file to update the older backtrace from another • RADOS object listing: we can list every object in the filesystem • Even better: we can inject “filters” to restrict the listing, to the first file object, or to objects which we have not tagged in a previous pass • MDS logs: each log segment contains a lot of info about the objects it changes • RADOS operations: • snapshots could be useful • you can store objects “next to” other objects — create a scratch space!
77 Check what we know we have
• Flush the MDS logs out to disk • We can probably identify if logs are whole or not — they’re numbered consecutively, and we can check objects on the boundary against any lost PGs • Missing transactions are likely to be reconstructable from context: if the path for an inode changes unexpectedly, we can adjust to that! • Cross-MDS transactions are either renames across authoritative zones (can be resolved by looking at the journals together) or about transfers of authority (we can make up new boundaries) • Have each MDS scrub its auth data, and tag all reached objects • If we expect to find objects and don’t, or they are broken, add them to a fix list
78 Look at the raw objects
• Run a (filtered?) listing of all objects in the metadata and data pools • We can filter out stuff that was tagged if we want; and depending on thoroughness we can skip objects without a backtrace • This can scale in several ways (eg “map” along PG lines, “reduce” based on guessed authority from backtraces) • For each found object we don’t have in the hierarchy, attempt to place them in the tree based on backtraces • is it so new that the dentry never got flushed out (or lost in busted journal)? • does it belong to a missing directory object? • Create “phantom” directories based on backtrace contents, if needed • If prior guessed data conflicts with new data, take the one with the newer version
79 Try again!
• After doing both a hierarchy scan and a deep scan, and fixing things up, we should be able to touch everything in the system. Run it again to make sure.
80 How do we store repaired data?
We have a few options here: • use RADOS snapshots to not change the original data at all, and write updates in place • use the “locator” functionality to explicitly create our own in- flight repair objects next to the original data until we can check it and flush back to original data • Or something more complicated? (object class code, explicitly copying all dentries within a directory object, etc)
81 How would repair scale?
• Scrubbing scales across MDS authority zones as before • Note that if necessary, we can spin up new MDSes with new authority zones to scale out the checking • Each MDS can try to repair data for which it is authoritative, and pass along objects it finds to belong elsewhere • and request stuff it thinks it should own from its peers, too • Obviously a mapping from the raw RADOS listing to these authoritative zones is required: • partition the PG space up evenly between MDSes, let each one handle the listing • chop up the raw data into expected authority zones based on backtraces (probably log them to an area in RADOS) • ship off lists to proper authoritative MDS
82 Questions?
• #ceph-devel on irc.oftc.net • I’m gregsfortytwo • [email protected]
• Greg Farnum • [email protected] • @gregsfortytwo
83