1 CephFS fsck: Distributed Filesystem Checking Hi, I’m Greg

Greg Farnum CephFS Tech Lead, Red Hat [email protected]

Been working as a core Ceph developer since June 2009

3 4 What is Ceph?

An awesome, software-based, scalable, distributed storage system that is designed for failures

•Object storage (our native API) •Block devices ( kernel, QEMU/KVM, others) •RESTful S3 & Swift API object store •POSIX Filesystem

5 APP APP HOST/VM CLIENT

RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible with distributed block device, distributed system, apps to directly S3 and Swift with a Linux kernel client with a Linux kernel client access RADOS, and a QEMU/KVM driver and support for FUSE with support for C, C++, Java, Python, Ruby, and PHP

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes

Reliable Autonomic Distributed Object Store 6 What is CephFS?

An awesome, software-based, scalable, distributed POSIX- compliant that is designed for failures

7 RADOS A user perspective

8 Objects in RADOS

Data 01110010101 01010101010 00010101101 xattrs version: 1 omap foo -> bar baz -> qux

9 The librados API

C, C++, Python, Java, shell. File-like API: • read/ (extent), truncate, remove; get/set/remove xattr or key • efficient copy-on-write clone • Snapshots — single object or pool-wide • atomic compound operations/transactions • read + getxattr, write + setxattr • compare xattr value, if match write + setxattr • “object classes” • load new code into cluster to implement new methods • calc sha1, /filter, generate thumbnail • encrypt, increment, rotate image • Implement your own access mechanisms — HDF5 on the node • watch/notify: use object as communication channel between clients (locking primitive) • pgls: list the objects within a placement group

10 The RADOS Cluster

M M M

CLIENT

Object Storage Devices (OSDs)

11 Object Storage Object 1 Object 3 Object 1 Object 3 Devices (OSDs)

Object 2 Object 4 Object 2 Object 4

Pool 1 Pool 2

M

Monitor

12 10 10 01 01 10 10 01 11 01 10

hash(object name) % num pg

CRUSH(pg, cluster state, rule set)

13 RADOS data guarantees

• Any write acked as safe will be visible to all subsequent readers • Any write ever visible to a reader will be visible to all subsequent readers • Any write acked as safe will not be lost unless the whole containing PG is lost • A PG will not be lost unless all N copies are lost (N is admin- configured, usually 3)… • …and in case of OSD failure the system will try to bring you back up to N copies (no user intervention required)

14 RADOS data guarantees

• Data is regularly scrubbed to ensure copies are consistent with each other, and administrators are alerted if inconsistencies arise • …and while it’s not automated, it’s usually easy to identify the correct data with “majority voting” or similar. • maintains checksums for certainty and we think this is the future

15 CephFS System Design

16 CephFS Design Goals

Infinitely scalable Avoid all Single Points Of Failure Self Managing

17 CLIENT

01 10

Metadata Server (MDS)

M

M M

18 Scaling Metadata

So we have to use multiple MetaData Servers (MDSes)

Tw o I s s u e s : • Storage of the metadata • Ownership of the metadata

19 Scaling Metadata – Storage

Some systems store metadata on the MDS system itself

But that’s a Single Point Of Failure!

• Hot standby? • External metadata storage √

20 Scaling Metadata – Ownership

Traditionally: assign hierarchies manually to each MDS • But if workloads change, your nodes can unbalance

Newer: hash directories onto MDSes • But then clients have to jump around for every folder traversal

21 one tree

two metadata servers

22 one tree

two metadata servers

23 The Ceph Metadata Server

Key insight: If metadata is stored in RADOS, ownership should be impermanent

One MDS is authoritative over any given subtree, but... • That MDS doesn’t need to keep the whole tree in-memory • There’s no reason the authoritative MDS can’t be changed!

24 The Ceph MDS – Partitioning

Cooperative Partitioning between servers: • Keep track of how hot metadata is • Migrate subtrees to keep heat distribution similar • Cheap because all metadata is in RADOS • Maintains locality

25 The Ceph MDS – Persistence

All metadata is written to RADOS • And changes are only visible once in RADOS

26 The Ceph MDS – Clustering Benefits

Dynamic adjustment to metadata workloads • Replicate hot data to distribute workload

Dynamic cluster sizing: • Add nodes as you wish • Decommission old nodes any

Recover quickly and easily from failures

27 28 29 30 31 DYNAMIC SUBTREE PARTITIONING

32 Does it work?

33 It scales! Click to edit Master text styles 34 It redistributes! Click to edit Master text styles 35 Cool Extras Besides POSIX-compliance and scaling

36 Snapshots

$ foo/.snap/one # create snapshot $ foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ foo/myfile $ ls -F foo bar/ $ ls foo/.snap/one myfile bar/ $ foo/.snap/one # remove snapshot

37 Recursive statistics

$ ls -alSh | total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph $ getfattr -d -m ceph. pomceph # file: pomceph ceph.dir.entries="39" ceph.dir.files="37" ceph.dir.rbytes="10550153946827" ceph.dir.rctime="1298565125.590930000" ceph.dir.rentries="2454401" ceph.dir.rfiles="1585288" ceph.dir.rsubdirs="869113" ceph.dir.subdirs="2"

38 Different Storage strategies

•Set a “virtual xattr” on a directory and all new files underneath it follow that layout. •Layouts can specify lots of detail about storage: • pool file data goes into • how large file objects and stripes are • how many objects are in a stripe set

•So in one cluster you can use • one slow pool with big objects for Hadoop workloads • one fast pool with little objects for a scratch space • one slow pool with small objects for home directories • or whatever else makes sense...

39 CephFS Important Data structures

40 Directory objects

•One (or !) per directory •Deterministically named: . •Embeds dentries and inodes for each child of the folder •Contains a potentially-stale versioned backtrace (path location) •Located in the metadata pool

41 File objects

•One or more per file •Deterministically named . •First object contains a potentially-stale versioned backtrace •Located in any of the data pools

42 MDS log (objects)

The MDS fully journals all metadata operations. The log is chunked across objects. •Deterministically named . •Log objects may or may not be replayable if previous entries are lost •each entry contains what it needs, but eg a file move can depend on a previous rename entry •Located in the metadata pool

43 MDSTable objects

• Single objects • SessionMap (per-MDS) • stores the state of each client Session • particularly: preallocated inodes for each client • InoTable (per-MDS) • Tracks which inodes are available to allocate • (this is not a traditional inode mapping table or similar) • SnapTable (shared) • Tracks system snapshot IDs and their state (in use pending create/delete)

• All located in the metadata pool

44 CephFS Metadata update flow

45 Client Sends Request Object Storage Devices (OSDs) Create dir CLIENT

MDS

log.1 log.2 log.3 dir.1 dir.2 dir.3

46 MDS Processes Request “Early Reply” and journaling Object Storage Devices (OSDs)

Early Reply Journal Write CLIENT

MDS

log.1 log.2 log.3 dir.1 dir.2 dir.3

47 MDS Processes Request Journaling and safe reply Object Storage Devices (OSDs)

CLIENT Safe Reply Journal ack

MDS

log.1 log.2 log.3 log.4 dir.1 dir.2 dir.3

48 …time passes… Object Storage Devices (OSDs)

MDS

log.4 dir.1 dir.2 dir.3

log.5 log.6 log.7 log.8

49 MDS Flushes Log Object Storage Devices (OSDs)

Directory Write

MDS

log.4 dir.1 dir.2 dir.3

log.5 log.6 log.7 log.8

50 MDS Flushes Log Object Storage Devices (OSDs)

Write ack

MDS

log.4 dir.1 dir.2 dir.3 dir.4

log.5 log.6 log.7 log.8

51 MDS Flushes Log Object Storage Devices (OSDs)

Log Delete

MDS

dir.1 dir.2 dir.3 dir.4

log.5 log.6 log.7 log.8

52 Traditional fsck

53 e2fsck

Drawn from “ffsck: The Fast File System Checker”, FAST ’13 and http://mm.iit.uni-miskolc.hu/Data/texts/Linux/SAG/ node92.html

54 Key data structures & checks

• Superblock • check free block count, free inode count • Data bitmap: blocks marked free are not in use • Inode bitmap: inodes marked free are not in use • Directories • inodes are allocated, reasonable, and in-tree • inodes • consistent internal state • link counts • blocks claimed are valid and unique

55 Procedure

• Pass 1: iterate over all inodes • check self-consistency • builds up maps of in-use blocks/inodes/etc • correct any issues with doubly-allocated blocks • Pass 2: iterate over all directories • check dentry validity and that all referenced inodes are valid • cache a tree structure • Pass 3: Check directory connectivity in-memory • Pass 4: Check inode reference counts in-memory • Pass 5: Check cached maps against on-disk maps and overwrite if needed

56 CephFS fsck What it needs to do

57 RADOS is different than a disk

•We look at objects, not disk blocks •And we can’t lose them (at least not in a way we can recover) •We can deterministically identify all pieces of file data •and the inode they belong to! •It is not feasible to keep all metadata in-memory at once •Data loss is the result of: •bugs in the system, •simultaneous catastrophic failure of RADOS (probably losing lots of random data), •or bitrot

58 Failure detection: “forward scrub”

•Intended to tree inconsistencies from bugs or bitrot •Runs continuously in the background •Traverse the tree and sure referents agree in both directions

59 Catastrophic failure repair: backwards repair

•Repair the tree after catastrophic failure or scrub detects an issue •Run only with administrator intervention •Could affordably be offline-only

60 Forward Scrub Being implemented

61 Goal: Check the tree

In the background, examine the entire filesystem hierarchy and make sure it’s self-consistent.

62 File objects

• Every inode we find has the correct data objects • The inode location and data object backtrace is self-consistent • slightly tricky, since they can be stale • (Optionally) The file size is consistent with what objects exist

63 Directory Objects

• The directory objects are self-consistent with their summaries and actual contents (the rstats) • The directory’s backtrace and actual parent directory are self- consistent • This covers any bugs causing double-links

64 Constraints

• Limited memory: we can’t hold all the metadata in-memory at once • Scalable: we have an MDS cluster and this needs to work within that framework

65 On-disk data structure changes

• inode_t gets a scrub_stamp (time) and scrub_version to keep track of the last time it was scrubbed • frag_t gets a scrub_stamp and scrub_version for both “local” and “recursive” scrubs • Directories can be “fragmented” into multiple pieces; frag_t holds the metadata about a given directory fragment

• These values can be reported to the user via our rstats “virtual xattr" interface

66 Algorithm Design

• Obviously we’re doing a depth-first search: • This means in the worst case we restrict memory usage to O(tree depth) • Validating files first means that when we validate directory rstats, they’re not about to get changed • Basic strategy: construct stack of “CDentry”s to scrub and when you scrub a directory, push its contents to the top of the stack first • But we want to limit it so we don’t explode memory usage

67 ScrubStack

• Examine the dentry • If it’s a file: scrub it directly and record scrub stamp/version • If it’s a directory: • on first access, generate list of all dentries, segregated by directory/file status • this lets us kick out the CDentry, CInode, CDir objects we had to read in to get the list • on first access, note the current version and time • push the next unscrubbed directory fragment on to the stack and restart • If there are no frags left, push files onto the stack and restart • When all files are scrubbed, scrub directory frags and record directory scrubbed as of the start stamp/version values

• If we hit a tree owned by another MDS, spin off a request to scrub that tree and move on to another area of our own hierarchy until it’s completed

68 How’s it scale?

Well, each MDS scrubs the data for which it’s authoritative. There’s only interaction at the boundaries, and there’s a lot of machinery around that which makes it fairly easy.

69 Backwards scrub/repair Being designed

70 Goal: repair the tree

Once we know that there’s been some failure (either due to catastrophic data loss or a serious bug), examine the raw RADOS state and get all the data back into the filesystem hierarchy. Make all referents consistent with each other.

71 File objects

• Given an object name and location, we know if it’s a file object and which inode it belongs to • The first file object for an inode has a backtrace of its location

72 Directory Objects

• Given an object name and location, we know if it’s a directory object, and for which directory inode • The directory has a backtrace on it • The directory has (versioned) forward links to all children

73 MDS Logs

The MDS logs can contain much newer versions of inodes than any of the backing objects • eg, a file got renamed into a directory on a different MDS and it hasn’t been flushed yet

74 Constraints

• Limited memory: we can’t hold all the metadata in-memory at once • Scalable: we have an MDS cluster and this needs to work within that framework

75 Algorithm Design

Well, we haven’t finished this yet…but we have lots of ideas!

76 Building blocks

• Forward scrub: lets us identify things which we know are in the filesystem (potentially: tag them while doing so), and find things we know are missing but shouldn’t be • Backtraces: give us (possibly stale) snapshots of where in the tree each directory or file lives • importantly, these are versioned! So we can use the backtrace from one file to update the older backtrace from another • RADOS object listing: we can list every object in the filesystem • Even better: we can inject “filters” to restrict the listing, to the first file object, or to objects which we have not tagged in a previous pass • MDS logs: each log segment contains a lot of info about the objects it changes • RADOS operations: • snapshots could be useful • you can store objects “next to” other objects — create a scratch space!

77 Check what we know we have

• Flush the MDS logs out to disk • We can probably identify if logs are whole or not — they’re numbered consecutively, and we can check objects on the boundary against any lost PGs • Missing transactions are likely to be reconstructable from context: if the path for an inode changes unexpectedly, we can adjust to that! • Cross-MDS transactions are either renames across authoritative zones (can be resolved by looking at the journals together) or about transfers of authority (we can make up new boundaries) • Have each MDS scrub its auth data, and tag all reached objects • If we expect to find objects and don’t, or they are broken, add them to a fix list

78 Look at the raw objects

• Run a (filtered?) listing of all objects in the metadata and data pools • We can filter out stuff that was tagged if we want; and depending on thoroughness we can skip objects without a backtrace • This can scale in several ways (eg “map” along PG lines, “reduce” based on guessed authority from backtraces) • For each found object we don’t have in the hierarchy, attempt to place them in the tree based on backtraces • is it so new that the dentry never got flushed out (or lost in busted journal)? • does it belong to a missing directory object? • Create “phantom” directories based on backtrace contents, if needed • If prior guessed data conflicts with new data, take the one with the newer version

79 Try again!

• After doing both a hierarchy scan and a deep scan, and fixing things up, we should be able to everything in the system. Run it again to make sure.

80 How do we store repaired data?

We have a few options here: • use RADOS snapshots to not change the original data at all, and write updates in place • use the “locator” functionality to explicitly create our own in- flight repair objects next to the original data until we can check it and flush back to original data • Or something more complicated? (object class code, explicitly copying all dentries within a directory object, etc)

81 How would repair scale?

• Scrubbing scales across MDS authority zones as before • Note that if necessary, we can spin up new MDSes with new authority zones to scale out the checking • Each MDS can try to repair data for which it is authoritative, and pass along objects it finds to belong elsewhere • and request stuff it thinks it should own from its peers, too • Obviously a mapping from the raw RADOS listing to these authoritative zones is required: • partition the PG space up evenly between MDSes, let each one handle the listing • chop up the raw data into expected authority zones based on backtraces (probably log them to an area in RADOS) • ship off lists to proper authoritative MDS

82 Questions?

• #ceph-devel on irc.oftc.net • I’m gregsfortytwo • [email protected]

• Greg Farnum • [email protected] • @gregsfortytwo

83