Vault Fsck.Key

1 CephFS fsck: Distributed Filesystem Checking Hi, I’m Greg Greg Farnum CephFS Tech Lead, Red Hat [email protected] Been working as a core Ceph developer since June 2009 3 4 What is Ceph? An awesome, software-based, scalable, distributed storage system that is designed for failures •Object storage (our native API) •Block devices (Linux kernel, QEMU/KVM, others) •RESTful S3 & Swift API object store •POSIX Filesystem 5 APP APP HOST/VM CLIENT RADOSGW RBD CEPH FS LIBRADOS A bucket-based REST A reliable and fully- A POSIX-compliant A library allowing gateway, compatible with distributed block device, distributed file system, apps to directly S3 and Swift with a Linux kernel client with a Linux kernel client access RADOS, and a QEMU/KVM driver and support for FUSE with support for C, C++, Java, Python, Ruby, and PHP RADOS A reliable, autonomous, distributed object store comprised of self-healing, self-managing, intelligent storage nodes Reliable Autonomic Distributed Object Store 6 What is CephFS? An awesome, software-based, scalable, distributed POSIX- compliant file system that is designed for failures 7 RADOS A user perspective 8 Objects in RADOS Data 01110010101 01010101010 00010101101 xattrs version: 1 omap foo -> bar baz -> qux 9 The librados API C, C++, Python, Java, shell. File-like API: • read/write (extent), truncate, remove; get/set/remove xattr or key • efficient copy-on-write clone • Snapshots — single object or pool-wide • atomic compound operations/transactions • read + getxattr, write + setxattr • compare xattr value, if match write + setxattr • “object classes” • load new code into cluster to implement new methods • calc sha1, grep/filter, generate thumbnail • encrypt, increment, rotate image • Implement your own access mechanisms — HDF5 on the node • watch/notify: use object as communication channel between clients (locking primitive) • pgls: list the objects within a placement group 10 The RADOS Cluster M M M CLIENT Object Storage Devices (OSDs) 11 Object Storage Object 1 Object 3 Object 1 Object 3 Devices (OSDs) Object 2 Object 4 Object 2 Object 4 Pool 1 Pool 2 M Monitor 12 10 10 01 01 10 10 01 11 01 10 hash(object name) % num pg CRUSH(pg, cluster state, rule set) 13 RADOS data guarantees • Any write acked as safe will be visible to all subsequent readers • Any write ever visible to a reader will be visible to all subsequent readers • Any write acked as safe will not be lost unless the whole containing PG is lost • A PG will not be lost unless all N copies are lost (N is admin- configured, usually 3)… • …and in case of OSD failure the system will try to bring you back up to N copies (no user intervention required) 14 RADOS data guarantees • Data is regularly scrubbed to ensure copies are consistent with each other, and administrators are alerted if inconsistencies arise • …and while it’s not automated, it’s usually easy to identify the correct data with “majority voting” or similar. • btrfs maintains checksums for certainty and we think this is the future 15 CephFS System Design 16 CephFS Design Goals Infinitely scalable Avoid all Single Points Of Failure Self Managing 17 CLIENT 01 10 Metadata Server (MDS) M M M 18 Scaling Metadata So we have to use multiple MetaData Servers (MDSes) Tw o I s s u e s : • Storage of the metadata • Ownership of the metadata 19 Scaling Metadata – Storage Some systems store metadata on the MDS system itself But that’s a Single Point Of Failure! • Hot standby? • External metadata storage √ 20 Scaling Metadata – Ownership Traditionally: assign hierarchies manually to each MDS • But if workloads change, your nodes can unbalance Newer: hash directories onto MDSes • But then clients have to jump around for every folder traversal 21 one tree two metadata servers 22 one tree two metadata servers 23 The Ceph Metadata Server Key insight: If metadata is stored in RADOS, ownership should be impermanent One MDS is authoritative over any given subtree, but... • That MDS doesn’t need to keep the whole tree in-memory • There’s no reason the authoritative MDS can’t be changed! 24 The Ceph MDS – Partitioning Cooperative Partitioning between servers: • Keep track of how hot metadata is • Migrate subtrees to keep heat distribution similar • Cheap because all metadata is in RADOS • Maintains locality 25 The Ceph MDS – Persistence All metadata is written to RADOS • And changes are only visible once in RADOS 26 The Ceph MDS – Clustering Benefits Dynamic adjustment to metadata workloads • Replicate hot data to distribute workload Dynamic cluster sizing: • Add nodes as you wish • Decommission old nodes at any time Recover quickly and easily from failures 27 28 29 30 31 DYNAMIC SUBTREE PARTITIONING 32 Does it work? 33 It scales! Click to edit Master text styles 34 It redistributes! Click to edit Master text styles 35 Cool Extras Besides POSIX-compliance and scaling 36 Snapshots $ mkdir foo/.snap/one # create snapshot $ ls foo/.snap one $ ls foo/bar/.snap _one_1099511627776 # parent's snap name is mangled $ rm foo/myﬁle $ ls -F foo bar/ $ ls foo/.snap/one myﬁle bar/ $ rmdir foo/.snap/one # remove snapshot 37 Recursive statistics $ ls -alSh | head total 0 drwxr-xr-x 1 root root 9.7T 2011-02-04 15:51 . drwxr-xr-x 1 root root 9.7T 2010-12-16 15:06 .. drwxr-xr-x 1 pomceph pg4194980 9.6T 2011-02-24 08:25 pomceph drwx--x--- 1 fuzyceph adm 1.5G 2011-01-18 10:46 fuzyceph drwxr-xr-x 1 dallasceph pg275 596M 2011-01-14 10:06 dallasceph $ getfattr -d -m ceph. pomceph # file: pomceph ceph.dir.entries="39" ceph.dir.files="37" ceph.dir.rbytes="10550153946827" ceph.dir.rctime="1298565125.590930000" ceph.dir.rentries="2454401" ceph.dir.rfiles="1585288" ceph.dir.rsubdirs="869113" ceph.dir.subdirs="2" 38 Different Storage strategies •Set a “virtual xattr” on a directory and all new files underneath it follow that layout. •Layouts can specify lots of detail about storage: • pool file data goes into • how large file objects and stripes are • how many objects are in a stripe set •So in one cluster you can use • one slow pool with big objects for Hadoop workloads • one fast pool with little objects for a scratch space • one slow pool with small objects for home directories • or whatever else makes sense... 39 CephFS Important Data structures 40 Directory objects •One (or more!) per directory •Deterministically named: <inode number>.<directory piece> •Embeds dentries and inodes for each child of the folder •Contains a potentially-stale versioned backtrace (path location) •Located in the metadata pool 41 File objects •One or more per file •Deterministically named <ino number>.<object number> •First object contains a potentially-stale versioned backtrace •Located in any of the data pools 42 MDS log (objects) The MDS fully journals all metadata operations. The log is chunked across objects. •Deterministically named <log inode number>.<log piece> •Log objects may or may not be replayable if previous entries are lost •each entry contains what it needs, but eg a file move can depend on a previous rename entry •Located in the metadata pool 43 MDSTable objects • Single objects • SessionMap (per-MDS) • stores the state of each client Session • particularly: preallocated inodes for each client • InoTable (per-MDS) • Tracks which inodes are available to allocate • (this is not a traditional inode mapping table or similar) • SnapTable (shared) • Tracks system snapshot IDs and their state (in use pending create/delete) • All located in the metadata pool 44 CephFS Metadata update flow 45 Client Sends Request Object Storage Devices (OSDs) Create dir CLIENT MDS log.1 log.2 log.3 dir.1 dir.2 dir.3 46 MDS Processes Request “Early Reply” and journaling Object Storage Devices (OSDs) Early Reply Journal Write CLIENT MDS log.1 log.2 log.3 dir.1 dir.2 dir.3 47 MDS Processes Request Journaling and safe reply Object Storage Devices (OSDs) CLIENT Safe Reply Journal ack MDS log.1 log.2 log.3 log.4 dir.1 dir.2 dir.3 48 …time passes… Object Storage Devices (OSDs) MDS log.4 dir.1 dir.2 dir.3 log.5 log.6 log.7 log.8 49 MDS Flushes Log Object Storage Devices (OSDs) Directory Write MDS log.4 dir.1 dir.2 dir.3 log.5 log.6 log.7 log.8 50 MDS Flushes Log Object Storage Devices (OSDs) Write ack MDS log.4 dir.1 dir.2 dir.3 dir.4 log.5 log.6 log.7 log.8 51 MDS Flushes Log Object Storage Devices (OSDs) Log Delete MDS dir.1 dir.2 dir.3 dir.4 log.5 log.6 log.7 log.8 52 Traditional fsck 53 e2fsck Drawn from “ffsck: The Fast File System Checker”, FAST ’13 and http://mm.iit.uni-miskolc.hu/Data/texts/Linux/SAG/ node92.html 54 Key data structures & checks • Superblock • check free block count, free inode count • Data bitmap: blocks marked free are not in use • Inode bitmap: inodes marked free are not in use • Directories • inodes are allocated, reasonable, and in-tree • inodes • consistent internal state • link counts • blocks claimed are valid and unique 55 Procedure • Pass 1: iterate over all inodes • check self-consistency • builds up maps of in-use blocks/inodes/etc • correct any issues with doubly-allocated blocks • Pass 2: iterate over all directories • check dentry validity and that all referenced inodes are valid • cache a tree structure • Pass 3: Check directory connectivity in-memory • Pass 4: Check inode reference counts in-memory • Pass 5: Check cached maps against on-disk maps and overwrite if needed 56 CephFS fsck What it needs to do 57 RADOS is different than a disk •We look at objects, not disk blocks •And we can’t lose them (at least not in a way we can recover) •We can deterministically identify all pieces of file data •and the inode they belong to! •It

Load more