Recent developments in GFS2

Steven Whitehouse Manager, GFS2 Filesystem LinuxCon Europe – October 2013 Topics

● Principles of operation ● Locking ● Hints and Tips ● , Directories and System files ● NFS/Samba ● What is new in GFS2? ● FIEMAP ● Discards, FITRIM and fallocate ● Tracing ● Performance ● Ease of Use ● Future Developments ● Questions Principles of Operation

 The page cache ● Caches on­disk information to allow rapid access ● Indexed by “address space” and “page index” ● For GFS2 this is split between nodes and managed by glocks ● Local filesystems: data address space in the , metadata address space in the block device inode ● GFS2: data address space in the inode, metadata address space in the inode glock  The glock is a GFS2 concept used to manage the cache ● There is a 1:1 mapping from GFS2 glock to DLM locks ● For inodes, the glock manages the data and metadata caches ● The glock state (EX,SH,DF,UN) determines what can be cached  Most performance problems occur when the implication of the glock system are not understood Locking

 Three lock modes: ● Shared, Exclusive, Deferred (shared but not compatible with Shared) ● Deferred is used for Direct I/O to ensure it's cluster coherent ● The DLM’s NULL lock mode is used to maintain a ref count on LVBs (currently only used for quota data)  Pluggable locking architecture ● lock_nolock ● A “no­op” lock manager used with local filesystems ● lock_dlm ● A full cluster aware lock manager ● Works with the kernel’s DLM and Red Hat clustering software ● Provides VAX­like DLM semantics translated into GFS’s requirements Hints and Tips

 fcntl caveat ● When using F_GETLK a PID will be returned, but it might exist on any node in the cluster! Be careful ● Currently there is no way to work out which node the process is on which holds blocking locks ● We do not support leases ● We do not support dnotify/ (except on the local node)  It is also possible to ask for “local” fcntl locking even when running as a cluster filesystem.  flock is fully supported across the cluster. Use this in preference to fcntl if your application supports it.  Using the DLM from an application ● The DLM is available for application use and this is probably a better solution than fcntl locking. Inodes

 Very similar to those of GFS ● Retains common fields at the same offsets ● One inode per block ● Spare space used for either data (“stuffed” inodes) or indirect pointers as required ● Indirect pointers formatted as an equal height metadata tree (constant time access to any area of the file independent of offset) ● Due to the metadata header in each indirect block, the number of pointers per block isn’t a power of two :( ● Attributes are now set/read by the lsattr and commands. ● Extended attributes use either one extra block, or a single layer metadata tree similar to the main metadata tree Directories

The original paper upon which the GFS2 directory structure is based is: "Extendible Hashing" by Fagin, et al in ACM Trans. on Database Systems, Sept 1979.

 Small directories are packed into the directory inode  Hashed dirs based on extendible hashing (fast lookup), but ● There is a maximum hash table size ● Beyond that size, the directory leaf chains grow  We use GFS2’s own hash (a CRC32) in the dcache to reduce the number of times we hash a filename (unlike GFS1)  Each directory leaf block contains an unordered set of “dirents” very much like / System files

 In GFS they are “hidden” and accessed via an  In GFS2 they are accessed by a filesystem of type gfs2meta  The rindex holds the locations of resource groups  The jindex holds the locations of the journals ● A file in GFS1 ● A directory in GFS2  The “per_node” subdirectory holds a set of files for each node in GFS2 ● Inum – Inode number generation ● Statfs – For fast & fuzzy statfs  Quota files  The GFS2 system is extensible so that further files can be added as required. NFS

 The NFS interface has been designed with failover in mind ● Requires use of “fsid” export option since the same device might have different device numbers on different nodes ● The file handles are all byte swapped to be endian independent  The fcntl locks used by NFS are handled by a daemon: dlm_controld ● Each node maintains a view of all the fcntl locks in the cluster ● Uses openais/corosync to synchronise locks between nodes ● Limited to active/passive failover for the time being FIEMAP ioctl

 This allows mapping the physical location of disk blocks containing data  Used by the filefrag utility ● Provides feedback on fragmentation of individual inodes ● Location information can be used in performance/correctness analysis  Example:

[root@chywoon mnt]# dd if=/dev/zero bs=4096 count=100 of=my.test 100+0 records in 100+0 records out 409600 bytes (410 kB) copied, 0.00385974 s, 106 MB/s [root@chywoon mnt]# filefrag ­v my.test Filesystem type is: 1161970 File size of my.test is 409600 (100 blocks, blocksize 4096) ext logical physical expected length flags 0 0 165367 100 merged,eof my.test: 1 extent found Discards, FITRIM and fallocate

 Discard supported as a mount option  FITRIM ioctl provides a method of periodically sweeping the unused fs space  Uses generic fstrim tool  Useful for: ● SSDs ● Thin provisioning  GFS2 doesn't have ability to indicate “allocated but zero” in metadata  So fallocate makes use of sb_issue_zeroout()  Still faster than using dd to create a zeroed file  Allows allocation beyond end of file, without extending the file size Tracepoints

 Dynamic debugging feature  Killer feature is that the ordering of GFS2 events is maintained with respect to other tracepoint supporting subsystem  Can be combined easily with tracing blocks/processes, etc  Simple to use via interface  Covers all major parts of GFS2  Gradually being extended  Provides the raw data for performance and correctness analysis ● e.g. for PCP  Easy to filter out unwanted data at source  See also, new glstats files in debugfs Ordered write list sorting

 Like ext3, defaults to ordered write mode ● This means we write all dirty data to disk before each journal flush ● If barriers are supported, we use a barrier between data flush and journal flush ● If barriers are not supported, we have to wait for all data I/O before continuing  Items are added to the ordered write list by write/page_mkwrite  Original implementation added each buffer to the ordered write list  New implementation adds the inodes to the ordered write list ● This makes the list shorter, so less work to sort it ● It means syncing is per inode, so we can use ­>writepages() ● As a result, we generate fewer, larger I/Os Faster glock dump

 Each cached glock generates at least a single line in the glock dump file  Glocks which are in use, may generate several lines  In large deployments, millions of glocks are common  The seq_file code finds its place by counting from the start of the file each time it is called and only returns a relatively small number of lines  Problem: we have an O(N2) algorithm  Result: Often takes hours to dump the glock state  Solution: ● Keep a cached position indicator (hash table position plus offset) ● Increase the buffer size to report more data on each read call ● Buffer size: min(PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER, 65536UL) ● Fails silently back to using PAGE_SIZE if allocation fails ● Glock dumps now take seconds rather than hours ● Enables use of glocktop and other debug programs Performance

 Merged lock_dlm lock module into GFS2 ● Many optimisations due to shared state ● Reduced memory footprint (and number of allocations) ● Added address space to certain glocks ● Avoids need to have extra inode structure for metadata  RCU used for glock hash table  “rbm” resource group bitmap location structure  Mkfs aligns resource groups, etc, according to RAID stripe sizes  Block reservations ● Like a write side version of ● Greatly improves multi­stream write performance ● Reduces fragmentation ● Work still ongoing in this area  atomic_open – combines create or lookup and open Better Journaling

 Better memory management ● Using bio directly to build I/Os ● Using mempool to keep enough pages to flush the log ● Eliminates __GFP_NOFAIL from journaling code ● Fewer pauses in low memory conditions  Since I/O will be mostly contiguous... ● Ideal situation to build large I/Os ● Keep a bio for the journal which we build up with each log I/O ● Send off the bio when we cannot enlarge it ● Due to discontinuity in the journal ● Due to requirement to flush immediately ● When it is full Ease of Use

 Quota ● Now uses generic quota tools ● The old, obsolete gfs2_quota tool has been retired  New clustering interface ● Uses pacemaker  PCP Support ● See Paul Evans' talk! ● Intent is to provide a visual indication of performance  gfs2_tool now obsolete ● Replaced by a variety of things: ● Tunables ­> mount options ● SB editing ­> tunegfs2 ● Freeze ­> dmsetup suspend ● Lockdump ­> cat /sys/kernel/debug/gfs2//glocks Future developments

 Stability ● Continuously adding more tests to the test suite ● Aiming at ease of maintenance  Samba/NFS integration  Performance ● Monitored on an ongoing basis by our Performance Engineering team ● Streaming writes/multipage write a priority ● Will also include fsck.gfs2  Scalability ● Using RCU for more data structures  Ease of use ● Better performance monitoring ● Simpler configuration Questions?