Recent developments in GFS2
Steven Whitehouse Manager, GFS2 Filesystem LinuxCon Europe – October 2013 Topics
● Principles of operation ● Locking ● Hints and Tips ● Inodes, Directories and System files ● NFS/Samba ● What is new in GFS2? ● FIEMAP ● Discards, FITRIM and fallocate ● Tracing ● Performance ● Ease of Use ● Future Developments ● Questions Principles of Operation
The page cache ● Caches ondisk information to allow rapid access ● Indexed by “address space” and “page index” ● For GFS2 this is split between nodes and managed by glocks ● Local filesystems: data address space in the inode, metadata address space in the block device inode ● GFS2: data address space in the inode, metadata address space in the inode glock The glock is a GFS2 concept used to manage the cache ● There is a 1:1 mapping from GFS2 glock to DLM locks ● For inodes, the glock manages the data and metadata caches ● The glock state (EX,SH,DF,UN) determines what can be cached Most performance problems occur when the implication of the glock system are not understood Locking
Three lock modes: ● Shared, Exclusive, Deferred (shared but not compatible with Shared) ● Deferred is used for Direct I/O to ensure it's cluster coherent ● The DLM’s NULL lock mode is used to maintain a ref count on LVBs (currently only used for quota data) Pluggable locking architecture ● lock_nolock ● A “noop” lock manager used with local filesystems ● lock_dlm ● A full cluster aware lock manager ● Works with the kernel’s DLM and Red Hat clustering software ● Provides VAXlike DLM semantics translated into GFS’s requirements Hints and Tips
fcntl caveat ● When using F_GETLK a PID will be returned, but it might exist on any node in the cluster! Be careful ● Currently there is no way to work out which node the process is on which holds blocking locks ● We do not support leases ● We do not support dnotify/inotify (except on the local node) It is also possible to ask for “local” fcntl locking even when running as a cluster filesystem. flock is fully supported across the cluster. Use this in preference to fcntl if your application supports it. Using the DLM from an application ● The DLM is available for application use and this is probably a better solution than fcntl locking. Inodes
Very similar to those of GFS ● Retains common fields at the same offsets ● One inode per block ● Spare space used for either data (“stuffed” inodes) or indirect pointers as required ● Indirect pointers formatted as an equal height metadata tree (constant time access to any area of the file independent of offset) ● Due to the metadata header in each indirect block, the number of pointers per block isn’t a power of two :( ● Attributes are now set/read by the lsattr and chattr commands. ● Extended attributes use either one extra block, or a single layer metadata tree similar to the main metadata tree Directories
The original paper upon which the GFS2 directory structure is based is: "Extendible Hashing" by Fagin, et al in ACM Trans. on Database Systems, Sept 1979.
Small directories are packed into the directory inode Hashed dirs based on extendible hashing (fast lookup), but ● There is a maximum hash table size ● Beyond that size, the directory leaf chains grow We use GFS2’s own hash (a CRC32) in the dcache to reduce the number of times we hash a filename (unlike GFS1) Each directory leaf block contains an unordered set of “dirents” very much like ext2/ext3 System files
In GFS they are “hidden” and accessed via an ioctl In GFS2 they are accessed by a filesystem of type gfs2meta The rindex holds the locations of resource groups The jindex holds the locations of the journals ● A file in GFS1 ● A directory in GFS2 The “per_node” subdirectory holds a set of files for each node in GFS2 ● Inum – Inode number generation ● Statfs – For fast & fuzzy statfs Quota files The GFS2 system is extensible so that further files can be added as required. NFS
The NFS interface has been designed with failover in mind ● Requires use of “fsid” export option since the same device might have different device numbers on different nodes ● The file handles are all byte swapped to be endian independent The fcntl locks used by NFS are handled by a user space daemon: dlm_controld ● Each node maintains a view of all the fcntl locks in the cluster ● Uses openais/corosync to synchronise locks between nodes ● Limited to active/passive failover for the time being FIEMAP ioctl
This allows mapping the physical location of disk blocks containing data Used by the filefrag utility ● Provides feedback on fragmentation of individual inodes ● Location information can be used in performance/correctness analysis Example:
[root@chywoon mnt]# dd if=/dev/zero bs=4096 count=100 of=my.test 100+0 records in 100+0 records out 409600 bytes (410 kB) copied, 0.00385974 s, 106 MB/s [root@chywoon mnt]# filefrag v my.test Filesystem type is: 1161970 File size of my.test is 409600 (100 blocks, blocksize 4096) ext logical physical expected length flags 0 0 165367 100 merged,eof my.test: 1 extent found Discards, FITRIM and fallocate
Discard supported as a mount option FITRIM ioctl provides a method of periodically sweeping the unused fs space Uses generic fstrim tool Useful for: ● SSDs ● Thin provisioning GFS2 doesn't have ability to indicate “allocated but zero” in metadata So fallocate makes use of sb_issue_zeroout() Still faster than using dd to create a zeroed file Allows allocation beyond end of file, without extending the file size Tracepoints
Dynamic debugging feature Killer feature is that the ordering of GFS2 events is maintained with respect to other tracepoint supporting subsystem Can be combined easily with tracing blocks/processes, etc Simple to use via debugfs interface Covers all major parts of GFS2 Gradually being extended Provides the raw data for performance and correctness analysis ● e.g. for PCP Easy to filter out unwanted data at source See also, new glstats files in debugfs Ordered write list sorting
Like ext3, gfs2 defaults to ordered write mode ● This means we write all dirty data to disk before each journal flush ● If barriers are supported, we use a barrier between data flush and journal flush ● If barriers are not supported, we have to wait for all data I/O before continuing Items are added to the ordered write list by write/page_mkwrite Original implementation added each buffer to the ordered write list New implementation adds the inodes to the ordered write list ● This makes the list shorter, so less work to sort it ● It means syncing is per inode, so we can use >writepages() ● As a result, we generate fewer, larger I/Os Faster glock dump
Each cached glock generates at least a single line in the glock dump file Glocks which are in use, may generate several lines In large deployments, millions of glocks are common The seq_file code finds its place by counting from the start of the file each time it is called and only returns a relatively small number of lines Problem: we have an O(N2) algorithm Result: Often takes hours to dump the glock state Solution: ● Keep a cached position indicator (hash table position plus offset) ● Increase the buffer size to report more data on each read call ● Buffer size: min(PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER, 65536UL) ● Fails silently back to using PAGE_SIZE if allocation fails ● Glock dumps now take seconds rather than hours ● Enables use of glocktop and other debug programs Performance
Merged lock_dlm lock module into GFS2 ● Many optimisations due to shared state ● Reduced memory footprint (and number of allocations) ● Added address space to certain glocks ● Avoids need to have extra inode structure for metadata RCU used for glock hash table “rbm” resource group bitmap location structure Mkfs aligns resource groups, etc, according to RAID stripe sizes Block reservations ● Like a write side version of readahead ● Greatly improves multistream write performance ● Reduces fragmentation ● Work still ongoing in this area atomic_open – combines create or lookup and open Better Journaling
Better memory management ● Using bio directly to build I/Os ● Using mempool to keep enough pages to flush the log ● Eliminates __GFP_NOFAIL from journaling code ● Fewer pauses in low memory conditions Since I/O will be mostly contiguous... ● Ideal situation to build large I/Os ● Keep a bio for the journal which we build up with each log I/O ● Send off the bio when we cannot enlarge it ● Due to discontinuity in the journal ● Due to requirement to flush immediately ● When it is full Ease of Use
Quota ● Now uses generic quota tools ● The old, obsolete gfs2_quota tool has been retired New clustering interface ● Uses pacemaker PCP Support ● See Paul Evans' talk! ● Intent is to provide a visual indication of performance gfs2_tool now obsolete ● Replaced by a variety of things: ● Tunables > mount options ● SB editing > tunegfs2 ● Freeze > dmsetup suspend ● Lockdump > cat /sys/kernel/debug/gfs2/
Stability ● Continuously adding more tests to the test suite ● Aiming at ease of maintenance Samba/NFS integration Performance ● Monitored on an ongoing basis by our Performance Engineering team ● Streaming writes/multipage write a priority ● Will also include fsck.gfs2 Scalability ● Using RCU for more data structures Ease of use ● Better performance monitoring ● Simpler configuration Questions?