<<

, Inc. Scalable Filesystems XFS & CXFS

Presented by: Yingping Lu

January 31, 2007 Outline

• XFS Overview •XFS Architecture • XFS Fundamental Data Structure – list –B+Tree – • XFS Filesystem On-Disk Layout • XFS Directory Structure • CXFS: shared

||January 31, 2007 Page 2 XFS: A World-Class File System

–Scalable • Full 64 bit support • Dynamic allocation of metadata space • Scalable structures and algorithms –Fast • Fast metadata speeds • High bandwidths • High transaction rates –Reliable • Field proven • Log/Journal

||January 31, 2007 Page 3 Scalable

–Full 64 bit support • Large Filesystem – 18,446,744,073,709,551,615 = 264-1 = 18 million TB (exabytes) • Large Files – 9,223,372,036,854,775,807 = 263-1 = 9 million TB (exabytes)

– Dynamic allocation of metadata space • Inode size configurable, inode space allocated dynamically • Unlimited number of files (constrained by storage space)

– Scalable structures and algorithms (B-Trees) • Performance is not an issue with large numbers of files and directories

||January 31, 2007 Page 4 Fast –Fast metadata speeds • B-Trees everywhere (Nearly all lists of metadata information) – Directory contents – Metadata free lists – Extent lists within file

– High bandwidths (Storage: RM6700) • 7.32 GB/s on one filesystem (32p Origin2000, 897 FC disks) • >4 GB/s in one file (same Origin, 704 FC disks) • Large extents (4 KB to 4 GB) • Request parallelism (multiple AGs) • Delayed allocation, Read ahead/Write behind

– High transaction rates: 92,423 IOPS (Storage: TP9700)

||January 31, 2007 Page 5 Reliable

– Field proven • Run for years on 100s of 1,000 of IRIX systems for over a decade • Ships as default filesystem on SGI Altix family : 64-bit • Commercial vendors shipping XFS – Ciprico DiMeda NAS Solutions, The Quantum Guardian™ 14000, BigStorage K2~NAS, Echostar DishPVR 721, Sun Cobalt RaQ™ 550 • Linux Distributions shipping XFS – Mandrake Linux, SuSE Linux, Gentoo Linux, Slackware Linux, JB Linux • Now in Linux Kernel – Log/Journal • XFS designed around log • No UNIX fsck is needed • Recovery time is independent of system size – Depends on system activity levels • Usually, recovery completes in under a second

||January 31, 2007 Page 6 Other XFS Features

– Large Range of Block Sizes (512 B to 64 KB) – Extended attributes – Sparse files (holes do not use disk space) – Guaranteed rate I/O – Online Dump, Resize, Defrag active file systems – DMAPI interface supported (for HSM) – Support real-time file stream

– Run on IRIX, Linux, FreeBSD

||January 31, 2007 Page 7 The System Architecture (Linux)

User mount open read write lseek System call Kernel VFS

FS drv nfs XFS

Cache Manager

Scsi drv st sd sr

SCSI middle layer

HBA drv qla2200

||January 31, 2007 Page 8 XFS Architecture

XVM

||January 31, 2007 Page 9 Inode Structure

–IO Core component xfs_dinode_t • Time stamp, size, ino di_core (96 bytes) xfs_dinode_t • Formats of two other components –Data component (union) di_next_unlink (4 bytes) • Btree/extent list • /small data di_u data • Small directory list –Attribute component (union) di_a extended attribute fork • Btree/extent list • Local attributes

||January 31, 2007 Page 10 Extent

• An extent is a number of contiguous file blocks • The minimal size of an extent is one file block • Extent is represented as a triple: start file system block, number of blocks and flag • Extent can significantly reduce the space used to record the allocated and free space if space has a large number of contiguous blocks • For a regular file, extent also includes the file offset. • With extents, sparse files, i.e. potential “holes” in file are supported.

||January 31, 2007 Page 11 Extent List Mapping

size 0 time … 4 blocks 20 mode 3 … 23 dformat=extent 4 aformat=local <0, 20, 4, 0> … 10 blocks <4, 32, 10, 0> 32 <16, 50, 5, 0> 13 … total Hole Count 41 nlen 16 vlen … 5 blocks name 20 value 50 … 54 Inode File space File system space

||January 31, 2007 Page 12 Data B+tree

7 size level=2 time numrecs=3 Root node mode file offset/fs block 0/13|5000/14|9800/15 dformat=btree intermediate nodes aformat=local 13 14 15 level=2 level=1 level=1 level=1 numrecs=1 numrecs=50 numrecs=50 numrecs=32 off (0)/fs block(7) file offset/fs block file offset/fs block file offset/fs block 0/40|100/51|200/55 5000/120|5100/121|51 9800/802|9910/803|10 total 290/56| … 90/122 005/805 Count … … 40 51 nlen level=0 level=0 vlen numrecs=50 numrecs=50 name file offset/fs blk/#blk file offset/fs blk/#blk value 0/201/2|2/206/2 100/311/2|102/317/2 4/210/3|7/218/1 104/325/4|110/340/3 leave nodes … … Inode

||January 31, 2007 Page 13 File System On-Disk Layout

A File System

AG0 AG1 AG2 …

AG: Allocation Group

SB AGF AGFL AGI FDB1 FDB2 …

SB: Super Block (1 sector, xfs_sb_t) AGF: Allocation Group Free Space (1 sector, xfs_agf_t) AGI: Allocation Group Inode (1 sector, xfs_agi_t) AGFL: Allocation Group Freelist (1 sector, xfs_agfl_t) FDB: File Data Block (file block size: 4K(default))

||January 31, 2007 Page 14 AGF B+Tree

||January 31, 2007 Page 15 AGF B+Tree (2 levels)

||January 31, 2007 Page 16 AGI B+Tree

||January 31, 2007 Page 17 Inode Allocation

• Inode size – Configurable at FS creation time – Can be 256B, 512B, up to 4096B • Inode allocation – Allocation unit is a cluster, 64 – Inode mask to show which inodes are free • Inode number consists of: –AG # – FS Block # for the inode cluster – Index within a cluster

||January 31, 2007 Page 18 XFS Directory Structure

• Unix files are organized in an inverted tree structure • Each directory has a list of files under its directory • Each entry in a directory represents an file object, or a logical link or a sub-directory • Each entry has the object name, the length of name, the corresponding inode number • Directory data are usually stored in directory blocks, the size of a directory block is multiple of file data block size. Superblock’s sb_dirblklog designates the size. It can range from 4KB to 64KB

||January 31, 2007 Page 19 Directory Forms

• Directory data – Directory entries (name, len, ino, offset) – Leaf array: hash/address for lookup – Freeindex array for allocation • Directory forms – Shortform directory: Directory data stored within inode – Block directory: 1 extent, All directory entries stored within a directory block – Leaf directory: extent list, Multiple data blocks, one leaf block – Node directory: extent list, Multiple data blocks, B+tree like leaf blocks – B+tree directory: btree format for data fork • The system dynamically adjusts the format with the addition or removal of the directory entries

||January 31, 2007 Page 20 Shortform Directory

||January 31, 2007 Page 21 Block Directory

• Use a directory block to store directory entries • The location of the block is stored in the inode’s incore extent list: the di_u.u_bmx[0] • The directory blocks (xfs_dir2_block_t) has the following data fields: – A header specifies the magic number and freespace list (3 largest free space) – Directory entry list (name len, name, ino, offset) – Leaf array: contains an array of hashval/address pairs for quickly looking up a name by the hash value. – Tail structure specifies the number of elements in the leaf array and the number of stale entries in the array. The tail is always located at the end of the block.

||January 31, 2007 Page 22 ||January 31, 2007 Page 23 Leaf Directory

• When the # of directory entries cannot be stored in one block, we use extent list to store directory entries • Data and leaf are split into different blocks. • One or more data blocks, each directory data block has its own header and bestfree list • Only one leaf block (the last one). Leaf block has its own header, hash/address array, best free space array maintains each data block’s bestfree[0]’s length. The tail part has the number of bestfree elements

||January 31, 2007 Page 24 Leaf Block

||January 31, 2007 Page 25 Node Directory

• When leaf fills a block, another separation is needed. • The “data” blocks are the same as leaf directory • The leaf blocks are changed into B+tree with generic header pointing to directory “leaves”. • A new freeindex block contains the best for each data block • The location of the leaf blocks can be in any order, the only way to determine the appropriate is by the node block hash/before values.

||January 31, 2007 Page 26 B+Tree-style Leaf Blocks leaf blocks

Node block

||January 31, 2007 Page 27 B+Tree Directory

• With very large number of directory entries, inode format is changed to B+tree • B+tree extents contains extent maps for data (directory entries), node, leaf(hash/address), freeindex. • The node/leaf trees can be more than one level • More than one freelist may exist

||January 31, 2007 Page 28 CXFS

Fibre Channel

Full standard Unix interface Near-local file As easy to share files performance from as with NFS, but faster direct data channels

Fully resilient (HA)

||January 31, 2007 Page 29 CXFS Concepts

–Metadata • The data about a file, including: • size, inode, create/modify times, and permissions

–Metadata server node (a.k.a. CXFS server) • One machine in the cluster that is responsible for controlling the metadata of files. Plays “traffic cop” to control access to the file.

–Metadata client node (a.k.a. CXFS client) • A machine in the cluster that is not the metadata server. Must obtain permission from metadata server before accessing the file.

–Single server manages metadata • Backup metadata servers designated for fail-over • No single point of failure

||January 31, 2007 Page 30 CXFS networks

• Besides the storage area network CXFS uses the following networks:

– Metadata network • IP network (dedicated) for metadata and tokens

– Membership network • IP network used for heart beating

– Reset network between metadata servers • non IP serial lines used to reset nodes

– I/O Fencing • SAN switch port disable/enable

||January 31, 2007 Page 31 Data Integrity - IO fencing

• CXFS nodes all have direct access to FC storage. • Integrity of a shared file system requires a unified view of who's allowed to read/write what. • Tokens control access. • A failed node may retain write tokens; need to prevent such a node unilaterally writing to a shared file system. • Applies to all CXFS platforms & is independent of disk sub- systems • Uses Brocade switch to disable/enable FC ports • I/O Fencing architecture could be ported to other switches.

||January 31, 2007 Page 32 CXFS Architecture CXFS Metadata Server

||January 31, 2007 Page 33 Research Issues

•Self-healing, especially deadlock detection and self-recovery •I/O fencing •Fail-over •Intelligent data placement algorithm •QoS provisioning •Scalable cluster •OSD support

||January 31, 2007 Page 34