OCFS2: the Oracle Clustered File System, Version 2

OCFS2: The Oracle Clustered File System, Version 2 Mark Fasheh Oracle [email protected] Abstract Disk space is broken up into clusters which can range in size from 4 kilobytes to 1 megabyte. This talk will review the various components File data is allocated in extents of clusters. This of the OCFS2 stack, with a focus on the file allows OCFS2 a large amount of flexibility in system and its clustering aspects. OCFS2 ex- file allocation. tends many local file system features to the File metadata is allocated in blocks via a sub cluster, some of the more interesting of which allocation mechanism. All block allocators in are posix unlink semantics, data consistency, OCFS2 grow dynamically. Most notably, this shared readable mmap, etc. allows OCFS2 to grow inode allocation on de- In order to support these features, OCFS2 logi- mand. cally separates cluster access into multiple layers. An overview of the low level DLM layer will be given. The higher level file system locking will be described in detail, including 1 Design Principles a walkthrough of inode locking and messaging for various operations. A small set of design principles has guided Caching and consistency strategies will be dis- most of OCFS2 development. None of them cussed. Metadata journaling is done on a per are unique to OCFS2 development, and in fact, node basis with JBD. Our reasoning behind that almost all are principles we learned from the choice will be described. Linux kernel community. They will, however, come up often in discussion of OCFS2 file sys- OCFS2 provides robust and performant recov- tem design, so it is worth covering them now. ery on node death. We will walk through the typical recovery process including journal re- play, recovery of orphaned inodes, and recov- 1.1 Avoid Useless Abstraction Layers ery of cached metadata allocations. Allocation areas in OCFS2 are broken up into Some file systems have implemented large ab- groups which are arranged in self-optimizing straction layers, mostly to make themselves “chains.” The chain allocators allow OCFS2 to portable across kernels. The OCFS2 develop- do fast searches for free space, and dealloca- ers have held from the beginning that OCFS2 tion in a constant time algorithm. Detail on the code would be Linux only. This has helped us layout and use of chain allocators will be given. in several ways. An obvious one is that it made 290 • OCFS2: The Oracle Clustered File System, Version 2 the code much easier to read and navigate. De- 2 Disk Layout velopment has been faster because we can di- rectly use the kernel features without worrying ocfs2_fs.h if another OS implements the same features, or Near the top of the header, one worse, writing a generic version of them. will find this comment: /* Unfortunately, this is all easier said than done. * An OCFS2 volume starts this way: * Sector 0: Valid ocfs1_vol_disk_hdr that cleanly Clustering presents a problem set which most * fails to mount OCFS. Linux file systems don’t have to deal with. * Sector 1: Valid ocfs1_vol_label that cleanly * fails to mount OCFS. When an abstraction layer is required, three * Block 2: OCFS2 superblock. principles are adhered to: * * All other structures are found * from the superblock information. */ • Mimic the kernel API. The OCFS disk headers are the only amount of • Keep the abstraction layer as thin as pos- backwards compatibility one will find within an sible. OCFS2 volume. It is an otherwise brand new • If object life timing is required, try to use cluster file system. While the file system basics the VFS object life times. are complete, there are many features yet to be implemented. The goal of this paper then, is to 1.2 Keep Operations Local provide a good explanation of where things are in OCFS2 today. Bouncing file system data around a cluster can be very expensive. Changed metadata blocks, 2.1 Inode Allocation Structure for example, must be synced out to disk before another node can read them. OCFS2 design at- tempts to break file system updates into node The OCFS2 file system has two main alloca- blocks clusters local operations as much as possible. tion units, and . Blocks can be anywhere from 512 bytes to 4 kilobytes, whereas clusters range from 4 kilobytes up to 1.3 Copy Good Ideas one megabyte. To make the file system math- ematics work properly, cluster size is always There is a wealth of open source file system im- greater than or equal to block size. At format plementations available today. Very often dur- time, the disk is divided into as many clustering OCFS2 development, the question “How do sized units as will fit. Data is always allocated other file systems handle it?” comes up with re- in clusters, whereas metadata is allocated in spect to design problems. There is no reason to blocks reinvent a feature if another piece of software already does it well. The OCFS2 developers Inode data is represented in extents which are thus far have had no problem getting inspira- organized into a b-tree. In OCFS2, extents are tion from other Linux file systems.1 In some represented by a triple called an extent record. cases, whole sections of code have been lifted, Extent records are stored in a large in-inode ar- with proper citation, from other open source ray which extends to the end of the inode block. projects! When the extent array is full, the file system 1Most notably Ext3. will allocate an extent block to hold the current 2006 Linux Symposium, Volume One • 291 DISK INODE of course the set of characters which make up the file name. 2.3 The Super Block EXTENT BLOCK The OCFS2 super block information is con- Figure 1: An Inode B-tree tained within an inode block. It contains a Record Field Field Size Description standard set of super block information—block e_cpos 32 bits Offset into the size, compat/incompat/ro features, root inode file, in clusters pointer, etc. There are four values which are e_clusters 32 bits Clusters in this somewhat unique to OCFS2. extent e_blkno 64 bits Physical disk • s_clustersize_bits – Cluster size offset for the file system. Table 1: OCFS2 extent record • s_system_dir_blkno – Pointer to the system directory. array. The first extent record in the inode will be re-written to point to the newly allocated ex- • s_max_slots – Maximum number of tent block. The e_clusters and e_cpos simultaneous mounts. values will refer to the part of the tree under- • s_first_cluster_group – Block neath that extent. Bottom level extent blocks offset of first cluster group descriptor. form a linked list so that queries accross a range can be done efficiently. s_clustersize_bits is self-explanatory. The reason for the other three fields will be ex- 2.2 Directories plained in the next few sections. Directory layout in OCFS2 is very similar to 2.4 The System Directory Ext3, though unfortunately, htree has yet to be ported. The only difference in directory en- In OCFS2 file system metadata is contained try structure is that OCFS2 inode numbers are within a set of system files. There are two types 64 bits wide. The rest of this section can be of system files, global and node local. All sys- skipped by those already familiar with the di- tem files are linked into the file system via the rent structure. hidden system directory2 whose inode number Directory inodes hold their data in the same is pointed to by the superblock. To find a sys- manner which file inodes do. Directory data tem file, a node need only search the system is arranged into an array of directory en- directory for the name in question. The most tries. Each directory entry holds a 64-bit inode common ones are read at mount time as a per- pointer, a 16-bit record length, an 8-bit name formance optimization. Linking to system files length, an 8-bit file type enum (this allows us 2debugfs.ocfs2 can list the system dir with the to avoid reading the inode block for type), and ls // command. 292 • OCFS2: The Oracle Clustered File System, Version 2 from the system directory allows system file lo- ALLOCATION UNITS catations to be completely dynamic. Adding new system files is as simple as linking them into the directory. Global system files are generally accessible by any cluster node at any time, given that it GROUP has taken the proper cluster-wide locks. The DESCRIPTOR global_bitmap is one such system file. There are many others. Figure 2: Allocation Group Node local system files are said to be owned by groups are then chained together into a set of a mounted node which occupies a unique slot. singly linked lists, which start at the allocator The maximum number of slots in a file sys- inode. tem is determined by the s_max_slots superblock field. The slot_map global system The first block of the first allocation unit file contains a flat array of node numbers which within a group contains an ocfs2_group_ details which mounted node occupies which set descriptor. The descriptor contains a small of node local system files. set of fields followed by a bitmap which extends to the end of the block. Each bit in the Ownership of a slot may mean a different bitmap corresponds to an allocation unit within thing to each node local system file. For the group. The most important descriptor fields some, it means that access to the system file follow. is exclusive—no other node can ever access it. For others it simply means that the owning node gets preferential access—for an allocator file, • bg_free_bits_count – number of this might mean the owning node is the only unallocated units in this group.

Load more