OCFS2: the Oracle Clustered File System, Version 2

OCFS2: The Oracle Clustered File System, Version 2 Mark Fasheh Oracle [email protected] Abstract Disk space is broken up into clusters which can range in size from 4 kilobytes to 1 megabyte. This talk will review the various components File data is allocated in extents of clusters. This of the OCFS2 stack, with a focus on the file allows OCFS2 a large amount of flexibility in system and its clustering aspects. OCFS2 ex- file allocation. tends many local file system features to the File metadata is allocated in blocks via a sub cluster, some of the more interesting of which allocation mechanism. All block allocators in are posix unlink semantics, data consistency, OCFS2 grow dynamically. Most notably, this shared readable mmap, etc. allows OCFS2 to grow inode allocation on de- In order to support these features, OCFS2 logi- mand. cally separates cluster access into multiple layers. An overview of the low level DLM layer will be given. The higher level file system locking will be described in detail, including 1 Design Principles a walkthrough of inode locking and messaging for various operations. A small set of design principles has guided Caching and consistency strategies will be dis- most of OCFS2 development. None of them cussed. Metadata journaling is done on a per are unique to OCFS2 development, and in fact, node basis with JBD. Our reasoning behind that almost all are principles we learned from the choice will be described. Linux kernel community. They will, however, come up often in discussion of OCFS2 file sys- OCFS2 provides robust and performant recov- tem design, so it is worth covering them now. ery on node death. We will walk through the typical recovery process including journal re- play, recovery of orphaned inodes, and recov- 1.1 Avoid Useless Abstraction Layers ery of cached metadata allocations. Allocation areas in OCFS2 are broken up into Some file systems have implemented large ab- groups which are arranged in self-optimizing straction layers, mostly to make themselves "chains". The chain allocators allow OCFS2 to portable across kernels. The OCFS2 develop- do fast searches for free space, and dealloca- ers have held from the beginning that OCFS2 tion in a constant time algorithm. Detail on the code would be Linux only. This has helped us layout and use of chain allocators will be given. in several ways. An obvious one is that it made 1 the code much easier to read and navigate. De- 2 Disk Layout velopment has been faster because we can di- rectly use the kernel features without worrying Near the top of the ocfs2_fs.h header, one if another OS implements the same features, or will find this comment: worse, writing a generic version of them. /∗ Unfortunately, this is all easier said than done. ∗ An OCFS2 volume starts this way: Clustering presents a problem set which most ∗ Sector 0: Valid ocfs1_vol_disk_hdr that cleanly Linux file systems don’t have to deal with. ∗ fails to mount OCFS. When an abstraction layer is required, three ∗ Sector 1: Valid ocfs1_vol_label that cleanly principles are adhered to: ∗ fails to mount OCFS. ∗ Block 2: OCFS2 superblock. • Mimic the kernel API. ∗ ∗ All other structures are found • Keep the abstraction layer as thin as pos- ∗ from the superblock information. sible. ∗/ • If object life timing is required, try to use The OCFS disk headers are the only amount of the VFS object life times. backwards compatibility one will find within an OCFS2 volume. It is an otherwise brand new 1.2 Keep Operations Local cluster file system. While the file system basics are complete, there are many features yet to be Bouncing file system data around a cluster can implemented. The goal of this paper then, is to be very expensive. Changed metadata blocks, provide a good explanation of where things are for example, must be synced out to disk before in OCFS2 today. another node can read them. OCFS2 design at- tempts to break file system updates into node 2.1 Inode Allocation Structure local operations as much as possible. The OCFS2 file system has two main alloca- 1.3 Copy Good Ideas tion units - blocks and clusters. Blocks can be anywhere from 512 bytes to 4 kilobytes, There is a wealth of open source file system im- whereas clusters range from 4 kilobytes up to 1 plementations available today. Very often dur- megabyte. To make the file system mathemat- ing OCFS2 development, the question “How do ics work properly, cluster size is always greater other file systems handle it?” comes up with re- than or equal to block size. At format time, the spect to design problems. There is no reason to disk is divided into as many cluster-sized units reinvent a feature if another piece of software as will fit. Data is always allocated in clusters, already does it well. The OCFS2 developers whereas metadata is allocated in blocks thus far have had no problem getting inspira- tion from other Linux file systems1. In some Inode data is represented in extents which are cases, whole sections of code have been lifted, organized into a b+ tree. In OCFS2, extents are with proper citation, from other open source represented by a triple called an extent record. projects! Extent records are stored in a large in-inode ar- 1Most notably Ext3. ray which extends to the end of the inode block. 2 length, an 8-bit file type enum (this allows us to avoid reading the inode block for type) and of course the set of characters which make up the file name. 2.3 The Super Block Figure 1: An Inode B+ tree Record Field Field Size Description The OCFS2 super block information is con- e_cpos 32 bits Offset into the tained within an inode block. It contains a file, in clusters standard set of super block information - block e_clusters 32 bits Clusters in this size, compat/incompat/ro features, root inode extent pointer, etc. There are four values which are e_blkno 64 bits Physical disk somewhat unique to OCFS2. offset • s_clustersize_bits Table 1: OCFS2 extent record - Cluster size for the file system. When the extent array is full, the file system • s_system_dir_blkno - Pointer to the will allocate an extent block to hold the current system directory. array. The first extent record in the inode will be re-written to point to the newly allocated ex- • s_max_slots - Maximum number of tent block. The e_clusters and e_cpos simultaneous mounts. values will refer to the part of the tree under- • s_first_cluster_group - Block neath that extent. Bottom level extent blocks offset of first cluster group descriptor. form a linked list so that queries accross a range can be done efficiently. s_clustersize_bits is self explanatory. The reason for the other three fields will be ex- 2.2 Directories plained in the next few sections. Directory layout in OCFS2 is very similar to 2.4 The System Directory Ext3, though unfortunately, htree has yet to be ported. The only difference in directory en- In OCFS2 file system metadata is contained try structure is that OCFS2 inode numbers are within a set of system files. There are two types 64 bits wide. The rest of this section can be of system files - global and node local. All sys- skipped by those already familiar with the di- tem files are linked into the file system via the rent structure. hidden system directory2 whose inode number Directory inodes hold their data in the same is pointed to by the superblock. To find a sys- manner which file inodes do. Directory data tem file, a node need only search the system is arranged into an array of "directory en- directory for the name in question. The most tries". Each directory entry holds a 64-bit inode 2debugfs.ocfs2 can list the system dir with the “ls pointer, a 16-bit record length, an 8-bit name //” command. 3 common ones are read at mount time as a performance optimization. Linking to system files from the system directory allows system file lo- catations to be completely dynamic. Adding new system files is as simple as linking them into the directory. Global system files are generally accessible by any cluster node at any time, given that it has taken the proper cluster-wide locks. The Figure 2: Allocation Group global_bitmap is one such system file. There are many others. up into allocation groups, each of which contains a fixed number of allocation units. These Node local system files are said to be “owned” groups are then chained together into a set of by a mounted node which occupies a unique singly linked lists, which start at the allocator “slot”. The maximum number of slots in a file inode. system is determined by the s_max_slots superblock field. The slot_map global sys- The first block of the first allocation unit tem file contains a flat array of node numbers within a group contains an ocfs2_group_ which details which mounted node occupies descriptor. The descriptor contains a small which set of node local system files. set of fields followed by a bitmap which extends to the end of the block. Each bit in the Ownership of a slot may mean a different thing bitmap corresponds to an allocation unit within to each node local system file. For some, it the group. The most important descriptor fields means that access to the system file is exclusive follow. - no other node can ever access it. For others it simply means that the owning node gets pref- erential access - for an allocator file this might • bg_free_bits_count - number of mean the owning node is the only one allowed unallocated units in this group.

OCFS2: the Oracle Clustered File System, Version 2

HP Storageworks Clustered File System Command Line Reference

Shared File Systems: Determining the Best Choice for Your Distributed SAS® Foundation Applications Margaret Crevar, SAS Institute Inc., Cary, NC

Comparative Analysis of Distributed and Parallel File Systems' Internal Techniques

Designing High-Performance and Scalable Clustered Network Attached Storage with Infiniband

Newest Trends in High Performance File Systems

A Guide to the IBM Clustered Network File System

Sistemi Di Storage: Clustered Filesystems

Best Practices for Data Sharing in a Grid Distributed SAS Environment

Around the Linux File System World in 45 Minutes

A Cross-Platform, High Performance Shared Storage System Master of Science Thesis in the Programme Networks and Distributed Systems

Red Hat Gluster Storage Product Overview

Dell EMC Isilon: Solution Design and Considerations for SMB Environments