OCFS2(7) OCFS2 Manual Pages OCFS2(7)

NAME OCFS2 − A Shared-Disk Cluster System for

INTRODUCTION OCFS2 is a file system.Itallows users to store and retrieve data. The data is stored in files that are orga- nized in a hierarchical directory tree. It is a POSIX compliant file system that supports the standard inter- faces and the behavioral semantics as spelled out by that specification.

It is also a shared disk cluster file system, one that allows multiple nodes to access the same disk the same . This is where the fun begins as allowing a file system to be accessible on multiple nodes opens a can of worms. What if the nodes are of different architectures? What if a node dies while writing to the file system? What data consistencycan one expect if processes on twonodes are reading and writing concur- rently? What if one node removesafile while it is still being used on another node?

Unlikemost shared file systems where the answer is fuzzy,the answer in OCFS2 is very well defined. It behavesonall nodes exactly likealocal file system. If a file is removed, the directory entry is removedbut the inode is kept as long as it is in use across the cluster.When the last user closes the descriptor,the inode is marked for deletion.

The data consistencymodel follows the same principle. It works as if the twoprocesses that are running on twodifferent nodes are running on the same node. A read on a node gets the last irrespective ofthe IO mode used. The modes can be buffered, direct, asynchronous, splice or mapped IOs. It is fully cache coherent.

Take for example the REFLINK feature that allows a user to create multiple write-able snapshots of a file. This feature, likeall others, is fully cluster-aware. A file being written to on multiple nodes can be safely reflinked on another.The snapshot created is a point-in-time image of the file that includes both the file data and all its attributes (including extended attributes).

It is a journaling file system. When a node dies, a surviving node transparently replays the journal of the dead node. This ensures that the file system metadata is always consistent. It also defaults to ordered data journaling to ensure the file data is flushed to disk before the journal commit, to remove the small possibil- ity of stale data appearing in files after a crash.

It is architecture and endian neutral.Itallows concurrent mounts on nodes with different processors like x86, x86_64, IA64 and PPC64. It handles little and big endian, 32-bit and 64-bit architectures.

It is featurerich.Itsupports indexed directories, metadata checksums, extended attributes, POSIX ACLs, quotas, REFLINKs, sparse files, unwritten extents and inline-data.

It is fully integrated with the mainline Linux kernel. The file system was merged into Linux kernel 2.6.16 in early 2006.

It is quickly installed.Itisavailable with almost all Linux distributions. The file system is on-disk com- patible across all of them.

It is modular.The file system can be configured to operate with other cluster stacks like Pacemaker and CMAN along with its own stack, O2CB.

It is easily configured.The O2CB cluster stack configuration involves editing twofiles, one for cluster lay- out and the other for cluster timeouts.

It is very efficient.The file system consumes very little resources. It is used to store virtual machine images in limited memory environments likeXen and KVM.

Version 1.8.2 January 2012 1 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

In summary,OCFS2 is an efficient, easily configured, modular,quickly installed, fully integrated and com- patible, feature-rich, architecture and endian neutral, cache coherent, ordered data journaling, POSIX-com- pliant, shared disk cluster file system.

OVERVIEW OCFS2 is a general-purpose shared-disk cluster file system for Linux capable of providing both high per- formance and high availability.

As it provides local file system semantics, it can be used with almost all applications. Cluster-aware appli- cations can makeuse of cache-coherent parallel I/Os from multiple nodes to scale out applications easily. Other applications can makeuse of the clustering facilities to fail-overrunning application in the event of a node failure.

The notable features of the file system are: Tunable Block size The file system supports block sizes of 512, 1K, 2K and 4K bytes. 4KB is almost always recom- mended. This feature is available in all releases of the file system.

Tunable Cluster size Acluster size is also referred to as an allocation unit. The file system supports cluster sizes of 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K and 1M bytes. For most use cases, 4KB is recommended. However, a larger value is recommended for volumes hosting mostly very large files likedatabase files, virtual machine images, etc. A large cluster size allows the file system to store large files efficiently.This feature is available in all releases of the file system.

Endian and Architectureneutral The file system can be mounted concurrently on nodes having different architectures. Like32-bit, 64-bit, little-endian (x86, x86_64, ia64) and big-endian (ppc64, s390x). This feature is available in all releases of the file system.

Buffered, Direct, Asynchronous, Splice and Memory Mapped I/O modes The file system supports all modes of I/O for maximum flexibility and performance. It also sup- ports cluster-wide shared writeable mmap(2).The support for bufferred, direct and asynchronous I/O is available in all releases. The support for splice I/O was added in Linux kernel 2.6.20 and for shared writeable map(2) in 2.6.23.

Multiple Cluster Stacks The file system includes a flexible framework to allowittofunction with userspace cluster stacks likePacemaker (pcmk)and CMAN (cman), its own in-kernel cluster stack o2cb and no cluster stack.

The support for o2cb cluster stack is available in all releases.

The support for no cluster stack, or local , was added in Linux kernel 2.6.20.

The support for userspace cluster stack was added in Linux kernel 2.6.26.

Journaling The file system supports both ordered (default) and writeback data journaling modes to provide file system consistencyinthe event of power failure or system crash. It uses JBD2 in Linux kernel 2.6.28 and later.Itused JBD in earlier kernels.

Version 1.8.2 January 2012 2 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

Extent-based Allocations The file system allocates and tracks space in ranges of clusters. This is unlikeblock based file sys- tems that have totrack each and every block. This feature allows the file system to be very effi- cient when dealing with both large volumes and large files. This feature is available in all releases of the file system.

Sparse files Sparse files are files with holes. With this feature, the file system delays allocating space until a write is issued to a cluster.This feature was added in Linux kernel 2.6.22 and requires enabling on-disk feature sparse.

Unwritten Extents An unwritten is also referred to as user pre-allocation. It allows an application to request a range of clusters to be allocated, but not initialized, within a file. Pre-allocation allows the file sys- tem to optimize the data layout with fewer,larger extents. It also provides a performance boost, delaying initialization until the user writes to the clusters. This feature was added in Linux kernel 2.6.23 and requires enabling on-disk feature unwritten.

Hole Punching Hole punching allows an application to remove arbitrary allocated regions within a file. Creating holes, essentially.This is more efficient than zeroing the same extents. This feature is especially useful in virtualized environments as it allows a block discard in a guest file system to be con- verted to a hole punch in the host file system thus allowing users to reduce disk space usage. This feature was added in Linux kernel 2.6.23 and requires enabling on-disk features sparse and unwritten.

Inline-data Inline data is also referred to as data-in-inode as it allows storing small files and directories in the inode block. This not only savesspace but also has a positive impact on cold-cache directory and file operations. The data is transparently movedout to an extent when it no longer fits inside the inode block. This feature was added in Linux kernel 2.6.24 and requires enabling on-disk feature inline-data.

REFLINK REFLINK is also referred to as fast copy. Itallows users to atomically (and instantly) copyregular files. In other words, create multiple writeable snapshots of regular files. It is called REFLINK because it looks and feels more likea(hard) link(2) than a traditional snapshot. Likea link, it is a regular user operation, subject to the security attributes of the inode being reflinked and not to the super user privileges typically required to create a snapshot. Likealink, it operates within a file system. But unlikealink, it links the inodes at the data extent levelallowing each reflinked inode to growindependently as and when written to. Up to four billion inodes can share a data extent. This feature was added in Linux kernel 2.6.32 and requires enabling on-disk feature refcount.

Allocation Reservation File contiguity plays an important role in file system performance. When a file is fragmented on disk, reading and writing to the file involves manyseeks, leading to lower throughput. Contiguous files, on the other hand, minimize seeks, allowing the disks to perform IO at the maximum rate.

With allocation reservation, the file system reserves a windowinthe bitmap for all extending files allowing each to growascontiguously as possible. As this extra space is not actually allocated, it is available for use by other files if the need arises. This feature was added in Linux kernel 2.6.35 and can be tuned using the mount option resv_level.

Version 1.8.2 January 2012 3 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

Indexed Directories An indexeddirectory allows users to perform quick lookups of a file in very large directories. It also results in faster creates and unlinks and thus provides better overall performance. This feature wasadded in Linux kernel 2.6.30 and requires enabling on-disk feature indexed-dirs.

File Attributes This refers to -style file attributes, such as immutable, modified using (1) and queried using lsattr(1).This feature was added in Linux kernel 2.6.19.

Extended Attributes An extended attribute refers to a name:value pair than can be associated with file system objects likeregular files, directories, symbolic links, etc. OCFS2 allows associating an unlimited number of attributes per object. The attribute names can be up to 255 bytes in length, terminated by the first NUL character.While it is not required, printable names (ASCII) are recommended. The attribute values can be up to 64 KB of arbitrary binary data. These attributes can be modified and listed using standard Linux utilities setfattr(1) and getfattr(1).This feature was added in Linux kernel 2.6.29 and requires enabling on-disk feature xattr.

Metadata Checksums This feature allows the file system to detect silent corruptions in all metadata blocks likeinodes and directories. This feature was added in Linux kernel 2.6.29 and requires enabling on-disk fea- ture metaecc.

POSIX ACLs and Security Attributes POSIX ACLs allows assigning fine-grained discretionary access rights for files and directories. This security scheme is a lot more flexible than the traditional file access permissions that imposes astrict user-group-other model.

Security attributes allowthe file system to support other security regimes likeSELinux, SMACK, AppArmor,etc.

Both these security extensions were added in Linux kernel 2.6.29 and requires enabling on-disk feature xattr.

User and Group Quotas This feature allows setting up usage quotas on user and group basis by using the standard utilities like quota(1), setquota(8), quotacheck(8),and quotaon(8).This feature was added in Linux ker- nel 2.6.29 and requires enabling on-disk features usrquota and grpquota.

Unix File Locking The has historically provided twosystem calls to lock files. flock(2) or BSD locking and fcntl(2) or POSIX locking. OCFS2 extends both file locks to the cluster.File locks taken on one node interact with those taken on other nodes.

The support for clustered flock(2) wasadded in Linux kernel 2.6.26.All flock(2) options are sup- ported, including the kernels ability to cancel a lock request when an appropriate signal is receivedbythe user.This feature is supported with all cluster-stacks including o2cb.

The support for clustered fcntl(2) wasadded in Linux kernel 2.6.28.But because it requires group communication to makethe locks coherent, it is only supported with userspace cluster stacks, pcmk and cman and not with the default cluster stack o2cb.

Version 1.8.2 January 2012 4 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

Comprehensive Tools Support The file system has a comprehensive -style toolset that tries to use similar parameters for ease-of-use. It includes .ocfs2(8) (format), tunefs.ocfs2(8) (tune), .ocfs2(8) (check), .ocfs2(8) (debug), etc.

Online Resize The file system can be dynamically grown using tunefs.ocfs2(8).This feature was added in Linux kernel 2.6.25.

RECENT CHANGES The O2CB cluster stack has a global heartbeat mode. It allows users to specify heartbeat regions that are consistent across all nodes. The cluster stack also allows online addition and removalofboth nodes and heartbeat regions.

o2cb(8) is the newcluster configuration utility.Itisaneasy to use utility that allows users to create the cluster configuration on a node that is not part of the cluster.Itreplaces the older utility o2cb_ctl(8) which has being deprecated.

ocfs2console(8) has been obsoleted.

o2info(8) is a newutility that can be used to provide file system information. It allows non-priviledged users to see the enabled file system features, block and cluster sizes, extended file stat, free space fragmen- tation, etc.

o2hbmonitor(8) is a o2hb heartbeat monitor.Itisanextremely light weight utility that logs messages to the system logger once the heartbeat delay exceeds the warn threshold. This utility is useful in identifying volumes encountering I/O delays.

debugfs.ocfs2(8) has some newcommands. net_stats shows the o2net message times between various nodes. This is useful in indentifying nodes are that slowing down the cluster operations. stat_sysdir allows the user to dump the entire system directory that can be used to debug issues. grpextents dumps the com- plete free space fragmentation in the cluster group allocator.

mkfs.ocfs2(8) nowenables xattr, indexed-dirs, discontig-bg, refcount, extended-slotmap and clusterinfo feature flags by default, in addition to the older defaults, sparse, unwritten and inline-data.

mount.ocfs2(8) allows users to specify the levelofcache coherencybetween nodes. By default the file system operates in full coherencymode that also serializes the direct I/Os. While this mode is technically correct, it limits the I/O thruput in a clustered database. This mount option allows the user to limit the cache coherencytoonly the buffered I/Os to allowmultiple nodes to do concurrent direct writes to the same file. This feature works with Linux kernel 2.6.37 and later.

COMPATIBILITY The OCFS2 development teams goes to great lengths to maintain compatibility.Itattempts to maintain both on-disk and network protocol compatibility across all releases of the file system. It does so evenwhile adding newfeatures that entail on-disk format and network protocol changes. Todothis successfully,itfol- lows a fewrules:

1.The on-disk format changes are managed by a set of feature flags that can be turned on and off. The file system in kernel detects these features during mount and continues only if it understands all the features. Users encountering this have the option of either disabling that feature or upgrading the file system to a newer release.

Version 1.8.2 January 2012 5 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

2.The latest release of ocfs2-tools is compatible with all versions of the file system. All utilities detect the features enabled on disk and continue only if it understands all the features. Users encountering this have toupgrade the tools to a newer release.

3.The network protocol version is negotiated by the nodes to ensure all nodes understand the active protocol version.

FEATURE FLAGS The feature flags are into three categories, namely, Compat, Incompat and RO Compat.

Compat,orcompatible, is a feature that the file system does not need to fully understand to safely read/write to the volume. An example of this is the backup-super feature that added the capability to backup the super block in multiple locations in the file system. As the backup super blocks are typically not read nor written to by the file system, an older file system can safely mount a volume with this feature enabled.

Incompat,orincompatible, is a feature that the file system needs to fully understand to read/write to the volume. Most features fall under this category.

RO Compat,orread-only compatible, is a feature that the file system needs to fully understand to write to the volume. Older software can safely read a volume with this feature enabled. An exam- ple of this would be user and group quotas. As quotas are manipulated only when the file system is written to, older software can safely mount such volumes in read-only mode.

The list of feature flags, the version of the kernel it was added in, the earliest version of the tools that understands it, etc., is as follows:

FeatureFlags Kernel Version Tools Version Category Hex Value backup-super All ocfs2-tools 1.2 Compat 1 strict-journal-super All All Compat 2 local Linux 2.6.20 ocfs2-tools 1.2 Incompat 8 sparse Linux 2.6.22 ocfs2-tools 1.4 Incompat 10 inline-data Linux 2.6.24 ocfs2-tools 1.4 Incompat 40 extended-slotmap Linux 2.6.27 ocfs2-tools 1.6 Incompat 100 xattr Linux 2.6.29 ocfs2-tools 1.6 Incompat 200 indexed-dirs Linux 2.6.30 ocfs2-tools 1.6 Incompat 400 metaecc Linux 2.6.29 ocfs2-tools 1.6 Incompat 800 refcount Linux 2.6.32 ocfs2-tools 1.6 Incompat 1000 discontig-bg Linux 2.6.35 ocfs2-tools 1.6 Incompat 2000 clusterinfo Linux 2.6.37 ocfs2-tools 1.8 Incompat 4000 unwritten Linux 2.6.23 ocfs2-tools 1.4 RO Compat 1 grpquota Linux 2.6.29 ocfs2-tools 1.6 RO Compat 2 usrquota Linux 2.6.29 ocfs2-tools 1.6 RO Compat 4

To query the features enabled on a volume, do:

$o2info --fs-features /dev/sdf1 backup-super strict-journal-super sparse extended-slotmap inline-data xattr indexed-dirs refcount discontig-bg clusterinfo unwritten

Version 1.8.2 January 2012 6 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

ENABLING AND DISABLING FEATURES

The format utility, mkfs.ocfs2(8),allows a user to enable and disable specific features using the fs- features option. The features are provided as a comma separated list. The enabled features are listed as is. The disabled features are prefixed with no.The example belowshows the file system being formatted with sparse disabled and inline-data enabled.

#mkfs.ocfs2 --fs-features=nosparse,inline-data /dev/sda1

After formatting, the users can toggle features using the tune utility, tunefs.ocfs2(8).This is an offline operation. The volume needs to be umounted across the cluster.The example belowshows the sparse feature being enabled and inline-data disabled.

#tunefs.ocfs2 --fs-features=sparse,noinline-data /dev/sda1

Care should be taken before enabling and disabling features. Users planning to use a volume with an older version of the file system will be better of not enabling newer features as turning disabling may not succeed.

An example would be disabling the sparse feature; this requires filling every hole. The operation can only succeed if the file system has enough free space.

DETECTING FEATURE INCOMPATIBILITY

Say one tries to mount a volume with an incompatible feature. What happens then? Howdoes one detect the problem? Howdoes one knowthe name of that incompatible feature?

To begin with, one should look for error messages in dmesg(8).Mount failures that are due to an incompatible feature will always result in an error message likethe following:

ERROR: couldn't mount because of unsupported optional features (200).

Here the file system is unable to mount the volume due to an unsupported optional feature. That means that that feature is an Incompat feature. By referring to the table above,one can then deduce that the user failed to mount a volume with the xattr feature enabled. (The value in the error message is in hexadecimal.)

Another example of an error message due to incompatibility is as follows:

ERROR: couldn't mount RDWR because of unsupported optional features (1).

Here the file system is unable to mount the volume in the RWmode. That means that that feature is a RO Compat feature. Another look at the table and it becomes apparent that the volume had the unwritten feature enabled.

In both cases, the user has the option of disabling the feature. In the second case, the user has the choice of mounting the volume in the ROmode.

GETTING STARTED The OCFS2 software is split into twocomponents, namely,kernel and tools. The kernel component includes the core file system and the cluster stack, and is packaged along with the kernel. The tools compo- nent is packaged as ocfs2-tools and needs to be specifically installed. It provides utilities to format, tune, mount, debug and check the file system.

Version 1.8.2 January 2012 7 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

To install ocfs2-tools,refer to the package handling utility in in your distributions.

The next step is selecting a cluster stack. The options include:

A.Nocluster stack, or local mount.

B.In-kernel o2cb cluster stack with local or global heartbeat.

C.Userspace cluster stacks pcmk or cman.

The file system allows changing cluster stacks easily using tunefs.ocfs2(8).Tolist the cluster stacks stamped on the OCFS2 volumes, do:

#mounted.ocfs2 -d Device Stack Cluster F UUID Label /dev/sdb1 o2cb webcluster G DCDA2845177F4D59A0F2DCD8DE507CC3 hbvol1 /dev/sdc1 None 23878C320CF3478095D1318CB5C99EED localmount /dev/sdd1 o2cb webcluster G 8AB016CD59FC4327A2CDAB69F08518E3 webvol /dev/sdg1 o2cb webcluster G 77D95EF51C0149D2823674FCC162CF8B logsvol /dev/sdh1 o2cb webcluster G BBA1DBD0F73F449384CE75197D9B7098 scratch

NON-CLUSTERED OR LOCAL MOUNT

To format a OCFS2 volume as a non-clustered (local)volume, do:

#mkfs.ocfs2 -L "mylabel" --fs-features=local /dev/sda1

To convert an existing clustered volume to a non-clustered volume, do:

#tunefs.ocfs2 --fs-features=local /dev/sda1

Non-clustered volumes do not interact with the cluster stack. One can have both clustered and non-clustered volumes mounted at the same time.

While formating a non-clustered volume, users should consider the possibility of later converting that volume to a clustered one. If there is a possibility of that, then the user should add enough node-slots using the -N option. Adding node-slots during format creates journals with large extents. If created later,then the journals will be fragmented which is not good for performance.

CLUSTERED MOUNT WITH O2CB CLUSTER STACK

Only one of the twoheartbeat mode can be active atany one time. Changing heartbeat modes is an offline operation.

Both heartbeat modes require /etc/ocfs2/cluster.conf and /etc/sysconfig/o2cb to be populated as described in ocfs2.cluster.conf(5) and o2cb.sysconfig(5) respectively.The only difference in set up between the twomodes is that global requires heartbeat devices to be configured whereas local does not.

Refer o2cb(7) for more information.

Version 1.8.2 January 2012 8 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

LOCAL HEARTBEAT This is the default heartbeat mode. The user needs to populate the configuration files as described in ocfs2.cluster.conf(5) and o2cb.sysconfig(5).Inthis mode, the cluster stack heartbeats on all mounted volumes. Thus, one does not have tospecify heartbeat devices in cluster.conf.

Once configured, the o2cb cluster stack can be onlined and offlined as follows:

#service o2cb online Setting cluster stack "o2cb": OK Registering O2CB cluster "webcluster": OK Setting O2CB cluster timeouts : OK

#service o2cb offline Clean userdlm domains: OK Stopping O2CB cluster webcluster: OK Unregistering O2CB cluster "webcluster": OK

GLOBAL HEARTBEAT The configuration is similar to local heartbeat. The one additional step in this mode is that it requires heartbeat devices to be also configured.

These heartbeat devices are OCFS2 formatted volumes with global heartbeat enabled on disk. These volumes can later be mounted and used as clustered file systems.

The steps to format a volume with global heartbeat enabled is listed in o2cb(7).Also listed there is listing all volumes with the cluster stack stamped on disk.

In this mode, the heartbeat is started when the cluster is onlined and stopped when the cluster is offlined.

#service o2cb online Setting cluster stack "o2cb": OK Registering O2CB cluster "webcluster": OK Setting O2CB cluster timeouts : OK Starting global heartbeat for cluster "webcluster": OK

#service o2cb offline Clean userdlm domains: OK Stopping global heartbeat on cluster "webcluster": OK Stopping O2CB cluster webcluster: OK Unregistering O2CB cluster "webcluster": OK

#service o2cb status Driver for "": Loaded Filesystem "configfs": Mounted Stack glue driver: Loaded Stack plugin "o2cb": Loaded Driver for "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster "webcluster": Online Heartbeat dead threshold: 31 Network idle timeout: 30000 Network keepalive delay: 2000

Version 1.8.2 January 2012 9 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

Network reconnect delay: 2000 Heartbeat mode: Global Checking O2CB heartbeat: Active 77D95EF51C0149D2823674FCC162CF8B /dev/sdg1 Nodes in O2CB cluster: 92 96

CLUSTERED MOUNT WITH USERSPACECLUSTER STACK

Configure and online the userspace stack pcmk or cman before using tunefs.ocfs2(8) to update the cluster stack on disk.

#tunefs.ocfs2 --update-cluster-stack /dev/sdd1 Updating on-disk cluster information to match the running cluster. DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION. Update the on-disk cluster information? y

Refer to the cluster stack documentation for information on starting and stopping the cluster stack.

FILE SYSTEM UTILITIES This sections lists the utilities that are used to manage the OCFS2 file systems. This includes tools to for- mat, tune, check, mount, debug the file system. Each utility has a that lists its capabilities in detail.

mkfs.ocfs2(8) This is the file system format utility.All volumes have tobeformatted prior to its use. As this util- ity overwrites the volume, use it with care. Double check to ensure the volume is not in use on any node in the cluster.

As a precaution, the utility will abort if the volume is locally mounted. It also detects use across the cluster if used by OCFS2. But these checks are not comprehensive and can be overridden. So use it with care.

While it is not always required, the cluster should be online.

tunefs.ocfs2(8) This is the file system tune utility.Itallows users to change certain on-disk parameters likelabel, uuid, number of node-slots, volume size and the size of the journals. It also allows turning on and offthe file system features as listed above.

This utility requires the cluster to be online.

fsck.ocfs2(8) This is the file system check utility.Itdetects and fixes on-disk errors. All the check codes and their fixes are listed in fsck.ocfs2.checks(8).

This utility requires the cluster to be online to ensure the volume is not in use on another node and to prevent the volume from being mounted for the duration of the check.

Version 1.8.2 January 2012 10 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

mount.ocfs2(8) This is the file system mount utility.Itisinv okedindirectly by the mount(8) utility.

This utility detects the cluster status and aborts if the cluster is offline or does not match the cluster stamped on disk.

o2cluster(8) This is the file system cluster stackupdate utility.Itallows the users to update the on-disk cluster stack to the one provided.

This utility only updates the disk if the utility is reasonably assured that the file system is not in use on anynode.

o2info(1) This is the file system information utility.Itprovides information likethe features enabled on disk, block size, cluster size, free space fragmentation, etc.

It can be used by both priviledged and non-priviledged users. Users having read permission on the device can provide the path to the device. Other users can provide the path to a file on a mounted file system.

debugfs.ocfs2(8) This is the file system debug utility.Itallows users to examine all file system structures including walking directory structures, displaying inodes, backing up files, etc., without mounting the file system.

This utility requires the user to have read permission on the device.

o2image(8) This is the file system image utility.Itallows users to copythe file system metadata skeleton, including the inodes, directories, bitmaps, etc. As it excludes data, it shrinks the size of the file system tremendously.

The image file created can be used in debugging on-disk corruptions.

mounted.ocfs2(8) This is the file system detect utility.Itdetects all OCFS2 volumes in the system and lists its label, uuid and cluster stack.

O2CB CLUSTER STACKUTILITIES This sections lists the utilities that are used to manage O2CB cluster stack. Each utility has a man page that lists its capabilities in detail. o2cb(8) This is the cluster configuration utility.Itallows users to update the cluster configuration by adding and removing nodes and heartbeat regions. This utility is used by the o2cb init script to online and offline the cluster.

This is a new utility and replaces o2cb_ctl(8) which has been deprecated.

Version 1.8.2 January 2012 11 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

ocfs2_hb_ctl(8) This is the cluster heartbeat utility.Itallows users to start and stop local heartbeat. This utility is invokedby mount.ocfs2(8) and should not be invokeddirectly by the user.

o2hbmonitor(8) This is the disk heartbeat monitor.Ittracks the elapsed time since the last heartbeat and logs warn- ings once that time exceeds the warn threshold.

FILE SYSTEM NOTES This section includes some useful notes that may prove helpful to the user. BALANCED CLUSTER Acluster is a computer.This is a fact and not a slogan. What this means is that an errant node in the cluster can affect the behavior of other nodes. If one node is slow, the cluster operations will slowdownonall nodes. Toprevent that, it is best to have a balanced cluster.This is a cluster that has equally powered and loaded nodes.

The standard recommendation for such clusters is to have identical hardware and software across all the nodes. However, that is not a hard and fast rule. After all, we have taken the effort to ensure that OCFS2 works in a mixed architecture environment.

If one uses OCFS2 in a mixed architecture environment, try to ensure that the nodes are equally powered and loaded. The use of a load balancer can assist with the latter.Power refers to the num- ber of processors, speed, amount of memory,I/O throughput, network bandwidth, etc. In reality, having equally powered heterogeneous nodes is not always practical. In that case, makethe lower node numbers more powerful than the higher node numbers. The O2CB cluster stack favors lower node numbers in all of its tiebreaking logic.

This is not to suggest you should add a single core node in a cluster of quad cores. No amount of node number juggling will help you there.

FILE DELETION In Linux, (1) removesthe directory entry.Itdoes not necessarily delete the corresponding inode. By removing the directory entry,itgiv esthe illusion that the inode has been deleted. This puzzles users when theydonot see a corresponding up-tick in the reported free space. The reason is that inode deletion has a fewmore hurdles to cross.

First is the count. This indicates the number of directory entries pointing to that inode. As long as a directory entry is linked to that inode, it cannot be deleted. The file system has to for that count to drop to zero.

The second hurdle is the POSIX semantics allowing files to be unlinked evenwhile theyare in use. In OCFS2, that translates to in use across the cluster.The file system has to wait for all processes across the cluster to stop using the inode.

Once these twoconditions are met, the inode is deleted and the freed bits are flushed to disk on the next sync.

This assumes that the inode was not reflinked. If it was, then the deletion would only release space that was private to the inode. Shared space would only be released when the last inode using it is deleted.

Users interested in following the trail can use debugfs.ocfs2(8) to viewthe node specific system files orphan_dir and truncate_log. Once the link count is zero, an inode is movedtothe

Version 1.8.2 January 2012 12 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

orphan_dir.After deletion, the freed bits are added to the truncate_log, where theyremain until the next sync, during which the bits are flushed to the global bitmap.

DIRECTORYLISTING (1) may be a simple command, but it is not cheap. What is expensive isnot the part where it reads the directory listing, but the second part where it reads all the inodes, also referred as an inode stat(2). If the inodes are not in cache, this can entail disk I/O. Now, while a cold cache inode stat(2) is expensive inall file systems, it is especially so in a clustered file system. It needs to takealock on each node, pure overhead when compared to a local file system.

Ahot cache stat(2), on the other hand, has shown to perform on OCFS2 likeitdoes on EXT3.

In other words, the second ls(1) will be quicker than the first. However, itisnot guaranteed. Say you have a million files in a file system and not enough kernel memory to cache all the inodes. In that case, each ls(1) will involvesome cold cache stat(2)s.

ALLOCATION RESERVA T ION Allocation reservation allows multiple concurrently extending files to growascontiguously as pos- sible. One way to demonstrate its functioning is to run a script that extends multiple files in a cir- cular order.The script belowdoes that by writing one hundred 4KB chunks to four files, one after another.

$for i in $(seq 0 99); >do >for j in $(seq 4); >do >ddif=/dev/zero of=file$j bs=4K count=1 seek=$i; >done; >done;

When run on a system running Linux kernel 2.6.34 or earlier,weend up with files with 100 extents each. That is full fragmentation. As the files are being extended one after another,the on- disk allocations are fully interleaved.

$filefrag file1 file2 file3 file4 file1: 100 extents found file2: 100 extents found file3: 100 extents found file4: 100 extents found

When run on a system running Linux kernel 2.6.35 or later,wesee files with 7 extents each. That is a lot fewer than before. Fewer extents mean more on-disk contiguity and that always leads to better overall performance.

$filefrag file1 file2 file3 file4 file1: 7 extents found file2: 7 extents found file3: 7 extents found file4: 7 extents found

REFLINK OPERATION This feature allows a user to create a writeable snapshot of a regular file. In this operation, the file system creates a newinode with the same extent pointers as the original inode. Multiple inodes are

Version 1.8.2 January 2012 13 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

thus able to share data extents. This adds a twist in file system administration because none of the existing file system utilities in Linux expect this behavior.(1), a utility to used to compute file space usage, simply adds the blocks allocated to each inode. As it does not knowabout shared extents, it overestimates the space used. Say,wehav e a5GB file in a volume having 42GB free.

$ls-l total 5120000 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:15 myfile

$du-mmyfile* 5000 myfile

$-h. Filesystem Size Used Avail Use% Mounted on /dev/sdd1 50G 8.2G 42G 17% /ocfs2

If we were to reflink it 4 times, we would expect the directory listing to report five5GB files, but the df(1) to report no loss of available space. du(1), on the other hand, would report the disk usage to climb to 25GB.

$reflink myfile myfile-ref1 $reflink myfile myfile-ref2 $reflink myfile myfile-ref3 $reflink myfile myfile-ref4

$ls-l total 25600000 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:15 myfile -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref1 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref2 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref3 -rw-r--r-- 1 jeff jeff 5242880000 Sep 24 17:16 myfile-ref4

$df-h. Filesystem Size Used Avail Use% Mounted on /dev/sdd1 50G 8.2G 42G 17% /ocfs2

$du-mmyfile* 5000 myfile 5000 myfile-ref1 5000 myfile-ref2 5000 myfile-ref3 5000 myfile-ref4 25000 total

Enter shared-du(1),ashared extent-aware du. This utility reports the shared extents per file in parenthesis and the overall footprint. As expected, it lists the overall footprint at 5GB. One can viewthe details of the extents using shared-filefrag(1).Both these utilities are available at http://oss.oracle.com/˜smushran/reflink-tools/. Weare currently in the process of pushing the changes to the upstream maintainers of these utilities.

$shared-du -m -c --shared-size myfile* 5000 (5000) myfile 5000 (5000) myfile-ref1

Version 1.8.2 January 2012 14 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

5000 (5000) myfile-ref2 5000 (5000) myfile-ref3 5000 (5000) myfile-ref4 25000 total 5000 footprint

#shared-filefrag -v myfile Filesystem is: 7461636f File size of myfile is 5242880000 (1280000 blocks, blocksize 4096) ext logical physical expected length flags 002247937 8448 18448 2257921 2256384 30720 239168 2290177 2288640 30720 369888 2322433 2320896 30720 4100608 2354689 2353152 30720 7192768 2451457 2449920 30720 ... 37 1073408 2032129 2030592 30720 shared 38 1104128 2064385 2062848 30720 shared 39 1134848 2096641 2095104 30720 shared 40 1165568 2128897 2127360 30720 shared 41 1196288 2161153 2159616 30720 shared 42 1227008 2193409 2191872 30720 shared 43 1257728 2225665 2224128 22272 shared,eof myfile: 44 extents found

DATA COHERENCY One of the challenges in a shared file system is data coherencywhen multiple nodes are writing to the same set of files. NFS, for example, provides close-to-open data coherencythat results in the data being flushed to the server when the file is closed on the client. This leavesopen a wide win- dowfor stale data being read on another node.

Asimple to check the data coherencyofashared file system involves concurrently appending the same file. Likerunning " -a >>/dir/file" using a parallel distributed shell likedsh or pconsole. If coherent, the file will contain the results from all nodes.

#dsh -R ssh -w node32,node33,node34,node35 "uname -a >> /ocfs2/test" # /ocfs2/test Linux node32 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux Linux node35 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux Linux node33 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux Linux node34 2.6.32-10 #1 SMP Fri Sep 17 17:51:41 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

OCFS2 is a fully cache coherent cluster file system.

DISCONTIGUOUS BLOCK GROUP Most file systems pre-allocate space for inodes during format. OCFS2 dynamically allocates this space when required.

However, this dynamic allocation has been problematic when the free space is very fragmented, because the file system required the inode and extent allocators to growincontiguous fixed-size chunks.

The discontiguous block group feature takes care of this problem by allowing the allocators to

Version 1.8.2 January 2012 15 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

growinsmaller,variable-sized chunks.

This feature was added in Linux kernel 2.6.35 and requires enabling on-disk feature discontig-bg.

BACKUP SUPER BLOCKS Afile system super block stores critical information that is hard to recreate. In OCFS2, it stores the block size, cluster size, and the locations of the root and system directories, among other things. As this block is close to the start of the disk, it is very susceptible to being overwritten by an errant write. Say,ddif=file of=/dev/sda1.

Backup super blocks are copies of the super block. These blocks are dispersed in the volume to minimize the chances of being overwritten. On the small chance that the original gets corrupted, the backups are available to scan and fix the corruption.

mkfs.ocfs2(8) enables this feature by default. Users can disable this by specifying --fs-fea- tures=nobackup-super during format.

o2info(1) can be used to viewwhether the feature has been enabled on a device.

#o2info --fs-features /dev/sdb1 backup-super strict-journal-super sparse extended-slotmap inline-data xattr indexed-dirs refcount discontig-bg clusterinfo unwritten

In OCFS2, the super block is on the third block. The backups are located at the 1G, 4G, 16G, 64G, 256G and 1T byte offsets. The actual number of backup blocks depends on the size of the device. The super block is not backed up on devices smaller than 1GB.

fsck.ocfs2(8) refers to these six offsets by numbers, 1 to 6. Users can specify anybackup with the -r option to recoverthe volume. The example belowuses the second backup. If successful, fsck.ocfs2(8) overwrites the corrupted super block with the backup.

#fsck.ocfs2 -f -r 2 /dev/sdb1 fsck.ocfs2 1.8.0 [RECOVER_BACKUP_SUPERBLOCK] Recover superblock information from backup block#1048576? y Checking OCFS2 filesystem in /dev/sdb1: Label: webhome UUID: B3E021A2A12B4D0EB08E9E986CDC7947 Number of blocks: 13107196 Block size: 4096 Number of clusters: 13107196 Cluster size: 4096 Number of slots: 8

/dev/sdb1 was run with -f, check forced. Pass 0a: Checking cluster allocation chains Pass 0b: Checking inode allocation chains Pass 0c: Checking extent block allocation chains Pass 1: Checking inodes and blocks. Pass 2: Checking directory entries. Pass 3: Checking directory connectivity. Pass 4a: checking for orphaned inodes Pass 4b: Checking inodes link counts. All passes succeeded.

Version 1.8.2 January 2012 16 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

SYNTHETIC FILE SYSTEMS The OCFS2 development effort included twosynthetic file systems, configfs and dlmfs. It also makes use of a third, debugfs.

configfs configfs has since been accepted as a generic kernel component and is also used by net- console and fs/dlm. OCFS2 tools use it to communicate the list of nodes in the cluster, details of the heartbeat device, cluster timeouts, and so on to the in-kernel node manager. The o2cb init script mounts this file system at /sys/kernel/config.

dlmfs dlmfs exposes the in-kernel o2dlm to the user-space. While it was developed primarily for OCFS2 tools, it has seen usage by others looking to add a cluster locking dimension in their applications. Users interested in doing the same should look at the libo2dlm library provided by ocfs2-tools. The o2cb init script mounts this file system at /dlm.

debugfs OCFS2 uses debugfs to expose its in-kernel information to user space. For example, list- ing the file system cluster locks, dlm locks, dlm state, o2net state, etc. Users can access the information by mounting the file system at /sys/kernel/debug. Toautomount, add the following to /etc/: debugfs /sys/kernel/debug debugfs defaults 0 0

DISTRIBUTED LOCK MANAGER One of the key technologies in a cluster is the lock manager,which maintains the locking state of all resources across the cluster.Aneasy implementation of a lock manager involves designating one node to handle everything. In this model, if a node wanted to acquire a lock, it would send the request to the lock manager.Howev er, this model has a weakness: lock managerâs death causes the cluster to seize up.

Abetter model is one where all nodes manage a subset of the lock resources. Each node maintains enough information for all the lock resources it is interested in. On event of a node death, the remaining nodes pool in the information to reconstruct the lock state maintained by the dead node. In this scheme, the locking overhead is distributed amongst all the nodes. Hence, the term distrib- uted lock manager.

O2DLM is a .Itisbased on the specification titled "Programming Lock- ing Application" written by Kristin Thomas and is available at the following link. http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf

DLM DEBUGGING O2DLM has a rich debugging infrastructure that allows it to showthe state of the lock manager, all the lock resources, among other things. The figure belowshows the dlm state of a nine-node cluster that has just lost three nodes: 12, 32, and 35. It can be ascertained that node 7, the recovery master,iscurrently recovering node 12 and has receivedthe lock states of the dead node from all other live nodes.

#cat /sys/kernel/debug/o2dlm/45F81E3B6F2B48CCAAD1AE7945AB2001/dlm_state Domain: 45F81E3B6F2B48CCAAD1AE7945AB2001 Key: 0x10748e61 Thread Pid: 24542 Node: 7 State: JOINED Number of Joins: 1 Joining Node: 255 Domain Map: 7 31 33 34 40 50 Live Map: 7 31 33 34 40 50 Lock Resources: 48850 (439879)

Version 1.8.2 January 2012 17 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

MLEs: 0 (1428625) Blocking: 0 (1066000) Mastery: 0 (362625) Migration: 0 (0) Lists: Dirty=Empty Purge=Empty PendingASTs=Empty PendingBASTs=Empty Purge Count: 0 Refs: 1 Dead Node: 12 Recovery Pid: 24543 Master: 7 State: ACTIVE Recovery Map: 12 32 35 Recovery Node State: 7-DONE 31 - DONE 33 - DONE 34 - DONE 40 - DONE 50 - DONE

The figure belowshows the state of a dlm lock resource that is mastered (owned) by node 25, with 6locks in the granted queue and node 26 holding the (writelock) lock on that resource.

#debugfs.ocfs2 -R "dlm_locks M000000000000000022d63c00000000" /dev/sda1 Lockres: M000000000000000022d63c00000000 Owner: 25 State: 0x0 Last Used: 0 ASTs Reserved: 0 Inflight: 0 Migration Pending: No Refs: 8 Locks: 6 On Lists: None Reference Map: 26 27 28 94 95 Lock-Queue Node Level Conv Cookie Refs AST BAST Pending-Action Granted 94 -1 94:3169409 2 No No None Granted 28 NL -1 28:3213591 2 No No None Granted 27 NL -1 27:3216832 2 No No None Granted 95 NL -1 95:3178429 2 No No None Granted 25 NL -1 25:3513994 2 No No None Granted 26 EX -1 26:3512906 2 No No None

The figure belowshows a lock from the file system perspective.Specifically,itshows a lock that is in the process of being upconverted from a NL to EX. Locks in this state are are referred to in the file system as busy locks and can be listed using the debugfs.ocfs2 command, "fs_locks -B".

#debugfs.ocfs2 -R "fs_locks -B" /dev/sda1 Lockres: M000000000000000000000b9aba12ec Mode: No Lock Flags: Initialized Attached Busy RO Holders: 0 EX Holders: 0 Pending Action: Convert Pending Unlock Action: None Requested Mode: Exclusive Blocking Mode: No Lock PR > Gets: 0 Fails: 0 Waits Total: 0us Max: 0us Avg: 0ns EX > Gets: 1 Fails: 0 Waits Total: 544us Max: 544us Avg: 544185ns Disk Refreshes: 1

With this debugging infrastructure in place, users can debug hang issues as follows:

*Dump the busy fs locks for all the OCFS2 volumes on the node with hanging processes. If no locks are found, then the problem is not related to O2DLM.

*Dump the corresponding dlm lock for all the busy fs locks. Note down the owner (master) of all the locks.

Version 1.8.2 January 2012 18 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

*Dump the dlm locks on the master node for each lock.

At this stage, one should note that the hanging node is waiting to get an AST from the master.The master,onthe other hand, cannot send the AST until the current holder has down converted that lock, which it will do upon receiving a Blocking AST.Howev er, a node can only down convert if all the lock holders have stopped using that lock. After dumping the dlm lock on the master node, identify the current lock holder and dump both the dlm and fs locks on that node.

The trick here is to see whether the Blocking AST message has been relayed to file system. If not, the problem is in the dlm layer.Ifithas, then the most common reason would be a lock holder,the count for which is maintained in the fs lock.

At this stage, printing the list of process helps.

$-e-opid,stat,,wchan=WIDE-WCHAN-COLUMN

Makeanote of all D state processes. At least one of them is responsible for the hang on the first node.

The challenge then is to figure out whythose processes are hanging. Failing that, at least get enough information (likealt-sysrq t output) for the kernel developers to review. What to do next depends on where the process is hanging. If it is waiting for the I/O to complete, the problem could be anywhere in the I/O subsystem, from the block device layer through the drivers to the disk array.Ifthe hang concerns a user lock (flock(2)), the problem could be in the userâs applica- tion. A possible solution could be to kill the holder.Ifthe hang is due to tight or fragmented mem- ory,free up some memory by killing non-essential processes.

The thing to note is that the symptom for the problem was on one node but the cause is on another. The issue can only be resolved on the node holding the lock. Sometimes, the best solution will be to reset that node. Once killed, the O2DLM recovery process will clear all locks owned by the dead node and let the cluster continue to operate. As harsh as that sounds, at times it is the only solution. The good news is that, by following the trail, you nowhav e enough information to file a bugand get the real issue resolved.

NFS EXPORTING OCFS2 volumes can be exported as NFS volumes. This support is limited to NFS version 3, which translates to Linux kernel version 2.4 or later.

If the version of the Linux kernel on the system exporting the volume is older than 2.6.30,then the NFS clients must mount the volumes using the nordirplus mount option. This disables the READ- DIRPLUS RPC call to workaround a bug in NFSD, detailed in the following link:

http://oss.oracle.com/pipermail/ocfs2-announce/2008-June/000025.html

Users running NFS version 2 can export the volume after having disabled subtree checking (mount option no_subtree_check). Be warned, disabling the check has security implications (documented in the exports(5) man page) that users must evaluate on their own.

FILE SYSTEM LIMITS OCFS2 has no intrinsic limit on the total number of files and directories in the file system. In gen- eral, it is only limited by the size of the device. But there is one limit imposed by the current filesystem. It can address at most four billion clusters. A file system with 1MB cluster size can go up to 4PB, while a file system with a 4KB cluster size can address up to 16TB.

Version 1.8.2 January 2012 19 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

SYSTEM OBJECTS The OCFS2 file system stores its internal meta-data, including bitmaps, journals, etc., as system files. These are grouped in a system directory.These files and directories are not accessible via the file system interface but can be viewed using the debugfs.ocfs2(8) tool.

To list the system directory (referred to as double-slash), do:

#debugfs.ocfs2 -R "ls -l //" /dev/sde1 66 drwxr-xr-x 10 0 0 3896 19-Jul-2011 13:36 . 66 drwxr-xr-x 10 0 0 3896 19-Jul-2011 13:36 .. 67 -rw-r--r-- 100 019-Jul-2011 13:36 bad_blocks 68 -rw-r--r-- 100 1179648 19-Jul-2011 13:36 global_inode_alloc 69 -rw-r--r-- 100 4096 19-Jul-2011 14:35 slot_map 70 -rw-r--r-- 100 1048576 19-Jul-2011 13:36 heartbeat 71 -rw-r--r-- 10053686960128 19-Jul-2011 13:36 global_bitmap 72 drwxr-xr-x 200 3896 25-Jul-2011 15:05 orphan_dir:0000 73 drwxr-xr-x 200 3896 19-Jul-2011 13:36 orphan_dir:0001 74 -rw-r--r-- 100 8388608 19-Jul-2011 13:36 extent_alloc:0000 75 -rw-r--r-- 100 8388608 19-Jul-2011 13:36 extent_alloc:0001 76 -rw-r--r-- 100 121634816 19-Jul-2011 13:36 inode_alloc:0000 77 -rw-r--r-- 100 019-Jul-2011 13:36 inode_alloc:0001 77 -rw-r--r-- 100 268435456 19-Jul-2011 13:36 journal:0000 79 -rw-r--r-- 100 268435456 19-Jul-2011 13:37 journal:0001 80 -rw-r--r-- 100 019-Jul-2011 13:36 local_alloc:0000 81 -rw-r--r-- 100 019-Jul-2011 13:36 local_alloc:0001 82 -rw-r--r-- 100 019-Jul-2011 13:36 truncate_log:0000 83 -rw-r--r-- 100 019-Jul-2011 13:36 truncate_log:0001

The file names that end with numbers are slot specific and are referred to as node-local system files. The set of node-local files used by a node can be determined from the slot map. Tolist the slot map, do:

#debugfs.ocfs2 -R "slotmap" /dev/sde1 Slot# Node# 032 135 240 331 434 533

Formore information, refer to the OCFS2 support guides available in the Documentation section at http://oss.oracle.com/projects/ocfs2.

HEARTBEAT, QUORUM, AND FENCING Heartbeat is an essential component in anycluster.Itischarged with accurately designating nodes as dead or alive.Amistakehere could lead to a cluster hang or a corruption.

o2hb is the disk heartbeat component of o2cb.Itperiodically updates a timestamp on disk, indicat- ing to others that this node is alive.Italso reads all the timestamps to identify other live nodes. Other cluster components, like o2dlm and o2net,use the o2hb service to get node up and down ev ents.

The quorum is the group of nodes in a cluster that is allowed to operate on the shared storage.

Version 1.8.2 January 2012 20 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

When there is a failure in the cluster,nodes may be split into groups that can communicate in their groups and with the shared storage but not between groups. o2quo determines which group is allowed to continue and initiates fencing of the other group(s).

Fencing is the act of forcefully removing a node from a cluster.Anode with OCFS2 mounted will fence itself when it realizes that it does not have quorum in a degraded cluster.Itdoes this so that other nodes wonât be stuck trying to access its resources.

o2cb uses a machine reset to fence. This is the quickest route for the node to rejoin the cluster.

PROCESSES

[o2net] One per node. It is a work-queue thread started when the cluster is brought on-line and stopped when it is off-lined. It handles network communication for all mounts. It gets the list of active nodes from O2HB and sets up a TCP/IP communication channel with each live node. It sends regular keep-alive packets to detect anyinterruption on the channels.

[user_dlm] One per node. It is a work-queue thread started when dlmfs is loaded and stopped when it is unloaded (dlmfs is a synthetic file system that allows user space processes to access the in-kernel dlm).

[ocfs2_wq] One per node. It is a work-queue thread started when the OCFS2 module is loaded and stopped when it is unloaded. It is assigned background file system tasks that may take cluster locks likeflushing the truncate log, orphan directory recovery and local alloc recovery.For example, orphan directory recovery runs in the background so that it does not affect recovery time.

[o2hb-14C29A7392] One per heartbeat device. It is a kernel thread started when the heartbeat region is popu- lated in configfs and stopped when it is removed. It writes every twoseconds to a block in the heartbeat region, indicating that this node is alive.Italso reads the region to maintain amap of live nodes. It notifies subscribers likeo2net and o2dlm of anychanges in the live node map.

[ocfs2dc] One per mount. It is a kernel thread started when a volume is mounted and stopped when it is unmounted. It downgrades locks in response to blocking ASTs (BASTs) requested by other nodes.

[jbd2/sdf1-97] One per mount. It is part of JBD2, which OCFS2 uses for journaling.

[ocfs2cmt] One per mount. It is a kernel thread started when a volume is mounted and stopped when it is unmounted. It works with kjournald2.

[ocfs2rec] It is started wheneveranode has to be recovered. This thread performs file system recov- ery by replaying the journal of the dead node. It is scheduled to run after dlm recovery

Version 1.8.2 January 2012 21 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

has completed.

[dlm_thread] One per dlm domain. It is a kernel thread started when a dlm domain is created and stopped when it is destroyed. This thread sends ASTs and blocking ASTs in response to lock levelconvert requests. It also frees unused lock resources.

[dlm_reco_thread] One per dlm domain. It is a kernel thread that handles dlm recovery when another node dies. If this node is the dlm recovery master,itre-masters every lock resource owned by the dead node.

[dlm_wq] One per dlm domain. It is a work-queue thread that o2dlm uses to queue blocking tasks.

FUTURE WORK File system development is a neverending cycle. Faster and larger disks, faster and more number of processors, larger caches, etc. keep changing the sweet spot for performance forcing developers to rethink long held beliefs. Add to that newuse cases, which forces developers to be innovative in providing solutions that melds seamlessly with existing semantics.

We are currently looking to add features liketransparent compression, transparent encryption, delayed allocation, multi-device support, etc. as well as work on improving performance on newer generation machines.

If you are interested in contributing, email the development team at [email protected].

ACKNOWLEDGEMENTS The principal developers of the OCFS2 file system, its tools and the O2CB cluster stack, are Joel Becker, ZachBrown, Mark Fasheh, JanKara, Kurt Hackel, TaoMa, Sunil Mushran, TigerYang and Tristan Ye.

Other developers have contributed to the file system via bug fixes, testing, etc. are WimCoekaerts, Srinivas Eeda, Coly Li, Jeff Mahoney, Marcos Matsunaga, Goldwyn Rodrigues, Manish Singh and Wen- gang Wang.

The members of the Linux Cluster community including Andrew Beekhof, LarsMarowsky-Bree, Fabio Massimo Di Nitto and David Teigland.

The members of the Linux File system community including Christoph Hellwig and Chris Mason.

The corporations that have contributed resources for this project including Oracle, SUSE Labs, EMC, Emulex, HP, IBM, Intel and Network Appliance.

SEE ALSO debugfs.ocfs2(8) fsck.ocfs2(8) fsck.ocfs2.checks(8) mkfs.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8) o2cluster(8) o2image(8) o2info(1) o2cb(7) o2cb(8) o2cb.sysconfig(5) o2hbmonitor(8) ocfs2.clus- ter.conf(5) tunefs.ocfs2(8)

AUTHOR

Version 1.8.2 January 2012 22 OCFS2(7) OCFS2 Manual Pages OCFS2(7)

COPYRIGHT Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 23 o2cb(7) OCFS2 Manual Pages o2cb(7)

NAME o2cb − Default cluster stack of the OCFS2 file system. SYNOPSIS o2cb is the default cluster stack of the OCFS2 file system. It is an in-kernel cluster stack that includes a node manager (o2nm) to keep track of the nodes in the cluster,adisk heartbeat agent (o2hb) to detect node live-ness, a network agent (o2net) for intra-cluster node communication and a distributed lock manager (o2dlm) to keep track of lock resources. It also includes a synthetic file system, dlmfs, to allowapplica- tions to access the in-kernel dlm.

CONFIGURATION The stack is configured using the o2cb(8) cluster configuration utility and operated (online/offline/status) using the o2cb init service.

CLUSTER CONFIGURATION

It has twoconfiguration files. One for the cluster layout (/etc/ocfs2/cluster.conf) and the other for the cluster timeouts, etc. (/etc/sysconfig/o2cb). More information about these twofiles can be found in ocfs2.cluster.conf(5) and o2cb.sysconfig(5).

The o2cb cluster stack supports twoheartbeat modes, namely, local and global.Only one heart- beat mode can be active atany one time.

Local heartbeat refers to disk heartbeating on all shared devices. In this mode, the heartbeat is started during mount and stopped during umount.This mode is easy to setup as it does not require configuring heartbeat devices. The one drawback in this mode is the overhead on servers having a large number of OCFS2 mounts. For example, a server with 50 mounts will have 50heartbeat threads. This is the default heartbeat mode.

Global heartbeat,onthe other hand, refers to heartbeating on specific shared devices. These devices are normal OCFS2 formatted volumes that could also be mounted and used as clustered file systems. In this mode, the heartbeat is started during cluster online and stopped during cluster offline.While this mode can be used for all clusters, it is strongly recommended for clusters having alarge number of mounts.

More information on disk heartbeat is provided below.

KERNEL CONFIGURATION

Tw o sysctl values need to be set for o2cb to function properly.The first, panic_on_oops, must be enabled to turn a kernel oops into a panic. If a kernel thread required for o2cb to function crashes, the system must be reset to prevent a cluster hang. If it is not set, another node may not be able to distinguish whether a node is unable to respond or slowtorespond.

The other related sysctl parameter is panic, which specifies the number of seconds after a panic that the system will be auto-reset. Setting this parameter to zero disables autoreset; the cluster will require manual intervention. This is not preferred in a cluster environment.

To manually enable panic on oops and set a 30 sec timeout for reboot on panic, do:

# 1 > /proc/sys/kernel/panic_on_oops #echo 30 > /proc/sys/kernel/panic

To enable the above onevery boot, add the following to /etc/sysctl.conf:

Version 1.8.2 August 2011 1 o2cb(7) OCFS2 Manual Pages o2cb(7)

kernel.panic_on_oops = 1 kernel.panic = 30

OS CONFIGURATION

The o2cb cluster stack also requires iptables (firewalling) to be either disabled or modified to allownetwork traffic on the private network interface. The port used by o2cb is specified in /etc/ocfs2/cluster.conf.

DISK HEARTBEAT O2CB uses disk heartbeat to detect node liveness. The disk heartbeat thread, o2hb,periodically reads and writes to a heartbeat file in a OCFS2 file system. Its write payload contains a sequence number that it incre- ments in each write. This allows other nodes reading the same heartbeat file to detect the change and asso- ciate that with a live node. Conversely,anode whose sequence number has stopped changing is marked as apossible dead node. Possible. Not confirmed. That is because it just could be slowI/Os.

To differentiate between a dead node and one that has slowI/Os, O2CB has a disk heartbeat threshold (timeout). Only nodes whose sequence number has not incremented for that duration are marked dead.

Howeverthat node may not be dead but just experiencing slowI/O. Toprevent that, the heartbeat thread keeps track of the time elapsed since the last completed write. If that time exceeds the timeout, it forces a self-fence. It does so to prevent other nodes from marking it as dead while it is still alive.

This self-fencing scheme has proventobevery reliable as it relies on kernel timers and pci bus reset. Exter- nal fencing, while attractive,israrely as reliable as it relies on external hardware and software that is prone to failure due to misconfiguration, etc.

Having said that, O2CB disk heartbeat has had its share of problems with self fencing. Nodes experiencing slowI/O on only one of multiple devices have toinitiate self-fence.

This is because in the default local heartbeat scheme, nodes in a cluster may not be heartbeating on the same set of devices.

The global heartbeat mode addresses this shortcoming by introducing a scheme that forces all nodes to heartbeat on the same set of devices. In this scheme, a node experiencing a slowdown in I/O on a device may not need to initiate self-fence. It will only have todosoifitencounters slowdown on 50% or more of the heartbeat devices. In acluster with 3 heartbeat regions, a slowdown in 1 region will be tolerated. In a cluster with 5 regions, a slowdown in 2 will be tolerated.

It is for this reason, this mode is recommended for users that have 3 ormore OCFS2 mounts.

O2CB allows upto 32 heartbeat regions to be configured in the global heartbeat mode.

ONLINE CLUSTER MODIFICATION The O2CB cluster stack allows adding and removing nodes in an online cluster when run in the global heartbeat mode. Use the o2cb(8) utility to makethe changes in the configuration and (re)online the cluster using the o2cb init script. The user must do the same on all nodes in the cluster.The cluster will not allow anynew cluster mounts if the node configuration on all nodes is not the same.

The removalofnodes will only succeed if that node is no longer in use. If the user removesanactive node from the configuration, the re-online will fail.

The cluster stack also allows adding and removing heartbeat regions in an online cluster.Use the o2cb(8)

Version 1.8.2 August 2011 2 o2cb(7) OCFS2 Manual Pages o2cb(7)

utility to makethe changes in the configuration file and (re)online the cluster using the o2cb init script. The user must do the same on all nodes in the cluster.The cluster will not allowany new cluster mounts if the heartbeat region configuration on all nodes is not the same.

The removalofheartbeat regions will only succeed if the active heartbeat region count is greater than 3. This is to protect against edge conditions that can destabilize the cluster.

GETTING STARTED The first step in configuring o2cb is deciding whether to setup local or global heartbeat. If global heartbeat, then one has to format atleast one heartbeat device.

To format a OCFS2 volume with global heartbeat enabled, do:

#mkfs.ocfs2 --cluster-stack=o2cb --cluster-name=webcluster --global-heartbeat -L "hbvol1" /dev/sdb1

Once formatted, setup /etc/ocfs2/cluster.conf following the example provided in ocfs2.cluster.conf(5).

If local heartbeat, then one can setup cluster.conf without anyheartbeat devices. The next step is starting the cluster.

To online the cluster stack, do:

#service o2cb online Loading stack plugin "o2cb": OK Loading filesystem "ocfs2_dlmfs": OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Setting cluster stack "o2cb": OK Registering O2CB cluster "webcluster": OK Setting O2CB cluster timeouts : OK Starting global heartbeat for cluster "webcluster": OK

Once the cluster stack is online, new OCFS2 volumes can be formatted normally without specifying the cluster stack information. mkfs.ocfs2(8) will pick up that information automatically.

#mkfs.ocfs2 -L "datavol" /dev/sdc1

Meanwhile existing volumes can be converted to the newcluster stack using tunefs.ocfs2(8) utility.

#tunefs.ocfs2 --update-cluster-stack /dev/sdd1 Updating on-disk cluster information to match the running cluster. DANGER: YOU MUST BE ABSOLUTELY SURE THAT NO OTHER NODE IS USING THIS FILESYSTEM BEFORE MODIFYING ITS CLUSTER CONFIGURATION. Update the on-disk cluster information? y

Another utility mounted.ocfs2(8) is useful is listing all the OCFS2 volumes alonghwith the cluster stack information.

To get a list of OCFS2 volumes, do:

#mounted.ocfs2 -d Device Stack Cluster F UUID Label /dev/sdb1 o2cb webcluster G DCDA2845177F4D59A0F2DCD8DE507CC3 hbvol1 /dev/sdc1 None 23878C320CF3478095D1318CB5C99EED localmount /dev/sdd1 o2cb webcluster G 8AB016CD59FC4327A2CDAB69F08518E3 webvol

Version 1.8.2 August 2011 3 o2cb(7) OCFS2 Manual Pages o2cb(7)

/dev/sdg1 o2cb webcluster G 77D95EF51C0149D2823674FCC162CF8B logsvol /dev/sdh1 o2cb webcluster G BBA1DBD0F73F449384CE75197D9B7098 scratch

The o2cb init script can also be used to check the status of the cluster,offline the cluster,etc.

To check the status of the cluster stack, do:

#service o2cb status Driver for "configfs": Loaded Filesystem "configfs": Mounted Stack glue driver: Loaded Stack plugin "o2cb": Loaded Driver for "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster "webcluster": Online Heartbeat dead threshold: 62 Network idle timeout: 60000 Network keepalive delay: 2000 Network reconnect delay: 2000 Heartbeat mode: Global Checking O2CB heartbeat: Active 77D95EF51C0149D2823674FCC162CF8B /dev/sdg1 DCDA2845177F4D59A0F2DCD8DE507CC3 /dev/sdk1 BBA1DBD0F73F449384CE75197D9B7098 /dev/sdh1 Nodes in O2CB cluster: 6 7 10 Active userdlm domains: ovm

To offline and unload the cluster stack, do:

#service o2cb offline Clean userdlm domains: OK Stopping global heartbeat on cluster "webcluster": OK Stopping O2CB cluster webcluster: OK Unregistering O2CB cluster "webcluster": OK

#service o2cb unload Clean userdlm domains: OK Unmounting ocfs2_dlmfs filesystem: OK Unloading module "ocfs2_dlmfs": OK Unloading module "ocfs2_stack_o2cb": OK

SEE ALSO o2cb(8) o2cb.sysconfig(5) ocfs2.cluster.conf(5) o2hbmonitor(8)

AUTHORS Oracle Corporation

COPYRIGHT Copyright © 2004, 2011 Oracle. All rights reserved.

Version 1.8.2 August 2011 4 o2cb(8) OCFS2 Manual Pages o2cb(8)

NAME o2cb − Cluster registration utility for the O2CB cluster stack. SYNOPSIS o2cb [--config-file=path] [-h|--help][-v|--verbose][-V|--version] COMMAND [ARGS]

DESCRIPTION o2cb(8) is used to add, remove and list the information in the O2CB cluster configuration file. This utility is also used to register and unregister the cluster,aswell as start and stop global heartbeat.

The default location of the configuration file, /etc/ocfs2/cluster.conf, can be overridden using the --config- file option.

OPTIONS --config-file config-file Specify a path to the configuration file. If not provided, it will use the default path of /etc/ocfs2/cluster.conf.

-v,--verbose Verbose mode.

-h, --help Help.

-V,--version Showversion and .

O2CB COMMANDS add-cluster cluster-name Adds a cluster to the configuration file. The O2CB configuration file can hold multiple clusters. However, only one cluster can be active atany time.

remove-cluster cluster-name Removesacluster from the configuration file. This command removesall the nodes and heartbeat regions assigned to the cluster.

add-node cluster-name node-name [--ip ip-address][--port port][--number node-number] Adds a node to the cluster in the configuration file. It accepts three optional arguments. If not pro- vided, the ip-address defaults to the one assigned to the node-name, port to 7777, and node-num- ber to the lowest unused node number.

remove-node cluster-name node-name Removesanode from the cluster in the configuration file.

add-heartbeat cluster-name [uuid|device] Adds a heartbeat region to the cluster in the configuration file.

remove-heartbeat cluster-name [uuid|device] Removesaheartbeat region from the cluster in the configuration file.

Version 1.8.2 January 2012 1 o2cb(8) OCFS2 Manual Pages o2cb(8)

heartbeat-mode cluster-name [local|global] Sets the heartbeat mode for the cluster in the configuration file.

list-clusters Lists all the cluster names in the configuration file.

list-cluster cluster-name --oneline Lists all the nodes and heartbeat regions associated with the cluster in the configuration file.

list-nodes cluster-name --oneline Lists all the nodes associated with the cluster in the configuration file.

list-heartbeats cluster-name --oneline Lists all the heartbeat regions associated with the cluster in the configuration file.

register-cluster cluster-name Registers the cluster listed in the configuration file with configfs. If called when the cluster is already registered, it will update configfs with the current configuration.

unregister-cluster cluster-name Unregisters the cluster from configfs.

start-heartbeat cluster-name Starts global heartbeat on all regions for the cluster as listed in the configuration file. If repeated, it will start heartbeat on newregions and stop on regions since removed. It will silently exit if global heartbeat has not been enabled.

stop-heartbeat cluster-name Stops global heartbeat on all regions for the cluster.Itwill silently exit if global heartbeat has not been enabled.

cluster-status [cluster-name] Shows whether the givencluster is offline or online. If no cluster is provided, it shows the cur- rently active cluster,ifany.

EXAMPLE To create a cluster,mycluster having twonodes, node1 and node2, do:

$o2cb add-cluster mycluster $o2cb add-node mycluster node1 --ip 10.10.10.1 $o2cb add-node mycluster node2 --ip 10.10.10.2

To specify a global heartbeat device, /dev/sda1, do:

$o2cb add-heartbeat mycluster /dev/sda1

To enable global heartbeat, do:

$o2cb heartbeat-mode mycluster global

Version 1.8.2 January 2012 2 o2cb(8) OCFS2 Manual Pages o2cb(8)

SEE ALSO o2cb(7) o2cb.sysconfig(5) ocfs2.cluster.conf(5)

AUTHORS Oracle Corporation

COPYRIGHT Copyright © 2010, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 3 /etc/ocfs2/cluster.conf(5) OCFS2 Manual Pages /etc/ocfs2/cluster.conf(5)

NAME /etc/ocfs2/cluster.conf − Cluster configuration file for the o2cb cluster stack. SYNOPSIS The cluster layout of the o2cb cluster stack is specified in /etc/ocfs2/cluster.conf.Itlists the name of the cluster,the nodes comprising that cluster and its heartbeat regions. The cluster stack expects this file to be the same on all nodes in that cluster.

This file should be populated using the o2cb(8) cluster configuration utility.Asample of the same is shown in the example section.

DESCRIPTION The configuration file is divided into three types of stanzas, each with a list of parameters and values. The three stanza types are cluster, node and heartbeat.While a configuration file can store definitions of multi- ple clusters, the o2cb cluster stack allows only one cluster to be active atany one time. The name of this active cluster is stored in /etc/sysconfig/o2cb [o2cb.sysconfig(5)].

The cluster stanza specifies the name of the cluster,number of nodes and the heartbeat mode. The cluster name can include upto 16 alphanumeric characters [0-9A-Za-z]. No special characters are allowed.

Parameters Description node_count Number of nodes in the cluster heartbeat_mode local or global heartbeat name Cluster name (upto 16 alphanumeric chars [0-9A-Za-z])

The node stanza specifies the node name that is part of the cluster alongwith its IPv4 address, port and node number.The node name must match the hostname. The domain name is not required. For example, appserver1.company.com can be appserver1. The IPv4 address need not be the one associated with that hostname. As in, anyvalid IPv4 address on that node can be used. The o2cb cluster stack will not attempt to match the node name (hostname) with the specified IPv4 address. A low-latencyprivate interconnect address is recommended for best performance.

Parameters Description ip_port IPv4 port ip_address IPv4 address (private interconnect recommended) number Node number (0 - 254) name Node name (hostname without the domain name) cluster Cluster name (should match the name in the cluster stanza)

The heartbeat stanza specifies the global heartbeat region UUIDs. A cluster can have upto 32 heartbeat regions. This is an optional stanza and is only required if the global heartbeat mode is enabled. In other words, the regions are only used if heartbeat_mode = global is in the cluster stanza. If not, this stanza is ignored.

Parameters Description region Heartbeat region UUID cluster Cluster name (should match the name in the cluster stanza)

Version 1.8.2 January 2012 1 /etc/ocfs2/cluster.conf(5) OCFS2 Manual Pages /etc/ocfs2/cluster.conf(5)

While manual editing is not recommended, users doing so must followthe format strictly.The stanza should start at the first column and end with a colon. The parameters must start after a tab.Ablank line must demarcate each stanza. Care should be taken to avoid stray white-spaces.

EXAMPLE The example belowillustrates populating a cluster.conf with a cluster called webcluster,having 3 nodes and 3global heartbeat regions, using the o2cb(8) utility.

$o2cb add-cluster webcluster

$o2cb add-node webcluster node7 --ip 192.168.0.107 --number 7 $o2cb add-node webcluster node6 --ip 192.168.0.106 --number 6 $o2cb add-node webcluster node10 --ip 192.168.0.110 --number 10

$o2cb add-heartbeat webcluster /dev/sdg1 $o2cb add-heartbeat webcluster /dev/sdk1 $o2cb add-heartbeat webcluster /dev/sdh1

$o2cb heartbeat-mode webcluster global

$o2cb list-cluster webcluster heartbeat: region = 77D95EF51C0149D2823674FCC162CF8B cluster = webcluster

heartbeat: region = DCDA2845177F4D59A0F2DCD8DE507CC3 cluster = webcluster

heartbeat: region = BBA1DBD0F73F449384CE75197D9B7098 cluster = webcluster

node: ip_port = 7777 ip_address = 192.168.0.107 number = 7 name = node7 cluster = webcluster

node: ip_port = 7777 ip_address = 192.168.0.106 number = 6 name = node6 cluster = webcluster

node: ip_port = 7777 ip_address = 192.168.0.110 number = 10 name = node10 cluster = webcluster

Version 1.8.2 January 2012 2 /etc/ocfs2/cluster.conf(5) OCFS2 Manual Pages /etc/ocfs2/cluster.conf(5)

cluster: node_count = 3 heartbeat_mode = global name = webcluster

SEE ALSO o2cb(7) o2cb(8) o2cb.sysconfig(5)

AUTHORS Oracle Corporation

COPYRIGHT Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 3 /etc/sysconfig/o2cb(5) OCFS2 Manual Pages /etc/sysconfig/o2cb(5)

NAME /etc/sysconfig/o2cb − Cluster configuration file for the o2cb cluster stack. SYNOPSIS The configuration file /etc/sysconfig/o2cb stores the active cluster stack, its name and the various cluster timeouts for the o2cb cluster stack.

DESCRIPTION This file can be populated using the o2cb init script. An example of the same is illustrated in the examples section.

The list of configurable parameters in this file are:

O2CB_STACK Name of the cluster stack. The possible values are o2cb, pcmk and cman. o2cb is the default cluster stack of the OCFS2 file system. pcmk (Pacemaker) and cman (rgmanager) are the two other cluster stacks that are supported by the same file system.

O2CB_BOOTCLUSTER Name of the active cluster.While /etc/ocfs2/cluster.conf can hold descriptions of multiple clusters, only one can be active atany one time. The name of that active cluster is specified here. The name itself can be upto 16 alphanumeric characters [0-9A-Za-z] with no special characters. The remaining configurable parameters (cluster timeouts) are only relevant for the o2cb cluster stack. These cluster timeouts are used by the o2cb cluster stack to determine whether a node is dead or alive.The default timeouts are just a guide and may need to be tweaked depending on the hardware the software is running on.

The various cluster timeouts for the o2cb cluster stack are: O2CB_HEARTBEAT_THRESHOLD The disk heartbeat timeout is the number of twosecond iterations before a node is considered dead. The exact formula used to convert the timeout in seconds to the number of iterations is as follows:

O2CB_HEARTBEAT_THRESHOLD = (((timeout in seconds) / 2) + 1)

Forexample, to specify a 60 sec timeout, set it to 31. For 120 secs, set it to 61. The default for this timeout is 60 secs (O2CB_HEARTBEAT_THRESHOLD = 31).

While it defaults to 60 secs, multipath users typically set it to 120 secs.

O2CB_IDLE_TIMEOUT_MS The network idle timeout specifies the time in milliseconds before a network connection is consid- ered dead. While it defaults to 30000 ms, network bonding users typically set it to 60000 ms.

O2CB_KEEPALIVE_DELAY_MS The network keepalive specifies the maximum delay in milliseconds before a keepalive packet is sent to another node to check whether it is alive ornot. It defaults to 2000 ms.

O2CB_RECONNECT_DELAY_MS The network reconnect specifies the minimum delay in milliseconds between repeated connect attempts. It defaults to 2000 ms.

Version 1.8.2 January 2012 1 /etc/sysconfig/o2cb(5) OCFS2 Manual Pages /etc/sysconfig/o2cb(5)

EXAMPLE The example belowillustrates populating the o2cb sysconfig file using the o2cb init script.

$service o2cb configure Configuring the O2CB driver.

This will configure the on-boot properties of the O2CB driver. The following questions will determine whether the driver is loaded on boot. The current values will be shown in brackets ('[]'). Hitting without typing an answer will keep that current value. Ctrl-C will abort.

Load O2CB driver on boot (y/n) [n]: y Cluster stack backing O2CB [o2cb]: Cluster to start on boot (Enter "none" to clear) [ocfs2]: webcluster Specify heartbeat dead threshold (>=7) [31]: 62 Specify network idle timeout in ms (>=5000) [30000]: 60000 Specify network keepalive delay in ms (>=1000) [2000]: Specify network reconnect delay in ms (>=2000) [2000]: Writing O2CB configuration: OK

$cat /etc/sysconfig/o2cb # #This is a configuration file for automatic startup of the O2CB #driver. It is generated by running /etc/init.d/o2cb configure. #OnDebian based systems the preferred method is running #'dpkg-reconfigure ocfs2-tools'. #

#O2CB_ENABLED: 'true' means to load the driver on boot. O2CB_ENABLED=true

#O2CB_STACK: The name of the cluster stack backing O2CB. O2CB_STACK=o2cb

#O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start. O2CB_BOOTCLUSTER=webcluster

#O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead. O2CB_HEARTBEAT_THRESHOLD=62

#O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead. O2CB_IDLE_TIMEOUT_MS=60000

#O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent O2CB_KEEPALIVE_DELAY_MS=2000

#O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts O2CB_RECONNECT_DELAY_MS=2000

Version 1.8.2 January 2012 2 /etc/sysconfig/o2cb(5) OCFS2 Manual Pages /etc/sysconfig/o2cb(5)

SEE ALSO o2cb(7) o2cb(8) ocfs2.cluster.conf(5)

AUTHORS Oracle Corporation

COPYRIGHT Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 3 mkfs.ocfs2(8) OCFS2 Manual Pages mkfs.ocfs2(8)

NAME mkfs.ocfs2 − Creates an OCFS2 file system. SYNOPSIS mkfs.ocfs2 [−b block−size][−C cluster−size][−L volume−label][−M mount-type][−N num- ber−of−nodes][−J journal−options][−−fs−features=[no]sparse...][−−fs−feature−level=feature−level] [−T filesystem−type][−−cluster−stack=stackname][−−cluster−name=clustername][−−global−heart- beat][−FqvV] device [blocks-count] DESCRIPTION mkfs.ocfs2 is used to create an OCFS2 file system on a device,usually a partition on a shared disk. In order to prevent data loss, mkfs.ocfs2 will not format an existing OCFS2 volume if it detects that it is mounted on another node in the cluster.This tool requires the cluster service to be online.

OPTIONS −b, −−block−size block−size Valid block size values are 512, 1K, 2K and 4K bytes per block. If omitted, a value will be heuris- tically determined based on the expected usage of the file system (see the −T option). A block size of 512 bytes is neverrecommended. Choose 1K, 2K or 4K.

−C, −−cluster−size cluster−size Valid cluster size values are 4K, 8K, 16K, 32K, 64K, 128K, 256K, 512K and 1M. If omitted, a value will be heuristically determined based on the expected usage of the file system (see the −T option). For volumes expected to store large files, likedatabase files, while a cluster size of 128K or more is recommended, one can opt for a smaller size as long as that value is not smaller than the database block size. Forothers, use 4K.

−F,−−force Forexisting OCFS2 volumes, mkfs.ocfs2 ensures the volume is not mounted on anynode in the cluster before formatting. For that to work, mkfs.ocfs2 expects the cluster service to be online. Specify this option to disable this check.

−J, −−journal-options options Create the journal using options specified on the command−line. Journal options are comma sepa- rated, and may takeanargument using the equals (’=’) sign. The following options are supported:

size=journal−size Create a journal of size journal−size.Minimum size is 4M. If omitted, a value is heuris- tically determined based upon the file system size.

block32 Use a standard 32bit journal. The journal will be able to access up to 2ˆ32-1 blocks. This is the default. It has been the journal format for OCFS2 volumes since the beginning. The journal is compatible with all versions of OCFS2.Prepending no is equivalent to the block64 journal option.

block64 Use a 64bit journal. The journal will be able to access up to 2ˆ64-1 blocks. This allows large filesystems that can extend to the theoretical limits of OCFS2.Itrequires a new- enough filesystem driverthat uses the newjournalled block device, JBD2.Prepending no is equivalent to the block32 journal option.

Version 1.8.2 January 2012 1 mkfs.ocfs2(8) OCFS2 Manual Pages mkfs.ocfs2(8)

−L, −−label volume−label Set the volume label for the file system. This is useful for mounting−by−label. Limit the label to under 64 bytes.

−M, −−mount mount−type Valid types are local and cluster.Local mount allows users to mount the volume without the clus- ter overhead and works only with OCFS2 bundled with Linux kernels 2.6.20 or later.Defaults to cluster.

−N, −−node−slots number−of−node−slots Valid number ranges from 1 to 255. This number specifies the maximum number of nodes that can concurrently mount the partition. If omitted, the number defaults to 8. The number of slots can be later tuned up or down using tunefs.ocfs2.

−T filesystem−type Specify howthe filesystem is going to be used, so that mkfs.ocfs2 can chose optimal filesystem parameters for that use. The supported filesystem types are:

mail Appropriate for file systems that will host lots of small files.

datafiles Appropriate for file systems that will host a relatively small number of very large files.

vmstore Appropriate for file systems that will host Virtual machine images.

−−fs−features=[no]sparse... Turn specific file system features on or off. A comma separated list of feature flags can be pro- vided, and mkfs.ocfs2 will try to create the file system with those features set according to the list. To turn a feature on, include it in the list. Toturn a feature off, prepend no to the name. Choices here will override individual features set via the −−fs−feature−level option. Refer to the section titled featurecompatibility beforeselecting specific features. The following flags are supported:

backup-super mkfs.ocfs2,bydefault, makes up to 6 backup copies of the super block at offsets 1G, 4G, 16G, 64G, 256G and 1T depending on the size of the volume. This can be useful in dis- aster recovery.This feature is fully compatible with all versions of the file system and generally should not be disabled.

local Create the file system as a local mount, so that it can be mounted without a cluster stack.

sparse Enable support for sparse files. With this, OCFS2 can avoid allocating (and zeroing) data to fill holes. Turn this feature on if you can, otherwise extends and some writes might be less performant.

unwritten Enable unwritten extents support. With this turned on, an application can request that a range of clusters be pre-allocated within a file. OCFS2 will mark those extents with a spe- cial flag so that expensive data zeroing doesn’thav e to be performed. Reads and writes to apre-allocated region act as reads and writes to a hole, except a write will not fail due to

Version 1.8.2 January 2012 2 mkfs.ocfs2(8) OCFS2 Manual Pages mkfs.ocfs2(8)

lack of data allocation. This feature requires sparse file support to be turned on.

inline-data Enable inline-data support. If this feature is turned on, OCFS2 will store small files and directories inside the inode block. Data is transparently movedout to an extent when it no longer fits inside the inode block. In some cases, this can also makeapositive impact on cold-cache directory and file operations.

extended-slotmap The slot-map is a hidden file on an OCFS2 fs which is used to map mounted nodes to sys- tem file resources. The extended slot map allows a larger range of possible node numbers, which is useful for userspace cluster stacks. If required, this feature is automatically turned on by mkfs.ocfs2.

metaecc Enables metadata checksums. With this enabled, the file system computes and stores the checksums in all metadata blocks. It also computes and stores an error correction code capable of fixing single bit errors.

refcount Enables creation of reference counted trees. With this enabled, the file system allows users to create inode-based snapshots and clones known as reflinks.

xattr Enable extended attributes support. With this enabled, users can attach name:value pairs to objects within the file system. In OCFS2,the names can be upto 255 bytes in length, terminated by the first NUL byte. While it is not required, printable names (ASCII) are recommended. The values can be upto 64KB of arbitrary binary data. Attributes can be attached to all types of inodes: regular files, directories, symbolic links, device nodes, etc. This feature is required for users wanting to use extended security facilities likePOSIX ACLs or SELinux.

usrquota Enable user quota support. With this feature enabled, filesystem will track amount of space and number of inodes (files, directories, symbolic links) each user owns. It is then possible to limit the maximum amount of space or inodes user can have.See a documen- tation of quota-tools package for more details.

grpquota Enable group quota support. With this feature enabled, filesystem will track amount of space and number of inodes (files, directories, symbolic links) each group owns. It is then possible to limit the maximum amount of space or inodes user can have.See a documen- tation of quota-tools package for more details.

indexed-dirs Enable directory indexing support. With this feature enabled, the file system creates indexedtree for non-inline directory entries. For large scale directories, directory entry lookup perfromance from the indexedtree is faster then from the legacy directory blocks.

discontig-bg Enables discontiguous block groups. With this feature enabled, the file system is able to growthe inode and the extent allocators evenwhen there is no contiguous free chunk

Version 1.8.2 January 2012 3 mkfs.ocfs2(8) OCFS2 Manual Pages mkfs.ocfs2(8)

available. It allows the file system to growthe allocators in smaller (discontiguous) chunks.

clusterinfo Enables storing the cluster stack information in the superblock. This feature is needed to support userspace cluster stacks and the global heartbeat mode in the o2cb cluster stack. If needed, this feature is automatically turned on by mkfs.ocfs2.

−−fs−feature−level=feature−level Choose from a set of pre-determined file-system features. This option is designed to allowusers to conveniently choose a set of file system features which fits their needs. There is no downside to trying a set of features which your module might not support - if it won’tmount the newfile sys- tem simply reformat at a lower level. Feature levels can be fine-tuned via the −−fs−features option. Currently,there are 3 types of feature levels:

max-compat Chooses fewer features but ensures that the file system can be mounted from older ver- sions of the OCFS2 module.

default The default feature set tries to strikeabalance between providing newfeatures and main- taining compatibility with relatively recent versions of OCFS2.Itcurrently enables sparse, unwritten, inline-data, xattr, indexed-dirs, discontig-bg, refcount, extended- slotmap and clusterinfo.

max-features Choose the maximum amount of features available. This will typically provide the best performance from OCFS2 at the expense of creating a file system that is only compatible with very recent versions of the OCFS2 kernel module.

−−cluster−stack Specify the cluster stack. This option is normally not required as mkfs.ocfs2 chooses the currently active cluster stack. It is required only if the cluster stack is not online and the user wishes to use a stack other than the default, o2cb.Other supported cluster stacks are pcmk (Pacemaker) and cman (rgmanager). Once set, OCFS2 will only allowmounting the volume if the active cluster stack and cluster name matches the one specified on-disk.

−−cluster−name Specify the name of the cluster.This option is mandatory if the user has specified a cluster−stack. This name is restricted to a max of 16 characters. Additionally,the o2cb cluster stack allows only alpha-numeric characters.

−−global−heartbeat Enable the global heartbeat mode of the o2cb cluster stack. This option is not required if the o2cb cluster stack with global heartbeat is online as mkfs.ocfs2 will detect the active stack. However, if the cluster stack is not up, then this option is required alongwith cluster−stack and cluster−name. Formore, refer to o2cb(7).

−−no-backup-super This option is deprecated, please use --fs-features=nobackup-super instead.

Version 1.8.2 January 2012 4 mkfs.ocfs2(8) OCFS2 Manual Pages mkfs.ocfs2(8)

−n, --dry-run Display the heuristically determined values without overwriting the existing file system.

−q, −−quiet Quiet mode.

−U uuid Specify a custom UUID in the plain (2A4D1C581FAA42A1A41D26EFC90C1315) or traditional (2a4d1c58-1faa-42a1-a41d-26efc90c1315) format. This option in not recommended because the file system uses the UUID to uniquely identify a file system. If morethan one file system wereto have the same UUID, one is very likely to encounter erratic behavior,ifnot, outright file sys- tem corruption.

−v,−−verbose Verbose mode.

−V,−−version Print version and exit.

blocks-count Usually mkfs.ocfs2 automatically determines the size of the givendevice and creates a file system that uses all of the available space on the device. This optional argument specifies that the file sys- tem should only consume the givennumber of file system blocks (see -b)onthe device.

FEATURE COMPATIBILITY This section lists the file system features that have been added to the OCFS2 file system and the version that it first appeared in. The table belowlists the versions of the mainline Linux kernel and ocfs2-tools. Users should use this information to enable only those features that are available in the file system that theyare using. Before enabling newfeatures, users are advised to reviewtothe section titled featurevalues.

FeatureKernel Version Tools Version local Linux 2.6.20 ocfs2-tools 1.2 sparse Linux 2.6.22 ocfs2-tools 1.4 unwritten Linux 2.6.23 ocfs2-tools 1.4 inline-data Linux 2.6.24 ocfs2-tools 1.4 extended-slotmap Linux 2.6.27 ocfs2-tools 1.6 metaecc Linux 2.6.29 ocfs2-tools 1.6 grpquota Linux 2.6.29 ocfs2-tools 1.6 usrquota Linux 2.6.29 ocfs2-tools 1.6 xattr Linux 2.6.29 ocfs2-tools 1.6 indexed-dirs Linux 2.6.30 ocfs2-tools 1.6 refcount Linux 2.6.32 ocfs2-tools 1.6 discontig-bg Linux 2.6.35 ocfs2-tools 1.6 clusterinfo Linux 2.6.37 ocfs2-tools 1.8

Users can query the features enabled in the file system as follows:

#tunefs.ocfs2 -Q "Label: %V\nFeatures: %H %O\n" /dev/sdg1 Label: apache_files_10

Version 1.8.2 January 2012 5 mkfs.ocfs2(8) OCFS2 Manual Pages mkfs.ocfs2(8)

Features: sparse inline-data unwritten

FEATURE VALUES This section lists the hexvalues that are associated with the file system features. This information is useful when debugging mount failures that are due to feature incompatibility.When a user attempts to mount an OCFS2 volume that has features enabled that are not supported by the running file system software, it will fail with an error like:

ERROR: couldn't mount because of unsupported optional features (200).

By referring to the table below, itbecomes apparent that the user attempted to mount a volume with the xattr (extended attributes) feature enabled with a version of the file system software that did not support it. At this stage, the user has the option of either upgrading the file system software, or,disabling that on-disk feature using tunefs.ocfs2.

Some features allowthe file system to be mounted with an older version of the software provided the mount is read-only.Ifauser attempts to mount such a volume in a read-write mode, it will fail with an error like:

ERROR: couldn't mount RDWR because of unsupported optional features (1).

This error indicates that the volume had the unwritten RO compat feature enabled. This volume can be mounted by an older file system software only in the read-only mode. In this case, the user has the option of either mounting the volume with the ro mount option, or,disabling that on-disk feature using tunefs.ocfs2.

FeatureCategory Hexvalue local Incompat 8 sparse Incompat 10 inline-data Incompat 40 extended-slotmap Incompat 100 xattr Incompat 200 indexed-dirs Incompat 400 metaecc Incompat 800 refcount Incompat 1000 discontig-bg Incompat 2000 clusterinfo Incompat 4000 unwritten RO Compat 1 usrquota RO Compat 2 grpquota RO Compat 4

SEE ALSO debugfs.ocfs2(8) fsck.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8) o2cb(7) o2cluster(8) o2image(8) o2info(1) tunefs.ocfs2(8)

AUTHORS Oracle Corporation

Version 1.8.2 January 2012 6 mkfs.ocfs2(8) OCFS2 Manual Pages mkfs.ocfs2(8)

COPYRIGHT Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 7 mount.ocfs2(8) OCFS2 Manual Pages mount.ocfs2(8)

NAME mount.ocfs2 − mount an OCFS2 filesystem SYNOPSIS mount.ocfs2 [−vn][−o options] device dir DESCRIPTION mount.ocfs2 mounts an OCFS2 filesystem at dir.Itisusually invokedindirectly by the mount(8) com- mand.

OPTIONS netdev Indicates that the file system resides on a device that requires network access (used to prevent the system from attempting to mount these filesystems until the network has been enabled on the sys- tem). mount.ocfs2(8) transparently appends this option during mount. However, users mounting the volume via /etc/fstab must explicitly specify this mount option to delay the system from mounting the volume until after the network has been enabled.

noatime The file system will not update access time.

relatime The file system will update atime only if the on-disk atime is older than mtime or ctime.

strictatime,atime quantum=nrsec The file system will always perform atime updates, but the minimum update interval is specified by atime_quantum which defaults to 60 secs. Set it to zero to always update atime. These two options need work together.

[no]acl Enables / disables POSIX ACLs (access control lists) support.

[no]user_xattr Enables / disables extended user attributes.

commit=nrsec Sync all data and metadata every nrsec seconds. The default value is 5 seconds. Zero means default.

data=[ordered|writeback] Specifies the handling of file data during metadata journalling.

ordered This is the default mode. Data is flushed to disk before the corresponding meta-data is committed to the journal.

writeback Data ordering is not preserved - data may be flushed to disk after the corresponding meta-data is committed to the journal. This is rumored to be the higher-throughput option. While it guarantees internal file system integrity,itcan allowold data to appear in files after a crash and journal recovery.

Version 1.8.2 January 2012 1 mount.ocfs2(8) OCFS2 Manual Pages mount.ocfs2(8)

errors=[remount-ro|errors=panic] Specifies the behavior when an on-disk corruption is encountered.

remount-ro This is the default mode. The file system is remounted read-only.

panic The system is halted via panic.

localflocks This disables cluster-aware flock(2).

coherency=[full|coherency] Specifies the extent of coherencyfor the cached file data across the cluster.This mount option works with Linux kernel 2.6.37 and later.

full This is the default mode. The file system ensures the cached file data is coherent across the cluster for all IO modes.

buffered The file system only ensures the cached file data coherencyfor buffered mode IOs. It does not perform IO serialization for direct IOs. This allows multiple nodes to perform concurrent direct IOs to the same file. This is the recommended mode for volumes host- ing database files.

resv_level=level Specifies the levelofallocation reservation for files. The higher the value, the more aggressive it is. Valid values are between 0 (reservation off) to 8 (maximum space for reservation). It defaults to 2. This mount option works with Linux kernel 2.6.35 and later.

dir_resv_level=level By default, directory reservation scales with file reserveration. Users should rarely need to change this value. If the file allocation reservation is turned off, this option will have noeffect. This mount option works with Linux kernel 2.6.35 and later. inode64 Indicates that the file system can create inodes at anylocation in the volume, including those which will result in inode numbers greater than 4 billion.

[no]intr Specifies whether a signal can interrupt IOs. It is disabled by default.

ro Mount the file system read-only.

rw Mount the file system read-write.

NOTES To mount and umount a OCFS2 volume, do:

#mount /dev/sda1 /mount/path

Version 1.8.2 January 2012 2 mount.ocfs2(8) OCFS2 Manual Pages mount.ocfs2(8)

... #umount /mount/path

Users mounting a clustered volume should be aware of the following:

1. The cluster stack must to be online for a clustered mount to succeed.

2. The clustered mount operation is not instantaneous; it must wait for the node to the DLM domain.

3. Likewise, clustered umount is also not instantaneous; it involves migrating all mastered lock- resources to the other nodes in the cluster.

If the mount fails, detailed errors can be found via dmesg(8).These might include incorrect cluster configu- ration (say,amissing node or incorrect IP address) or a firewall interfering with o2cb network traffic. Check the configuration as listed in o2cb(7) or the man page of the active cluster stack.

To auto-mount volumes on startup, the file system tools include an ocfs2 init service. This runs after the o2cb init service has started the cluster.The ocfs2 init service mounts all OCFS2 volumes listed in /etc/fstab.

#chkconfig --add o2cb o2cb 0:off 1:off 2:on 3:on 4:off 5:on 6:off

$chkconfig --add ocfs2 o2cb 0:off 1:off 2:on 3:on 4:off 5:on 6:off

$cat /etc/fstab ... /dev/sda1 /u01 ocfs2 _netdev,defaults 0 0 ...

SEE ALSO debugfs.ocfs2(8) fsck.ocfs2(8) mkfs.ocfs2(8) mounted.ocfs2(8) o2cb(7) o2cluster(8) o2image(8) o2info(1) tunefs.ocfs2(8)

AUTHORS Oracle Corporation

COPYRIGHT Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 3 mounted.ocfs2(8) OCFS2 Manual Pages mounted.ocfs2(8)

NAME mounted.ocfs2 − Detects all OCFS2 volumes on a system. SYNOPSIS mounted.ocfs2 [−d][−f][device] DESCRIPTION mounted.ocfs2 is used to detect OCFS2 volume(s) on a system. When run without specifying a device,it scans all the partitions listed in /proc/partitions.

OPTIONS −d Lists the OCFS2 volumes along with their labels and uuids. It also lists the cluster stack, cluster name and the cluster flags. The possible cluster stacks are o2cb, pcmk and cman. None indicates alocal mount or a non-clustered volume. A G cluster flag indicates global-heartbeat for the o2cb cluster stack.

−f Lists the OCFS2 volumes along with the list of nodes that have mounted the volume.

NOTES As this utility gathers information without taking anycluster locks, the information listed in the full detect mode could be stale. This is only problematic for volumes that were not cleanly umounted by the last node. Such volumes will showupmounted (as per this utility) on one or more nodes but are in fact not mounted on anynode. Such volumes are awaiting slot-recovery which is auto-performed on the next mount (or file system check).

EXAMPLES To viewthe list of OCFS2 volumes, do:

#mounted.ocfs2 -d Device Stack Cluster F UUID Label /dev/sdc1 None 23878C320CF3478095D1318CB5C99EED localmount /dev/sdd1 o2cb webcluster G 8AB016CD59FC4327A2CDAB69F08518E3 webvol /dev/sdg1 o2cb webcluster G 77D95EF51C0149D2823674FCC162CF8B logsvol /dev/sdh1 o2cb webcluster G BBA1DBD0F73F449384CE75197D9B7098 scratch /dev/sdk1 o2cb webcluster G DCDA2845177F4D59A0F2DCD8DE507CC3 hb1

To viewthe list of nodes that have potentially (see notes) mounted the OCFS2 volumes, do:

#mounted.ocfs2 -f Device Stack Cluster F Nodes /dev/sdc1 None /dev/sdd1 o2cb webcluster G node1, node3, node10 /dev/sdg1 o2cb webcluster G node1, node3, node10 /dev/sdh1 o2cb webcluster G Not mounted /dev/sdk1 o2cb webcluster G node1, node3, node10

SEE ALSO debugfs.ocfs2(8) fsck.ocfs2(8) mkfs.ocfs2(8) mount.ocfs2(8) o2cluster(8) o2image(8) o2info(1) tunefs.ocfs2(8)

AUTHORS Oracle Corporation

Version 1.8.2 January 2012 1 mounted.ocfs2(8) OCFS2 Manual Pages mounted.ocfs2(8)

COPYRIGHT Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 2 tunefs.ocfs2(8) OCFS2 Manual Pages tunefs.ocfs2(8)

NAME tunefs.ocfs2 − Change OCFS2 file system parameters. SYNOPSIS tunefs.ocfs2 [−−cloned−volume[=new-label][−−fs−features=list−of−features][−J journal-options][−L volume-label][−N number-of-node-slots][−Q query-format][−ipqnSUvVy][−−backup-super] [−−list−sparse] device [blocks-count]

DESCRIPTION tunefs.ocfs2(8) is used to adjust OCFS2 file system parameters on disk. The tool expects the cluster to be online as it needs to takethe appropriate cluster locks to write safely to disk.

OPTIONS −−cloned−volume[=new-label] Change the volume UUID (auto-generated) and the label, if provided, of a cloned OCFS2 volume. This option does not perform volume cloning. It only changes the UUID and label on a cloned volume so that it can be mounted on the node that has the original volume mounted.

−−fs−features=[no]sparse... Turn specific file system features on or off. tunefs.ocfs2(8) will attempt to enable or disable the feature list provided. Toenable a feature, include it in the list. Todisable a feature, prepend no to the name. For a list of feature names, refer to mkfs.ocfs2(8).

−J, −−journal−options options Modify the journal using options specified on the command−line. Journal options are comma sep- arated, and may takeanargument using the equals (’=’) sign. For a list of possible options, refer to mkfs.ocfs2(8).

−L, −−label volume−label Change the volume label of the file system. Limit the label to under 64 bytes.

−N, −−node−slots number−of−node−slots Valid number ranges from 1 to 255. This number specifies the maximum number of nodes that can concurrently mount the partition. Use this to increase or decrease the number of node slots. One reason to decrease could be to release the space consumed by the journals for those slots.

−S, −−volume−size Growthe size of the OCFS2 file system. If blocks-count is not specified, tunefs.ocfs2(8) extends the volume to the current size of the device.

−Q, −−query query−format Query the file system for its attributes likeblock size, label, etc. Query formats are modified ver- sions of the standard (3) formatting. The format is made up of static (which may include standard C character escapes for newlines, tabs, and other special characters) and printf(3) type formatters. The list of type specifiers is as follows: B Block size in bytes T Cluster size in bytes N Number of node slots

Version 1.8.2 January 2012 1 tunefs.ocfs2(8) OCFS2 Manual Pages tunefs.ocfs2(8)

R Root directory block number Y System directory block number P First cluster group block number V Volume label U Volume uuid M Compat flags H Incompat flags O RO Compat flags

−q, −−quiet Quiet mode.

−U,−−uuid−reset[=new-uuid] Reset the volume UUID of the file system. If not provided, the utility will auto generate it. For custom UUID, specify in either the plain (2A4D1C581FAA42A1A41D26EFC90C1315) or the traditional (2a4d1c58-1faa-42a1-a41d-26efc90c1315) format. Users specifying custom UUIDs must be careful to ensure that no twovolumes have the same UUID. If morethan one file system weretohav e the same UUID, one is very likely to encounter erratic behavior,ifnot, outright file system corruption.

−v,−−verbose Verbose mode.

−V,−−version Showversion and exit.

−y,−−yes Always answer Yes in interactive command line.

−n, −−no Always answer No in interactive command line.

−−backup−super Backs up the superblock to fixed offsets (1G, 4G, 16G, 64G, 256G and 1T) on disk. This option is useful for users to backup the superblock on volumes that the user either explicitly disallowed while formatting, or,used a version of mkfs.ocfs2(8) (1.2.2 or older) that did not provide this facility.

−−list-sparse Lists the files having holes. This option is useful when disabling the sparse feature.

−−update-cluster-stack Updating on-disk cluster information to match the running cluster.Users looking to update the on- disk cluster stack without starting the newcluster should use the o2cluster(8) utility.

blocks-count During resize, tunefs.ocfs2(8) automatically determines the size of the givendevice and grows the file system such that it uses all of the available space on the device. This optional argument

Version 1.8.2 January 2012 2 tunefs.ocfs2(8) OCFS2 Manual Pages tunefs.ocfs2(8)

specifies that the file system should be extended to consume only the givennumber of file system blocks on the device.

EXAMPLES #tunefs.ocfs2 -Q "UUID = %U\nNumSlots = %N\n" /dev/sda1 UUID = CBB8D5E0C169497C8B52A0FD555C7A3E NumSlots = 4

SEE ALSO debugfs.ocfs2(8) fsck.ocfs2(8) fsck.ocfs2.checks(8) mkfs.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8) o2cluster(8) o2image(8) o2info(1)

AUTHORS Oracle Corporation

COPYRIGHT Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 3 o2cluster(8) OCFS2 Manual Pages o2cluster(8)

NAME o2cluster − Change cluster stack stamped on an OCFS2 file system. SYNOPSIS o2cluster [−o|−−show−ondisk][−r|−−show−running][−u|−−update[=]] [−hvVyn] [device]

DESCRIPTION o2cluster is used to change the cluster stack stamped on an OCFS2 file system. It also used to list the active cluster stack and the one stamped on-disk. This utility does not expect the cluster to be online. It only updates the file system if it is reasonably assured that it is not in-use on anyother node. Clean journals implies the file system is not in-use. This utility aborts if it detects evenone dirty journal.

Beforeusing this utility,the user should use other means to ensurethat the volume is not in-use,and more importantly,not about to be put in-use.While clean journals implies the file system is not in-use, there is a tinywindowafter the check and before the update during which another node could mount the file system using the older cluster stack.

If a dirty journal is detected, it implies one of twoscenarios. Either the file system is mounted on another node, or,the last node to have itmounted, crashed. There is no way,short of joining the cluster,that the utility can use to differentiate between the two. Considering this utility is targetted to be used in scenarios when the user is looking to change the on-disk cluster stack, it becomes a chicken-and-egg problem.

If one were to run into this scenario, the user should manually re-confirm that the file system is not in-use on another node and then run fsck.ocfs2(8).Itwill update the on-disk cluster stack to the active cluster stack, and, do a complete file system check.

SPECIFYING CLUSTER STACK The cluster stack can be specified in one of twoforms. The first as default,denoting the original classic o2cb cluster stack with local heartbeat. The second as a triplet with the stack name, the cluster name and the cluster flags separated by commas. Like o2cb,mycluster,global.

The valid stack names are o2cb, pcmk,and cman.

The cluster name can be upto 16 characters. The o2cb stack further restricts the names to contain only alphanumeric characters.

The valid flags for the o2cb stack are local and global,denoting the twoheartbeat modes. The only valid flag for the other stacks is none.

OPTIONS −o|−−show−ondisk Shows the cluster stack stamped on-disk.

−r|−−show−running Shows the active cluster stack.

−u|−−update[=] Updates the on-disk cluster stack with the one provided. If no cluster stack is provided, the utility detects the active cluster stack and stamps it on-disk.

Version 1.8.2 January 2012 1 o2cluster(8) OCFS2 Manual Pages o2cluster(8)

−v,−−verbose Verbose mode.

−V,−−version Showversion and exit.

−y,−−yes Always answer Yes in interactive command line.

−n, −−no Always answer No in interactive command line.

EXAMPLES #o2cluster -r o2cb,myactivecluster,global

#o2cluster -o /dev/sda1 o2cb,mycluster,global

#o2cluster --update=o2cb,yourcluster,global /dev/sdb1 Changing the clusterstack from o2cb,mycluster,global to o2cb,yourcluster,global. Continue? y Updated successfully.

SEE ALSO debugfs.ocfs2(8) fsck.ocfs2(8) fsck.ocfs2.checks(8) mkfs.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8) o2image(8) o2info(1) tunefs.ocfs2(8)

AUTHORS Oracle Corporation

COPYRIGHT Copyright © 2011, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 2 debugfs.ocfs2(8) OCFS2 Manual Pages debugfs.ocfs2(8)

NAME debugfs.ocfs2 − OCFS2 file system debugger. SYNOPSIS debugfs.ocfs2 [−f cmdfile][−R command][−s backup][−nwV?][device] debugfs.ocfs2 −l [tracebit ... [allow|off|deny]] ... debugfs.ocfs2 −d, −−decode lockname debugfs.ocfs2 −e, −−encode lock_type block_num [generation | parent]

DESCRIPTION The debugfs.ocfs2 program is an interactive file system debugger useful in displaying on-disk OCFS2 filesystem structures on the specified device.

OPTIONS −d, −−decode lockname Display the information encoded in the lockname.

−e, −−encode lock_type block_num [generation | parent] Display the lockname obtained by encoding the arguments provided.

−f,−−file cmdfile Executes the debugfs commands in cmdfile.

−i, −−image Specifies device is an o2image file created by o2image tool.

−l [tracebit ... [allow|off|deny]] ... Control OCFS2 filesystem tracing by enabling and disabling trace bits. Do debugfs.ocfs2 -l to get the list of all trace bits.

−n, −−noprompt Hide prompt.

−R, −−request command Executes a single debugfs command.

−s, −−superblock backup−number mkfs.ocfs2 makes upto 6 backup copies of the superblock at offsets 1G, 4G, 16G, 64G, 256G and 1T depending on the size of the volume. Use this option to specify the backup, 1 thru 6, to use to open the volume.

−w,−−write Opens the filesystem in RW mode. By default the filesystem is opened in RO mode.

−V,−−version Display version and exit.

Version 1.8.2 January 2012 1 debugfs.ocfs2(8) OCFS2 Manual Pages debugfs.ocfs2(8)

−?, −−help Displays help and exit.

SPECIFYING FILES Many debugfs.ocfs2 commands takeafilespec as an argument to specify an inode (as opposed to a path- name) in the filesystem which is currently opened by debugfs.ocfs2.The filespec argument may be speci- fied in twoforms. The first form is an inode number or lockname surrounded by angle brackets, e.g., <32>. The second form is a pathname; if the pathname is prefixed by a forward slash (’/’), then it is interpreted relative tothe root of the filesystem which is currently opened by debugfs.ocfs2.Ifnot, the path is inter- preted relative tothe current working directory as maintained by debugfs.ocfs2,which can be modified using the command .Ifthe pathname is prefixed by a double forward slash (’//’), then it is interpreted rel- ative tothe root of the system directory of the filesystem opened by debugfs.ocfs2.

LOCKNAMES Locknames are specially formatted strings used by the file system to uniquely identify objects in the filesys- tem. Most locknames used by OCFS2 are generated using the inode number and its generation number and can be decoded using the decode command or used directly in place of an inode number in commands requiring a filespec.Likeinode numbers, locknames need to be enclosed in angle brackets, e.g., . Use the encode command to generate all possible locknames for an object.

COMMANDS This is a list of the commands which debugfs.ocfs2 supports.

bmap filespec logical_block Display the physical block number corresponding to the logical block number logical_block in the inode filespec.

cat filespec Dump the contents of inode filespec to stdout.

cd filespec Change the current working directory to filespec.

chroot filespec Change the root directory to be the directory filespec.

close Close the currently opened filesystem.

controld dump Display information obtained from ocfs2_controld.

curdev Showthe currently open device.

decode Display the inode number encoded in the lockname.

dirblocks <filespec> Display the directory blocks associated with the given filespec.

Version 1.8.2 January 2012 2 debugfs.ocfs2(8) OCFS2 Manual Pages debugfs.ocfs2(8)

dlm_locks [−f <file>] [−l] []... Display the status of all lock resources in the o2dlm domain that the file system is a member of. This command expects the debugfs filesystem to be mounted as mount -t debugfs debugfs /sys/ker- nel/debug.Use lockname(s) to limit the output to the givenlock resources, -l to include contents of the lock value block and -f <file> to specify a savedcopyof/sys/ker- nel/debug/o2dlm//locking_state.

dump [−p] filespec outfile Dump the contents of the inode filespec to the output file outfile.Ifthe -p is given, set the owner, group, timestamps and permissions information on outfile to match those of filespec.

dx_dump filespec Display the indexeddirectory information for the givendirectory.

dx_leaf Display the contents of the givenindexeddirectory leaf block.

dx_root Display the contents of the givenindexeddirectory root block.

dx_space filespec Display the directory free space list.

encode filespec Display the lockname for the filespec.

extent Display the contents of the extent structure at block#.

findpath [|] Display the pathname for the inode specified by lockname or inode#.This command does not dis- play all the hard-linked paths for the inode.

frag filespec Display the inode’snumber of extents to clusters ratio.

fs_locks [-f <file>] [-l] [-B] []... Display the status of all locks known by the file system. This command expects the debugfs filesystem to be mounted as mount -t debugfs debugfs /sys/kernel/debug.Use lockname(s) to limit the output to the givenlock resources, -B to limit the output to only the busy locks, -l to include contents of the lock value block and -f <file> to specify a savedcopyof/sys/ker- nel/debug/ocfs2//locking_state.

group Display the contents of the group descriptor at block#.

grpextents Display free extents in the chain group.

Version 1.8.2 January 2012 3 debugfs.ocfs2(8) OCFS2 Manual Pages debugfs.ocfs2(8)

hb Display the contents of the heartbeat system file.

help, ? Print the list of commands understood by debugfs.ocfs2.

icheckblock# ... Display the inodes that use the one or more blocks specified on the command line. If the inode is aregular file, also display the corresponding logical block offset.

lcd directory Change the current working directory of the debugfs.ocfs2 process to the directory on the native filesystem.

locate [|] ... Display all pathnames for the inode(s) specified by locknamesor inode#s.

logdump [-T] slot# Display the contents of the journal for slot slot#.Use -T to limit the output to just the summary of the inodes in the journal.

ls [−l] filespec Print the listing of the files in the directory filespec.The −l flag will list files in the long format.

net_stats [interval [count]] Display net statistics.

ncheck[|] ... See locate.

open device Open the filesystem on device.

quit, q Quit debugfs.ocfs2.

rdump [−v] filespec outdir Recursively dump directory filespec and all its contents (including regular files, symbolic links and other directories) into the outdir which should be an existing directory on the native filesystem.

refcount [−e] filespec Display the refcount block, and optionally its tree, of the specified inode.

slotmap Display the contents of the slotmap system file.

stat [−t|−T] filespec Display the contents of the inode structure for the filespec.The -t ("traverse") option selects tra- versal of the inode’smetadata. The extent tree, chain list, or other extra metadata will be dumped. This is the default. The -T option turns offtraversal to reduce the I/O required when basic inode information is needed.

Version 1.8.2 January 2012 4 debugfs.ocfs2(8) OCFS2 Manual Pages debugfs.ocfs2(8)

stat_sysdir Display the contents of all objects in the system directory.

stats [−h] [−s backup−number] Display the contents of the superblock. Use −s to display a specific backup superblock. Use −h to hide the inode.

xattr [-v] <filespec> Display extended attributes associated with the given filespec.

ACKNOWLEDGEMENT This tool has been modelled after debugfs,adebugging tool for ext2.

SEE ALSO fsck.ocfs2(8) fsck.ocfs2.checks(8) mkfs.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8) o2cluster(8) o2image(8) o2info(1) ocfs2(7) tunefs.ocfs2(8)

AUTHOR Oracle Corporation

COPYRIGHT Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 5 o2image(8) OCFS2 Manual Pages o2image(8)

NAME o2image − Copyorrestore OCFS2 file system meta-data SYNOPSIS o2image [−r][−I] device image-file DESCRIPTION o2image copies the OCFS2 file system meta-data from the device to the specified image-file. This image file contains the file system skeleton that includes the inodes, directory names and file names. It does not include anyfile data.

This image file can be useful to debug certain problems that are not reproducible otherwise. Likeon-disk corruptions. It could also be used to analyse the file system layout in an aging file system with an eye towards improving performance.

As the image-file contains a copyofall the meta-data blocks, it can be a large file. By default, it is created in a packed format, in which all meta-data blocks are written back-to-back. With the −r option, the user could choose to have the file in the raw(or sparse) format, in which the blocks are written to the same offset as theyare on the device.

debugfs.ocfs2 understands both formats.

o2image also has the option, −I,torestore the meta-data from the image file onto the device. This option will rarely be useful to end-users and has been written specifically for developers and testers.

OPTIONS −r Copies the meta-data to the image-file in the rawformat. Use this option only if the destination file system supports sparse files. If unsure, do not use this option and let the tool create the image-file in the packed format.

−I Restores meta-data from the image-file onto the device. CAUTION: This option could corrupt the file system.

−i Interactive mode - before writing out the image file print it’ssize and ask whether to proceed. This setting only applies when ’-I’ is not specified. It can be useful when the file system holding the image is lowondisk space and the user might need to free up space once the target image size is calculated.

EXAMPLES Copies metadata blocks from /dev/sda1 device to sda1.out file.

#o2image /dev/sda1 sda1.out

Copies meta-data blocks from sda1.out onto the /dev/sda1 device. As this command over-writes an exist- ing volume, please use with CAUTION.

#o2image -I /dev/sda1 sda1.out

SEE ALSO debugfs.ocfs2(8) fsck.ocfs2(8) fsck.ocfs2.checks(8) mkfs.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8) o2cluster(8) o2info(1) tunefs.ocfs2(8)

Version 1.8.2 January 2012 1 o2image(8) OCFS2 Manual Pages o2image(8)

AUTHORS Oracle Corporation

COPYRIGHT Copyright © 2007, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 2 o2hbmonitor(8) OCFS2 Manual Pages o2hbmonitor(8)

NAME o2hbmonitor − Monitors disk heartbeat in the O2CB cluster stack SYNOPSIS o2hbmonitor [−w percent] [−ivV] DESCRIPTION o2hbmonitor is a utility to monitor the disk heartbeat in the o2cb cluster stack. It tracks the time elapsed since the last heartbeat and logs messages once it exceeds the warn threshold.

By default, it runs as a daemon and logs messages to the system logger.Itcan be started at anytime and stopped using kill(1).Itdoes not affect the functioning of the heartbeat thread. It is typically automatically started during cluster online and stopped during cluster offline by the o2cb init script.

This utility expects the debugfs file system to be mounted at /sys/kernel/debug.

OPTIONS −w percent Warn threshold percent. It is the percentage of the idle threshold. It defaults to 50%.

−i Interactive mode. It works as a daemon by default. This mode is typically only used for debug- ging.

−v Verbose mode. It logs messages only to the system logger by default. In this mode it also logs the messages to stdout.

−V Displays version.

NOTES This utility works with Linux kernel 2.6.37 and later.

SEE ALSO o2cb(7)

AUTHORS Oracle Corporation

COPYRIGHT Copyright © 2010, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 1 ocfs2_hb_ctl(8) OCFS2 Manual Pages ocfs2_hb_ctl(8)

NAME ocfs2_hb_ctl − Starts and stops the O2CB local heartbeat on a givendevice. SYNOPSIS ocfs2_hb_ctl -S -d device service ocfs2_hb_ctl -S -u uuid service ocfs2_hb_ctl -K -d device service ocfs2_hb_ctl -K -u uuid service ocfs2_hb_ctl -I -d device ocfs2_hb_ctl -I -u uuid ocfs2_hb_ctl -P -d device [-n io_priority] ocfs2_hb_ctl -P -u uuid [-n io_priority] ocfs2_hb_ctl -h

DESCRIPTION ocfs2_hb_ctl starts and stops local heartbeat on a OCFS2 device. Users arestrongly urged not to use this tool directly.Itisautomatically invokedby mount.ocfs2(8) and other tools that require heartbeat noti- fications.

This utility only operates in the local heartbeat mode. It fails silently when run in global heartbeat mode. More information on the heartbeat modes can be found in o2cb(7).

The tools accepts devices to be specified by its name or its uuid. Service denotes the application that is requesting the heartbeat notification.

OPTIONS −S Starts the heartbeat.

−K Stops the heartbeat.

−I Prints the heartbeat reference counts for that heartbeat region.

−d Specify region by device name.

−u Specify region by device uuid.

−n Adjust IO priority for the heartbeat thread. This option calls the ionice tool to set its IO scheduling class to realtime with scheduling class data as provided. This option is usable only with the O2CB cluster stack.

−h Displays help and exit.

SEE ALSO mount.ocfs2(8) o2cb(7) o2cb(8) o2cb.sysconfig(5) ocfs2.cluster.conf(5) o2cluster(8)

AUTHORS Oracle Corporation

COPYRIGHT Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 1 fsck.ocfs2(8) OCFS2 Manual Pages fsck.ocfs2(8)

NAME fsck.ocfs2 − Check an OCFS2 file system. SYNOPSIS fsck.ocfs2 [ −pafFGnuvVy ][−b superblockblock ][−B blocksize ] device DESCRIPTION fsck.ocfs2 is used to check an OCFS2 file system.

device is the file where the file system is stored (e.g. /dev/sda1). It will almost always be a device file but a regular file will work as well.

OPTIONS −a This option does the same thing as the -p option. It is provided for backwards compatibility only: it is suggested that people use the -p option wheneverpossible.

−b superblockblock Normally, fsck.ocfs2 will read the superblock from the first block of the device. This option speci- fies an alternate block that the superblock should be read from. (Use −r instead of this option.)

−B blocksize The blocksize,specified in bytes, can range from 512 to 4096. Avalue of 0, the default, is used to indicate that the blocksize should be automatically detected.

−D Optimize directories in filesystem. This option causes fsck.ocfs2 to coalesce the directory entries in order to improve the filesystem performance.

−f Force checking evenifthe file system is clean.

−F By default fsck.ocfs2 will check with the cluster services to ensure that the volume is not in-use (mounted) on anynode in the cluster before proceeding. -F skips this check and should only be used when it can be guaranteed that the volume is not mounted on anynode in the cluster. WARN- ING: If the cluster check is disabled and the volume is mounted on one or morenodes, file system corruption is very likely.Ifunsure, do not use this option.

−G Usually fsck.ocfs2 will silently assume inodes whose generation number does not match the gen- eration number of the super block are unused inodes. This option causes fsck.ocfs2 to ask the user if these inodes should in fact be marked unused.

−n Give the ’no’ answer to all questions that fsck will ask. This guarantees that the file system will not be modified and the device will be opened read-only.The output of fsck.ocfs2 with this option can be redirected to produce a record of a file system’sfaults.

−p Automatically repair ("preen") the file system. This option will cause fsck.ocfs2 to automatically fix anyproblem that can be safely corrected without human intervention. If there are problems that require intervention, the descriptions will be printed and fsck.ocfs2 will exit with the value 4 logically or’dinto the exit code. (See the EXIT CODE section.) This option is normally used by the system’sboot scripts.

−P Showprogress.

Version 1.8.2 January 2012 1 fsck.ocfs2(8) OCFS2 Manual Pages fsck.ocfs2(8)

−r backup-number mkfs.ocfs2 makes upto 6 backup copies of the superblock at offsets 1G, 4G, 16G, 64G, 256G and 1T depending on the size of the volume. Use this option to specify the backup, 1 thru 6, to use to recoverthe superblock.

−t ShowI/O statistics. If this option is specified twice, it shows the statistics on a pass by pass basis.

−y Give the ’yes’ answer to all questions that fsck will ask. This will repair all faults that fsck.ocfs2 finds but will not give the operator a chance to intervene if fsck.ocfs2 decides that it wants to dras- tically repair the file system.

−v This option causes fsck.ocfs2 to produce a very large amount of debugging output.

−V Print version information and exit.

EXIT CODE The exit code returned by fsck.ocfs2 is the sum of the following conditions: 0−No errors 1−File system errors corrected 2−File system errors corrected, system should be rebooted 4−File system errors left uncorrected 8−Operational error 16 − Usage or syntax error 32 − fsck.ocfs2 canceled by user request 128 − Shared library error

SEE ALSO debugfs.ocfs2(8) fsck.ocfs2.checks(8) mkfs.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8) o2cluster(8) o2image(8) o2info(1) tunefs.ocfs2(8)

AUTHORS Oracle Corporation. This man page entry derivessome text, especially the exit code summary,from e2fsck(8) by Theodore Y.Ts’o .

COPYRIGHT Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 2 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

NAME fsck.ocfs2.checks − Consistencychecks that fsck.ocfs2(8) performs and its means for fixing inconsisten- cies. DESCRIPTION fsck.ocfs2(8) is used to check an OCFS2 file system. It performs manyconsistencychecks and will offer to fix faults that it finds. This man page lists the problems it may find and describes their fixes. The problems are indexedbythe error number that fsck.ocfs2(8) emits when it describes the problem and asks if it should be fixed.

The prompts are constructed such that answering ’no’ results in no changes to the file system. This may result in errors later on that stop fsck.ocfs2(8) from proceeding.

CHECKS EB_BLKNO Extent blocks contain a record of the disk block where theyare located. An extent block was found at a block that didn’tmatch its recorded location.

Answering yes will update the data structure in the extent block to reflect its real location on disk.

EB_GEN Extent blocks are created with a generation number to match the generation number of the volume at the time of creation. An extent block was found which contains a generation number that doesn’tmatch.

Answering yes implies that the generation number is correct and that the extent block is from a previous file system. The extent block will be ignored and the file that contains it will lose the data it referenced.

EB_GEN_FIX Extent blocks are created with a generation number to match the generation number of the volume at the time of creation. An extent block was found which contains a generation number that doesn’tmatch.

Answering yes implies that the generation number in the extent block is incorrect and that the extent block is valid. The generation number in the block is updated to match the generation number in the volume.

EXTENT_MARKED_UNWRITTEN An extent record has the UNWRITTEN flag set, but the filesystem feature set does not include unwritten extents.

Answering yes clears the UNWRITTEN flag. This is safe to do; as the feature is disabled anyway.

EXTENT_MARKED_REFCOUNTED An extent record has the REFCOUNTED flag set, but neither the filesystem nor the file has the REF- COUNTED flag set.

Answering yes clears the REFCOUNTED flag.

EXTENT_BLKNO_UNALIGNED The block that marks the start of an extent should always fall on the start of a cluster.Anextent was found that starts part-way into a cluster.

Answering yes movesthe start of the extent back to the start of the addressed cluster.This may add data to the middle of the file that contains this extent.

Version 1.8.2 January 2012 1 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

EXTENT_CLUSTERS_OVERRUN An extent was found which claims to contain clusters which are beyond the end of the volume.

Answering yes clamps the extent to the end of the volume. This may result in a reduced file size for the file that contains the extent, but it couldn’thav e addressed those final clusters anyway.One can imagine this problem arising if there are problems shrinking a volume.

EXTENT_EB_INVALID Deep extent trees are built by forming a tree out of extent blocks. An extent tree references an invalid extent block.

Answering yes stops the tree from referencing the invalid extent block. This may truncate data from the file which contains the tree.

EXTENT_LIST_DEPTH Extent lists contain a record of their depth in the tree. An extent list was found whose recorded depth doesn’tmatch the position theyhav e in the tree.

Answering yes updates the depth field in the list to match the tree on disk.

EXTENT_LIST_COUNT The number of entries in an extent list is bounded by either the size of the inode or the size of the block which contains it. An extent list was found which claims to have more entries than would fit in its con- tainer.

Answering yes updates the count field in the extent list to match the container.Answering no to this ques- tion may stop further fixes from being done because the count value can not be trusted.

EXTENT_LIST_FREE The number of free entries in an extent list must be less than the total number of entries in the list. Alist wasfound which claims to have more free entries than possible entries.

Answering yes sets the number of free entries in the list equal to the total possible entries.

EXTENT_BLKNO_RANGE An extent record was found which references a block which can not be referenced by an extent. The refer- enced block is either very early in the volume, and thus reserved, or beyond the end of the volume.

Answering yes removesthis extent record from the tree. This may remove data from the file which owns the tree but anysuch data was inaccessible.

CHAIN_CPG The bitmap inode indicates a different clusters per group than the group descriptor.This value is typically static and only modified by tunefs during volume resize and that too only on volumes having only one clus- ter group.

Answering yes updates the clusters per group on the bitmap inode to the corresponding value in the group descriptor.

SUPERBLOCK_CLUSTERS The super block indicates a different total clusters value than the global bitmap. This is only possible due to afailed volume resize operation.

Version 1.8.2 January 2012 2 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

Answering yes updates the total clusters in the super block to the value specified in the global bitmap.

FIXED_CHAIN_CLUSTERS The global bitmap inode was repaired, resulting in a change to the total cluster count of the filesystem.

Answering yes updates the total clusters in the super block to the value specified in the global bitmap.

GROUP_UNEXPECTED_DESC The group descriptors that makeupthe global bitmap chain allocator reside at predictable locations on disk. Agroup descriptor was found in the global bitmap allocator which isn’tatone of these locations and so shouldn’tbeinthe allocator.

Answering yes removesthis descriptor from the global bitmap allocator.

GROUP_EXPECTED_DESC The group descriptors that makeupthe global bitmap chain allocator reside at predictable locations on disk. Agroup descriptor at one of these locations was not linked into the global bitmap allocator.

Answering yes will relink this group into the allocator.

GROUP_GEN Agroup descriptor was found with a generation number that doesn’tmatch the generation number of the volume.

Answering yes sets the group descriptor’sgeneration equal to the generation number in the volume.

GROUP_PARENT Group descriptors contain a pointer to the allocator inode which contains the chain theybelong to. Agroup descriptor was found in an allocator inode that doesn’tmatch the descriptor’sparent pointer.

Answering yes updates the group descriptor’sparent pointer to match the inode it resides in.

GROUP_DUPLICATE Group descriptors contain a pointer to the allocator inode which contains the chain theybelong to. Agroup descriptor was found in twoallocator inodes so it may be duplicated.

Answering yes removesthe group descriptor from current allocator inode.

GROUP_BLKNO Group descriptors have a field which records their block location on disk. Agroup descriptor was found at agiv enlocation but is recorded as being located somewhere else.

Answering yes updates the group descriptor’srecorded location to match where it actually is found on disk.

GROUP_CHAIN Group descriptors are found in a number of different singly-linked chains in an allocator inode. Agroup descriptor records the chain number that it is linked in. Agroup descriptor was found whose chain field doesn’tmatch the chain it was found in.

Version 1.8.2 January 2012 3 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

Answering yes sets the group descriptor’schain field to match the chain it is found in.

GROUP_FREE_BITS Agroup descriptor records the number of bits in its bitmap that are free. Agroup descriptor was found which claims to have more free bits than are valid in its bitmap.

Answering yes decreases the number of recorded free bits so that it equals the total number of bits in the group descriptor’sbitmap.

CHAIN_COUNT The chain list embedded in an inode is limited by the block size and the number of bytes consumed by the rest of the inode. Achain list header was found which claimed that there are more entries in the list then could fit in the inode.

Answering yes resets the header’scl_count member to the maximum size allowed by the block size after accounting for the space consumed by the inode.

CHAIN_NEXT_FREE This is identical to CHAIN_COUNT except that it is testing and fixing the pointer to the next free list entry recorded in the cl_next_free_rec member instead of the total number of entries.

CHAIN_EMPTY Chain entries need to be packed such that there are no chains without descriptors found before the chain that is marked as free by the chain header.Achain without descriptors was found found before that chain that was marked free.

Answering yes will remove the unused chain and shift the remaining chains forward in the list.

CHAIN_I_CLUSTERS Chain allocator inodes have ani_clusters value that represents the number of clusters used by the allocator. An inode was found whose i_clusters value doesn’tmatch the number of clusters its chains cover.

Answering yes updates i_clusters in the inode to reflect what was actually found by walking the chain.

CHAIN_I_SIZE Chain allocator inodes multiply the number of bytes per cluster by the their i_clusters value and store it in i_size. An inode was found which didn’thav e the correct value in its i_size.

Answering yes updates i_size to be the product of i_clusters and the cluster size. Nothing else uses this value, and previous versions of tools didn’tcalculate it properly,sodon’tbetoo worried if this error appears.

CHAIN_GROUP_BITS The inode that contains an embedded chain list has fields which record the total number of bits covered by the chain as well as the amount free. These fields didn’tmatch what was found in the chain.

Answering yes updates the fields in the inode to reflect what was actually found by walking the chain.

CHAIN_HEAD_LINK_RANGE The header that starts a chain tried to reference a group descriptor at a block number that couldn’tbevalid.

Version 1.8.2 January 2012 4 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

Answering yes will clear the reference to this invalid block and truncate the chain that it started.

CHAIN_LINK_GEN Areference was made to a group descriptor whose generation number doesn’tmatch the generation of the volume.

Answering yes to this question implies that the group descriptor is invalid and the chain is truncated at the point that it referred to this invalid group descriptor.Answering no to this question considers the group descriptor as valid and its generation may be fixed.

CHAIN_LINK_MAGIC Chains are built by chain headers and group descriptors which are linked together by block references. A reference was made to a group descriptor at a givenblock but a valid group descriptor signature wasn’t found at that block.

Answering yes clears the reference to this invalid block and truncates the chain at the point of the reference.

CHAIN_LINK_RANGE Chains are built by chain headers and group descriptors which are linked together by block references. A reference a block was found which can’tpossibly be valid because it was either too small or extended beyond the volume.

Answering yes truncates the chain in question by zeroing the invalid block reference. This shortens the chain in question and could result in more fixes later if the part of the chain that couldn’tbereferenced was valid at some point.

CHAIN_BITS Achain’sheader contains members which record the total number of bits in the chain as well as the number of bits that are free. After walking through a chain it was found that the number of bits recorded in its header don’tmatch what was found by totalling up the group descriptors.

Answering yes updates the c_total and c_free members of the header to reflect what was found in the group descriptors in the chain.

DISCONTIG_BG_DEPTH Adiscontiguous block group has an extent list which records all the clusters allocated to it. Discontiguous block groups only support extent lists with a tree depth of 0. Ablock group claims to have a tree depth greater than 0.

Answering yes will set the tree depth of the extent list to 0.

DISCONTIG_BG_COUNT Adiscontiguous block group has an extent list which records all the clusters allocated to it. A block group claims to have more records than can actually fit.

Answering yes will set the record count to the maximum possible.

DISCONTIG_BG_REC_RANGE Block groups set aside clusters to be used for metadata. Adiscontiguous block group claims to contain clusters beyond the end of the volume.

Answering yes will remove the block group.

Version 1.8.2 January 2012 5 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

DISCONTIG_BG_CORRUPT_LEAVES Adiscontiguous block group has an extent list which records all the clusters allocated to it. Agroup has more than one extent claiming to have animpossible number of clusters.

Answering yes will remove the block group.

DISCONTIG_BG_CLUSTERS Extent records in a discontiguous block group were found having more clusters allocated then a block group can have.

Answering yes will remove the block group.

DISCONTIG_BG_LESS_CLUSTERS Extent records in a discontiguous block group were found having less clusters allocated then a block group can have.

Answering yes will remove the block group.

DISCONTIG_BG_NEXT_FREE_REC Adiscontiguous block group has an extent list which records all the clusters allocated to it. A group was found with fewer filled in extents than it claims to have.The filled in extents describe a complete and cor- rect group.

Answering yes will set the used extent count to the number of filled extents.

DISCONTIG_BG_LIST_CORRUPT Adiscontiguous block group has an extent list which records all the clusters allocated to it. The group claims to have more extents than is possible, and the existing extents contain errors.

Answering yes will remove the block group.

DISCONTIG_BG_REC_CORRUPT Adiscontiguous block group has a extent list which records all the clusters allocated to it. Agroup was found with one extent claiming too manyclusters but the sum of the remaining extents are equal to the total clusters a group must have.

Answering yes will remove the block group.

DISCONTIG_BG_LEAF_CLUSTERS Adiscontiguous block group has a extent list which records all the clusters allocated to it. Agroup was found with one extent claiming too manyclusters, but the remaining extents are correct.

Answering yes will set the number of the clusters on the broken extent to the difference between the total clusters a group must have and the sum of the remaining extents.

INODE_ALLOC_REPAIR The inode allocator did not accurately reflect the set of inodes that are free and in use in the volume.

Answering yes will update the inode allocator bitmaps. Each bit that doesn’tmatch the state of its inode will be inverted.

Version 1.8.2 January 2012 6 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

INODE_SUBALLOC Each inode records the node whose allocator is responsible for the inode. An inode was found in a given node’sallocator but the inode itself claimed to belong to a different node.

Answering yes will correct the inode to point to the node’sallocator that it belongs to.

LALLOC_SIZE Each node has a local allocator contained in a block that is used to allocate clusters in batches. Anode’s local allocator claims to reflect more bytes than are possible for the volume’sblock size.

Answering yes decreases the local allocator’ssize to reflect the volume’sblock size.

LALLOC_NZ_USED Agiv ennode’slocal allocator isn’tinuse but it claims to have bits in use in its bitmap.

Answering yes zeros this used field.

LALLOC_NZ_BM Agiv ennode’slocal allocator isn’tinuse but it has a field which records the bitmap as starting at a non- zero cluster offset.

Answering yes zeros the bm_offfield.

LALLOC_BM_OVERRUN Each local allocator contains a reference to the first cluster that its bitmap addresses. Agiv enlocal alloca- tor was found which references a starting cluster that is beyond the end of the volume.

Answering yes resets the givenlocal allocator.Noallocated data will be lost.

LALLOC_BM_SIZE The givenlocal allocator claims to covermore bits than are possible for the size in bytes of its bitmap.

Answering yes decreases the number of bits the allocator covers to reflect the size in bytes of the bitmap and resets the allocator.Noallocated data will be lost.

LALLOC_BM_STRADDLE The givenlocal allocator claims to coveraregion of clusters which extents beyond the end of the volume.

Answering yes resets the givenlocal allocator.Noallocated data will be lost.

LALLOC_USED_OVERRUN The givenlocal allocator claims to have more bits in use than it has total bits in its bitmap.

Answering yes decreases the number of bits used so that it equals the total number of available bits.

LALLOC_CLEAR Alocal allocator inode was found to have problems. This givesthe operator a chance to just reset the local allocator inode.

Answering yes clears the local allocator.Noinformation is lost but the global bitmap allocator may need to be updated to reflect clusters that were reserved for the local allocator but were free.

Version 1.8.2 January 2012 7 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

DEALLOC_COUNT The giventruncate log inode contains a count that is greater than the value that is possible giventhe size of the inode.

Answering yes resets the count value to the possible maximum.

DEALLOC_USED The giventruncate log inode claims to have more records in use than it is possible to store in the inode.

Answering yes resets the record of the number used to the maximum value possible.

TRUNCATE_REC_START_RANGE Atruncate record was found which claims to start at a cluster that is beyond the number of clusters in the volume.

Answering yes will clear the truncate record. This may result in previously freed space being marked as allocated. This will be fixed up later as the allocator is updated to match what is used by the file system.

TRUNCATE_REC_WRAP Clusters are recorded as 32bit values. A truncate record was found which claims to have enough clusters to cause this value to wrap. This could neverbethe case and is a sure sign of corruption.

Answering yes will clear the truncate record. This may result in previously freed space being marked as allocated. This will be fixed up later as the allocator is updated to match what is used by the file system.

TRUNCATE_REC_RANGE Atruncate record was found which claims to reference a region of clusters which partially extends beyond the number of clusters in the volume.

Answering yes will clear the truncate record. This may result in previously freed space being marked as allocated. This will be fixed up later as the allocator is updated to match what is used by the file system.

INODE_GEN Inodes are created with a generation number to match the generation number of the volume at the time of creation. An Inode was found which contains a generation number that doesn’tmatch.

Answering yes implies that the generation number is correct and that the inode is from a previous file sys- tem. The inode will be recorded as free.

INODE_GEN_FIX Inodes are created with a generation number to match the generation number of the volume at the time of creation. An inode was found which contains a generation number that doesn’tmatch.

Answering yes implies that the generation number in the inode is incorrect and that the inode is valid. The generation number in the inode is updated to match the generation number in the volume.

INODE_BLKNO Inodes contain a field that must match the block that theyreside in. An inode was found at a block that doesn’tmatch the field in the inode.

Answering yes updates the field to match the inode’sposition on disk.

Version 1.8.2 January 2012 8 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

ROOT_NOTDIR The super block contains a reference to the inode that contains the root directory.This block was found to contain an inode that isn’tadirectory.

Answering yes clears this inode. The operator will be asked to recreate the root directory at a point in the near future.

INODE_NZ_DTIME Inodes contain a field describing the time at which theywere deleted. This can not be set for an inode that is still in use. An inode was found which is in use but which contains a non-zero dtime.

Answering yes implies that the inode is still valid and resets its dtime to zero.

LINK_FAST_DAT A The target name for a is stored either as file contents for that inode or in the inode structure itself on disk. Only small destination names are stored in the inode structure. The i_blocks field of the inode indicates that the name is stored in the inode when it is zero. An inode was found that has both i_blocks set to zero and file contents.

Answering yes clears the inode and so deletes the link.

LINK_NULLTERM The targets of links on disk must be null terminated. Alink was found whose target wasn’tnull terminated.

Answering yes clears the inode and so deletes the link.

LINK_SIZE The size of a link on disk must match the length of its target string. Alink was found whose size does not.

Answering yes updates the link’ssize to reflect the length of its target string.

LINK_BLOCKS Links can not be sparse. There must be exactly as manyblocks allocated as are needed to coverits size. A link was found which doesn’thav e enough blocks allocated to coverits size.

Answering yes clears the link’sinode thus deleting the link.

DIR_ZERO Directories must at least contain a block that has the "." and ".." entries. Adirectory was found which doesn’tcontain anyblocks.

Answering yes to this question clears the directory’sinode thus deleting the directory.

INODE_SIZE Certain inodes record the size of the data theyreference in an i_size field. This can be the number of bytes in a file, directory,orsymlink target which are stored in data mapped by extents of clusters. This error occurs when the extent lists are walked and the amount of data found does not match what is stored in i_size.

Answering yes to this question updates the inode’si_size to match the amount of data referenced by the extent lists. It is vitally important that i_size matches the extent lists and so answering yes is strongly encouraged.

Version 1.8.2 January 2012 9 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

INODE_SPARSE_SIZE Certain inodes record the size of the data theyreference in an i_size field. This can be the number of bytes in a file, directory,orsymlink target which are stored in data mapped by extents of clusters. This error occurs when a sparse inode was found that had data allocated past its i_size.

Answering yes to this question will update the inode’si_size to coverall of its allocated storage. It is vitally important that i_size matches the extent lists and so answering yes is strongly encouraged.

INODE_INLINE_SIZE Inodes can only fit a certain amount of inline data. This inode has its data inline but claims an i_size larger than will actually fit.

Answering yes to this question updates the inode’si_size to the maximum available inline space.

INODE_CLUSTERS Inodes contain a record of howmanyclusters are allocated to them. An inode was found whose recorded number of clusters doesn’tmatch the number of blocks that were found associated with the inode.

Answering yes resets the inode’snumber of clusters to reflect the number of blocks that were associated with the file.

INODE_SPARSE_CLUSTERS Inodes contain a record of howmanyclusters are allocated to them. An sparse inode was found whose recorded number of clusters doesn’tmatch the number of blocks that were found associated with the inode.

Answering yes resets the inode’snumber of clusters to reflect the number of blocks that were associated with the file.

INODE_INLINE_CLUSTERS Inlined inode should not have allocated clusters. An inode who has inline data flag set was found with clus- ters allocated.

Answering yes resets the inode’snumber of clusters to zero.

LALLOC_REPAIR An active local allocator did not accurately reflect the set of clusters that are free and in use in its region.

Answering yes will update the local allocator bitmap. Each bit that doesn’tmatch the use of its cluster will be inverted.

LALLOC_USED Alocal allocator records the number of bits that are used in its bitmap. An allocator was found whose used value doesn’treflect the number of bits that are set in its bitmap.

Answering yes sets the used value to match the number of bits set in the allocator’sbitmap.

CLUSTER_ALLOC_BIT Aspecific cluster’suse didn’tmatch the setting of its bit in the cluster allocator.

Answering yes will invert the bit in the allocator to match the use of the cluster -- either allocated and in use or free.

Version 1.8.2 January 2012 10 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

REFCOUNT_FLAG_INVALID Refcount file can only exist in a volume with refcount supported, Fsck has found that a file in a non-ref- count volume has refcount flag set.

Answering yes remove this flag from the file.

REFCOUNT_LOC_INVALID Refcount loc can only be valid if the file has refcount flag set. Fsck has found that a file has refcount loc while it does’thav e refcount flag set.

Answering yes reset refcount loc to zero for the file.

RB_BLKNO refcount blocks contain a record of the disk block where theyare located. An refcount block was found at a block that didn’tmatch its recorded location.

Answering yes will update the data structure in the refcount block to reflect its real location on disk.

RB_GEN Refcount blocks are created with a generation number to match the generation number of the volume at the time of creation. An refcount block was found which contains a generation number that doesn’tmatch.

Answering yes implies that the generation number is correct and that the refcount block is from a previous file system. The refcount block will be removedand the file that uses it will lose the refcounted informa- tion, but it may be regenerated later.

RB_GEN_FIX Refcount blocks are created with a generation number to match the generation number of the volume at the time of creation. An refcount block was found which contains a generation number that doesn’tmatch.

Answering yes implies that the generation number in the refcount block is incorrect and that the refcount block is valid. The generation number in the block is updated to match the generation number in the vol- ume.

RB_PARENT refcount blocks contain a record of the parent this disk block belongs to. An refcount block was found stor- ing a wrong parent location.

Answering yes will update the data structure in the refcount block to reflect its parent’sreal location on disk.

REFCOUNT_LIST_COUNT The number of entries in a refcount list is bounded by the size of the block which contains it. An refcount list was found which claims to have more entries than would fit in its container.

Answering yes updates the count field in the refcount list to match the container.Answering no to this ques- tion may stop further fixes from being done because the count value can not be trusted.

REFCOUNT_LIST_USED The number of free entries in a refcount list must be less than the total number of entries in the list. Alist wasfound which claims to have more free entries than possible entries.

Version 1.8.2 January 2012 11 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

Answering yes sets the number of free entries in the list equal to the total possible entries.

REFCOUNT_CLUSTER_RANGE Arefcount record was found which references a cluster which can not be referenced by a refcount. The ref- erenced cluster is either very early in the volume, and thus reserved, or beyond the end of the volume.

Answering yes removesthis refcount record from the tree.

REFCOUNT_CLUSTER_COLLISION Arefcount record was found which references a cluster which has a collision with the previous valid ref- count record.

Answering yes removesthis refcount record from the tree.

REFCOUNT_LIST_EMPTY Arefcount list was found which has no refcount record in it. It is normally caused by a corrupted refcount record.

Answering yes removesthis refcount block from the tree. It will be re-generated in refcounted extent records handler if all the other information is sane.

REFCOUNT_BLOCK_INVALID Refcount block stores the refcount record for physical clusters of a file. It is found refering an invalid ref- count block.

Answering yes remove this refcount block.

REFCOUNT_CLUSTERS Refcount tree contains a record of howmanyclusters are allocated to them. Atree was found whose recorded number of clusters doesn’tmatch the number of blocks that were found associated with it.

Answering yes resets the number of clusters to reflect the real number of clusters that were associated with the tree.

REFCOUNT_ROOT_BLOCK_INVALID Root refcount block is the root of the refcount record for a file. It is found refering an invalid refcount block.

Answering yes remove this refcount block and clear refcount flag from this file.

REFCOUNT_REC_REDUNDANT Refcount record is used to store the refcount for physical clusters. Some refcount record is found to have no physical clusters corresponding to it.

Answering yes remove the refcount record.

REFCOUNT_COUNT_INVALID Refcount record is used to store the refcount for physical clusters. A record record is found whichs claims the wrong refcount for some physical clusters.

Answering yes update the corresponding refcount record.

Version 1.8.2 January 2012 12 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

REFCOUNT_COUNT Refcount tree contains a record of howmanyfiles refering to this tree. Atree was found whose recorded number of files doesn’tmatch the real files refering to the tree.

Answering yes resets the number of files to reflect the real number of files that were associated with the tree.

DUP_CLUSTERS_SYSFILE_CLONE Asystem file inode claims clusters that are also claimed by another inode. ocfs2 does not allowthis. Sys- tem files may be cloned but may not be deleted. Allocation system files may not be cloned or deleted.

Answering yes will copythe data of this inode to newly allocated extents. This will break the claim on the overcommitted clusters.

DUP_CLUSTERS_CLONE An inode claims clusters that are also claimed by another inode. ocfs2 does not allowthis.

Answering yes will copythe data of this inode to newly allocated extents. This will break the claim on the overcommitted clusters.

DUP_CLUSTERS_DELETE An inode claims clusters that are also claimed by another inode. ocfs2 does not allowthis.

Answering yes will remove this inode, thus breaking its claim on the overcommitted clusters.

DUP_CLUSTERS_ADD_REFCOUNT An inode claims clusters that are also claimed by another inode. ocfs2 does not allowthis.

Answering yes will try to add a refcount record for all these inodes, so that theywill share the cluster.

DIRENT_DOTTY_DUP There can be only one instance of both the "." and ".." entries in a directory.Adirectory entry was found which duplicated one of these entries.

Answering yes will remove the duplicate directory entry.

DIRENT_NOT_DOTTY The first and second directory entries in a directory must be "." and ".." respectively.One of these direc- tory entries was found to not match these rules.

Answering yes will force the directory entry to be either "." or "..". This might consume otherwise valid entries and cause some files to appear in lost+found.

DIRENT_DOT_INODE The inode field of the "." directory entry must refer to the directory inode that contains the givendirectory block. A "." entry was found which doesn’tdoso.

Answering yes sets the directory entry’sinode reference to the parent directory that contains the entry.

Version 1.8.2 January 2012 13 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

DIRENT_DOT_EXCESS A"." directory entry was found whose lengths exceeds the amount required for the single dot in the name.

Answering yes creates another empty directory entry in this excess space.

DIRENT_ZERO Adirectory entry was found with a zero length name.

Answering yes clears the directory entry so its space can be reused.

DIRENT_NAME_CHARS Directory entries can not contain either the NULL character (ASCII 0) or the forward slash (ASCII 47). A directory entry was found which contains either.

Answering yes will change each instance of these forbidden characters into a period (ASCII 46).

DIRENT_INODE_RANGE Each directory entry contains a inode field which the entry’sname corresponds to. An entry was found which referenced an inode number that is invalid for the current volume.

Answering yes clears this entry so its space can be reused. If the entry once corresponded to a real inode and was corrupted this inode may appear in lost+found.

DIRENT_INODE_FREE Each directory entry contains a inode field which the entry’sname corresponds to. An entry was found which referenced an inode number that isn’tinuse.

Answering yes clears this directory entry.

DIRENT_TYPE Each directory entry contains a field which describes the type of file that the entry refers to. An entry was found whose type doesn’tmatch the inode it is referring to.

Answering yes resets the entry’stype to match the target inode.

DIR_PARENT_DUP Each directory can only be pointed to by one directory entry in a parent directory.Adirectory entry was found which was the second entry to point to a givendirectory inode.

Answering yes clears this entry which was the second to refer to a givendirectory.This reflects the policy that hard links to directories are not allowed.

DIRENT_DUPLICATE File names within a directory must be unique. Afile name occurred in more than one directory entry in a givendirectory.

Answering yes renames the duplicate entry to a name that doesn’tcollide with recent entries and is unlikely to collide with future entries in the directory.

Version 1.8.2 January 2012 14 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

DIRENT_LENGTH There are very fewdirectory entry lengths that are valid. The lengths must be greater than the minimum required to record a single character directory,berounded to 12 bytes, be within the amount of space remaining in a directory block, and be properly rounded for the size of the name of the directory entry.An entry was found which didn’tmeet these criteria.

Answering yes will try to repair the directory entry.This runs a very good chance of invalidating all the entries in the directory block. Orphaned inodes may appear in lost+found.

DIR_TRAILER_INODE Adirectory block trailer is a fakedirectory entry at the end of the block. The trailer has compatibility fields for when it is viewed as a directory entry.The inode field must be zero.

Answering yes will set the inode field to zero.

DIR_TRAILER_NAME_LEN Adirectory block trailer is a fakedirectory entry at the end of the block. The trailer has compatibility fields for when it is viewed as a directory entry.The name length field must be zero.

Answering yes will set the name length field to zero.

DIR_TRAILER_REC_LEN Adirectory block trailer is a fakedirectory entry at the end of the block. The trailer has compatibility fields for when it is viewed as a directory entry.The record length field must be equal to the size of the trailer.

Answering yes will set the record length field to the size of the trailer.

DIR_TRAILER_BLKNO Adirectory block trailer is a fakedirectory entry at the end of the block. The self-referential block number is incorrect.

Answering yes will set the block number to the correct block on disk.

DIR_TRAILER_PARENT_INODE Adirectory block trailer is a fakedirectory entry at the end of the block. It has a pointer to the directory inode it belongs to. This pointer is incorrect.

Answering yes will set the parent inode pointer to the inode referencing this directory block.

ROOT_DIR_MISSING The super block contains a reference to the inode that serves as the root directory.This reference points to an inode that isn’tinuse.

Answering yes will create a newinode and update the super block to refer to this inode as the root direc- tory.

LOSTFOUND_MISSING The super block contains a reference to the inode that serves as the lost+found directory.This reference points to an inode that isn’tinuse.

Version 1.8.2 January 2012 15 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

Answering yes will create a newlost+found directory in the root directory.

DIR_NOT_CONNECTED Every directory in the file system should be reachable by a directory entry in its parent directory.This is verified by walking every directory in the system. Adirectory inode was found during this walk which doesn’thav e aparent directory entry.

Answering yes movesthis directory entry into the lost+found directory and givesitaname based on its inode number.

DIR_DOTDOT Adirectory inode’s".." directory entry must refer to the parent directory.Adirectory was found whose ".." doesn’trefer to its parent.

Answering yes will read the directory block for the givendirectory and update its ".." entry to reflect its parent.

INODE_NOT_CONNECTED Most all inodes in the system should be referenced by a directory entry.Aninode was found which isn’t referred to by anydirectory entry.

Answering yes movesthis inode into the lost+found directory and givesitaname based on its inode num- ber.

INODE_COUNT Each inode records the number of directory entries that refer to it. An inode was found whose recorded count doesn’tmatch the number of entries that refer to it.

Answering yes sets the inode’scount to match the number of referring directory entries.

INODE_ORPHANED While files are being deleted theyare placed in an internal directory.Ifthe machine crashes while this is taking place the files will be left in this directory.Fsck has found an inode in this directory and would like to finish the job of truncating and removing it.

Answering yes removesthe file data associated with the inode and frees the inode.

RECOVER_BACKUP_SUPERBLOCK When fsck.ocfs2 successfully uses the specified backup superblock, it provides the user with this option to overwrite the existing superblock with that backup.

Answering yes will refresh the superblock from the backup. Answering no will only disable the copying of the backup superblock and will not effect the remaining fsck.ocfs2 processing.

ORPHAN_DIR_MISSING While files are being deleted theyare placed in an internal directory,named orphan directory.Ifanorphan directory does not exist, an OCFS2 volume cannot be mounted successfully.Fsck has found the orphan directory is missing and would liketocreate it for future use.

Answering yes creates the orphan directory in the system directory.

Version 1.8.2 January 2012 16 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

JOURNAL_FILE_INVALID OCFS2 uses JDB for journalling and some journal files exist in the system directory.Fsck has found some journal files that are invalid.

Answering yes to this question will regenerate the invalid journal files.

JOURNAL_UNKNOWN_FEATURE Fsck has found some journal files with unknown features. Other journals on the filesystem have only known features, so this is likely a corruption. If you think your filesystem may be newer than this version of fsck.ocfs2, say N here and grab the latest version of fsck.ocfs2.

Answering yes resets the journal features to match other journals.

JOURNAL_MISSING_FEATURE Fsck has found some journal files have features that are not set on all journal files. All journals on filesys- tem should have the same set of features.

Answering yes will set all journals to the union of set features.

JOURNAL_TOO_SMALL Fsck has found some journal files are too small.

Answering yes extends these journals.

RECOVER_CLUSTER_INFO The currently active cluster stack is different than the one the filesystem is configured for.Thus, fsck.ocfs2 cannot determine whether the filesystem is mounted on an another node or not. The recommended solution is to exit and run fsck.ocfs2 on this device from a node that has the appropriate active cluster stack. How- ev er, you can proceed with the fsck if you are sure that the volume is not in use on anynode.

Answering yes reconfigures the filesystem to use the current cluster stack. DANGER: YOU MUST BE ABSOLUTELYSURE THATNOOTHER NODE IS USING THIS FILESYSTEM BEFORE CONTINU- ING. OTHERWISE, YOU CAN CORRUPT THE FILESYSTEM AND LOSE DAT A.

INLINE_DAT A_FLAG_INVALID Inline file can only exist in a volume with inline supported, Fsck has found that a file in a non-inline volume has inline flag set.

Answering yes remove this flag from the file.

INLINE_DAT A_COUNT_INVALID Foraninline file, there is a limit for id2.id_data.id_count. Fsck has found that this value isn’tright.

Answering yes change this value to the right number.

XATTR_BLOCK_INVALID Extended attributes are stored offanextended attribute block referenced by the inode. This inode refer- ences an invalid extended attribute block.

Answering yes will remove this block.

Version 1.8.2 January 2012 17 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

XATTR_COUNT_INVALID The count of extended attributes in an inode, block, or bucket does not match the number of entries found by fsck.

Answering yes will change this to the correct count.

XATTR_ENTRY_INVALID An extended attribute entry points to already used space.

Answering yes will remove this entry.

XATTR_NAME_OFFSET_INVALID The name_offset field of an extended attribute entry is not correct. Without a correct name_offset field, the entry cannot be used.

Answering yes will remove this entry.

XATTR_VALUE_INVALID The value region of an extended attribute points to already used space.

Answering yes will remove this entry.

XATTR_LOCATION_INVALID The xe_local field and xe_value_size field of an extended attribute entry does not match. So the entry can- not be used.

Answering yes will remove this entry.

XATTR_HASH_INVALID Extended attributes use a hash of their name for lookup purposes. The name_hash of this extended attribute entry is not correct.

Answering yes will change this to the correct hash.

XATTR_FREE_START_INVALID Extended attributes use free_start to indicate the offset of the free space in inode, block, or bucket. The free_start field of this object is not correct.

Answering yes will change this to the correct offset.

XATTR_VALUE_LEN_INVALID Extended attributes use name_value_len to store the total length of all entry’sname and value in inode, block or bucket. the name_value_len filed of this object is not correct.

Answering yes will change this to the correct value.

XATTR_BUCKET_COUNT_INVALID The count of extended attributes bucket pointed by one extent record does not match the number of buckets found by fsck.

Answering yes will change this to the correct count.

Version 1.8.2 January 2012 18 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

QMAGIC_INVALID The magic number in the header of quota file does not match the proper number.

Answering yes will makefsck use values in the quota file header anyway.

QTREE_BLK_INVALID Block with references to other blocks with quota data is corrupted.

Answering yes will makefsck use references in the block.

DQBLK_INVALID The structure with quota limits was found in a corrupted block.

Answering yes will use the values of limits for the user / group.

DUP_DQBLK_INVALID The structure with quota limits was found in a corrupted block and fsck has already found quota limits for this user / group.

Answering yes will use newvalues of limits for the user / group.

DUP_DQBLK_VALID The structure with quota limits was found in a correct block but fsck has already found quota limits for this user / group.

Answering yes will use newvalues of limits for the user / group.

IV_DX_TREE Adirectory indexwas found on an inode but that feature is not enabled on the file system.

Answering yes will truncate the invalid index.

DX_LOOKUP_FAILED Adirectory entry is missing an entry in the directory index. The missing indexentry will cause lookups on this name to fail.

Answering yes will rebuild the directory index, restoring the missing entry.

NO_HOLES Ametadata structure encountered a hole where it should not. Examples of such structures are directories, refcount trees, dx_trees etc.

Answering yes will remove the hole by updating the offset to the expected value.

EXTENT_OVERLAP The extents of the file overlap, which means there could be twoormore possible data for a particular offset for the file.

Answering yes will serialize the extents.

Version 1.8.2 January 2012 19 fsck.ocfs2.checks(8) OCFS2 Manual Pages fsck.ocfs2.checks(8)

SEE ALSO debugfs.ocfs2(8) fsck.ocfs2(8) mkfs.ocfs2(8) mount.ocfs2(8) mounted.ocfs2(8) o2cluster(8) o2image(8) o2info(1) tunefs.ocfs2(8)

AUTHORS Oracle Corporation.

COPYRIGHT Copyright © 2004, 2012 Oracle. All rights reserved.

Version 1.8.2 January 2012 20