<<

systems , and Table of contents

• Ext2 – Directories – structures on disk – Data structures in memory – Disk space management • Ext3 file system – Journaling – Storing entries in H-trees – Data structures • Ext4 file system – Extents – allocation – Journal – Nanosecond time stamps • Tests

2 File systems of the ext* family

Linux supports many file systems, but ext* family systems are native to it. Ext2 (second )

• Introduced in 1993. Main developer is Rémy Card. • The maximum file allowed is from 16 GB to 2 TB. • The total file system size is between 2 TB and 32 TB. • A directory can contain 32,000 subdirectories. • Recommended on flash drives and USB, because it does not introduce overhead associated with journaling. • A journaling extension to the ext2 has been developed. It is possible to add a journal to an existing ext2 filesystem without the need for data conversion.

3 General features of ext2

The main features of ext2 affecting its performance: • When creating the system, the administrator can choose the optimal block size (in the range of 1 KB to 4 KB), depending on the expected average . • When creating a system, the administrator can set the number of for a particular partition size, depending on the number of files expected. • Disk blocks are divided into groups including adjacent tracks, thanks to which reading a file located within a single group is associated with a short seek time. • The file system preallocates disk blocks for regular files, so as the file grows, blocks are already reserved for it in physically adjacent areas, which reduces file fragmentation. • Thanks to the careful implementation, it is stable and flexible. • Defined in /fs/ext2.

4 Directories in ext2

Directory – consists of blocks of type ext2_dir_entry_2.

#define EXT2_NAME_LEN 255

struct ext2_dir_entry_2 { __le32 ; /* Inode number */ __le16 rec_len; /* Directory entry length */ __u8 name_len; /* Name length */ __u8 file_type; char name[]; /* File name, up to EXT2_NAME_LEN */ };

The file name is limited to 255 characters (constant EXT2_NAME_LEN). Depending on the setting of the NO_TRUNCATE, flag, longer names may be truncated or treated as incorrect. The structure has variable length because the last field is an array of variable length. Each entry is supplemented with \0 to multiples of 4. The name_len field stores the actual length of the file name. 5 Directories in ext2

The rec_len field can be interpreted as a pointer to the next correct directory entry. To remove an entry, just reset the inode field and increase the rec_len value. The file is added by means of a linear search to the first structure, in which the inode number is 0 and there is enough space. If one is not found, the new file will be appended at the end.

User processes can the directory as a file, but only the kernel can the directory, which guarantees the correctness of the data.

Example directory (source: Bovet, Cesati, Understanding the Kernel) 6 Ext2 file system data structures on disk

The basic physical unit of data on a disk is a block. The block size is constant throughout the entire file system. These constants limit the block size: #define EXT2_MIN_BLOCK_SIZE 1024 #define EXT2_MAX_BLOCK_SIZE 4096 /* in */ Ext2 on a disk consists of many groups of disk blocks (of the same size, located sequentially one after the other). Block groups reduce file fragmentation.

Source: https://computing.ece.vt.edu/~changwoo/ECE-LKP-2019F/l/lec21-fs.pdf 7 Ext2 file system data structures on disk

Superblock — Superblocks in all groups have the same content*. Group descriptors — As with superblocks, their content is copied to all groups*.

* Originally, the superblock and group descriptors were replicated in every block group with those located in block group 0 designated as the primary copies. This is no longer common practice due to the Sparse SuperBlock Option. The Sparse SuperBlock Option only replicates the file system superblock and group descriptors in a fraction of the block groups.

The kernel only uses the superblock and group descriptors from group 0. When e2fsck checks the consistency of the file system, it reaches into the superblock and descriptors from group 0 and copies them to other groups. If as a result of the failure the structure data stored in block 0 are unusable, the administrator can order e2fsck to reach older copies in the other groups. During system initialization, blocks with group descriptors from group 0 are read into memory. Unless there are exceptional situations, the system does not use blocks with descriptors and a superblock from other groups

8 Ext2 file system data structures on disk

Number of block groups depends on the partition size and the block size. The block bitmap must be stored in a single block, so if the block size in bytes is b, there can be at most 8 * b blocks in each block group, thus if s is the partition size in blocks, the total number of block groups is roughly s/(8 * b). If the blocks are 1 KB in size, then we can describe 8 K blocks in a single bitmap, so in the group we have 8 MB for data blocks. If the blocks are 4 KB, then we have sixteen times more space for data blocks, i.e. 128 MB. Let’s consider a 32 GB ext2 partition with a 4 KB block size. In this case, each 4 KB block bitmap describes 32K data blocks—that is, 128 MB. Therefore, at most 256 block groups are needed. Dividing the file system into block groups is designed to increase security and optimize data writing to disk. Increased security is achieved by maintaining redundant information about the file system (superblock and group descriptors) in each block group. Optimization of data writing is ensured by algorithms for the allocation of new inodes and disk blocks.

9 Superblock

Stores information describing file system (block size, number of inodes and blocks, number of free blocks, etc.) struct ext2_super_block { __le32 s_inodes_count; /* Inodes count */ __le32 s_blocks_count; /* Blocks count */ __le32 s_free_blocks_count; /* Free blocks count */ __le32 s_free_inodes_count; /* Free inodes count */ __le32 s_first_data_block; /* First Data Block */ __le32 s_blocks_per_group; /* # blocks per group */ __le32 s_inodes_per_group; /* # inodes per group */ __le16 s_inode_size; /* size of inode structure */ /* * Performance hints. Directory preallocation should only * happen if the EXT2_COMPAT_PREALLOC flag is on. */ __u8 s_prealloc_blocks; /* Nr of blocks to try to preallocate*/ __u8 s_prealloc_dir_blocks; /* Nr to preallocate for dirs */ ... }; 10 Group descriptors and bitmaps

It occupes 32 bytes, which means that when a disk block has a size of 1 KB, one block contains 32 group descriptors. Bitmaps are arrays in which a value of 0 means that the corresponding inode or data block is free, and 1 means it is busy.

struct ext2_group_desc { __le32 bg_block_bitmap; /* Blocks bitmap block */ __le32 bg_inode_bitmap; /* Inodes bitmap block */ __le32 bg_inode_table; /* Inodes table block */ __le16 bg_free_blocks_count; /* Free blocks count */ __le16 bg_free_inodes_count; /* Free inodes count */ __le16 bg_used_dirs_count; /* Directories count */ ... };

11 Inode table

It consists of a sequence of adjacent blocks, each containing a fixed number of inodes. All inodes have the same size of 128 bytes. One 1 KB block contains 8 inodes, and a 4 KB block contains 32 inodes. Inode number = starting inode number of a block group + inode location in the inode table. The inode in the ext2 file system is described by the structure ext2_inode.

struct ext2_inode { __le16 i_mode; /* File mode */ __le32 i_size; /* Size in bytes */ __le16 i_links_count; /* Links count */ __le32 i_blocks; /* Blocks count */ __le32 i_block[EXT2_N_BLOCKS];/* Pointers to blocks */ __le32 i_file_acl; /* File ACL */ __le32 i_dir_acl; /* Directory ACL */ ... }; 12 Inode table

The i_size field has 32 , i.e. the field size limits the file size to 4 GB. In fact, the highest bit of the field is not used, so the maximum file size is even smaller — 2 GB. Ext2 uses different methods to increase the allowable file size. Some architectures use the i_dir_acl field as a 32-bit extension of the i_size field. In ext2 there is no need to store on the disk the mapping between the inode number and the number of the block containing it, because this block number can be calculated based on the group number and relative position of the inode in the inode table. Let's assume that each group has 4,096 inodes and that we want to find the disk address of the inode 13,021. The inode therefore belongs to the third group and its disk address is stored in the 733th position of the corresponding inode table.

13 Ext2 file system data structures in memory

Most of the information stored in disk data structures is copied to RAM during file system mounting, thanks to which the kernel has quick access to them.

Mapping ext2 data structures to VFS structures

Type Data structure on disk Data structure in memory Buffering mode

Superblock ext2_super_block ext2_sb_info Always cache Group descriptor ext2_group_desc ext2_group_desc Always cache

Block bitmap Bitmap in a block Bitmap in a buffer Dynamically

Inode bitmap Bitmap in a block Bitmap in a buffer Dynamically

Inode ext2_inode ext2_inode_info Dynamically Data block array VFS buffer Dynamically Free inode ext2_inode ------Never Free block Byte array ------Never 14 Superblock in memory

In VFS, the superblock is described in the super_block structure. The s_fs_info field of this structure for the ext2 filesystem points to the ext2_sb_info structure, and the s_es field of this structure to ext2_super_block. The kernel initializes these data structures during mounting and loads the superblock and group descriptors to memory. It releases them when unmounting the file.

The ext2_sb_info structure (source: Bovet, Cesati, Understanding the )

15 Inode in memory

An ext2_inode_info inode descriptor is created for a file from the ext2 system. The descriptor includes inode descriptor from VFS (inode). Objects of the ext2_inode type go to buffers (see the ext2_get_inode(), function, which ends with the statement return (struct ext2_inode *) (bh  b_data + offset); Field i_block_group is the number of the block group which contains this file's inode. Constant across the lifetime of the inode, it is used for making block allocation decisions – we try to place a file's data blocks near its inode block, and new inodes near to their parent directory's inode.

struct ext2_inode_info { __le32 i_data[15]; __u32 i_flags; __u16 i_state; __u32 i_dtime; __u32 i_block_group; /* block reservation info */ struct ext2_block_alloc_info *i_block_alloc_info; struct inode vfs_inode; /* the entire inode object from VFS */ }; 16 Managing ext2 disk space

The storage of a file on disk differs from the view the programmer has of the file in two ways: blocks can be scattered around the disk (although the filesystem tries to keep blocks sequential to improve access time), and files may appear to a programmer to be bigger than they really are because a program can introduce holes into them (through the lseek() ). The i_size field of the inode contains the file size including holes (as seen by the program), and the i_blocks field remembers the number of blocks actually allocated to the file.

17 Allocating an ext2 disk inode for a directory

The ext2_new_inode() function creates an ext2 disk inode, returning the address of the corresponding inode object (or NULL). If the new inode is a directory, the function finds a suitable block group for the directory: 1. Directories having as parent the filesystem root should be spread among all block groups. Thus, the function searches the block groups looking for a group having a number of free inodes and a number of free blocks above the average. If there is no such group, it jumps to step 3. 2. Nested directories should be put in the group of the parent if it satisfies the following rules: – The group does not contain too many directories. – The group has a sufficient number of free inodes left. – The group has a small “debt” (the debt is increased each time a new directory is added and decreased each time another type of file is added). If the parent’s group does not satisfy these rules, it picks the first group that satisfies them. If no such group exists, it jumps to step 3.

18 Allocating an ext2 disk inode for a directory

3. The function starts with the block group containing the parent directory and selects the first block group that has more free inodes than the average number of free inodes per block group. This requires to search all group descriptors that have free inode counters. It is not expensive, creating a new directory is not a frequent activity, and all group descriptors are read into memory when the file system is mounted and remain there until the end of the system's operation (their number varies from several dozen to several hundred).

19 Allocating an ext2 disk inode for a regular file

If the new inode is not a directory, the function selects the group by starting from the one that contains the parent directory and moving farther away from it. This means that files in the same directory are more likely to be in the same block group: 1. If no free inodes exist in the parent directory group, then a fast logarithmic search starts. The algorithm searches log(n) groups, where n is the total number of block groups. The algorithm considers block groups i mod (n), i+1 mod (n), i+1+2 mod (n) etc., where i is the starting group. 2. If the logarithmic search failed in finding a block group with a free inode, the function performs an exhaustive linear search starting from the block group that includes the parent directory, until it finds a group with free inodes.

20 Inode bitmaps

After selecting a block group, it remains to search for a free inode in the group and assign it to the created file. Information which inode in the block group is free and which is occupied is stored in bitmaps. Unlike group descriptors, inode bit mappings are not all stored in memory. The maximum number of bitmaps loaded in memory is 8. When the number of groups exceeds this constant (the most common case), the LRU algorithm is used to exchange bitmaps in memory. When an inode is found, the buffers containing the group descriptor and bitmap are marked as dirty for later writing to disk, and the pointer to the selected inode is passed as a function value.

The ext2_free_inode() function deletes a disk inode.

21 Block addressing in ext2

In the ext2 file system, the table with addresses in the inode has the name: 1. i_data – for the inode structure in memory (the field of the ext2_inode_info structure),

__le32 i_data[15]

2. i_block – for the inode structure on the disk (the field of the ext2_inode structure).

__le32 i_block[EXT2_N_BLOCKS]; /* Pointers to blocks */

This table has size equal to 15. There are 12 items with addresses of direct blocks and one item for single indirect, double indirect and triple indirect blocks.

#define EXT2_NDIR_BLOCKS 12 #define EXT2_IND_BLOCK EXT2_NDIR_BLOCKS #define EXT2_DIND_BLOCK (EXT2_IND_BLOCK + 1) #define EXT2_TIND_BLOCK (EXT2_DIND_BLOCK + 1) #define EXT2_N_BLOCKS (EXT2_TIND_BLOCK + 1)

22 Some block numbers may be zero. This means that nothing has been saved to a certain space in the file (this is possible thanks to the lseek () function). 23 Block addressing in ext2

Data block addressing (source: Bovet, Cesati, Understanding the Linux Kernel) b is the block size, block number occupies 4 bytes, (b/4) is the number of addresses in one index block.

You can refer to the block with: – relative position in file – block number in file or – position in the disk partition – logical block number. Obtaining a logical block number from an offset within a file requires two steps: – translating the offset to the block index containing the appropriate byte, – translating the index to a logical block number. 24 Allocating an ext2 disk data block

The allocation of disk blocks is performed by the function ext2_new_blocks(). The inode parameter indicates the inode for which we allocate the block, count indicates the desired number of blocks, goal gives the number of the block we would like to allocate (this is related to pre-allocation). If it is not possible to allocate a block with this number, the function will try to allocate any other free block. ext2_fsblk_t ext2_new_blocks (struct inode *inode, ext2_fsblk_t goal, unsigned long *count, int *errp) void ext2_free_blocks (struct inode * inode, unsigned long block, unsigned long count)

If the goal block is free, it will be allocated. If the request cannot be completed within the current group, it tries in the others. If it still fails, preallocation is turned off. Searching for a new block to be allocated first in the immediate vicinity of the given block makes sense for the speed of the file system. The ext2 file system owes it its extremely low file fragmentation rate. The blocks of a given file are almost always together and loading the file is fast.

25 File systems of the ext* family

Ext3 (third extended file system)

• Introduced in 2001, withdrawn in 2015. The main developer is . • Available since kernel version 2.4.15. • The main benefit of ext3 is that it allows journaling. Journaling has a dedicated area in the file system, where all the changes are tracked. When the system crashes, the possibility of file system corruption is lower because of journaling. • Maximum individual file size can be from 16 GB to 2 TB. • Overall ext3 file system size can be from 4 TB to 32 TB. • A directory can contain 32,000 subdirectories • You can ext2 file system to ext3 file system directly (without /restore).

26 Ext3 – journaling

The ext3 file system was first mentioned in Journaling the Linux ext2fs Filesystem (Stephen Tweedie, 1998). If you shut down your computer without unmounting the ext2 file system, you must examine the integrity of this partition before mounting it again. The larger this partition is, the longer it takes, for large file systems it can take hours. In the case of ext3 with the has_journal option enabled, the consistency test is replaced by playing a journal, which is much faster, in the order of seconds. After incorrect unmounting, playing the journal restores the correct state of the data or even the metadata. Information about pending file system updates is written to the journal. Regardless of the mode of operation, the journal ensures consistency only at the level of the system function call. There are six types of metadata in ext2 and ext3: superblocks, block group descriptors, inodes, intermediate index blocks, data block bitmaps and inode bitmaps.

27 Ext3 – journaling

Journal is logically a fixed-size, circular array. • Implemented as a special file with a hard-coded inode number. • Each journal transaction is composed of a begin marker, log, and end marker.

Source: https://computing.ece.vt.edu/~changwoo/ECE-LKP-2019F/l/lec21-fs.pdf 28 Ext3 – journaling

Transactions Instead of considering each file system update as a separate transaction, ext3 groups many updates into a single compound transaction that is periodically committed to disk. Compound transactions may have better performance than more fine-grained transactions when the same structure is frequently updated in a short period of time (e.g., a free space bitmap or an inode of a file that is constantly being extended).

Checkpointing It is the of writing journaled metadata and data to their fixed-locations. Checkpointing is triggered when various thresholds are crossed, e.g., when file system buffer space is low, when there is little free space left in the journal, or when a timer expires. https://www.usenix.org/legacy/publications/library/proceedings/usenix05/tech/general/full_ papers/prabhakaran/prabhakaran_html/main.html 29 Ext3 – journaling

At some point we will wish to commit our outstanding filesystem updates to the journal as a new compound transaction. When we commit a transaction, the new updated filesystem blocks are sitting in the journal but have not yet been synced back to their permanent home blocks on disk (we need to keep the old blocks unsynced in case we crash before committing the journal). Once the journal has been committed, the old version on the disk is no longer important and we can write back the buffers to their home locations at our leisure. Until we have finished syncing those buffers, we cannot delete the copy of the data in the journal. The ext3 uses checkpoints at which a check is made to ascertain whether the changes in the journal have been written to the filesystem. If they have, the data in the journal are no longer needed and can be removed.

30 Ext3 – journal recovery

During recovery, the file system scans the log for committed complete transactions; incomplete transactions are discarded. Each update in a completed transaction is simply replayed into the fixed-place ext3 structures. Case 1: failure between journal commit and checkpointing 1. There is a transaction end marker at the tail of the journal → logs between begin and end markers are consistent. 2. Replay logs in the last journal transaction. Case 2: failure before journal commit 1. There is no transaction end marker at the tail of the journal → there is no guarantee that logs between begin and end markers are consistent. 2. Ignore the last journal transaction.

Source: https://computing.ece.vt.edu/~changwoo/ECE-LKP-2019F/l/lec21-fs.pdf 31 Ext3 – journaling modes

1. Writeback (highest risk) Only metadata is journaled (as in XFS, JFS and ReiserFS); file contents are not. The contents might be written before or after the journal is updated (the system does not wait for associated changes to file data to be written before updating metadata). Files modified right before a crash can become corrupted. For example, a file being appended to may be marked in the journal as being larger than it actually is, causing garbage at the end. Older versions of files could also appear unexpectedly after a journal recovery. Usually causes the smallest overhead. Source: Source: https://en.wikipedia.org/wiki/Ext3 https://www.eecs.harvard.edu /~cs161/notes/journaling.pdf32 Ext3 – journaling modes

2. Ordered (medium risk) Only metadata is saved in the journal. Metadata are journaled only after writing data to disk. This is the default on many Linux distributions. If there is a power outage while a file is being written or appended to, the journal will indicate that the new file or appended data has not been committed, so it will be purged by the cleanup process. (Thus appends and new files have the same level of integrity protection as the journaled level.) Files being overwritten can be corrupted because the original version of the file is not stored. It's possible to end up with a file in an intermediate state between new and old, without enough information to restore either one or the other (the new data never made it to disk completely, and the old data is not stored anywhere). Even worse, the intermediate state might intersperse old and new data, because the order of the write is left up to the disk's hardware. This mode is generally slightly slower than writeback and much Source: faster than journal. https://www.eecs.harvard.edu Source: https://en.wikipedia.org/wiki/Ext3 /~cs161/notes/journaling.pdf33 Ext3 – journaling modes

3. Journal (lowest risk) Both metadata and file contents are written to the journal before being committed to the main file system. Data consistency is guaranteed: in case of failure, the file will contain old data or new data. Because the journal is relatively continuous on disk, this can improve performance, if the journal has enough space. In other cases, performance gets worse, because the data must be written twice—once to the journal, and once to the main part of the filesystem. In all modes, ext3 logs full blocks, as opposed to differences from old versions; thus, even a single bit change in a bitmap results in the entire bitmap block being logged. Source: https://www.eecs.harvard.edu/ Source: https://en.wikipedia.org/wiki/Ext3 ~cs161/notes/journaling.pdf 34 Ext3 – journaling

The ext3 file system journal is usually written to a hidden file called .journal located in the root of the file system. Ext3 uses the Journaling Block Device layer to support journaling. This is the code used to physically write the journal. It is located outside of the ext3 driver, it can be used by other file system drivers (ext4, oracle cluster file system) and you can give it the device on which the journal is to be located and the sequence of blocks on this device intended for the journal. A partition with the has_journal option enabled has a journal. Has_journal is marked in the superblock in the s_feature_compat field with the bit EXT3_FEATURE_COMPAT_HAS_JOURNAL. FEATURE_COMPAT in the name means that the ext2 driver can mount a partition with a read and write journal. In ext2 we talk about reserved inodes and one of them is devoted to the journal. The superblock keeps the number of the inode saved.

35 Ext3 – directories in H-trees

In ext2, the directory is a list of variable size directory entries. Searching for the inode number takes O (n), when n is the number of entries in the directory. The ext3 partition with the dir_index option enabled can reduce the search time of the inode several times. H-trees (, hashed binary ) used in ext3 directory indexes are trees of height 2 or 3 with equal depth of all nodes. The root of an h-tree index is the first block of a directory file. The leaves are normal ext2 directory blocks, referenced by the root or indirectly through intermediate h-tree index blocks. References within the directory file are by means of logical block offsets within the file. The nodes other than leaves, as in B-trees, contain key values that separate the keys in the subtrees attached to subsequent child nodes. The keys are the hash function values for the file names. Ext3 supports several hash functions.

A Directory Index for Ext2, Daniel Phillips, 2001 Add ext3 indexed directory (htree) support (October 2002, ver. 2.5.40) This patch significantly increases the speed of using large directories in ext3, in a completely backwards and forwards compatible fashion. Creating 100,000 files in a single directory took 38 minutes without directory indexing... and 11 seconds with the directory indexing turned on. 36 37 Ext3 – directories in H-trees struct dx_entry struct dx_root { { __le32 hash; struct fake_dirent dot; __le32 block; char dot_name[4]; }; struct fake_dirent dotdot; char dotdot_name[4]; struct fake_dirent struct dx_root_info { { __le32 inode; __le32 reserved_zero; __le16 rec_len; u8 hash_version; u8 name_len; u8 info_length; /* 8 */ u8 file_type; u8 indirect_levels; }; u8 unused_flags; } info; struct dx_entry entries[0]; }; struct dx_node { struct fake_dirent fake; struct dx_entry entries[0];

}; 38 Ext3 – directories in H-trees

The search for a file name in the H-tree begins with a binary search of the leaf in which the directory entry for the name is found. Directory entries within the leaf are not ordered, the leaf should be searched linearly. There may be a collision of hash function values. An important case is when of the two directory entries for conflicting names, one has the largest hash value in its leaf and the other has the smallest in the successor of this node. If we are looking for the second name, then we get to this first block and there we recognize, that the searched name is missing. At this point we need to search the successor node (and perhaps more nodes). The youngest bit of the hash in the parent node indicates whether such a collision on the border occurred; thanks to this information we can skip searching the successor.

39 Ext3 – directories in H-trees

Adding an entry consists in adding a new directory entry to the appropriate leaf. If the leaf is full and there is only one level of index nodes, we perform the split operation as in a B-tree. If the leaf is full and there are two levels of index nodes, then there are several tens of millions of entries in the directory, or because of the fragmentation too many other nodes are not full – in this case the inability to create the file is reported. The directory entry is deleted only in the leaf. If the leaf becomes empty, we do nothing about it – which simplifies the implementation, but potentially prevents the operation of splitting another node

40 Ext3 – directories in H-trees

Ext2 drivers and older ext3 drivers do not know the dir_index option, but they can read and write in partitions with the option enabled. Index nodes cannot be interpreted by older drivers as blocks with directory entries. By using the appropriate combination of the first eight bytes of a block, you can force drivers to treat the block as empty. When the old driver creates or deletes a directory entry in an indexed directory, it will not update the index. The new driver must recognize the inconsistency and if the index is out of date, rebuild it or search for the entry linearly. Additional information is stored in the root for quick consistency checking. Indexed directories have a bit set in the inode so that the newer driver does not treat the first directory block created by the old driver as the root of the H-tree. Hashing algorithms may introduce the risk of a denial of service (DOS) attack. Ext3 is vulnerable to maliciously creating multiple directory entries with the same name hash value, which will bring the name search to a linear search.

41 File systems of the ext* family

Ext4 (fourth extended file system)

• Introduced in 2008 (not entirely new filesystem, rather of ext3). • Main maintainers: Theodore Ts’o, Andreas Dilger. • Available since kernel version 2.6.19. • Supports huge individual file size and overall file system size. • Maximum individual file size can be from 16 GB to 16 TB. • Overall file system size can be from 1 EB (exabyte). 1 EB = 1024 PB (petabyte), 1 PB = 1024 TB (terabyte). • A directory can contain 64,000 subdirectories. • Several other new features are introduced in ext4: multiple block allocation, delayed allocation, journal checksum, fast , etc. • There is an option of turning the journaling feature off. • An existing ext3 can be mounted as ext4 (without having to upgrade it).

42 General features of ext4

Ext4 was initially a series of backward-compatible extensions to ext3, but developers opposed accepting extensions to ext3 for stability reasons. On 28 June 2006 the source code of ext3 was forked and the new branch renamed to ext4. Kernel 2.6.28, containing the ext4 filesystem, was released on 25 December 2008. The ext4 file system is defined in the /fs /ext4 directory. The directory contains noticeably more files than ext2 and ext3. Ext4 is backward-compatible with ext3 and ext2, making it possible to mount ext3 and ext2 as ext4. Mechanisms that require changed data structures on the disk (in particular extents) will not work – so many optimizations introduced to ext4 will not be available for such a mounted partition. Conversion from ext2 and ext3 to ext4 is possible without changing the inodes (old block addressing is used). In this way, files using the old block indirect addressing can coexist on the disk, as well as files using the new extents mechanism. To identify which addressing method the inode uses, check the EXT4_EXTENTS_FL bit in the i_flags inode field.

43

The most important feature that distinguishes ext4 from the ext2 and ext3 is the extents mechanism, which replaces indirect block addressing. Instead of addressing individual blocks, ext4 tries to map as much data as possible to a continuous block area on the disk. To get this ext4 mapping needs 3 values: – the initial mapping block in the file, – the size of the mapped area (in blocks) and – the initial block of data saved on the disk. The structure that stores these values is called extent.

#define EXT4_MIN_BLOCK_SIZE 1024 #define EXT4_MAX_BLOCK_SIZE 65536 /* in bytes*/

struct ext4_extent { __le32 ee_block; /* first logical block extent covers */ __le16 ee_len; /* number of blocks covered by extent */ __le16 ee_start_hi; /* high 16 bits of physical block */ __le32 ee_start_lo; /* low 32 bits of physical block */ }; 44 Source: https://computing.ece.vt.edu/~changwoo/ECE-LKP-2019F/l/lec21-fs.pdf 45 File, , extent size

1 exbibyte = 260 bytes = 1152921504606846976 bytes = 1,024 pebibytes 1 EiB is approximately 1.15 EB, where exabyte (EB) to 1018 bytes. 46 File, volume, extent size

File blocks in the ext4 system are numbered using 32 bits, which limits their number to 232 4 KB blocks. This gives a maximum file size of 232*212=24*240=16 TiB. In standard ext3, the file can have a maximum of 2 TiB. The volume size, in turn, is limited by the 48-bit block identifier on the disk, which for a 4 KB block size gives 248*212=260=1 EiB. For comparison, ext3 with a 32-bit number and a 4 KB block size offered a maximum partition size of 16 TiB. The size of the extent is limited by 215 blocks, i.e. for a 4 KB block it gives 215*212=217=128 MB. This limitation results from the division into block groups, and a single block group can have a maximum size of 128 MB. Due to this limitation, the last bit of the 16-bit extent size can be used in the preallocation mechanism. The extents mechanism reduces the size of metadata, which means that operations on large files are much faster. The 500 MB file in ext4 uses four 12-byte extents, while the block addresses of the same file need more than 0.5 MB metadata in ext2. The advantage of the new solution can be seen especially in operations requiring many operations on metadata (e.g. file deletion).

47 Storing files up to 512 MB

The metadata for storing information about extents is saved in the inode in the 60- bytes area used in ext2 for storing block addresses (12 direct blocks and one position for single, double and triple indirect blocks). There are 4 extents in this area and 1 header describing them (12 bytes for each structure).

/* Each block (leaves and indexes), even inode-stored has header. */ struct ext4_extent_header { __le16 eh_magic; /* probably will support different formats */ __le16 eh_entries; /* number of valid entries */ __le16 eh_max; /* capacity of stores in entries */ __le16 eh_depth; /* has tree real underlying blocks? */ /* 0 = this extent node points to data blocks; otherwise, this extent node points to other extent * nodes. The extent tree can be at most 5 levels deep: a logical block number can be at most * 2^32, and the smallest n that satisfies 4*(((blocksize - 12)/12)^n) >= 2^32 is 5. */ */ __le32 eh_generation; /* generation of the tree */ };

48 Storing files up to 512 MB

Extent map (source: Ext4: The Next Generation of Ext2/3 Filesystem, Mingming Cao, Suparna Bhattacharya, Ted Tso) 49 Storing files over 512 MB

A tree is built for larger chunks of data. For this purpose, an additional structure is used – an index containing the initial position of the extent in the file and the block number on the data on the disk. This block always contains a header describing the data and may contain further indexes or extents with the data.

* This is index on-disk structure. * It's used at all the levels except the bottom. */ struct ext4_extent_idx { __le32 ei_block; /* index covers logical blocks from 'block' */ __le32 ei_leaf_lo; /* pointer to the physical block of the next */ /* level. leaf or next index could be there */ __le16 ei_leaf_hi; /* high 16 bits of physical block */ __u16 ei_unused; };

50 Storing files over 512 MB

Inode structure and extent tree in ext4 (source: M. Michałowski) 51 Layout of the large inode

Ext3 supports different inode . The inode size can be set to any power-of-two larger than 128 bytes size up to the filesystem block size. This can be done by the mke2fs -I [inode size] command at time. The default inode size is 128 bytes, which is already crowded with data and has little space for new fields. In ext4, the default inode structure size is 256 bytes.

In order to avoid duplicating a lot of code in the kernel and e2fsck, the large inodes keep the same fixed layout for the first 128-bytes. The rest of the inode is split into two parts: 1. A fixed-field section that allows addition of fields common to all inodes, such as nanosecond timestamps. 2. A section for Fast Extended Attributes (EAs) that consumes the rest of the inode (support for fast EAs in large inodes has been available in Linux kernels since 2.6.12). The first use of EAs was to store file ACLs and other Source : The new ext4 filesystem: security data (selinux). current status and future plans52 Block allocation enhancements

The goal is to increase filesystem throughput. One of the method is to reduce filesystem fragmentation. The new features in ext4 take advantage of the existing extents mapping and aim at reducing file system fragmentation by improving block allocation techniques.

53 Persistent preallocation

Persistent preallocation The fallocate() system call allows to reserve a specific area for a file that does not initially use all space. The mechanism is useful when working with slow-growing files or files that are not written sequentially. Preallocation also protects against the lack of disk space for file extension and allows to reduce data fragmentation. Information that the file is pre-allocated and extent contains uninitialized data is contained in bit 16 of the field describing the size of the extent (ee_len). During reads, an uninitialized extent is treated just like a hole, so that the VFS returns zero-filled blocks. Upon writes, the extent must be split into initialized and uninitialized extents, merging the initialized portion with an adjacent initialized extent if contiguous.

LWN: fallocate()

54 Multiple block allocation

In ext2, as well as ext3, each block of the file had to be allocated separately, which in the case of large files resulted in a large number of calls to the allocation function. In addition to performance issues, this made the file system more susceptible to fragmentation. Ext4 has a multiple block allocation mechanism (mballoc) that is necessary to ensure a continuous block area for extents. Depending on the file size, the allocator uses different strategies – for small files (<16 blocks) it tries to keep them close together, which will speed up their reading; – large files are allocated so that they are in the most continuous memory area possible. This solves the performance and fragmentation issues that occur in ext2. Regardless of which strategy the ext4 allocator uses – it first checks if there are free preallocated blocks, only in the next step uses the buddy cache. Description of the allocator: Mballoc.c @ LXR – 300 line comment on the operation of the multiblock allocator. Using this mechanism does not affect the format of the data stored on the disk. 55 Delayed allocation

Delayed allocation (allocate-on-flush) is a technique used in many modern file systems, consisting in maximum delay in block allocation (in contrast to traditional file systems, in which blocks are allocated as soon as possible) . If the process writes to a file, the file system immediately allocates the blocks where the data will be written, even if it does not happen immediately and the data is cached for some time. This approach has its drawbacks. If the process writes continuously to a file that is growing, subsequent writes allocate blocks to data, but it is not known whether the file will continue to grow. In the case of delayed allocation, blocks are not allocated immediately upon writing, but only when disk writing is actually to take place. This allows the block allocator to optimize allocation in situations where the old system could not. Delayed writing works very well with two other techniques: extents and multiple block allocation, because in many situations when the file is finally saved to disk, it will be placed in the extents allocated using the mballoc allocator. This improves performance and reduces fragmentation.

56 Delayed allocation

When the file is created, free blocks are searched in VFS that can hold data blocks and metadata. They are reserved to ensure that there is sufficient disk space at the time of actual writing (disk space for the appended data is subtracted from the free-space counter, but not actually allocated in the free-space bitmap.) Buffer_head structures that hold the contents of a file in memory are marked as allocated with delay (BH_DELAY). When writepage () or writepages () is called, multiple block allocation is made to dirty file frames in a continuous disk area. After successful allocation, unused blocks reserved in the first step can be released.

57 Delayed allocation

Benefits: – Combined with the simultaneous allocation of multiple blocks, it increases performance when saving growing files. This is because the chance that subsequent allocations will be grouped into one large allocation that will be more efficiently implemented than many individual ones increases. – Reduces data fragmentation, especially if the file has grown from creation to actual disk allocation. – In the case of temporary files, there is a chance that you will not need to save them to disk at all. Disadvantages: – Increases the risk of data loss during a failure. – Many assumptions about writing to a file, true for ext2, become false for ext4.

58 Journal

Journal checksums The journal is one of the most intensively used areas of the disk, therefore data corruption in this area is particularly severe for the entire file system. Ext4 introduces the journal checksumming. The checksum is calculated for each transaction and each block group descriptor. On the one hand, this is done at the expense of a larger overhead for I/O operations, on the other hand it allows to transform two journal writing phases known from ext3 to one, which improves both reliability and performance.

Barriers This mechanism allows to order the disk driver to save data in a specific order, which prevents data from being split in case of an emergency. Modern disk drivers could, in order to improve performance, change the order in which data is written. The file system must explicitly instruct the disk that writing data to the journal must precede writing the record with a commit. Barriers are used for this. 59 Nanosecond time stamps

Ext4 introduces time stamps with nanosecond precision. The consequence of adding extended time stamps is also increasing the default inode size from 128 bits to 256 bits. The new time stamps also shift the problem of 2038 by another 204 years. Additional fields in the inode compared to ext3:

__le32 i_ctime_extra; /* extra Change time (nsec << 2 | epoch) */ __le32 i_mtime_extra; /* extra Modification time(nsec << 2 | epoch) */ __le32 i_atime_extra; /* extra Access time (nsec < 2 | epoch) */ __le32 i_crtime; /* File Creation time */ __le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */ __le32 i_version_hi; /* high 32 bits for 64-bit version */

60 File systems of the ext* family

Linux FS: ext2 vs ext3 vs ext4 61 Tests

Android Modders Guide - File systems (2017-10-20). Benchmarking filesystems, Linux Gazette #102, linuxgazette.net, 10 maja 2010.

Test 1

Test 1 (source: Ext4: The Next Generation of Ext2/3 Filesystem – 2006) 62 Tests

Test 2 FFSB is a powerful filesystem benchmarking tool that can be tuned to simulate very specific workload. Multithreaded creation of large files. The test runs 4 threads, which combined create 24 1-GB files, and stress the sequential write operation. 35% improvement in throughput and 40% decrease in CPU utilization in ext4 as compared to ext3.

Test 2 (source: The new ext4 filesystem: current status and future plans – 2007) 63 Tests

Test 3 Postmark is a well-known benchmark simulating a mail server performing many single-threaded transactions on small to medium files. About 30% throughput gain with ext4. Similar percent improvements in CPU utilization, because metadata is much more compact with extents. The write throughput is higher than read throughput because everything is being written to memory. Aside from the obvious performance gain on large contiguous files, ext4 is also a good choice on smaller file workloads.

Test 3 (source: The new ext4 filesystem: current status and future plans – 2007) 64 Tests

Test 4 For the IOzone benchmark testing, the system was booted with only 64 M of memory to really stress disk I/O. The tests were performed with 8 MB record sizes on various file sizes. Write, rewrite, read, reread, random write, and random read operations were tested. Figure shows throughput results for 512 MB sized files. There is great improvement between ext3 and ext4,especially on rewrite, random- write and reread operations. In this test, XFS still has better read performance, while ext4 has shown higher throughput on write operations.

Test 4 (source: The new ext4 filesystem: current status and future plans – 2007) 65 Additional reading

• Documentation/filesystems/ext2.txt. • Access Control Lists (Krzysztof Bałażyk). • Preallocation in ext2 (Damian Karasiński). • State of the Art: Where we are with the Ext3 filesystem, M. Cao, T. Y. Ts'o, B. Pulavarty, S. Bhattacharya, IBM. • A Directory Index for Ext2, Daniel Phillips, 2001. • Journaling the Linux ext2fs Filesystem, LinuxExpo, Stephen C. Tweedie, 1998. • Anatomy of Linux journaling file systems, M. Tim Jones, IBM. • Ext3, Wikipedia, the free encyclopedia, 7 maja 2010. • Ext3 removal, quota & udf fixes (Linus Torwalds, September 2015) So the thing I'm happy to see is that the ext4 developers seem to unanimously agree that maintaining ext3 compatibility is part of their job, and nobody seems to be arguing for keeping ext3 around. Assuming no major objections come up, the EXT3 file-system driver will be dropped for the Linux 4.3 kernel. • File system design case studies (Paul Krzyzanowski, March 2012)

66 Additional reading

• Documentation/filesystems/ext4.txt. • Ext4 wiki. • Ext4 Howto. • Ext4 Disk Layout. • Ext4 block and inode allocator improvements, A. Kumar, M. Cao, J. Santos, A. Diliger, 2008 . • Case-insensitive ext4 (Jake Edge, March 2019). • The new ext4 filesystem: current status and future plans, A. Mathur, M. Cao, S. Bhattacharya, A. Dilger, A. Tomas, L. Vivier, 2007 Linux Symposium. • Ext4: The Next Generation of Ext2/3 Filesystem, M. Cao, S. Bhattacharya, T. Tso, IBM, 2007. • A Minimum Complete Tutorial of Linux ext4 File System • Understanding Linux filesystems: ext4 and beyond, Jim Salter, April 2018 • Ext4 file system and crash consistency, Hangwoo Min • How do SSDs work?

67