Native LINUX Filesystems
Total Page:16
File Type:pdf, Size:1020Kb
Native LINUX Filesystems Extended filesystems (Ext, Ext2, Ext3, Ext4) Extended filesystem (ext fs), second extended filesystem (ext2fs) and third extended filesystem (ext3fs) were designed and implemented on Linux by Rmy Card, Laboratoire MASI--Institut Blaise Pascal, , Theodore Ts'o, Massachussets Institute of Technology, and Stephen Tweedie, University of Edinburgh Non-Journaled Filesystems Extended filesystem (Ext FS) This is original filesystem used in early Linux systems. The standard filesystem for Linux, ext2, is a high-performance, non-journaled filesystem. Although ext2 lacks journaling features, many users choose it because of its high speed and reliability. Second Extended Filesystem (Ext2 FS) The Second Extended File System provides standard Unix file semantics and advanced features. Ext2 filesystem format forms the basis for following native LINUX file system versions. Due to optimizations included in the kernel code, Ext2fs has extensions to the current filesystem: access control lists conforming to the Posix semantics, undelete, and on-the-fly file compression. I Ext2 features: - Long file names (255 characters to 1012_ and variable length directory entries. - VFS layer filesystems to 4 TB - Reserves 5% of the blocks super user (root) to recover from user processes filling up filesystems. - Filesystem metadata (inodes, bitmap blocks, indirect blocks and directory blocks) synchronous write - Choice of logical block size when creating the filesystem, typically be 1024, 2048 and 4096 bytes to speed up I/O since with fewer I/O requests, and thus fewer disk head seeks. - Fast symbolic links that do not use any data block on the filesystem; filename is not stored in the inode - filesystem state using a special field in the superblock to indicate the status of the file system. When a filesystem is mounted in read/write mode, its state is set to ``Not Clean''. When it is unmounted or remounted in read-only mode, its state is reset to ``Clean''. At boot time, the filesystem checker uses this information to decide if a filesystem must be checked (fsck)... The filesystem checker tests this to force the check of the filesystem regardless of its apparently clean state (fsck). - Filesystems checks are forced at regular intervals. A mount counter is maintained in the superblock. A last check time and a maximal check interval are also maintained in the superblock. Each time the filesystem is mounted in read/write mode, counters and timestamps arechecked. When it reaches a maximal value (also recorded in the superblock), the filesystem checker forces the check even if the filesystem is ``Clean''. - provides an attribute allows the users to request secure deletion on files. When such a file is deleted, random data is written in the disk blocks previously allocated to the file. This prevents malicious people from gaining access to the previous content of the file by using a disk editor. Ext2 Physical Structure Unlike FFS, the ext2 filesystems is made up of block groups instead of FFS cylinder groups. Block groups are not tied to the physical layout of the blocks on the disk, since modern drives tend to be optimized for sequential access (‘smart” drives – SAN, SCSI, SATA) and hide their physical geometry to the operating system. Ext2 filesystem layout ,---------+---------+---------+---------+---------, | Boot | Block | Block | ... | Block | | sector | group 1 | group 2 | | group n | `---------+---------+---------+---------+---------' Each block group contains a redundant copy of crucial filesystem control informations (superblock and the filesystem descriptors) and also contains a part of the filesystem (a block bitmap, an inode bitmap, a piece of the inode table, and data blocks). The structure of a block group is represented in this table: Ext2 blockgroup layout ,---------+---------+---------+---------+---------+---------, | Super | FS | Block | Inode | Inode | Data | | block | desc. | bitmap | bitmap | table | blocks | ---------+---------+---------+---------+---------+---------' Using block groups improves reliability since the control structures are replicated in each block group, it is easy to recover from a filesystem where the superblock has been corrupted. This structure also helps to get good performances: by reducing the distance between the inode table and the data blocks, it is possible to reduce the disk head seeks during I/O on files. - In Ext2fs, directories are managed as linked lists of variable length entries containing the inode number, the entry length, the file name and its length. Variable length entries permit long file names without wasting disk space in directories. - Ext2fs buffer cache management performs readaheads:reading data blocks contiguouslys. This way, it tries to ensure that the next block to read will already be loaded into the buffer cache. Readaheads are extended directory reads - Ext2fs performs allocation optimizations. Block groups are used to cluster together related inodes and data to reduce the disk head seeks made when the kernel reads an inode and its data blocks. - Preallocates up to 8 adjacent blocks when allocating a new block for writing data. Preallocation hit rates are around 75% even on very full filesystems and gets good write performances under heavy load. It also allows contiguous blocks to be allocated to files, thus it speeds up the future sequential reads. Journaled Filesystems Journaled filesystems include additional record keeping that increases the ability of the filesystem to recover from a crash. · ext3 - the ext2 filesystem with journaling extensions. · jfs - Journaled File System - a filesystem contributed to Linux by IBM. · xfs - A filesystem contributed to open source by SGI. · reiserfs, developed by Namesys, is the default filesystem for SUSE Linux, DARPA. •Third Extended Filesystem (Ext3 FS) Ext3 supports the same features as Ext2, but also includes Journaling. • Fourth Extended Filesystem (Ext4 FS) Compatibility Any existing Ext3 filesystem can be migrated to Ext4 with an easy procedure which consists in running a couple of commands in read-only mode. Migrate existing Ext3 filesystems to Ext4 You need to use the tune2fs and fsck tools in the filesystem, and that filesystem needs to be unmounted. Run: tune2fs -O extents,uninit_bg,dir_index /dev/yourfilesystem After running this command you MUST run fsck. If you don't do it, Ext4 WILL NOT MOUNT your filesystem. This fsck run is needed to return the filesystem to a consistent state. It WILL tell you that it finds checksum errors in the group descriptors - it's expected, and it's exactly what it needs to be rebuilt to be able to mount it as Ext4, so don't get surprised by them. Since each time it finds one of those errors it asks you what to do, always say YES. If you don't want to be asked, add the "-p" parameter to the fsck command, it means "automatic repair": (e2)fsck -pfDCO /dev/yourfilesystem Bigger filesystem/file sizes Currently, Ext3 support 16 TB of maximum filesystem size, and 2 TB of maximum file size. Ext4 adds 48-bit block addressing, so it will have 1 EB of maximum filesystem size and 16 TB of maximum file size. 1 EB = 1,048,576 TB (1 EB = 1024 PB, 1 PB = 1024 TB, 1 TB = 1024 GB). Sub directory scalability Right now the maximum possible number of sub directories contained in a single directory in Ext3 is 32000. Ext4 breaks that limit and allows a unlimited number of sub directories. Extents The traditionally Unix-derived filesystems like Ext3 use a indirect block mapping scheme to keep track of each block used for the blocks corresponding to the data of a file. This is inefficient for large files, especially on large file delete and truncate operations, because the mapping keeps a entry for every single block, and big files have many blocks -> huge mappings, slow to handle. Modern filesystems use a different approach called "extents". An extent is basically a bunch of contiguous physical blocks. Multiblock allocation When Ext3 needs to write new data to the disk, there's a block allocator that decides which free blocks will be used to write the data. But the Ext3 block allocator only allocates one block (4KB) at a time. That means that if the system needs to write the 100 MB data mentioned in the previous point, it will need to call the block allocator 25600 times (and it was just 100 MB!). Ext4 uses a "multiblock allocator" (mballoc) which allocates many blocks in a single call, instead of a single block per call, avoiding a lot of overhead. This improves the performance, and it's particularly useful with delayed allocation and extents. This feature doesn't affect the disk format. Also, note that the Ext4 block/inode allocator has other improvements, described in detail in this paper. Delayed allocation Delayed allocation is a performance feature (it doesn't change the disk format) found in a few modern filesystems such as XFS, ZFS, btrfs or Reiser 4, and it consists in delaying the allocation of blocks as much as possible, contrary to what traditionally filesystems (such as Ext3, reiser3, etc) do: allocate the blocks as soon as possible. EXT4 Delayed allocation, on the other hand, does not allocate the blocks immediately when the process write()s, rather, it delays the allocation of the blocks while the file is kept in cache, until it is really going to be written to the disk. This gives the block allocator the opportunity to optimize the allocation in situations where the old system couldn't. Fast fsck Fsck is a very slow operation, especially the first step: checking all the inodes in the file system. In Ext4, at the end of each group's inode table will be stored a list of unused inodes (with a checksum, for safety), so fsck will not check those inodes. The result is that total fsck time improves from 2 to 20 times, depending on the number of used inodes. Journal checksumming The journal is the most used part of the disk, making the blocks that form part of it more prone to hardware failure.