File Systems
Total Page:16
File Type:pdf, Size:1020Kb
File Systems Parallel Storage Systems 2021-04-20 Jun.-Prof. Dr. Michael Kuhn [email protected] Parallel Computing and I/O Institute for Intelligent Cooperating Systems Faculty of Computer Science Otto von Guericke University Magdeburg https://parcio.ovgu.de Outline Review File Systems Review Introduction Structure Example: ext4 Alternatives Summary Michael Kuhn File Systems 1 / 51 Storage Devices Review • Which hard-disk drive parameter is increasing at the slowest rate? 1. Capacity 2. Throughput 3. Latency 4. Density Michael Kuhn File Systems 2 / 51 Storage Devices Review • Which RAID level does not provide redundancy? 1. RAID 0 2. RAID 1 3. RAID 5 4. RAID 6 Michael Kuhn File Systems 2 / 51 Storage Devices Review • Which problem is called write hole? 1. Inconsistency due to non-atomic data/parity update 2. Incorrect parity calculation 3. Storage device failure during reconstruction 4. Partial stripe update Michael Kuhn File Systems 2 / 51 Outline Introduction File Systems Review Introduction Structure Example: ext4 Alternatives Summary Michael Kuhn File Systems 3 / 51 Motivation Introduction 1. File systems provide structure • File systems typically use a hierarchical organization • Hierarchy is built from files and directories • Other approaches: Tagging, queries etc. 2. File systems manage data and metadata • They are responsible for block allocation and management • Access is handled via file and directory names • Metadata includes access permissions, time stamps etc. • File systems use underlying storage devices • Devices can also be provided by storage arrays such as RAID • Common choices are Logical Volume Manager (LVM) and/or mdadm on Linux Michael Kuhn File Systems 4 / 51 Examples Introduction • Linux: tmpfs, ext4, XFS, btrfs, ZFS • Windows: FAT, exFAT, NTFS • OS X: HFS+, APFS • Universal: ISO9660, UDF • Pseudo: sysfs, proc Michael Kuhn File Systems 5 / 51 Examples... Introduction • Network: NFS, AFS, Samba • Cryptographic: EncFS, eCryptfs • Parallel distributed: Spectrum Scale, Lustre, OrangeFS, CephFS, GlusterFS • Most of these use another underlying file system Michael Kuhn File Systems 6 / 51 I/O Interfaces Introduction • I/O operations are realized using I/O interfaces • Interfaces are available for dierent abstraction levels • Interfaces forward operations to the actual file system • Low-level interfaces provide basic functionality • POSIX, MPI-IO • High-level interfaces provide more convenience • HDF, NetCDF, ADIOS Michael Kuhn File Systems 7 / 51 I/O Operations Introduction • open can be used to open and create files • Features many dierent flags and modes 1 fd = open("/path/to/file", • O_RDWR: Open for reading and writing 2 O_RDWR | O_CREAT | • O_CREAT: Create file if necessary 3 O_TRUNC, O_TRUNC • : Truncate if is exists already 4 S_IRUSR | S_IWUSR); • Initial access happens via a path 5 • Aerwards, file descriptors can be used (with a 6 rv = close(fd); few exceptions) 7 rv = unlink("/path/to/file"); • All functions provide a return value • errno should be checked in case of errors Michael Kuhn File Systems 8 / 51 I/O Operations... Introduction 1 nb = write(fd, data, sizeof(data)); • write returns the number of written bytes • Does not necessarily correspond to the given size (error handling!) • write updates the file pointer internally (alternative: pwrite) • Functions are provided by libc • libc performs system calls Michael Kuhn File Systems 9 / 51 Virtual File System (Switch) Introduction • VFS is a central file system component in the kernel • Provides a standardized interface for all file systems (POSIX) • Defines file system structure and interface for the most part • Forwards operations performed by applications to the corresponding file system • File system is selected based on the mount point • Enables supporting a wide range of dierent file systems • Applications are still portable due to POSIX Michael Kuhn File Systems 10 / 51 Virtual File System (Switch)... Introduction mmap (anonymous pages) Applications (processes) malloc ... stat(2) open(2) read(2) write(2) chmod(2) VFS Block-based FS Network FS Pseudo FS Special ext2 ext3 ext4 xfs NFS coda purpose FS Direct I/O proc sysfs Page btrfs ifs iso9660 smbfs ... (O_DIRECT) pipefs futexfs tmpfs ramfs cache gfs ocfs ... ceph usbfs ... devtmpfs Stackable FS ecryptfs overlayfs unionfs FUSE userspace (e.g. sshfs) network stackable (optional) struct bio - sector on disk BIOs (block I/Os) Devices on top of “normal” BIOs (block I/Os) - sector cnt block devices - bio_vec cnt drbd LVM - bio_vec index device mapper mdraid - bio_vec list dm-crypt dm-mirror ... dm-cache dm-thin bcache dm-raid dm-delay The Linux Storage Stack Diagram http://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram Created by Werner Fischer and Georg Schönberger License: CC-BY-SA 3.0, see http://creativecommons.org/licenses/by-sa/3.0/ [Fischer and Schönberger, 2017] Michael Kuhn File Systems 11 / 51 Virtual File System (Switch)... Introduction The Linux Storage Stack Diagram version 4.10, 2017-03-10 outlines the Linux storage stack as of Kernel version 4.10 Virtual Host Virtual FireWire Fibre Channel Fibre over Ethernet ISCSI Channel Fibre USB mmap (anonymous pages) Applications (processes) LIO malloc vfs_writev, vfs_readv, ... ... stat(2) open(2) read(2) write(2) chmod(2) VFS tcm_fc sbp_target tcm_usb_gadget tcm_vhost tcm_qla2xxx iscsi_target_mod Block-based FS Network FS Pseudo FS Special ext2 ext3 ext4 xfs NFS coda proc purpose FS target_core_mod Direct I/O sysfs Page btrfs ifs iso9660 smbfs ... (O_DIRECT) pipefs futexfs tmpfs ramfs cache target_core_file gfs ocfs ... ceph usbfs ... devtmpfs Stackable FS target_core_iblock ecryptfs overlayfs unionfs FUSE userspace (e.g. sshfs) target_core_pscsi target_core_user network stackable (optional) struct bio - sector on disk BIOs (block I/Os) Devices on top of “normal” BIOs (block I/Os) - sector cnt block devices - bio_vec cnt drbd LVM - bio_vec index device mapper mdraid - bio_vec list dm-crypt dm-mirror ... dm-cache dm-thin bcache libc dm-raid dm-delay • Applications call functions in userspace BIOs BIOs Block Layer BIOs I/O scheduler blkmq libc Maps BIOs to requests multi queue hooked in device drivers • performs system calls noop Software (they hook in like stacked cfq ... queues devices do) deadline Hardware Hardware dispatch ... dispatch queue queues • System calls are handled by VFS Request Request BIO based drivers based drivers based drivers Request-based device mapper targets • VFS determines correct file system instance dm-multipath SCSI mid layer sysfs scsi-mq /dev/zram* /dev/rbd* /dev/mmcblk*p* /dev/nullb* /dev/vd* /dev/rssd* /dev/skd* (transport attributes) SCSI upper level drivers /dev/sda /dev/sd* ... /dev/ubiblock* /dev/nbd* /dev/loop* /dev/nvme*n* /dev/rsxx* Transport classes scsi_transport_fc /dev/sr* /dev/st* zram ubi rbd nbd mmc loop null_blk virtio_blk mtip32xx nvme skd rsxx scsi_transport_sas scsi_transport_... • Data is read/written via page cache or directly network memory SCSI low level drivers libata megaraid_sas qla2xxx pm8001 iscsi_tcp virtio_scsi ufs ... ahci ata_piix ... aacraid lpfc mpt3sas vmw_pvscsi • Block layer handles communication with devices network HDD SSD DVD LSI Qlogic PMC-Sierra para-virtualized Micron nvme stec virtio_pci mobile device drive RAID HBA HBA SCSI flash memory PCIe card device device Adaptec Emulex LSI 12Gbs VMware's SD-/MMC-Card RAID HBA SAS HBA para-virtualized IBM flash Physical devices SCSI adapter The Linux Storage Stack Diagram http://www.thomas-krenn.com/en/wiki/Linux_Storage_Stack_Diagram Created by Werner Fischer and Georg Schönberger License: CC-BY-SA 3.0, see http://creativecommons.org/licenses/by-sa/3.0/ [Fischer and Schönberger, 2017] Michael Kuhn File Systems 12 / 51 Outline Structure File Systems Review Introduction Structure Example: ext4 Alternatives Summary Michael Kuhn File Systems 13 / 51 File System Objects Structure • Dierences from user and system point of view • Users deal with files and directories that contain data and metadata • Files consist of bytes, directories contain files and further directories • The system manages all internals • Combines individual blocks into files etc. • Inodes • The most basic data structure in POSIX file systems • Each file and directory is represented by an inode (see stat) • Inodes contain mostly metadata • Some of the metadata is visible for users, some is internal • Inodes are typically referenced by ID and have a fixed size Michael Kuhn File Systems 14 / 51 File System Objects... Structure • Files • Files contain data in the form of a byte array • POSIX specifies that data is a byte stream • Data can be read/written using explicit functions • Data can also be mapped into memory for implicit access • Directories • Directories organize the file system’s namespace • They can contain files and further directories • Directories within directories lead to a hierarchical namespace • From a user’s point of view, directories are a list of entries • Internally, file systems oen use tree structures Michael Kuhn File Systems 15 / 51 B-Tree vs. B+-Tree Structure 7 16 1 2 5 6 9 12 18 21 [CyHawk, 2010] • B-trees are generalized binary trees • It is optimized for systems that read/write large blocks • Pointers and data are mixed in the tree Michael Kuhn File Systems 16 / 51 B-Tree vs. B+-Tree... Structure 7 16 1 2 5 6 7 9 12 16 18 21 [CyHawk, 2010] • B+-trees are a modification of B-trees • Data is only stored in leaf nodes • Advantageous for caching since nodes are easier to cache • Used in NTFS, XFS etc. Michael Kuhn File Systems 17 / 51 Alternative Trees Structure • H-trees • Based on B-trees • Has dierent handling of hash collisions • Used in ext3 and ext4 • B"-trees • Optimized for write operations • Improved performance for insert, range query and update operations Michael Kuhn File Systems 18 / 51 Files Structure 1 nb = pwrite(fd, data, sizeof(data), 42); 2 nb = pread(fd, data, sizeof(data), 42); • pwrite and pread behave like write and read • They allow specifying the oset and do not modify the file pointer • Both functions are therefore thread-safe • Access is done via an open file descriptor • Can be used in parallel by multiple threads Michael Kuhn File Systems 19 / 51 Files..