NOVA-Fortis: a Fault-Tolerant Non-Volatile Main Memory File

Total Page:16

File Type:pdf, Size:1020Kb

NOVA-Fortis: a Fault-Tolerant Non-Volatile Main Memory File NOVA-Fortis: A Fault-Tolerant Non- Volatile Main Memory File System Jian Andiry Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego 1 Non-volatile Memory and DAX • Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology – Reside on memory bus, load/store interface Application load/store load/store DRAM NVMM File system HDD / SSD 2 Non-volatile Memory and DAX • Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology – Reside on memory bus, load/store interface Application • Direct Access (DAX) mmap() DAX-mmap() – DAX file I/O bypasses the page cache – DAX-mmap() maps NVMM pages to application DRAM NVMM address space directly and bypasses file system – “Killer app” copy HDD / SSD 3 Application expectations on NVMM File System DAX POSIX I/O Atomicity Fault Direct Speed Tolerance Access 4 ext4 xfs BtrFS F2FS DAX POSIX I/O Atomicity Fault Direct Speed Tolerance Access ✔ ❌ ✔❌ ❌ 5 PMFS ext4-DAX xfs-DAX DAX ✔ ❌ ❌ ✔✔ 6 Fault Direct Speed Strata SOSP ’17 DAX ✔✔ ❌ ❌ ✔ 7 Fault Direct Speed NOVA FAST ’16 DAX ✔ ✔ ❌ ✔ ✔ 8 Fault Direct Speed NOVA-Fortis DAX ✔ ✔✔✔ ✔ 9 Fault Direct Speed Challenges DAX 10 NOVA: Log-structured FS for NVMM • Per-inode logging – High concurrency Per-inode logging – Parallel recovery • High scalability Inode Head Tail – Per-core allocator, journal and inode table Inode log • Atomicity Data Data Data – Logging for single inode update – Journaling for update across logs – Copy-on-Write for file data Jian Xu and Steven Swanson, NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories, FAST ’16. 11 Snapshot 13 Snapshot support • Snapshot is essential for file system backup • Widely used in enterprise file systems – ZFS, Btrfs, WAFL • Snapshot is not available with DAX file systems 14 Snapshot for normal file I/O Current snapshot 012 write(0, 4K); take_snapshot(); write(0, 4K); File log 0 Page 0 1 Page 0 1 Page 0 2 Page 0 write(0, 4K); take_snapshot(); Data Data Data Data write(0, 4K); recover_snapshot(1); File write entry Data in snapshot Reclaimed data Current data 15 Memory Ordering With DAX-mmap() D V Valid D = 42; ? False ✓ Fence(); 42 False ✓ V = True; 42 True ✓ ? True ✗ • Recovery invariant: if V == True, then D is valid 16 Memory Ordering With DAX-mmap() D = 42; Application D V Fence(); DAX-mmap() V = True; NVMM Page 1 Page 3 • Recovery invariant: if V == True, then D is valid • D and V live in two pages of a mmap()’d region. 17 DAX Snapshot: Idea • Set pages read-only, then copy-on-write Applications: no file system intervention File system: DAX-mmap() File data: RO 18 DAX Snapshot: Incorrect implementation • Application invariant: if V is True, then D is valid Application Application NOVA Snapshot thread values snapshot values D = ?; D V D V V = False; ? F snapshot_begin(); set_read_only(page_d); ? D = 42;page fault 42 F copy_on_write(page_d); V = True; 42 T set_read_only(page_v); ? T snapshot_end(); ? T 19 DAX Snapshot: Correct implementation • Delay CoW page faults completion until all pages are read-only Application Application NOVA Snapshot thread values snapshot values D = ?; D V D V V = False; ? F snapshot_begin(); set_read_only(page_d); ? D = 42;page fault set_read_only(page_v); ? F snapshot_end(); ? F 42 F copy_on_write(page_d); V = True; 42 T copy_on_write(page_v); 20 Performance impact of snapshots • Normal execution vs. taking snapshots every 10s – Negligible performance loss through read()/write() – Average performance loss 3.7% through mmap() W/O snapshot W snapshot 1.2 Filebench (read/write) WHISPER (DAX-mmap()) 1 0.8 0.6 0.4 0.2 0 21 Protecting Metadata and Data 22 NVMM Failure Modes • Detectable errors – Media errors detected by NVMM controller Software: Receives MCE – Raises Machine Check Exception Detects uncorrectable errors (MCE) NVMM Ctrl.: Raises exception • Undetectable errors Read NVMM data: Media error – Media errors not detected by NVMM controller – Software scribbles 23 NVMM Failure Modes • Detectable errors – Media errors detected by NVMM controller Software: Consumes corrupted data – Raises Machine Check Exception (MCE) NVMM Ctrl.: Sees no error • Undetectable errors Read NVMM data: Media error – Media errors not detected by NVMM controller – Software scribbles 24 NVMM Failure Modes • Detectable errors – Media errors detected by NVMM controller Software: Bug code scribbles NVMM – Raises Machine Check Exception Write (MCE) NVMM Ctrl.: Updates ECC • Undetectable errors NVMM data: Scribble error – Media errors not detected by NVMM controller – Software scribbles 25 NOVA-Fortis Metadata Protection • Detection inode’ Head’ Tail’ csumcsum’’ H1’ T1’ – CRC32 checksums in all structures – Use memcpy_mcsafe() to catch inode Head Tail csum H1 T1 MCEs • Correction log ent1 c1 … entN cN – Replicate all metadata: inodes, logs, superblock, etc. log’ ent1’ c1’ … entN’ cN’ – Tick-tock: persist primary before updating replica Data 1 Data 2 26 NOVA-Fortis Data Protection inode’ Head’ Tail’ csumcsum’’ H1’ T1’ inode Head Tail csum H1 T1 • Metadata – CRC32 + replication for all structures log ent1 c1 … entN cN • Data c1 log’ ent1’ … entN’ cN’ ’ – RAID-4 style parity – Replicated checksums Data 1 Data 2 1 Block (8 stripes) S0 S1 S2 S3 S4 S5 S6 S7 P P = ⊕ S0..7 Ci = CRC32C(Si) Replicated 27 File data protection with DAX-mmap • Stores are invisible to the file systems • The file systems cannot protect mmap’ed data • NOVA-Fortis’ data protection contract: DAX NOVA-Fortis protects pages from media errors and scribbles iff they are not mmap()’d for writing. 28 File data protection with DAX-mmap • NOVA-Fortis logs mmap() operations User-space load/store load/store Applications: Kernel-space NOVA-Fortis: read/write mmap() NVDIMMs File data: protected unprotected File log: mmap log entry 29 File data protection with DAX-mmap • On munmap and during recovery, NOVA-Fortis restores protection User-space load/store munmap() Applications: Kernel-space NOVA-Fortis: read/write mmap() NVDIMMs Protection restored File data: File log: 30 File data protection with DAX-mmap • On munmap and during recovery, NOVA-Fortis restores protection User-space System Failure + Applications: recovery Kernel-space NOVA-Fortis: read/write mmap() NVDIMMs File data: File log: 31 Performance 32 Latency breakdown VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity Create Append 4KB Overwrite 4KB Overwrite 512B Read 4KB Read 16KB 0 1 2 3 4 5 6 Latency (microsecond) 33 Latency breakdown VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity Create Append 4KB Overwrite 4KB Overwrite 512B Read 4KB Read 16KB 0 1 2 3 4 5 6 Latency (microsecond) Metadata Protection 34 Latency breakdown VFS alloc inode journaling memcpy_mcsafe memcpy_nocache append entry free old data calculate entry csum verify entry csum replicate inode replicate log verify data csum update data csum update data parity Create Append 4KB Overwrite 4KB Overwrite 512B Read 4KB Read 16KB 0 1 2 3 4 5 6 Latency (microsecond) Metadata Protection Data Protection 35 Application performance Normalized throughput 1.2 1 0.8 0.6 0.4 Normalized throughput Normalized 0.2 0 Fileserver Varmail MongoDB SQLite TPCC Average ext4-DAX Btrfs NOVA w/ MP w/ MP+DP 36 Conclusion • Fault tolerance is critical for file system, but existing DAX file systems don’t provide it • We identify new challenges that NVMM file system fault tolerance poses • NOVA-Fortis provides fault tolerance with high performance – 1.5x on average to DAX-aware file systems without reliability features – 3x on average to other reliable file systems 37 Give a try https://github.com/NVSL/linux-nova 38 Thanks! 39 Backup slides 40 Hybrid DRAM/NVMM system • Non-volatile main memory (NVMM) – PCM, STT-RAM, ReRAM, 3D XPoint technology • File system for NVMM Host CPU NVMM FS DRAM NVMM 41 Disk-based file systems are inadequate for NVMM Ext4 Ext4 Ext4 • Ext4, xfs, Btrfs, F2FS, NILFS2 Atomicity Btrfs xfs wb order dataj • 1-Sector Built for hard disks and SSDs ✓ ✓ ✓ ✓ ✓ overwrite – Software overhead is high 1-Sector ✗ ✓ ✓ ✓ ✓ – CPU may reorder writes to NVMM append 1-Block ✗ ✗ ✓ ✓ ✗ – NVMM has different atomicity guarantees overwrite 1-Block ✗ ✓ ✓ ✓ ✓ append • Cannot exploit NVMM performance N-Block ✗ ✗ ✗ ✗ ✗ • write/append Performance optimization compromises N-Block ✗ ✓ ✓ ✓ ✓ consistency on system failure [1] prefix/append [1] Pillai et al, All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI '14. 42 NVMM file systems are not strongly consistent • BPFS, PMFS, Ext4-DAX, SCMFS, Aerie • None of them provide strong metadata and data consistency Metadata Data Mmap File system atomicity atomicity Atomicity [1] BPFS Yes Yes [2] No PMFS Yes No No Ext4-DAX Yes No No SCMFS No No No Aerie Yes No No NOVA Yes Yes Yes [1] Each msync() commits updates atomically. [2] In BPFS, write times are not updated atomically with respect to the write itself. 43 Why LFS? • Log-structuring provides cheaper atomicity than journaling and shadow paging • NVMM supports fast, highly concurrent random accesses – Using multiple logs does not negatively impact performance
Recommended publications
  • The Kernel Report
    The kernel report (ELC 2012 edition) Jonathan Corbet LWN.net [email protected] The Plan Look at a year's worth of kernel work ...with an eye toward the future Starting off 2011 2.6.37 released - January 4, 2011 11,446 changes, 1,276 developers VFS scalability work (inode_lock removal) Block I/O bandwidth controller PPTP support Basic pNFS support Wakeup sources What have we done since then? Since 2.6.37: Five kernel releases have been made 59,000 changes have been merged 3069 developers have contributed to the kernel 416 companies have supported kernel development February As you can see in these posts, Ralink is sending patches for the upstream rt2x00 driver for their new chipsets, and not just dumping a huge, stand-alone tarball driver on the community, as they have done in the past. This shows a huge willingness to learn how to deal with the kernel community, and they should be strongly encouraged and praised for this major change in attitude. – Greg Kroah-Hartman, February 9 Employer contributions 2.6.38-3.2 Volunteers 13.9% Wolfson Micro 1.7% Red Hat 10.9% Samsung 1.6% Intel 7.3% Google 1.6% unknown 6.9% Oracle 1.5% Novell 4.0% Microsoft 1.4% IBM 3.6% AMD 1.3% TI 3.4% Freescale 1.3% Broadcom 3.1% Fujitsu 1.1% consultants 2.2% Atheros 1.1% Nokia 1.8% Wind River 1.0% Also in February Red Hat stops releasing individual kernel patches March 2.6.38 released – March 14, 2011 (9,577 changes from 1198 developers) Per-session group scheduling dcache scalability patch set Transmit packet steering Transparent huge pages Hierarchical block I/O bandwidth controller Somebody needs to get a grip in the ARM community.
    [Show full text]
  • Trusted Docker Containers and Trusted Vms in Openstack
    Trusted Docker Containers and Trusted VMs in OpenStack Raghu Yeluri Abhishek Gupta Outline o Context: Docker Security – Top Customer Asks o Intel’s Focus: Trusted Docker Containers o Who Verifies Trust ? o Reference Architecture with OpenStack o Demo o Availability o Call to Action Docker Overview in a Slide.. Docker Hub Lightweight, open source engine for creating, deploying containers Provides work flow for running, building and containerizing apps. Separates apps from where they run.; Enables Micro-services; scale by composition. Underlying building blocks: Linux kernel's namespaces (isolation) + cgroups (resource control) + .. Components of Docker Docker Engine – Runtime for running, building Docker containers. Docker Repositories(Hub) - SaaS service for sharing/managing images Docker Images (layers) Images hold Apps. Shareable snapshot of software. Container is a running instance of image. Orchestration: OpenStack, Docker Swarm, Kubernetes, Mesos, Fleet, Project Docker Layers Atomic, Lattice… Docker Security – 5 key Customer Asks 1. How do you know that the Docker Host Integrity is there? o Do you trust the Docker daemon? o Do you trust the Docker host has booted with Integrity? 2. How do you verify Docker Container Integrity o Who wrote the Docker image? Do you trust the image? Did the right Image get launched? 3. Runtime Protection of Docker Engine & Enhanced Isolation o How can Intel help with runtime Integrity? 4. Enterprise Security Features – Compliance, Manageability, Identity authentication.. Etc. 5. OpenStack as a single Control Plane for Trusted VMs and Trusted Docker Containers.. Intel’s Focus: Enable Hardware-based Integrity Assurance for Docker Containers – Trusted Docker Containers Trusted Docker Containers – 3 focus areas o Launch Integrity of Docker Host o Runtime Integrity of Docker Host o Integrity of Docker Images Today’s Focus: Integrity of Docker Host, and how to use it in OpenStack.
    [Show full text]
  • Rootless Containers with Podman and Fuse-Overlayfs
    CernVM Workshop 2019 (4th June 2019) Rootless containers with Podman and fuse-overlayfs Giuseppe Scrivano @gscrivano Introduction 2 Rootless Containers • “Rootless containers refers to the ability for an unprivileged user (i.e. non-root user) to create, run and otherwise manage containers.” (https://rootlesscontaine.rs/ ) • Not just about running the container payload as an unprivileged user • Container runtime runs also as an unprivileged user 3 Don’t confuse with... • sudo podman run --user foo – Executes the process in the container as non-root – Podman and the OCI runtime still running as root • USER instruction in Dockerfile – same as above – Notably you can’t RUN dnf install ... 4 Don’t confuse with... • podman run --uidmap – Execute containers as a non-root user, using user namespaces – Most similar to rootless containers, but still requires podman and runc to run as root 5 Motivation of Rootless Containers • To mitigate potential vulnerability of container runtimes • To allow users of shared machines (e.g. HPC) to run containers without the risk of breaking other users environments • To isolate nested containers 6 Caveat: Not a panacea • Although rootless containers could mitigate these vulnerabilities, it is not a panacea , especially it is powerless against kernel (and hardware) vulnerabilities – CVE 2013-1858, CVE-2015-1328, CVE-2018-18955 • Castle approach : it should be used in conjunction with other security layers such as seccomp and SELinux 7 Podman 8 Rootless Podman Podman is a daemon-less alternative to Docker • $ alias
    [Show full text]
  • F2FS) Overview
    Flash Friendly File System (F2FS) Overview Leon Romanovsky [email protected] www.leon.nu November 17, 2012 Leon Romanovsky [email protected] F2FS Overview Disclaimer Everything in this lecture shall not, under any circumstances, hold any legal liability whatsoever. Any usage of the data and information in this document shall be solely on the responsibility of the user. This lecture is not given on behalf of any company or organization. Leon Romanovsky [email protected] F2FS Overview Introduction: Flash Memory Definition Flash memory is a non-volatile storage device that can be electrically erased and reprogrammed. Challenges block-level access wear leveling read disturb bad blocks management garbage collection different physics different interfaces Leon Romanovsky [email protected] F2FS Overview Introduction: Flash Memory Definition Flash memory is a non-volatile storage device that can be electrically erased and reprogrammed. Challenges block-level access wear leveling read disturb bad blocks management garbage collection different physics different interfaces Leon Romanovsky [email protected] F2FS Overview Introduction: General System Architecture Leon Romanovsky [email protected] F2FS Overview Introduction: File Systems Optimized for disk storage EXT2/3/4 BTRFS VFAT Optimized for flash, but not aware of FTL JFFS/JFFS2 YAFFS LogFS UbiFS NILFS Leon Romanovsky [email protected] F2FS Overview Background: LFS vs. Unix FS Leon Romanovsky [email protected] F2FS Overview Background: LFS Overview Leon Romanovsky [email protected] F2FS Overview Background: LFS Garbage Collection 1 A victim segment is selected through referencing segment usage table. 2 It loads parent index structures of all the data in the victim identified by segment summary blocks.
    [Show full text]
  • Dm-X: Protecting Volume-Level Integrity for Cloud Volumes and Local
    dm-x: Protecting Volume-level Integrity for Cloud Volumes and Local Block Devices Anrin Chakraborti Bhushan Jain Jan Kasiak Stony Brook University Stony Brook University & Stony Brook University UNC, Chapel Hill Tao Zhang Donald Porter Radu Sion Stony Brook University & Stony Brook University & Stony Brook University UNC, Chapel Hill UNC, Chapel Hill ABSTRACT on Amazon’s Virtual Private Cloud (VPC) [2] (which provides The verified boot feature in recent Android devices, which strong security guarantees through network isolation) store deploys dm-verity, has been overwhelmingly successful in elim- data on untrusted storage devices residing in the public cloud. inating the extremely popular Android smart phone rooting For example, Amazon VPC file systems, object stores, and movement [25]. Unfortunately, dm-verity integrity guarantees databases reside on virtual block devices in the Amazon are read-only and do not extend to writable volumes. Elastic Block Storage (EBS). This paper introduces a new device mapper, dm-x, that This allows numerous attack vectors for a malicious cloud efficiently (fast) and reliably (metadata journaling) assures provider/untrusted software running on the server. For in- volume-level integrity for entire, writable volumes. In a direct stance, to ensure SEC-mandated assurances of end-to-end disk setup, dm-x overheads are around 6-35% over ext4 on the integrity, banks need to guarantee that the external cloud raw volume while offering additional integrity guarantees. For storage service is unable to remove log entries documenting cloud storage (Amazon EBS), dm-x overheads are negligible. financial transactions. Yet, without integrity of the storage device, a malicious cloud storage service could remove logs of a coordinated attack.
    [Show full text]
  • Providing User Security Guarantees in Public Infrastructure Clouds
    1 Providing User Security Guarantees in Public Infrastructure Clouds Nicolae Paladi, Christian Gehrmann, and Antonis Michalas Abstract—The infrastructure cloud (IaaS) service model offers improved resource flexibility and availability, where tenants – insulated from the minutiae of hardware maintenance – rent computing resources to deploy and operate complex systems. Large-scale services running on IaaS platforms demonstrate the viability of this model; nevertheless, many organizations operating on sensitive data avoid migrating operations to IaaS platforms due to security concerns. In this paper, we describe a framework for data and operation security in IaaS, consisting of protocols for a trusted launch of virtual machines and domain-based storage protection. We continue with an extensive theoretical analysis with proofs about protocol resistance against attacks in the defined threat model. The protocols allow trust to be established by remotely attesting host platform configuration prior to launching guest virtual machines and ensure confidentiality of data in remote storage, with encryption keys maintained outside of the IaaS domain. Presented experimental results demonstrate the validity and efficiency of the proposed protocols. The framework prototype was implemented on a test bed operating a public electronic health record system, showing that the proposed protocols can be integrated into existing cloud environments. Index Terms—Security; Cloud Computing; Storage Protection; Trusted Computing F 1 INTRODUCTION host level. While support data encryption at rest is offered by several cloud providers and can be configured by tenants Cloud computing has progressed from a bold vision to mas- in their VM instances, functionality and migration capabil- sive deployments in various application domains. However, ities of such solutions are severely restricted.
    [Show full text]
  • Beej's Guide to Unix IPC
    Beej's Guide to Unix IPC Brian “Beej Jorgensen” Hall [email protected] Version 1.1.3 December 1, 2015 Copyright © 2015 Brian “Beej Jorgensen” Hall This guide is written in XML using the vim editor on a Slackware Linux box loaded with GNU tools. The cover “art” and diagrams are produced with Inkscape. The XML is converted into HTML and XSL-FO by custom Python scripts. The XSL-FO output is then munged by Apache FOP to produce PDF documents, using Liberation fonts. The toolchain is composed of 100% Free and Open Source Software. Unless otherwise mutually agreed by the parties in writing, the author offers the work as-is and makes no representations or warranties of any kind concerning the work, express, implied, statutory or otherwise, including, without limitation, warranties of title, merchantibility, fitness for a particular purpose, noninfringement, or the absence of latent or other defects, accuracy, or the presence of absence of errors, whether or not discoverable. Except to the extent required by applicable law, in no event will the author be liable to you on any legal theory for any special, incidental, consequential, punitive or exemplary damages arising out of the use of the work, even if the author has been advised of the possibility of such damages. This document is freely distributable under the terms of the Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. See the Copyright and Distribution section for details. Copyright © 2015 Brian “Beej Jorgensen” Hall Contents 1. Intro................................................................................................................................................................1 1.1. Audience 1 1.2. Platform and Compiler 1 1.3.
    [Show full text]
  • In Search of Optimal Data Placement for Eliminating Write Amplification in Log-Structured Storage
    In Search of Optimal Data Placement for Eliminating Write Amplification in Log-Structured Storage Qiuping Wangy, Jinhong Liy, Patrick P. C. Leey, Guangliang Zhao∗, Chao Shi∗, Lilong Huang∗ yThe Chinese University of Hong Kong ∗Alibaba Group ABSTRACT invalidated by a live block; a.k.a. the death time [12]) to achieve Log-structured storage has been widely deployed in various do- the minimum possible WA. However, without obtaining the fu- mains of storage systems for high performance. However, its garbage ture knowledge of the BIT pattern, how to design an optimal data collection (GC) incurs severe write amplification (WA) due to the placement scheme with the minimum WA remains an unexplored frequent rewrites of live data. This motivates many research stud- issue. Existing temperature-based data placement schemes that ies, particularly on data placement strategies, that mitigate WA in group blocks by block temperatures (e.g., write/update frequencies) log-structured storage. We show how to design an optimal data [7, 16, 22, 27, 29, 35, 36] are arguably inaccurate to capture the BIT placement scheme that leads to the minimum WA with the fu- pattern and fail to group the blocks with similar BITs [12]. ture knowledge of block invalidation time (BIT) of each written To this end, we propose SepBIT, a novel data placement scheme block. Guided by this observation, we propose SepBIT, a novel data that aims to minimize the WA in log-structured storage. It infers placement algorithm that aims to minimize WA in log-structured the BITs of written blocks from the underlying storage workloads storage.
    [Show full text]
  • ECE 598 – Advanced Operating Systems Lecture 19
    ECE 598 { Advanced Operating Systems Lecture 19 Vince Weaver http://web.eece.maine.edu/~vweaver [email protected] 7 April 2016 Announcements • Homework #7 was due • Homework #8 will be posted 1 Why use FAT over ext2? • FAT simpler, easy to code • FAT supported on all major OSes • ext2 faster, more robust filename and permissions 2 btrfs • B-tree fs (similar to a binary tree, but with pages full of leaves) • overwrite filesystem (overwite on modify) vs CoW • Copy on write. When write to a file, old data not overwritten. Since old data not over-written, crash recovery better Eventually old data garbage collected • Data in extents 3 • Copy-on-write • Forest of trees: { sub-volumes { extent-allocation { checksum tree { chunk device { reloc • On-line defragmentation • On-line volume growth 4 • Built-in RAID • Transparent compression • Snapshots • Checksums on data and meta-data • De-duplication • Cloning { can make an exact snapshot of file, copy-on- write different than link, different inodles but same blocks 5 Embedded • Designed to be small, simple, read-only? • romfs { 32 byte header (magic, size, checksum,name) { Repeating files (pointer to next [0 if none]), info, size, checksum, file name, file data • cramfs 6 ZFS Advanced OS from Sun/Oracle. Similar in idea to btrfs indirect still, not extent based? 7 ReFS Resilient FS, Microsoft's answer to brtfs and zfs 8 Networked File Systems • Allow a centralized file server to export a filesystem to multiple clients. • Provide file level access, not just raw blocks (NBD) • Clustered filesystems also exist, where multiple servers work in conjunction.
    [Show full text]
  • CS 5600 Computer Systems
    CS 5600 Computer Systems Lecture 10: File Systems What are We Doing Today? • Last week we talked extensively about hard drives and SSDs – How they work – Performance characterisEcs • This week is all about managing storage – Disks/SSDs offer a blank slate of empty blocks – How do we store files on these devices, and keep track of them? – How do we maintain high performance? – How do we maintain consistency in the face of random crashes? 2 • ParEEons and MounEng • Basics (FAT) • inodes and Blocks (ext) • Block Groups (ext2) • Journaling (ext3) • Extents and B-Trees (ext4) • Log-based File Systems 3 Building the Root File System • One of the first tasks of an OS during bootup is to build the root file system 1. Locate all bootable media – Internal and external hard disks – SSDs – Floppy disks, CDs, DVDs, USB scks 2. Locate all the parEEons on each media – Read MBR(s), extended parEEon tables, etc. 3. Mount one or more parEEons – Makes the file system(s) available for access 4 The Master Boot Record Address Size Descripon Hex Dec. (Bytes) Includes the starEng 0x000 0 Bootstrap code area 446 LBA and length of 0x1BE 446 ParEEon Entry #1 16 the parEEon 0x1CE 462 ParEEon Entry #2 16 0x1DE 478 ParEEon Entry #3 16 0x1EE 494 ParEEon Entry #4 16 0x1FE 510 Magic Number 2 Total: 512 ParEEon 1 ParEEon 2 ParEEon 3 ParEEon 4 MBR (ext3) (swap) (NTFS) (FAT32) Disk 1 ParEEon 1 MBR (NTFS) 5 Disk 2 Extended ParEEons • In some cases, you may want >4 parEEons • Modern OSes support extended parEEons Logical Logical ParEEon 1 ParEEon 2 Ext.
    [Show full text]
  • Mmap and Dma
    CHAPTER THIRTEEN MMAP AND DMA This chapter delves into the area of Linux memory management, with an emphasis on techniques that are useful to the device driver writer. The material in this chap- ter is somewhat advanced, and not everybody will need a grasp of it. Nonetheless, many tasks can only be done through digging more deeply into the memory man- agement subsystem; it also provides an interesting look into how an important part of the kernel works. The material in this chapter is divided into three sections. The first covers the implementation of the mmap system call, which allows the mapping of device memory directly into a user process’s address space. We then cover the kernel kiobuf mechanism, which provides direct access to user memory from kernel space. The kiobuf system may be used to implement ‘‘raw I/O’’ for certain kinds of devices. The final section covers direct memory access (DMA) I/O operations, which essentially provide peripherals with direct access to system memory. Of course, all of these techniques requir e an understanding of how Linux memory management works, so we start with an overview of that subsystem. Memor y Management in Linux Rather than describing the theory of memory management in operating systems, this section tries to pinpoint the main features of the Linux implementation of the theory. Although you do not need to be a Linux virtual memory guru to imple- ment mmap, a basic overview of how things work is useful. What follows is a fairly lengthy description of the data structures used by the kernel to manage memory.
    [Show full text]
  • Ext4 File System and Crash Consistency
    1 Ext4 file system and crash consistency Changwoo Min 2 Summary of last lectures • Tools: building, exploring, and debugging Linux kernel • Core kernel infrastructure • Process management & scheduling • Interrupt & interrupt handler • Kernel synchronization • Memory management • Virtual file system • Page cache and page fault 3 Today: ext4 file system and crash consistency • File system in Linux kernel • Design considerations of a file system • History of file system • On-disk structure of Ext4 • File operations • Crash consistency 4 File system in Linux kernel User space application (ex: cp) User-space Syscalls: open, read, write, etc. Kernel-space VFS: Virtual File System Filesystems ext4 FAT32 JFFS2 Block layer Hardware Embedded Hard disk USB drive flash 5 What is a file system fundamentally? int main(int argc, char *argv[]) { int fd; char buffer[4096]; struct stat_buf; DIR *dir; struct dirent *entry; /* 1. Path name -> inode mapping */ fd = open("/home/lkp/hello.c" , O_RDONLY); /* 2. File offset -> disk block address mapping */ pread(fd, buffer, sizeof(buffer), 0); /* 3. File meta data operation */ fstat(fd, &stat_buf); printf("file size = %d\n", stat_buf.st_size); /* 4. Directory operation */ dir = opendir("/home"); entry = readdir(dir); printf("dir = %s\n", entry->d_name); return 0; } 6 Why do we care EXT4 file system? • Most widely-deployed file system • Default file system of major Linux distributions • File system used in Google data center • Default file system of Android kernel • Follows the traditional file system design 7 History of file system design 8 UFS (Unix File System) • The original UNIX file system • Design by Dennis Ritche and Ken Thompson (1974) • The first Linux file system (ext) and Minix FS has a similar layout 9 UFS (Unix File System) • Performance problem of UFS (and the first Linux file system) • Especially, long seek time between an inode and data block 10 FFS (Fast File System) • The file system of BSD UNIX • Designed by Marshall Kirk McKusick, et al.
    [Show full text]