NVSL Presentation Template

Total Page:16

File Type:pdf, Size:1020Kb

NVSL Presentation Template NOVA: A High-Performance, Hardened File System for Non-Volatile Main Memories Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson Non-Volatile Systems Laboratory Department of Computer Science and Engineering University of California, San Diego 1 NVDIMM Usage Models • Legacy File IO Acceleration – fast and easy Legacy IO Throughput – Run existing IO-intensive apps on NVDIMMs 450 – “just works” 400 – NOVA is 30% - 10x faster than Ext4 for write intensive 350 workloads. 300 – Need strong protections on data. 250 • DAX Mmap -- maximum speed + programming 200 challenges 150 100 – Load -store access (x1000) per second Ops 50 – You still need a strongly-consistent file system 0 • File system corruption can still destroy your data • NOVA is strongly consistent – Data protection is still critical Ext4-datajournal NOVA 2 XFS F2FS NILFS EXT4 BTRFS 3 Disk-based file systems are inadequate for NVMM • Disk-based file systems Atomicity Data Protection cannot exploit NVMM 1-Sector 1-Sector 1-Block 1-Block N-Block N-Block Meta- Snap- Data performance overwrite append overwrite append overwrite append data shots Ext4 wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ Ext4 Performance Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ • Ext4 optimization Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓ Btrfs compromises consistency ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ xfs on system failure [1] ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ Reiserfs ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ [1] Pillai et al, All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications, OSDI '14. 4 BPFS SCMFS PMFS Aerie EXT4-DAX M1FS XFS-DAX 5 NVMM file systems don’t provide strong consistency or data protection • DAX does not provide data Atomicity Data Protection Meta- Snap- atomicity guarantees Metadata Data Data data shots • Programming is more difficult BPFS ✓ ✓ ✗ ✗ ✗ PMFS ✓ ✗ ✗ ✗ ✗ Ext4 DAX ✓ ✗ ✗ ✓ ✗ XFS DAX ✓ ✗ ✗ ✓ ✗ SCMFS ✗ ✗ ✗ ✗ ✗ Aerie ✓ ✗ ✗ ✗ ✗ 6 NOVA provides strong atomicity guarantee Atomicity Data Protection Atomicity Data Protection 1-Sector 1-Sector 1-Block 1-Block N-Block N-Block Meta- Snap- Meta- Snap- Data Metadata Data Data overwrite append overwrite append overwrite append data shots data shots Ext4 BPFS wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ Ext4 PMFS Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ Ext4 Ext4 Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ XFS Btrfs ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ xfs SCMFS ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ Reiserfs Aerie ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ 7 NOVA provides strong atomicity guarantee Atomicity Data Protection Atomicity Data Protection 1-Sector 1-Sector 1-Block 1-Block N-Block N-Block Meta- Snap- Meta- Snap- Data Metadata Data Data overwrite append overwrite append overwrite append data shots data shots Ext4 BPFS wb ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ Ext4 PMFS Order ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ Ext4 Ext4 Dataj ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ XFS Btrfs ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✓ ✓ DAX ✓ ✗ ✗ ✓ ✗ xfs SCMFS ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ Reiserfs Aerie ✓ ✓ ✗ ✓ ✗ ✓ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ NOVA NOVA ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ 8 NOVA’s Key Features • Features – High-performance • Usage Models – Strong Consistency – open()/close(), read()/write() – Snapshot support – DAX -mmap() – Data protection 9 NOVA’s Architecture 10 Core NOVA Structures Log Structure + copy-on-write + Journals • One log per iNode • Non-contiguous Tail Tail • Fast, Simple atomic File log updates • Meta-data only 11 Core NOVA Structures Log Structure + copy-on-write + Journals • Multi-page atomic update Tail Tail • Fast allocation File log • Instant data GC Data 0 Data 1 Data 1 Data 2 12 Core NOVA Structures Log Structure + copy-on-write + Journals TailTail Directory log • Small, fixed sized Tail Tail journals File log • For complex ops. Dir tail Journal File tail 13 Supporting Backups with Snapshots 14 Snapshots for Normal File Access Current epoch 012 Snapshot 0 Snapshot 1 File log 0 0x1000 1 0x2000 1 0x3000 2 0x4000 Data Data Data Data Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data 15 Corrupt Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Incorrect: Naïvely mark pages read-only one-at-a-time Application: D = 1; V = True; Snapshot Page hosting D: ? 1 ? Page hosting V: False True T Time Snapshot Page Copy on Value R/W RO Fault Write Change 16 Consistent Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Correct: Block page faults until all pages are read-only Application: D = 1; V = True; Snapshot Page hosting D: ? 1 ? Page hosting V: False True F Time Snapshot Page Copy on Value R/W RO RO Fault Write Change Blocking 17 Performance impact of snapshots • Normal execution vs. taking snapshots every 10s – Negligible performance loss through read()/write() – Average performance loss 6.2% through mmap() Conventional workloads NVMM-aware workloads from WHISPER 18 Data Protection: Metadata 19 NVMM Failure Modes: Media Failures • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Consumes good data • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Detects & corrects errors – Undetectable Read NVMM data: • May consume corrupted data Media error • Software scribbles – Kernel bugs or own bugs – Transparent to hardware 20 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Receives MCE • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Detects uncorrectable errors Raises exception Read – Undetectable NVMM data: • May consume corrupted data Media error & Poison Radius (PR) • Software scribbles e.g. 512 bytes – Kernel bugs or own bugs – Transparent to hardware 21 Detecting NVMM Media Errors memcpy_mcsafe() • Copy data from NVMM Process and • Catch MCEs and return failure return Yes Handler registered? No Kernel Whose Kernel panic access? User Recoverable MCE SIGBUS Unrecoverable Kernel panic 22 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Consumes corrupted data • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Media error • Software scribbles – Kernel bugs or own bugs – Transparent to hardware 23 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Bug code scribbles NVMM • Affect a contiguous range of data Write • Raise machine check exception (MCE) NVMM Ctrl.: Updates ECC – Undetectable NVMM data: • Consume corrupted data Scribble error • Software “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable 24 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to software – Detectable & uncorrectable Software: Consumes corrupted data • Affect a contiguous range of data • Raise machine check exception (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Scribble error • Software “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable 25 NOVA Metadata Protection • Replicate everything inode’ Head’ Tail’ csumcsum’’ H1’ T1’ – Inodes inode – Logs Head Tail csum H1 T1 – Superblock – … ent1 c1 … entN cN • CRC32 Checksums everywhere ent1’ c1’ … entN’ cN’ Data 1 Data 2 26 Defense Against Scribbles • Tolerating Larger Scribbles – Allocate replicas far from one another – Can tolerate arbitrarily large scribbles to metadata. • Preventing scribbles – Mark all NVMM as read-only – Disable CPU write protection while accessing NVMM 27 Data Protection: Data 28 NOVA Data Protection • Divide 4KB blocks into 512-byte stripes • Compute a RAID 5-style parity stripe • Compute and replicate checksums for each stripe 1 Block 512-Byte stripe segments S0 S1 S2 S3 S4 S5 S6 S7 P P = S0..7 C = CRC32C(S ) i ⊕ i Replicated 29 File data protection with DAX-mmap • With DAX-Mmap(), file data changes are invisible to NOVA • NOVA cannot protect mmap’ed file data • NOVA logs mmap() and restores protection on munmap() or recovery User-space load/store load/store Applications: Kernel-space NOVA: read(), write() mmap() NVDIMMs File data: protected unprotected File log: mmap log entry 30 File data protection with DAX-mmap • NOVA cannot protect mmap’ed file data – User applications directly load/store the mmap’ed region – NOVA has to know what file pages are mmap’ed User-space load/store munmap() Applications: Kernel-space NOVA: read(), write() mmap() NVDIMMs Protection restored File data: File log: 31 File data protection with DAX-mmap • NOVA cannot protect mmap’ed file data – User applications directly load/store the mmap’ed region – NOVA has to know what file pages are mmap’ed User-space System Failure + Applications: recovery Kernel-space NOVA: read(), write() mmap() NVDIMMs File data: File log: 32 Performance 33 Performance Cost of Data Integrity 1.2 1 0.8 0.6 0.4 0.2 0 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim TPCC average xfs-DAX ext4-DAX ext4-dataj Fortis baseline w/ MP+WP w/ MP+DP+WP 34 Conclusion • Existing file systems do not meet the requirements of applications on NVMM file systems • NOVA’s multi-log design achieves high performance and strong consistency • NOVA’s data protection features ensure data integrity • NOVA outperforms existing file systems while providing stronger consistency and data protection guarantees 35 Thank you! Try NOVA! https://github.com/NVSL/NOVA
Recommended publications
  • The Kernel Report
    The kernel report (ELC 2012 edition) Jonathan Corbet LWN.net [email protected] The Plan Look at a year's worth of kernel work ...with an eye toward the future Starting off 2011 2.6.37 released - January 4, 2011 11,446 changes, 1,276 developers VFS scalability work (inode_lock removal) Block I/O bandwidth controller PPTP support Basic pNFS support Wakeup sources What have we done since then? Since 2.6.37: Five kernel releases have been made 59,000 changes have been merged 3069 developers have contributed to the kernel 416 companies have supported kernel development February As you can see in these posts, Ralink is sending patches for the upstream rt2x00 driver for their new chipsets, and not just dumping a huge, stand-alone tarball driver on the community, as they have done in the past. This shows a huge willingness to learn how to deal with the kernel community, and they should be strongly encouraged and praised for this major change in attitude. – Greg Kroah-Hartman, February 9 Employer contributions 2.6.38-3.2 Volunteers 13.9% Wolfson Micro 1.7% Red Hat 10.9% Samsung 1.6% Intel 7.3% Google 1.6% unknown 6.9% Oracle 1.5% Novell 4.0% Microsoft 1.4% IBM 3.6% AMD 1.3% TI 3.4% Freescale 1.3% Broadcom 3.1% Fujitsu 1.1% consultants 2.2% Atheros 1.1% Nokia 1.8% Wind River 1.0% Also in February Red Hat stops releasing individual kernel patches March 2.6.38 released – March 14, 2011 (9,577 changes from 1198 developers) Per-session group scheduling dcache scalability patch set Transmit packet steering Transparent huge pages Hierarchical block I/O bandwidth controller Somebody needs to get a grip in the ARM community.
    [Show full text]
  • Trusted Docker Containers and Trusted Vms in Openstack
    Trusted Docker Containers and Trusted VMs in OpenStack Raghu Yeluri Abhishek Gupta Outline o Context: Docker Security – Top Customer Asks o Intel’s Focus: Trusted Docker Containers o Who Verifies Trust ? o Reference Architecture with OpenStack o Demo o Availability o Call to Action Docker Overview in a Slide.. Docker Hub Lightweight, open source engine for creating, deploying containers Provides work flow for running, building and containerizing apps. Separates apps from where they run.; Enables Micro-services; scale by composition. Underlying building blocks: Linux kernel's namespaces (isolation) + cgroups (resource control) + .. Components of Docker Docker Engine – Runtime for running, building Docker containers. Docker Repositories(Hub) - SaaS service for sharing/managing images Docker Images (layers) Images hold Apps. Shareable snapshot of software. Container is a running instance of image. Orchestration: OpenStack, Docker Swarm, Kubernetes, Mesos, Fleet, Project Docker Layers Atomic, Lattice… Docker Security – 5 key Customer Asks 1. How do you know that the Docker Host Integrity is there? o Do you trust the Docker daemon? o Do you trust the Docker host has booted with Integrity? 2. How do you verify Docker Container Integrity o Who wrote the Docker image? Do you trust the image? Did the right Image get launched? 3. Runtime Protection of Docker Engine & Enhanced Isolation o How can Intel help with runtime Integrity? 4. Enterprise Security Features – Compliance, Manageability, Identity authentication.. Etc. 5. OpenStack as a single Control Plane for Trusted VMs and Trusted Docker Containers.. Intel’s Focus: Enable Hardware-based Integrity Assurance for Docker Containers – Trusted Docker Containers Trusted Docker Containers – 3 focus areas o Launch Integrity of Docker Host o Runtime Integrity of Docker Host o Integrity of Docker Images Today’s Focus: Integrity of Docker Host, and how to use it in OpenStack.
    [Show full text]
  • Rootless Containers with Podman and Fuse-Overlayfs
    CernVM Workshop 2019 (4th June 2019) Rootless containers with Podman and fuse-overlayfs Giuseppe Scrivano @gscrivano Introduction 2 Rootless Containers • “Rootless containers refers to the ability for an unprivileged user (i.e. non-root user) to create, run and otherwise manage containers.” (https://rootlesscontaine.rs/ ) • Not just about running the container payload as an unprivileged user • Container runtime runs also as an unprivileged user 3 Don’t confuse with... • sudo podman run --user foo – Executes the process in the container as non-root – Podman and the OCI runtime still running as root • USER instruction in Dockerfile – same as above – Notably you can’t RUN dnf install ... 4 Don’t confuse with... • podman run --uidmap – Execute containers as a non-root user, using user namespaces – Most similar to rootless containers, but still requires podman and runc to run as root 5 Motivation of Rootless Containers • To mitigate potential vulnerability of container runtimes • To allow users of shared machines (e.g. HPC) to run containers without the risk of breaking other users environments • To isolate nested containers 6 Caveat: Not a panacea • Although rootless containers could mitigate these vulnerabilities, it is not a panacea , especially it is powerless against kernel (and hardware) vulnerabilities – CVE 2013-1858, CVE-2015-1328, CVE-2018-18955 • Castle approach : it should be used in conjunction with other security layers such as seccomp and SELinux 7 Podman 8 Rootless Podman Podman is a daemon-less alternative to Docker • $ alias
    [Show full text]
  • F2FS) Overview
    Flash Friendly File System (F2FS) Overview Leon Romanovsky [email protected] www.leon.nu November 17, 2012 Leon Romanovsky [email protected] F2FS Overview Disclaimer Everything in this lecture shall not, under any circumstances, hold any legal liability whatsoever. Any usage of the data and information in this document shall be solely on the responsibility of the user. This lecture is not given on behalf of any company or organization. Leon Romanovsky [email protected] F2FS Overview Introduction: Flash Memory Definition Flash memory is a non-volatile storage device that can be electrically erased and reprogrammed. Challenges block-level access wear leveling read disturb bad blocks management garbage collection different physics different interfaces Leon Romanovsky [email protected] F2FS Overview Introduction: Flash Memory Definition Flash memory is a non-volatile storage device that can be electrically erased and reprogrammed. Challenges block-level access wear leveling read disturb bad blocks management garbage collection different physics different interfaces Leon Romanovsky [email protected] F2FS Overview Introduction: General System Architecture Leon Romanovsky [email protected] F2FS Overview Introduction: File Systems Optimized for disk storage EXT2/3/4 BTRFS VFAT Optimized for flash, but not aware of FTL JFFS/JFFS2 YAFFS LogFS UbiFS NILFS Leon Romanovsky [email protected] F2FS Overview Background: LFS vs. Unix FS Leon Romanovsky [email protected] F2FS Overview Background: LFS Overview Leon Romanovsky [email protected] F2FS Overview Background: LFS Garbage Collection 1 A victim segment is selected through referencing segment usage table. 2 It loads parent index structures of all the data in the victim identified by segment summary blocks.
    [Show full text]
  • Dm-X: Protecting Volume-Level Integrity for Cloud Volumes and Local
    dm-x: Protecting Volume-level Integrity for Cloud Volumes and Local Block Devices Anrin Chakraborti Bhushan Jain Jan Kasiak Stony Brook University Stony Brook University & Stony Brook University UNC, Chapel Hill Tao Zhang Donald Porter Radu Sion Stony Brook University & Stony Brook University & Stony Brook University UNC, Chapel Hill UNC, Chapel Hill ABSTRACT on Amazon’s Virtual Private Cloud (VPC) [2] (which provides The verified boot feature in recent Android devices, which strong security guarantees through network isolation) store deploys dm-verity, has been overwhelmingly successful in elim- data on untrusted storage devices residing in the public cloud. inating the extremely popular Android smart phone rooting For example, Amazon VPC file systems, object stores, and movement [25]. Unfortunately, dm-verity integrity guarantees databases reside on virtual block devices in the Amazon are read-only and do not extend to writable volumes. Elastic Block Storage (EBS). This paper introduces a new device mapper, dm-x, that This allows numerous attack vectors for a malicious cloud efficiently (fast) and reliably (metadata journaling) assures provider/untrusted software running on the server. For in- volume-level integrity for entire, writable volumes. In a direct stance, to ensure SEC-mandated assurances of end-to-end disk setup, dm-x overheads are around 6-35% over ext4 on the integrity, banks need to guarantee that the external cloud raw volume while offering additional integrity guarantees. For storage service is unable to remove log entries documenting cloud storage (Amazon EBS), dm-x overheads are negligible. financial transactions. Yet, without integrity of the storage device, a malicious cloud storage service could remove logs of a coordinated attack.
    [Show full text]
  • Providing User Security Guarantees in Public Infrastructure Clouds
    1 Providing User Security Guarantees in Public Infrastructure Clouds Nicolae Paladi, Christian Gehrmann, and Antonis Michalas Abstract—The infrastructure cloud (IaaS) service model offers improved resource flexibility and availability, where tenants – insulated from the minutiae of hardware maintenance – rent computing resources to deploy and operate complex systems. Large-scale services running on IaaS platforms demonstrate the viability of this model; nevertheless, many organizations operating on sensitive data avoid migrating operations to IaaS platforms due to security concerns. In this paper, we describe a framework for data and operation security in IaaS, consisting of protocols for a trusted launch of virtual machines and domain-based storage protection. We continue with an extensive theoretical analysis with proofs about protocol resistance against attacks in the defined threat model. The protocols allow trust to be established by remotely attesting host platform configuration prior to launching guest virtual machines and ensure confidentiality of data in remote storage, with encryption keys maintained outside of the IaaS domain. Presented experimental results demonstrate the validity and efficiency of the proposed protocols. The framework prototype was implemented on a test bed operating a public electronic health record system, showing that the proposed protocols can be integrated into existing cloud environments. Index Terms—Security; Cloud Computing; Storage Protection; Trusted Computing F 1 INTRODUCTION host level. While support data encryption at rest is offered by several cloud providers and can be configured by tenants Cloud computing has progressed from a bold vision to mas- in their VM instances, functionality and migration capabil- sive deployments in various application domains. However, ities of such solutions are severely restricted.
    [Show full text]
  • In Search of Optimal Data Placement for Eliminating Write Amplification in Log-Structured Storage
    In Search of Optimal Data Placement for Eliminating Write Amplification in Log-Structured Storage Qiuping Wangy, Jinhong Liy, Patrick P. C. Leey, Guangliang Zhao∗, Chao Shi∗, Lilong Huang∗ yThe Chinese University of Hong Kong ∗Alibaba Group ABSTRACT invalidated by a live block; a.k.a. the death time [12]) to achieve Log-structured storage has been widely deployed in various do- the minimum possible WA. However, without obtaining the fu- mains of storage systems for high performance. However, its garbage ture knowledge of the BIT pattern, how to design an optimal data collection (GC) incurs severe write amplification (WA) due to the placement scheme with the minimum WA remains an unexplored frequent rewrites of live data. This motivates many research stud- issue. Existing temperature-based data placement schemes that ies, particularly on data placement strategies, that mitigate WA in group blocks by block temperatures (e.g., write/update frequencies) log-structured storage. We show how to design an optimal data [7, 16, 22, 27, 29, 35, 36] are arguably inaccurate to capture the BIT placement scheme that leads to the minimum WA with the fu- pattern and fail to group the blocks with similar BITs [12]. ture knowledge of block invalidation time (BIT) of each written To this end, we propose SepBIT, a novel data placement scheme block. Guided by this observation, we propose SepBIT, a novel data that aims to minimize the WA in log-structured storage. It infers placement algorithm that aims to minimize WA in log-structured the BITs of written blocks from the underlying storage workloads storage.
    [Show full text]
  • ECE 598 – Advanced Operating Systems Lecture 19
    ECE 598 { Advanced Operating Systems Lecture 19 Vince Weaver http://web.eece.maine.edu/~vweaver [email protected] 7 April 2016 Announcements • Homework #7 was due • Homework #8 will be posted 1 Why use FAT over ext2? • FAT simpler, easy to code • FAT supported on all major OSes • ext2 faster, more robust filename and permissions 2 btrfs • B-tree fs (similar to a binary tree, but with pages full of leaves) • overwrite filesystem (overwite on modify) vs CoW • Copy on write. When write to a file, old data not overwritten. Since old data not over-written, crash recovery better Eventually old data garbage collected • Data in extents 3 • Copy-on-write • Forest of trees: { sub-volumes { extent-allocation { checksum tree { chunk device { reloc • On-line defragmentation • On-line volume growth 4 • Built-in RAID • Transparent compression • Snapshots • Checksums on data and meta-data • De-duplication • Cloning { can make an exact snapshot of file, copy-on- write different than link, different inodles but same blocks 5 Embedded • Designed to be small, simple, read-only? • romfs { 32 byte header (magic, size, checksum,name) { Repeating files (pointer to next [0 if none]), info, size, checksum, file name, file data • cramfs 6 ZFS Advanced OS from Sun/Oracle. Similar in idea to btrfs indirect still, not extent based? 7 ReFS Resilient FS, Microsoft's answer to brtfs and zfs 8 Networked File Systems • Allow a centralized file server to export a filesystem to multiple clients. • Provide file level access, not just raw blocks (NBD) • Clustered filesystems also exist, where multiple servers work in conjunction.
    [Show full text]
  • CS 5600 Computer Systems
    CS 5600 Computer Systems Lecture 10: File Systems What are We Doing Today? • Last week we talked extensively about hard drives and SSDs – How they work – Performance characterisEcs • This week is all about managing storage – Disks/SSDs offer a blank slate of empty blocks – How do we store files on these devices, and keep track of them? – How do we maintain high performance? – How do we maintain consistency in the face of random crashes? 2 • ParEEons and MounEng • Basics (FAT) • inodes and Blocks (ext) • Block Groups (ext2) • Journaling (ext3) • Extents and B-Trees (ext4) • Log-based File Systems 3 Building the Root File System • One of the first tasks of an OS during bootup is to build the root file system 1. Locate all bootable media – Internal and external hard disks – SSDs – Floppy disks, CDs, DVDs, USB scks 2. Locate all the parEEons on each media – Read MBR(s), extended parEEon tables, etc. 3. Mount one or more parEEons – Makes the file system(s) available for access 4 The Master Boot Record Address Size Descripon Hex Dec. (Bytes) Includes the starEng 0x000 0 Bootstrap code area 446 LBA and length of 0x1BE 446 ParEEon Entry #1 16 the parEEon 0x1CE 462 ParEEon Entry #2 16 0x1DE 478 ParEEon Entry #3 16 0x1EE 494 ParEEon Entry #4 16 0x1FE 510 Magic Number 2 Total: 512 ParEEon 1 ParEEon 2 ParEEon 3 ParEEon 4 MBR (ext3) (swap) (NTFS) (FAT32) Disk 1 ParEEon 1 MBR (NTFS) 5 Disk 2 Extended ParEEons • In some cases, you may want >4 parEEons • Modern OSes support extended parEEons Logical Logical ParEEon 1 ParEEon 2 Ext.
    [Show full text]
  • Ext4 File System and Crash Consistency
    1 Ext4 file system and crash consistency Changwoo Min 2 Summary of last lectures • Tools: building, exploring, and debugging Linux kernel • Core kernel infrastructure • Process management & scheduling • Interrupt & interrupt handler • Kernel synchronization • Memory management • Virtual file system • Page cache and page fault 3 Today: ext4 file system and crash consistency • File system in Linux kernel • Design considerations of a file system • History of file system • On-disk structure of Ext4 • File operations • Crash consistency 4 File system in Linux kernel User space application (ex: cp) User-space Syscalls: open, read, write, etc. Kernel-space VFS: Virtual File System Filesystems ext4 FAT32 JFFS2 Block layer Hardware Embedded Hard disk USB drive flash 5 What is a file system fundamentally? int main(int argc, char *argv[]) { int fd; char buffer[4096]; struct stat_buf; DIR *dir; struct dirent *entry; /* 1. Path name -> inode mapping */ fd = open("/home/lkp/hello.c" , O_RDONLY); /* 2. File offset -> disk block address mapping */ pread(fd, buffer, sizeof(buffer), 0); /* 3. File meta data operation */ fstat(fd, &stat_buf); printf("file size = %d\n", stat_buf.st_size); /* 4. Directory operation */ dir = opendir("/home"); entry = readdir(dir); printf("dir = %s\n", entry->d_name); return 0; } 6 Why do we care EXT4 file system? • Most widely-deployed file system • Default file system of major Linux distributions • File system used in Google data center • Default file system of Android kernel • Follows the traditional file system design 7 History of file system design 8 UFS (Unix File System) • The original UNIX file system • Design by Dennis Ritche and Ken Thompson (1974) • The first Linux file system (ext) and Minix FS has a similar layout 9 UFS (Unix File System) • Performance problem of UFS (and the first Linux file system) • Especially, long seek time between an inode and data block 10 FFS (Fast File System) • The file system of BSD UNIX • Designed by Marshall Kirk McKusick, et al.
    [Show full text]
  • Comparing Filesystem Performance: Red Hat Enterprise Linux 6 Vs
    COMPARING FILE SYSTEM I/O PERFORMANCE: RED HAT ENTERPRISE LINUX 6 VS. MICROSOFT WINDOWS SERVER 2012 When choosing an operating system platform for your servers, you should know what I/O performance to expect from the operating system and file systems you select. In the Principled Technologies labs, using the IOzone file system benchmark, we compared the I/O performance of two operating systems and file system pairs, Red Hat Enterprise Linux 6 with ext4 and XFS file systems, and Microsoft Windows Server 2012 with NTFS and ReFS file systems. Our testing compared out-of-the-box configurations for each operating system, as well as tuned configurations optimized for better performance, to demonstrate how a few simple adjustments can elevate I/O performance of a file system. We found that file systems available with Red Hat Enterprise Linux 6 delivered better I/O performance than those shipped with Windows Server 2012, in both out-of- the-box and optimized configurations. With I/O performance playing such a critical role in most business applications, selecting the right file system and operating system combination is critical to help you achieve your hardware’s maximum potential. APRIL 2013 A PRINCIPLED TECHNOLOGIES TEST REPORT Commissioned by Red Hat, Inc. About file system and platform configurations While you can use IOzone to gauge disk performance, we concentrated on the file system performance of two operating systems (OSs): Red Hat Enterprise Linux 6, where we examined the ext4 and XFS file systems, and Microsoft Windows Server 2012 Datacenter Edition, where we examined NTFS and ReFS file systems.
    [Show full text]
  • NOVA: a Log-Structured File System for Hybrid Volatile/Non
    NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories Jian Xu and Steven Swanson, University of California, San Diego https://www.usenix.org/conference/fast16/technical-sessions/presentation/xu This paper is included in the Proceedings of the 14th USENIX Conference on File and Storage Technologies (FAST ’16). February 22–25, 2016 • Santa Clara, CA, USA ISBN 978-1-931971-28-7 Open access to the Proceedings of the 14th USENIX Conference on File and Storage Technologies is sponsored by USENIX NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories Jian Xu Steven Swanson University of California, San Diego Abstract Hybrid DRAM/NVMM storage systems present a host of opportunities and challenges for system designers. These sys- Fast non-volatile memories (NVMs) will soon appear on tems need to minimize software overhead if they are to fully the processor memory bus alongside DRAM. The result- exploit NVMM’s high performance and efficiently support ing hybrid memory systems will provide software with sub- more flexible access patterns, and at the same time they must microsecond, high-bandwidth access to persistent data, but provide the strong consistency guarantees that applications managing, accessing, and maintaining consistency for data require and respect the limitations of emerging memories stored in NVM raises a host of challenges. Existing file sys- (e.g., limited program cycles). tems built for spinning or solid-state disks introduce software Conventional file systems are not suitable for hybrid mem- overheads that would obscure the performance that NVMs ory systems because they are built for the performance char- should provide, but proposed file systems for NVMs either in- acteristics of disks (spinning or solid state) and rely on disks’ cur similar overheads or fail to provide the strong consistency consistency guarantees (e.g., that sector updates are atomic) guarantees that applications require.
    [Show full text]