NOVA: a High-Performance, Fault Tolerant File System for Non

Total Page:16

File Type:pdf, Size:1020Kb

NOVA: a High-Performance, Fault Tolerant File System for Non NOVA: A High-Performance, Fault Tolerant File System for Non-Volatile Main Memories Jian Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah, Amit Borase, Tamires Brito Da Silva, Andy Rudoff (Intel), Steven Swanson Non-VolaHle Systems Laboratory Department of Computer Science and Engineering University of California, San Diego 1 NVMM File System Requirements • Legacy File IO Acceleraon – fast and easy – Run exis1ng IO-intensive apps on NVDIMMs – “just works” • Strong atomicity • DAX Mmap – Load-store access & complex data structures – More control/responsibility for programs • Data protec1on – Don’t trust the media or other soPware – Support backups • DAX High performance – Otherwise, why bother? 2 XFS EXT4 F2FS BTRFS NILFS ✔ ❌ ✔❌DAX ❌ 3 BPFS SCMFS PMFS Aerie EXT4-DAX XFS-DAX M1FS ✔ ❌ ❌ ✔DAX ✔ 4 NOVA ✔✔✔✔DAX ✔ 5 Files, Atomicity, and Performance 6 Log Structured FS For NVMM Per-inode logging • A Nova FS is a tree of logs Inode Head Tail – One log per i-node – Logs are not con1guous Inode log • I-nodes point to the head and tail Commi^ed entry Uncommi^ed entry 7 Atomicity: Logging for Simple Metadata Operations • Combines log-structuring, journaling and copy-on-write • Log-structuring for single log Tail Tail update File log – Write, msync, chmod, etc – Lower overhead than journaling and shadow paging 8 Atomicity: Lightweight Journaling for Complex Metadata Operations • Lightweight journaling for update across logs TailTail – Unlink, rename, etc Directory log – Journal log tails instead of metadata or data TailTail File log Dir tail Journal File tail 9 Atomicity: Copy-on-write for file data • Copy-on-write for file data – Log only contains metadata – Log is short Tail Tail File log Data 0 Data 1 Data 1 Data 2 10 Atomicity: DAX • Nova does not make atomicity guarantees for DAX mmap()’d data • msync() works, but it’s slow & non-atomic • The program must ensure consistency – ISA Support: CLWB, clflush, etc. DAX – Careful programming 11 DRAM Indexes for High Performance • Per-inode logging allows for Radix tree high concurrency 0 1 2 3 DRAM • Split data structure between NVMM Tail DRAM and NVMM – Persistent log is simple and File log efficient Data 0 Data 1 Data 2 – Volale tree structure has no consistency overhead 12 Performance and Scalability • Put allocator in DRAM CPU 0 CPU 1 • High scalability Free list Free list – Per-CPU NVMM free list, DRAM journal and inode table NVMM Journal Journal – Concurrent transac1ons Super Inode table Inode table and allocaon/deallocaon block Recovery inode Inode Head Tail Inode log 13 Fast garbage collection • Log is a linked list • Log only contains Tail metadata Head • Fast GC deletes dead log pages from the linked list • No copying Vaild log entry Invalid log entry 14 Thorough garbage collection • Starts if valid log entries < 50% log length • Format a new log and atomically replace the old one • Only copy metadata Tail Head Vaild log entry Invalid log entry 15 Recovery • Normal shutdown CPU 0 CPU 1 recovery: Free list Free list – Store allocator state in recovery inode DRAM NVMM – Constant 1me startup Journal Journal • Failure recovery: Super Inode table Inode table block – Parallel scan Recovery inode Recovery Recovery – Failure recovery bandwidth: thread thread > 400 GB/s 16 Operation Latency 30 Ext4-datajournal Ext4-DAX xfs-DAX m1fs 25 NOVA • Intel PM Emulaon Plaorm – Emulates different NVM 20 characteris1cs 15 – Emulates clwb/clflush latency • NOVA provides low latency 10 Latency (microsecond) atomicity 5 0 Create Append (4KB) Delete 17 Filebench throughput 450 Ext4-datajournal Ext4-DAX 400 xfs-DAX m1fs NOVA 350 300 • NOVA achieves high 250 performance with 200 strong data consistency 150 Ops per second (x1000) 100 50 0 Fileserver Varmail Webproxy Webserver 18 DAX, Backup, and Data Protec1on DAX 19 DAX Challenges • DAX mmap() gives the programmer great power – Fine-grain updates – Pointer-based data structures – Custom data layout • …along with great responsibili1es. – Data structure consistency/persistence ordering DAX – Memory protec1on/media error management • The file system must not interfere. 20 Snapshots for Normal File Access Current epoch 012 Snapshot 0 Snapshot 1 File log 0 Page 1 1 Page 1 1 Page 1 2 Page 1 Data Data Data Data Snapshot entry File write entry Epoch ID Data in snapshot Reclaimed data Current data 21 Memory Ordering With DAX mmap() D = 42; V = true; • D and V live in two pages of a mmap()’d region. • Recovery invariant: if V == True, then D is valid 22 Corrupt Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Incorrect: Naïvely mark pages read-only one-at-a-1me V = Applicaon: D = 1; True; Snapshot Page hos1ng D: ? 1 ? Page hos1ng V: False True T Time Snapshot Page Copy on Value R/W RO Fault Write Change 23 Consistent Snapshots with DAX-mmap() • Recovery invariant: if V == True, then D is valid – Correct: Block page faults un1l all pages are read-only V = Applicaon: D = 1; True; Snapshot Page hos1ng D: ? 1 ? Page hos1ng V: False True F Time Snapshot Page Copy on Value R/W RO RO Fault Write Change Blocking 24 Performance impact of snapshots • Normal execu1on vs. taking snapshots every 10s – Negligible performance loss through read()/write() – Average performance loss 6.2% through mmap() Conven1onal workloads NVMM-aware workloads from WHISPER 25 Protec1ng Metadata 26 NVMM Failure Modes: Media Failures • Media errors – Detectable & correctable • Transparent to soPware – Detectable & uncorrectable SoPware: Consumes good data • Affect a con1guous range of data • Raise machine check excep1on (MCE) NVMM Ctrl.: Detects & corrects errors – Undetectable Read NVMM data: • May consume corrupted data Media error • SoPware scribbles – Kernel bugs or own bugs – Transparent to hardware 27 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to soPware – Detectable & uncorrectable SoPware: Receives MCE • Affect a con1guous range of data • Raise machine check excep1on (MCE) NVMM Ctrl.: Detects uncorrectable errors Raises excep1on Read – Undetectable NVMM data: • May consume corrupted data Media error & Poison Radius • SoPware scribbles (PR) – Kernel bugs or own bugs e.g. 512 bytes – Transparent to hardware 28 NVMM Failure Modes : Media Failures • Media errors – Detectable & correctable • Transparent to soPware – Detectable & uncorrectable SoPware: Consumes corrupted data • Affect a con1guous range of data • Raise machine check excep1on (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Media error • SoPware scribbles – Kernel bugs or own bugs – Transparent to hardware 29 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to soPware – Detectable & uncorrectable SoPware: Bug code scribbles NVMM • Affect a con1guous range of data Write • Raise machine check excep1on (MCE) NVMM Ctrl.: Updates ECC – Undetectable NVMM data: • Consume corrupted data Scribble error • SoPware “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable 30 NVMM Failure Modes: Scribbles • Media errors – Detectable & correctable • Transparent to soPware – Detectable & uncorrectable SoPware: Consumes corrupted data • Affect a con1guous range of data • Raise machine check excep1on (MCE) NVMM Ctrl.: Sees no error – Undetectable Read NVMM data: • Consume corrupted data Scribble error • SoPware “scribbles” – Kernel bugs or NOVA bugs – NVMM file systems are highly vulnerable 31 NOVA Metadata Protection • Replicate everything inode’ Head’ Tail’ csumcsum’’ H1’ T1’ – Inodes inode – Logs Head Tail csum H1 T1 – Superblock – … ent1 c1 … entN cN • CRC32 Checksums everywhere ent1’ c1’ … entN’ cN’ Data 1 Data 2 32 Defense Against Scribbles • Tolerang Larger Scribbles – Allocate replicas far from one another – Can tolerate arbitrarily large scribbles to metadata. • Preven1ng scribbles – Mark all NVMM as read-only – Disable CPU write protec1on while accessing NVMM 33 Protec1ng Data DAX 34 NOVA Data Protection • Divide 4KB blocks into 512-byte stripes • Compute a RAID 5-style parity stripe • Compute and replicate checksums for each stripe 1 Block 512-Byte stripe segments S0 S1 S2 S3 S4 S5 S6 S7 P P = ⊕ S0..7 Ci = CRC32C(Si) Replicated 35 Data Protection With DAX mmap() • File systems cannot efficiently protect mmap’ed data, since stores are invisible • NOVA’s data protec1on contract: NOVA protects pages from media errors DAX and scribbles iff they are not mmap()’d for wring. 36 File data protection with DAX-mmap • NOVA logs mmap() operaons User-space load/store load/store Applicaons: Kernel-space read(), mmap() NOVA: write() NVDIMMs File data: protected unprotected File log: mmap log entry 38 File data protection with DAX-mmap • On unmap and during recovery, NOVA restores protec1on User-space load/store munmap() Applicaons: Kernel-space read(), mmap() NOVA: write() NVDIMMs Protec1on restored File data: File log: 39 File data protection with DAX-mmap • On unmap and during recovery, NOVA restores protec1on User-space System Failure + Applicaons: recovery Kernel-space read(), mmap() NOVA: write() NVDIMMs File data: File log: 40 Performance 41 Metadata Protec1on O(1) Data Protec1on O(N) 42 Storage Utilization Primary log: 2.00% Redundancy Primary inode: 0.10% 14.76% File data: 82.4% Replica inode: 0.10% Replica log: 2.00% Unused File checksum: 1.56% 0.75% File parity: 11.1% 43 Performance Cost of Data Integrity 1.2 1 0.8 0.6 0.4 0.2 0 Fileserver Varmail Webproxy Webserver RocksDB MongoDB Exim TPCC average ext4-DAX ext4-dataj For1s baseline NOVA w/ MP+WP w/ MP+DP+WP 44 Conclusion • Exis1ng file systems do not meet the requirements of applicaons on NVMM file systems • NOVA’s mul1-log design achieves high performance and strong consistency • NOVA’s data protec1on features ensure data integrity • NOVA outperforms exis1ng file systems while providing stronger consistency and data protec1on guarantees 45 NOVA is open source. We are preparing it for addi1on to Linux. To help or try it out: h^ps://github.com/ NVSL/linux-nova 46 Thanks! 47 .
Recommended publications
  • The Kernel Report
    The kernel report (ELC 2012 edition) Jonathan Corbet LWN.net [email protected] The Plan Look at a year's worth of kernel work ...with an eye toward the future Starting off 2011 2.6.37 released - January 4, 2011 11,446 changes, 1,276 developers VFS scalability work (inode_lock removal) Block I/O bandwidth controller PPTP support Basic pNFS support Wakeup sources What have we done since then? Since 2.6.37: Five kernel releases have been made 59,000 changes have been merged 3069 developers have contributed to the kernel 416 companies have supported kernel development February As you can see in these posts, Ralink is sending patches for the upstream rt2x00 driver for their new chipsets, and not just dumping a huge, stand-alone tarball driver on the community, as they have done in the past. This shows a huge willingness to learn how to deal with the kernel community, and they should be strongly encouraged and praised for this major change in attitude. – Greg Kroah-Hartman, February 9 Employer contributions 2.6.38-3.2 Volunteers 13.9% Wolfson Micro 1.7% Red Hat 10.9% Samsung 1.6% Intel 7.3% Google 1.6% unknown 6.9% Oracle 1.5% Novell 4.0% Microsoft 1.4% IBM 3.6% AMD 1.3% TI 3.4% Freescale 1.3% Broadcom 3.1% Fujitsu 1.1% consultants 2.2% Atheros 1.1% Nokia 1.8% Wind River 1.0% Also in February Red Hat stops releasing individual kernel patches March 2.6.38 released – March 14, 2011 (9,577 changes from 1198 developers) Per-session group scheduling dcache scalability patch set Transmit packet steering Transparent huge pages Hierarchical block I/O bandwidth controller Somebody needs to get a grip in the ARM community.
    [Show full text]
  • Trusted Docker Containers and Trusted Vms in Openstack
    Trusted Docker Containers and Trusted VMs in OpenStack Raghu Yeluri Abhishek Gupta Outline o Context: Docker Security – Top Customer Asks o Intel’s Focus: Trusted Docker Containers o Who Verifies Trust ? o Reference Architecture with OpenStack o Demo o Availability o Call to Action Docker Overview in a Slide.. Docker Hub Lightweight, open source engine for creating, deploying containers Provides work flow for running, building and containerizing apps. Separates apps from where they run.; Enables Micro-services; scale by composition. Underlying building blocks: Linux kernel's namespaces (isolation) + cgroups (resource control) + .. Components of Docker Docker Engine – Runtime for running, building Docker containers. Docker Repositories(Hub) - SaaS service for sharing/managing images Docker Images (layers) Images hold Apps. Shareable snapshot of software. Container is a running instance of image. Orchestration: OpenStack, Docker Swarm, Kubernetes, Mesos, Fleet, Project Docker Layers Atomic, Lattice… Docker Security – 5 key Customer Asks 1. How do you know that the Docker Host Integrity is there? o Do you trust the Docker daemon? o Do you trust the Docker host has booted with Integrity? 2. How do you verify Docker Container Integrity o Who wrote the Docker image? Do you trust the image? Did the right Image get launched? 3. Runtime Protection of Docker Engine & Enhanced Isolation o How can Intel help with runtime Integrity? 4. Enterprise Security Features – Compliance, Manageability, Identity authentication.. Etc. 5. OpenStack as a single Control Plane for Trusted VMs and Trusted Docker Containers.. Intel’s Focus: Enable Hardware-based Integrity Assurance for Docker Containers – Trusted Docker Containers Trusted Docker Containers – 3 focus areas o Launch Integrity of Docker Host o Runtime Integrity of Docker Host o Integrity of Docker Images Today’s Focus: Integrity of Docker Host, and how to use it in OpenStack.
    [Show full text]
  • Rootless Containers with Podman and Fuse-Overlayfs
    CernVM Workshop 2019 (4th June 2019) Rootless containers with Podman and fuse-overlayfs Giuseppe Scrivano @gscrivano Introduction 2 Rootless Containers • “Rootless containers refers to the ability for an unprivileged user (i.e. non-root user) to create, run and otherwise manage containers.” (https://rootlesscontaine.rs/ ) • Not just about running the container payload as an unprivileged user • Container runtime runs also as an unprivileged user 3 Don’t confuse with... • sudo podman run --user foo – Executes the process in the container as non-root – Podman and the OCI runtime still running as root • USER instruction in Dockerfile – same as above – Notably you can’t RUN dnf install ... 4 Don’t confuse with... • podman run --uidmap – Execute containers as a non-root user, using user namespaces – Most similar to rootless containers, but still requires podman and runc to run as root 5 Motivation of Rootless Containers • To mitigate potential vulnerability of container runtimes • To allow users of shared machines (e.g. HPC) to run containers without the risk of breaking other users environments • To isolate nested containers 6 Caveat: Not a panacea • Although rootless containers could mitigate these vulnerabilities, it is not a panacea , especially it is powerless against kernel (and hardware) vulnerabilities – CVE 2013-1858, CVE-2015-1328, CVE-2018-18955 • Castle approach : it should be used in conjunction with other security layers such as seccomp and SELinux 7 Podman 8 Rootless Podman Podman is a daemon-less alternative to Docker • $ alias
    [Show full text]
  • CERIAS Tech Report 2017-5 Deceptive Memory Systems by Christopher N
    CERIAS Tech Report 2017-5 Deceptive Memory Systems by Christopher N. Gutierrez Center for Education and Research Information Assurance and Security Purdue University, West Lafayette, IN 47907-2086 DECEPTIVE MEMORY SYSTEMS ADissertation Submitted to the Faculty of Purdue University by Christopher N. Gutierrez In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy December 2017 Purdue University West Lafayette, Indiana ii THE PURDUE UNIVERSITY GRADUATE SCHOOL STATEMENT OF DISSERTATION APPROVAL Dr. Eugene H. Spa↵ord, Co-Chair Department of Computer Science Dr. Saurabh Bagchi, Co-Chair Department of Computer Science Dr. Dongyan Xu Department of Computer Science Dr. Mathias Payer Department of Computer Science Approved by: Dr. Voicu Popescu by Dr. William J. Gorman Head of the Graduate Program iii This work is dedicated to my wife, Gina. Thank you for all of your love and support. The moon awaits us. iv ACKNOWLEDGMENTS Iwould liketothank ProfessorsEugeneSpa↵ord and SaurabhBagchi for their guidance, support, and advice throughout my time at Purdue. Both have been instru­ mental in my development as a computer scientist, and I am forever grateful. I would also like to thank the Center for Education and Research in Information Assurance and Security (CERIAS) for fostering a multidisciplinary security culture in which I had the privilege to be part of. Special thanks to Adam Hammer and Ronald Cas­ tongia for their technical support and Thomas Yurek for his programming assistance for the experimental evaluation. I am grateful for the valuable feedback provided by the members of my thesis committee, Professor Dongyen Xu, and Professor Math­ ias Payer.
    [Show full text]
  • F2FS) Overview
    Flash Friendly File System (F2FS) Overview Leon Romanovsky [email protected] www.leon.nu November 17, 2012 Leon Romanovsky [email protected] F2FS Overview Disclaimer Everything in this lecture shall not, under any circumstances, hold any legal liability whatsoever. Any usage of the data and information in this document shall be solely on the responsibility of the user. This lecture is not given on behalf of any company or organization. Leon Romanovsky [email protected] F2FS Overview Introduction: Flash Memory Definition Flash memory is a non-volatile storage device that can be electrically erased and reprogrammed. Challenges block-level access wear leveling read disturb bad blocks management garbage collection different physics different interfaces Leon Romanovsky [email protected] F2FS Overview Introduction: Flash Memory Definition Flash memory is a non-volatile storage device that can be electrically erased and reprogrammed. Challenges block-level access wear leveling read disturb bad blocks management garbage collection different physics different interfaces Leon Romanovsky [email protected] F2FS Overview Introduction: General System Architecture Leon Romanovsky [email protected] F2FS Overview Introduction: File Systems Optimized for disk storage EXT2/3/4 BTRFS VFAT Optimized for flash, but not aware of FTL JFFS/JFFS2 YAFFS LogFS UbiFS NILFS Leon Romanovsky [email protected] F2FS Overview Background: LFS vs. Unix FS Leon Romanovsky [email protected] F2FS Overview Background: LFS Overview Leon Romanovsky [email protected] F2FS Overview Background: LFS Garbage Collection 1 A victim segment is selected through referencing segment usage table. 2 It loads parent index structures of all the data in the victim identified by segment summary blocks.
    [Show full text]
  • Dm-X: Protecting Volume-Level Integrity for Cloud Volumes and Local
    dm-x: Protecting Volume-level Integrity for Cloud Volumes and Local Block Devices Anrin Chakraborti Bhushan Jain Jan Kasiak Stony Brook University Stony Brook University & Stony Brook University UNC, Chapel Hill Tao Zhang Donald Porter Radu Sion Stony Brook University & Stony Brook University & Stony Brook University UNC, Chapel Hill UNC, Chapel Hill ABSTRACT on Amazon’s Virtual Private Cloud (VPC) [2] (which provides The verified boot feature in recent Android devices, which strong security guarantees through network isolation) store deploys dm-verity, has been overwhelmingly successful in elim- data on untrusted storage devices residing in the public cloud. inating the extremely popular Android smart phone rooting For example, Amazon VPC file systems, object stores, and movement [25]. Unfortunately, dm-verity integrity guarantees databases reside on virtual block devices in the Amazon are read-only and do not extend to writable volumes. Elastic Block Storage (EBS). This paper introduces a new device mapper, dm-x, that This allows numerous attack vectors for a malicious cloud efficiently (fast) and reliably (metadata journaling) assures provider/untrusted software running on the server. For in- volume-level integrity for entire, writable volumes. In a direct stance, to ensure SEC-mandated assurances of end-to-end disk setup, dm-x overheads are around 6-35% over ext4 on the integrity, banks need to guarantee that the external cloud raw volume while offering additional integrity guarantees. For storage service is unable to remove log entries documenting cloud storage (Amazon EBS), dm-x overheads are negligible. financial transactions. Yet, without integrity of the storage device, a malicious cloud storage service could remove logs of a coordinated attack.
    [Show full text]
  • Providing User Security Guarantees in Public Infrastructure Clouds
    1 Providing User Security Guarantees in Public Infrastructure Clouds Nicolae Paladi, Christian Gehrmann, and Antonis Michalas Abstract—The infrastructure cloud (IaaS) service model offers improved resource flexibility and availability, where tenants – insulated from the minutiae of hardware maintenance – rent computing resources to deploy and operate complex systems. Large-scale services running on IaaS platforms demonstrate the viability of this model; nevertheless, many organizations operating on sensitive data avoid migrating operations to IaaS platforms due to security concerns. In this paper, we describe a framework for data and operation security in IaaS, consisting of protocols for a trusted launch of virtual machines and domain-based storage protection. We continue with an extensive theoretical analysis with proofs about protocol resistance against attacks in the defined threat model. The protocols allow trust to be established by remotely attesting host platform configuration prior to launching guest virtual machines and ensure confidentiality of data in remote storage, with encryption keys maintained outside of the IaaS domain. Presented experimental results demonstrate the validity and efficiency of the proposed protocols. The framework prototype was implemented on a test bed operating a public electronic health record system, showing that the proposed protocols can be integrated into existing cloud environments. Index Terms—Security; Cloud Computing; Storage Protection; Trusted Computing F 1 INTRODUCTION host level. While support data encryption at rest is offered by several cloud providers and can be configured by tenants Cloud computing has progressed from a bold vision to mas- in their VM instances, functionality and migration capabil- sive deployments in various application domains. However, ities of such solutions are severely restricted.
    [Show full text]
  • In Search of Optimal Data Placement for Eliminating Write Amplification in Log-Structured Storage
    In Search of Optimal Data Placement for Eliminating Write Amplification in Log-Structured Storage Qiuping Wangy, Jinhong Liy, Patrick P. C. Leey, Guangliang Zhao∗, Chao Shi∗, Lilong Huang∗ yThe Chinese University of Hong Kong ∗Alibaba Group ABSTRACT invalidated by a live block; a.k.a. the death time [12]) to achieve Log-structured storage has been widely deployed in various do- the minimum possible WA. However, without obtaining the fu- mains of storage systems for high performance. However, its garbage ture knowledge of the BIT pattern, how to design an optimal data collection (GC) incurs severe write amplification (WA) due to the placement scheme with the minimum WA remains an unexplored frequent rewrites of live data. This motivates many research stud- issue. Existing temperature-based data placement schemes that ies, particularly on data placement strategies, that mitigate WA in group blocks by block temperatures (e.g., write/update frequencies) log-structured storage. We show how to design an optimal data [7, 16, 22, 27, 29, 35, 36] are arguably inaccurate to capture the BIT placement scheme that leads to the minimum WA with the fu- pattern and fail to group the blocks with similar BITs [12]. ture knowledge of block invalidation time (BIT) of each written To this end, we propose SepBIT, a novel data placement scheme block. Guided by this observation, we propose SepBIT, a novel data that aims to minimize the WA in log-structured storage. It infers placement algorithm that aims to minimize WA in log-structured the BITs of written blocks from the underlying storage workloads storage.
    [Show full text]
  • ECE 598 – Advanced Operating Systems Lecture 19
    ECE 598 { Advanced Operating Systems Lecture 19 Vince Weaver http://web.eece.maine.edu/~vweaver [email protected] 7 April 2016 Announcements • Homework #7 was due • Homework #8 will be posted 1 Why use FAT over ext2? • FAT simpler, easy to code • FAT supported on all major OSes • ext2 faster, more robust filename and permissions 2 btrfs • B-tree fs (similar to a binary tree, but with pages full of leaves) • overwrite filesystem (overwite on modify) vs CoW • Copy on write. When write to a file, old data not overwritten. Since old data not over-written, crash recovery better Eventually old data garbage collected • Data in extents 3 • Copy-on-write • Forest of trees: { sub-volumes { extent-allocation { checksum tree { chunk device { reloc • On-line defragmentation • On-line volume growth 4 • Built-in RAID • Transparent compression • Snapshots • Checksums on data and meta-data • De-duplication • Cloning { can make an exact snapshot of file, copy-on- write different than link, different inodles but same blocks 5 Embedded • Designed to be small, simple, read-only? • romfs { 32 byte header (magic, size, checksum,name) { Repeating files (pointer to next [0 if none]), info, size, checksum, file name, file data • cramfs 6 ZFS Advanced OS from Sun/Oracle. Similar in idea to btrfs indirect still, not extent based? 7 ReFS Resilient FS, Microsoft's answer to brtfs and zfs 8 Networked File Systems • Allow a centralized file server to export a filesystem to multiple clients. • Provide file level access, not just raw blocks (NBD) • Clustered filesystems also exist, where multiple servers work in conjunction.
    [Show full text]
  • CS 5600 Computer Systems
    CS 5600 Computer Systems Lecture 10: File Systems What are We Doing Today? • Last week we talked extensively about hard drives and SSDs – How they work – Performance characterisEcs • This week is all about managing storage – Disks/SSDs offer a blank slate of empty blocks – How do we store files on these devices, and keep track of them? – How do we maintain high performance? – How do we maintain consistency in the face of random crashes? 2 • ParEEons and MounEng • Basics (FAT) • inodes and Blocks (ext) • Block Groups (ext2) • Journaling (ext3) • Extents and B-Trees (ext4) • Log-based File Systems 3 Building the Root File System • One of the first tasks of an OS during bootup is to build the root file system 1. Locate all bootable media – Internal and external hard disks – SSDs – Floppy disks, CDs, DVDs, USB scks 2. Locate all the parEEons on each media – Read MBR(s), extended parEEon tables, etc. 3. Mount one or more parEEons – Makes the file system(s) available for access 4 The Master Boot Record Address Size Descripon Hex Dec. (Bytes) Includes the starEng 0x000 0 Bootstrap code area 446 LBA and length of 0x1BE 446 ParEEon Entry #1 16 the parEEon 0x1CE 462 ParEEon Entry #2 16 0x1DE 478 ParEEon Entry #3 16 0x1EE 494 ParEEon Entry #4 16 0x1FE 510 Magic Number 2 Total: 512 ParEEon 1 ParEEon 2 ParEEon 3 ParEEon 4 MBR (ext3) (swap) (NTFS) (FAT32) Disk 1 ParEEon 1 MBR (NTFS) 5 Disk 2 Extended ParEEons • In some cases, you may want >4 parEEons • Modern OSes support extended parEEons Logical Logical ParEEon 1 ParEEon 2 Ext.
    [Show full text]
  • Ext4 File System and Crash Consistency
    1 Ext4 file system and crash consistency Changwoo Min 2 Summary of last lectures • Tools: building, exploring, and debugging Linux kernel • Core kernel infrastructure • Process management & scheduling • Interrupt & interrupt handler • Kernel synchronization • Memory management • Virtual file system • Page cache and page fault 3 Today: ext4 file system and crash consistency • File system in Linux kernel • Design considerations of a file system • History of file system • On-disk structure of Ext4 • File operations • Crash consistency 4 File system in Linux kernel User space application (ex: cp) User-space Syscalls: open, read, write, etc. Kernel-space VFS: Virtual File System Filesystems ext4 FAT32 JFFS2 Block layer Hardware Embedded Hard disk USB drive flash 5 What is a file system fundamentally? int main(int argc, char *argv[]) { int fd; char buffer[4096]; struct stat_buf; DIR *dir; struct dirent *entry; /* 1. Path name -> inode mapping */ fd = open("/home/lkp/hello.c" , O_RDONLY); /* 2. File offset -> disk block address mapping */ pread(fd, buffer, sizeof(buffer), 0); /* 3. File meta data operation */ fstat(fd, &stat_buf); printf("file size = %d\n", stat_buf.st_size); /* 4. Directory operation */ dir = opendir("/home"); entry = readdir(dir); printf("dir = %s\n", entry->d_name); return 0; } 6 Why do we care EXT4 file system? • Most widely-deployed file system • Default file system of major Linux distributions • File system used in Google data center • Default file system of Android kernel • Follows the traditional file system design 7 History of file system design 8 UFS (Unix File System) • The original UNIX file system • Design by Dennis Ritche and Ken Thompson (1974) • The first Linux file system (ext) and Minix FS has a similar layout 9 UFS (Unix File System) • Performance problem of UFS (and the first Linux file system) • Especially, long seek time between an inode and data block 10 FFS (Fast File System) • The file system of BSD UNIX • Designed by Marshall Kirk McKusick, et al.
    [Show full text]
  • Comparing Filesystem Performance: Red Hat Enterprise Linux 6 Vs
    COMPARING FILE SYSTEM I/O PERFORMANCE: RED HAT ENTERPRISE LINUX 6 VS. MICROSOFT WINDOWS SERVER 2012 When choosing an operating system platform for your servers, you should know what I/O performance to expect from the operating system and file systems you select. In the Principled Technologies labs, using the IOzone file system benchmark, we compared the I/O performance of two operating systems and file system pairs, Red Hat Enterprise Linux 6 with ext4 and XFS file systems, and Microsoft Windows Server 2012 with NTFS and ReFS file systems. Our testing compared out-of-the-box configurations for each operating system, as well as tuned configurations optimized for better performance, to demonstrate how a few simple adjustments can elevate I/O performance of a file system. We found that file systems available with Red Hat Enterprise Linux 6 delivered better I/O performance than those shipped with Windows Server 2012, in both out-of- the-box and optimized configurations. With I/O performance playing such a critical role in most business applications, selecting the right file system and operating system combination is critical to help you achieve your hardware’s maximum potential. APRIL 2013 A PRINCIPLED TECHNOLOGIES TEST REPORT Commissioned by Red Hat, Inc. About file system and platform configurations While you can use IOzone to gauge disk performance, we concentrated on the file system performance of two operating systems (OSs): Red Hat Enterprise Linux 6, where we examined the ext4 and XFS file systems, and Microsoft Windows Server 2012 Datacenter Edition, where we examined NTFS and ReFS file systems.
    [Show full text]