The NILFS2 Filesystem: Review and Challenges
Total Page:16
File Type:pdf, Size:1020Kb
Ryusuke KONISHI NTT Cyberspace Laboratories NTT Corporation — NILFS Introduction — FileSystem Design — Development Status — Wished features & Challenges Copyright (C) 2009 NTT Corporation 2 — NILFS is the Linux file system supporting “continuous snapshotting” üProvides versioning capability of entire file system üCan retrieve previous states before operation mistake even restores files mistakenly overwritten or destroyed just a few seconds ago. üMerged into the mainline kernel 2.6.30 Copyright (C) 2009 NTT Corporation 3 CAUSE OF DATA LOSS Software Viruses Natural disaster malfunction 4% 2% 9% Human error Hardware 26% failure 59% Unprotected by redundant drives. Mostly preventable with basic high-integrity system Source: Ontrack Data Recovery, Inc. (Redundant configuration) Including office PC.The data is based on actual data recoveries performed by Ontrack. Copyright (C) 2009 NTT Corporation 4 ISSUES IN BACKUP — Changes after the last backup are not safe — But, frequent backups place burden on the system as well as the interval is limited by the backup time. Incrementally done Hard drive Backup Discretely scheduled e.g. per 8 hours, per day Backup Backup Some applications (esp. for system software) Restoration is cumbersome are sensitive to coherency among files (and often fails or unhelpful) Copyright (C) 2009 NTT Corporation 5 — Adopt Log-structured File System approach to continually save data on disk. — Checkpoints are created every time user makes a change, and each checkpoint is mutable to snapshots later on. Instantly Ordinary file systems NILFS taken Backup CP CP A CP B B → → → CP C A B C CP D → → CP E E D E F F Accessible ex. any time Previous data is Previous data is per 8 hours overridden preserved on disk Copyright (C) 2009 NTT Corporation 6 Maximum File System Instant Writable Retroactive Incremental Number of Snapshotting Snapshots Backup (solution) Snapshots Snapshots NTFS Optional Volume 64 Third party Shadow Copy product ZFS Unlimited*1 √ √ √ Btrfs Unlimited*1 √ √ Planned NILFS2 Unlimited*1 √ √ Requested Apple Time Thinned out ― Backup Machine automatically √ Solutions CDP Unlimited*1 ― ― ― √ *1:No practical limits (bounded by disk capacity) Copyright (C) 2009 NTT Corporation 7 — Only modified blocks are incrementally written to disk üThis write scheme is applied even to metadata and intermediate blocks Application view: File A (modified) File B (appended) A A’ A B B’ On disk images: B-Tree intermediate blocks Metadata blocks (inodes, …) A B modified blocks A B A’ B’ Original blocks are not overridden Copyright (C) 2009 NTT Corporation 8 Super blocks Block mapping (B-tree) Node blocks Super root block Inode Intermediate 3 blocks *1 x255 x255 CPFILE SUFILE DAT Data blocks checkpoint Segment Disk block information usage address translation *1 B-tree depth varies with the number of data blocks Files, Directories, Symbolic links cno=100 IFILE cno=101 cno=102 contains cno=103 ino=123 ino=124 ino=125 ino=126 Inodes cno=104 cno=105 Copyright (C) 2009 NTT Corporation 9 — Segments ü Disk space is allocated or freed per segment ü Each segment is filled with logs — Logs ü Organize delta of data and metadata per file ü Composea new version of metadata hierarchy every checkpoint Super Segments Logs Super block block 0 1 2 3 N-1 N Metadata files file-A file-B file-C file-D IFILE CPFILE SUFILE DAT Data blocks B-tree nodes Summary blocks (per log) Super Root ( Changed blocks only ) Block Copyright (C) 2009 NTT Corporation 10 — How does NILFS recover from unclean status? üFinds the last log which has a super root block, and done! üEach log is validated with checksums The super block points to the most recent log. *1 The pointer is periodically updated. incomplete series of logs is ignored Super root block Super block Summary block to be chosen next segment next segment next segment log log log log log Scan order of logs *1Series of logs may not have the super root block. This type of variant is allowed for optimizations to make synchronous write operation faster. Copyright (C) 2009 NTT Corporation 11 — Creates new disk space to continue writing logs ü Essential function of Log structured File Systems — A disk block is in-use if it belongs to a snapshot or recent checkpoints; unused blocks are freed with their checkpoints Preserving checkpoints as snapshots Protection CP SS SS period CP CP CP CP CP CP CP CP CP CP CP A checkpoint which user marked as SNAPSHOT Not all unprotected checkpoints will be Recent checkpoints deleted in a shot of GC Copyright (C) 2009 NTT Corporation 12 Overall view segment usage information checkpoint information Cleanerd execute GC Virtual block address (Standalone daemon) information ioctl() Userland Log summary SUFILE CPFILE DAT GC cache Log writer Kernel L L L L Segment 101 Segment 102 Segment 103 Segment 1000 Segment 1001 On disk S L L D L S D L D D S L L Live blocks Dead blocks S L L S L L L L Copyright (C) 2009 NTT Corporation 13 block addressing — Issue for moving disk blocks ü Must rewrite b-tree node blocks and inodeshaving a pointer to moved blocks ü Disk blocks are pointed from many parent blocks because NILFS makes numerous versions — Solution ü Use virtual (i.e. indirect) block numbers instead of real disk block numbers Block Disk Address mapping Translation File offset Virtual Block Disk Block (per block) Number Number offset=256 vblocknr=1235 blocknr= 10002 Example. DAT File #1234 4000 #1235 10002 Extra lookup is needed but DAT is cached as a file B-tree lookup Table reference Copyright (C) 2009 NTT Corporation 14 Live or dead determination — Cleanerddetermines if each disk block is LIVE or DEAD from DAT Snapshots Recent checkpoints Virtual Block Numbers DAT cno cno cno le64 start cno>= 800 vblocknr=1234 200 2 120 280 le64 end LIVE! 200 < 280 < 300 300 le64 start vblocknr=1235 300 le64 end LIVE! 300 < 800 < 1000 1000 le64 start vblocknr=1440 300 DEAD le64 end Log 500 Summary vblocknr vblocknr vblocknr Block(s) 1234 1235 1440 Virtual block numbers of the payload blocks are written in the summary Copyright (C) 2009 NTT Corporation 15 — Achievements üSnapshots Automatically and continuously taken Mountable as read-only file systems Mountable concurrently with the writable mount (convenient for online backup) Quick listing Easy administration üOnline disk space reclamation Can maintain multiple snapshots Copyright (C) 2009 NTT Corporation 16 — Achievements üOther Features Quick recovery on-mount after system crash B-tree based file and meta data management 64-bit data structures; support many files, large files and disks Block sizes smaller than page size (e.g. 1KB or 2KB) Redundant super blocks (automatic switch) 64-bit on-disk timestamps which are free of the year 2038 problem Nano second timestamps Copyright (C) 2009 NTT Corporation 17 — Todo üOn-disk atime üExtended attributes (work in progress) üPOSIX ACLs üO_DIRECT write Currently fallback to buffered write üFsck üResize üQuotas — Performance issues üDirectory operations üWrite performance üOptimization for silicon disks (esp. for SSD) Copyright (C) 2009 NTT Corporation 18 Compilebench (kernel 2.6.31-rc8) st 180 a Ext3 F 160 Btrfs 140 Under investigation. Nilfs2 120 Applying ext3 hash tree or b-tree based ec directory management is discussed /s 100 in the NILFS community. B M 80 . g v 60 A 40 20 0 low initial create patch compile clean read delete stat S create tree tree tree Hardware specs: Processor: Pentium Dual-Core CPU E5200 @ 2.49GHz, Chipset: Intel 4 Series Chipset + ICH10R, Memory: 2989MB, Disk: ST3500620AS Copyright (C) 2009 NTT Corporation 19 Bonnie++ (kernel 2.6.31-rc8) st 120 a Ext3 F 100 Btrfs Implementation needs to be Nilfs2 80 brushed up, especially in ut p log-writer and b-tree. h ec /s ug 60 B ro M h T 40 20 low 0 S seq write seq rewrite seq read Hardware specs: Processor: Pentium Dual-Core CPU E5200 @ 2.49GHz, Chipset: Intel 4 Series Chipset + ICH10R, Memory: 2989MB, Disk: ST3500620AS Copyright (C) 2009 NTT Corporation 20 — Faster and robust online backup like ZFS ü Back up checkpoints instead of usual files ü Similar features are planned for btrfs, TUX3, and the Device Mapper (dm replication) The reader extracts delta between The writer appends the delta two checkpoints and streams it to a passive NILFS partition over the network Replicator Replicator sshconnection (writer) (reader) CP CP CP CP CP CP CP CP CP CP CP CP Passive (possibly unmounted) NILFS Drive NILFS Drive Copyright (C) 2009 NTT Corporation 21 KEY CHALLENGES DISCUSSED IN THE NILFS COMMUNITY — How to extract delta between two checkpoints ? üTwo approaches Scan logs in creation order (just gets delta from logs) Scan DAT to gather blocks changed during given period üHave pros and cons The former seems to be efficient, but has a limit due to GC. Replicator may use either or both of these methods — Rollback on the destination file system üNeeded before starting replication especially to thin out the backups with GC Copyright (C) 2009 NTT Corporation 22 — Better garbage collector is much needed üBetter data retention policy to prevent disk full üSelf-regulating speed üSmarter selection algorithm of target segments to reduce I/O — Further chance of optimization and enhancement üBackground data verification üDefragmentation üDe-duplication üBackground disk format upgrade Copyright (C) 2009 NTT Corporation 23 — NILFS is in the mainline kernel üYou can go back in time just before you scream “ ” üInstant failure recovery. Simple administration. üPotential for innovative application ü… and most importantly, WORKING STABLY :) — Contribution is welcome üVarious topics in GC, snapshot tools, and time-oriented tools. üLet’s drop the (EXPERIMENTAL) flag! Copyright (C) 2009 NTT Corporation 24 — Project page ühttp://www.nilfs.org/ — Mailing-list üusers (at) nilfs.org üusers-ja(at) nilfs.org — Contact Information üRyusuke KONISHI <ryusuke(at) osrg.net> Copyright (C) 2009 NTT Corporation 25 Copyright (C) 2009 NTT Corporation 26.