J.R. Tipton Malcolm Smith Microsoft
Total Page:16
File Type:pdf, Size:1020Kb
ReFS J.R. Tipton Malcolm Smith Microsoft 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. Background: NTFS Released in 1993 Significantly enhanced since Rich API File system level transactions ARIES-like write ahead logging Recovery on crash Writes in place Depends on write ordering aka careful writing 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 2 Why ReFS? Hardware has changed Data integrity & capacity Our expectations have changed Data integrity and availability Can’t take volume offline Scale & performance Many more files, directories Much larger files, volumes Concurrency …stay compatible with Windows applications 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 3 ReFS data philosophy Corruption is commonplace It’s not a special case Can’t take a day off for corruption Must approach availability coherently Don’t take volume offline Many different things in one volume One bad apple shouldn’t spoil the bunch Don’t freak out when something breaks Do not write in place Really, it turns Gizmo into a Gremlin 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 4 Component overview Kept NTFS file system logic API Detailed semantics FS New storage engine Minstore NTFS Recoverable object store All file system metadata is in On-Disk here MinStore Engine Minstore supplies primitives, file system gives it meaning 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 5 Minstore is a component Minstore doesn’t know what a file system is Not just an engineering aesthetic Make things easier and they become feasible Online chkdsk/fsck File system oblivious to many things Checksum error detection, inline repair Recovery semantics Central place for performance work Reusable Kernel, user, file system, whatever 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 6 Minstore highlights Transaction-centered table key/value API Create transaction Manipulate tables and rows Commit or abort Allocate-on-write All metadata checksummed User data checksumming optional Hierarchical allocation Recoverable No inherent on-disk ordering (no log) Take advantage of allocate-on-write Can “salvage” trees – remove bad chunks – online 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 7 Minstore B+ tables One flexible B+ implementation There is a lot the B+ engine doesn’t know Doesn’t know how pages/buckets are manifested Doesn’t know about table embedding Doesn’t really know about allocate-on-write 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 8 Minstore recovery B+ table is unit of recovery Represents one or more file system transactions Table is recovered entirely or not at all Tables are written in almost any order Modification order != write order Relatively novel Checkpoints “harden” tables to disk Utilize flush/sync for stabilizing writes Some tables are special, internal to Minstore Written with checkpoints 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 9 Table atomicity Minstore can pick which table to write with any heuristic Free from strict ordering requirements Compare to ARIES-like logging systems Minstore automatically tracks table dependencies Transaction that spans two tables Automatically expands atomic unit This has potentially visible implications 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 10 Table atomicity, cont What if a transaction spans two tables? Minstore combines them into one atomic unit We call this table binding Some “atoms” become one “molecule” Automatic Invisible to client e.g. file system After writing the bound set, they are unbound “Molecule” broken into “atoms” Automatic and invisible Implementation was challenging, but paid off 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 11 Hierarchical allocation Allocate-on-write == more allocation requests Allocation can be synchronization bottleneck Minstore uses hierarchical allocation To allocate, move down a level To free, move up a level Think of it as “big chunk allocator” allocating to “smaller chunk allocator” Like Russian dolls 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 12 Hierarchical allocation, cont Large allocator describes gigabytes Medium allocator describes 10s of MB Private allocator describes clusters 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 13 Hierarchical allocation, cont …which might cascade …frees them in the upper level Freeing blocks at a low level Allocators can be sparse 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 14 Hierarchical allocation, cont Any table can have its own private allocator Bottommost level of allocator hierarchy Better concurrency Aides in locality goals Allocators are B+ tables Same properties of other B+ tables E.g. scalability, embedded options There is no bitmap Or is there? There isn’t 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 15 Embedded tables Express relatedness of tables, let them scale independently For example, child can grow very large without bloating parent Reduce table dependency tracking overhead Just like top-level tables except for the root The root, instead of a bucket, is the data portion of a row in the parent Seems quirky at first look 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 16 Embedded tables Look just like a row in a table Until you descend into it 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 17 Checksums on metadata Goal is to detect – and correct – latent hardware corruption Every storage pointer is <address, checksum> pair Checksum algorithm is flexible Take advantage of redundancy schemes underneath Minstore/ReFS Spaces Conceivably could work with third party storage 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 18 Checksums on user data Exactly the same as metadata Sometimes undesirable Fragmentation Synchronization Sophisticated applications may receive no benefit If they can detect and repair corruption, why are we in the way? Let applications decide Per-file, per-directory attribute New APIs Control checksum validation Allow applications to read and verify/fix copies 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 19 A file system on Minstore Schema Table schema, table relationships Embedded vs. top-level Where to utilize private allocators No inodes, no MFT, no file table Well, okay, file tables Stream IO Conventional Integrity Allocation changes on write Performance Caching Buffer stabilization 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 20 A file system on Minstore, cont Locking File system independence of Take advantage of allocate-on-write Metadata IO concurrent with access (read & write) Salvage Many design choices We think of it like a protocol Separation of concerns very helpful here Out on a limb, design-wise 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 21 Thank you 2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. .