ReFS
J.R. Tipton Malcolm Smith Microsoft
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved.
Background: NTFS
Released in 1993 Significantly enhanced since Rich API File system level transactions ARIES-like write ahead logging Recovery on crash Writes in place Depends on write ordering aka careful writing
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 2
Why ReFS?
Hardware has changed Data integrity & capacity Our expectations have changed Data integrity and availability Can’t take volume offline Scale & performance Many more files, directories Much larger files, volumes Concurrency …stay compatible with Windows applications
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 3
ReFS data philosophy
Corruption is commonplace It’s not a special case Can’t take a day off for corruption Must approach availability coherently Don’t take volume offline Many different things in one volume One bad apple shouldn’t spoil the bunch Don’t freak out when something breaks Do not write in place Really, it turns Gizmo into a Gremlin
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 4
Component overview
Kept NTFS file system logic API Detailed semantics FS New storage engine Minstore NTFS Recoverable object store All file system metadata is in On-Disk here MinStore Engine Minstore supplies primitives, file system gives it meaning
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 5
Minstore is a component
Minstore doesn’t know what a file system is Not just an engineering aesthetic Make things easier and they become feasible Online chkdsk/fsck File system oblivious to many things Checksum error detection, inline repair Recovery semantics Central place for performance work Reusable Kernel, user, file system, whatever
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 6
Minstore highlights
Transaction-centered table key/value API Create transaction Manipulate tables and rows Commit or abort Allocate-on-write All metadata checksummed User data checksumming optional Hierarchical allocation Recoverable No inherent on-disk ordering (no log) Take advantage of allocate-on-write Can “salvage” trees – remove bad chunks – online
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 7
Minstore B+ tables
One flexible B+ implementation There is a lot the B+ engine doesn’t know Doesn’t know how pages/buckets are manifested Doesn’t know about table embedding Doesn’t really know about allocate-on-write
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 8
Minstore recovery
B+ table is unit of recovery Represents one or more file system transactions Table is recovered entirely or not at all Tables are written in almost any order Modification order != write order Relatively novel Checkpoints “harden” tables to disk Utilize flush/sync for stabilizing writes Some tables are special, internal to Minstore Written with checkpoints
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 9
Table atomicity
Minstore can pick which table to write with any heuristic Free from strict ordering requirements Compare to ARIES-like logging systems Minstore automatically tracks table dependencies Transaction that spans two tables Automatically expands atomic unit This has potentially visible implications
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 10
Table atomicity, cont
What if a transaction spans two tables? Minstore combines them into one atomic unit We call this table binding Some “atoms” become one “molecule” Automatic Invisible to client e.g. file system After writing the bound set, they are unbound “Molecule” broken into “atoms” Automatic and invisible Implementation was challenging, but paid off
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 11
Hierarchical allocation
Allocate-on-write == more allocation requests Allocation can be synchronization bottleneck Minstore uses hierarchical allocation To allocate, move down a level To free, move up a level Think of it as “big chunk allocator” allocating to “smaller chunk allocator” Like Russian dolls
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 12
Hierarchical allocation, cont
Large allocator describes gigabytes
Medium allocator describes 10s of MB
Private allocator describes clusters
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 13
Hierarchical allocation, cont
…which might cascade
…frees them in the upper level
Freeing blocks at a low level
Allocators can be sparse
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 14
Hierarchical allocation, cont
Any table can have its own private allocator Bottommost level of allocator hierarchy Better concurrency Aides in locality goals Allocators are B+ tables Same properties of other B+ tables E.g. scalability, embedded options There is no bitmap Or is there? There isn’t
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 15
Embedded tables
Express relatedness of tables, let them scale independently For example, child can grow very large without bloating parent Reduce table dependency tracking overhead Just like top-level tables except for the root The root, instead of a bucket, is the data portion of a row in the parent Seems quirky at first look
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 16
Embedded tables
Look just like a row in a table Until you descend into it
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 17
Checksums on metadata
Goal is to detect – and correct – latent hardware corruption Every storage pointer is
pair Checksum algorithm is flexible Take advantage of redundancy schemes underneath Minstore/ReFS Spaces Conceivably could work with third party storage2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 18
Checksums on user data
Exactly the same as metadata Sometimes undesirable Fragmentation Synchronization Sophisticated applications may receive no benefit If they can detect and repair corruption, why are we in the way? Let applications decide Per-file, per-directory attribute New APIs Control checksum validation Allow applications to read and verify/fix copies
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 19
A file system on Minstore
Schema Table schema, table relationships Embedded vs. top-level Where to utilize private allocators No inodes, no MFT, no file table Well, okay, file tables Stream IO Conventional Integrity Allocation changes on write Performance Caching Buffer stabilization
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 20
A file system on Minstore, cont
Locking File system independence of Take advantage of allocate-on-write Metadata IO concurrent with access (read & write) Salvage Many design choices We think of it like a protocol Separation of concerns very helpful here Out on a limb, design-wise
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 21
Thank you
2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved.