ReFS

J.R. Tipton Malcolm Smith

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved.

Background: NTFS

 Released in 1993  Significantly enhanced since  Rich API  level transactions  ARIES-like write ahead logging  Recovery on crash  Writes in place  Depends on write ordering aka careful writing

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 2

Why ReFS?

 Hardware has changed  Data integrity & capacity  Our expectations have changed  Data integrity and availability  Can’t take volume offline  Scale & performance  Many more files, directories  Much larger files, volumes  Concurrency  …stay compatible with Windows applications

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 3

ReFS data philosophy

 Corruption is commonplace  It’s not a special case  Can’t take a day off for corruption  Must approach availability coherently  Don’t take volume offline  Many different things in one volume  One bad apple shouldn’t spoil the bunch  Don’t freak out when something breaks  Do not write in place  Really, it turns Gizmo into a Gremlin

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 4

Component overview

 Kept NTFS file system logic  API  Detailed semantics FS  New storage engine Minstore NTFS  Recoverable object store  All file system metadata is in On-Disk here MinStore Engine  Minstore supplies primitives, file system gives it meaning

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 5

Minstore is a component

 Minstore doesn’t know what a file system is  Not just an engineering aesthetic  Make things easier and they become feasible  Online chkdsk/fsck  File system oblivious to many things Checksum error detection, inline repair Recovery semantics  Central place for performance work  Reusable  Kernel, user, file system, whatever

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 6

Minstore highlights

 Transaction-centered table key/value API  Create transaction  Manipulate tables and rows  Commit or abort  Allocate-on-write  All metadata checksummed  User data checksumming optional  Hierarchical allocation  Recoverable  No inherent on-disk ordering (no log)  Take advantage of allocate-on-write  Can “salvage” trees – remove bad chunks – online

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 7

Minstore B+ tables

 One flexible B+ implementation  There is a lot the B+ engine doesn’t know  Doesn’t know how pages/buckets are manifested  Doesn’t know about table embedding  Doesn’t really know about allocate-on-write

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 8

Minstore recovery

 B+ table is unit of recovery  Represents one or more file system transactions  Table is recovered entirely or not at all  Tables are written in almost any order  Modification order != write order  Relatively novel  Checkpoints “harden” tables to disk  Utilize flush/sync for stabilizing writes  Some tables are special, internal to Minstore  Written with checkpoints

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 9

Table atomicity

 Minstore can pick which table to write with any heuristic  Free from strict ordering requirements  Compare to ARIES-like logging systems  Minstore automatically tracks table dependencies  Transaction that spans two tables  Automatically expands atomic unit  This has potentially visible implications

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 10

Table atomicity, cont

 What if a transaction spans two tables?  Minstore combines them into one atomic unit  We call this table binding  Some “atoms” become one “molecule”  Automatic  Invisible to client e.g. file system  After writing the bound set, they are unbound  “Molecule” broken into “atoms”  Automatic and invisible  Implementation was challenging, but paid off

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 11

Hierarchical allocation

 Allocate-on-write == more allocation requests  Allocation can be synchronization bottleneck  Minstore uses hierarchical allocation  To allocate, move down a level  To free, move up a level  Think of it as “big chunk allocator” allocating to “smaller chunk allocator”  Like Russian dolls

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 12

Hierarchical allocation, cont

Large allocator describes gigabytes

Medium allocator describes 10s of MB

Private allocator describes clusters

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 13

Hierarchical allocation, cont

…which might cascade

…frees them in the upper level

Freeing blocks at a low level

Allocators can be sparse

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 14

Hierarchical allocation, cont

 Any table can have its own private allocator  Bottommost level of allocator hierarchy  Better concurrency  Aides in locality goals  Allocators are B+ tables  Same properties of other B+ tables E.g. scalability, embedded options  There is no bitmap Or is there? There isn’t

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 15

Embedded tables

 Express relatedness of tables, let them scale independently  For example, child can grow very large without bloating parent  Reduce table dependency tracking overhead  Just like top-level tables except for the root  The root, instead of a bucket, is the data portion of a row in the parent  Seems quirky at first look

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 16

Embedded tables

 Look just like a row in a table  Until you descend into it

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 17

Checksums on metadata

 Goal is to detect – and correct – latent hardware corruption  Every storage pointer is pair  Checksum algorithm is flexible  Take advantage of redundancy schemes underneath Minstore/ReFS  Spaces  Conceivably could work with third party storage

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 18

Checksums on user data

 Exactly the same as metadata  Sometimes undesirable  Fragmentation  Synchronization  Sophisticated applications may receive no benefit  If they can detect and repair corruption, why are we in the way?  Let applications decide  Per-file, per- attribute  New  Control checksum validation  Allow applications to read and verify/fix copies

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 19

A file system on Minstore

 Schema  Table schema, table relationships  Embedded vs. top-level  Where to utilize private allocators  No inodes, no MFT, no file table  Well, okay, file tables  Stream IO  Conventional  Integrity  Allocation changes on write  Performance  Caching  Buffer stabilization

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 20

A file system on Minstore, cont

 Locking  File system independence of  Take advantage of allocate-on-write Metadata IO concurrent with access (read & write)  Salvage  Many design choices  We think of it like a protocol  Separation of concerns very helpful here  Out on a limb, design-wise

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved. 21

Thank you

2012 Storage Developer Conference. Copyright © 2012 Microsoft Corp. All Rights Reserved.