<<

Department Informatik Technical Reports / ISSN 2191-5008

Paul Prade, Tobias Groß, Andreas Dewald

Forensic Analysis of the Resilient System (ReFS) Version 3.4

Technical Report CS-2019-05

December 2019

Please cite as: Paul Prade, Tobias Groß, Andreas Dewald, “Forensic Analysis of the Resilient (ReFS) Version 3.4,” Friedrich-Alexander-Universitat¨ Erlangen-Nurnberg,¨ Dept. of Computer Science, Technical Reports, CS-2019-05, December 2019.

Friedrich-Alexander-Universitat¨ Erlangen-Nurnberg¨ Department Informatik

Martensstr. 3 91058 Erlangen Germany · · www.cs.fau.de Forensic Analysis of the Resilient File System (ReFS) Version 3.4

Paul Prade1, Tobias Groß1 and Andreas Dewald1, 2

1Friedrich-Alexander University, Erlangen-Nuremberg, Germany 2ERNW Research GmbH, Heidelberg, Germany [email protected], [email protected] Abstract

ReFS is a modern file system that is developed by and its internal structures and behavior is not officially documented. Even so there exist some analysis efforts in deciphering its structures, some of these findings have yet become deprecated and cannot be applied to current ReFS versions anymore. In this work, general concepts and internal structures found in ReFS are examined and documented. Based on the structures and the processes by which they are modified, approaches to (deleted) files from ReFS formatted file systems are shown. We also evaluated our implementation and the allocation strategy of ReFS with respect to accuracy, runtime and the ability to recover older file states. In addition, we extended allowing it to parse ReFS partitions and build a carver based on that extend The Sleuth Kit.

i List of Tables

2.1 Data categories...... 16 2.2 Postfixes of the names of file system tools...... 16

4.1 List of default ReFS attribute types...... 33 4.2 Data structure in the first 24 of the (FSRS)...... 49 4.3 Data structure contained in the remainder of the boot sector...... 49 4.4 Structure of a page header...... 51 4.5 Structure of a page reference...... 52 4.6 Structure of the superblock...... 53 4.7 Structure of the checkpoint...... 55 4.8 Structure of the index root element...... 57 4.9 Structure of the index header...... 58 4.10 Structure of the header of a regular index entry...... 58 4.11 Rows found in the Schema Table...... 60 4.12 Rows found in the Object ID Table...... 60 4.13 Variable index root element used in Allocator Tables...... 61 4.14 Rows found in Allocator Tables...... 61 4.15 Bitmap structure used in Allocator Tables...... 61 4.16 Rows found in the Container Table...... 62 4.17 Rows found in the Parent Child Table...... 62 4.18 Rows found in the Upcase Table...... 62 4.19 Structure of the buffer found in the Logfile Information Table...... 63 4.20 Structure of the “ ” row in the Volume Information Table...... 63 4.21 Structure of the “General Information” row in the Volume Information Table...... 63 4.22 Structure of the “General Information ” row in the Volume Information Table 64 4.23 Rows found in the Security Table...... 64 4.24 Rows found in the Reparse Index Table...... 64 4.25 Structure of the “ Descriptor” row ...... 65 4.26 Structure of the “ID2” row type...... 65 4.27 Structure of the “File” row type...... 65 4.28 Structure of the “Directory ” row type...... 66 4.29 Variable index root element used in Directory Descriptor tables...... 66 4.30 Variable index root element used in File tables...... 67 4.31 Structure of the key in an attribute row...... 68 4.32 Structure of the header of a standalone attribute...... 68 4.33 Structure of the $DIR_LINK attribute type...... 68 4.34 Structure of the $INDEX_ROOT attribute type...... 69 4.35 Structure of the $NAMED_DATA attribute type...... 69 4.36 Structure of the $REPARSE_POINT attribute type...... 69 4.37 Structure of the $USN_INFO attribute type...... 69 4.38 Rows found in the Data Run Table...... 70 4.39 Structure of a log page in the restart area...... 76 4.40 Structure of a log page in the logging area...... 77 4.41 Structure of the redo information found in a log page in the logging area...... 78

ii List of Tables

4.42 Structure of a table component in a redo operation...... 78

6.1 Overview of the created data sets...... 91 6.2 Output of the TSK extension, compared to the final state of the file system...... 91 6.3 Final state of the file system, compared to the output of the TSK extension...... 92 6.4 State of all allocated files according to the TSK extension, compared to all actions..... 92 6.5 State of all files that could be recovered with the TSK extension compared to all actions.. 93 6.6 State of all files that could be recovered with the carver application compared to all actions 93 6.7 Runtimes of the different applications...... 94 6.8 Experiment to analyze the recoverability of COW copies...... 94

iii List of Figures

2.1 NTFS structure, based on [35, p. 7] and [12, p. 275]...... 6 2.2 Corruption in B+-trees, according to [44]...... 8 2.3 (a) A b- (b) Inserting key 19, and creating a path of modified pages [39, p. 7]... 11 2.4 (a) A basic tree (b) Deleting key 6. [39, p. 7]...... 11 2.5 Experiment to show, that -On- allocations make full page copies...... 12 2.6 Abstraction layers used in The Sleuth Kit (TSK), based on [5]...... 14

4.1 Overview of the checkpoint mechanism...... 25 4.2 Global root nodes referenced by the checkpoint...... 26 4.3 Implementation of a table in a key-value store by using a B+-tree...... 27 4.4 Concept of embedding tables...... 27 4.5 Tables referenced by the Object ID Table...... 28 4.6 Schematic view of the organization of the Object ID Table...... 29 4.7 Exemplary directory table...... 30 4.8 Overview of the interaction between the Object ID Table and directory tables...... 32 4.9 Tables referenced by the checkpoint structure...... 35 4.10 Exemplary view of the Upcase Table...... 35 4.11 Overview of the Volume Information Table...... 36 4.12 Exemplary Parent Child Table with its interpretation...... 37 4.13 Interaction between the hierarchical allocators, according to [22, p. 41]...... 38 4.14 Schematic view of bands found in ReFS...... 39 4.15 Necessity of the address translation ...... 40 4.16 Exemplary address translation process...... 41 4.17 Exemplary usage of the Block Reference Count Table...... 41 4.18 Overview of the logging areas used in ReFS...... 46 4.19 Example for an inconsistent checkpoint state (redo operations are necessary)...... 47 4.20 Example for a consistent checkpoint state (no redo operations are necessary)...... 47 4.21 Implementation to calculate the checksum of the FSRS...... 49 4.22 Example layout of the FSRS structure, explained in table 4.2...... 50 4.23 Example layout of the remaining boot sector, explained in table 4.3...... 50 4.24 Example layout of a page header, as explained in table 4.4...... 51 4.25 Example layout of a page reference structure, as explained in table 4.5...... 51 4.27 Calculation of a volume signature...... 53 4.26 Example layout of a superblock structure, as explained in table 4.6...... 54 4.28 Example layout of the first 144 bytes of a checkpoint structure, as explained in table 4.7.. 56 4.29 Example layout of the pointer list found in a checkpoint structure, as explained in table 4.7 56 4.30 Overview of the components of a Minstore B+ node...... 57 4.32 Example layout of an Index Header structure, according to 4.9...... 58 4.33 Example layout of a key index structure...... 58 4.31 Example layout of an index root structure, according to table 4.8...... 59 4.34 Example layout of an index entry structure, according to 4.10...... 59 4.35 Example for the deletion of the entries e2 and e1 ...... 71 4.36 Example for the deletion of the entries e1 and e2 ...... 72 4.37 Example for the deletion of the entry e3 ...... 72 4.38 Example for the insertion of the entry e5 ...... 72 4.39 Example for the insertion of the entry e4 ...... 73 4.40 Example for the remaining traces after a push operation...... 74 4.41 Overview to express the recoverability of tree structures...... 75

iv List of Figures

5.1 Array of “file system openers”...... 79 5.2 Example for a struct overlay used in TSK...... 80 5.3 Illustration of the file system context object...... 81 5.4 Combining entries of existing and deleted tables that are referenced by the Object ID Table 83 5.5 Classification process in the collection phase of the carver...... 87 5.6 Multiple file systems different locations within a disk...... 87

v Contents

1 Introduction 1 1.1 Related Work...... 1 1.1.1 Building the next generation file system for Windows: ReFS [46]...... 1 1.1.2 Analysis of the File System found on Windows 2012 [24]...... 1 1.1.3 Analysis of the Windows Resilient File System [32]...... 2 1.1.4 Resilient Filesystem [22]...... 2 1.1.5 Reverse engineering of ReFS [34]...... 2 1.2 Results...... 2 1.3 Outline...... 3

2 Background 4 2.1 File System...... 4 2.1.1 NTFS...... 5 2.2 B+-Tree...... 7 2.3 Copy-On-Write (COW)...... 9 2.4 The Sleuth Kit (TSK)...... 13 2.4.1 File System Category...... 14 2.4.2 Content Category...... 14 2.4.3 Category...... 15 2.4.4 File Name Category...... 15 2.4.5 Application Category...... 15 2.4.6 Extending the file system layer of TSK...... 16

3 File System Reverse Engineering 18 3.1 Static Analysis...... 18 3.2 Dynamic Analysis...... 18 3.3 Examples of file system reverse engineering...... 19 3.4 Preliminaries for the reverse engineering process...... 20

4 Analysis 23 4.1 ReFS Concepts...... 23 4.1.1 Pages...... 23 4.1.2 Page References...... 24 4.1.3 Superblock and Checkpoint...... 24 4.1.4 Minstore B+ Tables...... 25 4.2 ReFS Analysis...... 34 4.2.1 File System Category...... 34 4.2.2 Content Category...... 37 4.2.3 Metadata Category...... 42

Contents

4.2.4 File Name Category...... 44 4.2.5 Application Category...... 45 4.3 ReFS Data Structures...... 48 4.3.1 Boot Sector...... 48 4.3.2 Page Structure...... 50 4.3.3 Page Reference...... 51 4.3.4 Superblock...... 52 4.3.5 Checkpoint...... 54 4.3.6 Tables...... 56 4.3.7 Schema Table...... 59 4.3.8 Object Identifier Table...... 59 4.3.9 Allocator Table...... 60 4.3.10 Container Table...... 61 4.3.11 Parent-Child Table...... 61 4.3.12 Upcase Table...... 61 4.3.13 Logfile Information Table...... 62 4.3.14 Volume Information Table...... 63 4.3.15 Security Table...... 63 4.3.16 Reparse Index Table...... 64 4.3.17 Directory Table...... 64 4.3.18 Directory Descriptor Table...... 66 4.3.19 File Table...... 66 4.3.20 Attribute Types...... 67 4.3.21 Data Run Table...... 67 4.3.22 File System Journal...... 69 4.4 ReFS ...... 70 4.4.1 Operations in a single node...... 71 4.4.2 Operations in a tree...... 73 4.4.3 Concepts to restore entries from tree structures...... 74

5 Implementation 79 5.1 Extending “The Sleuth Kit” to support ReFS...... 79 5.2 Development of a carver based on TSK...... 86

6 Evaluation 89 6.1 Setup for recovering multiple files from a file system...... 89 6.1.1 Testing the correctness of the applications...... 91 6.1.2 Testing the recovery capabilities of the applications...... 92 6.2 Setup for recovering copies of a single file...... 93

7 Conclusion and Future Work 95

Bibliography 96

vii 1 Introduction

Storage media analysis is a common task in the field of digital forensic, when PCs or mobile devices get analyzed. Investigators have to rely on the proper functioning of their tools to provide them with correct interpretation of traces. File systems have to be interpreted and presented when analyzing storage media, doing this manually is unfeasible. From this situation emerges the need for digital forensic tools to ideally support all of the file systems that are currently in use and may be encountered in a forensic analysis. Limitations of classical file systems such as low performance, limited capacity or unsuitability for SSD drives led to the development of new filesystems like APFS or ReFS. These new filesystems have to be supported in source forensic tools and documentation. Transparency in forensic processes and tools is important when digital evidence is used in severe cases. That is even important when filesystems are proprietary as the above-mentioned ones. With this work, we want to provide the forensic community with tools and information to properly analyze ReFS partitions. This report is a revised version of Paul Prade’s1 master thesis.

1.1 Related Work

As mentioned previously, some analysis efforts have already been put into understanding the structures used under ReFS. Most of the found works look at ReFS in version 1.1 and 1.2. There exist onlye one publication that focuses on a more recent version 3.2. The terminology used within these publications also differs as most articles by Microsoft only reveal a few official namings for data structures, and different analysts use different terms to refer to examined structures. Whenever those terms are discussed in the later part of this work, several different terms are mentioned to make this work as comparable as possible to the referenced publications.

1.1.1 Building the next generation file system for Windows: ReFS [46] In this article Stefen Sinofsky, the former head of Microsoft’s development department, introduces ReFS as a new file system for Windows. He explains key goals as well as features that are provided by ReFS. Most newly introduced features contribute to additional integrity, such as the usage of checksums for meta- or -data as well as the Copy-On-Write mechanism that is used to avert corruption of write errors. Another highlighted aspect is the scaling for large volumes and directories. The article also compares various key metrics, such as the capacity limits of ReFS and its predecessor NTFS. Regarding the architecture of ReFS the article mentions that manipulations of on-disk structures happen over a generic key-value interface provided by a storage engine. The implementation of the storage engine exclusively relies on B+-trees. The key-value interface provided by the storage engine is used by an upper layer that utilizes it to implement files, directories, and more.

1.1.2 Analysis of the File System found on [24] This working report contains an extensive inofficial documentation of various structures that are used under ReFS. All found structures are also explained, and in a conclusive appendix, an overview of identified structures is given. The report does not yet depict how the internal B+-tree and the key-value structure of ReFS works. Green [24] also looks at deleted items. He states that deleted files and directories in ReFS are not expunged from the file system entirely. Instead, he claims that the deletion of files and directories even generates artifacts that inform about their deletion. The artifacts identified by Green however are not part of the ReFS specification but instead refer to the recycle bin folder. Windows moves files that are deleted

1first author

1 CHAPTER 1. INTRODUCTION not permanently into a folder called recycle bin. To prevent naming conflicts in that folder, deleted files are renamed, and an additional entry to store their original name and metadata is created. This process happens fully outside the ReFS driver in the implementation of the file explorer and is best described by [1].

1.1.3 Analysis of the Windows Resilient File System [32] This document is another unofficial draft of structures found under ReFS. The author of this work references the working report of Green [24]. This work strongly focuses on the low-level presentation of how data structures in ReFS are composed. This document is also the first to vaguely describe the key-value structure that the storage engine used by ReFS provides (See: 4.4 Metadata table value entry, [32, p. 6]). The findings in this work are the foundation for the libfsrefs2 which was also developed by the document’s author Metz [32]. libfsrefs allows to process ReFS formatted file systems. As of now the library supports processing and verifying the first sector and extracting the volume name of a ReFS file system.

1.1.4 Resilient Filesystem [22] This work of Georges gives another advanced look at the data structures found under ReFS. It builds upon the findings of the previous works but also aims at explaining structures that were yet omitted. A core goal of this work is to develop a tool that produces analysis results that can be compared to the output of the digital forensics software EnCase. EnCase is a popular commercial analysis tool for digital investigations that also provides support for the ReFS file system. Georges even found the commercial to produce faulty outputs, as it reported a different cluster size, compared to his findings. He used this opportunity to appeal to the need for public research, as well as open-source applications in digital investigations. Additional research questions of this work are concerned with different data structures found under ReFS and the pieces of information contained in these. This work is also the first one to look at how ReFS manages to store the allocation status of clusters. The tool developed as part of this work may be used for interactively generating reports about ReFS file systems and their contents as well as extracting files from them. However, just as the previous works, this thesis looks at the outdated ReFS version 1.2. Since ReFS was not designed to be backward compatible, these findings cannot be fully transferred to the current ReFS version.

1.1.5 Reverse engineering of ReFS [34] Although Nordvik et al. [34] also investigate ReFS, they mainly reverse engineered v1.2 and v3.2 of ReFS. At the end they tested their findings on v3.4, too. For v3.2 they came to the conclusion that the versions did not differ much from v1.2 and that “the structures are almost identical”. In our work we come to the conclusion that between v3.4 and v1.2 many data structures were added, deprecated or changed. Also new functionality was added like virtual addresses which have to be taken into account when extracting data from the disk. In contrast to Nordvik et al. [34], we discovered more details and also investigate how the Copy-On-Write mechanism of ReFS is implemented and propose strategies to recover deleted or old versions of files from ReFS. In addition to documenting their findings they implemented a prototype parser and showed on the basis of hex dumps the validity of their findings.

1.2 Results

This work is the first one to examine the current ReFS version 3.4 in detail. Between the analyzed ReFS version 1.2 and this current version, many data structures were added, deprecated, or changed. The entire address mechanism used to reference metadata blocks has been wholly altered, as more recent versions of ReFS introduced virtual addresses. Because of that, many old findings cannot be applied anymore. More recent versions of ReFS have also introduced a redo log that stores valuable information about the last operations that were performed in a ReFS file system. In this work, numerous of these changes and features are addressed, explained, and documented. Addi- tionally, previous findings of related work are expanded by putting a stronger focus on explaining general

2https://github.com/libyal/libfsrefs

2 CHAPTER 1. INTRODUCTION high-level concepts instead of solely looking at underlying data structures. This approach also allowed comparing ReFS to its predecessor NTFS and finding lots of similarities and related design choices. This work is also the first one to look at how the removal process of entries in the data structures of ReFS works and how this may be made use of to recover deleted items. Previous works did also not yet examine the Copy-On-Write policy used under ReFS. This work introduces considerations from other well-known file systems that equally use a Copy-On-Write policy. Based on these ideas, strategies are proposed to recover deleted files from a ReFS file system. Finally, the work introduces two applications that were developed to enable investigators to analyze ReFS file systems. The first application is an extension of a well known forensic tool that may be used to interpret a functioning file system and to recover files and directories in it. The second application is based on the same tool but is more thorough as it scans an entire drive and thereby can unreferenced structures of ReFS file systems. It also does not rely on the correctness of many data structures as the first tool does. Lastly, both tools are evaluated against other methods to extract contents from a ReFS file system and are made puplicly available3.

1.3 Outline

Chapter1 provides a motivation that stresses the necessity for this work. It also contains a list of related work. Chapter2 introduces terminology as well as background information that is required to understand basic concepts that are used in the analyzed file system. In a later chapter, the findings of this work are pigeonholed into the file system category model proposed by Brian Carrier. This chapter also introduces this model. Chapter3 introduces terminology that deals with reverse engineering of file systems. Additionally, some previous efforts of reverse engineering different other file systems are motivated. Based on these findings, various strategies are proposed that we used to analyze the internal structures of ReFS. Chapter4 is concerned with the actual analysis of the data structures found under ReFS. The chapter starts with an explanation of how ReFS utilizes the concepts that are proposed in chapter2. This explanation also gives an overall view of how to access files in ReFS. In the subsequent section, many further components of the ReFS file system are presented at a high-level without looking at their inner structure. The structures are then also categorized into the file system category model that was proposed in chapter2. Subsequently, the actual structure of the data structures that ReFS uses is explained. Conclusively, it is explained how ReFS deletes items, as well as metadata and how this data can be restored. Chapter5 presents the implementation of two applications that deal with the interpretation as well as the data recovery process under ReFS. The chapter is used to explain how these tools work and to reveal details of their implementation. Chapter6 presents the results of the developed applications in various scenarios. The findings in this chapter are used to assess the effectiveness and the correctness of the applications. Chapter7 concludes the work and admits the limitations of the analysis as well as the limitations of the developed tools. Finally, an outlook is given regarding what future work needs to be done and which components of ReFS still need to be examined in more depth.

3https://faui1-gitlab.cs.fau.de/paul.prade/-sleuthkit-implementation

3 2 Background

In this chapter fundamental terms and concepts are defined that are used in the subsequent chapters. When explaining these terms and concepts, it is also tried to name some specifics of ReFS that may be put into the context of the current topic. The chapter ends with an explanation of a model that is used to classify data structures of a file system. At a later chapter, the findings of the analysis are embedded in this model.

2.1 File System

A file system can be considered the link between the and the data stored on the designated devices. Its task is to provide the foundation for long-term storage and retrieval of data. To fulfill this purpose, file systems usually provide mechanisms for users to store data in a hierarchy of files and directories [12, p. 175]. Aside from the contents of a file, file systems usually also store additional metadata that describes files and folders. Examples for this metadata are timestamps, permissions, size values, or flags. The amount and form of metadata associated with files varies depending on the type of the file system. File systems also differ in the structures they use to fulfill their purpose [22, p. 24]. When talking about the properties of storage media and file systems, it is essential to mention that there exists a variety of terms to refer to differently sized units of data. The smallest addressable unit of data in a storage medium is the so-called sector. A single sector is typically 512 bytes large. File systems may use further terms to describe sizes of aggregated sectors that form -sized containers. As there exist multiple names for these containers, Carrier [12, p. 174] uses the term “data unit” to refer to them neutral from a file system type. The data units used in ReFS are called clusters and pages. Currently, clusters represent the smallest allocation unit in ReFS. A cluster in ReFS may either be 4 KiB or 64 KiB large. Pages in ReFS are used to store organizational file system data and are formed by multiple clusters. A file system driver always allocates complete data units, even if it does not use them entirely. Often, the unused bytes in a data unit are not wiped when it is allocated. Because of that, the unused part of a data unit may still store remainings of files or other metadata that was previously stored in it. These remainings are called slack space. Slack space plays an essential role when forensically examining a file system. The examined ReFS driver, however, always completely wipes clusters as well as pages when allocating them. As a consequence of that, no real slack space according to its definition, exists in ReFS. Nevertheless, the Copy-On-Write process introduces a new variant of slack space, that is explained in section 2.3. During the last decades, many different file systems evolved steadily increasing in flexibility, scalability, robustness, and a set of supported features were developed. Salter [43] separates the development of file systems into the following five generations. Early forms of storage merely provided an arbitrary data with no file system at all. Punchcards and data on audiocassettes fall into this category. In the next generation named files, but no folders or metadata exist. The succeeding generation already provided folders to form hierarchical structures. This development is followed by file systems that are capable of storing metadata, tracking ownership, permissions, and allow to enforce access control. Even more modern file system added journaling on top of these features to keep the file system - even in the event of a crash - in a consistent state. The latest generation of file systems identified by Salter [43] introduced concepts such as Copy-On-Write, per-block checksumming, and self-healing mechanisms. Those features can be found in modern file systems such as and ReFS [43]. From the standing point of file system forensics, it is crucial to know the structures that are employed within these file systems as any currently used file system may also become part of a digital investigation. To be able to interpret files and directories located on a file system correctly, it is essential to know the underlying model and the structures that are used by that file system. Tools that are used in the context of file system forensics need to dig even more profound as it is expected from them to be able to show and possibly recover deleted files. Locating and recovering deleted files is a new challenge for every new file system as this task was most likely, not considered by the original specification of the file system [11, p. 6]. Because of

4 CHAPTER 2. BACKGROUND that, there also exists no standard approach to recover files, and it is necessary to develop strategies to solve this task anew whenever a new file system type is encountered. Obtaining the underlying file system model and the used structures of a file system is a rather simple task if documentation of the file system or the source code of its driver is freely available. If the documentation is insufficient or such information does not exist at all in the publicity, it might be necessary to examine data structures found on the file system manually and to perform reverse engineering to understand these.

2.1.1 NTFS The New Technologies File System (NTFS) was developed in the ’90s by Microsoft as the successor of the FAT file system. Until now NTFS is still the default file system for all recent Windows versions. Just like ReFS, NTFS was not published with official documentation so that drivers for different operating systems had to be developed through reverse engineering. Already in 2005, Carrier [12] assumed that NTFS would “likely be the most common file system for Windows investigations” [12, p. 273]. By now NTFS is still dominant in Windows but as it lacks more “next-gen” features as the ones that were listed above this might change in the next decades. Horalek et al. [28, p. 1] state that NTFS was initially exclusively made available to the public in a server version of the Windows operating system before it was brought to a broader audience. The authors assume the same for its successor system ReFS which currently is also only available in a few selected server and enterprise versions of the Windows operating system. When introducing ReFS, Sinofsky [46] mentioned that ReFS was built on the foundations of NTFS to maintain backward compatibility. On the other hand, Sinofsky stressed the necessity of deprecating other features of NTFS that provided a poor cost-benefit ratio. Initial codenames for ReFS were Protogon and Monolithic NTFS (MNTFS), the latter name also hints to possible similarities between the file systems NTFS and ReFS. By now only a limited amount of unofficial tools exist that allow analyzing ReFS volumes whereas, for NTFS, which has been widespread in use for many years by now, vast amounts of materials, documentation, and tools exist. This section is thought to explain core concepts and the architecture of NTFS, as it is a well-known file system, and some of its underlying ideas were likely carried into the implementation of ReFS.

Master File Table (MFT) The core concept of NTFS is that “everything is a file”. Even basic file system administrative data that is used to manage the file system is equally to files and folders mapped to generic file structures. This design allows a more flexible layout of the file system, as files that contain administrative data may now, just like regular files, be placed anywhere in the volume and may scale identically like regular files. It is merely enforced that the first sector of an NTFS volume stores its boot sector, the remaining file system spans a data area. Every sector in that area may be used to store files [12, p. 274]. The central data structure in NTFS is the Master File Table (MFT). The MFT is a table structure at a fixed location and contains so-called MFT entries that represent files. Each of those entries is usually 1 KiB large but may be extended if necessary. The first 42 bytes of an MFT entry have a fixed structure, whereas the remaining 982 bytes are unstructured and may be used to store so-called attributes. As the MFT itself also is mapped to a file, it contains an entry that describes itself. Previous implementations of NTFS throughout different versions of the Windows operating system did not only enforce the boot sector to be at the of an NTFS volume but also reserved a contiguous chunk of memory for the MFT to prevent fragmentation in it. Figure 2.1 (not to scale) shows an exemplary layout of an NTFS file system. The first 16 MFT entries in NTFS are reserved for file system metadata files. These files reside in the of an NTFS file system but are typically hidden from most users. The names of all those file system metadata files start with a $ and are followed by a name, with the first letter being capitalized. One exception of those names is the root directory which is simply called “.”. As of now only 12 of those 16 reserved MFT entries are used. Their names are: $MFT, $MFTMirr, $LogFile, $Volume, $AttrDef, ., $Bitmap, $Boot, $BadClus, $Secure, $Upcase, $Extend. The MFT entry $Extend is a folder that extends this layout and allows storing even more file system metadata files. As this section should only cover the core ideas of the NTFS file system, the purpose of the listed MFT entries is omitted. When later looking at the data structures of ReFS, some of these entries that share similarities to concepts found in ReFS are however referenced.

5 CHAPTER 2. BACKGROUND

Boot sector MFT Data area MFT copy Data area

MFT Entry Attribute 1 ... Attribute n

MFT Entry Header

Figure 2.1: NTFS structure, based on [35, p. 7] and [12, p. 275]

File Attributes As seen in figure 2.1, all MFT entries and thereby files consist of a header of fixed size and a variable list of attributes. Attributes of MFT entries represent the properties of a file and comprise file metadata, file names, data content, index structures to model directory contents, and more. Attributes are marked with a type identifier that indicates their purpose. If multiple attributes in an MFT entry share the same type identifier, the attributes must additionally be marked with a name so that they can be distinguished. Attributes in MFT entries are stored sorted ascending by their type identifier. Important attributes utilize a lower type identifier. Because of that, attributes of higher importance can be located faster when traversing through the attribute list of an MFT entry. NTFS differentiates between so-called resident and non-resident attributes. If an attribute exceeds a defined size, it is not saved as part of the MFT entry anymore. Instead, the attribute in the MFT entry references a cluster or a cluster run in which only the contents of that attribute are saved. Such an attribute is called “non-resident”. Attributes that on the other hand are stored directly in an MFT entry are called “resident”. The file system metadata file $AttrDef stores a list of definitions and information for each attribute type. The contents of this list include the name of the attribute as well as its minimum and maximum sizes and whether the attribute may be stored resident or not.

MFT Entry Addresses An MFT entry address is a 48-bit value that uniquely identifies MFT Entries in NTFS. The address may be split into a 16-bit sequence number that is counted up if an MFT entry is reallocated and a 32-bit address that may be used to address the actual record in the MFT. MFT entry addresses do not depend on the directory and the path in which a file resides. MFT entries also may contain multiple file name attributes. This choice of design makes NTFS more flexible than its predecessor FAT and enables the creation of so-called hard links. The concept of a allows files to possess more than one file name and furthermore allows storing directory entries of the same file in different directories. Hard links can be equally treated as the file that they reference as they only expose a different file name but point to the same MFT entry. When creating a hard link under NTFS, like for a regular file or folder, a directory entry is inserted into the index attribute of the current directory. Additionally, the linked MFT entry is extended with a new $FILE_NAME attribute. When a file is deleted, it must also be possible to check whether the MFT entry is not referenced by any hard link anymore and thus may be deleted. For this purpose, NTFS also has to maintain a link counter contained in every MFT entry.

Journal Many modern file systems use journals. File system journals facilitate the recovery process in the event of a power loss or a system crash. They usually store a list of actions, as well as an identifier which marks the last modifying action that was completely persisted to the storage medium. This information may be used to gain an insight into past interactions with the file system. NTFS provides two files that serve as journals.

6 CHAPTER 2. BACKGROUND

• File System Journal ($LogFile): Before metadata information in NTFS is updated, the update process is recorded in a journal. The completion of a metadata information update is also recorded in the journal. These two pieces of information allow a file system driver to perform undo- or redo operations to recover from a crash of the operating system. The file system metadata file $LogFile is used to track such update procedures. When the end of the log file is reached the writing process cycles and continues at the start of the log file. Operations such as creating files or directories, changing the content of a file, or renaming a file constitute so-called “file system transactions”. Those transactions can be broken up into smaller operations that were bundled by the transaction and need to be performed consecutively [12, p. 205]. • Change Journal ($Extend\$UsnJrnl): Since NTFS version 3.0, NTFS additional provides a so-called Update Sequence Number Journal, short USN Journal or Change Journal. This type of log may be used to track which files were modified in a given span. It uses bit fields to describe the types of changes that were made but does not include which data of a file was changed. Carrier [12, p. 343] mentions that the change journal feature is by default disabled, but may be enabled by any application. Because of this circumstance, he notes that it is unsure how useful this journal is, as it is not always enabled but still stresses its importance in reconstructing events that recently occurred.

2.2 B+-Tree

The following definition of trees and their variations stem from Elmasri and Navathe [20, p. 646-654]. Bayer and McCreight initially introduced the B-tree and algorithms associated with it in 1972. Over the years, specializations of classical B-trees, such as the B+-tree were developed. Both, the B-tree and the B+-tree originate from the well-known search tree data structure. Trees consist of nodes that are connected by edges. Edges manage the relationship between nodes whereby they form the structure of the tree. Every node apart from the root node has a parent node. Additionally, every node may have an arbitrary number of child nodes. A node that does not have any child nodes is called leaf node. In opposition, a node that does have child nodes is called internal node or inner node. The depth of a node describes its depth in the tree, starting at 0 with the root node the depth increases by following down its child nodes. If the leaf nodes are all at the same depth, the tree is called balanced. The term height describes the longest path from a node to a leaf node. In a balanced tree, all paths from a node to a leaf node have an equal length. The search tree is a specialization of common trees that Search trees are used to manage key-value entries in a tree and to be able to search them efficiently. Being able to search for entries in a tree is enforced by storing them in the sorting order of their keys. Nodes in a search tree use an index structure that indicates what edge to follow when searching for a key. Following an edge in a search tree restricts the search space for an element as all nodes not in the followed subtree are ignored. B-trees are balanced search trees designed to work well on disks or other direct-access storage devices. Through more complex insertion and deletion algorithms, B-trees limit the waste of space in their nodes. Reorganization operations in B-trees assure that internal nodes always have a minimum occupancy of 50%. Nodes in a B-tree are organized in the form of:

< P1, < K1, P r1 >, P2, < K2, P r2 >, ..., < Kq−1, P rq−1 >, Pq >

Pi is a tree pointer to a child node in the B-tree and < Ki, P ri > represents a key-value pair. Within that key-value pair, P ri is a pointer to the record with a search key equal to Ki. Leaf nodes in B-trees do differ from internal nodes as all of their tree pointers are unset. B-trees use a so-called minimum key rule to sort entries by their key. This rule requires that the relation K1 < K2 < ...Kq−1 is fulfilled in every node. It also enforces that tree pointers Pi only reference subtrees with keys that are ≤ Ki [38, p. 7-8]. In 1979 Douglas Comer’s publication “Ubiquitous B-tree” provided the first prominent publication of the B+-tree as a variant of the B-tree. Whereas B-trees allowed keys including data pointers to be stored at any level in the tree, the B+-tree restricts data pointers to only be stored in leaf nodes of the tree. Through this change, the leaf nodes in B+-trees differ from inner nodes, as the latter ones merely contain links to their child nodes whereas the first ones are only used to store data. With this adjustment inner nodes of B+-trees may be denoted by the form:

7 CHAPTER 2. BACKGROUND

< P1,K1,P2,K2, ..., Pq−1,Pq > whereas leaf nodes in a B+-tree may be described as:

< K1, P r1 >, < K2, P r2 >, ..., < Kq−1, P rq − 1 > The core idea of storing data elements only in leaf nodes and putting child links solely into inner nodes allows to pack more entries into internal nodes. Through this change in inner nodes B+-trees provide a higher fan-out than conventional B-trees and thereby allow to keep the tree more shallow. This goal is especially important when looking at - or file systems where nodes often correspond to blocks that have to be from traditionally slower storage media such as disks. Storing data pointers only in leaf nodes also allows to add forward- and or backward- links between the leaf nodes so that it is possible to sequentially traverse all data elements [20, p. 646-654].

B+-trees in the context of file systems File systems such as BTRFS, APFS, and ReFS utilize B-/B+-trees to store organizational data [38, p. 2]. The conjunction of B-, B+-trees and the Copy-On-Write (COW) update policy that is often seen in modern file systems makes several concepts of classical B- and B+-trees inapplicable. Rodeh [38] proposed various changes to B-trees to make them “COW friendly” so they may be used together with a COW update policy. B+-trees implemented in ReFS presumably use some of those ideas and thus must be treated and considered different from regular B+-trees. While they can still be searched and viewed like regular B+-trees, they show a different behavior when insertion and deletion operations are executed on them. As explained, B+-trees + typically store the data in their nodes indirectly by merely referencing it through a pointer P ri.B -trees in modern file systems typically omit this last indirection and store values directly in nodes. Because of that and since the size of keys and values in the B+-trees of modern file systems are variable, requirements for the balanced distribution of entries cannot be fulfilled. This circumstance additionally hinders B+-trees as implemented in ReFS from achieving the minimum occupancy of 50% in inner nodes as this concept presupposes a fixed size for the entries saved in a B+-tree. The underlying storage engine of ReFS deals with this issue by using a different insertion behavior for different types of tree structures but cannot guarantee the same space utilization as regular B+-trees. Some modern file systems such as ReFS combine the tree structure provided by B+-trees with the concept of so-called Merkle trees. In Merkle trees, every inner node validates all of its child nodes by storing cryptographic hashes of them. Nodes in B+-trees as implemented in ReFS, do not store a cryptographic hashes of their child nodes, but a checksum of them instead. In a blog article Shanks [44] explains how ReFS may utilize B+-trees to damp the corruption of metadata. By storing checksums together with the references to nodes in a B+-tree, it is possible to detect corrupted nodes. If corruption in a node of a B+-tree occurs, only the node itself and the nodes that it references are affected by the corruption.

100 200 ......

80 90 ... 100 150 160 ......

85 ...... 155 ......

Figure 2.2: Corruption in B+-trees, according to [44]

If as shown in figure 2.2 the link in the node with key 90 indicates corruption only the subtree that it references is affected by the corruption. Through this isolation of the erroneous data, the rest of the file system remains unaffected and is still available to the user. Only if the root node of a crucial tree is affected

8 CHAPTER 2. BACKGROUND by corruption, the entire volume should be taken offline. Presumably for this reason ReFS stores completely identical fallback copies of important tree structures. In the implementation of ReFS files and nearly all existent data structures are saved as entries in B+ trees. Thus it becomes crucial to analyze how the concept of a COW friendly B+ tree under ReFS is implemented in detail, how insertion and deletion operations work and how deleted entries from a B+-tree may be recovered. Hansen and Toolan [25, p. 6] also note that file systems that use B-trees to organize meta-data are challenging in a forensic context. They state that through balancing and rewriting operations, free records in such trees get overwritten rapidly. They also mention that in their attempt to recover historical metadata from a file system that was based on B-trees, they only were able to restore data from a very limited time frame. On the other hand, however, they note that the usage of a Copy-On-Write mechanism still provides great recovery opportunities.

2.3 Copy-On-Write (COW)

Copy-On-Write (COW) is an update policy that may be used to alter data on a storage medium. A COW update policy makes sure that data is never updated data in place. Whenever the contents of a block should be altered, the block is read into memory, modified and its contents are written to an alternate location on the storage medium. Not overwriting old data helps in preventing data loss. When the system crashes while a block is updated through a COW policy, the old state of the data remains untouched and still persists. The opposite updating strategy to COW is called Update-In-Place (UIP). Whenever the contents of a block should be altered with a UIP policy the block is read into memory, modified, and its contents are written over the data on its original location [15, p. 2343]. Copy-On-Write offers a simple strategy to enforce atomicity in write operations and to ensure the integrity of data structures [39, p. 15]. In literature, the term shadowing is sometimes synonymously used for the process of Copy-On-Write [38, p. 1]. ReFS uses Copy-On-Write to perform robust disk updates and to avert inconsistent states in the case of a crash. In his introduction of ReFS as a new file system Sinofsky [46] mentioned that “metadata must not be written in place to avoid the possibility of “torn writes,””. Torn writes occur when the process of writing new data over old data is interrupted and cannot complete. A typical reason for this issue are power failures that occur during a write operation [44]. Similar to various more recent file systems, ReFS uses Copy-On-Write combined with adjusted B+-trees. If data is updated with a COW policy and the system crashes during the update, the old state of the data still exists and is referenced from a structure called checkpoint. In the context of a file system, a checkpoint references the complete file system tree that reflects the state of that file system. The term checkpoint also refers to the operation that manifests this complete state of a file system. Only after data was successfully written to a new location, a link in the checkpoint is updated to point to the newly written data. After the checkpoint was updated successfully, the previous checkpoint may be discarded. Rodeh [38, p. 6] states that it is unproblematic if a system crash occurs while a checkpoint is written, as the system may fall back to using the previous checkpoint which remains intact until the new checkpoint was successfully updated. Since modifications to trees can be collected and batched, the checkpoint process is efficient as modifications of blocks may be collected first and written to the disk at a later time. Rodeh [38, p. 6] also mentions that recoverability in the context of Copy-On-Write may be ensured by additionally logging commands. More recent versions of ReFS use a journal into which all changes to the file system are logged so that they can be redone. Similar to the journaling implemented in NTFS, the journal under ReFS is written sequentially cyclic. When examining Copy-On-Write as a concept for file systems based on B+-trees, it seems appropriate to discuss the ideas used in previous file systems that were developed with similar foundations. All so-called next-gen file systems such as BTRFS, APFS, and ReFS use COW as a policy to update data structures, and all also rely on variants of B-trees. Changes to entries in B+-trees always happen in leaf nodes. Sometimes these changes propagate up into inner nodes. When updating a leaf node while using a Copy-on-Write policy, the node is written to a new location on the disk. Hence the link through which the node is referenced, by its parent node, must also be updated. This procedure continues recursively in all ancestors of the node until the root node of the B+-tree is reached, and a new checkpoint is written. This growth of write operations forces to rethink existing algorithms and concepts used in regular B+-trees.

9 CHAPTER 2. BACKGROUND

A possible solution to this problem that does not affect the implementation of B-trees is to use an alternate style of Copy-on-Write. Rosenberg et al. [40] and Reuter and Gray [37] first described this method. The main problem of Copy-On-Write towards linked data structures is that all links to an altered component must be updated as an update writes the component to a new place and thus its address changes. This issue is combatted by not altering the address of structures updated via Copy-On-Write. This can be achieved by assigning constant virtual addresses to every block in the system. All references use these virtual addresses and do not need to know the physical address of a block. A virtual address may be translated to a physical address by querying a mapping table that maintains such translations for every existent block. The major drawback of this idea is that such a mapping table can become quite large. The performance of the lookup process via this table also becomes crucial as the table has to be queried whenever a page is read. In the implementation of most modern file systems that leverage the combination of B+-trees and Copy- On-Write, a different solution has been dominant which relies on modifying the algorithms and the structures used by B+-trees. In a research report, Rodeh et al. [39] discuss characteristics of BTRFS. They describe so- called “COW friendly” B-trees as a central element in the structure of the BTRFS file system. COW friendly B-trees are tree structures that are based on standard B+-trees but use various algorithmic adjustments. Regular B+-trees provide links between leaf nodes to simplify range queries and tree balancing operations. Chained leaf nodes, however, impose a problem to the Copy-On-Write paradigm. When using the COW paradigm and linked leaf nodes, a change in a single leaf node recursively impacts all its siblings that link to it. Having to update all siblings of a node and thus all leaf nodes in a tree requires to rewrite the complete B+-tree structure. As this cost would make most operations in the B+-tree structure unprofitable COW friendly B-trees renounce to link leaf nodes. This limitation, however, makes it impossible to apply many older ideas from literature to COW friendly B-trees. Another change that is proposed for B-trees to be COW friendly is to further restrict shuffle operations between neighboring leaf nodes. Regular B+-trees allowed to keys from neighboring leaf nodes after a delete operation was performed to keep the tree balanced. If the adjacent leaf node involved in the shuffle operation, however, belongs to a different subtree, Copy-On-Write would again induce noticeably more write operations as regularly required. Another noteworthy limitation that is posed by the combination of B-trees and Copy-On-Write concerns concurrent tree updates. When inserting or removing a key from a leaf node in most cases solely this node must be exclusively locked to prevent concurrent accesses to it. However, as all changes in B-trees with a COW policy propagate up to the root node of the tree, all nodes in this chain and most critically the root node must be exclusively locked. This drastically impacts concurrent accesses to the tree, as any modifying operations may not be executed parallel anymore. The proposed changes applied by the COW friendly B-tree discussed by Rodeh [38] boil down to the following adjustments 1. Updates in the tree are executed in top-down order.

2. Linking of leaf nodes is omitted. 3. A lazy reference-counting mechanism is introduced for space management. This step is only necessary if one wants to provide an efficient method to clone tree structures.

In their work Rodeh et al. [39] visualize the impact of Copy-On-Write when used in conjunction with B+-trees. The same behavior applies to the proposed COW friendly B-trees. The following graphics of their report show what changes for insert- and delete- operations in a B+-tree if all operations are executed by using Copy-On-Write. Yellow nodes in the graphic represent existing blocks, whereas green nodes represent new blocks that were allocated via the COW policy.

Insertion into a B+-tree, see figure 2.3 If a key is inserted into the shown B-tree, all nodes that are affected by the change are written to a new place. The value 19 is inserted into the rightmost leaf node containing the values 10 and 11. As this scenario requires no reorganization of the tree structure, an insertion algorithm for an Update-In-Place policy would stop here. The Copy-On-Write policy, however, alters the location of any updated node. Thus the link referencing the node must be updated as well. In this scenario, the node was previously referenced by the

10 CHAPTER 2. BACKGROUND

1 4 10 1 btr4 10 1 4 10

1 2 5 6 7 10 11 1 2 5 6 7 10 11 10 11 19

(a) (b)

Figure 2.3: (a) A basic b-tree (b) Inserting key 19, and creating a path of modified pages [39, p. 7]. root node. As a result of that, the root node also needs to be updated and written to a different location. The old root node still references the state of the tree before the update operation. A copy-on-write policy would finally update the referenced root node in a checkpoint and thereby discard the old root node.

Deletion from a B+-tree, see figure 2.4

1 btr4 10 1 4 10

1 2 5 6 7 5 7 10 11

(a) (b)

Figure 2.4: (a) A basic tree (b) Deleting key 6. [39, p. 7].

A similar effect can be observed when looking at the deletion of entries. The key 6 can be found in a leaf node, that contains the values 5, 6, and 7. When the key is removed from the node, a new copy of the node is written to a new place on the disk. After this operation was executed, the same recursive updating procedure as described before has to be performed.

Forensic implications of Copy-On-Write policies Both, the insertion and the deletion of entries in a B+-tree, that utilizes a COW update policy are interesting from the viewing point of a forensic examiner. When an entry is inserted into the tree, the old entry is not removed, only a reference to the node in which it was stored is removed. Thus it might be possible to locate the previous state of the B-tree. In the second scenario in which a value was removed from the B-tree, it is similarly possible to obtain an older state of the tree in which the entry was not yet deleted. File-systems based on B+-trees mostly represent files and directories as such tree structures. While directories are usually implemented as trees that store variable sized directory entries in their leaves, files are typically portrayed as trees that store disk-extents in their leaves [38, p. 2]. This concept can be equally found under ReFS where tree structures are similarly utilized to represent the structure of files and directories. By this means, a Copy-On-Write policy constructs a partial or complete copy of a file or a directory every time it is altered. In addition to the entry that was altered, also other entries, as well as unused data chunks in the block in which the entry resided, are duplicated. In this way, the Copy-On-Write policy leaves lots of traces behind that might be of value for a forensic examiner and might him to look at different past states of a file or directory. Copy-On-Write slack space: As it was stated in section 2.1 no traditional slack space exists in ReFS. However, the Copy-On-Write policy introduces a new variant of slack space. ReFS uses so-called pages to store organizational file system data. Pages are also the unit of Copy-On-Write operations and usually correspond to tree nodes. Whenever a Copy-On-Write operation is performed, the contents of an entire page (aside from its header data) are copied to a different on-disk location. Every page has an inner structure

11 CHAPTER 2. BACKGROUND

Page Page PagePage Header Header Header Header COW COW XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX

XXXXXXXX Delete an XXXXXXXX Insert an XXXXXXXX entry entry

Address #1 Address #2 Address #3

Figure 2.5: Experiment to show, that Copy-On-Write allocations make full page copies which may fragment as well. When an entry is removed from a page of a B+-tree, the entry is not wiped. Instead, only a link to a data chunk in the page is removed, and the data chunk is released. Since COW operations happen at the granularity of a page and do not logically copy the referenced elements in a page, they retain deleted entries. Because of that, COW operations may also generate duplicates of entries, even after they have been removed. We verified this cognition in a small experiment that is shown in figure 2.5. In the experiment, we modified a single page that was referenced by the file system. To make the modification clearly visible, we filled areas of the page that were unused completely with markers. In the figure, the markers are represented by X’s. To make the driver accept the modified page, it was also necessary to correct the checksum of the page. The driver later used and copied this page multiple times. As shown in the figure, the marker tags remained for some time in the page. The figure also shows an entry that is deleted in the first step. Nevertheless, the entry would be retained and copied, whenever a COW operation is carried out until the page contents are reorganized and it gets overwritten. Such logical elements that exist in a page but which were previously removed we call slack space. In practice, the success of recovering traces, that arise through Copy-On-Write strongly depends on the used allocation strategy by the file system. The allocation strategy that is used to supply new pages for Copy-On-Write operations determines which pages are reallocated and overwritten and thereby how long old copies of previously used pages exist. Ideally, an allocation policy for a file system should allocate its pages in a cohesive region of memory to prevent fragmentation. Sinofsky [46] explains that in the implementation of ReFS concepts were introduced to maintain a measure of read contiguity. These concepts might, however, lead to short-lived copies of old pages as they might get rapidly reclaimed and overwritten. It also is not a trivial task to locate older pages that were created through Copy-On-Write operations as there is no place from which those pages are referenced. Instead, those potential interesting pages merely remain unreferenced, unallocated clusters located potentially anywhere on the disk. The ReFS driver maintains a linked list between older states of Copy-On-Write pages in memory. This information, however, is on one side limited to a short time frame and on the other side usually not available to an examiner who has to analyze the on-disk structures of a ReFS file system. A technique to locate pages that stem from previous COW operations is carving. Strategies to carve for older states of pages are explained in section 5.2. To track the order in which blocks were written via COW, all blocks under ReFS also contain clock values. To our understanding, ReFS has yet no practical use for the clock values. They, however, offer a great value to a forensic examiner as they can be used to sort unreferenced pages by their creation time.

Implementation aspects of Copy-On-Write in BTRFS In their report, Rodeh et al. [39] explain more concepts used in BTRFS that are based on Copy-On-Write. During the forensic analysis, we found various of these concepts being similarly implemented in ReFS. BTRFS considers a file system to be a forest of different tree structures. A so-called superblock which is located at a fixed disk location is the entry point and also the anchor of the file system. The superblock contains a pointer to a “tree of tree roots”. This tree indexes all the B-trees that form the current state of the file system. ReFS follows a similar approach, but instead of saving important tree roots in a tree structure, ReFS stores them in a list in a structure called checkpoint. The tree references in this structure are also referred to as “global root nodes”. Whereas the “tree of tree roots” under BTRFS allows storing an arbitrary number of tree roots, the number of referenced root nodes under ReFS is as of now limited. In BTRFS changes to files and directories as well as allocation- or other tree-structures induce updates in leaf nodes. As explained previously, if using a COW policy, changes to entries in leaf nodes always cause

12 CHAPTER 2. BACKGROUND recursive updates that ripple up the affected tree until its root node is reached. As it would be quite expensive if every modification to the tree would entail a rewrite on all nodes from the leaf to the root node, those changes are collected. Based on either a or a counter of changed pages, modifications in the tree are first accumulated and at a later point written as a batch to the disk. Such behavior is referred to as lazy checkpointing mechanism. ReFS uses a timer as well as the journal filling state and other criteria to decide when to update the on-disk structures and to persist a new checkpoint.

2.4 The Sleuth Kit (TSK)

When the first digital forensic investigations were conducted, digital investigators would often use the evidentiary computer itself to retrieve evidence from it. This approach bears the risk that the examined computer alters the evidence in an undetectable way. This methodology of obtaining evidence also neglects the recovery of deleted files. Even though operating systems provided tools that allowed to extract deleted data from hard drives manually, they were rarely used. Hence digital evidence examinations at that time mostly ignore deleted data. Only since the early 1990s tools were developed, that would enable digital investigators to collect all data on a computer disk, without altering important details. As the awareness of the evidentiary value of computers in forensic examinations increased, the need for more advanced tools grew. Tools like Encase and FTK firstly helped digital investigators to conduct examinations more efficiently, as these tools could perform routine tasks. Additionally, those tools also firstly presented data in a graphical so that investigators could locate important details more easily [14, p. 28]. Initially, analysis software that was used in the field of digital forensics mostly was custom and proprietary. At a later time, the development of open-source tools within the field of digital forensics provided freely accessible utilities that supported a comparable set of features. Open-source tools are more transparent to the public as they document their procedures within their source code and by that provide a more retraceable behavior [11]. Aside from that, open-source tools may also be extended and adjusted to own needs. The Sleuth Kit (TSK) is an example of such a tool suite of open-source file system forensic tools. It was developed with high portability and extensibility in mind [8, p. 42]. The highly extensible design of TSK allows developers to extend its functionality as well as the types of file systems it supports. TSK is structured into multiple layers of abstraction that map how data is stored on storage media. Figure 2.6 presents an overview of the layers found in TSK. The following description of the layers is based on [5]. The description merely comprises the layers that are relevant to access a file system.

• Base Layer: This is the lowest layer of TSK. The base layer contains functions and structures that may be used by all layers. The functionality of this layer includes error handling, the definitions of various data types as well as convenience functions. • Disk Layer: This layer provides an abstraction to various formats of disk images. It hides the inner properties of file system images and abstracts from whether an image is split, compressed, or encrypted. Through abstraction, this layer provides a contiguous to the upper layers. • Volume System Layer: This layer provides an abstraction to various types of volume systems that may be used to manage multiple partitions on a disk. The functions implemented in this layer provide the starting and ending locations of all partitions contained in the disk. • File System Layer: This layer provides an abstraction to various types of file systems that may be found within a partition. It is also possible for a file system to be spanned over the entire disk image. The functions provided to bridge this abstraction enable a user to access files and folders stored on a file system. This layer can be further subdivided into five sub-layers that generically describe the inner structures of most file systems.

The layer of most importance within this work is the File System Layer, as this is the abstraction where file systems such as FAT, NTFS and also ReFS are settled. To make different file systems more comparable and to develop a generic interface for them, Carrier [12] proposed a reference model within the File System Layer. Carrier introduces the model in his work “File system forensic analysis” where he utilizes it to explain

13 CHAPTER 2. BACKGROUND

Base

Disk Image Hash Database

Volume System

File system category

Content category

File System Metadata category

File name category

Application category

Automation

Figure 2.6: Abstraction layers used in The Sleuth Kit (TSK), based on [5] a variety of file systems, such as FAT, NTFS, Ext and UFS in a generic fashion. The same model is also embedded in TSK. All file system tools provided by TSK may be assigned into a single category of this reference model. As the findings of this work were also implemented into TSK and since the model proposed by Carrier may be similarly applied to ReFS, the structure of the analysis chapter firmly sticks to the structure of Carrier’s model. The model offers a uniform starting point to look at different file systems and makes identified components in ReFS comparable to other well-known file systems. In the following subsections, the different categories within the File System Layer are explained by referring to [12].

2.4.1 File System Category Data that falls into this category is concerned with general information of a file system. Towards the other categories, the file system category plays a superordinate role as it describes the layout of the file system and must be processed first to find out where the data structures of the other categories reside. Data in the file system category may also describe the size of a data unit as well as the number of data units or metadata structures managed by the file system. Data structures of this category can typically be found in the first block or at a defined entry point of a file system. Structures in this category mostly consist of single and independent values and thus cannot be interpreted in more depth. Tools typically use this data to access other structures of the file system and might display values from the file system category to an investigator. TSK offers a single tool called fsstat that may be used to query data of this category. Depending on the analyzed file system type and the file system instance, fsstat presents different data to an investigator.

2.4.2 Content Category This category comprises data that is contained in files. Usually, most of the data in a file system can be assigned to this category. This data is traditionally stored as an accumulation of data units, in ReFS these data units are also called clusters. As a file system needs to know which data units are in use and which may be allocated to a new file, file systems use a data structure to keep track of the allocation status of each data unit.

14 CHAPTER 2. BACKGROUND

2.4.3 Metadata Category Data in the metadata category describes the properties of a file. Such properties may be information on where the data of the file is located and how big it is as well as timestamps that describe when the file was created, last accessed or last written. Entries that store the metadata of files and directories can typically referenced through a numerical identifier. This identifier is also called metadata address, TSK calls it “inum”. Metadata of a file may describe that a file is write-protected, hidden, stored sparse, compressed, encrypted, or contain access control information of a file. In even more recent file systems, metadata may also contain checksum data that was calculated from the contents of a file. Data in this category also offers the opportunity to perform what is called metadata-based recovery. If one can locate the metadata structure of a deleted file, its actual recovery is mostly straightforward and no different from interpreting the contents of an allocated file. The only difference that exists between reading the contents of a deleted and an existent file is that one cannot be sure whether the data units of the deleted file were in the meantime reallocated to a new file. Metadata structures and data units can become out of if data units are allocated to new files and in general, there exists no strategy to find out whether the found blocks represent the last state of a recovered file or not. Modern file systems partly change this. File systems such as ZFS, BTRFS and, ReFS did not only introduce checksums to protect the integrity of file system structures but also allow to calculate and store checksums of user data. ReFS only calculates the checksums of user data if this is explicitly requested by a user. The reason for this is that the calculation of checksums is time-consuming. Since older file systems did not use checksums to verify the contents of data, applications that worked with data would often take on this task. If an application, however, still calculates the checksum of data it works with, it is not necessary to additionally calculate checksums on the file system layer. Checksums are used to detect and repair bit rot resulting from media degradation and corruption from other sources. Checksums make modern file systems more resilient towards these errors but also introduce new opportunities for file system forensic tools. If checksum data of a file’s contents is stored associated with its metadata, a tool can verify the correctness of the checksum. A checksum mismatch might indicate that data units of the file were reallocated and overwritten or that the content of a file is for some other reason out of sync with its checksum. If the checksum saved with a deleted file matches the checksum calculated from its data, the data units of the file have most likely not been reallocated.

2.4.4 File Name Category The file name category establishes a mapping between metadata addresses and file names and thus allows a user to refer to a file by its name instead of its metadata address. Data of this category can be usually found by reading the data referenced by metadata entries that represent directories. The analysis of data in the File Name Category mostly starts with locating the root directory of a file system as this directory matters to find a file by its path. While metadata-based recovery only required a metadata entry to be existent, file name-based recovery requires the file name and the metadata entry of a corresponding file to exist. Once a file name was recovered, it may be used to find its associated metadata entry. Starting from this metadata entry, it then becomes possible to perform metadata-based recovery. When discussing metadata-based recovery previously, we encountered the problem that the clusters referenced by a metadata entry might have been reallocated and overwritten by the contents of a different file. A similar problem occurs here, as a metadata entry to which a deleted file name may point might also have been reallocated and reused.

2.4.5 Application Category The application category is used to store optional data through which additional features may be introduced to a file system. Data in this category traditionally is stored as special file system data instead of in regular files and therefore is also sometimes mentioned in the specification of a file system. Practical examples of such data are user quota statistics and journals. The file system could work perfectly fine without them, but they extend its functionality or help a file system to recover faster from a crash.

15 CHAPTER 2. BACKGROUND

First Letter Layer Ending Function d Data Unit Lists information in the layer i Meta Data cat Displays content in the layer f File Name Displays details about a given object in the layer fs File System Maps other layers to its layer j Journal calc Calculates “something” in the layer

Table 2.1: Data categories Table 2.2: Postfixes of the names of file system tools

2.4.6 Extending the file system layer of TSK The TSK toolkit contains over 20 command line analysis tools that are separated into the layers shown in figure 2.6. The essential tools for this work are the file system tools that are located in the file system group. The names of the tools found in the file system group correspond to the five identified subcategories. The first part of the name of a tool identifies the category in which it operates, the second part of the name specifies the operation that is performed by the tool. The tool icat, for example, operates in the metadata category (i) and prints the contents of a file (cat). An overview of the different abbreviations of categories and functionalities is given in table 2.1 and 2.2. Soon after TSK was released, it allowed investigators to analyze the commonly used file system types NTFS, Ext, and FAT. Over time support for new file system types such as ISO9660, ExFAT, HFS+, Yaffs2 was added to it. TSK, however, still lacks official support for some more modern file systems such as BTRFS, ZFS, and ReFS. However, since the code of TSK is open source and leveraged through its extensible design, its compatibility for new file systems may easily be extended. When implementing the support for ReFS in TSK, some remarks from Carrier [11] were considered. • The analysis tool must handle every possible condition: If the analysis tool does not cover every possible condition, a suspect could potentially create a condition that would hide data from the investigator. The driver that implements the logic of the file system only has to test that it can handle every condition that it can create [11, p. 4-5]. The tool must exceed this functionality and be able to correctly interpret any given file system that is valid according to the file system’s specification. • The analysis tool must perform tasks that exceed the specification of a file system: Most tools that are utilized to perform file system forensic analysis provide the option to show deleted files and directories to an investigator. In some cases, tools are also able to recover these. Such tasks were mostly not considered when the file system was specified. Because of that, there exists no standard procedure to perform the recovery of deleted files and directories. To recover deleted entries, tools must often process and interpret unused space to locate data that fulfills various sanity checks. The purpose of the sanity checks is to make sure that the found data is what the tool assumes it to be. If the sanity checks used by a tool are too strict, likely, some deleted entries will not be shown, and evidence is discarded. On the other hand, such sanity checks may also not be too weak as this may lead to falsely interpreting data as something else than it is. This behavior would produce artificial, false evidence [11, p. 6]. As mentioned before, the ReFS file system uses a Copy-On-Write (COW) allocation strategy so that data is never written in place. This behavior potentially leaves many traces in a ReFS formatted file system. • A forensic analysis tool that regularly interprets a file system and additionally searches the slack space of referenced pages for links to such promising copies will likely not find all old pages. Many pages that were previously allocated via COW and deleted at a later time cannot be located as they are not referenced from any page in the file system anymore. An attempt to search for those structures must go further than just interpreting file system structures or searching the slack space of blocks that can be reached. A tool that can locate all blocks that were ever created through the COW policy has to scan the entire file system. • Additional such a tool should also not require a file system to contain a completely valid ReFS file system. If structures of the file system category of a file system, such as its boot sector or its

16 CHAPTER 2. BACKGROUND

superblock are damaged, classical file system forensic tools refuse to interpret it, and an analysis becomes impossible. Such a filesystem might, however, still contain valuable evidence. To tackle these problems, we developed an additional tool called refstool within the file system layer of TSK. This tool is a command-line application that allows to scan an entire ReFS file system and thus allows recovering all files potentially existing on it. This approach also yields old copies of files that were created through the COW policy. Both implementations are presented in chapter5.

17 3 File System Reverse Engineering

Carrier et al. [13, p. 7] describe file system analysis as a process that “translates the bytes and sectors of the partition to directories and files”. This process, however, requires knowledge about the model of the underlying file system. In the existence of many proprietary, poorly documented file systems, it is necessary to develop techniques to obtain file system models. Plum [35, p. 14] defines file system reverse engineering as the process of extracting such a file system model from an existing file system. Tools that are used to analyze file systems as part of forensic investigations have the purpose of giving an examiner an overview of existing and mostly also deleted files and directories found on the file system. To be able to offer this desired functionality, authors of those tools require detailed knowledge about the model of the file system that is subject to the analysis [35, p. 14]. The main goal of this work is to obtain the file system model of the ReFS file system and to develop a tool based on its file system model. In this section, various approaches to reverse engineering file systems and the available information for an analysis of ReFS are discussed. Chikofsky and Cross [16] formally define reverse engineering as “the process of analyzing a subject system to identify the system’s components and their interrelationships” and to “create representations of the system in another form or at a higher level of abstraction”. Reverse engineering enables an analyst to visualize the structure of a software and allows him to understand how software accomplishes its tasks and how it operates. Data reverse engineering is a branch of classical reverse engineering that is concerned with deciphering undocumented file-formats or network protocols. Data reverse engineering mostly looks at proprietary file-formats or protocols that can only be accessed by a program written by their original author, as no one else knows the internal structure of that data [19]. File system reverse engineering is a variant of data reverse engineering. Instead of the structure of a file or a network-protocol, file system reverse engineering is concerned with the data structures used by an entire file system. In general, there exist two main techniques to perform reverse engineering, static- and dynamic analysis.

3.1 Static Analysis

In a static analysis, an analyst tries to understand a program’s behavior by reading its code. As analysts mostly have no access to the source code of a program, the code they look at usually is the program’s instruction sequence. This sequence may be obtained by using a disassembler. Disassemblers strongly vary in the set of features they provide. The most basic example for a disassembler is the objdump application that merely provides a of the instructions of an application. More advanced disassemblers such as Hex-Rays IDA Pro offer helpful comfort features that, amongst other things aid in automating analysis processes, visualizing program structures in a graph view and make it easy to add notes to disassembled instructions. The whole purpose of reverse engineering revolves around the understanding of an analyzed application. Therefore a visualization of a program structure is always useful. A benefit and at the same time a drawback of the static analysis is that an analyst sees every instruction in a program. When looking at large applications, the amount of functions and instructions becomes unmanageable. Usually reverse engineers are anyhow only interested in a few paths that are taken and the code that is executed there.

3.2 Dynamic Analysis

A dynamic analysis involves the execution of software that should be analyzed. This execution usually happens in an isolated, controlled environment. During the execution of the software, all observable aspects of its behavior are recorded. As a consequence of that, dynamic analysis can be most useful if an analyst intends to find interesting paths in the program that are traversed during its execution. The findings of a static analysis might overwhelm

18 CHAPTER 3. FILE SYSTEM REVERSE ENGINEERING an analyst with too many instructions that cannot be covered, whereas a dynamic analysis is only concerned with a given execution path. In a dynamic analysis, it is likely that some functions are never called and thus out of the scope of the analysis. Since it still might be necessary to analyze them, it is most useful to use a of techniques from the static as well as the dynamic analysis in conjunction.

3.3 Examples of file system reverse engineering

The following compilation of publications that give an insight into how reverse engineering of file systems may be accomplished is strongly based on the findings of Plum [35, p. 20-23]. One common purpose of data reverse engineering is the analysis of closed-source software to provide interoperability to it. A general example is driver code that is only released for a single hardware architecture or operating system. If the vendor of a driver refuses to support different platforms, reverse engineering becomes necessary to comprehend the functionality of the original driver [18, p. 7]. The developers of the NTFS driver used for faced a similar issue. In a report, they state that at the time of the development of an unofficial NTFS driver Microsoft had not provided any documentation about the internals of NTFS. This circumstance forced them to reverse engineer the NTFS file system from scratch. They describe their procedure in 4 steps that were repeated very often: • Look at the volume with a hex-editor • Perform some operation .g. create a file • Use the hex-editor to look for changes • Classify and document the changes They characterize this methodology as time-consuming and laborious and also state that the writing of a driver derived from their findings was far more straightforward than the analysis of the file system model of NTFS [4]. The approach taken by the authors might suffice to develop a driver that supports all operations that can be tested manually. A significant problem of this approach, however, is that it only exposes a drivers behavior in tested scenarios. The functionality of the analyzed driver might be more comprehensive, but the findings of an analysis are limited to the behavior that can be observed in the covered cases. An additional hurdle of this method is its implicit assumption that changes made to the file system are so small that it is feasible to manually compare the current and a previous state of the file system. This hurdle becomes an even greater burden if the driver does not overwrite data in place. Copy-On-Write strategies that are widely employed by modern file system lead to the changes of far more data on the file system than a single altered sequence and thus make it harder to employ such methodologies. Another example for reverse engineering of a file systems model to provide interoperability to it is described by Hellwig [27]. Hellwig states that for achieving full compatibility towards a file system, it is indispensable to comprehend all on-disk structures in detail. Presumably, a better understanding of the drivers’ internals goes hand in hand with the understanding of the used on-disk structures of a file system and vice versa. However, Hellwig furthermore stresses that for achieving compatibility, a thorough understanding of on-disk structures is more important than the understanding of the inner workings in a driver. When looking at file system forensics, the internal workings of a file systems driver gain more importance as the behavior of a driver determines how files and folders are deleted and what kinds of traces are left. Hellwig’s analysis began by examining publicly available documentation of the VxFS file system. He later employed various reverse engineering techniques such as disassembly and using symbolic information [27, p. 192]. Shullich [45] analyzed the proprietary exFAT file system developed by Microsoft to leverage forensic analyses of it. His analysis approach included searching for patents that were registered by Microsoft and referred to the exFAT file system. In his work, he also looked for similarities in internal structures shared by exFAT and other already well-known variants of the FAT file system. Similar to the approach of the analysts of the NTFS file system, Shullich also used the driver provided by Microsoft to create multiple slightly different images of the exFAT file system which he later compared in a hex-editor. Shullich describes the process of repeatedly using a hex-editor on large files as very tedious. Hence he also developed a program to provide formatted printouts of the located structures [45].

19 CHAPTER 3. FILE SYSTEM REVERSE ENGINEERING

Authors who already analyzed the structure of the ReFS file system, such as Ballenthin [9], Metz [32], Green [24], Head [26] and Georges [22] also used similar approaches. They also fell back to using the pieces of information provided by Microsoft. Additionally, most of them also created instances of ReFS file systems, made a copy of its original state, altered something on it by using the official driver and later compared both states by using a hex-editor. Hellwig [27, p. 192-193] who reverse engineered the VxFS file system also mentions a technique called “protocol snooping”. Protocol snooping is the idea of monitoring the input/output behavior of a system. Transferred to a file system, protocol snooping might instead be concerned with monitoring read- and write- operations. Hellwig found the concept of protocol snooping for file systems unsuitable as this technique requires either specialized, rarely available hardware or the possibility to emulate block storage devices. In more recent work, Tobin et al. [49] use this technique in the context of CCTV systems. Tobin et al. [49] note that reverse engineering of yet unknown proprietary file systems is a time-consuming process. On the other hand, forensic investigations sometimes require to extract data as fast as possible. In their work, they try to combat this conflict by proposing a so-called “eavesdrop” approach that is similar to the idea of protocol snooping. Whereas reverse engineering of file systems is mostly concerned with trying to reconstruct low-level data structures, the approach of Tobin et al. [49] aims to identify the order in which blocks in a file system are being accessed. In their analysis, they accessed unknown structured data with a tool that was able to interpret that data. Instead of trying to reverse engineer the tool, to uncover the interpretation of the data, they performed various actions with the tool and monitored the resulting disk I/O to obtain the patterns in which the tool accessed the data. In this way, it is possible to find out how data is roughly structured and in which order it may be accessed without the need to analyze an underlying executable application. This approach allows detecting fixed entry points as well as pointer structures in data. These findings may yield information about used data structures. Such an approach is flexible as it merely requires a tool or a driver that is aware of the structures used in unknown data. Its granularity, however, is limited to the size of blocks that are read by the file system. The sole information that a file system read a block of data does not give a reverse engineer an insight into which data within that block was needed by the application.

3.4 Preliminaries for the reverse engineering process

When starting with the analysis of the ReFS, various sources of information were available to us: • Publications and unofficial documentation of ReFS

• Publications about different file systems that may employ similar concepts as ReFS • , a Windows version that uses the ReFS driver in version 1.2 • Update 1803, a Windows version that uses the current ReFS driver in version 3.4

• A program database file (PDB) containing public debug symbols of the current ReFS driver. Sotirov [47] gives a good motivation on the reverse engineering process of binaries provided by Microsoft. He notes that most Microsoft binaries do not utilize code obfuscation. Since most of the code of Microsoft binaries is written in object-oriented C++ and compiled with the Microsoft Visual C++ compiler one may also make assumptions about optimizations that were applied in the compilation process of it. Knowing how a compiler might have translated single operations into machine instructions is a great help when analyzing them. The probably most important aspect mentioned in the work of Sotirov is the fact that Microsoft provides debugging symbols, in the form of PDB files, for most of its binaries. Although Microsoft distinguishes between public and private debug symbols and only publishes the public ones, nonetheless these symbols provide a vast amount of information. Public debug symbols include the names of functions as well as the names of global variables used in an application [31]. Most of the names that are introduced in this work to refer to data structures employed in ReFS stem from these symbols. Otherwise, if a structure was artifically named by us, this is mentioned explicitly. The first thing we did in this examination was inspecting all existing publications about ReFS. Most of them are listed as related work in the introductory chapter. We also looked for concepts from known file systems

20 CHAPTER 3. FILE SYSTEM REVERSE ENGINEERING that might have been transferred to ReFS. Since ReFS is the successor of NTFS, we found it necessary to seek internal similarities between both of these file systems. All pieces of unofficial documentation that describe ReFS as of now look at its early versions ReFS 1.1 and ReFS 1.2. Thus we gradually tried to transfer those earlier findings to its current version. In this early stage of our work, we merely focussed on the representation of data structures on sample ReFS images in version 3.4 that we created. Various concepts used in ReFS posed hurdles to us. It is not easily possible to the file system, make changes to it, unmount it and compare the modified image to its original state as all these changes are written by using a COW update policy. As a consequence of that, it is initially hard to determine which blocks are updated. Sometimes it is useful to change data in a file system from the outside to analyze how the interpretation of the driver reflects these changes. This concept, however, cannot be applied as simple to ReFS, since ReFS utilizes checksums that are used to detect and prevent such external changes from happening. To get a better understanding of the structures and concepts used in ReFS, we employed various reverse engineering techniques. We first set up an analysis environment that would allow us to analyze the ReFS driver statically and dynamically. Since a driver was subject to analysis, we had to build a setup that would allow us to perform kernel-mode debugging. Drivers are executed in the kernel of an operating system. Halting the execution of a driver implies halting the entire operating system. To still be able to control the debuggee, a second computer must exist which receives the state from and sends controls to the debuggee. The debugger is called host system, whereas the debuggee is also referred to as the guest system. We used a virtual machine in which an instance of acted as a host system. On the host system, we used the debugger (WinDbg) to control and analyze the execution of the ReFS driver. The host machine was also used to run multiple tools for a static analysis of the driver and to store notes about findings that were obtained through the dynamic analysis. We used another virtual machine to execute a guest system in which the code to be debugged, namely the ReFS driver, was actively used to perform actions to a ReFS file system. [2] best explains the fundamentals of this setup. While we performed a dynamic analysis of the ReFS driver, we pursued an approach similiar to the eavesdropping strategy that was proposed by Tobin et al. [49]. First, we tried to identify the functions that are responsible for reading data of the file system. With these functions, it was possible for us to automatically determine in which order on-disk blocks were accessed and read. Using a debugger makes this process even more effective as it enabled us to study the read blocks with a finer granularity. Directly after a block was read into a buffer, it is possible to use so-called “access breakpoints” that point into this buffer. Access breakpoints allow stopping the debuggee whenever a value is read from or written into memory at a specified address. Setting access breakpoints within blocks that were read allows identifying whenever the driver accesses fields within these blocks. This approach enabled us to rapidly identify the size of single fields within a block. Combining this information with the list of function names from the PDB file made it significantly easier to classify different components of the file system driver. We supplemented the dynamic analysis with a static analysis of the driver in which we also extracted all strings found in the driver. We did this by using the strings utility under Linux. It is important to note that Windows encodes most of the strings that it uses as 16-bit little-endian. This has to be considered when executing strings. It was also interesting to see that the ReFS driver in version 1.2 supplied many different strings that were not contained in the current ReFS driver and vice versa. Both ReFS drivers used most of their strings for debugging or logging purposes. If for example, a table structure in a ReFS file system becomes corrupted, the driver writes a message that contains the name of that table into the event log of Windows. From this information, we could derive many official names of data structures. We then searched places that referenced these strings, so-called cross-references, in the driver to determine where these strings were formatted and printed. Another important methodology was using tools that employ the file system API provided by ReFS. We used the application fsutil repeatedly to read properties of different instances of ReFS formatted file systems. At the same time, we employed the kernel-debugger to find out which values were read via the driver and later output by the fsutil application. As fsutil also associates names to the data that it extracts from the file system model of ReFS, this gave us further insight into the terminology used in ReFS. The ReFS driver also contains much code to trace its execution, to protocol errors as well as code to provide telemetry functionality that is used to format and send records of events that occur in the driver. We found 46 functions in total that were used to format different telemetry messages. These messages contain formatted strings that store the names of properties of the file system as well as their associated values. We

21 CHAPTER 3. FILE SYSTEM REVERSE ENGINEERING also employed a dynamic analysis to find out from which on-disk structures these values were read. We also found more than 100 functions in the driver that were created by the Windows software trace preprocessor (WPP). WPP mainly exists for debugging purposes but is also a valuable source of information for reverse engineers. Developers may introduce WPP events into an application to log variables in particular situations. WPP functions are automatically generated code that writes data of these events. This data may be consumed by tracing applications. The structure of the data that is written by WPP functions can be looked up in so-called event manifests. These documents, however, are usually not supplied to end-users. Still, reverse engineering may be used to comprehend which data has been logged in a tracing process. The functions that are created by WPP to produce event messages take different numbers and datatypes of arguments which they write at a later point into a consumer object. The utility of WPP for reverse engineering and its internal components are best explained in [23]. At a later point of our analysis, we also deciphered the structure of the redo transaction-log that ReFS uses. The contents of the redo log also helped us much in understanding ReFS, as the log protocolled all modifying interactions with the file system. The current ReFS driver consists of about over roughly 3.000 different functions. Due to the PDB file that was accessible to us, we were able to extract the names of these functions. Knowing the names of functions was a great aid in prioritizing the functions in the analysis process. Still, it was unfeasible for us to understand values in the file system that were rarely used or left out by the driver. Because of that and because of the of functions that exist in the driver, the findings of our analysis still are incomplete and lack the interpretation of multiple structures that are used in ReFS. After we analyzed the file system model of ReFS to a point where we understood the most critical data structures and were able to interpret them, we put a stronger focus on analyzing the internals of the driver. It was essential for us not only to find out which data is stored on the disk but also to comprehend how this data is modified and how the process of file creation and deletion works as these processes help to understand how files and directories may be recovered.

22 4 Analysis

In this chapter the results of the analysis of on-disk structures used under ReFS are presented. The structure of this chapter is kept similar to the characterizations of different file system types by Carrier [12]. The first section discusses the basic concepts found under ReFS. The second section looks at the different categories of data found under ReFS and describes how data in each category may be analyzed. The final section is dedicated to the data structures used under ReFS and shows how they may be interpreted.

4.1 ReFS Concepts

The data unit under ReFS is referred to as a cluster. When allocating storage for the contents of a file in ReFS, always one or more clusters are allocated. Data that forms organizational structures in a ReFS formatted file system is allocated in a different unit as so-called pages.

4.1.1 Pages Pages are the data unit of organizational file system data, as well as the unit at which the Copy-On-Write mechanism operates. Apart from two exceptions, namely the Superblock and the Checkpoint pages represent equally sized chunks of data. In more modern versions of ReFS, pages are always at least as large as a cluster. Depending on the size of a cluster, the size of a page ranges from 1 to 4 clusters. In previous versions of ReFS, pages had a fixed size of 16 KiB and were smaller than the only available cluster size. Every page starts with a header that consists of the following elements: • Header Information: This part of the page contains the page signature that describes the type and hence the purpose of the page. It also contains a volume signature that is randomly generated and assigned to the file system when it is created. • Virtual Address: In the header of a page up to four virtual addresses may be stored to describe the location of the page. As a measure for performance, ReFS uses an address translation process that maps virtual addresses to real clusters. When following a reference to a page, one can verify that the virtual address in the found page matches the virtual address of the reference. If a page consists out of multiple clusters, these do not need to be placed contiguously on the underlying volume. To reassemble the contents of a page, all four virtual clusters need to be read and recombined. • Clock Values: Pages also contain clock values that play an essential role when it comes to the implementation of Copy-On-Write. Each page stores a virtual allocator clock and a tree update clock. The virtual allocator clock is a 64-bit counter that is updated whenever a new checkpoint is written. Every time a new page is allocated, the current virtual allocator clock is written into its header. As of now, we are unsure where in the ReFS driver the clocks in a page header are used. They offer, however, a great value to a forensic examiner as they allow to establish a relative ordering among pages. This clock has nothing to do with real timestamps that may be easily forged but instead establishes a reliable relative order that is purely influenced and maintained by the driver of the file system. The tree update clock is an additional clock value that encodes the number of updates that were written in the last clock cycle. When the driver of the file system writes a batch of altered pages to their new locations the tree update clock describes how many update operations were executed in that batch, the virtual allocator clocks and tree update clocks are equal for all pages in the batch. As of now, we are uncertain what change a single update exactly is. • Identifier: Additional to the page signature, every page also holds an identifier field that may be used to further divide the purpose of a page with a given type. The most important page type are pages

23 CHAPTER 4. ANALYSIS

that represent nodes in a B+-tree. For these types of pages, the identifier is used to encode the tree to which the node belongs.

4.1.2 Page References Pages in ReFS rarely exist independent only for themselves. Instead, most pages form a composition of more complex data structures. To realize such a behavior, it must at least be possible to reference pages from other pages. As of now the most crucial data structure that is formed by pages that reference another is a modified B+-tree. However, there also exists a different data structure that is used to store a list of references to root nodes of B+-trees. Instead of only saving the location that describes where the referenced page is stored, references also store a checksum of the data contained in the referenced page. The checksum is stored separately from the page and thus independently of its data. The checksum may be used to detect corruption in metadata pages which stems from bit rot. Bit rot can result from bit flips, I/O operations that were performed incorrectly, or result from a magnetic media that fails to hold a bit position [30, p. 8]. Since page references contain the checksum of the referenced pages, they can be used as a reliable indicator to verify whether the referenced page is still valid. At a later point, the recovery of pages is discussed. For this process, it is most useful to be able to verify whether a reference to a page is still valid. If the checksum in its reference does not match its content, the page has either been reallocated and filled with new data or the data in the page is corrupt. In these scenarios, such a reference cannot be trusted as the node does most likely not belong to a past state of the current data structure anymore. Nearly all addresses used in ReFS are virtual and must be translated to real addresses first before they may be interpreted. Virtual addresses are not used to circumvent the shortcomings that occur when using a Copy-On-Write policy with linked data structures. Instead, virtual addresses, as well as an address translation mechanism, seem to have been added to achieve a better performance when using ReFS in conjunction with storage tiering. More recent versions of ReFS offer “tiering aware allocation”, this form of allocation introduced a logical indirection layer between addresses that are used to reference clusters and the physical clusters of a volume. Tiering aware allocation is an allocation strategy that is aware of the storage tiers of its underlying storage media. This knowledge about the properties of components in a storage tier is used to decide which storage tier fits best for a current allocation. The abstraction layer that is used to implement virtual addresses is explained in section 4.2.2.

4.1.3 Superblock and Checkpoint Like any other file system, ReFS utilizes the first sector of a volume as a boot sector. The boot sector contains a header that may be used to identify a file system to be of the type ReFS. Additionally, the boot sector stores multiple parameters of the file system, such as the used ReFS version, the sector count, as well as the cluster size that is used in a file system. After the boot sector has been processed successfully, the driver continues to interpret the superblock. The superblock is the entry point to all further structures of a ReFS file system and is placed at a fixed location in the cluster 30 (0x1e) of a volume. The most important task of the superblock is to reference the checkpoint. The checkpoint references the root nodes of all further important structures found under ReFS and thereby the state of the file system. Two different checkpoint structures logically form a checkpoint. Different to most data structures, the single checkpoint structures are updated in-place. If they were updated with a Copy-On-Write policy instead, the links found in the superblock would have to be updated whenever the checkpoint is altered, and the same updating problem would propagate into the superblock. If there were only a single checkpoint structure that is written in-place ReFS would lose its robustness provided through Copy-On-Write. Therefore there exist two checkpoint structures in every ReFS file system. These checkpoint structures are written alternatingly, as seen in figure 4.1. After a batch of update operations has been persisted, one of both checkpoint structures is updated to reflect the current state of the file system. When the next batch of update operations is persisted, the other checkpoint structure is manifested. If the system crashes while the first checkpoint is written, it may fall back to using an older state of the second checkpoint and vice versa. Additionally, since more recent versions of ReFS, a redo log can be replayed to repeat all operations that occurred between a crash and the state of the last valid checkpoint and were not persisted to the disk. To determine which checkpoint is more recent, the checkpoint structures use a

24 CHAPTER 4. ANALYSIS

Superblock

Checkpoint 1 Checkpoint 2

0x0

0x1 ime T 0x2

0x3

...

Figure 4.1: Overview of the checkpoint mechanism clock value that is incremented whenever a checkpoint structure is updated. The correctness of a checkpoint structure may be verified by using its self-checksum. The actual payload of a checkpoint structure is a list of references to the “global root nodes”, so to say the root nodes of essential tree structures found in the file system. On a higher abstraction level, these trees may be considered as tables of a key-value store. Figure 4.2 shows an overview of the global root nodes that are referenced by a checkpoint.

4.1.4 Minstore B+ Tables The most important data structure under ReFS is the B+-tree. Microsoft calls their B+-tree implementation “Minstore B+-trees”. Minstore is the name of a new storage engine that is used in ReFS to supply a recoverable key-value store formed out of so-called Minstore B+-trees. Instances of such key-value stores are also referred to as tables. To a user of its API, the Minstore engine only provides primitives. On an upper layer, these primitives are combined to implement the logic of a file system. The API provided by Minstore enables a user to create transactions that operate on the key-value store. A transaction bundles multiple operations on one or more tables of the key-value store into an atomic unit [29]. Every transaction either ends with a commit or an abort. Either the transaction was successful, and the changes are persisted to the key-value store, or it failed, and the performed changes are rolled back. + A single Minstore B -tree forms a key-value store of the form < idx_root, < k1, v1 >, ..., < kn, vn >>, where ki represents a key and vi its associated value. idx_root describes a fixed chunk of data in the root node of the tree that serves as header data and may be used to store general as well as specific information about the table. Figure 4.3 shows how a B-tree can be logically interpreted as a key-value store. It is also possible for tables to be embedded in other tables. Such a relationship may be represented by the key-value pair < k1, < idx_root, < j1, v1 >, < j2, v2 >, ..., < jn, vn >>>, where k1 is a key through which data may be accessed that itself forms a completely new table. Usually, nodes in a Minstore B+-tree correspond to complete pages in the file system. Root nodes of embedded tables, however, are smaller than a whole page, as these use the data portion of a row to represent the root node of a tree. The concept of embedded tables may be applied recursively, so that embedded tables may also contain other embedded tables. Embedded tables are used to express how tables relate to another while allowing them to scale independently. Some embedded tables might even wholly fit into the value portion of a row. This makes embedded tables more space-efficient compared to the approach to reference other pages. Figure 4.4 shows how the embedding of other tables works on a viewing plane of B+-trees. A table that contains embedded tables is not limited to only store tables but might as well store atomic key-value pairs in its rows. Another important concept that is used together with embedded tables is reparenting. Moving an embedded table from one table to another table only requires to move its root node. Files under ReFS are for example mapped to embedded tables in so-called directory tables. If a file is moved from one directory to another directory, it is

25 CHAPTER 4. ANALYSIS

Table Reference Header LCN 0: AAA LCN 1: BBB Checksum Information LCN 2: CCC LCN 3: DDD Checksum Data Object ID Table

Medium Allocator ..... Table

Small Allocator Table .....

Schema Table ..... Object ID Table (0x2) Parent Child ..... Table Key Table Reference Object ID Table ..... r Duplicate 0x600 1

Block Reference 0x701 r2 ..... Count Table ...... Container Table ..... 0x80f rn

Global Root References Container Table ..... Duplicate

Schema Table ..... Duplicate

Container Index ..... Table

Integrity State Table .....

Container Allocator ..... Table

Checkpoint

Figure 4.2: Global root nodes referenced by the checkpoint merely necessary to move its root node into the new directory table, so to say to reparent it. Tables in ReFS can be referred to by a 64-bit long identifier that is stored in every page header and thereby can be found in all non-embedded nodes in a Minstore B+-tree. Contrary to the address of the root node of a table, its identifier stays constant over time. Embedded tables inherit the identifier of the table in which they reside. They may be referred to by the table identifier of their most outer table, combined with the key or the sequence of keys that is required to reach them. It is important to note, that if a table is reparented its nodes are not updated and thus they might still hold the table identifier of one of their former parent tables. Every table under ReFS also has a fixed set of information associated with it, such as the number of rows it stores, from how many pages the table is formed as well as an identifier of the schema that is used in a table. The schema of a table defines, amongst other things, the amount of memory that the root node and that other nodes may consume in a tree. A schema also defines the data type of the keys that are used in a tree. The datatype of a key may be used to determine the collation of the entries that are stored in a tree.

Schema Table Information on which table schemas exist as well as the definitions of table schemas reside not only in the driver but are also stored in on-disk structures of a ReFS file system. Like most pieces of file system metadata, the schema definitions that store the properties of all tables under ReFS are contained in a table. The table that is used to store information about different table schemas is called “Schema Table” and introduces a circular logic issue, similar to the $AttrDef file used in NTFS. To be able to interpret the contents of a table, one first has to parse the Schema Table, but how does one parse the Schema Table without knowing how to interpret a table correctly? To solve this issue, the ReFS driver initially loads a

26 CHAPTER 4. ANALYSIS

INDEX ROOT

General information about the table Specific information about the table

k1 v1 k2 v2

k k ... k ...... 1 2 n ...... v1 v2 ... vn

kn vn

Figure 4.3: Implementation of a table in a key-value store by using a B+-tree

Embedded Table

Figure 4.4: Concept of embedding tables set of pre-defined table schemas. With this set of schemas, the driver can load all further schemas from the Schema Table. It is important to note that the schema table does not define how tables are structured but rather defines size values and flags that apply to tables. These flags are crucial for reading the nodes of a table as they are also used to indicate whether the addresses of nodes are virtual or not. The interpretation of the data found in the tables is up to the specification and the interpretation of the driver, and is not explained in any on-disk structures. Nevertheless, the concept of schemas provides a great way to maintain backward compatibility to deprecated data structures. As an example: The metadata entry of a file in ReFS is also encoded as a table. If Microsoft wanted to alter the current layout of how files are stored in ReFS, they would have to introduce a new schema. In this regard, the identifier of a schema fulfills a similar purpose as a version number. A driver that understands this new schema, as well as the old schema, can interpret files saved in both layouts correctly. Tables also contain header data that depends on their schema. ReFS, for example, uses so-called allocator tables to manage the allocation status of clusters in the file system. Header data of such a table could include the number of used, free and reserved clusters that are managed by the table. A table that represents a file might store timestamps and flags as specific information instead.

Object ID Table When conducting a forensic analysis of a ReFS formatted file system, the table of most importance is the so-called Object ID Table. This table references the root nodes of a variety of other tables and associates an identifier to them. Alongside a table that contains general information about the volume, a table storing an

27 CHAPTER 4. ANALYSIS

Upcase Table and a table containing an entry point to the redo log, the Object ID Table also references the root nodes of all directory tables. When searching for any directory, the Object ID Table must be queried. After a directory has been deleted, it is not referenced by the Object ID Table anymore. The Object ID Table is the only place where the actual addresses of directory tables are stored.

Identifier Table Name Category File System Content File Name Metadata Application 0x7 Upcase Table  0x8 Upcase Table, dup.  0x9 Logfile Information Table  0xa Logfile Information Table, dup.  0xd Stream Table   0x500 Volume Information Table  0x501 Volume Information Table, dup.  0x520 File System Metadata Table   0x530 Security (Descriptor) Table  0x540 Reparse Index Table  0x541 Reparse Index Table, dup.  0x520 File System Metadata Table   0x600 Root Directory Table   > 0x700 Any other directory table  

Figure 4.5: Tables referenced by the Object ID Table

28 CHAPTER 4. ANALYSIS

Table Reference

LCN 0: AAA LCN 1: BBB Checksum Information Additional LCN 2: CCC LCN 3: DDD Checksum Data Information

Object ID Table

Referenced Table

Figure 4.6: Schematic view of the organization of the Object ID Table

Because of these circumstances, recovery techniques that attempt to restore directories under ReFS should focus on recovering rows found in the Object ID Table. Table 4.5 provides an overview of the tables that are referenced by the Object ID Table. Even though it would be possible to completely embed the indexed tables into the Object ID Table as proposed in figure 4.4, this concept is not used here. Instead, the Object ID Table merely stores references to the root nodes of other tables as shown in figure 4.6. A possible explanation for this implementation might be the hint that J.R. and Smith [29, p. 16] give on embedded tables. Embedded tables are used to “express relatedness of tables”, the Object ID Table is however not related to the tables it references and follows the sole purpose to index them. The Object ID Table additionally stores meta-information about the tables that it references. It stores the addresses and the checksums of the tables it references, as well as their last persisted log sequence number in the file system journal. Additionally, entries in the Object ID Table may also store a buffer with variable data. For links to directory tables, this buffer is filled with the next file identifier to be used in the directory. As the Object ID Table takes a superior role to the tables which it references, more weight is given to guarantee its integrity. A so-called “Duplicate Object ID Table” exists which contains the same entries as well as the same slack space as the regular “Object ID Table”. While Copy-On-Write is used to leverage atomic write operations, duplicate table structures seem to be used to battle bit rot. If one variant of the Object ID Table becomes corrupted, the ReFS driver may fall back to using the other variant of it.

Directory Tables As their name suggests, directory tables are tables that implement the logic of directories. Every directory table contains a single row that stores metadata of the directory that the table represents. We refer to this row as the descriptor of that directory. The other rows found in a directory table mostly represent directory entries that are contained in the directory. There exist two different types of directory entries: files and folders. For all files in a directory table, an additional entry type exists that provides a mapping between the metadata address of a file and its current handle. • Directory Descriptor: Stores metadata and attributes of the directory The key of this entry is a constant identifier. Its value is an embedded table that contains the metadata and the attributes of the directory that the directory table represents. • File: File name → Entry that stores metadata and attributes of a file The key of this entry begins with a constant identifier followed by the name of the file it represents. Its value is an embedded table that contains the metadata and the attributes of the file that the entry represents. This structure is similar to the structure of directory descriptors. • Directory Link: File name → Directory identifier, includes some metadata

29 CHAPTER 4. ANALYSIS

The key of this entry begins with a constant identifier followed by the name of the directory. Its value is a structure with a fixed size. The structure contains metadata that describes the referenced directory. It also contains the table identifier of the referenced directory table, but not its physical address. To find the referenced directory, one has to search its ID in the Object ID Table. The metadata that is stored in directory links is limited in its size and is not as extensive as the metadata found in the directory descriptor of a directory table.

• ID2: File identifier → File name / Metadata address Since directory tables index metadata entries only by their file names, this additional structure exists. With this row type, it is possible to use the metadata address of a file to look up its name. The found name may then be used to search the actual metadata entry in the current directory table. ID2 entries are also used to make metadata addresses stable. If a file is moved from its initial directory into a different directory, an ID2 entry in its initial directory subsists to point at the current address of it. File names: Using file names and paths as a primary handle for a file is rather unusual. Most file systems address and reference files merely by their metadata address. In ReFS this behavior, however, shifts and file names gain more importance. Figure 4.7 visualizes the structure of a directory. The representation of the metadata entry for the file f.txt is omitted as it looks nearly identical to the metadata entry of the directory descriptor. The table in the figure also shows an ID2 entry that maps the file identifier 0x1 in the directory to the file name f.txt

Metadata File ID: 0x0 Flags: ...

Access Time: ... Write Time: .... Creation Time: ... Meta Mod. Time: ... Attribute List Directory Table (0x600) Attr. ID Attr. Data Key Value ...... Desc. Metadata, ID: 0x0 ......

f.txt Metadata, ID: 0x1

0x1 f.txt Directory Reference k/ Directory Ref: 0x701 Directory ID: 0x701 Flags: ...

Access Time: ... Write Time: .... Creation Time: ... Meta Mod. Time: ...

Figure 4.7: Exemplary directory table

Directory entries under ReFS were designed so that the name of a file or folder that is described by the directory entry is used as their key. Because of that concept, metadata entries that represent files do not need to possess a $FILE_NAME attribute anymore. Directory descriptors which are used to store all the metadata and attributes of a directory however still contain an attribute to store their name as well as the identifier of their parent directory. Because of that choice of design, all subdirectories in ReFS posses two file names. The file name found in the directory descriptor of the subdirectory as well as the file name stored in a directory link that stores the directory identifier of the respective subdirectory. Both names should be equal. When printing the structure of a directory Windows only displays the directory name found in the directory link. Metadata addresses: Metadata addresses are used to uniquely address files and directories. They offer a shorter notation than the path of a file. A metadata address in ReFS consists of two components:

30 CHAPTER 4. ANALYSIS

• Directory Identifier: The directory identifier describes in which directory (table) the entry resides. Directory identifiers may be reused. When the file system is mounted, the driver picks the successor of the directory identifier of the highest existing directory table as a starting value. • File Identifier: The file identifier allows referring to a file within a directory table uniquely. The file identifier 0 refers to the directory descriptor in the respective directory table. All others file identifiers refer to files within that directory table. File identifiers in a directory are never reset or reused. In this work metadata addresses are noted in the following form: 0x600 | 0x42 Directory Identifier File Identifier The directory identifier of a directory is equal to its table identifier. A table with the id 0x800 represents the directory with the metadata address 0x800|0. The files inside this directory are referred to as 0x800|i, where i > 0 and i < max_file_id0x800. This choice of addressing files induces a tight coupling between metadata addresses and the actual paths in which files reside. The directory identifier of a metadata entry defines the directory table in which the metadata entry is stored. Within that table, the filename associated with the entry is used as its key. Alternatively, each directory table also stores so-called “ID2 entries” that allow finding a file name in a directory table by a file identifier. If a file is moved from one directory to another directory its metadata address is altered as a new directory identifier, and a new file identifier are assigned to it. However, it is still possible to refer to the file by its original file- and directory-identifier. The original identifiers are still saved in a field of the metadata entry of the file. Additionally, moving a file into another directory creates an ID2 row in its original directory table. This row is used to offer a mapping between the original- and the current-metadata address of the file. This measure allows metadata addresses of files to be stable even if a file is moved into a new directory. Through this mapping, a file may either be addressed by its initial or its current metadata address. When querying the metadata address of a file, Windows returns its initial address. When a file is deleted, the ID2 entry in its current directory and its original ID2 entry are both deleted as well. Contrary to NTFS where metadata entries may own multiple file names, a concept also known as hard link, files in ReFS cannot own more than one file name. Theoretically, hard links could be introduced as names that link to the original ID2 entry of a file, as this entry works as a stable pointer to the file. Additionally, it would be necessary to add a link count to ID2 entries so that they could track the number of names that point to a file. There exist two special directory tables: • . (0x600) The root directory is particular as it is the base directory of the file system. It is used as an entry point for all operations that refer to file names and is needed whenever a path to a file must be interpreted. • File System Metadata (0x520) This directory fulfills a similar purpose to the $Extend directory known from NTFS. It is invisible to all users and is used to manage system files specific to the ReFS file system. Just like the root directory, it is not referenced from any other directory even though various other file system structures suggest that it is placed in the root directory. All other directories use table identifiers that are ≥ 0x701. Directory tables with an identifier that does not fulfill these conditions are not shown by the ReFS driver and cannot be accessed either. Locating an entry by its metadata address: When given a metadata address, one first has to look at its directory identifier and search it in the Object ID Table. In the case of success, one finds a row that contains the table identifier and thus the directory identifier as a key and the reference to the table as data. It is then only necessary to search the file identifier of the given metadata address in the located directory table. Locating an entry by its path: Even though file names seem to play a superordinate role compared to metadata addresses, it is still more complex to locate a file or directory by its path. The search for a path starts in the root directory table. Every component in the path constitutes an indirection that must be searched in the respective directory table. The process of interpreting a path to obtain a directory entry starts in the root directory table and can be described as follows:

31 CHAPTER 4. ANALYSIS

• Search the file name of the current path segment in the current directory table (e.g. k/a.txt)

• If the current path segment is a file – Return its metadata entry and stop • Else if the current path segment is a directory – Parse the found directory link – Extract the directory identifier from the directory link – Search the directory identifier in the Object ID Table to locate the corresponding directory table – If the current path segment is the last component in the path ∗ Return the directory descriptor found in that table – Else, repeat the procedure by starting in the current directory table with the remaining path Figure 4.8 visualizes this interplay of the Object ID Table and directory tables. To focus on the relevant content, the ID2 rows in the directory tables were omitted. The previously explained procedure might be applied to the example shown in figure 4.8 and the path k/a.txt as follows:

Object ID Table (0x2) Key Table Reference

0x600 r1

0x701 r2 Directory Table (0x600) ...... Key Value

0x80f rn Dir Desc. Metadata, ID: 0x0 0x600 | 0x0

f.txt Metadata, ID: 0x1 0x600 | 0x1

g.txt Metadata, ID: 0x3 0x600 | 0x3

k/ Directory Ref: 0x701

Directory Table (0x701) Key Value

Dir Desc. Metadata, ID: 0x0 0x701 | 0x0

a.txt Metadata, ID: 0x1 0x701 | 0x1

l Directory Ref: 0x72a

Figure 4.8: Overview of the interaction between the Object ID Table and directory tables

• Use the Object ID Table to locate the root directory table (Table identifier 0x600) • Search the entry k in the root directory table

• Obtain the directory identifier 0x701 from the found directory link • Use the Object ID Table to locate the directory table with the identifier 0x701 • Search the entry a.txt in the found directory table • Return the metadata of the found entry

32 CHAPTER 4. ANALYSIS

Type Identifier Type Identifier 0x10 $STANDARD_INFORMATION 0x30 $ 0x38 $DIR_LINK 0x40 $OBJECT_ID 0x50 $OBSOLETE 0x60 $VOLUME_NAME 0x70 $VOLUME_INFO 0x80 $DATA 0x90 $INDEX_ROOT 0xa0 $INDEX_ALLOCATION 0xb0 $NAMED_DATA 0xc0 $REPARSE_POINT 0xd0 $EA_INFORMATION 0xe0 $EA

Table 4.1: List of default ReFS attribute types

File Tables As explained previously, ReFS uses the concept of embedding tables mainly to express the relatedness of tables. Files, as well as directory descriptors, are tables that are embedded within directory tables. Like all other tables, they start with a fixed-sized chunk of data that is used to describe table-specific information. The entries stored in a file table are the properties that a file possesses. This concept strongly reminds of attributes that are used in NTFS. In this manner, MFT entries that are known from NTFS expose a similar behavior as file- and directory-descriptor-tables in ReFS. Both consist of a small fixed component as well as a variable-sized list of attributes. Opposed to NTFS, ReFS uses no on-disk structures to store the properties of attributes such as the $AttrDef file found under NTFS. Due to the design of ReFS that does not focus so much on attributes as NTFS did, various attributes also seem to have been eliminated. Through a static analysis of the ReFS driver, we were able to obtain multiple definitions as well as names of attributes that may be assigned to files or directories. The found attribute names and their identifiers are presented in table 4.1. Of all the attributes defined in the driver, we, however, found only the following ones to be practically used: $DIR_LINK, $INDEX_ROOT, $DATA, $NAMED_DATA, and $REPARSE_POINT. Most of the information that was previously stored in the $STANDARD_INFORMATION attribute has now become a part of the fixed data found in files and directory descriptors. Since the form in which contents of a directory are represented was utterly changed and shifted into the responsibility of rows in directory tables, both the $INDEX_ROOT and the $INDEX_ALLOCATION attribute known from NTFS seem to have become obsolete. Still, we found all directories to use equally filled dummy $INDEX_ROOT attributes. A prominent concept under NTFS is that attributes may be stored either resident or non-resident. The contents of a non-resident attribute are stored in an external cluster while resident attributes store their content directly within a buffer that is reserved for that attribute. Much of the flexibility of this concept seems to have been dropped in ReFS. Aside from the $DATA attribute, the contents of all attributes under ReFS seem to be stored resident. In NTFS it was possible for the $DATA attribute to either be stored resident or non-resident. In ReFS, the $DATA attribute now seems always to be stored non-resident. Even if a file is only a few bytes large, ReFS allocates an own cluster for it and saves its data non-resident. Furthermore, the $DATA attribute seems to be the only attribute that spans an embedded tree. All other attributes store their data in a small variable-sized chunk of memory within the attribute list. If this chunk of memory is not large enough to store the contents of the attribute multiple equally identified attributes of the same type are allocated. Since attributes also employ a field to encode their relative start offset as well as their length, a single attribute may be easily stored split and may be reassembled later. The tree that is produced when a $DATA attribute is allocated stores rows of cluster runs. This approach also makes it easy to search for an offset relative in the file as the tree used to store data runs is collated by the relative start of a data run.

33 CHAPTER 4. ANALYSIS

4.2 ReFS Analysis

In the following, the different data categories that were introduced in section 2.4 are applied to the data structures found in ReFS. Figure 4.2 already gave an overview of all tables referenced by the checkpoint structure. These tables and all tables referenced by the Object ID Table are the most important source when it comes to analyzing data categories in the ReFS file system. Data of the metadata category as well as the file name category is primarily found in directory tables as well as their subtables which were described in section 4.1.4. The other tables found under ReFS predominantly store data of the file system category, as well as data relevant to the content category. Only two tables are concerned with the application category.

4.2.1 File System Category It is important to note, that contrary to the other categories, this category is also represented in the boot sector, as well as the superblock and the checkpoint structures.

Boot Sector The boot sector is the first sector in every ReFS file system. As it is common for the file system category, the boot sector contains multiple single and independent values such as a signature to indicate the file system type, the size of the volume, the size of a sector, the size of a cluster, the size of a container as well as a version- and a serial number. Those values are needed to interpret the remaining structure of a file system correctly. Especially the size of a cluster is important as this value affects the page size used under ReFS. A backup boot sector exists in the last sector of the file system. It may be used if the regular boot sector becomes corrupted.

Superblock With the knowledge of the values found in the boot sector, it is possible to reach the superblock that is contained in cluster 30 of a ReFS file system. There also exist two copies of the superblock at the end of every ReFS partition. They may be located by either using the size found in the partition table or the volume size that can be found in the boot sector. The superblock contains only three values of importance. A 128-bit wide GUID value g that is randomly created when a file system is formatted with the ReFS file system as well as two references to structures that make up the current checkpoint state. The GUID g may be separated into 4 32-bit values: g1, g2, g3, g4. From these values the volume signature may be calculated as: vol_sig = g1 ⊕ g2 ⊕ g3 ⊕ g4. As described in section 4.1.1, every page of a ReFS formatted file system holds a value that should be equal to the calculated volume signature vol_sig. Data in the file system category is often described as a map that hints to where further important structures of a file system reside. The two checkpoint references found in the superblock serve exactly this purpose as they point to the location at which the current checkpoint can be found.

Checkpoint The checkpoint of a file system is the first structure that holds the current clock values, which were described in section 4.1.1. Additionally, the checkpoint also contains the current log sequence number (LSN) which identifies the last entry that was written into the redo log. Since ReFS stores changes to the file system in the memory and writes them at a later point as a batch, the system may crash in this time window. A system crash reverts the system to its last valid state and would discard all changes executed in the meantime. However, every transaction that was successfully performed and is part of the batch is also written to a sequential log file that is discussed in the application category in section 4.2.5. To know from where to start with the redo operations, it is essential to save an identifier for the last transaction that was part of the current checkpoint. The most important structures found in the checkpoint are the table references that were already shown in figure 4.2. For the sake of thoroughness, table 4.9 presents an overview of all tables referenced by the checkpoint along with their table identifiers. All of these tables, as well as the tables referenced by them or embedded in them, contain the remaining data that can be assigned into the different categories proposed

34 CHAPTER 4. ANALYSIS

Table Identifier Table Name 0x2 Object ID Table 0x21 Medium Allocator Table 0x20 Container Allocator Table 0x1 Schema Table 0x3 Parent Child Table 0x4 Object ID Table, duplicate 0x5 Block Reference Count Table 0xb Container Table 0xc Container Table, duplicate 0x6 Schema Table, duplicate 0xe Container Index Table 0xf Integrity State Table 0x22 Small Allocator Table

Figure 4.9: Tables referenced by the checkpoint structure by Carrier. The first table to discuss within the file system category is the Schema Table, but hence it was already extensively introduced in section 4.1.4 it is omitted here. The next table of importance is the Object ID table and the tables referenced by it as shown in figure 4.5.

Object ID Table The Object ID Table constitutes an important part of the file system category as it references other structures of the file system. Its purpose and the tables that it references were already explained and shown in figure 4.5. The Object ID Table references two different tables that may also be assigned to the file system category: The Upcase Table and the Volume Information Table.

Upcase Table The Upcase Table is presumably the table with the most trivial structure. Its keys are 32-bit wide integer values that are used to index and arrange a large chunk of data as shown in figure 4.10. The first entry in the Upcase Table stores the ASCII string “Upcase Table”. The second entry in the Upcase Table is a 32-bit value that reflects the number of characters managed by the following entries: 0x10000 (65536). The remaining entries form a 128 KiB large chunk of data. This chunk of data represents an array of uppercase UTF-16 characters. Confusingly this array is also termed as Upcase Table just like the structure in which it is stored. Hence we refer to it as “Upcase Array” whereas we use the term Upcase Table to name the structure in which is contained. If the file system needs to determine the upcased version of a UTF-16 character, it uses the integer representation of that character as an index within the Upcase Array.

Upcase Table

Key Value

0x0 Upcase Table

0x1 0x10000

0x2 Upcase_Data[0x0:...... ] ... Upcase_Data[...... :...... ] ... Upcase_Data[...... :0x20000]

Figure 4.10: Exemplary view of the Upcase Table

35 CHAPTER 4. ANALYSIS

The concept of such an array is not new to ReFS and was already previously employed in NTFS. Carrier [12] briefly describes this structure as a file that “contains the uppercase version of every character”.

Volume Information Table The Volume Information Table is another table with a rather simple structure. It uses 64-bit wide keys to index its values. As of now, it seems to manage 3 different keys: 0x510, 0x520, 0x540 that are depicted in figure 4.11. • Volume Name (0x510) The first key addresses a row that contains the name of the volume. This entry seems to be optional and is non-existent if a volume has no name. • General Information (0x520) The second key addresses a row that contains general information about the volume. This information includes when and in which ReFS version the volume was formatted. It furthermore includes the current ReFS version used by the volume as well as a timestamp to indicate when the volume was last mounted. This row also contains a single flag value that we were unable to decipher yet. • Backup cluster of general information (0x540) The third key addresses a row that merely contains two 64-bit values: The number of a cluster, as well as a length field that describes over how many clusters the data in that cluster stretches. The data found in the referenced cluster is equal to the information found in the row with the key 0x520.

Volume Information Table Creation time: ... Creation version: ... Last mount time: ... Key Value Current version: ... Flags: ... 0x510 Volume Name Creation time: ... Creation version: ... 0x520 Last mount time: ... Current version: ... Flags: ... 0x540 Cluster Number, Length

Figure 4.11: Overview of the Volume Information Table

Parent-Child Table The Parent-Child Table is used to express hierarchies formed among different tables. As of now, the Parent- Child Table is only used to form hierarchies between directory tables. In directory tables, hierarchies are already established through directory entries that link to subdirectories as well as metadata of directories that links to its parent directory. Nevertheless, this table provides the same information in a more structured way as it only deals with the presentation of such relationships. Even though the driver maintains the state of the Parent-Child Table whenever a directory is created, moved, or deleted it regularly never has to use its entries. The Parent-Child table solely serves the purpose to recover path names which have been corrupted. The directory “File System Metadata” is a system directory that is not referenced by any other directory and thus cannot be seen or accessed by any user. Still, the Parent-Child Table contains an entry which suggests that the “File System Metadata” directory is located within the root directory of the file system. The root directory of a ReFS file system is the only directory that has no entry within the Parent-Child Table to describe its parent identifier as it owns none. Every entry in the Parent-Child table identifies the parent table of a given table. Thereby the Parent-Child Table represents a 1:N relationship. Every table may only possess a single parent table, but a parent table may have an arbitrary number of child tables. This mapping

36 CHAPTER 4. ANALYSIS could easily be represented by using the identifier of a table as the key and the identifier of its parent as the associated value. Instead, the key and the data overlap completely and contain both the identifier of a table as well as the identifier of its parent table. Figure 4.12 shows an exemplary Parent-Child Table as well as its interpretation.

0x600 Parent Child Table Key/Value 0x520 0x701 Parent ID: 0x600, Child ID: 0x520 Parent ID: 0x600, Child ID: 0x701 Parent ID: 0x701, Child ID: 0x702 0x702

Figure 4.12: Exemplary Parent Child Table with its interpretation

4.2.2 Content Category The most important concept when talking about the contents of files are allocators. In a presentation on the Storage Developer Conference, J.R. and Smith [29, p. 15] stated that ReFS uses no bitmap to manage the allocation of clusters. Instead, so-called hierarchical allocators were originally responsible for the execution of allocations.

Allocator Tables: ReFS v1 Since file systems that use Copy-On-Write policies need to perform way more allocations than file systems that employ an Update-In-Place strategy, the efficiency of an allocator becomes a decisive factor. In ReFS v1, which encompasses the versions 1.1 and 1.2 there existed multiple hierarchical allocators. Georges [22, p. 36-41] describes the concept of allocators extensively. He refers to a blog post of Microsoft where it is claimed that in total, three different allocator tables exist [46]. Internally those allocators behave like russian dolls. There exists a large allocator that is used to claim large cohesive chunks of data, in ReFS v1 these chunks are 64 MiB in size. A large allocator is realized through a medium allocator which is used to claim smaller cohesive chunks of data which were 64 KiB in size. Thus, a single block managed by the large allocator corresponds to 1024 blocks managed by the medium allocator. The blocks that are assigned by the medium allocator, again stem from another allocator, the small allocator. The small allocator managed the tiniest amount of data chunks, which were only 16 KiB in size. A single block of the medium allocator consisted of 4 blocks from the small allocator. This relationship is shown in figure 4.13. J.R. and Smith [29, p. 15] furthermore highlight that the design of the allocators allows every table and thus, every B+-tree to own its private allocator from which it may request space. When an allocator of a table runs out of space it goes to its next superordinate allocator. Having no central data structure to query space from allows the allocators to all work in parallel and according to a talk of Das [17] gave a massive boost to the speed of the allocations. This concept of hierarchical allocation was used at least until the versions 1.1 and 1.2 of ReFS.

Allocator Tables: ReFS v2 In more recent versions of ReFS, the concept of hierarchical allocation was vastly reworked. While ReFS v1 provided a large number of allocators, more modern versions of ReFS rely on a single allocator. Relying on a single allocator allowed to centralize the allocation process. This change was necessary to provide a better foundation for data tiering which plays an important role in ReFS v2. Internally, however, there still exist three allocator tables, but only one of them plays a decisive role when it comes to allocating blocks for metadata entries and files. In ReFS 3.4 the existent allocators are called “Small Allocator”, “Medium Allocator” and “Container Allocator” and are not interlayered anymore as they were in ReFS v1. The “Container Allocator” manages

37 CHAPTER 4. ANALYSIS

64 kiB clusters

16 kiB 64 MiB clusters ... Cluster

Medium Small Large Allocator Allocator Allocator

Figure 4.13: Interaction between the hierarchical allocators, according to [22, p. 41] allocations required for itself as well as allocations for the “Container Table” that is used to enable the virtual addresses. The “Small Allocator” manages allocations required for its own tree structure, as well as allocations for the tree structure of the “Medium Allocator”, the “Block Reference Count Table”, and the “Integrity State Table”. In this way, the small allocator manages the allocations for the data of structures that implement other functionalities in the Content Category. The “Medium Allocator” is the single allocation structure that is used to manage storage allocations for the remaining tree structures as well as the allocations for file contents. Das [17] notes that the “single allocator” in ReFS provides a policy-driven allocation model and is aware of the storage tiers. When an allocation request is made, a policy module is attached to the request. This policy module allows giving a hint to the allocator from which region of the disk it should allocate space. This, for example, allows to directly transfer large sequential writes into storage tiers that are most suitable for such workloads. Interestingly, the “Small Allocator” and the “Medium Allocator” both manage allocations of the size of a cluster and do not differ in their granularity. Thus their naming might be misleading. If entries in an allocator table become corrupted the driver marks the entire corrupted range as “allocated” as it cannot immediately decide which clusters were in use. This process is also called memory leakage. In more modern versions of ReFS, a tool called refsutil exists that allows rebuilding the state of allocator tables. When trying to obtain an overview of which clusters are allocated in a ReFS file system, it is necessary to process the allocation information found in all allocator tables. Additionally, it is also necessary to supplement these findings with information stored in the “Container Table”. An allocator table solely stores information on cohesive memory regions that contain clusters that may be allocated. The sole purpose of allocator tables is to provide new clusters to a requester. If a large region of memory, a so-called band or container, is not available for allocation anymore this does not need to be reflected in the allocator table. Instead, it is expressed in a more compact form in the “Container Table”. To obtain an overall view of the allocation status of clusters in a ReFS file system, it is necessary to reassemble all these pieces of information. This can be achieved by executing the following steps:

• Allocate an array a that holds n entries, where n is the number of clusters that reside in the file system. Initially set all bits within that array to 0 (unused). • Iterate through the Small Allocator Table and mark all occupied clusters in the array a as 1 (used). • Iterate through the Medium Allocator Table and mark all occupied clusters in the array a as 1 (used).

• Iterate through the Container Allocator Table and mark all occupied clusters in the array a as 1 (used). • Iterate through the Container Table. For every band that is completely used, mark the complete area in a as used (1).

38 CHAPTER 4. ANALYSIS

Container Table In 2015, Tipton [48, p. 17-22] presented changes made to ReFS on the Storage Developer Conference (SDC). There he first introduced the concept of containers that play an essential role in more recent ReFS versions. Internally a ReFS volume is separated into multiple equally sized “bands” or “containers” as seen in figure 4.14. The size of a single band is relatively large at around 64 MiB or 256 MiB and thus is also comparable to the size of the chunks that were previously managed by the Large Allocator Table. The Container Table, however, provides far more detailed information on the managed memory regions compared to the Large Allocator. The Large Allocator only provided the information if a block was used or unused. The Container Table also tracks how many clusters within a band are used and additionally collects statistics about the usage of data and access times within that band. If a band is entirely used, it may be unhooked from the allocator tables that describe its allocation status, as the information in the Container Table suffices to describe that all clusters within that band are occupied. Tipton [48] argues that the introduction of Container Tables was important to treat multi-tiered storage systems more efficiently. He also acknowledges that ReFS v2 puts a strong focus on server systems and aims to imbue tasks that lie beyond the pure stack of a file system. In a multi-tiered storage system, it is usual to have different types of storage media that offer different reading and writing characteristics. Some storage media such as flash memory allow performing random access faster than traditional hard disks, which are more suitable to perform sequential operations.

0 MiB 64 MiB 128 MiB 192 MiB 256 MiB 320 MiB Band 1 Band 2 Band 3 Band 4 Band 5

...

Clusters within a band

Figure 4.14: Schematic view of bands found in ReFS

The mixture of different storage tiers creates a bottleneck, as different storage media behave differently for the various forms of access. To battle this bottleneck on the level of a file system, ReFS adds performance counters to every band. The general idea of this concept is to administer the volume in a way so that small random write operations of metadata may most of the time be performed in areas of the file system where random access is relatively cheap. This concept also requires the file system from time to time to reorganize the bands. Data that was not needed in a long period is also referred to as “cold data” and may be outsourced into a storage tier that offers worse access times so that data that is needed often, which is referred to as “hot data” may be moved into a faster storage tier. To perform this reorganization, it is possible to swap bands, and thus their contents between different storage tiers. The additional tracking of metadata information within a band allows to monitor its heat, so to say how often data in it is accessed. This performance metric may be used to decide when to shuffle two bands. Since bands are large I/O units this shuffling process happens as a relatively fast sequential write operation. Especially since this task seems to be the responsibility of a lower layer in a storage system and not the competence of a file system, this choice of implementation seems odd. Still, this concept as it is described by Tipton [48] and Das [17] can be found in current ReFS versions. It is important to note that shuffling two different bands also changes the respective physical addresses of the data found in the bands. Before such a shuffle operation it would be necessary to adjust all pointers that reference data in the affected bands. The costs of this measure would outnumber all the benefits from the actual shuffle operation. To prevent the necessity of updating any pointers, ReFS v2 also implemented virtual addresses as a mechanism to leverage shuffle operations within containers. As explained in section 4.1.2, nearly all addresses used under ReFS have to be translated into real addresses before they may be used. After two containers were shuffled, it is

39 CHAPTER 4. ANALYSIS only necessary to update their respective address mappings. Figure 4.15 shows a fictive scenario of an address interpretation. Given is a page reference that points to a metadata block in the file system. The metadata block is initially located in the band 0x4. In the scenario, the bands 0x4 and 0x8 are rotated. • 1: If there was no additional layer of indirection and no existing pointers would be updated, the page reference would now point to the data previously found in band 0x8. This is obviously false and would lead to accessing the wrong block. • 2: The solution implemented by ReFS introduces an address translation process. Pages are not referenced by their physical address anymore but use virtual addresses instead. The data required for the address translation process can be found in the Container Table.

Page Reference

LCN 0: AAA LCN 1: BBB Checksum Information LCN 2: CCC LCN 3: DDD Checksum Data

Wrong block Storage Tier 1 is accessed Storage Tier 2 1: 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8 ReFS Volume Container Rotation

Address translation process Correct block Storage Tier 1 Storage Tier 2 is accessed 2: 0x1 0x2 0x3 0x4 0x5 0x6 0x7 0x8 ReFS Volume Container Rotation

Figure 4.15: Necessity of the address translation process

The Superblock, the Checkpoint, the Container Table, and the Container Allocator Table all use real addresses that do not need to be translated first. This is obvious, as these structures are required to bootstrap the address translation mechanism. We did not examine the inner structure of the rows of the Container Table in more detail. As of now, only two fields in these rows are known to us. One field is required to perform the address translation process. Another field in rows of the container table indicates how many clusters within a band are in use. It is likely, that in tiered storage systems, the remaining fields of the Container Table are used to store performance metrics of the underlying storage devices.

40 CHAPTER 4. ANALYSIS

Figure 4.16 portrays the practical usage of the Container Table in the address translation process. Every virtual address used under ReFS may be separated into a container- and an offset- component. The container component determines which row of the Container Table needs to be queried to perform an address translation. The offset component provides an offset that must be added to the translated address.

Logical Cluster Number Container:Offset

Container Offset

Container / (Container Size * 2) Real Cluster Number + Container Table Key Value Index of a container Start LCN of the container 0x2 ... 0x3 ...... n ...

Figure 4.16: Exemplary address translation process

Further remarks Additional tables such as the Container Index Table, The Integrity State Table, as well as the Block Reference Count Table also fall into the content category. Within our analysis, we were, however, unable to produce scenarios that would use those tables and only had the choice of statically analyzing them. The Block Reference Count table seems to implement data cloning and . Both of those features are as of now reserved to recent server editions of the Windows operating system, and we were unable to test them practically. The fundamental idea of these concepts is to reuse data regions of files whose contents overlap. To copy a large file within the same volume, it is usually necessary to allocate new clusters and to write all contents of the original file into the clusters of the newly allocated file. Data cloning introduces a different approach that transforms such data-intensive operations into operations that are based on changes in metadata. Instead of copying all clusters of a file, reference counters of clusters are introduced. As shown, in figure 4.17, if a file called X is copied to a new file called Y, it is not necessary to allocate new clusters for Y and to duplicate the contents of X. Instead, a copy operation simply lets the new file Y point at the same clusters that X contains. Additionally, a new data structure must be introduced to track by how many pointers clusters are referenced, in ReFS. As of now we assume that the Block Reference Count Table is this data structure. If a file is deleted, the reference counts for its used regions are decremented. As both files diverge, the reference counters of the affected regions are decremented, and the copied file may claim its own clusters. Not only does this concept help to drastically reduce the effort of copy operations, but it also allows to store data more efficiently by preventing to store it redundantly [6].

Block Reference Count Table File X File Y Region Reference Count A A A 2 B B B 2 C C C 2

Figure 4.17: Exemplary usage of the Block Reference Count Table

41 CHAPTER 4. ANALYSIS

4.2.3 Metadata Category Most of the metadata in ReFS can be found within directory tables as well as their subtables as this is the place where the directory entries for files as well as directories reside. However, some of the metadata information that describes files and folders is outsourced to different table structures.

Security Table In NTFS file systems, access control for files as well as directories relied on security descriptors. Security descriptors are data structures that specify the security information of securable objects. Every file and directory in NTFS could own a $SECURITY_DESCRIPTOR attribute that indicated the owner of a file and associated access rights to it. Since multiple files may use the same security descriptors, this method may store some security descriptors redundantly and therefore waste space. Because of that, at a later time, NTFS used to store security descriptors in a single central file called $Secure. ReFS handles security descriptors in a similar way to the $Secure file. Instead of storing security descriptors as attributes of a file or directory, every file- and directory-table solely holds a 32-bit large security identifier. This identifier may be used to look up its corresponding from a central Security Table. Keys used in the Security Table are security identifiers. The values are header data that is followed by actual security descriptors. Microsoft officially describes the structure of security descriptors. Security descriptors used in ReFS have the same form as security descriptors used in NTFS. Thus, their analysis can be performed just as in NTFS.

Trash Stream Table The Trash Stream Table has a similar structure to common directories tables but fulfills a different purpose. It is used to conserve large files that were removed from ReFS formatted file systems. If a file that is formed by 5000 (0x1388) or more data runs is deleted, it is not actually removed. Even its clusters are not released immediately. Instead, the table representation of the file that was previously stored in a directory table is moved into the Trash Stream Table. As of now, no user can see the contents of the Trash Stream Table. It still may be accessed by suitable software to recover large deleted files entirely. The line at which size a file is moved into the Trash Stream Table is blurry. To check whether a file is qualified to be moved into the Trash Stream Table, the driver merely verifies whether the file contains 5000 or more data runs, but it does not look at the size of those. Some data runs may only encompass a few clusters, whereas other data runs may contain multiple thousand clusters. Because of that, the actual size of files that are moved into the Trash Stream Table is unclear. When analyzing larger unfragmented volumes, only files of a size of 200 GiB may be written into the Trash Stream Table. In a more fragmented volume files written to the Trash Stream Table might as well only encompass a few GiB’s. To internally move a file into the Trash Stream Table, it is “reparented” from its current directory into the Trash Stream Table. Since the clusters of files in the Trash Stream Table are not released and available to be reused, files referenced by the Trash Stream Table still consume space on the volume. The ReFS driver, therefore, uses a worker method to periodically remove entries from the Trash Stream Table. Theoretically, it would be possible to patch the ReFS driver so that files are never directly deleted but always reparented to the Trash Stream Table first and only removed once it would be necessary to reclaim their clusters. Such a change would negatively impact the performance of the file system but would add great value to it when it comes to performing a forensic analysis.

General metadata of files and directories As mentioned in section 4.1.4 tables not only contain rows but also header information that is specific to their table type. Tables that represent files, as well as tables that represent directory descriptors, use this header to store information similar to the $STANDARD_INFORMATION attribute known from NTFS. In NTFS, the $STANDARD_INFORMATION attribute was used to store general information about a file or directory. The table-specific information of files and directories contains the same four MACE (Modified, Accessed, Created, Metadata Entry) timestamps that are already known from NTFS. Just like in NTFS, the timestamps are stored in the Windows NT time format that uses the number of one-hundred nanoseconds

42 CHAPTER 4. ANALYSIS since January 1, 1601, to encode time information. Different to many file systems, however common to NTFS, is the ”Metadata Entry“ timestamp which is used to encode when the metadata of a file was changed. Just like the $STANDARD_INFORMATION attribute, this fixed part of every file- and directory descriptor- table also contains flags that are used to encode properties of a file or directory. A comprehensive view of all flags is given in [7]. The data furthermore contains size information, as well as a security identifier that may be used to locate an associated security descriptor. The data also contains information that may be used to locate the last entry in the change journal (USN journal) that referred to this metadata entry. If the metadata entry stores a so-called reparse point attribute, the general information also includes the tag of the reparse point. Reparse points are a concept that stems from NTFS, they are explained in section 4.2.3. For files, the general information also stores the file identifier that is the lower part of a metadata address. For directories, instead, the number of the next file identifier in the directory to be used is stored. In the Windows version that was used during the analysis, the access timestamps were never updated. Not updating the timestamps whenever a file or a folder is accessed may be used to prevent unnecessary write operations that become even more extensive if a Copy-On-Write policy is used. Windows systems manage a registry key to determine whether access timestamps in ReFS file systems should be updated. The registry path HKEY_LOCAL_MACHINE/SYSTEM/CurrentControlSet/Control/FileSystem stores the keys NtfsDisableLastAccessUpdate and RefsDisableLastAccessUpdate that control whether access timestamps in the corresponding file systems are updated. Within the analysis environment that we used, the key RefsDisableLastAccessUpdate was from the start set to True.

$DIR_LINK Attribute (0x38) The directory link attribute is new compared to NTFS. In NTFS merely a $FILE_NAME attribute existed that was used to store the name of a file or a directory. Contrary, the $DIR_LINK attribute only serves the purpose to store information about the name of a directory. The $DIR_LINK attribute also stores the same MACE timestamps that are contained in the general information of a directory. The information stored in the $DIR_LINK attribute is only altered if the name of the directory is changed or if it is moved. The $DIR_LINK attribute also stores the directory identifier of the parent directory of the directory it describes. This supplementary piece of information makes it possible to restore the paths of unreferenced directories. As of now, we assume that a directory may only own a single attribute of this type.

$INDEX_ROOT Attribute (0x90) The $INDEX_ROOT attribute was only found as an attribute of directories. In NTFS an equally named attribute exists which is used to implement indices as well as directory structures within the metadata entries of directories. This task is not required in ReFS, as directory tables adopt the purpose to index files and subdirectories. The $INDEX_ROOT attribute is still found as an attribute in ReFS but seems to be unused.

$DATA Attribute (0x80) The data attribute is used to reference the clusters in which the contents of a file reside. Different to NTFS where data could be stored resident or non-resident, the data referred to by the $DATA attribute in ReFS is always located in external clusters, thus making it a non-resident attribute. While all other attributes are stored as primitive chunks of data in the table of a file or a directory descriptor this attribute is the only one that forms an embedded table. ReFS implements the $DATA attribute as a table that stores data runs which describe assignments from LCN’s to VCN’s. The term Logical Cluster Number (LCN) refers to the cluster within a file system, whereas the Virtual Cluster Number (VCN) describes an offset in a file. In ReFS, not only checksums of metadata are calculated, but it is also possible for users to enable the calculation of checksums on file data. This feature is called “integrity streams” and is realized in the table spanned by the $DATA attribute. If integrity streams are enabled, the corresponding data run table does not only store the data runs of a file but also supplements them with checksum information.

$NAMED_DATA Attribute (0xb0) NTFS allowed a file to own more than one $DATA attribute. The default data attribute of a file has no name associated with it. Additional $DATA attributes were distinguished through their names. This feature is

43 CHAPTER 4. ANALYSIS also known as “alternate data streams” (ADS) or named streams. When ReFS was introduced, alternate data streams were announced to be deprecated. Nevertheless, the current version of ReFS allows creating alternate data streams. Their implementation, however, vastly differs between ReFS and NTFS. Instead of treating alternate data streams like regular file contents and storing them in external clusters, they are stored directly as a resident attribute and thus as part of the metadata of a file system. Because of that choice of design, alternate data streams always profit from the checksums that ReFS uses to ensure the integrity of metadata. They however also drastically inflate the tree structure spanned by a file and thus their size is limited to 128 KiB. Since their maximum size exceeds the size of a page, a single alternate data stream may be separated into multiple attributes that are defined by different offset and length values.

$REPARSE_POINT Attribute (0xc0) Reparse points are another well-known concept from NTFS. Reparse points were always associated with a file or a directory. This has not changed in ReFS, as only files and directory descriptors may possess this type of attribute. Reparse points are formed by an arbitrary data portion and a tag. The interpretation of the data stored in a reparse point happens independent of the structure of a file system. Any application or system may define a specific format in the data of a reparse point. The tag of a reparse point must be unique. Reparse tags are chosen by applications or systems and are used to distinguish different types of reparse points. In NTFS, a reparse status code was returned by the driver whenever a file with an associated reparse point was processed. This status code could be processed by an application, or a system that defines the actions to take when reparse data with a given tag is accessed [41, p. 154]. A frequently used example of reparse points are links. In the implementation of links, a $REPARSE_POINT attribute is used to indicate to which location a file or directory links. This concept is also known as symbolic links and is discussed in more detail in section 4.2.4.

$USN_INFO Attribute (0xf0) This attribute was not defined within the attribute list found in the ReFS driver. Thus its name $USN_INFO is also not official but instead was picked by us. Its name stems from the observation that this attribute is only used in conjunction with Update Sequence Number (USN) journals. USN journals are explained in section 4.2.5. If a user creates a USN journal under ReFS, a file called Change Journal is created in the File System Metadata directory. This file does not only store a $DATA attribute, that is used to reference the data of the USN Journal but also contains this $USN_INFO attribute to store organizational information about the USN journal itself. In NTFS the change journal file used a data stream called $Max to store the same properties of a USN journal that can be found in the $USN_INFO attribute.

Further remarks Every ReFS formatted file system contains a folder called File System Metadata. This folder serves a similar purpose as the $Extend directory found under NTFS. Initially, it contains the three files: Volume Direct IO File, Security Descriptor Stream, and Reparse Index. When a user creates a change journal, another file is created in the File System Metadata directory. Aside from the change journal, none of the files in this directory contain any meaningful contents. We assume that most of the files in this directory represent counterparts to different tables and may be used to access the contents of these tables through the interface of a file. The counterpart of the Security Descriptor Stream is the Security Table, whereas the counterpart of the Reparse Index is the Reparse Index Table that is introduced in the next chapter.

4.2.4 File Name Category File names under ReFS are tightly coupled to the actual metadata of a file. As explained in section 4.1.4, the name of a file or a directory link is used as a key in the directory table in which it resides. The $DIR_LINK attribute is another example of data that can be assigned to the file name category, but as it also contains timestamps and other types of metadata it may be similarly assigned to the metadata category. The choice of directory design used in ReFS eliminates the necessity of attributes to implement directory indices.

44 CHAPTER 4. ANALYSIS

Symbolic links, as well as junctions, offer an additional method of introducing file names to ReFS. Symbolic links and junctions are realized as regular files or directories that point to different files - which however must not necessarily exist. The pointer in such a link is stored in a $REPARSE_POINT attribute and stores the name of a referenced object to point to it. Junctions are a legacy concept of NTFS, that was however retained in ReFS. Junctions may only reference absolute paths that are local to the volume on which they are stored. The paths they point to must not necessarily exist. Symbolic links on the opposite may also be used to reference remote files and directories by referring to those via their UNC (Uniform Naming Convention) path. Symbolic links store references in the same format that a user provided when creating them, either relative or absolute. In NTFS a third type of link, the so-called “hard link” exists that allows referencing a file on the same volume by multiple names that all point to the same metadata entry. Hard links are implemented through internal mechanisms of NTFS that go beyond the scope of reparse points. As stated in section 4.1.4, hard links were abolished in the specification of ReFS. Regular symbolic links can be created by using the command-line application mklink. The command: mklink link_name target creates a new file called link_name, which acts as a and references the file name target. The command-line application fsutil may be used to query general information about a reparse point and thus may also be used to query information stored with a link. The implementation of the internal workings of symbolic links has not changed between NTFS and ReFS since in both file systems they are equally mapped to reparse points.

Reparse Index Table In NTFS, a file called \$Extend\$Reparse was used to keep track of all files that implemented reparse points. In ReFS, a file with the similar name File System Metadata\Reparse Index exists. This file, however, seems to be not used at all. Instead, a table called “Reparse Index Table” is used to track reparse points. The table merely stores reparse tags together with the metadata addresses of the files and directories that hold these. This design choice allows to efficiently locate all metadata entries that contain reparse data. Traversing through the “Reparse Index Table” and looking only for entries with the reparse tag IO_REPARSE_TAG_SYMLINK could be a strategy to locate all symbolic links that exist in a file system.

4.2.5 Application Category File System Journaling The Copy-On-Write policy that enables atomic updates of data structures, as well as the checksum mechanism used in ReFS greatly contribut to its robustness. Because of the atomic updates, it is not required to persist data for undo operations. Instead ReFS uses a file system journal solely to perform redo operations. Changes to the file system are written as batches. Within the period between two batch operations, multiple changes to data structures may occur. All of those changes will not be persisted until the next checkpoint is written. If a system failure occurs within this timespan, accumulated changes that were previously wholly executed but were not yet written to the disk might get lost. The storage engine on which ReFS is based provides the concept of transactions to its user. A traditional requirement of transactions is that they must be persistent. Once a transaction was committed, it is expected that its result exists in the file system regardless of a future failure. To fulfill this requirement, ReFS uses a file system journal. After a transaction under ReFS has been executed successfully, an entry which confirms the success of the transaction is written into a logging area. The logged entry contains detailed information on which actions were executed as part of the transaction so that the entire transaction may be repeated. Additionally, the entry stores a log sequence number (LSN) which describes a relative order to when it was written as well as other organizational data. Every entry written to the logging area is also secured by a checksum which is calculated directly before the entry is written to the disk. Just in NTFS, the log entries do not contain user data therefore they cannot be used for file recovery. NTFS used a file called $LogFile to store the journal of the file system. The $LogFile contains a data attribute consisting of a restart- and a logging-area. When NTFS tries to recover from a crash, the file system driver first interprets the restart area which provides information on the logging area in which the

45 CHAPTER 4. ANALYSIS actual log entries reside. The restart area also provides information on which action was the last one that was performed successfully and thus gives the location from where to start in the logging area [12, p. 340]. The structure of the file system journal found under ReFS is quite similar. Instead of storing the entire journal within a file, ReFS utilizes a table that references the restart area as well as the logging area, as seen in figure 4.18. The start and the end of the logging area are not only defined by the Logfile Information Table but can also be found in the restart area.

Logfile Information Table (0x9)

Key Value Logging Area Start Logging Area End 0x0 Logfile Information Table Logging Area Length 0x1 Restart Area Restart Area, backup

Logging Area

Restart Area

LSN: XXX Checksum: YYY

Redo Operation #1 ... Redo Operation #n

Figure 4.18: Overview of the logging areas used in ReFS

To cushion the risk of corruption, the restart area of the journal used in ReFS consists of two blocks that hold the same content and are similar to the checkpoint structures written alternately. Just like all other blocks of the file system journal, the blocks of the restart area use checksums to offer a method to detect corruption in them. Whenever one of the blocks in the restart area is rewritten, a counter value in it is updated. If both blocks can be verified successfully, the block with the higher counter is picked, else the block without corruption is picked. If both blocks become corrupted, the ReFS driver has to discard the information in the journal. The restart area contains an interval that reaches from the most recent LSN that has been written to the previous checkpoint to the first LSN that is included in the next checkpoint. By excluding its boundaries, this interval forms a range of LSN’s that are included in the current checkpoint. The interval is always updated right before a new checkpoint is written, figure 4.19 shows this concept. Right before the checkpoint C2 is persisted to a storage medium, the LSN’s in the restart area are adjusted accordingly. When a checkpoint is written to the disk, all changes in memory are flushed, and all log entries that lie within its range become part of the current on-disk state. After C2 in figure 4.19 was written successfully, the transactions described by the log sequence numbers 6, 7, 8, and 9 have been persisted to the volume. The checkpoint also holds a field that contains the next LSN that should be written. This LSN is used in the recovery process to decide which parts of the file system journal have to be replayed.

46 CHAPTER 4. ANALYSIS

C1, next LSN: 6 C2, next LSN: 10 C3, incomplete Checkpoints 1, 2, 3, 4, 5 6, 7, 8, 9 10, 11, ... (Including LSN's)

Restart Area

Old LSN New LSN

Figure 4.19: Example for an inconsistent checkpoint state (redo operations are necessary)

To examine whether parts of the file system journal have to be replayed, the log sequence number found in the checkpoint is compared to the start of the current interval. When the file system is unmounted, and all transactions are persisted to a checkpoint the “old lsn” identifier in the restart area is set to the same value as the “next lsn” field in the checkpoint. If both of these fields are equal, the ReFS driver detects that the redo log does not need to be interpreted. Such a scenario is shown in figure 4.20. Else, starting at the “next lsn” value of the checkpoint all entries in the redo log are interpreted and replayed. This process continues until an invalid log entry is reached. A log entry is detected to be invalid if its log sequence number does not match the current expected log sequence number or if an unrecoverable checksum error occurs.

C2, next LSN: 10 C3, next LSN: 14 Checkpoints 6, 7, 8, 9 10, 11, 12, 13 (Including LSN's)

Restart Area

Old LSN New LSN

Figure 4.20: Example for a consistent checkpoint state (no redo operations are necessary)

The logging area is filled with records that describe modifications to different tree structures. The format of entries in the logging area is independent of the internal representation of tree structures that are modified. Instead, a descriptive format is used that explains which high-level actions need to be repeated in order to replay the log. An entry in the logging area represents a transaction. A single transaction encapsulates multiple operations. The following properties characterize a single operation: • Redo Identifier: Specifies which operation was performed and thus needs to be redone

• Table Path: Is a path to the table that was altered as part of the operation. If the entry refers to an embedded table, the table path consists of a table identifier as well as a sequence of keys that is used to reach the embedded table. If multiple succeeding operations in a transaction refer to the same table, only the first operation must specify a table path. If the table path is omitted in a log entry, it is assumed that the previous table path is reused.

• Data: Many operations are characterized by parameters. An insert operation, for example, might be performed with a key and a value, whereas a remove operation may only expect a key. Within the transaction log, these parameters are encoded as “data” that is passed to the redo operations. Some

47 CHAPTER 4. ANALYSIS

redo operations also accept a variable number of parameters, as some parameters might be optional or are only needed in exceptional situations. Transactions usually need to rely on the collaboration of multiple tables. An example of a transaction is the creation of a file which consists of multiple steps: 1. Create an ID2 entry that builds a mapping between the file name and its metadata address 2. Update the “next metadata address” field in the directory into which the file is inserted 3. Update the “next metadata address” field that is used in the Object ID Table 4. Create a table for the new file 5. Fill the table with basic information, such as timestamps as well as a security id, ... 6. Insert a $DATA attribute into the table 7. Create an embedded table that may be used to store the data runs for the file If one of these steps is not performed the system would fall into an inconsistent state. Therefore it is crucial for the driver to make sure that either all actions or none are executed. There exists no comprehensive list that includes all transaction as well as the operations they comprise. Indeed it would be possible to realize the effects of the same transaction in multiple different ways. It might be a first step to collect different transactions in a ReFS file system and to classify them as high-level operations, but a more precise approach should also be able to detect variants of those transactions which serve the same purpose.

Change Journal With more recent versions of NTFS, a so-called change journal, or Update Sequence Number Journal (USN) was introduced. The change journal was implemented as an optional file called $Extend\$UsnJrnl and was used to roughly track recent changes made to files and directories. Its specification is completely disclosed by Microsoft and thus it can be easily processed. In different versions of the Windows operating system, the change journal provided by NTFS is sometimes enabled by default and sometimes not. Carrier [12] states that ‘any application can turn the change journal feature on and off”. Because of that, the change journal is not guaranteed to be enabled and to provide valuable information about a file system. Since the change journal merely stores data in a generic form that provides an abstract view on changes made to files and directories, it can be easily ported to other file systems. ReFS uses a file called File System Metadata\Change Journal to implement the mechanics of the change journal known from NTFS. In NTFS the $UsnJrnl file consisted out of two alternate data streams. The data stream $J contains the actual contents of the change journal, whereas the data stream $Max contains organizational information about the journal. In ReFS, the file ChangeJournal uses a regular $DATA attribute to store the contents of the change journal as well as a $USN_INFO attribute to store its size information.

4.3 ReFS Data Structures

All values under ReFS are saved in little-endian representation. Strings such as the volume label as well as names of files and directories are saved in a little-endian UTF-16 representation. Internal names that are part of file system components are saved as ASCII strings.

4.3.1 Boot Sector Every ReFS formatted volume starts with a 512-byte large boot sector. In the literature, this structure is also referred to as (VBR). Just like in NTFS, the boot sector used under ReFS starts with 3 bytes that are reserved for storing an instruction to jump to boot code. This instruction is only necessary if a file system is bootable. As of now, those bytes are merely reserved under ReFS. ReFS itself is not a file system from which it is possible to boot yet. The last sector of a ReFS file system stores a copy of the boot sector. This copy is useful if the boot sector at the start of the file system is damaged.

48 CHAPTER 4. ANALYSIS

The boot sector is the first structure that must be parsed under ReFS to obtain cluster and sector sizes as well as the used ReFS version. The first 24-bytes of the boot sector form the so-called File System Recognition Structure FSRS which, amongst other things, contains the name of the file system (ReFS). The internal structure of the FSRS is described in table 4.2. Its definition is known from a patent that was registered by Microsoft [42]. An operating system that does not include a driver that is able to interpret the type of a file system can still use the FSRS to deduct whether the volume holds a valid file system. • Example: Changing the file system’s name in the FSRS from ReFS to SeFS and adjusting the checksum used in the FSRS leads to that the operating system cannot mount the partition anymore, but still shows it as a valid partition. In previous Windows versions (Up to Windows 7) the operating system would open a dialog to format such partitions, that hold an unknown file system. The remaining 48-bytes of the boot sector, which are explained in table 4.3 are used to store multiple size values. With the cluster size that can be retrieved from the boot sector, it is possible to determine the location of the superblock.

Byte Range Description 0x00 - 0x03 Assembly instruction to jump to boot code (Must be zero). 0x03 - 0x0b FSRS file system name: ReFS (0x5265465300000000). 0x0b - 0x10 Must be zero. 0x10 - 0x14 FSRS Identifier. Must be FSRS (0x53525346). 0x14 - 0x16 Size of the VBR. Must be 0x200 (512 bytes). 0x16 - 0x18 Checksum for the entire boot sector, see Figure 4.21 for an implementation of the used checksum algorithm.

Table 4.2: Data structure in the first 24 bytes of the boot sector (FSRS)

Byte Range Description 0x00 - 0x18 See Table 4.2. 0x18 - 0x20 Sector count. 0x20 - 0x24 Bytes per sector. Must be 0x200 (512 bytes). 0x24 - 0x28 Sectors per cluster (data unit). 0x28 - 0x2a Major and minor version number. 0x2c - 0x30 Unknown flag. 0x38 - 0x40 . 0x40 - 0x48 Bytes per container.

Table 4.3: Data structure contained in the remainder of the boot sector

#BS_DATA (Bytes of the boot sector) def fsrs_checksum(BS_DATA): checksum_val=0 tmp_val = 0 for i in range(0,len(BS_DATA)): if (checksum_val & 1): tmp_val = 0x8000 else: tmp_val = 0x0

checksum_val = tmp_val + (checksum_val >> 1) + BS_DATA[i]

return checksum_val

Figure 4.21: Implementation to calculate the checksum of the FSRS

49 CHAPTER 4. ANALYSIS

Example Figure 4.22 and 4.23 visualize an exemplary boot sector structure. The shown boot sector indicates that the volume that it describes contains 0x240000 sectors. The size of a single sector is 0x200 (512 B), a cluster is formed by 0x8 sectors. The version of the volume is 3.4. Its serial number is 0xdab891efb891ca81. The volume is divided into containers of 0x4000000 bytes. This is equal to 64 MiB. Knowing the cluster size used in a volume, one may continue to process the superblock found in cluster 0x1e (30).

00000000 00 00 00 52 65 46 53 00 00 00 00 00 00 00 00 00 00000010 46 53 52 53 00 02 37 a6

Figure 4.22: Example layout of the FSRS structure, explained in table 4.2

00000010 00 00 24 00 00 00 00 00 00000020 00 02 00 00 08 00 00 00 03 04 00 00 06 00 00 00 00000030 00 00 00 00 00 00 00 00 81 ca 91 b8 ef 91 b8 da 00000040 00 00 00 04 00 00 00 00

Figure 4.23: Example layout of the remaining boot sector, explained in table 4.3

4.3.2 Page Structure Besides from the boot sector, all data that is used to manage the structure of a ReFS file system is stored in so-called pages. Ballenthin [9], Metz [32], Head [26], and Georges [22, p. 18] use the term metadata blocks to refer to them. We will stick to the term pages that was obtained by the symbols of the driver. Pages differ from regular clusters that are used to store the contents of a file as pages mostly use a different size and all start with an identically structured header. Ballenthin [9] found pages to be always of the size 16 KiB (0x4000). With this size pages in previous ReFS versions were smaller than the supported cluster size of 64 KiB. Hence they also were addressed differently from regular clusters. The header of every page in older ReFS versions started with a 64-bit identifier which was used as an index. To determine the address of a page one had to multiply its identifier by its size: PAGE_ADDRESS = PAGE_IDENTIFIER · 0x4000 In more recent versions of ReFS, pages always have the size of one or multiple clusters. At a cluster size of 4 KiB, a page may be up to 16 KiB large. At a cluster size of 64 KiB, a page is 64 KiB large. With this change also the addressing scheme of pages was altered. Every page starts with a so-called page header as it is shown in figure 4.4. The page header contains general information about a page, such as its logical cluster number (LCN) which may be used to reference it. Furthermore, the page header contains fields from which the recency, as well as the purpose of a page, may be derived. The LCNs in a page header are stored in a quadruple. The addresses contained in the quadruple do not need to lie cohesive on the disk. Thus pages may become fragmented. An LCN is also not always equal to the actual cluster offset in the volume as some LCN’s have to be translated to real clusters first. With more recent versions of ReFS, a page signature and a volume signature were introduced to the page header. The page signature may be used to distinguish the type of the page (Superblock, Checkpoint, Minstore B+ Node) whereas the volume signature may be used to make sure that the page belongs to the current volume. Both of these new values come in handy when searching entire volumes for pages.

Example Figure 4.24 visualizes an exemplary page header structure. The page described by the header belongs to a Minstore B+ Node (0x2b42534d: MSB+). That node seems to be a part of a tree with the identifier 0xb. The identifier of the volume in which the page resides is 0x6100b379. The virtual allocator clock of the batch in which the page was written is 0x69. Within that batch, 0xa operations were performed. The address quadruple that may be used to reference the page is <0x4000, 0x4001, 0x4002, 0x4003>.

50 CHAPTER 4. ANALYSIS

Byte Range Description 0x00 - 0x04 Page signature (SUPB, CHKP or MSB+) 0x04 - 0x08 Always 0x2 0x08 - 0x0c Always 0x0 0x0c - 0x10 Volume signature 0x10 - 0x20 Recency of the page 0x10 - 0x18 Virtual Allocator Clock 0x18 - 0x20 Tree Update Clock 0x20 - 0x40 Address of the page (LCN) 0x20 - 0x28 LCN 0 0x28 - 0x30 LCN 1 0x30 - 0x38 LCN 2 0x38 - 0x40 LCN 3 0x40 - 0x50 Table Identifier 0x40 - 0x48 Table Identifier (High) 0x48 - 0x50 Table Identifier (Low)

Table 4.4: Structure of a page header

04000000 4d 53 42 2b 02 00 00 00 00 00 00 00 79 b3 00 61 04000010 69 00 00 00 00 00 00 00 0a 00 00 00 00 00 00 00 04000020 00 40 00 00 00 00 00 00 01 40 00 00 00 00 00 00 04000030 02 40 00 00 00 00 00 00 03 40 00 00 00 00 00 00 04000040 00 00 00 00 00 00 00 00 0b 00 00 00 00 00 00 00

Figure 4.24: Example layout of a page header, as explained in table 4.4

4.3.3 Page Reference To be able to reference pages, ReFS uses a so-called page reference structure. A page reference structure holds the LCN quadruple of the referenced page as well as information about its checksum and checksum data that may be used to verify the integrity of the referenced page. Their structure is described in figure 4.5. A page may also hold such a reference to itself so that one may verify its own correctness. Self-references are used to verify the integrity of structures which serve as entry points to the file system.

Example

0005c0f8 00 40 00 00 00 00 00 00 01 40 00 00 00 00 00 00 0005c108 02 40 00 00 00 00 00 00 03 40 00 00 00 00 00 00 0005c118 00 00 02 08 08 00 00 00 f7 63 73 c7 e5 30 06 3c

Figure 4.25: Example layout of a page reference structure, as explained in table 4.5

Figure 4.25 visualizes an exemplary page reference structure. The structure references the page with the address quadruple <0x4000, 0x4001, 0x4002, 0x4003>. To follow the reference, it is first necessary to translate the virtual addresses stored in the quadruple to real clusters. Next, all these clusters are read, and their data is assembled into a single contiguous page. It is checked that the address quadruple stored in the reference matches with the address quadruple found in the header of the read page. The reference in the example uses the checksum type (0x2) CRC64-ECMA-182. If the calculated checksum matches the checksum found in the page reference, it can be assumed that the referenced page is valid. Else, either the checksum itself or the referenced page is assumed to be corrupt. If a

51 CHAPTER 4. ANALYSIS

Byte Range Description 0x00 - 0x20 Address of the referenced page 0x00 - 0x08 LCN 0 0x08 - 0x10 LCN 1 0x10 - 0x18 LCN 2 0x18 - 0x20 LCN 3 0x20 - .... Checksum of the referenced page 0x20 - 0x22 Unused 0x22 - 0x23 Checksum Type (1: CRC32-C, 2: CRC64-ECMA-182) 0x23 - 0x24 Checksum Offset (Relative to the start of the checksum information) 0x24 - 0x26 Checksum Length 0x26 - 0x28 Unused .... - .... Checksum Data

Table 4.5: Structure of a page reference page reference was found in an attempt to recover deleted structures it is also possible that the page reference refers to a page that has been reallocated and that the data in it has nothing to do with its previous contents anymore.

4.3.4 Superblock The superblock (short SUPB) is the first page that must be read to interpret a ReFS file system. Its purpose is to reference the current checkpoint state. We derived the name of this structure from various functions in the refs. driver as well as from the signature at the beginning of such a page 0x53555042 (SUPB). • Green [24, p. 13] calls this data structure POMB: Partition Offset Metadata Block

• Georges [22, p. 93] calls it Entry Block 0x1e • Metz [32, p. 7] calls it Level 0 metadata block As mentioned in section 4.1.2, the addressing scheme between different ReFS versions was altered. In prior ReFS versions (1.1, 1.2) pages and so too the superblock were addressed via sequential identifiers. At that time the superblock could be located in page 30 (0x1e) of a volume. Since more recent versions of ReFS, the superblock is always found in cluster 30.

Page size: 16k ReFS v1: Bytes per page · 0x1e → 0x078000 Cluster size: 4k ReFS v3: Bytes per cluster · 0x1e → 0x01e000 Cluster size: 64k ReFS v3: Bytes per cluster · 0x1e → 0x1e0000

There exist multiple copies of the superblock structure. The first is located at either page or cluster 30. The other superblock copies can be found in the third and second last page or cluster of a file system. The driver picks the most recent superblock. Recency is defined through a clock field contained in the superblock. To enable the driver to verify its integrity, the superblock also contains a page reference structure that serves as a self-descriptor. The page reference structure contains a CRC checksum that is calculated over the entire superblock. During the calculation of the checksum of the superblock, the page reference structure is completely set to zero. When a superblock is found to be corrupt, by failing to verify its page header, its self page reference, or if any offsets or length values of the superblock lie out of bounds ReFS may fall back on using one of the two backup superblocks. If no valid superblock exists, the driver refuses to mount the file system. The only information contained in the superblock that the ReFS driver saves for further use is the GUID and two references to the current checkpoint state.

52 CHAPTER 4. ANALYSIS

Byte Range Description 0x00 - 0x50 Page Header (See table 4.4) 0x50 - 0x60 GUID (Is used to compute the volume signature) 0x68 - 0x70 Superblock version (Used to determine recency) 0x70 - 0x74 Offset to checkpoint references 0x74 - 0x78 Should be 0x2 (Number of checkpoint references?) 0x78 - 0x7c Offset to the self-descriptor 0x7c - 0x80 Length of the self-descriptor 0x00 - 0x10 Checkpoint references 0x00 - 0x08 Checkpoint reference #1 0x08 - 0x10 Checkpoint reference #2 Self-descriptor reference

Table 4.6: Structure of the superblock

• GUID: The GUID consists of four double words: x1, x2, x3, and x4. The GUID may be used to calculate the volume signature: vol_sig = x1 ⊕ x2 ⊕ x3 ⊕ x4 The volume signature is used to verify the headers of all pages. Once a ReFS volume has been formatted its GUID and thus its volume signature never changes. It is expected that all pages referenced within a volume hold the same volume signature. • Checkpoint addresses: The superblock must always save two distinct references to the current checkpoint state so that the Copy-On-Write mechanism works properly. This necessity is explained in section 4.1.3. The superblock merely contains the addresses of the checkpoints and does not store their checksums. The reason for this is that the checkpoints are altered constantly. Thus their checksums would have to be adjusted very often. The superblock, however, has to serve as an anchor and is by design a structure that stays constant. That is why the checkpoint similar to the superblock stores a checksum that describes itself. The first checkpoint structure is usually roughly located at 1% of the volume size. The second checkpoint structure can be found at 12% of the volume size. These values were determined by creating multiple differently sized ReFS file systems ranging from 2 GiB to 4 TiB.

Example Figure 4.26 visualizes an exemplary superblock structure according to its definition in table 4.6. The GUID in the superblock may be separated 0xcaeec5c4 ⊕ into the double words 0xcaeec5c4, 0x4b6ff452, 0xb3b9d4a4, 0x4b6ff452 ⊕ 0x5338584b. The volume signature may be calculated from the 0xb3b9d4a4 ⊕ GUID, as shown in figure 4.27. The resulting volume signature 0x5338584b 0x6100bd79 must be contained in every page header that is ref- 0x6100bd79 erenced within the volume. The recency value found in the superblock is usually equal within all Figure 4.27: Calculation of a volume copies of it (0x1). The ReFS driver decides itself for the first superblock signature with the highest recency value and because of that may virtually always pick the first found superblock. Within the shown superblock the checkpoint addresses start at the offset 0xc0. The self-descriptor of the superblock starts at the offset 0xd0 and is 0x68 bytes long. The checkpoint structures are located at the clusters 0xba8 and 0x8a5c. From the previously read boot sector that was shown in figure 4.23, it is known that the volume contains 0x48000 clusters. These values are consistent with the claim that the checkpoints are stored at 1% and 12% of the entire volume size. The page reference that refers to the superblock may be equally examined as the page references that were explained previously. To calculate the checksum of the superblock, one would have to zero the bytes 0xd0-0x138 of it and then perform the calculation of a CRC32-C checksum.

53 CHAPTER 4. ANALYSIS

0001e000 [PAGE HEADER] 0001e040 [PAGE HEADER] 0001e050 c4 c5 ee ca 52 fa 6f 4b a4 d4 b9 b3 4b 58 38 53 0001e060 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 0001e070 c0 00 00 00 02 00 00 00 d0 00 00 00 68 00 00 00 0001e080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0001e090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0001e0a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0001e0b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0001e0c0 a8 0b 00 00 00 00 00 00 5c 8a 00 00 00 00 00 00 0001e0d0 1e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0001e0e0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0001e0f0 00 00 01 08 04 00 00 00 b5 14 33 c1 00 00 00 00 0001e100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0001e110 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0001e120 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0001e130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Figure 4.26: Example layout of a superblock structure, as explained in table 4.6

4.3.5 Checkpoint The checkpoint (short CHKP) is a page that references the entire state of a file system. It contains some regular values as well as 13 references to the most important tables of a ReFS file system. Since tables under ReFS are implemented as a variant of B+-trees these table references are also termed as “global root nodes”. We derived the name of this structure from various functions in the refs.sys driver as well as from the signature at its beginning: 0x43484b50 (CHKP).

• Green [24, p. 14] calls this data structure MOMB: Master Offset Metadata Block • Georges [22, p. 34, 61-62] calls this data structure $TREE_CONTROL • Metz [32, p. 10] calls it Level 1 metadata block The checkpoint has a variable location and can be found by looking up an offset in the superblock. The superblock contains two references to checkpoint structures that form the current checkpoint state. A checkpoint structure is updated in place and does not use a Copy-On-Write policy. As explained in section 4.1.3, it is crucial to write the checkpoint structures alternatingly to uphold the properties of the Copy-On- Write policy that is used in the pages referenced by the checkpoint. The driver will always first try to pick the checkpoint structure that was last written and thereby owns the highest version value. After a file system has been unmounted, this structure is usually the second referenced checkpoint. If the ReFS driver detects an unrecoverable checksum mismatch in the read checkpoint, it uses the other checkpoint. If the other checkpoint is also damaged, the driver refuses to mount the volume. A checkpoint structure under ReFS 3.4 contains 13 references to the most important tables. When ReFS was introduced the checkpoint only referenced 6 tables. Tables that are new in ReFS 3.4 are a set of duplicate tables that are exact clones of important tables as well as the “Block Reference Count Table”, the “Container Table” and the “Integrity State Table”. To reference these tables, the checkpoint stores an array of pointers which point to different offsets within the checkpoint structure. At every such offset a table reference starts, which references the root node of a table. The entries in the array do not contain additional information on which table they reference. Instead, the identifier of the tables must be derived implicitly from the accessed offset in the pointer list. Aside from the references to tables, the checkpoint also contains various other values, such as the current log sequence number which may be used to process the redo journal. After the ReFS driver has processed the checkpoint, it first has to parse the Container Table to establish the address mapping. Subsequently, it can interpret any further table structures such as the Object ID Table.

54 CHAPTER 4. ANALYSIS

Byte Range Description 0x00 - 0x50 Page header (See table Table 4.4) 0x50 - 0x54 Unknown 0x54 - 0x56 Major ReFS Version 0x56 - 0x58 Minor ReFS Version 0x58 - 0x5c Offset to self-descriptor 0x5c - 0x60 Length of self-descriptor 0x60 - 0x68 Checkpoint virtual clock 0x68 - 0x70 Allocator virtual clock 0x70 - 0x78 Oldest log record reference 0x78 - 0x7c Unknown 0x7c - 0x80 Alignment 0x80 - 0x88 Unused / Reserved 0x88 - 0x8c Unknown buffer offset 0x8c - 0x90 Unknown buffer length 0x90 - 0x94 Pointer Count 0x94 - 0xc8 Pointer List 0x94 - 0x98 Pointer to the Object ID Table reference 0x98 - 0x9c Pointer to the Medium Allocator Table reference 0x9c - 0xa0 Pointer to the Container Allocator Table reference 0xa0 - 0xa4 Pointer to the Schema Table reference 0xa4 - 0xa8 Pointer to the Parent Child Table reference 0xa8 - 0xac Pointer to the Object ID Table duplicate reference 0xac - 0xb0 Pointer to the Block Reference Count Table reference 0xb0 - 0xb4 Pointer to the Container Table reference 0xb4 - 0xb8 Pointer to the Container Table duplicate reference 0xb8 - 0xbc Pointer to the Schema Table duplicate reference 0xbc - 0xc0 Pointer to the Container Index Table reference 0xc0 - 0xc4 Pointer to the Integrity State Table reference 0xc4 - 0xc8 Pointer to the Small Allocator Table reference Self-descriptor reference Table references

Table 4.7: Structure of the checkpoint

Example Figure 4.28 visualizes the start of an exemplary checkpoint structure according to its definition in table 4.7. To make the illustration more intelligible only some fields in it were highlighted with different colors whereas other less relevant fields were merely underlined. The start of the checkpoint gives information on where its self-descriptor resides. The self-descriptor of the checkpoint may be used to verify its integrity. In figure 4.28, the self-descriptor starts at offset 0xd0 and is 0x68 bytes long. The checkpoint also includes the checkpoint virtual clock which is updated whenever the checkpoint structure is rewritten. The value of the clock is 0x69. The value of the next log sequence number is 0x6b. Figure 4.29 shows an exemplary pointer list within a checkpoint. The interpretation of the different pointers is made according to the definition in table 4.7. Pointers to similar table references share the same colors. For example, all pointers to allocator table references were picked to be blue. When following the offsets found in the pointer list one reaches the page references that point to the root nodes of the respective tables.

55 CHAPTER 4. ANALYSIS

000ba000 [PAGE HEADER] 000ba010 [...... ] 000ba040 [PAGE HEADER] 00ba8050 00 00 00 00 03 00 04 00 d0 00 00 00 68 00 00 00 00ba8060 69 00 00 00 00 00 00 00 49 00 00 00 00 00 00 00 00ba8070 6b 00 00 00 08 00 00 00 02 00 00 00 00 00 00 00 00ba8080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Figure 4.28: Example layout of the first 144 bytes of a checkpoint structure, as explained in table 4.7

00ba8090 0d 00 00 00 38 01 00 00 a0 01 00 00 08 02 00 00 00ba80a0 70 02 00 00 d8 02 00 00 40 03 00 00 a8 03 00 00 00ba80b0 10 04 00 00 78 04 00 00 e0 04 00 00 48 05 00 00 00ba80c0 b0 05 00 00 18 06 00 00 00 00 00 00 00 00 00 00

Figure 4.29: Example layout of the pointer list found in a checkpoint structure, as explained in table 4.7

4.3.6 Tables Everything from directories and files to bitmaps as well as security descriptors or general information about the volume is stored in tables. All tables found under ReFS are build up generically. Solely the interpretation of their rows is different depending on the purpose of the table. Tables under ReFS are implemented as so-called Minstore B+-trees. Thus it becomes most important to look at the internal structure of this specific B+-tree implementation. The root node of a Minstore B+-tree holds some general as well as some table-specific information. Inner nodes in a Minstore B+-tree store keys with associated page references that point to child nodes. Leaf nodes are used to store key-value pairs that form the rows of a table. Internally, a single node of a Minstore B+-tree is built from the following elements: • Index Root: Is only used in the root node it however also exists as a filler struct in all non root-nodes. The root node contains the Index Root element that is used to store general information about a tree, such as the number of nodes from which it is formed as well as the number of elements that it stores. Furthermore, it also stores specific information about a table. Its structure is explained in table 4.8. • Index Header: Is used to maintain organizational data of a node, such as its height or the number of free bytes it contains. The Index Header also references the start and the end of an array of indices, the so-called “Key Index” through which the entries stored in the node may be accessed in the sorting order of their keys. Furthermore the Index Header references the start and the end of the Data Area, this is the place where the entries in a node reside. The structure of the Index Header is explained in table 4.9. • Data Area: Is a cohesive region of memory that is used to store key-value pairs. In a leaf node the Data Area is used to store keys and associated data. In an inner node the Data Area is used to store keys together with links to child nodes. The term “Index Entry” or “Entry” is used to refer to a key-value pair in the data area regardless whether it stores a link to a child node or actual data. The format of Index Entries is described in table 4.10. There also exists a slightly modified variant of Index Entries that is used to store references to data runs more efficiently. • Key Index: The Key Index is an array of offsets that are relative to the start of the index header. These offsets point to the Index Entries that are stored in a node. The Key Index offers an ordered view on the entries of a node. This is achieved by reordering it whenever a new entry is inserted into a node. Even though the name suggests it, the Key Index does not store keys. The key and the value of an entry are stored together within the data area.

Figure 4.30 shows how these elements may interact and visualizes a node structure. The abbreviation iei is used for Index Entries, the abbreviation ri is used for references which can be found in the Key Index. All parts of the figure that are marked grey represent unused space within a node that was potentially previously

56 CHAPTER 4. ANALYSIS

filled with data. Recovering files, directories or file system metadata under ReFS comes down to recovering index entries from these components of a node and is discussed in more detail in section 4.4.

ie1 ie2 r? r1 r2

Index Index Data Area Slack Space Key Index Root Header

Figure 4.30: Overview of the components of a Minstore B+ node

Byte Range Description Root node 0x00 - 0x04 Size 0x04 - 0x06 Size of the fixed component of the index root (Should always be 0x28) 0x06 - 0x08 Unknown 0x08 - 0x0a Unknown (Checksum type related) 0x0a - 0x0c Unknown (Unused ?) 0x0c - 0x0e Schema of the table 0x0e - 0x10 Unknown (Unused ?) 0x10 - 0x12 Schema of the table 0x12 - 0x14 Unknown (Unused ?) 0x14 - 0x16 Unknown (Checksum type related) 0x16 - 0x18 Unknown (Checksum type related) 0x18 - 0x20 Number of extents in the table 0x20 - 0x28 Number of rows in the table 0x28 - .... Variable component Non-root node 0x00 - 0x04 Size (Should always be 0x8) 0x04 - 0x08 Alignment

Table 4.8: Structure of the index root element

Every Minstore B+-tree possesses a table identifier as well as a schema. Table Identifier: All nodes in a Minstore B+-tree store an identifier that indicates to which tree they belong. The identifier is located in the header of any page. Embedded tables inherit the table identifier of the table in which they are contained. They can uniquely be identified by the table identifier of their most outer parent table combined with the key sequence that was used to locate them. Table schema: Additionally, every table has a schema associated with it. There exists one table that associates schema identifiers (key) with the attributes of a table (value). Georges [22, p. 41] called this table $ATTRIBUTE_LIST but did not analyze it any further in his work. The current ReFS driver refers to the table as “Schema Table”. Schemas define the datatype of the key used in a table. They also provide flags and size information for a table (e.g. How many bytes does the Index Root element consume).

Example In the following example, the structure of a single Minstore B+-tree node is dissected based on the definitions that were given previously. Figure 4.31 starts with the definition of the index root element.

57 CHAPTER 4. ANALYSIS

Byte Range Description 0x00 - 0x04 Offset to the start of the data area 0x04 - 0x08 Offset to the end of the data area (New index entries are inserted at this offset) 0x08 - 0x0c Free bytes in the node 0x0c - 0x0d Height of the node 0x0d - 0x0e Flags 0x1: Inner Node 0x2: Root Node 0x4: Stream Node 0x10 - 0x14 Start of the key index 0x14 - 0x18 Number of entries in the key index 0x18 - 0x20 Unused 0x20 - 0x24 End of the key index 0x24 - 0x28 Alignment

Table 4.9: Structure of the index header

Byte Range Description 0x00 - 0x04 Index entry length 0x04 - 0x06 Offset to the start of the key 0x06 - 0x08 Length of the key 0x08 - 0x0a Flags 0x2: Rightmost extent in a subtree 0x4: Entry was deleted 0x40: Stream index entry 0x0a - 0x0c Offset to the start of the value 0x0c - 0x0e Length of the value

Table 4.10: Structure of the header of a regular index entry

The given index root element is 0x70 bytes long. Its fixed component is 0x28 bytes large, the variable component is 0x70 - 0x28 = 0x48 bytes large, but is completely zeroed out. The table uses the schema with the identifier 0xe090. The table is formed by 0xb (11) nodes and stores 0x183 (387) rows. The next element found in a Minstore B+ node is the Index Header. All offsets that an Index Header stores must be interpreted relative to its start. The Index Header shown in figure 4.32 starts at the offset 0xc0. The data area of the node spans from 0xc0 + 0x28 = 0xe8 to 0xc0 + 0x68 = 0x128, its key index spans from 0xc0 + 0x8c = 0x14c to 0xc0 + 0x90 = 0x150. The tree that is formed by the node has a height of 0x2. The flag in the node is set to 0x3. This flag contains the masks “inner node” (0x1) and “root node” (0x2). 0x24 bytes of data in the node are free. The node merely contains a single entry. With this information, the key index of the node may be located. It is shown in figure 4.33.

0004c0c0 28 00 00 00 68 00 00 00 24 00 00 00 02 03 00 00 0004c0d0 8c 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 0004c0e0 90 00 00 00 00 00 00 00

Figure 4.32: Example layout of an Index Header structure, according to 4.9

0004c140 28 00 ff ff

Figure 4.33: Example layout of a key index structure

Just as specified by the previous shown Index Header, the Key Index only encompasses a single entry. The

58 CHAPTER 4. ANALYSIS

0004c050 70 00 00 00 28 00 00 00 01 00 00 00 90 e0 00 00 0004c060 90 e0 00 00 02 00 00 00 0b 00 00 00 00 00 00 00 0004c070 83 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0004c080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0004c090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0004c0a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0004c0b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Figure 4.31: Example layout of an index root structure, according to table 4.8 offsets within the Key Index are stored as 4-byte values, even though as of now 2 bytes are always sufficient to address all data in a page. Maybe because of that, the two most significant bytes of entries in the Key Index are always set to 0xffff. To correctly interpret an entry in the Key Index one has to remove these bytes: 0xffff0028 & 0x0000ffff = 0x28. With this information, we know that the only index entry within this node is located at offset 0x28 relative to the Index Header: 0xc0 + 0x28 = 0xe8 The contents of this Index Entry are shown in figure 4.34. Offsets found within an Index Entry always refer to its start. The entry is 0x40 bytes large, its key would start at offset 0x10, but as the key length is specified to be 0x0 the entry does not use a key. The flag 0x2 indicates that the entry is the last Index Entry in an inner node. The last entry within an inner node of a B+-tree does not need to store a key. The data associated with the Index Entry starts at the offset 0x10 and is 0x30 bytes long. Since the node is an inner node, the data represents a link to a child node. The link is implemented through a page reference structure.

0005c0e0 40 00 00 00 10 00 00 00 0005c0f0 02 00 10 00 30 00 00 00 60 00 01 00 00 00 00 00 0005c100 61 00 01 00 00 00 00 00 62 00 01 00 00 00 00 00 0005c110 63 00 01 00 00 00 00 00 00 00 02 08 08 00 00 00 0005c120 f7 63 73 c7 e5 30 06 3c

Figure 4.34: Example layout of an index entry structure, according to 4.10

To proceed in parsing this tree, one would have to follow the found page reference and parse the referenced node in a similar fashion. This parsing process continues until every child node was visited, and all key-value pairs were processed. With this information, one can extract all key-value pairs out of a single table. The following subsections look at the different tables that exist in ReFS. Their structure is explained by deciphering their usage of the variable index root component, as well as, by looking at the key-value pairs they contain.

4.3.7 Schema Table The purpose of the Schema Table was explained in section 4.1.4. The rows of the Schema Table store the definitions of different table schemas. A key in a row of the Schema Table stores a schema identifier, the data in such a row contains the definition of the respective schema. The schema definitions contain various fields that were laborious to analyze as they were rarely used in the driver and only had a little significance for the analysis. Many of those values were omitted in the analysis and must be examined in further work. Variable Index Root Element: Unused Rows: See table 4.11

4.3.8 Object Identifier Table The purpose of the Object ID Tables was explained in section 4.1.4. The rows within an Object ID Table link to further tables that are not referenced by the checkpoint. A key in a row of the Object ID Table stores a table identifier, the data in such a row contains information about the respective table as well as a page reference that may be used to access it. Variable Index Root Element: Unused

59 CHAPTER 4. ANALYSIS

Byte Range Description Key Data 0x00:0x04 Schema ID 0x00:0x04 - 0x08:0x0c Complete size of the schema - 0x00:0x04 0x0c:0x10 Offset to the start of the schema - 0x04:0x08 0x10:0x18 Unused - 0x08:0x10 0x18:0x20 Unused - 0x10:0x18 0x20:0x24 Length of the schema - 0x18:0x1c 0x24:0x28 Collation (Data type of the used key) - 0x1c:0x20 0x28:0x2a Unknown Value - 0x20:0x22 0x2a:0x2c Alignment - 0x22:0x24 0x2c:0x30 Unknown Value - 0x24:0x28 0x30:0x34 Unknown Value - 0x28:0x2c 0x34:0x38 Unknown Value - 0x2c:0x30 0x38:0x3c Complete size of the root node - 0x30:0x34 0x3c:0x40 Index Root Size - 0x34:0x38 0x40:0x44 Unknown Value - 0x38:0x3c 0x44:0x48 Page Size - 0x3c:0x40 0x48:0x4c Unknown Value - 0x40:0x44 0x4c:0x50 Unknown Value - 0x44:0x48 0x50:0x54 Unknown Value - 0x48:0x4c 0x54:0x58 Unknown Value - 0x4c:0x50

Table 4.11: Rows found in the Schema Table

Rows: See table 4.12

Byte Range Description Key Data 0x00:0x10 Table ID 0x00:0x10 - 0x10:0x11 Unknown (Always 0x2?) - 0x00:0x01 0x18:0x1c Offset to durable LSN - 0x08:0x0c 0x1c:0x20 Unknown - 0x0c:0x10 0x20:0x24 Buffer offset - 0x10:0x14 0x24:0x28 Buffer length - 0x10:0x14 0x28:0x30 Durable LSN - 0x18:0x20 0x30:0xd8 Page reference - 0x20:0xc8 0xd8:.... - 0xc8:...

Table 4.12: Rows found in the Object ID Table

4.3.9 Allocator Table The purpose of Allocator Tables was explained in section 4.2.2. The rows within an allocator represent the allocation status of cohesive clusters. A key in a row of an Allocator Table stores a cluster range, the data in such a row is a bitmap that describes the allocation status of the clusters in the range. It is also possible to compactly store the allocation status of large unused memory regions without using bitmaps. For this purpose, allocation regions may be marked as sparse. Variable Index Root Element: See table 4.13 Rows: See table 4.14. If the allocation region is not sparse, a bitmap as shown in table 4.15 follows.

60 CHAPTER 4. ANALYSIS

Byte Range Description 0x000:0x008 Unknown 0x008:0x010 Total clusters in allocator 0x010:0x018 Used clusters in allocator 0x018:0x020 Unknown 0x020:0x028 Metadata clusters in allocator 0x028:0x030 Unknown 0x030:0x038 Unknown

Table 4.13: Variable index root element used in Allocator Tables

Byte Range Description Key Data 0x00:0x08 First cluster in bitmap 0x00:0x08 - 0x08:0x10 Number of clusters in bitmap 0x08:0x10 - 0x10:0x12 Free clusters in bitmap - 0x00:0x02 0x12:0x14 Flag (0x1: ???, 0x2: Sparse) - 0x02:0x04 0x14:0x15 Unknown - 0x04:0x05 0x15:0x16 Alignment - 0x05:0x06 0x16:0x18 Bitmap flags - 0x06:0x08

Table 4.14: Rows found in Allocator Tables

Byte Range Description Key Data 0x18:0x19 Allocation status of cluster 1-8 - 0x08:0x09 0x19:0x1a Allocation status of cluster 9-16 - 0x09:0x0a 0x1a:......

Table 4.15: Bitmap structure used in Allocator Tables

4.3.10 Container Table Variable Index Root Element: Unused Rows: See table 4.16 The purpose of the Container Table was explained in section 4.2.2. Every row in the Container Table refers to a single container. The key in a row specifies the identifier of a container, the data in such a row stores its physical address, the number of used clusters in the container as well as metadata to track its heat.

4.3.11 Parent-Child Table The purpose of the Parent-Child Table was explained in section 4.2.1. Every row in the Parent-Child table establishes a relationship between two tables. One of the tables involved in the relationship is considered a parent table, the other a child table. The key and the value in a row of the Parent-Child Table are equal. They both comprise the identifier of the parent table as well as the identifier of the child table. Variable Index Root Element: Unused Rows: See table 4.17

4.3.12 Upcase Table The purpose of the Upcase Table was explained in section 4.2.1. The structure of the rows found in the Upcase Table provide a flexible way to store data. The used keys are sequential 32-bit integers. The values store arbitrary data that is reassembled in the order of the keys. Variable Index Root Element: Unused

61 CHAPTER 4. ANALYSIS

Byte Range Description Key Data 0x00:0x08 Band ID 0x00:0x08 0x00:0x08 0x08:0x0c Unknown 0x08:0x0c 0x08:0x0c 0x0c:0x10 Unknown 0x0c:0x10 0x0c:0x10 0x10:0x11 Flag - 0x10:0x11 0x20:0x28 Free clusters in container - 0x20:0x28 Cluster size: 4k 0x90:0x98 Physical start LCN of the container - 0x90:0x98 0x98:0xa0 Number of clusters in the container - 0x98:0xa0 Cluster size: 64k 0xd0:0xd8 Physical start LCN of the container - 0xd0:0xd8 0xd8:0xe0 Number of clusters in the container - 0xd8:0xe0

Table 4.16: Rows found in the Container Table

Byte Range Description Key Data 0x00:0x010 Parent (Table Identifier) 0x00:0x10 0x00:0x10 0x10:0x020 Child (Table Identifier) 0x10:0x20 0x10:0x20

Table 4.17: Rows found in the Parent Child Table

Rows: See table 4.18 • Key: 0, Value: “Upcase Table” • Key: 1, Value: 0x10000 (Number of entries in the Upcase Array)

• Key: 2, Value: First 0x155 bytes of the Upcase Array • Key: 3, Value: Next 0x155 bytes of the Upcase Array • Key: 386, Value: Last 0x80 bytes of the Upcase Array

Byte Range Description Key Data 0x00:0x04 Sequence number 0x00:0x04 - 0x08:0x.. Raw data - 0x00:...

Table 4.18: Rows found in the Upcase Table

4.3.13 Logfile Information Table The purpose of the Logfile Information Table was explained in section 4.2.5. Internally the Logfile Infor- mation Table is similarly constructed like the Upcase Table as its rows merely specify the order in which chunks of data have to be reassembled. Variable Index Root Element: Unused Rows:

• Key: 0, Value: “Logfile Information Table” • Key: 1, Value: 0x30 bytes of data, that are further specified in table 4.19

62 CHAPTER 4. ANALYSIS

Byte Range Description 0x00:0x08 Logging area start 0x08:0x10 Logging area end 0x10:0x18 Logging area size 0x18:0x20 Restart area 0x20:0x28 Restart area copy 0x28:0x30 Unknown

Table 4.19: Structure of the buffer found in the Logfile Information Table

4.3.14 Volume Information Table The purpose of the Volume Information Table was explained in section 4.2.1. The rows stored in the Volume Information Table contain general information about the volume. The key in a row is a 64-bit value. The format of the data in a row depends on the value of the key. Variable Index Root Element: Unused Rows:

• Volume Label, see table 4.20 • General information, see table 4.21 • General information backup block, see table 4.22

Byte Range Description Key Data 0x00:0x08 Entry identifier: 0x510 0x00:0x08 - 0x08:0x.. UTF-16 string (Name of the volume) - 0x00:..

Table 4.20: Structure of the “Volume Label” row in the Volume Information Table

Byte Range Description Key Data 0x00:0x08 Entry identifier: 0x510 0x00:0x08 - 0x88:0x8a ReFS version at creation - 0x80:0x82 0x8a:0x8c ReFS version during last mount - 0x82:0x84 0x8c:0x90 File system flags (Unknown) - 0x84:0x88 0x98:0xa0 FS last time mounted - 0x90:0x98 0xa8:0xb0 FS creation time - 0xa0:0xa8

Table 4.21: Structure of the “General Information” row in the Volume Information Table

4.3.15 Security Table The purpose of the Security Table was explained in section 4.2.3. The keys found in the rows of the Security Descriptor Table contain security id’s. The values hold security descriptors. In this way, the Security Table forms an index that one can use to search for security descriptors by their security identifiers. The form of security descriptors is well known and also publicly described by Microsoft. The structure of so-called System Access Control Lists (SACL) and Discretionary Access Control Lists (DACL) as well as SID’s that are used in security descriptors is not new to ReFS and is even publicly documented, thus their structure definition is omitted here. Variable Index Root Element: Unused Rows: See table 4.23

63 CHAPTER 4. ANALYSIS

Byte Range Description Key Data 0x00:0x08 Entry identifier: 0x540 0x00:0x08 - 0x08:0x10 Cluster of verify block - 0x00:0x08 0x10:0x14 Cluster count (size) of verify block - 0x08:0x0c

Table 4.22: Structure of the “General Information backup block” row in the Volume Information Table

Byte Range Description Key Data 0x00:0x04 Data length 0x00:0x04 - 0x04:0x08 Unused 0x04:0x08 - 0x08:0x0c Always 0x1? 0x08:0x0c - 0x0c:0x10 Security ID 0x0c:0x10 - 0x10:0x14 Security ID - 0x00:0x04 0x14:0x18 Always 0x1? - 0x04:0x08 0x18:0x1c Data length - 0x08:0x0c 0x1c:0x.... Security Descriptor 0x1c:0x1d Revision - 0x0c:0x0d 0x1d:0x1e Sbz1 - 0x0d:0x0e 0x1e:0x20 Control - 0x0e:0x10 Offsets are relative to the start of the security descriptor The value 0 implies that no such object is referenced 0x20:0x24 Offset to the Owner SID (Required) - 0x10:0x14 0x24:0x28 Offset to the Group SID (Required) - 0x14:0x18 0x28:0x2c Offset to the SACL (Optional) - 0x18:0x1c 0x2c:0x30 Offset to the DACL (Optional) - 0x1c:0x20 0x30:....

Table 4.23: Rows found in the Security Table

4.3.16 Reparse Index Table The purpose of the Reparse Index Table was explained in section 4.2.4. The entries in the Reparse Index Table only use the key of a row and not its value component. A single key contains the reparse tag as well as the metadata address of a file or directory that stores the respective reparse point. Variable Index Root Element: Unused Rows: See table 4.24

Byte Range Description Key Data 0x00:0x04 0x80000001 0x00:0x04 - 0x04:0x04 Reparse tag value 0x04:0x08 - 0x08:0x10 File identifier 0x08:0x10 - 0x10:0x18 Directory identifier 0x10:0x18 -

Table 4.24: Rows found in the Reparse Index Table

4.3.17 Directory Table The purpose of the Directory Table was illustrated in section 4.1.4. The rows in a directory table represent the contents of a directory. There exist four different types of rows that are stored in directory tables. One row type represents the directory descriptor which contains information about the directory itself. The other

64 CHAPTER 4. ANALYSIS three types are used to represent files, ID2 entries and directory links. Variable Index Root Element: Unused Rows:

• Directory Descriptor, see table 4.25 • File, see table 4.26

• ID2 Entry, see table 4.27 • Directory Link, see table 4.28

Byte Range Description Key Data 0x000:0x004 Directory Descriptor Type (0x10) 0x00:0x04 - Root node of an embedded directory descriptor 0x008:0x288 - 0x000:0x280 table

Table 4.25: Structure of the “Directory Descriptor” row type

Byte Range Description Key Data 0x00:0x04 ID2 Type (0x80000020) 0x00:0x04 - 0x04:0x00 Unused (0x0) 0x04:0x08 - 0x08:0x10 File Identifier 0x08:0x10 - 0x10:0x18 Unused (0x0) 0x10:0x18 - 0x18:0x20 Type of the ID2 entry (1 or 2) 0x00:0x08 Type 1: File name of a metadata entry in the current directory table 0x20:0x22 Offset to the file name 0x08:0x0a 0x22:0x24 Length of the file name 0x0a:0x0c 0x24:...... File Name 0x0c:... Type 2: Metadata address of a metadata entry in a different directory table 0x20:0x28 Directory identifier of the file 0x08:0x10 0x28:0x30 File identifier of the file 0x10:0x18

Table 4.26: Structure of the “ID2” row type

Byte Range Description Key Data 0x00:0x04 File Type (0x00010030) 0x00:0x04 - 0x04:...... File Name 0x04:...... - ...... :...... Root node of an embedded file table - 0x000:0x280

Table 4.27: Structure of the “File” row type

65 CHAPTER 4. ANALYSIS

ByteRange Description Key Data 0x00:0x04 Directory Type (0x00020030) 0x00:0x04 - 0x04:...... Directory Name 0x04:...... - ...... :...... Unused (0x0) - 0x00:0x08 ...... :...... Directory Identifier - 0x08:0x10 ...... :...... Creation Time - 0x10:0x18 ...... :...... Modification Time - 0x18:0x20 ...... :...... Metadata Modification Time - 0x20:0x28 ...... :...... Access Time - 0x28:0x30 ...... :...... Rounded Up Size (0x0) - 0x30:0x38 ...... :...... Actual Size (0x0) - 0x38:0x40 ...... :...... Flags (Always includes 0x10000000)- 0x40:0x44 ...... :...... Reparse Tag - 0x44:0x48

Table 4.28: Structure of the “Directory Link” row type

4.3.18 Directory Descriptor Table The purpose of the Directory Descriptor Table was motivated in section 4.1.4. The Directory Descriptor Tables are one kind of few tables that make use of their variable-sized index root area to store table-specific information. In Directory Descriptor Tables, this information contains general properties of the directory that is represented by a table. The rows of a Directory Descriptor Table are used to store attributes. The structure of attributes is separately explained in section 4.3.20 as files also use attributes. Variable Index Root Element: See table 4.29

Byte Range Description 0x00:0x08 Creation Time 0x08:0x10 Modification Time 0x10:0x18 Metadata Modification Time 0x18:0x20 Access Time 0x20:0x24 Flags 0x24:0x28 Unknown (0x0 or 0x1) 0x28:0x30 Security Descriptor ID 0x30:0x38 Size 0x38:0x40 Allocated Size 0x40:0x48 Offset in the USN Journal 0x48:0x50 Identifier of the USN Journal 0x50:0x54 Reparse Tag 0x54:0x58 Lower two bytes of the reparse tag 0x58:0x60 Maximum file identifier in directory 0x60:0x68 Unused (Used by file tables only) 0x68:0x70 Unused (Used by file tables only) 0x70:0x72 Unknown (0x1) 0x72:0x74 Unknown (Uninitialized) 0x74:0x7c Unknown (Never written)

Table 4.29: Variable index root element used in Directory Descriptor tables

4.3.19 File Table The purpose of the File Table was motivated in section 4.1.4. File Tables use a similar format to Directory Descriptor Tables to store general information in their variable index root area. Just like Directory Descriptor Tables, their rows are also attributes which are explained in section 4.3.20.

66 CHAPTER 4. ANALYSIS

Variable Index Root Element: See table 4.30

Byte Range Description 0x00:0x08 Creation Time 0x08:0x10 Modification Time 0x10:0x18 Metadata Modification Time 0x18:0x20 Access Time 0x20:0x24 Flags 0x24:0x28 Unknown (0x0 or 0x1) 0x28:0x30 Security Descriptor ID 0x30:0x38 Size 0x38:0x40 Allocated Size 0x40:0x48 Offset in the USN Journal 0x48:0x50 Identifier of the USN Journal 0x50:0x54 Lower two bytes of the reparse tag 0x54:0x58 Reparse Tag 0x58:0x60 Maximum file identifier in directory 0x60:0x68 Old file identifier 0x68:0x70 Old directory identifier 0x70:0x72 Unknown (0x1) 0x72:0x74 Unknown (Uninitialized) 0x74:0x7c Unknown (Never written)

Table 4.30: Variable index root element used in File tables

4.3.20 Attribute Types There exist two different kinds of attributes. Attributes that store a linear chunk of data (standalone) and attributes that are mapped as embedded tree structures (complex). The key of every attribute row starts with a header that contains the type of the attribute. To distinguish attributes with the same type their rows are additionally tagged with a name. If the data of a standalone attribute gets too large, it is split into multiple rows. The header of a standalone attribute also stores information that may be used to reassemble it. The structure of the key of an attribute is shown in table 4.31. The value of standalone attribute types starts with a header shown in table 4.32. Complex attribute types do not require a header but store a complete root node in their value instead. In the following tables, the structure of all attribute types that we could identify is explained: • $DIR_LINK: See table 4.33 • $DATA: Is an embedded table, see section 4.3.21 • $IDX_ROOT: See table 4.34 • $NAMED_DATA: See table 4.35 • $REPARSE: See table 4.36 • $USN_INFO: See table 4.37

4.3.21 Data Run Table Unlike all other attributes, the $DATA attribute stores the contents of a root node as its value. This node forms a so-called Data Run Table. The Data Run Table organizes clusters that are used to store the contents of a file. It maps virtual cluster numbers (VCN), that provide a relative cluster offset in a file, to logical cluster numbers (LCN) that describe the locations of a cluster in the file system. If “integrity streams” are enabled, the Data Run Table also stores the checksums of the respective clusters.

67 CHAPTER 4. ANALYSIS

Byte Range Description 0x00:0x04 Total attribute length 0x04:0x08 Offset in attribute (Required to reassemble split attributes) 0x08:0x0a Attribute identifier 0x0a:0x0c Unused (0x0) 0x0c:...... Attribute name

Table 4.31: Structure of the key in an attribute row

Byte Range Description 0x00:0x01 Unknown (0x0) 0x01:0x02 Alignment 0x02:0x04 Unknown 0x04:0x08 Data length 0x08:0x0a Attribute header size (0xc) 0x0a:0x0b Unknown 0x0b:0x0c Alignment

Table 4.32: Structure of the header of a standalone attribute

Variable Index Root Element: Unknown Rows: See table 4.38

Byte Range Description Key Data 0x00:0x0c Attribute key header 0x00:0x0c - 0x10:0x1c Attribute data header - 0x00:0x0c 0x1c:0x24 Parent file identifier (0x0) - 0x0c:0x14 0x24:0x2c Parent directory identifier - 0x14:0x1c 0x2c:0x34 Creation Time - 0x1c:0x24 0x34:0x3c Modification Time - 0x24:0x2c 0x3c:0x44 Metadata Modification Time - 0x2c:0x34 0x44:0x4c Access Time - 0x34:0x3c 0x4c:0x54 Size (0x0) - 0x3c:0x44 0x54:0x5c Allocated Size (0x0) - 0x44:0x4c 0x5c:0x60 Flags (Always includes 0x10000000) - 0x4c:0x50 0x60:0x64 Unknown - 0x4c:0x50 0x64:0x66 Directory Name Length - 0x50:0x52 0x66:0x6a Unknown - 0x52:0x56 0x6a:0x6e Directory Type (0x00020030) - 0x56:0x5a 0x6e:...... Directory Name - 0x5a:......

Table 4.33: Structure of the $DIR_LINK attribute type

68 CHAPTER 4. ANALYSIS

Byte Range Description Key Data 0x00:0x14 Attribute key header (Includes the name $I30) 0x00:0x14 - 0x18:0x24 Attribute data header - 0x00:0x0c 0x24:0x54 Unknown header - 0x0c:0x3c 0x24:0x28 Length of the header - 0x0c:0x10 0x54:0xac Unused - 0x3c:0x94

Table 4.34: Structure of the $INDEX_ROOT attribute type

Byte Range Description Key Data Attribute key header (Includes the name of the 0x00:..... 0x00:...... - stream) ...... :...... Attribute data header - 0x00:0x0c ...... :...... Unknown header - 0x0c:0x3c ...... :...... Length of the header - 0x0c:0x10 ...... :...... Contents of the data stream - 0x3c:......

Table 4.35: Structure of the $NAMED_DATA attribute type

Byte Range Description Key Data 0x00:0x0c Attribute key header 0x00:0x0c - 0x10:0x1c Attribute data header - 0x00:0x0c 0x1c:0x1c Reparse tag - 0x0c:0x10 0x20:0x24 Reparse data length - 0x10:0x14 0x24:...... Reparse data - 0x14:......

Table 4.36: Structure of the $REPARSE_POINT attribute type

Byte Range Description Key Data 0x00:0x0c Attribute key header 0x00:0x0c - 0x10:0x1c Attribute data header - 0x00:0x0c 0x1c:0x24 Maximum Size - 0x0c:0x14 0x24:0x2c Allocation Delta - 0x14:0x1c 0x2c:0x34 Identifier of the USN Journal - 0x1c:0x24 0x34:0x3c Lowest Valid USN - 0x24:0x2c

Table 4.37: Structure of the $USN_INFO attribute type

4.3.22 File System Journal The purpose of the File System Journal was already explained in section 4.2.5. Blocks that are used to store journaling data are always 4 KiB large. We refer to blocks of the journal as “log pages”. To access the actual logging area into which all redo operations are written, one first has to parse the restart area. The structure of the restart area is explained in table 4.39. After the restart area has been processed, one may locate the logging area. Log pages in the logging area are referenced through their log sequence number (LSN). The lower bytes of an LSN determine at which

69 CHAPTER 4. ANALYSIS

Byte Range Description 0x00:0x08 Starting LCN 0x08:0x0a Flags 0x0010 → Contains data (Not sparse) 0x0040 → Stream Index Entry 0x0080 → CRC32 checksum 0x0100 → CRC64 checksum 0x0a:0x0c Total length (len) 0x0c:0x14 Start cluster relative in this run 0x14:0x18 Number of clusters in this run 0x18:len Cluster size of 4 KiB 0x18:0x1c Checksum # 1: 4 KiB 0x1c:0x20 Checksum # 2: 4 KiB ....:len Checksum # n: 4 KiB 0x18:len Cluster size of 64 KiB 0x18:0x20 Checksum # 11: 16 KiB 0x20:0x28 Checksum # 12: 16 KiB 0x28:0x30 Checksum # 13: 16 KiB 0x30:0x38 Checksum # 14: 16 KiB 0x38:0x40 Checksum # 21: 16 KiB ....:len Checksum # n4: 16 KiB

Table 4.38: Rows found in the Data Run Table offset within the logging area a respective log page is located. Log pages all share the same structure. The purpose of a log page in the logging area is to encapsulate redo operations of a single transaction. Table 4.41 illustrates the structure of the redo information found in a single log page. Since transactions mostly consist out of multiple single actions, a single log page also stores multiple redo operations. As explained in section 4.2.5, the most critical parameters for a single redo operation are the redo id, the table path, and the data. The redo id determines which operation has to be re-done, the table path points to the table which is affected by the redo operation. The structure of table paths is explained in table 4.42. The data stored together with a redo operation is necessary to repeat the operation.

4.4 ReFS Data Recovery

There exist two distinct approaches to recover deleted files: Application-based and metadata-based analysis techniques. Application-based analysis techniques are out of the scope of a file system analysis. They are used to search data chunks which contain signatures that correspond to known file types. This technique is especially useful to recover file contents that have no metadata structure pointing to them. As this work is purely focused on an analysis of the file system layer, only metadata-based recovery techniques are discussed. Metadata-based recovery techniques are based on finding metadata structures of deleted files. If the metadata of a file still exists its contents may be read just like the contents of an allocated file. The only difference is that the the contents of the file may have already been reallocated and were overwritten by a new file [12]. All metadata found under ReFS resides in pages that store nodes of B+-trees, more precisely nodes of directory tables. When a metadata entry is removed from a directory table, it is not overwritten or wiped directly. Instead, it is only removed from the index of the tree. This behavior makes it possible to recover deleted metadata entries from tree structures which in turn may then be used to locate the contents of a file. It is also possible to find deleted metadata entries in older copies of pages that were created through the Copy-On-Write mechanism. If these older copies are however not referenced from anywhere in the file system, their metadata entries remain unseen. This problem is approached in the implementation of the carver that is explained in section 5.2. Since all pages used under ReFS start with a common header structure,

70 CHAPTER 4. ANALYSIS

Data Area e1 e3 e2 e1 e3 e2 e1 e3 e2

Key Index

Figure 4.35: Example for the deletion of the entries e2 and e1 it is easy to carve for them. This section, however, only deals with how entries in nodes of tree structures can be restored. To comprehend how slack space, as well as copies of entries, are created, it is first necessary to look at the insertion and deletion process of Minstore B+-tree structures. As figure 4.30 is too complicated to illustrate the relations between multiple nodes in this section, a simplified version of the B+ nodes is used, which only consists of a data area and a key index. Yellow entries in a node represent existing referenced rows or pointers in the key index. Grey marked entries represent rows that were deleted and are unreferenced or pointers that do not belong to the key index anymore. Regular arrows are used to represent existing pointer structures. Dashed arrows are used to represent deleted pointer structures that can still be restored.

4.4.1 Operations in a single node First, we want to only look at inserting entries at a node level. For this, we assume that if an entry is inserted into a tree structure, the node into which it is inserted contains enough space to store the entry. When an entry is inserted into a node, its header its keys and its value as described in table 4.10 are written into the data area of the node. Additionally, the key index of the node is extended by 4 bytes and the pointer list in the key index is adjusted so that it correctly sorts the entries in the data area by their keys. The ReFS driver uses a “next available” allocation strategy to determine at which position of the data area new entries are inserted. A next available allocation strategy does not start its search for free space at the start of the data area. Instead, it performs the insertion at a marker that points to the end of the last inserted entry. Usually, the marker is not set back until the end of the data area is reached and the node may be reorganized. The marker in a node may also be set back if the last inserted entry is deleted. If entries before the last inserted entry are deleted the marker is not set back. In the following illustrations, the marker for the insertion of new entries is represented as a red line. In the following, multiple figures are used to illustrate the process of deleting entries in a Minstore B+ node. The data area in a node is used to store its actual entries which are named as ei, with i describing the position of the entry in the node. The node itself has to provide an ordered view of its stored entries. Since it would, however, be laborious to always reorder multiple entries in the data area, the key index is used to establish an order in the node. The key index is stored at the end of a node. It contains a pointer for each entry that resides in the node. If an entry is removed from the node, the key index is shrunk, and its pointers are reordered. If an entry is inserted into the node its key index is enlarged towards the data area, a new pointer is added, and all existing pointers are reordered. In figure 4.35, it is shown how the entries e2 and e3 are removed from a node. While the contents of the deleted entries remain in the data area, the pointers in the key index that reference them are overwritten. In the first step, the entry e2 is removed, as a result of that, the insertion marker is moved back. It also becomes necessary to reorder the key index. The key index is shortened, because of that, the pointer to the former first element is not part of the key index anymore but still points to it. After that step, the remaining two entries in the key index need to be reordered. Another interesting observation of the deletion process is that the marker that determines where to insert the next entry is continuously moved backward since every deletion removes the last entry in the data area. If a new entry were added to the node, the last deleted entry would be overwritten first. Figure 4.36 provides another example similar to the previous illustration. Instead of the entries e2 and e3, now the entries e1 and e2 are removed from the node. In this scenario, all pointers that are removed from the key index remain usable. The reason for this is that the entries were removed in the order in which they were organized within the key index. Because of that, it is not required to

71 CHAPTER 4. ANALYSIS

Data Area e1 e3 e2 e1 e3 e2 e1 e3 e2

Key Index

Figure 4.36: Example for the deletion of the entries e1 and e2 reorder the pointers in the key index. An additional interesting observation is that the insertion marker is only moved back when e2 is deleted. e1 remains in a data area that is not immediately overwritten when inserting a new entry. This is illustrated even more concisely in figure 4.37.

Data Area e1 e3 e2 e1 e3 e2

Key Index

Figure 4.37: Example for the deletion of the entry e3

Even if the entry e3 is removed from the node as well, the insertion marker will not be set to a position before e1. The marker is only moved back when the most recent entry is deleted. As e1, however, was deleted prior to e3 the insertion marker can never be set to a position in before e1, except if the contents of the node are reorganized.

Data Area e1 e3 e2 e4 e1 e3 ee52 e4

Key Index

Figure 4.38: Example for the insertion of the entry e5

Figure 4.38 shows another example, which is the first to include an insertion operation. Within this scenario, the entry e5 is inserted into the node. Writing the data of the entry into the data area overwrites the start of the entry e2. Such scenarios pose problems to the recovery of entries. In some of the previous figures, it was possible to reach deleted entries by locating old pointers in the key index structure. If this was not possible, it was still possible to sequentially iterate over all entries in the node. Every entry starts with a length value that may be used to skip it. If all entries are stored one after another in the data area, their respective length values may be used to walk over all of them. This strategy, however, becomes infeasible once gaps of an unknown size occur between two entries, such as e5 and e4 in the figure. In this constellation, it gets hard to determine where the deleted entry e4 starts. This, however, is a problem often faced by a tool that has to recover entries from a node. Figure 4.39 shows a final insertion scenario. If the insertion marker is to the end of the data area and a new entry, such as e4 in the example, does not fit into the remaining space anymore, the data area of the node must be reorganized. Reorganizing a node means to move all its referenced entries to the front of the data area. This process has a similar effect as the previous scenario since the reorganization process may also overwrite older entries in the data area of the node. Similar to the previous scenario, it is also possible that the reorganization of a node leaves a gap in the

72 CHAPTER 4. ANALYSIS

Data Area e1 e2 e3 e1e3 ee24 e3

Key Index

Figure 4.39: Example for the insertion of the entry e4 data area of the node. This gap again makes it a complicated task to locate further entries after it since its length is unknown. The scenario also shows another interesting effect that may be provoked when a node is reorganized. After the reorganization of its data area, the entry e3 exists twice in the node. If the referenced e3 entry is altered, these changes are not reflected in the older unreferenced e3 entry. Being able to obtain this older entry allows an investigator to look at an older state of metadata. For this, it is, however, necessary to use a strategy that can detect as many deleted and unreferenced entries in a node as possible.

4.4.2 Operations in a tree When analyzing the insertion and deletion behavior exposed by the Minstore B+-tree, it is not sufficient to exclusively look at what happens at a node level. It is also essential to analyze the interplay of multiple nodes as changes in the formed tree structure also create and remove traces. There exist four different operations that are used to restructure the nodes found in a Minstore B+-tree. As mentioned at the beginning of this work, in ReFS no actual slack space at a cluster level exists, as clusters and pages are completely wiped when they are allocated. Thus, this work considers logical elements (index entries) in pages that are not referenced anymore but which still exist as slack space. Split: If a non-root node has not enough space to store a new entry, a split operation occurs. To perform a split operation, a new node is allocated. This node becomes the right neighbor of the original node. Next, the entries in the data area of the original node are reorganized. This reorganization process also includes moving entries depending on a split ratio from the old left node to its new right-hand sibling. Depending on the type of the table into which data is inserted, a different split ratio may be picked. For directory tables, up to 90% of the data of the original node remain in it. Thus only around 10% of its data are moved to the newly allocated node. Due to the reorganization operation within the original node, most of its slack space is overwritten. The newly allocated right-hand node does not contain slack space yet. Finally, the parent node of the subtree needs to store an additional entry that references the newly created sibling. If the parent node also lacks the capacity to store this new entry, it must also be split. The splitting process may ripple up towards the root node in which a push operation would be performed. Push: If the same requirements for a split operation are fulfilled in a root node, a so-called push operation is performed. In a push operation, the height of the tree increases. First, a new child node for the root node is allocated. It is then tried to move all entries from the root node into that child node. In conventional assumptions of a B+-tree, this would never work since all nodes have the same size. If the entries do not fit into the root node, they won’t fit into any other single node neither. In Minstore B+-trees, however, the sizes of nodes may vary. Root nodes of embedded tables are usually significantly smaller than common nodes, and thus, a single child node can potentially store more entries than the root node. If the newly allocated child node is too small to store all the entries moved from the root node, a split operation is performed in it. The push operation generates slack space in the root node. All entries of the root node are shifted into its new child node but also remain unreferenced at their original location in the data area. One or two new links to child nodes are inserted into the root node, but most of its slack space is retained. Figure 4.40 shows how push operations may even lead to entire subtrees being still accessible from the root node. In conjunction with the Copy-On-Write paradigm, this becomes even more interesting. References in the slack space of a node are likely to point to older states of subtrees. It is also important to note, that references found in this way always point to subtrees that have a lower height than the rest of the tree. Sibling Merge: If the used capacity of two siblings in a Minstore B+-tree drops under a defined threshold (Less than 40% usage in both nodes) they are merged into a single node which is usually the left-hand node.

73 CHAPTER 4. ANALYSIS

n1

n1 n21 n22

n21 n22 n23 ... n2n n31 ... n3i n3j ... n3n n23 ... n2n

Figure 4.40: Example for the remaining traces after a push operation

This process completely discards the slack space of the right node as it is not referenced anymore. Shifting many entries from the right node to the left node may also overwrite a significant amount of slack space of the left node. Additionally, the link to the right node is removed from the parent node. It is however still possible to restore this link and thus to find the entries and the slack space of the right node. Pop: The final operation is the so-called pop operation, which is the opposite of the push operation. If it is possible to move all entries from the direct child nodes of a root node back into it, a pop operation may be performed. In a pop operation, the links to the direct child nodes of a root node are removed. Afterward, all entries from its former direct child nodes are moved into the root node. In this way, a pop operation decreases the height of a Minstore B+-tree. Since the child nodes of the root node are discarded, potential slack space gets lost. Moving many entries into the root node additionally overwrites slack space of it.

4.4.3 Concepts to restore entries from tree structures The various examples in section 4.4.1 have shown that there exists no definite strategy to recover entries from Minstore B+ nodes. It is not sufficient to rely on recovering pointers of the key index structure, as the driver reorders the key index whenever an entry is added into the node. The behavior of which pointer structures are overwritten strongly depends on the order in which files and directories are removed. It is also not sufficient to iterate over the blocks in the data area since there might exist gaps between them. One simple but very effective strategy that remains is to scan the entire data area of a node for indicators of references to child nodes or rows. The structure of headers of index entries is described in 4.10. This structure defines some implicit constraints. The first field, the “length of an index entry” must be at least 0x10 byte large, as this is the size of the header structure. The offsets of the succeeding key and data buffers must also contain a value equal to or greater than 0x10. It is also required, that neither the key nor the value exceeds the boundaries of the index entry. Thus, a primitive but effective method of restoring entries could be searching the data area for memory chunks that fulfills these constraints that an index entry has to fulfill. The implementation of the recovery tool is a little more specialized. Many rows employ different constant identifiers associated with their keys. Every file table under ReFS, for example, starts with the identifier 0x00010030. One can easily search the data area for this pattern. After such a pattern has been found, one has to conduct various sanity checks on the structures around it, to make sure that the found entry is not a false positive. The implementation allows defining different data area carving strategies that may be selected when a tree is traversed. We focused on developing strategies that could recover entries from the Object ID Table, as well as the Directory Table. These tables contain information that is necessary to recover files as well as directory structures and thus are most relevant for a forensic analysis of a file system. Figure 4.41 summarizes the effectiveness of different approaches to restore table entries. There exist three different ideas that are explained in the following: • Using no recovery techniques at all: This is the representation that is provided by an implementation of ReFS that merely follows its specification. The ReFS driver used in Windows, for example, implements this behavior. With this approach, only a single directory table would be found in figure 4.41. • Use various techniques to recover files from the data area of a node: If one uses methods that recover entries from nodes, one might even find deleted directory tables as well as deleted file tables. The TSK implementation, for example, shows this behavior. This approach, however, is still incomplete, as there might exist some nodes that are not referenced from any other node.

74 CHAPTER 4. ANALYSIS

• Search the entire disk for node structures and recover all entries in them: To get a complete overview of all files and directories that can be recovered from a ReFS file system, it is necessary to scan an entire disk for such structures. The carver application that was developed in this work implements this behavior. On top of locating all nodes that exist in the file system, this approach also needs to extract entries from the slack space of these nodes. Figure 4.41 also highlights that every node and even every page starts with a specific header. This header offers a great opportunity when it comes to identifying and locating all nodes in a ReFS file system. Based on these findings, it is possible to develop applications that apply these ideas to a ReFS formatted file system, in an attempt to recover deleted files and directories. Both of the implementations are explained in the succeeding chapter.

ReFS Directory Tables

Boot Sector MSB+

SUPB

Superblock MSB+ MSB+ File Table

CHKP

Checkpoint MSB+

MSB+ MSB+ Data Run MSB+ MSB+ Table

Object ID Table MSB+

MSB+ MSB+ MSB+ MSB+ MSB+

MSB+ MSB+ MSB+ MSB+ MSB+ MSB+

Figure 4.41: Overview to express the recoverability of tree structures

75 CHAPTER 4. ANALYSIS

Byte Range Description 0x000:0x004 Page Signature (MLog) 0x004:0x008 Volume Signature 0x008:0x00c Always 0x1 0x00c:0x010 Always 0x1000 0x010:0x020 UUID (ExUuidCreate) 0x020:0x024 Unknown 0x024:0x028 Alignment 0x028:0x030 LSN_NULL (Current LSN) 0x030:0x038 LSN_NULL (Previous LSN) 0x038:0x03c Number of LSN’s in the block (0x1) 0x03c:0x040 Number of 4 KiB blocks in the log page (0x1) 0x040:0x044 Always 0x0 0x054:0x058 Offset 1 (Usually: 0x78) The following values are relative to offset 1 (Usually: 0x78) 0x000:0x008 LSN_NULL (Current LSN) 0x008:0x010 Checksum 0x010:0x018 Unknown 0x018:0x020 LSN_NULL (Previous LSN) 0x020:0x024 Length of next relative entry 0x024:0x028 Always 0x0 0x028:0x02c Offset 2 (Usually: 0x38) 0x02c:0x030 Always 0x0 0x030:0x034 Always 0x1 0x034:0x038 Always 0x0 The following values are relative to offset 1 and offset 2 Usually: 0x78 + 0x38) 0x000:0x008 Virtual Clock (Used to decide which restart area is more recent) 0x008:0x010 Logging Area Start 0x010:0x018 Logging Area End 0x018:0x020 Most recent LSN of the previous checkpoint 0x020:0x028 First LSN in the next checkpoint 0x028:0x038 16 byte long UUID (Same as in the header) 0x038:0x03c Unknown 0x03c:0x040 Flag (Indicates whether checksums for the logging area are enabled) 0x040:0x044 Always 0x1 0x044:0x054 Unknown 0x054:..... Unused rest

Table 4.39: Structure of a log page in the restart area

76 CHAPTER 4. ANALYSIS

Byte Range Description 0x000:0x004 Page Signature (MLog) 0x004:0x008 Volume Signature 0x008:0x00c Always 0x1 0x00c:0x010 Always 0x1000 0x010:0x020 UUID (Identical to the UUID from the restart area) 0x020:0x024 Unknown 0x024:0x028 Alignment Current LSN (Lower 4 bytes: Index of the 4 KiB block in the logging 0x028:0x030 area) Previous LSN (Lower 4 bytes: Index of the previous 4 KiB block in the 0x030:0x038 logging area) 0x038:0x03c Number of LSN’s in the block (0x1) 0x03c:0x040 Number of 4 KiB blocks in the log page (0x1) 0x040:0x044 Always 0x0 0x054:0x058 Offset 1 (Usually: 0x78) The following values are relative to offset 1 (Usually: 0x78) 0x000:0x008 LSN_NULL (Current LSN) 0x008:0x010 Checksum 0x010:0x018 Unknown 0x018:0x020 Previous LSN 0x020:0x024 Length of next relative entry 0x024:0x028 Always 0x0 0x028:0x02c Offset 2 (Usually: 0x38) 0x02c:0x030 Offset to the end of the log block 0x030:0x038 Always 0x2 The following values are relative to offset 1 and offset 2 Usually: 0x78 + 0x38) 0x000:.... Redo information of a transaction

Table 4.40: Structure of a log page in the logging area

77 CHAPTER 4. ANALYSIS

Byte Range Description 0x000:0x004 Length of the redo information of a complete transaction 0x004:0x008 Offset to the first redo block First redo record, relative to the offset at 0x004:0x008 (Usually: 0x8) 0x000:0x004 Length of the first redo record 0x004:0x008 Redo ID of the first redo record 0x008:0x00c Number of table path components 0x00c:0x010 Offset to first table path component 0x010:0x014 Number of data components 0x014:0x018 Offset to first data component 0x018:0x020 Virtual Allocator Clock of the last checkpoint 0x020:0x028 LSN of the last checkpoint 0x028:0x02c Always 0x0 0x02c:0x030 Flag 0x030:0x038 Unknown Offsets to table path components, relative to the offset at 0x00c:0x010 0x000:0x004 Offset to first table path component 0x004:0x008 Length of first table path component 0x008:0x00c Offset to second table path component 0x00c:0x010 Length of second table path component .....:...... 0x008:0x00c Offset to last table path component 0x00c:0x010 Length of last table path component Offsets to data components, relative to the offset at 0x014:0x018 0x000:0x004 Offset to first data component 0x004:0x008 Length of first data component 0x008:0x00c Offset to second data component 0x00c:0x010 Length of second data component .....:...... 0x008:0x00c Offset to last data component 0x00c:0x010 Length of last data component Table path components Data components .....:..... Length of the second redo record .....:......

Table 4.41: Structure of the redo information found in a log page in the logging area

Byte Range Description 0x000:0x004 Table schema old 0x004:0x008 Table schema new 0x008:0x00c Always 0x0 0x00c:..... Key (The first key of a table path is always a Table Identifier)

Table 4.42: Structure of a table path component in a redo operation

78 5 Implementation

The application areas of “The Sleuth Kit” (TSK) were already motivated in section 2.4. TSK provides a solid foundation that may be easily extended with the support for new file systems. The codebase of the file system layer found in TSK is based on C and C++. Initially, the complete TSK was purely based on C, but over time, multiple C++ based modules were added. In the following, we describe the implementation of ReFS in the file system layer of TSK as well as the implementation of a carver which utilizes various functions that are provided by TSK. TSK offers only a few data structures that may be used to store and manage data. Those include a list and a stack, but both of these data structures also only support limited data types. Therefore both of our implementations use a mix of C and C++. C++ is mainly used to make use of data structures as well as convenience features that are provided by its Standard Template Library. We made the extension of The Sleuth Kit and the developed carver available for the public at https://faui1-gitlab.cs.fau.de/paul.prade/refs-sleuthkit-implementation.

5.1 Extending “The Sleuth Kit” to support ReFS

The development process began with a small change in the volume layer of TSK. TSK provides a command- line application that is called mmls, mmls allows a user to display the partition table of a volume. When displaying the partition table of a volume, mmls also shows the partition types of all partitions that exist in the volume. Partition types are used to specify the file system type that is contained in a partition. A , however, does not have to be equal to the actual file system type used in a partition. It is also possible that a partition type can be assigned to multiple different file system types. In TSK only partition types stored in DOS partitions are translated to possible file system types. ReFS uses the same partition type as the exFAT as well as the NTFS file system. Therefore it was only necessary to add the name “ReFS” to the output of file system names. This change was made in the file tsk/vs/.c. The remaining development process was exclusively focused on the file system layer. The entry point of the file system parsing process in TSK is found in the source file tsk/fs/fs_open.c. This file exposes a method called tsk_fs_open_img that is required to process any file system with TSK. The function takes a pointer to the representation of a disk image, as well as an offset in the image and an optional file system type as arguments. The function defines an array of so-called “file system openers”. This array stores the names of different file systems together with functions that are used to interpret those file systems. Figure 5.1 shows the structure of the “file system openers” array. For every supported file system type a function called _open exists. Such an “opener function” tries to interpret the data found in the provided disk image as an appropriate file system. If an opener function is able to interpret the data on the disk image successfully, it returns a pointer to a file system context object of the type TSK_FS_INFO. This object stores state information about the opened file system. If an opener function fails to interpret the data found in a disk image as a file system, it returns NULL.

FS_OPENERS[] = { { "NTFS", ntfs_open, TSK_FS_TYPE_NTFS_DETECT }, { "FAT", fatfs_open, TSK_FS_TYPE_FAT_DETECT }, { "ReFS", refs_open, TSK_FS_TYPE_REFS_DETECT }, { "/3/4", ext2fs_open, TSK_FS_TYPE_EXT_DETECT }, { "UFS", ffs_open, TSK_FS_TYPE_FFS_DETECT }, { "YAFFS2", yaffs2_open, TSK_FS_TYPE_YAFFS2_DETECT }, { "HFS", hfs_open, TSK_FS_TYPE_HFS_DETECT }, { "ISO9660", iso9660_open, TSK_FS_TYPE_ISO9660_DETECT } };

Figure 5.1: Array of “file system openers”

79 CHAPTER 5. IMPLEMENTATION

typedef struct { uint8_t f1[20]; uint8_t value_1[4]; uint8_t value_2[8]; uint8_t f2[3]; char string_1[5]; uint8_t value_3[2]; } struct_def;

struct_def s; cnt = tsk_fs_read(fs_info, 0, &s, sizeof(struct_def)); if (cnt != sizeof(struct_def)) ..error handling..

uint32_t value_1 = tsk_getu32(fs_info->endian, (&s)->value_1); int64_t value_2 = tsk_gets64(fs_info->endian, (&s)->value_2);

Figure 5.2: Example for a struct overlay used in TSK

TSK also provides an autodetect feature that can automatically determine the type of a used file system. The autodetect feature calls the open functions of all “file system openers”. If an open function returns successfully, it is assumed that the volume holds a file system of the respective type. TSK, however, does not stop once it has identified a matching opener function. Instead, it executes every opener function on the provided disk image. This approach is used to identify ambiguity in the process of detecting a file system. If multiple opener functions are able to interpret the file system type correctly, TSK aborts and asks the user to specify the file system type manually [10]. The context object that is returned by a file system opener function is reused for all further interactions with the opened file system. An opener function fills the context object with information of the file system. This information includes the size of a block, as well as the number of the first and the last block of the file system. The term “block” refers to the data unit used in the file system. For NTFS and ReFS, a block corresponds to a cluster. The context object also holds the first and last metadata address in the file system, the that is used by the file system, as well as various metadata addresses, such as the metadata address of the root directory. To enable a user to interact with an opened file system, the context object also stores several function pointers that serve as callback functions. These function pointers offer a generic way to interact with any file system and are used to realize the functionality of various file system tools. Under the hood, the implementation of these functions depends on the corresponding file system type. However, all details of different file systems are hidden behind the interfaces of these functions. An example of such a callback function is the function pointer called file_add_meta. When a user wants to read the metadata of a file, he merely provides a metadata address to this function and may expect regardless of the underlying file system type that the function returns a representation of the file’s metadata. The first important task to implement functions in the file system layer is to read data from a file system and to interpret it. Read operations on the file system layer are performed by calling the function tsk_fs_read that allows reading an arbitrary number of bytes from the disk into a buffer that is provided by the caller. After data has been read into a buffer, the common approach used in TSK is to cast the read data into a struct definition. The macros tsk_get<16/24/32/48/64> are then used to extract fields of such a struct. They also provide an abstraction from the endianness used in the file system. tsk_getu returns unsigned values, whereas tsk_gets returns signed values. The number after the signedness indicator specifies the number of bits in the field that should be interpreted. Figure 5.2 shows an example of how data of a file system may be read and extracted with TSK. Entries in the struct that are named f1, f2, fn represent alignment gaps, values that are irrelevant for the interpretation of the file system, or unknown fields. refs_open: The entry function of every file system first allocates a file system context object on the heap. This object consists out of two distinct components, a well known TSK_FS_INFO struct that is common to all different file system types and a file system specific data area. Since there exist huge differences between different file systems types, it would be impracticable to store all data relevant to a file system in the TSK_FS_INFO struct. This would require to add many fields to the struct that are relevant for some file

80 CHAPTER 5. IMPLEMENTATION systems but utterly useless for others. Instead, the file system context object used in TSK starts with a well known documented TSK_FS_INFO struct that is followed by file system specific structures. This concept is shown in figure 5.3.

Common values TSK_FS_INFO Callback functions

File System

File SystemContext specific structs

Figure 5.3: Illustration of the file system context object

• Boot Sector Similar to the implementation of other file system types, the refs_open function starts with parsing the boot sector. It does that by reading the first sector of the file system and applying a structure definition according to table 4.3 to it. Next, the single values of the boot sector are verified and written into the file system context object. The boot sector of a valid ReFS file system, for example, must contain the character string ReFS. The checksum of the File System Recognition Structure in the boot sector must also be correct. Various other parameters restrict the contents of the boot sector. The cluster size of a ReFS file system is expected to be 4 KiB or 64 KiB as these are yet the only supported cluster sizes. The implementation furthermore expects that the sector size is 512 B. The implementation also makes sure that the ReFS version indicated by the boot sector is 3.4 since we do not have sufficient knowledge about other versions of ReFS. • Superblock After the cluster size has been determined, the function process_superblock reads the su- perblock found in cluster 30, as well as the backup superblocks, found at the end of the file system. The implementation applies the structure definition of a superblock to the read data. The structure definition of the superblock is equal to its description in table 4.6. The implementation picks the most recent superblock that is valid. A superblock is only considered valid if none of its offsets and length values exceed its bounds. To make sure that a superblock is valid, it is also checked that its self-checksum equals the checksum that can be calculated from it. From the superblock, the volume signature is calculated, and the cluster addresses of the checkpoint structures are extracted. • Checkpoint The function process_checkpoint reads both checkpoint structures and applies the struct defini- tions explained in table 4.7 to them. Just like for the superblock, the most recent valid checkpoint is picked. Usually, the contents of both checkpoint structures are nearly identical. It might, however, be possible that one checkpoint structure references one or multiple older root nodes that were created through the Copy-On-Write process. Hitherto we refrain from parsing both checkpoints as it is not trivial to combine both forests that the checkpoints reference. Combining those two results, however, could slightly contribute to the number of metadata structures found by the implementation. From the chosen checkpoint, the 13 references to the different root nodes are extracted. Our imple- mentation uses a datatype called table_reference to store the addresses as well as checksum information of references that link to pages. This struct is used to store both the root nodes of pages and extents that are referenced by other pages. After a page has been mapped the data in its table_reference struct is complemented by fields extracted from its page header. • Container The function refs_load_container sets up the address translation mechanism employed by ReFS. The function enumerates the most recent Container Table and extracts data from its rows that is necessary to perform address translations. After the function has returned successfully, the

81 CHAPTER 5. IMPLEMENTATION

function refs_map_page may be used to interpret all further page references in the file system. refs_map_page uses the helper function translateLCN that performs address translations based on the data obtained from the Container Table. • Object ID Table It should not matter whether the regular Object ID Table or its identical copy, the duplicate Object ID Table is processed since both should store the same contents and presumably also contain the same slack space. Many tables referenced by the Object ID Table represent directories. Because of that, it is not only important to read all regular entries found in the Object ID Table but also to recover deleted ones. The function enumerateTable allows to obtain all entries found in a tree structure, so to say, all rows in a table. It takes a table_reference structure as an argument that contains the address of the node at which the enumeration of the tree starts. It also expects a function as an argument that provides a strategy to recover entries from a node. Recovery strategies may depend on the structure of the tree. Some trees, for example, store specific signatures in their keys it is then easy for a strategy to search these signatures within a node to recover entries. When recovering links to directory tables, it is possible to find multiple links with the same table identifier. This behavior mainly occurs as a result of the Copy-On-Write process, that continually writes the root nodes of tables to new pages. Duplicate page references that store the same table identifier as others may reference different root nodes. As a consequence of that, it is possible for a directory to be formed by multiple tree structures. An example of this state of affairs is shown in figure 5.4. The grey marked rows in the figure represent rows that were recovered. When combining the keys found in the different restored tree structures, it is important that all keys must only be interpreted once. The current state of the table stores the keys 2, 4, 6, and 9. These are interpreted as regular entries. The recovered trees must be sorted in descending order of their recency. The recency of a tree may be derived from the virtual allocator clock found in the page header of its root node. After a tree has been enumerated completely, all new keys are added to the set of existent keys. When parsing the tree with the clock value 0x2, only the key 7 would be added to the set of keys. In the next recovered tree, only the key 3 is new. If such a scenario occurs only the keys found in the referenced tree structure (2, 4, 6, 9) are considered to be existent. Keys that are not found in the referenced tree structure (3, 7) but are located in different tables that were recovered, are considered to be deleted entries. It is only attempted to recover deleted entries which reference directory tables. All other table structures referenced by the Object ID Table are regularly processed by querying its index.

82 CHAPTER 5. IMPLEMENTATION

Object ID Table

Key Table Reference

0x600 r1

0x600 r2

0x600 r3

......

Clock: 0x3 Clock: 0x2 Clock: 0x1

2 4 6 9 2 4 6 7 2 4 3 7

Current Table State Recovered Table References

Figure 5.4: Combining entries of existing and deleted tables that are referenced by the Object ID Table

• Volume Information Table Next, the Volume Information Table is processed. All three entries as described in the tables 4.20, 4.21, and 4.22 are extracted from it, and their contents are written into a structure of the datatype volume_info. The only place where this information is necessary is within the refs_fsstat function, which prints general information about the file system and data from the file system category. The function is used to realize the functionality of the equally named file system tool fsstat. After all of these tables have been successfully processed the function pointers to the callback functions are written into the file system context object. In total there are 16 callback functions, therefore only the most important ones are explained below. refs_inode_lookup: This is the presumably most important callback function referenced by the file system context object. The arguments of the function are the current file system context object, a pointer to a TSK specific data structure which is used to store general metadata for files and directories and the metadata address of a file that should be processed. If the function is successful it returns 0, else it returns 1. The function uses the function refs_dinode_lookup to locate the table of a file or a directory descriptor. Every metadata address under ReFS is formed by a file identifier and a directory identifier. The directory identifier specifies in which directory table the searched file resides. To find the location of a file, one must query the directory table in which it resides. When the Object ID Table was processed, a map with directory identifiers and references to the corresponding root nodes of the tables was populated. This map structure makes it easy to locate the directory table in which the file is stored. refs_dinode_lookup uses a helper function to enumerate the found directory table. The helper function, in turn, uses the enumerateTable function together with a strategy function that is specialized in locating deleted files and directory links. It is not necessary to carve for the directory descriptor entry of a directory table since there exists only one within a table and it is always referenced by the index. The processed entries that were found in the directory table are retained so that the table does not have to be enumerated entirely again at a later time. If a caller, however, would use this function to open thousands of files in different directories this process might consume more memory than intended, as all entries found in the directories are cached, so that subsequent requests can be handled faster. Because of that, it might be necessary to establish a threshold value at which the cached directory entries of previously processed directory tables are released. The found directory descriptor- or file table can then be parsed as explained in the tables 4.29 and 4.30. The function refs_dinode_copy is used to transform a directory descriptor table or a file table into the

83 CHAPTER 5. IMPLEMENTATION metadata representation used by TSK. The general information of files and directories is covered in the body of the TSK_FS_FILE structure. The structure can be flexibly extended by adding attributes to it. Already NTFS has benefited from this design decision which allows to easily map attributes of files and folders to the provided metadata structure. All attributes except the data attribute may be modeled as classical resident attributes. The data attribute is non-resident, as data in ReFS resides in external clusters. To parse an on-disk data attribute, the function enumerateTable is called to obtain all of its data runs. Similar to its previous uses, a strategy function is supplied that may be used to restore data runs that have been removed from a file. If a file is rather small so that all of its data runs fit into a single node, the complete index of its data run table is cleared when the file is deleted. The strategy function attempts to recover data runs that have been removed from the index in such scenarios. After all of these steps have been performed, the TSK_FS_META file is filled with data that reflects the metadata of the opened file. To indicate its success, the function finally returns 0. refs_dir_open_meta: The previous function was merely concerned with opening metadata addresses and did not consider the file name category at all. This function associates the names of files and folders to their corresponding metadata addresses. The arguments of the function are the current file system context object, a pointer to a pointer to a TSK specific data structure that represents a directory and finally the metadata address of a directory that should be processed. If the function is succesfull it returns 0, else it returns 1. The purpose of the function is to populate the provided directory data structure TSK_FS_DIR with mappings of file names and associated metadata addresses. The TSK_FS_NAME structure is used to express such a mapping from a file name to a metadata address. The function tsk_fs_dir_add that TSK provides allows adding such single mappings into a . To implement this functionality, first, it is determined to which directory table the provided metadata address refers. Next, the corresponding directory table is wholly enumerated. Finally, all files and direc- tory links contained in it are transformed into separate TSK_FS_NAME structures. The File System Metadata folder used under ReFS resides in the root directory, but because there exists no directory link to it, this mapping must be expressed manually. refs_get_default_attr_type: After an interesting metadata entry has been identified, a user may want to access its contents. TSK needs to know which attribute of a file stores its “main data”. A file may store multiple attributes such as named data streams, or a reparse attribute. Therefore it is required to define a default attribute. The default attribute used by files is their data attribute. In NTFS the index root attribute was considered to be the standard attribute of directories. Because of that, we choose the same attribute type as default attribute type for directories under ReFS, even if the index root attribute is always unused and seems to have become deprecated. refs_close: After a file system has been processed, and it is not necessary to interact with it anymore, its internally allocated data structures must be freed. Cleaning up data structures and deallocating buffers is done by the “close” function of a file system.

Other functions: The file system context object provides various other functions. refs_inode_ walk, for example, implements the traversal over a range of metadata addresses. The function takes a start and an end metadata address as arguments as well as flags that may be used to check whether a found metadata entry is qualified to be processed. To process single metadata entries the function also takes a callback function as an argument that is called on all qualified metadata entries. TSK also defines a set of functions that operate on the blocks of a file system. The code of these functions is for the most part, independent of the used file system and could largely be reused from the NTFS implementation. It is only required to provide a data structure that expresses the allocation state of single clusters in the file system. The refs_load_bmap function assembles multiple tables to generate a cohesive view on the allocation status of clusters in a file system. The function refs_file_get_sidstr is another component that has been largely reused from the NTFS implementation. It obtains the security descriptor of a given metadata entry and extracts its security

84 CHAPTER 5. IMPLEMENTATION identifier (SID) string. Since both file systems, NTFS, as well as ReFS, use security descriptors to express properties of secured objects. The interpretation of security descriptors already exists in TSK and thus does not need to be implemented redundantly. Since the information stored in the TSK_FS_INFO struct and the fixed part of the TSK_FS_FILE struct is not universal enough to store the properties of all file systems and all files, the functions refs_fsstat and refs_istat exist. refs_fsstat takes a file system context object as its argument, which it uses to write general information about the file system into an output object. refs_fsstat also interprets the file system specific structs that are located in a TSK_FS_INFO. refs_istat works similar, but instead of describing the properties of a file system, it is concerned with printing characteristics of a metadata entry. It is necessary to offer different istat implementations for different file systems, as the interpretation of the attributes of a file completely depends on the analyzed file system. The file system context object also provides three function pointers that refer to the journal of a file system. refs_jopen is used to “open” the journal so that other functions may access it. refs_jblk_walk allows a caller to define flags, a range, and a callback function that is executed on all journal blocks that lie within the range and fulfill the properties defined by the flags. refs_jentry_walk works in a similar fashion, but instead of complete journal blocks, it iterates over single entries of the file system journal. Aside from the callback functions defined in the file system context object, TSK also offers a tool called usnjls. This tool may be used to display the contents of an Update Sequence Number (USN) journal. So far the only file system that provides such a journal was NTFS. ReFS, however, also allows creating a similarly structured USN journal. Since the contents of the USN journal are stored in a common file, it may be read easily by using the previously implemented functionality. The interpretation of the entries found in the USN journal is already implemented for NTFS. Thus it was only necessary to extend this implementation as ReFS uses a new USN record format that was not yet supported in TSK.

Hurdles in the implementation of the tool • Size of metadata addresses: The greatest obstacle in the implementation of the ReFS file system layer in TSK was the format of metadata addresses. All previous file systems supported by TSK use 64-bit large metadata addresses to reference files and directories. ReFS, however, uses two 64-bit large identifiers (128-bit) to refer to a metadata entry. One identifier is used to encode the directory table in which the metadata entry resides. The other one is used to address the entry within the corresponding directory table. This finding is supported by a compatibility report of Microsoft [33]. In section 1.10.3 of the report, it is stated that ReFS uses 128-bit identifiers as a measure to allow ReFS deployments to store billions of files. While the ReFS API still allows addressing files by using 64-bit identifiers to maintain backward compatibility, the report states that an application which merely works with 64-bit addresses is likely to crash if it encounters a 128-bit file identifier. Metadata addresses used in TSK are defined as typedef uint64_t TSK_INUM_T;. All call- back functions provided by the file system context object use the data type TSK_INUM_T to refer to metadata addresses. 64-bit values, however, are not sufficient to address metadata entries found in ReFS. Some compilers, such as the GCC compiler provide 128-bit built-in data types that are not covered by any C, or C++ standard. Using such a compiler-specific datatype would drastically limit the compatibility of the TSK implementation. There also exist no format strings that allows printing such a value, and various unpleasant side effects may occur when working with these types. Therefore we decided to the type of a metadata address by mapping the TSK_INUM_T type to a struct that holds a low and a high 64-bit value. This change, however, had an impact on all other implementations of file systems and tools in TSK as all of them make use of the TSK_INUM_T type and had to be adjusted accordingly. It was also necessary to alter all arithmetic operations of TSK_INUM_T types since the C programming language offers no method to execute most of the built-in operators on structs. This modification also affects all applications that are connected to TSK and expect metadata addresses to be 64-bit large. All graphical user interfaces that rely on the API’s of TSK will not work anymore as long as they do not reflect this modification in the data model of TSK. • Older copies of files: TSK also only offers a rigid assignment of metadata addresses to file structures. TSK does not cover the idea of multiple copies of a file that all share the same metadata address. This

85 CHAPTER 5. IMPLEMENTATION

concept, however, is crucial if one intends to analyze ReFS formatted file systems, as the Copy-On- Write policy used in ReFS, as well as the reorganization of tree structures, may create multiple copies of files that represent their states at different times. Our implementation does not tackle this issue in the file system layer of TSK as this approach would introduce many additional changes to the architecture of TSK. Instead, a carver based on the core functionality of TSK was developed. The carver allows extracting all files and directories from ReFS specific data structures, even if the underlying file system does not contain a complete ReFS partition anymore. Different to the implementation in the file system layer, the carver can identify and extract all older copies of files.

5.2 Development of a carver based on TSK

The need for a carver is best expressed by figure 4.41 that was discussed in section 4.4.3. Even though many deleted entries may be restored by analyzing remnants in tree structures, still many nodes of previously allocated pages remain unseen. The reason for this is that some nodes are not referenced by any accessible page anymore and there exists no way to locate them. It is possible that at an arbitrary position on the disk, important data is located that belongs to the metadata of a deleted file or a directory structure. If we, however, have no way to find this structure, it remains unseen and potentially substantial evidence is suppressed. The implementations for the file systems in TSK are purely based on parsing existing structures and thus react highly responsive. It would, however, be necessary to scan an entire volume if all deleted pages have to be recovered. Therefore a carver independent of the actual ReFS implementation was developed. Core concepts of the carver were taken from the internals of the application refsutil.exe, which was also analyzed in our work. The carver is called refstool and based on the tsk_img_open function provided by TSK. A user may provide an arbitrary image and an offset to the carver which then tries to recover deleted files and directories from the image. The carver works in two steps: • Collection phase: This is the phase in which the carver starts. The carver scans the entire provided image for blocks that look like pages of tree nodes. The step size of the carver is 4 KiB. In every step, it verifies the page header within the first 80 bytes of a read block. If these bytes fulfill the characteristic properties of a page header, the page is kept for further analysis. If the table identifier of the page is equal to the table identifier of a “Container Table” or any directory table the page is retained, else it is discarded. The carver then proceeds to classify the page into an order, as shown in figure 5.5. The carver maintains a structure that maps volume signatures to an object that represents the state of a ReFS file system. This structure is used to distinguish different instances of ReFS file systems that may exist if the underlying disk contains multiple ReFS partitions or was formatted multiple times with the ReFS file system. Within the state of each volume, a map of directory identifiers is maintained. This map assigns table identifiers to a list of pages. When looking up the key of a directory table in this map the caller may obtain a list of all pages that store information relevant to this table, so to say all nodes that once belonged and nodes that still belong to a table with a given identifier. The carver also derives further information from pages that were read. The number of addresses stored in the page header is used to decide whether the page belongs to a volume with a cluster size of 4 KiB or 64 KiB. It is also important for the carver to know at which disk offset the volume started. At a later time, the application must interpret cluster runs. For this, it is necessary to know to which starting address they refer. Figure 5.6 illustrates this issue. The red marked file system starts at the beginning of the disk image. Address references used in this file system may be interpreted without adding any offsets. The blue marked file system starts somewhere in the middle of the disk image. Therefore its starting offset relative to the start of the disk image needs to be added to all cluster references that are interpreted in it. Luckily, the cluster numbers found in the page header of the Container Table are equal to their physical addresses. Thus, the first found page that belongs to a container table may be used to determine the starting location of a ReFS file system.

86 CHAPTER 5. IMPLEMENTATION

Page Information

Volume Signature Directory Information Addr: XXX

0x600 Addr: YYY

Addr: ZZZ 0x6100b379 0x701

....

Volume Offset: .... Container Mapping: ....

Page Size: .... Cluster Size: ....

......

Figure 5.5: Classification process in the collection phase of the carver

The collection process may also be halted at any time. Whenever the carver has read and classified 256 MiB of data, it writes its progress state as well as the list of found pages into a file. When restarting the application at a later time, the collection process may continue at the last written state. The idea of this feature, as well as, lots other concepts, were extracted from the refsutil.exe utility. refsutil.exe was introduced in early 2019 by Microsoft. It allows to verify or recreate a boot sector, to reclaim leaked clusters, to remove corrupt files, or to recover files from a ReFS file system. The recovery operation is also referred to as “salvage”. We analyzed the implementation of this operation as it appeared to contain valuable knowledge when it comes to recovering files.

ReFS file system #1 ReFS file system #2

Disk Image

Figure 5.6: Multiple file systems at different locations within a disk

• Reconstruction phase: After all pages have been read and classified according to figure 5.5, the carver goes into the recon- struction phase. The reconstruction phase loops over all found volume states and performs identical operations on them. First, the latest state of the Container Table is restored so that it is possible to translate virtual addresses in a volume to real addresses. Next, the carver loops over all directory tables found in a volume. • Metadata reconstruction: Directory tables are stored as a flat sequence of pages. The carver searches these pages for signatures that may be used to identify files, directory links and directory descriptors. When a corresponding signature is found, the carver executes various sanity checks on the found data and tries to interpret it as the structure it believes it to be. The carver must also restore the subtree formed by directory descriptor tables and file tables to access their attributes, and their data runs. If the carver was successful in restoring an entry, the entry is saved in the class format file_information or dirref_information. Every directory stores these entries as a set. If a file_information entry or a dirref_information entry is equal to an already existing entry, it is discarded. To determine whether two entries are identical, multiple properties of them are

87 CHAPTER 5. IMPLEMENTATION

compared. It would not be sufficient to merely check whether the file identifier already exists in the set of found entries as this would discard potential older copies of files. Instead, the file identifier, the last modification time, and the file size are used in conjunction to check whether two file entries are equal. If they are not equal, both may be contained in the same result set. • Extraction phase: After all directory pages of a volume have been analyzed and their entries have been transformed into the internal representation of files and directories, the extraction phase begins. First all directory tables found in the volume are created as folders of the form /. Next, a volume report with the file name /report.txt is created. The report describes the properties of the corresponding volume such as its offset in the disk image, its cluster size and the number of directories that it contains. The report also describes the address translation mapping that was extracted from the container table of the volume. Next, the application creates a file called /structures.txt that is filled with the reconstructed directory hierarchy formed by the file system. Since the application stores the extracted directories in a flat form and only refers to them through their directory identifiers, this file may be used to re-establish directory structures. Finally, the application iterates over all directory structures and dumps the files found in them. Into every directory a file called / /directory_report.txt is written which contains a list of all files found in the directory. The list also contains the metadata of duplicate files that may have been created through the Copy-On- Write mechanism. The contents of all files are finally dumped into the corresponding directory. To prevent naming conflicts among restored files that utilize identical file names files are written as //_. Their names however may easily be looked up by examining the /structures.txt or the / /directory_report.txt file.

As of now, the carver is only able to extract files and directories of a file system, and contrary to the ReFS implementation in the file system layer of TSK does not show additional information about a volume. It might be best to use both of those tools in conjunction. The file system implementation of TSK provides valuable information such as attributes of files as well as file system metadata that has not yet been added to the analysis process of the carver. A crucial advantage of the implementation in the file system layer of TSK is that its implementation can decide whether a file is allocated or has been removed. The carver is not aware of whether the files that are extracted were allocated or not. On the other hand, the carver provides a valuable tool for an investigator that intends to examine a corrupted ReFS file system or an investigator who wants to obtain potential older copies of files.

88 6 Evaluation

To generate reproducible results in digital forensics, it is necessary to publish the data sets that were used to evaluate recovery techniques. If this presumption is not met, different research teams might compare results starting from different initial conditions. Therefore, we made our generated test data available at https://faui1-files.cs.fau.de/public/refs/refs3.4_full_dataset.. As of now, there exists no work that looks at recovering files and folders from ReFS formatted file systems, which puplished their test data. Thus also, no work has yet stated test cases that could be used to evaluate tools that serve such a purpose. As the design and the internal structures of ReFS differ from other file systems, interactions with files and folders produce a different amount of recoverable data. Creating a folder in ReFS, for example, always leads to the allocation of a complete page. Compared to the sole creation of a file, the creation of a folder produces a significantly higher footprint on the disk and is more likely to overwrite multiple old copies of deleted files at once. Garfinkel et al. [21] state that it is much more complicated to generate reproducibility in research findings than in a specific forensic investigation. Forensic tools and recovery techniques should be ideally tested against representative data sets. However, it is not clear when a file system is representative. File systems that are perceived as representative are usually large data sets of real data which were created by . Those file systems however mostly contain personal or proprietary information, that most likely is privacy-sensitive. From these circumstances, researchers are often forced to generate their own data sets. Generating own data sets and own test scenarios, however, bears the risk that the developed method performs differently in distinct real-world scenarios that were not covered by the designed test scenarios. To obtain a metric for the effectiveness of the developed tools, we designed multiple test scenarios which portrayed a fictive usage of the file system. These scenarios, however, all are affected by the described limitation, and the problem that it is not clear to which extent the developed testing cases are representative.

6.1 Setup for recovering multiple files from a file system

The evaluation process of the tools is strongly based on the description of the evaluation dataset used by Plum and Dewald [36]. Plum and Dewald generated multiple data sets to verify the effectiveness of a carver for the APFS file system. To generate test data sets, we developed a python application. The application randomly executed 100 of the following actions on a ReFS file system. All actions except for the copy operation, stem from Plum and Dewald. • add_file: Add a random file from the EDRM File Formats Data Set 1.0 [3] to the volume • delete_file: Delete a random file from the volume • move_file: Moves a random file within the volume • change_file: Insert a random amount of ASCII A’s into a randomly picked file • copy_file: Copies a random file within the volume • add_folder: Creates a randomly named folder in the volume • remove_folder: Removes a randomly picked folder from the volume If an action could not be executed in a step (e.g. removing a directory, if no directory exists at all), a different action was picked randomly. The application was executed by a batch-job. After the application terminated, the batch job waited 10 seconds, automatically remounted the file system, waited 10 seconds, and executed the application again. This process was repeated 10 times so that in total 1000 actions were executed. The waiting pause after the actions were performed was introduced to make sure that the application was

89 CHAPTER 6. EVALUATION terminated successfully. The waiting pause between mounting the file system and before executing the actions should ensure that the volume was mounted correctly. Remounting the file system fulfilled the purpose to make sure that the changes were persisted to the disk. The picked actions were not evenly distributed. Instead, some actions, such as “Creating a file” were performed more often than “Creating a directory” or “Deleting a directory”. This choice of the distribution was picked deliberate to generate a more “realistic” reflection of the usage of a ReFS file system. The distribution of the actions, however, was an arbitrary guess of the authors of how often these actions occur. An overview of the distribution of these actions and the individual data sets is given in table 6.1. The datasets in the table use different volume and different cluster sizes, indicated by their names. The dataset 2G_4K. for example represents a 2 GiB large ReFS file system with a cluster size of 4 KiB. After all 1000 actions were executed the file system was remounted as read-only, and a python script was run to capture the final state of existent files on the file system. While the single actions were performed, their outcomes were also protocolled. The output format of both of these logging processes was a slightly modified body file. Body files are a logging format specified by TSK and are used to store general meta information of files. The modified body file stored enough information, so that the impact of nearly every action produced a distinct output line. In few occasions, a file from the original is written twice to the same location of the volume, or a file is modified and the newly written data is identical to its old data. If such an event happens in a time frame that is shorter than a second, the action generates a duplicate entry in the body file. Duplicate entries are removed from the body file and discarded. Both, the TSK implementation as well as the carver were adjusted so that they were able to output the modified body file format. The following characteristics were stored in the modified body file: MD5 | base_name | dir_id | init_dir_id | init_file_id | mtime | atime | ctime | size

The field MD5 stores the MD5 checksum of a file. Thereby it allows comparing contents of an entry that occurs in two body files. It is set to 0 for directories. The field base_name stores the sole name of a file or a directory instead of its complete path. The field dir_id stores the current directory identifier of a file, for directories it stores their parent identifier. init_dir_id and init_file_id form the initial address of the file. This address stays constant, even if the file is moved. Directories use the init_dir_id field to refer to their own directory table. mtime, atime, ctime track the last modification- access- and the creation- time of a file. The field size is used to store the actual size of a file. The python applications that are used to acquire the ground truth of the state of the file system use the function os.stat to obtain most of these values. Interestingly the os.stat implementation in python 3.7 was only able to obtain the backward-compatible 64-bit metadata addresses of files and directories in ReFS. This however was no limitation for the evaluation as the created metadata address were small enough. The metadata address returned by os.stat could be compared to init_dir_id and init_file_id. The MD5 checksum for the files was calculated by using the hashlib module. Single lines in two body files are considered to represent the same state of a file or a directory if all their properties match. If any property such as the MD5 checksum or the base_name differs between two entries they cannot be considered to refer to the same file state and thus are not perceived as a match. The capture of the final state of the file system includes all existent files and directories, whereas the outcomes of the single executed operations only log changes made to files and the creation of directories. Every action that creates, alters, moves or copies a file generates a log entry for the modified state of the file. Actions that create files also change the timestamps of the directory in which the file is created. The same effect applies for copying and deleting a file. If a file is moved from one directory into another directory, the timestamps of the source and the timestamps of the destination directory are changed. We refrained from logging these types of changes in the metadata of directories as a deliberate decision in weighting the results. When a file is created in a directory, practically two entries in the body file would have to be created. One entry that represents the current state of the created file and another entry for the directory in which the file was created. In practice, it might be of more importance restoring metadata entries of files than of directories. We did not want to weight the incapability to restore an old timestamp of a directory equal to the incapability of restoring the contents of a file. Another decision that was taken was only to compare the basenames of restored files and folders instead of comparing their complete file paths. This step was taken so that also orphan files with incomplete paths could be included in the weighting.

90 CHAPTER 6. EVALUATION

add_file delete_file move_file copy_file alter_file add_dir delete_dir Data Set 35% 20% 5% 10% 10% 15% 5% 2G_4K.vhd 345 177 60 101 112 154 51 2G_64K.vhd 336 201 59 90 109 159 46 5G_4K.vhd 358 201 53 100 96 144 48 5G_64K.vhd 346 206 51 108 81 159 49 40G_4K.vhd 360 210 49 85 122 129 45 40G_64K.vhd 378 182 47 98 93 159 43 100G_4K.vhd 348 218 45 92 109 147 41 100G_64K.vhd 343 205 59 95 98 152 48

Table 6.1: Overview of the created test data sets

Interpreted entries Data Set Directories Files 2G_4K.vhd 47/46 (102.17%) 112/110 (101.82%) 2G_64K.vhd 44/43 (102.33%) 83/81 (102.47%) 5G_4K.vhd 45/44 (102.27%) 123/121 (101.65%) 5G_64K.vhd 20/19 (105.26%) 33/31 (106.45%) 40G_4K.vhd 25/24 (104.17%) 78/76 (102.63%) 40G_64K.vhd 56/55 (101.82%) 143/141 (101.42%) 100G_4K.vhd 35/34 (102.94%) 64/62 (103.23%) 100G_64K.vhd 53/52 (101.92%) 110/108 (101.85%)

Table 6.2: Output of the TSK extension, compared to the final state of the file system

6.1.1 Testing the correctness of the applications In the first evaluation phase, we compared the output of the developed tools to the final state of the file system. It is expected that the TSK extension can produce a report equal to the actual state of the file system. The TSK extension can differ between allocated and restored files and is capable of only displaying allocated files. The output of this option should be equal to the state of the file system that Windows reports.

• If the output of the TSK extension would contain entries that are not in the file system state, the tool is faulty as it would interpret data as an existent file or directory which in practice does not exist. The evaluation of this scenario is shown in table 6.2. It is expected that for no test case, the state calculated by the TSK application contains entries, that are not in the current file system state. • If the output of the TSK extension would on the other hand not contain entries that are logged in the final file system state it would be faulty as it is unable to find existent files. This comparison is shown in table 6.3. As shown in table 6.2 and table 6.3, for all randomly generated testing scenarios, the output of the TSK extension nearly always matched the actual state of the file system. The only difference in the output of the real state and the output of the TSK extension is that the latter also contained the directory “System Volume Information” (Table ID: 0x701) and its contents “WPSettings.dat”, and “IndexerVolumeGuid”. These entries were not included in the extracted final state of the file system. The “System Volume Information” directory is a system directory that cannot regularly be read by any user but still exist in the file system. The same applies to the “File System Metadata” directory, which we omitted in this output of the TSK. The carving application is unable to distinguish between allocated and unallocated files. Thus it can only be made sure that the carver at least finds all files that exist in the file system. Most likely the carver is also able to restore more files, but unable to distinguish which files belong to the current file system state.

91 CHAPTER 6. EVALUATION

Interpreted entries Data Set Directories Files 2G_4K.vhd 46/46 (100.00%) 110/110 (100.00%) 2G_64K.vhd 43/43 (100.00%) 81/81 (100.00%) 5G_4K.vhd 44/44 (100.00%) 121/121 (100.00%) 5G_64K.vhd 19/19 (100.00%) 31/31 (100.00%) 40G_4K.vhd 24/24 (100.00%) 76/76 (100.00%) 40G_64K.vhd 55/55 (100.00%) 141/141 (100.00%) 100G_4K.vhd 34/34 (100.00%) 62/62 (100.00%) 100G_64K.vhd 52/52 (100.00%) 108/108 (100.00%)

Table 6.3: Final state of the file system, compared to the output of the TSK extension

Interpreted entries Data Set Directories Files 2G_4K.vhd 14/154 (9.09%) 110/613 (17.94%) 2G_64K.vhd 16/159 (10.06%) 81/590 (13.73%) 5G_4K.vhd 11/144 (7.64%) 121/606 (19.97%) 5G_64K.vhd 9/159 (5.66%) 31/585 (5.30%) 40G_4K.vhd 9/129 (6.98%) 76/615 (12.36%) 40G_64K.vhd 7/159 (4.40%) 141/616 (22.89%) 100G_4K.vhd 6/147 (4.08%) 62/592 (10.47%) 100G_64K.vhd 12/152 (7.89%) 108/593 (18.21%)

Table 6.4: State of all allocated files according to the TSK extension, compared to all actions

6.1.2 Testing the recovery capabilities of the applications The second part of the evaluation was concerned with the recovery of deleted files and the reconstruction of previous states of files and folders. For this purpose, the outputs of the developed tools were compared to the trace that was generated by the tool that performed the actions on the file system. If the TSK extension and the carver work properly, it is expected that the carver outperforms the TSK extension in finding files. Table 6.4 starts with comparing, the files and directories that were determined to be allocated by the TSK extension with all files and directories that were generated in a complete test case. The resulting numbers represent how many intermediate files and directories have not been removed or changed in the scenario. Table 6.5 includes the file and directory recovery capabilities of the TSK extension. The TSK extension was used to display all files that it could recover. These results were compared to all entries that were generated when creating the data set. As the ReFS extension can only restore entries from referenced pages, it is likely that there remain pages that the TSK extension cannot locate. Additionally, the TSK extension only retrieves the most recent state of a file and is unable to address and thus to retrieve past states of a file. Table 6.6 portrays all files that could be recovered by using the refstool carver. In all scenarios the carver is able to recover more files and directories than the TSK extension. The major drawback of the carver is, that it is unable to differentiate between existing and removed files. Furthermore, its runtime is significantly higher than the runtime of the TSK extension, since it has to scan all blocks on the volume. As of now, the implementation of the carver does not encompass all features that the TSK extension offers such as the recovery of the bitmap in the file system or the restoration of security descriptors. Table 6.7 conclusively give an overview of the runtimes of the various test scenarios. To conduct all experiments we used a 4 TiB large Western Digital hard drive with the product id WDBHDW0040BBK. We used the command-line application dd to estimate the speed of the hard drive for sequential read operations. The hard drive was able to read data sequentially at a speed of 95-110 MiB/s. Depending on the size of a volume, the carver is most of the time busy with reading and collecting pages from the volume. The runtime of the TSK extension also varies strongly based on the number of pages that are read.

92 CHAPTER 6. EVALUATION

Restored entries Data Set Directories Files 2G_4K.vhd 15/154 (9.74%) 132/613 (21.53%) 2G_64K.vhd 19/159 (11.95%) 125/590 (21.19%) 5G_4K.vhd 14/144 (9.72%) 159/606 (26.24%) 5G_64K.vhd 28/159 (17.61%) 75/585 (12.82%) 40G_4K.vhd 12/129 (9.30%) 97/615 (15.77%) 40G_64K.vhd 11/159 (6.92%) 175/616 (28.41%) 100G_4K.vhd 10/147 (6.80%) 113/592 (19.09%) 100G_64K.vhd 13/152 (8.55%) 142/593 (23.95%)

Table 6.5: State of all files that could be recovered with the TSK extension compared to all actions

Restored entries Data Set Directories Files 2G_4K.vhd 15/154 (9.74%) 139/613 (22.68%) 2G_64K.vhd 21/159 (13.21%) 133/590 (22.54%) 5G_4K.vhd 16/144 (11.11%) 169/606 (27.89%) 5G_64K.vhd 34/159 (21.38%) 88/585 (15.04%) 40G_4K.vhd 16/129 (12.40%) 102/615 (16.59%) 40G_64K.vhd 11/159 (6.92%) 183/616 (29.71%) 100G_4K.vhd 10/147 (6.80%) 124/592 (20.95%) 100G_64K.vhd 13/152 (8.55%) 150/593 (25.30%)

Table 6.6: State of all files that could be recovered with the carver application compared to all actions

6.2 Setup for recovering copies of a single file

While the previous evaluation scenario was concerned with the recovery of multiple randomly generated files and folders it is also interesting to look at how the Copy-On-Write process impacts single files. For this experiment we developed a small application that would write text from a source outside of a ReFS file system into a file in a ReFS file system. The text was artificially written at the speed at which a human types (175 characters per minute). Every 2 minutes the text file was saved and a checksum of its current intermediate state was logged together with its metadata. The purpose of this experiment is to analyze how many distinct older states of the written text can be restored. These states can be used to discover how the contents of the file changed over time. Because of the Copy-On-Write policy that ReFS uses, it is likely that metadata, that describes the file and the location of its clusters, is dispensed into multiple places of the volume. As the TSK implementation is only able to retrieve the most recent copy of a file, it is unsuitable for this task. Instead the carver refstool was used to find as many existent old copies of a file as possible. To perform this experiment, we also modified the ReFS driver so that it would protocol its own behavior. We altered instructions in multiple functions of the driver that play a role when allocating new pages. In all functions affected by this measure jumps to trampoline code were inserted. The trampoline code was used to log information of the current action. E.g. the number of a cluster that was just allocated, or information about a Copy-On-Write process in which data from one page is written to another page. We used the own tracing functionality of the driver to log these values. These modifications of the driver code can either be made in memory or on-disk by modifying the binary of the driver. In the second case it is however necessary to disable the verification of driver signatures. The data generated in this process allowed us to reconstruct which intermediate states of the file were generated. The continuous modification of a single page P in the Copy-On-Write process can be viewed as a chain of page states: P1 - P2 - P3 - P4 - ... - Pn. Every page state has a physical address associated with it. If this

93 CHAPTER 6. EVALUATION

TSK TSK Data Set refstool (alloc. files) (incl. unalloc. files) 2G_4K.vhd 1.233 s 1.379 s 20.411 s 2G_64K.vhd 1.21 s 1.947 s 20.519 s 5G_4K.vhd 0.613 s 0.694 s 50.1 s 5G_64K.vhd 0.917 s 1.206 s 50.764 s 40G_4K.vhd 0.972 s 0.94 s 383.027 s 40G_64K.vhd 2.256 s 2.409 s 402.892 s 100G_4K.vhd 1.441 s 1.415 s 997.922 s 100G_64K.vhd 2.037 s 2.344 s 1012.462 s

Table 6.7: Runtimes of the different applications physical address is reused at a later time for an allocation of the page P or another page, this state cannot be recovered anymore. All older pages in such a page chain that remain intact may still yield valuable contents. The experiment focused on writing data into a single small file. This requirement made sure that the metadata entry of the file always remained in the same page. Additionally, no other modifications on the file system were made while the file was altered. While this is no realistic behavior for practice, it is also unfeasible to infinitely often repeat this scenario with all possible distinct external influences. Creating and deleting additional files and folders as background noise would also possibly let the test file wander into a different page, as the directory tree in which the file resides might be expanded or shrink. The experiment was conducted with five different data sets. Each data set was a 5 GiB large ReFS volume with a cluster size of 4 KiB.

Recoverable pages Recovered states Data Set Duration (m) Save operations Found by refstool (Obtained by the instrumented driver) 1.vhd 10 5 3 (2 valid) 6 / 15 2.vhd 30 15 3 (3 valid) 5 / 31 3.vhd 60 30 3 (3 valid) 5 / 47 4.vhd 120 60 3 (3 valid) 5 / 86 5.vhd 240 120 3 (3 valid) 4 / 141

Table 6.8: Experiment to analyze the recoverability of COW copies

As seen in table 6.8, the number of recoverable files does not differ much between the various experiments. Even though various runtimes were used for the experiments, and for longer running experiments more page states were created, the amount of recoverable pages stays nearly constant. It seems like the allocator in the ReFS driver reused old pages relatively fast. The number of file states that could be recovered in the experiment thus also stays constant. A file state is considered “valid”, if it represents a correct intermediate state of a file. In the first scenario only 2 of the 3 recovered file states corresponded to the obtained intermediate states. The reason for this, is that the carver also found an empty copy of the file before any content has been written into it. Aside from that in the first experiment it was also possible to locate a state of the root directory in which the file did not exist yet. In all scenarios, the valid files that could be recovered corresponded to the last 3 (or 2) states of the file. As the number of recoverable pages was very small for every generated file system we were also able to manually verify, that the carver did find all existent states of the file. This finding ascertains the correctness of the carver in recovering older states of a file. The evaluation scenario attempted to obtain allocation information of ReFS by treating it as a black box. It might also be tempting for future work to understand how the concrete allocation process in ReFS works. While the used method only allows describing metrics of observed behavior, the understanding of the allocation process in ReFS could be used to make more general valid statements about the implemented Copy-On-Write policy and the recoverability of removed files and folders.

94 7 Conclusion and Future Work

The results of this work validate and confirm the findings of the related work that was discussed in the introductory chapter. By using reverse engineering techniques, various new insights into the structure of the ReFS file system could be attained, and questions that were left open for research by the most recent work regarding ReFS [22] could be answered. By using symbol files, officially provided by Microsoft and by performing an analysis of the driver, many official namings for data structures could be obtained, and a more coherent and general valid nomenclature could be established. Previous works also appealed to perform an analysis of more recent versions of ReFS. In this work, data structures of the most recent ReFS version (3.4), as well as other yet unknown data structures of ReFS, were examined. Additionally, this work is the first one to document concepts and techniques to recover rows in the B+-tree of the file system. Recovering these rows, allows restoring deleted files and directories from ReFS file systems. For the audience of digital forensic investigators this work can offer detailed descriptions of data structures found on a ReFS file system. These descriptions may be used to implement own tools that perform analysis tasks in a ReFS file system. While the analysis of ReFS file systems and the retrieval of deleted files from ReFS file systems has been dominated by vendors of commercial tools, the findings of this work offer a foundation on which free and open tools may be built. The results, however, are also affected by various limitations. Due to the lack of time and workforce we were not able to thoroughly analyze all data structures that exist in ReFS. Most findings were obtained by performing reverse engineering techniques. The reliability and the information value of results obtained through reverse engineering strongly depend on the expertise and the skills of the analyst. Findings in this process are seldom unambiguous and clear. Because of that it is necessary that our results are validated and confirmed by other researchers. A significant limitation in the expressiveness of this work was the lack of representative data to evaluate the developed tools. As stated in the evaluation chapter, there might exist many different constellations of data sets for which the developed tools might perform worse or better than in the tested scenarios. As we had only access to a few older versions of the Windows operating system and to keep the efforts in manageable bounds, the analysis was firmly focused on the most recent version of the ReFS driver. Many older versions of the ReFS file system are still unknown to us. Supporting them would likely require a major overhaul of the developed applications.

95 Bibliography

[1] Examining the Windows 10 recycle bin. URL https://www.blackbagtech.com/blog/ 2017/01/19/examining-the-windows-10-recycle-bin/. last visited: 2019-12-05. [2] Debug Drivers - Step by Step Lab (Sysvad Kernel Mode). URL https:// docs.microsoft.com/en-us/windows-hardware/drivers/debugger/debug- universal-drivers--kernel-mode-. last visited: 2019-12-05. [3] EDRM Data Set. URL https://www.edrm.net/resources/data-sets/ edrm-file-format-data-set/. last visited: 2019-12-05. [4] NTFS FAQ (en). URL https://flatcap.org/linux-ntfs/info/ntfs.html#3.8. last visited: 2019-12-05. [5] The Sleuth Kit, Library . URL http://www.sleuthkit.org/sleuthkit/docs/api- docs/4.3/basicpage.html. last visited: 2019-12-05. [6] Block cloning on ReFS, 2018. URL https://docs.microsoft.com/en-us/windows- server/storage/refs/block-cloning. last visited: 2019-12-05. [7] Constants, 2019. URL https://docs.microsoft.com/en-us/windows/ win32/fileio/file-attribute-constants. last visited: 2019-12-05. [8] Cory Altheide and Harlan Carvey. Digital forensics with open source tools. Elsevier, 2011. [9] William Ballenthin. Updated ReFS Documentation, 2013. URL http:// www.williballenthin.com/blog/2013/10/15/updated-refs-documentation/. last visited: 2018-11-05. [10] Brian Carrier. The Sleuth Kit Informer, issue 20. URL http://www.sleuthkit.org/ informer/sleuthkit-informer-20.html. last visited: 2019-12-05. [11] Brian Carrier. Open Source Digital Forensics Tools: The Legal Argument 1. Tech- nical report, stake, 2002. https://www.researchgate.net/publication/ 240899558_Open_Source_Digital_Forensics_Tools_The_Legal_Argument_1. [12] Brian Carrier. File system forensic analysis. Addison-Wesley Professional, 2005. [13] Brian Carrier et al. Defining digital forensic examination and analysis tools using abstraction layers. International Journal of digital evidence, 1(4):1–12, 2003. [14] Eoghan Casey. Digital evidence and computer crime: Forensic science, computers, and the internet. Academic press, 2000. [15] Jie Chen, Jun Wang, Zhi-hu Tan, and Changsheng Xie. Recursive Updates in Copy-on-write File Systems-Modeling and Analysis. JCP, 9(10):2342–2351, 2014. [16] Elliot J. Chikofsky and James H Cross. Reverse engineering and design recovery: A taxonomy. IEEE software, 7(1):13–17, 1990. [17] Rajsekhar Das. ReFS Support For SMR Drives. Presentation, SDC 2017, 2017. https://www.snia.org/sites/default/files/SDC/2017/presentations/smr/ Das_Rajsekhar_ReFS_Support_For_Shingled_Magnetic_Recording_Drives.pdf. [18] Chris Eagle. The IDA pro book. No Starch Press, 2011.

96 Bibliography

[19] Eldad Eilam. Reversing: secrets of reverse engineering. John & Sons, 2011.

[20] Ramez Elmasri and Shamkant Navathe. Fundamentals of database systems. Addison-Wesley Publishing Company, 2010. [21] Simson Garfinkel, Paul Farrell, Vassil Roussev, and George Dinolt. Bringing science to digital forensics with standardized forensic corpora. Digital Investigation, 6:S2–S11, 2009.

[22] Henry Georges. Resilient Filesystem. Master’s thesis, NTNU, 2018. https:// ntnuopen.ntnu.no/ntnu-xmlui/handle/11250/2502565. [23] Matt Graeber. Data Source Analysis and Dynamic Windows RE using WPP and TraceLogging, 2019. URL https://posts.specterops.io/data-source-analysis-and-dynamic- windows-re-using-wpp-and-tracelogging-e465f8b653f7. last visited: 2019-12-05.

[24] Paul Green. Resilient File System (ReFS), Analysis of the File System found on Windows Server 2012. Technical report, Staffordshire University, 2013. [25] Kurt Hansen and Fergus Toolan. Decoding the APFS file system. Digital Investigation, 22, 09 2017. doi: 10.1016/j.diin.2017.07.003. https://www.sciencedirect.com/science/article/ pii/S1742287617301408. [26] Andrew Head. Forensic Investigation of Microsoft’s Resilient File System (ReFS), 2015. URL http://resilientfilesystem.co.uk/. last visited: 2019-12-05. [27] Christoph Hellwig. Reverse engineering an advanced filesystem. In Ottawa Linux Symposium, pages 191–196, 2002. https://www.kernel.org/doc/ols/2002/ols2002-pages- 191-196.pdf. [28] J Horalek, V Sobeslav, and R Cimler. Verifying properties of Resilient File System. In Ra- dioelektronika, 2015 25th International Conference, pages 389–394. IEEE, 2015. https: //ieeexplore.ieee.org/abstract/document/7128971/. [29] Tipton J.R. and Malcolm Smith. ReFS. Presentation, SDC 2012, 2012. https:// www.snia.org/sites/default/orig/SDC2012/presentations/File_Systems/ JRTipton_Next_Generaltion-3.pdf. [30] Alex Kibkalo. ReFS file system in Windows Server 2012 / R2 and its future in vNext, 2015. URL https://de.slideshare.net/Vitalyns/refs-windows-server-2012r2-vnext. last visited: 2019-12-05.

[31] Don Marshall. Public and Private Symbols, 2017. URL https://docs.microsoft.com/en- us/windows-hardware/drivers/debugger/public-and-private-symbols. last visited: 2019-12-05. [32] Joachim Metz. Resilient File System (ReFS), 2013. URL https://github.com/libyal/ libfsrefs/blob/master/documentation/Resilient%20File%20System% 20(ReFS).pdf. last visited: 2019-12-05. [33] Microsoft Corporation. Application Compatibility with ReFS, 2012. URL https:// www.microsoft.com/en-us/download/details.aspx?id=29043. last visited: 2019- 12-05.

[34] Rune Nordvik, Henry Georges, Fergus Toolan, and Stefan Axelsson. Reverse engineering of ReFS. Digital Investigation, 30:127 – 147, 2019. ISSN 1742-2876. doi: https://doi.org/10.1016/ j.diin.2019.07.004. URL http://www.sciencedirect.com/science/article/pii/ S1742287619301252. [35] Jonas Plum. Reverse Engineering proprietärer Dateisysteme. Master’s thesis, Technische Universität Darmstadt, 2016. https://blog.cugu.eu/files/pub/2016_01_masterthesis.pdf.

97 Bibliography

[36] Jonas Plum and Andreas Dewald. Forensic APFS File Recovery. In Proceedings of the 13th In- ternational Conference on Availability, Reliability and Security, ARES 2018, pages 47:1–47:10, New York, NY, USA, 2018. ACM. ISBN 978-1-4503-6448-5. doi: 10.1145/3230833.3232808. http://doi.acm.org/10.1145/3230833.3232808. [37] Andreas Reuter and Jim Gray. Transaction processing: Concepts and techniques. Morgan Kaufmann Publishers, 1993.

[38] Ohad Rodeh. B-trees, Shadowing, and Clones. TOS, 3(4):2–1, 2008. [39] Ohad Rodeh, Josef Bacik, and Chris Mason. BTRFS: The Linux B-tree filesystem. ACM Transactions on Storage (TOS), 9(3):9, 2013. [40] John Rosenberg, Frans Henskens, Fred Brown, Ron Morrison, and David Munro. Stability in a persistent store based on a large . In Security and Persistence, pages 229–245. Springer, 1990. [41] M.E. Russinovich, D.A. Solomon, and A. Ionescu. Windows Internals. 6 edition, 2012. ISBN 9780735677272.

[42] Matthew S. Garson, Ravinder S. Thind, Darwin Ou-Yang, Karan Mehra, and Neal R. Christiansen. File system recognition structure, 2009. https:// patentimages.storage.googleapis.com/e1/80/95/1a8eccce2f8eaf/ US20100281299A1.pdf. [43] Jim Salter. Bitrot and atomic COWs: inside “next-gen” filesystems, 2014. URL https: //arstechnica.com/information-technology/2014/01/bitrot-and-atomic- cows-inside-next-gen-filesystems/. last visited: 2019-12-05. [44] Eric Shanks. Microsoft’s Resilient File System (ReFS), 2014. URL https:// theithollow.com/2014/01/13/microsofts-resilient-file-system-refs/. last visited: 2019-12-05.

[45] Robert Shullich. Reverse Engineering the Microsoft exFAT File System, 2009. URL https:// www.sans.org/reading-room/whitepapers/forensics/reverse-engineering- microsoft--file-system-33274. last visited: 2019-12-05. [46] . Building the next generation file system for Windows: ReFS, 2012. URL https:// blogs.msdn.microsoft.com/b8/2012/01/16/building-the-next-generation- file-system-for-windows-refs/. last visited: 2019-12-05.

[47] Alexander Sotirov. Reverse engineering Microsoft binaries. Recon 2006, 2006. http: //www.phreedom.org/presentations/reverse-engineering-microsoft- binaries/reverse-engineering-microsoft-binaries.pdf. [48] J.R. Tipton. ReFS v2, Cloning, projecting, and moving data. Presentation, SDC 2015, 2015. https: //www.snia.org/sites/default/files/SDC15_presentations/file_sys/ JRTipton_ReFS_v2.pdf. [49] Lee Tobin, Ahmed Shosha, and Pavel Gladyshev. Reverse engineering a CCTV system, a case study. Digital Investigation, 11(3):179–186, 2014.

98