1/26/2016

Motivation – Process Need

• Processes store, retrieve information • When process terminates, memory lost Distributed Computing Systems • How to make it persist? • What if multiple processes want to share?

File Systems • Requirements: – large Solution? Files – persistent are large, – concurrent access persistent!

Motivation – Disk Functionality (1 of 2) Motivation – Disk Functionality (2 of 2)

• Questions that quickly arise – How do you information? – How to map blocks to files? bs – sb – super block – How do you keep one user from reading another’s ? – How do you know which blocks are free?

Solution? Systems • Sequence of fixed-size blocks • Support reading and writing of blocks

Outline File Systems • Abstraction to disk (convenience) • Files (next) – “The only thing friendly about a disk is that it has • Directories persistent storage.” – Devices may be different: tape, USB, SSD, IDE/SCSI, • Disk space management NFS • Misc • Users • Example file systems – don’t care about implementation details – care about interface • OS – cares about implementation (efficiency and robustness)

1 1/26/2016

File System Concepts Files: The User’s Point of View

• Files - store the data • Naming: how does user refer to it? • Directories - organize files • Does case matter? Example: blah , BLAH , Blah – Users often don’t distinguish, and in much of Internet no • Partitions - separate collections of directories (also difference (e.g., domain name), but sometimes (e.g., URL called “volumes”) ) – all information kept in partition – Windows: generally case doesn’t matter, but is preserved – mount to access – : generally case matters • Protection - allow/restrict access for files, directories, • Does extension matter? Example: file.c , file.com – Software may distinguish (e.g., compiler for .cpp , partitions Windows Explorer for application association) – Windows: explorer recognizes extension for applications – Linux: extension ignored by system, but software may use defaults

Structure Type and Access • What’s inside? a) Sequence of bytes (most modern OSes (e.g., Linux, • Access Method: Windows)) – sequential (for character files, an abstraction of I/O of b) Records - some internal structure (rarely today) serial device, such as network/modem) c) - organized records within file space (rarely – random (for block files, an abstraction of I/O of block today) device, such as a disk) • Type: – ascii - human readable – binary - computer only readable – Allowed operations/applications (e.g., executable, c-file …) (typically via extension type or “magic number” (see next slide)

Determining File Type – file Determining File Type – Unix file (1 of 2) (2 of 2) $ cat TheLinuxCommandLine 1. Use () (see Project 1; see also • What is displayed? system program) %PDF-1.6^M%âãÏÓ^M 3006 0 obj^M<>stream^M stat ... – “mode” for special file (socket, sym , named pipes) • How to determine file type? ‰ file $ file grof 2. Try examining parts of file (“magic number”) • grof: PostScript document text – Bytes with specific format, specific location $ file Desktop/ 25 50 44 46, offset 0 ‰ PDF https://en.wikipedia.org/wiki • Desktop/: directory – Custom, too (see man magic ) /List_of_file_signatures $ file script.sh • script.sh: Bourne-Again shell script, ASCII, exe 3. Examine content $ file a.out – Text? ‰ Examine blocks for known types (e.g., ) • a.out: ELF 32-bit LSB executable, Intel 80386... $ file TheLinuxCommandLine 4. Otherwise ‰ data • TheLinuxCommandLine: PDF document, version 1.6

2 1/26/2016

Common Attributes System Calls for Files

• Create • Seek • Delete • Get attributes • Truncate • Set attributes • • Rename • • Append

Example: Program to Copy File (1 of 2) Example: Program to Copy File (2 of 2) “man ” tells what system include files are needed

Command line args.

Next ‰ Zoom in on open() system call

int in_fd = open (argv[1], O_RDONLY); /* open file for reading */

Example: Unix open() Unix open() – Under the Hood int open (char *path, int flags [, int mode]) int fid = open (“blah”, flags); read (fid, …); User Space

• path is name of file (NULL terminated string) System Space • flags is bitmap to set switch – O_RDONLY, O_WRONLY, O_TRUNC … 0 stdin 1 stdout – O_CREATE then use mode for permissions 2 stderr ... • success, returns index 3 File File Structure – On error, -1 and set errno (see Project 1) ... Descriptor ... (where blocks are) (index) (attributes) (Per file) (Per process) (Per device)

3 1/26/2016

Example: Windows CreateFile() Disk Partitions

• Returns file object handle: • Partition large group HANDLE CreateFile ( of sectors allocated lpFileName , // name of file for specific purpose dwDesiredAccess , // read-write • Specify number of dwShareMode , // shared or not cylinders to use lpSecurity , // permissions • Specify type ... ) • Linux: “ df –h” and • File objects used for all: files, directories, “fdisk ” disk drives, ports, pipes, sockets and console (“System Reserved” partition for Windows contains OS boot code and code to do HDD decryption, if set)

File System Layout MBR vs. GPT • BIOS reads in program (“bootloader”, e.g., grub ) from known disk location ( or GUID Partition Table ) • MBR = Master Boot Record • MBR /GPT has partition table (start, end of each partition) – Older standard, still in widespread use • Bootloader reads first block (“boot block”) of partition • GPT = GUID (Globally Unique ID) Partition Table • Boot block knows how to read next block and start OS – Newer standard • Rest can vary. Often “superblock” with details on file system • Both help OS know partition structure of hard disk – Type, number of blocks,… • Linux – default GPT (must use Grub 2), but can use MBR • Mac – default GPT . Can run on MBR disk, but can’t install on it • Windows – 64-bit support GPT . default (or GPT MBR , but Windows 8 & 10 default GPT see next)

Master Boot Record ( MBR ) GUID Partition Table ( GPT )

• Old standard, still widely in • Newest standard use • GUID = globally unique • At beginning of disk, hold identifiers information on partitions • Unlimited partitions (but most • Also code that can scan for OSes limit to 128) active OS and load up boot • Since 64-bit, 1 billion TB (1 code for OS Zettabyte) partition size • Only 4 partitions, unless 4 th is (Windows limit ~18 million TB) extended • table stored at end • 32-bit, so partition size • CRC32 checksums to detect limited to 2TB errors • If MBR corrupted ‰ trouble! • Protective MBR layer for apps that don’t know about GPT

4 1/26/2016

File System Implementation Example – Linux (1 of 3)

Process Open File Disk Control Block Table Table Each task_struct describes a process, refers to open file table File sys info // /usr/include/linux/sched.h Copy fd File struct task_struct { to mem descriptors Open volatile long state; File long counter; Directories Pointer long priority; Array … struct files_struct *files; // open file table (per process) (in memory Data … copy, one per } device)

Example – Linux (2 of 3) Example – Linux (3 of 3)

The files_struct data structure describes files Each open file is represented by file descriptor , refers to blocks of data on disk process has open, refers to file descriptor table struct file { mode_t f_mode; // /usr/include/linux/fs.h loff_t f_pos; struct files_struct { unsigned short f_flags; unsigned short f_count; int count; unsigned long f_reada, f_ramax, f_raend, f_ralen, f_rawin; fd_set close_on_exec; struct file *f_next, *f_prev; fd_set open_fds; int f_owner; struct *f_inode; // file descriptor (next) struct file *fd[NR_OPEN]; // file descriptor table struct file_operations *f_op; }; unsigned long f_version; void *private_data; };

File System Implementation Contiguous Allocation (1 of 2)

• Core data to track: which blocks with which • Store file as contiguous block – ex: w/ 1K block, 50K file has 50 consecutive blocks file? File A: start 0, length 2 File B: start 14, length 3 • File descriptor implementations: • Good : – Contiguous allocation – Easy: remember location with 1 number – Linked list allocation – Fast: read entire file in 1 operation (length) – Linked list allocation with index • Bad : – Static: need to know at creation – inode File Descriptor • Or tough to grow! – Fragmentation: remember why we had paging in memory? (Example next slide)

5 1/26/2016

Contiguous Allocation (2 of 2) Linked List Allocation • Keep linked list with disk blocks

null null File File File File File Block Block Block Block Block 0 1 2 0 1

Physical 4 7 2 6 3 Block

• Good : – Easy: remember 1 number (location) – Efficient: no space lost in fragmentation a) 7 files • Bad : b) 5 files (file D and F deleted) – Slow: random access bad

Linked List Allocation with Index inode

Physical Block

0 • Table in memory

1 – MS-DOS FAT, Win98 VFAT

2 null • Good: faster random access 3 null • Bad: can be large! e.g., 1 TB • Fast for small files disk, 1 KB blocks 4 7 • Can hold large files – Table needs 1 billion entries 5 – Each entry 3 bytes (say 4 typical) Typically 15 pointers • Number of pointers per block? Depends 6 3 • ‰ 4 GB memory! 12 to direct blocks upon block size and pointer size – – e.g., 1k byte block, 4 byte pointer ‰ each indirect 7 2 – 1 single indirect has 256 pointers Common format still (e.g., USB drives) since – 1 doubly indirect • Max size of file? Same – it depends upon block size and pointer size supported by many Oses & additional features – 1 triply indirect not needed – e.g., 4KB block, 4 byte pointer ‰ max size 2 TB

Linux File System: inode Outline // linux/include/linux/ext3_fs.h #define EXT3_NDIR_BLOCKS 12 // Direct blocks #define EXT3_IND_BLOCK EXT3_NDIR_BLOCKS + 1 // Indirect block index Files (done) #define EXT3_DIND_BLOCK EXT3_IND_BLOCK + 1 // Double-ind. block index • #define EXT3_TIND_BLOCK EXT3_DIND_BLOCK + 1 // Triple-ind. block index #define EXT3_N_BLOCKS EXT3_TIND_BLOCK + 1 // (Last index & total) • Directories (next) struct ext3_inode { __u16 i_mode; // File mode • Disk space management __u16 i_uid; // Low 16 bits of owner Uid __u32 i_size; // Size in bytes • Misc __u32 i_atime; // Access time __u32 i_ctime; // Creation time __u32 i_mtime; // Modification time • Example file systems __u32 i_dtime; // Deletion time __u16 i_gid; // Low 16 bits of group Id __u16 i_links_count; // Links count __u32 i_blocks; // Blocks count ... __u32 i_block[EXT3_N_BLOCKS]; // Pointers to blocks ... }

6 1/26/2016

Directory Layout - Paths Directory Implementation

• Two types of file system paths – • Just like files (“wait, what?”) Absolute & Relative – Have data blocks • Absolute – File descriptor to map which blocks to directory – Full path from root to file • But have special bit set so user process cannot modify – e.g., /home/joe/cs4513/hw1.pdf contents – e.g., C:\Users\home\Doc\hw1.pdf – Data in directory is information / links to files • Relative – Modify only through system call – (See .c ) – OS keeps track of process working directory for each process • Organized for: – Path relative to current working directory – Efficiency - locating file quickly – e.g., [working directory = /home/joe]: – Convenience - user patterns syllabus.docx ‰ /home/joe/syllabus.docx • Groups ( .c, .exe ), same names cs4513/hw1.pdf ‰ • Tree structure, directory the most flexible /home/joe/cs4513/hw1.pdf – User sees hierarchy of directories ./cs4513/hw1.pdf ‰ /home/joe/cs3513/hw1.pdf ../peter/hw1.pdf ‰ /home/peter/hw1.pdf 37

System Calls for Directories Directories

• Before reading file, must be opened • Create • Readdir • Directory entry provides information to get • Delete • Rename blocks • Opendir • Link – Disk location (blocks, address) • Closedir • • Map ASCII name to file descriptor

name block count

block numbers

Where are file attributes (e.g., owner, permissions) stored?

Options for Storing Attributes Windows (FAT) Directory

a) Directory entry has attributes (Windows) • Hierarchical directories b) Directory entry refers to file descriptor (e.g., inode), and descriptor has attributes (Unix) • Entry: – name - date – type (extension) - block number (w/FAT) – time

name type attrib time date block size

7 1/26/2016

Unix Directory Unix Directory Example

Root Directory Block 132 • Hierarchical directories Block 406 1 . 6 . inode 6 inode 26 26 . • Entry: inode name 1 .. 1 .. 6 .. 4 bin 26 bob – name 12 grants 7 dev 17 jeff – inode number (try “ ls –i” or “ ls –iad .”) 81 books 14 lib 14 sue 132 406 60 mbox • Example, say want to read data from below file 9 etc 51 sam 17 Linux /usr/bob/mbox 6 usr 29 mark 8 tmp Aha! Want contents of file, which is in blocks Looking up inode 60 bob gives Need file descriptor (inode) to get blocks Looking up has contents Contents of inode 26 usr gives Contents of of mbox How to find the file descriptor (inode)? inode 6 usr in bob in block 132 block 406

Note: handled by OS in system call to open() , not user or user code

Length of File Names Handling Long

• Each directory entry is name (and maybe attributes) plus descriptor • How long should file names be? • If fixed small, will hit limit (users don’t like) • If fixed large, may be wasted space (internal fragmentation) • Solution ‰ allow variable length names a) Compact (all in memory, so fast) on word boundary b) Heap to file

User Access to Same File in More than One Directory Possible Implementations

I. Directory entry contains disk blocks? B C (Instead of tree, – Contents (blocks) may change really have directed acyclic – What happens when blocks change? graph) A ? B C “” II. Directory entry points to file descriptor? – If removed, refers to non-existent file Possibilities for the “alias”: – Must keep count, remove only if 0 – Remember Linux ext3 inode? I. Directory entry contains disk __u16 i_links_count; // Links count blocks? Will review each – implementation II. Directory entry points to – Similar if delete file in use attributes structure? choice, next III. Have new type of file to redirect? Example: try “ ” and “ls -i”

8 1/26/2016

Possible Implementation (“ hard link ”) Possible Implementation (“ soft link ”)

III. Have new type of file to redirect? – New file only contains alternate name for file – Overhead, must parse tree second time – Soft link (or ) • Note, in Windows only viewable by graphic browser, are absolute paths, with metadata, can track even if move • Does have mklink (hard and soft ) for NTFS – Often have max link count in case loop (example) – What about soft link across partitions? (example)

a) Initial situation Example: try “ln -s” What about hard link across b) After link created partitions? (example) c) Original owner removes file (what if quotas?)

Need for Robust File Systems Crash Consistency Problem

• Consider upkeep for removing file • Disk guarantees that single sector writes are a. Remove file from directory entry atomic b. Return all disk blocks to pool of free disk blocks – But no way to make multi-sector writes atomic c. Release file descriptor (inode) to pool of free descriptors • How to ensure consistency after crash? • What if system crashes in middle? 1. Don’t bother to ensure consistency • Accept that the file system may be inconsistent after crash – a) inode becomes orphaned ( lost+found , 1 per • Run program that fixes file system during bootup partition) • File system checker (e.g., fsck ) – b) blocks free and allocated 2. Use transaction log to make multi-writes atomic – if flip steps, blocks/descriptor free but directory entry • Log stores history of all writes to disk exists! • After crash log “replayed” to finish updates • Crash consistency problem •

52

File System Checker – Journaling File Systems the Good and the Bad 1. Write intent to do actions (a-c) to log before starting • Advantages of File System Checker – Option - read back to verify integrity before continue – Doesn’t require file system to do any work to ensure consistency 2. Perform operations – Makes file system implementation simpler 3. Erase log • Disadvantages of File System Checker Block Block Block – Complicated to implement fsck program Superblock Journal … Group 0 Group 1 Group N • Many possible inconsistencies that must be identified • Many difficult corner cases to consider and handle – Usually super slow • If system crashes, when restart read log and apply • Scans entire file system multiple times • Consider really large disks, like 400 TB RAID array! operations • Logged operations must be idempotent (can be

53 repeated without harm)

9 1/26/2016

Journaling Example Commits and Checkpoints

• Assume appending new data block (D 2) to file • Transaction committed after all writes to log complete – 3 writes: inode v2 , data bitmap v2 (next topic), data D 2 • Before executing writes, first log them • After transaction is committed, OS checkpoints update

Committed! TxB TxE I v2 B v2 D ID=1 2 ID=1

Journal Journal TxB I v2 B v2 D2 TxE

Inode Data 1. Begin new transaction with unique ID=1 Data BlocksCheckpointed! Bitmap Bitmap 2. Write updated meta-data block v1v2 D D 3. Write file data block 1 2 4. Write end-of-transaction with ID= 1 • Final step: free checkpointed transaction 55 56

Journal Implementation Crash Recovery (1 of 2)

• What if system crashes during logging? • Journals typically implemented as circular – If transaction not committed, data lost buffer – But, file system remains consistent! – Journal is append-only • OS maintains pointers to front and back of Journal TxB I v2 B v2 D2 transactions in buffer Inode Data Inodes Data Blocks – As transactions are freed, back moved up Bitmap Bitmap v1 • Thus, contents of journal are never deleted, D1 just overwritten over time

57 58

Journaling – Crash Recovery (2 of 2) The Good and the Bad • What if system crashes during checkpoint? • Advantages of journaling – File system may be inconsistent – During reboot, transactions committed but not free replayed in – Robust, fast file system recovery order • No need to scan entire journal or file system – Thus, no data is lost and consistency restored! – Relatively straight forward to implement • Disadvantages of journaling Journal TxB I v2 B v2 D TxE 2 – Write traffic to disk doubled Inode Data Inodes Data Blocks • Especially file data, which is probably large Bitmap Bitmap v1v2 D1 D2

59 60

10 1/26/2016

Meta-Data Journaling Crash Recovery Redux (1 of 2)

• Most expensive part of data journaling writing file data • What if system crashes during logging? twice – If transaction not committed, data lost – Meta-data small (<1 block), file data is large – D2 will eventually be overwritten • So, only journal meta-data – File system remains consistent

Journal TxB I v2 B v2 TxE Journal TxB I v2 B v2 Inode Data Inodes Data Blocks Bitmap Bitmap Inode Data Inodes Data Blocks v1v2 Bitmap Bitmap D1 D2 v1 D1 D2

61 62

Crash Recovery Redux (2 of 2) Journaling Summary • What if system crashes during checkpoint? • Today, most OSes use journaling file systems – File system may be inconsistent – During reboot, transactions committed but not free replayed in – ext3/ on Linux order – NTFS on Windows – Thus, no data lost and consistency is restored • Provides crash recovery with relatively low space and performance overhead Journal TxB I v2 B v2 TxE • Next-gen OSes likely move to file systems with Inode Data Inodes Data Blocks Bitmap Bitmap copy-on-write semantics v1v2 – and on Linux D1 D2

63 64

Outline Disk Space Management • n bytes ‰ choices: • Files (done) 1. contiguous 2. blocks • Directories (done) • Similarities with memory management • Disk space management (next) – contiguous is like variable-sized partitions Misc • but compaction by moving on disk very slow! • • so use blocks • Example file systems – blocks are like paging (can be wasted space) • how to choose block size? • (Note, physical disk block size typically 512 bytes, but file system logical block size chosen when formatting) • Depends upon size of files stored

11 1/26/2016

File Sizes in Practice (1 of 2) File Sizes in Practice (2 of 2)

Claypool Office PC Linux Ubuntu March 2014 • (VU – University circa 2005, Web – Commercial Web server 2005) • Files trending larger. But most small. What are the tradeoffs?

Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

Choosing Block Size Disk Performance and Efficiency • Large blocks – faster throughput, less seek time, more data per read Data Rate – wasted space (internal fragmentation) Utilization • Small blocks – less wasted space – more seek time since more blocks to access same data Disk Space Utilization

• Assume 4 KB files. • At crossover (~64 KB), only 6.6 MB/sec, Efficiency 7% (both bad) Data Rate • However ‰ Most disk block sizes not larger than paging system hardware – On x86 is 4K ‰ most file systems pick 1KB – 4 KB Block size

Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall, Inc. All rights reserved. 0-13-6006639

Keeping Track of Free Blocks Keeping Track of Free Blocks a) Linked list of free blocks – 1K block, 32 bit disk block number = 255 free blocks/block (one number points to next block) – 500 GB disk has 488 millions disk blocks • About 1,900,000 1 KB blocks to track free blocks b) Bitmap of free blocks – 1 bit per block, represents free or allocated – 500 GB disk needs 488 million bits • About 60,000 1 KB blocks to track free blocks a) Linked-list of free blocks b) Bitmap of free blocks

12 1/26/2016

Tradeoffs File System Performance • DRAM ~5 nanoseconds, Hard disk ~5 milliseconds • Bitmap usually smaller since 1-bit per block rather than 32 – Disk access 1,000,000x slower than memory! bits per block ‰ Reduce number of disk accesses needed • Only if disk is nearly full does linked list require fewer • Block/buffer cache blocks – Cache to memory – But linked-list blocks are not needed for space (they are free) • Full cache? Replacement algorithms use: FIFO, LRU, – Only matters for some maintenance (e.g., consistency checking) 2nd chance … • If enough RAM, bitmap method preferred since provides – Exact LRU can be done (why?) locality, too Pure LRU inappropriate sometimes ‰ e.g., some • If only 1 “block” of RAM, and disk is full, bitmap method • may be inefficient since have to load multiple blocks to find blocks are “more important” than others free space – Block heavily used always in memory – Linked list can take first in line – Crash w/inode can lead to inconsistent state – Some rarely referenced (double indirect block)

Modified LRU Outline • Files (done) Is block likely to be needed soon? • • Directories (done) – If no, put at beginning of list • Disk space management (done) • Is block essential for consistency of file Misc (next) system? • – partitions ( fdisk , mount ) – Write immediately – maintenance • Occasionally write out all – quotas – sync • Example file systems

Partitions Mount isn’t Just for Bootup • mount , unmount • When plug storage devices – pick access point in file-system into running system, – load super-block from disk / (root) mount executed in • Super-block background – e.g., plugging in USB stick – file system type usr tmp home – block size • What does it mean to “safely eject” device? – free blocks – Flush cached writes to that – free file descriptors (inodes) device – Cleanly unmount file system on that device

78

13 1/26/2016

File System Maintenance Defragmenting (Example, 1 of 2) • Format: – Create file system structure: super block, descriptors (inodes) – format (Windows ), mke2fs (Linux ) (e.g., Windows : “ format /? ” and Linux : “ man mke2fs ”) • “Bad blocks” – Most disks have some (even when brand new) – chkdsk (Win , or properties->tools->error checking) or badblocks (Linux ) – Add to “bad-blocks” list (file system can ignore) • Defragment (see picture next slide) – Arrange blocks allocated to files efficiently • Scanning (when system crashes) – lost+found , correcting file descriptors...

Defragmenting (Example, 2 of 2) Disk Quotas

• Table 1: Open file table in memory – When file size changed, charged to user – User index to table 2 • Table 2: quota record – Soft limit checked, exceed allowed w/warning – Hard limit never exceeded • Limit: blocks & file descriptors (inodes) – Running out of inodes as bad as running out of blocks • Overhead? Again, in memory

Outline Linux File System

• Files (done) • Directories (done) • Disk space management (done) • Misc (done) • Example systems (next) • Virtual FS allows loading of many different FS, without changing – Linux process interface – Still have struct file_struct , open() , creat(), … – Windows • When build/install, FS choices ‰ ext3/4, hfps, DOS, NFS, NTFS, smbfs, is9660, … (about 2 dozen) • ext4 is “default” for many, most popular (but ext3 still widely in use, e.g., CCC)

14 1/26/2016

Linux File System: extfs Linux File System: inodes (1 of 2)

• “Extended” (from ) file system, version 2 – (Minix a Unix-like teaching OS by Tanenbaum) • ext2fs • Uses inodes – Long file names, long files, better performance – mode for file, – Main for many years directory, • ext3fs – Larger files (2 TB), Larger file system (32 TB) symbolic link – Fully compatible with ... – Adds journaling • ext4fs – Larger files (16 TB), Larger file systems (1 EB) – Extents (for free space management) – Improved perf (multi-block allocation, journal checksum…)

Linux File System: inodes (2 of 2) Linux File System: Blocks % sudo tune2fs -l /dev/sda1 | grep Block Block count: 60032256 • Default block size Block size: 4096 Blocks per group: 32768 • For higher performance – Performs I/O in chunks (reduce requests) – Clusters adjacent requests (block groups ) • Group has:

struct ext3_inode { – Bit-map of free blocks and free inodes __u16 i_mode; // File mode __u16 i_uid; // Low 16 bits of owner Uid – Copy of super block __u32 i_size; // Size in bytes __u32 i_atime; // Access time __u32 i_ctime; // Creation time __u32 i_mtime; // Modification time __u32 i_dtime; // Deletion time __u16 i_gid; // Low 16 bits of group Id __u16 i_links_count; // Links count __u32 i_blocks; // Blocks count ... __u32 i_block[EXT3_N_BLOCKS]; // Pointers to blocks ... }

Linux File System: Directories Linux File System: Unified • Directory just special file with names and inodes

• (left) separate file trees (ala Windows) • (right) after mounting “DVD” under “b” Linux

15 1/26/2016

Linux Filesystem: /proc Linux Filesystem: ext3fs & ext4fs • Contents of “files” not stored, but computed • Journaling – internal structure assured • Provide interface to kernel statistics – Journal (lowest risk) - Both metadata and file contents written to journal before being committed. • Most read only, access • Roughly, write twice (journal and data) using Unix text tools – Ordered (medium risk) - Only metadata, not file contents. – e.g., cat /proc/cpuinfo | grep model Guarantee write contents before journal committed • Enabled by • Often the default “” – Writeback (highest risk) - Only metadata, not file contents. Contents might be written before or after the journal is (Windows has perfmon ) updated. So, files modified right before crash can be corrupted (Show examples No built-in defragmentation tools • e.g., cd /proc/self ) – Probably not much needed yukon% sudo fsck -nvf /dev/sda1 … since blocks grouped 942826 inodes used (6.28%) 1138 non-contiguous files (0.1%) 821 non-contiguous directories (0.1%)

Windows New Technology File System: NTFS: Fundamental Concepts NTFS • Background: Windows had (has) FAT • File names limited to 255 characters • FAT-16, FAT-32 • Full paths limited to 32,000 characters – 16-bit addresses, so limited disk partitions (2 GB) • File names in unicode (other languages, 16- – 32-bit can support 2 TB bits per character) – No security • Case sensitive names (“Foo” different than • NTFS default in Win XP and later “FOO”) – 64-bit addresses – But Win32 API does not fully support

NTFS: Fundamental Concepts NTFS: Fundamental Concepts

• File not sequence of bytes, but multiple • Hierarchical, with “\” as component separator attributes, each a stream of bytes Throwback for MS-DOS to support /M • Example: – microcomputer OS – One stream name (short) – One stream id (short) • Supports “aliases” (links), but only for POSIX – One stream data (long) subsystem – But can have more than one long stream • Streams can have metadata (e.g., thumbnail image) • Streams fragile, and not always preserved by utilities over network or when copied/backed up

16 1/26/2016

NTFS: File System Structure NTFS: Storage Allocation

• Basic allocation unit called a cluster (block) – Sizes from 512 bytes to 64 Kbytes (most 4 KBytes) – Referred to by offset from start, 64-bit number • Each has Master File Table (MFT) – Sequence of 1 KByte records – Bitmap to keep track of which MFT records are free • Each MFT record – Unique ID - MFT index, and “version” for caching and consistency – Contains attributes (name, length, value) – If number of extents small enough, whole entry stored in MFT (faster access) • Bitmap to keep track of free blocks • Disk blocks kept in runs (extents), when • Extents to keep clusters of blocks possible

NTFS: Storage Allocation NTFS: Directories • Name plus pointer to record with file system entry • Also cache attributes (name, sizes, update) for faster directory listing • If few files, entire directory in MFT record

• If file too large, can link to another MFT record

NTFS: Directories NTFS: File Compression • Transparent to user • But if large, linear search can be slow – Can be created (set) in compressed mode • Compresses (or not) in 16-block chunks • Store directory info (names, perms, …) in B+ tree – Every path from root to leaf “costs” the same

– Insert, delete, search all O(log FN) • F is the “fanout” (typically 3)

– Faster than linear search O(N) versus O(log FN) – Doesn’t need reorganizing like binary tree

17 1/26/2016

NTFS: Journaling

• Many file systems lose metadata (and data) if powerfailure – fsck , chkdsk when reboot – Can take a looong time and lose data • lost+found • Recover via “transaction” model – Log file with redo and undo information – Start transactions, operations, commit – Every 5 seconds, checkpoint log to disk – If crash, redo successful operations and undo those that don’t commit • Note, doesn’t cover user data, only meta data

18