Outline for today

Administrative: – Still groups that need to sign up for demos. – Next time: midterm will be returned and discussed. Objective: – NTFS – another case study – Distributed File Systems

NT Another Case Study

1 NTFS Files

• Files consist of multiple “streams” – Filename is a stream – One long unnamed data stream – Other data streams possible • Image in unnamed stream • Thumbnail in named stream – Filename:stream1

NTFS System Calls

• API functions for file I/O in Windows 2000 • Second column gives nearest UNIX equivalent

2 File System API Example

A program fragment for copying a file using the Windows 2000 API functions

Directory System Calls

• API functions for directory management in Windows 2000 • Second column gives nearest UNIX equivalent, when one exists

3 File System Structure

• Each has its MFT • 4KB blocks • Files and directories described by 1KB records in MFT • Attribute/value pairs

The NTFS master file table

File System Structure

The attributes used in MFT records

4 File System Structure

An MFT record for a three-run, nine-block file

File System Structure

A file that requires three MFT records to store its runs

5 Immediate Files

Std filename data header unused

Data resides directly in entry for short file.

File System Structure

The MFT record for a small directory.

6 File Name Lookup

Object Mgr name space

Steps in looking up the file C:\maria\web.htm

File Compression

(a) An example of a 48-block file being compressed to 32 blocks (b) The MTF record for the file after compression

7 File Encryption

K retrieved

user's public key

Operation of the

Caching in Windows 2000

The path through the cache to the hardware

8 Comparisons

FFS LFS NTFS

Data Clustering, Log: contiguous Runs of Cylinder grouping by temporal contiguous blocks ordering blocks possible, immed. Directories Directory nodes, Directory nodes, They are MFT cylinder grouping in log entries

Block Inodes, specified Inodes in log In MFT entries loc in cylinder for files indices group

Distributed File Systems

9 Distributed File Systems

• Naming client – Location transparency/ server independence network • Caching – Consistency client • Replication – Availability and client server updates

Naming Her local directory tree usr • \\His\d\pictures\castle.jpg m_pt – Not location transparent - both machine and drive embedded in name. for_export • NFS mounting A B His local – Remote directory mounted dir tree over local directory in local Her local His after naming hierarching. tree after mount – /usr/m_pt/A mount A B usr on B – No global view usr m_pt m_pt A B

10 Global Name Space

Example: Andrew File System

/

afs tmp bin lib local files

shared files - looks identical to all clients

VFS: the Filesystem Switch introduced the framework in 1985 to accommodate the cleanly. • VFS allows diverse specific file systems to coexist in a file tree, isolating all FS-dependencies in pluggable filesystem modules.

user space VFS was an internal kernel restructuring syscall layer (file, uio, etc.) with no effect on the syscall interface. network protocol Virtual File System (VFS) Incorporates object-oriented concepts: stack a generic procedural interface with (TCP/IP) NFS FFS LFS *FS etc. etc. multiple implementations. device drivers

Other abstract interfaces in the kernel: device drivers, file objects, executable files, memory objects.

11 Vnodes In the VFS framework, every file or directory in active use is represented by a vnode object in kernel memory.

free vnodes syscall layer

Each vnode has a standard file attributes struct.

Generic vnode points at Active vnodes are reference- filesystem-specific struct counted by the structures that (e.g., inode, rnode), seen hold pointers to them, e.g., only by the filesystem. the system open file table. NFS UFS Vnode operations are Each specific file macros that vector to system maintains a filesystem-specific hash of its resident procedures. vnodes.

Example:Network File System (NFS) client server syscall layer user programs VFS syscall layer VFS NFS server

UFS UFS NFS client

network

12 Vnode Operations and Attributes vnode/file attributes (vattr or fattr) directories only type (VREG, VDIR, VLNK, etc.) vop_lookup (OUT vpp, name) mode (9+ bits of permissions) vop_create (OUT vpp, name, vattr) nlink ( count) vop_remove (vp, name) owner user ID vop_link (vp, name) owner group ID vop_rename (vp, name, tdvp, tvp, name) filesystem ID vop_mkdir (OUT vpp, name, vattr) unique file ID vop_rmdir (vp, name) file size (bytes and blocks) vop_readdir (uio, cookie) access time vop_symlink (OUT vpp, name, vattr, contents) modify time vop_readlink (uio) generation number files only generic operations vop_getpages (page**, count, offset) vop_getattr (vattr) vop_putpages (page**, count, sync, offset) vop_setattr (vattr) vop_fsync () vhold() vholdrele()

Pathname Traversal • When a pathname is passed as an argument to a system call, the syscall layer must “convert it to a vnode”. • Pathname traversal is a sequence of vop_lookup calls to descend the tree to the named file or directory. open(“/tmp/zot”) Issues: vp = get vnode for / (rootdir) 1. crossing mount points vp->vop_lookup(&cvp, “tmp”); 2. obtaining root vnode (or current dir) vp = cvp; 3. finding resident vnodes in memory vp->vop_lookup(&cvp, “zot”); 4. caching name->vnode translations 5. symbolic (soft) links 6. disk implementation of directories 7. locking/referencing to handle races with name create and delete operations

13 Hints

• A valuable distributed systems design technique that can be illustrated in naming. • Definition: information that is not guaranteed to be correct. If it is, it can improve performance. If not, things will still work OK. Must be able to validate information. • Example: Sprite prefix tables

Prefix Tables

/

A /A/m_pt1 -> blue m_pt1 /A/m_pt1/usr/B -> pink usr /A/m_pt1/usr/m_pt2 -> pink B m_pt2

/A/m_pt1/usr/m_pt2/stuff.below

14 Distributed File Systems

• Naming client – Location transparency/ server independence network • Caching – Consistency client • Replication – Availability and client server updates

Caching was “The Answer”

• Avoid the disk for as Proc many file operations as Memory possible. • Cache acts as a filter File for the requests seen cache by the disk - reads served best. • Delayed writeback will avoid going to disk at all for temp files.

15 Caching in Distributed F.S.

• Location of cache on client client - disk or memory • Update policy server – write through – delayed writeback network – write-on-close • Consistency client – Client does validity check, contacting server client server – Server call-backs

File Cache Consistency

Caching is a key technique in distributed systems. The cache consistency problem: cached data may become stale if cached data is updated elsewhere in the network. Solutions: Timestamp invalidation (NFS). Timestamp each cache entry, and periodically query the server: “has this file changed since time t?”; invalidate cache if stale. Callback invalidation (AFS). Request notification (callback) from the server if the file changes; invalidate cache on callback. Leases (NQ-NFS) [Gray&Cheriton89]

16 Sun NFS Cache Consistency • Server is stateless • Requests are self- client ti open contained. ti== tj ? tj • Blocks are transferred and cached in memory. server write/ • Timestamp of last known network mod kept with cached close file, compared with “true” timestamp at server on client Open. (Good for an interval) client server • Updates delayed but flushed before Close ends. 33

Cache Consistency for the Web

• Time-to-Live (TTL) fields - HTTP “expires” client client header • Client polling -HTTP lan “if-modified-since” network request headers proxy – polling frequency? cache possibly adaptive (e.g. based on age of object Web and assumed stability) server

34

17 AFS Cache Consistency {c0, c1} • Server keeps state of all clients holding copies c0 (copy set) callback • Callbacks when cached server data are about to become stale network close • Large units (whole files or 64K portions) c1 • Updates propagated upon close • Cache on local disk & c2 server memory

• If client crashes, revalidation on recovery (lost callback possibility)35

NQ-NFS Leases

In NQ-NFS, a client obtains a lease on the file that permits the client’s desired read/write activity. “A lease is a ticket permitting an activity; the lease is valid until some expiration time.” – A read-caching lease allows the client to cache clean data. Guarantee: no other client is modifying the file. – A write-caching lease allows the client to buffer modified data for the file. Guarantee: no other client has the file cached. Leases may be revoked by the server if another client requests a conflicting operation (server sends eviction notice). Since leases expire, losing “state” of leases at server is OK.

18 NFS Protocol

NFS is a network protocol layered above TCP/IP. – Original implementations (and most today) use UDP datagram transport for low overhead. • Maximum IP datagram size was increased to match FS block size, to allow send/receive of entire file blocks. • Some newer implementations use TCP as a transport. NFS protocol is a set of message formats and types. • Client issues a request message for a service operation. • Server performs requested operation and returns a reply message with status and (perhaps) requested data.

File Handles Question: how does the client tell the server which file or directory the operation applies to? – Similarly, how does the server return the result of a lookup? • More generally, how to pass a pointer or an object reference as an argument/result of an RPC call? In NFS, the reference is a file handle or fhandle, a 32-byte token/ticket whose value is determined by the server. – Includes all information needed to identify the file/object on the server, and get a pointer to it quickly. volume ID inode # generation #

19 NFS: From Concept to Implementation How do we make it work in a real system? – How do we make it fast? • Answer: caching, read-ahead, and write-behind. – How do we make it reliable? What if a message is dropped? What if the server crashes? • Answer: client retransmits request until it receives a response. – How do we preserve file system semantics in the presence of failures and/or sharing by multiple clients? • Answer: well, we don’t, at least not completely. – What about security and access control?

Coda – Using Caching to Handle Disconnected Access • Single location-transparent UNIX FS. • Scalability - coarse granularity (whole-file caching, volume management) ®First class (server) replication and client caching (second class replication) ®Optimistic replication & consistency maintenance. • Designed for disconnected operation for mobile computing clients

20 AFS Cache Consistency {c0, c1} • Server keeps state of all clients holding copies c0 (copy set) callback • Callbacks when cached server data are about to become stale network close • Large units (whole files or 64K portions) c1 • Updates propagated upon close • Cache on local disk & c2 server memory

• If client crashes, revalidation on recovery (lost callback possibility)41

Explicit First-class Replication

• File name maps to set of replicas, one of which will be used to satisfy request – Goal: availability • Update strategy – Atomic updates - all or none – Primary copy approach – Voting schemes – Optimistic, then detection of conflicts

21 Optimistic vs. Pessimistic

• Avoids conflicts by holding of shared or • High availability exclusive locks. Conflicting updates • How to arrange when are the potential disconnection is problem - requiring involuntary? detection and • Leases [Gray, SOSP89] resolution. puts a time-bound on locks but what about expiration?

Multiple Copy Schemes

• Primary Copy Of all copies, there is one which is “primary” to which updates are sent. Secondary copies eventually get updates. Reads can go to any copy. How to take over when primary gone? • Voting An operation must acquire locks on some subset of copies (overlapping read or write quorums) • Optimistic Act as though no conflicts, when possible, compare replicas for conflicts (version vector) and resolve.

22 “Committing” a Transaction

Begin Transaction lots of reads and writes Commit or Abort Transaction

“Committing” a Transaction

Begin Transaction Withdraw $1000 from savings account Deposit $1000 to checking account Commit or Abort Transaction

23 Atomic Transactions

ACID property - data is recoverable. • Atomicity - a transaction must be all-or-nothing. • Consistency - a transaction takes system from one consistent state to another • Isolation - No intermediate effects are visible to others - serializability • Durability - the effects of a committed transaction are permanent

Implementation Mechanisms

• Stable storage • Shadow blocks • Logging

24 Stable Storage

We need to be able to trust something not to be corrupted or destroyed • Mirrored disks. Always write disk 1, verify, then write disk 2. – If crash, compare disks, disk 1 “wins” – If bad checksum, use other disk block. • Battery backed up RAM

Private Workspace

• Create a shadow data structure • On commit, make the shadow the real one. – One pointer change to exchange indices allows this to be atomic.

25 Logging

• Intentions list Savings $5K/$4K • Do/undo log Checking $100/$1100 • Log is written to Commit stable storage. Rollback, if abort. Completion, if commit.

2-Phase Commit Coordinator Worker Write “prepare” in log Send “prepare” Write “ready” in log Send “ready” Collect all responses ? Write “commit” in log Send “commit” Write “commit” in log Do commit Collect all responses Send “done” Done

26 – Using Caching to Handle Disconnected Access • Single location-transparent UNIX FS. • Scalability - coarse granularity (whole-file caching, volume management) • First class (server) replication and client caching (second class replication) • Optimistic replication & consistency maintenance. ® Designed for disconnected operation for mobile computing clients

Client-cache State Transitions

hoarding disconnection logical reconnection

emulation reintegration physical reconnection

27 Prefetching

• To avoid the access latency of moving the data in for that first cache miss. • Prediction! “Guessing” what data will be needed in the future. – It’s not for free: Consequences of guessing wrong Overhead

Hoarding - Prefetching for Disconnected Information Access • Caching for availability (not just latency) • Cache misses, when operating disconnected, have no redeeming value. (Unlike in connected mode, they can’t be used as the triggering mechanism for filling the cache.) • How to preload the cache for subsequent disconnection? Planned or unplanned. • What does it mean for replacement?

28 Hoard Database

• Per-workstation, per-user set of pathnames with priority • User can explicitly tailor HDB using scripts called hoard profiles • Delimited observations of reference behavior (snapshot spying with bookends)

Coda Hoarding State

• Balancing act - caching for 2 purposes at once: – performance of current accesses, – availability of future disconnected access. • Prioritized algorithm - Priority of object for retention in cache is f(hoard priority, recent usage). • Hoard walking (periodically or on request) maintains equilibrium - no uncached object has higher priority than any of cached objects

29 The Hoard Walk

• Hoard walk - phase 1 - reevaluate name bindings (e.g., any new children created by other clients?) • Hoard walk - phase 2 - recalculate priorities in cache and in HDB, evict and fetch to restore equilibrium

Hierarchical Cache Mgt

• Ancestors of a cached object must be cached in order to resolve pathname. • Directories with cached children are assigned infinite priority

30 Callbacks During Hoarding

• Traditional callbacks - invalidate object and refetch on demand • With threat of disconnection – Purge files and refetch on demand or hoard walk – Directories - mark as stale and fix on reference or hoard walk, available until then just in case.

Emulation State

• Pseudo-server, subject to validation upon reconnection • Cache management by priority – modified objects assigned infinite priority – freeing up disk space - compression, replacement to floppy, backout updates • Replay log also occupies non-volatile storage (RVM - recoverable virtual memory)

31 Reintegration

• Problem: apparently simultaneous, incompatible updates • Version vectors – Each site maintains a version vector for each file: CVVi(f)[j] = k means Si knows that Sj has seen version k of file f 3 servers hold file f, initially [1,1,1]

After updating & propagating to S1 and S2 [2,2,1]

Simultaneously S3 updates [1,1,2] Upon reintegration, compare CVVs

Client-cache State Transitions with Weak Connectivity

hoarding

disconnection strong weak connection connection

write emulation disconnection disconnected physical reconnection

32 Cache Misses with Weak Connectivity • At least now it’s possible to service misses but $$$ and it’s a foreground activity (noticable impact). Maybe not • User patience threshold - estimated service time compared with what is acceptable • Defer misses by adding to HDB and letting hoard walk deal with it • User interaction during hoard walk.

33