Distributed File Systems Distributed Systems Case Studies

NFS • AFS • • DFS • SMB • CIFS Distributed File Systems Dfs • WebDAV • GFS • Gmail-FS? • xFS

Paul Krzyzanowski [email protected]

Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Page 1 Page 2

NFS Design Goals

– Any machine can be a client or – Must support diskless workstations – Heterogeneous systems must be supported NFS • Different HW, OS, underlying – Access transparency • Remote files accessed as local files through normal file system calls (via VFS in UNIX) – Recovery from failure • Stateless, UDP, client retries c. 1985 – High Performance • use caching and read-ahead

Page 3 Page 4

NFS Design Goals NFS Design Goals

No migration transparency No support for UNIX file access semantics If resource moves to another server, client must Stateless design: file locking is a problem. remount resource. All controls may not be available.

Page 5 Page 6

1 NFS Design Goals NFS Design Goals

Devices Transport Protocol must support diskless workstations where every file Initially NFS ran over UDP using Sun RPC is remote. Why UDP? Remote devices refer back to local devices. - Slightly faster than TCP - No connection to maintain (or lose) - NFS is designed for LAN environment – relatively reliable - Error detection but no correction. NFS retries requests

Page 7 Page 8

NFS Protocols Mounting Protocol

Mounting protocol • Send pathname to server Request access to exported directory tree • Request permission to access contents

client: parses pathname Directory & File access protocol contacts server for file handle Access files and directories (read, write, mkdir, readdir, …) • Server returns file handle – File device #, inode #, instance #

client: create in-code vnode at mount point. (points to inode for local files) points to rnode for remote files - stores state on client

Page 9 Page 10

Mounting Protocol Directory and file access protocol

static mounting • First, perform a lookup RPC – mount request contacts server – returns file handle and attributes

• Not like open Server: edit /etc/exports – No information is stored on server Client: mount fluffy:/users/paul /home/paul • handle passed as a parameter for other file access functions – e.g. read(handle, offset, count)

Page 11 Page 12

2 Directory and file access protocol NFS Performance

NFS has 16 functions • Usually slower than local – (version 2; six more added in version 3) • Improve by caching at client – Goal: reduce number of remote operations – Cache results of null link getattr read, readlink, getattr, lookup, readdir lookup symlink setattr – Cache file data at client (buffer cache) readlink – Cache information at client create statfs – Cache pathname bindings for faster lookups remove mkdir • Server side rename rmdir – Caching is “automatic” via buffer cache readdir – All NFS writes are write-through to disk to avoid read unexpected data loss if server dies write

Page 13 Page 14

Inconsistencies may arise Validation

Try to resolve by validation • Always invalidate data after some time – Save timestamp of file – After 3 seconds for open files (data blocks) – When file opened or server contacted for new – After 30 seconds for directories block • Compare last modification time • If data block is modified, it is: • If remote is more recent, invalidate cached data – Marked dirty – Scheduled to be written – Flushed on file close

Page 15 Page 16

Improving read performance Problems with NFS

• Transfer data in large chunks • File consistency – 8K bytes default • Assumes clocks are synchronized • Read-ahead • Open with append cannot be guaranteed to work – Optimize for sequential file access – Send requests to read disk blocks before they are • Locking cannot work requested by the application – Separate lock manager added (stateful) • No reference counting of open files – You can delete a file you (or others) have open! • Global UID space assumed

Page 17 Page 18

3 Problems with NFS Problems with NFS

• No reference counting of open files • File permissions may change – You can delete a file you (or others) have open! – Invalidating access to file

• Common practice • No encryption – Create temp file, delete it, continue access – Requests via unencrypted RPC – Sun’s hack: – Authentication methods available • If same process with open file tries to delete it • Diffie-Hellman, Kerberos, Unix-style • Move to temp name – Rely on user-level software to encrypt • Delete on close

Page 19 Page 20

Improving NFS: version 2 Improving NFS: version 2

• User-level lock manager • Adjust RPC retries dynamically – Monitored locks – Reduce network congestion from excess RPC • status monitor: monitors clients with locks retransmissions under load • Informs lock manager if host inaccessible – Based on performance • If server crashes: status monitor reinstates locks on recovery • If client crashes: all locks from client are freed • Client-side disk caching • NV RAM support – cacheFS – Improves write performance – Extend buffer cache to disk for NFS – Normally NFS must write to disk on server before • Cache in memory first responding to client write requests • Cache on disk in 64KB chunks – Relax this rule through the use of non-volatile RAM

Page 21 Page 22

The automounter Automounter

Problem with mounts • Alternative to static mounting – If a client has many remote resources mounted, • Mount and unmount in response to client boot-time can be excessive demand – Each machine has to maintain its own name space – Set of directories are associated with a local • Painful to administer on a large scale directory – None are mounted initially Automounter – When local directory is referenced – Allows administrators to create a global name space • OS sends a message to each server – Support on-demand mounting • First reply wins – Attempt to unmount every 5 minutes

Page 23 Page 24

4 Automounter maps The automounter

Example: automount /usr/src srcmap NFS mount NFS srcmap contains: application automounter server cmd -ro doc:/usr/src/cmd kernel -ro frodo:/release/src \ bilbo:/library/source/kernel NFS request lib -rw sneezy:/usr/local/lib KERNEL VFS Access /usr/src/cmd: request goes to doc Access /usr/src/kernel: NFS ping frodo and bilbo, mount first response NFS request

Page 25 Page 26

More improvements… NFS v3 More improvements… NFS v3

• Updated version of NFS protocol • New commit operation • Support 64-bit file sizes – Check with server after a write operation to see • TCP support and large-block transfers if data is committed – UDP caused more problems on WANs (errors) – If commit fails, client must resend data – All traffic can be multiplexed on one connection – Reduce number of write requests to server • Minimizes connection setup – Speeds up write requests – No fixed limit on amount of data that can be • Don’t require server to write to disk immediately transferred between client and server • Return file attributes with each request • Negotiate for optimal transfer size – Saves extra RPCs • Server checks access for entire from client

Page 27 Page 28

AFS

• Developed at CMU • Commercial spin-off AFS – Transarc • IBM acquired Transarc Carnegie-Mellon University Currently open source under IBM Public License Also: c. 1986(v2), 1989(v3) OpenAFS, Arla, and version

Page 29 Page 30

5 AFS Design Goal AFS Assumptions

• Most files are small • Reads are more common than writes Support information sharing • Most files are accessed by one user at a time on a large scale • Files are referenced in bursts (locality) – Once referenced, a file is likely to be referenced again e.g., 10,000+ systems

Page 31 Page 32

AFS Design Decisions AFS Design

Whole file serving • Each client has an AFS disk cache – Send the entire file on open – Part of disk devoted to AFS (e.g. 100 MB) – Client manages cache in LRU manner Whole file caching – Client caches entire file on local disk • Clients communicate with set of trusted servers – Client writes the file back to server on close • if modified • Each server presents one identical name space to clients • Keeps cached copy for future accesses – All clients access it in the same way – Location transparent

Page 33 Page 34

AFS Server: cells AFS Server: volumes

• Servers are grouped into administrative entities Disk partition contains called cells file and directories

• Cell: collection of grouped into volumes – Servers – Administrators – Users – Administrative unit of organization – Clients • e.g. user’s home directory, local source, etc. – Each volume is a directory tree (one root) • Each cell is autonomous but cells may cooperate and – Assigned a name and ID number present users with one uniform name space – A server will often have 100s of volumes

Page 35 Page 36

6 management Accessing an AFS file

Clients get information via cell directory server 1. Traverse AFS mount point E.g., /afs/cs.rutgers.edu (Volume Location Server) that hosts the Volume Location Database (VLDB) 2. AFS client contacts Volume Location DB on Volume Location server to look up the volume 3. VLDB returns volume ID and list of machines Goal: (>1 for replicas on read-only file systems) everyone sees the same namespace 4. Request root directory from any machine in the list 5. Root directory contains files, subdirectories, and /afs/cellname/path mount points 6. Continue parsing the file name until another mount /afs/mit.edu/home/paul/src/try.c point (from step 5) is encountered. Go to step 2 to resolve it.

Page 37 Page 38

Internally on the server Authentication and access

• Communication is via RPC over UDP Kerberos authentication: – Trusted third party issues tickets – Mutual authentication • Access control lists used for protection – Directory granularity Before a user can access files – UNIX permissions ignored (except execute) – Authenticate to AFS with klog command • “Kerberos login” – centralized authentication – Get a token (ticket) from Kerberos – Present it with each file access

Unauthorized users have id of system:anyuser

Page 39 Page 40

AFS cache coherence AFS cache coherence

On open: If a client modified a file: – Server sends entire file to client – Contents are written to server on close and provides a callback promise: When a server gets an update: – it notifies all clients that have been issued the – It will notify the client when any other process callback promise modifies the file – Clients invalidate cached files

Page 41 Page 42

7 AFS cache coherence AFS: replication and caching

If a client was down, on startup: • Read-only volumes may be replicated on – Contact server with timestamps of all cached files multiple servers to decide whether to invalidate • Whole file caching not feasible for huge files If a process has a file open, it continues – AFS caches in 64KB chunks (by default) accessing it even if it has been invalidated – Entire directories are cached – Upon close, contents will be propagated to server • Advisory locking supported – Query server to see if there is a lock AFS: Session Semantics

Page 43 Page 44

AFS summary AFS summary

Whole file caching AFS benefits – offers dramatically reduced load on servers – AFS scales well – Uniform name space Callback promise – Read-only replication – keeps clients from having to check with server to – Security model supports mutual authentication, invalidate cache data encryption

AFS drawbacks – Session semantics – Directory based permissions – Uniform name space

Page 45 Page 46

Sample Deployment (2008)

• Intel engineering (2007) – 95% NFS, 5% AFS – Approx 20 AFS cells managed by 10 regional organizations – AFS used for: • CAD, applications, global data sharing, secure data CODA – NFS used for: • Everything else COnstant Data Availability • Morgan Stanley (2004) Carnegie-Mellon University – 25000+ hosts in 50+ sites on 6 continents – AFS is primary distributed filesystem for all UNIX hosts – 24x7 system usage; near zero downtime c. 1990-1992 – Bandwidth from LANs to 64 Kbps inter-continental WANs

Page 47 Page 48

8 CODA Goals Mobility

Descendant of AFS • Provide constant data availability in CMU, 1990-1992 disconnected environments • Via hoarding (user-directed caching) – Log updates on client Goals – Reintegrate on connection to network (server) Provide better support for replication than AFS - support shared read/write files • Goal: Improve fault tolerance Support mobility of PCs

Page 49 Page 50

Modifications to AFS Volume Storage Group

• Support replicated file volumes • Volume ID used in the File ID is • Extend mechanism to support disconnected – Replicated volume ID operation • A volume can be replicated on a group of • One-time lookup servers – Replicated volume ID list of servers and local – Volume Storage Group (VSG) volume IDs – Cache results for efficiency • Read files from any server • Write to all available servers

Page 51 Page 52

Disconnection of volume servers Disconnected servers

AVSG: Available Volume Storage Group If the client detects that some servers have old – Subset of VSG versions – Some server resumed operation What if some volume servers are down? – Client initiates a resolution process On first download, contact everyone you can • Updates servers: notifies server of stale data and get a version timestamp of the file • Resolution handled entirely by servers • Administrative intervention may be required (if conflicts)

Page 53 Page 54

9 AVSG = Ø Reintegration

• If no servers are available Upon reconnection – Client goes to disconnected operation mode – Commence reintegration • If file is not in cache – Nothing can be done… fail Bring server up to date with CML log playback • Do not report failure of update to server – Optimized to send latest changes – Log update locally in Client Modification Log (CML) – User does not notice Try to resolve conflicts automatically – Not always possible

Page 55 Page 56

Support for disconnection CODA summary

Keep important files up to date • Session semantics as with AFS – Ask server to send updates if necessary • Replication of read/write volumes – Client-driven reintegration Hoard database • Disconnected operation – Automatically constructed by monitoring the user’s – Client modification log activity – Hoard database for needed files – And user-directed prefetch • User-directed prefetch – Log replay on reintegration

Page 57 Page 58

DFS • Part of Open Group’s Distributed Computing Environment • Descendant of AFS - AFS version 3.x DFS • Development stopped c. 2005 Distributed File System Open Group Assume (like AFS): – Most file accesses are sequential – Most file lifetimes are short – Majority of accesses are whole file transfers – Most accesses are to small files

Page 59 Page 60

10 DFS Goals DFS Tokens

Use whole file caching (like original AFS) Cache consistency maintained by tokens

But… Token: session semantics are hard to live with – Guarantee from server that a client can perform certain operations on a cached file

Create a strong consistency model

Page 61 Page 62

DFS Tokens Living with tokens

• Open tokens • Server grants and revokes tokens – Allow token holder to open a file. – Multiple read tokens OK – Token specifies access (read, write, execute, exclusive- write) – Multiple read and a write token or multiple write • Data tokens tokens not OK if byte ranges overlap – Applies to a byte range • Revoke all other read and write tokens – read token - can use cached data • Block new request and send revocation to other token – write token - write access, cached writes holders • Status tokens – read: can cache file attributes – write: can cache modified attributes • Lock token – Holder can lock a byte range of a file

Page 63 Page 64

DFS design DFS Summary

• Token granting mechanism Essentially AFS v2 with server-based token – Allows for long term caching and strong granting consistency – Server keeps track of who is reading and who is writing files • Caching sizes: 8K – 256K bytes – Server must be contacted on each open and close operation to request token • Read-ahead (like NFS) – Don’t have to wait for entire file • File protection via ACLs • Communication via authenticated RPCs

Page 65 Page 66

11 SMB Goals

• File sharing protocol for Windows 95/98/NT/200x/ME/XP/Vista • Protocol for sharing: Files, devices, communication abstractions (named pipes), SMB mailboxes

Server Message Blocks • Servers: make file system and other resources available to clients • Clients: access shared file systems, printers, etc. from servers

c. 1987 Design Priority: locking and consistency over client caching

Page 67 Page 68

SMB Design Message Block

• Request-response protocol • Header: [fixed size] – Send and receive message blocks – Protocol ID • name from old DOS system call structure – Command code (0..FF) – Send request to server (machine with resource) – Error class, error code – Server sends response – Tree ID – unique ID for resource in use by client • Connection-oriented protocol (handle) – Persistent connection – “session” – Caller process ID • Each message contains: – User ID – Fixed-size header – Multiplex ID (to route requests in a process) – Command string (based on message) or reply string • Command: [variable size] – Param count, params, #bytes data, data

Page 69 Page 70

SMB Commands SMB Commands

• Files • Print-related – Get disk attr – Open/close spool file – create/delete directories – write to spool – search for file(s) – Query print queue – create/delete/rename file – lock/unlock file area • User-related – open/commit/close file – Discover home system for user – get/set file attributes – Send message to user – Broadcast to all users – Receive messages

Page 71 Page 72

12 Protocol Steps Protocol Steps

• Establish connection • Establish connection • Negotiate protocol – negprot SMB – Responds with version number of protocol

Page 73 Page 74

Protocol Steps Protocol Steps

• Establish connection • Establish connection • Negotiate protocol • Negotiate protocol - negprot • Authenticate/set session parameters • Authenticate - sesssetupX – Send sesssetupX SMB with username, password • Make a connection to a resource – Receive NACK or UID of logged-on user – Send tcon (tree connect) SMB with name of – UID must be submitted in future requests – Server responds with a tree ID (TID) that the client will use in future requests for the resource

Page 75 Page 76

Protocol Steps Locating Services

• Establish connection • Clients can be configured to know about servers • Negotiate protocol - negprot • Each server broadcasts info about its presence – Clients listen for broadcast • Authenticate - sesssetupX – Build list of servers • Make a connection to a resource – tcon • Fine on a LAN environment • Send open/read/write/close/… SMBs – Does not scale to WANs – Microsoft introduced browse servers and the Windows Internet Name Service (WINS) – or … explicit pathname to server

Page 77 Page 78

13 Security

• Share level – Protection per “share” (resource) – Each share can have password CIFS – Client needs password to access all files in share – Only security model in early versions Common Internet File – Default in Windows 95/98 • User level System Microsoft, Compaq, … – protection applied to individual files in each share based on access rights – Client must log in to server and be authenticated c. 1995? – Client gets a UID which must be presented for future accesses

Page 79 Page 80

SMB evolves Original Goals

SMB was reverse-engineered • Heterogeneous HW/OS to request file – under Linux services over network • Based on SMB protocol Microsoft released protocol to X/Open in 1992 • Support – Shared files Microsoft, Compaq, SCO, others joined to develop an enhanced public version of the SMB – Byte-range locking protocol: – Coherent caching Common Internet File System – Change notification (CIFS) – Replicated storage – Unicode file names

Page 81 Page 82

Original Goals Original Goals

• Applications can register to be notified when • Batch multiple requests to minimize round- file or directory contents are modified trip latencies • Replicated virtual volumes – Support wide-area networks – For load sharing • Transport independent – Appear as one volume server to client – But need reliable connection-oriented message – Components can be moved to different servers stream transport without name change • DFS support (compatibility) – Use referrals – Similar to AFS

Page 83 Page 84

14 Caching and Server Communication Oplocks

• Increase effective performance with Server grants opportunistic locks (oplocks) to – Caching client • Safe if multiple clients reading, nobody writing – Oplock tells client how/if it may cache data – read-ahead – Similar to DFS tokens (but more limited) • Safe if multiple clients reading, nobody writing – write-behind Client must request an oplock • Safe if only one client is accessing file – oplock may be • Minimize times client informs server of • Granted changes • Revoked • Changed by server

Page 85 Page 86

Level 1 oplock (exclusive access) Level 2 oplock (one writer)

– Client can open file for exclusive access – Level 1 oplock is replaced with a Level 2 lock if – Arbitrary caching another process tries to read the file – Cache lock information – Request this if expect others to read – Read-ahead – Multiple clients may have the same file open as long – Write-behind as none are writing – Cache reads, file attributes • Send other requests to server If another client opens the file, the server has former client break its oplock: – Client must send server any lock and write data and Level 2 oplock revoked if another client opens acknowledge that it does not have the lock the file for writing – Purge any read-aheads

Page 87 Page 88

Batch oplock Filter oplock (remote open even if local closed) (allow preemption) – Client can keep file open on server even if a local process that was using it has closed the file • Open file for read or write • Exclusive R/W open lock + data lock + metadata lock • Allow clients with filter oplock to be suspended while another process preempted – Client requests batch oplock if it expects file access. programs may behave in a way that generates a lot – E.g., indexing service can run and open files of traffic (e.g. accessing the same files over and without causing programs to get an error when over) they need to open the file • Designed for Windows batch files • Indexing service is notified that another process wants to access the file. • Batch oplock revoked if another client opens • It can abort its work on the file and close it or finish its indexing and then close the file. the file

Page 89 Page 90

15 No oplock Naming

– All requests must be sent to the server • Multiple naming formats supported:

– N:\junk.doc – can work from cache only if byte range was locked by client – \\myserver\users\paul\junk.doc – file://grumpy.pk.org/users/paul/junk.doc

Page 91 Page 92

Microsoft Dfs Redirection

• “Distributed File System” • A share can be replicated (read-only) or – Provides a logical view of files & directories moved through Microsoft’s Dfs • Each computer hosts volumes \\servername\dfsname • Client opens old location: Each Dfs tree has one root volume and one level of leaf volumes. – Receives STATUS_DFS_PATH_NOT_COVERED • A volume can consist of multiple shares – Client requests referral: – Alternate path: load balancing (read-only) TRANS2_DFS_GET_REFERRAL – Similar to Sun’s automounter – Server replies with new server • Dfs = SMB + naming/ability to mount server shares on other server shares

Page 93 Page 94

CIFS Summary

• A “standard” SMB

• Oplocks mechanism supported in base OS: Windows NT, 2000, XP

• Oplocks offer flexible control for distributed NFS version 4 consistency Network File System • Dfs offers namespace management Sun Microsystems

Page 95 Page 96

16 NFS version 4 enhancements NFS version 4 enhancements

• Stateful server • create, link, open, remove, rename – Inform client if the directory changed during the • Compound RPC operation – Group operations together – Receive set of responses • Strong security – Reduce round-trip latency – Extensible authentication architecture • Stateful open/close operations • File system replication and migration – Ensures atomicity of share reservations for – To be defined windows file sharing (CIFS) • No concurrent write sharing or distributed – Supports exclusive creates cache coherence – Client can cache aggressively

Page 97 Page 98

NFS version 4 enhancements

• Server can delegate specific actions on a file to enable more aggressive client caching – Similar to CIFS oplocks • Callbacks Other – Notify client when file/directory contents change (less conventional) Distributed File Systems

Page 99 Page 100

Google File System: Application-Specific usage needs

• Component failures are the norm • Stores modest number of large files – Thousands of storage machines – Files are huge by traditional standards • Multi-gigabyte common – Some are not functional at any given time – Don’t optimize for small files • Built from inexpensive commodity components • Workload: – Large streaming reads • Datasets: – Small random reads – Billions of objects consuming many terabytes – Most files are modified by appending – Access is mostly read-only, sequential • Support concurrent appends • High sustained bandwidth more important than latency • Optimize FS API for application – E.g., atomic append operation

Page 101 Page 102

17 Google file system WebDAV

• GFS cluster • Not a file system - just a protocol – Multiple chunkservers • Web-based Distributed Authoring [and Versioning] • Data storage: fixed-size chunks RFC 2518 • Chunks replicated on several systems (3 replicas) – One master • Extension to HTTP to make the Web writable • File system metadata • New HTTP Methods • Mapping of files to chunks – PROPFIND: retrieve properties from a resource, including a • Clients ask master to look up file collection (directory) structure – PROPPATCH: change/delete multiple properties on a – Get (and cache) chunkserver/chunk ID for file resource offset – MKCOL: create a collection (directory) • Master replication – COPY: copy a resource from one URI to another – Periodic logs and replicas – MOVE: move a resource from one URI to another – LOCK: lock a resource (shared or exclusive) – UNLOCK: remove a lock

Page 103 Page 104

Who uses WebDAV? An ad hoc file system using Gmail

• File systems: • Gmail file system (Richard Jones, 2004) – davfs2: Linux file system driver to mount a DAV server as a • User-level file system – Python application • Coda kernel driver and neon for WebDAV communication – FUSE userland file system interface – Native filesystem support in OS X (since 10.0) – Microsoft web folders (since ) • Supports – Read, write, open, close, stat, symlink, link, unlink, truncate, • Apache HTTP server rename, directories • Apple iCal & iDisk • Each message represents a file • Jakarta Slide & Tomcat – Subject headers contain: • KDE Desktop • File system name, filename, pathname, info, owner ID, • Microsoft Exchange & IIS group ID, size, etc. • SAP NetWeaver – File data stored in attachments • Many others… • Files can span multiple attachments

• Check out webdav.org Page 105 Page 106

Client-server file systems Serverless file systems?

• Central servers • Use workstations cooperating as peers to – Point of congestion, single point of failure provide file system service • Alleviate somewhat with replication and client • Any machine can share/cache/control any caching block of data – E.g., Coda – Limited replication can lead to congestion Prototype serverless file system – Separate set of machines to administer – xFS from Berkeley demonstrated to be scalable • But … user systems have LOTS of disk space • Others: – (500 GB disks commodity items @ $45) – See Fraunhofer FS (www.fhgfs.com)

Page 107 Page 108

18 The end

Page 109

19