4/4/2014

Distributed File Systems

• Early networking and files – Had FTP to transfer files – Telnet to remote login to other systems with files Distributed Computing Systems • But want more transparency! – local computing with remote • Distributed file systems ‰ One of earliest distributed Distributed File Systems system components • Enables programs to access remote files as if local – Transparency • Allows sharing of data and programs • Performance and reliability comparable to local disk

Outline Concepts of Distributed File System

• Overview (done) • Transparency • Basic principles (next) • Concurrent Updates – Concepts • Replication – Models • Fault Tolerance • (NFS) • Consistency • Platform Independence • Andrew File System (AFS) • Security • Dropbox • Efficiency

1 4/4/2014

Transparency Concurrent Updates

Illusion that all files are similar. Includes: • Changes to file from one client should not • Access transparency — a single set of operations. Clients that work on local files can work with remote files. interfere with changes from other clients • Location transparency — clients see a uniform name – Even if changes at same time space. Relocate without changing path names. • Mobility transparency —files can be moved without • Solutions often include: modifying programs or changing system tables – File or record-level locking • Performance transparency —within limits, local and remote file access meet performance standards • Scaling transparency —increased loads do not degrade performance significantly. Capacity can be expanded.

5 6

Replication Fault Tolerance

• File may have several copies of its data at • Function when clients or servers fail different locations – Often for performance reasons • Detect, report, and correct faults that occur – Requires update other copies when one copy is • Solutions often include: changed • Simple solution – Redundant copies of data, redundant hardware, – Change master copy and periodically refresh the other backups, transaction logs and other measures copies – Stateless servers • More complicated solution – Idempotent operations – Multiple copies can be updated independently at same time needs finer grained refresh and/or merge

7 8

2 4/4/2014

Consistency Platform Independence

• Data must always be complete, current, and correct • Access even though hardware and OS • File seen by one process looks the same for all completely different in design, architecture processes accessing and functioning, from different vendors • Consistency special concern whenever data is • Solutions often include: duplicated – Well-defined way for clients to communicate with • Solutions often include: servers – Timestamps and ownership information

9 10

Security Efficiency

• File systems must be protected against • Overall, want same power and generality as unauthorized access, data corruption, loss and local file systems other threats • Early days, goal was to share “expensive” • Solutions include: resource ‰ the disk – Access control mechanisms (ownership, • Now, allow convenient access to remotely permissions) stored files – Encryption of commands or data to prevent “sniffing”

11 12

3 4/4/2014

Outline File Service Models

• Overview (done) Upload/Download Model Remote Access Model • Read file: copy file from server • File service provides functional • Basic principles (next) to client interface Concepts • Write file: copy file from client – Create, delete, read bytes, write – to server bytes, … – Models • Good • Good – Simple – Client only gets what’s needed Network File System (NFS) – Server can manage coherent view • • Bad of file system – Wasteful – what if client only • Andrew File System (AFS) needs small piece? • Bad – Problematic – what if client – Possible server and network • Dropbox doesn’t have enough space? congestion – Consistency – what if others • Servers used for duration of access need to modify file? • Same data may be requested repeatedly

Semantics of File Service Accessing Remote Files (1 of 2)

Sequential Semantics Session Semantics • For transparency, implement client as module Read returns result of last write Relax sequential rules under VFS • Easily achieved if • Changes to open file are – Only one server initially visible only to – Clients do not cache data process that modified it • But – Performance problems if no • Last process to modify file cache “wins” – Can instead write-through • Must notify clients holding • Can hide or lock file under copies modification from other • Requires extra state, generates extra traffic clients

(Additional picture next slide)

4 4/4/2014

Accessing Remote Files (2 of 2) Stateful or Stateless Design allows for transparency Stateful Stateless Server maintains client-specific Server maintains no information on client accesses state • Each request must identify file • Shorter requests and offsets • Server can crash and recover • Better performance in – No state to lose processing requests • No open/close needed – They only establish state • Cache coherence possible • No server space used for state – Server can know who’s – Don’t worry about supporting accessing what many clients • Problems if file is deleted on • File locking possible server • File locking not possible

Caching Concepts of Caching (1 of 2)

• Hide latency to improve performance for Centralized control repeated accesses • Keep track of who has what open and cached on each node • Four places: • Stateful file system with signaling traffic – Server’s disk – Server’s buffer cache (memory) Read-ahead (pre-fetch) – Client’s buffer cache (memory) • Request chunks of data before needed – Client’s disk • Minimize wait when actually needed • Client caches risk cache consistency problems • But what if data pre-fetched is out of date?

5 4/4/2014

Concepts of Caching (2 of 2) Outline

Write-through Overview (done) • All writes to file sent to server • – What if another client reads its own (out-of-date) cached copy? Basic principles (done) • All accesses require checking with server • • Or … server maintains state and sends invalidations • Network File System (NFS) (next) Delayed writes (write-behind) • Andrew File System (AFS) • Only send writes to files in batch mode (i.e., buffer locally) • One bulk write is more efficient than lots of little writes • Dropbox • Problem: semantics become ambiguous – Watch out for consistency – others won’t see updates!

Write on close • Only allows session semantics • If lock, must lock whole file

Network File System (NFS) NFS Overview

• Introduced in 1984 (by ) • Provides transparent access to remote files – Independent of OS (e.g., Mac, Linux, Windows) or • Not first made, but first to be used as product hardware • Made interfaces in public domain • Symmetric – any computer can be server and client – But many institutions have dedicated server Allowed other vendors to produce – • Export some or all files implementations • Must support diskless clients • Internet standard is NFS protocol (version 3) • Recovery from failure – RFC 1913 – Stateless, UDP, client retries • High performance • Still widely deployed, up to v4 but maybe too – Caching and read-ahead bloated so v3 widely used

6 4/4/2014

Underlying Transport Protocol NSF Protocols

Initially NSF ran over UDP using Sun RPC • Since clients and servers can be implemented for • different platforms, need well-defined way to • Why UDP? communicate ‰ Protocol – Protocol – agreed upon set of requests and responses – Slightly faster than TCP between client and servers – No connection to maintain (or lose) • Once agreed upon, Apple implemented Mac NFS client can talk to a Sun implemented Solaris NFS server – NFS is designed for Ethernet LAN • NFS has two main protocols • Relatively reliable – Mounting Protocol : Request access to exported directory – Error detection but no correction tree – Directory and File Access Protocol : Access files and • NFS retries requests directories (read, write, mkdir, readdir … )

NFS Mounting Protocol NFS Architecture • In many cases, on same LAN, but not required • Request permission to access contents at pathname – Can even have client-server on same machine • Client • Directories available on server through /etc/exports – Parses pathname – When client mounts, becomes part of directory hierarchy – Contacts server for file handle Server 1 Client Server 2 • Server (root) (root) (root) – Returns file handle: file device #, i-node #, instance # • Client – Create in-memory VFS i-node at mount point export . . . vmunix usr nfs – Internally point to r-node for remote files • Client keeps state, not server Remote Remote people students x staff users • Soft-mounted – if client access fails, throw error to mount mount

processes. But many do not handle file errors well big bobjon . . . jimann jane joe • Hard-mounted – client blocks processes, retries until server up (can cause problems when NFS server down) File system mounted at /usr/students is sub-tree located at /export/people in Server 1, and file system mounted at /usr/staff is sub-tree located at /nfs/users in Server 2

7 4/4/2014

Example NFS exports File NFS Automounter

• File stored on server, typically /etc/exports • Automounter – only mount when access empty NFS-specified dir – Attempt unmount every 5 minutes – Avoids long start up costs when many NSF mounts # See exports(5) for a description. – Useful if users don’t need /public 192.168.1.0/255.255.255.0 (rw,no_root_squash)

• Share the folder /public • Restrict to 192.168.1.0/24 Class C subnet – Use ‘ *’ for wildcard/any • Give read/write access • Allow the root user to connect as root

NFS Access Protocol NFS Access Operations

• Most file operations supported from server to client (e.g., read() , write() , getattr() ) • NFS has 16 functions (v2, v3 added six more) • First, perform lookup RPC – Returns file handle and attributes (not like open() since no information stored on server) • Doesn’t support open() and close() – Instead on, say, read() , client sends RPC handle, UFID and offset • Allows server to be stateless , not remember connections – Better for scaling and robustness • However, typical Unix can lock file on open() , unlock on close() – NFS must run separate lock daemon

8 4/4/2014

NFS Caching - Server NFS Caching - Client

• Keep file data in memory as much as possible • Reduce number of requests to server (avoid slow (avoid slow disk) network) • Cache – read, write, getattr, readdir Read-ahead – get subsequent blocks (typically 8 • • Can result in different versions at client KB chunk) before needed – Validate with timestamp • Delayed write – only put data on disk when – When contact server ( open() or new block), invalidate memory cache (typically every 30 seconds) block if server has newer timestamp • Clients responsible for polling server Server responds by write-through (data to disk • – Typically 3 seconds for file when client asks) – Typically 30 seconds for directory – Performance can suffer, so another option only when • Send written (dirty) blocks every 30 seconds file closed, called commit – Flush on close()

Improve Read Performance Problems with NFS

• Transfer data in large chunks • File consistency (client caches) – 8K bytes default (that used to be large) • Assumes clocks are synchronized • Read-ahead • No locking – Optimize for sequential file access – Separate lock manager needed, but adds state – Send requests to read disk blocks before • No reference count for open files requested by process – Could delete file that others have open! • File permissions may change – Invalidating access

9 4/4/2014

NFS Version 3 NFS Version 4

• TCP support • Adds state to system – UDP caused more problems on WANs (errors) • Supports open() operations since can be – All traffic can be multiplexed on one connection maintained on server • Minimizes connection setup • Read operations not absolute, but relative, • Large-block transfers and don’t need all file information, just handle – Negotiate for optimal transfer size – Shorter messages – No fixed limit on amount of data per request • Locking integrated • Includes optional security/encryption

Outline Andrew File System (AFS)

• Overview (done) • Developed at CMU (hence the “Andrew” from “Andrew Carnegie”) • Basic principles (done) – Commercialized through IBM to OpenAFS • Network File System (NFS) (done) (http://openafs.org/ ) • Andrew File System (AFS) (next) • Transparent access to remote files • Dropbox • Using Unix-like file operations ( creat, open , …) • But AFS differs markedly from NFS in design and implementation…

10 4/4/2014

General Observations Motivating AFS AFS Design • For Unix users – Most files are small, less than 10 KB in size • Scalability is most important design goal – read() more common than write() - about 6x – Sequential access dominates, random rare – More users than other distributed systems – Files referenced in bursts – used recently, will likely be used again • Key strategy is caching of whole files at clients • Typical scenarios for most files: – Many files for one user only (i.e., not shared), so no problem – Whole-file serving – entire file and directories – Shared files that are infrequently updated to others (e.g., code, Whole-file caching – clients store cache on disk large report) no problem – • Local cache of few hundred MB enough of a working set for • Typically several hundred most users • Permanent so still there if rebooted • What doesn’t fit? ‰ databases – updated frequently, often shared, need fine-grained control – Explicitly, AFS not for databases

AFS Example AFS Questions

• Process at client issues open() system call • How does AFS gain control on open() or • Check if local cached copy (Yes, then use. No, close() ? then next step.) • What space is allocated for cached files on Send request to server • workstations? • Server sends back entire copy • How does FS ensure cached copies are up-to- • Client opens file (normal Unix file descriptor) date since may be updated by several clients? • read() , write() , etc. all apply to copy • When close() , if local cached copy changed, send back to server

11 4/4/2014

AFS Architecture System Call Interception in AFS Workstations Servers

User Venus program Workstation Vice • Kernel mod UNIX kernel to open() UNIX kernel

Venus User User and Venus program Network program UNIX file Non-local file UNIX kernel close() system calls operations Vice UNIX kernel Venus • If remote, User program UNIX kernel UNIX kernel pass to

Venus Local • Vice – implements flat file system on server disk • Venus – intercepts remote requests, pass to vice

Cache Consistency Implementation of System Calls in AFS User process UNIX kernel Venus Net Vice open(FileName, If FileName refers to a mode) file in shared file space, pass the request to Check list of files in • Vice issues callback promise with file local cache. If not Venus. present or there is no valid callback promise , • If server copy changes, it “calls back” to Venus send a request for the file to the Vice server processes, cancelling file that is custodian of the containing the Transfer a copy of the file. file and a callback – Note, change only happens on close of whole file promise to the workstation. Log the If Venus process opens file, must fetch copy from Place the copy of the callback promise. • file in the local file Open the local file and system, enter its local server return the file name in the local cache descriptor to the list and return the local • If reboot, cannot be sure callbacks are all correct (may application. name to UNIX. read(FileDescriptor, Perform a normal have missed some) Buffer, length) UNIX read operation on the local copy. – Checks with server for each open write(FileDescriptor, Perform a normal Buffer, length) UNIX write operation • Note, versus traditional cache checking, AFS far less on the local copy. close(FileDescriptor) Close the local copy communication for non-shared, read-only files and notify Venus that the file has been closed. If the local copy has been changed, send a copy to the Vice server Replace the file that is the custodian of contents and send a the file. callback to all other clients holding ca llba ck promises on the file.

12 4/4/2014

Update Semantics AFS Misc

• No other access control mechanisms • 1989: Benchmark with 18 clients, standard • If several workstations close() file after NFS load writing, only last file will be written – Up to 120% improvement over NFS • 1996: Transarc (acquired by IBM) Deployed on – Others silently lost 1000 servers over 150 sites • Clients must implement concurrency control – 96-98% cache hit rate separately • Today, some AFS cells up to 25,000 clients • If two processes on same machine access file, (Morgan Stanley) local Unix semantics apply • OpenAFS standard: http://www.openafs.org/

Other Distributed File Systems Outline

• SMB: Server Message Blocks, Microsoft ( Samba is a • Overview (done) free re-implementation of SMB). Favors locking and consistency over client caching. • Basic principles (done) • : AFS spin-off at CMU. Disconnection and fault • Network File System (NFS) (done) recovery. • Sprite : research project in 1980’s from UC Berkeley, • Andrew File System (AFS) (done) introduced log file system. • Dropbox (next) • Amoeba Bullet File Server : Tanenbaum research project. Favors throughput with atomic file change. • xFS : SGI serverless file system by distributing across multiple machines for Irix OS.

13 4/4/2014

File Synchronization Dropbox Differences

• Client runs on desktop • Most Web apps high read/write • Copies changes to local folder – e.g., twitter, facebook, reddit 100:1, 1000:1, + – Uploaded automatically • Dropbox – Downloads new versions automatically – Everyone’s computer has complete copy of dropbox – Traffic only when changes occur • Huge scale – 100+ million users, 1 billion files/day – File upload : file download about 1:1 • Design • Huge number of uploads compared to traditional service – Small client, few resources • Guided by ACID requirements (from DB) – Possibility of low-capacity network to user – Atomic – don’t share partially modified files – Scalable back-end – Consistent – operations have to be in order, reliable; cannot delete file in shared folder and have others see – (99% of code in Python) – Durable – files cannot disappear

Dropbox Architecture – v1 Dropbox Architecture – v2

14 4/4/2014

Dropbox Architecture – v3 Dropbox Architecture – v4

Dropbox Architecture – v5 Bit Bucket

15 4/4/2014

File System Functions UNIX File System Operations

filedes = open(name, mode) Opens an existing file with given name . • Abstraction of file system functions that apply filedes = creat(name, mode) Creates a new file with given name . Both operations deliver file descriptor referencing open to distributed file systems file. The mode is read , write or both. status = close(filedes) Closes open file filedes . Directory module: relates file names to file IDs count = read(filedes, buffer, n) Transfers n bytes from file referenced by filedes to buffer . File module: relates file IDs to particular files count = write(filedes, buffer, n) Transfers n bytes to file referenced by filedes from buffer. Both operations deliver number of bytes actually transferred Access control module: checks permission for operation requested and advance read-write pointer. pos = lseek(filedes, offset, Moves read-write pointer to offset (relative or absolute, File access module: reads or writes file data or attributes whence) depending on whence ). Block module: accesses and allocates disk blocks status = unlink(name) Removes file name from directory structure. If file has no other names, it is deleted. Device module: disk I/O and buffering status = link(name1, name2) Adds new name ( name2 ) for file (name1 ). • Most use set of functions derived from Unix status = stat(name, buffer) Gets file attributes for file name into buffer . (next slide)

File Service Architecture File Service Architecture

Client computer Server computer • Flat file service Implement operations on the files Directory service – Application Application program program – Manage unique file identifiers (UFIDs) – create, delete • Directory service – Mapping between text names and UFIDs Flat file service – Create, delete new directories and entries Client module – Ideally, hierarchies, as in Unix/Windows • Client module – Integrate flat file and directory services under single API – Make available to all computers

16