Quick viewing(Text Mode)

Distributed Computing Systems Distributed File Systems Outline Concepts of Distributed File System Transparency Concurrent Updat

Distributed Computing Systems Distributed File Systems Outline Concepts of Distributed File System Transparency Concurrent Updat

2/12/2016

Distributed File Systems

• Early networking and files – Had FTP to transfer files – Telnet to remote login to other systems with files Systems • But want more transparency! – local computing with remote file system • Distributed file systems ‰ One of earliest distributed Distributed File Systems system components • Enables programs to access remote files as if local – Transparency • Allows sharing of and programs • Performance and reliability comparable to local disk

Outline Concepts of Distributed File System

• Overview (done) • Transparency • Basic principles (next) • Concurrent Updates – Concepts • Replication – Models • Fault Tolerance • Network File System (NFS) • Consistency • Platform Independence • Andrew File System (AFS) • Security • Dropbox • Efficiency

Transparency Concurrent Updates

Illusion that local/remote files are similar. Includes: • Changes to file from one • Access transparency — a single set of Transparency operations. Clients that work on local files Transparency should not interfere with Concurrent Updates can work with remote files. Concurrent Updates changes from other clients Replication • Location transparency — clients see a Replication Fault Tolerance uniform name space. Relocate without Fault Tolerance – Even if changes at same time Consistency changing path names. Consistency Platform Independence • Mobility transparency — files can be moved Platform Independence • Solutions often include: Security without modifying programs or changing Security Efficiency system tables Efficiency – File or record-level locking • Performance transparency — within limits, local and remote file access meet performance standards • Scaling transparency —increased loads do not degrade performance significantly. Capacity can be expanded.

5 6

1 2/12/2016

Replication Fault Tolerance

• File may have several copies of its • Function when clients or servers data at different locations fail Transparency – Often for performance reasons Transparency • Detect, report, and correct faults Concurrent Updates – Requires update other copies when Concurrent Updates Replication one copy is changed Replication that occur Fault Tolerance Fault Tolerance Consistency • Simple solution Consistency • Solutions often include: Platform Independence – Change master copy and periodically Platform Independence – Redundant copies of data, Security refresh other copies Security redundant hardware, backups, Efficiency • More complicated solution Efficiency transaction logs and other measures – Multiple copies can be updated independently at same time needs – Stateless servers finer grained refresh and/or merge – Idempotent operations

7 8

Consistency Platform Independence

• Data must always be complete, • Access even though hardware current, and correct Transparency Transparency and OS completely different in Concurrent Updates • File seen by one process looks Concurrent Updates design, architecture and Replication the same for all processes Replication Fault Tolerance accessing Fault Tolerance functioning, from different Consistency Consistency Platform Independence • Consistency special concern Platform Independence vendors Security whenever data is duplicated Security Efficiency Efficiency • Solutions often include: • Solutions often include: – Well-defined way for clients to – Timestamps and ownership information communicate with servers (protocol)

9 10

Security Efficiency

• File systems must be protected • Overall, want same power

Transparency against unauthorized access, Transparency and generality as local file Concurrent Updates Concurrent Updates systems Replication data corruption, loss and other Replication Fault Tolerance threats Fault Tolerance • Early days, goal was to Consistency Consistency “expensive” resource Platform Independence • Solutions include: Platform Independence Security Security ‰ the disk Efficiency Efficiency – Access control mechanisms • Now, allow convenient (ownership, permissions) access to remotely stored – Encryption of commands or files data to prevent “sniffing”

11 12

2 2/12/2016

Outline File Service Models

• Overview (done) Upload/ Model Remote Access Model • Read file: copy file from • File service provides functional • Basic principles (next) to client interface Concepts • Write file: copy file from client – Create, delete, read bytes, write – to server bytes, … – Models • Good • Good – Simple – Client only gets what’s needed – Server can manage coherent view • Network File System (NFS) • Bad of file system – Wasteful – what if client only • Andrew File System (AFS) needs small piece? • Bad – Problematic – what if client – Possible server and network • Dropbox doesn’t have enough space? congestion – Consistency – what if others • Servers used for duration of access need to modify file? • Same data may be requested repeatedly

Semantics of File Service Accessing Remote Files (1 of 2)

Sequential Semantics Session Semantics • For transparency, implement client as module Read returns result of last write Relax sequential rules under Virtual File System (VFS) • Easily achieved if • Changes to open file are – Only one server initially visible only to – Clients do not cache data process that modified it • But – Performance problems if no • Last process to modify file cache “wins” – Can instead write-through • Must notify clients holding • Can hide or lock file under copies modification from other • Requires extra state, generates extra traffic clients

(Additional picture next slide)

Accessing Remote Files (2 of 2) Stateful or Stateless Design Virtual file system allows for transparency Stateful Stateless Server maintains client-specific Server maintains no information on client accesses state • Each request must identify file • Shorter requests and offsets • Server can crash and recover • Better performance in – No state to lose processing requests • No open/close needed – They only establish state • Cache coherence possible • No server space used for state – Server can know who’s – Don’t worry about supporting accessing what many clients • Problems if file is deleted on • File locking possible server • File locking not possible

3 2/12/2016

Caching Concepts of Caching (1 of 2)

• Hide latency to improve performance for Centralized control repeated accesses • Keep track of what files each client has open and cached • Four places: • Stateful file system with signaling traffic – Server ’s disk – Server ’s buffer cache (memory) Read-ahead (pre-fetch) – Client ’s buffer cache (memory) • Request chunks of data before needed – Client ’s disk • Minimize wait when actually needed • Client caches risk cache consistency problems • But what if data pre-fetched is out of date?

Concepts of Caching (2 of 2) Outline

Write-through Overview (done) • All writes to file sent to server • – What if another client reads its own (out-of-date) cached copy? Basic principles (done) • All accesses require checking with server • • Or … server maintains state and sends invalidations • Network File System (NFS) (next) Delayed writes (write-behind) • Andrew File System (AFS) • Only send writes to files in batch mode (i.e., buffer locally) • One bulk write is more efficient than lots of little writes • Dropbox • Problem: semantics become ambiguous – Watch out for consistency – others won’t see updates!

Write on close • Only allows session semantics • If lock, must lock whole file

Network File System (NFS) NFS Overview

• Introduced in 1984 (by Sun Microsystems) • Provides transparent access to remote files – First was 1970’s Data Access Protocol by DEC – Independent of OS (e.g., Mac, Linux, Windows) or – But NFS first to be used as product hardware – Developed in conjunction with Sun RPC • Symmetric – any can be server and client • Made interfaces in public domain – But many setups have dedicated server – Request For Comment (RFC) by Engineering Task • Export some or all files Force (IETF) – technical development of Internet standards • Must support diskless clients – Allowed other vendors to produce implementations • Recovery from failure • Internet standard is NFS protocol (version 3) – Stateless, UDP, client retries – RFC 1913 • High performance • Still widely deployed, up to v4 but maybe too bloated so – Caching and read-ahead v3 widely used

4 2/12/2016

Underlying Transport Protocol NFS Protocols

Initially NFS ran over UDP using Sun RPC • Since clients and servers can be implemented for • different platforms, need well-defined way to • Why UDP ? communicate ‰ Protocol – Protocol – agreed upon set of requests and responses – Slightly faster than TCP between client and servers – No connection to maintain (or lose) • Once agreed upon, Apple Mac NFS client can talk to a Sun Solaris NFS server – Reliable send not issue • NFS has two main protocols • NFS is designed for Ethernet LAN (relatively reliable) – Mounting Protocol - Request access to exported directory – UDP has error detection but no correction tree – Directory and File Access Protocol - Access files and • NFS retries requests upon error/timeout directories (read, write, mkdir, readdir … )

NFS Mounting Protocol NFS Architecture • In many cases, on same LAN, but not required • Request permission to access contents at pathname – Can even have client-server on same machine • Client • Directories available on server through /etc/exports – Parses pathname – When client mounts, becomes part of directory hierarchy – Contacts server for file handle Server 1 Client Server 2 • Server (root) (root) (root) – Returns file handle: file device #, inode #, instance # • Client – Create in-memory VFS inode at mount point export . . . vmunix usr nfs – Internally point to r- (for remote/RPC) for remote files • Client keeps state, not server Remote Remote people students x staff users • Soft-mounted – if client access fails, throw error to mount mount

processes. But many do not handle file errors well big bobjon . . . jimann jane joe • Hard-mounted – client blocks processes, retries until server up (can cause problems when NFS server down) File system mounted at /usr/students is sub-tree located at /export/people in Server 1, and file system mounted at /usr/staff is sub-tree located at /nfs/users in Server 2

Example NFS exports File NFS Automounter

• Automounter – only mount when access empty NFS-specified dir • File stored on server, typically /etc/exports – Attempt unmount every 5 minutes – Conserve local resources if users don’t need # See exports(5) for a description. – Avoids dependencies on unneeded servers when many NFS mounts /public 192.168.1.0/255.255.255.0 (rw,no_root_squash) • Share folder /public • Restrict to 192.168.1.0/24 Class C subnet – Use ‘ *’ for wildcard/any • Give read/write access ( rw ) • Allow root user to connect as root (no_root_squash )

5 2/12/2016

NFS Access Protocol NFS Access Operations

• Most file operations supported from client to server (e.g., • NFS has 16 core operations (v2, v3 added six more) read() , write() , getattr() ) – But doesn’t support open() and close() • First, client performs lookup RPC – Gets RPC handle for connection/return call – Successful call gets file handle (UFID) and attributes – Note, not like open() since no information stored on server • On, e.g., read() client sends RPC handle, UFID and offset • Allows server to be stateless , not remember connections – Better for scaling and robustness • However, typical Unix file system can lock file on open() , unlock on close() – If doing with NFS must run separate lock daemon

NFS Caching - Server NFS Caching - Client

• Keep file data in memory as much as possible • Reduce number of requests to server (avoid slow network) (avoid slow disk) • Cache – read(), write(), getattr(), • Read-ahead – get subsequent blocks (typically 8 readdir() KB chunk) before needed • Can result in different versions at client – Validate with timestamp • Server supports write-through (data to disk – When contact server (local open() or new block), invalidate immediately when client asks) block if server has newer timestamp – Performance can suffer, so another option only when • Clients responsible for polling server file closed, called commit – Typically 3 seconds for file • Delayed write – only put data on disk in batch – Typically 30 seconds for directory when using memory cache • Send written (dirty) blocks every 30 seconds – Flush on close() – Typically every 30 seconds

Improve Read Performance Problems with NFS

• Transfer data in large chunks • File consistency (if client caches) – 8K bytes “typical” default (that used to be large) – Common Linux default 32K • Assumes clocks are synchronized • Read-ahead • No locking – Optimize for sequential file access – Separate lock manager needed, but adds state – Send requests to read disk blocks before requested by process • No reference count for open files • Generally ‰ tune NFS performance – Could delete file that others have open! – Many possibilities - server threads, network timeout, cache write, cache sizes, server disk layout … • File permissions may change – “Best” depends upon system and workload – Invalidating access

6 2/12/2016

NFS Version 3 NFS Version 4

• TCP support • Adds state to system – UDP caused more problems (errors) on WANs or • Supports open() operations since can be wireless maintained on server – Realized all traffic from one client to server can be • Read operations not absolute, but relative, multiplexed on one connection and don’t need all file information, just handle • Minimizes connection setup cost – Shorter messages • Large-block transfers Locking integrated – Negotiate for optimal transfer size • – No fixed limit on amount of data per request • Includes optional security/encryption

Outline Andrew File System (AFS)

• Overview (done) • Developed at CMU in 1980’s (hence the “Andrew” from “Andrew Carnegie”) • Basic principles (done) – Commercialized through IBM to OpenAFS • Network File System (NFS) (done) (http://openafs.org/ ) • Andrew File System (AFS) (next) • Transparent access to remote files • Dropbox • Using Unix-like file operations ( creat(), open() , …) • But AFS differs markedly from NFS in design and implementation…

General Observations Motivating AFS AFS Design • For Unix users – Most files are small, less than 10 KB in size • Scalability is most important design goal – read() more common than write() - about 6x – Sequential access dominates, random rare – Distributed file systems generally have more users – Files referenced in bursts – used recently, will likely be used again than other distributed systems • Typical scenarios for most files: Key strategy is caching of whole files at clients – Many files for one user only (i.e., not shared), so no problem • – Shared files that are infrequently updated to others (e.g., code, – Whole-file serving – entire file and directories large report) no problem • Local cache of few hundred MB enough for working set for – Whole-file caching – clients store cache on disk most users • Typically several hundred • What doesn’t fit? ‰ databases – updated frequently, often • “Permanent” in that written to local disk, so still there if shared, need fine-grained control rebooted – Explicitly, AFS not for databases

7 2/12/2016

AFS Example AFS Questions

• Process at client issues open() system call • How does AFS gain control on open() or • Check if local cached copy close() ? – Yes? then use. Done. – No? then proceed to next step. • What space is allocated for cached files on • Send request to server clients ? • Server sends back entire copy • Client opens file (normal Unix file descriptor, local • How does AFS ensure cached copies are up-to- access) date since may be updated by several clients ? • read() , write() , etc. all apply to copy • When close() , if local cached copy changed, send back to server

AFS Architecture System Call Interception in AFS Workstations Servers

User Venus program Workstation Vice • Kernel mod UNIX kernel to open() UNIX kernel User Venus Venus User Network and program program UNIX file Non-local file UNIX kernel close() system calls operations Vice UNIX kernel Venus • If remote, User program UNIX kernel UNIX file system UNIX kernel pass to

Venus Local • Vice – implements flat file system on server disk • Venus – intercepts remote requests, pass to vice – Vice provides for directory structure, relative location, working directory

Cache Consistency Implementation of System Calls in AFS User process UNIX kernel Venus Net Vice open(FileName, If FileName refers to a mode) file in shared file space, • Vice issues callback promise with file pass the request to Check list of files in local cache. If not Venus. present or there is no • If server copy changes, it “calls back” to Venus valid callback promise , send a request for the processes, cancelling file file to the Vice server that is custodian of the volume containing the Transfer a copy of the – Note, change only happens on close of whole file file. file and a callback promise to the workstation. Log the • If Venus process re-opens file, must fetch copy from Place the copy of the callback promise. file in the local file server Open the local file and system, enter its local return the file name in the local cache – Note, if client already had open, will still proceed descriptor to the list and return the local application. name to UNIX. • If reboot, cannot be sure callbacks are all correct (may read(FileDescriptor, Perform a normal Buffer, length) UNIX read operation have missed some) on the local copy. write(FileDescriptor, Perform a normal – Checks with server for each open Buffer, length) UNIX write operation on the local copy. • Note, versus traditional cache checking, AFS far less close(FileDescriptor) Close the local copy and notify Venus that the file has been closed. If the local copy has communication for non-shared, read-only files been changed, send a copy to the Vice server Replace the file that is the custodian of contents and send a the file. callback to all other clients holding ca llba ck (Flow diagram next slide) promises on the file.

8 2/12/2016

Update Semantics AFS Misc

• No other access control mechanisms • 1989: Benchmark with 18 clients, standard • If several workstations close() file after NFS load writing, only last file will be written – Up to 120% improvement over NFS – Others silently lost • 1996: Transarc (acquired by IBM) Deployed on • Clients must implement concurrency control 1000 servers over 150 sites separately – 96-98% cache hit rate • If two processes on same machine access file, • Today, some AFS cells up to 25,000 clients local Unix semantics apply (i.e., generally (Morgan Stanley) none, unless processes explicitly lock) • OpenAFS standard: http://www.openafs.org/

Other Distributed File Systems Outline

• SMB: Server Message Blocks, Microsoft ( Samba is a • Overview (done) free re-implementation of SMB). Favors locking and consistency over client caching. • Basic principles (done) • CODA : AFS spin-off at CMU. Disconnection and fault • Network File System (NFS) (done) recovery. • Sprite : research project in 1980’s from UC Berkeley, • Andrew File System (AFS) (done) introduced first journaling file system. • Dropbox (next) • Amoeba Bullet File Server : Tanenbaum research project. Favors throughput with atomic file change. • xFS : SGI serverless file system by distributing across multiple machines for Irix OS.

Dropbox Overview (1 of 3) Dropbox Overview (2 of 3)

• Client runs on desktop • Motivation most Web apps high read/write • Copies changes to local folder – e.g., Twitter, Facebook, reddit 100:1, 1000:1, + – Uploaded automatically • Everyone’s computer has complete copy of – new versions automatically Dropbox • Huge scale – 100+ million users, 1 billion files/day – Run daemon on computer to track “Sync” folder • Design • Traffic only when changes occur – Small client , few resources – Results in file upload : file download about 1:1 – Possibility of low-capacity network to user – Scalable back-end – Huge number of uploads compared to traditional service – (99% of code in Python) • Uses compression to reduce traffic

9 2/12/2016

Dropbox Overview (3 of 3) Dropbox Upload (1 of 3)

DropBox • Client attempts to Daemon Check for updates Upload file “commit” new file (e.g., stat() ) (e.g., send() ) – Breaks file into blocks, computes hashes – Contacts Metaserver • Metaserver checks if hashes known • If not, Metaserver returns that it “needs blocks” (nb)

Dropbox Upload (2 of 3) Dropbox Upload (3 of 3)

• Client talks to Blockserver to add • Client commits again needed blocks – Contacts Metaserver • Limit bytes/request with same request (typically 8 MB), so may • This time, ok be multiple requests

Dropbox Download (1 of 2) Dropbox Download (2 of 2)

• Client periodically polls • Client checks if blocks exist Metaserver – For new file, this fails – Lists files it “knows • Retrieve blocks about” • Limit bytes/request • Metaserver returns (typically 8 MB), so may be multiple requests information on new • When done, reconstruct files and add to local file system – Using local filesystem system calls (e.g., open(), write()…)

10 2/12/2016

Dropbox Misc – Streaming Sync Dropbox Misc – LAN Sync

• Normally, cannot • LAN Sync – download download to another from other clients until upload complete • Periodically broadcast – For large files, takes time on LAN (via UDP) “sync” • Response to get TCP • Instead, enable client to connection to other start download when clients some blocks arrive, • Pull blocks over HTTP before commit – Streaming Sync

Dropbox Architecture – v1 Dropbox Architecture – v2

Dropbox Architecture – v3 Dropbox Architecture – v4

11 2/12/2016

Dropbox Architecture – v5 Bit Bucket

File System Functions UNIX File System Operations

filedes = open(name, mode) Opens an existing file with given name . • Abstraction of file system functions that apply filedes = creat(name, mode) Creates a new file with given name . Both operations deliver file descriptor referencing open to distributed file systems file. The mode is read , write or both. status = close(filedes) Closes open file filedes . Directory module: relates file names to file IDs count = read(filedes, buffer, n) Transfers n bytes from file referenced by filedes to buffer . File module: relates file IDs to particular files count = write(filedes, buffer, n) Transfers n bytes to file referenced by filedes from buffer. Both operations deliver number of bytes actually transferred Access control module: checks permission for operation requested and advance read-write pointer. pos = lseek(filedes, offset, Moves read-write pointer to offset (relative or absolute, File access module: reads or writes file data or attributes whence) depending on whence ). Block module: accesses and allocates disk blocks status = unlink(name) Removes file name from directory structure. If file has no other names, it is deleted. Device module: disk I/O and buffering status = link(name1, name2) Adds new name ( name2 ) for file (name1 ). status = stat(name, buffer) Gets file attributes for file name into buffer . • Most use set of functions derived from Unix (next slide)

File Service Architecture File Service Architecture

Client computer Server computer • Flat file service Implement operations on the files Directory service – Application Application program program – Manage unique file identifiers (UFIDs) – create, delete • Directory service – Mapping between text names and UFIDs Flat file service – Create, delete new directories and entries Client module – Ideally, hierarchies, as in Unix/Windows • Client module – Integrate flat file and directory services under single API – Make available to all

12