Frangipani: a Scalable Distributed File System

Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation Presented by: Long Zhang Slides come from the combination of previous course and Frangipani’s original slides in SOSP 97 Motivation Large-scale distributed file systems are hard to administer Administration is a problem because of - size of installation - number of components 2 Outline Background Introduction System Structure Disk Layout Logging and Recovery The Lock Service Easy Administration Performance Conclusions Questions Background (cont'd) Original slides: http://ftp.digital.com/pub/Digital/SRC/ publications/thekkath/talk/frangipani-sosp.ppt This paper is built on top of two related papers: Edward K. Lee , Chandramohan A. Thekkath, Petal: distributed virtual disks, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.84-92, October, 1996, Cambridge, Massachusetts, United States. Leslie Lamport. The Part-Time Parliament. Technical Report 49, Digital Equipment Corporation, Systems Research Center, 130Lytton Ave., Palo Alto, CA943011044, September 1989. Related Work NFS (Sandberg et al.,’85, SUN) VAXClusters (Kronenberg, Levy, & Strecker,’86, DEC) AFS (Howard et al.,’88, CMU) Echo (Mann et al.,’94, SRC) xFS (Anderson et al.,’95, Berkeley) Calypso (Devarakonda, Kish, and Mohindra,’95, IBM) Shillner and Felten (’96, Princeton) 5 Introduction Many distributed file systems already there: VMS Cluster file system, Echo, Calypso, and etc. Generally, large-scaled distributed file systems are hard to manage. Lots of file systems administration work require human intervention – have to be done manually. The administration problem is caused by Growing computer installation. More disks attached to more machines. (components) Introduction – Solution Frangipani A new scalable distributed file system. Two layered model: build on top of Petal, a distributed storage system. Can also be viewed as a cluster file system. It can solve the administration problem by Give all users a consistent view of files. Frangipani servers can be easily added to existing installation to improve the performance. Add users without manually configuration. Dynamic/hot backup support Fault tolerance. (machine, network, disk failures) Petal Prototype Petal Petal Petal Petal Client Client Client Client Switched Network Petal Server Petal Server Petal Server Disk Disk Disk s s s Petal virtual disk 8 Introduction – Layered structure User User User program program program Frangipani file Frangipani file server server Distributed Petal lock service distributed virtual Physical disks System Structure – Common workstations Petal virtual disk System Structure – Components User programs access Frangipani through the standard operating system call interface. (Digital Unix vnode interface) Frangipani file server module runs within OS kernel. Changes to file contents are staged through the local kernel bufer pool. Could be volatile until next fsync/sync system call. Metadata changes are logged in Petal and be guaranteed non-volatile. (Write ahead redo log, discuss later) Components Frangipani file server module read/write Petal virtual disks using local Petal device driver. Exploit Petal’s large virtual space. More details in a separate paper. The lock services Multi-reader/single-writer lock Lock with leases (discuss later) Client/Server configuration Security issues: Any Frangipani machine can read/write any block of the shared Petal virtual disk. Eavesdropping on the network interconnecting the Petal and Frangipani machines Solution: run Frangipani, Petal and lock servers on trusted network, machines and OSs . Client/Server configuration. All the servers are interconnecting with a private network. Remote, untrusted clients talk to Frangipani servers through a separate network. (have no access to Petal) Bonus: Clients can use Frangipani without modifying Client/Server configuration System Structure – Design issues Why not use an old file system on Petal? Petal works with old file systems. Traditional file systems such as UFS, AdvFS (target in performance section) cannot share a block device. The machine runs the file system can be a bottleneck. Why choose two layer structure? Two layer structure is not unique. e.g. Universal File Server. Modularity. Frangipani machines can be added and deleted transparently. Consistent backup without halting the system. Design issues (cont'd) Three aspects of the Frangipani design can be problematic: Duplicated logging. Sometimes logged both by Petal and Frangipani. Doesn’t use disk location information in placing data. Frangipani locks entire files and directories rather than blocks. Disk Layout 264 bytes of address space provided by Petal Commits/decommits in large chunks – 64K Six regions in address space: 1st region stores shared configuration parameters and housekeeping information – 1TB 2nd region stores logs. Each Frangipani server has one. Reserved 1TB, partitioned into 256 logs. 3rd region is used for allocation bitmaps, to describe which blocks in the remaining regions are free – 3TB 4th region holds inodes. 1 TB inode space, each Disk Layout (cont'd) 5th region hold small data blocks, each 4KB in size. Allocated 7TB The remainder holds for large data blocks. 1 TB for each large block. 224 large files limit. Frangipani takes advantage of Petal’s large, sparse disk address space to simplify its data structure. Logging and Recovery Frangipani uses a write ahead redo log for metadata Metadata: any on-disk data structure other than the content of an ordinary file. Log records are kept on Petal. Logs are bounded in size – 128 KB Data is written to Petal On fsync/sync system calls, or every 30 seconds. On lock revocation or then the log wraps. Each Frangipani machine has a separate log Reduces contention Independent recovery Logging and Recovery (cont'd) Frangipani server crashes can be detected in two ways: Detected by a client of failed server; When the lock service asks the failed server to return a lock it is holding. Generally, recovery is initiated by the lock service. Recovery demon will take the ownership of the failed server’s logs and locks. After recovery, releases all the locks and frees the logs. Lock Services Multiple reader/single writer lock mechanism Read lock allows a server to read data and cache it. Write lock allows a server to read or write data . When a write lock is downgraded or released, the server must flush its dirty data to disk. Locks are moderately coarse-grained Lock for each logical segments Each file, directory or symbolic link is one segment. protects entire file or directory Lock Services (cont'd) Avoiding deadlock by globally ordering these locks. And acquiring these locks in two phases: A server determine what locks it needs. Which file or directory? Read lock or write lock? The server sorts the locks by inode address and acquires each lock in turn. Then checks whether any objects identified in phase one were modified while their locks were released. If so, the server releases locks and loops back to phase one. Lock Services (cont'd) The lock service deal with client failure using leases Client obtain a lease together with the lock. If the lease expires, the client either renew the lease or the lock will become invalid. Three diferent implementations: (Key problem: where to store the lock state?) 1st : A single, centralized server. All lock states are keep in the server volatile memory. 2nd: Primary/backup server. Store the lock state on a Petal virtual disk, so in case of server crash, the lock state can be recovered. Poor performance. Lock Services (cont'd) 3rd and final: A set of mutually cooperating lock servers, and a clerk module linked into each Frangipani server. Result: fully distributed for fault tolerance and scalable performance. Highlights of final implementation: The lock servers maintain a lock table for each Frangipani server. Clerk module is responsible for communications. (via asynchronous messages) A small amount of global state information is replicated across all lock servers using Lamport’s Paxos algorithm. (Also used in Google chubby lock service http://labs.google.com/papers/ chubby.html) Easy Administration (adding/removing servers) Adding another Frangipani server requires a minimal amount of administrative work: Which Petal virtual disk to use And where to find lock service. Removing a Frangipani server is even easier. Simply shut the server of. Lock servers will invalid the locks hold by the server after the lease expired and initiate recovery service to run the redo logs. Easy Administration – backup Petal’s snapshot feature provides a convenient way to make consistent full dump of a Frangipani file system Uses copy-on-write techniques Crash consistent: a snapshot reflects a coherent state. Backup a Frangipani file system: Taking a Petal snapshot. And copying it to tape. Performance – Experimental Non-volatile memory (NVRAM) Solved Frangipani server latency problems. Placed in between physical disks and Petal server. Ideal testbed: 100 Petal nodes. (small array controllers) 50 Frangipani servers. (typical workstations) Reality: 7 333Mhz DEC Alpha 500 5/333 as Petal servers. Each has 9 DIGITAL RZ29 disks, 4.3 GB each. Connected to 24 port ATM switch 155 Mbit/s link. Single Machine Performance Why AdvFS? Significantly faster than BSD-derived UFS file system. Can

Load more