Management of Replicated Volume Location Data in the Ficus
Total Page:16
File Type:pdf, Size:1020Kb
Appeared in the Proceedings of the Summer USENIX Conference, June 1991, pages 17-29 Management of Replicated Volume Lo cation Data in the Ficus Replicated File System y Thomas W. Page Jr., Richard G. Guy, John S. Heidemann, Gerald J. Popek, Wai Mak, and Dieter Rothmeier Department of Computer Science University of California Los Angeles fpage,guy,popek,johnh,waimak,[email protected] Abstract Existing techniques to provide name transparency in distributed le systems have b een de- signed for mo dest scale systems, and do not readily extend to very large con gurations. This pap er details the approach whichisnow op erational in the Ficus replicated Unix ling envi- ronment, and shows how it di ers from other metho ds currently in use. The Ficus mechanism p ermits optimistic management of the volume lo cation data by exploiting the existing directory reconciliation algorithms which merge directory up dates made during network partition. 1 Intro duction Most Unix le system implementations separate the maintenance of name binding data into two levels: within a volume or lesystem, directory entries bind names to low-level identi ers such as ino des; a second mechanism is used to form a sup er-tree by \gluing" the set of volumes to each other. The traditional Unix volume sup er-tree connection mechanism has b een widely altered or replaced to supp ort b oth small and large scale distributed le systems. Examples of the former are Sun's Network File System NFS [13] and IBM's TCF [12]; larger scale le systems are exempli ed by AFS [7], Decorum [6], Co da [14], Sprite [9], and Ficus [2, 10]. The problem addressed by this pap er is simply stated as follows: in the course of expanding a path name in a distributed le system, the system encounters a graft point. That is, it reaches a leaf-no de in the currentvolume which indicates that path name expansion should continue in the ro ot directory of another volume which is to b e grafted on at that p oint. How do es the system identify and lo cate a replica of that next volume? Solutions to this problem are very much This work was sp onsored byDARPA contract numb er F29601-87-C-0072. y This author is also asso ciated with Lo cus Computing Corp oration. Appeared in the Proceedings of the Summer USENIX Conference, June 1991, pages 17-29 constrained by the number of volumes in the name hierarchy, the numb er of replicas of volumes, the top ology and failure characteristics of the communications network, the frequency or ease with which replica storage sites or graft p oints change, and the degree to which the hierarchyof volumes spans multiple administrative domains. 1.1 Related Solutions 1 The act of gluing sub-hierarchies of the name space together is commonly known as mounting. In a conventional single-host Unix system, a single mount table exists which contains the mappings between the mounted-on leaf no des and the ro ots of mounted volumes. However, in a distributed le system, the equivalent of the mount table must b e a distributed data structure. The distributed mount table information must b e replicated for reliability, and the replicas kept consistent in the face of up date. Most distributed Unix le systems to some degree attempt to provide the same view of the name space from any site. Such name transparency requires mechanisms to ensure the coherence of the distributed and replicated name translation database. NFS, TCF, and AFS each employ quite di erent approaches to this problem. To the degree that NFS achieves name transparency, it do es so through convention and the out-of-band co ordination by system administrators. Each site must explicitly mountevery volume which is to b e accessible from that site; NFS do es not traverse mount p oints in remotely mounted volumes. If one administrator decides to mounta volume at a di erent place in the name tree, this information is not automatically propagated to other sites which also mount the volume. While allowing sites some autonomyinhow they con gure their name tree is viewed as a feature by some, it leads to frequent violations of name transparency which in turn signi cantly complicates the users' view of the distributed le system and limits the ability of users and programs to move between sites. Further, as a distributed le system scales across distinct administrative domains, the prosp ect of maintaining global agreementby convention b ecomes exceedingly dicult. IBM's TCF, like its predecessor Lo cus [12], achieves transparency by renegotiating a common view of the mount table among all sites in a partition every time the communications or no de top ology partition memb ership changes. This design achieves a very high degree of network transparency in limited scale lo cal area networks where top ology change is relatively rare. How- ever, for a network the size of the Internet, a mount table containing several volumes for each site in the network results in an unmanageably large data structure on each site. Further, in a nationwide environment, the top ology is constantly in a state of ux; no algorithm whichmust renegotiate global agreements up on each partition memb ership change may b e considered. Clearly neither of the ab ove approaches scales b eyond a few tens of sites. Cellular AFS [15] like Ficus is designed for larger scale application. AFS employs a Volume Location Data Base VLDB for each cell lo cal cluster which is replicated on the cell's backb one servers. The mount p oint itself contains the cell and volume identi ers. The volume identi er is used as a key to lo cate the volume in a copy of the VLDB within the indicated cell. Volume lo cation information, once obtained, is cached by each site. The VLDB is managed separately 1 We will henceforth use the term \graft" and \graft p oint" for the Ficus notion of grafting volumes while retaining the mount terminology for the Unix notion of mounting lesystems. Appeared in the Proceedings of the Summer USENIX Conference, June 1991, pages 17-29 from the le system using its own replication and consistency mechanism. A primary copy of the VLDB on the system control machine p erio dically p olls the other replicas to pull over any up dates, compute a new VLDB for the cell, and redistribute it to the replicas. The design do es not p ermit volumes to move across cell b oundaries, and do es not provide lo cation transparency across cells, as each cell's managementmay mount remote cell volumes anywhere in the namespace. Again, this may b e billed as a feature or a limitation dep ending on where one stands on the tradeo between cell autonomy and global transparency. 1.2 The Ficus Solution Ficus uses AFS-style on disk mounts, and unlike NFS readily traverses remote mount p oints. The di erence b etween the Ficus and AFS metho ds lies in the nature of Ficus volumes which are replicated and the relationship of graft p oints and volume lo cation databases. In Ficus, like AFS [5],avolume is a collection of les which are managed together and which 2 form a subtree of the name space . Each logical volume in Ficus is represented by a set of volume replicas which form a maximal, but extensible, collection of containers for le replicas. Files and directories within a logical volume are replicated in one or more of the volume replicas. Each individual volume replica is normally stored entirely within one Unix disk partition. Ficus and AFS di er in howvolume lo cation information is made highly available. Instead of employing large, monolithic mount tables on each site, Ficus fragments the information needed to lo cate a volume and places the data in the mounted-on leaf a graft p oint. A graft p oint maps a set of volume replicas to hosts, which in turn each maintain a private table mapping volume replicas to sp eci c storage devices. Thus the various pieces of information required to lo cate and access a volume replica are stored where they will b e accessible exactly where and when they will b e needed. A graft p ointmay b e replicated and manipulated just likeany other ob ject le or directory in avolume. In Ficus the format of a graft p oint is compatible with that of a directory: a single bit indicates that it contains grafting information and not le name bindings. The extreme similarity between graft p oints and normal le directories allows the use of the same optimistic replication and reconciliation mechanism that manages directory up dates. Without building any additional mechanism, graft p oint up dates are propagated to accessible replicas, unco ordinated up dates are detected and automatically repaired where p ossible, and rep orted to the system administrators otherwise. Volume replicas maybemoved, created, or deleted, so long as the target volume replica and any replica of the graft p oint are accessible in the partition one copyavailability. This optimistic approach to replica management is critical, as one of the primary motivations for adding a new volume replica may b e that network partition has left only one replica still accessible, and greater reliability is desired. This approach to managing volume lo cation information scales to arbitrarily large networks, 2 Whereas a lesystem in Unix is traditionally one-to-one with a disk partition, a volume is a logical grouping of les whichsays nothing ab out how they are mapp ed to disk partitions.