xFS: A Wide Area Mass Storage File System
Randolph Y. Wang and Thomas E. Anderson
frywang,[email protected] erkeley.edu
Computer Science Division
University of California
Berkeley, CA 94720
Abstract Scalability : The central le server mo del breaks
down when there can b e:
The current generation of le systems are inadequate
thousands of clients,
in facing the new technological challenges of wide area
terabytes of total client caches for the servers
networks and massive storage. xFS is a prototyp e le
to keep track of,
system we are developing to explore the issues brought
billions of les, and
ab out by these technological advances. xFS adapts
p etabytes of total storage.
many of the techniques used in the eld of high p er-
Availability : As le systems are made more scalable,
formance multipro cessor design. It organizes hosts into
allowing larger groups of clients and servers to work
a hierarchical structure so lo cality within clusters of
together, it b ecomes more likely at any given time
workstations can b e b etter exploited. By using an
that some clients and servers will b e unable to com-
invalidation-based write back cache coherence proto col,
municate.
xFS minimizes network usage. It exploits the le system
Existing distributed le systems were originally de-
naming structure to reduce cache coherence state. xFS
signed for lo cal area networks and disks as the b ottom
also integrates di erent storage technologies in a uni-
layer of the storage hierarchy. They are inadequate in
form manner. Due to its intelligent use of lo cal hosts
facing the challenges of wide area networks and massive
and lo cal storage, we exp ect xFS to achieve b etter p er-
storage. Many su er the following shortcomings:
formance and availability than current generation net-
Poor le system protocol : Frequently used tech-
work le systems run in the wide area.
niques, such as broadcasts and write through pro-
to cols, while acceptable on a LAN, are no longer
appropriate on a WAN b ecause of their inecient
1 Intro duction
use of bandwidth.
Dependency on centralized servers : The availabil-
Recent advances in rob ot technology and tertiary me-
ity and p erformance of many existing le systems
dia have resulted in the emergence of large capacity
dep end crucially on the health of a few centralized
and cost e ective automated tertiary storage devices.
servers.
When placed on high p erformance wide area networks,
Data structures that do not scale :For example, if
such devices have dramatically increased the amount
a server has to keep cache coherence state for every
of digital information accessible to a large number of
piece of data stored on the clients, then b oth the
geographically distributed users. The amountof near-
amount of data and the numb er of clients in the
line storage and degree of co op eration made p ossible by
system will have to b e quite limited.
these hardware advances are unprecedented.
Lack of tertiary support : Existing le systems do a
The presence of wide area networks and tertiary stor-
p o or job at hiding multiple layers of storage hierar-
age, however, has brought new challenges to the design
chy. Some use manual migration and/or whole le
of le systems:
migration. This is neither convenient nor ecient.
Bandwidth and latency :At least in the short run,
Some o er ad ho c extensions to existing systems.
bandwidth over a WAN is much more exp ensive than
This usually increases co de complexity.
over a LAN. Furthermore, latency over the wide
While others have addressed some asp ects of the
area, a parameter restricted by the sp eed of light,
problem [2,5, 9, 11, 13], there is yet no system that
will remain high.
handles b oth WANs and mass storage. While wedo
This work was supp orted in part by the National Science
not pretend that wehave reached the p erfect solution
Foundation CDA-8722788, the Digital Equipment Corp oration
of providing fast, cheap, and reliable wide area access
the Systems Research Center and the External Research Pro-
to massive amounts of storage, we present the xFS pro-
gram, and the AT&T Foundation. Anderson was supp orted by
a National Science Foundation Young Investigator Award. totyp e as a design p oint from which some of the issues
increase scalability of the system. of mass storage in the wide area can b e explored and
future measurements can b e taken.
Server WAN vide the p erformance and Our goal in xFS is to pro UCB
LAN
availability of a lo cal disk le system when sharing is
minimum and storage requirement is small. We observe
y of the problems we face have also b een con- that man Client 0 Client 1 Client 2 Client 3 Client 4 Client 5 Client 6
UCLA UCLA UCSB UCSB UCSBUCB UCB
sidered by researchers studying distributed shared mem-
ory multipro cessors and there are manywell-tested so-
Figure 1: Simple client-server relationship .
lutions that can b e applied to our design. In particular,
xFS has the following features:
The mo del seen in gure 1 has a numb er of prob-
xFS organizes hosts into a more sophisticated hi-
lems. First of all, WANs have higher latency,lower
erarchical structure analogous to those found in
bandwidth, and higher cost. Although the situation is
some high p erformance multipro cessors suchas
exp ected to improve as the technology matures, webe-
DASH [10]. Requests are satis ed by a lo cal cluster
lieve wide area communication is likely to remain more
if p ossible to minimize remote communication.
costly and o er less bandwidth than its LAN counter-
xFS uses an invalidation-based write back cache co-
part for some time. Even if the bandwidth b ottleneck
herence proto col pioneered by research done on mul-
is solved, latency will continue to haunt a naive sys-
tipro cessor caches. Clients with lo cal stable stor-
tem. Even under the b est p ossible circumstances, the
age can store private data inde nitely, consuming
round trip delayover 1000 km will b e comparable to
less wide area bandwidth with lower latency than if
a disk seek. A simple client-server mo del as seen in
all mo di cations were written through to the server.
gure 1 fails to recognize the distinction b etween wide
For data cached lo cally, clients can op erate on them
area links and lo cal links. The result is overuse of wide
without server attention. The Co da [18] study il-
area communication, which in turn leads to high cost
lustrates that this should o er b etter availabilityin
and p o or p erformance.
case of server or network failure.
Secondly, lo cality plays a more imp ortant role in a
xFS exploits the le system naming structure to re-
wide area where clusters of hosts that are geographically
duce the state needed to preserve cache coherence.
close to each other are likely to share information more
Ownership of a directory and its descendents can b e
frequently and have an easier time talking to each other.
granted as a whole. By avoiding the maintenance
A le system that treats all clients as equals might make
of cache coherence information on a strictly p er- le
sense on a LAN where all clients are virtually p eers that
or p er-blo ck basis, xFS reduces storage requirements
exhibit little distinction. Such an organization b ecomes
and improves p erformance.
less acceptable in a wide area where paying more atten-
xFS integrates multiple levels of storage hierarchy.
tion to lo cality is likely to improve p erformance.
xFS employs a uniform log-structured management
Thirdly, the organization of gure 1 has problems
and fast lo okup across all storage technologies.
with scale. The server load and the amountofcache
xFS is currently in the middle of the implementa-
coherence state on the server increases with the number
tion stage and we do not yet have p erformance num-
of clients. A straightforward deploymentofmany exist-
b ers to rep ort. The remainder of this p osition pap er
ing le systems in the wide area can b e neither \wide"
is organized as follows. Section 2 presents the hierar-
nor \massive".
chical client-server structure of xFS. Section 3 describ es
In an attempt to solve these problems, xFS employs
the cache coherence proto col. Section 4 describ es how
a hierarchy similar to those found in scalable shared
1
xFS reduces the amount of cache coherence information
memory multipro cessors gure 2 .To understand the
maintained by hosts. Section 5 discusses the integra-
motivation of such an organization, it is instructiveto
tion of multiple levels of storage. Section 6 discusses
study the analogy b etween DASH [10] and xFS. The
some additional issues including crash recovery. Sec-
DASH system consists of a numb er of pro cessor clus-
tion 7 concludes.
ters. Communication among the clusters is provided by
a mesh network. This is analogous to WAN links in xFS
where bandwidth is limited and latency is high. Each
2 Hierarchical Client-server Or-
DASH cluster consists of a small numb er of pro cessors,
analogous to clients on a LAN. Intra-cluster communi-
ganization
cation is provided by a bus. This is analogous to the
LAN links in xFS where lo cal communication is cheap
Andrew [6], NFS [17], and Sprite [20] all have a simple
and fast.
one-layer client-server hierarchy. While it is conceivable
The analogy,however, do es not stop here. Firstly,
to p ort such a mo del onto a wide area network gure 1,
DASH recognizes the distinction b etween intra-cluster
the result is likely to b e unsatisfactory b ecause it re-
communication bus based and inter-cluster communi-
quires large amount of communication over the WAN.
cation p ointtopoint to optimize p erformance. Sim-
xFS uses clustering based on geographic proximityto
1
exploit distinction b etween lo cal and remote communi-
In principle, it is p ossible to extend the hierarchy further to
cation, utilize lo cality of data usage within clusters, and form clusters of clusters.
Home
termediate servers b ecome \delay Home Cluster parts do. In e ect, in
WAN Server UCB
ers". An xFS consistency server, on the other hand,
LAN UCB serv
by not participating in data caching, avoids unnecessary ys. Client 5 Client 6 dela
UCB UCB
Thirdly, the xFS hierarchy provides b etter scalabil-
ity. Consistency servers shield the home servers from
Consistency Consistency
tra-cluster trac. Furthermore, in an organization as Server 0 Server 1 in
UCLA UCSB
shown in gure 1, the server must keep long lists of
clients requiring consistency actions. In an xFS hierar-
chy, home servers will only have to track the consistency
servers who act as agents for their lo cal clusters. Client 0 Client 1 Client 2 Client 3 Client 4
UCLA UCLA UCSB UCSB UCSB
We b elieve xFS has a hierarchy b etter suited for a text. It recognizes the distinction of lo- Cluster 1 wide area con
Cluster 0 UCSB
ersus remote communication. It exploits lo cality
UCLA cal v
more aggressively. It distributes the burden of a cen-
Figure 2: xFS hierarchy.
tralized server to a numb er of consistency servers and
p eer clients. It is thus exp ected to o er b etter scalabil-
ity.
ilarly, xFS clients maycho ose to write through and/or
broadcast more frequently on a LAN where communi-
cation is cheap and fast. When communicating across
3 Cache Coherence Proto col
slower and more exp ensiveWAN links, however, xFS
hosts sp eak a write back proto col section 3.
Existing le systems use proto cols that do not provide
Secondly, the hierarchical organization do es a b et-
strong enough consistency semantics and/or generate
ter job at taking advantage of p otential lo cality within
excessive amount of network trac. xFS sp eaks an
clusters of workstations. When a DASH pro cessor refer-
invalidation-based write back proto col designed to min-
ences data cached in a lo cal cluster, no remote commu-
imize communication. A client with lo cal stable stor-
nication is incurred. Similarly, one exp eriment[2] has
age can store private data inde nitely. The abilityof
shown that roughly two thirds of the les read by one
a client to op erate more indep endently without server
workstation are already cached by neighb oring clients.
intervention should also result in b etter availability.
A study done by Dahlin [4] shows clients within a clus-
NFS [17] writes through and o ers little consistency
ter can e ectively serve each others' cache misses. To
guarantee for concurrent read write sharing. Andrew [6]
exploit such lo cality, when an xFS client references data
provides a consistent view at le op en time and writes
not present in its lo cal cache, it rst queries a consis-
back at close time. Sprite [14] enforces p erfect con-
tency server on the cluster to see if any p eer clients can
sistency by disabling caching for write-shared les but
supply the data. It is our hop e that most of the lo cal
still writes all dirty data to the server every 30 seconds
trac can b e contained within the cluster. Only o c-
for reliability reasons. Furthermore, none of these le
casionally will a consistency server have to contact the
systems do es e ective directory caching; consequently,
home cluster to handle cache misses.
even after optimizing write back p olicies, there are still a
The function of a consistency server is analogous to
large numb er of name related op erations left [19]. Such
that of the directory mechanism in DASH which is re-
unnecessary use of network trac mighthave p osed lit-
sp onsible for lo cating data and enforcing cache coher-
tle problem on a LAN but will b ecome a p erformance
ence. Once the lo cation of the desired data is discovered,
b ottleneck in a wide area.
data ows directly b etween the sender and the receiver.
We observe that the eld of multipro cessor computer
2
In a way, the separation of control messages and data
architecture has extensive literature in maintaining the
3
messages in xFS is analogous to the approach taken in
consistency of cached replicas, while minimizing net-
the Mass Storage Reference Mo del [3] where name ser-
work usage by limiting communication to those cases
vices, le movement, and storage management all have
when data is truly b eing shared. To reduce the number
distinct information ow paths.
of bytes transferred over the wide area network, xFS
Anumb er of recent studies have explored the idea
uses a proto col based on multipro cessor style cache co-
of hierarchical organizations [2, 5]. One exp eriment [5]
4
herence [15]. An xFS host requests ownership of a
employed an intermediate data cache server and its hit
le or a directory from a higher level server. A client
rate was found to b e surprisingly low. In such a hier-
that p ossesses read ownership is allowed to cache data
archy of data caches, the higher levels are made of the
5
lo cally for reads . A client that p ossesses write owner-
same technology as the lower ones and the higher level
ship is assured that it has the only valid copy of data.
caches contain little more than their lower level counter-
Such a copy is considered private and the host can cache
2
This refers to messages related to requesting, granting, and
4
revoking ownerships as discussed in section 3. This refers to a client or a consistency server.
3 5
This refers to messages that contain actual data. Unix-like le access times are not kept.
and mo dify the copy inde nitely in its own stable stor- state. If a host requests write ownership, the server re-
age without contacting the server. Ownership of data, vokes ownership from all the current readers. In the
once obtained by a consistency server, can b e granted reply messages, the readers sp ecify whether or not they
to lower level clients. A server, in order to enforce cache still desire the read ownership. If none of the previous
coherence, may refuse to grantownership. It may also readers want to retain read ownership, the le go es into
cho ose to revoke the ownership of data it has previously write-1 state and the new writer is granted write own-
granted to other hosts. A client that has the write own- ership. If some readers are still actively reading the le,
ership is obliged to transfer its dirty data back only then the le go es into write sharing state and all hosts
up on receipt of such a revoking call. are denied ownership.
In existing stateful le systems [6, 14, 19], an open
not owned:
volves requesting ownership; a close involves relin-
in nobody has
wnership; and a callback is asso ciated with
quishing o ownership
revoking ownership. Many systems do not distinguish
the notion of open and that of requesting ownership, or
close and that of relinquishing ownership. the notion of read-1: read sharing:
1 host has read multiple hosts have
en in a system that aggressively exploits caching such
Ev ownership read ownership
as Sprite [14], a client merely passes the user system
calls to the server gure 3a. If a user rep eatedly op ens
and writes the same le which is a likely event, a Sprite
t dutifully transmits each op en request and writes clien write-1: write sharing:
1 host has write nobody has
y data every 30 seconds.
dirt ownership ownership
xFS is di erentintwo resp ects gure 3b. Firstly,
xFS explicitly separates the notion of open from that
of requesting ownership. Ownership requests are sent
only when necessary. Secondly, xFS proto col do es not
Figure 4: xFS le state transition.
include anything that resembles a close system call.
Hosts never voluntarily relinquish ownership. They give
Due to its b etter use of the lo cal cache and less de-
up ownership only in resp onse to servers' revoking calls.
p endency on servers, xFS clients not only should ob-
If a user rep eatedly writes the same le, only the ini-
servelower latency and use less bandwidth, they should
tial op en can result in a request for ownership and no
also achieve b etter availability. Existing le systems
further communication is required until another client
rely heavily on the servers for write throughs and name
requests the data. In this way, le and directory data is
related op erations. The demise of a single server or a
only transferred over the wide area network on a cache
failed network link usually renders the clients helpless.
miss, a cache ush, or b ecause of true sharing b etween
xFS hosts, on the other hand, after acquiring ownership
multiple clients. In other words, xFS writes on demand
of the prop er data, can op erate more indep endently.
as opp osed to traditional le systems that write data
Furthermore, xFS hosts will have an easier time rein-
through to guard against the p ossibility that they might
tegrating after disconnection. A Co da [18] client, for
b e shared.
example, is required to write the dirty data it has accu-
mulated during disconnection back to the server during
User User
reintegration. In xFS, such writes are not necessary un-
y data is needed by others.
open close less the dirt
open close
There are a numb er of problems with this cache co-
Client
herence scheme. The rst is the large amount of state
h host needs to keep. This is addressed Client request revoke information eac ownership ownership
callback in the next section. The second problem is the need for
open close Consis
eep dirty data inde nitely. Server lo cal stable storage that will k
request revoke This is addressed in section 5. The third is the need for
ownership ownership
p olicies to deal with network partitions and down hosts. Server Home
Server This is discussed in section 6.
(a) Sprite style protocol (b) xFS protocol
4 Reducing Cache Coherence
Figure 3: Separating notions of op en/close/call back from
State
those of requesting/reli nqu is hi ng/revoki ng ownership.
Many existing stateful le systems maintain cache con-
sistency in a way that is not designed to scale to handle Figure 4 shows the xFS le state transition in more
massive amount of client cache. xFS reduces the cache detail. As an example, consider a le that is in the read
coherence state by exploiting the hierarchical naming sharing state. If a new host requests read ownership, it
structure of the le system. is granted the ownership and the le stays in the same
A Sprite or an Andrew server, for example, must keep with all the consistency servers it serves and only the
trackofevery le cached byany client. The amountof consistency servers that have acquired read ownership
server state grows with the total size of client caches. have to forward the revoking broadcast to their clients.
xFS lives on a wide area network and there are p oten- An xFS server maintains cache coherence state for
tially large numb er of clients and large amount of call- clusters of les instead of individual les. For each
back information asso ciated with each client. A wayto cluster of les, a server only keeps a list of consistency
deal with the explosion of state information needs to b e servers instead of individual clients. And for les that
found. are almost never mo di ed, even this list can b e done
away with. By employing these techniques, we exp ect
The hierarchical nature of the system as discussed in
to substantially reduce storage requirements for cache
section 2 partially alleviates the problem. Home servers
coherence state.
only need to keep track of the top level consistency
servers. Bottom level consistency servers only need to
tally the leaf clients that they serve. This reduces the
5 Integration of Multi-level
length of client lists but do es not reduce the number of
les that might require consistency actions.
Storage
A second technique xFS uses to deal with state ex-
plosion is based on the observation that clusters of les
xFS is designed to utilize a multi-level hierarchy of sta-
tend to have the same ownership. There is little need to
ble storage, capable of inde nitely storing mo di ed le
keep state for individual les when a representative for
and directory data. xFS integrates all storage devices
a large cluster can b e used instead. Figure 5 illustrates
in a simple and uniform manner:
the idea. Here we see user \fo o" is working on machine
All storage devices are treated as caches.
\X". \X" has acquired read ownership for cluster \A"
All storage devices are written in a log-structured
and write ownership for clusters \B" and \C". This has
manner.
happ ened as a result of mo di cation of her home di-
Data blo cks are lo cated by lo oking up virtual-to-
rectory and directory \baz" by user \fo o". Similarly,
physical translation tables maintained on fast de-
machine \Y" has acquired read ownership for cluster
vices.
\A" and write ownership for cluster \D". We see that
In addition to deciding on appropriate migration p oli-
ownership only needs to b e asso ciated with clusters of
cies that move data around, we need to address two im-
les.
mediate questions. The rst of whichishow to lo cate
data as blo cks migrate among di erent levels of stor- A: read owned
/
wtolay out data on media in an
by X and Y age. The second is ho
ecient manner.
To solve the problem of lo cating data in multiple lev-
els of storage, some existing systems extend the UFS
data structures to incorp orate tertiary devices. For ex-
k address in Highlight [9] can b elong to the foo ample, a blo c
bar
disk farm, some tertiary store, or the disk cache for ter-
en an hinode number, offseti pair,
baz tiary storage. Giv
lo cating data involves lo cating the ino de and indexing
the ino de. If the blo ck address is found to b e on ter-
tiary, a fetch is done to bring the missing data into the
disk cache and the op eration resumes. Such extensions B: write owned
by X usually complicate the co de and force the new storage
devices to inherit data structures that were originally C: write owned
by X D: write owned designed for disks.
by Y
A solution to the second problem, namely that of
Figure 5: Hierarchical state information.
laying out data on media eciently, is provided by
the log-structured le system LFS [16]. LFS app ends
newly written data to a segmentedlog and it is opti-
A third technique is based on the observation that a
mized for write p erformance. As old data is deleted,
large numb er of les in the system are widely shared but
LFS reclaims disk space by recopying sparsely p opu-
rarely mo di ed. For such les an entire subtree that is
lated segments to the tail of the log and marks those
exp orted for read only, for example, instead of remem-
segments as clean. LFS provides sup erior p erformance
b ering exactly which hosts have acquired read owner-
in environments where large main memory caches ab-
ship, xFS merely rememb ers the fact that read owner-
sorb most of the reads and consequently disk write sp eed
ship has b een granted to some hosts. In the rare event
b ecomes the limiting factor. Since many tertiary de-
that such les do get mo di ed, xFS resorts to broadcasts
6
vices are app end-only and tertiary archival storage are
to lo cate the readers. Such broadcasts, however, do not
6
necessarily have to ood every single host due to the hi-
Most media deliver b est p erformance when they are used as
erarchical nature of xFS. A home server communicates app end-only devices.
mostly write-only, log-structured layout is a natural way In principle, the translation table can always b e
of managing such devices. Highlight[9], a descendant \paged" to a slower device. For simplicity,wehave
of LFS, bases its design on this observation. It em- opted to keep the translation table entirely on a faster
ploys a di erent cleaner pro cess to migrate selected data device. For example, the translation table for a disk
7
blo cks out to tertiary store, which is also written log- cache is kept entirely in memory and the translation
structured. table for tertiary is kept entirely on disk. An obvi-
When designing the xFS approach to lo cating data, ous disadvantage of doing so is the storage requirement
we draw analogies from solutions to memory manage- needed for the translation tables. For example, tens of
ment problems. A virtually addressed memory word megabytes of main memory might b e needed to manage
can b e anywhere from an on-chip cache to a secondary less than ten gigabytes of disk storage. This problem
backing store. To nd out where it is, we rst haveto is partially alleviated by using a single translation ta-
translate a virtual address to a physical address. The ble entry to p ointtoanumb er of consecutively written
xFS equivalent of a virtual address is a block ID: blo cks. Thus large sequentially written les will need
little space in the translation table. We b elieve given
le ID blo ck
the fast decrease in memory cost, the memory used for
the translation table is money well sp ent in exchange for
Such a virtual address can have its physical incarnation
the resulting simplicity. Similarly, studies of some work-
anywhere in a hierarchy of storage devices. Each de-
loads suggest the use of less than ten gigabytes of disk
vice is treated as a cache. Each device employs a fast
space will suce for \translating" around ten terabytes
translation table that translates a blo ckIDtoaphysical
of tertiary storage and suchaninvestment is probably
address on the device. A translation table entry is of
worthwhile.
the following form:
We plan to implementtwointerfaces for each kind of
xFS device: a storage interface that allows the device to
blo ckID of blo cks device address
b e used as a caching store, and a translation interface
that allows the device to b e used as a translation de-
As data migrate among di erent storage levels, we sim-
vice. A disk, for example, has twointerfaces that allow
ply change the corresp onding translation tables, a sim-
it to b e used either as a storage device or as a transla-
pler and cleaner approach than extending the existing
tion table for tertiary. An appropriately chosen pair of
UFS data structures. This is similar to the approach
slow and fast devices can b e \glued" together to form a
taken by[7] where logical disk addresses are mapp ed to
storage bin. Migration agents running on each machine
physical ones to allow a clean separation b etween le
cho oses blo cks to inject into or eject from the storage
and disk management without sacri cing p erformance.
bins it oversees. Pieces of the same le can p otentially
The memory management analogy, unfortunately,
spread across multiple levels of storage hierarchyover a
do es not apply for data layout. Firstly, unlike pro cessor
wide area, re ecting the widely distributed and tertiary
caches which are usually direct mapp ed, wewould like
nature of xFS.
our xFS devices to b e fully asso ciative. Secondly, unlike
pro cessor caches and main memory, few of the storage
technologies xFS manages are truly random access de-
6 Other Issues
vices. For these two reasons, xFS writes its caches in a
log-structured manner.
One question any distributed system has to answer is
When an xFS device receives a read request, it queries
how it handles machine crashes and network partitions.
its translation table to nd the corresp onding physical
An xFS client has to rememb er what ownerships it
address. A successful search leads us directly to the de-
has acquired from which servers. An xFS server needs
vice lo cation where the data is found. If the search fails,
to rememb er what ownerships it has granted to which
the device declares a cache miss. When an xFS device
clients. Even after applying the state compression tech-
receives a write request, it app ends the data blo cks to
niques describ ed in section 4, the amount of information
the end of the log and up dates its translation table to
would still b e to o large to keep entirely in memory as
re ect the new mapping. Invalidation of data blo cks
Sprite [20] and Spritely NFS [19] do. Part of it needs
involves removing the corresp onding mappings from its
to b e written to stable storage. On the other hand,
translation table. When it is time to clean, whether
we cannot a ord to log every ownership change to disk.
a blo ck is live or not can b e determined by comparing
Consequently,we need to recover the memory resident
its identity against the corresp onding translation table
information lost in a crash.
entry.
The task is relatively simple for clients. Whenever
Ejecting from an xFS device is accomplished by a mi-
a client commits data to stable storage, it conveniently
gration pro cess. It cho oses a numb er of victim blo cks
logs the corresp onding ownership. Any lost ownership
according to some p olicy. If the blo cks are dirty, the
in a crash will not have dirty data asso ciated with it. In
migration agent retrieves the blo cks from the source de-
the worst case, the server might b elieve it has granted
vice and sends them to the destination device. Then it
7
simply invalidates the blo cks in the source device and
Up on reb o ot, the memory resident translation table is recon-
the freed space will b e reclaimed by the cleaner. structed by examining the identities of data blo cks on disk.
7 Conclusion ownership to certain clients which the clients have lost
in crashes. Later when the server sends revoking calls to
Existing disk based lo cal area le systems can no longer
which the clients simply reply that they do not haveit
meet the demands of mass storage over a wide area. xFS
any more. A recovering server, however, has to recover
uses a hierarchy that can b etter take advantage of the lo-
the exact state it had b efore the crash. It app ears that
cality nature. It sp eaks a cache coherency proto col that
a server centric approach as done in [12]would work
minimizes the use of wide area communication. Con-
well. Under suchascheme, a recovering server directs
sequently, it is exp ected to op erate with lower latency
its clients to help rebuild the lost server state. It has
and consume less wide area bandwidth. xFS minimizes
also b een noted that such frequently up dated and small
cache coherence state information by exploiting the hi-
amountofvolatile state is a go o d candidate for inclusion
erarchical nature of le system name space and hierar-
in a stable memory that can survive machine crashes
chical nature of the cluster based organization. xFS in-
and the recovery cost can b e kept to a minimum [1].
tegrates multiple levels of storage in a uniform manner.
Its ability to take advantage of lo cal storage should de-
Another question we need to answer is how to deal
liver b etter economy, sup erior p erformance, and higher
with machines that do not resp ond to ownership revokes
availability.
due to crashes or network partitions. There are three
alternatives. We fail the op eration that resulted in the
revoke in the rst place, hang the op eration inde nitely
Acknowledgements
until the revoke is answered, or let the op eration pro-
ceed at the risk of allowing consistency con icts which
Discussions with Mike Dahlin and DavePatterson have
will have to b e resolved when the o ending hosts rejoin.
help ed improve the xFS design, particularly that of the
We adopt the third approach taken byCoda[18] which
hierarchical client-server structure. Wewould also like
trades quality for availability.Fortunately, the Co da
to thank Soumen Chakrabarti, Mike Dahlin, John Hart-
study has shown that such con icts are extremely rare
man, Arvind Krishnamurthy, and Cli ord Mather for
events and we do not exp ect ownership revokes to b e
their helpful comments.
frequent in xFS.
References
xFS's approach of storing data lo cally and inde -
nitely also raises other concerns. Security is one. An-
[1] M. Baker and M. Sullivan. The Recovery Box:
drew [6] treats servers as rst class citizens and clients
Using Fast Recovery to Provide High Availabil-
are deemed less trustworthy. In xFS, however, the line
ity in the Unix Environment. USENIX Association
between clients and servers are blurred when clients are
Summer 1992 ConferenceProceedings, pages 31-43,
allowed to serve each other section 2. In such an or-
June 1992.
ganization, the clusters can b e treated as security do-
mains where cluster memb ers trust each other but for-
[2] M. Blaze and R. Alonso. Toward Massive Dis-
eign clusters that are not homes are deemed untrust-
tributed Systems. Proceedings of the 3rd Workshop
worthy. A more paranoid approachwould require the
on Workstation Operating Systems, pages 48-51,
writer to compute a checksum. Kaliski [8] has devised
April 1992.
an ecient algorithm which makes it computationally
infeasible to nd two messages with the same checksum
[3] S. Coleman and S. Miller. Mass Storage System
or a message with a presp eci ed checksum. A reader,
Reference Mo del: Version 4. IEEE Technical Com-
up on receipt of the data and the checksum, can run the
mittee on Mass Storage Systems and Technology,
same algorithm to verify that the data has not b een
May 1990.
altered.
[4] M. Dahlin. Private Communication. 1993.
Another diculty is backup. When the storage is con-
[5] D. Muntz and P. Honeyman. Multi-Level Caching
centrated on several gigabytes of disks on a few servers,
in Distributed File System. USENIX Associa-
disaster recovery as opp osed to crash recovery can b e
tion Winter 1992 ConferenceProceedings, January
accomplished by rolling the whole world backtoatape
1992.
dump. Such an approach fails to work when there are
massive amounts of storage and/or the storage is so
[6] J. Howard, M. Kazar, S. Menees, D. Nichols,
widely distributed that taking consistent snapshots is
M. Satyanarayanan, R. Sideb otham, and M. West.
virtually imp ossible. One p ossible way of providing dis-
Scale and Performance in a Distributed File Sys-
aster recovery for xFS is to avoid a disaster in the rst
tem. ACM Transactions on Computer Systems,
place by insisting on having two copies of everything at
61:51-82, February 1988.
di erent sites at all times. An e ectivewayofproviding
[7] W. de Jonge, M. F. Kaasho ek, and W. Hsieh. The backup for mass storage in the wide area is a topic for
Logical Disk: A New Approach to Improving File future research.
[20] B. Welch. The Sprite Distributed File System. PhD System Performance. Proceedings of the 14th Sym-
Thesis, Department of Electrical Engineering and posium on Operating Systems Principles, Decem-
Computer Science, University of California, Berke- b er 1993. To app ear.
ley, March 1990.
[8] B. Kaliski, Jr. The MD4 Message Digest Algo-
rithm. Workshop on the Theory and Application of
Cryptographic Techniques Proceedings,May 1990.
[9] J. Kohl and C. Staelin. HighLight: Using A Log-
structured File System for Tertiary Storage Man-
agement. USENIX Association Winter 1993 Con-
ferenceProceedings, January 1993.
[10] D. Lenoski, J. Laudon, K. Gharachorlo o, A. Gupta
and J. Hennessy. The Directory-Based Cache Co-
herence Proto col for the DASH Multipro cessor.
Proceedings of the 17th Annual International Sym-
posium on Computer Architecture, pages 148-159,
May 1990.
[11] F. McClain. DataTree and UniTree: Software for
File and Storage Management. Digest of Papers,
Proceedings of 10th IEEE Symposium on Mass
Storage Systems, pages 126-128, May 1990.
[12] J. Mogul. A Recovery Proto col for Spritely NFS.
Proceedings of the USENIX Workshop on File Sys-
tems,May 1992.
[13] J. Mott-Smith. The Jaquith Archive Server.
UCB/CSD Rep ort 92-701, University of California,
Berkeley, Septemb er 1992.
[14] M. Nelson, B. Welch, and J. Ousterhout. Caching
in the Sprite Network File System. ACM Transac-
tions on Computer Systems, 61:134-154, Febru-
ary 1988.
[15] M. Papamarcos and J. Patel. A LowOverhead Co-
herence Solution for Multipro cessors with Private
Cache Memories. Proceedings of the 11th Annual
International Symposium on Computer Architec-
ture, pages 348-354, June 1984.
[16] M. Rosenblum and J. Ousterhout. The Design
and Implementation of a Log-Structured File Sys-
tem. Operating Systems Review, 255:1-15, Octo-
b er 1991.
[17] R. Sandb erg, D. Goldb erg, S. Kleiman, D. Walsh,
and B. Lyon. Design and Implementation of the
Sun Network Filesystem. Proceedings of the 1985
Summer USENIX Conference, June 1985.
[18] M. Satyanarayanan, J. Kistler, P. Kumar,
M. Okasaki, E. Siegel, and D. Steere. Co da: A
Highly Available File System for a Distributed
Workstation Environment. ACM Transactions on
Computer Systems, 394:447-459, April 1990.
[19] V. Srinivasan and J. Mogul. Spritely NFS: Ex-
p erience with Cache Consistency Proto cols. Pro-
ceedings of 12th Symposium on Operating Systems
Principles, pages 45-57, Decemb er 1989.