xFS: A Wide Area Mass Storage

Randolph Y. Wang and Thomas E. Anderson

frywang,[email protected] erkeley.edu

Computer Science Division

University of California

Berkeley, CA 94720

Abstract  Scalability : The central le server mo del breaks

down when there can b e:

The current generation of le systems are inadequate

 thousands of clients,

in facing the new technological challenges of wide area

 terabytes of total client caches for the servers

networks and massive storage. xFS is a prototyp e le

to keep track of,

system we are developing to explore the issues brought

 billions of les, and

ab out by these technological advances. xFS adapts

 p etabytes of total storage.

many of the techniques used in the eld of high p er-

 Availability : As le systems are made more scalable,

formance multipro cessor design. It organizes hosts into

allowing larger groups of clients and servers to work

a hierarchical structure so lo cality within clusters of

together, it b ecomes more likely at any given time

workstations can b e b etter exploited. By using an

that some clients and servers will b e unable to com-

invalidation-based write back cache coherence proto col,

municate.

xFS minimizes network usage. It exploits the le system

Existing distributed le systems were originally de-

naming structure to reduce cache coherence state. xFS

signed for lo cal area networks and disks as the b ottom

also integrates di erent storage technologies in a uni-

layer of the storage hierarchy. They are inadequate in

form manner. Due to its intelligent use of lo cal hosts

facing the challenges of wide area networks and massive

and lo cal storage, we exp ect xFS to achieve b etter p er-

storage. Many su er the following shortcomings:

formance and availability than current generation net-

 Poor le system protocol : Frequently used tech-

work le systems run in the wide area.

niques, such as broadcasts and write through pro-

to cols, while acceptable on a LAN, are no longer

appropriate on a WAN b ecause of their inecient

1 Intro duction

use of bandwidth.

 Dependency on centralized servers : The availabil-

Recent advances in rob ot technology and tertiary me-

ity and p erformance of many existing le systems

dia have resulted in the emergence of large capacity

dep end crucially on the health of a few centralized

and cost e ective automated tertiary storage devices.

servers.

When placed on high p erformance wide area networks,

 Data structures that do not scale :For example, if

such devices have dramatically increased the amount

a server has to keep cache coherence state for every

of digital information accessible to a large number of

piece of data stored on the clients, then b oth the

geographically distributed users. The amountof near-

amount of data and the numb er of clients in the

line storage and degree of co op eration made p ossible by

system will have to b e quite limited.

these hardware advances are unprecedented.

 Lack of tertiary support : Existing le systems do a

The presence of wide area networks and tertiary stor-

p o or job at hiding multiple layers of storage hierar-

age, however, has brought new challenges to the design

chy. Some use manual migration and/or whole le

of le systems:

migration. This is neither convenient nor ecient.

 Bandwidth and latency :At least in the short run,

Some o er ad ho c extensions to existing systems.

bandwidth over a WAN is much more exp ensive than

This usually increases co de complexity.

over a LAN. Furthermore, latency over the wide

While others have addressed some asp ects of the

area, a parameter restricted by the sp eed of light,

problem [2,5, 9, 11, 13], there is yet no system that

will remain high.

handles b oth WANs and mass storage. While wedo

This work was supp orted in part by the National Science

not pretend that wehave reached the p erfect solution

Foundation CDA-8722788, the Digital Equipment Corp oration

of providing fast, cheap, and reliable wide area access

the Systems Research Center and the External Research Pro-

to massive amounts of storage, we present the xFS pro-

gram, and the AT&T Foundation. Anderson was supp orted by

a National Science Foundation Young Investigator Award. totyp e as a design p oint from which some of the issues

increase scalability of the system. of mass storage in the wide area can b e explored and

future measurements can b e taken.

Server WAN vide the p erformance and Our goal in xFS is to pro UCB

LAN

availability of a lo cal disk le system when sharing is

minimum and storage requirement is small. We observe

y of the problems we face have also b een con- that man Client 0 Client 1 Client 2 Client 3 Client 4 Client 5 Client 6

UCLA UCLA UCSB UCSB UCSBUCB UCB

sidered by researchers studying distributed shared mem-

ory multipro cessors and there are manywell-tested so-

Figure 1: Simple client-server relationship .

lutions that can b e applied to our design. In particular,

xFS has the following features:

The mo del seen in gure 1 has a numb er of prob-

 xFS organizes hosts into a more sophisticated hi-

lems. First of all, WANs have higher latency,lower

erarchical structure analogous to those found in

bandwidth, and higher cost. Although the situation is

some high p erformance multipro cessors suchas

exp ected to improve as the technology matures, webe-

DASH [10]. Requests are satis ed by a lo cal cluster

lieve wide area communication is likely to remain more

if p ossible to minimize remote communication.

costly and o er less bandwidth than its LAN counter-

 xFS uses an invalidation-based write back cache co-

part for some time. Even if the bandwidth b ottleneck

herence proto col pioneered by research done on mul-

is solved, latency will continue to haunt a naive sys-

tipro cessor caches. Clients with lo cal stable stor-

tem. Even under the b est p ossible circumstances, the

age can store private data inde nitely, consuming

round trip delayover 1000 km will b e comparable to

less wide area bandwidth with lower latency than if

a disk seek. A simple client-server mo del as seen in

all mo di cations were written through to the server.

gure 1 fails to recognize the distinction b etween wide

For data cached lo cally, clients can op erate on them

area links and lo cal links. The result is overuse of wide

without server attention. The Co da [18] study il-

area communication, which in turn leads to high cost

lustrates that this should o er b etter availabilityin

and p o or p erformance.

case of server or network failure.

Secondly, lo cality plays a more imp ortant role in a

 xFS exploits the le system naming structure to re-

wide area where clusters of hosts that are geographically

duce the state needed to preserve cache coherence.

close to each other are likely to share information more

Ownership of a directory and its descendents can b e

frequently and have an easier time talking to each other.

granted as a whole. By avoiding the maintenance

A le system that treats all clients as equals might make

of cache coherence information on a strictly p er- le

sense on a LAN where all clients are virtually p eers that

or p er-blo ck basis, xFS reduces storage requirements

exhibit little distinction. Such an organization b ecomes

and improves p erformance.

less acceptable in a wide area where paying more atten-

 xFS integrates multiple levels of storage hierarchy.

tion to lo cality is likely to improve p erformance.

xFS employs a uniform log-structured management

Thirdly, the organization of gure 1 has problems

and fast lo okup across all storage technologies.

with scale. The server load and the amountofcache

xFS is currently in the middle of the implementa-

coherence state on the server increases with the number

tion stage and we do not yet have p erformance num-

of clients. A straightforward deploymentofmany exist-

b ers to rep ort. The remainder of this p osition pap er

ing le systems in the wide area can b e neither \wide"

is organized as follows. Section 2 presents the hierar-

nor \massive".

chical client-server structure of xFS. Section 3 describ es

In an attempt to solve these problems, xFS employs

the cache coherence proto col. Section 4 describ es how

a hierarchy similar to those found in scalable shared

1

xFS reduces the amount of cache coherence information

memory multipro cessors  gure 2 .To understand the

maintained by hosts. Section 5 discusses the integra-

motivation of such an organization, it is instructiveto

tion of multiple levels of storage. Section 6 discusses

study the analogy b etween DASH [10] and xFS. The

some additional issues including crash recovery. Sec-

DASH system consists of a numb er of pro cessor clus-

tion 7 concludes.

ters. Communication among the clusters is provided by

a mesh network. This is analogous to WAN links in xFS

where bandwidth is limited and latency is high. Each

2 Hierarchical Client-server Or-

DASH cluster consists of a small numb er of pro cessors,

analogous to clients on a LAN. Intra-cluster communi-

ganization

cation is provided by a bus. This is analogous to the

LAN links in xFS where lo cal communication is cheap

Andrew [6], NFS [17], and Sprite [20] all have a simple

and fast.

one-layer client-server hierarchy. While it is conceivable

The analogy,however, do es not stop here. Firstly,

to p ort such a mo del onto a wide area network  gure 1,

DASH recognizes the distinction b etween intra-cluster

the result is likely to b e unsatisfactory b ecause it re-

communication bus based and inter-cluster communi-

quires large amount of communication over the WAN.

cation p ointtopoint to optimize p erformance. Sim-

xFS uses clustering based on geographic proximityto

1

exploit distinction b etween lo cal and remote communi-

In principle, it is p ossible to extend the hierarchy further to

cation, utilize lo cality of data usage within clusters, and form clusters of clusters.

Home

termediate servers b ecome \delay Home Cluster parts do. In e ect, in

WAN Server UCB

ers". An xFS consistency server, on the other hand,

LAN UCB serv

by not participating in data caching, avoids unnecessary ys. Client 5 Client 6 dela

UCB UCB

Thirdly, the xFS hierarchy provides b etter scalabil-

ity. Consistency servers shield the home servers from

Consistency Consistency

tra-cluster trac. Furthermore, in an organization as Server 0 Server 1 in

UCLA UCSB

shown in gure 1, the server must keep long lists of

clients requiring consistency actions. In an xFS hierar-

chy, home servers will only have to track the consistency

servers who act as agents for their lo cal clusters. Client 0 Client 1 Client 2 Client 3 Client 4

UCLA UCLA UCSB UCSB UCSB

We b elieve xFS has a hierarchy b etter suited for a text. It recognizes the distinction of lo- Cluster 1 wide area con

Cluster 0 UCSB

ersus remote communication. It exploits lo cality

UCLA cal v

more aggressively. It distributes the burden of a cen-

Figure 2: xFS hierarchy.

tralized server to a numb er of consistency servers and

p eer clients. It is thus exp ected to o er b etter scalabil-

ity.

ilarly, xFS clients maycho ose to write through and/or

broadcast more frequently on a LAN where communi-

cation is cheap and fast. When communicating across

3 Cache Coherence Proto col

slower and more exp ensiveWAN links, however, xFS

hosts sp eak a write back proto col section 3.

Existing le systems use proto cols that do not provide

Secondly, the hierarchical organization do es a b et-

strong enough consistency semantics and/or generate

ter job at taking advantage of p otential lo cality within

excessive amount of network trac. xFS sp eaks an

clusters of workstations. When a DASH pro cessor refer-

invalidation-based write back proto col designed to min-

ences data cached in a lo cal cluster, no remote commu-

imize communication. A client with lo cal stable stor-

nication is incurred. Similarly, one exp eriment[2] has

age can store private data inde nitely. The abilityof

shown that roughly two thirds of the les read by one

a client to op erate more indep endently without server

workstation are already cached by neighb oring clients.

intervention should also result in b etter availability.

A study done by Dahlin [4] shows clients within a clus-

NFS [17] writes through and o ers little consistency

ter can e ectively serve each others' cache misses. To

guarantee for concurrent read write sharing. Andrew [6]

exploit such lo cality, when an xFS client references data

provides a consistent view at le op en time and writes

not present in its lo cal cache, it rst queries a consis-

back at close time. Sprite [14] enforces p erfect con-

tency server on the cluster to see if any p eer clients can

sistency by disabling caching for write-shared les but

supply the data. It is our hop e that most of the lo cal

still writes all dirty data to the server every 30 seconds

trac can b e contained within the cluster. Only o c-

for reliability reasons. Furthermore, none of these le

casionally will a consistency server have to contact the

systems do es e ective directory caching; consequently,

home cluster to handle cache misses.

even after optimizing write back p olicies, there are still a

The function of a consistency server is analogous to

large numb er of name related op erations left [19]. Such

that of the directory mechanism in DASH which is re-

unnecessary use of network trac mighthave p osed lit-

sp onsible for lo cating data and enforcing cache coher-

tle problem on a LAN but will b ecome a p erformance

ence. Once the lo cation of the desired data is discovered,

b ottleneck in a wide area.

data ows directly b etween the sender and the receiver.

We observe that the eld of multipro cessor computer

2

In a way, the separation of control messages and data

architecture has extensive literature in maintaining the

3

messages in xFS is analogous to the approach taken in

consistency of cached replicas, while minimizing net-

the Mass Storage Reference Mo del [3] where name ser-

work usage by limiting communication to those cases

vices, le movement, and storage management all have

when data is truly b eing shared. To reduce the number

distinct information ow paths.

of bytes transferred over the wide area network, xFS

Anumb er of recent studies have explored the idea

uses a proto col based on multipro cessor style cache co-

of hierarchical organizations [2, 5]. One exp eriment [5]

4

herence [15]. An xFS host requests ownership of a

employed an intermediate data cache server and its hit

le or a directory from a higher level server. A client

rate was found to b e surprisingly low. In such a hier-

that p ossesses read ownership is allowed to cache data

archy of data caches, the higher levels are made of the

5

lo cally for reads . A client that p ossesses write owner-

same technology as the lower ones and the higher level

ship is assured that it has the only valid copy of data.

caches contain little more than their lower level counter-

Such a copy is considered private and the host can cache

2

This refers to messages related to requesting, granting, and

4

revoking ownerships as discussed in section 3. This refers to a client or a consistency server.

3 5

This refers to messages that contain actual data. Unix-like le access times are not kept.

and mo dify the copy inde nitely in its own stable stor- state. If a host requests write ownership, the server re-

age without contacting the server. Ownership of data, vokes ownership from all the current readers. In the

once obtained by a consistency server, can b e granted reply messages, the readers sp ecify whether or not they

to lower level clients. A server, in order to enforce cache still desire the read ownership. If none of the previous

coherence, may refuse to grantownership. It may also readers want to retain read ownership, the le go es into

cho ose to revoke the ownership of data it has previously write-1 state and the new writer is granted write own-

granted to other hosts. A client that has the write own- ership. If some readers are still actively reading the le,

ership is obliged to transfer its dirty data back only then the le go es into write sharing state and all hosts

up on receipt of such a revoking call. are denied ownership.

In existing stateful le systems [6, 14, 19], an open

not owned:

volves requesting ownership; a close involves relin-

in nobody has

wnership; and a callback is asso ciated with

quishing o ownership

revoking ownership. Many systems do not distinguish

the notion of open and that of requesting ownership, or

close and that of relinquishing ownership. the notion of read-1: read sharing:

1 host has read multiple hosts have

en in a system that aggressively exploits caching such

Ev ownership read ownership

as Sprite [14], a client merely passes the user system

calls to the server  gure 3a. If a user rep eatedly op ens

and writes the same le which is a likely event, a Sprite

t dutifully transmits each op en request and writes clien write-1: write sharing:

1 host has write nobody has

y data every 30 seconds.

dirt ownership ownership

xFS is di erentintwo resp ects  gure 3b. Firstly,

xFS explicitly separates the notion of open from that

of requesting ownership. Ownership requests are sent

only when necessary. Secondly, xFS proto col do es not

Figure 4: xFS le state transition.

include anything that resembles a close system call.

Hosts never voluntarily relinquish ownership. They give

Due to its b etter use of the lo cal cache and less de-

up ownership only in resp onse to servers' revoking calls.

p endency on servers, xFS clients not only should ob-

If a user rep eatedly writes the same le, only the ini-

servelower latency and use less bandwidth, they should

tial op en can result in a request for ownership and no

also achieve b etter availability. Existing le systems

further communication is required until another client

rely heavily on the servers for write throughs and name

requests the data. In this way, le and directory data is

related op erations. The demise of a single server or a

only transferred over the wide area network on a cache

failed network link usually renders the clients helpless.

miss, a cache ush, or b ecause of true sharing b etween

xFS hosts, on the other hand, after acquiring ownership

multiple clients. In other words, xFS writes on demand

of the prop er data, can op erate more indep endently.

as opp osed to traditional le systems that write data

Furthermore, xFS hosts will have an easier time rein-

through to guard against the p ossibility that they might

tegrating after disconnection. A Co da [18] client, for

b e shared.

example, is required to write the dirty data it has accu-

mulated during disconnection back to the server during

User User

reintegration. In xFS, such writes are not necessary un-

y data is needed by others.

open close less the dirt

open close

There are a numb er of problems with this cache co-

Client

herence scheme. The rst is the large amount of state

h host needs to keep. This is addressed Client request revoke information eac ownership ownership

callback in the next section. The second problem is the need for

open close Consis

eep dirty data inde nitely. Server lo cal stable storage that will k

request revoke This is addressed in section 5. The third is the need for

ownership ownership

p olicies to deal with network partitions and down hosts. Server Home

Server This is discussed in section 6.

(a) Sprite style protocol (b) xFS protocol

4 Reducing Cache Coherence

Figure 3: Separating notions of op en/close/call back from

State

those of requesting/reli nqu is hi ng/revoki ng ownership.

Many existing stateful le systems maintain cache con-

sistency in a way that is not designed to scale to handle Figure 4 shows the xFS le state transition in more

massive amount of client cache. xFS reduces the cache detail. As an example, consider a le that is in the read

coherence state by exploiting the hierarchical naming sharing state. If a new host requests read ownership, it

structure of the le system. is granted the ownership and the le stays in the same

A Sprite or an Andrew server, for example, must keep with all the consistency servers it serves and only the

trackofevery le cached byany client. The amountof consistency servers that have acquired read ownership

server state grows with the total size of client caches. have to forward the revoking broadcast to their clients.

xFS lives on a wide area network and there are p oten- An xFS server maintains cache coherence state for

tially large numb er of clients and large amount of call- clusters of les instead of individual les. For each

back information asso ciated with each client. A wayto cluster of les, a server only keeps a list of consistency

deal with the explosion of state information needs to b e servers instead of individual clients. And for les that

found. are almost never mo di ed, even this list can b e done

away with. By employing these techniques, we exp ect

The hierarchical nature of the system as discussed in

to substantially reduce storage requirements for cache

section 2 partially alleviates the problem. Home servers

coherence state.

only need to keep track of the top level consistency

servers. Bottom level consistency servers only need to

tally the leaf clients that they serve. This reduces the

5 Integration of Multi-level

length of client lists but do es not reduce the number of

les that might require consistency actions.

Storage

A second technique xFS uses to deal with state ex-

plosion is based on the observation that clusters of les

xFS is designed to utilize a multi-level hierarchy of sta-

tend to have the same ownership. There is little need to

ble storage, capable of inde nitely storing mo di ed le

keep state for individual les when a representative for

and directory data. xFS integrates all storage devices

a large cluster can b e used instead. Figure 5 illustrates

in a simple and uniform manner:

the idea. Here we see user \fo o" is working on machine

 All storage devices are treated as caches.

\X". \X" has acquired read ownership for cluster \A"

 All storage devices are written in a log-structured

and write ownership for clusters \B" and \C". This has

manner.

happ ened as a result of mo di cation of her home di-

 Data blo cks are lo cated by lo oking up virtual-to-

rectory and directory \baz" by user \fo o". Similarly,

physical translation tables maintained on fast de-

machine \Y" has acquired read ownership for cluster

vices.

\A" and write ownership for cluster \D". We see that

In addition to deciding on appropriate migration p oli-

ownership only needs to b e asso ciated with clusters of

cies that move data around, we need to address two im-

les.

mediate questions. The rst of whichishow to lo cate

data as blo cks migrate among di erent levels of stor- A: read owned

/

wtolay out data on media in an

by X and Y age. The second is ho

ecient manner.

To solve the problem of lo cating data in multiple lev-

els of storage, some existing systems extend the UFS

data structures to incorp orate tertiary devices. For ex-

k address in Highlight [9] can b elong to the foo ample, a blo c

bar

disk farm, some tertiary store, or the disk cache for ter-

en an hinode number, offseti pair,

baz tiary storage. Giv

lo cating data involves lo cating the ino de and indexing

the ino de. If the blo ck address is found to b e on ter-

tiary, a fetch is done to bring the missing data into the

disk cache and the op eration resumes. Such extensions B: write owned

by X usually complicate the co de and force the new storage

devices to inherit data structures that were originally C: write owned

by X D: write owned designed for disks.

by Y

A solution to the second problem, namely that of

Figure 5: Hierarchical state information.

laying out data on media eciently, is provided by

the log-structured le system LFS [16]. LFS app ends

newly written data to a segmentedlog and it is opti-

A third technique is based on the observation that a

mized for write p erformance. As old data is deleted,

large numb er of les in the system are widely shared but

LFS reclaims disk space by recopying sparsely p opu-

rarely mo di ed. For such les an entire subtree that is

lated segments to the tail of the log and marks those

exp orted for read only, for example, instead of remem-

segments as clean. LFS provides sup erior p erformance

b ering exactly which hosts have acquired read owner-

in environments where large main memory caches ab-

ship, xFS merely rememb ers the fact that read owner-

sorb most of the reads and consequently disk write sp eed

ship has b een granted to some hosts. In the rare event

b ecomes the limiting factor. Since many tertiary de-

that such les do get mo di ed, xFS resorts to broadcasts

6

vices are app end-only and tertiary archival storage are

to lo cate the readers. Such broadcasts, however, do not

6

necessarily have to ood every single host due to the hi-

Most media deliver b est p erformance when they are used as

erarchical nature of xFS. A home server communicates app end-only devices.

mostly write-only, log-structured layout is a natural way In principle, the translation table can always b e

of managing such devices. Highlight[9], a descendant \paged" to a slower device. For simplicity,wehave

of LFS, bases its design on this observation. It em- opted to keep the translation table entirely on a faster

ploys a di erent cleaner pro cess to migrate selected data device. For example, the translation table for a disk

7

blo cks out to tertiary store, which is also written log- cache is kept entirely in memory and the translation

structured. table for tertiary is kept entirely on disk. An obvi-

When designing the xFS approach to lo cating data, ous disadvantage of doing so is the storage requirement

we draw analogies from solutions to memory manage- needed for the translation tables. For example, tens of

ment problems. A virtually addressed memory word megabytes of main memory might b e needed to manage

can b e anywhere from an on-chip cache to a secondary less than ten gigabytes of disk storage. This problem

backing store. To nd out where it is, we rst haveto is partially alleviated by using a single translation ta-

translate a virtual address to a physical address. The ble entry to p ointtoanumb er of consecutively written

xFS equivalent of a virtual address is a block ID: blo cks. Thus large sequentially written les will need

little space in the translation table. We b elieve given

le ID blo ck

the fast decrease in memory cost, the memory used for

the translation table is money well sp ent in exchange for

Such a virtual address can have its physical incarnation

the resulting simplicity. Similarly, studies of some work-

anywhere in a hierarchy of storage devices. Each de-

loads suggest the use of less than ten gigabytes of disk

vice is treated as a cache. Each device employs a fast

space will suce for \translating" around ten terabytes

translation table that translates a blo ckIDtoaphysical

of tertiary storage and suchaninvestment is probably

address on the device. A translation table entry is of

worthwhile.

the following form:

We plan to implementtwointerfaces for each kind of

xFS device: a storage interface that allows the device to

blo ckID  of blo cks device address

b e used as a caching store, and a translation interface

that allows the device to b e used as a translation de-

As data migrate among di erent storage levels, we sim-

vice. A disk, for example, has twointerfaces that allow

ply change the corresp onding translation tables, a sim-

it to b e used either as a storage device or as a transla-

pler and cleaner approach than extending the existing

tion table for tertiary. An appropriately chosen pair of

UFS data structures. This is similar to the approach

slow and fast devices can b e \glued" together to form a

taken by[7] where logical disk addresses are mapp ed to

storage bin. Migration agents running on each machine

physical ones to allow a clean separation b etween le

cho oses blo cks to inject into or eject from the storage

and disk management without sacri cing p erformance.

bins it oversees. Pieces of the same le can p otentially

The memory management analogy, unfortunately,

spread across multiple levels of storage hierarchyover a

do es not apply for data layout. Firstly, unlike pro cessor

wide area, re ecting the widely distributed and tertiary

caches which are usually direct mapp ed, wewould like

nature of xFS.

our xFS devices to b e fully asso ciative. Secondly, unlike

pro cessor caches and main memory, few of the storage

technologies xFS manages are truly random access de-

6 Other Issues

vices. For these two reasons, xFS writes its caches in a

log-structured manner.

One question any distributed system has to answer is

When an xFS device receives a read request, it queries

how it handles machine crashes and network partitions.

its translation table to nd the corresp onding physical

An xFS client has to rememb er what ownerships it

address. A successful search leads us directly to the de-

has acquired from which servers. An xFS server needs

vice lo cation where the data is found. If the search fails,

to rememb er what ownerships it has granted to which

the device declares a cache miss. When an xFS device

clients. Even after applying the state compression tech-

receives a write request, it app ends the data blo cks to

niques describ ed in section 4, the amount of information

the end of the log and up dates its translation table to

would still b e to o large to keep entirely in memory as

re ect the new mapping. Invalidation of data blo cks

Sprite [20] and Spritely NFS [19] do. Part of it needs

involves removing the corresp onding mappings from its

to b e written to stable storage. On the other hand,

translation table. When it is time to clean, whether

we cannot a ord to log every ownership change to disk.

a blo ck is live or not can b e determined by comparing

Consequently,we need to recover the memory resident

its identity against the corresp onding translation table

information lost in a crash.

entry.

The task is relatively simple for clients. Whenever

Ejecting from an xFS device is accomplished by a mi-

a client commits data to stable storage, it conveniently

gration pro cess. It cho oses a numb er of victim blo cks

logs the corresp onding ownership. Any lost ownership

according to some p olicy. If the blo cks are dirty, the

in a crash will not have dirty data asso ciated with it. In

migration agent retrieves the blo cks from the source de-

the worst case, the server might b elieve it has granted

vice and sends them to the destination device. Then it

7

simply invalidates the blo cks in the source device and

Up on reb o ot, the memory resident translation table is recon-

the freed space will b e reclaimed by the cleaner. structed by examining the identities of data blo cks on disk.

7 Conclusion ownership to certain clients which the clients have lost

in crashes. Later when the server sends revoking calls to

Existing disk based lo cal area le systems can no longer

which the clients simply reply that they do not haveit

meet the demands of mass storage over a wide area. xFS

any more. A recovering server, however, has to recover

uses a hierarchy that can b etter take advantage of the lo-

the exact state it had b efore the crash. It app ears that

cality nature. It sp eaks a cache coherency proto col that

a server centric approach as done in [12]would work

minimizes the use of wide area communication. Con-

well. Under suchascheme, a recovering server directs

sequently, it is exp ected to op erate with lower latency

its clients to help rebuild the lost server state. It has

and consume less wide area bandwidth. xFS minimizes

also b een noted that such frequently up dated and small

cache coherence state information by exploiting the hi-

amountofvolatile state is a go o d candidate for inclusion

erarchical nature of le system name space and hierar-

in a stable memory that can survive machine crashes

chical nature of the cluster based organization. xFS in-

and the recovery cost can b e kept to a minimum [1].

tegrates multiple levels of storage in a uniform manner.

Its ability to take advantage of lo cal storage should de-

Another question we need to answer is how to deal

liver b etter economy, sup erior p erformance, and higher

with machines that do not resp ond to ownership revokes

availability.

due to crashes or network partitions. There are three

alternatives. We fail the op eration that resulted in the

revoke in the rst place, hang the op eration inde nitely

Acknowledgements

until the revoke is answered, or let the op eration pro-

ceed at the risk of allowing consistency con icts which

Discussions with Mike Dahlin and DavePatterson have

will have to b e resolved when the o ending hosts rejoin.

help ed improve the xFS design, particularly that of the

We adopt the third approach taken byCoda[18] which

hierarchical client-server structure. Wewould also like

trades quality for availability.Fortunately, the Co da

to thank Soumen Chakrabarti, Mike Dahlin, John Hart-

study has shown that such con icts are extremely rare

man, Arvind Krishnamurthy, and Cli ord Mather for

events and we do not exp ect ownership revokes to b e

their helpful comments.

frequent in xFS.

References

xFS's approach of storing data lo cally and inde -

nitely also raises other concerns. Security is one. An-

[1] M. Baker and M. Sullivan. The Recovery Box:

drew [6] treats servers as rst class citizens and clients

Using Fast Recovery to Provide High Availabil-

are deemed less trustworthy. In xFS, however, the line

ity in the Unix Environment. USENIX Association

between clients and servers are blurred when clients are

Summer 1992 ConferenceProceedings, pages 31-43,

allowed to serve each other section 2. In such an or-

June 1992.

ganization, the clusters can b e treated as security do-

mains where cluster memb ers trust each other but for-

[2] M. Blaze and R. Alonso. Toward Massive Dis-

eign clusters that are not homes are deemed untrust-

tributed Systems. Proceedings of the 3rd Workshop

worthy. A more paranoid approachwould require the

on Workstation Operating Systems, pages 48-51,

writer to compute a checksum. Kaliski [8] has devised

April 1992.

an ecient algorithm which makes it computationally

infeasible to nd two messages with the same checksum

[3] S. Coleman and S. Miller. Mass Storage System

or a message with a presp eci ed checksum. A reader,

Reference Mo del: Version 4. IEEE Technical Com-

up on receipt of the data and the checksum, can run the

mittee on Mass Storage Systems and Technology,

same algorithm to verify that the data has not b een

May 1990.

altered.

[4] M. Dahlin. Private Communication. 1993.

Another diculty is . When the storage is con-

[5] D. Muntz and P. Honeyman. Multi-Level Caching

centrated on several gigabytes of disks on a few servers,

in Distributed File System. USENIX Associa-

disaster recovery as opp osed to crash recovery can b e

tion Winter 1992 ConferenceProceedings, January

accomplished by rolling the whole world backtoatape

1992.

dump. Such an approach fails to work when there are

massive amounts of storage and/or the storage is so

[6] J. Howard, M. Kazar, S. Menees, D. Nichols,

widely distributed that taking consistent snapshots is

M. Satyanarayanan, R. Sideb otham, and M. West.

virtually imp ossible. One p ossible way of providing dis-

Scale and Performance in a Distributed File Sys-

aster recovery for xFS is to avoid a disaster in the rst

tem. ACM Transactions on Computer Systems,

place by insisting on having two copies of everything at

61:51-82, February 1988.

di erent sites at all times. An e ectivewayofproviding

[7] W. de Jonge, M. F. Kaasho ek, and W. Hsieh. The backup for mass storage in the wide area is a topic for

Logical Disk: A New Approach to Improving File future research.

[20] B. Welch. The Sprite Distributed File System. PhD System Performance. Proceedings of the 14th Sym-

Thesis, Department of Electrical Engineering and posium on Operating Systems Principles, Decem-

Computer Science, University of California, Berke- b er 1993. To app ear.

ley, March 1990.

[8] B. Kaliski, Jr. The MD4 Message Digest Algo-

rithm. Workshop on the Theory and Application of

Cryptographic Techniques Proceedings,May 1990.

[9] J. Kohl and C. Staelin. HighLight: Using A Log-

structured File System for Tertiary Storage Man-

agement. USENIX Association Winter 1993 Con-

ferenceProceedings, January 1993.

[10] D. Lenoski, J. Laudon, K. Gharachorlo o, A. Gupta

and J. Hennessy. The Directory-Based Cache Co-

herence Proto col for the DASH Multipro cessor.

Proceedings of the 17th Annual International Sym-

posium on Computer Architecture, pages 148-159,

May 1990.

[11] F. McClain. DataTree and UniTree: Software for

File and Storage Management. Digest of Papers,

Proceedings of 10th IEEE Symposium on Mass

Storage Systems, pages 126-128, May 1990.

[12] J. Mogul. A Recovery Proto col for Spritely NFS.

Proceedings of the USENIX Workshop on File Sys-

tems,May 1992.

[13] J. Mott-Smith. The Jaquith Archive Server.

UCB/CSD Rep ort 92-701, University of California,

Berkeley, Septemb er 1992.

[14] M. Nelson, B. Welch, and J. Ousterhout. Caching

in the Sprite . ACM Transac-

tions on Computer Systems, 61:134-154, Febru-

ary 1988.

[15] M. Papamarcos and J. Patel. A LowOverhead Co-

herence Solution for Multipro cessors with Private

Cache Memories. Proceedings of the 11th Annual

International Symposium on Computer Architec-

ture, pages 348-354, June 1984.

[16] M. Rosenblum and J. Ousterhout. The Design

and Implementation of a Log-Structured File Sys-

tem. Operating Systems Review, 255:1-15, Octo-

b er 1991.

[17] R. Sandb erg, D. Goldb erg, S. Kleiman, D. Walsh,

and B. Lyon. Design and Implementation of the

Sun Network Filesystem. Proceedings of the 1985

Summer USENIX Conference, June 1985.

[18] M. Satyanarayanan, J. Kistler, P. Kumar,

M. Okasaki, E. Siegel, and D. Steere. Co da: A

Highly Available File System for a Distributed

Workstation Environment. ACM Transactions on

Computer Systems, 394:447-459, April 1990.

[19] V. Srinivasan and J. Mogul. Spritely NFS: Ex-

p erience with Cache Consistency Proto cols. Pro-

ceedings of 12th Symposium on Operating Systems

Principles, pages 45-57, Decemb er 1989.