USENIX Association

Proceedings of the 2001 USENIX Annual Technical Conference

Boston, Massachusetts, USA June 25Ð30, 2001

THE ADVANCED COMPUTING SYSTEMS ASSOCIATION

© 2001 by The USENIX Association All Rights Reserved For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein.

Storage management for web proxies

Lan Huang Eran Gabb er Elizab eth Shriver

SUNY Stony Brook Bel l Laboratories Bel l Laboratories

[email protected] [email protected] [email protected]

Christopher A. Stein

Harvard University

[email protected]

Abstract Squid and Apache are two p opular web proxies.

Both of these systems use the standard le sys-

tem services provided by the host op erating system.

On UNIX this is usually UFS, a descendantofthe

Today,caching web proxies use general-purp ose le

4.2BSD UNIX Fast (FFS) [13]. FFS

systems to store web ob jects. Proxies, e.g., Squid or

was designed for workstation workloads and is not

Apache, when running on a UNIX system, typically

optimized for the di erent workload and require-

use the standard UNIX le system (UFS) for this

ments of a web proxy. It has b een observed that le

purp ose. UFS was designed for research and engi-

system latency is a key comp onent in the latency

neering environments, whichhave di erentcharac-

observed byweb clients [21].

teristics from that of a caching web proxy. Some

of the di erences are high temp oral lo cality, relaxed

Some commercial vendors haveimproved I/O p er-

p ersistence requirements, and a di erent read/write

formance by rebuilding the entire system stack:

ratio. In this pap er, wecharacterize the web proxy

a sp ecial op erating system with an application-

workload, describ e the design of Hummingbird, a

sp eci c le system executing on dedicated hardware

light-weight le system for web proxies, and present

(e.g., CacheFlow, Network Appliance). Needless to

p erformance measurements of Hummingbird. Hum-

say, these solutions are exp ensive. We b elieve that

mingbird has two distinguishing features: it sepa-

a lightweight and p ortable le system can b e built

rates ob ject naming and storage lo cality through

that will allowproxies to achieve p erformance close

direct application-provided hints, and its clients are

to that of a sp ecialized system on commo dity hard-

compiled with a linked library interface for mem-

ware, within a general-purp ose op erating system,

ory sharing. When we simulated the Squid proxy,

and with minimal changes to their source co de; Gab-

Hummingbird achieves do cument request through-

b er and Shriver [5] discuss this view in detail.

put 2.3{9.4 times larger than with several di erent

versions of UFS. Our exp erimental results are veri-

We have built a simple, lightweight le system li-

ed within the Polygraph proxy b enchmarking en-

brary named Hummingbird that runs on top of a

vironment.

raw disk partition. This system is easily p ortable|

wehave run it with minimal changes on FreeBSD,

IRIX, Solaris, and Linux. In this pap er we de-

1 Intro duction

scrib e the design, interface, and implementation of

this system along with some exp erimental results

Caching web proxies are computer systems dedi-

that compare the p erformance of our system with a

cated to caching and delivering web content. Typ-

UNIX le system. Our results indicate that Hum-

ically, they exist on a corp orate rewall or at the

mingbird's throughput is 2.3{4.0 times larger than

point where an Internet Service Provider (ISP)

asimulated version of Squid running UFS mounted

p eers with its network access provider. From a web

asynchronously on FreeBSD, 5.4{9.4 times faster

p erformance and scalability p oint of view, these sys-

than Squid running UFS mounted synchronously

tems have three purp oses: improve web client la-

on FreeBSD, 5.6{8.4 times larger than a simulated

tency, drive down the ISP's network access costs

version of Squid running UFS with soft up dates on

b ecause of reduced bandwidth requirements, and re-

FreeBSD, and 5.4{13 times larger than XFS and

duce request load on origin servers.

Naming. The web proxy application determines EFS on IRIX (see Section 4). We also p erformed

thenameofa letobewritteninto the le system. exp eriments using the Polygraph environment [18]

In a traditional UNIX le system, le names are with an Apache proxy; the mean resp onse time for

normally selected by a user. hits in the proxy is 14 times smaller with Humming-

bird than with UFS (see Section 5).

Reference lo cality. Clientweb accesses are char-

acterized by a request for an HTML page followed

Throughout the rest of this pap er, we use the terms

by requests for emb edded images. The set of images

proxy or web proxy to mean caching web proxy. Sec-

within a cacheable HTML page changes slowly over

tion 2 presents the imp ortantcharacteristics of the

time. So, if a page is requested by a client, it makes

proxy workload considered for our le system. It

sense for the proxy to anticipate future requests by

also presents some background on le systems and

prefetching its historically asso ciated images.

proxies that is imp ortant b ecause it motivates much

of our design. Section 3 describ es the Humming-

We studied one day of the web proxy log for this

bird le system. Our exp eriments and results are

reference lo cality. The rst time we see an HTML

presented in Sections 4 and 5. Section 6 discusses

le, we coalesce it with all following non-HTMLs

related work in le systems and web proxy caching.

from the same client to form the primordial local-

ity set,whichdoesnotchange. The next time the

HTML le is referenced, we form another lo cality

2 File system limitations

set in the same manner and compare its le mem-

The web proxy workload is di erentthanthestan-

b ers with those of the primordial lo cality set. The

dard UNIX workload; these di erences are discussed

average hit rate (the ratio between the size of the

in Section 2.1. Due to these di erences, the p erfor-

latter sets and the size of the primordial set) across

mance which can b e attained for a web proxy run-

all references is 47%. Thus, on average, a lo cality set

ning on UFS is limited by some of the features; these

re-reference accesses almost half of the les of the

limitations are discussed in Section 2.2.

original reference. One of the reasons that this hit

rate is small mightbe due to the assumption that

2.1 Web proxy workload characteristics

all non-HTML les that follow a HTML le are in

the same lo cality set; this is clearly not true if a

Web proxy workloads have sp ecial characteristics,

user has multiple active browser sessions. Also, we

which are di erent from those of a traditional UNIX

determined the typ e of le using the le extension,

le system workload. This section describ es the sp e-

thus possibly placing some HTML les in another

cial characteristics of proxy workloads.

le's lo cality set. We also studied the size of the lo-

cality sets in bytes, and found that 42% are 32 KB

We studied a week's worth of web proxy logs from

or smaller, 62% are 64 KB or smaller, 73% are 96

a ma jor, national ISP, collected from January 30 to

KB or smaller, and 80% are 128 KB or smaller.

February 5, 1999. This proxy ran Netscap e Enter-

prise server proxy software and the logs were gen-

File access. Several older UNIX le system p er-

erated in Netscap e-Extended2 format. For the pur-

formance studies have shown that les are usually

p ose of our analysis we isolated the request stream

accessed sequentially [16, 1]. A recent study [19]

to those that would a ect the le system underlying

has suggested that this is changing, esp ecially for

the proxy. Thus we excluded 34% of the GET re-

memory-mapp ed and large les. However, UNIX

quests which are considered non-cacheable by the

le systems have b een designed for the traditional

proxy. If we could not nd the le size, we re-

sequential workload. Web proxies have an even

moved the log event. This prepro cessing results in

stronger and more predictable pattern of b ehavior.

the removal of ab out 4% of the log records, nearly

Web proxies currently always access les sequen-

all during the rst few days. We eliminated the

tially and in their entirety. (This maychange when

rst few days and are left with 4 days of pro cessed

range requests are more p opular.)

logs containing 4.8 million requests for 14.3 GB of

unique cacheable data and 27.6 GB total requested

File size. Most cacheable web do cuments are

cacheable data.

small; the median size of requested les is 1986 B

and the average size is 6167 B. Over 90% of the

Characteristics of a web proxy and its workload are:

references are for les smaller than 8 KB.

Idleness. The proxy workload is characterized by

Persistence. A web proxy only writes les re-

a large variability in request rate and frequent idle

trieved from an origin server. Thus, these are just

cached les, and can b e retrieved again if necessary.

periods. Using simulation, we found, in fact, in each direct and low-overhead mechanism can b e used.

1-minute interval, the disk is idle 57{99% of the du-

ration of the interval, with a median of 82%. If Exp erience with Squid and Apache has shown that

the request rate in the trace doubles, the disk is it is dicult for web proxies to use directory hier-

idle less of the time, with a median of 60%. In the archies to their advantage [11]. Deep pathnames

busiest p erio ds, it is idle only 1%. Note that when mean long traversal times. Populated directories

the request rate doubles, the disk is almost satu- slowdown lo okup further b ecause many legacy im-

rated in the busy intervals (idle time drops to 1%), plementations still do a linear search for the le-

while the disk remains idle in the low-activity in- name through the directory contents. The hierar-

tervals. This idleness study used assumptions that chical name space allows les to be organized by

represent Hummingbird: the disk has 64 KB blo cks, separating them across directories, but this is not

all new les are written to the disk, and only data needed byaproxy. What the proxy actually needs

needs to b e written to disk (no meta-information). is a at name space and the ability to sp ecify storage

The existence of idle p erio ds is crucial for the design lo cality.

of Hummingbird since Hummingbird p erforms back-

ground b o okkeeping op erations in those idle p erio ds UFS le meta-data is stored in the i-no de, which

(see Section 3.5). is up dated using synchronous disk writes to ensure

meta-data integrity. Acaching web proxy do es not

2.2 UFS p erformance limitations

require le durability for correctness, so it is free

to replace synchronous meta-data writes with asyn-

Due to the proxy workload characteristics presented

chronous writes to improve the p erformance (which

in the previous section, UFS has a number of fea-

is done with soft up dates [6]) and eliminate the need

tures which are not needed, or could b e greatly sim-

to maintain much of the meta-data asso ciated with

pli ed, so that le system p erformance could b e im-

p ersistence.

proved. Table 1 presents the features on whichUFS

and a desired caching web proxy le system should

Traditional le systems also force twoarchitectural

di er. Wenow discuss these features.

issues. First, the standard le system interface

copies from kernel VM into the applications' address

UFS les are collections of xed-size blo cks, typi-

space. Second, the le system caches le blo cks in

cally 8 KB in size. When accessing a le, the disk

its own bu er cache. Web proxies manage their

delays are due to disk head p ositioning o ccurring

own application-level VM caches to eliminate mem-

when the le blo cks are not stored contiguously on

ory copies and use private information to facilitate

disk. UFS attempts to minimize the disk head p osi-

more e ective cache management. However, web

tioning time by storing the le blo cks contiguously

do cuments cached at the application level are likely

and prefetching blo cks when a le is accessed se-

to also exist in the le system bu er cache, esp e-

quentially, and do es a good job of this for small

cially if recently accessed from disk. This multiple

les. Thus, when the workload consists of mostly

bu ering reduces the e ective size of the memory.

small les, the largest comp onent of disk delays are

A single uni ed cache solves the multiple bu ering

due to the reference stream lo cality not corresp ond-

and con guration problems. Memory copy costs can

ing with the on-disk layout. UFS attempts to reduce

be alleviated by passing data by reference rather

this delaybyhaving the user place les into direc-

than copying. Both of these can b e done using a le

tories, and lo cates les in a directory on a group of

system implemented by a library that accesses the

contiguous disk blo cks called cylinder groups. Thus,

raw disk partition. Another approach for alleviating

reference lo cality is tied to naming. Here, the re-

multiple bu ering is using memory-mapp ed les as

sp onsibility for p erformance lies with the applica-

done by Maltzahn et al. [11].

tion or user, who must construct a hierarchywithdi-

rectory lo cality that matches future usage patterns.

In addition, to reduce le lo okup times, the direc-

3 File system design

tory hierarchymust b e well-balanced and any single

directory should not havetoomanyentries. Squid The design of Hummingbird is in uenced by the

attempts to balance the directory hierarchy,butin proxy workload characteristics discussed in Sec-

the pro cess distributes the reference stream across tion 2. Hummingbird uses lo calityhints generated

directories, thus destroying lo cality. Apache maps bythe proxy to pack les into large, xed-size ex-

les from the same origin server into the same direc- tents called clusters, which are the unit of disk ac-

tory. For sp ecifying lo calitybya web proxy, a more cess. Therefore, reads and writes are large, amortiz-

Table 1: Comparison of le system features.

feature UFS Hummingbird

name space hierarchical name space at name space

reference lo cality application manages directories application sends lo calityhints

le meta-data meta-data kept on disk; synchronous up dates most meta-data in memory

preserve consistency

disk layoutof les data in blo cks, with separate i-no de for meta-data le and meta-data stored contiguously

disk access could b e as small as a blo ck size cluster size (typically 32 or 64 KB)

interface bu ers are passed; memory copies are necessary pointers are passed

mingbird has several parameters that the proxy is ing disk p ositioning times and interrupt pro cessing.

free to set to optimize the system for its workload.

The parameters set at le system initialization in- Hummingbird manages a large memory cache. Since

clude: size of a cluster, memory cache eviction p ol- Hummingbird is implemented by a library that is

icy, le hash table size, le and cluster lifetimes, disk linked in with the proxy, no memory copies or mem-

data layout p olicy,andrecovery p olicies. ory mappings are required to move data from the

le system to the proxy. The le system simply

3.1 Hummingbird ob jects

passes a pointer. Likewise, the proxy passes the

le system a p ointer when it writes data. Wehave

Hummingbird stores two main typ es of ob jects in

only designed the le system for a single client, so

main memory: les and clusters. A le is created by

protection is not necessary. Since the client (the

a write file() call. Clusters contain les and some

proxy) handles all data transfers to and from the

le meta-data. Grouping les into clusters allows

system, it must b e trusted in any case. The proxy

the le system to physically collo cate les together,

1

maybemulti-threaded or havemultiple pro cesses ,

since when a cluster is read from disk, all of the

in which case access is serialized with a single lo ck

les contained in the cluster are read. Clusters are

on the le system meta-data that is released be-

clean, i.e., they can b e evicted from main memory

fore blo cking for disk I/O. Using a single lo ckmay

by reclaiming their space without writing to disk,

slow the system under a heavy load. Since the le

since a cluster is written to disk as so on as it is

system is I/O b ound, the lo ck is held only when re-

created. (Section 3.5 discusses where on disk the

ferring to Hummingbird's data structures. The lo ck

clusters are written.)

is released b efore a thread blo cks for disk I/O. No

lo cking is needed when accessing le data.

The application provides lo cality hints by the

collocate files(fnameA, fnameB) call. The

Since the typical workload is bursty, Humming-

le system saves these hints until the les are

bird is designed to reduce the resp onse time dur-

assigned to clusters. This assignment o ccurs as

ing bursty p erio ds, and p erform maintenance activ-

late as p ossible, that is, when space is needed

ities during the idle p erio ds present in the workload.

in main memory. At this point, the le system

Hummingbird performs the maintenance activities

attempts to write fnameA and fnameB in the same

by calling several daemons resp onsible for: (1) re-

cluster. It is p ossible for a le to be a member of

claiming main memory space by writing les into

multiple clusters, and stored in multiple lo cations

clusters, and (2) reclaiming disk space by deleting

on disk by the application sending multiple hints

unused clusters.

files(fnameA, fnameB) and (e.g., collocate

collocate files(fnameC, fnameB)). For proxy

While there are invariants across proxy workloads,

caches, this is a useful feature since emb edded

some characteristics will change. Wehave designed

images are frequently referenced in a number of

Hummingbird to be con gurable so that the sys-

related HTML pages.

tem can b e optimized for a proxy workload and the

underlying storage hardware. To this e ect, Hum-

When the le system is building a cluster, it deter-

mines which les to add to the cluster using an LRU

1

To p erform memory management for the multi-pro cess

ordering according to the last time the le was read.

version of Hummingbird, the pro cesses have a shared memory

region which is used for malloc(). If the least-recently-used le has a list of collo cated

les, then these les are added to the cluster if they meta-data must include the cluster ID and the le

are in main memory. (If a le is on the collo ca- reference count for that le.

tion list, and already has been added to a cluster,

it can still b e added to the current cluster if the le It is natural for a proxy to use the URL as a le

is in memory.) Files are packed into the cluster un- name. URLs may b e extremely long, and since we

til the cluster size threshold is reached, or until all have many small les, the le names may takeupa

les on the LRUlisthave b een pro cessed. This way, large p ortion of main memory if they were kept p er-

small lo cality sets with similar last-read times can manently in memory. Thus, we savethe le name

be packed into the same cluster. Another p ossible with the le data in its cluster and not p ermanently

algorithm to pack les into clusters is the Greedy in memory. Internally, Hummingbird hashes the le

Dual Size algorithm [3]. name into a 32-bit index, which is used to lo cate

the le meta-data. Hash collisions can b e detected

Large les are sp ecial. They account for a very by comparing the requested le name with the le

small fraction of the requests, but a signi cantfrac- name stored in the cluster. If there is a collision, the

tion of the bytes transferred. In the log we ana- next element in the hash table bucket is checked.

lyzed, les over 1 MB accounted for over 8% of the

Cluster meta-data. A cluster table contains in-

bytes transferred, but only 0.02% of the requests.

formation ab out each cluster on disk: the status,

Caching these large les is not imp ortant for the

last-time accessed, and a linked list of the les in the

average latency p erceived by clients, but is an im-

cluster. The cluster status eld identi es whether

p ortant factor in the network access costs of the

the cluster is empty, on disk, or in memory. For

ISP. It is b etter to store these large les on disk,

our le system, the cluster ID identi es the lo cation

and not in the le system cache in memory. The

of the cluster on disk. While a cluster is in mem-

write nomem file() call bypasses the main mem-

ory, the address of the cluster in memory is needed.

ory cache and writes a le directly to disk; if the le

The last-time accessed is needed to determine the

is larger than the cluster size, multiple clusters are

amount of time since the cluster was last touched.

nomem file() allo cated. Having an explicit write

function allows the application to request that any

3.3 The Hummingbird interface

le can bypass main memory, not just large les.

This section describ es the basic Hummingbird calls.

3.2 Meta-data

All routines return a p ositiveinteger or zero if the

op eration succeeds; a negative return value is an

Hummingbird maintains three typ es of meta-data:

error co de.

le system meta-data, le meta-data, and cluster

meta-data.

 int write file(char* fname, void* buf,

File system meta-data. To determine when

t sz); This function writes the contents of size

main memory space needs to be freed, Humming-

the memory area starting at buf with size sz to

bird maintains counts of the amount of space used

the le system with lename fname. It returns the

for storing the les and the le system data. To

size of the le. Once the pointer buf is handed

assist with determining which les and clusters to

over to the le system, the application should

evict from main memory, Hummingbird maintains

not use it again. Hummingbird will eventually

twoLRU lists, one for les whichhavenotyet b een

pack the le into a cluster, free the bu er, and

write the cluster to stable storage.

packed into clusters and another for clusters that

are in memory.

 int read file(char* fname, void** buf); This

function sets *buf to the b eginning of the memory

File meta-data. A hash table stores p ointers to

area containing the contents of the le fname. It

the le information such as the le number (dis-

increments a reference count and returns the size

cussed b elow), status, and a reference countofthe

of the le. If the le is not in main memory, an-

numb er of users that are currently reading the le.

other le or les may b e evicted to make ro om, and

The le status eld identi es whether the le is not

a cluster containing the sp eci ed le will b e read

in a cluster, in one cluster, in multiple clusters, or

from disk.

not cacheable. Until a le b ecomes a member of a

 int done read file(char* fname, void* buf);

cluster, the le name and le size need to b e main-

This function releases the space o ccupied by the

tained as part of the le meta-data. We also main-

leinmainmemoryby decrementing the reference

tain a list of les that should be collo cated with

count; the application should not use the p ointer

file() must be accompanied this le. When a le is added to a cluster, the le again. Every read

by a done read file(). Otherwise, the le will Warming the cache. Hummingbird lacks a di-

stay in memory inde nitely (until the application

rectory structure and all meta-data consistency de-

program terminates).

p endencies are contained within clusters. There is

no need for an analog of the UNIX fsck utilityto

file(char* fname); This function  int delete

ensure le system consistency after a crash. Dur-

deletes the le fname. An error co de is returned

ing a planned shutdown, Hummingbird will write

if the le has any active read file()'s.

the le-to-cluster mappings to disk. (They are also

files(char* fnameA, char*  int collocate

written p erio dically.) During a crash, the system

fnameB); This function attempts to collo cate le

has no such luxury. So, while the le system is

fnameB with le fnameA on disk. Both les must

guaranteed to be consistent with itself, the lack

file()). b e previously written (by calling write

of directories might make it imp ossible to lo cate

 int write nomem file(char* fname, void*

les on disk immediately after a crash if the le-to-

buf, size t sz); This function bypasses the

cluster mapping was not up to date. Hummingbird

main memory cache and writes a le directly to

sp eeds up crash recovery time with a log contain-

disk. This le is agged so that when it is read,

ing the cluster identi ers of p opular clusters. The

it do es not comp ete with other do cuments for

hot clusters daemon() creates this log using sync

cache space and is immediately released after the

the cluster LRU list and writes it to disk period-

application issues the done read file().

ically. After a crash, these clusters are scanned in

rst, quickly achieving a hit rate close to that b efore

the crash.

Missing from this API are commands suchasls and

df. Wehaveseennoneedforsuch commands for a

Persistence. Hummingbird do es not change disk

caching web proxy. For example, the existence of a

contents immediately for le and cluster deletions.

le can b e determined with read file().

Research with journalling le systems has shown

that hard meta-data up date p ersistence is exp en-

3.4 Recovery

sive, due to the necessity for up dating stable storage

Web proxies are slowly convergent; it takes days to

synchronously [22]. Hummingbird do es not provide

reach the maximal hit rate. Consequently, proxy

hard p ersistence, but uses a log of intentional dele-

cache contents must be saved across system fail-

tions to b ound the p ersistence of deletions. Records

ures. At the same time, recovery must be quick.

describing deletions, either cluster or le, are writ-

With to day's disk sizes, the system cannot wait for

ten into the log, which is bu ered in memory. Peri-

full disk scans b efore servicing do cument requests.

o dically, the log is written out to disk according to

Hummingbird warms the main memory cache with

the sp eci ed thresholds. The user sp eci es when the

les and clusters that were \hot" b efore the system

log should be written by sp ecifying either a num-

reb o ot. Bounding le create and delete p ersistence

ber of les threshold (i.e., once X les have b een

rather than attaching them to system call semantics

recorded in the log, it must b e written to disk), or a

allows for higher p erformance.

threshold on the passing of time (i.e., the log must

b e written to disk at least once every Y seconds).

Data stored on disk. Hummingbird's disk stor-

age is segmented into four regions:

The log is structured as a table, indexed by cluster

or le identi er. When a le or cluster is overwritten

on disk, the delete intent record is removed from the

 clusters, which store the le data and meta-

log. Records contain le or cluster identi ers as well

data,

as a generation numb er, which is stored in the on-

disk meta-data and used to eliminate the p ossibility

 mappings of les to clusters, whichallow a le

of replayed deletes.

to be quickly lo cated on disk by identifying

The recovery pro cedure. During recovery, all

which clusters the le is in,

four regions of the disk are lo cked exclusively by

the recovery pro cess. The Hummingbird interface

 the hot cluster log, which caches frequently

comes alive to the application early in the pro cess

used clusters, and

{ after the hot cluster log of the previous session

has b een recovered. However, Hummingbird will

 the delete log, which stores small records de-

not write new les or clusters to disk until recov-

scribing intentional deletes.

ery has completed; write file() calls will fail. It

The mappings of les to clusters is part of the le meta-data describ ed in Section 3.2.

tinues to recover the clusters in the background con proxy workload file

log generator system disk During this

and rebuild the in-memory meta-data. file system

requests that do not hit in the currently-

phase, operations

rebuilt meta-data are checked in the le-to-cluster

Figure 1: Simulation environment.

mappings to identify the cluster that needs to be

recovered. Since the le-to-clusters mapping can b e

out-of-date (due to a crash), a le without a le-to-

not o ccur.

cluster mapping is viewed as not in the le system.

Pack les from head daemon. The pack

Wenow outline the sequence of events during crash

files daemon() takes les o the tail of the LRU

recovery; recovery is much simpler during a planned

list to pack into clusters; pack files from head

shutdown. (1) Crash or power failure. The sys-

daemon() takes les o the head. This has the

tem reb o ots and enters recovery mo de. (2) The hot

e ect of packing the most-frequently accessed les

cluster log is scanned, the hot clusters are read in,

into clusters.

and their contents are used to initialize in-memory

meta-data. (3) The le-to-cluster mappings are read

Free main memory data daemon. This dae-

in from disk. (4) The delete log is scanned and

mon evicts data using le and cluster LRU lists

records are applied, mo difying meta-data as neces-

when the amount of main memory used by the le

sary. (5) Proxy service is enabled. Hummingbird

system exceeds a tunable threshold.

will now service requests. (6) During idle time, the

recovery pro cess scans the cluster region, rebuilding

Free disk space daemon. This daemon deletes

the in-memory meta-data for all les and clusters.

les or clusters whose age exceeds tunable thresh-

As les are recovered, they are available to the ap-

olds set bythedelete le policy and the delete cluster

plication. (7) Recovery is complete. All les and

policy.

clusters are now available. Hummingbird can now

write new clusters to disk.

3.5 Daemons

4 Exp erimental results

Hummingbird employs a numb er of daemons in ad-

We built an environment to run trace-driven exp er-

dition to the recovery daemons mentioned ab ove

iments on real implementations of UFS and Hum-

that run when the le system is idle and can be

mingbird. This environment consists of four com-

called on demand. In the future, wehopetosupport

ponents, as depicted in Figure 1. (1) The pro-

client-provided daemons, whichwould, for example,

cessed proxy log whichwas discussed in Section 2.1.

supp ort di erentcache eviction p olicies.

(2) The workload generator simulates the op eration

of a caching web proxy. It reads the proxy log

Pack les daemon. If the amountofmainmem-

events and generates the le system op erations that

ory used to store les exceeds a tunable threshold,

aproxy would have generated during pro cessing of

the pack files daemon() uses the le LRUlistto

the original HTTP requests. (3) The le system,

create a cluster of les and write the cluster to

which is either Hummingbird or various implemen-

disk. The daemon packs the les using the informa-

tations of UFS, EFS, or XFS. (4) The disk, which

file() calls, attempting tion from the collocate

is a physical magnetic disk that the le system uses

to pack les from the same lo cality set in the same

to store the les. The workload generators, imple-

cluster. If a le is larger than the cluster size, it is

mentation details, and exp eriments are describ ed in

split b etween multiple clusters.

the following section.

This daemon uses the disk data layout policy to de-

4.1 Workload generators

cide which cluster to pack next. Implemented p oli-

We develop ed two workload generators: wg-Squid cies include: closest cluster, which picks the clos-

which mimics Squid's interaction with the le sys- est free cluster to the previous cluster on disk ac-

tem, and wg-Hummingbird which issues Humming- cessed, closest cluster to previous write,whichpicks

bird calls. The same wg-squid workload genera- the closest free cluster to the previous cluster writ-

tor was used with the various UFS implementation, ten to on disk, and overwrite, whichoverwrites the

EFS and XFS. The generators take as input the next cluster. All of these p olicies tend to write les

mo di ed proxy access log. The workload genera- accessed within a short time to clusters close to-

tors op erate in a trace-driven lo op which pro cesses gether on disk. Thus, long-term fragmentation do es

2

HTML le request seen from a particular client. logged events sequentially without pause . This

The hash table only stores the URLs of static simulates a heavily loaded server.

HTML les. As requests for non-HTML do cu-

UFS workload generator: wg-Squid. The

ments are pro cessed, wg-Hummingbird generates

wg-Squid simulates the le system behavior of

collocate files() calls for the non-HTML paired

Squid-2.2. The Squid cache has a 3-level directory

with its client's current HTML le, as stored in the

hierarchy for storing les; the number of children at

hash table.

eachlevel is a con gurable parameter. In order to

minimize le lo okup times, Squid attempts to keep

Note that Squid (and wg-Squid) do not pro-

directories small by distributing les evenly across

vide explicit lo cality hints to the underlying op-

the directory hierarchy.

erating system; they only place les in the di-

rectory hierarchy. This is in contrast with wg-

When a le is written to the cache a top-level direc-

Hummingbird, whichprovides explicit lo cality hints

tory,orSwapDir, is selected. Squid attempts to load

via the colocate files() call. There are other

balance across the SwapDirs. Once the SwapDir has

ways which a proxy could use UFS to obtain le

b een selected, the le is assigned a le numb er which

lo cality; we did not test against these other ap-

uniquely identi es the le within the SwapDir. The

proaches. Thus, our comparisons are restricted to

value of this le number is used to compute the

the Squid approach for le lo cality.

names of the level-2 and level-3 directories. Thus,

4.2 Exp eriments

Squid do es not use the URL or URL reference lo-

We p erformed our exp eriments on PCs with a

cality for le placementinto directories, limiting the

700 MHz Pentium I I I running FreeBSD 4.1 with an

ability of the le system to collo cate les whichwill

18 GB IBM 10,000 RPM Ultra2 LVD SCSI (Ultra-

b e accessed together. Once the directory path has

star 18LZX). We also p erformed exp eriments on an

b een determined, the le is created and written.

SGI Origin 2000 with 18 GB 10,000 RPM Seagate

Cheetah disks (ST118202FC) under IRIX 6.5. The

Squid has con gurable high and low water marks.

results with di erent parameter settings were simi-

When the total cache size passes the high water

lar to each other so we only present a representative

mark, eviction b egins. Files are deleted from the

subset of the results.

cache in mo di ed LRU order with a small bias to-

wards expired les. Eviction continues until the low

Our exp eriments used the 4-day web proxy log as

water mark is reached. The wg-Squid simulates this

input into the workload generators. Among other

behavior except that it uses LRU instead of mo di ed

measurements, we measured the proxy hit rate,

LRU; our log do es not contain the expires informa-

which represents how frequently the web page will

tion.

not havetobefetched from the server, and the le

system read time, which represents how long the

Hummingbird workload generator: wg-

proxy must wait to get the le from the le sys-

Hummingbird. Each iteration of the wg-Humming-

tem. The le system write time is the time it takes

bird lo op parses the next sequential event from

the le system to return after a write; this may

the log and attempts to read the URL from Hum-

not include the time to write the le or le meta-

mingbird. If successful, it explicitly releases the le

data to disk. (We call the le system read (write)

bu er. If the read attempt fails, it attempts to write

times FS read (write) time in our gures and ta-

the URL into the le system; this would have oc-

bles.) wg-Squid and Hummingbird measured the

curred after the proxy fetched the URL from the

le system read/write times directly, and the out-

server.

put of the iostat -o and sar commands were used

to determine the disk I/O times. We also rep ort the

The wg-Hummingbird maintains a hash table keyed

throughput,whichwe compute as the ratio of the ex-

by client IP address to store information for gen-

p eriment run time and the total numb er of requests

erating the collo cate hints. The values in the

pro cessed. The measurements are averaged over the

hash table are the URLs of the most recent

entire log; warm-up e ects were insigni cant.

2

We call this timing mo de THROTTLE. Wehave imple-

mented two other di erent timing mo des: REAL and FAST.

Comparing Hummingbird with UFS: single

REAL issues the event with the frequency that they were

thread. We compared Hummingbird with three

recorded. FAST pro cesses the events with a sp eed-up heuris-

versions of UFS on FreeBSD 4.1. The three ver-

tic parameterized by n; if the time b etween events is longer

than time n,thenweonlywait time n between them. sions of UFS were: UFS, which is UFS mounted

Table 2: Comparing Hummingbird with UFS, UFS-async, and UFS-soft when les greater than 64 KB are

not cached.

le disk main proxy FS read FS write # of disk mean disk

system size memory (MB) hit rate time (ms) time (ms) I/Os I/O time (ms)

Hummingbird 4GB 256 0.62 1.81 0.32 1,161,163 5.14

UFS-async 4GB 128+128 0.64 6.13 2.54 4,807,440 4.66

UFS-soft 4GB 128+128 0.64 5.54 21.07 10,737,300 4.97

UFS 4GB 128+128 0.64 5.93 20.77 10,238,460 5.02

Hummingbird 4GB 1024 0.63 1.05 0.03 552,882 5.75

UFS-async 4GB 512+512 0.64 3.21 2.35 3,217,440 3.95

UFS-soft 4GB 512+512 0.64 3.27 20.85 9,714,420 4.76

UFS 4GB 512+512 0.64 3.92 18.43 9,022,200 4.63

Hummingbird 8GB 256 0.64 2.03 0.32 1,194,494 5.64

UFS-async 8GB 128+128 0.67 5.69 1.47 4,605,360 4.34

UFS-soft 8GB 128+128 0.67 5.78 15.66 9,058,141 4.86

UFS 8GB 128+128 0.67 6.17 12.83 8,313,360 4.63

Hummingbird 8GB 1024 0.64 1.20 0.03 578,562 6.37

UFS-async 8GB 512+512 0.67 4.05 1.34 3,561,180 4.13

UFS-soft 8GB 512+512 0.67 3.74 15.65 8,148,721 4.62

UFS 8GB 512+512 0.67 3.81 15.70 7,611,840 4.68

1000

hronously (the default), UFS-soft, whichisUFS

sync Hummingbird

soft up dates, and UFS-async, which is UFS with UFS-async

UFS-soft

ted asynchronously, so that meta-data up dates moun UFS

800 are not synchronous and the le system is not guar-

anteed to be recoverable after a crash. We used

a version of Hummingbird with a single working ere called explicitly ev-

thread, where the daemons w 600

ery 1000 log events. Table 2 presents comparisons

for two di erent disk sizes, 4 GB and 8 GB, with

two memory sizes, 256 MB and 1024 MB, when les

400

greater than 64 KB are not cached. The memory

was split evenly between the Squid cache and the 3

throughput (FS reads per second)

le system bu er cache . The proxy-p erceived la-

200

tency in Table 2 is the 5th column, the FS read time.

Hummingbird's smaller le system read time is due

to the hits in main memory caused by grouping les

y sets into clusters. Hummingbird's smaller in lo calit 0 4GB/256MB 8GB/256MB

le system write time (6th column) when compared 4GB/1024MB 8GB/1024MB

to cluster writes, which write

to UFS-async is due Configuration

Figure 2: Request throughput from Table 2.

multiple les to disk in a single op eration. The FS

write times for UFS and UFS-soft are greater than

UFS-async due to the synchronous le create op er-

ation.

the number of disk I/Os in the UFS exp eriments

is larger than the total number of requests in the

The e ectiveness of the clustered reads and writes

log. This is b ecause le op erations resulted in mul-

and the collo cation strategy is illustrated in the

tiple disk I/Os. This also explains why UFS read

number of disk I/Os. In all test con gurations,

and write op erations (as seen in FS read and write

Hummingbird issued substantially fewer disk I/Os

times) are slower than individual disk I/Os. The

than any of the UFS con gurations. Also, note that

mean disk I/O time is larger in Hummingbird since

the request size is a cluster, which is larger when

3

We controlled the size of the le system bu er cache by

compared to the mean data transfer size accessed

lo cking the Squid's memory using the mlock system call and

by UFS.

lo cking additional memory, so that le system bu er cache

could use only the remaining unlo cked memory. In this way

The throughput for each exp eriment in Table 2 is we also prevented paging of the Squid pro cess.

shown in Figure 2. Figure 2 shows that Humming- ond observation from Table 4 is that EFS is much

bird throughput is much higher than both UFS, slower than XFS, which is a journalling le system.

UFS-soft, and UFS-async on the same disk size and In particular, le system write time for EFS is more

memory size. This is not quite a fair comparison than 3 times larger than XFS write time. It is the

since the proxy hit rate is lower with Hummingbird. result of frequent synchronous write op erations for

(We do not exp ect the exp eriment run time to in- meta-data, which are p erformed by EFS. The third

crease more than 10% when the Hummingbird p oli- observation is that increasing wg-Squid cache size

cies are set so that it would have equivalent hit rate to 128MB and reducing the le system bu er cache

to wg-Squid). The throughput is larger for Hum- size actually improved the le system p erformance,

mingbird since much less time is sp entindiskI/O. which indicates that the wg-Squid cache is more ef-

Using throughput as a comparison metric, we see fective than the le system bu er cache.

that Hummingbird is 2.3{4.0 times faster than sim-

Comparing Hummingbird with UFS: multi-

ulated Squid running on UFS-async, 5.6{8.4 times

threaded workload generators. Many proxies

larger than a simulated version of Squid running on

are either multi-threaded or havemultiple pro cesses.

UFS-soft, and 5.4{9.4 times faster than simulated

Hummingbird is b oth thread- and pro cess-safe. We

Squid running on UFS. These numb ers include also

implemented multi-threaded versions of both our

the results from Table 3.

workload generators, with a thread for the daemons

in wg-Hummingbird, and ran similar exp eriments to

The exp eriments for Table 2 assumed that les

the ab ove with one pro cessor running FreeBSD 4.1

greater than 64 KB were not cached by the proxy.

with the LinuxThreads library. Table 5 contains

We got similar results when assuming the proxy

a subset of the results of our exp eriments using

would cache all les; see Table 3. Note that the

four threads and two disks. In multi-threaded squid

proxy hit rate in Table 3 is lower than in Table 2.

workload generator, two cache ro ot directories are

This is the result of the cache b eing \p olluted"

used, each residing on a disk. The exp erimentrun

with large les, which cause some smaller les to

time for the exp eriments in Table 5 are consistently

be evicted. The end result is that there are less

longer than the corresp onding cases in Table 3 due

hits, which translate into less le system activity,

to an uneven distribution of the les on the 2 disks;

and fewer le accesses.

we observed a bursty access pattern to each disk

in the iostat log. Queuing in the device driver re-

Comparing Hummingbird with EFS and

sults in longer FS read/write time to o. However, the

XFS: single thread. We compared Humming-

Hummingbird throughput is again 2{4 times greater

bird with EFS and XFS on SGI IRIX. EFS is an

than for UFS, UFS-async, and UFS-soft.

le system. XFS is a high-p erformance jour-

Recovery p erformance. Recovering the hot

nalling le system that interacts with the kernel

clusters and the deletion log are quick; it takes ab out

through the traditional VFS and vno de interfaces.

30 seconds to read 2500 hot clusters and a log of

3000 deletion entries into memory. Thus, on a sys-

We exp erimented with 3 di erent disk sizes, 4 GB,

tem crash with a 18 GB disk, Hummingbird can

9 GB, and 18 GB. Since EFS supp orts le systems

start to service requests in 30 seconds, while UFS

up to 8 GB in size, we could not test it with 9

will take more than 20 minutes to p erform the nec-

or 18 GB disks. Table 4 presents results where

essary fsck b efore requests can b e serviced. While

wg-Squid cache and the IRIX bu er cache together

rebuilding the in-memory meta-data (Step 6 in Sec-

use 256 MB of main memory, which are divided

tion 3.4), the FS read time increased a small amount

in two ways: 50 MB + 206 MB and 128 MB +

(e.g., from 2.03 ms to 2.50 ms for a 8 GB disk).

128 MB (wg-Squid and bu er cache, resp ectively).

The50MBforthewg-Squid was selected according

Journalling le systems will recover integrityquickly

to the Squid administration guidelines [23]. Hum-

compared with traditional UFSs due to log-based

mingbird has 256 MB of main memory, and it has

recovery. However, journalling alone will not fetch

thebestchoice of p olicies as previously discussed.

hot data from disk as part of the recovery pro cess.

Our exp erimental results are in Table 4. Humming-

bird throughput is much higher than b oth XFS and

5 Polygraph results

EFS on the same disk size as seen in the UFS exp er-

A web cache b enchmarking package called Web iments, and the user-p erceived latency (5th column

Polygraph [18] is used for comparison of caching in Table 4) is smallest in Hummingbird. The sec-

Table 3: Comparing Hummingbird with UFS, UFS-async, and UFS-soft when all les are cached.

le disk main proxy FS read FS write #ofdisk mean disk exp eriment

system size memory (MB) hit rate time (ms) time (ms) I/Os I/O time (ms) run time (s)

Hummingbird 4GB 256 0.60 1.68 0.39 1,349,175 4.17 6,362

UFS-async 4GB 128+128 0.62 6.61 2.88 5,134,380 4.72 25,510

UFS-soft 4GB 128+128 0.62 5.81 22.91 11,858,100 5.02 59,475

UFS 4GB 128+128 0.62 6.13 22.67 11,283,060 5.05 59,807

Hummingbird 4GB 1024 0.61 0.88 0.52 630,362 5.62 6,030

UFS-async 4GB 512+512 0.62 3.67 2.70 3,919,620 3.98 16,384

UFS-soft 4GB 512+512 0.62 3.51 22.89 10,868,700 4.84 52,377

UFS 4GB 512+512 0.62 4.24 20.23 10,115,400 4.69 49,749

Hummingbird 8GB 256 0.64 2.17 0.40 1,464,727 5.03 7,923

UFS-async 8GB 128+128 0.66 6.42 2.30 5,017,920 4.67 24,657

UFS-soft 8GB 128+128 0.66 6.14 19.31 10,461,841 4.98 51,797

UFS 8GB 128+128 0.66 7.37 16.42 9,954,540 4.89 51,062

Hummingbird 8GB 1024 0.62 1.18 0.51 695,771 6.41 6,802

UFS-async 8GB 512+512 0.66 4.66 1.90 4,016,400 4.32 18,240

UFS-soft 8GB 512+512 0.67 3.69 15.71 8,158,080 4.61 37,097

UFS 8GB 512+512 0.66 4.24 18.90 8,894,760 4.82 45,044

Table 4: Comparing Hummingbird with XFS and EFS with 256 MB of main memory when all les are

cached.

le disk cache proxy FS read FS write #ofdisk mean disk exp eriment

system size size (MB) hit rate time (ms) time (ms) I/Os I/O time (ms) run time (s)

Hum 4GB 256 0.62 2.82 0.20 922,871 9.85 9,926

EFS 4GB 50+206 0.62 14.79 46.38 10,784,969 10.99 128,878

XFS 4GB 50+206 0.62 14.97 16.56 10,870,353 6.67 75,565

EFS 4GB 128+128 0.62 14.08 48.33 10,540,778 11.37 130,359

XFS 4GB 128+128 0.62 14.90 13.95 10,115,369 6.72 70,746

Hum 9GB 256 0.66 3.30 0.21 1,028,922 10.77 11,861

XFS 9GB 50+206 0.66 15.47 12.39 9,558,954 7.02 69,929

XFS 9GB 128+128 0.66 15.65 8.80 8,737,878 7.12 64,726

Hum 15 GB 256 0.67 3.71 0.21 1,049,121 11.92 13,313

XFS 15 GB 50+206 0.67 16.71 5.73 8,103,576 7.60 63,371

XFS 15 GB 128+128 0.67 15.91 5.79 7,841,184 7.55 60,943

Table 5: Comparing Hummingbird with UFS, UFS-async, and UFS-soft with 256 MB of main memory

when all les are cached with 2 disks and 4 threads in the workload generator.

le disk proxy FS read FS write # of disk mean disk exp eriment

system size hit rate time (ms) time (ms) I/Os I/O time (ms) run time (s)

Hummingbird 4GB 0.64 4.86 1.34 1,478,005 11.68 11,743

UFS-async 4GB 0.66 5.92 6.39 5,638,201 10.72 30,427

UFS-soft 4GB 0.66 3.25 20.42 9,405,301 9.76 45,655

UFS 4GB 0.66 6.71 15.86 10,771,201 9.05 48,577

Hummingbird 8GB 0.67 6.12 1.40 1,556,169 14.08 12,997

UFS-async 8GB 0.67 6.19 5.18 5,466,061 10.67 29,354

UFS-soft 8GB 0.67 6.04 14.67 8,678,761 10.52 44,622

UFS 8GB 0.67 6.78 9.73 8,903,641 8.74 38,464

web proxies. The clients and servers are simulated; the write-ahead logging proto col. This di ers from

the client workload parameters such as hit ratio, LFS, in which the log contains all data, including

cacheability, and resp onse sizes can b e sp eci ed and meta-data. LFS also addresses the meta-data up-

server-side delays can be sp eci ed. We used the date problem by ordering up dates within segments.

PolyMix-2 trac mo del to compare Apache [25]us-

ing UFS and a slightly mo di ed version of Apache The Bullet server [26, 24] is the le system for

using Hummingbird without collocate files() Amo eba, a distributed op erating system. The Bul-

calls. Due to space considerations of this pap er, let service supp orts entire le op erations to read,

we only brie y discuss our results. create, and delete les. All les are immutable.

Each le is stored contiguously, b oth on disk and in

We used one of our FreeBSD Pentium IIIs for the memory. Even though the le system API is similar

proxy. We used a numb er of di erent client request to Hummingbird, the Bullet service do es not p er-

rates; the following results are for 8 requests/second. form clustering of les together, so it would not have

We found that the mean resp onse time for hits in the the same typ e of p erformance improvement that

proxy is 14 times smaller with Hummingbird than Hummingbird has for a caching web proxy work-

with UFS, and the median resp onse time for hits load.

is 20 times smaller. The improvement in mean and

median resp onse times for proxy misses was much Kaasho ek et al. [9] approaches high p erformance

smaller, as exp ected; Hummingbird is 20% faster through developing server operating systems, where

than UFS. Since Hummingbird serves proxy hits a server op erating system is a set of abstractions and

faster, its request queue is shorter, which in turn, runtime supp ort for sp ecialized, high p erformance

shortens the queue time of proxy misses. server applications. Their implementation of the

Cheetah web server is similar to Hummingbird in

one way: collo cating an HTML page and its images

6 Related work

on disk and reading them from disk as a unit. Web

servers' data storage is represented naturally bythe

Related work falls into two categories: rst, analyses

UFS le hierarchy. This is not true for caching web

of traditional UFS-based systems and ways to b eat

proxies as discussed in Section 2.2.

their p erformance limitations, and second, analyses

of the b ehavior of web proxies and how they can

CacheFlow [2] builds a cache op erating system

b etter use the underlying I/O and le systems.

called CacheOS with an optimized ob ject storage

system which minimizes the numb er of disk seek and

The rst set of research extends back to the original

I/O op erations p er ob ject. Unfortunately, details of

FFS work of McKusick et al. [13] which addressed

the ob ject storage are not public. The Network Ap-

the limitations of the System V le system byintro-

pliance ler [8] is a prime example of combination

ducing larger blo ck sizes, fragments, and cylinder

of an op erating system and a sp ecialized le sys-

groups. With increasing memory and bu er cache

tem (WAFL) inside a storage appliance. Novell [15]

sizes, UNIX le systems were able to satisfy more

has develop ed the Cache Ob ject Store (COS) which

reads out of memory. The FFS clustering work of

they state is 10 times more ecientthantypical le

McVoy and Kleiman [14] sought to improve write

systems; few details on the design are available. The

times by lazily writing the data to disk in contigu-

COS prefetches the comp onents for a page when the

ous extents called clusters. LFS [20] soughttoim-

page is requested, leading us to b elieve that the com-

prove write times by packing dirty le blo cks to-

ponents are not stored contiguously as they are in

gether and writing to an on-disk log in large extents

Hummingbird.

called segments. The LFS approach necessitates a

cleaner daemon to coalesce live data and free on-

Rousskov and Soloviev [21] studied the p erformance

disk segments. As well, new on-disk structures are

of Squid and its use of the le system. Markatos

required. Work in soft up dates [6] and journalling

et al. [12] presents metho ds for web proxies to

[7, 4] has sought to alleviate the p erformance limi-

work around costly le system le op ens, closes,

tations due to synchronous meta-data op erations,

and deletes. One of their metho ds, LAZY-READS,

such as le create or delete, which must mo dify

gathers read requests n-at-a-time, and issues them

le system structures in a sp eci ed order. Soft up-

all at the same time to the disk; results are presented

dates maintains dep endency information in kernel

when n is 10. This is similar to our clustering of lo-

memory to order disk up dates. Journalling systems

cality sets, since a read for a cluster will, on average,

write meta-data up dates to an auxiliary log using

access 8 les. We feel that Hummingbird in a more version of Squid running UFS with soft up dates on

general solution to the decreasing the e ect of costly FreeBSD, and 5.4{13 times larger than XFS and

le system op erations on a web proxy. EFS on IRIX. The Web Polygraph environmentcon-

rmed these improvements for the resp onse times

Maltzahn et al. [10] compared the disk I/O of for proxy hits.

Apache and Squid and concluded that they were

remarkably similar. In a later pap er [11], they sim- Additional information ab out Hummingbird is

ulated the op eration of several techniques for mo d- available at http://www.bell-labs.com/pro-

ifying Squid, one of which was to use a memory- ject/hummingbird/.

mapp ed interface to access small les. Other tech-

Acknowledgements. Many thanks to Arthur

niques improved the lo cality of related les based

Goldb erg for giving us copies of the ISP log les that

on domain names. This pap er rep orted a reduc-

we studied. WeeTeck Ng help ed us with prepar-

tion of up to 70% in the number of disk op era-

ing our exp erimentenvironment. Bruce Hillyer and

tions relative to unmo di ed Squid. An inherent

Philip Bohannon were very helpful in initial design

problem using one memory-mapp ed le to access

discussions. Hao Sun p erformed the Polygraph ex-

all small ob jects is that it cannot scale to handle a

p eriments.

very large number of ob jects. Like Hummingbird,

using memory-mapp ed les requires mo di cation to

the proxy co de.

References

Pai et al. [17] develop ed a kernel I/O system called

IO-lite to p ermit sharing of \bu er aggregates" b e-

[1] Baker, M. G., Hartman, J. H., Kupfer, M. D.,

tween multiple applications and kernel subsystems.

Shirriff, K. W., and Ousterhout, J. K. Mea-

This system solves the multiple bu ering problem,

surements of a distributed le system. In Proceed-

but, like Hummingbird, applications must use a

ings of the Thirteenth ACM Symposium on Operat-

di erent interface that sup ersedes the traditional

ing Systems Principles (Asilomar (Paci c Grove),

UNIX read and writes.

CA, Oct. 1991), ACM Press, pp. 198{212.

[2] Network cache p erformance measurements.

CacheFlowWhitePap ers Version 2.1, CacheFlow,

7 Conclusions

Sept. 1998.

This pap er explores le system supp ort for appli-

[3] Cao, P., and Irani, S. Cost-aware WWW proxy

cations which can take advantage of the perfor-

caching algorithms. 193{206.

mance/p ersistence tradeo . Such a le system is

[4] Chutani, S., Anderson, O. T., Kazar, M. L.,

esp ecially useful for lo cal caching of data, where p er-

Leverett, B. W., Mason, W. A., and Side-

manent storage of the data is available elsewhere. A

botham, R. N. The Episo de le system. In Pro-

caching web proxy is the prime example of an ap-

ceedings of the Winter 1992 USENIX Conference

plication that may b ene t from this le system.

(San Francisco, CA, Winter 1992), pp. 43{60.

[5] Gabber, E., and Shriver, E. Let's put NetApp

We implemented Hummingbird, a light-weight le

and CacheFlow out of business! In Proceedings of

system that is designed to supp ort caching web

the 9th ACM SIGOPS European Workshop (Kold-

proxies. Hummingbird has two distinguishing fea-

ing, Denmark, Sept. 2000), pp. 85{90.

tures: it stores its meta-data in memory, and it

[6] Ganger, G. R., and Patt, Y. N. Metadata up-

stores groups of related ob jects (e.g., HTML page

date p erformance in le systems. In Proceedings of

and its embedded images) together on the disk.

the First USENIX Symposium on Operating Sys-

By setting the tunable parameters to achieve per-

tems Design and Implementation (OSDI) (Mon-

sistence, Hummingbird can also be used to im-

terey, CA, Nov. 1994), pp. 49{60.

prove the p erformance of web servers, which have

[7] Hagmann, R. Reimplementing the Cedar le sys-

similar reference lo cality as proxies. Our results

tem using logging and group commit. In Proceed-

are very promising; Hummingbird's throughput is

ings of the Eleventh ACM Symposium on Operat-

2.3{4.0 times larger than a simulated version of

ing Systems Principles (Austin, TX, Nov. 1987),

pp. 155{162. In ACM Operating Systems Review

Squid running UFS mounted asynchronously on

21:5.

FreeBSD, 5.4{9.4 times larger than a simulated ver-

sion of Squid running UFS mounted synchronously

[8] Hitz, D., Lau, J., and Malcolm, M. File sys-

on FreeBSD, 5.6{8.4 times larger than a simulated tem design for an NFS le server appliance. In

[20] Rosenblum, M., and Ousterhout, J. K. The Proceedings of the USENIX 1994 Winter Techni-

design and implementation of a log-structured le cal Conference (San Francisco, CA, Jan. 1994),

system. ACM Transactions on Computer Systems pp. 235{246.

10,1(Feb. 1992), 26{52.

[9] Kaashoek, M. F., Engler, D. R., Ganger,

G. R., and Wallach, D. A. Server op erating

[21] Rousskov, A., and Soloviev, V. A p erformance

systems. In Proceedings of the 1996 SIGOPS Euro-

study of the Squid proxy on HTTP/1.0. World-

pean Workshop (Connemara, Ireland, Sept. 1996),

Wide Web Journal, Special Edition on WWW

pp. 141{148.

Characterization and Performance and Evaluation

2, 1{2 (1999).

[10] Maltzahn, C., Richardson, K., and Grun-

wald, D. Performance issues of enterprise level

[22] Seltzer, M. I., Ganger, G. R., McKusick,

proxies. In Proceedings of the ACM Sigmetrics Con-

M. K., Smith, K. A., Soules, C. A. N., and

ferenceonMeasurement and Modeling of Computer

Stein, C. A. Journaling versus soft up dates: asyn-

Systems (Sigmetrics '97) (Seattle, WA, June 1997),

chronous meta-data protection in le systems. In

pp. 13{23.

Proceedings of the 2000 USENIX Annual Techni-

cal Conference (USENIX-00) (San Diego, CA, June

[11] Maltzahn, C., Richardson, K. J., and Grun-

2000), pp. 71{84.

wald, D. Reducing the disk I/O of web proxy

server caches. In Proceedings of the 1999 USENIX

[23] 1999. http://squid.nlanr.net/mail- archi ve/-

Annual Technical Conference (Monterey, CA, June

squid-users/.

1999), pp. 225{238.

[24] Tanenbaum, A. S., van Renesse, R., van

[12] Markatos, E. P., Katevenis, M. G., Pnev-

Staveren, H., Sharp, G. J., Mullender, S. J.,

matikatos, D., and Flouris, M. Secondary stor-

Jansen, J., and van Rossum, G. Exp eriences

age managementforweb proxies. In Proceedings of

with the Amo eba distributed op erating system.

the 2nd USENIX Symposium on Internet Technolo-

Communications of the ACM 33 (Dec. 1990), 46{

gies and Systems (Boulder, CO, Oct. 1999), pp. 93{

63.

114.

[25] The Apache Software Foundation. Apache

[13] McKusick, M. K., Joy, W. N., Leffler, S. J.,

http server pro ject. http://www.apache.org/-

and Fabry, R. S. A fast le system for UNIX.

httpd.html.

ACM Transactions on Computer Systems 2, 3

[26] van Renesse, R., Tanenbaum, A. S., and

(Aug. 1984), 181{197.

Wilschut, A. N. The design of a high-

[14] McVoy, L. W., and Kleiman, S. R. Extent-like

p erformance le server. In Proceedings of the Ninth

p erformance from a UNIX le system. In Proceed-

International Conference on Distributed Comput-

ings of the Winter 1991 USENIX Conference (Dal-

ing Systems (ICDCS 1989) (Newp ort Beach, CA,

las, TX, Jan. 1991), pp. 33{43.

June 1989), IEEE Computer So ciety Press, pp. 22{

[15] The Novell ICS advantage: Comp etitive white pa-

27.

per. Tech. rep. Available at http://www.novell.-

com/advantage/nics/nics-compw p.htm l.

[16] Ousterhout, J. K., Da Costa, H., Harrison,

D., Kunze, J. A., Kupfer, M., and Thompson,

J. G. A trace-driven analysis of the UNIX 4.2 BSD

le system. In Proceedings of 10th ACM Sympo-

sium on Operating Systems Principles (Orcas Is-

land, WA, Dec. 1985), vol. 19, ACM Press, pp. 15{

24.

[17] Pai, V. S., Druschel, P., and Zwaenepoel, W.

IO-lite: a uni ed I/O bu ering and caching system.

In Proceedings of the Third USENIX Symposium

on Design and Implementation

(OSDI'99) (New Orleans, LA, Feb. 1999), pp. 15{

28.

[18] Polyteam. Web p olygraph site. http://poly-

graph.ircache.net/.

[19] Roselli, D., Lorch, J. R., and Anderson,

T. E. A comparison of le system workloads. In

Proceedings of the 2000 USENIX Annual Techni-

cal Conference (USENIX-00) (San Diego, CA, June 2000), pp. 41{54.