Proceedings of the 2001 USENIX Annual Technical Conference
Total Page:16
File Type:pdf, Size:1020Kb
USENIX Association Proceedings of the 2001 USENIX Annual Technical Conference Boston, Massachusetts, USA June 25–30, 2001 THE ADVANCED COMPUTING SYSTEMS ASSOCIATION © 2001 by The USENIX Association All Rights Reserved For more information about the USENIX Association: Phone: 1 510 528 8649 FAX: 1 510 548 5738 Email: [email protected] WWW: http://www.usenix.org Rights to individual papers remain with the author or the author's employer. Permission is granted for noncommercial reproduction of the work for educational or research purposes. This copyright notice must be included in the reproduced paper. USENIX acknowledges all trademarks herein. Storage management for web proxies Lan Huang Eran Gabb er Elizab eth Shriver SUNY Stony Brook Bel l Laboratories Bel l Laboratories [email protected] [email protected] [email protected] Christopher A. Stein Harvard University [email protected] Abstract Squid and Apache are two p opular web proxies. Both of these systems use the standard le sys- tem services provided by the host op erating system. On UNIX this is usually UFS, a descendantofthe Today,caching web proxies use general-purp ose le 4.2BSD UNIX Fast File System (FFS) [13]. FFS systems to store web ob jects. Proxies, e.g., Squid or was designed for workstation workloads and is not Apache, when running on a UNIX system, typically optimized for the di erent workload and require- use the standard UNIX le system (UFS) for this ments of a web proxy. It has b een observed that le purp ose. UFS was designed for research and engi- system latency is a key comp onent in the latency neering environments, whichhave di erentcharac- observed byweb clients [21]. teristics from that of a caching web proxy. Some of the di erences are high temp oral lo cality, relaxed Some commercial vendors haveimproved I/O p er- p ersistence requirements, and a di erent read/write formance by rebuilding the entire system stack: ratio. In this pap er, wecharacterize the web proxy a sp ecial op erating system with an application- workload, describ e the design of Hummingbird, a sp eci c le system executing on dedicated hardware light-weight le system for web proxies, and present (e.g., CacheFlow, Network Appliance). Needless to p erformance measurements of Hummingbird. Hum- say, these solutions are exp ensive. We b elieve that mingbird has two distinguishing features: it sepa- a lightweight and p ortable le system can b e built rates ob ject naming and storage lo cality through that will allowproxies to achieve p erformance close direct application-provided hints, and its clients are to that of a sp ecialized system on commo dity hard- compiled with a linked library interface for mem- ware, within a general-purp ose op erating system, ory sharing. When we simulated the Squid proxy, and with minimal changes to their source co de; Gab- Hummingbird achieves do cument request through- b er and Shriver [5] discuss this view in detail. put 2.3{9.4 times larger than with several di erent versions of UFS. Our exp erimental results are veri- We have built a simple, lightweight le system li- ed within the Polygraph proxy b enchmarking en- brary named Hummingbird that runs on top of a vironment. raw disk partition. This system is easily p ortable| wehave run it with minimal changes on FreeBSD, IRIX, Solaris, and Linux. In this pap er we de- 1 Intro duction scrib e the design, interface, and implementation of this system along with some exp erimental results Caching web proxies are computer systems dedi- that compare the p erformance of our system with a cated to caching and delivering web content. Typ- UNIX le system. Our results indicate that Hum- ically, they exist on a corp orate rewall or at the mingbird's throughput is 2.3{4.0 times larger than point where an Internet Service Provider (ISP) asimulated version of Squid running UFS mounted p eers with its network access provider. From a web asynchronously on FreeBSD, 5.4{9.4 times faster p erformance and scalability p oint of view, these sys- than Squid running UFS mounted synchronously tems have three purp oses: improve web client la- on FreeBSD, 5.6{8.4 times larger than a simulated tency, drive down the ISP's network access costs version of Squid running UFS with soft up dates on b ecause of reduced bandwidth requirements, and re- FreeBSD, and 5.4{13 times larger than XFS and duce request load on origin servers. Naming. The web proxy application determines EFS on IRIX (see Section 4). We also p erformed thenameofa letobewritteninto the le system. exp eriments using the Polygraph environment [18] In a traditional UNIX le system, le names are with an Apache proxy; the mean resp onse time for normally selected by a user. hits in the proxy is 14 times smaller with Humming- bird than with UFS (see Section 5). Reference lo cality. Clientweb accesses are char- acterized by a request for an HTML page followed Throughout the rest of this pap er, we use the terms by requests for emb edded images. The set of images proxy or web proxy to mean caching web proxy. Sec- within a cacheable HTML page changes slowly over tion 2 presents the imp ortantcharacteristics of the time. So, if a page is requested by a client, it makes proxy workload considered for our le system. It sense for the proxy to anticipate future requests by also presents some background on le systems and prefetching its historically asso ciated images. proxies that is imp ortant b ecause it motivates much of our design. Section 3 describ es the Humming- We studied one day of the web proxy log for this bird le system. Our exp eriments and results are reference lo cality. The rst time we see an HTML presented in Sections 4 and 5. Section 6 discusses le, we coalesce it with all following non-HTMLs related work in le systems and web proxy caching. from the same client to form the primordial local- ity set,whichdoesnotchange. The next time the HTML le is referenced, we form another lo cality 2 File system limitations set in the same manner and compare its le mem- The web proxy workload is di erentthanthestan- b ers with those of the primordial lo cality set. The dard UNIX workload; these di erences are discussed average hit rate (the ratio between the size of the in Section 2.1. Due to these di erences, the p erfor- latter sets and the size of the primordial set) across mance which can b e attained for a web proxy run- all references is 47%. Thus, on average, a lo cality set ning on UFS is limited by some of the features; these re-reference accesses almost half of the les of the limitations are discussed in Section 2.2. original reference. One of the reasons that this hit rate is small mightbe due to the assumption that 2.1 Web proxy workload characteristics all non-HTML les that follow a HTML le are in the same lo cality set; this is clearly not true if a Web proxy workloads have sp ecial characteristics, user has multiple active browser sessions. Also, we which are di erent from those of a traditional UNIX determined the typ e of le using the le extension, le system workload. This section describ es the sp e- thus possibly placing some HTML les in another cial characteristics of proxy workloads. le's lo cality set. We also studied the size of the lo- cality sets in bytes, and found that 42% are 32 KB We studied a week's worth of web proxy logs from or smaller, 62% are 64 KB or smaller, 73% are 96 a ma jor, national ISP, collected from January 30 to KB or smaller, and 80% are 128 KB or smaller. February 5, 1999. This proxy ran Netscap e Enter- prise server proxy software and the logs were gen- File access. Several older UNIX le system p er- erated in Netscap e-Extended2 format. For the pur- formance studies have shown that les are usually p ose of our analysis we isolated the request stream accessed sequentially [16, 1]. A recent study [19] to those that would a ect the le system underlying has suggested that this is changing, esp ecially for the proxy. Thus we excluded 34% of the GET re- memory-mapp ed and large les. However, UNIX quests which are considered non-cacheable by the le systems have b een designed for the traditional proxy. If we could not nd the le size, we re- sequential workload. Web proxies have an even moved the log event. This prepro cessing results in stronger and more predictable pattern of b ehavior. the removal of ab out 4% of the log records, nearly Web proxies currently always access les sequen- all during the rst few days. We eliminated the tially and in their entirety. (This maychange when rst few days and are left with 4 days of pro cessed range requests are more p opular.) logs containing 4.8 million requests for 14.3 GB of unique cacheable data and 27.6 GB total requested File size. Most cacheable web do cuments are cacheable data. small; the median size of requested les is 1986 B and the average size is 6167 B. Over 90% of the Characteristics of a web proxy and its workload are: references are for les smaller than 8 KB. Idleness. The proxy workload is characterized by Persistence.