1 A Secure, Publisher-Centric Web Caching
Infrastructure
z Andy Myersy John Chuang Urs Hengartner Yinglian Xie Weiqiang Zhuang Hui Zhang
yDepartment of Electrical and Computer Engineering, Carnegie Mellon University
zSchool of Information Management and Systems, University of California, Berkeley
Department of Computer Science, Carnegie Mellon University
Abstract— streams) back to the publishers. This is of particular con- The current web cache infrastructure, though it has a cern to publishers who rely on accurate hit counts to justify number of performance benefits, does not address many of their advertisement-driven revenue model, and to publish- the publishers’ requirements. We argue that web caches ers who wish to obtain accurate representations of the size should be enhanced to address publishers’ needs. For ex- and information consumption behavior of their audience. ample, caches will need to dynamically produce content, log/report client accesses, and give publishers QoS guaran- Finally, caches unilaterally make local copies of web ob- tees. In this paper, we propose Gemini, a publisher-centric jects, often without the consent or even the awareness of web caching infrastructure. Since Gemini caches can alter the publishers. Publishers have no knowledge of the num- content, traditional end-to-end security mechanisms can no ber and locations of cached copies of their objects, mak- longer ensure the integrity and authenticity of content. We ing object consistency impossible to maintain. As a re- thus introduce a new security architecture, based on digi- sult, caches may be serving stale or outdated objects to tal certificates and signatures. Certificates allow publishers the clients. For these reasons, many publishers have re- to specify which caches they trust to generate content for them. A certificate may apply to an entire group of caches sorted to cache-busting, i.e., bypassing the caches by tag- or it can delegate the trust decision to a third party. Digi- ging their objects ’non-cacheable’. This forces the caches tal signatures are used for verification: We require Gemini to forward all object requests back to the origin server. caches to sign their dynamically generated content, there- While this practice assures proper dynamic page genera- fore, clients can verify that this content was generated by tion, accurate hit counts, data consistency, and copyright an authorized cache. We also present a probabilistic scheme protection, it also forfeits all the benefits of caching. that exploits the non-repudiation property of digital signa- We believe that caching is fundamental to the long-term tures to let publishers identify misbehaving caches. In our scalability of the web infrastructure, and therefore it is im- design, we ensure that Gemini is incrementally deployable and seamlessly interoperates with the existing caching in- portant to align the interests of publishers and cache opera- frastructure. Along with a system design, we also present tors. We propose Gemini, a publisher-centric web caching an implementation and preliminary performance results. infrastructure and paradigm that will encourage the pub- lishers and cache operators to cooperate in the distribution and caching of web content. I. INTRODUCTION The Gemini strategy is to endow cache nodes with Web caching, like other forms of caching that occur at communications, storage and processing capabilities that various levels of the memory hierarchy (e.g., hardware, can be beneficially employed by publishers. A Gemini operating system, application), exploits the reference lo- cache node is designed as a next-generation web cache cality principle to improve the cost and performance of that can be incrementally deployed in the current cache data access. This has been especially effective at the In- infrastructure. It can transparently substitute for a regu- ternet level, where large geographic and topological dis- lar cache, as well as interoperate with existing coopera- tances separate the producers and consumers of content. tive caching schemes. A Gemini cache can support a va- The direct and tangible benefits of web caching include: riety of publisher-specified functions. In the data plane, improved access latency, reduced bandwidth consumption, it can support dynamic content generation using filtering, improved data availability, and reduced server load. versioning, and/or other publisher-authored methods based The main drawback of today’s cache infrastructure is on sandboxed, virtual machine based languages such as that it is network-centric, but not publisher-centric. From Java. In the control plane, a Gemini cache can support the publisher’s point of view, a number of important fea- customizable logging and reporting, as well as other func- tures are missing. First, caches are not equipped to han- tions such as object consistency control, access control, dle dynamically generated content, an increasingly large and publisher-specified QoS. portion of all web traffic. Requests for dynamic con- Central to our design is the architectural assumption of a tent have to be forwarded back to the origin servers, and heterogeneous global web cache infrastructure. Just as the the dynamically constructed pages cannot be reused, even Internet’s routers and links are owned by different admin- by the same client. Second, caches are unable to fur- istrative domains, we assume that caches belong to many nish reports on access statistics (e.g., hit counts and click- different administrative domains, and may have different 2 functionalities. Furthermore, the traditional client-server on modular and differential page construction techniques. architecture is replaced by a three-party architecture where For example, delta encoding [2], [3] combines the data al- the intermediary (web cache) can actively create/alter con- ready in cache with any differential updates from the origin tent on behalf of the servers. This means that traditional server. Other techniques include partial transfers, cache- end-to-end security mechanisms can no longer ensure the based compaction [4] and HTML macros [5]. integrity and authenticity of content. In the second case, Gemini supports on-the-fly page The Gemini security architecture is designed to protect construction by running publisher-authored code for filter- clients, publishers and caches from one another. First, the ing and versioning, etc. Consider the application of filter- publishers are assured of proper content generation and ac- ing to the dynamic generation of customized news pages curate logging/reporting by the caches. Second, the caches (e.g., MyYahoo). The publisher code residing at the Gem- are protected against faulty or malicious code from pub- ini cache will apply one or more filters to construct a cus- lishers or attackers. The end result: the caches create and tomized page on the fly. The filters may be derived from deliver content to the clients according to the specifications several sources. First, filters may be supplied by the user in of the publishers. the form of cookies in the HTTP request header. For exam- The rest of the paper is organized as follows. Section ple, a user may specify the news categories and stock sym- II describes the different dynamic content generation tech- bols that she wants to keep track of. Second, filters may niques and applications supported by Gemini. Sections III be derived from a user profile match that incorporates her and IV describes the security architecture and incremen- past browsing and purchasing history. This type of filters tal deployment strategy. The design and prototype imple- may be used for delivery of targetted ad banners, product mentation of the Gemini node, based on the open source recommendations and offers. Finally, the publisher code Squid [1] caching software, are presented in Section V. can generate its own filters by incorporating data that are We discuss the performance of our implementation in Sec- specific to the local environment. For example, when the tion VI, and identify related work in Section VII before we user accesses the URL from within her home area, the cus- conclude the paper. tomized page may include local weather, traffic and sports news. When the user is travelling outside her home area, II. APPLICATIONS the page may include links to food, accomodation, services Traditional caches can only handle static objects such as and maps for the foreign area instead. HTML pages. Gemini caches, on the other hand, are capa- Versioningis also useful for producing customized news ble of storing and processing active documents, including pages. For example, a page may be laid out in different the invocation of any publisher-authored methods based on ways according to user-specified preferences stored in a sandboxed, virtual machine based languages such as Java. cookie. The publisher code may also create different ver- This allows the Gemini caches to support, among a wide sions of the page for the same user based on the hard- range of publisher-centric applications, the generation and ware device (e.g., desktop and handhelds have different delivery of dynamic content. display capabilities), access bandwidth, operating system There are two main types of dynamic content. In the first and browser used to issue the request. case, a web page is dynamic because the underlying data While we have used the example of a customized news source changes frequently. Examples include stock tick- page, these techniques can also be beneficially employed ers, news headlines, and traffic reports. In the second case, by other types of web sites. For example, a consumer e- a web page is dynamic because it is constructed on the fly commerce merchant may tailor web pages to individual on a per request basis. The exact form and substance of the customers with product recommendations, special offers, page may be based on input from the client, server, and/or etc., based on the customer profile filter. cache. Examples include database or search responses, customized news, customized page layout. Using a vari- III. SECURITY ety of techniques, Gemini caches can support both types of dynamic content generation. In traditional distributed communication, end-to-end When the underlying data source changes frequently, mechanisms are sufficient to secure data sent between the the cost-effectiveness of caching is dependent on the ex- client and the publisher because intermediate nodes do not pected lifetime of the data. More specifically, the thresh- alter content. In our system, caches are active participants old for caching should be a function of the ratio of read-to- in content generation, so end-to-end security mechanisms write frequency. For example, online stock tickers may be are no longer sufficient. But it is not only dynamic content updated on a minute-to-minute basis, but popular tickers that affects the end-to-end nature of securing content de- may be read multiple times per minute to justify caching. livery. Caches are now responsible for logging user hits as In many cases, a dynamic page consists of a small amount well. Publishers need assurances that caches will log ac- of frequently updated data embedded in a sea of static data. cess correctly, and that these logs will be transported back Instead of treating the entire page as un-cacheable, a Gem- to the publisher intact. To accomplish this, caches must ini node can cache portions of a page according to pub- become fully involved in the system’s security. lisher directives. Then it can generate new pages based As an example, consider a publisher’s digital signature 3 1 on a document . Previously, a client would be able to use D . Because there is a chance that a trusted cache will be the signature to verify the authenticity of the document. compromised by an attacker, the publisher must still verify With Gemini, a cache between the publisher and client that trusted caches are functioning correctly. might transform the document according to a publisher’s A client will trust a publisher to deliver its own content. instructions, but the cache is unable to alter the publisher’s The client will also trust a publisher to delegate content signature because it does not possess the publisher’s secret delivery. Thus, if a publisher trusts some cache to deliver key. The result is that the client is unable to use the pub- some document, the client will also trust that cache for that lisher’s signature to verify the version of the document it document. For the same reasons as the publisher, the client receives. The obvious solution to this problem would be to needs to verify that the cache is performing correctly. distribute the publisher’s secret key to caches, but this has A cache also trusts the publisher, and any cache the pub- serious ramifications: a cache with the secret key would be lisher trusts, to deliver the publisher’s content. Other able to sign any content whatsoever on behalf of the pub- caches are completely untrusted, with one exception: a lisher. Even if the cache’s owner is honest enough not to cache may trust other caches in the same administrative exploit this, crackers who break into the cache may not be domain to deliver any document. Even if it can be sure as polite. which publisher sent a piece of code, a cache will not trust Our design is guided by four principles: that code to function correctly. Protect the publisher and client—not just the cache. Many Publishers/clients find out about attacks eventually. If the previous systems have focused on protecting caches and publisher trusts honest caches which are never compro- clients from the publisher. However, it is just as important mised, its content will always be safe. But if trusted caches to protect the publisher and clients from caches. become dishonest, the system’s security is endangered. The main risk to the publisher and client is of content be- Publishers and clients must have a way to detect these ing altered before it is delivered to the client. An attacking breaches of trust. cache could edit or delete the publisher’s objects, or add Many applications require that attacks be detected in- entirely new objects which appear to be from the publisher, stantly, but in a caching infrastructure, instant detection so that what a client receives is not what the publisher in- is expensive because it requires the publisher to verify all tended to send. In addition to simply altering content, a content delivered. If we loosen the restriction and increase cache could run a publisher’s code incorrectly (either cor- the amount of time an attack can go undiscovered, the cost rupting the program’s state or the input given to the pro- of verification can be reduced since it can be done less of- gram). ten. Each publisher should be allowed to make its own Caches can also mishandle a publisher’s content by prema- decision about how long an attack can continue undiscov- turely ejecting it, by not respecting the quality of service ered since each publisher’s content has a different value. that the publisher requested for the content, or by serving We believe that for most content, the value to a publisher a stale version of the content. The first two of these affect of a single page being served correctly is very small. If performance but not correctness, while the last does affect a publisher’s content is temporarily altered or made un- correctness. Finally, a cache could add, alter, or delete en- available, the loss to the publisher is tolerably small. On tries from the log of client accesses recorded by the cache the other hand, a publisher might wish to frequently verify and returned to the publisher. content with a high value since even a short attack would Publishers decide whom to trust. Like the other hardware have a significant cost. In this case, the high value of the in the Internet, we expect caches to be owned and admin- content justifies a higher cost of verification. istrated by a wide variety of organizations. We cannot ex- The system should be incrementally deployable. The het- pect every publisher to trust every organization, nor can we erogeneous nature of the Internet prevents any system from even expect all publishers to agree on which organizations being universally deployed in a short amount of time. In- are trustworthy. For this reason, we must let publishers de- stead, new systems must be able to be deployed gradually, cide which caches store and serve their content. Further, and must not interfere with existing systems. On the other because some of a publisher’s documents may be more im- hand, the system is not secure until the whole path from portant than others, publishers must be able to specify trust the publisher to the client is secured. on a document by document basis. This allows a publisher Neither caches nor client browsers should need to be modi- to widely distribute less important content while keeping fied. The caching infrastructure should be secure from the its most vital content in a smaller number of highly trusted publisher all the way to the last cache or browser which caches. has our system installed. Clients which do not run our sys-
Each publisher may choose to trust a set of caches, C ,to tem will be as vulnerable to attack as they are today since C serve a set of documents, D . Any cache not in is not an attacker could alter content right before it arrives at the trusted to correctly store or run code from the objects in unmodified client.
D . The publisher trusts that any cache which is a member The challenge in securing the cache is to come up with of C will correctly store and run code from any object in an approach that is both powerful enough to provide pro-
1 We consider a “document” to be a single object, rather than a whole tection and generally applicable. For example, one ap- “page,” or group of objects which a browser might display together. proach would be to require that each cache include a se-
4 Expires cure coprocessor [6], which is a processor and memory en- lisher’s public key, from Valid to is the range
cased in a secure, tamper-proof enclosure. The idea is that of time that the certificate is valid, and CA is the name all parties can trust the coprocessor to oversee the genera- of the certificate authority who created the certificate. The
tion of all content on the cache. Unfortunately, secure co- certificate is signed with the certificate authority’s private