A Secure, Publisher-Centric Web Caching Infrastructure

1 A Secure, Publisher-Centric Web Caching Infrastructure z Andy Myersy John Chuang Urs Hengartner Yinglian Xie Weiqiang Zhuang Hui Zhang yDepartment of Electrical and Computer Engineering, Carnegie Mellon University zSchool of Information Management and Systems, University of California, Berkeley Department of Computer Science, Carnegie Mellon University Abstract— streams) back to the publishers. This is of particular con- The current web cache infrastructure, though it has a cern to publishers who rely on accurate hit counts to justify number of performance benefits, does not address many of their advertisement-driven revenue model, and to publish- the publishers’ requirements. We argue that web caches ers who wish to obtain accurate representations of the size should be enhanced to address publishers’ needs. For ex- and information consumption behavior of their audience. ample, caches will need to dynamically produce content, log/report client accesses, and give publishers QoS guaran- Finally, caches unilaterally make local copies of web ob- tees. In this paper, we propose Gemini, a publisher-centric jects, often without the consent or even the awareness of web caching infrastructure. Since Gemini caches can alter the publishers. Publishers have no knowledge of the num- content, traditional end-to-end security mechanisms can no ber and locations of cached copies of their objects, mak- longer ensure the integrity and authenticity of content. We ing object consistency impossible to maintain. As a re- thus introduce a new security architecture, based on digi- sult, caches may be serving stale or outdated objects to tal certificates and signatures. Certificates allow publishers the clients. For these reasons, many publishers have re- to specify which caches they trust to generate content for them. A certificate may apply to an entire group of caches sorted to cache-busting, i.e., bypassing the caches by tag- or it can delegate the trust decision to a third party. Digi- ging their objects ’non-cacheable’. This forces the caches tal signatures are used for verification: We require Gemini to forward all object requests back to the origin server. caches to sign their dynamically generated content, there- While this practice assures proper dynamic page genera- fore, clients can verify that this content was generated by tion, accurate hit counts, data consistency, and copyright an authorized cache. We also present a probabilistic scheme protection, it also forfeits all the benefits of caching. that exploits the non-repudiation property of digital signa- We believe that caching is fundamental to the long-term tures to let publishers identify misbehaving caches. In our scalability of the web infrastructure, and therefore it is im- design, we ensure that Gemini is incrementally deployable and seamlessly interoperates with the existing caching in- portant to align the interests of publishers and cache opera- frastructure. Along with a system design, we also present tors. We propose Gemini, a publisher-centric web caching an implementation and preliminary performance results. infrastructure and paradigm that will encourage the publishers and cache operators to cooperate in the distribution and caching of web content. I. INTRODUCTION The Gemini strategy is to endow cache nodes with Web caching, like other forms of caching that occur at communications, storage and processing capabilities that various levels of the memory hierarchy (e.g., hardware, can be beneficially employed by publishers. A Gemini operating system, application), exploits the reference lo- cache node is designed as a next-generation web cache cality principle to improve the cost and performance of that can be incrementally deployed in the current cache data access. This has been especially effective at the In- infrastructure. It can transparently substitute for a regu- ternet level, where large geographic and topological dis- lar cache, as well as interoperate with existing coopera- tances separate the producers and consumers of content. tive caching schemes. A Gemini cache can support a va- The direct and tangible benefits of web caching include: riety of publisher-specified functions. In the data plane, improved access latency, reduced bandwidth consumption, it can support dynamic content generation using filtering, improved data availability, and reduced server load. versioning, and/or other publisher-authored methods based The main drawback of today’s cache infrastructure is on sandboxed, virtual machine based languages such as that it is network-centric, but not publisher-centric. From Java. In the control plane, a Gemini cache can support the publisher’s point of view, a number of important fea- customizable logging and reporting, as well as other func- tures are missing. First, caches are not equipped to han- tions such as object consistency control, access control, dle dynamically generated content, an increasingly large and publisher-specified QoS. portion of all web traffic. Requests for dynamic con- Central to our design is the architectural assumption of a tent have to be forwarded back to the origin servers, and heterogeneous global web cache infrastructure. Just as the the dynamically constructed pages cannot be reused, even Internet’s routers and links are owned by different admin- by the same client. Second, caches are unable to fur- istrative domains, we assume that caches belong to many nish reports on access statistics (e.g., hit counts and click- different administrative domains, and may have different 2 functionalities. Furthermore, the traditional client-server on modular and differential page construction techniques. architecture is replaced by a three-party architecture where For example, delta encoding [2], [3] combines the data al- the intermediary (web cache) can actively create/alter con- ready in cache with any differential updates from the origin tent on behalf of the servers. This means that traditional server. Other techniques include partial transfers, cache- end-to-end security mechanisms can no longer ensure the based compaction [4] and HTML macros [5]. integrity and authenticity of content. In the second case, Gemini supports on-the-fly page The Gemini security architecture is designed to protect construction by running publisher-authored code for filter- clients, publishers and caches from one another. First, the ing and versioning, etc. Consider the application of filter- publishers are assured of proper content generation and ac- ing to the dynamic generation of customized news pages curate logging/reporting by the caches. Second, the caches (e.g., MyYahoo). The publisher code residing at the Gem- are protected against faulty or malicious code from pub- ini cache will apply one or more filters to construct a cus- lishers or attackers. The end result: the caches create and tomized page on the fly. The filters may be derived from deliver content to the clients according to the specifications several sources. First, filters may be supplied by the user in of the publishers. the form of cookies in the HTTP request header. For exam- The rest of the paper is organized as follows. Section ple, a user may specify the news categories and stock sym- II describes the different dynamic content generation tech- bols that she wants to keep track of. Second, filters may niques and applications supported by Gemini. Sections III be derived from a user profile match that incorporates her and IV describes the security architecture and incremen- past browsing and purchasing history. This type of filters tal deployment strategy. The design and prototype imple- may be used for delivery of targetted ad banners, product mentation of the Gemini node, based on the open source recommendations and offers. Finally, the publisher code Squid [1] caching software, are presented in Section V. can generate its own filters by incorporating data that are We discuss the performance of our implementation in Sec- specific to the local environment. For example, when the tion VI, and identify related work in Section VII before we user accesses the URL from within her home area, the cus- conclude the paper. tomized page may include local weather, traffic and sports news. When the user is travelling outside her home area, II. APPLICATIONS the page may include links to food, accomodation, services Traditional caches can only handle static objects such as and maps for the foreign area instead. HTML pages. Gemini caches, on the other hand, are capa- Versioningis also useful for producing customized news ble of storing and processing active documents, including pages. For example, a page may be laid out in different the invocation of any publisher-authored methods based on ways according to user-specified preferences stored in a sandboxed, virtual machine based languages such as Java. cookie. The publisher code may also create different ver- This allows the Gemini caches to support, among a wide sions of the page for the same user based on the hard- range of publisher-centric applications, the generation and ware device (e.g., desktop and handhelds have different delivery of dynamic content. display capabilities), access bandwidth, operating system There are two main types of dynamic content. In the first and browser used to issue the request. case, a web page is dynamic because the underlying data While we have used the example of a customized news source changes frequently. Examples include stock tick- page, these techniques can also be beneficially employed ers, news headlines, and traffic reports. In the second case, by other types of web sites. For example, a consumer e- a web page is dynamic because it is constructed on the fly commerce merchant may tailor web pages to individual on a per request basis. The exact form and substance of the customers with product recommendations, special offers, page may be based on input from the client, server, and/or etc., based on the customer profile filter. cache. Examples include database or search responses, customized news, customized page layout. Using a vari- III. SECURITY ety of techniques, Gemini caches can support both types of dynamic content generation.

A Secure, Publisher-Centric Web Caching Infrastructure

GEMINI Fixed Dome IP Network Camera, 2-Megapixel, H.265, 60Fps HFR, 4.3X Optical Zoom, IR Leds, Indoor/Outdoor

Gemini Broadband Ku/Ka/C Satellite Router

1 the Internet Internet Levels of Access to the Internet

Gemini Quick Start Guide 70090630

Early Use of the Gemini Observatory Archive

Cellular Communications Commercial Security Fire Alarms & Devices

Skyedge Ii-C Gemini-I

Gemini and NTS Exit Reform API Usage Guidelines

Gemini 720Im

GEMINI Fixed IP Network Camera, 6-Megapixel, H.265, 4.3X Optical Zoom, IR Leds, Indoor/Outdoor

Gemini Guide to Connectivity Xoserve Gemini Production Access Through Citrix & API

GRP- Gemini Production Access Through Citrix &