School of Computer Science and Engineering University Of New South Wales

Caching Dynamic Data for Web Applications

For the Degree of Doctor of Philosophy

Submitted by Mehregan Mahdavi

Supervisor Dr. John Shepherd Co-supervisor A/Prof. Boualem Benatallah

November 2006 Acknowledgements

I would like to express my gratitude to all those who gave me the possi- bility to complete this thesis. Dr. John Shepherd and Dr. Boualem Bena- tallah provided the motivation, enthusiasm, and constructive comments and feedback during many discussions we had. It was a great pleasure for me to finish this thesis under their supervision. I would also like to thank the people who helped in implementing fragments of the simulation test- bed used in the evaluation, Willy Mong, Chatarpreet Singh Jitla, and Ka Yee Joanna Chan. The last but not the least, I would like to give my special thanks to my wife Shirley whose patient love enabled me to complete this thesis. List of Publications

The work described in this thesis has been presented in the following publications. In each case, the author of this thesis was the primary contributor.

”Caching on the Web”, Mehregan Mahdavi, John Shepherd, Boualem Benatallah. Web Data Management Practices: Emerging Techniques and Technologies, Eds. Athena Vakili and George Pallis, Idea Group, Inc., 2005.

”Enabling Dynamic Content Caching in Web Portals”, Mehregan Mahdavi and John Shepherd. 14th International Workshop on Research Issues on Data Engineering (IEEE-RIDE’04), March 2004, USA.

”A Collaborative Approach for Caching Dynamic Data in Portal Applications”, Mehregan Mahdavi, John Shepherd, Boualem Benatal- lah. The Fifteenth Australasian Database Conference (ADC’04), January 2004, New Zealand.

”Caching Dynamic Data for E-Business Applications”, Mehregan Mah- davi, Boualem Benatallah, Fethi Rabhi. International Conference on In- telligent Information Systems (IIS’03): New Trends in Intelligent Infor- mation Processing and Web Mining, June 2003, Poland.

ii Abstract

Web portals are one of the rapidly growing applications, providing a sin- gle interface to access different sources (providers). The results from the providers are typically obtained by each provider querying a database and returning an HTML or XML document. Performance and in par- ticular providing fast response time is one of the critical issues in such applications. Dissatisfaction of users dramatically increases with increas- ing response time, resulting in abandonment of Web sites, which in turn could result in loss of revenue by the providers and the portal. Caching is one of the key techniques that address the performance of such appli- cations. In this work we focus on improving the performance of portal applications via caching. We discuss the limitations of existing caching solutions in such applications and introduce a caching strategy based on collaboration between the portal and its providers. Providers trace their logs, extract information to identify good candidates for caching and no- tify the portal. Caching at the portal is decided based on scores calculated by providers and associated with objects. We evaluate the performance of the collaborative caching strategy using simulation data. We show how providers can trace their logs and calculate -worthiness scores for their objects and notify the portal. We also address the issue of hetero- geneous scoring policies by different providers and introduce mechanisms to regulate caching scores. We also show how portal and providers can synchronize their meta-data in order to minimize the overhead associated with collaboration for caching. Contents

1 Introduction 2

2 Background Information 6 2.1 Web Portals ...... 6 2.1.1 Architectures ...... 7 2.1.2 Enabling Technologies ...... 12 2.1.3 Performance Issues ...... 14 2.1.4 Benchmarking ...... 18 2.2 Web Data Caching: An Overview ...... 19 2.2.1 Cache Hierarchy ...... 22 2.2.2 Caching Issues ...... 30 2.2.3 Distributed Cache Management ...... 32 2.2.4 Cache Coherency ...... 33 2.2.5 Dynamic Content Caching ...... 49 2.2.6 Caching Policy and Replacement Strategy . . . . 53 2.2.7 Query Rewriting and Caching ...... 58 2.2.8 Query Processing & Caching ...... 60 2.2.9 Case Studies ...... 62 2.3 Summary ...... 73

3 A Collaborative Caching Strategy for Web Portals 75

i 3.1 Caching in Web Portals ...... 75 3.2 Caching Strategy ...... 79 3.3 Meta-data Support ...... 80 3.4 Calculating Cache-worthiness ...... 86 3.5 Other Parameters ...... 89 3.6 Regulating heterogeneous Caching Scores ...... 91 3.7 Synchronization of Meta-data ...... 93 3.8 Summary ...... 95

4 Evaluation and Analysis 96 4.1 Evaluation Test-bed ...... 96 4.1.1 Dependence Between Objects ...... 101 4.1.2 Size and Computation Cost ...... 104 4.1.3 Cache-worthiness Scores ...... 105 4.2 Evaluation Results ...... 106 4.2.1 Throughput ...... 107 4.2.2 Network Bandwidth Usage ...... 110 4.2.3 Average Access Time ...... 113 4.2.4 Analysis of the Performance Results ...... 116 4.2.5 Average Access Time - First Reply ...... 118 4.2.6 Effect of User Access Pattern ...... 120 4.2.7 Effect of Cache Size ...... 121 4.2.8 Weak Cache Coherency ...... 123 4.2.9 Recency of Objects ...... 127 4.2.10 Utility of Providers ...... 129 4.2.11 Regulation ...... 131 4.2.12 Meta-Data Synchronization ...... 135 4.3 Summary ...... 138

ii 5 Conclusions 142

Bibliography 146

iii List of Figures

2.1 Centralized Portal Architecture ...... 9 2.2 Distributed portal architecture ...... 10 2.3 Complementary providers ...... 11 2.4 Competitor providers ...... 11 2.5 Server accelerator ...... 26 2.6 Edge servers (forward setup) ...... 28 2.7 Edge servers (reverse setup) ...... 29

3.1 Caching in portals ...... 77 3.2 Caching algorithm used by portal ...... 81 3.3 Caching algorithm used by providers ...... 82 3.4 Meta-data used by the caching strategy ...... 83 3.5 Details of caching algorithm used by provider ...... 85

4.1 Architecture of the test-bed ...... 98 4.2 Synchronization between UserSimulators and PortalSimu- lator ...... 100 4.3 Synchronization of request items ...... 101 4.4 Synchronization of response items ...... 102 4.5 Object Dependence Graph (ODG) ...... 103 4.6 Input file to define object dependence ...... 104 4.7 Input file to define size of objects ...... 104

iv 4.8 Input file to define computation cost of objects ...... 105 4.9 Throughput ...... 108 4.10 Throughput - upper-bound scenario for CacheCW . . . . . 111 4.11 Network Bandwidth Usage ...... 112 4.12 Network Bandwidth Usage - upper-bound scenario for CacheCW114 4.13 Average Access Time ...... 115 4.14 Average Access Time - upper-bound scenario for CacheCW 117 4.15 Average Access Time (first reply) ...... 119 4.16 Average Access Time (first reply) - upper-bound scenario for CacheCW ...... 120 4.17 Throughput - high update rate ...... 122 4.18 Hit Ratio - different cache sizes ...... 124 4.19 Hit Ratio - smaller cache sizes ...... 125 4.20 Effect of cache size on performance of CacheCW ...... 126 4.21 Performance measures - weak coherency ...... 128 4.22 Average Access Time (first reply) - weak coherency . . . 129 4.23 Regulating heterogeneous cache-worthiness scores . . . . 133

v List of Tables

2.1 A classification of cache coherency mechanisms ...... 41 2.2 Comparison of push and pull ...... 45 2.3 Summary of ESI tags ...... 65 2.4 Summary of JESI tags ...... 66 2.5 Supported cache tag attributes in BEA WebLogic . . . . 68 2.6 Supported cache directive attributes in ASP.NET . . . . 70

4.1 False Hit Ratio - weak coherency ...... 127 4.2 Comparison of CacheCW-R and CacheCW ...... 129 4.3 Effect of utility on throughput ...... 130 4.4 Effect of utility on average access time ...... 130 4.5 Average access times for individual providers ...... 134 4.6 Total average access time ...... 134 4.7 Periodic synchronization of meta-data ...... 136 4.8 Effect of synchronization on throughput ...... 137

vi 1 Chapter 1

Introduction

The provides a convenient and inexpensive infrastruc- ture for communicating and exchanging data between users and data sources. It has influenced many aspects of life such as communication, education, business, shopping, and entertainment. There are many re- sources on the Internet; some provide data for being used and shared among users, while others are designed to provide applications. For ex- ample, some Web sites such as Web sites of universities, people’s home pages, yellow and white pages provide data. There are also Web sites which provide applications that can be used by users, such as on-line shopping, booking flights and banking. Users find and access appropriate data or applications through Web browsers.

Performance is one of the major issues in today’s Web-enabled ap- plications. Previous research has shown that abandonment of Web sites dramatically increases with increase in response time [Zon01], resulting in loss of revenue by businesses. In other words, providing a fast response time is one of the critical issues that today’s Web applications must deal with. Nowadays, many Web sites employ dynamic Web pages by access-

2 ing a back-end database and formatting the result into HTML pages. Accessing the database and assembling the final result on the fly is an expensive process and a significant factor in the overall performance of such systems. Server workload or failure and network traffic are other contributing factors for slow response times.

With the increasing use of the Internet for applications and emerging class of Web applications called Web portals there is a need for bet- ter performance. Web portals enable access to different data or appli- cation providers through a single interface. They save time and effort for customers who only need to access the portal’s Web interface rather than navigating through many providers. Business portals, such as Ex- pedia (www.expedia.com) and Amazon (www.amazon.com) are examples of such applications.

Caching is one of the key techniques that addresses some of the per- formance issues of Web-enabled applications. Caching can improve the response time. As a result, customer satisfaction is increased and better revenue for the portal and the providers is generated. In addition, net- work traffic and the workload on the providers’ servers are considerably reduced. This in turn improves throughput and scalability and reduces hardware and software costs.

When considering caching techniques, a caching policy is required to determine which objects should be cached. For rapidly changing data we might prefer not to cache them because of the space, communication or computation costs. Therefore, a cache policy is required that based on benefits and costs of caching a data item decides whether to cache an item or not. Communication cost is measured in terms of amount of exchanged messages between cache and data provider. Computation cost

3 is measured in terms of processing time in data providers required for refreshing or invalidating cached items. Moreover, there is some cost for cache manager to run the cache replacement algorithm.

Web caching has been extensively studied. Existing approaches have examined caching in a general setting and can provide some benefit to portals, as well as to general Web sites. However, portals have distinctive properties which can be exploited to provide significantly better caching than that provided by more general approaches. This thesis aims at exam- ining caching solutions specifically for portal applications. We propose a new approach that results in significant benefits over existing approaches. The primary technique used in this thesis is for portals and providers to collaborate by sharing critical caching information.

In this thesis, we aimed to investigate the problem of improving the performance of portal applications via caching. This included proposing a new strategy to portal caching involving collaboration between the por- tal and its providers and analyzing its performance. Throughput, average access time and network bandwidth usage were used as primary perfor- mance measures. It was desired to achieve better performance results compared to some existing techniques. We also aimed to investigate the issues related to such a collaborative approach and provide a solution to address them. These issues included heterogeneity of different providers involved in the collaboration process and also coherence of the meta-data used by the portal and its providers.

This Thesis is organized as follows: Chapter 2 provides an overview of Web portals and Web caching. It also discusses the use and short-comings of existing caching techniques in Web portals. Chapter 3 describes our proposed caching strategy in detail. Evaluation of the strategy is pre-

4 sented in Chapter 4. Finally, Chapter 5 concludes the thesis.

5 Chapter 2

Background Information

This chapter provides an overview of Web portals, their architectures, en- abling technologies, and performance issues. It also gives a comprehensive overview of Web caching. We study the issues of caching dynamic data in Web portals and discuss the short-comings of existing solutions.

2.1 Web Portals

It is now common for businesses to offer a Web site through which cus- tomers can search and buy products or services on-line. Such businesses are referred to as product or service providers. Due to the large number of existing providers, portals have emerged as Internet-based applications which enable access to different providers through a single Web interface. The idea is to save time and effort for customers who only need to access the portal’s Web interface instead of having to navigate through many provider Web sites. In other words, the portal represents an integrated service which is an aggregation of the services provided by available providers. A Web portal is normally used to provide a mediated schema of

6 data sources or service providers. Users issue queries based on the schema provided by the portal and the portal finds relevant data sources, refor- mulates each query into a form acceptable by each data source, submits a query to each individual data source and combines the results from different data sources [MAG+97, QWG+96, HMN+99]. Business portals, such as Amazon (www.amazon.com) and Expedia (www.expedia.com), are examples of such applications where customers can search for services or products to use or buy on-line.

Each provider may have a membership relationship with a number of portals. Moreover, each provider may have a number of sub-providers. Each provider stores its own catalog and the integrated catalog represents the aggregation of all providers’ catalogs. The portal deals with a request from the customer by sending requests to the appropriate providers. Re- sponses from providers are sent back to the portal, processed and a final response is returned to the customer.

2.1.1 Architectures

There are two main approaches to establishing and maintaining the re- lationship between portal and providers:

• Centralized: In the centralized approach, each provider sends its content to the portal and the contents from different providers are combined and maintained by the portal. When the portal receives a query, e.g., a browse request, it processes the query locally. Each provider is responsible to update its content to provide fresh data. Normally, a person such as an administrator manages and updates the content on a regular basis. In this case, the provider does not

7 have to have a Web site or an application program that talks to the portal. This is normally done manually by uploading the catalogue data through the portal’s Web site or through FTP. It may also be done automatically, i.e., through a program, on a periodic basis or when the number of changes in the provider’s database exceeds a threshold. Figure 2.1 shows the architecture of a centralized portal.

• Distributed: In the distributed approach, each provider maintains its own content. When the portal receives a request it queries the appropriate provider(s). Each provider processes the query and re- turns the result to the portal, e.g., as an XML or SOAP message. Each provider may need to contact other providers, e.g., a service or product is made of different services or items which are provided by other providers. The final result will be integrated at the requesting portal, formatted into HTML or any other format, and returned to the user. In the distributed approach an application represents the provider which manages the content and talks to the portal appli- cation. Figure 2.2 shows the architecture of a distributed portal.

Both approaches rely on a meta-data repository to store information about providers, such as user-id, password, address, phone, Web address, as well as a mapping from the portal schema to the provider schemas.

In the distributed approach, the meta-data repository is used to find which providers out of many candidate choices should be chosen to an- swer a sub-query. For example, if user is looking for accommodation in a travel portal, there is no point querying car rental companies. As another example, if a user is looking for a mobile phone to be delivered to his home, the request does not need to go to the providers that do not pro- vide delivery. Other information that might be stored about providers in

8 Clients Portal Providers

Integrated Catalog

Meta−Data

......

Figure 2.1: Centralized Portal Architecture

the repository include category, data model, query language, attributes of tables (in the case of relational databases), operating system, and etc. The process of choosing a number of providers out of existing ones for query processing is referred to as query routing [Liu99]. Effective query routing not only reduces the query response time and the overall process- ing cost, but also eliminates a lot of unnecessary communication overhead on contacting the providers that do not contribute to the answer of the query.

Although the centralized approach provides an effective (fast) sce- nario for many applications (e.g., on-line book store), it may fail in other applications where providing fresh data is more important. In applica- tions such as flight booking and travel planner, providing fresh data is important. Managing the content becomes difficult when a large num- ber of changes happen in the provider’s database, especially when the provider has relationship with a number of portals. The distributed ap- proach is deemed more appropriate for such applications.

9 Clients Portal Providers

Catalog

Meta−Data

Catalog

......

Figure 2.2: Distributed portal architecture

As mentioned earlier, when the portal receives a query, it reformulates the query (e.g. a browse catalog query is broken down into browse sub- queries from different providers) and queries the appropriate providers. Each provider processes the query and returns the result to the portal, e.g., as a SOAP message. The results are then integrated by the portal.

In a portal-based application, providers fall into two categories based on the way they take part in satisfying the queries:

• Complementary: Complementary providers are those who pro- vide different elements of a composite service or product. In travel planner portals, for example, flight, accommodation, and car rental are complementary services. Another example is a computer man- ufacturing portal where each provider may provide different parts of computer. Figure 2.3 shows a travel portal with complementary providers.

• Competitor: Competitor providers are those who provide the same service or product. A computer selling portal is an example

10 Travel Planner Portal

QANTAS Airline Mercury Hotel Budget Car Rental

Figure 2.3: Complementary providers

Computer Selling Portal

Dell IBM Mac ...

Figure 2.4: Competitor providers

where providers compete with each other in selling their products, such as PC or printer. They compete with each other either in pro- viding better quality of Service (QoS), e.g., faster response time through the portal or a cheaper price. Fast response time through the portal is a QoS which both portal and provider try to pro- vide. Figure 2.4 shows a computer-selling portal with competitor providers.

11 2.1.2 Enabling Technologies

In early distributed Web applications much hard-coded programming needed to be done both in portal and provider site to enable the estab- lishment of relationships between portal and providers, e.g., using Java RMI or CORBA. Emerging technologies such as Web services promise to substantially simplify the development of portal-enabled applications [BC02]. They enable program-to-program communication independent of the hardware and software platform and also the programming language [Kre01].

A Web service is the interface that enables the use of a network- accessible operation. The operation itself is called a service [Kre01]. Dis- tributed application integration via Web services is enabled by three major technologies:

• Web Services Description Language (WSDL): is an XML-based lan- guage used to describe the service and its programmatic interfaces.

• Universal Description, Discovery and Integration (UDDI): is a mechanism where the Web service can be described and registered in a registry. It also enables discovering and integrating the Web services by other applications.

• Simple Object Access Protocol (SOAP): is an XML-based protocol who provides the communication means between Web services and applications. An application can request a service by sending a SOAP request. The result of the service is sent to the application as a SOAP response.

Any business that provides a Web service can describe and publish its

12 service in a registry on the Web. Any business who is willing to integrate the service into its application finds and binds (invokes) the service. Web services have enabled the means for building a loosely coupled distrib- uted Web application which can in turn lead towards building a highly dynamic marketplace. Requesting a service through Web services can be done using the following steps:

• The service requester creates a SOAP request message which iden- tifies the service being requested plus all the input parameters for the service. A request message can also be included in a URL. In this case all the input parameters are encoded in the URL (i.e., GET method) or sent through the message body (i.e., POST method).

• The request message is delivered to the service provider through the network.

• The service provider processes the request and generates the re- sponse message, usually a SOAP response message. However, it can be any agreed format between service provider and requester, e.g., any XML message or text file. The response message is delivered to service requester.

Although, Web services simplify developing distributed applications on the Web, the same performance issues apply to Web services as apply to older technologies. The on-line query processing in distributed scenario is more time consuming compared to the centralized approach where query processing is performed locally. Network traffic between portal and individual providers, server workload, or failure at provider site, are also other contributing factors for slow response time. Therefore, this scenario may fail in providing fast response time to the user.

13 2.1.3 Performance Issues

In this work, we focus on distributed portals, where providing fast re- sponse time is one of the critical issues.

Previous research [Won99, Zon01] shows that dissatisfaction of users dramatically increases with increasing access time. The importance of speed is referred to as the “8 second rule” in this research. If users experience response times more than 8 seconds, they start thinking of abandoning the Web site and doing their business with competitors, or taking their business off the Internet entirely. It is estimated that about one-third of on-line customers abandon transactions for this reason; this amounts to around $4.35 billion of lost revenue annually.

In the recent years we have witnessed an explosive increase in the number of data providers, application providers and users. We have also witnessed unparalleled growth in the Internet in terms of total bytes transferred and network capacity. In order to exploit these resources, there is a requirement for more efficient query processing techniques.

There are three classes of delays on the Internet that can affect the responsiveness of query processing:

1. Initial Delay: Longer than expected wait time until the first tuple arrives from a remote source

2. Slow Delivery: The data arrives at a fairly constant rate, but slower than expected rate

3. Bursty Arrival: The data arrives in a fluctuating manner

In portal applications, query execution can stall if providers experi- ence such delays[UF00] . It is therefore desirable to provide partial results

14 of query execution, based on the the results that have arrived so far. One problem with this, however, is dealing with blocking operators, such as max, average, and difference.A non-blocking operator does not have to wait for all the inputs to arrive to be able to produce the results. Op- erators such as select, project, intersect are non-blocking operators. Operators that have to wait for all the inputs to arrive to be able to generate output are blocking. Operators such as sort and average are blocking. Some operators such as join can be implemented in a blocking or non-blocking way. Non-blocking operators can be easily implemented to generate partial results based on the part of input arrived so far. The challenge is to implement blocking operators as if they were non- blocking. Therefore, implementing blocking operators in a non-blocking fashion would be desirable. In other words, at any time, blocking opera- tors should be able to output the current result, based on the data arrived so far on the input stream(s) [STD+00, NDM+00, UF00, IFF+99].

Caching is another key technique that addresses some of the perfor- mance issues faced by today’s Web applications. In particular, caching response messages (which we also refer to as dynamic objects or objects1, for short) gives portals the ability to respond to some customer requests locally. As a result, response time to the customer is improved, cus- tomer satisfaction is increased and better revenue for the portal and the providers is generated. In addition, network traffic and the workload on the providers’ servers are reduced. This in turn improves scalability and reduces hardware costs.

Replication is a technique with a similar purpose to caching. Repli-

1A dynamic object is a data item requested by the portal, such as the result of a database query, the result of a JSP page, an XML or SOAP response message.

15 cating the content at different replication servers reduces network band- width. The content of a data source (or parts of it) can be duplicated in different geographical regions and users, around each region, can access the data from the closest node. Replication can also increase scalability as it lightens the load on the original server and divides it among different servers.

Although, caching and replication have the same advantages, there are some differences between them. Caching is usually done automati- cally as a result of query processing while the query is being executed, while replication is an off-line process and is usually decided by a sys- tem administrator. Replication takes effect at server machines (i.e., data sources) or other servers managed by or on behalf of origin server while caching might also take effect at clients (i.e., query sources). In addition, replication is typically coarse-grained: the whole content of a Web site, the whole database, a whole table, a whole index, or a whole (horizontal) partition of a table or index can be replicated. Caching, on the other hand, is typically fine-grained: individual pages of a Web site, individual pages of a table or index can be cached at a client or middle-tier machine [Kos00].

Pre-fetching is a technique that can be accommodated in cache servers. Instead of cache servers waiting for a request and caching it, they try to predict future accesses to Web pages and pro-actively request them from origin servers and cache them. Prefetching between clients and proxy servers is an example that can improve performance considerably.

Many clients connect to the Internet through low-bandwidth dial- up connections. Even though the content they are accessing might have been cached at the , the connection between them is a major

16 bottleneck which results in slow response time. Prefetching the content in a pro-active manner by predicting the next object(s) to be accessed can improve the performance. This can be achieved either by browsers pulling the content in their browser cache or by proxy servers to push the content from its cache to the browser. This can be done during modem’s idle time when user is viewing or reading the current object [FJCL99]. The same case applies to portal applications. Portals can prefetch content from providers by predicting user activities in future. Obviously, the prediction algorithm plays a major role in the success of prefetching. Bad prediction not only does not improve the performance but also increases the network load and fills the cache with useless copies of data which might finally degrade the performance.

All of the above techniques (replication, caching, prefetching) involve some overhead. If applied naively, these techniques could lead to increased overall cost (i.e. the overhead outweighs the cost benefit derived from the method). Because the rest of this thesis is concerned with caching, we briefly consider here some of the factors taken into account by caching schemes in order to achieve nett cost benefits.

Deciding whether to cache a data item depends on the data freshness (QoD) and response time (QoS) constraints, the Web site usage patterns, copyright and security issues, and on the particular hardware and soft- ware environment [KF00]. For example caching HTML files can avoid reconstruction of HTML pages from underlying databases in the case of dynamic HTML pages. This is useful if the frequency of access to data is more than frequency of update. If update frequency is high, this method does not perform well and incurs a lot of load on the origin or cache server to keep consistency between cached HTML files and actual data.

17 2.1.4 Benchmarking

Performance of systems can be measured and compared using a bench- mark [JBW99b]. These measures include throughput, scalability, re- sponse time, etc. There are different benchmarks available which range from load testing systems such as RadView’s WebLoad [Rad] to bench- marks specifically developed for e-business applications, such as WebEC [JBW99b, JBW99a], TPC-W [Tra01], ECPerf [Sun02] and its successor SPECjAppServer [Suna].

Workload generators generate representative Web references and hence simulate the behavior of Web applications. Generating workload could be done using either a trace-based or analytical approach. In a trace-based approach previous workload of the system is used. The recorded workload is replayed several times to generate the desired work- load for testing purposes. Moreover, the time frame the workload is im- posed could be varied accordingly. In an analytical approach a mathe- matical model for workload characteristics is used and the workload is generated according to this model [BG98]. The most important para- meter in workload generators is the access frequency of individual Web pages. Some research [BCF+99b, BCF+99a] has shown that the relative frequency of accesses to pages of a Web site follows Zipf law. According to Zipf law the probability of access to the i’th most popular page is pro- portional to 1/i. Further research showes that this distribution follows a Zipf-like distribution as 1/iα, with α normally a value less and close to 1. This research also shows that there is only little correlation between the access frequency of a document and its size. Also the correlation between access frequency and its update rate is very low to none. Therefore to model Web accesses one can simply assume a zipf-like distribution for

18 accesses with no correlation with response size and update rate.

For e-business frameworks a more complicated benchmark is needed to capture the differences in business models. There are different business models which include: intermediary (e-broker), manufacturer and auction models [JBW99b, JBW99a]. The differences in business models implies that a single benchmark cannot be created for all business applications. TPC-W is a benchmark that simulates the activity of a retail store. The performance metric for this benchmark is the number of Web interactions per second. More specifically the performance is measured as Web Inter- actions per Second at a tested scale factor (WIPS@scale factor), where the scale factor is the number of items in the table used for items [Tra01]. ECperf [Sun02], which is now called SPECjAppServer [Suna]) is a bench- mark to measure the scalability and performance of J2EE servers and platforms. It is created based on a supply chain manufacturing model. WebEC [JBW99b, JBW99a] is another benchmark which is created for an e-broker business application.

2.2 Web Data Caching: An Overview

A Web cache stores Web resources for future requests. It is located some- where between the and the origin content provider. Candi- date Web objects for caching include: HTML pages, images, audio/video files, XML pages or fragments, query results (e.g., SQL), result of dy- namic Web pages (e.g., JSP/Servlet, ASP, PHP), and programs (e.g., Java applets). When the cache server receives a request, it checks the cache to see whether the request can be answered locally or not. If yes, then the result is sent to the client. Otherwise, the request will be for-

19 warded to the content provider. A web cache can result in one or more of the following:

• Reducing network traffic and therefore reducing network costs for both content providers and consumers

• Reducing user-perceived delay

• Reducing load on the Web/application server and the database server for dynamically generated Web pages from back-end data- bases

• Increasing reliability and availability of Web application servers

• Reducing hardware and support costs

Deploying cache servers close to clients (e.g., browser or proxy cache) reduces the network traffic. When a hit is detected, the content can be served to user using the cached copy. This eliminates the need for receiv- ing the content from the original server, which in turn avoids additional network traffic.

One of the important aspects of caching is that it can improve user- perceived delay. When the content is served from a shorter distance, users experience less delay. In other words, by answering requests locally, caches hide or reduce network latency. This is a very important aspect of caching in today’s Web applications, and directly relevant to the problems of losing e-commerce customers under the “8 second rule” mentioned above.

Caching can also improve performance by decreasing the load on Web application server or database server. Caching the result of dynamic Web pages, such as JSP, ASP, or PHP pages, on the Web application server re- duces the computation cost for generating these pages from the back-end

20 database each time the page is requested. Moreover, caching the result of parameterized queries, such as the result of SQL queries, on the database server and using them for subsequent requests reduces the computation cost. It improves the performance when the query execution time is a major cost and page generation needs to access the database through expensive SQL statements [FYVI00, YFVI00, LR00, LR01, GO01].

Using local caches can hide temporary unavailability of network dur- ing network outages, making the network appear to be more reliable. This is especially important for delivery of multimedia objects, such as video or audio, where consistent bandwidth and response times are im- portant [BO00]. Moreover, availability can be improved by deploying cache servers in a reverse fashion. A is managed by or on behalf of content providers and improves the scalability of their site. In this case, cache servers improve the availability and fault toler- ance of Web servers and the network seems more reliable. During Web server down-time, requests can be replied by cached copies even though the cached copy is not fresh. Research shows that most users prefer to be given stale data than showing an error message if the server is down [BO00]. The reverse proxy server can also act as a load balancer if a farm of Web application servers is being used .

Finally, caching data on remote machines reduces the load on the origin server. When a hit occurs, the request can be answered using the cached copy, which otherwise should have been requested from the origin server. This results in lightening the load on the origin server, which in turn results in reducing the hardware and support costs.

In the rest of this chapter we discuss Web caching in more detail and consider the major issues relevant to caching.

21 2.2.1 Cache Hierarchy

Data items can be cached within the following nodes: Web browser, proxy server, Web/application server, server accelerator, database server, data- base accelerator, transparent proxy, Content Delivery/Distribution Net- work (CDN) services, and application program. Caching a particular data item among eligible nodes depends both on the behavioral information (e.g., access and update pattern of original data) and on the processing capability of the given node [YFVI00].

Browser: Caching is a feature supported by nearly all traditional Web browsers in use on PCs today. Most browsers can store static objects, such as images that a user has accessed on the Web, to a directory on the user’s hard drive. The browser is configured to allocate a certain amount of hard drive space for this purpose. Browser caching can speed up the rendering of pages that contain cached objects. When a URL is requested, the browser will first look at its cache. Depending on the browser’s configuration, if it finds the object, the browser will load it from the cache rather than connecting to the origin server to get a new one. If the object is not available in the cache, then the browser retrieves it from the origin Web server and saves it to its local cache for future requests. A browser cache may help speed up the delivery of some static page elements, such as images, but does little to off-load the computation required to construct dynamic pages on the origin Web servers. Another reason browser caching is not very effective is that content providers tend to mark even their statically generated content with special HTTP head- ers, such as a "Pragma: no-cache" header or an expiration in the past, that render the content not cacheable. Content providers do this because

22 they want to maintain control over their content, which is especially im- portant when that content is changing frequently. It is impossible for content providers to retrieve or check freshness of cached objects at Web browsers once those objects have been delivered, so content providers are careful about which objects they allow the browser to cache [Ora01b].

Proxy Server: Proxy servers are located between a large number of client machines such as an ISP or Intranet users, and the Internet. Similar to browser cache, when a request is received, the proxy server checks its cache. If the object is available, it sends the object to the client. If the object is not available or if it has expired, it will request the object from the origin server and send it to the client. The object will be stored in the proxy’s local cache for future requests. Web Proxy Cache [Squ] is an example. Web browsers need to be configured to refer to proxy servers. Unlike the browser cache which deals with only one user, a proxy cache deals with a large number of users. The main problem of browser and proxy caches is that they only deal with static Web pages and do little or nothing about dynamic Web pages. Despite this, proxy caches are still the most common caching strategy for Web pages [FYVI00, YFVI00]. With emerging personalized Web pages and Web databases, Web pages are no longer static. Specifically, in E-Business applications, Web pages are highly dynamic and personalized which prevents them to be easily cached in proxy. Caching dynamic Web pages at proxy server can be enabled by sending and caching some programs to the proxy server, such as Java applets. These programs generate the dynamic part of some Web pages while the static part can be directly provided from the cache [LN01, LNK+00, LCD01].

23 Web/Application Server: The Web/application server is more likely to be able to cache dynamic Web pages. Generating dynamic Web pages (e.g., JSPs/Servlets) puts a lot of workload on Web/application servers and database servers due to processing and generating the dynamic con- tent. Caching the result of such pages can reduce the workload on the Web/application server and back-end database [CLL+01]. Under a hit, the Web/application server answers the request using the cache if it is still valid. Changes in back-end database invalidate relevant Web pages that use the modified data. For this purpose the Web/application server creates an entry for each cached page in a table called cache validation table. Changes in the back-end database invalidate relevant entry(ies) in the table. When the Web/application server detects a hit, it will check the relevant entry to see whether the page is still valid or not. Current application servers such as BEA WebLogic (http://www.bea.com), IBM WebSphere Application Server (http://www.ibm.com), and Oracle Ap- plication Server (http://www.oracle.com) support caching dynamic Web pages. To provide more scalability the application can be distributed over different Web/application servers.

Server Accelerator: Cache servers can be deployed in front of Web/application server. This type of caching solution is known as re- verse proxy or server acceleration. Unlike a proxy server which caches content from an infinite number of sources, a server accelerator caches content for one or a small number of origin server. It intercepts requests to the Web/application server and either answers the request (if the re- sult is cached) or forwards the request to the origin server. After a cache miss, the server accelerator caches any cacheable result returned by the origin server and forwards the reply back to the requester. They can be

24 used to cache dynamic Web pages as well. Some examples include IBM WebSphere Cache Manager [IBM] and Oracle 9i AS Web Cache [Ora]. They promise caching dynamic and personalized web pages. They can de- crease the processing overhead on the origin Web/application server and the back-end database server. Due to decreasing such overheads they in- crease the throughput of the Web/application server. They also increase the reliability of the Web/application server when the server is down or a crash occurs by serving the request from the cache. In this case, they can even answer the request by out-dated cached copies. Some traces of research show that many applications (such as e-business applications) are better to serve out-dated results than not to answer the request at all.

Database Server: When dynamic Web pages are generated by query- ing a back-end database, the result of such queries can be cached at database server. For example, caching the result of SQL queries on data- base server as materialized views and using them for future requests can reduce computational cost on the database server. Similarly, the result of XML queries could be stored in an XML database.

Database Accelerator: Unlike the database cache which is deployed at the data server, the database accelerator is deployed at the application server. It accelerates the processing of database queries by caching com- mon data sets. This kind of caching is also known as middle-tier caching. A database accelerator increases the performance by reducing the com- munication between the application server and the database server. It also reduces the load on the back-end database resulting in more scalabil- ity [Ora01a, Tim]. Products such as Oracle Application Server Database

25 Web/Application Server 2 Server Accelerator 1 1a Data The Internet Server 1b

• User requests content from origin server

• Accelerator receives the request and checks the cache. Two cases may occur: MISS: – Cache server asks the origin server for the result or missing fragments. – Origin server sends the result. – Result is sent back to the user and a copy is stored in the cache. HIT: – The copy in cache is fresh (by checking Expires in HTTP header) and is sent to user OR – Accelerator asks the server for the freshness (using If-Modified-Since request). – Server validates the freshness or sends the fresh result to the accelerator. – Result is sent back to the user and a copy is stored in the cache.

Figure 2.5: Server accelerator

Cache [Ora01a] and TimesTen [Tim02] provide this kind of caching.

Transparent Proxy: Transparent proxy caching eliminates one of the big drawbacks of the proxy server approach: the requirement to configure Web browsers to refer to a specific proxy. Transparent caches work by intercepting HTTP requests and redirecting them to Web cache servers or cache clusters. This style of caching establishes a point at which different kinds of administrative control are possible; for example, deciding how

26 to load balance requests across multiple caches [BO00].

Content Delivery/Distribution Network (CDN): Caching Web objects has already created a multi-million dollar business: Content De- livery/Distribution Network (CDN) [Ora]. Companies such as Akamai [Aka], and Digital Island [Dig] have been providing CDN services for several years. CDN services are designed to deploy cache or replication servers at different geographical locations. The first generation of such services aimed at caching or replicating static Web pages or fragments such as HTML pages, images, audio and video files at special servers called “edge servers”. These servers are deployed at different geograph- ical areas all around the world and serve the requests or part of them. Static content is not likely to change frequently and will most likely be requested by other users in the same geographic area. In theory, mov- ing content closer to the end users reduces the network traffic and also shortens response times. By caching frequently accessed content closer to end users, the number of router hops is reduced and data will reach its destination more quickly. Lightening the traffic to/from Web/application servers and database servers, leaves more power for them to process and generate dynamic Web pages [Mar00, Ora01b]. Examples of edge servers include Akamai EdgeSuite [Aka] and IBM WebSphere Edge Server [IBM]. They can be used in a reverse or forward set-up. In reverse set-up the host name is used for the edge server. Therefore, they can intercept the request and make a decision to answer the request by their cache or for- ward the request to the origin server. In forward set-up, the request first goes to the Web/application server and a decision is made to serve the request based on the cache at edge server(s). Using Edge Side Includes (ESI) [Edg], the origin server returns a template (rather than the actual

27 4 Edge Server Edge Server 3 Web/Application Server 2

Edge Server Data Server The Internet 1

Edge Server • User requests content from origin server. • Server replies the request with in-line references to frag- ments included in edge server. • Client machine asks the edge server for the fragments. • Fragments are sent to the client machine and the fragments are assembled at client’s machine.

Figure 2.6: Edge servers (forward setup) web page) with references to the fragments such as image files, which ex- ist in the edge server. ESI enables the definition of different cacheability for different fragments of an object. Processing ESI at these servers en- ables dynamic assembly of objects at edge servers which otherwise may be done at server accelerator, proxy server or browser. Detecting frag- ments for caching might also be done automatically [RILD05]. Typical customers of CDN’s are large Web sites like cnn.com, yahoo.com, and microsoft.com. Note that the caches are not necessarily filled with the most popular content but rather by the content of the Web sites buying the service.

Application Program: Some applications may need a customized caching technique. Therefore, the existing caching solutions might be insufficient. Application-level caching is normally enabled by providing a cache API, allowing application writers to explicitly manage the cache to add, delete, and modify cached objects. A system that provides a generic application-level cache is presented in [DIR01]. Object Caching

28 Edge Server Edge Server 1 Web/Application Server 2

Edge Server 1a Data The Internet 1b Server

Edge Server • User requests content from origin server. • Edge server receives the request and checks the cache. Two cases may occur: MISS: – Edge server asks the origin server for the result or miss- ing fragments. – Origin server sends the result. – Result is sent back to the user and a copy is stored in the cache.

HIT: – The copy in cache is fresh (by checking Expires in HTTP header) and is sent to user. OR – Edge server asks the server for the freshness (using If-Modified-Since). – Server validates the freshness or sends the fresh result to the edge server. – Result is sent back to the user and a copy is stored in the cache.

Figure 2.7: Edge servers (reverse setup)

Service for Java (OCS4J) is a caching system used in Oracle9i that en- ables caching static and non-static Java objects [Bor04]. OCS4J is also referred to as JSR-107. JSRs (Java Specification Requests) are the specifi- cations of caching such caching systems for Java platform [Sunb]. JCache Open Source is an effort to make an open source version of JSR-107 [Sou04]. JCS (Java Caching System)is an attempt by Apache [Apa] to build a system close to JCache based on JSR-107 [Apa04].

29 2.2.2 Caching Issues

Despite the advantages of caching, there are some issues which should be considered when applying caching techniques.

• Misleading Statistics: Many e-commerce Web sites need to know the number of hits they get, because their businesses are valued by the number of hits they receive. If the content is delivered by cache without notifying the origin server, the user statistics collected on the origin server will be misleadingly low. This is one reason for many content providers to make their content non-cacheable, e.g., specifying ’’Pragma: no-cache’’ header or an expiration date in the past [Ora01b].

• Copyright Protection: Providers of copyrighted material want to make their material available only to those who have arranged to pay a fee, such as pre-payment for an access code, credit card or etc., or in some cases direct electronic payment via e-cash. A potential serious problem with caching is that copyrighted materi- als residing in a cache are available for further access, either from the original end-user or other end-users. There is no mechanism to enable payment to the copyright holder for such secondary ac- cess. Given that proxy cache servers relabel requests for information from the end-user to the server, it becomes difficult for such pay- ment mechanisms to work. There is also a legal issue of whether caching without explicit permission of the copyright owner results in infringement, even in the case that no payment is demanded. At present, the only way of dealing with this situation is for the copy- right holder to tag the material as non-cacheable. This prevents it

30 from being served to secondary users, but also effectively prevents it from being cached. As more sites choose to enforce copyright in such a manner, the effectiveness of existing caching schemes will be severely impaired.

• Privacy: A cache implicitly contains records of an individual’s Web browsing activities. This is more an issue with local caches, since an ownership relationship exists between cache files and end- user. At the level of a proxy cache server, such information is lost due to the multiple-access nature which is its reason for existence. Whether the access trail in a local cache is more easily examined than network activity logs or direct packet-level traces is open to some question; however, insofar as caching may facilitate snooping on an end-user’s habits and activities, it holds great potential for abuse.

• Security: Security issues regarding the use of the Internet for se- cure data transfer such as personal information or financial trans- actions is a challenging issue. For example in a university database some data can be only accessed by academics. If these data are cached in a proxy and a student is going to access these data, there should be mechanisms to prevent him/her from accessing these data. It should be possible to efficiently impose the security policy of data sources on cached data. However, if some data are cached outside the data source, imposing access policy to cached data by data sources will be a challenging issue. Clearly, caching of pages exchanged in a secure transaction, should be protected from secondary unauthorized access. But, simply preventing secure transactions from being cached forces the end-user to re-request the

31 data from the content provider. As more transactions are made in a secure manner, caching will lose its effectiveness. One of the so- lutions to security issues is the use of encrypted sessions using, for example, Secure Socket-Layer (SSL) protocol for Web transactions.

2.2.3 Distributed Cache Management

The performance of individual cache servers increases when they collab- orate with each other by replying each other’s misses. Protocols such as Intercache Communication Protocol (ICP) [WC97], Summary Cache [FCB00], and Cache Array Routing Protocol (CARP) [Mic97] enable col- laboration between proxy servers to share their contents.

ICP was developed to enable querying other proxies in order to find requested web objects. When a cache miss occurs, the server sends an ICP message to all its neighboring proxies. The neighbors send back ICP replies indicating a miss or hit. After receiving the first hit reply, the ori- gin server asks the relevant server to send the object. In this protocol the number of messages exchanged between proxy servers increases dramat- ically with increasing the number of servers. Therefore, the communica- tion and computation cost of transferring and processing ICP messages may outweigh the benefit of caching itself.

It is desirable that each proxy be able to forward the request to a proxy which can answer the request by its cache or is more likely to contain the data in its cache. This can be achieved through keeping some meta data about caches maintained at other proxies. When the number of proxies becomes large, it is not possible to keep complete meta-data, and a summary might be kept instead. This summary should be accurate enough to gain a trade-off between the volume of meta

32 data and communication overhead for sending and receiving messages [LCD01, PF00, CK00b, FCB00, RBS01, Cha00]. In Summary Cache, each cache server keeps a summary table of the content of the cache at other servers. When a cache miss occurs, the server probes the table to find the missing object in other servers. It then sends a request only to those servers expected to contain the missing object. This protocol has a trade-off between the storage space required for summary tables and the accuracy of the summaries. In other words, the summaries do not need to be complete and can only represent a subset of the real sum- mary table. Moreover, they can be updated on a periodic basis or when the divergence of summary tables and the content of cache exceeds a threshold.

CARP is essentially a routing protocol. In CARP, all proxy servers are included in an array membership list. For each proxy, a hash func- tion is computed for the name of the proxy. Moreover, a hash function is computed for the name of requested URL. These hash values are com- bined and the proxy with highest hash value is determined as the owner of the URL. Using a deterministic way to determine a proxy to cache a URL and also to find the owner of a cached URL, eliminates the need for sending messages between proxy servers to find a cached URL. It also eliminates the duplication of caches in different proxy servers.

2.2.4 Cache Coherency

Changes on original data sources should be effectively propagated to cached copies. In other words, the cached copies should be consistent with the original data. This can be achieved either by invalidating or refreshing the copy of data in the cache. The decision between these two options

33 should be made using a cost-based approach, based on the usefulness of the data item. If the data item is likely to be accessed frequently and will not change in near future, it is worth refreshing the cache. Otherwise, the copy in the cache should be invalidated. It is desirable also to send only changes (deltas) to the cache instead of re-sending the whole data set. By receiving the delta, cache manager updates the data in the cache [MBFS01, MACM01, NACP01, TIH01].

The basic HTTP protocol provides mechanisms for caching which aim at either eliminating the need to send requests to servers or eliminating the need to send full responses by servers. The former reduces the num- ber of network round-trips while the latter reduces network bandwidth usage. A server or client uses caching directives and includes them in Cache-Control header. The two major mechanisms for caching provided by HTTP are: the Expires response header and the If-Modified-Since request header [FGM+99].

• The Expires response header is the simplest mechanism to de- termine the freshness of a cached objects. It simply requires com- paring the object’s Expires response header to the current time (using GMT). This mechanism is also referred to as Time-To- Live (TTL). Generally, the cache manager without disturbing the Web/application server where the object was originated can serve objects that have not yet been expired. If the content providers know the time that the content changes this approach takes full advantage of caching. Otherwise, they have to set a short expiry time or an expiry time in the past to prevent it from being cached. This overcomes the benefit of caching. This mechanism aims at eliminating the need to send subsequent requests to the same ob-

34 ject before it is expired.

• The If-Modified-Since request header is another mechanism the cache server uses to manage cache coherency. The cache server serves cache copies to user after checking with the origin server if a newer version is available. If a newer version exists, the cache server will request it from the origin server and freshness its cache as it serves the new object. If the object has not changed, the cache server delivers the cached object. This mechanism aims at eliminat- ing the need to send a full response if the object is still valid, i.e., only validation response message indicating a response status of 304 (Not Modified) is returned by server.

Different applications have different coherency requirements. Appli- cations such as stock market data or E-Commerce applications need a higher level of coherency than applications such as a telephone directory or the catalog of a bookshop. The former is said to require a strong co- herency mechanism while the latter needs only a weak coherency mech- anism. Applications with strong coherency requirements fall into two groups. In some of them such as stock market data the coherency re- quirements are strict and cached copies of data must be 100% consistent with the original data. Whenever, a change occurs on the original data, the cached copies must be updated or invalidated without a delay. The- oretically, there will always be a delay in updating or invalidating the cached copies because each message must travel through the communi- cation network. In practice, this delay will generally be very short, and so we ignore it for the purpose of our analysis. In other applications such as weather forecast data the coherency requirements are not so strict and for example 95% consistency between data can be tolerated. The

35 distinction between these two can be helpful in deploying appropriate coherency mechanism. For example, in strict coherency a pull-based ap- proach based on ”polling-every-time” can guaranty 100% freshness. But, in strong consistency a Time-To Live (TTL) or Time-To-Refresh (TTR) approach with an appropriate time period might be a good option. The details about these coherency mechanisms are discussed in the following sections.

According to the discussion above, coherency requirements can be distinguished in the following three categories:

• Strict: where cached copies must be coherent with original data at all times.

• Strong: where the coherency between cached copies and original data is important, but the coherency requirement is not as strict as in the former.

• Weak: where stale copies of cached data are acceptable to some extent.

The problem is therefore maintaining the required level of coherency according to the requirements of the application. Maintaining coherency can be achieved either by a pull or push approach.

Pull-based Coherency Mechanism

In the pull approach, the cache server contacts the source to check the freshness. If the original data is changed, the cache server either refreshes the cache by requesting the data or invalidates the cache copy by remov- ing it from the cache. This is also called client polling. There are two

36 main methods based on which the coherency is achieved:

• Polling-Every-Time

• Time-To-Refresh (TTR)

In Polling-Every-Time, whenever a cache hit occurs, the cache server sends a If-Modified-Since(IMS) message to the origin server to check the freshness. The problem of this mechanism (although it provides a strong cache consistency) is the overhead involved in sending IMS mes- sage each time a cache hit occurs. The overhead includes generating these messages by cache server and processing these messages by origin data server. Moreover, the network overhead of these messages results in delays in response messages. Although, the cache copy might be fresh when the hit occurs, the user has to experience the delay of sending and receiving IMS messages.

According to TTR, each client periodically checks the data source in a time period called Time-To-Refresh (TTR). Smaller TTR pro- vides stronger consistency but increases the number of refresh mes- sages. Larger TTR decreases the number of refresh messages but pro- vides weaker consistency. Adaptive TTR aims at determining the TTR based on change frequency, consistency requirements, network traffic, etc. [RDK+00, DKP+01].

In a pull-based approach, Time-To-Live (TTL) could be used to de- crease the number of IMS messages in Polling-every-time or avoiding un- necessary refreshes in TTR. According to TTL, each object (document, image file, etc.) is assigned a TTL. This is a reasonable value normally set by the content provider. This shows that when the object will ex- pire approximately. In other words, before this time a cached copy of

37 the object can be used but not after this time. If a cache server receives a request after TTL has expired the object will be requested from the original server. Selecting an appropriate value for TTL is a challenge. Smaller values provide higher consistency but less effective caching. In this case objects expire sooner and more requests are sent to the orig- inal server. More network bandwidth is used and users will experience more latency. Larger values of TTL provide better performance but lower cache coherency. Adaptive TTL approaches try to overcome the problem by adapting the TTL values based on update frequency of the object [GS96]. If it has not been modified for a long time then it tends to stay unchanged for a longer time than an object that has changed frequently. However, all TTL based approaches fail to provide strong coherency and can only be used when weak cache coherency is acceptable.

In [GS96] the notion of object’s age is used to define the time. They define a term called “update threshold” as a percentage of object’s age. If the time since last validation exceeds the product of update threshold and object’s age then the object becomes invalid. In this case upon in- validation or under the next request a validation message will be sent to the origin to check the validity of the object. Piggyback Client Valida- tion (PCV) [KW97] is an algorithm that checks the validity of objects in advance. With communication with the data server the cache checks the validity of a list of objects that their TTL is expired. This method reduces the chance of stale objects being served. Sending a bunch of validation messages reduces the communication overhead as well.

It is worth mentioning that a pull-based interaction can be done either in a synchronous or asynchronous mode. In the synchronous mode the client sends a polling request to the server and waits until the server

38 replies. During this time the client will be idle. The asynchronous mode tries to decrease or eliminate the client’s waiting time. In this case the client does not have to wait for the result of the poll and can do other things until the result arrives. This can increase the responsiveness of the application, e.g., in a Web application, the results can be generated and displayed as they arrive. Technologies such as AJAX [Gar05] enable asynchronous interactions in Web applications.

Push-based Coherency Mechanism

In push approach, the server takes the responsibility of refreshing or in- validating of cached copies. Content providers can achieve this by keeping a list of all cached objects and notify them when the content changes. The invalidation algorithm frees client cache manager from the burden of sending IMS messages. To ensure strong consistency, if an object A gets changed on the server, the server must send out an invalidation message right away to all the caches that store a copy of A. In this way, cache managers don’t need to worry about object validity. As long as an invalidation message is not received, the object is valid. Invalidation ap- proach requires the server to play a major role in the consistency control process over cached objects. This might be a significant burden for the Web server, because, it has to maintain state for each object that has ever been accessed, as well as a list of the addresses of all the requesters. Such a list is likely to grow very fast, especially for popular objects. The server has to maintain at least a big storage space for keeping the lists. On the other hand, although an object is stored in a cache whose address is kept on the server list, this object might be evicted from the cache later on, because it is rarely if ever requested again, or the cache manager needs

39 free space for newly-arrived objects. Therefore, it does not make sense for the server to keep the address of that cache on the list. Even worse, if the object is about to be changed, the server has to send invalidation message to the caches whose addresses are on the list, but they no longer keep the object, which adds unnecessary traffic to the network.

To enforce a strong cache consistency all invalidation messages should be acknowledged. The server delays the updates until it receives all the acknowledgements from the cache servers to whom an invalidation mes- sage has been sent [RS02, CO02a]. However, unreachability of any cache server causes others to keep waiting and none of them gets updated ver- sion of the object. This results in stale copies of the object being used [YBS99, CO02a].

Multicasting is one of the modifications to invalidation mechanism that addresses performance issue faced by the server. As discussed in [RS02]. A multicast group is assigned to each object. Each multicast group is a list of cache servers and is maintained by routers. When an update occurs the server sends an invalidation message to the correspond- ing remulticast group. It is the router’s job to send individual invalidation messages to all the caches whose names are included in the list [CO02a].

Piggyback Server Invalidation (PSI) [KW98] is a variation of server invalidation, which sends invalidation messages in batch. Sending in- validation messages in batch reduces the server workload compared to sending individual invalidations, as it reduces the number of invalidation messages. However, a strong coherency cannot be guaranteed using this mechanism.

40 Pull Push Strict Polling-every-time Invalidation Strong TTR Invalidation, TTR Weak TTL, PCV, TTR PSI, TTR

Table 2.1: A classification of cache coherency mechanisms

Combining Push and Pull

According to [RDK+00, DKP+01, CO02b, LC98], push and pull ap- proaches have complementary properties regarding to coherency, over- heads (i.e., network, computation, and space), and resilience, as summa- rized in Table 2.2.

• Coherency: In the pull approach based on polling-every-time, strong coherency is guaranteed. In the push approach, when a change happens in the source that affects the freshness of a cached item, the source will either invalidate or refresh the cached copy. In other words, push approach can offer high coherency. In a TTR- based mechanism either on a pull or push-based approach the provided coherency depends on the TTR. Smaller TTR provides stronger consistency but increases the number of refresh messages. Larger TTR decreases the number of refresh messages but pro- vides weaker consistency. Invalidation can provide either a strict or strong coherency depending on how it is implemented and its para- meters are set. Table 2.1 shows a classification of cache coherency mechanisms.

• Network Overheads: A pull-based approach requires two mes- sages per poll, an HTTP request followed by a response. In the TTR-based pull approach, a cache server polls the server based on

41 its estimate of how frequently the data is changing. If the data actu- ally changes at a slower rate, then the cache server might poll more frequently than necessary. Hence, a pull-based approach is liable to impose a large load on the network. In the push-based approach, the number of messages transferred over the network is equal to the number of times the data changes. However, a push-based ap- proach may push to clients who are no longer interested in a piece of information, thereby incurring unnecessary message overheads.

• Computational Overheads: Computational overheads for a pull- based approach result from the need to deal with individual pull requests. After getting a pull request from the cache server, the ori- gin server has to just look up the latest data value and respond. On the other hand, when the server has to push changes to the proxy, for each change that occurs, the server has to check if the coherency requirement for any of the caches has been violated. This compu- tation is directly proportional to the rate of arrival of new data values and the number of unique temporal coherency requirements associated with that data value.

• Space Overheads: A pull-based approach is stateless, meaning that neither the origin server nor the cache server need to store any information for coherency purposes. In contrast, in a push- based approach, server must maintain some information such as the consistency requirement for each client, the latest pushed value, along with the state associated with an open connection. Since this state is maintained throughout the duration of client connectivity, the number of clients which the origin server can handle may be limited when the state space overhead becomes large (resulting in

42 scalability problems).

• Resilience: By virtue of being stateless, a pull-based server is re- silient to failures. In contrast, a push server maintains crucial state information about the needs of its clients; this state is lost when the server fails. Consequently, the client’s coherency requirements will not be met until the cache server detects the failure and re-registers the coherency requirements with the server. Failures can be classi- fied in three groups, each of which has different implications on the behavior of the system.

– Origin Server: In case of server failures, state at the server is lost. Most of the push algorithms require state to be main- tained at the server and hence their correctness may get com- promised in such cases. Cache coherency is not guaranteed until the state is reconstructed at the server.

– Cache Server: Cache server may also fail. An origin server has to allocate resources to each cache server. As resources are valuable, in case of unreachable clients these resources must be reclaimed.

– Communication: Communication failures occur either due to socket failures at any one of the ends, network congestion or network partition. Push-based techniques must employ spe- cial mechanisms to deal with such errors. Otherwise, the state information kept by the server would be incorrect.

• Scalability: Pull servers are generally stateless and hence scal- able. Cache servers deployed all over the world are pull-based and stateless. A user sends a request and waits for the response. The

43 primary consideration has been to make Web servers scalable. This works for many normal applications, but for data which is chang- ing rapidly, it will not be so effective. When data at a source is changing very fast, the cache server will generate a large number of requests to keep its cache synchronized with the source. Thus there will be a large overhead in opening and closing the connections. Also the computational load on the server becomes high because it has to respond to far more requests.

Push servers have complementary characteristics. The server has to keep network connections open and allocate enough buffers to handle each client. With a large number of clients, state space and network resources can soon become bottlenecks and the server may start dropping requests. In short, scalability issues may arise be- cause of the excessive server computation and network traffic or state space maintained at the server and resources allocated (such as sockets) and there is a clear tradeoff between these two con- straints.

In summary, the pull approach does not offer high coherency when the data changes rapidly and strong consistency is required. Achieving a high coherency needs a small TTR which in turn increases network traffic and incurs extra workload on the origin server to process these messages. This workload will be significant, if the number of clients is large. On the other hand, the push approach is more likely to offer high coherency for rapidly changing data and/or strict coherency requirements. However, it increases the overhead on the origin data server to produce and send push messages. Moreover, the approach is less resilient to failures as it has to store state information for clients.

44 Resilience Coherency Overheads (Network & Com- putation & State Space) Push low high low & high & high Pull high or low low high & low & low

Table 2.2: Comparison of push and pull

These properties indicate that a push-based approach is suitable when a client expects its coherency requirements to be satisfied with high fi- delity, or when the communication overheads are a bottleneck. A pull- based approach is better suited to less frequently changing data or for less stringent coherency requirements, and when resilience to failures is important. As is clear from the discussion, neither push nor pull alone is sufficient for efficient dissemination of dynamic data. The complementary properties of the two approaches indicate the need for having an approach which combines the advantages of both while not suffering from any of their disadvantages.

Lease is one attempt to combine push and pull. Leases are like con- tracts given to a lease holder over some property. Whenever some client sends a request to a server for a certain document, the server returns that document along with a lease. In other words, a server takes the responsi- bility of informing the client about any changes during the lease period. Once a lease expires, the client must contact the server and renew the lease. The client can use the cached copy while it has a valid lease over the data item. During the valid lease period, the client remains in push mode and is switched back to pull mode after the lease expires. Thus the client is alternatively served in push and pull modes. When the lease expires, upon the next request of the object, the cache manager will send IMS messages to origin server, and the server either responds with the new version of the object, or, if the object has not been changed yet, ex-

45 tends the lease and returns that to the client, and the same rule applies. It is very important to choose a good lease period. For long lease peri- ods, the client remains in push mode for most of the time and scalability problem may arise. On the other hand, for small values the lease renewal cost may be very high. The trade-off between storage space and control messages depends on the duration of the leases. By choosing smaller lease duration, this approach behaves like a pull-based system and by choosing larger lease duration it behaves like a push-based system [DST00].

The Lease algorithm can maintain strong cache coherency while keep- ing servers from indefinitely waiting due to a client failure. If a server can- not contact a client, it delays updating the object until the unreachable client’s lease expires, and from then on it becomes the client’s responsi- bility to contact the server for validation. On the other hand, the Lease algorithm needs to be implemented both at client side and on the server.

As with TTL algorithm, the lease duration affects the efficiency of the algorithm itself. If the lease value is shorter than the interval between two requests, every subsequent request comes when the current lease has already expired. In this case, leasing becomes polling-every-time, which is far from desirable. However this doesn’t mean that a long lease is better. Having a very long lease forces server to delay object updates until that lease expires. This problem can be solved by introducing the idea of a “volume lease” in addition to leases on individual objects [YADL99]. In this approach, each volume lease is assigned to a set of related objects on the same server. In order to use a cached object, a client must hold the leases on both the object and the volume it belongs to. The cache manager cannot respond to a user request with cached object unless both the object and volume lease on that object are valid. Server is free

46 to update an object as soon as either the volume or object lease on the object has expired. By making object leases long and volume leases short, server can make object updates without long delays. Meanwhile, long object leases prevent the cache manager from having to validate individual objects frequently.

Adaptive leases determine the lease duration adaptively based on cur- rent information such as access and update frequencies, amount of avail- able storage space for keeping state information and workload on the original server [DST00].

To combine push and pull, the proxy can operate in pull mode using some TTR algorithm, while the server is in push mode and knows the co- herency requirement [RDK+00]. Using this requirement and proxy access patterns the server tries to predict when a client is going to poll next. If it determines that within this predicted time the client is likely to miss a relevant change, it pushes that change to the client. For predicting the client connection times, the server may run the TTR algorithm in parallel with the client or use some simpler approximation of it. In the ideal case, the coherency offered will be 100%, but due to synchronization problems and other factors, it will be slightly less. But, it will always be much greater than pull. Because of the pull component the resilience of the system will be high. Also, due to the push component, communication overheads will be low. This algorithm has parameters, such as window size for pushing the changes to client, which swing it towards more push or more pull, and thus its performance in terms of fidelity and coherency can be controlled.

Another possibility is to divide incoming clients at the server into either push or pull clients and dynamically switch them to one or the

47 other mode [RDK+00]. If resources are plentiful, every client is given a push connection irrespective of its coherency requirements. This ensures that the best coherency is offered. As more and more clients start re- questing the service, resource contention may arise at the server leading to performance problems. Some clients are then shifted to pull mode. Thus valuable resources are freed and system scales properly. Contrary to this, when resources again become available high priority clients can be switched back to push mode, thus ensuring high coherency. The most important issue is how to assign priorities to different clients. Some of the possible parameters are the access frequency of each client, tempo- ral coherency requirement, and network bandwidth available. Clearly no single criterion suffices but collectively they have the potential to offer high average coherency while keeping the system scalable.

Web Cache Invalidation Protocol (WCIP)

WCIP enables maintaining consistency using invalidations and updates. In server-driven mode, cache servers subscribe to invalidation channels maintained by an invalidation server. The invalidation server sends in- validation messages to channels. These invalidation messages will be re- ceived by cache servers. In client-driven mode cache servers periodically synchronize the objects with the invalidation server. The interval depends on coherency requirements [LCD01].

Best-Effort Synchronization

As mentioned earlier, serving stale (out-of-date) data from a cache can be acceptable for some applications. This can be due to the coherency re- quirements of the application and/or due to bandwidth or other resource

48 constraints. It is desirable to minimize the overall divergence between source objects and cached copies by selectively refreshing modified ob- jects. To do so, some of the objects should be chosen to be refreshed. The algorithm for selecting the best objects to be refreshed is called best-effort synchronization. In most approaches, the cache coordinates the process and selects objects to refresh. A best-effort synchronization scheduling policy is provided in [OW02].

According to this approach, the assumed limited resources for cache synchronization may occur at a number of points, including: the capacity of the link connecting the cache to the rest of the network (cache-side bandwidth) and the capacity of the link connecting each source to the rest of the network (source-side bandwidth). Moreover, all bandwidth capac- ities may fluctuate over time if traffic is shared with other applications. It is claimed that this technique is applicable more generally to other types of resource limitations, such as, limited computational resources of sources available for cache synchronization due to local processing load and also limited resources of caches.

2.2.5 Dynamic Content Caching

While static Web pages were sufficient for the first-generation of Web sites, they cannot support the requirements of many today’s Web ap- plications such as e-businesses [Ora01b]. Dynamic Web pages are au- tomatically generated by querying a back-end database and wrapping the result in HTML format. A JSP/Servlet, ASP, or PHP page is usu- ally used to query the database and produce appropriate HTML file. Regenerating a Web page, even with the same input parameters, may re- sult in a different HTML page, as the underling data may have changed

49 since last time the database was queried. While the static part of such a Web page can be cached the dynamic part may not be easily cached. Thus the resulting Web page can be assembled by requesting the dy- namic part [Abe01, Tec01, Chu, Aka, CB00]. To enable caching dynamic parts, the changes on the back-end database should be effectively de- tected. These changes should then invalidate or refresh cached copies [Ora01b, AJL+02, CLL+01, CID99, Dyn]. In what follows, we refer to such content as “dynamic Web pages” or “dynamic objects”. Dynamic content can be categorized in two groups:

• The first group includes pages which are assembled on-the-fly based on the user request/preference. These include personalized Web pages for users or user groups containing different images, news pages, etc., such as “My Yahoo”.

• The second group includes pages which are generated dynamically by a program running on a Web server, typically with access to a back-end database (e.g. through JDBC), and formatting the results into HTML or some other format for presentation. Such dynamic pages clearly depend on the current values in the underlying data store. When such underlying data is modified (e.g. the price of a product is changed), a number of pages are typically affected.

The latter is more complicated from coherency point of view. Deter- mining the relevant objects based on the changes in the back-end data- base is more challenging. In this work we are primarily interested in the second group.

Dynamic Web pages (e.g. those produced by JSPs/Servlets) can be cached at a cache server. Web/application servers or server accelerators

50 are more capable of caching such content. For this purpose a look-up table is kept at the Web/application server or server accelerator which stores the URI of cached page. When a change (e.g. in back-end database) affects the freshness of the page, the relevant entry in the look-up table will be invalidated. When a cache hit occurs at the Web/application server, the look-up table is probed to see if the page is still fresh or not. If it is fresh then the request can be answered by cache. Similarly, if a hit occurs at a server accelerator, a validation request message is sent to Web/application server to validate the cache. The Web/application server sends a reply indicating a positive or negative reply. If positive, the object can be served by the accelerator from its cache. Otherwise, the Web/application server sends the object to the accelerator.

There are different solutions to relating changes in the back-end data- base to the freshness of a page in the cache. Triggers are one option that the content provider can use to cause changes on the base data to val- idate (refresh) or invalidate the cached copy. When a limited number of cache servers is involved, it is feasible to make use of such triggers. However, managing and handling triggers involves overhead on the data server, and this overhead increases significantly as the number of cache servers increases.

Using materialized views is another option for queries submitted on base data. When the data is generated by the data server, it is cached as a materialized view either on the data server or the Web/application server and triggers can be defined on these materialized views. The difference with the previous approach is that in this approach the views can be managed by the application (if it runs the same DBMS as data server or a DBMS which can collaborate with the data server for view management).

51 It provides a more expressive way for defining and using triggers, but also incurs a significant workload for view and trigger management on database server [AJL+02].

An alternative approach (to using database mechanisms) for invali- dation is to use the URLs that invoke access to the backend database. Fine-grained invalidation (e.g. by an exact URL resulting from an HTML form with GET method) incurs a lot of workload but the result is more accurate. Coarse-grained invalidation (e.g., invalidation of all cache copies by a similar URI prefix) incurs less workload on the data server but is less accurate. An example of coarse grained invalidation is when something changes in a table, all queries on this table are invalidated regardless of whether the change affects the query result or not. By using a combina- tion of fine-grained and coarse-grained invalidation, it is possible to reach a trade-off between the workload and the invalidation quality [CAL+02].

Another option is using server logs to detect changes and invalidate relevant entries, as proposed in CachePortal [CLL+01]. CachePortal in- tercepts and analyzes three kinds of system logs to detect changes on base data and invalidate the relevant entries: HTTP request/delivery logs determine the requested page, query instance request/delivery logs determine the query issued on the database based on the user query, and database update logs. A sniffer module finds the map between query in- stances and URLs based on HTTP and query instance logs and generates a QI/URL map table. An invalidator module uses the database update logs and invalidates cached copies based on the updates and the QI/URL map table. This approach needs to keep a state space for all cached data. Moreover, it can not guarantee 100% freshness of cached data as the in- validator module may not be able to process logs and invalidate cached

52 copies in real time. In order to guarantee 100% fresh data, if required, it is necessary to use a pull-based approach or a combination of pull-based with this approach which can adapt itself based on the current server load, storage space and required coherency.

The use of CachePortal for caching dynamic objects is demonstrated in [LHP+04], where the J2EE reference software, the Java PetStore, has been used as a case study.

The DUP algorithm [DIR01, CIW+00, CID99] uses an object depen- dence graph (ODG) for the dependence between cached objects and the underlying data. The cache architecture is based on a cache manager which manages one or more caches. Application programs use an API to explicitly access caches to add, delete, and update cached objects.

There are also a number of systems and products on the market that support dynamic caching in one way or the other. They will be further discussed in case studies.

2.2.6 Caching Policy and Replacement Strategy

Since it is generally not possible or desirable to cache indefinitely every object received by the cache server, strategies are needed to determine: (i) which objects should be added to the cache on arrival (caching policy), and (ii) hich object should be removed if the cache is full and a new object arrives (replacement strategy).

Caching Policy

The caching policy assists in determining whether it is worthwhile placing a newly-arrived object in the cache. In the case of rapidly changing data,

53 it is not obvious whether objects based on such data should be placed in the cache at all; this clearly depends on the likelihood of subsequent references to that object before its data changes. Also, by filling the cache we need to run cache replacement algorithm more frequently which imposes an overhead on the cache server. The cost of update propagation or invalidation should be also taken into account. And finally, if we cache “useless” objects it means that we may end up removing useful data from the cache during replacement. Therefore, when deciding whether to cache an object or not, the decision should be made based on the benefit of caching for future requests and the above mentioned costs.

Products such as Oracle Web Cache [Ora01b], IBM WebSphere Edge Server [IBM] , and Dynamai [Dyn] enable system administrators to spec- ify caching policies. This is done primarily by including or excluding ob- jects or object groups (e.g., objects with a common prefix in the URI) to be cached, determining expiry date for cached objects or object groups, etc. Server logs (i.e. web server access log and database update log) are also used to identify objects which are good candidates for caching.

Weave [FYVI00, YFVI00] is a Web site management system which provides a language to specify a customized cache management strategy.

A framework called profile-based caching [BR02] allows clients to ex- press caching preferences for their applications using a set of parameters. The basis for this approach is for the cache to capture the latency-recency trade-off for a particular user or application. The cache then uses profiles to determine whether to deliver a cached object (if the object might be out of date and the user/application has not a strong coherency require- ment) to the client or to download a fresh object from a remote server (if there is a strong coherency requirement) [BR02].

54 Cache Replacement Strategy

Traditional cache replacement strategies such as First In First Out (FIFO) and Least Recently Used (LRU) were developed to solve the “page-level caching” problem, and may thus not be suitable for Web objects. Page-level caching was developed in the context of executing programs in virtual memory and involves maintaining a cache of fixed size objects and exploiting the “working set” behavior of programs. Web caching differs from this in dealing with objects that are heterogeneous in type and size and whose access patterns do not necessarily follow a “working set” model. Surveys of Web cache replacement strategies are presented in [PB03, BK04]. Some of the cache replacement strategies targeted at Web objects include:

The recency of Web objects is considered in the replacement strat- egy in [CI97]. It indicates that the probability of access to Web objects dramatically decreases with increasing the time since last access.

Least Likely to be Used (LLU) evicts objects which are less likely to be used in future [DDT+02]. This is achieved by mining access logs and extracting association rules between objects.

LRUMIN favors smaller sized objects to minimize the number of ob- jects replaced [AWY99]. If there is no space in the cache for the newly arrived object of size S, an object with size at least S is removed from cache. Among such objects, the least recently used one is evicted. Oth-

S erwise, objects of size at least 2 are removed in LRU order. Otherwise, S the same thing is done for objects of size at least 4 and so forth, until enough space is made.

SIZE strategy [AWY99], assigns larger objects a higher priority for

55 removal. If there are two objects with the same size, the least recently accessed one is removed first. This uses the idea that removing a large object leaves more space for a number of small objects, which are also believed to be accessed more frequently on the Web (traces of Web access logs show that most requests are for small objects [WA96]).

Size-Adjusted LRU (SLRU) [AWY99] chooses the objects with best cost-to-size ratio, which is defined as 1/(Si · ∆Tik). Si is the size of object i and ∆Tik is the number of accesses since the last time object i was accessed. In other words, it sorts the objects in order of non-descending values of Si · ∆Tik, and then greedily picks up the objects with highest values and purges them one by one from the cache until enough space for the incoming object is created.

Size-Adjusted and Popularity-Aware LRU (LRU-SP) [CK00a] ad- dresses the missing part of SRLU where objects with similar size are treated equally regardless of their popularity. LRU-SP uses both size and popularity of Web objects in the context of LRU. Given that fi is the number of accesses to object i since being cached, LRU-SP uses the following cost-to size ratio: fi/(Si · ∆Tik) In other words, LRU-SP sorts the objects in order of non-descending values of ∆Tik/fi, and then greedily picks up the objects with highest values.

GreedyDual [You91] takes into account the different costs of fetching objects into the cache. The original GreedyDual algorithm only considers the situation where all the objects have the same size, but the costs of fetching them into the cache is different [You91]. According to this algorithm, each cached page p is associated with a non-negative value, H. This is initially set to the cost of bringing the object into cache. When a replacement is needed the object with the lowest H value (Hmin) is

56 removed and H value of all the objects in cache is reduced by Hmin. Upon accessing a page, its H value is restored to the initial value. This way of reducing H values of objects and restoring them when they are accessed, this algorithm integrates the locality of accesses and replacement cost.

To incorporate different-sized objects, GreedyDual-Size [CI97] ex- tends the GreedyDual algorithm by setting H to the cost per size ra- tio (Cost/Size) of objects. Cost is the cost of bringing the object into cache, and Size is the size of the object in bytes. Depending on the goal of the replacement algorithm, cost will have different definitions: If the goal is to maximize hit ratio, cost is set to 1. It is set to the download time if minimizing user-perceived delay is considered. In the general case the total network cost should be considered. A good implementation of GreedyDual-Size avoids k subtractions when an object is replaced, where k is the number of objects in the cache. This is achieved by using an

“inflation” value of Hmin, and offsetting all future setting of H by Hmin.

Site-based LRU uses the name of Web sites for the cache replacement strategy [WY01]. When user requests an object from a site, the site name is inserted at the top of a list, rather than the object name, and the object is stored under the newly inserted site. In this approach only a limited amount of a site’s objects are cached, which is specified by Max Ob. When an object arrives from a site which already has Max Ob objects in the cache, the least recently used object from the site will be removed and the newly arrived object will be inserted in the cache. This prevents caching the unpopular objects of a popular Web site. Each time an object is requested its site name moves to top of the site list and the least recently accessed sites migrate to the bottom of the list. When a replacement is required, the system chooses the last sites in the list and

57 their related objects.

2.2.7 Query Rewriting and Caching

There are times when the requested object is not held in the cache server, but parts of the result might be available in the cache. Consider the following example, where clients request dynamic objects:

• When using a search engine, the user submits a request by providing some keywords. The search engine extracts the result set, wraps it in an appropriate format and sends it to the client (along with necessary logos, images, etc.)

• Searching catalog information through a company’s Web site may involve accessing a back-end database. In this case after retrieving the result set, it is wrapped in an appropriate format and sent to the client.

The cache server can store the result of such queries in cache. When a new query is received, the cache server can check to see whether the current query can be answered based on the results of previous queries contained in the cache. Therefore, to answer a query the following cases may occur:

1. The result set has no overlap with the data in the cache. In this case the data should be requested from the origin server.

2. The result set is fully contained in the cache. In this case the query can be answered from the data contained in the cache.

58 3. The result set is partly contained in the cache. In this case the query results in two parts. A probe part that is already contained in the cache and a remainder part which should be obtined from the origin server. In this case, the query can be rewritten by the cache server to ask the origin for the remainder part. On receiving the remainder part, the result set will be assembled and forwarded to the client.

To achieve the above, some query processing capability should be added to the cache server, e.g., SQL. This can be achieved by ship- ping some programs, which implement these capabilities, as proposed in [CZB98]. For server accelerators which are close to business edges, de- ploying such programs at the cache server can be done easily as they are managed directly by content provider. Cache servers at network edges (CDN) are normally managed by an ISP or CDN which makes deploying such programs not as easy as previous ones.

Moreover, to guarantee the freshness of the results, extensions such as invalidation or TTL-based approaches should be considered to adapt them for caching query results.

When a cache server receives a request consisting of an SQL query, it first considers whether it can be answered by the results already existed in cache and checks for the freshness of cache. Then it finds the probe and remainder part and decomposes the original query into several sub- queries to query probe and remainder. Then it queries the origin server(s) for the remainder(s). After receiving the remainder(s), it assembles the results and sends the results back to the user, and stores a copy in the cache.

59 Finding the probe and remainder based on arbitrary SQL to find a match is not very easy. However, when querying through a Web interface, users normally do not submit any arbitrary SQL query. Instead, they submit a query through a Web form which is a simplified SQL query based on a predefined template. In this scenario, matching SQL queries will be more straightforward [LN01, CB00, LNK+00, Mar00].

2.2.8 Query Processing & Caching

A portal is normally used to provide a consistent view of data/servics from various providers. Data is transferred from providers to the portal to be processed by query operators executing on the portal. To process query operators like aggregates and projection, a large amount of data often needs to be transferred through the connection network to be processed by query operators on the portal. The result of such operators is normally much smaller than the input data.

Recently, research has been done on flexible and dynamic place- ment of query operators in distributed database architectures. In other words query operators can be placed at different sites to minimize execu- tion time, communication cost, and/or other parameters. One interest- ing technique is that such query operators be processed on or near the providers and only the results be transferred to the portal. Therefore, transferring smaller amounts of data through the connection network, will result in saving network capacity and therefore users will experience faster response time.

Following a cost-based approach, evaluation of data-reducing opera- tors can be pushed to the providers and the evaluation of data-inflating operators to the portal.

60 • data inflating are those operators whose result is larger than the input(s). For example, scaling up an image file or a join operator where the result is larger than inputs.

• Data reducing operators are those operators whose result is smaller than the input/s. For example aggregates or a join operation which results in a smaller output than the inputs.

Therefore, data reducing operators are more likely to be executed on or close to providers and data inflating operators are more likely to be executed on or close to the portal. The philosophy behind this scheme is that transferring data is normally the major performance bottleneck in large-scale systems [RR00].

Based on [BKK+01, BKK99] the overall picture is to make it possible to execute virtually any kind of query operator on any machine and any kind of data on the Internet. The idea is to create an open market place for three kinds of suppliers: data providers supply data, function providers offer query operators to process the data, and cycle providers are con- tracted to execute query operators. It is desirable to execute complex queries which involve the execution of operators from multiple function providers at different sites (cycle providers) and the retrieval of data and documents from multiple data sources. When considering caching in such a system there would be a circular dependency between caching and query plan optimization. Operator site selection decisions made for one query have ramifications on the performance of subsequent queries. Therefore, the query optimization process must be extended to take longer-term view of the impact of its decisions.

Cache Investment is a technique introduced in [KF00] which combines

61 data placement and query optimization. In other words, it causes the optimizer to invest resources during the execution of one query in order to benefit later queries.

2.2.9 Case Studies

As mentioned earlier, there are a number of systems and products on the market that support dynamic caching in one way or the other. We study the most important ones here:

Oracle 9iAS Web Cache

Oracle 9iAS WebCache is a product that is used as server accelerator with the capability of caching dynamic Web pages [Ora01b]. It enables system administrators to specify cacheability rules using “regular expressions”. These rules specify whether a particular URL or a group of URLs should or should not be cached. Supported Objects include static contents such as GIF and JPEG as well as dynamic content from server-side languages such as JSP/Servlets, ASP, PL/SQL Server Pages (PSP), and CGI. There are a number of cases where content should be declared as non-cacheable. Some examples include update transactions, shopping cart views, and personal account views. If no cacheability rules are specified then Oracle Web Cache behaves like a traditional proxy server and uses HTTP header information for caching purposes.

It also supports caching multiple versions of the same URL. Accessing the same URL will return one of the versions depending on which user is accessing the page. For example, an e-commerce application might show different prices (e.g., full price or discounted) based on customers being

62 first time, regular, or returning customers. A Web application may use cookies or other HTTP request headers to decide which version to return. E-commerce applications also use cookies to track click-stream habits of users when they browse a Web site. Some users choose to disable cook- ies on their browsers because they do not want any private information about their browsing habits be collected. To track such customers, many Web sites embed cookies as parameters in URLs by inserting a sequence number called session ID into all hyperlinks to other parts of the applica- tion. Including such session IDs in every URL makes Web pages unique for each user, so they are likely to be considered as non-cacheable. In Oracle9i AS Web Cache these pages could be considered as cacheable. It has a string substitution mechanism for inserting session IDs into URLs. When it inserts such pages into the cache it takes the session information out and leaves a placeholder instead. For subsequent requests from the user, it is able to substitute values for embedded session IDs based on the request header.

It also enables caching personalized Web pages, such as greeting pages: ‘‘Welcome ’’

For this purpose, special SGML comments called “Web Cache tags” are used to identify the personalized attribute information within a page, for example:

... Welcome to our store, John Citizen ... <\HTML>

63 Oracle AS Web Cache parses the HTML and caches a generic version of the page, leaving a placeholder for the personalized attributes placed between the Web cache tags.

Some applications such as e-commerce or portal applications need a fine-grained caching solution. Oracle AS Web Cache can break the content down into its building elements and cache elements separately. It takes advantage of the Edge Side Includes (ESI) [Edg] to identify content fragments for caching.

For cache coherency purposes, Oracle AS Web Cache supports expi- ration and message-based invalidation.

In the expiration method an “expiration policy” can be assigned to the cache content. When an object expires, Oracle AS Web Cache marks it as invalid.

In the invalidation method an XML/HTTP invalidation message is sent to the Oracle AS Web Cache host machine. Invalidation messages are HTTP POST requests that carry an XML payload. The contents of the XML message body tells the cache which URLs to mark as invalid.

Edge Side Includes

The Edge Side Includes (ESI) [Edg] mark-up language defined by Oracle and Akamai allows applications to identify content fragments for caching and assembly at the network edges, either at the application edge (i.e. application server or server accelerator), or Internet edge (i.e. CDNs). Assembling dynamic pages from individual page fragments means that only expired or non-cacheable fragments need to be fetched from the origin server, which results in better performance.

64 Tag Purpose Include a separately cacheable fragment. Conditional execution - choose among several different alternatives based on, for example, cookie value or user agent. Specify alternative processing when a request fails (e.g., the origin server is not accessible). Permit variable substitution (for environment variables). Specify alternative content to be stripped by ESI but displayed by the browser if ESI process- ing is not done. Specify content to be processed by ESI but hid- den from the browser. Include a separately cacheable fragment whose body is included in the template.

Table 2.3: Summary of ESI tags

A server-side include is a variable value that a server can include in an HTML file before sending it to the requester. For example, LAST-MODIFIED is one of several environment variables that an operating system can keep track of and that can be accessible to a server program. When writing a Web page, an include statement can be inserted in the file that looks like:

In this case, the server will obtain the last-modified date for the file and insert it before the HTML file is sent to requesters. A server-side include can be considered as a limited form of Common Gateway Interface (CGI) application. The server simply searches the server-side include file for CGI environment variables, and inserts the variable information in the places in the file where the “include” statements have been inserted. Table 2.3 shows a summary of ESI tags.

65 Tag Purpose Used in a “template” page to indicate to the ESI processor how the fragments are to be as- sembled (the tag generates the tag. Assign an attribute (e.g., expiration) to tem- plates and fragments. Used to contain the entire content of a JSP con- tainer page within its body. Encapsulate individual content fragments within a JSP page. Specify that a particular piece of code needs to be executed before any other fragment is exe- cuted (a database connection established, user id computed, etc.). Explicitly remove and/or expire selected ob- jects cached in an ESI processor. Insert personalized content into a page where the content is placed in cookies and inserted into the page by the ESI processor.

Table 2.4: Summary of JESI tags

Oracle and Akamai have also defined an adaptation of ESI for Java, called Java Edge Side Includes (JESI), which can be used in JSP pages. In other words, it is a specification and custom JSP tag library that can be used by developers to automatically generate ESI code. Table 2.4 summarizes the JESI tags.

Server Side Includes (SSI)

A Web file with the suffix of ”.shtml” (rather than the usual ”.htm”) indicates a file that includes some information that will be added ”on the fly” by the server before it is sent to you. A typical use is to include a ”Last modified” date at the bottom of the page. This Hypertext Transfer Protocol facility is referred to as a server-side include. (Although rarely

66 done, the server administrator can identify some other file name suffix than ”.shtml” as a server-side include file.) You can think of a server- side include as a limited form of common gateway interface application. In fact, the CGI is not used. The server simply searches the server-side include file for CGI environment variables, and inserts the variable in- formation in the places in the file where the ”include” statements have been inserted.

When creating a Web site, a good idea is to ask your server ad- ministrator which environment variables can be used and whether the administrator can arrange to set the server up so that these can be han- dled. Your server administrator should usually be able to help you insert the necessary include statements in an HTML file.

JSP Cache Tag Library in BEA WebLogic

One of the JSP tags provided by BEA is cache tag which is packaged in weblogic-tags.jar tag library file. By copying this file to WEB-INF/lib directory of the Web application, cache tags can be used in JSP files. Using cache tag enables caching the body of the tag, i.e., the fragment within the tag. The following XML fragment in web.xml enables an Web application refer to this library:

weblogic-tags.tld /WEB-ING/lib/weblogic-tags.jar

Referencing the tag library in JSP files is done using taglib directive:

<%@ taglib uri="weblogic-tags.tld" prefix="w1" %>

67 Attribute Description timeout Specifies the amount of time after which the body of cache tag is refreshed. scope Specifies the scope in which data is cached. Valid scopes include: page, request, session, and application. key Specifies additional values for evaluating the con- dition based on which caching is decided. async A true value for this attribute denotes updating the cache asynchronously, if possible. name This attribute Specifies a unique name for cache. This allows a cache to be shared between different JSP files. size Specifies the maximum number of entries in the cache. vars This attribute is used for input caching, i.e., caching calculated values. flush A true value for this attribute causes the cache to be flushed.

Table 2.5: Supported cache tag attributes in BEA WebLogic

Table 2.5 summarizes the supported cache tag attributes in BEA WebLogic.

Invalidation Mechanism in Dynamai

Dynamai [Dyn] from Persistence Software, acts as a server accelerator and caches the result of requests for dynamic Web pages. Such pages will become invalid when the data on which they were based changes in the underlying database. Two kinds of events may cause this to happen:

• A database update by the application through the Web interface

• A database update by an external event such as system adminis- trator or another application

68 In the first case, incoming requests are monitored, and if they cause an update on the data base, the affected pages will be invalidated.

In the second case, the invalidation mechanism will need to be pro- grammed in a script file and executed by the administrator to invalidate appropriate Web pages.

Dynamai enables the application developer or system administrator to define dependency and event rules for invalidation purposes. They identify all dependencies between Web pages and all the events that a request may cause which in turn may invalidate other Web pages. Dynamai supports the following request-based dependency/event rules:

• Query string or form parameters (GET or POST action methods)

• Cookies

• Directory based URLs

Cache Directives and API in ASP.NET

ASP.NET provides both page level caching and fragment caching. It also provides application level caching by providing cache API to be used in the application to manually manage the cache [Smi03]. To incorporate page level output caching, an OutputCache directive could be added to the page as follows:

<%@ OutputCache Duration="60" VaryByParam="*" %>

This directive appears at the top of the ASPX page, before any out- put. Five attributes are supported by this directive, two of which are required and others are optional, as shown in Table 2.6.

69 Attribute Required Description Duration Yes Time, in seconds, the page should be cached. Must be a positive integer. Location No The location the output should be cached. It must be one of: Any, Client, Downstream, None, Server, or ServerAndClient, if spec- ified. VaryByParam Yes The names of the variables in the Request, which should result in, separate cache entries. "none" can be used to specify no variation. "*" can be used to create new cache entries for every different set of variables. Separate variables with ";". VaryByHeader No Varies cache entries based on variations in a specified header. VaryByCustom No Allows custom variations to be specified in the global.asax (for example, "browser").

Table 2.6: Supported cache directive attributes in ASP.NET

Most situations can be handled with a combination of the required Duration and VaryByParam options. For instance, we consider a product catalog that allows the user to view pages of the catalog based on a categoryID and a page variable. The page could be cached for some period of time (probably an hour would be acceptable unless the products change all the time) with a VaryByParam of "categoryID;page". This would create separate cache entries for every page of the catalog for each category. Each entry would persist for one hour from its first request.

VaryByHeader and VaryByCustom are primarily used to allow caching customization of the page’s look or content based on the client that is accessing them. For example, the same URL that generates output for both Web browsers and mobile phone clients, has to cache separate ver- sions for each. Also, the page might be optimized for IE but needs to be rendered for Netscape or Mozilla, perhaps with degraded quality. In

70 order to enable separate cache entries for each browser, VaryByCustom can be set to a value of "browser". This functionality will insert separate cached versions of the page for each browser name and major version.

<%@ OutputCache Duration="60" VaryByParam="None" VaryByCustom="browser" %>

ASP.Net also enables caching different fragments within a Web page using the same syntax as in full page caching. In ASP.NET a Web page is referred to as Web form (.aspx file) and a fragment within a page is referred to as user control (.ascx file). The same syntax may be used for caching user controls. All the attributes supported by the OutputCache directive on a Web form are also supported for user controls except for the Location attribute. There is, however, an extra attribute for user controls called VaryByControl, which caches a separate copy of the user control based on the value of a member of that control, such as a Drop- DownList. VaryByParam may be omitted if VaryByControl is specified. If a user control is used among different pages with the same name the Shared=’’true’’ parameter enables using the cached version(s) of the user control for all the pages containing that control. However, by default each user control on each page is cached separately. The following cache directive:

<%@ OutputCache Duration="60" VaryByParam="*" %>

caches the user control for 60 seconds, and creates a separate cache entry for every variation of input parameters.

Using the following directive:

<%@ OutputCache Duration="60" VaryByParam="none" VaryByControl="CategoryDropDownList" %>

71 causes the user control be cached for 60 seconds. It creates a sepa- rate cache entry for each different value of the CategoryDropDownList control, and for each page that contains this control .

<%@ OutputCache Duration="60" VaryByParam="none" VaryByCustom="browser" Shared="true %>

The above cache directive, caches the user control for 60 seconds, and creates separate cache entries for each browser name and major version. All pages containing a reference to this user control can share such entries in the cache.

ASP.NET also provides a more flexible means for caching through Cache object. Using Cache object, any serializable object can be placed in cache and its expiration can be controlled using one or more depen- dencies. Examples of dependencies include: time elapsed since caching the object, time elapsed since the last access to the object, changes on files /folders, etc.

Objects are inserted in the cache in a “key, value” pair, similar to a HashTable. To store an object in the cache the following assignment may be used, which stores the item in the cache without any dependencies. In this case, the object will stay in the cache and will not expire unless the cache engine removes it as a result of running cache replacement strategy.

Cache["key"] = "value";

Some API may be used to include specific cache dependencies, such as Add() or Insert() methods.

72 PreLoader from Chutney Technologies

Chutney’s Preloader may be deployed as a server accelerator that sits next to an application server farm and caches Web pages or individual components of Web pages [Abe01, Tec01, Chu].

It enables page-level dynamic content caching, including personal- ized Web pages such as those who contain personal greetings. For exam- ple, ‘‘Welcome to our store, John Citizen!’’. The cache, stores a generic page for such pages with a placeholder for personalized informa- tion that can be changed on-the-fly by the cache to represent different personalized pages.

Moreover, it enables component-level caching by breaking the content down into its components and caching them separately. It automatically assembles such components when a request is made.

Finally, it provides a variety of invalidation mechanisms, such as TTL settings and database triggers.

2.3 Summary

In this chapter we studied caching as a technique to improve perfor- mance in Web applications. Browsers and Proxies provide a better user experience by caching static content. Database servers and database ac- celerators are concerned with caching dynamic data in the form of data- base query results (e.g., SQL). CDN servers accelerate content delivery by replicating the content at different geographical areas, i.e., network edges. They deal with static content as well as fragment caching and dynamic assembly of content at network edges. Web/application servers

73 and server accelerators are potentially capable of caching dynamic data in applications with a back-end database. A table is normally used to keep track of cached objects, where each entry in the table represents an object or object group, and changes in the back-end database will invalidate the relevant entry(ies).

In existing systems, caching policies are defined by system adminis- trators based on the previous history of available resources, access and update patterns. This is done by including or excluding objects or object groups for caching, setting expiry time, etc.

In portal applications these policies need to be defined and adapted by the system. The portal and providers are normally managed by differ- ent organizations and the administrator of the portal does not normally have information to define such policies for providers. There is a need to provide caching solutions for Web portals that address such limitations. The rest of this thesis explores the limitations of existing solutions and proposes a collaborative caching strategy for Web portals.

74 Chapter 3

A Collaborative Caching Strategy for Web Portals

In this chapter we explain our proposed caching strategy in detail. The first section introduces the challenges of caching in Web portals. In the second section, we explain how the portal and providers collaborate to achieve an effective caching strategy. The next section describes the meta- data used to support the approach. How providers process logs and de- termine “cache-worthy” objects is explained in the next section. We then explain how the portal can deal with heterogeneous caching policies from different providers in order to achieve fairness among providers and/or better performance. We finally discuss the synchronization of meta-data and its effect on performance.

3.1 Caching in Web Portals

To improve performance, response messages from providers can be cached at a Web portal. In a portal that may include complementary

75 providers, competitor providers or both, the relationship between por- tal and providers is not necessarily fixed and pre-established. It can be rather dynamic where new providers can join or existing providers can leave the portal. Figure 3.1 briefly represents the general architecture of a Web portal along with the providers. Meta-data is information about providers when they register their service. As can be seen in the figure, each provider may have a membership relationship with a number of portals. Moreover, each provider may have a number of sub-providers.

Response messages (e.g., SOAP or XML) from providers can be cached at the portal. Caching data from providers can reduce network traffic. It also reduces the workload on the provider’s Web/application server and database server by satisfying some requests locally in the por- tal site. Less workload on the provider site leaves more processing power to process incoming requests which results in more scalability and reduces hardware costs.

In an environment where providers are complimentary, the perfor- mance of the portal is limited by the provider with lowest perfor- mance among the providers taking part in providing a composite service. Caching objects from such providers can boost the apparent performance of the slowest providers and therefore the performance of the portal over- all. Providing data from a shorter distance also improves the response time to the user. This can in turn help in better user satisfaction and finally in better revenue for the portal and the providers. For example, if a fast browse session for products is provided, users might be more will- ing to continue shopping, which may finally lead to buying the product. Moreover, Users will be more likely to come back to the Web site again if they have a good shopping experience including fast response time.

76 Clients Portal Providers

Content

Cache Meta−Data

Content . .

Figure 3.1: Caching in portals

In an environment where providers are competitors, caching is of more interest for providers. Assuming a portal which lists the contents from different providers as they arrive, providers with better response time get better chance to be listed to the user. Assuming the fact that in many cases users are only interested in ‘‘Top N’’ results, failing in providing fast response time may result in loosing the chance in getting listed in the result set. This may result in less revenue for the business.

Inherent in the notion of caching are the ideas that we should maintain as many objects as possible in the cache, but that the cache is not large enough to hold all of the objects that we would like. This introduces the notion of choosing better candidates for caching. Caching a particular object at the portal depends on the available storage space, response time (QoS) requirements, access and update frequency of objects [KF00]. As mentioned earlier, the best candidates for caching are those who are accessed frequently and do not change often. For rapidly changing or

77 infrequently accessed objects, it might not be beneficial to cache them at all. Due to large number of providers and dynamicity of the environment, identifying caching objects by the portal either by a system administrator or server logs is impractical.

A caching policy is required to determine which objects should be cached. Products such as Oracle Web Cache [Ora01b], IBM WebSphere Edge Server [IBM] , and Dynamai [Dyn] enable system administrators to specify caching policies. This is done mainly by including or exclud- ing objects or object groups (e.g., objects with a common prefix in the URI) to be cached, determining expiry date for caching objects or object groups, and etc. Server logs (i.e., access log, and database update log) are also used to identify objects to be cached.

Caching dynamic objects at Web portals introduces new problems to which existing techniques cannot be easily adapted and used. Most importantly, the portal and providers are managed by different organi- zations and administrators. Therefore, the administrator of portal does not normally have enough information to determine caching policies for individual providers. Moreover, since the portal may be dealing with a (large) number of providers, determining the best objects for caching manually or by processing logs is impractical. On the one hand, an ad- ministrator cannot identify candidate objects in a dynamic environment where providers may join and leave the portal frequently. On the other hand, keeping and processing access logs in the portal is impractical due to high storage space and processing time requirements. Also, database update logs are not normally accessible by the portal.

78 3.2 Caching Strategy

In current systems, caching policies are defined and tuned by parameters which are set by a system administrator based on the previous history of available resources, access and update patterns [AH00, IFF+99, OLW01, Ora01b, CID99, CLL+01, Dyn]. A more useful infrastructure should be able to provide more powerful means to define and deploy caching poli- cies, preferably with minimal manual intervention. As the owners of ob- jects, providers are deemed more eligible and capable of deciding objects for caching purpose. In this work, we focus only on distributed portals, where providing fast response time is one of the critical issues. The pro- posal in this thesis involves both the portal and the providers contributing information that allows an effective caching strategy to evolve automat- ically on the portal.

A caching score (called cache-worthiness) is associated to each ob- ject, determined by the provider of that object. The cache-worthiness of an object, a value in the range [0,1], represents the usefulness of caching this object at the portal. A value of zero indicates that it is not worthwhile to cache the object in the portal, while a value of one indicates that it is essential to cache the object in the portal [MBR03, MSB04, MS04]. The cache-worthiness score is sent by the provider to the portal in response to a request from the portal.

Cache-worthiness scores are determined by providers via an off-line process which examines the provider’s server logs, calculates scores and then stores the scores in a local table. In calculating cache-worthiness, the providers consider parameters such as access frequency, update frequency, computation/construction cost and delivery cost. A typi- cal cache-worthiness calculation would assign higher scores to objects

79 with higher access frequency, lower update frequency, higher computa- tion/construction cost, and higher delivery cost. However, each provider can have its own definition of these scores, based on its own policies and priorities. For example, a provider might choose not to process server logs for defining the scores. It might, for example, choose to let the system administrator assign zero to non-cacheable objects and some non-zero value to cacheable objects or object groups, as in some existing caching solutions. In this case all the objects with a non-zero value for cache- worthiness will be considered as cacheable by the portal.

The decision whether to cache an object or not is made by the portal, based on the cache-worthiness scores along with other parameters such as recency of objects, utility of providers, and correlation of objects.

Figure 3.2 and Figure 3.3 briefly show the algorithms used by the portal and providers to enable the caching strategy.

3.3 Meta-data Support

The caching strategy is supported by two major tables, the cache look- up table used by the portal to keep track of cached objects, and the cache validation table used by providers to validate/invalidate the objects cached at the portal.

An entry in the cache look-up table consists of Object-Identifier, Cache-Time, and Cache-Worthiness. An Object-Identifier is used to represent the object within the system. For example, the portal can invoke a Web service operation through a URI which includes the name of a Servlet at the provider along with the values for its input parameters. Cache-Time indicates the time the object was cached by the portal. This

80 1. Portal() 2. ... 3. while (true) 4. req = get-user-request(); 5. case (req.type) 6. “Read”: 7. /** Generate sub-queries Q = (q1, q2, ...) 8. to a subset of providers P = (p1, p2, ...) */ 9. /** Objects o1, o2, ... indicate the results of q1, q2, ... */ 10. Q = generate-sub-queries(req); 11. for ( qi IN Q) 12. if (cached (oi) && not expired(oi)) 13. send-validation-request (oi.oid); 14. else 15. send-read-request(qi); 16. end for 17. for ( pi IN P) 18. respi = get-response(pi); 19. if ( respi.type == “Validation”) 20. if ( respi.valid ) 21. oi = cache(respi.oid); 22. else /** respi.type == “Read” */ 23. oi = respi.content; 24. endif 25. endif 26. end for 27. result = integrate-results(o1, o2, ...); 28. send-reply(result); 29. “Write”: 30. /** Generate sub-requests W = (w1, w2, ...) 31. to a subset of providers P = (p1, p2, ...) */ 32. W = generate-sub-requests(req) 33. for ( wi IN W) 34. send-write-request(wi); 35. end for 36. end case 37. end while 38. end portal

Figure 3.2: Caching algorithm used by portal

81 1. Provider() 2. ... 3. while (true) 4. req = get-request(); 5. case (req.type) 6. “Read”: 7. o = generate-result(req); 8. send-response(o); 9. “Write”: 10. update-database(req); 11. invalidate-objects(req.oid); 12. “Validation”: 13. search-cachevalidationtable(req.oid); 14. if ( found && valid) 15. send-validation-response(true); 16. else 17. o = generate-result(req); 18. send-validation-response(false, o); 19. end if 20. end case 21. end while 22. end Provider Figure 3.3: Caching algorithm used by providers is used to validate the object when a validation request is sent to the provider. Cache-Worthiness is the cache-worthiness score assigned to the object.

An entry in cache validation table consists of Object-Identifier, Generation-Time, and Validity. The Object-Identifier is the same as the one used in the cache look-up table. We use Generation-Time to represent the time the object was generated. This is used to validate the object when a validation request is received from portal. Changes in the back-end database invalidate entries in cache validation table. If changing the content of the database affects the freshness of any object, then the appropriate entry in the provider’s cache validation table will be invalidated by resetting the Validity for the object. For this purpose

82 Object−Identifier Cache−Time Cache−Worthiness

OICT CW

Portal Object−Identifier Cache Generation−Time

Cache Look−up Table Validity

OI GT V

...... Provider

Cache Validation Table Database

Figure 3.4: Meta-data used by the caching strategy we use the incoming (update) requests to invalidate affected objects. The concept of Object Dependence Graph (ODG) is used to determine affected objects. Based on this concept, each update request invalidates a number of objects or object groups which can be determined based on ODG and the relevant entries in the cache validation table will be set to ”invalid” accordingly. More details about ODG included in Chapter 4.

When a hit is detected at the portal, while the object is not yet considered as expired, a validation request message is sent to the relevant provider. The message includes the corresponding Object-Identifier and Cache-Time. Upon receiving the validation request, the provider checks the freshness of the object by probing the cache validation table. The freshness of the object is then tested by checking the Validity value of the entry and comparing Cache-Time of the validation request

83 and Generation-Time of the entry. The comparison may result in the following cases:

• The object’s Cache-Time is greater than or equal to Generation-Time. In other words, the cached copy is the newest version generated by the provider, and therefore is fresh. Unless, some changes in the back-end database have made the object invalid, which is checked by checking Validity field in the table.

• The object’s Cache-Time is smaller than Generation-Time. This means that the copy in the cache is (most likely) stale, as a new version of the object has been generated by the provider after the object was cached. In this case the object will be considered as invalid.

The freshness of the objects is confirmed by sending a validation message. If the object is not fresh (by checking Validity or comparing the times), or the relevant entry is not found in the table, a fresh copy of the object is returned to the provider. After receiving the object, the portal responds to the user and a copy of the object may be cached at the portal for future requests. The algorithm used by the provider to validate an object is shown in Figure 3.5.

Clearly, storing the cache validation table imposes on each provider a space overhead proportional to the number of cacheable objects. It also imposes some computation overhead for detecting the changes on the back-end database and invalidating the relevant entries in the table. If a provider has limited resources, it could restrict the table size, maintaining only enough entries to fit the available space. The provider would also

84 1. Provider() 2. ... 3. while (true) 4. req = get-request(); 5. case (req.type) 6. “Read”: 7. ... 9. “Write”: 10. ... 11. “Validation”: 12. i = find-cachevalidationtable(req); /** -1 indicates not found */ 13. if ( i >= 0) 14. if (cachevalidationtable[i].valid && req.cache-time >= cachevalidationtable[i].generation-time) 15. send-validation-response(true); 16. else 17. o = generate-result(req); 18. send-validation-response(false, o); 19. end if 20. else 21. o = generate-result(req); 22. send-validation-response(false, o); 23. end if 24. end case 25. end while 26. end Provider Figure 3.5: Details of caching algorithm used by provider need to run a cache replacement strategy to free some space when the size of the table reaches an upper bound. This method puts a bound on the space overhead, but decreases caching performance at the portal, because the provider may replace valid objects in its validation table and subsequently tell the portal that a valid object is invalid (a situation called a false miss).

This leads to the notion that there is a trade-off in each provider between space and computation overhead on one hand, and the final performance of the cache on the other hand. The more the providers are willing to use storage space and their computation power on storing

85 and maintaining cache validation table, the better the performance of the cache will be. This applies to the performance of the cache for individual providers as well as the performance of the cache as a whole, as a result of better hit ratio for individual providers.

3.4 Calculating Cache-worthiness

As mentioned earlier, due to the potentially large number of providers and dynamicity of the environment, in terms of joining and leaving providers, it is not feasible to identify “cache-worthy” objects on the portal, either by a system administrator or by mining server logs; a hu- man administrator cannot handle frequent changes to the collection of providers; maintaining and processing access logs in the portal imposes too much storage and processing overhead; database update logs from the providers are typically not accessible to the portal. Providers, as the own- ers of the objects are more capable of deciding which objects should be selected. In order to provide effective caching in a distributed, dynamic portal environment, we propose a strategy based on the collaboration between the providers and the portal.

The best candidates for caching are objects that are: (i) requested frequently, (ii) not changed frequently, and (iii) expensive to compute or deliver [YFVI00]. For other objects, the caching overheads may outweigh the caching benefits. We use server logs at provider sites to calculate a score for cache-worthiness. In the rest of this chapter we use Oi,m to denote an object i from a provider m. We identify four important para- meters:

• Access Frequency is denoted by A(Oi,m, k), and indicates the

86 access frequency of Oi,m through portal k for the time the log has been recorded or since the last time access frequency of objects was calculated. It is calculated by processing the Web/application server access log and counting the accesses being made to each

object Oi,m. More frequently accessed objects are better choices for caching.

• Update Frequency is denoted by U(Oi,m), and indicates the num-

ber of times Oi,m has been invalidated for the time the calculation is being done (i.e., the time the log has been recorded or since the last time the update frequency was calculated.) It is calculated by processing the update log. Objects with lower update frequency are better for caching.

• Computation/Construction Cost is denoted by C(Oi,m), and

indicates the cost of generating Oi,m in terms of database accesses and formatting the results. It is calculated by processing the server logs and calculating the time elapsed between the request being sent to the database till the result of the request being ready to be delivered. Objects with more computation cost are better for caching.

• Delivery Cost is represented by the size of the object, and is

denoted by S(Oi,m). Larger objects are more expensive to deliver in terms of network bandwidth consumption and also the elapsed time for delivering the object, and therefore more appropriate for caching.

The cache-worthiness score is calculated as the aggregation of the above-mentioned parameters. As can be noticed, each of the above-

87 mentioned parameters could have a different range of values and/or have different units. To make these parameters comparable and therefore ag- gregateable we standardize each parameter X using the following for- mula: n n 1 2 1 2 2 ZX = X(X − X) = X X − X n i=1 n i=1 We use the second version of the formula because it can be computed in a single pass over the input.

The corresponding standard variables will be ZA for access, ZU for update, ZC for computation cost, and ZS for size. The resulting standard variables will have an average (Z) equal to 0 and an standard deviation (σ(Z)) equal to 1.

The above parameters are finally aggregated to generate the value for cache-worthiness. The resulting value is denoted by CW (Oi,m, k). It indicates how useful caching Oi,m at portal k is.

Tuning the effect of each term in calculating the cache-worthiness score is enabled by giving different weights to each term. If network bandwidth is low, giving a heavier weight to the size, results in more emphasis on caching larger objects when calculating the scores. Oppo- sitely, if network bandwidth is high, giving a lighter weight to the size, results in less emphasis on it. Similarly, if the database server is slow or is normally under heavy load and the computation of (e.g., SQL) queries and generating the results create a bottleneck, we can give a heavier weight to the computation cost which results in higher scores for those objects with a higher computation requirements and visa versa. More- over, if data change frequently and/or the coherency requirements are high, giving a heavier weight to the update frequency, results in higher scores for less frequently changed objects and visa versa. Finally, giv-

88 ing a heavier weight to access frequency gives more emphasis on access frequency when calculating the cache-worthiness scores.

3.5 Other Parameters

There are other parameters that may be used in conjunction with cache- worthiness scores in order to provide a more efficient caching strategy. These parameters include:

• Recency: Recency of an object is defined as a value in [0, 1]. The oldest object in the cache has a recency equal to 0 and the recency of the newest object is defined as 1. More recent objects are more likely to stay in cache and older objects are more likely to be removed when the cache replacement strategy is run. The following formula shows how the recency of an object can be calculated using the time-stamps (TS) assigned to each object:

R(Oi) = (TS(Oi) − TS(OE))/(TS(OK ) − TS(OE))

Where:

OE : The oldest object in the cache

OK : The most recent object

• Utility of Providers: It should be noted that the throughput of the portal is bounded by the throughput of the provider(s) with least performance. This is when the result of such provider(s) can- not be ignored, e.g., the portal is keen to do business with the provider, the provider normally offers good deals for the customers, the commission paid to the portal, or satisfying a composite service

89 when there is no other option. Boosting the performance of such providers can result in increasing the performance of the portal as a whole. Therefore, such providers might be given higher utility.

T ≤ min(T1,T2,T3, ...)

Each provider can be given a weight in advance and this weight can be used to favor some providers against others for caching. This weight can be used in conjunction with cache-worthiness scores to boost the performance of some particular providers.

• Correlation of Objects: In a Web application, some objects are normally requested in order. For example, in a travel portal, a browse session for accommodation in a particular region might be followed by a browse session for car rental in the same region. Therefore, it would be worth caching the result of car rental brows- ing session, if the result of the accommodation browsing session is

already cached. The correlation between Oi and Oj shows the rate

by which Oj is accessed after Oi and can be calculated by processing access logs as follows:1

r(Oi,Oj) = (f(Oj : Oi → Oj))/f(Oi)

Where:

f(Oj : Oi → Oj) : Access frequency of Oj when accessed

after Oi

f(Oi) : Access frequency of Oi

1The calculation of correlation we use is different from the one used in Statistics books. However, the idea of correlation follows the same concept.

90 The results of the processing will be maintained in a correlation

matrix, where every cell stores r(Oi,Oj).

These parameters can be used in conjunction with cache-worthiness scores to provide a more effective caching strategy.

3.6 Regulating heterogeneous Caching

Scores

The fact that it is up to providers to calculate cache-worthiness scores may lead to inconsistencies between them. Although, all providers may use the same overall strategy to score their objects, the scores may not be consistent. In the absence of any regulation of cache-worthiness scores, objects from providers who give higher scores will get more chance to be cached, and such providers will get more cache space than others. This leads to unfair treatment of providers. As a result those who give lower scores get comparatively less cache space and their performance improve- ments are expected to be less than those who score higher. It may also result in less effective cache performance as a whole. The following factors contribute to causing inconsistencies in caching scores among providers:

• Each provider uses a limited number of log entries to extract re- quired information, and the available log entries may vary from one to another

• The value of computation cost (C(Oi,m)) depends on the provider hardware and software platform, workload, etc.

• Providers may use other mechanisms to score the objects (they are

91 “not required” to use the above approach)

• Malicious providers may claim that all of their own objects should be cached, in the hope of getting more cache space

To achieve a fair and effective caching strategy, the portal should detect these inconsistencies and regulate the scores given by different providers. For this purpose, the portal uses a regulating factor λ(m) for each provider and applies it to the cache-worthiness scores and uses the result in the calculation of the overall caching scores received from provider m. This factor has a neutral value in the beginning and is adapted dynamically by monitoring the cache behavior. This is done by tracing false hits and true hits.

A false hit is a cache hit occurring at the portal when the object is already invalidated. False hits degrade the performance and increase the overheads both at portal and provider sites, without any outcome. These overheads include probing the cache validation table, generating validation request messages, wasting cache space, and probing cache look- up table.

A true hit is a cache hit occurring at the portal when the object is still fresh and can be served by the cache. The performance of the cache can only be judged by true hits.

The portal monitors the performance of the cache in terms of tracing false and true hits and dynamically adapts λ(m) for each provider. For those providers with higher ratio of false hits , the portal downgrades λ(m). Therefore, all the cache-worthiness scores from that provider are treated as lower scores, i.e. λ(m) → λ−(m). For those providers with higher ratio of true hits, the portal upgrades λ(m), i.e., λ(m) → λ+(m).

92 Therefore, all the cache-worthiness scores from that provider are treated as being higher than before.

A high false hit ratio for a provider m, indicates that the cache space for that particular provider is not utilized. That is because the cached objects for that provider are not as worthy as they should be. In other words, the provider has given higher cache-worthiness scores to its ob- jects. This can be resolved by downgrading the scores from that provider and treat them as they were lower.

Unlikely, a high true hit ratio for a provider m, indicates that the cache performance for this provider is good. Therefore, provider m is tak- ing good advantage of the cache space. Upgrading the cache-worthiness scores of provider m results in more cache space being assigned to this provider. This ensures fairness in the cache usage based on how the cache is utilized by providers. The fair distribution of cache space among providers will also result in better cache performance. The experimental results confirm this claim.

3.7 Synchronization of Meta-data

The meta-data used to support the caching strategy consists of two types of tables. These tables store information about validity of cached objects. As changes happen in the back-end database, some entries may become invalid in the cache validation table maintained by the provider. This changes do not invalidate the cached objects at the portal unless these tables are synchronized with each other. Otherwise, the portal has to check the validity of the object each time a hit is detected. Moreover, due to limitation of storage space, portal or providers might not be able

93 to store all the entries in these tables. Having a larger cache validation table incurs more overhead on the provider to monitor changes on the back-end database and invalidate the relevant entries. Therefore, cache look-up table may become inconsistent with cache validation table(s). As a result, it can degrade the performance of the cache. The inconsistency in the meta-data increases the number of false hits which in turn degrades the performance.

A false hit might occur when a hit is detected at the portal but the provider cannot check the freshness of the object and has to generate the object again, even though the object in the cache might have been still fresh. This is because the relevant entry does not exist in the cache validation table. It might have been removed as a result of running a cache replacement strategy to free some space at the providers.

Keeping the meta-data tables coherent, reduces the avoidable false hits. This, in turn, results in better cache performance. The synchroniza- tion can be performed in the following ways:

Change-based: When a cached copy becomes invalid as a result of a change in the back-end database, the relevant cache entry will be inval- idated. This can impose a lot of overhead for generating and processing such invalidation messages. But, this approach eliminates the need for sending validation messages each time a hit occurs while guarantying 100% freshness in applications with strong cache coherency requirements. Moreover, the fresh objects can be pushed to the portal along with the in- validation message. However, this needs to be considered carefully. The overhead of pushing objects to the portal might outweigh the perfor- mance gain, if they are not chosen properly and non-worthy objects are

94 pushed to the portal. Using change-based method seems to work in well when the update rate is low and a strong coherency is required.

Time-based: The meta-data tables are synchronized periodically. At specific times, invalid objects are removed from the cache. Keeping these tables highly coherent requires storage, computation and communication overhead which may, in turn, defeat the benefit of caching. Maintaining a low coherency between these tables leads to reduced cache performance (i.e. increasing false hits), wasting storage space, etc. To provide an ef- fective caching strategy, these tables should be reasonably coherent. It is desirable to keep the caching performance high while keeping computa- tion, network and storage overheads low. Smaller time periods increase the coherence while increasing the computation and communication over- heads, and vice versa. This method can provide a strong coherency if the portal uses a polling-every-time method to validate the objects. Other- wise, it provides a weak coherency.

3.8 Summary

In this section we introduced a collaborative caching strategy to be used in Web portals. It relies on a score called cache-worthiness calculated by providers and sent to the portal along with the objects. A method for calculating such scores using server logs was discussed. We also discussed the issue of consistency between providers in calculating the scores and introduced a regulating method to deal with that. Finally, we discussed the issue of synchronization and gave two alternative synchronization strategies.

95 Chapter 4

Evaluation and Analysis

In order to evaluate the performance of the collaborative caching strat- egy, we have built a test-bed. This test-bed enables us to simulate the behavior of a business portal with different number of providers, message sizes, response time, update rate, cache size, etc. We examine the behav- ior of the system under a range of different scenarios. Network bandwidth usage, throughput, and average access time are used as primary perfor- mance measures. In the next section we explain the evaluation test-bed in detail and explain how the simulation instances are set-up. In the follow- ing sections different alternatives in deploying our collaborative caching strategy are examined and the performance is compared with some ex- isting caching strategies.

4.1 Evaluation Test-bed

The test-bed enables us to evaluate the performance of our caching strate- gies. It also implements existing caching strategies in order to compare their performance with our proposed strategy. The following caching

96 strategies have been implemented in the test-bed: CacheCW (our pro- posed caching strategy), LRU (Least Recently Used), FIFO (First In First Out), LFU (Least Frequency Used), SIZE, LRU-S, and LRU-SP1. We be- lieve that these strategies provide a good basis for the evaluation as they include a variety of standard and state of the art strategies. However, any other caching strategy can be implemented and plugged into the test-bed. It is worth to mention that, according to the literature and also the evaluation results, LRU-SP performs the best for Web objects among all the known strategies. The simulation can also be run without caching (NoCache) and the performance results be used as a base line.

The simulation can implement a variety of scenarios by using differ- ent values for a set of system parameters stored in a configuration file. The most important parameters that we have used in the experiments in- clude: (i) the average number of providers; to let providers join or leave, but having an average number of providers during the simulation, (ii) the average number of providers involved in processing each request, (iii) the average number of objects per provider, (iv) the ratio of updates for each provider, to study the behavior of the caching strategy based on the update frequency, (e.g., a higher update ratio represents a business portal where many users’ search sessions result in buying products. Simi- larly, lower update ratio can represent a business portal that users’ search sessions are rarely followed by a purchase.) (v) network bandwidth, (vi) average number of invalidated objects per update in the back-end data- base, (vii) size of response messages, to define different message sizes for each provider, (viii) computation/generation cost of objects, to simulate providers with different response times, (ix) average number of online users, (x) cache size.

1Please refer to Section 2 for the detailed description of such strategies.

97 Synchronization Synchronization Mechanism Mechanism UserSimulator0 ProviderSimulator0

UserSimulator1 PortalSimulator ProviderSimulator1 ......

... ProviderSimulator2 ... Cache

Figure 4.1: Architecture of the test-bed

Figure 4.1 shows the architecture of the test-bed. PortalSimulator is the main module of the test-bed that implements the portal. UserSimulator0, UserSimulator1, etc. simulate the behavior of online users. User requests to objects are generated based on a Zipf-like distrib- ution. UserSimulator(s) send requests to PortalSimulator and wait for the results. The average access time (i.e., the time between a user sending the request and receiving the result) is measured and used as a perfor- mance measure. ProviderSimulator0, ProviderSimulator1, etc., sim- ulate providers. Upon receiving a requests from PortalSimulator (i.e., read, write, or validation), they process the request and return the results to PortalSimulator.

The communication between the components (i.e., UserSimulator(s), PortalSimulator, and UserSimulator(s) is enabled and synchronized using cubbyholes.

As shown in Figure 4.2, UserCubbyHole.java implements the synchronization mechanism between UserSimulator(s) and PortalSimulator. It provides methods for sending and receiving

98 requests and response messages. Method putRequest() is used by UserSimulator to send a request to PortalSimulator and method getResponse() is used to receive the result. Method getRequest() and putResponse() are used by PortalSimulator to get the requests or return the results to UserSimulator.

Two cubbyholes are used for the synchronization between PortalSimulator and ProviderSimulator(s), one for the requests and the other one for the responses, as shown in Figure 4.3 and Fig- ure 4.4. PortalSimulator uses putRequest() and getResponse() meth- ods to send requests and receive responses. ProviderSimulator uses getRequest() and putResponse() to receive the requests and send the responses.

As can be noticed, there are two kind of parameters in the system. The first group of parameters are defined for the system as a whole, e.g., the number of providers, the number of online users, cache size, etc. The second group are those defined per provider/object, e.g. dependence between underlying data and objects (which objects are invalidated upon a change in the back-end database), size of objects, computation cost of objects, etc. For the second group of parameters, we provide mechanisms to effectively set them.

In the rest of this section we explain how the dependence between objects and the underlying data is implemented. We also show how the size of objects and their computation/generation cost can be set.

99 package mypackage; import java.util.*; public class UserCubbyHole { public ArrayList eList = new ArrayList(); private String respondedThreadName; public class EventItem { public String threadName; public long time; public char operation; public long getReqTime; } public synchronized void putRequest(String thrName, long time, char operation, long getReqTime) { EventItem eItem= new EventItem(); eItem.threadName= thrName; eItem.time= time; eItem.operation= operation; eList.add(eItem); notifyAll(); } public synchronized EventItem getRequest() { while(eList.size()==0) { try { wait(); } catch (InterruptedException e) { } } EventItem eItem; eItem= (EventItem) eList.remove(0); eItem.getReqTime= System.currentTimeMillis(); return ((EventItem) eItem); } public synchronized void putResponse(String threadName) { while(respondedThreadName != null) { try { wait(); } catch (InterruptedException e) { } } respondedThreadName= threadName; notifyAll(); } public synchronized void getResponse(String threadName) { while (!threadName.equalsIgnoreCase(respondedThreadName)) { try { wait(); } catch (InterruptedException e) { } } respondedThreadName= null; notifyAll(); return; } }

Figure 4.2: Synchronization between UserSimulators and PortalSimulator

100 package mypackage; import java.util.*; public class RequestCubbyHole { private boolean available = false; private RequestItem reqItem=new RequestItem(); public synchronized void putRequest(char o, int pNo, int oNo, long t) { while (available == true) { try { wait(); } catch (InterruptedException e) { } } reqItem.op= o; reqItem.pNo= pNo; reqItem.oNo= oNo; reqItem.cacheTime = t; available = true; notifyAll(); } public synchronized RequestItem getRequest() { while (available == false) { try { wait(); } catch (InterruptedException e) { } } available = false; notifyAll(); return reqItem; } }

Figure 4.3: Synchronization of request items

4.1.1 Dependence Between Objects

As mentioned earlier, one way to determine invalid objects is to monitor update requests to the back-end database and invalidate the affected entries in the cache validation table. In order to determine the invalid objects, we model the dependencies between underlying data and objects. Object Dependence Graph (ODG) (as mentioned in the literature) is one way of modeling this. It should be mentioned that there are a number of ways to invalidate objects in the cache validation able. However, this choice does not affect our study. In this regard, some analysis has been already done in the literature in Section 2.2.5.

101 package mypackage; import java.util.*; public class ResponseCubbyHole { private boolean available = false; private ResponseItem respItem= new ResponseItem(); public synchronized void putResponse(char o, int pNo, int oNo, long s, double c, long gT, boolean sts) { while (available == true) { try { wait(); } catch (InterruptedException e) { } } respItem.op= o; respItem.pNo= pNo; respItem.oNo= oNo; respItem.size= s; respItem.cw= c; respItem.generationTime= gT; respItem.status= sts; available = true; notifyAll(); } public synchronized ResponseItem getResponse() { while (available == false) { try { wait(); } catch (InterruptedException e) { } } available = false; notifyAll(); return respItem; } public synchronized boolean responseReady() { return available; } }

Figure 4.4: Synchronization of response items

Based on ODG methodology, the dependencies between objects and their underlying data is modelled using a DAG (Directed Acyclic Graph) G = (V,E), where V = {U, O} is the set of underlying data and Web objects, and E ⊆ U × O is a set of relations. Each ele- ment in E represents what Web object (Oi) will be invalidated when a change occurs on data object Ui. Figure 4.5 shows an ODG with

V = {U0,U1,O0,O1,O2,O3} (U = {U0,U1}, O = {O0,O1,O2,O3}) and

E = {(U0,O0), (U0,O1), (U1,O0), (U1,O2), (U1,O3)}. Modification of an

102 O0 O1 O2 O3

U0 U1

Figure 4.5: Object Dependence Graph (ODG)

underlying data U0 invalidates objects O0 and O1 and modification of U1 invalidates objects O0, O2, and O3. In other words, it shows what entries in the cache validation table are invalidated as a result of a change in the back-end database.

To implement the ODG model in our test-bed, we use one input file for each provider. This file shows which objects will be invalidated based on the modifications. Therefore, real applications could be simulated more closely by modifying these files.

Figure 4.6 shows a sample input file. The file shows a set of depen- dencies between Web objects (O) and underlying database objects (U). Each database object is written alongside all of the Web objects that are affected when the database object changes. Note that the sample shown is much smaller than the actual files used in the simulations, and is pri- marily for explanation purposes. It should be also noted that the system is able to generate such a file randomly based on an input parameter that shows what percent of objects will be invalidated per write request (WriteFanOut). Once the input file is generated it can be manually mod- ified to achieve the desired dependence between objects. As can be seen in the figure, each underlying database object is associated (via “->”) with a set of Web objects. The set of Web objects can be defined either

103 InputDependency000.txt U0->O3,O5,O10-O35,O70-O80,O95; U1->O0-O8,O20,O25,O35-O42; U2->O15-O30,O40-O58,O70,O80-O87; U3->O2-O7,O12-O20,O38,O64-O84; ... Figure 4.6: Input file to define object dependence

InputSize000.txt O0-O1000000->32000; O0-O10->27000; O18->37000; O55-O85->45000; O90-O110>18000; O125-O131->29000; ... Figure 4.7: Input file to define size of objects

by enumerating them one-by-one separated by commas (e.g. O1,O2,O3), or specifying a range of objects (via the hyphen, as in O35 −O42), or some combination of these. Each dependency is terminated by a semi-colon.

4.1.2 Size and Computation Cost

The simulation system also uses input files for defining the size and com- putation cost of objects. One file defines object sizes and the other defines computation costs. The syntax is similar to that of the object dependency file, except that the object sets occur on the left of the “->”. The right- hand-side gives the size or cost of all of the objects on the left-hand-side of the “->”. Figure 4.7 and Figure 4.8 show examples.

Size is represented in bytes and computation cost is represented in milliseconds. For example, the second line in Figure 4.7 sets the size of objects O0 through O10 equal to 27000 bytes and the second line in

104 InputComputation000.txt O0-O1000000->750; O3-O12->520; O18-O22->450; O24->640; O28-O32>380; O35-O43->580; ... Figure 4.8: Input file to define computation cost of objects

Figure 4.8 sets the computation cost for objects O3 through O12 equal to 520 milliseconds. The first line in both files sets a default value for all objects that are not mentioned in subsequent lines.

As for dependence, the test-bed is able to generate values for size and computation cost randomly based on some input parameters and the files can be manually modified after that. These parameters show a range of minimum and maximum values for size and com- putation cost (i.e., minObjectSize and maxObjectSize for size and minObjectGenerationTime and maxObjectGenerationTime for compu- tation cost).

These files allow us to set up a simulation to model certain kinds of real-life systems. For example, a slow system with large pages could be modelled by setting (comparatively) large values for computation costs and (comparatively) large values for object sizes.

4.1.3 Cache-worthiness Scores

As mentioned in Section 3.4, the cache-worthiness scores are calculated as the aggregation of: (i) access frequency - calculated by processing access log, (ii) update frequency - calculated by processing database up-

105 date log, (iii) computation cost - calculated by processing the database request/delivery log, and (iv) delivery cost - measured by the size of the objects. Each term is associated with a weight in the aggregation formula. The value for each weight is set according to the system requirements and/or characteristics.

In order to determine the best value for each weight, we run the system using possible values for each one and compare the performance results. The values which result in the best performance results are then selected. However, these values are based on the current system char- acteristics and/or requirements. When the situation changes, the best value for each term will also change and should be calculated again. For each set of experimental results presented in the following sections we have run the system on all possible2 values for each weight and used the values that result in the best performance results.

4.2 Evaluation Results

In this section we present the performance results of the collaborative strategy. We compare its performance in terms of throughput, network bandwidth usage, and average access time with other caching strategies. The following caching strategies have been used: LRU, FIFO, LFU, SIZE, LRU-S, and LRU-SP. It should be mentioned that among all the strate- gies in the literature, LRU-SP was the most recent strategy introduced for Web objects and was expected to perform the best. The results of the evaluation confirm this. However, our caching strategy outperforms

2In order to limit the number of experiments, we have run these experiments on discrete values for each weight with a difference of 0.1.

106 LRU-SP. We also use NoCache (i.e., without caching) as a base line.

4.2.1 Throughput

In the first experiment, we study the throughput. Throughput is mea- sured as the number of performed requests per minute. We vary the average size of objects and run each simulation instance for 12 hours with a cache size equal to 20% of the size that would accommodate all the objects. The effect of cache size on throughput will be exam- ined in subsequent experiments. Figure 4.9 shows the results for different caching strategies, i.e., CacheCW, LRU, FIFO, LFU, SIZE, LRU-S, LRU-SP, and NoCache.

The throughput of NoCache, as shown in the figure, stays almost constant at about 293 for 16 KB, 224 for 32 KB, and 185 for 64 KB objects at all times. CacheCW shows the best results with 404 for 16 KB, 311 for 32 KB, and 261 for 64 KB objects. The improvement in throughput compared to NoCache is 1.38, 1.39, and 1.41 respectively. LRU-SP shows the second best results as expected. The values for throughput are 334, 255, and 211. The results are followed by LRU-S, LRU, LFU, FIFO, and SIZE respectively. The order is consistent for all three object sizes used in the experiment.

All caching strategies show a faster growth in the throughput in the beginning which slowly stabilizes at a pick. The fast increase in the be- ginning is when the cache is getting more populated. Once the cache gets populated with mostly cache-worthy objects then the throughput stabilizes.

Based on the results, CacheCW outperforms all the caching strategies

107 NoCache CacheCW LRU-SP LRU-S LRU LFU FIFO SIZE 420 400 380 360 340 320 300 280 Throughput (req/min)Throughput 260 0 120 240 360 480 600 720 Time (min)

(a) Average object size: 16 KB

NoCache CacheCW LRU-SP LRU-S LRU LFU FIFO SIZE 320 300 280 260 240 220 Throughput (req/min)Throughput 200 0 120 240 360 480 600 720 Time (min)

(b) Average object size: 32 KB

NoCache CacheCW LRU-SP LRU-S LRU LFU FIFO SIZE 280 260 240 220 200

Throughput (req/min) 180 160 0 120 240 360 480 600 720 Time (min)

(c) Average object size: 64 KB

Figure 4.9: Throughput

108 by a magnitude greater than 1.2, 1.22, and 1.24 for different object sizes. As can be noticed, the improvement in throughput increases with the increase in object size. That is mainly because of more overhead involved for smaller objects in terms of generating and exchanging validation mes- sages. For smaller objects, more validation messages are exchanged and more overhead is involved compared to the amount of data transferred. For very small size of objects, the caching overhead might outweigh the performance. However, this is not a likely situation in real applications as the average size of objects is estimated to be an order of 8 KB.

One of the reasons why CacheCW outperforms all the strategies is that none of those caching strategies are particularly designed for dynamic ob- jects. In particular, update frequency and generation time of objects are not considered in any of them. Moreover, they do not address the issue of identifying good candidates for caching. It is assumed that all the objects are cacheable and therefore all the objects get cached, while CacheCW dy- namically identifies caching policies by disregarding those objects with a cache-worthiness score equal to zero.

To set up a worst case scenario or upper-bound for CacheCW compared to other strategies we run the same simulation with a modification. This time all the caching strategies only cache those objects with a non-zero cache-worthiness score.3 In practice this situation might only happen when a system administrator defines such caching policies by including or excluding objects or object groups for caching. Identifying, individ- ual objects for caching would not be possible in practice, unless object groups are considered instead of individual objects. In other words, this

3In this scenario the performance of CacheCW will be the same as before and only other caching strategies perform better.

109 considered situation cannot happen in practice. Therefore, it is a valid case for setting up a worst case or upper bound scenario for CacheCW. The results for NoCache and CacheCW will stay the same, but other strate- gies are expected to show a better throughput. The results are shown in Figure 4.10.

As expected, all other caching strategies show better results. LRU-SP, with a throughput equal to 361, shows the second best result after CacheCW. Other strategies fall behind LRU-SP with the same order as in the previous experiment. However, all of them experience an increase in the throughput. According to the results, CacheCW still outperforms all the strategies. However, the improvement in throughput decreases to 1.12 for 16 KB, 1.13 for 32 KB, and 1.15 for 64 KB objects compared to LRU-SP, which still is a significant difference.

4.2.2 Network Bandwidth Usage

The second experiment examines the amount of network bandwidth us- age per request. Figure 4.11 shows the network bandwidth usage per request for different size of objects, i.e., 16 KB, 32 KB, and 64 KB. The results are based on 12 hours simulation with a cache size equal to 20%.

As shown in Figure 4.11, all caching strategies reduce network band- width usage per request. CacheCW has the best reduction in bandwidth usage, which is 1.3 times less than NoCache for 16 KB objects. The ratio is 1.34 and 1.38 for 32 KB and 64 KB objects. The second best result belongs to LRU-SP with a reduction rate equal to 1.10, 1.11, and 1.12, for different object sizes. The results are followed by LRU-S, LRU, LFU, FIFO, and SIZE, respectively. The order is consistent among all object sizes.

110 NoCache CacheCW LRU-SP LRU-S LRU LFU FIFO SIZE 420 400 380 360 340 320 300 280 Throughput (req/min)Throughput 260 0 120 240 360 480 600 720 Time (min)

(a) Average object size: 16 KB

NoCache CacheCW LRU-SP LRU-S LRU LFU FIFO SIZE 320 300 280 260 240 220 Throughput (req/min)Throughput 200 0 120 240 360 480 600 720 Time (min)

(b) Average object size: 32 KB

NoCache CacheCW LRU-SP LRU-S LRU LFU FIFO SIZE 280 260 240 220 200

Throughput (req/min) 180 160 0 120 240 360 480 600 720 Time (min)

(c) Average object size: 64 KB

Figure 4.10: Throughput - upper-bound scenario for CacheCW

111 200

190

180

170

160 Network Usage(KB/req) Network 150

140

e S U h SP - c - U LFU IFO a U LR F SIZE LR oC acheCW LR N C

(a) Average object size: 16 KB

400

380

360

340

320

300 Network Usage(KB/req) Network 280

260

e S U h SP - c - U LFU IFO a U LR F SIZE LR oC acheCW LR N C

(b) Average object size: 32 KB

800

760

720

680

640

600 Network Usage(KB/req) Network 560

520

e S U h SP - c - U LFU IFO a U LR F SIZE LR oC acheCW LR N C

(c) Average object size: 64 KB

Figure 4.11: Network Bandwidth Usage

112 The different ratios for different object sizes are due to transmission of validation messages through the network. Note that one validation mes- sage is sent to validate each object. Therefore, for smaller object sizes, proportionally, more network bandwidth is consumed for validation mes- sages. It should be mentioned that each user request generates a number of sub-requests to providers, which can be varied in the simulation. In other words, each request results in a number of response messages to be sent by providers. That is why the amount of network usage per request is a number of times more than the average size of response messages.

Similar to the experiments for the throughput, we consider a case that all other caching strategies perform their best and compare the re- sults with CacheCW. Although this might not be possible in practice but the results will demonstrate an upper-bound scenario for the compari- son of CacheCW with other caching strategies. The results are shown in Figure 4.12.

As shown in the figure, even in this case, CacheCW outperforms all the caching strategies by at least 1.10, 1.12, and 1.15 for different object sizes. LRU-SP has the second best result followed by the other strategies in the same order.

4.2.3 Average Access Time

In this experiment, we study the average access time which is defined as the time period since a request is submitted by the user until the result(s) is received. Figure 4.13 shows the average access time of CacheCW in comparison with NoCache and other caching strategies. The same time period and cache size as the previous experiments are used here, i.e. 12 hours simulation time and 20% cache size.

113 200

190

180

170

160 Network Usage(KB/req) Network 150

140

e S U h SP - c - U LFU IFO a U LR F SIZE LR oC acheCW LR N C

(a) Average object size: 16 KB

400

380

360

340

320

300 Network Usage(KB/req) Network 280

260

e S U h SP - c - U LFU IFO a U LR F SIZE LR oC acheCW LR N C

(b) Average object size: 32 KB

800

760

720

680

640

600 Network Usage(KB/req) Network 560

520

e S U h SP - c - U LFU IFO a U LR F SIZE LR oC acheCW LR N C

(c) Average object size: 64 KB

Figure 4.12: Network Bandwidth Usage - upper-bound scenario for CacheCW 114 7

6

5 Average Access Time (sec) AverageAccessTime

4

e S U h SP - c - U LFU IFO a U LR F SIZE LR oC acheCW LR N C

(a) Average object size: 16 KB

9

8

7 Average Access Time (sec) AverageAccessTime

6

e S U h SP - c - U LFU IFO a U LR F SIZE LR oC acheCW LR N C

(b) Average object size: 32 KB

10

9

8

7 Average Access Time (sec) AverageAccessTime

6

e S U h SP - c - U LFU IFO a U LR F SIZE LR oC acheCW LR N C

(c) Average object size: 64 KB

Figure 4.13: Average Access Time

115 As shown in Figure 4.13, caching improves average access time. For 16 KB objects, CacheCW results in an average access time equal to 4.67 seconds. Average access time for Nocache is 6.12 seconds and 5.42 sec- onds for LRU-SP. For 32 KB objects, average access times are 6.45, 8.71, and 7.61 seconds for CacheCW, NoCache, and LRU-SP, respectively. These values are respectively 7.1, 9.8, and 8.52 for 64 KB objects. As can be noticed, with increasing average size of objects the improvement of aver- age access time in CacheCW increases compared to NoCache and LRU-SP, i.e., 1.45 and 0.75 seconds compared to NoCache and LRU-SP for 16 KB objects, 2.26 and 1.16 seconds for 32 KB objects and finally 2.7 and 1.42 seconds for 64 KB objects.

Similar to the previous experiments, we consider a best case scenario for other caching strategies and use it as a comparison base for an upper- bound scenario for CaheCW. The results are shown in Figure 4.14.

As can be seen in the figure, in the worst case CacheCW still outper- forms all the other caching strategies by at least 1.11 for 16 KB objects, 1.35 for 32 KB objects, and 1.39 times for 64 KB objects. The second best result belongs to LRU-SP followed by other strategies in the same order as before.

4.2.4 Analysis of the Performance Results

CacheCW outperforms all the caching strategies in terms of throughput, network usage, and average access time. Even in a upper-bound study, the improvement is noticeable. In all the experiments, LRU-SP has the second best results followed by LRU-S, LRU, LFU, FIFO, and finally SIZE. The order is consistent among all the performance measures, i.e., throughput, network usage, and average access time. In the following experiments,

116 7

6

5 Average Access Time (sec) AverageAccessTime

4

e S U h SP - c - U LFU IFO a U LR F SIZE LR oC acheCW LR N C

(a) Average object size: 16 KB

9

8

7 Average Access Time (sec) AverageAccessTime

6

e S U h SP - c - U LFU IFO a U LR F SIZE LR oC acheCW LR N C

(b) Average object size: 32 KB

10

9

8

7 Average Access Time (sec) AverageAccessTime

6

e S U h SP - c - U LFU IFO a U LR F SIZE LR oC acheCW LR N C

(c) Average object size: 64 KB

Figure 4.14: Average Access Time - upper-bound scenario for CacheCW

117 we only compare the performance results of CacheCW with LRU-SP as the other strategies have lower performance results.

The reason why SIZE has the worst performance lies in the fact that it does not consider access pattern of Web objects for removal. Other caching strategies studied here consider the access frequency in one way or the other. The difference between LRU, LFU, and FIFO is the way they use the past history to predict the access frequency in the future. LRU-S and LRU-SP combine it with size and popularity of objects to achieve more effectiveness. Unlike CacheCW, none of them considers update frequency and computation cost of objects.

All caching strategies perform slightly better with larger objects. That is because less overhead is involved in generating, processing and trans- ferring validation messages for larger objects. For larger objects less val- idation messages are generated, processed and transferred compared to the size of data being processed and transferred.

4.2.5 Average Access Time - First Reply

As mentioned in Chapter 2, results can be shown to the user as they arrive. In such systems, user-perceived delay could be considered as the time since the request is sent until the “first” response message is re- ceived. Subsequent messages will be delivered to the user while he is browsing the results received so far. A slightly different definition of ac- cess time indicates the time period since the request is received by portal until the result is made ready by the portal. Obviously, this definition hides the network delay between users and portal and the calculated average access time would be less than the one calculated based on the former definition. Figure 4.15 shows the results based on this definition of

118 NoCache CacheCW LRU-SP 10 9 8 7 6 5 4 3 2 1

Average Access Time (sec) Access Average 0 16K 32K 64K Object Size

Figure 4.15: Average Access Time (first reply) average access time. According to the results, caching improves average access time for first response significantly. However, all caching strategies show a significant improvement as all caching strategies can eliminate the need for portal-provider communication and/or object generation for at least one of the objects.

For 16 KB objects, the resulting average access times are 2.46, 5.29, and 3.11 seconds for CacheCW, NoCache, and LRU-SP. The resulted values for 32 KB objects are 2.92, 6.45, and 3.73 seconds, and for 64 KB objects 3.79, 8.72, and 4.9 seconds, respectively.

We also show the results for the upper-bound comparison of CacheCW and others. The results are shown in Figure 4.16.

In the worst case CacheCW shows 1.19, 1.2, and 1.22 times improve- ment compared to LRU-SP. As mentioned earlier the results for other caching strategies were excluded from the figures, as LRU-SP shows the best results followed by others in the same order.

119 NoCache CacheCW LRU-SP 10 9 8 7 6 5 4 3 2 1

Average Access Time (sec) Access Average 0 16K 32K 64K Object Size

Figure 4.16: Average Access Time (first reply) - upper-bound scenario for CacheCW

4.2.6 Effect of User Access Pattern

In this experiment we study the effect of user access patterns on per- formance. In normal activities in a Web application, read requests are expected to occur more frequently than write requests. This is actu- ally the logic behind all caching strategies. In the simulation, we con- sider a parameter readToWriteRatio which denotes the ratio of read to write requests. This parameter was set to 90% for previous simula- tion instances, meaning that 90% of user requests are read requests, i.e., search and browse requests and 10% are write requests, e.g., updating personal information or buying a product. In this experiment we examine an extreme case where read and write requests are both 50% by setting readToWriteRatio to 0.5. This is to make sure that in such extreme cases the overhead of caching does not outweigh its benefits and our caching strategy does not experience degraded performance. This is an important issue in our caching strategy as we use meta-data and need to be sure that managing this meta-data does not become troublesome in cases like

120 this.

Figure 4.17 shows the throughput for 16 KB, 32KB, and 64 KB ob- jects. As shown in the figure, even with a ratio of 50%, the caching strat- egy shows significant improvement compared to NoCache and all other caching strategies.

The most important reason other caching strategies fall behind CacheCW in such an extreme case is that they do not take into account update rate of objects in the caching decision.

Network bandwidth usage cannot be negatively affected as a result of such an increase in update rate. It can only increase if the response for most of the validation messages is“Invalid” or in the case where all response messages are very small. However, neither of them occur even in extreme cases. Average access time also cannot degrade when throughput is improved. Therefore, we disregard the results for network bandwidth usage and average access time for including in the results.

4.2.7 Effect of Cache Size

In this experiment, we vary the cache size and compare the performance results of CacheCW with others. An average size of objects equal to 32 KB is considered in this experiment. Figure 4.18 shows the results of hit ratio (number of hits divided by number of accesses) for 100%, 50%, and 20% cache sizes. As shown in the figure, when the cache size is very large (e.g. 100%), all strategies perform almost similarly4. CacheCW still performs better as it ignores non-appropriate objects for caching,

4only LRU-SP is shown in the figure as it was the best after CacheCW among all the caching strategies

121 NoCache CacheCW LRU-SP 420

400

380

360

340

320 Throughput (req/min)Throughput 300 0 120 240 360 480 600 720 Time (min)

(a) Average object size: 16 KB

NoCache CacheCW LRU-SP 320

300

280

260

240 Throughput (req/min)Throughput 220 0 120 240 360 480 600 720 Time (min)

(b) Average object size: 32 KB

NoCache CacheCW LRU-SP 280

260

240

220

Throughput (req/min) 200

180 0 120 240 360 480 600 720 Time (min)

(c) Average object size: 64 KB

Figure 4.17: Throughput - high update rate

122 so minimizing the false hits and having a better hit ratio. When there is a cache size limitation (e.g., 50% or 20%) CacheCW performs better compared to LRU-SP and others. The reason is that CacheCW caches more suitable objects. With more limitation on the cache size the difference becomes significant, as shown in Figure 4.19.

We also evaluate the effect of cache size on the overall performance of CacheCW (i.e., throughput, and network bandwidth usage). We run sim- ulation instances of CacheCW with different cache sizes and compare the results. Figure 4.20 (a) shows that increasing cache size more than 30% will have minor or no effect on improving throughput (i.e., 315 request per minute compared to 315 for 100%, 316 for 50%, 311 for 20%, 292 for 10%, and 276 for 5% cache sizes. Increasing the cache size more than 30% in this case, does not have a positive impact on the throughput (only very minor for 50%). This is because of the extra overhead for housekeeping operations incurred for larger cache sizes, e.g. probing larger cache look- up table by the portal. Figure 4.20 (b) demonstrates this claim. For cache sizes larger than 30% there is a slight improvement in cache hit ratio. But as noted before, due to the overhead, it does not impact the throughput positively. According to Figure 4.20 (c) larger cache sizes have a positive impact on reducing network bandwidth usage.

4.2.8 Weak Cache Coherency

Some Web applications do not need to provide a strong coherency for cached objects, as discussed in Section 2. In such applications stale cached objects, to some extend, might be delivered to users. One way of doing this is assigning an expiration time to objects. Objects will be delivered to users before their expiration time, without checking the freshness of

123 CacheCW LRU-SP 0.5 0.4 0.3 0.2 hit/access 0.1 0 0 120 240 360 480 600 720 Time (min)

(a) Cache size: 100%

CacheCW LRU-SP 0.5 0.4 0.3 0.2 hit/access 0.1 0 0 120 240 360 480 600 720 Time (min)

(b) Cache size: 50%

CacheCW LRU-SP 0.5 0.4 0.3 0.2 hit/access 0.1 0 0 120 240 360 480 600 720 Time (min)

(c) Cache size: 20%

Figure 4.18: Hit Ratio - different cache sizes

124 CacheCW LRU-SP 0.5 0.4 0.3 0.2

hit/access 0.1 0 0 120 240 360 480 600 720 Time (min)

(d) Cache size: 10%

CacheCW LRU-SP 0.5 0.4 0.3 0.2

hit/access 0.1 0 0 120 240 360 480 600 720 Time (min)

(e) Cache size: 5%

Figure 4.19: Hit Ratio - smaller cache sizes the object. In this experiment we compare CacheCW with other caching strategies if a strong coherency is not required by the application. We set an expiration time for the objects equal to 30 seconds and compare the performance results. As shown in Figure 4.21, CacheCW is superior to all other caching strategies. That is because CacheCW chooses the most suitable objects for caching. The improvements in throughput are 1.49, 1.55, 1.63 times compared to NoCache and 1.2, 1.21, and 1.22 times com- pared to LRU-SP for 16 KB, 32 KB, and 64 KB objects, respectively. The improvements in network usage are 1.41, 1.45, and 1.5 times compared

125 350

300

250

200

150 Throughput (req/min)Throughput 100

5% 50% 30% 20% 10% 100% NoCache Cache Size

(a) Throughput

0.5

0.4

0.3

0.2

0.1 Hit Ratio (hit/access) Ratio Hit 0

5% 50% 30% 20% 10% 100% NoCache Cache Size

(b) Hit Ratio

400

350

300

250

NetworkUsage (KB/req) 200

5% 50% 30% 20% 10% 100% NoCache Cache Size

(c) Network Bandwidth Usage

Figure 4.20: Effect of cache size on performance of CacheCW

126 Strategy False Hit Ratio CacheCW 0.0143 LRU-SP 0.0167

Table 4.1: False Hit Ratio - weak coherency to NoCache and 1.19, 1.2, and 1.22 times compared to LRU-SP. Finally, the improvements in average access time are 1.42, 1.45, and 1.48 times compared to NoCache and 1.18, 1.19, and 1.20 times compared to LRU-SP.

Figure 4.22 shows the results for the first reply. The improvements are 2.41, 2.46, and 2.56 times compared to NoCache and 1.22, 1.23, and 1.25 times compared to LRU-SP.

The results presented in Figure 4.21 and Figure 4.22 show the supe- riority of CacheCW compared to other caching strategies in cases where a weak coherency is acceptable. Comparing false hit ratio of CacheCW and other caching strategies reveals that the level of coherency offered by CacheCW is better than others. In other words, besides improvement in all performance measures, it also results in a lower false hit ratio. As mentioned earlier, that is because of choosing more suitable objects for caching by CacheCW.

4.2.9 Recency of Objects

As mentioned earlier in Section 3, recency of objects may be combined with cache-worthiness scores by the portal. The decision for caching ob- jects is made by portal based on the combined value of recency and

127 NoCache CacheCW LRU-SP 500

400

300

200

100 Throughput(req/min) 0 16K 32K 64K Object Size

(a) Throughput

NoCache CacheCW LRU-SP 800

600

400

200

NetworkUsage (KB/req) 0 16K 32K 64K Object Size

(b) Network Usage

NoCache CacheCW LRU-SP 10

8

6

4

2

Average Access Time (sec) Access Average 0 16K 32K 64K Object Size

(c) Average Access Time

Figure 4.21: Performance measures - weak coherency

128 NoCache CacheCW LRU-SP 10 9 8 7 6 5 4 3 2 1

Average Access Time (sec) Access Average 0 16K 32K 64K Object Size

Figure 4.22: Average Access Time (first reply) - weak coherency

Strategy Throughput Hit Ratio Network Usage (req/min) (KB/req) CacheCW-R 322 0.37 305 CacheCW 311 0.35 289

Table 4.2: Comparison of CacheCW-R and CacheCW cache-worthiness scores. We call the resulted strategy CacheCW-R. The results of the comparison of CacheCW-R with CacheCW is shown in Ta- ble 4.2 for 32 KB objects. As shown in the table, CacheCW-R improves the performance measures. However, more improvement is expected if the access and update patterns change quickly over time. In this case, including recency will compensate the changes in access pattern.

4.2.10 Utility of Providers

In this experiment we examine the effect of considering utility of providers in the caching strategy. As mentioned earlier, each query processing time is bounded by the throughput of the slowest provider participating in

129 CacheCW CacheCW-Util Throughput 259 275

Table 4.3: Effect of utility on throughput

CacheCW CacheCW-Util Average Access time (sec) 7.91 7.32

Table 4.4: Effect of utility on average access time the query processing, especially when the result of such providers can not be ignored, e.g., satisfying a composite service or when the result of this provider is important to be included in the results.

In this experiments we put some delay on some of the providers in such a way that on average one provider in each query experiences such delay. We first run the simulation without considering the utility. In other words, we assign the same utility for all providers and measure the throughput of the system. For the second run we set a higher value for utility for the slow providers and measure the throughput. The re- sults are shown in Table 4.3. As shown in the table, this will result in improving the throughput by 6%.

We also experiment the effect of considering utility of providers on average access time. Table 4.4 shows the results. As can be seen in the table, the average access time is improved by 8%.

130 4.2.11 Regulation

As discussed in Section 3, the fact that it is up to providers to calcu- late cache-worthiness scores may lead to inconsistencies between them. To avoid such inconsistencies, a regulation factor is assigned to each provider. Every provider’s cache-worthiness score is multiplied by the cor- responding factor. Therefore, providers whose scores are low should have a high regulation factor and vice versa. The regulation factor changes over time by monitoring false and real hit ratio.

For this purpose, the providers in the simulation are set up in such a way that first one deliberately overestimates, second one underestimates, and third one produces the standard cache-worthiness score. The same pattern applies for subsequent providers. In other words, one third of providers overestimate, one third underestimate, and for the remaining one third the normal cache-worthiness score is considered. Each provider was initially given a regulation factor of 1, so that each cache-worthiness score from them was not modified.

False and real hit ratio were used to downgrade or upgrade the regu- lation factor. However, using real hit ratio over the occupied cache space by each provider for upgrading the regulation factor was the most suc- cessful among all the variations used. Using real hit ratio by itself did not produce the desired results, as the performance of the cache for a provider depends both on the real hit ratio and the cache space occupied by the provider.

Providers were monitored to see if the regulation factor was moving in such a way as to separate the three groups of providers so that the underestimating providers consistently had the highest factor, followed

131 by the accurately estimating provider, with the overestimating provider having the lowest regulation factor. Figure 4.23 (a) shows the changes in regulation factor for different groups of providers. One provider from each group is used in the Figure. However, all providers in each group show similar results.

The throughput of the system was compared with the case where no regulation is done. Results of the experiment are shown in Figure 4.23 (b). The results demonstrate that the regulation factor does separate the providers accordingly. This in turn helps to improve the performance of a Web portal in terms of increasing throughput.

When upgrading or downgrading the regulation factor, we use a small value δ by which λ(m) is upgraded or downgraded. By choosing a very small value for δ, it takes a long time for the system to adjust itself. On the other hand, choosing a large value for δ makes the regulation factor fluctuate unnecessarily. Choosing an appropriate value makes the system adjust itself in a fair amount of time. For this purpose we have examined different values for δ.

CWi,j: Cache-worthiness score of object Oj at

provider Pi

CWi Average value of cache-worthiness scores at

provider Pi

∆i(CW ) = CWi+1 − CWi

δ = f × ∆(CW ) : 0 < f ≤ 1

Smaller values for f make the adjustment smoother, but more timely. The experiment result show that any value for f in the range of 0 < f ≤ 1 will generate the expected results. The results in this experiment are generated based on f = 0.1. When ∆(CW ) = 0, although very unlikely,

132 UnderEst NormalEst OverEst 1.15 1.1 1.05 1 0.95 0.9

Regulation Factor Regulation 0.85 0.8 0 120 240 360 480 600 720 Time (min)

(a) Regulation Factor

Reg NoReg 350

300

250

200 Throughput (req/min)Throughput 150 0 120 240 360 480 600 720 Time (min)

(b) Throughput

Figure 4.23: Regulating heterogeneous cache-worthiness scores

133 UnderEst NormalEst OverEst NoReg 6.93 6.65 6.58 Reg 6.63 6.58 6.55

Table 4.5: Average access times for individual providers

Average Access Time NoReg 6.72 Reg 6.59

Table 4.6: Total average access time the regulation factor will be zero. In other words, in this special case the regulation process will stay unchanged. However, in the next interval, when ∆(CW ) is calculated again the regulation process will resume. The value for ∆(CW ) is calculated using available objects in the cache. Our experiments show that even using a subset of cached objects to generate ∆(CW ) will give the same results and an estimation of the value, in case the overhead is an issue, will generate the desired results.

According to the experiments regulation results in minor improve- ment in the throughput, i.e., 300 compared to 291 which is 1.03 improve- ment in throughput. However, by looking at Table 4.5 and Table 4.6 it guarantees fairness among providers by providing fairer average access times for individual providers. Moreover, it improves the overall average access time in the system.

134 4.2.12 Meta-Data Synchronization

In this experiment we examine the effect of meta-data synchronization (cache look-up and cache validation tables) and its effect on performance. As already discussed, the synchronization may be done either on a peri- odic basis or anytime a change happens. We run two set of experiments, one for periodic synchronization and the other one for change-based syn- chronization and present the results.

Time-based Synchronization

In the first experiment, we use a time-based synchronization and study its effect on performance. We use different intervals, i.e., 30sec, 1min, 2min, 3min, 4min, 5min, 7min, and 10min and compare the results with the case where no synchronization is done (NoSync). As it is obvious, smaller intervals increase the hit ratio. However, they impose more network and computational overheads which in turn degrade the performance. In con- trast, larger intervals do not impose much network and computational overhead, but have less effect on increasing hit ratio. According to the re- sults, using a three-minute interval provides the best results for through- put. The best results for Network bandwidth usage is achieved with an interval equal to 4 minutes or more. The results are shown in Table 4.7.

The best value for the interval depends on system parameters such as size of meta-data tables, network bandwidth, update rate, the number of providers, and the load on individual providers or the portal. In practice the best value can be determined by trying different values for the interval and monitoring the performance of the system. The best value can be set after this period. However, if the parameters of the system change

135 over time, e.g., network bandwidth, number of providers, then the best interval should be determined again.

Throughput (req/min) Network Usage (KB/req) NoSynch 311 289 30 sec 291 314 1 min 302 306 2 min 312 399 3 min 325 290 4 min 321 288 5 min 318 288 7 min 316 289 10 min 315 289

Table 4.7: Periodic synchronization of meta-data

Change-based Synchronization

In the second experiment, we use change-based synchronization and study its effect on performance. It should be noted that this mechanism eliminates the need for sending validation messages by the portal when a hit is detected, while still guaranteeing strong coherency. If a change happens on the back-end database the relevant object(s) at the portal will be invalidated immediately. Therefore, the portal does not need to validate cached objects. This can save some network turn-around each time a hit is detected compared to the other synchronization mechanism or a case where no synchronization is done. However, its performance will depend highly on the update rate. This method is expected to provide

136 good performance results when the update rate is low. We run the simu- lation for different update rates , i.e., 2%, 10% and 50% and compare the results. The result of the experiment and its comparison with time-based synchronization5 is shown in Table 4.8.

2% update rate 10% update rate 50% update rate No-Synch 388 311 310 Time-based 407 321 318 Change-based 436 318 278

Table 4.8: Effect of synchronization on throughput

According to Table 4.8 change-based synchronization improves throughput significantly when the update rate is low, e.g., 2%. How- ever, in cases where more changes happen on the back-end database, its overhead might outweigh the benefits and the system might experience a degrade in performance.

It should be noted that the throughput is measured as the number of requests per minute. This includes both read and write requests. Read requests normally involve more providers, while write requests involve less providers, mostly one (e.g., booking accommodation at a particular hotel). In this experiment, the results for 50% update rate are incidentally close to those for 10%. 5The results for time-based synchronization are extracted based on the best result achieved using different synchronization intervals.

137 4.3 Summary

In this section, we have studied the performance of CacheCW and com- pared its performance with other caching strategies. These strategies in- clude LRU, FIFO, LFU, SIZE, LRU-S, and LRU-SP. Extensive simulation results were conducted using an evaluation test-bed which enables us to simulate the behavior of a portal with different system parameters. These parameters include number of providers, average message size, re- sponse time, access frequency, update rate, cache size, etc. Throughput, average access time and network bandwidth usage were used as primary performance measures. According to the results CacheCW outperforms all existing strategies.

As a result of using a caching strategy, the throughput is increased. The throughput for all the simulated caching strategies increases rapidly in the beginning. This is when the cache is being populated with more ob- ject. The increase in the throughput continues until it slowly stabilizes at a pick. This is when the cache is populated with the most useful objects. According to the results, CacheCW results in the best throughput. How- ever, better results are achieved for larger objects. That is because of the overhead involved for generating and exchanging validation messages. For smaller objects, more validation messages are exchanged and more over- head is involved compared to the amount of data transferred. In a worst case scenario, the results show that CacheCW performs better than others as well. The reason why CacheCW has a better throughput than others is that it populates the cache with the most useful objects, while other strategies fall behind. In fact, none of those caching strategies are partic- ularly designed for dynamic objects. In calculating the cache-worthiness scores all the aspects of an object are considered. These include access

138 frequency, update rate, size and computation cost. The update rate of an object is an important factor for dynamic data which is missed by other strategies. The computation cost is another factor that is also missed by all other strategies. Moreover, other strategies do not address the issue of identifying good candidates for caching. It is assumed that all the objects are cacheable and therefore all the objects get cached, while CacheCW dynamically identifies caching policies by disregarding worthless objects for caching. For example, when the update rate is high CacheCW avoids caching those object that change frequently. Because, they are not good candidates for caching and normally waste the cache space without any benefit. Moreover, they might incur more overhead on the cache. As a result, better performance is achieved for CacheCW compared against all other caching strategies. The difference becomes significant when there is more limitation on cache space.

All caching strategies lose the important information about the caching history of the objects when they are removed from the cache, e.g., access frequency, etc. When an object is entered into the cache these information are calculated from the scratch. This is not the case for CacheCW and every object contains the caching information at all times, i.e., the cache-worthiness scores. This feature enables the strategy to make better decisions for caching, which makes it unique among all other strategies.

By combining cache-worthiness scores with recency of objects we achieve better performance results. Based on the concept of temporal locality, those objects who are accessed more recently are more likely to be accessed in near future. The experimental results show that this combination produces better performance results.

139 Utility of a provider is another factor used in our strategy. Giving more priority to the objects from a slow provider, contributing in a com- posite service, boosts the performance of the system as a whole in terms of response time and throughput. The concept of utility can also be used in a business portal to discriminate some providers against others based on, for example, the commission paid or other criteria. The evaluation results show that the performance results in terms of throughput and average access time are improved in such cases.

Regulation is an important issue that needs to be addressed ap- propriately in our strategy. We detect inconsistencies between differ- ent providers in calculating cache-worthiness scores and regulate them. Some providers might give artificially high scores to their objects in the hope of getting more cache space. It is also possible that some providers use inappropriate scoring methods which give high scores to the ob- jects. Oppositely, some providers might give low scores to their objects. These inconsistencies between different providers are detected in the reg- ulation process and the scores from such providers are downgraded or upgraded. This improves the performance results and provides fairness among providers. The hit ratio of providers is used as the primary factor in this process, as it reflects the performance of the cache for individ- ual providers. However, we normalize the values for hit ratio by dividing them by the cache space assigned to individual providers.

Synchronization of meta-data, i.e., cache look-up table and cache vali- dation tables, can be achieved either using a time-based or a change-based method. The false hits are reduced if the meta-data are kept consistent with each other. As a result, better performance is achieved. If the up- date rate is low a change-based method performs well. When there is

140 a high update rate, normally a time-based method with an appropriate interval performs better. The best value for the interval depends on the system parameters. In practice, this value is determined by trying differ- ent values for the interval and monitoring the performance results. The best value for the interval is set after the monitoring time based on the value that generates the best performance results. However, if the sys- tem parameters change over time, this value should be determined again. These parameters include size of meta-data tables, network bandwidth, update rate, number of providers, and the load on individual providers or the portal.

CacheCW adaptively determines the caching policy for the objects. This makes it unique among caching strategies; in all other strategies, a system administrator defines the caching policy by including or excluding objects or object groups for caching. As we already discussed this is almost impossible in a portal where the system administrator does not have enough information about objects from other providers.

141 Chapter 5

Conclusions

In this thesis, we studied portals as a growing class of Web-based appli- cations, focussing on the performance of such applications. We examined existing solutions for caching Web data and discussed their limitations in providing effective caching in portals.

We introduced a collaborative caching technique for portals, to ad- dress the limitations of existing solutions for defining caching policies. The collaborative caching strategy provides an automatic way to define such policies, which makes it unique among all the existing caching strate- gies. The experimental results show that this strategy improves the per- formance by increasing throughput, reducing user-perceived delay, and reducing network bandwidth usage.

We also addressed the issue of heterogeneous caching policies by trac- ing the performance of the cache and regulating the scores from different providers. As a result, better fairness among providers is achieved and the performance of the cache as a whole is increased.

We also examined the synchronization of meta-data and evaluated the performance according to any-change and time-based synchronization mechanisms.

142 Our proposed strategy can be deployed in conjunction with existing strategies such as caching at Web/application servers or server accel- erators. It can also be implemented in existing application servers or server accelerators, therefore making it easier for application developers to deploy our caching strategy in their applications by using API or Tag Libraries provided by the server.

Although, the collaborative caching strategy is proposed for portal applications, the idea of cache-worthiness scores can be used to specify candidate caching objects in any Web application.

The key factor in the overall performance of this strategy is the ef- fective collaboration of the portal and its providers in calculating and using the cache-worthiness scores. Each provider should effectively cal- culate the caching scores in a way that correctly distinguishes its objects from caching point of view. Otherwise, the portal cannot effectively take advantage of the cache. In practice, a scorer module could be provided by the portal as a downloadable plug-in to avoid such problems. The compu- tation of the caching scores could be implemented in the module so that it avoids discrepancies between providers in calculating the scores. This module can be customized based on individual providers’ parameters and priorities.

The calculation of cache-worthiness scores is performed periodically. The time period depends on how often the system parameters such as access or update frequency change. Therefore, to include the new parame- ters into the caching scores, they should be calculated again. In systems where such parameters change frequently the caching scores should be calculated more often. This might result in a decreased system perfor- mance due to the overheads imposed on the providers. However, if the behavior of the system parameters follows a regular pattern during differ- ent times, different sets of pre-calculated scores may be used for different

143 time intervals. Otherwise, it would be helpful to calculate the caching scores only for objects that witness a change in the pattern and for other objects use the previous values. However, this issue is not that important in systems with stable system parameters.

There are cases where providers collaborate with several portals or applications. In these cases, the caching scores should be calculated sep- arately for each portal or application. Because, access frequency, update frequency, and also network bandwidth may be different for each one. Moreover, ODG should be defined separately. Calculating different sets of cache-worthiness scores and processing multiple ODGs can impose ad- ditional overheads on providers. To avoid such overheads, a solution for minimizing such processing is desired. For example a common set can be calculated and maintained and only the difference be calculated for different portals or applications.

For the synchronization of meta-data, either a time-based or change- based method can be used. A change-based method can guaranty strict cache coherence without the need for using a polling-every-time method, but may impose an increased overhead on the system. A time-based method can provide strict cache coherence only if a polling-every-time method is used. Otherwise, it provides strong or weak cache coherence. This depends on the synchronization period. The best synchronization period for providing the best performance results based on the current system parameters could be an issue. We have used empirical meth- ods to find the best synchronization period. It would be ideal to have a divergence-based synchronization. In this method, a function is defined to represent the divergence of the meta-data. Parameters such as false hits may be used in order to define this function. When the value of this function exceeds a threshold, the synchronization takes place. The threshold value can be set by a system administrator to enforce different

144 requirements for cache coherence. However, defining such a function is an issue and a case for further work.

In our system, we take advantage of full-page caching. The caching performance may be increased with the use of partial caching. Sometimes the whole object does not exist in the cache but parts of the result does. Therefore, a query may result in a probe (i.e., the part that exists in the cache and can be served from it) and a remainder (i.e., the part of the result that does not exist in the cache and should be queried from the provider(s)). The decision for caching an object can be also made based on whether the object can partially satisfy the request for other objects. This issue may lead to defining a new dependence between objects. The caching policy should be adapted accordingly to make use of such dependencies.

145 Bibliography

[Abe01] Aberdeen Group. Cutting the Costs of Per- sonalization With Dynamic Content Caching. http://www.chutneytech.com/tech/aberdeen.cfm, March 2001. An Executive White Paper.

[AH00] Ron Avnur and Joseph M. Hellerstein. Eddies: Continuously Adaptive Query Processing. In ACM SIGMOD Conference, May 2000.

[AJL+02] Jesse Anton, Lawrence Jacobs, Xiang Liu, Jordan Parker, Zheng Zeng, and Tie Zhong. Web Caching for Database Aplications with Oracle Web cache. In ACM SIGMOD Con- ference, pages 594–599, 2002. Oracle Corporation.

[Aka] Akamai Technologies Corporate. http://www.akamai.com.

[Apa] Apache Software Foundation. http://www.apache.org.

[Apa04] Apache Software Foundation. JCS and JCACHE (JSR- 107). http://jakarta.apache.org/turbine/jcs/index.html, July 2004.

[AWY99] Charu Aggrawal, Joel L. Wolf, and Philip S. Yu. Caching on the World Wide Web. IEEE Transactions on Knowledge and Data Engineering (TKDE), 11(1):94–107, January/February 1999.

146 [BC02] Boualem Benatallah and Fabio Casati, editors. Special Issue on Web Services, Distributed and Parallel Databases, An In- ternational Journal. Kluwer Academic Publishers, 2002.

[BCF+99a] Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. Web Caching and Zipf-like Distribution: Evidence and Implications. In IEEE INFOCOM, 1999.

[BCF+99b] Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker:. Web Caching and Zipf-like Distributions: Evidence and Implications. In IEEE INFOCOM, pages 126–134, 1999.

[BG98] Paul Barford and Mark Grovella. Generating Representative Web Workloads for Network and Server Performance Evalu- ation. In ACM Sigmetrics Performance Evaluation, 1998.

[BK04] Abdullah Balamash and Marwin Krunz. An Overview of Web Caching Replacement Algorithms. In IEEE Comunica- tions Surveys and Tutorials, volume 6, pages 44–56, 2004.

[BKK+01] R. Braumandl, M. Keidl, A. Kemper, D. Kossmann, A. Kreutz, S. Prols, S. Seltzsam, and K. Stocke. Object- Globe: Ubiquitious Query Processing on the Internet. In VLDB Journal, volume 10, pages 48–71, 2001.

[BKK99] Reinhard Braumandl, Alfons Kemper, and Donald Koss- mann. Database Patchwork on the Internet. In ACM SIG- MOD Conference, pages 550–552, 99.

[BO00] Greg Barish and Katia Obraczka. World Wide Web Caching: Trends and Techniques. In ACM Communications, vol- ume 38, pages 178–185. 2000.

[Bor04] Jerry Bortvedt. Functional Specification for Ob- ject Caching Service for Java (OCS4J), 2.0.

147 http://jcp.org/aboutJava/communityprocess/jsr/cacheFS.pdf, June 2004.

[BR02] Laura Bright and Louiqa Raschid. Using Latency-Recency Profiles for Data Delivery on the Web. In VLDB, 2002.

[CAL+02] K. Selcuk Candan, Divyakant Agrawal, Wen-Syan Li, Oliver Po, and Wang-Pin Hsiug. View Invalidation for Dynamic Content Caching in Multitired Architectures. In Proceed- ings of of 28th International Conference on Very Large Data Bases (VLDB), pages 562–573, 2002.

[CB00] Boris Chidlovskii and Uwe Borghoff. Semantic caching of Web queries. VLDB Journal, 9(1):2–17, 2000.

[Cha00] Nikhil Chandhok. Web Distribution Systems: Caching and Replication. http://www.cis.ohio-state.edu/ jain/cis788- 99/web-caching/index.html, 2000.

[Chu] Chutney Technologies. http://www.chutneytech.com.

[CI97] Pei Cao and Sandi Irani. Cost-Aware WWW Proxy Caching Algorithms. In Proceedings of the USENIX Symposium on Internet Technologies and Systems, pages 193–206, 1997.

[CID99] Jim Challenger, Arun Iyengar, and Paul Dantzig. A Scalable System for Consistently Caching Dynamic Web Data. In IEEE INFOCOM, pages 294–303, 1999.

[CIW+00] Jim Challenger, Arun Iyengar, Karen Witting, Cameron Fer- stat, and Paul Reed. A Publishing System for Efficiently Creating Dynamic Web Content. In IEEE INFOCOM, pages 844–853, 2000.

148 [CK00a] Kai Cheng and Yahiko Kambayashi. LRU-SP: A Size- Adjusted and Popularity-Aware LRU Replacement Algo- rithm for Web Caching. In IEEE Compsac, pages 48–53, 2000.

[CK00b] Kai Cheng and Yahiko Kambayashi. Multicache-based Con- tent Management for Web Caching. In WISE, 2000.

[CLL+01] K. S. Candan, Wen-Syan Li, Qiong Luoand, Wang-Pin Hsi- ung, and Divyakant Agrawal. Enabling Dynamic Content Caching for Database-Driven Web Sites. In ACM SIGMOD Conference, pages 532–543, 2001.

[CO02a] L. Y. Cao and M. T. Ozsu. Evaluation of Strong Consistency Web Caching Techniques. In World Wide Web: Internet and Web Information Systems, 2002.

[CO02b] Y. Cao and M. T. Ozsu. Evaluation of Strong Consistency Web Caching Techniques. World Wide Web: Internet and Web Information Systems, 5(2):95–123, 2002.

[CZB98] Pei Cao, Jin Zhang, and Kevin Beach. Active Cache: Caching Dynamic Contents on the Web. In IFIP, 1998.

[DDT+02] Anindya Datta, Kaushik Dutta, Helen M. Thomas, Debra E. VanderMeer, and Krithi Ramamritham. Accelerating Dy- namic Web Content Generation. IEEE Internet Computing, 6(5):27–36, September/October 2002.

[Dig] Digital Island. http://www.digitalisland.com.

[DIR01] Louis Degenaro, Arun Iyengar, and Isabelle Ruvellou. Im- proving Performance with Application-Level Caching. In International Conference on Advances in Infrastructure for

149 Electronic Business, Science, and Education on the Internet (SSGRR), 2001.

[DKP+01] Pavan Deolasee, Amol Katkar, Ankur Panchbudhe, Krithi Ramamaritham, and Prashant Shenoy. Adaptive Push-Pull: Disseminating Dynamic Web Data. In The Tenth World Wide Web Conference (WWW-10), pages 265–274, 2001.

[DST00] Venkata Duvuri, Prashant Shenoy, and Renu Tewari. Adap- tive Leases: A Strong Consistency Mechanism for the World Wide Web. In IEEE INFOCOM’2000, pages 834–843, March 2000.

[Dyn] Dynamai. http://www.persistence.com.

[Edg] Edge Side Includes. http://www.esi.org.

[FCB00] Li Fan, Pei Cao, and Andrei Broder. Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol. In IEEE/ACM Transactions on Networking, volume 8, June 2000.

[FGM+99] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Mas- inter, P. Leach, and T. Berners-Lee. Hypertext Trans- fer Protocol – http/1.1. http://www.cis.ohio-state.edu/cgi- bin/rfc/rfc2616.html, June 1999.

[FJCL99] Li Fan, Quinn Jacobson, Pei Cao, and Wei Lin. Web Prefetching Between Low-Bandwidth Clients and Proxies: Potential and Performance. In SIGMETRICS’99, 1999.

[FYVI00] Daniela Florescu, Khaled Yagoub, Patrick Valduriez, and Va- lerie Issarny. WEAVE: A Data-Intensive Web Site Manage- ment System. In The Conference on Extending Database Technology (EDBT), 2000.

150 [Gar05] Jesse James Garrett. Ajax: A New Approach to Web Applications. http://www.adaptivepath.com/publications/essays/archives/000385.php, February 2005.

[GO01] Kian-Lee Tan Shen-Tat Goh and Beng Chin Ooi. Cache- On-Demand: Recycling with Certainty. In Proceedings of the 17th International Conference on Data Engineering(ICDE), pages 633–640, 2001.

[GS96] J. Gwertzman and M. Seltzer. World Wide Web Cache Con- sistency. In Proceedings of the USENIX Techical Conference, pages 141–152, 1996.

[HMN+99] L. M. Haas, R. J. Miller, B. Niswonger, M. Tork Roth, P. M. Schwarz, and E. L. Wimmers. Transforming Hetrogeneous Data with Database Middleware: Beyond Integration. In Data Engineering Bulletin. 1999.

[IBM] IBM Corporation. http://www.ibm.com.

[IFF+99] Zachary G. Ives, Daniela Florescu, Marc Friedman, Alon Y. Levy, and Daniel S. Weld. An Adaptive Query Execution System for Data Integration. In ACM SIGMOD Conference, pages 299–310, USA, 1999.

[JBW99a] Dawn Jutla, Peter Bodorik, and Yie Wang. Developing In- ternet E-Commerve Benchmarks. In Information Systems, volume 24, 1999.

[JBW99b] Dawn Jutla, Peter Bodorik, and Yie Wang. WebEC: A Benchmark for the Cybermediary Business Model in E- Commerce. In IMSA, Nassau, Bahamas, 1999.

151 [KF00] Donnald Kossmann and Michael J. Franklin. Cache Invest- ment: Integrating Query Optimization and Distributed Data Placement. In ACM TODS, December 2000.

[Kos00] Donald Kossmann. The State of the art in Distributed Query Processing. In ACM Computing Survey. 2000.

[Kre01] Heather Kreger. Web Services Conceptual Architecture (WSCA 1.0). Technical report, IBM Software Group, http://www.ibm.com, 2001.

[KW97] B. Krishnamurthy and C. E. Willis. Study of Piggyback Cache Validation for Proxy Caches in the World Wide Web. In Proceedings of the USENIX Symposium on Internet Tech- nologies and Systems, pages 1–12, 1997.

[KW98] B. Krishnamurthy and C. E. Willis. Piggyback Server Inval- idation for Proxy Cache Coherency. Computer Networks and ISDN Systems, 30(1-7):185–193, 1998.

[LC98] Chengjie Liu and Pei Cao. Maintaining Strong Cache Consis- tency in the World-Wide Web. In International Conference on Distributed Computing Systems, pages 12–21, 1998.

[LCD01] D. Li, P. Cao, and M. Dahlin. WCIP: Web Cache invali- dation Protocol. http://www.ietf.org/internet-drafts/draft- danli-wrec-wcip-01.txt, March 2001.

[LHP+04] Wen-Syan Li, Wang-Pin Hsiung, Oliver Po, Koji Hino, K. Selcuk Candan, and Divyakant Agrawal. Challenges and Practices in Deploying Web Acceleration Solutions for Dis- tributed Enterprise Systems. In Thirteenth World Wide Web Conference (WWW2004), 2004.

152 [Liu99] Ling Liu. Query Routing in Large-scale Digital Library Systems. In International Conference on Data Engineering (ICDE), 1999.

[LN01] Qiong Luo and Jeffrey F. Naughton. Form-Based Proxy Caching for Database-Backed Web Sites. In Proceedings of 27th International Conference on Very Large Data Bases (VLDB), pages 191–200, 2001.

[LNK+00] Qiong Luo, Jeffrey F. Naughton, Rajasekar Krishnamurthy, Pei Cao, and Yunrui Li. Active Query Caching for Data- base Web Servers. In Proceedings of the Third International Workshop on the Web and Databases (WebDB), pages 29– 34, 2000.

[LR00] Alexandros Labrinidis and Nick Roussopoulos. On the Ma- terialization of Web Views. In ACM SIGMOD Conference, pages 367–378, USA, June 2000.

[LR01] Alexandros Labrinidis and Nick Roussopoulos. Adaptive We- bview Materialization. In Proceedings of the ACM SIGMOD Workshop on the Web and Databases (WebDB), pages 85– 90, USA, May 24-25 2001.

[MACM01] Amelie Marian, Serge Abiteboul, Gregory Cobena, and Lau- rent Mignet. Change-Centric Management of Versions in an XML Warehouse. In Proceedings of 27th International Con- ference on Very Large Data Bases (VLDB), pages 581–590, 2001.

[MAG+97] J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom. Lore: A Database Management System for Semi-

153 structured Data. SIGMOD Record, 26(3):54–66, September 1997.

[Mar00] Evangelos P. Markatos. On Caching Search Engine Query Results. In 5th International Web Caching and Content De- livery Workshop, pages 137–143, 2000.

[MBFS01] Ioana Manolescu, Luc Bouganim, Francoise Fabret, and Eric Simon. Efficient Data and Program Integration Using Bind- ing Patterns. INRIA, France, 2001. Technical Report 4239.

[MBR03] Mehregan Mahdavi, Boualem Benatallah, and Fethi Rabhi. Caching Dynamic Data for E-Business Applications. In In- ternational Conference on Intelligent Information Systems (IIS’03): New Trends in Intelligent Information Processing and Web Mining (IIPWM), pages 459–466, 2003.

[Mic97] Microsoft Corporation. Cache Array Rout- ing Protocol and Microsoft Proxy Server 2.0. http://www.mcoecn.org/WhitePapers/Mscarp.pdf, 1997. White Paper.

[MS04] Mehregan Mahdavi and John Shepherd. Enabling Dy- namic Content Caching in Web Portals. In 14th Interna- tional Workshop on Research Issues on Data Engineering (RIDE’04), pages 129–136, 2004.

[MSB04] Mehregan Mahdavi, John Shepherd, and Boualem Benatal- lah. A Collaborative Approach for Caching Dynamic Data in Portal Applications. In The Fifteenth Australasian Database Conference (ADC’04), pages 181–188, 2004.

154 [NACP01] Benjamin Nguyen, Serge Abiteboul, Gregory Cobena, and Mihai Preda. Monitoring XML Data on the Web. In SIG- MOD Conference, pages 437–448, 2001.

[NDM+00] Jeffrey Naughton, David DeWitt, David Maier, et al. The Niagara Internet Query System, 2000.

[OLW01] Chris Olston, Boon Thau Loo, and Jennifer Widom. Adap- tive Precision Setting for Cached Approximate Values. In SIGMOD Conference, 2001.

[Ora] Oracle Corporation. http://www.oracle.com.

[Ora01a] Oracle Corporation. Oracle9i Application Server: Data- base Cache. Technical report, Oracle Corporation, http://www.oracle.com, February 2001.

[Ora01b] Oracle Corporation. Oracle9iAS Web Cache. Technical report, Oracle Corporation, http://www.oracle.com, June 2001.

[OW02] Chris Olston and Jennifer Widom. Best-Effort Synchroniza- tion with Source Cooperation. In ACM SIGMOD, 2002.

[PB03] Stefan Podlipnig and Laszlo Boszormenyi. A Survey of Web Cache Replacement Strategies. In ACM Computing Surveys, volume 35, pages 374–398, 2003.

[PF00] Sanjoy Paul and Zongming Fei. Distributed Caching with Centralized Control. In 5th International Web Caching and Content Delivery Workshop, 2000.

[QWG+96] D. Quass, J. Widom, R. Goldman, K. Haas, Q. Luo, J. McHugh, S. Nestorov, A. Rajaraman, H. Rivero, S. Abite- boul, J. Ullman, and J. Wiener. LORE: A Lightweight Ob-

155 ject REpository for Semistructured Data. In SIGMOD Con- ference, 1996.

[Rad] Radview. http://www.radview.com.

[RBS01] Uwe Rohm, Klemens Bohm, and Hans-Jorg Schek. Cache- Aware Query Routing in a Cluster of Databases. In ICDE, 2001.

[RDK+00] Krithi Ramamritham, Pavan Deolasee, Amol Kathar, Ankur Panchbudhe, and Prashant Shenoy. Dissemination of Dy- namic Data on the Internet. In DNIS 2000, pages 173–178, December 2000.

[RILD05] Lakshmish Ramaswamy, Arun Iyengar, Ling Liu, and Fred Douglis. Automatic Fragment Detection in Dynamic Web Pages and Its Impact on Caching. IEEE Transactions on Knowledge and Data Engineering, 17(1):859–874, 2005.

[RR00] Manuel Rodriguez and Nick Roussopoulos. MOCHA: A Self-Extensible Database Middleware System for Distributed Data Sources. In ACM SIGMOD Conference, May 2000.

[RS02] Michael Rabinovich and Oliver Spatscheck. Web Caching and Replication. Addison-Wesley, 2002.

[Smi03] Steven A. Smith. ASP.NET Caching: Techniques and Best Practices. http://msdn.microsoft.com/library/default.asp?url=/library/en- us/dnaspp/html/aspnet-cachingtechniquesbestpract.asp, August 2003.

[Sou04] SourceForge. JCache Open Source. http://jcache.sourceforge.net/, August 2004.

156 [Squ] Squid Web Proxy Cache. http://www.squid-cache.org.

[STD+00] Jayavel Shanmugasundaram, Kristin Tufte, David DeWitt, Jeffrey Naughton, and David Maier. Architecting a Network Query Engine for Producing Partial Results. In WebDB, pages 17–22, 2000.

[Suna] Sun Microsystems. http://www.spec.org/osg/jAppServer/.

[Sunb] Sun Microsystems. JSRs: Java Specification Requests. http://www.jcp.org/en/jsr/overview.

[Sun02] Sun Microsystems. ECperf Specification. http://java.sun.com/j2ee/ecperf/, April 2002.

[Tec01] Chutney Technologies. Dynamic Content Acceleration: A Caching Solution to Enable Scalable Dynamic Web Page Generation. In SIGMOD Conference, 2001.

[TIH01] Igor Tatarinov, Zachary G. Ives, and Alon Y. Halvey. Up- dating XML. In SIGMOD Conference, pages 413–424, 2001.

[Tim] TimesTen Inc. http://www.timesten.com.

[Tim02] TimesTen Inc. Mid-Tier Caching. Technical report, TimesTen Inc., http://www.timesten.com, 2002.

[Tra01] Transaction Processing Performance Counsil. http://www.tpc.org/tpcw/, October 2001.

[UF00] Tolga Urhan and Michael J. Franklin. XJoin: A Reactively- Scheduled Pipelined Join Operator. IEEE Data Engineering Bulletin, 23(3):27–33, June 2000.

[WA96] Stephen Williams and Marc Abrams. Removal Policies in Network Caches for World-Wide Web Documents. In ACM SIGCOMM, pages 293–305, 1996.

157 [WC97] D. Wessels and K. Claffy. Application of Internet Cache Pro- tocol (ICP), version 2. Network Working Group, Internet- Draft, July 1997.

[Won99] Stephanie Wong. Estimated $4.35 Bil- lion Ecommerce Sales at Risk Each Year. http://www.zonaresearch.com/info/press/99-june30.htm, 1999.

[WY01] Kin Yeung Wong and Kai Hau Yeung. Site-Based Approach to Web Cache Design. IEEE Internet Computing, 5(5):28– 34, September/October 2001.

[YADL99] J. Yin, L. Alvisi, M. Dahlin, and C. Lin. Volume Leases for Consistency in Large-scale Systems. Knowledge and Data Engineering, 11(4):563–576, 1999.

[YBS99] Haobo Yu, Lee Breslau, and Scott Shenker. A Scalable Web Cache Consistency Architecture. In ACM SIGCOMM, pages 163–174, Boston, USA, 1999.

[YFVI00] Khaled Yagoub, Daniela Florescu, Patrick Valduriez, and Valerie Issarny. Caching Strategies for Data-Intensive Web Sites. In Proceedings of 26th International Conference on Very Large Data Bases (VLDB), pages 188–199, Cairo, Egypt, September 2000.

[You91] Neal E. Young. The k-Server Dual and Loose Competitive- ness for Paging. Algorithmica, 11(6):525–541, 1991.

[Zon01] Zona Research Inc. Zona Research Releases Need for Speed II. http://www.zonaresearch.com/info/press/01- may03.htm, 2001.

158