Caching Dynamic Data for Web Applications

School of Computer Science and Engineering University Of New South Wales Caching Dynamic Data for Web Applications For the Degree of Doctor of Philosophy Submitted by Mehregan Mahdavi Supervisor Dr. John Shepherd Co-supervisor A/Prof. Boualem Benatallah November 2006 Acknowledgements I would like to express my gratitude to all those who gave me the possi- bility to complete this thesis. Dr. John Shepherd and Dr. Boualem Bena- tallah provided the motivation, enthusiasm, and constructive comments and feedback during many discussions we had. It was a great pleasure for me to finish this thesis under their supervision. I would also like to thank the people who helped in implementing fragments of the simulation test- bed used in the evaluation, Willy Mong, Chatarpreet Singh Jitla, and Ka Yee Joanna Chan. The last but not the least, I would like to give my special thanks to my wife Shirley whose patient love enabled me to complete this thesis. List of Publications The work described in this thesis has been presented in the following publications. In each case, the author of this thesis was the primary contributor. ”Caching on the Web”, Mehregan Mahdavi, John Shepherd, Boualem Benatallah. Web Data Management Practices: Emerging Techniques and Technologies, Eds. Athena Vakili and George Pallis, Idea Group, Inc., 2005. ”Enabling Dynamic Content Caching in Web Portals”, Mehregan Mahdavi and John Shepherd. 14th International Workshop on Research Issues on Data Engineering (IEEE-RIDE’04), March 2004, USA. ”A Collaborative Approach for Caching Dynamic Data in Portal Applications”, Mehregan Mahdavi, John Shepherd, Boualem Benatal- lah. The Fifteenth Australasian Database Conference (ADC’04), January 2004, New Zealand. ”Caching Dynamic Data for E-Business Applications”, Mehregan Mah- davi, Boualem Benatallah, Fethi Rabhi. International Conference on In- telligent Information Systems (IIS’03): New Trends in Intelligent Infor- mation Processing and Web Mining, June 2003, Poland. ii Abstract Web portals are one of the rapidly growing applications, providing a single interface to access different sources (providers). The results from the providers are typically obtained by each provider querying a database and returning an HTML or XML document. Performance and in par- ticular providing fast response time is one of the critical issues in such applications. Dissatisfaction of users dramatically increases with increasing response time, resulting in abandonment of Web sites, which in turn could result in loss of revenue by the providers and the portal. Caching is one of the key techniques that address the performance of such applications. In this work we focus on improving the performance of portal applications via caching. We discuss the limitations of existing caching solutions in such applications and introduce a caching strategy based on collaboration between the portal and its providers. Providers trace their logs, extract information to identify good candidates for caching and notify the portal. Caching at the portal is decided based on scores calculated by providers and associated with objects. We evaluate the performance of the collaborative caching strategy using simulation data. We show how providers can trace their logs and calculate cache-worthiness scores for their objects and notify the portal. We also address the issue of heterogeneous scoring policies by different providers and introduce mechanisms to regulate caching scores. We also show how portal and providers can synchronize their meta-data in order to minimize the overhead associated with collaboration for caching. Contents 1 Introduction 2 2 Background Information 6 2.1 Web Portals . 6 2.1.1 Architectures . 7 2.1.2 Enabling Technologies . 12 2.1.3 Performance Issues . 14 2.1.4 Benchmarking . 18 2.2 Web Data Caching: An Overview . 19 2.2.1 Cache Hierarchy . 22 2.2.2 Caching Issues . 30 2.2.3 Distributed Cache Management . 32 2.2.4 Cache Coherency . 33 2.2.5 Dynamic Content Caching . 49 2.2.6 Caching Policy and Replacement Strategy . 53 2.2.7 Query Rewriting and Caching . 58 2.2.8 Query Processing & Caching . 60 2.2.9 Case Studies . 62 2.3 Summary . 73 3 A Collaborative Caching Strategy for Web Portals 75 i 3.1 Caching in Web Portals . 75 3.2 Caching Strategy . 79 3.3 Meta-data Support . 80 3.4 Calculating Cache-worthiness . 86 3.5 Other Parameters . 89 3.6 Regulating heterogeneous Caching Scores . 91 3.7 Synchronization of Meta-data . 93 3.8 Summary . 95 4 Evaluation and Analysis 96 4.1 Evaluation Test-bed . 96 4.1.1 Dependence Between Objects . 101 4.1.2 Size and Computation Cost . 104 4.1.3 Cache-worthiness Scores . 105 4.2 Evaluation Results . 106 4.2.1 Throughput . 107 4.2.2 Network Bandwidth Usage . 110 4.2.3 Average Access Time . 113 4.2.4 Analysis of the Performance Results . 116 4.2.5 Average Access Time - First Reply . 118 4.2.6 Effect of User Access Pattern . 120 4.2.7 Effect of Cache Size . 121 4.2.8 Weak Cache Coherency . 123 4.2.9 Recency of Objects . 127 4.2.10 Utility of Providers . 129 4.2.11 Regulation . 131 4.2.12 Meta-Data Synchronization . 135 4.3 Summary . 138 ii 5 Conclusions 142 Bibliography 146 iii List of Figures 2.1 Centralized Portal Architecture . 9 2.2 Distributed portal architecture . 10 2.3 Complementary providers . 11 2.4 Competitor providers . 11 2.5 Server accelerator . 26 2.6 Edge servers (forward setup) . 28 2.7 Edge servers (reverse setup) . 29 3.1 Caching in portals . 77 3.2 Caching algorithm used by portal . 81 3.3 Caching algorithm used by providers . 82 3.4 Meta-data used by the caching strategy . 83 3.5 Details of caching algorithm used by provider . 85 4.1 Architecture of the test-bed . 98 4.2 Synchronization between UserSimulators and PortalSimu- lator . 100 4.3 Synchronization of request items . 101 4.4 Synchronization of response items . 102 4.5 Object Dependence Graph (ODG) . 103 4.6 Input file to define object dependence . 104 4.7 Input file to define size of objects . 104 iv 4.8 Input file to define computation cost of objects . 105 4.9 Throughput . 108 4.10 Throughput - upper-bound scenario for CacheCW . 111 4.11 Network Bandwidth Usage . 112 4.12 Network Bandwidth Usage - upper-bound scenario for CacheCW114 4.13 Average Access Time . 115 4.14 Average Access Time - upper-bound scenario for CacheCW 117 4.15 Average Access Time (first reply) . 119 4.16 Average Access Time (first reply) - upper-bound scenario for CacheCW . 120 4.17 Throughput - high update rate . 122 4.18 Hit Ratio - different cache sizes . 124 4.19 Hit Ratio - smaller cache sizes . 125 4.20 Effect of cache size on performance of CacheCW . 126 4.21 Performance measures - weak coherency . 128 4.22 Average Access Time (first reply) - weak coherency . 129 4.23 Regulating heterogeneous cache-worthiness scores . 133 v List of Tables 2.1 A classification of cache coherency mechanisms . 41 2.2 Comparison of push and pull . 45 2.3 Summary of ESI tags . 65 2.4 Summary of JESI tags . 66 2.5 Supported cache tag attributes in BEA WebLogic . 68 2.6 Supported cache directive attributes in ASP.NET . 70 4.1 False Hit Ratio - weak coherency . 127 4.2 Comparison of CacheCW-R and CacheCW . 129 4.3 Effect of utility on throughput . 130 4.4 Effect of utility on average access time . 130 4.5 Average access times for individual providers . 134 4.6 Total average access time . 134 4.7 Periodic synchronization of meta-data . 136 4.8 Effect of synchronization on throughput . 137 vi 1 Chapter 1 Introduction The World Wide Web provides a convenient and inexpensive infrastruc- ture for communicating and exchanging data between users and data sources. It has influenced many aspects of life such as communication, education, business, shopping, and entertainment. There are many re- sources on the Internet; some provide data for being used and shared among users, while others are designed to provide applications. For ex- ample, some Web sites such as Web sites of universities, people’s home pages, yellow and white pages provide data. There are also Web sites which provide applications that can be used by users, such as on-line shopping, booking flights and banking. Users find and access appropriate data or applications through Web browsers. Performance is one of the major issues in today’s Web-enabled applications. Previous research has shown that abandonment of Web sites dramatically increases with increase in response time [Zon01], resulting in loss of revenue by businesses. In other words, providing a fast response time is one of the critical issues that today’s Web applications must deal with. Nowadays, many Web sites employ dynamic Web pages by access- 2 ing a back-end database and formatting the result into HTML pages. Accessing the database and assembling the final result on the fly is an expensive process and a significant factor in the overall performance of such systems. Server workload or failure and network traffic are other contributing factors for slow response times. With the increasing use of the Internet for applications and emerging class of Web applications called Web portals there is a need for bet- ter performance. Web portals enable access to different data or appli- cation providers through a single interface. They save time and effort for customers who only need to access the portal’s Web interface rather than navigating through many providers. Business portals, such as Ex- pedia (www.expedia.com) and Amazon (www.amazon.com) are examples of such applications. Caching is one of the key techniques that addresses some of the performance issues of Web-enabled applications.

Load more