The Design and Evaluation of Web Prefetching and Caching Techniques

THE DESIGN AND EVALUATION OF WEB PREFETCHING AND CACHING TECHNIQUES BY BRIAN DOUGLAS DAVISON A dissertation submitted to the Graduate School|New Brunswick Rutgers, The State University of New Jersey in partial fulfillment of the requirements for the degree of Doctor of Philosophy Graduate Program in Computer Science Written under the direction of Haym Hirsh and approved by New Brunswick, New Jersey October, 2002 c 2002 Brian Douglas Davison ALL RIGHTS RESERVED ABSTRACT OF THE DISSERTATION The Design and Evaluation of Web Prefetching and Caching Techniques by Brian Douglas Davison Dissertation Director: Haym Hirsh User-perceived retrieval latencies in the World Wide Web can be improved by pre- loading a local cache with resources likely to be accessed. A user requesting content that can be served by the cache is able to avoid the delays inherent in the Web, such as congested networks and slow servers. The difficulty, then, is to determine what content to prefetch into the cache. This work explores machine learning algorithms for user sequence prediction, both in general and specifically for sequences of Web requests. We also consider information retrieval techniques to allow the use of the content of Web pages to help predict future requests. Although history-based mechanisms can provide strong performance in predicting future requests, performance can be improved by including predictions from additional sources. While past researchers have used a variety of techniques for evaluating caching algorithms and systems, most of those methods were not applicable to the evaluation of prefetching algorithms or systems. Therefore, two new mechanisms for evaluation are introduced. The first is a detailed trace-based simulator, built from scratch, that estimates client-side response times in a simulated network of clients, caches, and Web servers with various connectivity. This simulator is then used to evaluate various ii prefetching approaches. The second evaluation method presented is a novel architecture to simultaneously evaluate multiple proxy caches in a live network, which we introduce, implement, and demonstrate through experiments. The simulator is appropriate for evaluation of algorithms and research ideas, while simultaneous proxy evaluation is ideally suited to implemented systems. We also consider the present and the future of Web prefetching, finding that changes to the HTTP standard will be required in order for Web prefetching to become com- monplace. iii Acknowledgements Here I acknowledge the time, money, and support provided to me. Officially, I have been financially supported by a number of organizations while preparing this dissertation. The most recent and significant support has been through the National Science Foundation under NSF grant ANI 9903052. In addition, I was supported earlier by a Rutgers University special allocation to strategic initiatives in the Information Sciences, and by DARPA under Order Number F425 (via ONR contract N6600197C8534). I should point out that much of the content of this dissertation has been published separately. Parts of Chapter 1 were included in a WWW workshop position statement [Dav99a]. A version of Chapter 2 has been published in IEEE Internet Computing [Dav01c]. Part of Chapter 3 (written with Haym Hirsh) was presented in a AAAI/ICML workshop paper [DH98]. Much of Chapter 5 was presented at a SIGIR conference [Dav00c]. A version of Chapter 6 was presented in a Hypertext conference [Dav02a]. Part of Chapter 7 was presented at WCW [Dav99e]. Part of Chapter 8 was presented at MASCOTS [Dav01b]. A version of Chapter 10 was presented at WCW [Dav99d]. A portion of Chapter 11 (written with Chandrasekar Krishnan and Baoning Wu) was presented at WCW [DKW02]. Much of Chapter 12 was presented at WCW [Dav01a]. Much of this dissertation is based on experiments using real-world data sets. I thank those who have made such Web traces available, including Martin Arlitt, Laura Bottomley, Owen C. Davison, Jim Dumoulin, Steven D. Gribble, Mike Perkowitz, and Carey Williamson. This work would not have been possible without their generosity. I'd like to thank my advisor Haym Hirsh (who encouraged me to choose my own path, and said it was time to interview, and thus to finish the dissertation), and com- mittee members Ricardo Bianchini, Thu Nguyen, and Craig Wills. I've also appreciated the advice, support, and contributions of collaborators and iv reviewers, including, but not limited to: Fred Douglis, Apostolos Gerasoulis, Paul Kan- tor, Kiran Komaravolu, Chandrasekar Krishnan, Vincenzo Liberatore, Craig Nevill- Manning, Mark Nottingham, Weisong Shi, Brett Vickers, Gary Weiss, and Baoning Wu. The years at Rutgers would not have been nearly as much fun without the other cur- rent and former student members of the Rutgers Machine Learning Research Group: Arunava Banerjee, Chumki Basu, Daniel Kudenko, David Loewenstern, Sofus Mac- skassy, Chris Mesterharm, Khaled Rasheed, Gary Weiss, and Sarah Zelikovitz. Thank you for listening to me, and letting me listen to you. Finally, my love and gratitude are extended to my wife, Karen, and our family, both immediate and extended by blood or marriage. Without their love, understanding, and support, I would have had to get a real job years ago and this work would never have been finished... v Dedication This work is dedicated to all those who use the Web daily. May it only get better and better. vi Table of Contents Abstract : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ii Acknowledgements : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iv Dedication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : vi List of Tables : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xv List of Figures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xvii List of Abbreviations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : xxvi 1. Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1. Motivation . 1 1.1.1. Methods to improve Web response times . 1 1.1.2. User action prediction for the Web . 3 1.1.3. Questions and answers . 4 1.2. Contributions . 6 1.3. Dissertation Outline . 7 2. A Web Caching Primer : : : : : : : : : : : : : : : : : : : : : : : : : : : : 9 2.1. Introduction . 9 2.2. Caching . 10 2.2.1. Caching in memory systems . 11 2.2.2. Mechanics of a Web request . 12 2.2.3. Web caching . 14 2.2.4. Where is Web caching used? . 15 2.2.5. Utility of Web caching . 19 2.2.6. How caching saves time and bandwidth . 19 vii 2.2.7. Potential problems . 20 2.2.8. Content cacheability . 21 2.2.9. Improving caching latencies . 22 2.3. Why do research in Web Caching? . 23 2.4. Summary . 26 3. Incremental Probabilistic Action Prediction : : : : : : : : : : : : : : : 27 3.1. Introduction . 27 3.2. Background . 28 3.3. UNIX Command Prediction . 29 3.3.1. Motivation . 29 3.3.2. An Ideal Online Learning Algorithm (IOLA) . 33 3.4. Incremental Probabilistic Action Modeling (IPAM) . 34 3.4.1. The algorithm . 34 3.4.2. Determining alpha . 37 3.5. Evaluation . 38 3.6. Discussion . 40 3.7. Related Work . 42 3.8. Summary . 45 4. Web Request Prediction : : : : : : : : : : : : : : : : : : : : : : : : : : : 46 4.1. Introduction . 46 4.2. Evaluation concerns and approaches . 48 4.2.1. Type of Web logs used . 48 4.2.2. Per-user or per-request averaging . 49 4.2.3. User request sessions . 49 4.2.4. Batch versus online evaluation . 50 4.2.5. Selecting evaluation data . 51 4.2.6. Confidence and support . 51 4.2.7. Calculating precision . 52 viii 4.2.8. Top-n predictions . 53 4.3. Prediction Techniques . 54 4.4. Prediction Workloads . 61 4.4.1. Proxy . 62 4.4.2. Client . 62 4.4.3. Server . 64 4.5. Experimental Results . 65 4.5.1. Increasing number of predictions . 65 4.5.2. Increasing n-gram size . 67 4.5.3. Incorporating shorter contexts . 68 4.5.4. Increasing prediction window . 69 4.5.5. De-emphasizing likely cached objects . 71 4.5.6. Retaining past predictions . 74 4.5.7. Considering inexact sequence matching . 74 4.5.8. Tracking changing usage patterns . 76 4.5.9. Considering mistake costs . 77 4.6. Discussion . 79 4.7. Summary . 81 5. The Utility of Web Content : : : : : : : : : : : : : : : : : : : : : : : : : 83 5.1. Introduction . 83 5.2. Motivation . 84 5.3. Applications . 87 5.3.1. Web indexers . 87 5.3.2. Search ranking and community discovery systems . 88 5.3.3. Meta-search engines . 89 5.3.4. Focused crawlers . 90 5.3.5. Intelligent browsing agents . 90 5.4. Experimental Method . 90 ix 5.4.1. Data set . 91 5.4.2. Textual similarity calculations . 92 5.4.3. Experiments performed . 93 5.5. Experimental Results . 95 5.5.1. General characteristics . 95 5.5.2. Page to page characteristics . 98 5.5.3. Anchor text to page text characteristics . 100 5.6. Related Work . 102 5.7. Summary . 103 6. Content-Based Prediction : : : : : : : : : : : : : : : : : : : : : : : : : : 105 6.1. Introduction . ..

Load more