Adaptive Website Design Using Caching Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Adaptive Website Design using Caching Algorithms Justin Brickell Inderjit S. Dhillon Dharmendra S. Modha Department of Computer Department of Computer IBM Almaden Research Sciences Sciences Center University of Texas at Austin University of Texas at Austin San Jose, CA, USA Austin, TX, USA Austin, TX, USA [email protected] [email protected] [email protected] ABSTRACT Keywords Visitors enter a website through a variety of means, includ- Data Mining, Adaptive Web Sites, Caching, Pattern Mining, ing web searches, links from other sites, and personal book- Access Logs, Shortcutting marks. In some cases the first page loaded satisfies the vis- itor’s needs and no additional navigation is necessary. In other cases, however, the visitor is better served by content 1. INTRODUCTION located elsewhere on the site found by navigating links. If As websites increase in complexity, they run headfirst into the path between a user’s current location and his eventual a fundamental tradeoff: the more information that is avail- goal is circuitous, then the user may never reach that goal or able on the website, the more difficult it is for visitors to will have to exert considerable effort to reach it. By mining pinpoint the specific information that they are looking for. site access logs, we can draw conclusions of the form “users A well-designed website limits the impact of this tradeoff, so who load page p are likely to later load page q.” If there is that even if the amount of information is increased signifi- no direct link from p to q, then it would be advantageous cantly, locating that information becomes only marginally to provide one. The process of providing links to users’ more difficult. Typically, site designers ease information eventual goals while skipping over the in-between pages is overload by organizing the site content into a hierarchy of called shortcutting. Existing algorithms for shortcutting re- topics, and then providing a navigational tree that allows quire substantial offline training, which make them unable visitors to descend into the hierarchy and find the informa- to adapt when access patterns change between training ses- tion they are looking for. In their paper on adaptive web- sions. We present improved online algorithms for shortcut site design [12], Perkowitz and Etzioni describe these static, link selection that are based on a novel analogy drawn be- master-designed websites as “fossils cast in HTML.” They tween shortcutting and caching. In the same way that cache claim that a site designer’s a priori expectations for how a algorithms predict which memory pages will be accessed in site will be used and navigated are likely to inaccurately re- the future, our algorithms predict which web pages will be flect actual usage patterns, especially as the site adds new accessed in the future. Our algorithms are very efficient and content over time. As it is infeasible for even the most ded- are able to consider accesses over a long period of time, but icated site designer to understand the goals and access pat- give extra weight to recent accesses. Our experiments show terns of all site visitors, Perkowitz and Etzioni proposed significant improvement in the utility of shortcut links se- building websites that mine their own access logs in order lected by our algorithm as compared to those selected by to automatically determine helpful self-modifications. existing algorithms. One example of a helpful modification is shortcutting, in which links are added between unlinked pages in order to al- Categories and Subject Descriptors low visitors to reach their intended destinations with fewer clicks. Typically a limit N is imposed on the maximum I.5.2 [Pattern Recognition]: Design Methodology; H.2.8 number of outgoing shortcuts on any one particular page. [Database Management]: Database Applications—Data The shortcutting problem can then be thought of as an op- Mining timization problem to choose the N shortcuts per page that minimize the number of clicks needed for future visitors to General Terms reach their goal pages. These shortcuts may be modified Algorithms, Experimentation at any time based on past accesses in order to account for anticipated changes in the access patterns of future visi- tors. Finding an optimal solution to this problem would require an exact knowledge of the future and a precise way Permission to make digital or hard copies of all or part of this work for of determining each user’s goal. However, shortcutting al- personal or classroom use is granted without fee provided that copies are gorithms must provide shortcuts in an on-line framework, not made or distributed for profit or commercial advantage and that copies so the shortcuts must be chosen without knowledge of fu- bear this notice and the full citation on the first page. To copy otherwise, to ture accesses. Rather than solving the optimization problem republish, to post on servers or to redistribute to lists, requires prior specific exactly, shortcutting algorithms use heuristics and analyze permission and/or a fee. WEBKDD’06, August 20, 2006, Philadelphia, Pennsylvania, USA. past accesses in order to provide good shortcuts. Copyright 2006 ACM 1-59593-444-8 ...$5.00. In this paper, we draw a novel analogy between shortcut- ting algorithms, which maintain an active set of shortcuts on tion session. Eirinaki and Vazirgiannis [6] give a survey of each page, and caching algorithms, which maintain an ac- the use of web mining for personalization. tive set of items in cache. The goal of caching algorithms— Operating at a level between global adaptations and in- maximizing the fraction of future accesses for items in the dividual adaptations are group adaptations. The mixture- active set—is analogous to the goal of shortcutting algo- model variants of the MinPath algorithm [1] are examples rithms. The main contribution of this paper is the Cache- of group-based shortcutting algorithms. When suggesting Cut algorithm for shortcutting. By using replacement poli- shortcuts to a website visitor, they first classify that visi- cies developed for caching applications, CacheCut is able tor based on browsing behavior, and then provide shortcuts to run with less memory than other shortcutting algorithms, that are thought to be useful to that class of visitors. Classi- while producing better results. A second contribution is the fying users requires examining the “trails” or “clickstreams” FrontCache algorithm, which uses the same caching tech- in the access log, which are the sequences of pages accessed niques in order to select pages to promote on the front page by individual visitors. Other researchers have investigated with direct links. trails without the intention of adapting a website. Banerjee The remainder of this paper is organized as follows. In and Ghosh [3] use trails to cluster users. Cooley et al. [5] Section 2 we discuss related work in adaptive website prob- discover association rules to find correlations such as “60% lems. In Section 3 we give definitions for terms that are used of clients who accessed page A also accessed page B.” Yang throughout the remainder of the paper. Section 4 gives a et al. [15] conduct temporal event prediction, in which they formulation of the shortcutting problem and presents two also estimate when the client is likely to access B. shortcutting algorithms from existing literature. In Sec- Our work follows the global model of shortcutting seen tion 5 we detail our CacheCut algorithm for shortcutting, in [10], in which shortcutting is viewed as a global adapta- and also describe the FrontCache algorithm for promot- tion that adds links to each page that are the same for every ing pages with links on the front page. Section 6 describes visitor. Like Perkowitz’ algorithm, when choosing shortcuts our experimental setup and the results of our experiments. for a page p we pay close attention to the number of times Finally, in Section 7, we offer some concluding thoughts and other pages q are accessed after p within a trail; however, suggest directions for future work. our algorithm provides improvements in the form of reduced memory requirements and higher-quality shortcuts. A re- 2. RELATED WORK lated work by Yang and Zhang [16] sought to create an im- proved replacement policy for website caching by analyzing Perkowitz and Etzioni [11] issued the original challenge the access log. In contrast, our work uses existing caching to the AI community to build adaptive web sites that policies to create an improved website. learn visitor access patterns from the access log in order to automatically improve their organization and presenta- tion. Their follow-up paper [12] presented several global 3. DEFINITIONS adaptations that affect the presentation of the website to Before describing CacheCut and other shortcutting al- all users. One adaptation from their paper is “index page gorithms from the literature, we provide some definitions synthesis,” in which new pages are created containing col- that we will use throughout the paper. lections of links to related but currently unlinked pages. In Site Graph. The site graph of a website with n unique his thesis [10], Perkowitz presents the shortcutting problem pages is a directed n-node graph G = (V, E) where epq ∈ E as a global adaptive problem, in which links are added if and only if there is a link from page p to page q. to each page to ease the browsing experience of all site Shortcut. A shortcut is a directed connection between visitors. Ramakrishnan et al. [13] have also done work in web pages p and q that were not linked in the original site global adaptation; they observe that frustrated users who graph, i.e., epq 6∈ E. cannot find the content they are looking for are apt to Shortcut Set.