Summarizing Personal Web Browsing Sessions

Summarizing Personal Web Browsing Sessions Mira Dontcheva1 Steven M. Drucker3 Geraldine Wade4 David Salesin1,2 Michael F. Cohen3 1Computer Science & Engineering 2Adobe Systems 3Microsoft Research, 4Microsoft University of Washington 801 N. 34th Street One Microsoft Way Seattle, WA 98105-4615 Seattle, WA 98103 Redmond, WA 98052-6399 {mirad, salesin}@cs.washington.edu [email protected] {sdrucker, gwade, mcohen}@microsoft.com ABSTRACT ing sessions typically last a long time, span several sessions, We describe a system, implemented as a browser extension, involve gathering large amounts of heterogeneous content, that enables users to quickly and easily collect, view, and and can be difficult to organize ahead of time as the cate- share personal Web content. Our system employs a novel in- gories emerge through the tasks themselves. Current meth- teraction model, which allows a user to specify webpage ex- ods for collecting and organizing Web content include sav- traction patterns by interactively selecting webpage elements ing bookmarks or tabs, collecting content in documents, or and applying these patterns to automatically collect similar storing pages locally. While useful, these methods require a content. Further, we present a technique for creating visual great deal of overhead as pages must be saved manually and summaries of the collected information by combining user organized into folders, which distracts from the real task of labeling with predefined layout templates. These summaries analyzing the content and making decisions. are interactive in nature: depending on the behaviors encoded in their templates, they may respond to mouse events, in ad- In this paper, we present a new approach to collecting and dition to providing a visual summary. Finally, the summaries organizing Web information. Our approach is motivated by can be saved or sent to others to continue the research at an- Gibson et al.’s analysis [8] of the amount of template content other place or time. Informal evaluation shows that our ap- on the Web. Their study shows that 40-50% of the content on proach works well for popular websites, and that users can the Web is template content and that template material has quickly learn this interaction model for collecting content grown and is continuing to grow at a rate of approximately from the Web. 6% per year. Studies of users’ habits for collecting Web content [19] further show that people are often interested in col- ACM Classification H5.2 [Information interfaces and pre- lecting not only full webpages but also pieces of webpages. sentation]: User Interfaces. - Graphical user interfaces. These findings motivated us to design an interface that lever- ages the growing amount of templated information on the General Terms Design, Human Factors, Algorithms Web and allows users to gather only the pieces of content they find relevant to their task. Keywords: Information management, webpage extraction patterns, template-based summarization Our interface includes two parts: user specification of rele- INTRODUCTION vant content, and automatic collection of similar content. The The proliferation of Web content and the advances in search user specifies relevant content by interactively selecting web- technology have fundamentally changed our lives. Today, to page elements and labeling each element with a keyword, plan a vacation, you no longer go to a travel agent; when chosen from a predefined set. The selected content is stored making purchases, you no longer go to your neighborhood locally and used to create an extraction pattern that can be specialty store; and when learning about a new topic, you no applied to collect more content from similar pages. Extrac- longer go to the library. Instead, you search the Web. While tion patterns specify the locations of selected elements using finding information is becoming easier and easier, collecting the structure of the webpage and, optionally, text that must be and organizing Web research and coming back to it remains matched in any new pages. When the user visits a new page, difficult, especially when the research is exploratory. Sharing he can automatically collect all of the content that is analo- the results of the browsing session with others can be even gous to the content previously selected as relevant. Further, more difficult. As found by Sellen et al. [20], research brows- we provide an interface for collecting content from multiple pages simultaneously by allowing the user to select hyperlinks that point to pages of interest. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are To organize and present the gathered content, we employ lay- not made or distributed for profit or commercial advantage and that copies out templates that create visual summaries. A layout template bear this notice and the full citation on the first page. To copy otherwise, to specifies how the content should be organized and formatted republish, to post on servers or to redistribute to lists, requires prior specific and defines a set of user interactions available within a sum- permission and/or a fee. UIST’06, October 15–18, 2006, Montreux, Switzerland.. mary. The templates filter and present the content according Copyright 2006 ACM 1-59593-313-1/06/0010 ...$5.00. to the labeling specified by the user during content selection. To demonstrate possible content presentations, we have im- some cases, we override the browser’s layout engine, using plemented several layout templates, including a map, calen- scripts, to create more aesthetic results. dar, and grid. Our system is not limited to these designs and can incorporate additional layout templates. Bibliographic reference tools, such as RefWorks [16] and EndNote [6], are specifically designed for collecting and Our work has two key contributions: a novel interaction managing academic references. These systems have special- mechanism for collecting Web content semi-automatically, ized filters, much like our extraction patterns, for electronic and a presentation technique for the gathered content that references websites and can automatically import all the data uses pre-defined layout templates to construct summaries for a particular reference into a database. Our system com- that are interactive and can be shared with others. We demon- plements such systems by giving the user an interface for strate our approach on three types of tasks: comparison shop- selecting personally relevant bibliographic information. Fur- ping, travel planning, and researching academic papers. How- thermore, our approach is more general and is applicable to ever, our approach can be used for any type of Web research other Web content. and can be applied to other domains. User feedback shows USER INTERFACE that people can quickly learn to use our system and find many We designed the user interface with the goal of making the tasks for which it is useful. process of collecting information as unobtrusive as possible and allowing the user to focus on the task at hand rather RELATED WORK than worry about organizing and keeping track of content. Our work has its roots in the area of information workspaces Our system is implemented as an extension to the Firefox and systems like WebBook [5], Data Mountain [17], and browser and is presented to the user through a toolbar (see TopicShop [1], which present different mechanisms for pre- Figure 1). The toolbar includes four buttons and a check- senting and organizing collections of webpages. Our system box. The “Start” button opens the “summary” window that is most closely related to Hunter Gatherer [19] and Internet displays a visual summary of the content gathered thus far. Scrapbook [22], which provide the ability to save webpage The “Select” button initiates the selection mode and enables elements and place them together in a document. Like these the user to select and label pieces of Web content. The “Add two systems, we provide an interface that allows the user to Page” button allows the user to automatically add the content select pieces of webpages; however, we go further by mak- found by an extraction pattern to the summary. This button ing use of the user selection to create extraction patterns that is enabled only when there is a matching pattern that can can then be used to gather content automatically. We also be used to extract content automatically. The “Add Linked provide a mechanism for organizing the gathered content in Page(s)” button initiates a different selection mode in which richer ways, such as on a map or calendar, and for different the user can select any number of hyperlinks to add the con- purposes, such as for a PDA or printer. tent from the linked pages to the summary directly and simultaneously. A checkbox specifies whether to visually outline Our use of user-specified labels for relating content from dif- the elements found by an extraction pattern on a new web- ferent sources is similar to approaches in Semantic Web ap- page. If it is checked, the elements found are outlined in pur- plications, such as Piggy Bank [11] and Thresher [10]. Our ple. The summary window (shown in Figure 2) has buttons work differs from these in two ways. First, we create visual for opening and saving summaries and a menu for changing summaries of the gathered content, while Piggy Bank and the layout template used to compose the summary. Thresher store the collected data in a database. Second, our interface for labeling the content is more accessible to the Sample User Scenario average Web user, because the user must simply right-click To help explain the interface, we describe an example user and select a keyword. Piggy Bank requires that the user write scenario and show the steps taken by the user to create a scripts, and Thresher expects that the user has some knowl- summary.

Load more