CBR in the Pipeline
Total Page:16
File Type:pdf, Size:1020Kb
From: AAAI Technical Report WS-98-15. Compilation copyright © 1998, AAAI (www.aaai.org). All rights reserved. CBRin the Pipeline Marc Goodman ContinuumSoftware, Inc. 800 West CummingsPark, Suite 4950 Wobum,Mass. 01801 [email protected] Abstract composed of a number of snippets [Redmond 1992; In a variety of reasoning tasks, even ones for whichCBR Kolodner 1993] or microcases [Zito-Wolf and Alterman seemsideally suited, a stand-aloneCBR component may not 1994]. Each snippet contains a group of links, a content prove adequate. First, the data available in system header, a pointer to a parent snippet, and a set of pointers to constructionmay be too raw or noisy for direct processing child snippets. For example, a particular links page might and mayrequire sophisticated reasoning before it is in a contain a snippet consisting of links to peripheral form suitable for CBR.Second, capacity demandsand other manufacturers. Its header might be something like the text run-time constraints mayprohibit a straight CBRmodule string "Peripherals". It might appear on the page as a from being deployed. This paper describes a pipelined subsection under a supersection called "Computer architecture whereone or morereasoning steps are used to preprocess data into a formsuitable for use in CBR,and Hardware," and it might have child sections such as CBRis used as a synthesis componentfor the creation of a "Modems,""Printers," etc. Each of the child sections and stand-alone,run-time database. the parent section wouldalso be represented by snippets. The process of recommending links, conceptually, consists of taking a particular link, retrieving all of the Introduction snippets that contain this link, synthesizing the snippets into a representative snippet, and displaying this snippet to the user. The process of listing the links that occur under a The SideClick link referral system [Goodman1998] is a particular topic consists of retrieving all of the snippets that web-based service for resource exploration. Given a URL were indexed under an appropriate section header, (most often a link to a particular web page of interest), synthesizing the snippets into a representative snippet, and SideClick can provide a list of related URLsorganized by displaying this snippet to the user. Stated moreintuitively, topic as well as a list of related topics. Or, given a topic of the system is saying somethinglike "Giventhat the user is interest, SideClick can provide a list of URLsrelated to interested in a particular link, other webusers whohave that topic as well as other related topics. For example, been interested in this link have tended to organize it with given a URL for "The Dilbert Zone" [Adams 1998], these other links, under these topics. Therefore, the user SideClick returns links for "Over the Hedge" [Fry and should find these links and topics interesting as well." Lewis 1998], "Rose is Rose" [Brady 1998], "Peanuts" [Schulz 1998], the United Media comics page [United Media 1998], "Doonesbury" [Trudeau 1998], etc. and the Harder than it Sounds related topics, "Entertainment" and "Comics and Humor." Clicking on the "Entertainment" topic returns links from Unfortunately, several factors conspire to makethis simple baseball, movies, music, magazines, etc. and over 50 conceptual framework for link recommendation related topics from Art to UFOs. By following links and insufficient. First, data on webpages is extremely noisy. topics of interest, the user is free to discover new, This is hardly surprising given that most of these interesting webresources in a serendipitous fashion. documents are generated by hand, and many of them are SideClick models the way users of the weblink together generated by people who have only passing familiarity and organize information as embodied in bookmarksfiles with computers and computer programming. What is and other on-line links pages. The core observation in the surprising is the sheer variety of types of noise in web system is that people whocreate links pages tend to group pages. Somecommon types of noise include: links in sections, organized by content, with other similar links. Hence, a web page can be viewed as a case, ¯ MarkupNoise: Webpages reflect their organizational structure primarily via the author’s choice of markup,or Copyright©1998, American Association for Artificial Intelligence the way the layout of the page is expressed in terms of (www.aaai.org).All fightsreserved. markuptags. One author, for example, might choose to create sections using delimited lists of links, with 75 various levels of headers used to label sections and the section/subsection relationships between snippets often relative size of those headers intended to conveyscoping do not correspond to taxonomic or partonomic information. Another author might present the same relationships. For example, one web page might place information in tabular form, with the headers relegated "Scotland" under "Food." Perhaps the author intends the to a column along the side of the page and the links section to be "Scottish Foods." Anotherauthor will place contained in separate cells in a different column.A third "Recipes" under "Scotland," meaning "Scottish author might display information within free-form Recipes." A third author will place "Recipes" under descriptive paragraphs that contain embeddedlinks, "Food," and a fourth author will place "Chicken" under separated from other sections by horizontal rules. The "Recipes." Extracting a meaningful taxonomyof topics numberof distinct markupstyles approaches the number from the raw data is currently an unsolved problem. of web authors. This source of noise is further Cobwebs:It is a big exaggeration to say that half the compoundedby the majority of authors whouse markup web is "under construction," and the other half is tags incorrectly, invent their ownmarkup tags (which missing, relocated, or hopelessly out of date. In actual browsers simply ignore), and even introduce syntax fact, only 18%of the URLscited in pages on the web errors within the tags themselves. Reducing the amount refer to documentsthat no longer exist, serve only to of markupnoise is crucial for placing links correctly redirect the user to newlocations, or live on servers that within snippets as well as understanding the aren’t reachable or fail DNS(based on a sampling of relationships betweensnippets within a case. over one million commonlycited web documents). The ¯ URLNoise: It is unfortunate that URLstands for fewer such "cobwebs" that are contained within a "Uniform Resource Locator," not "Unique Resource service, the moreuseful that service becomes. Locator." In fact, there are usually several distinct ways of referring to any particular web document. For Anotherfactor that makescreating a link referral service example, the Netscape HomePage can be found at any difficult is the sheer size of the web. Accordingto Search of the following URLs: http://www.netscape.com/, Engine Watch [Search Engine Watch 1997], AltaVista http://home.mcom.com/, http://mcom.com/index.html, [AltaVista 1998] had indexed over 100 million web pages http://www.home.netscape.com/home/, and several in April of 1997, and their Chief Technical Officer, Louis others. Of the 4.5 million distinct URLsreferred to by Monier, estimated that there were as manyas 150 million documents within the SideClick casebase, over 500,000 distinct pages on the web. Even a small subset of the web of these URLs are redundant. Successfully will contain millions of documentswith tens of millions of canonicalizing URLsprevents the system from referring snippets. Retrieving and synthesizing these snippets can be the user to the same web resource via multiple URLs,as very computationally expensive. well as increasing the numberand usefulness of snippets Finally, a successful web service is, by definition, a indexed under those URLs. high-volume web service. The most popular websites ¯ Section Heading Noise: As described above, markup generate millions of page views per day. A scant million noise can makeit difficult to identify the piece of text (if hits a day adds up to over 11 hits per second, and peak any) that identifies the topic of a snippet. However,even access times can easily reach two or three times as many if that piece of text is successfully located, different hits per second as the average. At 33 hits per second, 30 people tend to label the same content differently. For msecs per query is about enough time to do three disk example, the section headings, "Search," "Search seeks. There isn’t a lot of time for complicated run-time Tools," "Suchmachinen," "Suchdienst," "Metasearch," analysis. "Keyword Search," "Search Forms," "Moteurs De Recherche," and "Search Engines" all refer to the same topic. Successfully canonicalizing section headings CBRin the Pipeline prevents the system from referring the user to multiple The solution we have developed to the above problems is versions of the sametopic with different names, as well to divide the system into a run-time componentthat does as increasing the number and usefulness of snippets fast lookup on a pre-built database (or knowledgebase), indexed under those section headings. A related but and a development component that builds the database. unsolved problem is ambiguity in section headings. For The development component is further broken down into example, somepeople label links about stock quotations several distinct processing steps, featuring one or more under "Quotations," while other people label links about distinct