From: AAAI Technical Report WS-98-15. Compilation copyright © 1998, AAAI (www.aaai.org). All rights reserved. CBRin the Pipeline

Marc Goodman

ContinuumSoftware, Inc. 800 West CummingsPark, Suite 4950 Wobum,Mass. 01801 [email protected]

Abstract composed of a number of snippets [Redmond 1992; In a variety of reasoning tasks, even ones for whichCBR Kolodner 1993] or microcases [Zito-Wolf and Alterman seemsideally suited, a stand-aloneCBR component may not 1994]. Each snippet contains a group of links, a content prove adequate. First, the data available in system header, a pointer to a parent snippet, and a set of pointers to constructionmay be too raw or noisy for direct processing child snippets. For example, a particular links page might and mayrequire sophisticated reasoning before it is in a contain a snippet consisting of links to peripheral form suitable for CBR.Second, capacity demandsand other manufacturers. Its header might be something like the text run-time constraints mayprohibit a straight CBRmodule string "Peripherals". It might appear on the page as a from being deployed. This paper describes a pipelined subsection under a supersection called "Computer architecture whereone or morereasoning steps are used to preprocess data into a formsuitable for use in CBR,and Hardware," and it might have child sections such as CBRis used as a synthesis componentfor the creation of a "Modems,""Printers," etc. Each of the child sections and stand-alone,run-time database. the parent section wouldalso be represented by snippets. The process of recommending links, conceptually, consists of taking a particular link, retrieving all of the Introduction snippets that contain this link, synthesizing the snippets into a representative snippet, and displaying this snippet to the user. The process of listing the links that occur under a The SideClick link referral system [Goodman1998] is a particular topic consists of retrieving all of the snippets that web-based service for resource exploration. Given a URL were indexed under an appropriate section header, (most often a link to a particular web page of interest), synthesizing the snippets into a representative snippet, and SideClick can provide a list of related URLsorganized by displaying this snippet to the user. Stated moreintuitively, topic as well as a list of related topics. Or, given a topic of the system is saying somethinglike "Giventhat the user is interest, SideClick can provide a list of URLsrelated to interested in a particular link, other webusers whohave that topic as well as other related topics. For example, been interested in this link have tended to organize it with given a URL for "The Dilbert Zone" [Adams 1998], these other links, under these topics. Therefore, the user SideClick returns links for "Over the Hedge" [Fry and should find these links and topics interesting as well." Lewis 1998], "" [Brady 1998], "" [Schulz 1998], the comics page [United Media 1998], "Doonesbury" [Trudeau 1998], etc. and the Harder than it Sounds related topics, "Entertainment" and "Comics and Humor." Clicking on the "Entertainment" topic returns links from Unfortunately, several factors conspire to makethis simple baseball, movies, music, magazines, etc. and over 50 conceptual framework for link recommendation related topics from Art to UFOs. By following links and insufficient. First, data on webpages is extremely noisy. topics of interest, the user is free to discover new, This is hardly surprising given that most of these interesting webresources in a serendipitous fashion. documents are generated by hand, and many of them are SideClick models the way users of the weblink together generated by people who have only passing familiarity and organize information as embodied in bookmarksfiles with computers and computer programming. What is and other on-line links pages. The core observation in the surprising is the sheer variety of types of noise in web system is that people whocreate links pages tend to group pages. Somecommon types of noise include: links in sections, organized by content, with other similar links. Hence, a web page can be viewed as a case, ¯ MarkupNoise: Webpages reflect their organizational structure primarily via the author’s choice of markup,or Copyright©1998, American Association for Artificial Intelligence the way the layout of the page is expressed in terms of (www.aaai.org).All fightsreserved. markuptags. One author, for example, might choose to create sections using delimited lists of links, with

75 various levels of headers used to label sections and the section/subsection relationships between snippets often relative size of those headers intended to conveyscoping do not correspond to taxonomic or partonomic information. Another author might present the same relationships. For example, one web page might place information in tabular form, with the headers relegated "Scotland" under "Food." Perhaps the author intends the to a column along the side of the page and the links section to be "Scottish Foods." Anotherauthor will place contained in separate cells in a different column.A third "Recipes" under "Scotland," meaning "Scottish author might display information within free-form Recipes." A third author will place "Recipes" under descriptive paragraphs that contain embeddedlinks, "Food," and a fourth author will place "Chicken" under separated from other sections by horizontal rules. The "Recipes." Extracting a meaningful taxonomyof topics numberof distinct markupstyles approaches the number from the raw data is currently an unsolved problem. of web authors. This source of noise is further Cobwebs:It is a big exaggeration to say that half the compoundedby the majority of authors whouse markup web is "under construction," and the other half is tags incorrectly, invent their ownmarkup tags (which missing, relocated, or hopelessly out of date. In actual browsers simply ignore), and even introduce syntax fact, only 18%of the URLscited in pages on the web errors within the tags themselves. Reducing the amount refer to documentsthat no longer exist, serve only to of markupnoise is crucial for placing links correctly redirect the user to newlocations, or live on servers that within snippets as well as understanding the aren’t reachable or fail DNS(based on a sampling of relationships betweensnippets within a case. over one million commonlycited web documents). The ¯ URLNoise: It is unfortunate that URLstands for fewer such "cobwebs" that are contained within a "Uniform Resource Locator," not "Unique Resource service, the moreuseful that service becomes. Locator." In fact, there are usually several distinct ways of referring to any particular web document. For Anotherfactor that makescreating a link referral service example, the Netscape HomePage can be found at any difficult is the sheer size of the web. Accordingto Search of the following URLs: http://www.netscape.com/, Engine Watch [Search Engine Watch 1997], AltaVista http://home.mcom.com/, http://mcom.com/index.html, [AltaVista 1998] had indexed over 100 million web pages http://www.home.netscape.com/home/, and several in April of 1997, and their Chief Technical Officer, Louis others. Of the 4.5 million distinct URLsreferred to by Monier, estimated that there were as manyas 150 million documents within the SideClick casebase, over 500,000 distinct pages on the web. Even a small subset of the web of these URLs are redundant. Successfully will contain millions of documentswith tens of millions of canonicalizing URLsprevents the system from referring snippets. Retrieving and synthesizing these snippets can be the user to the same web resource via multiple URLs,as very computationally expensive. well as increasing the numberand usefulness of snippets Finally, a successful web service is, by definition, a indexed under those URLs. high-volume web service. The most popular websites ¯ Section Heading Noise: As described above, markup generate millions of page views per day. A scant million noise can makeit difficult to identify the piece of text (if hits a day adds up to over 11 hits per second, and peak any) that identifies the topic of a snippet. However,even access times can easily reach two or three times as many if that piece of text is successfully located, different hits per second as the average. At 33 hits per second, 30 people tend to label the same content differently. For msecs per query is about enough time to do three disk example, the section headings, "Search," "Search seeks. There isn’t a lot of time for complicated run-time Tools," "Suchmachinen," "Suchdienst," "Metasearch," analysis. "Keyword Search," "Search Forms," "Moteurs De Recherche," and "Search Engines" all refer to the same topic. Successfully canonicalizing section headings CBRin the Pipeline prevents the system from referring the user to multiple The solution we have developed to the above problems is versions of the sametopic with different names, as well to divide the system into a run-time componentthat does as increasing the number and usefulness of snippets fast lookup on a pre-built database (or knowledgebase), indexed under those section headings. A related but and a development component that builds the database. unsolved problem is ambiguity in section headings. For The development component is further broken down into example, somepeople label links about stock quotations several distinct processing steps, featuring one or more under "Quotations," while other people label links about distinct form of reasoning/analysis at each step. These quotes from famous people "Quotations." Or, some processing steps can be loosely grouped into 1). Fetching people might place stock chart links under "Charts," the data, 2). preprocessing the raw data, 3).using CBR while other people might place music charts under synthesize the run-time database, and 4). accessing the run- "Charts." The result of this ambiguityis that the system time database. currently contains some"interesting" mixedtopics. ¯ Taxonomic Noise: Those of us who have experienced the joys of knowledgerepresentation first-hand will not be surprised to learn that what look like

76 number of times a URLis incorrectly mapped into a Fetching the Data differing URL,the number of times a URLis correctly mapped into a differing URL,and the number of times a The system has been bootstrapped to the point where the URLis mapped into a URLthat appears to differ, but analysis of a body of existing documents later in the might be the result of a document changing over time. pipeline has produced a list of canonical URLsto fetch. These values are combined heuristically, and the most The actual mechanics of fetching the corresponding web successful candidate rule is chosen (success is based on the pages are straightforward, and well documentedelsewhere most general rule that doesn’t introduce too manyfalse (see, for example, SideClick search results for HTTPand mappings). The process repeats until all of the URL RFC[SideClick 1998]). matches have been accounted for. Parsing Web Pages into Cases and Snippets. Some Preprocessing the Data organizational and scoping information for a web page is Preprocessing the data consists of several reasoning steps. explicit in the (possibly broken) markupfor that web page. These steps include, 1). learning a set of filtering rules for For example, a delimited list within a delimited list URLcanonicalization, 2). parsing web pages into cases represents that one snippet is a child of another snippet, composed of snippets, and 3). canonicalizing section and the scope of each snippet is defined by the scope of the headers into SideClick topics. delimited list. Other organizational information is implicit Learning URLFiltering Rules. URLfiltering rules are a in the markup.For example, a sequence of markuptags and set of regular expression patterns that map URLsinto strings of the form: "string string
string corresponding URLsthat refer to the same document. For
string

string string
example, a filtering rule might specify that if a URLis of

," implicitly defines two groups of the form "http://*/index.html" and there is another URL anchors, and could be represented by the fuzzy regular that is of the form "http://*/" and the two URLsdiffer only expression: in that one contains the "index.html" at the end and the (string ( string
)*

)* other doesn’t, then the two URLsprobably refer to the where the first string in each occurrence of the regular same document. Another rule might specify that "www."in expression probably denotes the section heading (the the host nameof a URLcan usually be stripped out if there expression is fuzzy because it allows the last " string is another knownURL that differs only in that part of the " of each subsequence to match the subexpression host name. " string
"). Such rules are learned in a two-step process. First, an Parsing a web page, therefore, consists of two steps. index of page similarity is created for all of the pair-wise First, a fault-tolerant HTMLgrammar is used to organize combinations of documents in the set of web pages. Note the tags and strings in the web page into a set of scoped that determining whether two documents are the same is, subexpressions. Next, for each sequence of tokens and itself, a difficult problem. On the one hand, many strings within a subexpression, a pattern detector reduces documentsare script generated and differ in the inclusion the sequence of tokens into a set of scoped subsequences of banner ads, dates and times, numberof page views, etc. based on increasingly complex regular expressions. The even on subsequent fetches of the same document. Such result of this analysis is a set of fully scoped tokens. documentswill appear to differ, incorrectly, unless suitable "Interesting" scopes are detected and output as scoped fuzzy matching techniques are used with appropriate snippets, and likely section headers for each snippet are similarity thresholds. Similarly, pages change over time. identified and output. Since the spider (the component that fetches the web Canonicalizing Section Headers. As previously pages) might take several days to fetch the millions of mentioned, the raw organizational information present in pages that comprise the set, it is quite possible that some web pages is not sufficient to generate an accurate pages will have changed between subsequent fetches. taxonomy of topics. As such, we have knowledge Hence, determining whether two pages are distinct often engineered a taxonomyof over 3000 topics, by hand, with requires modification based on the time those pages were much suffering and loss of life. The maintenance and fetched. On the other hand, many documents from the extension of this taxonomy is an ongoing process and samesite are identical with respect to navigation content, consumesmuch of the bulk of humanlabor in the system. layout, headers, and footers and differ only a small amount Mappingsection headers extracted during the previous on the actual content of the web page. Such pages will processing stage consists of applying a large numberof appear to be similar if matchingthresholds are set too low. phrase canonicalization rules (which were constructed and After the index of similarity is generated, a heuristic are maintained by hand) to the section header, and pattern learning algorithm is applied to generate the performing a statistical analysis of howwell the resulting filtering rules. For a particular pair of similar pages, the section header matches each of the known topics. This algorithm creates a set of regular expressions of varying analysis is based on morphologicalanalysis of the words in generality that describe how one URLcan be mapped to the section header and topic, the number of matching another. These candidate rules are scored by applying them words in the section header, the frequency of occurrence of to the entire body of URLs,and counts are kept of the these matching words in the set of documentsas a whole,

77 and the total length of the section header. Section headers The second reason for having a run-time system distinct that match topics above a certain threshold are from the CBRmodule is code complexity. The CBR canonicalized into the corresponding SideClick topics. The module requires code for loading cases, organizing case remaining section headers are rejected, and a knowledge memory, retrieving snippets and synthesizing these engineer periodically reviews frequently occurring rejected snippets. Also, the internal data structures used to represent headers for possible inclusion as new topics within and index case memoryare somewhatelaborate. It is a SideClick. simple fact that a live system on the world-wideweb is not The result of these preprocessing steps is a set of allowed to crash (sometimes they do anyway, which is one relatively clean and well-organized snippets and cases, of the reasons why large web services run two or three which are fed into the CBRcomponent. times as manyservers in their server farms as they really need to handle capacity). The CBRmodule weighs in with six times as manylines of code as the run-time system. It is Synthesizing the Database safe to assumethat the run-time system is easier to modify Primary functions supported by the run-time system and maintain. include: Finally, the run-time database is actually smaller than ¯ Links Related to Links: Given a URL,retrieve all of the original case base. Instead of keeping around the snippets containing that URL. Synthesize these information about every link that appears in every snippet snippets into a new snippet, as follows: 1). count the in every case that occurs in the case base, the run-time number of snippets each URLappears in, 2). compare system only needs to knowthe relative strength of the this count to the base probability that the URLwill relationship betweena particular URLand its most closely appear in a randomcollection of snippets, 3). if the URL related topics and URLs.In fact, the run-time database is occurs significantly more frequently than random small enough to fit within a gigabyte of RAM,and dual chance, include the URLin the synthesized snippet. 200MhzPentium Pro servers with one gigabyte of RAM ¯ Topics Related to Links: Given a URL,retrieve all of can be purchased for around $6000 (as of April, 1998). the snippets containing that URL. Synthesize these Avoiding any disk lookup whatsoever drastically increases snippets into a new snippet, as follows: 1). count the the speed of the run-time system. numberof snippets each under each topic, 2). compare this count to the base probability that a randomly Using the Database selected snippet will appear under each topic, 3). if the As described above, the run-time system consists of a topic occurs significantly more frequently than random large, precomputed database and a simple lookup chance, include the topic in the synthesized snippet. mechanism. This run-time system is implemented as a ¯ Links Related to Topics: Given a topic, retrieve all of TCP-basedserver that responds to requests from a set of the snippets under that topic. Synthesize these snippets front-ends. Each front-end is a web server that is into a new snippet, as follows: 1). count the numberof responsible for processing web page requests, querying the snippets each URLappears in, 2). comparethis count to back-end run-time system for link and topic referral the base probability that the URLwill appear in a information, and generating suitable HTMLweb pages. random collection of snippets, 3). if the URLoccurs The back-end is capable of handling over 30 requests per significantly more frequently than random chance, second, and most of this time is spent in TCPsocket setup include the URLin the synthesized snippet. and teardown. Perhaps surprisingly, it takes longer to query ¯ Topics Related to Topics: Consult the knowledge- the back-end and format the web page under Microsofi’s engineered taxonomyfor related topics. IIS web server, with C-language DLLsand Visual Basic Script web page generation under WindowsNT than it Constructing a run-time database consists of iterating does to process the back-end queries. Each front-end is through all of the knownURLs and topics, and generating only capable of processing around 11 requests per second. lists of the most closely related URLsand topics along with the strength of the relationship, as described above, and saving these results into a database. Whatdoes this Say about CBRIntegration? There is no theoretical reason why these functions The first observation is that while CBRseems to be an couldn’t be supported by a run-time CBRmodule. ideal technology for solving this problem, significant However,there are three practical reasons for using the reasoning work is needed before the available data is in CBRmodule to build an optimized run-time database and anything like a suitable format for processing. The system to respond to most queries using database lookup. The first described here includes fuzzy page matching, a novel reason is, of course, speed. Popular URLs,such as Yahoo technique for inducing pattern matching rules, a fault [Yahoo 1998], occur in tens of thousands of snippets tolerant grammar,pattern detection, somesimple Natural within the case base. Each snippet may, in turn, contain Languagepattern matching, statistical matchingof patterns references to tens or hundredsof links. Synthesizing all of and phrases, and a hand-engineered taxonomyof over 3000 these snippets can take orders of magnitudelonger than the topics before the CBRcan even begin. This is on top of maximumtime allowed for responding to a query.

78 more "conventional" programmingtasks such as creating a spider for fetching documents from the world-wide web, creating software for the efficient storage and retrieval of References millions of webpages, etc. The second observation is that even though a CBR Adams, S. 1998. Welcome to the Dilbert Zone. module as "master" in a run-time system may be . functionally adequate, it maybe undesirable on practical grounds due to high-capacity requirements, code AltaVista. 1997. AltaVista: Main Page. complexity and maintenanceissues, and case base size. . For these reasons, we have ended up with a pipelined architecture of processing steps from raw data through a Brady, P. 1998. The Official Rose is Rose Site. standalone database with CBRplanted squarely in the . middle. Fry, M. and Lewis, T. 1998.TheOfficial Over the HedgeSite. . Is this General? Goodman,M. 1989. CBRin Battle Planning. In Proceedings of While clearly an inappropriate architecture for some the Second DARPAWorkshop on Case-Based Reasoning, 312- reasoning tasks (for example, the Battle Planner system 326. where the ability to retrieve and examine cases forms an integral part of the decision support process [Goodman Goodman,M. 1995. Projective Visualization: Learning to 1989]), this methodology has been applied to two other Simulate from Experience. Ph.D. Thesis, Brandeis University, systems, Fido the Shopping Doggie [Goodman1997], and WalthamMass. FutureDB. Fido is a web-based shopping service. As in SideClick, Goodman, M. 1997. Fido the Shopping Doggie. web pages are downloaded and preprocessed. In Fido, . however, CBRis used to label parts of these web pages as product descriptions, product categories, vendors, prices, Goodman,M. 1998. SideClick. . etc., based on a case library of pre-labeled web pages. These newly downloaded and labeled web pages are fed Kolodner, J. 1993. Case-BasedReasoning, MorganKaufmann into a push-downautomata that use the labels to construct a Publishers, San Mateo,CA. database of products and prices. The run-time system allows web users to perform keyword searches on this Redmond,M. 1992. Learning by Observing and Explaining database to locate products of interest along with links Expert ProblemSolving., Ph.D. thesis, GeorgiaInstitute of back to the web page from which the product was Technology,Atlanta, GA. extracted. As in SideClick, a variety of processing steps are needed to convert raw web pages into cases, and CBRis Search Engine Watch. 1997. HowBig Are The Search Engines? used as a componentin a pipeline to synthesize an efficient . run-time database. In FutureDB, a product based on Projective Shulz, C. 1998. The Official Peanuts Home Page. Visualization [Goodman1995], raw historical data is . preprocessed and fused with external data sources, and CBRis used as a key component in constructing a SideClick. 1998. SideClick: +HTTP +RFC. simulator. This simulator is used to project historical data into the future, and the projected data is stored into a database in the sameformat as the historical database. This Trudeau, G. 1998. Doonesbury Electronic Town Hall. allows users to analyze the projected data using the same . decision support systems and on-line analytical processing tools that they currently use to examine historical data. United Media. 1998. Welcome to the Comic Zone. Once again, a variety of reasoning techniques are used to . preprocess raw data into a form suitable for CBR,and CBR is used in a pipeline to producea static run-time database. Yahoo.1998. Yahoo. . Hence, while not universal, the architecture described here does support a variety of reasoning systems. Zito-Wolf, R. and Alterman, R. 1993. A FrameWorkand an Analysis of Current Proposals for the Case-BasedOrganization and Representationof Procedural Knowledge.In Proceedingsof the EleventhNational Conference on Artificial Intelligence, 73- 78.