Topic-Specific Link Analysis Using Independent Components for Information Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
Topic-Specific Link Analysis using Independent Components for Information Retrieval Wray Buntine, Jaakko Lofstr¨ om,¨ Sami Perttu and Kimmo Valtonen Helsinki Institute for Information Technology (HIIT) P.O. Box 9800, FIN-02015 HUT, Finland [email protected] Abstract suffers perceived bias in some searches because of the over- riding statistics of word usage in its corpus (“the web”) in There has been mixed success in applying semantic contrast with their dictionary word senses (Johnson 2003): component analysis (LSA, PLSA, discrete PCA, etc.) to on the internet an “apple” is a computer, not something you information retrieval. Previous experiments have shown that high-fidelity language models do not imply good eat, and “Madonna” is an often-times risque pop icon, not a quality retrieval. Here we combine link analysis with religious icon. Thus one wants to use a keyword “Madonna” discrete PCA (a semantic component method) to de- but bias the topic somehow towards Christianity in order to velop an auxiliary score for information retrieval that get the religious word sense. is used in post-filtering documents retrieved via regu- A major user interface problem here is that people lar Tf.Idf methods. For this, we use a topic-specific have trouble navigating concept hierarchies or ontologies version of link analysis based on topics developed au- (Suomela & Kekal¨ ainen¨ 2005), especially when they are tomatically via discrete PCA methods. To evaluate the unfamiliar with them. This user interface problem is con- resultant topic and link based scoring, a demonstration founded because good topic hierarchies and ontologies are has been built using the Wikipedia, the public domain usually multi-faceted and search might require specifying encyclopedia on the web. multiple nodes. Thus we combine the following techniques: Topic by example: Users do not have to know the hierar- Introduction chy, or browse it, or travel to multiple places to get multi- More sophisticated language models are starting to be used facets for their search. They just enter a few words de- in information retrieval (Ponte & Croft 1998; Nallapati scribing their general topic area in a “context words” box 2004) and some real successes are being achieved with their and let the system work out the topics “by example”. An use (Craswell & Hawking 2003). A document modelling ap- example of the input screen is shown in Figure 1. proach based on discrete versions of PCA (Hofmann 1999; Topic specific page-rank: Many pages can be topically Blei, Ng, & Jordan 2003; Buntine & Jakulin 2004) has been relevant, but when dealing with a specific topic area or applied to the language modelling task in information re- combination of topic areas, which pages are considered trieval (Buntine & Jakulin 2004; Canny 2004). However, the most important in terms of topically relevant citations? it has been shown experimentally that this is not necessar- Topic specific versions of page rank (Haveliwala 2002; ily the right approach to use (Azzopardi, Girolami, & van Richardson & Domingos 2002) address this. Risjbergen 2003). The problem can be explained as fol- Result filtering: The top results from a regular Tf.Idf query lows: when answering a query about “german immigra- are reranked using a weighted combination of topic- tion,” a general statistical model built on a full international specific page rank. In this way, the choice of topic “by news corpus often lacks the fidelty on these two key words example” affects the results. combined. Ideally, one would like a statistical model more specifically about “german immigration,” if it were feasible. Here we first apply the discrete PCA method to develop Thus the statistically based language modelling approach to topics automatically. This gives topics suitable for the cor- information retrieval is stll needing of development. pus, and a multi-faceted classification of all pages in it. We Here we take an alternative path for using statistical mod- then apply these using a topic-specific version of page rank els, such as those built with discrete PCA, into information (Richardson & Domingos 2002) that is based on the notion retrieval. Our approach is motived by the widespread obser- of a random surfer willing to hit the back button when a non- vation that people would like to be able to bias their searches topical page is encountered. This gives topic specific rank- towards specific areas, but they find it difficult to do in gen- ings for pages that can be used in a topic-augmented search eral. Web critics have reported that Google, for instance, interface. Our intent is that these techniques yield a secondary top- Copyright c 2005, American Association for Artificial Intelli- ical score for retrieval in conjunction with a primary key- gence (www.aaai.org). All rights reserved. word based score such as Tf.Idf. Thus relevance of a docu- Figure 1: The search options on the results page ment is a combination of both keyword relevance and topical better topical match to the query. Another example is the relevance. Because search users are usually daunted by any- query “stars”. thing more than just a keyword box, and because keyword Tf.Idf: List of The Simpsons episodes, List of stars on the search currently works quite well, our default is to make the Hollywood Walk of Fame, Star Wars, Star Trek, List of keyword entry and the topical entry equivalent initially in stars by constellation, Star, Star Trek Other Storylines a search, and only give the option to change the topic, as Google: Star, Neutron star, Flag of the United States, Movie shown in Figure 1, after a first batch of results have been star, List of nearest stars, Stars and Stripes, List of bright- returned. Thus the initial search screen contains no “context est stars words” box. Our platform for experiments with these methods Topical filtering: Star system, Star (glyph), Star Trek Fur- is the English language part of the Wikipedia1, an ther Reading, Star (disambiguation), Star Wreck, Star, open source Encyclopedia. This has a good in- List of LucasArts Star Wars games ternal link structure necessary for the link analy- In this case, “Star (glyph)” is the mathematical concept of a sis. The system is demonstrated2 at our test website star. In this case, the disambiguation page is only seen in the (http://kearsage.hiit.fi/wikisearch.html). The results from topical filtering, as well as a broader range of website is being used to test interface concepts as well as topical versions of star. perform user studies. This paper first presents the background on discrete PCA The combination of topic-specific and link-based scoring (DPCA), and topic specific ranking using a topically moti- is fundamental, we believe, to the success of this method. vated random surfer. Then the combination of these meth- Topic-based scoring alone can return documents with high ods is described. The paper described the results of the topic topical scores, but they are not “characteristic” documents specific ranking, a very appealing and rational set of docu- for the topic and keyword combination, rather they are “typ- ment rankings for different topics. Finally the application of ical”. A document with high topical content is not nec- these techniques to information retrieval are discussed and essarily characteristic. For instance, entering the query presented. “madonna” gives the following pages titles as top results under a standard OKAPI BM25 version of Tf.Idf, under Background Google, and under our system (“Topical filtering”). These Topic Specific Ranking are listed in rank order: We use the term “random surfer model” in a broad sense: Tf.Idf: Madonna (entertainer), Unreleased Madonna songs, to encompass general Monte Carlo Markov chain methods, List of music videos by year work in progress, Bedtime modelling eye-balls on pages, used to determine scores for Stories (Madonna), American Life documents. Examples are (Haveliwala 2002; Richardson & Domingos 2002). A general method for topic-specific rank- Google: Madonna (entertainer), Madonna (singer), ing roughly following ((Richardson & Domingos 2002) goes Madonna, Unreleased Madonna Songs, Black Madonna as follows: Topical filtering: Madonna, Madonna (entertainer), Unre- Our surfer restarts with probability α at a page i with probability ri. From that page, they uniformly select a link leased Madonna songs, The Madonna, American Life 0 to document i , and jump to this next page. They then con- Tf.Idf essentially returns documents with many instances sider the topic of the new page, whose strength of relevance of the word Madonna. Google essentially returns docu- is determined by another probability ti0 . With this probabil- ments voted by web-links as being most important, mostly ity ti0 they accept the new page, and with probability 1 − ti0 Madonna the entertainer. Our approach sees Madonna is they go back to the page i to try a new link. The stationary a word with both entertainment and religious connotations, distribution of the Markov Chain for the probability of being and returns important documents with a better topical mix. on page pi is then given by the update equations: “Madonna” in this case is the main disambiguating page that ti points to the different versions of Madonna. It becomes the pi − αri + (1 − α) pi0 0 tj i0 : i0!i j : i !j highest ranked using our topical filtering due to it being a X where we perform the calculation only for those pages i with 0 0 P 1http:en.wikipedia.org ri > 0, and i ! i denotes page i links to page i. The vec- 2This link may not be functioning after 2005. tors ~r and ~t allow specialization to a topic,so a set of such rankings p~ can be developed for every topic: ~r represents the an additional set of quantities which are the word/lexeme starting documents for a topic and ~t represents the probabil- counts wj broken out into a term for each component, wj;k ity that someone interested in the topic will stay at a page.