<<

Hypertext and Data Mining∗ Soumen Chakrabarti Indian Institute of Technology Bombay [email protected]

The volume of unstructured text and hypertext data thereby enabling collaborative recommendation of ad- far exceeds that of structured data. Text and hypertext ditional material. are used for digital libraries, product catalogs, reviews, Relevance feedback and supervised learning: newsgroups, medical reports, customer service reports, Search systems can significantly improve their rele- and the like. Currently measured in billions of dollars, vance ranking, if the user can give even a little feed- the worldwide activity is expected to reach a back, say by marking some of the top ranking docu- trillion dollars by 2002. researchers have kept ments as relevant or irrelevant, thereby generating a some cautious distance from this action. The goal of two-class learning problem. More general learning pro- this tutorial is to expose database researchers to text grams can be trained to route documents automati- and hypertext (IR) and mining cally to appropriate nodes in a topic hierarchy such as systems, and to discuss emerging issues in the overlap- Yahoo! Unlike classifiers for structured data, text clas- ping areas of databases, hypertext, and data mining. sifiers must handle very high-dimensional (say 50,000) Keyword indices: The workhorse of text search is and noisy data; they must automatically extract those the so called ”inverted index” data structure. Invert- dimensions that are best for classification, and con- ing a document collection involves building a mapping struct statistical topic models. Scaling to large tax- from each term in the collection to the set of docu- onomies and collections is important not only in its ments containing the term, with details such as offset own right, but also for improved accuracy. in the document. This enables boolean queries like Exploiting for better learning: socks AND network AND NOT shoes. Most keyword Judging the topic of a hypertext document in isola- indexing systems also support phrase search, such as tion, without regard to its link neighborhood, gives "inverted index", and NEAR (proximity) queries, poor accuracy. There is information in links, but it is such as warehouse NEAR mining. noisy. A page about ‘cardiology’ may cite other pages Vector space model and relevance ranking: on cardiology, but also possibly a page on ‘swimming’. Large responses must be ordered so that documents Such ‘sociology’ of citations between topics can also be likely to fulfill the user’s information need are ranked learnt. Integrating this knowledge into text classifiers highly. To define similarity between a query and a using a general model of citations results in significant document, they are represented as vectors where each improvements in classification accuracy. dimension corresponds to a term. Suitable weighting Social network analysis for hypertext: Hyper- can be used for each term. The cosine of the angle be- text is a kind of social network. Social networks formed tween the query and document vectors is usually used between people by co-authoring, citing, and advising as similarity. have been extensively researched to find authoritative Similarity, clustering, and collaborative filter- nodes that have high prestige. The prestige of a so- ing: A limitation of the vector space model is that cial network node may be recursively modeled as the indirect evidence of similarity is overlooked. Latent sum of the prestige of nodes that cite it. Ordering web Semantic Indexing (LSI) transforms the vector space query results by prestige, as in the Google search en- into a lower dimensional space so that similar terms gine, improves the searching experience. Another mea- such as car and auto are mapped to similar vec- sure from social network is reflected prestige, as evident tors, rather than being orthogonal as in the original in topical hubs of resource links that are often found space. This enhances query capability beyond ex- on the Web. Hubs can be found by a mutual recursion act keyword match. An improved measure of docu- involving prestige and reflected prestige. ment similarity also enables clustering them into re- Distributed resource discovery: Given the social lated sub-collections, thereby constructing topical hi- structure of the web and the tendency of linked com- erarchies. Just as terms induce similarity between munities to form spontaneously, it is possible to mine documents (and vice versa), the relation between enti- the Web to obtain a coherent view of a topical section ties (movies, songs, books) and people liking them can of it. Care is needed to crawl the web in this goal- be analyzed to cluster both the people and entities, directed fashion, starting from a handful of relevant pages. The crawl frontier must be continually pruned ∗The printed version will not show the many informative hyperlinks to be found in the PDF version available through to eliminate irrelevant links and paths, and good topi- http://www.cs.berkeley.edu/~soumen. cal hubs detected for preferential expansion.

1