Web Mining: Research and Practice

W EB E NGINEERING WEB MINING: RESEARCH AND PRACTICE Web mining techniques seek to extract knowledge from Web data. This article provides an overview of past and current work in the three main areas of Web mining research— content, structure, and usage—as well as emerging work in Semantic Web mining. s a large and dynamic information • Web content mining is the application of data- source that is structurally complex mining techniques to content published on the and ever growing, the World Wide Internet, usually as HTML (semistructured), Web is fertile ground for data- plaintext (unstructured), or XML (structured) miningA principles, or Web mining. The Web documents. mining field encompasses a wide array of is- • Web structure mining operates on the Web’s hy- sues, primarily aimed at deriving actionable perlink structure. This graph structure can knowledge from the Web, and includes re- provide information about a page’s ranking4 or searchers from information retrieval, database authoritativeness5 and enhance search results technologies, and artificial intelligence. Since through filtering. Oren Etzioni,1 among others, formally intro- • Web usage mining analyzes results of user induced the term, authors have used “Web min- teractions with a Web server, including Web ing” to mean slightly different things. For ex- logs, clickstreams, and database transactions at ample, Jaideep Srivastava and colleagues2 a Web site or a group of related sites. Web us- define it as age mining introduces privacy concerns and is currently the topic of extensive debate. The application of data-mining techniques to extract knowledge from Web data, in which at least We discuss some important research contribu- one of structure or usage (Web log) data is used tions in Web mining, with a goal of providing a in the mining process (with or without other broad overview rather than an in-depth analysis. types of Web data). Web Content and Structure Mining Researchers have identified three broad cate- Some researchers combine content and structure gories of Web mining:2,3 mining to leverage the techniques’ strengths. Al- though not all researchers agree to such a classifi- 1521-9615/04/$20.00 © 2004 IEEE cation, we list research in these two areas together. 6 7 Copublished by the IEEE CS and the AIP Fabrizio Sebastini and Soumen Chakrabarti discuss Web content mining techniques in detail, PRANAM KOLARI AND ANUPAM JOSHI and Johannes Fürnkranz8 surveys work in Web University of Maryland, Baltimore County structure mining. JULY/AUGUST 2004 49 Web as a Database search results are returned. A page’s PageRank Early work in the area of Web databases focused on computation is based on the number of links other the Web’s layered view, as suggested by Osmar Za- ranked pages have to it and the probability that a iane and colleagues.9 Placing a layer of abstraction surfer will visit it without traversing links (through containing some semantic information on top of bookmarks, for example). Researchers have sug- the semistructured Web lets users query the Web gested enhancements to the basic PageRank algo- as they would a database. For instance, users can rithm. Sepandar Kamwar and colleagues,14 for ex- readily query a metadata layer describing a docu- ample, developed a quadratic extrapolation ment’s author or topic. Researchers can use content algorithm that significantly improves the cost of and hyperlink mining approaches in which XML PageRank computation. represents the semantics to build such a multilay- ered Web. WebLog10 and WebSQL11 are such Clever: Ranking by Content database-based approaches. More recent work in Basic hub and authority approaches do not consider this area aims to realize the Semantic Web vision.12 a link’s semantics for page ranking. The Clever15 system addresses this problem by considering query Document Classification terms occurring in or near the anchor text (a certain Classification’s roots are in machine learning, pat- window) in an HTML page as a hint to link se- tern recognition, and text analysis. The basic idea mantics, and thus leverages content-mining tech- is to classify pages using supervised or unsupervised niques for structure analysis. Clever gives greater methods. In simple terms, supervised learning uses weight to links that are similar to the search query. preclassified training data, which is not required in It incorporates a link’s weight into the HITS algo- unsupervised learning. Classification is useful in rithm when deciding a page’s authoritativeness. For such areas as topic aggregation and Web- example, if n pages link to two other pages for dif- community identification. ferent reasons, such as business and sports, the en- Early work in document classification applied hanced HITS algorithm will return different ranks text-mining techniques to Web data directly. (Text for both pages for queries on sports and business. mining is a subcategory of Web content mining that Soumen Chakrabarti and colleagues suggested does not use Web structure.) Later research showed additional refinements,15 and their results show that harnessing the Web graph structure and semi- a significant improvement over contemporary structured content in the form of HTML tags im- approaches. proved results. Hypursuit is an early effort in this direction.13 Google News (http://news.google. Identifying Web Communities com), which automatically gathers and classifies the Many communities are well organized on the Web, most recent news from more than 4,000 sources, is with webrings (interlinks between Web sites with a a popular application of document classification. ring structure) or information portals linking them together. Ravi Kumar and colleagues16 proposed Hubs and Authorities trawling to identify nascent or emerging commu- Hyperlink-induced topic search (HITS) is an iter- nities using hyperlink data to obtain cocitation in- ative algorithm for mining the Web graph to iden- formation. They represent such a group or com- tify topic hubs and authorities. “Authorities” are munity as a dense directed bipartite graph with highly ranked pages for a given topic; “hubs” are nodes divided into the community core and the pages with links to authorities. The algorithm takes rest. The community core represents those Web as input search results returned by traditional text- sites that are part of the same community without indexing techniques, and filters these results to links between themselves. Trawling is the process identify hubs and authorities. The number and of identifying such subgraphs from the Web graph. weight of hubs pointing to a page determine the page’s authority. The algorithm assigns weight to a Web Usage Mining hub based on the authoritativeness of the pages it Web usage mining has several applications in e-busi- points to. For example, a page containing links to ness, including personalization, traffic analysis, and all authoritative news servers (CNN, CNBC, and targeted advertising. The development of graphical so on) is a powerful news hub. analysis tools such as Webviz17 popularized Web us- Larry Page and colleagues proposed PageRank4 age mining of Web transactions. The main areas of and popularized it through the Google search en- research in this domain are Web log data prepro- gine. With PageRank, a crawler precomputes page cessing and identification of useful patterns from this ranks, increasing the speed with which ranked preprocessed data using mining techniques. Several 50 COMPUTING IN SCIENCE & ENGINEERING surveys on Web usage mining exist.18,19 a fast linear clustering algorithm that can handle Most data used for mining is collected from Web significant data noise.24 They use this algorithm to servers, clients, proxy servers, or server databases, all cluster Web access logs and use the traversal pat- of which generate noisy data. Because Web mining terns identified for specific groups to automatically is sensitive to noise, data cleaning methods are nec- adapt the Web site to those groups. essary. Jaideep Srivastava and colleagues18 catego- rize data preprocessing into subtasks and note that Association Rules the final outcome of preprocessing should be data Early systems used collaborative filtering for user that allows identification of a particular user’s brows- recommendation and personalization. Bamshad ing pattern in the form of page views, sessions, and Mobasher and colleagues25 used association-rule clickstreams. Clickstreams are of particular interest mining based on frequent item sets and intro- because they allow reconstruction of user naviga- duced a data structure to store the item sets. They tional patterns. Recent work by Yannis Manolopou- split Web logs into user sessions and then mined los and colleagues20 provides a comprehensive dis- these sessions using their suggested association- cussion of Web logs for usage mining and suggests novel ideas for Web log indexing. Such preprocessed Adaptive sites automatically change their data enables various mining techniques. We briefly describe some of the notable research here. organization and presentation according to Adaptive Web Sites the preferences of the user accessing them. Personalization is one of the most widely re- searched areas in Web usage mining. An early effort in this direction was the adaptive Web site challenge posed by Oren Etzioni and colleagues.21 Adaptive sites automatically change their organi- rule algorithm. They argue that other techniques zation and presentation according to the prefer- based on association rules for usage data do not ences of the user accessing them. Other contem- satisfy the real-time constraints of recommender porary research seeks to build agent-based systems systems because they consider all association rules that give user recommendations. For instance, prior to making a recommendation. Ming-Syan Web-watcher22 uses content- and structure-min- Chen and colleagues26 proposed a somewhat sim- ing techniques to give guided tours to users brows- ilar approach that uses a different frequent item- ing a page. Popular Web sites like Amazon.com use set counting algorithm. similar techniques for “recommended links” pro- vided to users.

Load more