W EB E NGINEERING

WEB MINING: RESEARCH AND PRACTICE

Web mining techniques seek to extract knowledge from Web data. This article provides an overview of past and current work in the three main areas of Web mining research— content, structure, and usage—as well as emerging work in Semantic Web mining.

s a large and dynamic information • Web content mining is the application of data- source that is structurally complex mining techniques to content published on the and ever growing, the World Wide Internet, usually as HTML (semistructured), Web is fertile ground for data- plaintext (unstructured), or XML (structured) miningA principles, or Web mining. The Web documents. mining field encompasses a wide array of is- • Web structure mining operates on the Web’s hy- sues, primarily aimed at deriving actionable perlink structure. This graph structure can knowledge from the Web, and includes re- provide information about a page’s ranking4 or searchers from information retrieval, database authoritativeness5 and enhance search results technologies, and artificial intelligence. Since through filtering. Oren Etzioni,1 among others, formally intro- • Web usage mining analyzes results of user in- duced the term, authors have used “Web min- teractions with a Web server, including Web ing” to mean slightly different things. For ex- logs, clickstreams, and database transactions at ample, Jaideep Srivastava and colleagues2 a Web site or a group of related sites. Web us- define it as age mining introduces privacy concerns and is currently the topic of extensive debate. The application of data-mining techniques to ex- tract knowledge from Web data, in which at least We discuss some important research contribu- one of structure or usage (Web log) data is used tions in Web mining, with a goal of providing a in the mining process (with or without other broad overview rather than an in-depth analysis. types of Web data). Web Content and Structure Mining Researchers have identified three broad cate- Some researchers combine content and structure gories of Web mining:2,3 mining to leverage the techniques’ strengths. Al- though not all researchers agree to such a classifi-

1521-9615/04/$20.00 © 2004 IEEE cation, we list research in these two areas together. 6 7 Copublished by the IEEE CS and the AIP Fabrizio Sebastini and Soumen Chakrabarti discuss Web content mining techniques in detail, PRANAM KOLARI AND ANUPAM JOSHI and Johannes Fürnkranz8 surveys work in Web University of Maryland, Baltimore County structure mining.

JULY/AUGUST 2004 49 Web as a Database search results are returned. A page’s PageRank Early work in the area of Web databases focused on computation is based on the number of links other the Web’s layered view, as suggested by Osmar Za- ranked pages have to it and the probability that a iane and colleagues.9 Placing a layer of abstraction surfer will visit it without traversing links (through containing some semantic information on top of bookmarks, for example). Researchers have sug- the semistructured Web lets users query the Web gested enhancements to the basic PageRank algo- as they would a database. For instance, users can rithm. Sepandar Kamwar and colleagues,14 for ex- readily query a metadata layer describing a docu- ample, developed a quadratic extrapolation ment’s author or topic. Researchers can use content algorithm that significantly improves the cost of and hyperlink mining approaches in which XML PageRank computation. represents the semantics to build such a multilay- ered Web. WebLog10 and WebSQL11 are such Clever: Ranking by Content database-based approaches. More recent work in Basic hub and authority approaches do not consider this area aims to realize the Semantic Web vision.12 a link’s semantics for page ranking. The Clever15 system addresses this problem by considering query Document Classification terms occurring in or near the anchor text (a certain Classification’s roots are in , pat- window) in an HTML page as a hint to link se- tern recognition, and text analysis. The basic idea mantics, and thus leverages content-mining tech- is to classify pages using supervised or unsupervised niques for structure analysis. Clever gives greater methods. In simple terms, supervised learning uses weight to links that are similar to the search query. preclassified training data, which is not required in It incorporates a link’s weight into the HITS algo- unsupervised learning. Classification is useful in rithm when deciding a page’s authoritativeness. For such areas as topic aggregation and Web- example, if n pages link to two other pages for dif- community identification. ferent reasons, such as business and sports, the en- Early work in document classification applied hanced HITS algorithm will return different ranks text-mining techniques to Web data directly. (Text for both pages for queries on sports and business. mining is a subcategory of Web content mining that Soumen Chakrabarti and colleagues suggested does not use Web structure.) Later research showed additional refinements,15 and their results show that harnessing the Web graph structure and semi- a significant improvement over contemporary structured content in the form of HTML tags im- approaches. proved results. Hypursuit is an early effort in this direction.13 Google News (http://news.google. Identifying Web Communities com), which automatically gathers and classifies the Many communities are well organized on the Web, most recent news from more than 4,000 sources, is with webrings (interlinks between Web sites with a a popular application of document classification. ring structure) or information portals linking them together. Ravi Kumar and colleagues16 proposed Hubs and Authorities trawling to identify nascent or emerging commu- Hyperlink-induced topic search (HITS) is an iter- nities using hyperlink data to obtain cocitation in- ative algorithm for mining the Web graph to iden- formation. They represent such a group or com- tify topic hubs and authorities. “Authorities” are munity as a dense directed bipartite graph with highly ranked pages for a given topic; “hubs” are nodes divided into the community core and the pages with links to authorities. The algorithm takes rest. The community core represents those Web as input search results returned by traditional text- sites that are part of the same community without indexing techniques, and filters these results to links between themselves. Trawling is the process identify hubs and authorities. The number and of identifying such subgraphs from the Web graph. weight of hubs pointing to a page determine the page’s authority. The algorithm assigns weight to a Web Usage Mining hub based on the authoritativeness of the pages it Web usage mining has several applications in e-busi- points to. For example, a page containing links to ness, including personalization, traffic analysis, and all authoritative news servers (CNN, CNBC, and targeted advertising. The development of graphical so on) is a powerful news hub. analysis tools such as Webviz17 popularized Web us- Larry Page and colleagues proposed PageRank4 age mining of Web transactions. The main areas of and popularized it through the Google search en- research in this domain are Web log data prepro- gine. With PageRank, a crawler precomputes page cessing and identification of useful patterns from this ranks, increasing the speed with which ranked preprocessed data using mining techniques. Several

50 COMPUTING IN SCIENCE & ENGINEERING surveys on Web usage mining exist.18,19 a fast linear clustering algorithm that can handle Most data used for mining is collected from Web significant data noise.24 They use this algorithm to servers, clients, proxy servers, or server databases, all cluster Web access logs and use the traversal pat- of which generate noisy data. Because Web mining terns identified for specific groups to automatically is sensitive to noise, data cleaning methods are nec- adapt the Web site to those groups. essary. Jaideep Srivastava and colleagues18 catego- rize data preprocessing into subtasks and note that Association Rules the final outcome of preprocessing should be data Early systems used collaborative filtering for user that allows identification of a particular user’s brows- recommendation and personalization. Bamshad ing pattern in the form of page views, sessions, and Mobasher and colleagues25 used association-rule clickstreams. Clickstreams are of particular interest mining based on frequent item sets and intro- because they allow reconstruction of user naviga- duced a data structure to store the item sets. They tional patterns. Recent work by Yannis Manolopou- split Web logs into user sessions and then mined los and colleagues20 provides a comprehensive dis- these sessions using their suggested association- cussion of Web logs for usage mining and suggests novel ideas for Web log indexing. Such preprocessed Adaptive sites automatically change their data enables various mining techniques. We briefly describe some of the notable research here. organization and presentation according to Adaptive Web Sites the preferences of the user accessing them. Personalization is one of the most widely re- searched areas in Web usage mining. An early ef- fort in this direction was the adaptive Web site challenge posed by Oren Etzioni and colleagues.21 Adaptive sites automatically change their organi- rule algorithm. They argue that other techniques zation and presentation according to the prefer- based on association rules for usage data do not ences of the user accessing them. Other contem- satisfy the real-time constraints of recommender porary research seeks to build agent-based systems systems because they consider all association rules that give user recommendations. For instance, prior to making a recommendation. Ming-Syan Web-watcher22 uses content- and structure-min- Chen and colleagues26 proposed a somewhat sim- ing techniques to give guided tours to users brows- ilar approach that uses a different frequent item- ing a page. Popular Web sites like Amazon.com use set counting algorithm. similar techniques for “recommended links” pro- vided to users. All these approaches primarily use Recommender Systems association rules and clustering mechanisms on log J. Ben Schafer and colleagues27 note that recom- data and Web pages. mender systems have enhanced e-business by

Robust Fuzzy Clustering • converting browsers to buyers, Anupam Joshi and colleagues23 use fuzzy techniques • increasing cross-sell by identifying related for Web page clustering and usage mining, and they products, and use the mined knowledge to create adaptive Web • building loyalty. sites.24 They argue that given the inherent ambigu- ity and complexity of the underlying data, cluster- These systems primarily use association rule min- ing results should not be clearly demarcated sets but ing for pattern detection. In an e-business scenario, rather fuzzy sets—that is, overlapping clusters. For a recommender system uses customers’ Web bas- instance, a user can belong to multiple user interest kets (shopping carts) as data sources. Amazon.com groups because at different times he or she accesses has the most prominent application: “Customers the Web for different information or merchandise. who bought product A also bought product B.” Insisting that each user fit only a single group is clearly inconsistent with this reality. Web Site Evaluation Moreover, given the noise expected in the data Myra Spiliopoulou28 suggests applying Web usage despite cleaning attempts, the clustering process mining to Web site evaluation to determine must be robust in the statistical sense. Raghu Kr- needed modifications—primarily to the site’s de- ishnapuram and colleagues discuss fuzzy clustering sign of page content and link structure between and its application to Web-log analysis and present pages. Such evaluation is one of the earliest steps

JULY/AUGUST 2004 51 in Web usage analysis conducted by Web sites and TR/owl-features) for marking up hypertext pages. is necessary for repeat visitors. Evaluation is im- OWL allows more complex assertions about a page portant because all subsequent Web usage mining (for instance, its provenance, access rules, and links techniques are effective only in the presence of to other pages) than the Web-as-database ap- large amounts of data created by repeat visitors. proach, which is limited to simple metadata (top- The main technique for evaluating data is to ics, author, creation date, and so on). Moreover, model user navigation patterns and compare them these assertions will be in a language with explicit to site designers’ expected patterns. The Web uti- semantics, making it machine interpretable. As lization miner (WUM)28 analysis tool, for exam- Bettina Berendt and colleagues31 discuss, the Se- ple, incorporates evaluation. mantic Web and Web mining can fit together: Web mining enables the Semantic Web vision, and the Hamlet: To Buy or Not to Buy Semantic Web infrastructure improves Web min- Etzioni and colleagues29 applied Web mining to ing’s effectiveness. airline ticket purchasing. Airlines use sophisticated In the Semantic Web, adding semantics to a Web techniques to manage yield, varying ticket prices resource is accomplished through explicit annotation according to time and capacity. Etzioni’s approach (based on an ontology). Humans cannot be expected mined airline prices available on the Web and price to annotate Web resources; it is simply not scalable. changes over time to produce recommendations Hence, we need to automate the annotation process regarding the best time to buy tickets. Many more through ontology learning, mapping, merging, and innovative areas are yet to be explored. instance learning. Web content-mining techniques can accomplish this. For instance, we can use topic Privacy Issues classification to automatically annotate Web pages Recent data-mining privacy violations have caused with information about topics in an ontology. Anno- concern, specifically when has involved tations of this kind enable new possibilities for Web vertically partitioned data—that is, data about the mining. Ontologies can help improve clustering re- same entity or individual from multiple sources. One sults through feature selection and aggregation (for example is Terrorist Information Awareness (TIA, instance, identifying that two different URLs both www.epic.org/privacy/profiling/tia), a DARPA-ini- point to the same airfare ). With the Se- tiated program that aims to aggregate information mantic Web, page ranking is decided not just by the from disparate sources to detect patterns that might approximated semantics of the link structure, but also indicate a terrorist. This program has led to serious by explicitly defined link semantics expressed in public debate about whether such a system, even if OWL. Thus, page ranking will vary depending on technologically possible, should be used. the content domain. Data modeling of a complete DoubleClick’s (www.doubleclick.com) online ad- Web site with an explicit ontology can enhance vertising is an instance of tracking user behavior usage-mining analysis through enhanced queries and across multiple sites. If a user’s transactions at every more meaningful visualizations. Web site are identified through uniquely identifi- able information collected by Web logs, they could create a far more complete profile of the user’s ecent research has mostly focused on shopping habits. The current Web privacy archi- Web usage analysis, partly because of tecture provided by the Platform for Privacy Pref- its applicability in e-business. We ex- erences (P3P, www.w3.org/P3P) Protocol and A pect privacy issues, distributed Web P3P Preference Exchange Language (APPEL) lets Rmining, and Semantic Web mining to attract users control this kind of usage by explicitly agree- equal, if not more, interest from the research ing or disagreeing to such tracking. However, Web community. Increased use of Web mining tech- sites and popular Web browsers offer limited sup- niques will require that privacy issues be ad- port for such tools. Web mining research should dressed, however. Similarly, aggregating data in accommodate this preference set and enforce it a central site and then mining it is rarely scalable, across organizations and databases. Lorrie Cranor hence the need for distributed mining tech- surveys possible research direction in this area.30 niques. Finally, researchers will need to leverage the semantic information the Semantic Web pro- Semantic Web Mining vides. Exposing content semantics and the link The Semantic Web12 is emerging as the next-gen- explicitly can help in many tasks, including min- eration Web, with a semantically rich language ing the hidden Web—that is, data stored in data- such as the Web Ontology Language (www.w3.org/ bases and not accessible through search engines.

52 COMPUTING IN SCIENCE & ENGINEERING As new data is published every day, the Web’s 19. M Eirinaki and M Vazirgiannis, “Web Mining for Web Personaliza- utility as an information source will continue to tion,” ACM Trans. Internet Technology, vol. 3, no. 1, 2003, pp. 1–27. grow. The only question is: Can Web mining catch 20. Y. Manolopoulos et al., “Indexing Techniques for Web Access Logs,” Web Information Systems, IDEA Group, 2004. up to the WWW’s growth? 21. M. Perkowitz and O. Etzioni, “Adaptive Web Site: An AI Chal- lenge,” Proc. Int’l Joint Conf. Artificial Intelligence (IJCAI), Morgan Acknowledgments Kaufmann, 1997. The US National Science Foundation helped support this 22. R. Armstrong et al., “Webwatcher: A Learning Apprentice for work under grant NSF IIS-9875433, as did the DARPA the World Wide Web,” Proc. AAAI Spring Symp. Information Gathering from Heterogeneous, Distributed Environments, AAAI DAML program under contract F30602-97-1-0215. Press, 1995, pp. 6–13. 23. A. Joshi and R. Krishnapuram, “Robust Fuzzy Clustering Methods to Support Web Mining,” Proc. ACM SIGMOD Workshop Data References Management and Knowledge Discovery, ACM Press, 1998. 1. O. Etzioni, “The World Wide Web: Quagmire or Gold 24. T. Kamdar, Creating Adaptive Web Servers Using Incremental Mine?,” Comm. ACM, vol. 39, no.11, 1996, pp. 65–68. Weblog Mining, masters thesis, Dept., Univ. 2. J. Srivastava, P. Desikan, and V. Kumar, “Web Mining: Accom- of Maryland, Baltimore, C0–1, 2001. plishments and Future Directions,” Proc. US Nat’l Science Foun- 25. B. Mobasher et al., “Effective Personalization Based on Associa- dation Workshop on Next-Generation Data Mining (NGDM), Nat’l tion Rule Discovery from Web Usage Data,” Proc. 3rd ACM Work- Science Foundation, 2002. shop Web Information and Data Management (WIDM 2001), 3. R. Kosala and H. Blockeel, “Web Mining Research: A Survey,” ACM Press, 2001, pp. 9–15. ACM SIGKDD Explorations, vol. 2, no. 1, 2000, pp. 1–15. 26. M.-S. Chen, J.S. Park, and P.S. Yu., “Efficient Data Mining for Path 4. L. Page et al., The PageRank Citation Ranking: Bring Order to the Traversal Patterns,” IEEE Trans. Knowledge and Data Eng., vol. 10, Web, tech. report, Stanford Digital Library Technologies, 1999- no. 2, 1998, pp. 209–221. 0120, Jan. 1998. 27. J.B. Schafer, J. Konstan, and J. Riedl, “Electronic Commerce Rec- 5. J. Kleinberg, “Authoritative Sources in a Hyperlinked Environ- ommender Applications,” J. Data Mining and Knowledge Dis- ment,” Proc. 9th Ann. ACM–SIAM Symp. Discrete Algorithms, covery, vol. 5, nos. 1/2, 2000, pp. 115–152. ACM Press, 1998, pp. 668–677. 28. M. Spiliopoulou, “Web Usage Mining for Site Evaluation,” Comm. 6. F. Sebastini, Machine Learning in Automated Text Categorization, ACM, , vol. 43, no. 8, 2000, pp. 127–134. tech. report B4-31, Istituto di Elaborazione dell’Informatione, 29. O. Etzioni et al., “To Buy or Not to Buy: Mining Airline Fare Data Consiglio Nazionale delle Ricerche, Pisa, 1999. to Minimize Ticket Purchase Price,” Proc. ACM SIGKDD Int’l Conf. 7. S. Chakrabarti, “Data Mining for Hypertext: A Tutorial Survey,” Knowledge Discovery and Data Mining, ACM Press, 2003. ACM SIGKDD Explorations, vol. 1, no. 2, pp. 1–11, 2000. 30. L. Cranor, “I Didn’t Buy It for Myself: Privacy and E-Commerce 8. J. Fürnkranz, “Web Structure Mining: Exploiting the Graph Struc- Personalization,” Proc. ACM Workshop Privacy in the Electronic So- ture of the World Wide Web,” Österreichische Gesellschaft für Ar- ciety, ACM Press, 2003. tificial Intelligence (ÖGAI), vol. 21, no. 2, 2002, pp. 17–26. 31. B. Berendt, A. Hotho, and G. Stumme, “Towards Semantic Web 9. O.R. Zaïane and J. Han, “Resource and Knowledge Discovery in Mining,” Proc. US Nat’l Science Foundation Workshop Next-Gen- Global Information Systems: A Preliminary Design and Experi- eration Data Mining (NGDM), Nat’l Science Foundation, 2002. ment,” Proc. 1st Int’l Conf. Knowledge Discovery and Data Mining (KDD), AAAI Press, 1995, pp. 331–336. 10. L.V.S. Lakshmanan, F. Sadri, and I.N. Subramanian, “A De- Pranam Kolari is a graduate student in the Computer clarative Language for Querying and Restructuring the Web,” Science Department at the University of Maryland, Bal- Proc. 6th IEEE Int’l Workshop Research Issues in Data Eng., In- timore County. His research interests include the Se- teroperability of Nontraditional Database Systems (RIDE-NDS), IEEE CS Press, 1996. mantic Web, social network analysis, machine learning, 11. A. Mendelzon, G. Michaila, and T. Milo, “Querying the World and Web mining. Kolari has a BE in computer science Wide Web,” Proc. 1st Int’l Conf. Parallel and Distributed Informa- from Bangalore University, India. He is a member of the tion System, IEEE CS Press, 1996, pp. 80–91. ACM. Contact him at [email protected]. 12. T. Bemers-Lee, J. Hendler, and O. Lassila, “The Semantic Web,” Scientific Am., vol. 279, no. 5, 2001, pp. 34–43. Anupam Joshi is an associate professor of computer sci- 13. R. Weiss et al., “HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering,” Proc. 7th ACM ence and electrical engineering at UMBC. His research Conf. Hypertext, ACM Press, 1996. interests are in the broad area of networked computing 14. S.D. Kamvar et al., “Extrapolation Methods for Accelerating and intelligent systems, with a primary focus on data PageRank Computations,” Proc. 12th Int’l World Wide Web management for mobile computing systems, and most Conf., ACM Press, 2003. recently on data management and security in pervasive 15. S. Chakrabarti et al., “Mining the Web’s Link Structure,” Computer, vol. 32, no. 8, 1999, pp. 60–67. computing and sensor environments. He is also inter- 16. R. Kumar et al., “Trawling the Web for Emerging Cybercommu- ested in the Semantic Web and data and Web mining. nities,” Proc. 8th World Wide Web Conf., Elsevier Science, 1999. Joshi has a BTech in electrical engineering from the In- 17. J.E. Pitkow and K. Bharat, “WebViz: A Tool for WWW Access dia Institute of Technology, Delhi, and an MS and PhD Log Analysis,” Proc. 1st Int’l Conf. World Wide Web, Elsevier Sci- in computer science from Purdue University. He is a se- ence, 1994, pp. 271–277. nior member of the IEEE and a member of the IEEE 18. J. Srivastava et al., “Web Usage Mining: Discovery and Applica- tions of Usage Patterns from Web Data,” ACM SIGKDD Explo- Computer Society and the ACM. Contact him at rations, vol. 1, no. 2, 2000, pp. 12–23. [email protected].

JULY/AUGUST 2004 53