World-Wide Web WWW: Incentives
Total Page:16
File Type:pdf, Size:1020Kb
What is Web Mining? The use of data mining techniques to automatically RECOMMENDATION MODELS discover and extract information from Web documents and services (Etzioni, 1996) FOR WEB USERS Web mining research integrate research from several research communities (Kosala and Blockeel, 2000) such as: Dr. Şule Gündüz Öğüdücü Database (DB) [email protected] Information Retrieval (IR) The sub-areas of machine learning (ML) Natural language processing (NLP) 1 2 World-wide Web WWW: Incentives Initiated at CERN (the European Organization for Nuclear Research) WWW is a huge, By Tim Berners-Lee widely distributed, GUIs global information Berners-Lee (1990) source for: Erwise and Viola(1992), Midas (1993) Information services: Mosaic (1993) news, advertisements, a hypertext GUI for the X-window system consumer information, HTML: markup language for rendering hypertext financial management, HTTP: hypertext transport protocol for sending HTML and other data over the Internet education, government, e- CERN HTTPD: server of hypertext documents commerce, health 1994 services, etc. Netscape was founded st 1 World Wide Web Conference Hyper-link information World Wide Web Consortium was founded by CERN and MIT http://www.touchgraph.com/TGGoogleBrowser.html Web page access and http://www.w3.org/ usage information Web site contents and Mining the Web Chakrabarti and Ramakrishnan organizations 3 4 Mining the World Wide Web Application Examples: e-Commerce Growing and changing very rapidly 6 December 2006 : 12.52 billion pages http://www.worldwidewebsize.com/ Only a small portion of information on the Web is truly relevant or useful to Web user WWW provides rich sources for data mining Goals include: Target potential customers for electronic commerce Enhance the quality and delivery of Internet information services to the end user Improve Web server system performance Facilitates personalization/adaptive sites Improve site design Fraud/intrusion detection Predict user’s actions 5 6 1 Challenges on WWW Interactions Web Mining Taxonomy Searching for usage patterns, Web structures, regularities and dynamics of Web contents Web Mining Finding relevant information 99% of info of no interest to 99% of people Creating knowledge from information available Limited query interface based on keyword – oriented Web Structure Web Content Web Usage search Mining Mining Mining Personalization of the information Limited customization to individual users 7 8 Web Usage Mining Terms in Web Usage Mining Web usage mining also known as Web log mining User: A single individual who is accessing files from one or more Web servers through a browser. The application of data mining techniques to discover usage patterns from the secondary data derived from Page file: The file that is served through a HTTP the interactions of users while surfing the Web protocol to a user. Page: The set of page files that contribute to a single Information scent display in a Web browser constitutes a Web page. Information: Food Click stream: The sequence of pages followed by a WWW: Forest user. User: Animal Server session (visit): A set of pages that a user requested from a single Web server during her single Understanding the behavior of an animal that looks for visit to the Web site. food in the forest 9 10 Methodology Recommendation Process Offline Process Online Process Input Output Pattern Recommendation Preparation Extraction Data Collection Cleaning User Subset of items: Usage Recommendation Representation Patterns Set •Movies •Books •CDs •User Identification •Statistical analysis Recommender •Session Identification •Association rules •Web documents System •Server level collection •Calculation of visiting •Clustering Set of items: : •Client level collection page time •Classification : •Proxy level collection •Cleaning •Sequential pattern •Movies •Books •CDs Data collection •Web documents Data preparation and cleaning : : The right information is Pattern extraction delivered to the right people at Recommendation the right time. 11 12 2 What information can be included in User Representation? Data Collection Server level collection The order of visited Web pages Client level collection The visiting page time Proxy level collection The content of the visited Web page The change of user behavior over time The difference in usage and behavior from different geographic areas Client User profile Web server Client 13 Proxy server 14 Web Server Log File Entries Use of Log Files Method/ IP Address User ID Timestamp Status Size URL Questions Who is visiting the Web site? What is the path users take through the Web pages? How much time users spend on each page? Where and when visitors are leaving the Web site? 15 16 Data Preprocessing (1) Data Preprocessing (2) Data cleaning Data transformation Remove log entries with filename suffixes such as gif, jpeg, GIF, JPEG, jpg, JPG User identification Remove the page requests made by the automated Users with the same client IP are identical agents and spider programs Session idendification Remove the log entries that have a status code of 400 and 500 series A new session is created when a new IP address is encountered or if the visiting page time exceeds 30 Normalize the URLs: Determine URLs that correspond to the same Web page minutes for the same IP-address Data integration Data reduction Merge data from multiple server logs sampling Integrate semantics (meta-data) dimensionality reduction Integrate registration data 17 18 3 Discovery of Usage Patterns Statistical Analysis Pattern Discovery is the key component of the Web Different kinds of statistical analysis (frequency, median, mean, etc.) of the session file, one can extract statistical information such mining, which converges the algorithms and as: techniques from data mining, machine learning, The most frequently accessed pages statistics and pattern recognition etc research Average view time of a page categories Average length of a path through a site Application: Separate subsections: Improving the system performance Statistical analysis Enhancing the security of the system Providing support for marketing decisions Association rules Examples: Clustering PageGather (Perkowitz et al., 1998) Classification Discovering of user profiles (Larsen et al., 2001) Sequential pattern 19 20 Association Rules Clustering A technique to group together objects with the Sets of pages that are accessed together with a similar characteristics support value exceeding some specified Clustering of sessions threshold Clustering of Web pages Application: Clustering of users Finding related pages that are accessed Application: together regardless of the order Facilitate the development and execution of Examples: future marketing strategies Web caching and prefetching (Yang et al., Examples: 2001) Clustering of user sessions (Gunduz et al., 2003) Clustering of individuals with mixture of Markov models (Sarukkai, 2000) 21 22 Classification Sequential Pattern The technique to map data item into one of Discovers frequent subsequences as patterns several predefined classes Applications: Application The analysis of customer purchase behavior Developing a usage profile belonging to a Optimization of Web site structure particular class or category Examples: Examples: WUM (Spiliopoulou and Faulstich, 1998) WebLogMiner (Zaiane et al., 1998) Longest Repeated Subsequence (Pitkow and Pirolli, 1999) 23 24 4 Online Module: Recommendation Probabilistic models of browsing behavior The discovered patterns are used by the online Useful to build models that describe the component of the model to provide dynamic browsing behavior of users recommendations to users based on their Can generate insight into how we use Web current navigational activity Provide mechanism for making predictions The produced recommendation set is added to Can help in pre-fetching and personalization the last requested page as a set of links before the page is sent to the client browser 25 26 Markov models for page prediction Markov models for page prediction General approach is to use a finite-state Markov chain For simplicity, consider order-dependent, time-independent finite-state Markov chain with M states Each state can be a specific Web page or a category of Web pages Let s be a sequence of observed states of length L. e.g. s = ABBCAABBCCBBAA with three states A, B and C. st is state at If only interested in the order of visits (and not in time), position t (1<=t<=L). In general, each new request can be modeled as a transition of L P(s) = P(s1)∏ P(st | st−1,..., s1) states t =2 Under a first-order Markov assumption, we have Issues L P (s) = P (s1 )∏ P (s t | s t −1 ) t = 2 Self-transition This provides a simple generative model to produce sequential Time-independence data 27 28 Markov models for page prediction Markov models for page prediction If we denote T = P(s = j|s = i), we can define a M x M transition ij t t-1 Tij = P(st = j|st-1 = i) now represent the probability matrix that an individual user’s next request will be from Properties category j, given they were in category i Strong first-order assumption We can add E, an end-state to the model Simple way to capture sequential dependence If each page is a state and if W pages, O(W2), W can be of the E.g. for three categories with end state: - 5 6 order 10 to 10 for a CS dept. of a university ⎛ P(1|1) P(2 |1) P(3|1_) P(E |1) ⎞ ⎜ ⎟ To alleviate, we