What is Web Mining? CS 345A Lecture 1 Discovering useful information from the World-Wide Web and its usage Introduction to Web Mining patterns

Web Mining v. Data Mining Web Mining topics

† Structure (or lack of it) † Web graph analysis „ Textual information and linkage structure † Power Laws and The Long Tail † Scale † Structured data extraction „ Data generated per day is comparable to † Web advertising largest conventional data warehouses † Systems Issues † Speed „ Often need to react to evolving usage patterns in real-time (e.g., merchandising)

Web Mining topics Size of the Web

† Web graph analysis † Number of pages † Power Laws and The Long Tail „ Technically, infinite † Structured data extraction „ Much duplication (30-40%) „ Best estimate of “unique” static HTML † Web advertising pages comes from search engine claims † Systems Issues † = 8 billion(?), Yahoo = 20 billion † Number of web sites „ Netcraft survey says 72 million sites (http://news.netcraft.com/archives/web_server_survey.html)

1 Netcraft survey The web as a graph

† Pages = nodes, = edges „ Ignore content „ Directed graph † High linkage „ 8-10 links/page on average „ Power-law degree distribution

http://news.netcraft.com/archives/web_server_survey.html

Structure of Web graph Bow-tie Structure

† Let’s take a closer look at structure „ Broder et al (2000) studied a crawl of 200M pages and other smaller crawls „ Bow-tie structure † Not a “small world”

Source: Broder et al, 2000

What can the graph tell us? Web Mining topics

† Distinguish “important” pages from † Web graph analysis unimportant ones † Power Laws and The Long Tail „ Page rank † Structured data extraction † Discover communities of related † Web advertising pages † Systems Issues „ Hubs and Authorities † Detect web spam „ Trust rank

2 Power-law degree distribution Power-laws galore

† Structure „ In-degrees „ Out-degrees „ Number of pages per site † Usage patterns „ Number of visitors „ Popularity

Source: Broder et al, 2000

The Long Tail The Long Tail

† Shelf space is a scarce commodity for traditional retailers „ Also: TV networks, movie theaters,… † The web enables near-zero-cost dissemination of information about products „ Action moves from Hits to Niches

Source: Chris Anderson (2004)

The Long Tail Web Mining topics

† More choice necessitates better filters † Web graph analysis „ Recommendation engines (e.g., Amazon) † Power Laws and The Long Tail „ How Into Thin Air made Touching the † Structured data extraction Void a bestseller † Web advertising † In fact, page rank can be seen as a long tail filter † Systems Issues „ Tapping into the Wisdom of Crowds

3 Extracting Structured Data Extracting structured data

http://www.simplyhired.com http://www.fatlens.com

Web Mining topics Searching the Web

† Web graph analysis † Power Laws and The Long Tail † Structured data extraction † Web advertising † Systems Issues

The Web Content aggregators Content consumers

Ads vs. search results Ads vs. search results

† Search advertising is the revenue model „ Multi-billion-dollar industry „ Advertisers pay for clicks on their ads † Interesting problems „ What ads to show for a search? „ If I’m an advertiser, which search terms should I bid on and how much to bid?

4 Sidebar: What’s in a name? Web Mining topics

† Geico sued Google, contending that it † Web graph analysis owned the trademark “Geico” † Power Laws and The Long Tail „ Thus, ads for the keyword geico couldn’t † Structured data extraction be sold to others † Web advertising † Court Ruling: search engines can sell keywords including trademarks † Systems Issues † No court ruling yet: whether the ad itself can use the trademarked word(s)

Systems architecture Very Large-Scale Data Mining

CPU , Statistics CPU CPU CPU … Memory Mem Mem Mem

“Classical” Data Mining Disk Disk Disk

Disk

Cluster of commodity nodes

Systems Issues Web Mining topics

† Web data sets can be very large † Web graph analysis „ Tens to hundreds of terabytes † Power Laws and The Long Tail † Cannot mine on a single server! † Structured data extraction „ Need large farms of servers † Web advertising † How to organize hardware/software † Systems Issues to mine multi-terabye data sets „ Without breaking the bank!

5 Web Mining Project The World-Wide Web

† Lots of interesting project ideas „ If you can’t think of one please come discuss with us Our modern-day † Data and Infrastructure Library of Alexandria „ Webbase data (older Stanford web crawl) „ Recent web crawl and server courtesy of Kosmix

The Web

6