<<

Hypertext • Academia – , web publication • Consumer – Newsgroups, communities, product reviews • Industry and organizations – Health care, customer service – Office documents, email The Web is bigger than the sum of its parts Search products and services • Verity • Inktomi (HotBot) •Fulcrum • Alta Vista •PLS • Google! • Oracle text extender • Yahoo! • DB2 text extender • Infoseek • Infoseek Intranet •Lycos • SMART (academic) • Excite • Glimpse (academic) Map Local data FTP HTML More structure Indexing Crawling Search

WebSQL WebL Relevance Ranking Social Latent Semantic Indexing Network XML of

Clustering Web Communities Collaborative Scatter- Filtering Gather Web Servers Topic Topic Directories Distillation Monitor User Mine Profiling Modify

Semi-supervised Automatic Focused Learning Classification Crawling Web Browsers Keyword indexing • Boolean search My care is loss of care D1 – care AND NOT old with old care done • Stemming Your care is gain of D2 care with new care won – gain* • Phrases and proximity care D1: 1, 5, 8 – “new care” D2: 1, 5, 8 – loss NEAR/5 care new D2: 7 – old D1: 7 loss D1: 3 Tables and queries 1

POSTING select distinct did from POSTING where tid = ‘care’ except tid did pos select distinct did from POSTING where tid like ‘gain%’ care d1 1 care d1 5 with care d1 8 TPOS1(did, pos) as care d2 1 (select did, pos from POSTING where tid = ‘new’), care d2 5 TPOS2(did, pos) as care d2 8 (select did, pos from POSTING where tid = ‘care’) new d2 7 select distinct did from TPOS1, TPOS2 where TPOS1.did = TPOS2.did old d1 7 and proximity(TPOS1.pos, TPOS2.pos) loss d1 3 ……… proximity(a, b) ::= a + 1 = b abs(a - b) < 5 Relevance ranking • Recall = coverage Query “True response” – What fraction of relevant documents Search Compare

were reported Consider Output sequence • Precision = accuracy prefix k 1 – What fraction of 0.8 reported documents 0.6 0.4

were relevant Precision 0.2 • Trade-off 0 0 0.2 0.4 0.6 0.8 1 Recall Vector space model and TFIDF • Some words are more important than others • W.r.t. a document collection D

– d+ have a term, d- do not d + d – “Inverse document frequency” 1+ log + − • “Term frequency” (TF) d + – Many variants: n(d,t) n(d,t) , ∑ n(d,t) max n(d,t) t t • Probabilistic models Tables and queries 2

VECTOR(did, tid, elem) ::= With TEXT(did, tid, freq) as (select did, tid, count(distinct pos) from POSTING group by did, tid), LENGTH(did, len) as (select did, sum(freq) from TEXT group by did), DOCFREQ(tid, df) as (select tid, count(distinct did) from POSTING group by tid) select did, tid, (freq / len) * (1 + log((select count(distinct did from POSTING))/df)) from TEXT, LENGTH, DOCFREQ where TEXT.did = LENGTH.did and TEXT.tid = DOCFREQ.tid Similarity

• Direct similarity … auto …car … auto …car … car… auto… auto …car – Cosine, normalized distance … car … auto … car … auto • Indirect similarity – auto and car co-occur often – They must be related car ≅ auto – Documents having related words are related … auto … • Useful for search and clustering ≅ … car … Latent semantic indexing

Term Document

k d Documents

D V

A t SVD U Terms

d r k-dim vector Supervised learning (classification) • Many forms – Content: automatically organize the web per Yahoo! – Type: faculty, student, staff – Intent: education, discussion, comparison, advertisement • Applications – Relevance feedback for re-scoring query responses – Filtering news, email, etc. – Narrowing searches and selective data acquisition Difficulties • Dimensionality – classifiers: dozens of columns – Vector space model: 50,000 ‘columns’ • Context-dependent noise – ‘Can’ (v.) considered a ‘stopword’ – ‘Can’ (n.) may not be a stopword in /Yahoo/SocietyCulture/Environment/Recycling • Need for scalability – High dimension needs more data to learn Techniques • Nearest neighbor + Standard keyword index also supports classification – How to define similarity? (TFIDF may not work) – Wastes space by storing individual document info • Rule-based, decision-tree based – Very slow to train (but quick to test) + Good accuracy (but brittle rules) • Model-based + Fast training and testing with small footprint The “bag-of-words” document model • Decide topic; topic c is picked with prior π ∑ π probability (c); c (c) = 1 • Each topic c has parameters θ(c,t) for terms t ∑ θ • Coin with face probabilities t (c,t) = 1 • Fix document length and keep tossing coin •Given c, probability of document is  n(d)  Pr[d | c] =  ∏θ (c,t)n(d ,t) {n(d,t)} t∈d Feature selection

Model with unknown parameters Confidence intervals T T

p1 p1 p2 ... q1 q2 ... q1

Observed data N 0 1 ... ⊆ N Pick F T such that models built over F have high separation confidence Tables and queries 3

TAXONOMY pcid kcid kcname EGMAPR(did, kcid) ::= ((select did, kcid from EGMAP) union all 1 (select e.did, t.pcid from 12Arts EGMAPR as e, TAXONOMY as t 1 3 Science where e.kcid = t.kcid)) 34Math 35Physics STAT(pcid, tid, kcid, kdoc, knum) ::= (select pcid, tid, TAXONOMY.kcid,

1 count(distinct TEXT.did), sum(freq) from EGMAPR, TAXONOMY, TEXT 2 3 EGMAP where TAXONOMY.kcid = EGMAPR.kcid did kcid 4 5 and EGMAPR.did = TEXT.did group by pcid, tid, TAXONOMY.kcid) TEXT did tid freq Clustering • Standard notion from structured data analysis • Techniques – Agglomerative, k-means – Mixture models and Expectation Maximization • How to reduce distance computation time? – Sample points or directions at random – Pre-cluster (eliminate redundancy) – Project remaining points on to subspace Collaborative filtering • People=record, movies=features, cluster people • Both people and features can be clustered • For hypertext access, time of access is a feature • Need advanced models

Batman Rambo Andre Hiver Whispers StarWars Lyle Ellen Jason Fred Dean Karen Hypertext models • c=class, t=text, N=neighbors • Text-only model: Pr[t|c] • Using neighbors’ text to judge my topic: " Pr[t, t(N) | c] • Better model: Pr[t, c(N) | c] • Non-linear relaxation Result of exploiting link features • Pretend to know only a % 40 of neighborhood topics 35 • Bootstrap using text-only 30 25 classifier 20 • Use non-linear relaxation %Error 15 10 to update topic 5 assignment iteratively 0 0 50 100 ☛Link information reduces %Neighborhood known error significantly Text Link Text+Link graph analysis • is a social network – Telephoned, advised, co-authored, paid, cited • Social network theory (cf. Wasserman & Faust) – Extensive research applying graph notions – Centrality – Prestige – Reflected prestige • Can be applied directly to Web search – HIT, Google, CLEVER, topic distillation Google and HITS • In-degree ≈ prestige • High prestige ⇔ good • Not all votes worth authority the same • High reflected • Prestige of a page is prestige ⇔ good hub the sum of prestige of • Bipartite iteration citing pages: p = Ep – a = Eh • Pre-compute query – h = ETa independent prestige – h = ETEh score Tables and queries 4

delete from HUBS; insert into HUBS(, score) HUBS (select urlsrc, sum(score * wtrev) from AUTH, LINK url score where authwt is not null and type = non-local AUTH and ipdst <> ipsrc and url = urldst url score group by urlsrc); update HUBS set (score) = score / (select sum(score) from HUBS);

update LINK as X set (wtfwd) = 1. / wgtfwd (select count(ipsrc) from LINK HUBS.score AUTH.score where ipsrc = X.ipsrc urlsrc urldst and urldst = X.urldst) @ipsrc @ipdst where type = non-local; LINK wgtrev urlsrc urldst ipsrc ipdst wgtfwd wtrev type Resource discovery

([DPSOH )HHGEDFN 7RSLF 7D[RQRP\ 6FKHGXOHU (GLWRU %URZVHU 'LVWLOOHU

&UDZOHU 7D[RQRP\ &UDZO :RUNHUV 'DWDEDVH 'DWDEDVH

+\SHUWH[W +\SHUWH[W &ODVVLILHU 7RSLF &ODVVLILHU /HDUQ 0RGHOV $SSO\ Resource discovery results

Harvest Rate (Cycling, Unfocused) H arvest R ate (C ycling , S oft F ocus) 1 1 Avg over 100 0.9 0.9 •High rate of Avg over 1000 0.8 0.8

0.7 0.7 “harvesting” 0.6 0.6 0.5 0.5

0.4 0.4 Average Relevance relevant pages Average Relevance 0.3 0.3 0.2 0.2

0.1 Avg over 100 0.1 0 0 0 5000 10000 0 2000 4000 6000 • Robust to #U R Ls fetched # U R Ls fetch ed

URL Coverage Distance to top authorities perturbations of 0.9 18 0.8 16 starting 0.7 14 0.6 12 0.5 10 0.4 8

• Great resources Frequency 0.3 6 0.2 4

found 10 links Fraction of reference crawl 0.1 2 0 0 0 1000 2000 3000 123456789101112 from start set #URLs crawled by test crawler Shortest distance found (#links) issues • Useful features + Concurrency and recovery (for crawling) + I/O-efficient representation of mining algorithms + Enables ad-hoc semi-structured queries • Would help to have – Unlogged tablespaces, flexible choice of recovery – Index (-ed scans) over temporary table expressions – Efficient string storage and operations – Answering multiple queries approximately