Introduc on to Informa on Retrieval Introduc on to Informa on Retrieval
Brief (non‐technical) history . Early keyword‐based engines ca. 1995‐1997 Introduc on to . Altavista, Excite, Infoseek, Inktomi, Lycos Informa on Retrieval . Paid search ranking: Goto (morphed into Overture.com → Yahoo!) CS276 . Your search ranking depended on how much you Informa on Retrieval and Web Search paid Pandu Nayak and Prabhakar Raghavan . Auc on for keywords: casino was expensive! Lecture 15: Web search basics
2
Introduc on to Informa on Retrieval Introduc on to Informa on Retrieval
Brief (non‐technical) history
. 1998+: Link‐based ranking pioneered by Google . Blew away all early engines save Inktomi . Great user experience in search of a business model . Meanwhile Goto/Overture’s annual revenues were nearing $1 billion Paid . Result: Google added paid search “ads” to the side, Search Ads independent of search results . Yahoo followed suit, acquiring Overture (for paid placement) and Inktomi (for search) . 2005+: Google gains search share, domina ng in Europe and very strong in North America . 2009: Yahoo! and Microso propose combined paid search offering Algorithmic results.
3 4
Introduc on to Informa on Retrieval Sec. 19.4.1 Introduc on to Informa on Retrieval Sec. 19.4.1
Web search basics User Needs
. Need [Brod02, RL04] User . Informa onal – want to learn about something (~40% / 65%)
. Naviga onal – want to go to that page (~25% / 15%) Web spider . Transac onal – want to do something (web‐mediated) (~35% / 20%) . Access a service . Downloads Search . Shop Indexer . Gray areas . Find a good hub . Exploratory search “see what’s there” The Web
Indexes Ad indexes5 6
1 Introduc on to Informa on Retrieval Introduc on to Informa on Retrieval
How far do people look for results? Users’ empirical evalua on of results . Quality of pages varies widely . Relevance is not enough . Other desirable quali es (non IR!!) . Content: Trustworthy, diverse, non‐duplicated, well maintained . Web readability: display correctly & fast . No annoyances: pop‐ups, etc. . Precision vs. recall . On the web, recall seldom ma ers . What ma ers . Precision at 1? Precision above the fold? . Comprehensiveness – must be able to deal with obscure queries . Recall ma ers when the number of matches is very small . User percep ons may be unscien fic, but are significant over a large aggregate
(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) 7 8
Introduc on to Informa on Retrieval Introduc on to Informa on Retrieval Sec. 19.2
Users’ empirical evalua on of engines The Web document collec on . Relevance and validity of results . No design/co‐ordina on . UI – Simple, no clu er, error tolerant . Distributed content crea on, linking, democra za on of publishing . Trust – Results are objec ve . Content includes truth, lies, obsolete . Coverage of topics for polysemic queries informa on, contradic ons … . Pre/Post process tools provided . Unstructured (text, html, …), semi‐ . Mi gate user errors (auto spell check, search assist,…) structured (XML, annotated photos), . Explicit: Search within results, more like this, refine ... structured (Databases)… . An cipa ve: related searches . Scale much larger than previous text . Deal with idiosyncrasies collec ons … but corporate records are catching up . Web specific vocabulary . . Impact on stemming, spell‐check, etc. Growth – slowed down from ini al The Web “volume doubling every few months” but . Web addresses typed in the search box s ll expanding . “The first, the last, the best and the worst …” . Content can be dynamically generated 9 10
Introduc on to Informa on Retrieval Introduc on to Informa on Retrieval Sec. 19.2.2
The trouble with paid search ads …
. It costs money. What’s the alterna ve? . Search Engine Op miza on: . “Tuning” your web page to rank highly in the algorithmic search results for select keywords . Alterna ve to paying for placement . Thus, intrinsically a marke ng func on SPAM . Performed by companies, webmasters and (SEARCH ENGINE OPTIMIZATION) consultants (“Search engine op mizers”) for their clients . Some perfectly legi mate, some very shady 11 12
2 Introduc on to Informa on Retrieval Sec. 19.2.2 Introduc on to Informa on Retrieval Sec. 19.2.2
Search engine op miza on (Spam) Simplest forms
. Mo ves . First genera on engines relied heavily on /idf . Commercial, poli cal, religious, lobbies . The top‐ranked pages for the query maui resort were the . Promo on funded by adver sing budget ones containing the most maui’s and resort’s . . Operators SEOs responded with dense repe ons of chosen terms . e.g., maui resort maui resort maui resort . Contractors (Search Engine Op mizers) for lobbies, companies . O en, the repe ons would be in the same color as the . Web masters background of the web page . Hos ng services . Repeated terms got indexed by crawlers . Forums . But not visible to humans on browsers . E.g., Web master world ( www.webmasterworld.com ) . Search engine specific tricks . Discussions about academic papers Pure word density cannot be trusted as an IR signal
13 14
Introduc on to Informa on Retrieval Sec. 19.2.2 Introduc on to Informa on Retrieval Sec. 19.2.2
Variants of keyword stuffing Cloaking . Misleading meta‐tags, excessive repe on . Serve fake content to search engine spider . Hidden text with colors, style sheet tricks, etc. . DNS cloaking: Switch IP address. Impersonate
SPAM Meta-Tags = N Is this a Search “… London hotels, hotel, holiday inn, hilton, discount, Engine spider? booking, reservation, sex, mp3, britney spears, viagra, …” Y Real Cloaking Doc
15 16
Introduc on to Informa on Retrieval Sec. 19.2.2 Introduc on to Informa on Retrieval
More spam techniques The war against spam . Quality signals ‐ Prefer . Spam recogni on by . Doorway pages authorita ve pages based machine learning . Pages op mized for a single keyword that re‐direct to the on: . Training set based on known spam real target page . Votes from authors (linkage signals) . Family friendly filters . Link spamming . Votes from users (usage signals) . Linguis c analysis, general . classifica on techniques, etc. Mutual admira on socie es, hidden links, awards – more . Policing of URL submissions . For images: flesh tone on these later . An robot test detectors, source text analysis, etc. . Domain flooding: numerous domains that point or re‐ . Limits on meta‐keywords direct to a target page . Editorial interven on . Robust link analysis . Blacklists . Robots . Ignore sta s cally implausible . Top queries audited . Fake query stream – rank checking programs linkage (or text) . Complaints addressed . Use link analysis to detect . Suspect pa ern detec on . “Curve‐fit” ranking programs of search engines spammers (guilt by associa on) . Millions of submissions via Add‐Url
17 18
3 Introduc on to Informa on Retrieval Introduc on to Informa on Retrieval
More on spam . Web search engines have policies on SEO prac ces they tolerate/block . h p://help.yahoo.com/help/us/ysearch/index.html . h p://www.google.com/intl/en/webmasters/ . Adversarial IR: the unending (technical) ba le between SEO’s and web search engines . Research h p://airweb.cse.lehigh.edu/ SIZE OF THE WEB
19 20
Introduc on to Informa on Retrieval Sec. 19.5 Introduc on to Informa on Retrieval Sec. 19.5
What is the size of the web ? What can we a empt to measure? . Issues .The rela ve sizes of search engines . The web is really infinite . Dynamic content, e.g., calendars . The no on of a page being indexed is s ll reasonably well . So 404: www.yahoo.com/
21 22
Introduc on to Informa on Retrieval Sec. 19.5 Introduc on to Informa on Retrieval Sec. 19.5 Rela ve Size from Overlap New defini on? Given two engines A and B
. The sta cally indexable web is whatever search Sample URLs randomly from A engines index. Check if contained in B and vice . IQ is whatever the IQ tests measure. versa
. Different engines have different preferences A ∩ B
. max url depth, max count/host, an ‐spam rules, priority A ∩ B = (1/2) * Size A rules, etc. A ∩ B = (1/6) * Size B . Different engines index different things under the (1/2)*Size A = (1/6)*Size B same URL: . frames, meta‐keywords, document restric ons, document ∴ Size A / Size B = extensions, ... (1/6)/(1/2) = 1/3
23 Each test involves: (i) Sampling (ii) Checking 24
4 Introduc on to Informa on Retrieval Sec. 19.5 Introduc on to Informa on Retrieval Sec. 19.5
Sampling URLs Sta s cal methods
Ideal strategy: Generate a random URL and check for . Approach 1 containment in each index. . Random queries . Random searches Problem: Random URLs are hard to find! Enough to . Approach 2 generate a random URL contained in a given Engine. . Random IP addresses Approach 1: Generate a random URL contained in a . Random walks given engine Suffices for the es ma on of rela ve size Approach 2: Random walks / IP addresses In theory: might give us a true es mate of the size of the web (as opposed to just rela ve sizes of indexes)
25 26
Introduc on to Informa on Retrieval Sec. 19.5 Introduc on to Informa on Retrieval Sec. 19.5 Query Based Checking Random URLs from random queries
. Generate random query: how? . Strong Query to check whether an engine B has a Not an English document D: . Lexicon: 400,000+ words from a web crawl dictionary . Download D. Get list of words. . Conjunc ve Queries: w1 and w2 . Use 8 low frequency words as AND query to B e.g., vocalists AND rsi . Check if D is present in result set. . Get 100 result URLs from engine A . Problems: . Choose a random URL as the candidate to check for . Near duplicates presence in engine B . Frames . Redirects . This distribu on induces a probability weight W(p) for each . Engine me‐outs page. . Is 8‐word query good enough?
27 28
Introduc on to Informa on Retrieval Sec. 19.5 Introduc on to Informa on Retrieval Sec. 19.5
Advantages & disadvantages Random searches . Sta s cally sound under the induced weight. . Choose random searches extracted from a local log . Biases induced by random query [Lawrence & Giles 97] or build “random . Query Bias: Favors content‐rich pages in the language(s) of the lexicon searches” [Notess] . Ranking Bias: Solu on: Use conjunc ve queries & fetch all . Use only queries with small result sets. . Checking Bias: Duplicates, impoverished pages omi ed . Count normalized URLs in result sets. . Document or query restric on bias: engine might not deal properly . Use ra o sta s cs with 8 words conjunc ve query . Malicious Bias: Sabotage by engine . Opera onal Problems: Time‐outs, failures, engine inconsistencies, index modifica on.
29 30
5 Introduc on to Informa on Retrieval Sec. 19.5 Introduc on to Informa on Retrieval Sec. 19.5
Advantages & disadvantages Random searches
. Advantage . 575 & 1050 queries from the NEC RI employee logs . Might be a be er reflec on of the human percep on . 6 Engines in 1998, 11 in 1999 of coverage . Implementa on: . Issues . Restricted to queries with < 600 results in total . Counted URLs from each engine a er verifying query . Samples are correlated with source of log match . Duplicates . Computed size ra o & overlap for individual queries . Technical sta s cal problems (must have non‐zero . Es mated index size ra o & overlap by averaging over all results, ra o average not sta s cally sound) queries
31 32
Introduc on to Informa on Retrieval Sec. 19.5 Introduc on to Informa on Retrieval Sec. 19.5
Queries from Lawrence and Giles study Random IP addresses . adap ve access control . so max ac va on func on . Generate random IP addresses . neighborhood preserva on . bose mul dimensional system topographic theory . Find a web server at the given address . hamiltonian structures . gamma mlp . If there’s one . right linear grammar . dvi2pdf . pulse width modula on neural . john oliensis . Collect all pages from server . unbalanced prior probabili es . rieke spikes exploring neural . From this, choose a page at random . ranked assignment method . video watermarking . internet explorer favourites . counterpropaga on network impor ng . fat sha ering dimension . karvel thornber . abelson amorphous compu ng . zili liu
33 34
Introduc on to Informa on Retrieval Sec. 19.5 Introduc on to Informa on Retrieval Sec. 19.5
Random IP addresses Advantages & disadvantages . Advantages . HTTP requests to random IP addresses . Clean sta s cs . Ignored: empty or authoriza on required or excluded . Independent of crawling strategies . [Lawr99] Es mated 2.8 million IP addresses running . Disadvantages crawlable web servers (16 million total) from observing . Doesn’t deal with duplica on 2500 servers. . Many hosts might share one IP, or not accept requests . OCLC using IP sampling found 8.7 M hosts in 2001 . No guarantee all pages are linked to root page. . Netcra [Netc02] accessed 37.2 million hosts in July 2002 . E.g.: employee pages . [Lawr99] exhaus vely crawled 2500 servers and . Power law for # pages/hosts generates bias towards sites with few pages. extrapolated . But bias can be accurately quan fied IF underlying distribu on . Es mated size of the web to be 800 million pages understood . Es mated use of metadata descriptors: . Poten ally influenced by spamming (mul ple IP’s for same . Meta tags (keywords, descrip on) in 34% of home pages, Dublin server to avoid IP block) core metadata in 0.3% 35 36
6 Introduc on to Informa on Retrieval Sec. 19.5 Introduc on to Informa on Retrieval Sec. 19.5
Random walks Advantages & disadvantages
. View the Web as a directed graph . Advantages . Build a random walk on this graph . “Sta s cally clean” method, at least in theory! . Includes various “jump” rules back to visited sites . Could work even for infinite web (assuming convergence) . Does not get stuck in spider traps! under certain metrics. . Can follow all links! . Disadvantages . Converges to a sta onary distribu on . Must assume graph is finite and independent of the walk. . List of seeds is a problem. . Condi ons are not sa sfied (cookie crumbs, flooding) . Prac cal approxima on might not be valid. . Time to convergence not really known . Non‐uniform distribu on . Sample from sta onary distribu on of walk . Subject to link spamming . Use the “strong query” method to check coverage by SE
37 38
Introduc on to Informa on Retrieval Sec. 19.5 Introduc on to Informa on Retrieval Sec. 19.6
Conclusions . No sampling solu on is perfect. . Lots of new ideas ...... but the problem is ge ng harder . Quan ta ve studies are fascina ng and a good research problem
DUPLICATE DETECTION
39 40
Introduc on to Informa on Retrieval Sec. 19.6 Introduc on to Informa on Retrieval Sec. 19.6
Duplicate documents Duplicate/Near‐Duplicate Detec on . The web is full of duplicated content . Duplica on: Exact match can be detected with . Strict duplicate detec on = exact match fingerprints . Not as common . Near‐Duplica on: Approximate match . . But many, many cases of near duplicates Overview . Compute syntac c similarity with an edit‐distance . E.g., last‐modified date the only difference measure between two copies of a page . Use similarity threshold to detect near‐duplicates . E.g., Similarity > 80% => Documents are “near duplicates” . Not transi ve though some mes used transi vely
41 42
7 Introduc on to Informa on Retrieval Sec. 19.6 Introduc on to Informa on Retrieval Sec. 19.6
Compu ng Similarity Shingles + Set Intersec on
. Features: . Compu ng exact set intersec on of shingles . Segments of a document (natural or ar ficial breakpoints) between all pairs of documents is expensive/ . Shingles (Word N‐Grams) intractable . a rose is a rose is a rose → . Approximate using a cleverly chosen subset of shingles a_rose_is_a from each (a sketch) rose_is_a_rose . Es mate (size_of_intersec on / size_of_union) is_a_rose_is based on a short sketch a_rose_is_a Doc Shingle set A Sketch A . Similarity Measure between two docs (= sets of shingles) A . Jaccard coefficient: Size_of_Intersec on / Size_of_Union Jaccard Doc Shingle set B Sketch B 43 B 44
Introduc on to Informa on Retrieval Sec. 19.6 Introduc on to Informa on Retrieval Sec. 19.6
Sketch of a document Compu ng Sketch[i] for Doc1 . Create a “sketch vector” (of size ~200) for each Document 1 document . Documents that share ≥ t (say 80%) corresponding 264 Start with 64-bit f(shingles) vector elements are near duplicates 264 Permute on the number line . For doc D, sketchD[ i ] is as follows: with π 64 i . Let f map all shingles in the universe to 0..2m‐1 (e.g., f = 2 64 fingerprin ng) 2 Pick the min value m . Let πi be a random permuta on on 0..2 ‐1
. Pick MIN {πi(f(s))} over all shingles s in D
45 46
Introduc on to Informa on Retrieval Sec. 19.6 Introduc on to Informa on Retrieval Sec. 19.6
Test if Doc1.Sketch[i] = Doc2.Sketch[i] However…
Document 1 Document 2 Document 1 Document 2 264 264 264 264 264 264 A 64 264 264 264 B 2 264 264 264 264 A B 264 264 A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Are these equal? Why? Claim: This happens with probability Test for 200 random permutations: π1, π2,… π200 47 Size_of_intersection / Size_of_union 48
8 Introduc on to Informa on Retrieval Sec. 19.6 Introduc on to Informa on Retrieval Sec. 19.6
Set Similarity of sets Ci , Cj Key Observa on
. For columns Ci, Cj, four types of rows
Ci Cj A 1 1 . View sets as columns of a matrix A; one row for each B 1 0 element in the universe. aij = 1 indicates presence of item i in set j C 0 1 . Example C1 C2 D 0 0 . Overload nota on: A = # of rows of type A 0 1 1 0 . Claim 1 1 Jaccard(C1,C2) = 2/5 = 0.4 0 0 1 1 0 1 49 50
Introduc on to Informa on Retrieval Sec. 19.6 Introduc on to Informa on Retrieval Sec. 19.6
“Min” Hashing Min‐Hash sketches
. Randomly permute rows . Pick P random row permuta ons . MinHash sketch . Hash h(Ci) = index of first row with 1 in column Sketch = list of P indexes of first rows with 1 in column C Ci D . Surprising Property . Similarity of signatures . . Why? Let sim[sketch(Ci),sketch(Cj)] = frac on of permuta ons where MinHash values agree . Both are A/(A+B+C) . Observe E[sim(sketch(Ci),sketch(Cj))] = Jaccard(Ci,Cj) . Look down columns Ci, Cj until first non-Type-D row
. h(Ci) = h(Cj) type A row
51 52
Introduc on to Informa on Retrieval Sec. 19.6 Introduc on to Informa on Retrieval Sec. 19.6
Example All signature pairs Signatures Now we have an extremely efficient method for S1 S2 S3 Perm 1 = (12345) 1 2 1 es ma ng a Jaccard coefficient for a single pair of Perm 2 = (54321) 4 5 4 documents. C1 C2 C3 Perm 3 = (34512) 3 5 4 2 R1 1 0 1 But we s ll have to es mate N coefficients where N R2 0 1 1 is the number of web pages. R 1 0 0 3 S ll slow R4 1 0 1 Similarities R5 0 1 0 1-2 1-3 2-3 One solu on: locality sensi ve hashing (LSH) Col-Col 0.00 0.50 0.25 Another solu on: sor ng (Henzinger 2006) Sig-Sig 0.00 0.67 0.00
53 54
9 Introduc on to Informa on Retrieval
More resources . IIR Chapter 19
55
10