Introducon to Informaon Retrieval Introducon to Informaon Retrieval

Brief (non‐technical) history . Early keyword‐based engines ca. 1995‐1997 Introducon to . Altavista, , , , Informaon Retrieval . Paid search ranking: Goto (morphed into Overture.com → Yahoo!) CS276 . Your search ranking depended on how much you Informaon Retrieval and Web Search paid Pandu Nayak and Prabhakar Raghavan . Aucon for keywords: casino was expensive! Lecture 15: Web search basics

2

Introducon to Informaon Retrieval Introducon to Informaon Retrieval

Brief (non‐technical) history

. 1998+: Link‐based ranking pioneered by . Blew away all early engines save Inktomi . Great user experience in search of a business model . Meanwhile Goto/Overture’s annual revenues were nearing $1 billion Paid . Result: Google added paid search “ads” to the side, Search Ads independent of search results . Yahoo followed suit, acquiring Overture (for paid placement) and Inktomi (for search) . 2005+: Google gains search share, dominang in Europe and very strong in North America . 2009: Yahoo! and Microso propose combined paid search offering Algorithmic results.

3 4

Introducon to Informaon Retrieval Sec. 19.4.1 Introducon to Informaon Retrieval Sec. 19.4.1

Web search basics User Needs

. Need [Brod02, RL04] User . Informaonal – want to learn about something (~40% / 65%)

. Navigaonal – want to go to that page (~25% / 15%) Web spider . Transaconal – want to do something (web‐mediated) (~35% / 20%) . Access a service . Downloads Search . Shop Indexer . Gray areas . Find a good hub . Exploratory search “see what’s there” The Web

Indexes Ad indexes5 6

1 Introducon to Informaon Retrieval Introducon to Informaon Retrieval

How far do people look for results? Users’ empirical evaluaon of results . Quality of pages varies widely . Relevance is not enough . Other desirable qualies (non IR!!) . Content: Trustworthy, diverse, non‐duplicated, well maintained . Web readability: display correctly & fast . No annoyances: pop‐ups, etc. . Precision vs. recall . On the web, recall seldom maers . What maers . Precision at 1? Precision above the fold? . Comprehensiveness – must be able to deal with obscure queries . Recall maers when the number of matches is very small . User percepons may be unscienfic, but are significant over a large aggregate

(Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) 7 8

Introducon to Informaon Retrieval Introducon to Informaon Retrieval Sec. 19.2

Users’ empirical evaluaon of engines The Web document collecon . Relevance and validity of results . No design/co‐ordinaon . UI – Simple, no cluer, error tolerant . Distributed content creaon, linking, democrazaon of publishing . Trust – Results are objecve . Content includes truth, lies, obsolete . Coverage of topics for polysemic queries informaon, contradicons … . Pre/Post process tools provided . Unstructured (text, html, …), semi‐ . Migate user errors (auto spell check, search assist,…) structured (XML, annotated photos), . Explicit: Search within results, more like this, refine ... structured (Databases)… . Ancipave: related searches . Scale much larger than previous text . Deal with idiosyncrasies collecons … but corporate records are catching up . Web specific vocabulary . . Impact on stemming, spell‐check, etc. Growth – slowed down from inial The Web “volume doubling every few months” but . Web addresses typed in the search box sll expanding . “The first, the last, the best and the worst …” . Content can be dynamically generated 9 10

Introducon to Informaon Retrieval Introducon to Informaon Retrieval Sec. 19.2.2

The trouble with paid search ads …

. It costs money. What’s the alternave? . Opmizaon: . “Tuning” your web page to rank highly in the algorithmic search results for select keywords . Alternave to paying for placement . Thus, intrinsically a markeng funcon SPAM . Performed by companies, webmasters and (SEARCH ENGINE OPTIMIZATION) consultants (“Search engine opmizers”) for their clients . Some perfectly legimate, some very shady 11 12

2 Introducon to Informaon Retrieval Sec. 19.2.2 Introducon to Informaon Retrieval Sec. 19.2.2

Search engine opmizaon (Spam) Simplest forms

. Moves . First generaon engines relied heavily on /idf . Commercial, polical, religious, lobbies . The top‐ranked pages for the query maui resort were the . Promoon funded by adversing budget ones containing the most maui’s and resort’s . . Operators SEOs responded with dense repeons of chosen terms . e.g., maui resort maui resort maui resort . Contractors (Search Engine Opmizers) for lobbies, companies . Oen, the repeons would be in the same color as the . Web masters background of the web page . Hosng services . Repeated terms got indexed by crawlers . Forums . But not visible to humans on browsers . E.g., Web master world ( www.webmasterworld.com ) . Search engine specific tricks . Discussions about academic papers  Pure word density cannot be trusted as an IR signal

13 14

Introducon to Informaon Retrieval Sec. 19.2.2 Introducon to Informaon Retrieval Sec. 19.2.2

Variants of keyword stuffing Cloaking . Misleading meta‐tags, excessive repeon . Serve fake content to search engine spider . Hidden text with colors, style sheet tricks, etc. . DNS cloaking: Switch IP address. Impersonate

SPAM Meta-Tags = N Is this a Search “… London hotels, hotel, holiday inn, hilton, discount, Engine spider? booking, reservation, sex, mp3, britney spears, viagra, …” Y Real Cloaking Doc

15 16

Introducon to Informaon Retrieval Sec. 19.2.2 Introducon to Informaon Retrieval

More spam techniques The war against spam . Quality signals ‐ Prefer . Spam recognion by . Doorway pages authoritave pages based machine learning . Pages opmized for a single keyword that re‐direct to the on: . Training set based on known spam real target page . Votes from authors (linkage signals) . Family friendly filters . Link spamming . Votes from users (usage signals) . Linguisc analysis, general . classificaon techniques, etc. Mutual admiraon sociees, hidden links, awards – more . Policing of URL submissions . For images: flesh tone on these later . An robot test detectors, source text analysis, etc. . Domain flooding: numerous domains that point or re‐ . Limits on meta‐keywords direct to a target page . Editorial intervenon . Robust link analysis . Blacklists . Robots . Ignore stascally implausible . Top queries audited . Fake query stream – rank checking programs linkage (or text) . Complaints addressed . Use link analysis to detect . Suspect paern detecon . “Curve‐fit” ranking programs of search engines spammers (guilt by associaon) . Millions of submissions via Add‐Url

17 18

3 Introducon to Informaon Retrieval Introducon to Informaon Retrieval

More on spam . Web search engines have policies on SEO pracces they tolerate/block . hp://help.yahoo.com/help/us/ysearch/index.html . hp://www.google.com/intl/en/webmasters/ . Adversarial IR: the unending (technical) bale between SEO’s and web search engines . Research hp://airweb.cse.lehigh.edu/ SIZE OF THE WEB

19 20

Introducon to Informaon Retrieval Sec. 19.5 Introducon to Informaon Retrieval Sec. 19.5

What is the size of the web ? What can we aempt to measure? . Issues .The relave sizes of search engines . The web is really infinite . Dynamic content, e.g., calendars . The noon of a page being indexed is sll reasonably well . So 404: www.yahoo.com/ is a valid page defined. . Stac web contains syntacc duplicaon, mostly due to . Already there are problems mirroring (~30%) . Document extension: e.g., engines index pages not yet crawled, by . Some servers are seldom connected indexing anchortext. . Document restricon: All engines restrict what is indexed (first n . Who cares? words, only relevant words, etc.) . Media, and consequently the user . Engine design . Engine crawl policy. Impact on recall.

21 22

Introducon to Informaon Retrieval Sec. 19.5 Introducon to Informaon Retrieval Sec. 19.5 Relave Size from Overlap New definion? Given two engines A and B

. The stacally indexable web is whatever search Sample URLs randomly from A engines index. Check if contained in B and vice . IQ is whatever the IQ tests measure. versa

. Different engines have different preferences A ∩ B

. max url depth, max count/host, an‐spam rules, priority A ∩ B = (1/2) * Size A rules, etc. A ∩ B = (1/6) * Size B . Different engines index different things under the (1/2)*Size A = (1/6)*Size B same URL: . frames, meta‐keywords, document restricons, document ∴ Size A / Size B = extensions, ... (1/6)/(1/2) = 1/3

23 Each test involves: (i) Sampling (ii) Checking 24

4 Introducon to Informaon Retrieval Sec. 19.5 Introducon to Informaon Retrieval Sec. 19.5

Sampling URLs Stascal methods

 Ideal strategy: Generate a random URL and check for . Approach 1 containment in each index. . Random queries . Random searches  Problem: Random URLs are hard to find! Enough to . Approach 2 generate a random URL contained in a given Engine. . Random IP addresses  Approach 1: Generate a random URL contained in a . Random walks given engine  Suffices for the esmaon of relave size  Approach 2: Random walks / IP addresses  In theory: might give us a true esmate of the size of the web (as opposed to just relave sizes of indexes)

25 26

Introducon to Informaon Retrieval Sec. 19.5 Introducon to Informaon Retrieval Sec. 19.5 Query Based Checking Random URLs from random queries

. Generate random query: how? . Strong Query to check whether an engine B has a Not an English document D: . Lexicon: 400,000+ words from a web crawl dictionary . Download D. Get list of words. . Conjuncve Queries: w1 and w2 . Use 8 low frequency words as AND query to B e.g., vocalists AND rsi . Check if D is present in result set. . Get 100 result URLs from engine A . Problems: . Choose a random URL as the candidate to check for . Near duplicates presence in engine B . Frames . Redirects . This distribuon induces a probability weight W(p) for each . Engine me‐outs page. . Is 8‐word query good enough?

27 28

Introducon to Informaon Retrieval Sec. 19.5 Introducon to Informaon Retrieval Sec. 19.5

Advantages & disadvantages Random searches . Stascally sound under the induced weight. . Choose random searches extracted from a local log . Biases induced by random query [Lawrence & Giles 97] or build “random . Query Bias: Favors content‐rich pages in the language(s) of the lexicon searches” [Notess] . Ranking Bias: Soluon: Use conjuncve queries & fetch all . Use only queries with small result sets. . Checking Bias: Duplicates, impoverished pages omied . Count normalized URLs in result sets. . Document or query restricon bias: engine might not deal properly . Use rao stascs with 8 words conjuncve query . Malicious Bias: Sabotage by engine . Operaonal Problems: Time‐outs, failures, engine inconsistencies, index modificaon.

29 30

5 Introducon to Informaon Retrieval Sec. 19.5 Introducon to Informaon Retrieval Sec. 19.5

Advantages & disadvantages Random searches

. Advantage . 575 & 1050 queries from the NEC RI employee logs . Might be a beer reflecon of the human percepon . 6 Engines in 1998, 11 in 1999 of coverage . Implementaon: . Issues . Restricted to queries with < 600 results in total . Counted URLs from each engine aer verifying query . Samples are correlated with source of log match . Duplicates . Computed size rao & overlap for individual queries . Technical stascal problems (must have non‐zero . Esmated index size rao & overlap by averaging over all results, rao average not stascally sound) queries

31 32

Introducon to Informaon Retrieval Sec. 19.5 Introducon to Informaon Retrieval Sec. 19.5

Queries from Lawrence and Giles study Random IP addresses . adapve access control . somax acvaon funcon . Generate random IP addresses . neighborhood preservaon . bose muldimensional system topographic theory . Find a web server at the given address . hamiltonian structures . gamma mlp . If there’s one . right linear grammar . dvi2pdf . pulse width modulaon neural . john oliensis . Collect all pages from server . unbalanced prior probabilies . rieke spikes exploring neural . From this, choose a page at random . ranked assignment method . video watermarking . internet explorer favourites . counterpropagaon network imporng . fat shaering dimension . karvel thornber . abelson amorphous compung . zili liu

33 34

Introducon to Informaon Retrieval Sec. 19.5 Introducon to Informaon Retrieval Sec. 19.5

Random IP addresses Advantages & disadvantages . Advantages . HTTP requests to random IP addresses . Clean stascs . Ignored: empty or authorizaon required or excluded . Independent of crawling strategies . [Lawr99] Esmated 2.8 million IP addresses running . Disadvantages crawlable web servers (16 million total) from observing . Doesn’t deal with duplicaon 2500 servers. . Many hosts might share one IP, or not accept requests . OCLC using IP sampling found 8.7 M hosts in 2001 . No guarantee all pages are linked to root page. . Netcra [Netc02] accessed 37.2 million hosts in July 2002 . E.g.: employee pages . [Lawr99] exhausvely crawled 2500 servers and . Power law for # pages/hosts generates bias towards sites with few pages. extrapolated . But bias can be accurately quanfied IF underlying distribuon . Esmated size of the web to be 800 million pages understood . Esmated use of metadata descriptors: . Potenally influenced by spamming (mulple IP’s for same . Meta tags (keywords, descripon) in 34% of home pages, Dublin server to avoid IP block) core metadata in 0.3% 35 36

6 Introducon to Informaon Retrieval Sec. 19.5 Introducon to Informaon Retrieval Sec. 19.5

Random walks Advantages & disadvantages

. View the Web as a directed graph . Advantages . Build a random walk on this graph . “Stascally clean” method, at least in theory! . Includes various “jump” rules back to visited sites . Could work even for infinite web (assuming convergence) . Does not get stuck in spider traps! under certain metrics. . Can follow all links! . Disadvantages . Converges to a staonary distribuon . Must assume graph is finite and independent of the walk. . List of seeds is a problem. . Condions are not sasfied (cookie crumbs, flooding) . Praccal approximaon might not be valid. . Time to convergence not really known . Non‐uniform distribuon . Sample from staonary distribuon of walk . Subject to link spamming . Use the “strong query” method to check coverage by SE

37 38

Introducon to Informaon Retrieval Sec. 19.5 Introducon to Informaon Retrieval Sec. 19.6

Conclusions . No sampling soluon is perfect. . Lots of new ideas ...... but the problem is geng harder . Quantave studies are fascinang and a good research problem

DUPLICATE DETECTION

39 40

Introducon to Informaon Retrieval Sec. 19.6 Introducon to Informaon Retrieval Sec. 19.6

Duplicate documents Duplicate/Near‐Duplicate Detecon . The web is full of duplicated content . Duplicaon: Exact match can be detected with . Strict duplicate detecon = exact match fingerprints . Not as common . Near‐Duplicaon: Approximate match . . But many, many cases of near duplicates Overview . Compute syntacc similarity with an edit‐distance . E.g., last‐modified date the only difference measure between two copies of a page . Use similarity threshold to detect near‐duplicates . E.g., Similarity > 80% => Documents are “near duplicates” . Not transive though somemes used transively

41 42

7 Introducon to Informaon Retrieval Sec. 19.6 Introducon to Informaon Retrieval Sec. 19.6

Compung Similarity Shingles + Set Intersecon

. Features: . Compung exact set intersecon of shingles . Segments of a document (natural or arficial breakpoints) between all pairs of documents is expensive/ . Shingles (Word N‐Grams) intractable . a rose is a rose is a rose → . Approximate using a cleverly chosen subset of shingles a_rose_is_a from each (a sketch) rose_is_a_rose . Esmate (size_of_intersecon / size_of_union) is_a_rose_is based on a short sketch a_rose_is_a Doc Shingle set A Sketch A . Similarity Measure between two docs (= sets of shingles) A . Jaccard coefficient: Size_of_Intersecon / Size_of_Union Jaccard Doc Shingle set B Sketch B 43 B 44

Introducon to Informaon Retrieval Sec. 19.6 Introducon to Informaon Retrieval Sec. 19.6

Sketch of a document Compung Sketch[i] for Doc1 . Create a “sketch vector” (of size ~200) for each Document 1 document . Documents that share ≥ t (say 80%) corresponding 264 Start with 64-bit f(shingles) vector elements are near duplicates 264 Permute on the number line . For doc D, sketchD[ i ] is as follows: with π 64 i . Let f map all shingles in the universe to 0..2m‐1 (e.g., f = 2 64 fingerprinng) 2 Pick the min value m . Let πi be a random permutaon on 0..2 ‐1

. Pick MIN {πi(f(s))} over all shingles s in D

45 46

Introducon to Informaon Retrieval Sec. 19.6 Introducon to Informaon Retrieval Sec. 19.6

Test if Doc1.Sketch[i] = Doc2.Sketch[i] However…

Document 1 Document 2 Document 1 Document 2 264 264 264 264 264 264 A 64 264 264 264 B 2 264 264 264 264 A B 264 264 A = B iff the shingle with the MIN value in the union of Doc1 and Doc2 is common to both (i.e., lies in the intersection) Are these equal? Why? Claim: This happens with probability Test for 200 random permutations: π1, π2,… π200 47 Size_of_intersection / Size_of_union 48

8 Introducon to Informaon Retrieval Sec. 19.6 Introducon to Informaon Retrieval Sec. 19.6

Set Similarity of sets Ci , Cj Key Observaon

. For columns Ci, Cj, four types of rows

Ci Cj A 1 1 . View sets as columns of a matrix A; one row for each B 1 0 element in the universe. aij = 1 indicates presence of item i in set j C 0 1 . Example C1 C2 D 0 0 . Overload notaon: A = # of rows of type A 0 1 1 0 . Claim 1 1 Jaccard(C1,C2) = 2/5 = 0.4 0 0 1 1 0 1 49 50

Introducon to Informaon Retrieval Sec. 19.6 Introducon to Informaon Retrieval Sec. 19.6

“Min” Hashing Min‐Hash sketches

. Randomly permute rows . Pick P random row permutaons . MinHash sketch . Hash h(Ci) = index of first row with 1 in column Sketch = list of P indexes of first rows with 1 in column C Ci D . Surprising Property . Similarity of signatures . . Why? Let sim[sketch(Ci),sketch(Cj)] = fracon of permutaons where MinHash values agree . Both are A/(A+B+C) . Observe E[sim(sketch(Ci),sketch(Cj))] = Jaccard(Ci,Cj) . Look down columns Ci, Cj until first non-Type-D row

. h(Ci) = h(Cj)  type A row

51 52

Introducon to Informaon Retrieval Sec. 19.6 Introducon to Informaon Retrieval Sec. 19.6

Example All signature pairs Signatures  Now we have an extremely efficient method for S1 S2 S3 Perm 1 = (12345) 1 2 1 esmang a Jaccard coefficient for a single pair of Perm 2 = (54321) 4 5 4 documents. C1 C2 C3 Perm 3 = (34512) 3 5 4 2 R1 1 0 1  But we sll have to esmate N coefficients where N R2 0 1 1 is the number of web pages. R 1 0 0 3  Sll slow R4 1 0 1 Similarities  R5 0 1 0 1-2 1-3 2-3 One soluon: locality sensive hashing (LSH) Col-Col 0.00 0.50 0.25  Another soluon: sorng (Henzinger 2006) Sig-Sig 0.00 0.67 0.00

53 54

9 Introducon to Informaon Retrieval

More resources . IIR Chapter 19

55

10