Informa2on Retrieval

Introducon to Informaon Retrieval Introducon to Informaon Retrieval Brief (non‐technical) history . Early keyword‐based engines ca. 1995‐1997 Introducon to . Altavista, Excite, Infoseek, Inktomi, Lycos Informaon Retrieval . Paid search ranking: Goto (morphed into Overture.com → Yahoo!) CS276 . Your search ranking depended on how much you Informaon Retrieval and Web Search paid Pandu Nayak and Prabhakar Raghavan . Aucon for keywords: casino was expensive! Lecture 15: Web search basics 2 Introducon to Informaon Retrieval Introducon to Informaon Retrieval Brief (non‐technical) history . 1998+: Link‐based ranking pioneered by Google . Blew away all early engines save Inktomi . Great user experience in search of a business model . Meanwhile Goto/Overture’s annual revenues were nearing $1 billion Paid . Result: Google added paid search “ads” to the side, Search Ads independent of search results . Yahoo followed suit, acquiring Overture (for paid placement) and Inktomi (for search) . 2005+: Google gains search share, dominang in Europe and very strong in North America . 2009: Yahoo! and Microso propose combined paid search offering Algorithmic results. 3 4 Introducon to Informaon Retrieval Sec. 19.4.1 Introducon to Informaon Retrieval Sec. 19.4.1 Web search basics User Needs . Need [Brod02, RL04] User . Informaonal – want to learn about something (~40% / 65%) . Navigaonal – want to go to that page (~25% / 15%) Web spider . Transaconal – want to do something (web‐mediated) (~35% / 20%) . Access a service . Downloads Search . Shop Indexer . Gray areas . Find a good hub . Exploratory search “see what’s there” The Web Indexes Ad indexes5 6 1 Introducon to Informaon Retrieval Introducon to Informaon Retrieval How far do people look for results? Users’ empirical evaluaon of results . Quality of pages varies widely . Relevance is not enough . Other desirable qualies (non IR!!) . Content: Trustworthy, diverse, non‐duplicated, well maintained . Web readability: display correctly & fast . No annoyances: pop‐ups, etc. Precision vs. recall . On the web, recall seldom maers . What maers . Precision at 1? Precision above the fold? . Comprehensiveness – must be able to deal with obscure queries . Recall maers when the number of matches is very small . User percepons may be unscienfic, but are significant over a large aggregate (Source: iprospect.com WhitePaper_2006_SearchEngineUserBehavior.pdf) 7 8 Introducon to Informaon Retrieval Introducon to Informaon Retrieval Sec. 19.2 Users’ empirical evaluaon of engines The Web document collecon . Relevance and validity of results . No design/co‐ordinaon . UI – Simple, no cluer, error tolerant . Distributed content creaon, linking, democrazaon of publishing . Trust – Results are objecve . Content includes truth, lies, obsolete . Coverage of topics for polysemic queries informaon, contradicons … . Pre/Post process tools provided . Unstructured (text, html, …), semi‐ . Migate user errors (auto spell check, search assist,…) structured (XML, annotated photos), . Explicit: Search within results, more like this, refine ... structured (Databases)… . Ancipave: related searches . Scale much larger than previous text . Deal with idiosyncrasies collecons … but corporate records are catching up . Web specific vocabulary . Impact on stemming, spell‐check, etc. Growth – slowed down from inial The Web “volume doubling every few months” but . Web addresses typed in the search box sll expanding . “The first, the last, the best and the worst …” . Content can be dynamically generated 9 10 Introducon to Informaon Retrieval Introducon to Informaon Retrieval Sec. 19.2.2 The trouble with paid search ads … . It costs money. What’s the alternave? . Search Engine Opmizaon: . “Tuning” your web page to rank highly in the algorithmic search results for select keywords . Alternave to paying for placement . Thus, intrinsically a markeng funcon SPAM . Performed by companies, webmasters and (SEARCH ENGINE OPTIMIZATION) consultants (“Search engine opmizers”) for their clients . Some perfectly legimate, some very shady 11 12 2 Introducon to Informaon Retrieval Sec. 19.2.2 Introducon to Informaon Retrieval Sec. 19.2.2 Search engine opmizaon (Spam) Simplest forms . Moves . First generaon engines relied heavily on /idf . Commercial, polical, religious, lobbies . The top‐ranked pages for the query maui resort were the . Promoon funded by adversing budget ones containing the most maui’s and resort’s . Operators SEOs responded with dense repeons of chosen terms . e.g., maui resort maui resort maui resort . Contractors (Search Engine Opmizers) for lobbies, companies . Oen, the repeons would be in the same color as the . Web masters background of the web page . Hosng services . Repeated terms got indexed by crawlers . Forums . But not visible to humans on browsers . E.g., Web master world ( www.webmasterworld.com ) . Search engine specific tricks . Discussions about academic papers Pure word density cannot be trusted as an IR signal 13 14 Introducon to Informaon Retrieval Sec. 19.2.2 Introducon to Informaon Retrieval Sec. 19.2.2 variants of keyword stuffing Cloaking . Misleading meta‐tags, excessive repeon . Serve fake content to search engine spider . Hidden text with colors, style sheet tricks, etc. DNS cloaking: Switch IP address. Impersonate SPAM Meta-Tags = N Is this a Search “… London hotels, hotel, holiday inn, hilton, discount, Engine spider? booking, reservation, sex, mp3, britney spears, viagra, …” Y Real Cloaking Doc 15 16 Introducon to Informaon Retrieval Sec. 19.2.2 Introducon to Informaon Retrieval More spam techniques The war against spam . Quality signals ‐ Prefer . Spam recognion by . Doorway pages authoritave pages based machine learning . Pages opmized for a single keyword that re‐direct to the on: . Training set based on known spam real target page . votes from authors (linkage signals) . Family friendly filters . Link spamming . votes from users (usage signals) . Linguisc analysis, general . classificaon techniques, etc. Mutual admiraon sociees, hidden links, awards – more . Policing of URL submissions . For images: flesh tone on these later . An robot test detectors, source text analysis, etc. Domain flooding: numerous domains that point or re‐ . Limits on meta‐keywords direct to a target page . Editorial intervenon . Robust link analysis . Blacklists . Robots . Ignore stascally implausible . Top queries audited . Fake query stream – rank checking programs linkage (or text) . Complaints addressed . Use link analysis to detect . Suspect paern detecon . “Curve‐fit” ranking programs of search engines spammers (guilt by associaon) . Millions of submissions via Add‐Url 17 18 3 Introducon to Informaon Retrieval Introducon to Informaon Retrieval More on spam . Web search engines have policies on SEO pracces they tolerate/block . hp://help.yahoo.com/help/us/ysearch/index.html . hp://www.google.com/intl/en/webmasters/ . Adversarial IR: the unending (technical) bale between SEO’s and web search engines . Research hp://airweb.cse.lehigh.edu/ SIZE OF THE WEB 19 20 Introducon to Informaon Retrieval Sec. 19.5 Introducon to Informaon Retrieval Sec. 19.5 What is the size of the web ? What can we aempt to measure? . Issues .The relave sizes of search engines . The web is really infinite . Dynamic content, e.g., calendars . The noon of a page being indexed is sll reasonably well . So 404: www.yahoo.com/<anything> is a valid page defined. Stac web contains syntacc duplicaon, mostly due to . Already there are problems mirroring (~30%) . Document extension: e.g., engines index pages not yet crawled, by . Some servers are seldom connected indexing anchortext. Document restricon: All engines restrict what is indexed (first n . Who cares? words, only relevant words, etc.) . Media, and consequently the user . Engine design . Engine crawl policy. Impact on recall. 21 22 Introducon to Informaon Retrieval Sec. 19.5 Introducon to Informaon Retrieval Sec. 19.5 Relave Size from Overlap New definion? Given two engines A and B . The stacally indexable web is whatever search Sample URLs randomly from A engines index. Check if contained in B and vice . IQ is whatever the IQ tests measure. versa . Different engines have different preferences A ∩ B . max url depth, max count/host, an‐spam rules, priority A ∩ B = (1/2) * Size A rules, etc. A ∩ B = (1/6) * Size B . Different engines index different things under the (1/2)*Size A = (1/6)*Size B same URL: . frames, meta‐keywords, document restricons, document ∴ Size A / Size B = extensions, ... (1/6)/(1/2) = 1/3 23 Each test involves: (i) Sampling (ii) Checking 24 4 Introducon to Informaon Retrieval Sec. 19.5 Introducon to Informaon Retrieval Sec. 19.5 Sampling URLs Stascal methods Ideal strategy: Generate a random URL and check for . Approach 1 containment in each index. Random queries . Random searches Problem: Random URLs are hard to find! Enough to . Approach 2 generate a random URL contained in a given Engine. Random IP addresses Approach 1: Generate a random URL contained in a . Random walks given engine Suffices for the esmaon of relave size Approach 2: Random walks / IP addresses In theory: might give us a true esmate of the size of the web (as opposed to just relave sizes of indexes) 25 26 Introducon to Informaon Retrieval Sec. 19.5 Introducon to Informaon Retrieval Sec. 19.5 Query Based Checking Random URLs from random queries . Generate random query: how? . Strong Query to check whether an engine B has a Not an English document D: . Lexicon: 400,000+ words from a web crawl dictionary . Download D. Get list of words.

Informa2on Retrieval

60 ENTREPRENEUR in the Click by Catherine Seda

CLAYTON KILE, on Behalf of Himself : and All Others Similarly Situated, : : Plaintiff, : : : MERRILL LYNCH & CO., INC

A Study on the Web Portal Industry

INSIDE MICROSOFT (Part 2) 2/7/04 2:30 PM

The Complete Guide to Launching Your High-Tech Business

What Should Google Do? OO Page 2

Internet Debate Research Rich Edwards, Baylor University 2012

DISSENTING STATEMENT of COMMISSIONER ORSON SWINDLE in Privacy Online: Fair Information Practices in the Electronic Marketplace a Report to Congress

The Growth and Development of the Internet in the United States

Cut Foliage Grower Volume 15, Number 4 October–December, 2000

Way of the Ferret: Finding and Using Resources on the Internet

Traffic Edge Administrator's Guide Release