Home , DMOZ

Outline

• Historical Background • Information Tsunami • Anatomy of a Web Page Needle in the Haystack: • Anatomy of Web Access The Technology of Internet Search • The Challenge of Search • Google’s Page Rank Algorithm Randy H. Katz • Fun and Games with Internet Search The United Microelectronics Corporation Distinguished Professor • New Directions Computer Science Division, EECS Department University of California, Berkeley Berkeley, CA 94720-1776 USA [email protected] 1 2

Search is BIG! And the World is Going Digital

3 4

Historical Background: Outline The Perfect Storm

• Historical Background ARPANet 1969 • Information Tsunami NSFNet 1985

• Anatomy of a Web Page Commercial Internet 1995 • Anatomy of Web Access Marc Andreessen Jim Clark • The Challenge of Search NCSA Mosaic Netscape World Wide Web 1993 1995 • Google’s Page Rank Algorithm Tim Berners-Lee URL/HTTP/HTML 1989 • Fun and Games with Internet Search Bill Atkinson Hypercard 1987 • New Directions SGML 1986 Ted Nelson Xanadu Hypertext 1965-1990 Autodesk Est. $15.5 Billion spent on-line Vannevar Bush “As We Thanksgivings to Xmas 2004, May Think” MEMEX 1947 up 28% since 2003 5 6

Page 1 1 Outline Information Tsunami

• Historical Background • Bit: Binary digit – either a 0 or 1 • Information Tsunami •Byte: 8 bits • Anatomy of a Web Page – 1 byte: single character – 10 bytes: a single word • Anatomy of Web Access – 100 bytes: Telegram or punched card • The Challenge of Search • Kilobyte: 1,000 or 103 bytes – 1 kilobyte: Very short story • Google’s Page Rank Algorithm – 2 kilobytes: Typewritten page • Fun and Games with Internet Search – 10 kilobytes: Encyclopedia page • New Directions – 50 kilobytes: Compressed document image page – 100 kilobytes: Low-res photo – 200 kilobytes: Box of punched cards

7 http://www.sims.berkeley.edu/research/projects/how-much-info/index.html 8

Information Tsunami Information Tsunami

• Megabyte: 1,000,000 or 106 bytes • Terabyte: 1,000,000,000,000 or 1012 bytes – 1 megabyte: Small novel or 3.5in floppy disk – 1 terabyte: 50,000 trees made into paper and printed – 2 megabytes: Hi-res photo or 1 day of EOS data – 5 megabytes: Complete works of Shakespeare – 2 terabytes: Academic research library – 10 megabytes: Minute of hi-fi sound – 10 terabytes: Printed collection of the U.S. Library of Congress – 100 megabytes: 1m shelved books – 50 terabytes: Contents of a large mass storage system – 500 megabytes: CD-ROM – 400 terabytes: National Climate Data Center (NOAA) database 9 • Gigabyte: 1,000,000,000 or 10 bytes • Petabyte: 1,000,000,000,000,000 or 1015 bytes – 1 gigabyte: Pickup truck filled with paper – 1 petabytes: 3 years of Earth Observing System (EOS) data – 2 gigabytes: Movie on a DVD – 2 petabytes: All U.S. academic research libraries – 50 gigabytes: Floor of books – 8 petabytes: All information available on the Web – 100 gigabytes: Floor of academic journals – 500 gigabytes: Biggest FTP site – 200 petabytes: All printed material (2001)

http://www.sims.berkeley.edu/research/projects/how-much-info/index.html 9 http://www.sims.berkeley.edu/research/projects/how-much-info/index.html 10

Information Tsunami Outline

• Exabyte: 1,000,000,000,000,000,000 or 1018 bytes • Historical Background – 2 exabytes: Total volume of information generated worldwide annually • Information Tsunami – 5 exabytes: All words ever spoken by humans • Anatomy of a Web Page • Zettabyte: 1,000,000,000,000,000,000,000 or 1021 bytes • Anatomy of Web Access 24 • Yottabyte: 1,000,000,000,000,000,000,000,000 or 10 bytes • The Challenge of Search • Google’s Page Rank Algorithm • Fun and Games with Internet Search • New Directions

http://www.sims.berkeley.edu/research/projects/how-much-info/index.html 11 12

Page 2 2 Anatomy of a Web Page: Randy’s Home Page Anatomy of a Web Page: Professor Randy Howard Katz University of California Berkeley Computer Science Division Home Page Randy’s Home Page Locator •Images

2005 vs. 1985 ... The hair is grayer, but the smirk remains the same!

"... Katz, a thin, almost gaunt man with horn-rimmed glasses magnifying sunken eyes. ..."
--George Johnson, WIRED Magazine, (January 2000), page 13 150.

Anatomy of a Web Page: Randy’s Web Page

Professor Randy H. Katz

Electrical Engineering and Computer Science Department

• Images The • Links! United Microelectronics Corporation Distinguished Professor

Ph.D., University of California, Berkeley, 1980.
M.S., University of California, Berkeley, 1978.
A.B., Cornell University, 1976.

15 16

Outline Anatomy of Web Access

• Historical Background Naming System (DNS): Web Page Name-to-Address Mapping • Information Tsunami In HTML IP address • Anatomy of a Web Page • Anatomy of Web Access

• The Challenge of Search (1) Taiwan • Google’s Page Rank Algorithm (2) • Fun and Games with Internet Search Link URL (3) • New Directions http://www.umc.com.tw/

(4) Web Browser Web Server

17 18

Page 3 3 Anatomy of Web Access Outline Content Caching

Naming System (DNS) • Historical Background Web Page Origin IP In HTML Content Network DNS • Information Tsunami Edge Cache IP • Anatomy of a Web Page • Anatomy of Web Access

(5) Taiwan • The Challenge of Search (6) • Google’s Page Rank Algorithm

Link URL • Fun and Games with Internet Search …/English/about/index.asp Content • New Directions Distribution (7) San Jose Web Browser Origin Edge Web Server (8) Cache 19 20

Quick (and Incomplete) History Challenges of Search of Search Engines

CMU Yahoo! a9.com Lycos acquires AlltheWeb Ask Jeeves • How to find all the pages on the Web? 1st Commercial Inktomi Clusty • How to order the pages by relevance? Search Engine Gigablast Yahoo! Ez2Find • How to make searchable the content on those pages? Stanford acquires Teoma Yahoo! • How to keep it all up-to-date? Overture WiseNut Directories GoHook Battle for Popularity: (AlltheWeb, Walhello • Web Crawlers/SpiderBots UMinn MIT Webcrawler (UWash) AltaVista) Kartoo – Network software executing in parallel that follow links in the Veronica & Wandex/ HotBot (Wired) Web to find content Archie WWW Excite (Stanford) Yahoo! – Web pages “scraped” for more links follow services Wanderer Infoseek (ABC) deploys for Inktomi (Berkeley) – Web revisited on the order of once every two-three days AltaVista (DEC) joint gopher & Aliweb • Indexers Google (Stanford) technology ftp – Web pages “scraped” for search terms to build indexes – (Google) Page rank algorithm: order a page within the index based Pre-Web1993 1995 1997 1999 2001 2003 2005 (roughly) on how many pages refer to it 21 22

Search Challenges and Issues Outline

• Web growing faster than search engines can index • Historical Background • Web pages updated frequently, forcing frequent • Information Tsunami revisits • Anatomy of a Web Page • Key word only searches results in many false positives • Anatomy of Web Access • Difficult to index dynamically generated sites: the so- called “invisible web” • The Challenge of Search • Some search engines order results by financial • Google’s Page Rank Algorithm “placement” considerations rather than relevance • Fun and Games with Internet Search • Some sites trick search engine to display them first for some keywords—results in polluted search results, • New Directions with more relevant links pushed down among the results

23 24

Page 4 4 Page Ranking Algorithms Google’s Page Rank Algorithm

• Web page relevancy – Many hits, how to insure the best/most relevant web pages are presented first in answer to a search • Location and Frequency of Keywords – Index terms in page title raise its relevance for that term – Keywords near “top” of page more relevant than bottom – High keyword frequency boosts relevance • If search engine strategy is known, page developers will “game” the strategy to get their pages ranked higher

• Which is the most important page?

25 26

Google’s Page Rank Algorithm Google Page Rank Algorithm

• Googlese from their web page: • Basic idea: – PageRank relies on the uniquely democratic – Page’s rank determined by the number of links to the page (also known nature of the web by using its vast link as citations) – If citing page is more important (has a high page rank/authority page) structure as an indicator of an individual then the pages it cites are more important page's value. Google interprets a link from – If citing page has many links, then cited page is less important page A to page B as a vote, by page A, for (normalize for number of links on citing page) page B. But, Google looks at more than the sheer volume of votes, or links a page PR(P) is page rank of page P, T1, …, TN are pages that cite P, receives; it also analyzes the page that C(P) is the # links from Page P, D is a “decay factor”, e.g., 0.85 casts the vote. Votes cast by pages that then: are themselves "important" weigh more heavily and help to make other pages PR(P) = (1 – d) + d (PR(T )/C(T ) + … + PR(T )/C(T )) "important.” 1 1 n n • See http://www-db.stanford.edu/~backrub/google.html

27 28

Google Conceptual Google Server Architecture Architecture Google Spell Checker Web Server

Ad Server

Doc Server Doc Server Doc Server Doc Server Doc Server Doc Server Doc Server Doc Server Index Server Doc Server

• Index servers: search term partitioned and mapped to doc list • Intersect to find document list, sort by page rank • Document IDs used to extract text from Doc Servers • Over 100,000 processors (and growing) in Googleplex 29 30

Page 5 5 Outline Fun and Games

• Historical Background •Google Scholar • Information Tsunami • Googling Someone • Anatomy of a Web Page •Google News • Anatomy of Web Access • Comparison Shopping • The Challenge of Search • Google Whacks • Google’s Page Rank Algorithm • Fun and Games with Internet Search • New Directions

31 32

Google Scholar Google Randy

33 34

Google Randy Katz “Google Google News Index”

Advertising Placement

35 36

Page 6 6 Comparison Shopping elgooG

37 38

Business Model Google Whacks Ad Placement and Click-Thru

Old data (2002): Google is now market leader in ad revenue 2004 revenue through 9/30/04: $2.1B 39 40

Outline Top 10 Search Engines

• Historical Background 10. DMOZ.org • Information Tsunami 9. Alltheweb.com • Anatomy of a Web Page 8. KartOO.com • Anatomy of Web Access 7. MSN.com 6. Dogpile.com • The Challenge of Search 5. AskJeeves.com • Google’s Page Rank Algorithm 4. About.com • Fun and Games with Internet Search 2. Yahoo.com • New Directions 2. Vivismio.com 1. Google.com

41 42

Page 7 7 Clustering Google Video Search

43 44

Google Video Search Amazon’s A9

45 46

Amazon’s A9 A9’s Yellow Pages

47 48

Page 8 8 Innovations Now and A9’s Yellow Pages Yet to Come

• Index ever larger portions of the Web, even beyond traditional web pages, e.g., video • Better quality/higher relevance searches • Better presentation of results, e.g., clustering, site information • Better exploitation of semantic relationships for improved page ranking, more personalization, e.g., user’s zip code • More services (Web, news groups, blogs, comparison shopping, video/audio, yellow pages, etc.) • Integrate with desktop machine

49 50

Parting Thoughts Parting Thoughts

51 52

Needle in the Haystack: The Technology of Internet Search

“Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?” T.S. Eliot, “Choruses from the rock”, Selected Poems, NY: Harvest / Harcourt, 1962, p. 107.

Thanks for Your Patience & Attention! Questions? 53 54

Page 9 9