Lecture Outline
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to Google Jian-hua Yeh (葉建華) [email protected] Lecture Outline • What are Google’s services? • Inventing Google • Current status of Google • GMail service • Google Office? • iGoogle? • What Google can not do 2 The Google Services • Web search • Images search • Video search • News search •Maps •Mail • More? 3 10 Cool Things You Can Do With Google 1. Basic Searching 5 Basic Searching Step-by-Step Select search term(s) Enter search term(s) into search box Click Search or Press Enter key Browse Results 6 2. Advanced Searching Click on “Advanced Search” on main Google Page Go to www.googleguide.com for more on how to use Google’s Basic and Advanced Search 7 Better Searches, Better Results Exact Phrase [“one small step for man”] Excluded Words [bass –fishing, virus -computer] Similar Words [~mobile phone] Multiple Words (or) [Maui OR Hawaii] Multiple Words (and) [vacation Hawaii] ----------------------------------------------------------- “I’m feeling lucky” [takes you directly to first web page returned for your query] 8 3. Definitions “define ______” or “define: ____” Definitions gathered from around the Web 9 Define “Blog” 10 4. Calculator Addition + Subtraction – Multiplication * Division / Percentages %of Exponents ^ 11 “15.99 + 32.50 + 13.25” 12 5. Numbers Phone #s Tracking #s VIN #s UPC codes Area Codes More… 13 Examples of Number Searches Phone numbers Area codes Tracking packages by # UPC Codes VIN #s 14 6. Movies Showtimes “movies 91360” Reviews Buy Tickets Online 15 7. Stocks Find reports on specific stocks Compare stocks by entering multiple stock symbols 16 8. Weather Weather forecasts for specific regions of the world Example: “weather 91360” 17 9. Travel Airport weather and delays Airline Flight Information Examples: “lax airport” AND “United 164” 18 10. Pizza! Find local businesses by typing in a keyword (like “pizza”) and your zipcode 19 More? Yes, there are more… 21 Lecture Outline • What are Google’s services? • Inventing Google • Current status of Google • GMail service • Google Office? • iGoogle? • What Google can not do 22 Inventing Google Inventing Google • Sergey & Larry - Ph.D. students at Stanford University •Prototype(1998) – http://google.stanford.edu – 24,000,000 pages (8,058,044,651 today) • Google – “We chose our system name, Google, because it is a common spelling of googol, or 10100 and fits well with our goal of building very large-scale search engines.” • Page Rank – An objective measure of its citation importance that corresponds well with people’s subjective idea of importance. 24 Google’s Mission “Organize the world’s information and make it universally accessible and useful.” 25 Google’s Goal “To provide a much higher level of service to all those who seek information, whether they're at a desk in Boston, driving through Bonn, or strolling in Bangkok.” 26 Business Ethics 1. Focus on the user and all else will follow. 2. It's best to do one thing really, really well. 3. Fast is better than slow. 4. Democracy on the web works. 5. You don't need to be at your desk to need an answer. 6. You can make money without doing evil. 7. There's always more information out there. 8. The need for information crosses all borders. 9. You can be serious without a suit. 10. Great just isn't good enough. 27 Inventing Google: Foundation • PageRank*: – We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d... Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) T C1 1 … A T n Cn 28 *) Larry Page Inventing Google: Foundation • Page Rank formula informally – PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) – PageRank can be thought of as a model of user behavior. – We assume there is a "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on another random page. – The probability that the random surfer visits a page is its PageRank. • High PR has a page if… – there are many pages that point to it – or if there are some pages that point to it and have a high PR – Note recursive weight propagation through web link structure. – Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one. – Damping factor d is the probability at each page the "random surfer" will get bored and request another random page. • Personalization ☺ 29 Inventing Google: Foundation • PageRank relevancy tuning – Page title –Anchor text –Meta –Font •Size • Weight – Capitalization –… 30 Inventing Google: Anatomy 31 Inventing Google: Anatomy •URL Server – Providers list of URLs to be fetched to crawlers • Google Crawlers (GoogleBot) – Multiple distributed crawlers • Own DNS cache • 300 connections open at once – Send fetched pages to Store Server – Originally written in Python • Store Server – Compresses and stores files to repository. – DOCID is created for each page. • Repository – Stores fetched pages for further processing by Indexer 32 Inventing Google: Anatomy • Indexer – Reads pages from Repository (uncompress) – Parses each document (Flex on top of own stack): • Page converted to set of Hits (position, font, capitalization, title/achor/meta) / 2B • Added to Document Index – Hits are distributed to Barrels (i.e. one document to multiple barrels) – Every link found in page is stored to Anchors file • Forward and Inverted Barrels (2*64) – Forward Index • Barrel keeps range of Hits sorted by DOCIDs • (DOCID, (WORDID, word’s Hit reference+)+) – Processed by Sorter: • Generates inverted index from forward index – sorts Hits by WORDIDs • Creates (WORDID, offsets) used by Lexicon – Inverted Index (short/full) • (WORDID, (DOCID reference, Hit list reference)+)) • Short: DOCIDs sorted by/contains just quality Hits (word in title, anchor,...); optimal single word search • Full: DOCIDs sorted by DOCID; optimal Hit lists merging i.e. multi-word search • Anchors file – Anchor (from, to, text) • URL Resolver – Reads anchors file: • Relation 2 absolute URL conversion + DOCID assignment • Creates links file • Links file – (url, target: DOCID) 33 Inventing Google: Anatomy • Searcher uses… – Lexicon • Keeps map saying which Barrel to use. • Originally kept in memory (256MB). – IMHO now must be used something like Multi-level VM Page Table – It is is/was of fixed size (14,000,000 words) – Barrels • Each barrel keeps range of WORDIDs • WORID 2 DOCID map – PageRank pool • Keeps counted page rank for each DOCID – Doc Index • DOCID ordered information about each document – (DOCID, status, repository pointer, checksum, stat, URL, title) 34 Cluster Innards Cluster Innards: Global Google •Over 30 Google clusters around the world. – DNS based & geo location driven load-balancing: • Domain Name: GOOGLE.COM Registrar: ALLDOMAINS.COM INC. Whois Server: whois.alldomains.com Referral URL: http://www.alldomains.com Name Server: NS2.GOOGLE.COM Name Server: NS1.GOOGLE.COM Name Server: NS3.GOOGLE.COM Name Server: NS4.GOOGLE.COM Status: REGISTRAR-LOCK Updated Date: 03-oct-2002 Creation Date: 15-sep-1997 Expiration Date: 14-sep-2011 • 2005, May 7: Google DNS hack speculations •Total PCs • > 5,000 in 2000 •>15,000 in 2003 • >79,000* in 2004 36 *) I’m not sure about this number, it was taken from an external resource. Cluster Innards: HW • Basics cluster design insights – Reliability in SW rather then server-class HW. • Commodity PCs used to build high-end computing cluster at a low end prices. • Example: – $287,000 – 176x 2GHz Xeon, 176GB RAM, 7TB HDD – $758,000 – 8x 2GHZ Xeon, 64GB RAM, 8TB HDD – Design is tailored for best aggregate request throughput, not peak server response time – individual request parallelization. • Google has inexpensively built out its computing infrastructure by using thousands of "commodity" servers – <2,000 servers in single cluster. – Dual-processor x86 servers (starting at 533MHz Celeron) with 2-4 GB of memory per machine, 1+ 80GB IDE drive. – Rack: 40-80 of x86-based servers. 37 Cluster Innards: HW • Optimistically, a consumer PC might crash once in three years from a software glitch or hardware problem. – "At Google scale...if you have thousands of PCs, you can expect one (failure) a day,…" • 1,000,000s not 1,000,000,000s of dollars. – “The trick is to make these racks of hardware work together and to ensure that the failure of one machine doesn't derail an operation.” • Switched Ethernet – Commodity networking hardware is used - typically either 100 megabits/second or 1 gigabit/second at the machine level, but averaging considerably less in overall bisection bandwidth. – Locality optimizations (GFS) 38 Cluster Innards: SW • Stripped-down version of Linux, which is based on the Red Hat distribution but is really just the operating system kernel modified for Google. • Google File System is optimized for handling large blocks of data. – 64MB block – The file system was designed to assume that a failure, such as a failed disk or unplugged network cable, can happen at any time. – Data is replicated in three places, and there is a "master" machine that can locate copies of a piece of data, such as a keyword index, if the original is out of commission. • Google has created "batch" job scheduling software that acts as a sort of taskmaster for millions of operations called the Global Work Queue. • Another important engineering feat done by Google is to make writing programs that run across thousands of servers very straightforward… 39 Lecture Outline • What are Google’s services? • History of Google • Current status of Google • GMail service • Google Office? • iGoogle? • What Google can not do 40 YEAR MONTH EVENT 1995 March Sergey Brin and Larry Page meet at a Stanford University spring gathering of Ph.D.