Chapter Three: Google Technology
Total Page:16
File Type:pdf, Size:1020Kb
Chapter Three: Google Technology Chapter Three: Google Technology “Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to product better search results.... Fast crawling technology is needed to gather the Web documents and keep them up to date. Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing system must process hundreds of gigabytes of data efficiently. Queries must be handled quickly, at the rate of hundreds to thousands per second.” – Sergey Brin and Lawrence Page, 19971 In the beginning, there was BackRub, the service that became Google. Today, Google is most closely associated with its PageRank algorithm. PageRank is a voting algorithm weighted for importance. The indicators of a Web page’s importance is the number of pages that link to a particular page. Messrs. Brin and Page soon added another factor which voted for the importance of a Web page. This idea was the number of people who click on a Web page. The more clicks on a Web page, the more weight that Web page was given. Over time, still other factors have been added to the PageRank algorithm; for example, the frequency with which content on a page is changed. Google’s PageRank technology is closely allied with Internet search. Voting algorithms are less effective in enterprise search, for instance. The attention given to Google and its search technology dominate popular thinking about the company. Google search is like a nova. The 1. From “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” www.- db.standord.edu/~backrub/google.html The Google Legacy 55 Chapter Three: Google Technology luminescence makes it difficult for the observer to see other aspects of the phenomenon clearly or easily. Radiance aside, Google is a technology company.2 Some of that technology when described in technical papers such as the earliest one “The Anatomy of a Large-Scale Hypertextual Web Search Engine” is demanding. The later papers such as “MapReduce: Simplified Data Processing on Large Clusters” can be a slow read.3 Since Google is technology, explaining what Google does in an easily-digestible meal is difficult. The diagram below provides unauthorized snapshot of Google’s computing framework. b a d c Important Google technologies that underlie this diagram of the Googleplex include: [a] modifications to Linux to permit large file sizes and other functions so as to accelerate the overall system; [b] a distributed architecture that allows applications and scaling to be “plugged in” without the type of hands-on set-up other operating systems require; [c] a technical architecture that is similar at every level of scale; [d] a Web-centric architecture that allows new types of applications to be built without a programming language limitation. 2. The annex to this monograph contains a listing of more than 60 Google patents. The list is not all-inclusive; however, it does provide the patent number and a brief description for some of Google’s most important patents. The PageRank patent belongs to the trustees of Stanford University. Google’s patent efforts have focused on systems and methods for relevance, advertising, and other core foci of the company. Google is creating a patent fence to protect its interests. 3. Jeff Dean, former Alta Vista researcher and a Google senior engineer, has been an advocate of MapReduce. His most recent papers are available on his Web page at http:// labs.google.com/people/jeff/. 56 The Google Legacy Chapter Three: Google Technology Google’s technology has emerged from a series of continuous improvements or what Japanese management consultants call kaizan. Each Google technical change may be inconsequential to the average user of Google. But when taken as a whole, Google’s “technological advantage” comes from Google’s incremental innovations, clever adaptations of research-computing concepts, and Byzantine tweaks to Linux. Some day, a historian of technology will be able to identify, from the hundreds of improvements that Google has engineered in the last nine years, one or two that stand with PageRank as of major importance. Critics of Google will see that the company has grafted to its core technology processes from many different sources. To illustrate, the structure of Google’s data centers and the messages passed to and from these data centers is in many ways a variant of grid computing.4 Google’s ability to read data from many computers simultaneously is reminiscent of BitTorrent’s technology.5 Google’s use of commodity or “white box” hardware in its data centers is an indication of Google’s hacker ethos. The use of memory and discs to store multiple copies of data comes from the frontiers of computing. Google’s approach to technology, then, is eclectic and in many ways represents a building block approach to large-scale systems. Google benefits from that eclecticism in several ways. First, Google’s computational framework delivers sizzling performance from low-cost hardware. Second, Google worked around the bottlenecks of such operating systems as Solaris, Windows Advanced Server, and off-the-shelf Linux. Third, Google took good programming ideas from other languages, implementing new functions and libraries to eliminate most of the manual coding required to parallelise an application across Google’s servers.6 According to Jeff Dean, one of Google’s senior engineers, “Google engineering is sort of chaotic.”7 This is neither surprising nor necessarily a negative. The Googleplex is a toy box for engineers and programmers. The tools are sophisticated. The challenges of the problems and peers make Google “the place to be” for the best and brightest technical talent in the world. The nature of creativity combined with Google’s approach to innovation make it difficult to predict the next big thing from Google. Before reviewing selected parts of Google’s technology in somewhat more detail, the diagram “Google’s Computing Framework” provides an overview of the Googleplex and some of its technologies. These will be touched upon in this section. 4. Grid computing is applying resources from many computers in a network to a single problem or application. Google uses grid-like technology in its distributed computing system. 5. BitTorrent is a peer-to-peer file distribution tool written by programmer Bram Cohen in 2001.The reference implementation is written in Python and is released under the MIT License. 6. Google has anywhere from 100,000 to 165,000 or more servers. Servers are organized into clusters. Clusters may reside within one rack or across multiple racks of servers. Some Google functions are distributed across data centers. 7. From Dr Dean’s speech at the University of Washington in October 2003. See http:// www.uwtv.org/programs/displayevent.asp?rid=2459. The Google Legacy 57 Chapter Three: Google Technology PageRank requires a lot of computing horsepower cycles to work. When Google got underway in 1996, Messrs. Brin and Page had limited computing horsepower. In order to make PageRank work, they had to figure out how to get the PageRank algorithm to run on garden-variety computers available to them. From the beginning – and this is an important issue with regards to Google’s almost-certain collision course with Microsoft – Google had to solve both software engineering and hardware engineering issues to make Google Search viable. In fact, when discussing Google technology, it is important to keep in mind that PageRank is important only because it can run quickly in the real world, not in a sterile computer lab illuminated with the blue glow of supercomputers. The figure Google’s Fusion: Hardware and Software Engineering shows that Google’s technology framework has two areas of activity. There is the software engineering effort that focuses on PageRank and other applications. Software engineering, as used here, means writing code and thinking about how computer systems operate in order to get work done quickly. Quickly means the sub one-second response times that Google is able to maintain despite its surging growth in usage, applications and data processing. Google’s Fusion: Hardware and Software Innovations The Google phenomenon comes from the fission occurring when PageRank’s software and hardware engineering interact. Google’s technology delivers super computer applications for mass markets. The other effort focuses on hardware. Google has refined server racks, cable placement, cooling devices, and data center layout. The payoff is lower operating costs and the ability to scale as demand for computing resources increases. With faster turnaround and the 58 The Google Legacy Chapter Three: Google Technology elimination of such troublesome jobs as backing up data, Google’s hardware innovations give it a competitive advantage few of its rivals can equal as of mid-2005. PageRank with its layering of additional computations added over the years is a software problem of considerable difficulty. The Google system must find Web pages and perform dozens, if not hundreds of analyses of those Web pages. Consider the links pointing to a Web page. Google must keep track of them for more than eight billion Web pages. For a single Web page with one link pointing to it, the problem is trivial. One link equals one pointer. But what happens when a site has 10,000 links pointing to it? The problem becomes many times larger and more computationally demanding. Some of these links are likely to come from sites that have more traffic than others. Some of the links may come from sites that have spoofed Google for fun or profit. The calculations to sort out the “value” of each of these links adds to computational work associated with PageRank.