Indexing the World Wide Web: the Journey So Far Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA
Total Page:16
File Type:pdf, Size:1020Kb
Indexing The World Wide Web: The Journey So Far Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA ABSTRACT In this chapter, we describe the key indexing components of today’s web search engines. As the World Wide Web has grown, the systems and methods for indexing have changed significantly. We present the data structures used, the features extracted, the infrastructure needed, and the options available for designing a brand new search engine. We highlight techniques that improve relevance of results, discuss trade-offs to best utilize machine resources, and cover distributed processing concept in this context. In particular, we delve into the topics of indexing phrases instead of terms, storage in memory vs. on disk, and data partitioning. We will finish with some thoughts on information organization for the newly emerging data-forms. INTRODUCTION The World Wide Web is considered to be the greatest breakthrough in telecommunications after the telephone. Quoting the new media reader from MIT press [Wardrip-Fruin , 2003]: “The World-Wide Web (W3) was developed to be a pool of human knowledge, and human culture, which would allow collaborators in remote sites to share their ideas and all aspects of a common project.” The last two decades have witnessed many significant attempts to make this knowledge “discoverable”. These attempts broadly fall into two categories: (1) classification of webpages in hierarchical categories (directory structure), championed by the likes of Yahoo! and Open Directory Project; (2) full-text index search engines such as Excite, AltaVista, and Google. The former is an intuitive method of arranging web pages, where subject-matter experts collect and annotate pages for each category, much like books are classified in a library. With the rapid growth of the web, however, the popularity of this method gradually declined. First, the strictly manual editorial process could not cope with the increase in the number of web pages. Second, the user’s idea of what sub-tree(s) to seek for a particular topic was expected to be in line with the editors’, who were responsible for the classification. We are most familiar with the latter approach today, which presents the user with a keyword search interface and uses a pre-computed web index to algorithmically retrieve and rank web pages that satisfy the query. In fact, this is probably the most widely used method for navigating through cyberspace. The earliest search engines had to handle orders of magnitude more documents than previous information retrieval systems. In fact, around 1995, when the number of static web pages was believed to double every few months, AltaVista reported having crawled and indexed approximately 25 million webpages. Indices of today’s search engines are several orders of magnitude larger; Google reported around 25 billion web pages in 2005 [Patterson, 2005], while Cuil indexed 120 billion pages in 2008 [Arrington, 2008]. Harnessing together the power of hundreds, if not thousands, of machines has proven key in addressing this challenge of grand scale. 2 Figure 1: History of Major Web Search Engine Innovations (1994-2010) Using search engines has become routine nowadays, but they too have followed an evolutionary path. Jerry Yang and David Filo created Yahoo in 1994, starting it out as a listing of their favorite web sites along with a description of each page [Yahoo, 2010]. Later in 1994, WebCrawler was introduced which was the first full-text search engine on the Internet; the entire text of each page was indexed for the first time. Introduced in 1993 by six Stanford University students, Excite became functional in December 1995. It used statistical analysis of word relationships to aid in the search process and is part of AskJeeves today. Lycos, created at CMU by Dr. Michael Mauldin, introduced relevance retrieval, prefix matching, and word proximity in 1994. Though it was the largest of any search engine at the time, indexing over 60 million documents in 1996, it ceased crawling the web for its own index in April 1999. Today it provides access to human-powered results from LookSmart for popular queries and crawler-based results from Yahoo for others. Infoseek went online in 1995 and is now owned by the Walt Disney Internet Group. AltaVista, also started in 1995, was the first search engine to allow natural language questions and advanced searching techniques. It also provided multimedia search for photos, music, and videos. Inktomi was started in 1996 at UC Berkeley, and in June of 1999 introduced a directory search engine powered by concept induction technology. This technology tries to model human conceptual classification of content, and projects this intelligence across millions of documents. Yahoo purchased Inktomi was in 2003. AskJeeves launched in 1997 and became famous for being the natural language search engine, that allowed the user to search by framing queries in question form and responding with what seemed to be the right answer. In reality, behind the scenes, the company had many human editors who monitored search logs and located what seemed to be the best sites to match the most popular queries. In 1999, they acquired Direct Hit, which had developed the world’s first click popularity search technology, and in 2001, they acquired Teoma whose index was built upon clustering concepts of subject-specific popularity. Google, developed by Sergey Brin and Larry Page at Stanford University, launched in 1998 and used inbound links to rank sites. The MSN Search and Open Directory Project were also started in 1998, of which the former reincarnated as Bing in 2009. The Open Directory, according to its website, “is the largest, most comprehensive human-edited directory of the Web”. Formerly known as NewHoo, it was acquired by AOL Time Warner-owned Netscape in November 1998. All current search engines rank web pages to identify potential answers to a query. Borrowing from information retrieval, a statistical similarity measure has always been used in practice to assess the closeness of each document (web page) to the user text (query); the underlying principle being that the higher the similarity score, the greater the estimated likelihood that it is relevant to the user. This similarity formulation is based on models of documents and queries, the most effective of which is the vector space model [Salton, 1975]. The cosine measure [Salton, 1962] has consistently been found to be the most successful similarity measure in using this model. It considers document properties as vectors, and takes as distance function the cosine of the angle between each vector pair. From an entropy-based 3 perspective, the score assigned to a document can be interpreted as the sum of information conveyed by query terms in the document. Intuitively, one would like to accumulate evidence by giving more weight to documents that match a query term several times as opposed to ones that contain it only once. Each term’s contribution is weighted such that terms appearing to be discriminatory are favored while reducing the impact of more common terms. Most similarity measures are a composition of a few statistical values: frequency of a term t in a document d (term frequency or TF), frequency of a term t in a query, number of documents containing a term t (document frequency or DF), number of terms in a document, number of documents in the collection, and number of terms in the collection. Introduction of document-length pivoting [Singhal, 1996] addressed the issue of long documents either containing too many terms, or many instances of the same term. The explosive growth of the web can primarily be attributed to the decentralization of content publication, with essentially no control of authorship. A huge drawback of this is that web pages are often a mix of facts, rumors, suppositions and even contradictions. In addition, web-page content that is trustworthy to one user may not be so to another. With search engines becoming the primary means to discover web content, however, users could no longer self-select sources they find trustworthy. Thus, a significant challenge for search engines is to assign a user-independent measure of trust to each website or webpage. Over time, search engines encountered another drawback [Manning, 2008] of web decentralization: the desire to manipulate webpage content for the purpose of appearing high up in search results. This is akin to companies using names that start with a long string of As to be listed early in the Yellow Pages. Content manipulation not only includes tricks like repeating multiple keywords in the same color as the background, but also sophisticated techniques such as cloaking and using doorway pages, which serve different content depending on whether the http request came from a crawler or a browser. To combat such spammers, search engines started exploiting the connectivity graph, established by hyperlinks on web pages. Google [Brin, 1998] was the first web search engine known to apply link analysis on a large scale, although all web search engines currently make use of it. They assigned each page a score, called PageRank, which can be interpreted as the fraction of time that a random web surfer will spend on that webpage when following the out-links from each page on the web. Another interpretation is that when a page links to another page, it is effectively casting a vote of confidence. PageRank calculates a page’s importance from the votes cast for it. HITS is another technique employing link analysis which scores pages as both hubs and authorities, where a good hub is one that links to many good authorities, and a good authority is one that is linked from many good hubs. It was developed by Jon Kleinberg and formed the basis of Teoma [Kleinberg, 1999].