The Google Similarity Distance
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO 3, MARCH 2007, 370–383 1 The Google Similarity Distance Rudi L. Cilibrasi and Paul M.B. Vit´anyi Abstract— Words and phrases acquire meaning from the way is represented by the literal object itself. Objects can also be they are used in society, from their relative semantics to other given by name, like “the four-letter genome of a mouse,” words and phrases. For computers the equivalent of ‘society’ or “the text of War and Peace by Tolstoy.” There are also is ‘database,’ and the equivalent of ‘use’ is ‘way to search the database.’ We present a new theory of similarity between words objects that cannot be given literally, but only by name, and and phrases based on information distance and Kolmogorov com- that acquire their meaning from their contexts in background plexity. To fix thoughts we use the world-wide-web as database, common knowledge in humankind, like “home” or “red.” To and Google as search engine. The method is also applicable make computers more intelligent one would like to represent to other search engines and databases. This theory is then meaning in computer-digestable form. Long-term and labor- applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the intensive efforts like the Cyc project [22] and the WordNet world-wide-web using Google page counts. The world-wide-web project [33] try to establish semantic relations between com- is the largest database on earth, and the context information mon objects, or, more precisely, names for those objects. The entered by millions of independent users averages out to provide idea is to create a semantic web of such vast proportions automatic semantics of useful quality. We give applications in that rudimentary intelligence, and knowledge about the real hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, world, spontaneously emerge. This comes at the great cost cluster names of paintings by 17th century Dutch masters and of designing structures capable of manipulating knowledge, names of books by English novelists, the ability to understand and entering high quality contents in these structures by emergencies, and primes, and we demonstrate the ability to do knowledgeable human experts. While the efforts are long- a simple automatic English-Spanish translation. Finally, we use running and large scale, the overall information entered is the WordNet database as an objective baseline against which to judge the performance of our method. We conduct a massive minute compared to what is available on the world-wide-web. randomized trial in binary classification using support vector The rise of the world-wide-web has enticed millions of machines to learn categories based on our Google distance, users to type in trillions of characters to create billions of web resulting in an a mean agreement of 87% with the expert crafted pages of on average low quality contents. The sheer mass of WordNet categories. the information about almost every conceivable topic makes Index Terms— accuracy comparison with WordNet categories, automatic it likely that extremes will cancel and the majority or average classification and clustering, automatic meaning discovery using is meaningful in a low-quality approximate sense. We devise Google, automatic relative semantics, automatic translation, dis- a general method to tap the amorphous low-grade knowledge similarity semantic distance, Google search, Google distribution available for free on the world-wide-web, typed in by local via page hit counts, Google code, Kolmogorov complexity, nor- users aiming at personal gratification of diverse objectives, and malized compression distance ( NCD ), normalized information distance ( NID ), normalized Google distance ( NGD ), meaning yet globally achieving what is effectively the largest semantic of words and phrases extracted from the web, parameter-free electronic database in the world. Moreover, this database is data-mining, universal similarity metric available for all by using any search engine that can return aggregate page-count estimates for a large range of search- queries, like Google. arXiv:cs/0412098v3 [cs.CL] 30 May 2007 I. INTRODUCTION Previously, we and others developed a compression-based Objects can be given literally, like the literal four-letter method to establish a universal similarity metric among objects genome of a mouse, or the literal text of War and Peace by given as finite binary strings [2], [25], [26], [7], [8], [39], Tolstoy. For simplicity we take it that all meaning of the object [40], which was widely reported [20], [21], [13]. Such objects can be genomes, music pieces in MIDI format, computer The material of this paper was presented in part at the IEEE ITSOC programs in Ruby or C, pictures in simple bitmap formats, Information Theory Workshop 2005 on Coding and Complexity, 29th Aug. - or time sequences such as heart rhythm data. This method 1st Sept., 2005, Rotorua, New Zealand, and the IEEE Intn’l Symp. Information Theory, Seattle, Wash. USA, August 2006. Manuscript received April 12, is feature-free in the sense that it doesn’t analyze the files 2005; final revision June 18, 2006. Rudi Cilibrasi was supported in part by the looking for particular features; rather it analyzes all features Netherlands BSIK/BRICKS project, and by NWO project 612.55.002. He is at simultaneously and determines the similarity between every the Centre for Mathematics and Computer Science (Centrum voor Wiskunde en Informatica), Amsterdam, the Netherlands. Address: CWI, Kruislaan 413, pair of objects according to the most dominant shared feature. 1098 SJ Amsterdam, The Netherlands. Email: [email protected]. The crucial point is that the method analyzes the objects Paul Vitanyi’s work was done in part while the author was on sabbatical themselves. This precludes comparison of abstract notions or leave at National ICT of Australia, Sydney Laboratory at UNSW. He is affiliated with the Centre for Mathematics and Computer Science (Centrum other objects that don’t lend themselves to direct analysis, voor Wiskunde en Informatica) and the University of Amsterdam, both in like emotions, colors, Socrates, Plato, Mike Bonanno and Amsterdam, the Netherlands. Supported in part by the EU EU Project RESQ Albert Einstein. While the previous method that compares the IST-2001-37559, the ESF QiT Programmme, the EU NoE PASCAL, and the Netherlands BSIK/BRICKS project. Address: CWI, Kruislaan 413, 1098 SJ objects themselves is particularly suited to obtain knowledge Amsterdam, The Netherlands. Email: [email protected]. about the similarity of objects themselves, irrespective of 2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO 3, MARCH 2007, 370–383 common beliefs about such similarities, here we develop a storage, and on assumptions that are more refined, than what method that uses only the name of an object and obtains we propose. In contrast, [11], [1] and the many references knowledge about the similarity of objects, a quantified relative cited there, use the web and Google counts to identify lexico- Google semantics, by tapping available information generated syntactic patterns or other data. Again, the theory, aim, feature by multitudes of web users. Here we are reminded of the words analysis, and execution are different from ours, and cannot of D.H. Rumsfeld [31] “A trained ape can know an awful lot meaningfully be compared. Essentially, our method below / Of what is going on in this world / Just by punching on automatically extracts semantic relations between arbitrary his mouse / For a relatively modest cost!” In this paper, the objects from the web in a manner that is feature-free, up to the Google semantics of a word or phrase consists of the set of search-engine used, and computationally feasible. This seems web pages returned by the query concerned. to be a new direction altogether. A. An Example: C. Outline: While the theory we propose is rather intricate, the resulting The main thrust is to develop a new theory of semantic method is simple enough. We give an example: At the time of distance between a pair of objects, based on (and unavoidably doing the experiment, a Google search for “horse”, returned biased by) a background contents consisting of a database 46,700,000 hits. The number of hits for the search term “rider” of documents. An example of the latter is the set of pages was 12,200,000. Searching for the pages where both “horse” constituting the world-wide-web. Similarity relations between and “rider” occur gave 2,630,000 hits, and Google indexed pairs of objects is distilled from the documents by just using 8,058,044,651 web pages. Using these numbers in the main the number of documents in which the objects occur, singly formula (III.3) we derive below, with N = 8, 058, 044, 651, and jointly (irrespective of location or multiplicity). For us, this yields a Normalized Google Distance between the terms the Google semantics of a word or phrase consists of the set “horse” and “rider” as follows: of web pages returned by the query concerned. Note that this NGD(horse, rider) 0.443. can mean that terms with different meaning have the same ≈ semantics, and that opposites like ”true” and ”false” often In the sequel of the paper we argue that the NGD is a have a similar semantics. Thus, we just discover associations normed semantic distance between the terms in question, between terms, suggesting a likely relationship. As the web usually (but not always, see below) in between 0 (identical) grows, the Google semantics may become less primitive. The and 1 (unrelated), in the cognitive space invoked by the usage theoretical underpinning is based on the theory of Kolmogorov of the terms on the world-wide-web as filtered by Google.