The Similarity Metric
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO Y, MONTH 2004 1 The Similarity Metric Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul M.B. Vit´anyi Abstract— A new class of distances appropriate for mea- to different areas and even to collections of objects taken suring similarity relations between sequences, say one type from different areas. The method automatically zooms in of similarity per distance, is studied. We propose a new “normalized information distance”, based on the noncom- on the dominant similarity aspect between every two ob- putable notion of Kolmogorov complexity, and show that it jects. To realize this goal, we first define a wide class of is in this class and it minorizes every computable distance in similarity distances. Then, we show that this class con- the class (that is, it is universal in that it discovers all com- tains a particular distance that is universal in the following putable similarities). We demonstrate that it is a metric and call it the similarity metric. This theory forms the foun- sense: for every pair of objects the particular distance is less dation for a new practical tool. To evidence generality and than any “effective” distance in the class between those two robustness we give two distinctive applications in widely di- objects. This universal distance is called the “normalized vergent areas using standard compression programs like gzip and GenCompress. First, we compare whole mitochondrial information distance” (NID), it is shown to be a metric, genomes and infer their evolutionary history. This results in and, intuitively, it uncovers all similarities simultaneously a first completely automatic computed whole mitochondrial that effective distances in the class uncover a single simi- phylogeny tree. Secondly, we fully automatically compute the language tree of 52 different languages. larity apiece. (Here, “effective” is used as shorthand for a Index Terms— dissimilarity distance, Kolmogorov com- certain notion of “computability” that will acquire its pre- plexity, language tree construction, normalized information cise meaning below.) We develop a practical analogue of distance, normalized compression distance, phylogeny in the NID based on real-world compressors, called the “nor- bioinformatics, parameter-free data-mining, universal sim- ilarity metric malized compression distance” (NCD), and test it on real- world applications in a wide range of fields: we present the first completely automatic construction of the phylogeny I. Introduction tree based on whole mitochondrial genomes, and a com- How do we measure similarity—for example to determine pletely automatic construction of a language tree for over an evolutionary distance—between two sequences, such as 50 Euro-Asian languages. internet documents, different language text corpora in the Previous Work: Preliminary applications of the cur- same language, among different languages based on ex- rent approach were tentatively reported to the biological ample text corpora, computer programs, or chain letters? community and elsewhere [11], [31], [34]. That work, and How do we detect plagiarism of student source code in as- the present paper, is based on information distance [33], signments? Finally, the fast advance of worldwide genome [4], a universal metric that minorizes in an appropriate sequencing projects has raised the following fundamental sense every effective metric: effective versions of Ham- question to prominence in contemporary biological science: ming distance, Euclidean distance, edit distances, Lempel- how do we compare two genomes [30], [51]? Ziv distance, and the sophisticated distances introduced in Our aim here is not to define a similarity measure for [16], [38]. Subsequent work in the linguistics setting, [2], a certain application field based on background knowledge [3], used related ad hoc compression-based methods, Ap- and feature parameters specific to that field; instead we pendix A. The information distance studied in [32], [33], develop a general mathematical theory of similarity that [4], [31], and subsequently investigated in [25], [39], [43], arXiv:cs/0111054v3 [cs.CC] 5 Aug 2004 uses no background knowledge or features specific to an [49], is defined as the length of the shortest binary pro- application area. Hence it is, without changes, applicable gram that is needed to transform the two objects into each The material of this paper was presented in part in Proc. 14th other. This distance can be interpreted also as being pro- ACM-SIAM Symposium on Discrete Algorithms, 2003, pp 863- portional to the minimal amount of energy required to do 872. Ming Li is with the Computer Science Department, Uni- the transformation: A species may lose genes (by deletion) versity of Waterloo, Waterloo, Ontario N2L 3G1, Canada, and with BioInformatics Solutions Inc., Waterloo, Canada. He is par- or gain genes (by duplication or insertion from external tially supported by NSF-ITR grant 0085801 and NSERC. Email: sources), relatively easily. Deletion and insertion cost en- [email protected]. Xin Chen is with the Department of ergy (proportional to the Kolmogorov complexity of delet- Computer Science, University of California, Santa Barbara, CA 93106, USA. Email: [email protected]. Xin Li is with the Com- ing or inserting sequences in the information distance), and puter Science Department, University of Western Ontario, London, aspect that was stressed in [32]. But this distance is not Ontario N6A 5B7, Canada. Partially supported by NSERC grant RGP0238748. Email: [email protected]. Bin Ma is with the Com- proper to measure evolutionary sequence distance. For ex- puter Science Department, University of Western Ontario, London, ample, H. influenza and E. coli are two closely related sister Ontario N6A 5B7, Canada. Partially supported by NSERC grant species. The former has about 1,856,000 base pairs and the RGP0238748. Email: [email protected]. Paul Vit´anyi is with the CWI, Kruislaan 413, 1098 SJ Amsterdam, The Netherlands, and with latter has about 4,772,000 base pairs. However, using the the University of Amsterdam, Amsterdam, The Netherlands. Email information distance of [4], one would easily classify H. in- [email protected]. Partially supported by the EU project QAIP, fluenza with a short (of comparable length) but irrelevant IST–1999–11234, the EU Project RESQ, the NoE QUIPROCONE IST–1999–29064, the ESF QiT Programmme, the EU NeuroCOLT II species simply because of length, instead of with E. coli. Working Group EP 27150, and the EU PASCAL NoE. The problem is that the information distance of [4] deals IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO Y, MONTH 2004 2 with absolute distance rather than with relative distance. of the Eutherians can be reconstructed automatically from The paper [48] defined a transformation distance between unaligned complete mitochondrial genomes by use of our two species, and [24] defined a compression distance. Both software implementing (an approximation of) our theory, of these measures are essentially related to K(x|y). Other confirming one of the hypotheses in [9]. These experimen- than being asymmetric, they also suffer from being abso- tal confirmations of the effacity of our comprehensive ap- lute rather than relative. As far as the authors know, the proach contrasts with recent more specialized approaches idea of relative or normalized distance is, surprisingly, not such as [50] that have (and perhaps can) only be tested on well studied. An exception is [52], which investigates nor- small numbers of genes. They have not been experimen- malized Euclidean metric and normalized symmetric-set- tally tried on whole mitochondrial genomes that are, appar- difference metric to account for relative distances rather ently, already numerically out of computational range. In than absolute ones, and it does so for much the same rea- area (ii) we fully automatically construct the language tree sons as does the present work. In [42] the equivalent func- of 52 primarily Indo-European languages from translations tional of (V.1) in information theory, expressed in terms of of the “Universal Declaration of Human Rights”—leading the corresponding probabilistic notions, is shown to be a to a grouping of language families largely consistent with metric. (Our Lemma V.4 implies this result, but obviously current linguistic viewpoints. Other experiments and ap- not the other way around.) plications performed earlier, not reported here are: detect- This Work: We develop a general mathematical the- ing plagiarism in student programming assignments [10], ory of similarity based on a notion of normalized distances. phylogeny of chain letters in [5]. Suppose we define a new distance by setting the value be- Subsequent Work: The current paper can be viewed tween every pair of objects to the minimal upper semi- as the theoretical basis out of a trilogy of papers: In [15] computable (Definition II.3 below) normalized distance we address the gap between the rigorously proven optimal- (possibly a different distance for every pair). This new ity of the normalized information distance based on the distance is a non-uniform lower bound on the upper semi- noncomputable notion of Kolmogorov complexity, and the computable normalized distances. The central notion of experimental successes of the “normalized compression dis- this work is the “normalized information distance,” given tance” or “NCD” which is the same formula with the Kol- by a simple formula, that is a metric, belongs to the class of mogorov complexity replaced by the