Uva-DARE (Digital Academic Repository)

Uva-DARE (Digital Academic Repository)

UvA-DARE (Digital Academic Repository) Normalized information distance Vitányi, P.M.B.; Balbach, F.J.; Cilibrasi, R.L.; Li, M. Publication date 2008 Link to publication Citation for published version (APA): Vitányi, P. M. B., Balbach, F. J., Cilibrasi, R. L., & Li, M. (2008). Normalized information distance. Institute for Logic, Language and Computation. http://arxiv.org/abs/0809.2553 General rights It is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), other than for strictly personal, individual use, unless the work is under an open content license (like Creative Commons). Disclaimer/Complaints regulations If you believe that digital publication of certain material infringes any of your rights or (privacy) interests, please let the Library know, stating your reasons. In case of a legitimate complaint, the Library will make the material inaccessible and/or remove it from the website. Please Ask the Library: https://uba.uva.nl/en/contact, or a letter to: Library of the University of Amsterdam, Secretariat, Singel 425, 1012 WP Amsterdam, The Netherlands. You will be contacted as soon as possible. UvA-DARE is a service provided by the library of the University of Amsterdam (https://dare.uva.nl) Download date:26 Sep 2021 Chapter 3 Normalized Information Distance Paul M. B. Vita´nyi, Frank J. Balbach, Rudi L. Cilibrasi, and Ming Li Abstract The normali ed in!ormation distance is a uni"ersal distance measure !or ob#ects o! all kinds. $t is based on %olmogoro" com&le'ity and thus uncom&utable, but there are (ays to utili e it. First, com&ression algorithms can be used to a&&ro'imate the %olmogoro" com&le'ity i! the ob#ects ha"e a string re&resentation. )econd, !or names and abstract conce&ts, &age count statistics !rom the *orld *ide *eb can be used. These &ractical reali ations o! the normali ed in!ormation distance can then be a&&lied to machine learning tasks, e'&ecially clustering, to &er!orm !eature+!ree and &arameter+!ree data mining. This cha&ter discusses the theoretical !oundations o! the normali ed in!ormation distance and both &ractical reali ations. $t &resents numerous e'am&les o! success!ul real+(orld a&&lications based on these distance measures, ranging !rom bioin!ormatics to music clustering to machine translation. 3.1 Introduction The ty&ical data mining algorithm uses e'&licitly gi"en !eatures o! the data to assess their similarity and disco"er &atterns among them. $t also comes (ith many &arameters !or the user to tune to s&eci,c needs according to the domain at hand. $n this cha&ter, by contrast, (e are discussing algorithms that neither use !eatures o! the data nor &ro"ide any &arameters to be tuned, but that ne"ertheless o!ten out&er!orm algorithms o! the a!orementioned kind. $n addition, the methods &resented here are not #ust heuristics that ha&&en to (ork, but they are !ounded in the mathematical theory o! %olmogoro" com&le'ity. The &roblems discussed in this cha&ter (ill mostly, yet not e'clusi"ely, be clustering tasks, in (hich naturally the notion o! distance bet(een ob#ects &lays a dominant role. Paul M. B. Vita´nyi C*$, %ruislaan -./, .012 )J 3msterdam, The 4etherlands; e+mail6 paul"7c(i.nl Frank J. Balbach 8ni"ersity o! *aterloo, *aterloo, 94, Canada5 e+mail6 fbalbach7u(aterloo.ca :su&&orted by a &ostdoctoral !ello(shi& o! the ;erman 3cademic <'change )er"ice :=33=>> Rudi L. Cilibrasi C*$, %ruislaan -./, .012 )J 3msterdam, The 4etherlands; e+mail6 cilibrar7cilibrar.com Ming Li 8ni"ersity o! *aterloo, *aterloo, 94, Canada5 e+mail6 mli7u(aterloo.ca /1 -0 Paul M. B. Vita´nyi, Frank J. Balbach, Rudi L. Cilibrasi, and Ming Li There are good reasons to a"oid &arameter laden methods. )etting the &arameters re?uires an intimate understanding o! the underlying algorithm. )etting them incorrectly can result in missing the right &atterns or, &erha&s (orse, in detecting !alse ones. Moreo"er, com&aring t(o &arametri ed algorithms is di!,cult because di!!erent &arameter settings can gi"e a (rong im&ression that one algorithm is better than another, (hen in !act one is sim&ly ad#usted &oorly. Com&arisons using the o&timal &arameter settings !or each algorithm are o! little hel& because these settings are hardly e"er kno(n in real situations. Lastly, t(eaking &arameters might tem&t users to im&ose their assum&tions and e'&ectations on the algorithm. There are also good reasons to a"oid !eature based methods. =etermining the rele"ant !eatures re?uires domain kno(ledge, and determining ho( rele"ant they are o!ten re?uires guessing. $m&lementing the !ea+ ture e'traction in an algorithm can be di!,cult, error+&rone, and is o!ten time consuming. $t also limits the a&&licability o! an algorithm to a s&eci,c ,eld. @o( can an algorithm &er!orm (ell i! it does not e'tract the im&ortant !eatures o! the data and does not allo( us to t(eak its &arameters to hel& it do the right thingA 9! course, &arameter and !eature !ree algorithms cannot mind read, so i! (e a &riori kno( the !eatures, ho( to e'tract them, and ho( to combine them into e'actly the distance measure (e (ant, (e should do #ust that. For e'am&le, i! (e ha"e a list o! cars (ith their color, motor rating, etc. and (ant to cluster them by color, (e can easily do that in a straight!or(ard (ay. Parameter and !eature !ree algorithms are made (ith a di!!erent scenario in mind. $n this exploratory data mining scenario (e are con!ronted (ith data (hose im&ortant !eatures and ho( to e'tract them are unkno(n to us :&erha&s there are not e"en !eatures>. *e are then stri"ing not !or a certain similarity measure, but !or the similarity measure bet(een the ob#ects. =oes such an absolute measure o! similarity e'ist at allA Bes, it does, in theory. $t is called the in!ormation distance, and the idea behind it is that t(o ob#ects are similar i! there is a sim&le descri&tion o! ho( to trans!orm each one o! them into the other one. $!, ho(e"er, all such descri&tions are com&le', the ob#ects are deemed dissimilar. For e'am&le, an image and its negati"e are "ery similar because the trans!ormation can be described as Cin"ert e"ery &i'el.D By contrast, a descri&tion o! ho( to trans!orm a blank can"as into da VinciEs Mona Lisa (ould in"ol"e the com&lete, and com&arably large, descri&tion o! that &ainting. The latter e'am&le already &oints to some issues one has to take care o!, like asymmetry and normali a+ tion. 3symmetry re!ers to the !act that, a!ter all, the in"erse trans!ormation o! the Mona Lisa into a blank can"as can be described rather sim&ly. 4ormali ation re!ers to the !act that the trans!ormation descri&tion si e must be seen in relation to the si e o! the &artici&ating ob#ects. )ection /.F details ho( these and other issues are dealt (ith and e'&lains in (hich sense the resulting information distance measure is uni"ersal. The !ormulation o! this distance measure (ill in"ol"e the mathematical theory o! %olmogoro" com&le'ity, (hich is generally concerned (ith shortest e!!ecti"e descri&tions. While the de,nition o! the in!ormation distance is rather theoretical and cannot be reali ed in &ractice, one can still use its theoretical idea and a&&ro'imate it (ith &ractical methods. T(o such a&&roaches are discussed in subse?uent sections. They di!!er in (hich &ro&erty o! the %olmogoro" com&le'ity they use and to (hat kind o! ob#ects they a&&ly. The ,rst a&&roach, &resented in )ect. /./, e'&loits the relation bet(een %olmogoro" com&le'ity and data com&ression and conse?uently em&loys common com&ression algorithms to measure distances bet(een ob#ects. This method is a&&licable (hene"er the data to be clustered are gi"en in a com&ressible !orm, !or instance, as a te't or other literal descri&tion. The second a&&roach, &resented in )ect. /.-, e'&loits the relation bet(een %olmogoro" com&le'ity and &robability. $t uses statistics generated by common *eb search engines to measure distances bet(een ob+ #ects. This method is a&&licable to non+literal ob#ects, names and conce&ts, (hose &ro&erties and interrela+ tions are gi"en by common sense and human kno(ledge. / 4ormali ed $n!ormation =istance -. 3.2 Normalized Information Distance %olmogoro" com&le'ity measures the absolute in!ormation content o! indi"idual ob#ects. For the &ur&ose o! data mining, es&ecially clustering, (e (ould also like to be able to measure the absolute in!ormation distance bet(een indi"idual ob#ects. )uch a notion should be uni"ersal in the sense that it contains all other alternati"e or intuiti"e notions o! com&utable distances as s&ecial cases. )uch a notion should also ser"e as an absolute measure o! the in!ormational, or cogniti"e, distance bet(een discrete ob#ects x and y.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    34 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us