Marion REVOLLE

Marion REVOLLE

Algorithmic information theory for automatic classification Marion Revolle, [email protected] Nicolas le Bihan, [email protected] Fran¸cois Cayre, [email protected] Main objective Files : any byte string in a computer (text, music, ..) & Similarity measure in a non-probabilist context : Similarity metric Algorithmic information theory : Kolmogorov complexity % 1.1 Complexity 1.2 GZIP 1.3 Examples GZIP : compression algorithm = DEFLATE + Huffman. Given x a file string of size jxj define on the alphabet Ax A- x = ABCA BCCB CABC of size αx. A- DEFLATE L(x) = 6 Z(-1! 1) A- Simple complexity Dictionary compression based on LZ77 : make ref- x : L(A) L(B) L(C) Z(-3!3) L(C) Z(-6!5) K(x) : Kolmogorov complexity : the length of a erences from the past. ABC ABCC BCABC shortest binary program to compute x. B- y = ABCA BCAB CABC DEFLATE(x) generate two kinds of symbol : L(x) : Lempel-Ziv complexity : the minimal number L(y) = 5 of operations making insert/copy from x's past L(a) : insert the element a in Ax = Literal. y : L(A) L(B) L(C) Z(-3!3) Z(-6!6) to generate x. Z(-i ! j) : paste j elements, i elements before = Refer- ABC ABC ABCABC ence of length j. L(yjx) = 2 yjx : Z(-12!6) Z(-12!6) B- Conditional complexity ABCABC ABCABC K(xjy) : conditional Kolmogorov complexity : the B- Complexity C- z = MNOM NOMN OMNO length of a shortest binary program to compute Number of symbols to compress x with DEFLATE L(z) = 5 x is y is furnished as an auxiliary input. ≈ number of symbols to compress with LZ77 y : L(M) L(N) L(O) Z(-3!3) Z(-6!6) L(xjy) : Ziv-Merhav complexity : the minimal num- = Lempel-Ziv complexity MNO MNO MNOMNO ber of operation making insert/copy from y to L(zjx) = 12 Random byte string : lim L(x) = jxj zjx : L(M) L(N) L(O) L(M) L(N) L(O) L(M) L(N) L(O) ... generate x. jxj!1 logαjxj MNOMNOMNO ... 2.1 Metrics 2.2 Classification A- Vitanyi A.1- Normalized Information Distance The length of a shortest binary program that com- pute x from y as well as x from y. maxfK(xjy);K(yjx)g NID(x; y) = (1) maxfK(x);K(y)g A.2- Normalized Compression Distance Approximations : • K(x) = C(x) = x's compression size • C(xjy) = C(xy) { C(y) C(xy) − minfC(x);C(y)g NCD(x; y) = (2) maxfC(x);C(y)g B- Our proposals B.1- Normalized Lempel-Ziv Distance Approximation : K(x) = L(x) maxfL(xjy) − 1;L(yjx) − 1g NLD(x; y) = (3) maxfL(x);L(y)g B.2- Salza • explicitude : keep only references which give infor- mation. X ex(xjy) = li (4) s2S li>ly0 • weighted complexity : the shorter the references, the greater the complexity. X1 LP (xjy) = (5) li s2S LP (xjy) 2 LP (yjx) 2 S(x; y) = max ly0; lx0 (6) ex(xjy) ex(yjx) Fig. 1: Markov chain classification with a)NCD, b)NLD and c)Salza Fig. 2: Language classification with a)NCD, b)NLD and c)Salza Grenoble Images Parole Signal Automatique UMR CNRS 5216 - Grenoble Campus 38400 Saint Martin d'H`eres - FRANCE.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    1 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us