Marion REVOLLE
Total Page:16
File Type:pdf, Size:1020Kb
Algorithmic information theory for automatic classification Marion Revolle, [email protected] Nicolas le Bihan, [email protected] Fran¸cois Cayre, [email protected] Main objective Files : any byte string in a computer (text, music, ..) & Similarity measure in a non-probabilist context : Similarity metric Algorithmic information theory : Kolmogorov complexity % 1.1 Complexity 1.2 GZIP 1.3 Examples GZIP : compression algorithm = DEFLATE + Huffman. Given x a file string of size jxj define on the alphabet Ax A- x = ABCA BCCB CABC of size αx. A- DEFLATE L(x) = 6 Z(-1! 1) A- Simple complexity Dictionary compression based on LZ77 : make ref- x : L(A) L(B) L(C) Z(-3!3) L(C) Z(-6!5) K(x) : Kolmogorov complexity : the length of a erences from the past. ABC ABCC BCABC shortest binary program to compute x. B- y = ABCA BCAB CABC DEFLATE(x) generate two kinds of symbol : L(x) : Lempel-Ziv complexity : the minimal number L(y) = 5 of operations making insert/copy from x's past L(a) : insert the element a in Ax = Literal. y : L(A) L(B) L(C) Z(-3!3) Z(-6!6) to generate x. Z(-i ! j) : paste j elements, i elements before = Refer- ABC ABC ABCABC ence of length j. L(yjx) = 2 yjx : Z(-12!6) Z(-12!6) B- Conditional complexity ABCABC ABCABC K(xjy) : conditional Kolmogorov complexity : the B- Complexity C- z = MNOM NOMN OMNO length of a shortest binary program to compute Number of symbols to compress x with DEFLATE L(z) = 5 x is y is furnished as an auxiliary input. ≈ number of symbols to compress with LZ77 y : L(M) L(N) L(O) Z(-3!3) Z(-6!6) L(xjy) : Ziv-Merhav complexity : the minimal num- = Lempel-Ziv complexity MNO MNO MNOMNO ber of operation making insert/copy from y to L(zjx) = 12 Random byte string : lim L(x) = jxj zjx : L(M) L(N) L(O) L(M) L(N) L(O) L(M) L(N) L(O) ... generate x. jxj!1 logαjxj MNOMNOMNO ... 2.1 Metrics 2.2 Classification A- Vitanyi A.1- Normalized Information Distance The length of a shortest binary program that com- pute x from y as well as x from y. maxfK(xjy);K(yjx)g NID(x; y) = (1) maxfK(x);K(y)g A.2- Normalized Compression Distance Approximations : • K(x) = C(x) = x's compression size • C(xjy) = C(xy) { C(y) C(xy) − minfC(x);C(y)g NCD(x; y) = (2) maxfC(x);C(y)g B- Our proposals B.1- Normalized Lempel-Ziv Distance Approximation : K(x) = L(x) maxfL(xjy) − 1;L(yjx) − 1g NLD(x; y) = (3) maxfL(x);L(y)g B.2- Salza • explicitude : keep only references which give infor- mation. X ex(xjy) = li (4) s2S li>ly0 • weighted complexity : the shorter the references, the greater the complexity. X1 LP (xjy) = (5) li s2S LP (xjy) 2 LP (yjx) 2 S(x; y) = max ly0; lx0 (6) ex(xjy) ex(yjx) Fig. 1: Markov chain classification with a)NCD, b)NLD and c)Salza Fig. 2: Language classification with a)NCD, b)NLD and c)Salza Grenoble Images Parole Signal Automatique UMR CNRS 5216 - Grenoble Campus 38400 Saint Martin d'H`eres - FRANCE.