Algorithmic theory for automatic classification

Marion Revolle, [email protected] Nicolas le Bihan, [email protected] Fran¸cois Cayre, [email protected]

Main objective

Files : any byte string in a computer (text, music, ..) & Similarity in a non-probabilist context : Similarity metric Algorithmic : Kolmogorov %

1.1 Complexity 1.2 GZIP 1.3 Examples GZIP : compression algorithm = + Huffman. Given x a file string of size |x| define on the alphabet Ax A- x = ABCA BCCB CABC of size αx. A- DEFLATE L(x) = 6 Z(-1→ 1) A- Simple complexity Dictionary compression based on LZ77 : make ref- x : L(A) L(B) L(C) Z(-3→3) L(C) Z(-6→5) K(x) : Kolmogorov complexity : the length of a erences from the past. ABC ABCC BCABC shortest binary program to compute x. B- y = ABCA BCAB CABC DEFLATE(x) generate two kinds of symbol : L(x) : Lempel-Ziv complexity : the minimal number L(y) = 5 of operations making insert/copy from x’s past L(a) : insert the element a in Ax = Literal. y : L(A) L(B) L(C) Z(-3→3) Z(-6→6) to generate x. Z(-i → j) : paste j elements, i elements before = Refer- ABC ABC ABCABC ence of length j. L(y|x) = 2 y|x : Z(-12→6) Z(-12→6) B- Conditional complexity ABCABC ABCABC K(x|y) : conditional Kolmogorov complexity : the B- Complexity C- z = MNOM NOMN OMNO length of a shortest binary program to compute Number of symbols to compress x with DEFLATE L(z) = 5 x is y is furnished as an auxiliary input. ≈ number of symbols to compress with LZ77 y : L(M) L(N) L(O) Z(-3→3) Z(-6→6) L(x|y) : Ziv-Merhav complexity : the minimal num- = Lempel-Ziv complexity MNO MNO MNOMNO ber of operation making insert/copy from y to L(z|x) = 12 Random byte string : lim L(x) = |x| z|x : L(M) L(N) L(O) L(M) L(N) L(O) L(M) L(N) L(O) ... generate x. |x|→∞ logα|x| MNOMNOMNO ...

2.1 Metrics 2.2 Classification A- Vitanyi A.1- Normalized Information Distance The length of a shortest binary program that com- pute x from y as well as x from y. max{K(x|y),K(y|x)} NID(x, y) = (1) max{K(x),K(y)} A.2- Normalized Compression Distance Approximations : • K(x) = C(x) = x’s compression size • C(x|y) = C(xy) – C(y) C(xy) − min{C(x),C(y)} NCD(x, y) = (2) max{C(x),C(y)} B- Our proposals B.1- Normalized Lempel-Ziv Distance Approximation : K(x) = L(x)

max{L(x|y) − 1,L(y|x) − 1} NLD(x, y) = (3) max{L(x),L(y)} B.2- Salza • explicitude : keep only references which give infor- mation. X ex(x|y) = li (4) s∈S li>ly0 • weighted complexity : the shorter the references, the greater the complexity. X1 LP (x|y) = (5) li s∈S   LP (x|y) 2 LP (y|x) 2 S(x, y) = max ly0, lx0 (6) ex(x|y) ex(y|x) Fig. 1: Markov chain classification with a)NCD, b)NLD and c)Salza Fig. 2: Language classification with a)NCD, b)NLD and c)Salza

Grenoble Images Parole Signal Automatique UMR CNRS 5216 - Grenoble Campus 38400 Saint Martin d’H`eres - FRANCE