Marion REVOLLE

Algorithmic information theory for automatic classification Marion Revolle, [email protected] Nicolas le Bihan, [email protected] Fran¸cois Cayre, [email protected] Main objective Files : any byte string in a computer (text, music, ..) & Similarity measure in a non-probabilist context : Similarity metric Algorithmic information theory : Kolmogorov complexity % 1.1 Complexity 1.2 GZIP 1.3 Examples GZIP : compression algorithm = DEFLATE + Huffman. Given x a file string of size jxj define on the alphabet Ax A- x = ABCA BCCB CABC of size αx. A- DEFLATE L(x) = 6 Z(-1! 1) A- Simple complexity Dictionary compression based on LZ77 : make ref- x : L(A) L(B) L(C) Z(-3!3) L(C) Z(-6!5) K(x) : Kolmogorov complexity : the length of a erences from the past. ABC ABCC BCABC shortest binary program to compute x. B- y = ABCA BCAB CABC DEFLATE(x) generate two kinds of symbol : L(x) : Lempel-Ziv complexity : the minimal number L(y) = 5 of operations making insert/copy from x's past L(a) : insert the element a in Ax = Literal. y : L(A) L(B) L(C) Z(-3!3) Z(-6!6) to generate x. Z(-i ! j) : paste j elements, i elements before = Refer- ABC ABC ABCABC ence of length j. L(yjx) = 2 yjx : Z(-12!6) Z(-12!6) B- Conditional complexity ABCABC ABCABC K(xjy) : conditional Kolmogorov complexity : the B- Complexity C- z = MNOM NOMN OMNO length of a shortest binary program to compute Number of symbols to compress x with DEFLATE L(z) = 5 x is y is furnished as an auxiliary input. ≈ number of symbols to compress with LZ77 y : L(M) L(N) L(O) Z(-3!3) Z(-6!6) L(xjy) : Ziv-Merhav complexity : the minimal num- = Lempel-Ziv complexity MNO MNO MNOMNO ber of operation making insert/copy from y to L(zjx) = 12 Random byte string : lim L(x) = jxj zjx : L(M) L(N) L(O) L(M) L(N) L(O) L(M) L(N) L(O) ... generate x. jxj!1 logαjxj MNOMNOMNO ... 2.1 Metrics 2.2 Classification A- Vitanyi A.1- Normalized Information Distance The length of a shortest binary program that compute x from y as well as x from y. maxfK(xjy);K(yjx)g NID(x; y) = (1) maxfK(x);K(y)g A.2- Normalized Compression Distance Approximations : • K(x) = C(x) = x's compression size • C(xjy) = C(xy) { C(y) C(xy) − minfC(x);C(y)g NCD(x; y) = (2) maxfC(x);C(y)g B- Our proposals B.1- Normalized Lempel-Ziv Distance Approximation : K(x) = L(x) maxfL(xjy) − 1;L(yjx) − 1g NLD(x; y) = (3) maxfL(x);L(y)g B.2- Salza • explicitude : keep only references which give information. X ex(xjy) = li (4) s2S li>ly0 • weighted complexity : the shorter the references, the greater the complexity. X1 LP (xjy) = (5) li s2S LP (xjy) 2 LP (yjx) 2 S(x; y) = max ly0; lx0 (6) ex(xjy) ex(yjx) Fig. 1: Markov chain classification with a)NCD, b)NLD and c)Salza Fig. 2: Language classification with a)NCD, b)NLD and c)Salza Grenoble Images Parole Signal Automatique UMR CNRS 5216 - Grenoble Campus 38400 Saint Martin d'Hères - FRANCE.

Marion REVOLLE

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support