Lecture 19 Kolmogorov Complexity

EE5585 Data Compression April 4, 2013 Lecture 19 Instructor: Arya Mazumdar Scribe: Katie Moenkhaus Kolmogorov Complexity A major result is that the Kolmogorov complexity of a random sequence on average is close to the entropy. This captures the notion of compressibility. Also, an algorithmically incompressible binary sequence as defined by the Kolmogorov complexity has approximately the same numer of 1s and 0s. The difference between the number of 0s and 1s gives insight into how compressible a signal is. Kolmogrov complexity can be defined for numbers as well as sequences. For n 2 Z, K(n) = min l(p) p:U(p)=n where K(n) denotes the Kolmogorov complexity of n, and l(p) is the minimum length of the computer 5 55 program. Some integers have low Kolmogorov complexity. For exmple, let n = 55 . Even though this is a very large number, it has a short description. Also, e is easily described. Even though it is an irrational numer, it has a very short description of the basis of the natural logarithm, so it is actually a function such that the derivative of the function is itself. Note, K(n) ≤ log∗ n + c where log∗ n = log n + log log n + log log log n + ::: , so long as the each term is positive. Theorem 1 There exists an infinite number of integers for which K(n) > log n. Proof (by contradiction): Assume there exist a finite number of integers for which K(n) > log n. From Kraft's Inequality, X 2−K(n) ≤ 1 n If the assumption is true, then there will be only a finite number of integers for which K(n) > log n. Then there is a number n0 for which all n ≥ n0 =) K(n) ≤ log n. Then X X X 1 2−K(n) ≥ 2− log n = n n≥n0 n≥n0 n≥no P 1 The series n doesn't converge, so n X 1 = 1 n n≥no It follows that X 2−K(n) = 1 n P −K(n) This contradicts with 2 ≤ 1, and thus the assumption is false. ) 9 a finite number of integers n for which K(n) > log n. This illustrates that there is an infinite number of integers that are not simple to describe, i.e. they will take more than log n bits to describe. Given a binary sequency, if j#1s − #0sj ≥ n then the sequence can be compressed. 1 Transform Coding Consider a height/weight data set. Because the data is highly correlated, a clockwise rotation can be done to obtain new axes such that the line of best fit becomes the new prominent axis. Thus the difference between any data point and the axis is small, and fewer bits are needed to describe them. In the new data set, the entries are uncorrelated. Ideally, there will be 0 correlation in the new coordinates. Note that the correlation between random variables X and Y is E[(X − E(X)) (Y −E(Y ))], where E(X) denotes the expected value of X. If X and Y are independent, then Cor = 0. Similarly, if the correlation is large, the variables are highly correlated. Without loss of generality, E(X) = E(Y ) = 0. 2 This is just subtracting the mean from each set. Let 2 3 2 3 x1 0 6 x2 7 6 0 7 X = 6 7 ; E(X) = 6 7 6 . 7 6 . 7 4 . 5 4 . 5 xN 0 In the height/weight data, there were only two variables, i.e. height is x1 and weight is x2. The rotation is done by multiplying by a matrix: Y = AX Y is another N-dimensional vector, and A is N × N. It is desirable to have A be an orthonormal matrix, i.e. AT A = I. If this holds, then the result is an orthogonal transformation. For any orthogonal transformation, Parseval's Theorem is true. This says that X X σ2 = E(y2) yi i i i 2 = E(kY k2) = E(Y T Y ) = E(XT AT AX) = E(XT X) 2 = E(kXk2) X X = E(x2) = σ2 i xi i i If σ is the variance of x, then P E(x2) = P σ2 stems from the fact that X is a zero-mean, random x i xi i i variable. Thus, the transformation conserves the total energy. The transformation should ensure that the new data are uncorrelated. The correlation is representated by the covariance matrix, Cx, defined as 2 3 x1 6 x2 7 C = E(XXT ) = E 6 7 x x : : : x x 6 . 7 1 2 N 4 . 5 xN 2 2 3 x1 x1x2 : : : x1xN 2 6 x1x2 x2 : : : x2xN 7 = E 6 7 6 . .. 7 4 . 5 2 x1xN x2xN : : : xN 2 σ2 E(x x ) :::E(x x )3 x1 1 2 1 N 2 6 E(x1x2) σx :::E(x2xN )7 = 6 2 7 6 . .. 7 4 . 5 E(x x ) E(x x ) : : : σ2 1 N 2 N xN Note that this is a symmetric matrix. From the definition of the trace of a matrix, X T race(C ) = σ2 x xi i 3 which is the total energy of the signal. The transform matrix A should minimize the off-diagonal elements of the covariance matrix. T Cy = E(YY ) = E(AXXT AT ) = AE(XXT )AT T = ACxA Because AT A = I and A is an orthogonal matrix, AT = A−1. Thus, −1 Cy = ACxA CyA = ACx The ideal covariance matrix Cy is a diagonal matrix. Suppose 2 3 λ1 6 λ2 0 7 6 7 6 .. 7 Cy = 6 . 7 6 7 4 λN−1 5 0 λN Thus 2 3 λ1 6 λ2 0 7 6 7 6 .. 7 6 . 7 A = ACx 6 7 4 λN−1 5 0 λN Because X is zero mean, Y is also zero mean. Suppose 2 3 a11 a12 : : : a1N 6 a21 a22 : : : a2N 7 A = 6 7 6 . .. 7 4 . 5 aN1 aN2 : : : aNN 2 2 3 2 3 2 3 3 a11 a12 a1N 6 6 a21 7 6 a22 7 6 a2N 7 7 =) C A = 6λ 6 7 λ 6 7 : : : λ 6 7 7 y 6 1 6 . 7 2 6 . 7 N 6 . 7 7 4 4 . 5 4 . 5 4 . 5 5 aN1 aN2 aNN 2 2 3 2 3 2 33 a11 a12 a1N 6 6 a21 7 6 a22 7 6 a2N 77 = 6C 6 7 C 6 7 :::C 6 77 6 x 6 . 7 x 6 . 7 x 6 . 77 4 4 . 5 4 . 5 4 . 55 aN1 aN2 aNN For ease of notation, let 2 3 a1i 6 a2i 7 A = 6 7 i 6 . 7 4 . 5 aNi 4 In general, CxAi = λiAi λi is an eigenvalue of Cx, and Ai is the corresponding eigenvector. A = A1 A2 :::AN Thus, the transform matrix A is A = eigenvectors of Cx The eigenvalues are simply λ = σ2 . This is called the Karhunen-Loève Transform (KLT). The i yi process for finding the KLT is 1. Subtract off the mean: x1 ! x1 − E(x1) x2 ! x2 − E(x2) . xN ! xN − E(xN ) 2. Find the covariance matrix. If there are M elements in each vector, then the total energy is M 1 X σ2 = x2 xn M ni i=1 for each n going from 1 to N. The correlations are then M 1 X ρ = x x xn;xm M n m i=1 The covariance matrix is then 2 σ2 ρ : : : ρ 3 x1 x1;x2 x1;xN 2 6ρx1;x2 σx : : : ρx2;xN 7 C = 6 2 7 x 6 . .. 7 4 . 5 ρ ρ : : : σ2 x1;xN x2;xN xN 3. Knowing the covariance matrix Cx, find the eigenvalues and eigenvectors. Form the transform matrix A as A = A1 A2 :::AN th where Ai represents the i eigenvector of Cx. The process outlined above is called Principal Component Analysis. The goal of this analysis is to find a matrix that transforms the vectors to a new coordinate system in which they are uncorrelated. So ideally, 2 3 λ1 6 λ2 0 7 6 7 6 .. 7 Cy = 6 . 7 6 7 4 λN−1 5 0 λN 5 Keep only the eigenvalues that are largest. The energy lost in this process is the sum of the eigenvalues that are not used in the compression scheme. This is the mean square error. Suppose that the eigenvalues are ordered, i.e. λ > λ > ··· > λ . If all fλ gN are unused, then the mean square error is 1 2 N i i=N0 N X MSE = λi i=N0 This is one measure of the distortion induced by compressing the signal. KLT optimally minimizes E(yiyj), but it is computationally inefficient because for each new data set, a new transform matrix has to be computed. Instead, some standard transform matrices are used. One such of these is 21 1 1 1 ::: 1 3 61 !!2 !3 :::!N 7 6 2 4 6 2N 7 1 61 ! ! ! :::! 7 6 3 6 9 3N 7 F = p 61 ! ! ! :::! 7 N 6 7 6. .. 7 4. 5 1 !N−1 !2(N−1) !3(N−1) :::!(N−1)(N−1) −i 2π This is an example of a generic transform matrix. Typically, ! = e N . This is the Discrete Fourier Transform (DFT) matrix. The DFT projects data along the rows, and each row has a frequency kernel. Thus, it produces the frequency components of the data. Y = FX 2Y 13 6Y 27 = 6 7 6 .

Lecture 19 Kolmogorov Complexity

Lecture 24 1 Kolmogorov Complexity

Shannon Entropy and Kolmogorov Complexity

Information Theory

Paper 1 Jian Li Kolmold

Maximizing T-Complexity 1. Introduction

1952 Washington UFO Sightings • Psychic Pets and Pet Psychics • the Skeptical Environmentalist Skeptical Inquirer the MAGAZINE for SCIENCE and REASON Volume 26,.No

Marion REVOLLE

On One-Way Functions and Kolmogorov Complexity

Complexity Measurement Based on Information Theory and Kolmogorov Complexity

Kolmogorov Complexity and Its Applications

Minimum Complexity Pursuit: Stability Analysis

Prefix-Free Kolmogorov Complexity