Lecture 19 Kolmogorov Complexity

EE5585 Data Compression April 4, 2013 Lecture 19 Instructor: Arya Mazumdar Scribe: Katie Moenkhaus Kolmogorov Complexity A major result is that the Kolmogorov complexity of a random sequence on average is close to the entropy. This captures the notion of compressibility. Also, an algorithmically incompressible binary sequence as defined by the Kolmogorov complexity has approximately the same numer of 1s and 0s. The di↵erence between the number of 0s and 1s gives insight into how compressible a signal is. Kolmogrov complexity can be defined for numbers as well as sequences. For n Z, 2 K(n)= min l(p) p:U(p)=n where K(n) denotes the Kolmogorov complexity of n, and l(p) is the minimum length of the computer 5 55 program. Some integers have low Kolmogorov complexity. For exmple, let n =55 . Even though this is a very large number, it has a short description. Also, e is easily described. Even though it is an irrational numer, it has a very short description of the basis of the natural logarithm, so it is actually a function such that the derivative of the function is itself. Note, K(n) log⇤ n + c where log⇤ n = log n + log log n + log log log n + ..., so long as the each term is positive. Theorem 1 There exists an infinite number of integers for which K(n) > log n. Proof (by contradiction): Assume there exist a finite number of integers for which K(n) > log n. From Kraft’s Inequality, K(n) 2− 1 n X If the assumption is true, then there will be only a finite number of integers for which K(n) > log n. Then there is a number n for which all n n = K(n) log n.Then 0 ≥ 0 ) K(n) log n 1 2− 2− = ≥ n n n n n n n X≥ 0 X≥ 0 X≥ o 1 The series n doesn’t converge, so n P 1 = n 1 n n X≥ o It follows that K(n) 2− = 1 n X K(n) This contradicts with 2− 1, and thus the assumption is false. ) a finite number of integers n 9 for which K(n) > log nP. This illustrates that there is an infinite number of integers that are not simple to describe, i.e. they will take more than log n bits to describe. Given a binary sequency, if #1s #0s ✏ | − | ≥ n then the sequence can be compressed. 1 Transform Coding Consider a height/weight data set. Because the data is highly correlated, a clockwise rotation can be done to obtain new axes such that the line of best fit becomes the new prominent axis. Thus the di↵erence between any data point and the axis is small, and fewer bits are needed to describe them. In the new data set, the entries are uncorrelated. Ideally, there will be 0 correlation in the new coordinates. Note that the correlation between random variables X and Y is E[(X E(X)) (Y E(Y ))], where E(X) denotes the expected value of X.IfX and Y are independent, then −Cor = 0. Similarly,− if the correlation is large, the variables are highly correlated. Without loss of generality, E(X)=E(Y ) = 0. 2 This is just subtracting the mean from each set. Let x1 0 x2 0 X = 2 . 3 ; E(X)=2 . 3 . 6 7 6 7 6x 7 6 0 7 6 N 7 6 7 4 5 4 5 In the height/weight data, there were only two variables, i.e. height is x1 and weight is x2. The rotation is done by multiplying by a matrix: Y = AX Y is another N-dimensional vector, and A is N N. It is desirable to have A be an orthonormal matrix, i.e. AT A = I. If this holds, then the result⇥ is an orthogonal transformation. For any orthogonal transformation, Parseval’s Theorem is true. This says that 2 2 σyi = E(yi ) i i X X = E( Y 2) k k2 = E(Y T Y ) = E(XT AT AX) = E(XT X) = E( X 2) k k2 2 2 = E(xi )= σxi i i X X 2 2 If σx is the variance of x,then E(xi )= σxi stems from the fact that X is a zero-mean, random i i variable. Thus, the transformationP conservesP the total energy. The transformation should ensure that the new data are uncorrelated. The correlation is representated by the covariance matrix, Cx,defined as x1 x2 T Cx = E(XX )=E 2 . 3 x1 x2 ... xN . 6 7 6x 7 ⇥ ⇤ 6 N 7 4 25 x1 x1x2 ... x1xN 2 x1x2 x2 ... x2xN = E 2 . 3 . .. 6 7 6x x x x ... x2 7 6 1 N 2 N N 7 4 2 5 σx1 E(x1x2) ... E(x1xN ) 2 E(x1x2) σx2 ... E(x2xN ) = 2 . 3 . .. 6 7 6E(x x ) E(x x ) ... σ2 7 6 1 N 2 N xN 7 4 5 Note that this is a symmetric matrix. From the definition of the trace of a matrix, 2 Trace(Cx)= σxi i X 3 which is the total energy of the signal. The transform matrix A should minimize the o↵-diagonal elements of the covariance matrix. T Cy = E(YY ) = E(AXXT AT ) = AE(XXT )AT T = ACxA T T 1 Because A A = I and A is an orthogonal matrix, A = A− .Thus, 1 Cy = ACxA− CyA = ACx The ideal covariance matrix Cy is a diagonal matrix. Suppose λ1 λ2 0 2 . 3 Cy = .. 6 7 6 λN 1 7 6 − 7 6 0 λ 7 6 N 7 Thus 4 5 λ1 λ2 0 2 . 3 .. A = ACx 6 7 6 λN 1 7 6 − 7 6 0 λ 7 6 N 7 Because X is zero mean, Y is also4 zero mean. Suppose 5 a11 a12 ... a1N a21 a22 ... a2N A = 2 . 3 . .. 6 7 6a a ... a 7 6 N1 N2 NN7 4 5 a11 a12 a1N a21 a22 a2N = CyA = 2λ1 2 . 3 λ2 2 . 3 ... λN 2 . 3 3 ) . 6 6 7 6 7 6 7 7 6 6a 7 6a 7 6a 7 7 6 6 N17 6 N27 6 NN7 7 4 4 5 4 5 4 5 5 a11 a12 a1N a21 a22 a2N = 2Cx 2 . 3 Cx 2 . 3 ... Cx 2 . 33 . 6 6 7 6 7 6 77 6 6a 7 6a 7 6a 77 6 6 N17 6 N27 6 NN77 4 4 5 4 5 4 55 For ease of notation, let a1i a2i Ai = 2 . 3 . 6 7 6a 7 6 Ni7 4 5 4 In general, CxAi = λiAi λi is an eigenvalue of Cx, and Ai is the corresponding eigenvector. A = A1 A2 ... AN Thus, the transform matrix A is ⇥ ⇤ A = eigenvectors of Cx 2 The eigenvalues are simply λi = σyi . This⇥ is called the Karhunen-Loève⇤ Transform (KLT). The process for finding the KLT is 1. Subtract o↵ the mean: x x E(x ) 1 ! 1 − 1 x2 x2 E(x2) . ! − . x x E(x ) N ! N − N 2. Find the covariance matrix. If there are M elements in each vector, then the total energy is 1 M σ2 = x2 xn M ni i=1 X for each n going from 1 to N. The correlations are then 1 M ⇢ = x x xn,xm M n m i=1 X The covariance matrix is then 2 σx1 ⇢x1,x2 ... ⇢x1,xN 2 ⇢x1,x2 σx2 ... ⇢x2,xN Cx = 2 . 3 . .. 6 7 6⇢ ⇢ ... σ2 7 6 x1,xN x2,xN xN 7 4 5 3. Knowing the covariance matrix Cx, find the eigenvalues and eigenvectors. Form the transform matrix A as A = A1 A2 ... AN th where Ai represents the i eigenvector⇥ of Cx. ⇤ The process outlined above is called Principal Component Analysis. The goal of this analysis is to find a matrix that transforms the vectors to a new coordinate system in which they are uncorrelated. So ideally, λ1 λ2 0 2 . 3 Cy = .. 6 7 6 λN 1 7 6 − 7 6 0 λ 7 6 N 7 4 5 5 Keep only the eigenvalues that are largest. The energy lost in this process is the sum of the eigenvalues that are not used in the compression scheme. This is the mean square error. Suppose that the eigenvalues N are ordered, i.e. λ1 > λ2 > > λN . If all λi are unused, then the mean square error is ··· { }i=N0 N MSE = λi iX=N0 This is one measure of the distortion induced by compressing the signal. KLT optimally minimizes E(yiyj), but it is computationally inefficient because for each new data set, a new transform matrix has to be computed. Instead, some standard transform matrices are used. One such of these is 11 1 1 ... 1 1 !!2 !3 ... !N 2 2 4 6 2N 3 1 1 ! ! ! ... ! F = 61 !3 !6 !9 ... !3N 7 pN 6 7 6. 7 6. .. 7 6 7 61 !N 1 !2(N 1) !3(N 1) ... !(N 1)(N 1)7 6 − − − − − 7 4 5 i 2⇡ This is an example of a generic transform matrix. Typically, ! = e− N .ThisistheDiscrete Fourier Transform (DFT) matrix. The DFT projects data along the rows, and each row has a frequency kernel. Thus, it produces the frequency components of the data. Y = FX Y 1 Y 2 = 2 . 3 . 6 7 6Y 7 6 N 7 4 5 The values Y N are the frequency components. Compressing a signal by omitting any of the low { i}i=1 amplitude Yis (usually high frequency) e↵ectively filters the signal.

Load more