10-704: Information Processing and Learning Spring 2015 Lecture 16: March19 Lecturer: Aarti Singh Scribes: Aarti Singh, Danai Koutra

Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor.

So far in the course, we have talked about lossless compression. In today’s class, we will begin talking about lossy compression. We will also make connections to sufficient and information bottleneck principle in machine learning.

16.1 Sufficient

Sufficient statistics provide a (usually short) summary of the , and are often useful for learning tasks. A statistic is any function T (X) of the data X. if θ parametrizes the class of underlying data-generating distributions, then for any statistic, we have the Markov chain

θ → X → T (X) i.e. θ ⊥ T (X)|X and data processing inequality tells us that I(θ, T (X)) ≤ I(θ, X). A statistic is sufficient for a θ if θ ⊥ X|T (X), i.e. we also have the Markov chain

θ → T (X) → X.

In words, once we know T (X), the remaining in X does not depend on θ. This implies p(X|T (X)) does not depend on θ (this is a useful characterization when you may not want to think of θ as a ). Also, I(θ, T (X)) = I(θ, X). Examples of sufficient statistics:

1 Pn 1. X = X1,...,Xn ∼ Ber(θ). Then T (X) = n i=1 Xi is sufficient for θ.

2. X = X1,...,Xn ∼ Uniform(θ, θ + 1). Then T (X) = {mini Xi, maxi Xi} is sufficient for θ.

Note: A sufficient statistic can be multi-dimensional as the last example shows. Note: A sufficient statistic is not unique. Clearly, X itself is always a sufficient statistic. How do we find a sufficient statistic?

Theorem 16.1 (Fisher-Neyman Factorization Theorem) T (X) is a sufficient statistic for θ iff p(X; θ) = g(T (X), θ)h(X).

Here p(X; θ) is the joint distribution if θ is random, or is the likelihood of data under pθ otherwise.

16-1 16-2 Lecture 16: March19

Lets consider the second example above. Then

n n Y Y p(X, θ) = pθ(Xi) = 1Xi∈(θ,θ+1) = 1θ≤mini Xi≤maxi Xi≤θ+1 = g(T (X), θ) i=1 i=1

If instead X = X1,...,Xn ∼ Uniform(0, θ), then T (X) = maxi Xi is a sufficient statistic since

n n Y Y p(X, θ) = pθ(Xi) = 1Xi∈(0,θ) = 10≤mini Xi≤maxi Xi≤θ+1 = 10≤mini Xi 1maxi Xi≤θ+1 = h(X)g(T (X), θ) i=1 i=1

A sufficient statistic T (X) is minimal if T (X) = g(S(X)) for all sufficient statistics S(X). For example, if 1 Pn X = X1,...,Xn ∼ Ber(θ), then T (X) = n i=1 Xi is a minimal sufficient for θ but S(X) = X is not. 1 Pn Pn Note: A minimal sufficient statistic is not unique. For the above example, both n i=1 Xi and i=1 Xi are minimal sufficient statistic. Lets mention an alternate characterization of a sufficient statistic and minimal sufficient statistic.

• If T (X) is sufficient, then

I(θ, T (X)) = max I(θ, T 0(X)) (= I(θ, X)) T 0 i.e. it is a statistic that has largest with θ among all other statistics T 0(X). • If T (X) is minimal sufficient statistic, then

T (X) ∈ arg min I(X,S(X)) s.t. I(θ, S(X)) = max I(θ, T 0(X)) S T 0 i.e. it is a statistic that has smallest mutual information with X while having largest mutual information with θ.

The following theorem limits interest in sufficient statistics for complex learning tasks.

Theorem 16.2 (Pitman-Koopman-Darmois Theorem) For non-compactly supported distributions, ex- act sufficient statistics with bounded (not growing with n) dimensionality exist only for distributions in the .

16.2 Information Bottleneck Principle

The Information Bottleneck (IB) Principle is an information-theoretic approach to machine learning that generalizes the notion of sufficient statistic. The IB principle suggests using a summary T (X) of the data X that preserves some amount of information about an auxiliary variable Y :

min I(X,T ) s.t. I(Y,T ) ≥ γ ≡ min I(X,T ) − βI(Y,T ) T T where β ≥ 0. Thus, T serves as an information bottleneck between X and Y , extracting information from X that is relevant to Y . It can be thought of as dimensionality reduction. As β → ∞, T preserves all information in X relevant to Y and becomes a minimal sufficient statistic for Y based on data X. Some examples of application of IB method in machine learning: Lecture 16: March19 16-3

1. Speech compression - ideally, the speech waveform can be compressed down to its entropy. But in many applications, it is possible to compress further without losing important information about words, meaning, speaker identity, etc. The auxiliary variable Y can be transcript of signal for speech recognition, part of speech for grammar checking, speaker identity, etc. 2. Protein folding - Given the protein sequence data, we might compress it only preserving information about its structure. 3. Neural coding - Given neural stimuli, we might seek a summary relevant to characterizing the neural responses. 4. Unsupervised problems - In clustering documents, X can be document identity, Y can be words in the document and T will be clusters of documents with similar word statistics. 5. Supervised problems - In document classification, X can be words in the document, Y the topic labels and T will be clusters of words that are sufficient for document classification.

Another viewpoint explored in [1] is to consider I(Y,T ) as the empirical risk and I(X,T ) as a regularization term. As β → ∞, we are maximizing I(Y,T ) and overfitting to the data X, while for lower β this is prevented as we minimize I(X,T ). The IB method can also be considered as lossy compression of the data X where the degree of information lost about Y is controlled by β. In [2], it is argued that IB optimization provides a more general framework that rate distortion function used in lossy compression. We discuss this next. [1] http://www.cs.huji.ac.il/labs/learning/Papers/ibgen.pdf [2] http://www.cs.huji.ac.il/labs/learning/Papers/allerton.pdf

16.3 Lossy Compression and Rate distortion function

We have seen that for lossless compression, we can only compress a random variable X down to entropy number of bits. What if we allowed some loss or distortion in how accurately we can recover X? This question is particularly inevitable for continuous random variables which cannot be compressed to infinite precision with finite number of bits. Thus, we talk about lossy compression, also known as rate distortion theory. The rate-distortion function characterizes the smallest number of bits to which we can compress X while allowing distortion D in some distortion metric d(X, Xˆ):

R(D) = min I(X, Xˆ) s.t. E[d(X, Xˆ)] ≤ D Xˆ Observe that if we set D = 0 i.e. lossless compression is desired, then the rate distortion function R(0) = I(X,X) = H(X), the entropy. Examples of distortion functions are:

• Squared error distortion: d(X, Xˆ) = (X − Xˆ)2 ˆ • Hamming distortion: d(X, X) = 1X6=Xˆ .

The IB principle can be considered as a more general framework than rate-distortion since in many applica- tions (such as speaker identification) there may not be a natural distortion function and it is easier to state 16-4 Lecture 16: March19

the goal as preserving information about an auxiliary variable Y (speaker identity). However, as we will see in the next section, the IB method implicitly assumes the KL as a distortion function. Example of rate-distortion function:

• Gaussian source N (0, σ2) and squared-error distortion We first find a lower bound for the rate distortion function

I(X, Xˆ) = H(X) − H(X|Xˆ) 1 ≥ log (2πe)σ2 − H(X − Xˆ) 2 1 1 ≥ log (2πe)σ2 − log (2πe)D 2 2 1 σ2 = log 2 D where σ2 ≥ D. Note that we used the fact that conditioning reduces entropy (1st inequality), and the maximum entropy under the second constraint E[(X − Xˆ)2] ≤ D is achieved by the Gaussian distribution (2nd inequality). The lower bound can be achieved by the simple construction ˆ 2 ˆ ˆ 1 in Fig. 16.1 with X ∼ N (0, σ − D) and Z ⊥ X. It is easy to verify that H(X|X) = 2 log(2πe)D and E[(X − Xˆ)2] = D.

Figure 16.1: Joint distribution for Gaussian source.

If σ2 < D, i.e. we are allowed mean square distortion larger than σ2, then if we don’t encode X at all, i.e. R = 0, we can achieve square distortion D trivially. The rate distortion function is

 1 σ2 2 (I) ˆ 2 log D , if D ≤ σ R (D) = min I(X; X) = 2 p(ˆx|x):E[d(X,Xˆ )]≤D 0, if D > σ

16.4 Blahut-Arimoto algorithm for IB and R(D)

In some cases, such as Gaussian example above, the rate-distortion function can be computed explicitly. More generally, a closed form solution may not exist and we resort to iterative methods. The Blahut-Arimoto algorithm provides an iterative solution to both rate-distortion function and information bottleneck. The solution will also reveal that IB is essentially using KL divergence as the distortion measure. Lets first consider the rate-distortion function. The objective that we need to minimize can be written as (in analogy with the IB method): I(X,T ) + βEX,T [d(X,T )] where β ≥ 0, which we will optimize over p(t|x) which is a (integrates to 1 and is positive). The Lagrangian can be written as: X X L = I(X,T ) + βEX,T [d(X,T )] + λx[ p(t|x) − 1] − νx,tp(t|x) x t Lecture 16: March19 16-5

where νx,t ≥ 0. The iterative approach starts with an initial guess p(t) and iterates over updating p(t|x) and p(t). The update equation for p(t) is X p(t) = p(t|x)p(x) x

First recall

X p(x, t) I(X,T ) = p(x, t) log p(x)p(t) x,t X p(t|x) = p(t|x)p(x) log p(t) x,t X X = p(t|x)p(x) log p(t|x) − p(t|x)p(x) log p(t) x,t x,t X X = p(t|x)p(x) log p(t|x) − p(t) log p(t) x,t t

Taking derivative of I(X,T ) for fixed x, t

∂I(X,T ) ∂p(t) = p(x) log p(t|x) + p(x) − [log p(t) + 1] ∂p(t|x) ∂p(t|x) = p(x)[log p(t|x) + 1 − log p(t) − 1] p(t|x) = p(x) log p(t)

Also, ∂ [d(X,T )] EX,T = d(x, t)p(x) ∂p(t|x)

Thus, we have the derivative of the Lagrangian is:

∂L  p(t|x) λ + ν  = p(x) log + βd(x, t) + x x,t ∂p(t|x) p(t) p(x)

Setting it equal to 0, we get the update equation for p(t|x):

p(t|x) ∝ p(t)e−βd(x,t) and coupled with the update equation for p(t): X p(t) = p(t|x)p(x) x yields the Blahut-Arimoto algorithm for calculating the rate-distortion function. Note: The Blahut-Arimoto algorithm only helps calculate the rate-distortion function and does not provide a way to do lossy compression. Note: The algorithm assumes p(x) is given. One can also jointly optimize over p(x) and p(t|x) in an EM-like fashion. 16-6 Lecture 16: March19

Similarly, we can derive an iterative algorithm for the IB method where the objective is

I(X,T ) − βI(Y,T ) where β ≥ 0, which we will optimize over p(t|x) which is a probability distribution (integrates to 1 and is positive). The Blahut-Arimoto algorithm for IB yields the following update equations (assuming p(x) and p(y|x) are known): p(t|x) ∝ p(t)e−βKL(p(y|x)||p(y|t)) X p(t) = p(t|x)p(x) x p(y, t) 1 X 1 X p(y|t) = = p(y|x, t)p(t|x)p(x) = p(y|x)p(t|x)p(x) p(t) p(t) p(t) x x last step follows since Y is independent of T given X. Thus, the effective distortion measure in IB is d(x, t) = KL(p(y|x)||p(y|t)), the KL divergence between conditional distribution of Y given X and given T .