Dependent Random Variables

Dependent Random Variables MárioS. Alvim ([email protected]) Information Theory DCC-UFMG (2020/01) MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 1 / 65 Dependent random variables - Introduction In previous lectures we studied measures of the information contents of individual ensembles: the information content h(x) of an event x, and the Shannon entropy H(X ) of a single ensemble X . In this lecture we will study the information content and entropy of dependent ensembles (or random variables). In particular, we will define the concepts of conditional entropy and of mutual information, and we will study their mathematical properties. These concepts will be later used in a variety of scenarios: from communication theory to machine learning and artificial intelligence, and from biology to security and privacy. MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 2 / 65 Dependent random variables - Introduction As a motivation, note that many problems can be modeled as follows: 1. We are interested in learning the outcome of an ensemble X . 2. We don't have direct access to X . Instead, we have access to the outcomes of another ensemble Y that may be correlated to X . 3. We want to infer as much information as possible about X only using what we observed from Y . In this kind of problem we need to define and measure the amount of information that an ensemble carries about another ensemble. MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 3 / 65 Dependent random variables - Introduction For instance, to reliably send a message through a communication channel we must know how the output from the channel (i.e., what we can actually see) is correlated with the input to the channel (i.e., what we can't see but want to infer). Input and output to a channel can be modeled as dependent ensembles X and Y , respectively, and by measuring the amount of information Y carries about X we can quantify how good the channel is in transmitting information. As other examples, it may be relevant to measure how much information: 1 the result of a particular type of exam (Y ) carries about whether or not someone has a disease (X ); 2 a .jpeg file (Y ) carries about the original image it compressed (X ); 3 an election ballot (Y ) carries about whether or not a particular individual voted for a particular candidate (X ); 4 ::: MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 4 / 65 Review of Shannon information content and Entropy MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 5 / 65 Shannon information content The Shannon information content of an outcome x is defined as 1 h(x) = log ; 2 p(x) and it is measured in bits. h(x) measures the surprise one has when learning that the outcome of a random variable was x. Additivity in case of independence: If x; y are independent, then h(x; y) = h(x) + h(y). MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 6 / 65 Entropy The entropy of an ensemble X is defined as X 1 H(X ) = p(x) log ; 2 p(x) x2AX with the convention that for p(x) = 0 we use 0 · log2 1=0 = 0. The entropy of an ensemble is the expected value of the information content of each of its outcomes: X H(X ) = p(x)h(x): x2AX If a probability distribution has only two values p and 1 − p, we define the entropy 1 1 H(p) = p log + (1 − p) log : 2 p 2 1 − p MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 7 / 65 Entropy - Properties Bounds: 0 ≤ H(X ) ≤ log2 jAX j. Changing the base of the entropy: Hb(X ) = (loga b) · Ha(X ). Concavity: H(p) is concave in p. MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 8 / 65 Joint entropy and Conditional entropy MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 9 / 65 Joint Entropy The joint entropy of ensembles X ; Y is X X 1 H(X ; Y ) = p(x; y) log : 2 p(x; y) x2AX y2AY The joint entropy of two ensemble is the expected value of the information content of each of their joint outcomes: X X H(X ; Y ) = p(x; y)h(x; y): x2AX y2AY Additivity in case of independence: Entropy is additive for independent ensembles: H(X ; Y ) = H(X ) + H(Y ) iff X and Y are independent. Proof. We already had a proof for this in one of our homework assignments. Later in this lecture we will provide an alternative proof. MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 10 / 65 Conditional entropy We will now define a measure of the uncertainty about X given that the value y of a correlated ensemble Y is known. The conditional entropy of an ensemble X given a result y 2 AY is X 1 H(X j Y = y) = p(x j Y = y) log : 2 p(x j Y = y) x2AX MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 11 / 65 Conditional entropy We can also define a measure of the uncertainty about an ensemble X given that an ensemble Y is known. The conditional entropy of an ensemble X given an ensemble Y is X H(X j Y ) = p(y)H(X j Y = y) y2AY ! X X 1 = p(y) p(x j y) log 2 p(x j y) y2AY x2AX X X 1 = p(x; y) log 2 p(x j y) x2AX y2AY MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 12 / 65 Conditional entropy Theorem If X and Y are independent, then H(X j Y ) = H(X ). Proof. X X 1 H(X j Y ) = p(x; y) log (by definition) 2 p(x j y) x2AX y2AY X X 1 = p(x)p(y) log (since X and Y are independent) 2 p(x) x2AX y2AY X 1 X = p(x) log p(y) (reorganizing the summations) 2 p(x) x2AX y2AY X 1 = p(x) log · 1 (P p(y) = 1) 2 p(x) y2AY x2AX = H(X ) (by definition) MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 13 / 65 Chain rules Chain rules tells us how to decompose a quantity that is a function of a joint distribution into a sum of functions on each individual component in the distribution. For instance, we are already familiar with the chain rule of probabilities: p(x; y) = p(x)p(y j x) = p(y)p(x j y); which intuitively means that the probability of results x and y happening together can be decomposed into the chain product of the probability of x happening and the probability of y happening given that x has happened. We will now derive chain rules for functions of dependent ensembles. MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 14 / 65 Chain rule for information content Theorem (Chain rule of information content.) h(x; y) = h(x) + h(y j x) = h(y) + h(x j y): Proof. 1 h(x; y) = log (by def. of information content) 2 p(x; y) 1 = log (chain rule for probabilities) 2 p(x)p(y j x) 1 1 = log + log 2 p(x) 2 p(y j x) = h(x) + h(y j x) (by def. of information content) The proof of h(x; y) = h(y) + h(x j y) is analogous. MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 15 / 65 Chain rule for entropy Theorem (Chain rule for entropy.) H(X ; Y ) = H(X ) + H(Y j X ) = H(Y ) + H(X j Y ) Proof. X X H(X ; Y ) = p(x; y)h(x; y) (by definition) x2AX y2AY X X = p(x; y)[h(x) + h(y j x)] (chain rule for h) x2AX y2AY X X = [p(x; y)h(x) + p(x; y)h(y j x)] (distributivity) x2AX y2AY X X X X = p(x; y)h(x) + p(x; y)h(y j x) (splitting the sums (?)) x2AX y2AY x2AX y2AY MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 16 / 65 Chain rule for entropy Proof. (Continued) Note that the first term in the sum in Equation (?) can be rewritten as X X X X p(x; y)h(x) = p(x)p(y j x)h(x) (chain rule of prob.) x2AX y2AY x2AX y2AY X X (moving constants out = p(x)h(x) p(y j x) of the summation) x2AX y2AY X = p(x)h(x) · 1 (since P p(y j x) = 1) y2AY x2AX = H(X ) (by def. of entropy (??)); and the second term in the sum in Equation (?) can be rewritten as X X p(x; y)h(y j x) = H(Y j X ) (by def. of conditional entropy (???)): x2AX y2AY MárioS. Alvim ([email protected]) Dependent Random Variables DCC-UFMG (2020/01) 17 / 65 Chain rule for entropy Proof. (Continued) Now we can substitute Equations (??) and (???) in Equation (?) to obtain H(X ; Y ) = H(X ) + H(Y jX ): The proof of H(X ; Y ) = H(Y ) + H(X jY ) is analogous.

Load more