STAT 535 Lecture 2 Independence and Conditional Independence C
Total Page:16
File Type:pdf, Size:1020Kb
STAT 535 Lecture 2 Independence and conditional independence c Marina Meil˘a [email protected] 1 Conditional probability, total probability, Bayes’ rule Definition of conditional distribution of A given B. PAB(a, b) PA|B(a|b) = PB(b) whenever PB(b) = 0 (and PA|B(a|b) undefined otherwise). Total probability formula. When A, B are events, and B¯ is the complement of B P (A) = P (A, B)+ P (A, B¯) = P (A|B)P (B)+ P (A|B¯)P (B¯) When A, B are random variables, the above gives the marginal distribution of A. PA(a) = X PAB(a, b) = X PA|B(a|b)PB(b) b∈Ω(B) b∈Ω(B) Bayes’ rule PAPB|A PA|B = PB The chain rule Any multivariate distribution over n variables X1,X2,...Xn can be decomposed as: PX1,X2,...Xn = PX1 PX2|X1 PX3|X1X2 PX4|X1X2X3 ...PXn|X1...Xn−1 1 2 Probabilistic independence A ⊥ B ⇐⇒ PAB = PA.PB We read A ⊥ B as “A independent of B”. An equivalent definition of independence is: A ⊥ B ⇐⇒ PA|B = PA The above notation are shorthand for for all a ∈ Ω(A), b ∈ Ω(B), PA|B(a|b) = PA(a) whenever PB =0 PAB(a, b) = PA(a) whenever PB =0 PB(b) PAB(a, b) = PA(a)PB(b) Intuitively, probabilistic independence means that knowing B does not bring any additional information about A (i.e. doesn’t change what we already be- lieve about A). Indeed, the mutual information1 of two independent variables is zero. Exercise 1 [The exercises in this section are optional and will not influence your grade. Do them only if you have never done them before.] Prove that the two definitions of independence are equivalent. Exercise 2 A, B are two real variables. Assume that A, B are jointly Gaussian. Give a necessary and sufficient condition for A ⊥ B in this case. Conditional independence A richer and more useful concept is the conditional independence between sets of random variables. 1An information theoretical quantity that measures how much information random variable A has about random variable B. 2 A ⊥ B | C ⇐⇒ PAB|C = PA|C .PB|C ⇐⇒ PA|BC = PA|C Once C is known, knowing B brings no additional information about A. Here are a few more equivalent ways to express conditional independence. The expressions below must hold for every a, b, c for which the respective conditional probabilities are defined. PABC (a, b|c) = PA|C (a|c)PB|C (b|c) PABC (a, b, c) PAC (a, c)PBC (b, c) = 2 PC (c) PC (c) PA|C (a|c)PB|C(b|c) PABC (a, b, c) = PC (c) P (a, b, c) PA C (a, c) ABC = | PBC (b, c) PC (c) PA|BC (a|b, c) = PA|C (a|c) PABC (a, b, c) = PC (c)PA|C (a|c)PB|C(b|c) Since independence between two variables implies the joint probability dis- tribution factors, it implies far fewer parameters are necessary to represent the joint distribution (rA + rB rather than rArB), and other important sim- plifications such as A ⊥ B ⇒ E[AB] = E[A]E[B][Exercise 3 Prove this relationship.] These definitions extend to sets of variables: AB ⊥ CD ≡ PABCD = PABPCD and AB ⊥ CD|EF ≡ PABCD|EF = PAB|EF PCD|EF . 3 Furthermore, independence of larger sets of variables implies independence of subsets (for fixed conditions): AB ⊥ CD|EF ⇒ A ⊥ CD|EF,A ⊥ C|EF, . [Exercise 4 Prove this.] Notice that A ⊥ B does not imply A ⊥ B|C, nor the other way around. [Exercise 5 Find examples of each case.] Lemma 1 (Factorization Lemma) A ⊥ C | B iff there exist functions φAB,φBC so that PABC = φABφBC . Proof The ⇒ statement is obvious, so we will now prove the converse impli- cation, that is, if PABC = φABφBC for all (a, b, c) ∈ ΩABC , then A ⊥ C | B. PAB(a, b) = X φAB(a, b)φBC (b, c) = φAB(a, b)ψB (b) (1) c ′ PBC (b, c) = X φAB(a, b)φBC (b, c) = φBC (b, c)ψB (b) (2) a ′ PB(b) = ψB(b)ψB (b) (3) ′ PABPBC φAB(a, b)ψB (b)φBC (b, c)ψB(b) = ′ = PABC (4) PB ψB(b)ψB(b) Two different factorizations of the joint probability P (a, b, c) when A ⊥ B|C are a good basis for understanding the two primary types of graphical models we will study: undirected graphs, also known as Markov random fields (MRFs); and directed graphs, also known as Bayes’ nets (BNs) and belief networks. One factorization, PAC (a, c)PBC (b, c) PABC (a, b, c) = = φAC (a, c)φBC (b, c) PC (c) 4 is into a product of potential functions defined over subsets of variables with a term in the denominator that compensates for “double-counting” of variables in the intersection. From this factorization an undirected graphical represen- tation of the factorization can be constructed by adding an edge between any two variables that cooccur in a potential. The other factorization, PABC (a, b, c)= PC (c)PA|C (a|c)PB|C (b|c), is into a product of conditional probability distributions that imposes a par- tial order on the variables (e.g C comes before A, B). From this factorization a directed graphical model can be constructed by adding directed edges that match the “causality” implied by the conditional distributions. In particular, if the factorization involves PX|YZ then directed edges are added from Y and Z to X. 3 The (semi)-graphoid axioms [Exercise: how many possible conditional independence relations are there between n variables?] Some independence relations imply others. [Exercise. A ⊥ B, B ⊥ C ⇒ A ⊥ C? A ⊥ B, B ⊥ C ⇒ A ⊥ C? Prove or give counterexamples.] For any distribution PV over a set of variables V the following properties of independence hold. Let X,Y,Z,W be disjoint subsets of discrete variables from V . [S] X ⊥ Y | Z ⇒ Y ⊥ X | Z (Symmetry) “If Y is irrelevant to X then X is irrelevant to Y ” [D] X ⊥ Y W | Z ⇒ X ⊥ Y | Z (Decomposition) “If some information is irrelevant to X then part of it is also irrelevant to X” 5 [WU] X ⊥ Y W | Z ⇒ X ⊥ Y | WZ (Weak union) “If Y and W together are irrelevant to X then learning one of them does not make the other one relevant” [C] X ⊥ Y | Z and X ⊥ W | YZ ⇒ X ⊥ Y W | Z (Contraction) “If Y is irrelevant to X, and if learning Y does not make W relevant to X, then W was irrelevant to start with” [I] If P is strictly positive for all instantiations of the variables, X ⊥ Y | WZ and X ⊥ W | YZ ⇒ X ⊥ Y W | Z (Intersection) Properties S, D, WU, C are called the semi-graphoid axioms. The semi- graphoid axioms together with property [I] are called the graphoid axioms. Note that C is the converse of (WU and D). 6.