Conditioning and Markov Properties
Total Page:16
File Type:pdf, Size:1020Kb
Conditioning and Markov properties Anders Rønn-Nielsen Ernst Hansen Department of Mathematical Sciences University of Copenhagen Department of Mathematical Sciences University of Copenhagen Universitetsparken 5 DK-2100 Copenhagen Copyright 2014 Anders Rønn-Nielsen & Ernst Hansen ISBN 978-87-7078-980-6 Contents Preface v 1 Conditional distributions 1 1.1 Markov kernels . 1 1.2 Integration of Markov kernels . 3 1.3 Properties for the integration measure . 6 1.4 Conditional distributions . 10 1.5 Existence of conditional distributions . 16 1.6 Exercises . 23 2 Conditional distributions: Transformations and moments 27 2.1 Transformations of conditional distributions . 27 2.2 Conditional moments . 35 2.3 Exercises . 41 3 Conditional independence 51 3.1 Conditional probabilities given a σ{algebra . 52 3.2 Conditionally independent events . 53 3.3 Conditionally independent σ-algebras . 55 3.4 Shifting information around . 59 3.5 Conditionally independent random variables . 61 3.6 Exercises . 68 4 Markov chains 71 4.1 The fundamental Markov property . 71 4.2 The strong Markov property . 84 4.3 Homogeneity . 90 4.4 An integration formula for a homogeneous Markov chain . 99 4.5 The Chapmann-Kolmogorov equations . 100 iv CONTENTS 4.6 Stationary distributions . 103 4.7 Exercises . 104 5 Ergodic theory for Markov chains on general state spaces 111 5.1 Convergence of transition probabilities . 113 5.2 Transition probabilities with densities . 115 5.3 Asymptotic stability . 117 5.4 Minorisation . 122 5.5 The drift criterion . 127 5.6 Exercises . 131 6 An introduction to Bayesian networks 141 6.1 Introduction . 141 6.2 Directed graphs . 143 6.3 Moral graphs and separation . 145 6.4 Bayesian networks . 146 6.5 Global Markov property . 151 A Supplementary material 155 A.1 Measurable spaces . 155 A.2 Random variables and conditional expectations . 157 B Hints for exercises 161 B.1 Hints for chapter 1 . 161 B.2 Hints for chapter 2 . 162 B.3 Hints for chapter 3 . 163 B.4 Hints for chapter 4 . 164 B.5 Hints for chapter 5 . 167 Preface The present lecture notes are intended for the course "Beting". The purpose is to give a detailed probabilistic introduction to conditional distributions and how this concept is used to define and describe Markov chains and Bayesian networks. The chapters 1{4 are mainly based on different sets of lecture notes written by Ernst Hansen but are adapted to suit students with the knowledge obtained in the courses MI, VidSand1, and VidSand2. Parts of these chapters also contain material inspired by lecture notes written by Martin Jacobsen and Søren Tolver Jensen. Chapter 5 is an adapted version of a set of lecture notes written by Søren Tolver Jensen and Martin Jacobsen. These notes themselves were inspired by earlier notes by Søren Feodor Nielsen. Chapter 6 contains some standard results on Bayesian networks - though both formulations and proofs are formulated in the rather general framework from chapter 3 and 4. There are exercises in the end of each chapter and hints to selected exercises in the end of the notes. Some exercises are adapted versions of exercises and exam exercises from previous lecture notes. Others are inspired by examples and results collected from a large number of monographs on Markov chains, stochastic simulation, and probability theory in general. I am grateful to both students and the teaching assistants from the last two years, Ketil Bier- ing Tvermosegaard and Daniele Cappelletti, who have contributed to the notes by identifying mistakes and suggesting improvements. Anders Rønn-Nielsen København, 2014 vi CONTENTS Chapter 1 Conditional distributions Let (X ; E) and (Y; K) be two measurable spaces. In this chapter we shall discuss the relation between measures on the product space (X × Y; E ⊗ K) and measures on the two marginal spaces (X ; E) and (Y; K). More precisely we will see that measures on the product space can be constructed from measures on the two marginal spaces. A particularly simple example is a product measure µ ⊗ ν where µ and ν are measures on (X ; E) and (Y; K) respectively. The two factors X and Y will not enter the setup symmetrically in the more general con- struction. 1.1 Markov kernels Definition 1.1.1. A (X ; E){Markov Kernel on (Y; K) is a family of probability measures (Px)x2X on (Y; K) indexed by points in X such that the map x 7! Px(B) is E − B{measurable for every fixed B 2 K. Theorem 1.1.2. Let (X ; E) and (Y; K) be measurable spaces, let ν be a σ–finite measure on (Y; K), and let f 2 M+(X × Y; E ⊗ K) have the property that Z f(x; y) dν(y) = 1 for all x 2 X : 2 Conditional distributions Then (Px)x2X given by Z Px(B) = f(x; y) dν(y) for all B 2 K; x 2 X B is a (X ; E){Markov Kernel on (Y; K). Proof. For each fixed set B 2 K we need to argue that Z x 7! 1X ×B(x; y)f(x; y) dν(y) is an E{measurable function. That is a direct result of EH Theorem 8.7 applied to the function 1X ×Bf. Lemma 1.1.3. If (Y; K) has an intersection{stable generating system D, then (Px)x2X is a (X ; E){Markov Kernel on (Y; K) if only x 7! Px(D) is E − B-measurable for all fixed D 2 D. Proof. Define H = fF 2 K : x 7! Px(F ) is E − B{measurableg It is easily checked that H is a Dynkin Class. Since D ⊆ H, we have H = K. Lemma 1.1.4. Let (Px)x2X be a (X ; E){Markov kernel on (Y; K). For each G 2 E ⊗ K the map x x 7! Px(G ) is E − B{measurable. Proof. Let x H = fG 2 E ⊗ K : x 7! Px(G ) is E − B{measurableg and consider a product set A × B 2 E ⊗ K. Then ( ; if x2 = A (A × B)x = B if x 2 A such that ( x 0 if x2 = A Px((A × B) ) = = 1A(x) · Px(B) Px(B) if x 2 A 1.2 Integration of Markov kernels 3 This is a product of two E−B{measurable functions. Hence it is E−B{measurable. Thereby we conclude that H contains all product sets, and since this is a intersection stable generating system for E ⊗ K, we have H = E ⊗ K, if we can show that H is a Dynkin class: We already have that X × Y 2 H { it is a product set! Assume that G1 ⊆ G2 are two sets x x in H. Then obviously also G1 ⊆ G2 for all x 2 X , and x x x (G2 n G1) = G2 n G1 : Then x x x Px((G2 n G1) ) = Px(G2 ) − Px(G1 ) which is a difference between two measurable functions. Hence G2 n G1 2 H. Finally, assume that G1 ⊆ G2 ⊆ · · · is an increasing sequence of H{sets. Similarly to above x x we have G1 ⊆ G2 ⊆ · · · and 1 !x 1 [ [ x Gn = Gn : n=1 n=1 Then 1 !x! 1 ! [ [ x x Px Gn = Px Gn = lim Px(Gn) n!1 n=1 n=1 x This limit is E − B{measurable, since each of the functions x 7! Px(Gn) are measurable. S1 Then n=1 Gn 2 H. 1.2 Integration of Markov kernels Theorem 1.2.1. Let µ be a probability measure on (X ; E) and let (Px)x2X be a (X ; E){ Markov kernel on (Y; K). There exists a uniquely determined probability measure λ on (X × Y; E ⊗ K) satisfying Z λ(A × B) = Px(B) dµ(x) A for all A 2 E and B 2 K. The probability measure λ constructed in Theorem 1.2.1 is called the integration of (Px)x2X with respect to µ. The interpretation is that λ describes an experiment on X × Y that is performed in two steps: The first step is drawing x 2 X . The second step is drawing y 2 Y according to a probability measure that is determined by x. 4 Conditional distributions Proof. The uniqueness follows, since λ is determined on all product sets and these form an intersection stable generating system for E ⊗ K. In order to prove the existence, we define Z x λ(G) = Px(G ) dµ(x) For each G 2 E ⊗ K the integrand is measurable according to Lemma 1.1.4. It is furthermore non{negative, such that λ(G) is well–defined with values in [0; 1]. Now let G1;G2;::: be a sequence of disjoint sets in E ⊗ K. Then for each x 2 X the sets x x G1 ;G2 ;::: are disjoint as well. Hence 1 Z 1 x Z 1 1 [ [ X x X λ Gn = Px Gn dµ(x) = Px(Gn) dµ(x) = λ(Gn) n=1 n=1 n=1 n=1 In the second equality we have used that each Px is a measure, and in the third equality we have used monotone convergence to interchange integration and summation. From this we have that λ is a measure. And since Z Z Z x λ(X × Y) = Px((X × Y) ) dµ(x) = Px(Y) dµ(x) = 1 dµ(x) = 1 we obtain, than λ is actually a probability measure. Finally, it follows that Z Z Z x λ(A × B) = Px((A × B) ) dµ(x) = 1A(x)Px(B) dµ(x) = Px(B) dµ(x) A for all A 2 E and B 2 K.