Conditioning and Markov properties
Anders Rønn-Nielsen
Ernst Hansen
Department of Mathematical Sciences
University of Copenhagen
Department of Mathematical Sciences University of Copenhagen Universitetsparken 5 DK-2100 Copenhagen
Copyright 2014 Anders Rønn-Nielsen & Ernst Hansen ISBN 978-87-7078-980-6
Contents
- Preface
- v
- 1
- Conditional distributions
- 1
136
1.1 Markov kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Integration of Markov kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Properties for the integration measure . . . . . . . . . . . . . . . . . . . . . . 1.4 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Existence of conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 16 1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
23
- Conditional distributions: Transformations and moments
- 27
2.1 Transformations of conditional distributions . . . . . . . . . . . . . . . . . . . 27 2.2 Conditional moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
- Conditional independence
- 51
3.1 Conditional probabilities given a σ–algebra . . . . . . . . . . . . . . . . . . . 52 3.2 Conditionally independent events . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3 Conditionally independent σ-algebras . . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Shifting information around . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5 Conditionally independent random variables . . . . . . . . . . . . . . . . . . . 61 3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
- 4
- Markov chains
- 71
4.1 The fundamental Markov property . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 The strong Markov property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4 An integration formula for a homogeneous Markov chain . . . . . . . . . . . . 99 4.5 The Chapmann-Kolmogorov equations . . . . . . . . . . . . . . . . . . . . . . 100
iv 5
CONTENTS
4.6 Stationary distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
- Ergodic theory for Markov chains on general state spaces
- 111
5.1 Convergence of transition probabilities . . . . . . . . . . . . . . . . . . . . . . 113 5.2 Transition probabilities with densities . . . . . . . . . . . . . . . . . . . . . . 115 5.3 Asymptotic stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.4 Minorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.5 The drift criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
- 6
- An introduction to Bayesian networks
- 141
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.2 Directed graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.3 Moral graphs and separation . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.4 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.5 Global Markov property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
- A Supplementary material
- 155
A.1 Measurable spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 A.2 Random variables and conditional expectations . . . . . . . . . . . . . . . . . 157
- B Hints for exercises
- 161
B.1 Hints for chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 B.2 Hints for chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 B.3 Hints for chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 B.4 Hints for chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 B.5 Hints for chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Preface
The present lecture notes are intended for the course ”Beting”. The purpose is to give a detailed probabilistic introduction to conditional distributions and how this concept is used to define and describe Markov chains and Bayesian networks.
The chapters 1–4 are mainly based on different sets of lecture notes written by Ernst Hansen but are adapted to suit students with the knowledge obtained in the courses MI, VidSand1, and VidSand2. Parts of these chapters also contain material inspired by lecture notes written by Martin Jacobsen and Søren Tolver Jensen. Chapter 5 is an adapted version of a set of lecture notes written by Søren Tolver Jensen and Martin Jacobsen. These notes themselves were inspired by earlier notes by Søren Feodor Nielsen. Chapter 6 contains some standard results on Bayesian networks - though both formulations and proofs are formulated in the rather general framework from chapter 3 and 4.
There are exercises in the end of each chapter and hints to selected exercises in the end of the notes. Some exercises are adapted versions of exercises and exam exercises from previous lecture notes. Others are inspired by examples and results collected from a large number of monographs on Markov chains, stochastic simulation, and probability theory in general.
I am grateful to both students and the teaching assistants from the last two years, Ketil Biering Tvermosegaard and Daniele Cappelletti, who have contributed to the notes by identifying mistakes and suggesting improvements.
Anders Rønn-Nielsen København, 2014
- vi
- CONTENTS
Chapter 1
Conditional distributions
Let (X, E) and (Y, K) be two measurable spaces. In this chapter we shall discuss the relation between measures on the product space (X × Y, E ⊗ K) and measures on the two marginal spaces (X, E) and (Y, K). More precisely we will see that measures on the product space can be constructed from measures on the two marginal spaces. A particularly simple example is a product measure µ ⊗ ν where µ and ν are measures on (X, E) and (Y, K) respectively.
The two factors X and Y will not enter the setup symmetrically in the more general construction.
1.1 Markov kernels
Definition 1.1.1. A (X, E)–Markov Kernel on (Y, K) is a family of probability measures (Px)x∈X on (Y, K) indexed by points in X such that the map
x → Px(B)
is E − B–measurable for every fixed B ∈ K. Theorem 1.1.2. Let (X, E) and (Y, K) be measurable spaces, let ν be a σ–finite measure on
+
(Y, K), and let f ∈ M (X × Y, E ⊗ K) have the property that
Z
f(x, y) dν(y) = 1
for all x ∈ X .
- 2
- Conditional distributions
Then (Px)x∈X given by
Z
Px(B) =
f(x, y) dν(y) for all B ∈ K, x ∈ X
B
is a (X, E)–Markov Kernel on (Y, K).
Proof. For each fixed set B ∈ K we need to argue that
Z
x →
1X×B(x, y)f(x, y) dν(y)
is an E–measurable function. That is a direct result of EH Theorem 8.7 applied to the function 1X×Bf.
Lemma 1.1.3. If (Y, K) has an intersection–stable generating system D, then (Px)x∈X is a (X, E)–Markov Kernel on (Y, K) if only
x → Px(D)
is E − B-measurable for all fixed D ∈ D. Proof. Define
H = {F ∈ K : x → Px(F) is E − B–measurable}
It is easily checked that H is a Dynkin Class. Since D ⊆ H, we have H = K.
Lemma 1.1.4. Let (Px)x∈X be a (X, E)–Markov kernel on (Y, K). For each G ∈ E ⊗ K the map
x → Px(Gx)
is E − B–measurable. Proof. Let
H = {G ∈ E ⊗ K : x → Px(Gx) is E − B–measurable} and consider a product set A × B ∈ E ⊗ K. Then
(
∅
if x ∈/ A
(A × B)x =
- B
- if x ∈ A
such that
(
0
if x ∈/ A
- Px((A × B)x) =
- = 1A(x) · Px(B)
Px(B) if x ∈ A
- 1.2 Integration of Markov kernels
- 3
This is a product of two E−B–measurable functions. Hence it is E−B–measurable. Thereby we conclude that H contains all product sets, and since this is a intersection stable generating system for E ⊗ K, we have H = E ⊗ K, if we can show that H is a Dynkin class:
We already have that X × Y ∈ H – it is a product set! Assume that G1 ⊆ G2 are two sets in H. Then obviously also Gx1 ⊆ G2x for all x ∈ X, and
(G2 \ G1)x = G2x \ Gx1 .
Then
Px((G2 \ G1)x) = Px(G2x) − Px(Gx1 )
which is a difference between two measurable functions. Hence G2 \ G1 ∈ H. Finally, assume that G1 ⊆ G2 ⊆ · · · is an increasing sequence of H–sets. Similarly to above
we have Gx1 ⊆ Gx2 ⊆ · · · and
- !
x
- ∞
- ∞
- [
- [
Gn
=
Gxn .
- n=1
- n=1
Then
! !
- !
x
- ∞
- ∞
- [
- [
- Px
- Gn
- = Px
- Gxn = lim Px(Gxn)
n→∞
- n=1
- n=1
This limit is E − B–measurable, since each of the functions x → Px(Gxn) are measurable.
S∞
Then n=1 Gn ∈ H.
1.2 Integration of Markov kernels
Theorem 1.2.1. Let µ be a probability measure on (X, E) and let (Px)x∈X be a (X, E)– Markov kernel on (Y, K). There exists a uniquely determined probability measure λ on (X × Y, E ⊗ K) satisfying
Z
- λ(A × B) =
- Px(B) dµ(x)
A
for all A ∈ E and B ∈ K.
The probability measure λ constructed in Theorem 1.2.1 is called the integration of (Px)x∈X with respect to µ. The interpretation is that λ describes an experiment on X × Y that is performed in two steps: The first step is drawing x ∈ X. The second step is drawing y ∈ Y according to a probability measure that is determined by x.
- 4
- Conditional distributions
Proof. The uniqueness follows, since λ is determined on all product sets and these form an intersection stable generating system for E ⊗ K.
In order to prove the existence, we define
Z
λ(G) = Px(Gx) dµ(x)
For each G ∈ E⊗K the integrand is measurable according to Lemma 1.1.4. It is furthermore non–negative, such that λ(G) is well–defined with values in [0, ∞]. Now let G1, G2, . . . be a sequence of disjoint sets in E ⊗ K. Then for each x ∈ X the sets Gx1 , Gx2 , . . . are disjoint as well. Hence
- Z
- Z
- ∞
- ∞
- ∞
- ∞
- ꢀ
- ꢁ
- ꢀꢀ
ꢁ ꢁ
- [
- [
- X
- X
x
λ
Gn
=
- Px
- Gn
- dµ(x) =
- Px(Gxn) dµ(x) =
- λ(Gn)
- n=1
- n=1
- n=1
- n=1
In the second equality we have used that each Px is a measure, and in the third equality we have used monotone convergence to interchange integration and summation. From this we have that λ is a measure. And since
- Z
- Z
- Z
λ(X × Y) = Px((X × Y)x) dµ(x) = Px(Y) dµ(x) = 1 dµ(x) = 1 we obtain, than λ is actually a probability measure. Finally, it follows that
- Z
- Z
- Z
λ(A × B) = Px((A × B)x) dµ(x) = 1A(x)Px(B) dµ(x) =
Px(B) dµ(x)
A
for all A ∈ E and B ∈ K.
Corollary 1.2.2. Let µ be a probability measure on (X, E) and let (Px)x∈X be a (X, E)– Markov kernel on (Y, K). Let λ be the integration of (Px)x∈X with respect to µ. Then λ satisfies
- λ(A × Y) =
- µ(A)
for all A ∈ E
Z
λ(X × B) = Px(B) dµ(x)
for all B ∈ K
Proof. The second statement is obvious. For the first result just note that Px(Y) = 1 for all
x ∈ X.
The probability measure on (Y, K) defined by λ(X ×B) is called the mixture of the Markov kernel with respect to µ. Example 1.2.3. Let µ be a probability measure on (X, E) and let ν be a probability measure
- 1.2 Integration of Markov kernels
- 5
on (Y, K). Define Px = ν for all x ∈ X. Then, trivially, (Px)x∈X is a X–Markov kernel on Y. Let λ be the integration of this kernel with respect to λ. Then for all A ∈ E and B ∈ K
Z
- λ(A × B) =
- ν(B) dµ(x) = µ(A) · ν(B) .
A
The only measure satisfying this property is the product measure µ ⊗ ν, so λ = µ ⊗ ν. Hence a product measure is a particularly simple example of a measure constructed by integrating a Markov kernel.
◦
Example 1.2.4. Let µ be the Poisson distribution with parameter λ. For each x ∈ N0 we define Px to be the binomial distribution with parameters (x, p). Then it is seen that (Px)x∈N is a N0–Markov kernel on N0.
0
Let ξ be the mixture of (Px)x∈N with respect to µ. This must be a probability measure on
0
N0 and is thereby given by the probability function q. For n ∈ N0 we obtain
ꢂ ꢃ
∞
X
k
λk pn(1 − p)k−n e−λ
k! q(n) =
n
k=n
- ꢄ
- ꢅ
k−n
∞
(λp)n
X
(1 − p)λ
=
e−λ
- n!
- (k − n)!
k=n
(λp)n
=
=
e
−λe(1−p)λ
n!
(λp)n
e−λp
n!
Hence the mixture ξ is seen to be the Poisson distribution with parameter λp.
◦
Theorem 1.2.5 (Uniqueness of integration). Suppose that (Y, K) has a countable generating system that is intersection stable. Let µ and µ˜ be two probability measures on (X, E) and
˜
assume that (Px)x∈X and (Px)x∈X are two (X, E)–Markov kernels on (Y, K). Let λ be the
- ˜
- ˜
integration of (Px)x∈X with respect to µ, and let λ be the integration of (Px)x∈X with respect to µ˜. Define
˜
E0 = {x ∈ X : Px = Px}
˜
Then λ = λ if and only if µ = µ˜ and µ(E0) = 1.
Proof. Let (Bn)n∈N be a countable generating system for (Y, K). Then
∞
\
˜
E0 =
{x ∈ X : Px(Bn) = Px(Bn)}
n=1
from which we can conclude that E0 ∈ E.
- 6
- Conditional distributions
Assume that µ = µ˜ and µ(E0) = 1. Then for all A ∈ E and B ∈ K we have