Conditioning and Markov Properties

Conditioning and Markov properties

Anders Rønn-Nielsen
Ernst Hansen

Department of Mathematical Sciences
University of Copenhagen
Department of Mathematical Sciences University of Copenhagen Universitetsparken 5 DK-2100 Copenhagen

Contents

Preface

v

1

Conditional distributions

1

136
1.1 Markov kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Integration of Markov kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Properties for the integration measure . . . . . . . . . . . . . . . . . . . . . . 1.4 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.5 Existence of conditional distributions . . . . . . . . . . . . . . . . . . . . . . . 16 1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

23

Conditional distributions: Transformations and moments

27

2.1 Transformations of conditional distributions . . . . . . . . . . . . . . . . . . . 27 2.2 Conditional moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Conditional independence

51

3.1 Conditional probabilities given a σ–algebra . . . . . . . . . . . . . . . . . . . 52 3.2 Conditionally independent events . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3 Conditionally independent σ-algebras . . . . . . . . . . . . . . . . . . . . . . . 55 3.4 Shifting information around . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.5 Conditionally independent random variables . . . . . . . . . . . . . . . . . . . 61 3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4

Markov chains

71

4.1 The fundamental Markov property . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 The strong Markov property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 Homogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4 An integration formula for a homogeneous Markov chain . . . . . . . . . . . . 99 4.5 The Chapmann-Kolmogorov equations . . . . . . . . . . . . . . . . . . . . . . 100

iv 5
CONTENTS

4.6 Stationary distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Ergodic theory for Markov chains on general state spaces

111

5.1 Convergence of transition probabilities . . . . . . . . . . . . . . . . . . . . . . 113 5.2 Transition probabilities with densities . . . . . . . . . . . . . . . . . . . . . . 115 5.3 Asymptotic stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.4 Minorisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.5 The drift criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

6

An introduction to Bayesian networks

141

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.2 Directed graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.3 Moral graphs and separation . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.4 Bayesian networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.5 Global Markov property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

A Supplementary material

155

A.1 Measurable spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 A.2 Random variables and conditional expectations . . . . . . . . . . . . . . . . . 157

B Hints for exercises

161

B.1 Hints for chapter 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 B.2 Hints for chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 B.3 Hints for chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 B.4 Hints for chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 B.5 Hints for chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

Preface

The present lecture notes are intended for the course ”Beting”. The purpose is to give a detailed probabilistic introduction to conditional distributions and how this concept is used to deﬁne and describe Markov chains and Bayesian networks.

The chapters 1–4 are mainly based on diﬀerent sets of lecture notes written by Ernst Hansen but are adapted to suit students with the knowledge obtained in the courses MI, VidSand1, and VidSand2. Parts of these chapters also contain material inspired by lecture notes written by Martin Jacobsen and Søren Tolver Jensen. Chapter 5 is an adapted version of a set of lecture notes written by Søren Tolver Jensen and Martin Jacobsen. These notes themselves were inspired by earlier notes by Søren Feodor Nielsen. Chapter 6 contains some standard results on Bayesian networks - though both formulations and proofs are formulated in the rather general framework from chapter 3 and 4.

There are exercises in the end of each chapter and hints to selected exercises in the end of the notes. Some exercises are adapted versions of exercises and exam exercises from previous lecture notes. Others are inspired by examples and results collected from a large number of monographs on Markov chains, stochastic simulation, and probability theory in general.

I am grateful to both students and the teaching assistants from the last two years, Ketil Biering Tvermosegaard and Daniele Cappelletti, who have contributed to the notes by identifying mistakes and suggesting improvements.

Anders Rønn-Nielsen København, 2014

vi

CONTENTS

Chapter 1

Let (X, E) and (Y, K) be two measurable spaces. In this chapter we shall discuss the relation between measures on the product space (X × Y, E ⊗ K) and measures on the two marginal spaces (X, E) and (Y, K). More precisely we will see that measures on the product space can be constructed from measures on the two marginal spaces. A particularly simple example is a product measure µ ⊗ ν where µ and ν are measures on (X, E) and (Y, K) respectively.

The two factors X and Y will not enter the setup symmetrically in the more general construction.

1.1 Markov kernels

Deﬁnition 1.1.1. A (X, E)–Markov Kernel on (Y, K) is a family of probability measures (P_x)_x∈Xon (Y, K) indexed by points in X such that the map

x → P_x(B)

is E − B–measurable for every ﬁxed B ∈ K. Theorem 1.1.2. Let (X, E) and (Y, K) be measurable spaces, let ν be a σ–ﬁnite measure on

+

(Y, K), and let f ∈ M (X × Y, E ⊗ K) have the property that

Z

f(x, y) dν(y) = 1

for all x ∈ X .

2

Then (P_x)_x∈Xgiven by

Z

P_x(B) =

f(x, y) dν(y) for all B ∈ K, x ∈ X

B

is a (X, E)–Markov Kernel on (Y, K).

Proof. For each ﬁxed set B ∈ K we need to argue that

Z

x →
1_X×B(x, y)f(x, y) dν(y)

is an E–measurable function. That is a direct result of EH Theorem 8.7 applied to the function 1_X×Bf.

Lemma 1.1.3. If (Y, K) has an intersection–stable generating system D, then (P_x)_x∈Xis a (X, E)–Markov Kernel on (Y, K) if only

x → P_x(D)

is E − B-measurable for all ﬁxed D ∈ D. Proof. Deﬁne

H = {F ∈ K : x → P_x(F) is E − B–measurable}
It is easily checked that H is a Dynkin Class. Since D ⊆ H, we have H = K.

Lemma 1.1.4. Let (P_x)_x∈Xbe a (X, E)–Markov kernel on (Y, K). For each G ∈ E ⊗ K the map

x → P_x(G^x)

is E − B–measurable. Proof. Let

H = {G ∈ E ⊗ K : x → P_x(G^x) is E − B–measurable} and consider a product set A × B ∈ E ⊗ K. Then

(

∅

if x ∈/ A

(A × B)^x=

B

if x ∈ A

such that

(

0

if x ∈/ A

P_x((A × B)^x) =

= 1_A(x) · P_x(B)

P_x(B) if x ∈ A

1.2 Integration of Markov kernels

3

This is a product of two E−B–measurable functions. Hence it is E−B–measurable. Thereby we conclude that H contains all product sets, and since this is a intersection stable generating system for E ⊗ K, we have H = E ⊗ K, if we can show that H is a Dynkin class:

We already have that X × Y ∈ H – it is a product set! Assume that G₁⊆ G₂are two sets in H. Then obviously also G^x₁⊆ G₂^xfor all x ∈ X, and

(G₂\ G₁)^x= G₂^x\ G^x₁.

Then
P_x((G₂\ G₁)^x) = P_x(G₂^x) − P_x(G^x₁)

which is a diﬀerence between two measurable functions. Hence G₂\ G₁∈ H. Finally, assume that G₁⊆ G₂⊆ · · · is an increasing sequence of H–sets. Similarly to above

we have G^x₁⊆ G^x₂⊆ · · · and

!

x

∞

[

G_n

=

G^x_n.

n=1

Then

! !

!

x

∞

[

P_x

G_n

= P_x

G^x_n= lim P_x(G^x_n)

n→∞

n=1

This limit is E − B–measurable, since each of the functions x → P_x(G^x_n) are measurable.

S_∞

Then _n=1G_n∈ H.

Theorem 1.2.1. Let µ be a probability measure on (X, E) and let (P_x)_x∈Xbe a (X, E)– Markov kernel on (Y, K). There exists a uniquely determined probability measure λ on (X × Y, E ⊗ K) satisfying

Z

λ(A × B) =

P_x(B) dµ(x)

A

for all A ∈ E and B ∈ K.

The probability measure λ constructed in Theorem 1.2.1 is called the integration of (P_x)_x∈Xwith respect to µ. The interpretation is that λ describes an experiment on X × Y that is performed in two steps: The ﬁrst step is drawing x ∈ X. The second step is drawing y ∈ Y according to a probability measure that is determined by x.

4

Proof. The uniqueness follows, since λ is determined on all product sets and these form an intersection stable generating system for E ⊗ K.

In order to prove the existence, we deﬁne

Z

λ(G) = P_x(G^x) dµ(x)
For each G ∈ E⊗K the integrand is measurable according to Lemma 1.1.4. It is furthermore non–negative, such that λ(G) is well–deﬁned with values in [0, ∞]. Now let G₁, G₂, . . . be a sequence of disjoint sets in E ⊗ K. Then for each x ∈ X the sets G^x₁, G^x₂, . . . are disjoint as well. Hence

Z

∞

ꢀ

ꢁ

ꢀꢀ

ꢁ ꢁ

[

X

x

λ

G_n

=

P_x

G_n

dµ(x) =

P_x(G^x_n) dµ(x) =

λ(G_n)

n=1

In the second equality we have used that each P_xis a measure, and in the third equality we have used monotone convergence to interchange integration and summation. From this we have that λ is a measure. And since

Z

λ(X × Y) = P_x((X × Y)^x) dµ(x) = P_x(Y) dµ(x) = 1 dµ(x) = 1 we obtain, than λ is actually a probability measure. Finally, it follows that

Z

λ(A × B) = P_x((A × B)^x) dµ(x) = 1_A(x)P_x(B) dµ(x) =
P_x(B) dµ(x)

A

for all A ∈ E and B ∈ K.

Corollary 1.2.2. Let µ be a probability measure on (X, E) and let (P_x)_x∈Xbe a (X, E)– Markov kernel on (Y, K). Let λ be the integration of (P_x)_x∈Xwith respect to µ. Then λ satisﬁes

λ(A × Y) =

µ(A)

for all A ∈ E

Z

λ(X × B) = P_x(B) dµ(x)

for all B ∈ K

Proof. The second statement is obvious. For the ﬁrst result just note that P_x(Y) = 1 for all

x ∈ X.

The probability measure on (Y, K) deﬁned by λ(X ×B) is called the mixture of the Markov kernel with respect to µ. Example 1.2.3. Let µ be a probability measure on (X, E) and let ν be a probability measure

5

on (Y, K). Deﬁne P_x= ν for all x ∈ X. Then, trivially, (P_x)_x∈Xis a X–Markov kernel on Y. Let λ be the integration of this kernel with respect to λ. Then for all A ∈ E and B ∈ K

Z

λ(A × B) =

ν(B) dµ(x) = µ(A) · ν(B) .

A

The only measure satisfying this property is the product measure µ ⊗ ν, so λ = µ ⊗ ν. Hence a product measure is a particularly simple example of a measure constructed by integrating a Markov kernel.

◦

Example 1.2.4. Let µ be the Poisson distribution with parameter λ. For each x ∈ N₀we deﬁne P_xto be the binomial distribution with parameters (x, p). Then it is seen that (P_x)_x∈Nis a N₀–Markov kernel on N₀.

0

Let ξ be the mixture of (P_x)_x∈Nwith respect to µ. This must be a probability measure on

0

N₀and is thereby given by the probability function q. For n ∈ N₀we obtain

ꢂ ꢃ

∞

X

k

λ^kpⁿ(1 − p)^k−ne^−λ

k! q(n) =

n

k=n

ꢄ

ꢅ

k−n

∞

(λp)ⁿ

X

(1 − p)λ
=

e^−λ

n!

(k − n)!

k=n

(λp)ⁿ
=

=

e

^−λe^(1−p)λ

n!
(λp)ⁿ

e^−λp

n!
Hence the mixture ξ is seen to be the Poisson distribution with parameter λp.

◦

Theorem 1.2.5 (Uniqueness of integration). Suppose that (Y, K) has a countable generating system that is intersection stable. Let µ and µ˜ be two probability measures on (X, E) and

˜

assume that (P_x)_x∈Xand (P_x)_x∈Xare two (X, E)–Markov kernels on (Y, K). Let λ be the

˜

integration of (P_x)_x∈Xwith respect to µ, and let λ be the integration of (P_x)_x∈Xwith respect to µ˜. Deﬁne

˜

E₀= {x ∈ X : P_x= P_x}

˜

Then λ = λ if and only if µ = µ˜ and µ(E₀) = 1.

Proof. Let (B_n)_n∈Nbe a countable generating system for (Y, K). Then

∞

\

˜