The Hammersley-Clifford Theorem and its Impact on Modern

Helge Langseth

Department of Mathematical Sciences Norwegian University of Science and Technology

Hammersley-Clifford Theorem – p.1/20 Outline → Historical review

→ Hammersley-Clifford’s theorem

→ Usage in • Spatial models on a lattice • Point processes • Graphical models • Markov Chain Monte Carlo

→ Conclusion

Hammersley-Clifford Theorem – p.2/20 Example: The Ising model (Ising, 1925):

→ Model for ferromagnetism

−1 → Xi,j ∈ {−1, 1}, Ei,j(x) = xi,j · x`,m kT x`,m∈N (xi,j ) 1 P → p(x) = Z · exp(− i,j Ei,j(x)) P

Markov chains in higher dimensions

Xi−1 Xi Xi+1

Paul Levy´ (1948)

Xi,j

→ Define neighbouring set in the 2D-model:

N (xi,j) = {xi−1,j, xi+1,j, xi,j−1, xi,j+1}

→ Sought independence relations:

p xi,j|x \ {xi,j} = p xi,j|N (xi,j)   Hammersley-Clifford Theorem – p.3/20 → Define neighbouring set in the 2D-model:

N (xi,j) = {xi−1,j, xi+1,j, xi,j−1, xi,j+1}

→ Sought independence relations: p xi,j|x \ {xi,j} = p xi,j|N (xi,j)  

Markov chains in higher dimensions

Xi−1 Xi Xi+1

Paul Levy´ (1948)

Xi,j

Example: The Ising model (Ising, 1925):

→ Model for ferromagnetism

−1 → Xi,j ∈ {−1, 1}, Ei,j(x) = xi,j · x`,m kT x`,m∈N (xi,j ) 1 P → p(x) = Z · exp(− i,j Ei,j(x)) P Hammersley-Clifford Theorem – p.3/20 Defining the Markov models in two dimensions

Xi,j Xi,j

x x p( ) = i,j Ψi,j (xi,j , N (xi,j )) p(xi,j | \ {xi,j }) = p (xi,j |N (xi,j )) Q Joint model (Whittle, 1963) Conditional model (Bartlett, 1966)

→ For Nearest neighbour systems: The class of joint models contains the class of conditional models (Brook, 1964)

→ Not known (at the time) how to define the full joint distribution from the conditional distributions

→ Severe constraints in Bartlett’s model

Hammersley-Clifford Theorem – p.4/20 Besag (1972) on nearest neighbour systems What is the most general form of the conditional probability functions that define a coherent joint function? And what will the joint look like?

→ Assume p (x) > 0, and define

p(xi,j |N (xi,j ) Q (xi,j|xi−1,j, xi+1,j, xi,j−1, xi,j+1) = log .  p(0|N (xi,j ))  → Q(x | t, u, v, w) ≡

x{ψ0(x)+tψ1(x, t)+uψ1(u, x)+vψ2(x, v)+wψ2(w, x)}

→ Let xB be the boundary, and xI = x \ xB. p(x |x = 0) = 1 · exp x ψ (x )+ I B Z  i,j i,j 0 i,j x ψ (x , xP )+x ψ (x , x ) i−1,j 1 i,j i−1,j i,j−1 2 i,j i,j−1 

Hammersley-Clifford Theorem – p.5/20 Hammersley-Clifford’s theorem - Notation

N (Xi)

Xi

→ Define a graph G = (V, E), s.t. V = {X1, . . . , Xn} and

{Xi, Xj} ∈ E iff

p(xi | {x1, . . . , xn} \ {xi}) =6 p(xi | {x1, . . . , xn} \ {xi, xj})

→ Define N (Xi) s.t. Xj ∈ N (Xi) iff {Xi, Xj} ∈ E → C ⊆ V is a clique iff C ⊆ {X, N(X)} ∀X ∈ C.

Hammersley-Clifford Theorem – p.6/20 Hammersley-Clifford’s theorem - Result

Assume that p(x1, . . . , xn) > 0 (positivity condition). Then, 1 p(x) = φ (x ) Z C C C∈Ycl(G) Thus, the following are equivalent (given the positivity condition):

Local Markov property: p xi | x \ {xi} = p xi | N (xi) Factorization property: The probabilityfactorizes according to the cliques of the graph

Global Markov property: p(xA | xB, xS) = p(xA | xS)

whenever xA and xB are separated by xS in G

Hammersley-Clifford Theorem – p.7/20 Hammersley-Clifford’s theorem - Proof Line of proof due to Besag (1974), who clarified the original “circuitous” proof by Hammersley & Clifford

→ Assume the positivity condition to be correct

→ Let Q(x) = log p(x)/p(0)   → There exists a unique expansion of Q(x),

Q(x) = xiGi(xi) + xixjGi,j(xi, xj) + · · · 1X≤i≤n 1≤Xi

+ x1x2 . . . xnG1,2,...,n(x1, x2, . . . , xn)

→ Gi,j,...,s(xi, xj, . . . , xs) =6 0 only if {i, j, . . . , s} ∈ cl(G)

Hammersley-Clifford Theorem – p.8/20 Positivity condition: Historical implications

→ Hammersley & Clifford (1971) base their proof on the positivity condition:

p(x1, . . . , xn) > 0

→ They find the positivity condition unnatural, and postpones publication in hope of relaxing it

→ They are thereby preceded by Besag (1974) in publishing the theorem

→ Moussouris (1974) shows by a counter-example involving only four variables that the positivity condition is required

Hammersley-Clifford Theorem – p.9/20 Markov properties on DAGs

→ → X1 X2 Define a DAG G = (V, E ) for a

well-ordering X1 ≺ X2 ≺ · · · ≺ Xn s.t.

X3 → V = {X1, . . . , Xn} (as before) → → Assume Xj ≺ Xi. Then (Xj, Xi) ∈ E

X4 X5 → (i.e., Xj → Xi in G ) iff

p(xi | x1, . . . , xi−1) =6

X6 p(xi | x1, . . . , xj−1, xj+1, . . . , xi−1)

→ Define the parents of Xi as pa(Xi) = {Xj : (Xj, Xi) ∈ E }

Directed factorization property: p(x) factorizes according to G→

iff p(x) = i p xi | pa(xi) Q 

Hammersley-Clifford Theorem – p.10/20 Markov properties on DAGs (cont’d)

m m X1 X2 → Define moral graph G = (V, E ) from G→ by connecting parents and dropping

X3 edge directions m → Note that {Xi, pa(Xi)} ∈ cl(G ), i.e., m X4 X5 factorization relates to cl(G )

→ Local and Global Markov properties X6 defined “as usual” The following are equivalent even without the positivity condition (Lauritzen et al., 1990):

→ Factorization property

→ Local Markov property

→ Global Markov property

Hammersley-Clifford Theorem – p.11/20 Spatial statistics The theorem has had major implications in many areas of spatial statistics. Application areas include:

→ Quantitative geography (e.g, Besag, 1975)

→ Geographical analysis of the spread of diseases (e.g, Clayton & Kaldor,1987)

→ Image analysis (e.g, Geman & Geman, 1984)

Hammersley-Clifford Theorem – p.12/20 Markov Point Processes → Consider a point process on e.g. Rn N (ξ|r)

→ Let x = {x1, x2, . . . , xm} be r the observed points ξ → Define the neighbour set as

N (ξ|r) = {xi : ||ξ−xi|| ≤ r}

→ A density function f is Markov if f(ξ | x) depends only on ξ and N (ξ) ∩ x

→ Ripley&Kelly (1977): f(x) is a Markov function iff there x 1 x exist functions φC s.t. f( ) = Z C∈cl(G) φC ( C ) Q

Hammersley-Clifford Theorem – p.13/20 → Connection with Hammersley & Clifford’s theorem made by Darroch et al. (1980):

• G is defined s.t. Xi and Xj are only connected if uij =6 0 (with consistency assumptions) • A Hammersley & Clifford theorem can be proven for this structure • Representational benefits follows for the class of graphical models

Log-linear models

→ The analysis of contingency tables set into the framework of log-linear models in the 70’s

→ log p(x) = uφ + i ui(xi) + · · · + u1...n(x1, . . . , xn) P

Hammersley-Clifford Theorem – p.14/20 Log-linear models

→ The analysis of contingency tables set into the framework of log-linear models in the 70’s

→ log p(x) = uφ + i ui(xi) + · · · + u1...n(x1, . . . , xn) → Connection with HammersleP y & Clifford’s theorem made by Darroch et al. (1980):

• G is defined s.t. Xi and Xj are only connected if

uij =6 0 (with consistency assumptions) • A Hammersley & Clifford theorem can be proven for this structure • Representational benefits follows for the class of graphical models

Hammersley-Clifford Theorem – p.14/20 MCMC and the Gibbs sampler

→ Metropolis-Hastings algorithm: Define a Markov chain which has a desired distribution π(·) as its unique stationary distribution

Algorithm: 1. Initialization: x(0) ← fixed value 2. For i = 1, 2, . . .: i) Sample y from q(y | x(i−1)) π(y) · q(x(i−1) | y) ii) Define αy ← π(x(i−1)) · q(y | x(i−1))

y with p = min{1, αy} iii) x(i) ←   x(i−1) with p = max{0, 1 − αy}

 Hammersley-Clifford Theorem – p.15/20 MCMC and the Gibbs sampler (cont’d)

→ Geman & Geman (1984): Metropolis Hastings for high-dimensional x → Problem: How to sample y and calculate αy efficiently? (i) → Solution: Flip only one element xj at a time: (i+1) (i) (i) (i+1) (i) (i) x = x1 , . . . , xj−1, xj , xj+1, . . . , xn   → q y | x(i) is defined by the conditional probability (i) p xj | x  :  (i+1) (i) 1 (i) p xj | x = φC xC Zj   C:YXj ∈C   → αy = 1 for the Gibbs sampler → An algorithm of constant time complexity can be designed!

Hammersley-Clifford Theorem – p.16/20 Too much of a good thing?

→ Global properties from local models: • Model error dominates (e.g. Rue and Tjelmeland, 2002) • The critical temperature of the Ising model

“Beware — Gibbs can be dangerous!” Spiegelhalter et al. (1995): The BUGS v0.5 manual, p. 1

→ Alternative representations: • Bayesian networks (e.g. Pearl, 1988) • Vines (e.g. Bedford and Cooke, 2001)

Hammersley-Clifford Theorem – p.17/20 Clifford’s (MCMC) conclusion

“. . . from now on we can compare our data with the model we actually want to use rather than with a model which has some mathematical convenient form. This is surely a revolution.”

Dr. Peter Clifford (1993), The Royal Statistical Society meeting on the Gibbs sampler and other statistical Markov Chain Monte Carlo methods

Journal of the Royal Statistical Society, Series B, 55(1), p. 53

Hammersley-Clifford Theorem – p.18/20 References

I have benefited from getting the opinion of Peter Clifford, A. Philip Dawid, Steffen L. Lauritzen, David J. Spiegelhalter and Håvard Rue on these issues.

→ Adrian Baddeley and Jesper Møller (1989): Nearest-Neighbour Markov Point Processes and Random Sets. International Statistical Review, 57, pp. 89–121. → Tim J. Bedford and Roger M. Cooke (2001): Probability density decomposition for conditionally dependent random variables modelled by vines. Annals of Mathematics and AI, 32, 245–268. → (1972): Nearest-neighbour Systems and the Auto-logistic Model for Binary data. Journal of the Royal Statistical Society, Series B, 34, pp. 75–83. → Julian Besag (1974): Spatial and the Statistical Analysis of Lattice Systems. Journal of the Royal Statistical Society, Series B, 36, pp. 192–236. → Julian Besag (1975): Statistical Analysis of Non-lattice Data. The Statistician, 24, pp. 179–195. → Julian Besag (1991): Spatial Statistics in the Analysis of Agricultural Field . In: Spatial statistics and digital image analysis. Washington, D.C.: National Academy Press.

Hammersley-Clifford Theorem – p.19/20 References (cont’d)

→ Peter Clifford (1990): Markov Random Fields in Statistics. In: Geoffrey Grimmett and Domnic Welsh (Eds.), Disorder in Physical Systems: A Volume in Honour of John M. Hammersley, pp. 19–32. Oxford University Press. → Peter Clifford (1993): Discussion on the meeting on the Gibbs sampler and other statistical Markov Chain Monte Carlo methods. Journal of the Royal Statistical Society, Series B, 55, pp. 53–102. → John N. Darroch, Steffen L. Lauritzen, and Terry P. Speed (1980): Markov fields and log-linear interaction models for contingency tables. Annals of Statistics, 8, pp. 522–539. → Stuart Geman and Donald Geman (1984): Stochastic Relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, pp. 721–741. → John M. Hammersley and Peter Clifford (1971): Markov fields on finite graphs and lattices. Unpublished. → S.L. Lauritzen, A.P. Dawid, B.N. Larsen and H.-G. Leimer (1990): Independence Properties of Directed Markov Fields. Networks, 20, pp. 491–505. → John Moussouris (1974): Gibbs and Markov Random Systems with Constraints. Journal of Statistical Physics, 10, pp. 11-33. → Brian D. Ripley and Frank P. Kelly (1977): Markov point processes. Journal of the

London Mathematical Society, 15, pp. 188–192. Hammersley-Clifford Theorem – p.20/20