Non Parametric Bayesian Priors for Hidden Markov Random Fields Florence Forbes, Hongliang Lu, Julyan Arbel

Non parametric Bayesian priors for hidden Markov random fields Florence Forbes, Hongliang Lu, Julyan Arbel

To cite this version:

Florence Forbes, Hongliang Lu, Julyan Arbel. Non parametric Bayesian priors for hidden Markov random fields. JSM 2018 - Joint Statistical Meeting, Jul 2018, Vancouver, Canada. pp.1-38. hal- 01941679

HAL Id: hal-01941679 https://hal.archives-ouvertes.fr/hal-01941679 Submitted on 12 Dec 2018

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Bayesian Nonparametric Priors for Hidden Markov Random Fields: Application to Image Segmentation

Hongliang Lü, Julyan Arbel, Florence Forbes

Inria Grenoble Rhône-Alpes and University Grenoble Alpes Laboratoire Jean Kuntzmann Mistis team ﬂ[email protected]

July 2018

H. Lü et al. JSM 2018 July 2018 1 / 35 Unsupervised image segmentation

Challenges for mixture models (clustering)

inhomogeneities, noise

How many segments? T1 gado 2 classes 4 classes

Extensions of Dirichlet Process mixture model with spatial regularization

H. Lü et al. JSM 2018 July 2018 2 / 35 Outline of the talk

1 Dirichlet process (DP) 2 Spatially-constrained mixture model: DP-Potts mixture model Finite mixture model Bayesian ﬁnite mixture model DP mixture model DP-Potts mixture model 3 Inference using variational approximation 4 Some image segmentation results 5 Conclusion and future work

H. Lü et al. JSM 2018 July 2018 3 / 35 Dirichlet process (DP) Dirichlet process (DP)

The DP is a central Bayesian nonparametric (BNP) prior1.

Deﬁnition (Dirichlet process) A Dirichlet process on the space Y is a random process G such that there exist α (concentration parameter) and G0 (base distribution) such that for any ﬁnite partition {A1,...,Ap} of Y, the random vector (P (A1),...,P (Ap)) will be Dirichlet distributed:

(P (A1),...,P (Ap)) ∼ Dir(αG0(A1), . . . , αG0(Ap))

Notation: G ∼ DP(α, G0)

The DP is the inﬁnite-dimensional generalization of the Dirichlet distribution.

1Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209–230. H. Lü et al. JSM 2018 July 2018 4 / 35 TheDP has almost surely discrete realizations 2: ∞ X G = π δ ∗ k θk k=1

∗ iid Q iid where θk ∼ G0 and πk =π ˜k l

Dirichlet process (DP) Dirichlet process (DP) construction

A DP prior G can be constructed using three methods: The Blackwell-MacQueen urn scheme The Chinese Restaurant Process The Stick-Breaking construction

2Sethuraman, J. (1994). A constructive deﬁnition of Dirichlet priors. Statistica Sinica, 4:639-650. H. Lü et al. JSM 2018 July 2018 5 / 35 Dirichlet process (DP) Dirichlet process (DP) construction

A DP prior G can be constructed using three methods: The Blackwell-MacQueen urn scheme The Chinese Restaurant Process The Stick-Breaking construction TheDP has almost surely discrete realizations 2: ∞ X G = π δ ∗ k θk k=1

∗ iid Q iid where θk ∼ G0 and πk =π ˜k l

2Sethuraman, J. (1994). A constructive deﬁnition of Dirichlet priors. Statistica Sinica, 4:639-650. H. Lü et al. JSM 2018 July 2018 5 / 35 Spatially-constrained mixture model: DP-Potts mixture model Finite mixture model Spatially-constrained mixture model: DP-Potts mixture

Clustering/segmentation: Finite mixture models assume data are generated by a ﬁnite sum of probability distributions: D y = (y1, ..., yN) with yi = (yi1, ..., yiD) ∈ R i.i.d K ∗ X ∗ p(yi|θ , π) = πkF (yi|θk) k=1 where ∗ ∗ ∗ ∗ θ = (θ1, ..., θK ) and π = (π1, ..., πK ) with θ class parameters and π PK mixture weights with i=1 πi = 1. θ∗ and π can be estimated using EM algorithm.

Equivalently PK G = π δ ∗ non random k=1 k θk θi ∼ G and yi|θi ∼ F (yi|θi).

H. Lü et al. JSM 2018 July 2018 6 / 35 Limitation: Solution: Require specifying the number of Assume an inﬁnite number of compo- components K beforehand. nents using BNP priors.

Spatially-constrained mixture model: DP-Potts mixture model Bayesian ﬁnite mixture model Bayesian ﬁnite mixture model

In a Bayesian setting, a prior distribution is placed over θ∗ and π.

Thus, the posterior distribution of parameters given the observations is

p(θ∗, π|y) ∝ p(y|θ∗, π)p(θ∗, π)

To generate a data point within a Bayesian ﬁnite mixture model: ∗ θk ∼ G0 π1, ..., πK ∼ Dir(α/K, ..., α/K) PK G = π δ ∗ is a random measure k=1 k θk ∗ θi|G ∼ G, which means θi = θk with probability πk yi|θi ∼ F (yi|θi)

H. Lü et al. JSM 2018 July 2018 7 / 35 Spatially-constrained mixture model: DP-Potts mixture model Bayesian ﬁnite mixture model Bayesian ﬁnite mixture model

In a Bayesian setting, a prior distribution is placed over θ∗ and π.

Thus, the posterior distribution of parameters given the observations is

p(θ∗, π|y) ∝ p(y|θ∗, π)p(θ∗, π)

Limitation: Solution: Require specifying the number of Assume an inﬁnite number of compo- components K beforehand. nents using BNP priors.

H. Lü et al. JSM 2018 July 2018 7 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP mixture model DP mixture model

From a Bayesian ﬁnite mixture model to a DP mixture model

To establish a DP mixture model, let G be a DP prior (K → ∞), namely

G ∼ DP(α, G0)

and complement it with a likelihood associated to each θi

To generate a data point within a DP mixture model:

G ∼ DP(α, G0)

θi|G ∼ G

yi|θi ∼ F (yi|θi)

H. Lü et al. JSM 2018 July 2018 8 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP mixture model DP mixture model 2D point clustering (unsupervised learning) based on the DP mixture model:

Let the data speak for themselves!

H. Lü et al. JSM 2018 July 2018 9 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP mixture model DP mixture model Application to image segmentation:

Original image Segmentation by DP 0 0

50 50

100 100

150 150

200 200

250 250

0 50 100 150 200 250 0 50 100 150 200 250

Drawback: Solution: Spatial constraints and dependencies Combine the DP prior with a hidden are not considered. Markov random ﬁeld (HMRF).

H. Lü et al. JSM 2018 July 2018 10 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP-Potts mixture model DP-Potts mixture model

To solve the issue, we introduce a spatial Potts model component:   X M(θ) ∝ exp β δz(θi)=z(θj ) i∼j

with θ = (θ1, ..., θN ) and β the interaction parameter.

The DP mixture model is thus extended:

G ∼ DP(α, G0) Q θ|M,G ∼ M(θ) × i G(θi) yi|θi ∼ F (yi|θi)

H. Lü et al. JSM 2018 July 2018 11 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP-Potts mixture model DP-Potts mixture model

Other spatially-constrained BNP mixture models + inference algorithms: DP or PYP-Potts partition model + MCMC3 Hemodynamic brain parcellation (DP-Potts) + PARTIAL VB4 DP or PYP-Potts + Iterated Conditional Mode (ICM)5

Markov chain Monte Carlo (MCMC): Variational Bayes (VB): Advantage: asymptotically exact Advantage: much faster Drawback: computationally Drawback: less accurate, no expensive theoretical guarantee

We propose a DP-Potts mixture model based on a general stick-breaking construction that allows a natural Full VB algorithm enabling scalable inference for large datasets and straightforward generalization to other priors.

3Orbanz & Buhmann (2008); Xu, Caron & Doucet (2016); Sodjo, Giremus, Dobigeon & Giovannelli (2017) 4Albughdadi, Chaari, Tourneret, Forbes, Ciuciu (2017) 5Chatzis & Tsechpenakis (2010); Chatzis (2013) H. Lü et al. JSM 2018 July 2018 12 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP-Potts mixture model DP-Potts: Stick breaking construction

Stick breaking construction of DP: G ∼ DP (α, G0) ∗ θk|G0 ∼ G0 τk|α ∼ B(1, α), k = 1, 2,... Qk−1 πk(τ) = τk l=1 (1 − τl), k = 1, 2,... ∞ G = P π (τ)δ ∗ k=1 k θk +

θi|G ∼ G

yi|θi ∼ F (yi|θi) = Dirichlet Process Mixture Model (DPMM)

H. Lü et al. JSM 2018 July 2018 13 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP-Potts mixture model DP-Potts: Stick breaking construction

Stick breaking construction of DPMM Stick breaking construction of DPMM

yi|θi ∼ F (yi|θi) yi|θi ∼ F (yi|θi)

H. Lü et al. JSM 2018 July 2018 14 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP-Potts mixture model DP-Potts: Stick breaking construction

Using assignment variables zi

H. Lü et al. JSM 2018 July 2018 15 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP-Potts mixture model DP-Potts: Stick breaking construction

Using assignment variables zi

z = {z1, . . . , zN } y |z , θ∗ ∼ F (y |θ∗ ) ∗ ∗ i i i zi y |z , θ ∼ F (y |θ ) i i i zi

∞ P NB: Well deﬁned for every stick breaking construction ( πk = 1): k=1 e.g. Pitman-Yor (τk|α, σ) ∼ B(1 − σ, α + kσ)

H. Lü et al. JSM 2018 July 2018 16 / 35 Inference using variational approximation Inference using variational approximation

Clustering/ segmentation task: Estimating Z while parameters Θ unknown , eg. Θ = {τ , α, θ∗}

Bayesian setting

Access the intractable p(Z, Θ|y, Φ) approximate as q(z, Θ) = qz(z)qθ(Θ)

Variational Expectation-Maximization

Alternate maximization in qz and qθ (φ are hyperparameters) of the Free Energy:

p(y, Z, Θ|φ) F(qz, qθ, φ) = Eqz qθ log qz(z)qθ(Θ)

= log p(y|φ) − KL(qzqθ, p(Z, Θ|y, φ))

H. Lü et al. JSM 2018 July 2018 17 / 35 Inference using variational approximation DP-Potts Variational EM procedure

Joint DP-Potts (Gaussian) Mixture distribution

p(τk|α) is Beta B(1, α) ∗ p(θk|ρk) = NIW(µk, Σk|mk, λk, Ψk, νk) is Normal-inverse-Wishart

p(α|s1, s2) = G(α|s1, s2) is Gamma

Usual truncated variational posterior, qτk = δ1 for k ≥ K (eg. K = 40)

N K−1 K Y Y Y q(z, Θ) = qz (zj ) qα(α) qτ (τk) q ∗ (µk, Σk) j k θk j=1 k=1 k=1 E-steps: VE-Z, VE-α, VE-τ and VE-θ∗ M-step: φ updating straightforward except for β

H. Lü et al. JSM 2018 July 2018 18 / 35 Some image segmentation results Some image segmentation results Model validation and veriﬁcation:

Original image Segmentation by DP-Potts 0 0

50 50

100 100

150 150

200 200

250 250

0 50 100 150 200 250 0 50 100 150 200 250

Segmented image using DP-Potts model with β = 2.5.

H. Lü et al. JSM 2018 July 2018 19 / 35 Some image segmentation results Some image segmentation results

Convergence of the VB algorithm initialized by the k-means++ clustering:

H. Lü et al. JSM 2018 July 2018 20 / 35 Some image segmentation results Some image segmentation results Segmentation results for Berkeley Segmentation Dataset:

Original image Segmentation by DP-Potts (K=40, β = 0)

Segmentation by DP-Potts (K=40, β = 2) Segmentation by DP-Potts (K=40, β = 10)

The segmentation results obtained by DP-Potts model with β = 0, 1, 5.

H. Lü et al. JSM 2018 July 2018 21 / 35 Some image segmentation results Some image segmentation results Segmentation with estimated β = 1.66

Original image Segmentation by DP-Potts 0 0

50 50

100 100

150 150

200 200

250 250

300 300

0 100 200 300 400 0 100 200 300 400

H. Lü et al. JSM 2018 July 2018 22 / 35 Some image segmentation results Quantitative evaluation of the segmentations

Probabilistic Rand Index on 154 color (RGB) images with ground truth (several) from Berkeley dataset (1000 superpixels). But Manual ground truth segmentations are subjective !

PRI results with DP-Potts model Mean Median St.D. K=10 71.48 72.54 0.1040 K=20 73.64 73.42 0.0935 K=40 75.33 76.47 0.0853 K=50 75.81 76.31 0.0873 K=60 76.55 77.12 0.0848 K=80 77.06 78.30 0.0835

PRI results from Chatzis 2013

Computation time : Berkeley 321x481 image reduced to 1000 superpixels takes 10-30 s on a PC with CPU Intel(R) Core(TM) i7-5500U CPU 2.40GHz and 8GB RAM

H. Lü et al. JSM 2018 July 2018 23 / 35 How does β inﬂuence the number of components? Extend the model with other priors (Pitman-Yor process, Gibbs-type priors, etc.). Try other variational approximations (truncation-free) Investigate theoretical properties of BNP priors under structural constraints (time, spatial) ...... for other applications, such as discovery probabilities, etc.

Conclusion and future work Conclusion and future work

A general DP-Potts model and the associated VB algorithm were built. The DP-Potts model was applied to image segmentation and tested on different types of datasets. Impact of the interaction parameter β on the ﬁnal results is signiﬁcant. An estimation procedure was proposed for β

H. Lü et al. JSM 2018 July 2018 24 / 35 Conclusion and future work Conclusion and future work

How does β inﬂuence the number of components? Extend the model with other priors (Pitman-Yor process, Gibbs-type priors, etc.). Try other variational approximations (truncation-free) Investigate theoretical properties of BNP priors under structural constraints (time, spatial) ...... for other applications, such as discovery probabilities, etc.

H. Lü et al. JSM 2018 July 2018 24 / 35 References

1 M. Albughdadi, L. Chaari, J.-Y. Tourneret, F.Forbes, and P.Ciuciu. A Bayesian non-parametric hidden Markov random model for hemodynamic brain parcellation. Signal Processing, 135:132- 146, 2017. 2 S. P. Chatzis. A Markov random field-regulated Pitman-Yor process prior for spatially constrained data clustering. Pattern Recognition, 46(6): 1595-1603, 2013. 3 S. P. Chatzis and G. Tsechpenakis. The infinite hidden Markov random field model. IEEE Trans. Neural Networks, 21(6):1004-1014, 2010. 4 F. Forbes and N. Peyrard. Hidden Markov Random Field Selection Criteria based on Mean Field-like approximations. IEEE PAMI, 25(9):1089-1101, 2003 5 P. Orbanz and J. M. Buhmann. Nonparametric Bayesian image segmentation. International Journal of Computer Vision, 77(1-3):25-45, 2008. 6 J. Sodjo, A. Giremus, N. Dobigeon, J.F. Giovannelli, A generalized Swendsen-Wang algorithm for Bayesian nonparametric joint segmentation of multiple images, ICASSP, 2017. 7 R. Xu, F. Caron, and A. Doucet. Bayesian nonparametric image segmentation using a generalized Swendsen-Wang algorithm. ArXiv e-prints, February 2016. Thank you for your attention!

contact: ﬂ[email protected]

H. Lü et al. JSM 2018 July 2018 25 / 35 Job opportunity

Université Grenoble Alpes invites applications for a 2-year junior research chair (post-doc) in Data Science for Life Sciences and Health

Starting in October 2018 Data science methodology and machine learning to Life Sciences and Health Application deadline: August, 31 2018 Website: https://data-institute.univ-grenoble-alpes.fr/ Contact: ﬂ[email protected]

H. Lü et al. JSM 2018 July 2018 26 / 35 Stick breaking construction

α α

− − − −

DP simulations with G0 being a standard normal distribution N (0, 1) and α = 1, 10 using the Stick-Breaking representation.

H. Lü et al. JSM 2018 July 2018 27 / 35 Variational EM

General formulation, at iteration (r)

(r) (r−1) E-Z qz (z) ∝ exp E (r−1) [log p(y, z, Θ|φ )] qθ (r) (r−1) E-Θ q (Θ) ∝ exp E (r) [log p(y, Z, Θ|φ )] θ qz

(r) M-φφ = arg maxφ E (r) (r) [log p(y, Z, Θ|φ)] qz qθ

VE-Z, VE-α, VE-τ , and VE-θ∗

e.g. VE-Z step divides into N VE-Zj steps (qzj (zj ) = 0 for zj > K) ! ∗ X qzj (zj ) ∝ exp Eqθ∗ log p(yj |θzj ) + Eqτ log πzj (τ ) + β qzi (zj ) zj i∼j

H. Lü et al. JSM 2018 July 2018 28 / 35 Estimation of β

M-β step: involves p(z|τ ,β ) = K(β, τ )−1 exp(V (z; τ ,β )) P P with V (z; τ , β) = i log πzi (τ ) + β i∼j δ(zi=zj )

ˆ β = arg max Eqz qτ log p(z|τ ; β) β = arg max Eqz qτ V (z; τ , β) − Eqτ log K(β, τ ) β

Two difﬁculties (1) p(z|τ , β) is intractable (normalizing constant K(β, τ ), typical of MRF) (2) it depends on τ (typical of DP)

Two approximations (1) "standard" Mean Field like approximationa

(2) Replace the random τ by a ﬁxed τ˜ = Eqτ [τ ]

aForbes & Peyrard 2003

H. Lü et al. JSM 2018 July 2018 29 / 35 Approximation of p(z|τ ; β)

N Y p(z|τ , β) ≈ q˜z(z|β) = q˜zj (zj|β) j=1

P exp(log πk(τ˜) + β i∈N(j) qzi (k)) q˜zj (zj = k|β)= ∞ and τ˜ = Eqτ [τ ] P P exp(log πl(τ˜) + β i∈N(j) qzi (l)) l=1

β is estimated at each iteration by setting the approximate gradient to 0

K X X Eqz qτ ∇βV (z; τ , β) = qzj (k) qzi (k) k=1 i∼j K X X ∇βEqτ log K(β, τ ) = Ep(z|τ ,β)qτ ∇βV (z; τ , β) ≈ q˜zj (k|β)q ˜zi (k|β) k=1 i∼j

H. Lü et al. JSM 2018 July 2018 30 / 35 Some image segmentation results Segmentation results for medical images: all hyperparameters ﬁxed

Original image Segmentation by DP-Potts (K=40, β = 0)

Segmentation by DP-Potts (K=40, β = 2) Segmentation by DP-Potts (K=40, β = 10)

The segmentation results obtained by DP-Potts model with β = 0, 1, 5.

H. Lü et al. JSM 2018 July 2018 31 / 35 Some image segmentation results Segmentation with estimated hyperparameters (β = 0.75)

Original image Segmentation by DP-Potts 0 0

25 25

50 50

75 75

100 100

125 125

150 150

175 175

0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160

H. Lü et al. JSM 2018 July 2018 32 / 35 Some image segmentation results Segmentation with estimated β = 0.96 (pixels with partial volume)

Original image Segmentation by DP-Potts 0 0

25 25

50 50

75 75

100 100

125 125

150 150

175 175

0 25 50 75 100 125 150 175 0 25 50 75 100 125 150 175

H. Lü et al. JSM 2018 July 2018 33 / 35 Some image segmentation results Segmentation results for SAR images:

Original image Segmentation by DP-Potts (K=40, β = 0) Segmentation by DP-Potts (K=40, β = 2) Segmentation by DP-Potts (K=40, β = 10)

The segmentation results obtained by DP-Potts model with β = 0, 1, 5.

H. Lü et al. JSM 2018 July 2018 34 / 35 Some image segmentation results

Segmentation results with estimated β

β = 1.11 β = 1.02

Original image Segmentation by DP-Potts 0 0

Original image Segmentation by DP-Potts 0 0 100 100

50 50

100 100

200 200 150 150

200 200

250 250 300 300

300 300

350 350

400 400

0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350

0 50 100 150 200 250 0 50 100 150 200 250

H. Lü et al. JSM 2018 July 2018 35 / 35