Non parametric Bayesian priors for hidden Markov random fields Florence Forbes, Hongliang Lu, Julyan Arbel
To cite this version:
Florence Forbes, Hongliang Lu, Julyan Arbel. Non parametric Bayesian priors for hidden Markov random fields. JSM 2018 - Joint Statistical Meeting, Jul 2018, Vancouver, Canada. pp.1-38. hal- 01941679
HAL Id: hal-01941679 https://hal.archives-ouvertes.fr/hal-01941679 Submitted on 12 Dec 2018
HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Bayesian Nonparametric Priors for Hidden Markov Random Fields: Application to Image Segmentation
Hongliang Lü, Julyan Arbel, Florence Forbes
Inria Grenoble Rhône-Alpes and University Grenoble Alpes Laboratoire Jean Kuntzmann Mistis team fl[email protected]
July 2018
H. Lü et al. JSM 2018 July 2018 1 / 35 Unsupervised image segmentation
Challenges for mixture models (clustering)
inhomogeneities, noise
How many segments? T1 gado 2 classes 4 classes
Extensions of Dirichlet Process mixture model with spatial regularization
H. Lü et al. JSM 2018 July 2018 2 / 35 Outline of the talk
1 Dirichlet process (DP) 2 Spatially-constrained mixture model: DP-Potts mixture model Finite mixture model Bayesian finite mixture model DP mixture model DP-Potts mixture model 3 Inference using variational approximation 4 Some image segmentation results 5 Conclusion and future work
H. Lü et al. JSM 2018 July 2018 3 / 35 Dirichlet process (DP) Dirichlet process (DP)
The DP is a central Bayesian nonparametric (BNP) prior1.
Definition (Dirichlet process) A Dirichlet process on the space Y is a random process G such that there exist α (concentration parameter) and G0 (base distribution) such that for any finite partition {A1,...,Ap} of Y, the random vector (P (A1),...,P (Ap)) will be Dirichlet distributed:
(P (A1),...,P (Ap)) ∼ Dir(αG0(A1), . . . , αG0(Ap))
Notation: G ∼ DP(α, G0)
The DP is the infinite-dimensional generalization of the Dirichlet distribution.
1Ferguson, T. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209–230. H. Lü et al. JSM 2018 July 2018 4 / 35 TheDP has almost surely discrete realizations 2: ∞ X G = π δ ∗ k θk k=1
∗ iid Q iid where θk ∼ G0 and πk =π ˜k l Dirichlet process (DP) Dirichlet process (DP) construction A DP prior G can be constructed using three methods: The Blackwell-MacQueen urn scheme The Chinese Restaurant Process The Stick-Breaking construction 2Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4:639-650. H. Lü et al. JSM 2018 July 2018 5 / 35 Dirichlet process (DP) Dirichlet process (DP) construction A DP prior G can be constructed using three methods: The Blackwell-MacQueen urn scheme The Chinese Restaurant Process The Stick-Breaking construction TheDP has almost surely discrete realizations 2: ∞ X G = π δ ∗ k θk k=1 ∗ iid Q iid where θk ∼ G0 and πk =π ˜k l 2Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4:639-650. H. Lü et al. JSM 2018 July 2018 5 / 35 Spatially-constrained mixture model: DP-Potts mixture model Finite mixture model Spatially-constrained mixture model: DP-Potts mixture Clustering/segmentation: Finite mixture models assume data are generated by a finite sum of probability distributions: D y = (y1, ..., yN) with yi = (yi1, ..., yiD) ∈ R i.i.d K ∗ X ∗ p(yi|θ , π) = πkF (yi|θk) k=1 where ∗ ∗ ∗ ∗ θ = (θ1, ..., θK ) and π = (π1, ..., πK ) with θ class parameters and π PK mixture weights with i=1 πi = 1. θ∗ and π can be estimated using EM algorithm. Equivalently PK G = π δ ∗ non random k=1 k θk θi ∼ G and yi|θi ∼ F (yi|θi). H. Lü et al. JSM 2018 July 2018 6 / 35 Limitation: Solution: Require specifying the number of Assume an infinite number of compo- components K beforehand. nents using BNP priors. Spatially-constrained mixture model: DP-Potts mixture model Bayesian finite mixture model Bayesian finite mixture model In a Bayesian setting, a prior distribution is placed over θ∗ and π. Thus, the posterior distribution of parameters given the observations is p(θ∗, π|y) ∝ p(y|θ∗, π)p(θ∗, π) To generate a data point within a Bayesian finite mixture model: ∗ θk ∼ G0 π1, ..., πK ∼ Dir(α/K, ..., α/K) PK G = π δ ∗ is a random measure k=1 k θk ∗ θi|G ∼ G, which means θi = θk with probability πk yi|θi ∼ F (yi|θi) H. Lü et al. JSM 2018 July 2018 7 / 35 Spatially-constrained mixture model: DP-Potts mixture model Bayesian finite mixture model Bayesian finite mixture model In a Bayesian setting, a prior distribution is placed over θ∗ and π. Thus, the posterior distribution of parameters given the observations is p(θ∗, π|y) ∝ p(y|θ∗, π)p(θ∗, π) To generate a data point within a Bayesian finite mixture model: ∗ θk ∼ G0 π1, ..., πK ∼ Dir(α/K, ..., α/K) PK G = π δ ∗ is a random measure k=1 k θk ∗ θi|G ∼ G, which means θi = θk with probability πk yi|θi ∼ F (yi|θi) Limitation: Solution: Require specifying the number of Assume an infinite number of compo- components K beforehand. nents using BNP priors. H. Lü et al. JSM 2018 July 2018 7 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP mixture model DP mixture model From a Bayesian finite mixture model to a DP mixture model To establish a DP mixture model, let G be a DP prior (K → ∞), namely G ∼ DP(α, G0) and complement it with a likelihood associated to each θi To generate a data point within a DP mixture model: G ∼ DP(α, G0) θi|G ∼ G yi|θi ∼ F (yi|θi) H. Lü et al. JSM 2018 July 2018 8 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP mixture model DP mixture model 2D point clustering (unsupervised learning) based on the DP mixture model: Let the data speak for themselves! H. Lü et al. JSM 2018 July 2018 9 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP mixture model DP mixture model Application to image segmentation: Original image Segmentation by DP 0 0 50 50 100 100 150 150 200 200 250 250 0 50 100 150 200 250 0 50 100 150 200 250 Drawback: Solution: Spatial constraints and dependencies Combine the DP prior with a hidden are not considered. Markov random field (HMRF). H. Lü et al. JSM 2018 July 2018 10 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP-Potts mixture model DP-Potts mixture model To solve the issue, we introduce a spatial Potts model component: X M(θ) ∝ exp β δz(θi)=z(θj ) i∼j with θ = (θ1, ..., θN ) and β the interaction parameter. The DP mixture model is thus extended: G ∼ DP(α, G0) Q θ|M,G ∼ M(θ) × i G(θi) yi|θi ∼ F (yi|θi) H. Lü et al. JSM 2018 July 2018 11 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP-Potts mixture model DP-Potts mixture model Other spatially-constrained BNP mixture models + inference algorithms: DP or PYP-Potts partition model + MCMC3 Hemodynamic brain parcellation (DP-Potts) + PARTIAL VB4 DP or PYP-Potts + Iterated Conditional Mode (ICM)5 Markov chain Monte Carlo (MCMC): Variational Bayes (VB): Advantage: asymptotically exact Advantage: much faster Drawback: computationally Drawback: less accurate, no expensive theoretical guarantee We propose a DP-Potts mixture model based on a general stick-breaking construction that allows a natural Full VB algorithm enabling scalable inference for large datasets and straightforward generalization to other priors. 3Orbanz & Buhmann (2008); Xu, Caron & Doucet (2016); Sodjo, Giremus, Dobigeon & Giovannelli (2017) 4Albughdadi, Chaari, Tourneret, Forbes, Ciuciu (2017) 5Chatzis & Tsechpenakis (2010); Chatzis (2013) H. Lü et al. JSM 2018 July 2018 12 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP-Potts mixture model DP-Potts: Stick breaking construction Stick breaking construction of DP: G ∼ DP (α, G0) ∗ θk|G0 ∼ G0 τk|α ∼ B(1, α), k = 1, 2,... Qk−1 πk(τ) = τk l=1 (1 − τl), k = 1, 2,... ∞ G = P π (τ)δ ∗ k=1 k θk + θi|G ∼ G yi|θi ∼ F (yi|θi) = Dirichlet Process Mixture Model (DPMM) H. Lü et al. JSM 2018 July 2018 13 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP-Potts mixture model DP-Potts: Stick breaking construction Stick breaking construction of DPMM Stick breaking construction of DPMM ∗ ∗ θk|G0 ∼ G0 θk|G0 ∼ G0 τk|α ∼ B(1, α), k = 1, 2,... τk|α ∼ B(1, α), k = 1, 2,... k−1 k−1 Q Q πk(τ) = τk (1 − τl), k = 1,... πk(τ) = τk (1 − τl), k = 1,... l=1 l=1 ∞ ∗ G = P π (τ)δ ∗ =⇒ θ = θ with probability π (τ) k=1 k θk i k k θi|G ∼ G yi|θi ∼ F (yi|θi) yi|θi ∼ F (yi|θi) H. Lü et al. JSM 2018 July 2018 14 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP-Potts mixture model DP-Potts: Stick breaking construction Using assignment variables zi DPMM view Mixture/Clustering view ∗ ∗ θk|G0 ∼ G0 θk|G0 ∼ G0 τk|α ∼ B(1, α), k = 1, 2,... τk|α ∼ B(1, α), k = 1, 2,... k−1 k−1 Q Q πk(τ) = τk (1 − τl), k = 1,... πk(τ) = τk (1 − τl), k = 1,... l=1 l=1 ∗ θi = θk with probability πk(τ) p(zi = k|τ) = πk(τ) ∗ with zi = z(θi) = k when θi = θk y |z , θ∗ ∼ F (y |θ∗ ) yi|θi ∼ F (yi|θi) =⇒ i i i zi H. Lü et al. JSM 2018 July 2018 15 / 35 Spatially-constrained mixture model: DP-Potts mixture model DP-Potts mixture model DP-Potts: Stick breaking construction Using assignment variables zi Stick breaking of DPMM Stick breaking of DP-Potts ∗ ∗ θ |G0 ∼ G0 θk|G0 ∼ G0 k τ |α ∼ B(1, α), k = 1, 2,... τk|α ∼ B(1, α), k = 1, 2,... k k−1 k−1 Q π (τ) = τ Q (1 − τ ) πk(τ) = τk (1 − τl) k k l l=1 l=1 p(z|τ, β) ∝ Q π (τ) exp(β P δ ) p(zi = k|τ) = πk(τ) zi zi=zj i i∼j z = {z1, . . . , zN } y |z , θ∗ ∼ F (y |θ∗ ) ∗ ∗ i i i zi y |z , θ ∼ F (y |θ ) i i i zi ∞ P NB: Well defined for every stick breaking construction ( πk = 1): k=1 e.g. Pitman-Yor (τk|α, σ) ∼ B(1 − σ, α + kσ) H. Lü et al. JSM 2018 July 2018 16 / 35 Inference using variational approximation Inference using variational approximation Clustering/ segmentation task: Estimating Z while parameters Θ unknown , eg. Θ = {τ , α, θ∗} Bayesian setting Access the intractable p(Z, Θ|y, Φ) approximate as q(z, Θ) = qz(z)qθ(Θ) Variational Expectation-Maximization Alternate maximization in qz and qθ (φ are hyperparameters) of the Free Energy: p(y, Z, Θ|φ) F(qz, qθ, φ) = Eqz qθ log qz(z)qθ(Θ) = log p(y|φ) − KL(qzqθ, p(Z, Θ|y, φ)) H. Lü et al. JSM 2018 July 2018 17 / 35 Inference using variational approximation DP-Potts Variational EM procedure Joint DP-Potts (Gaussian) Mixture distribution N ∞ ∞ ∗ Y ∗ Y Y ∗ p(y, z, τ , α, θ |φ) = p(yj |zj , θ ) p(z|τ , β) p(τk|α) p(θk|ρk) p(α|s1, s2) j=1 k=1 k=1 ∗ p(yj |zj , θ ) = N (yj |µzj , Σzj ) is Gaussian p(z|τ , β) is a DP-Potts model p(τk|α) is Beta B(1, α) ∗ p(θk|ρk) = NIW(µk, Σk|mk, λk, Ψk, νk) is Normal-inverse-Wishart p(α|s1, s2) = G(α|s1, s2) is Gamma Usual truncated variational posterior, qτk = δ1 for k ≥ K (eg. K = 40) N K−1 K Y Y Y q(z, Θ) = qz (zj ) qα(α) qτ (τk) q ∗ (µk, Σk) j k θk j=1 k=1 k=1 E-steps: VE-Z, VE-α, VE-τ and VE-θ∗ M-step: φ updating straightforward except for β H. Lü et al. JSM 2018 July 2018 18 / 35 Some image segmentation results Some image segmentation results Model validation and verification: Original image Segmentation by DP-Potts 0 0 50 50 100 100 150 150 200 200 250 250 0 50 100 150 200 250 0 50 100 150 200 250 Segmented image using DP-Potts model with β = 2.5. H. Lü et al. JSM 2018 July 2018 19 / 35 Some image segmentation results Some image segmentation results Convergence of the VB algorithm initialized by the k-means++ clustering: H. Lü et al. JSM 2018 July 2018 20 / 35 Some image segmentation results Some image segmentation results Segmentation results for Berkeley Segmentation Dataset: Original image Segmentation by DP-Potts (K=40, β = 0) Segmentation by DP-Potts (K=40, β = 2) Segmentation by DP-Potts (K=40, β = 10) The segmentation results obtained by DP-Potts model with β = 0, 1, 5. H. Lü et al. JSM 2018 July 2018 21 / 35 Some image segmentation results Some image segmentation results Segmentation with estimated β = 1.66 Original image Segmentation by DP-Potts 0 0 50 50 100 100 150 150 200 200 250 250 300 300 0 100 200 300 400 0 100 200 300 400 H. Lü et al. JSM 2018 July 2018 22 / 35 Some image segmentation results Quantitative evaluation of the segmentations Probabilistic Rand Index on 154 color (RGB) images with ground truth (several) from Berkeley dataset (1000 superpixels). But Manual ground truth segmentations are subjective ! PRI results with DP-Potts model Mean Median St.D. K=10 71.48 72.54 0.1040 K=20 73.64 73.42 0.0935 K=40 75.33 76.47 0.0853 K=50 75.81 76.31 0.0873 K=60 76.55 77.12 0.0848 K=80 77.06 78.30 0.0835 PRI results from Chatzis 2013 Computation time : Berkeley 321x481 image reduced to 1000 superpixels takes 10-30 s on a PC with CPU Intel(R) Core(TM) i7-5500U CPU 2.40GHz and 8GB RAM H. Lü et al. JSM 2018 July 2018 23 / 35 How does β influence the number of components? Extend the model with other priors (Pitman-Yor process, Gibbs-type priors, etc.). Try other variational approximations (truncation-free) Investigate theoretical properties of BNP priors under structural con- straints (time, spatial) ...... for other applications, such as discovery probabilities, etc. Conclusion and future work Conclusion and future work A general DP-Potts model and the associated VB algorithm were built. The DP-Potts model was applied to image segmentation and tested on different types of datasets. Impact of the interaction parameter β on the final results is significant. An estimation procedure was proposed for β H. Lü et al. JSM 2018 July 2018 24 / 35 Conclusion and future work Conclusion and future work A general DP-Potts model and the associated VB algorithm were built. The DP-Potts model was applied to image segmentation and tested on different types of datasets. Impact of the interaction parameter β on the final results is significant. An estimation procedure was proposed for β How does β influence the number of components? Extend the model with other priors (Pitman-Yor process, Gibbs-type priors, etc.). Try other variational approximations (truncation-free) Investigate theoretical properties of BNP priors under structural con- straints (time, spatial) ...... for other applications, such as discovery probabilities, etc. H. Lü et al. JSM 2018 July 2018 24 / 35 References 1 M. Albughdadi, L. Chaari, J.-Y. Tourneret, F.Forbes, and P.Ciuciu. A Bayesian non-parametric hidden Markov random model for hemodynamic brain parcellation. Signal Processing, 135:132- 146, 2017. 2 S. P. Chatzis. A Markov random field-regulated Pitman-Yor process prior for spatially con- strained data clustering. Pattern Recognition, 46(6): 1595-1603, 2013. 3 S. P. Chatzis and G. Tsechpenakis. The infinite hidden Markov random field model. IEEE Trans. Neural Networks, 21(6):1004-1014, 2010. 4 F. Forbes and N. Peyrard. Hidden Markov Random Field Selection Criteria based on Mean Field-like approximations. IEEE PAMI, 25(9):1089-1101, 2003 5 P. Orbanz and J. M. Buhmann. Nonparametric Bayesian image segmentation. International Journal of Computer Vision, 77(1-3):25-45, 2008. 6 J. Sodjo, A. Giremus, N. Dobigeon, J.F. Giovannelli, A generalized Swendsen-Wang algo- rithm for Bayesian nonparametric joint segmentation of multiple images, ICASSP, 2017. 7 R. Xu, F. Caron, and A. Doucet. Bayesian nonparametric image segmentation using a gen- eralized Swendsen-Wang algorithm. ArXiv e-prints, February 2016. Thank you for your attention! contact: fl[email protected] H. Lü et al. JSM 2018 July 2018 25 / 35 Job opportunity Université Grenoble Alpes invites applications for a 2-year junior research chair (post-doc) in Data Science for Life Sciences and Health Starting in October 2018 Data science methodology and machine learning to Life Sciences and Health Application deadline: August, 31 2018 Website: https://data-institute.univ-grenoble-alpes.fr/ Contact: fl[email protected] H. Lü et al. JSM 2018 July 2018 26 / 35 Stick breaking construction α α