Hyperprior on symmetric

Jun Lu Computer Science, EPFL, Lausanne [email protected] November 5, 2018

Abstract In this article we introduce how to put vague hyperprior on Dirichlet distribution, and we update the parameter of it by adaptive rejection sampling (ARS). Finally we analyze this hyperprior in an over-fitted by some synthetic experiments.

1 Introduction

It has become popular to use over-fitted mixture models in which number of cluster K is chosen as a conservative upper bound on the number of components under the expectation that only relatively few of the components K0 will be occupied by data points in the samples X . This kind of over-fitted mixture models has been successfully due to the ease in computation. Previously Rousseau & Mengersen(2011) proved that quite generally, the posterior behaviour of overfitted mixtures depends on the chosen prior on the weights, and on the number of free parameters in the emission distributions (here D, i.e. the dimension of data). Specifically, they have proved that (a) If α=min(αk, k ≤ K)>D/2 and if the number of components is larger than it should be, asymptotically two or more components in an overfitted mixture model will tend to merge with non-negligible weights. (b) −1/2 In contrast, if α=max(αk, k 6 K) < D/2, the extra components are emptied at a rate of N . Hence, if none of the components are small, it implies that K is probably not larger than K0. In the intermediate case, if min(αk, k ≤ K) ≤ D/2 ≤ max(αk, k 6 K), then the situation varies depending on the αk’s and on the difference between K and K0. In particular, in the case where all αk’s are equal to D/2, then although the author does not prove definite result, they conjecture that the posterior distribution does not have a stable limit.

2 Hyperprior on symmetric Dirichlet distribution

Inspired by Rasmussen(1999) and further discussed by Görür & Edward Rasmussen(2010), they in- troduced a hyperprior on symmetric Dirichlet distribution prior. We here put a vague prior of Gamma shape on the concentration parameter α and use the standard α setting where αk = α = α+/K for k = 1,...,K. arXiv:1708.08177v1 [cs.LG] 28 Aug 2017

α|a, b ∼ Gamma(a, b) =⇒ p(α|a, b) ∝ αa−1e−bα. To get the conditioned posterior distributions on α we need to derive the conditioned posterior distributions on all the other parameters,. But for a graphical model, this conditional distribution is a function only of the nodes in the Markov blanket. In our case, the Bayesian finite Gaussian mixture model, a Directed acyclic graphical (DAG) model, the Markov blanket includes the parents, the children, and the co-parents, as shown in Figure1. From this graphical representation, we can find the Markov blanket for each parameter in the model, and then figure out their conditional posterior distribution to be derived: p(α|π, a, b) ∝ p(α|a, b)p(π|α) K a−1 −bα Γ(Kα) Y α−1 ∝ α e π QK k k=1 Γ(α) k=1 Γ(Kα) = αa−1e−bα(π . . . π )α−1 . 1 K [Γ(α)]K

1 α a, b

πk K β

µk zi Σk K

xi N

Figure 1: A Bayesian finite GMM with hyperprior on concentration parameter.

The following two theorems give the proof that the conditional posterior distribution of α is log- concave. Theorem 2.1. Define Γ(Kx) G(x) = . [Γ(x)]K For x > 0 and an arbitrary positive integer K, the function G is strictly log-concave.

1 1 K−1 2 (1−K) Kx− 2 Q i Proof. From Abramowitz et al.(1966) we get Γ(Kx) = (2π) K i=0 Γ(x + K ). Then

K−1 1 1 X i log G(x) = (1 − K) log(2π) + (Kx − ) log K + log Γ(x + ) − K log Γ(x) 2 2 K i=0 and K−1 0 X i [log G(x)] = K log K + Ψ(x + ) − KΨ(x), K i=0 where Ψ(x) is the Digamma function, and

∞ 0 X 1 Ψ (x) = . (1) (x + h)2 h=0 Thus "K−1 # 00 X 0 i 0 [log G(x)] = Ψ (x + ) − KΨ (x) < 0, (x > 0). K i=0 The last inequality comes from (1) and concludes the theorem.

This theorem is a general case of Theorem 1 in Merkle(1997). Theorem 2.2. In p(α|π, a, b), when a ≥ 1, p(α|π, a, b) is log-concave

a−1 −bα α−1 Proof. It is easy to verify that α e (π1 . . . πK ) is log-concave when a ≥ 1. In view of that the Γ(Kα) product of two log-concave functions is log-concave and Theorem2, it follows that [Γ(α)]K is log-concave. This concludes the proof.

2 From the two theorems above, we can find the conditional posterior for α depends only on the weight of each cluster. The distribution p(α|π, a, b) is log-concave, so we may efficiently generate independent samples from this distribution using Adaptive Rejection Sampling (ARS) technique (Gilks & Wild, 1992). Although the proposed hyperprior on Dirichlet distribution prior for mixture model is generic, we focus on its application in Gaussian mixture models for concreteness. We develop a collapsed Gibbs sampling algorithm based on Neal(2000) for posterior computation. Let X be the observations, assumed to follow a mixture of multivariate Gaussian distributions. We use a conjugate Normal-Inverse-Wishart (NIW) prior p(µ, Σ|β) for the mean vector µ and covariance matrix Σ in each multivariate Gaussian component, where β consists of all the hyperparameters in NIW. A key quantity in a collapsed Gibbs sampler is the probability of each customer i sitting with table k: p(zi = k|z−i, X , α, β), where z−i are the seating assignments of all the other customers and α is the concentration parameter in Dirichlet distribution. This probability is calculated as follows:

p(zi = k|z−i, X , α, β) ∝ p(zi = k|z−i, α,β)p(X |zi = k, z−i, α,¡ β)  = p(zi = k|z−i, α)p(xi|X−i, zi = k, z−i, β)p(X−i|zi = k, z−i, β) ∝ p(zi = k|z−i, α)p(xi|X−i, zi = k, z−i, β) ∝ p(zi = k|z−i, α)p(xi|Xk,−i, β),

th where Xk,−i are the observations in table k excluding the i observation. Algorithm1 gives the pseudo code of the collapsed Gibbs sampler to implement hyperprior for Dirichlet distribution prior in Gaussian mixture models. Note that ARS may require even 10-20 times the computational effort per iteration over sampling once from a gamma density and there is the issue of mixing being worse if we donâĂŹt marginalize out the π in updating α. So this might have a very large impact on effective sample size (ESS) of the Markov chain. Hence, marginalizing out π and using an approximation to the conditional distribution (perhaps with correction through an accept/reject step via usual Metropolis-Hastings or even just using importance weighting without the accept/reject) or even just a Metropolis-Hastings normal random walk for log(α) may be much more efficient than ARS in practice. We here only introduce the updating by ARS.

input : Choose an initial z, α and β; for T iterations do for i ← 1 to N do Remove xi’s from component zi ; for k ← 1 to K do Calculate p(zi = k|z−i, α) ; Calculate p(xi|Xk,−i, β); Calculate p(zi = k|z−i, X , α, β) ∝ p(zi = k|z−i, α)p(xi|Xk,−i, β); end Sample knew from p(zi|z−i, X , α, β) after normalizing; Add xi’s statistics to the component zi = knew ; end ? Draw current weight variable π = {π1, π2, . . . , πK } ; ? Update α using ARS; end Algorithm 1: Collapsed Gibbs sampler for a finite Gaussian mixture model with hyperprior on Dirichlet distribution

3 Experiments

In the following experiments we evaluate the effect of a hyperprior on symmetric Dirichlet prior in finite Bayesian mixture model.

3 3.1 Synthetic simulation

The parameters of the simulations are as follows, where K0 is the true cluster number. And we use K to indicate the cluster number we used in the test: Sim 1: K0 = 3, with N=300, π={0.5, 0.3, 0.2}, µ={-5, 0, 5} and σ={1, 1, 1}; In the test we put α ∼ Gamma(1, 1) as the hyperprior. Figure2 shows the result on Sim 1 with different set of K. Figure3 shows the posterior density of α in each set of K. We can find that the larger K − K0, the smaller the poserior mean of α. This is what we expect, as the larger overfitting, the smaller α will shrink the weight vector in the edge of a probability simplex.

variance variance 3 2.0 2 1.5 NMI VI K α 1 1.0 Frequency Sim 1 π− Sample value 0.5 (SE) (SE) (SE) (SE) 0 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0 250 500 750 1000 1250 1500 1750 2000 weights weights 0.931 0.203 3.0 1.84 15 K = 3 0.0

10 0.4 (8.5e-5) (2.5e-4) (0.0) (5.4e-3)

5 Frequency

Sample value 0.2 0.869 0.437 3.842 1.43 0 K = 4 0.08 0.2 0.3 0.4 0.5 0 250 500 750 1000 1250 1500 1750 2000 (2.0e-4) (7.7e-4) (2.6e-3) (5.2e-3) mean mean 4 5 0.843 0.560 4.508 1.01 K = 5 0.12 0 2 (2.7e-4) (11.1e-4) (3.2e-3) (4.5e-3) Frequency Sample value 5 0 0.846 0.564 4.703 0.618 4 2 0 2 4 0 250 500 750 1000 1250 1500 1750 2000 K = 6 0.11 (3.3e-4) ( 14.9e-4) (5.2e-3) (3.9e-3)

Figure 2 & Table 1: Left: An example of traceplot of Gibbs sampling using hyperprior on Dirichlet distribution for variances, weights and means when K = 3 in Sim 1 (Upper one: variance; Middle one: weights; Bottom one: mean). Right: Summary of posterior distribution in Sim 1: NMI is the normalized mutual information between true clustering and the resulting clustering. VI is the variation of information between true clustering and resulting clustering. SE is the standard error of mean. K is the average occupied number of cluster, α is the average α during sampling, π− is the average of extra weight.

4 Conclusion

We have proposed a new hyperprior on symmetric Dirichlet distribution in finite Bayesian mixture model. This hyperprior can learn the concentration parameter in Dirichlet prior due to over-fitting of the mixture model. The larger the overfitting (i.e. K − K0 is larger, more overfitting), the smaller the concentration parameter. Although Rousseau & Mengersen(2011) proved that α=max(αk, k 6 K) < D/2, the extra compo- nents are emptied at a rate of N −1/2, it is still risky to use such small α in practice, for example, how much do we over-fit (i.e. how large the K − K0). If K − K0 is small, we will get very poor mixing from MCMC. Some efforts has been done further by van Havre et al.(2015). But simple hyperprior on Dirichlet distribution will somewhat release the burden.

4 350 350

300 300

250 250

200 200

150 150

100 100

50 50

0 0 0 1 2 3 4 5 6 7 8 9 0 2 4 6 8 10 12

(a) K=3 (b) K=4

350 350

300 300

250 250

200 200

150 150

100 100

50 50

0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8

(c) K=5 (d) K=6

Figure 3: Posterior distribution for α in different overfitting settings in Sim 1.

5 References

Milton Abramowitz, Irene A Stegun, et al. Handbook of mathematical functions. Applied mathematics series, 55(62):39, 1966.

Walter R Gilks and Pascal Wild. Adaptive rejection sampling for gibbs sampling. Applied Statistics, pp. 337–348, 1992.

Dilan Görür and Carl Edward Rasmussen. gaussian mixture models: Choice of the base distribution. Journal of Computer Science and Technology, 25(4):653–664, 2010.

Milan Merkle. On log-convexity of a ratio of gamma functions. Publikacije Elektrotehničkog fakulteta. Serija Matematika, pp. 114–119, 1997.

Radford M Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9(2):249–265, 2000.

Carl Edward Rasmussen. The infinite gaussian mixture model. In Advances in Neural Information Processing Systems, volume 12, pp. 554–560, 1999.

Judith Rousseau and Kerrie Mengersen. Asymptotic behaviour of the posterior distribution in overfitted mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(5): 689–710, 2011.

Zoé van Havre, Nicole White, Judith Rousseau, and Kerrie Mengersen. Overfitting Bayesian mixture models with an unknown number of components. PloS one, 10(7):e0131739, 2015.

6