Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood

Kohei Hayashi [email protected] Global Research Center for Big Data Mathematics, National Institute of Informatics Kawarabayashi Large Graph Project, ERATO, JST

Shin-ichi Maeda [email protected] Graduate School of Informatics, Kyoto University

Ryohei Fujimaki [email protected] NEC Knowledge Discovery Research Laboratories

Abstract eters and probabilistic models is not one-to-one) and Factorized information criterion (FIC) is a re- such classical information criteria based on the regu- cently developed approximation technique for larity assumption as the Bayesian information criterion the marginal log-likelihood, which provides an (BIC) (Schwarz, 1978) are no longer justified. Since ex- automatic model selection framework for a few act evaluation of the marginal log-likelihood is often not latent variable models (LVMs) with tractable in- available, approximation techniques have been developed ference algorithms. This paper reconsiders FIC using sampling (i.e., meth- and fills theoretical gaps of previous FIC studies. ods (MCMCs) (Hastings, 1970)), a variational lower bound First, we reveal the core idea of FIC that allows (i.e., the variational Bayes methods (VB) (Attias, 1999; generalization for a broader class of LVMs. Sec- Jordan et al., 1999)), or algebraic geometry (i.e., the widely ond, we investigate the model selection mecha- applicable BIC (WBIC) (Watanabe, 2013)). However, nism of the generalized FIC, which we provide model selection using these methods requires heavy com- a formal justification of FIC as a model selec- putational cost (e.g., a large number of MCMC sampling in tion criterion for LVMs and also a systematic a high-dimensional space, an outer loop for WBIC.) procedure for pruning redundant latent variables. In the last few years, a new approximation technique Third, we uncover a few previously-unknown re- and an inference method, factorized information cri- lationships between FIC and the variational free terion (FIC) and factorized asymptotic Bayesian in- energy. A demonstrative study on Bayesian prin- ference (FAB), have been developed for some binary cipal component analysis is provided and numer- LVMs (Fujimaki & Morinaga, 2012; Fujimaki & Hayashi, ical experiments support our theoretical results. 2012; Hayashi & Fujimaki, 2013; Eto et al., 2014; Liu et al., 2015). Unlike existing methods that evaluate approximated marginal log-likelihoods calculated for 1. Introduction each latent variable dimensionality (and therefore need an outer loop for model selection), FAB finds an effective The marginal log-likelihood is a key concept of Bayesian dimensionality via an EM-style alternating optimization model identification of latent variable models (LVMs), procedure. such as mixture models (MMs), probabilistic principal component analysis, and hidden Markov models (HMMs). For example, let us consider a K-component MM for N > Determination of dimensionality of latent variables is an observations X = (x1,..., xN ) with one-of-K coding > essential task to uncover hidden structures behind the ob- latent variables Z = (z1,..., zN ), mixing coefficients served data as well as to mitigate overfitting. In gen- β = (β1, . . . , βK ), and DΞ-dimensional component-wise { } eral, LVMs are singular (i.e., mapping between param- parameters Ξ = ξ1,..., ξK . By using Laplace’s method to the marginalization of the log-likelihood, FIC Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- right 2015 by the author(s). Rebuilding Factorized Information Criterion of MMs (Fujimaki & Morinaga, 2012) is derived by marginal log-likelihood with a constant order error under [ some reasonable assumptions. This justifies gFIC as a ˆ FICMM(K) ≡ max Eq ln p(X, Z | β, Ξˆ ,K) model selection criterion for LVMs and further a natural q ] one-pass model “pruning” procedure is derived, which is ∑ ∑ Dξ znk DΠ performed heuristically in previous FIC studies. We also − k ln n + H(q) − ln N, (1) 2 N 2 provide an interpretation of gFIC as a variational free en- k ergy and uncover a few previously-unknown relationships where q is the distribution of Z, βˆ and Ξˆ are the maximum between them. Finally, we demonstrate the validity of gFIC 1 joint-likelihood estimators (MJLEs) , DΠ = DΞ + K − 1 by applying it to BPCA. The experimental results agree is the total dimension of Ξ and β, and H(q) is the entropy with to the theoretical properties of gFIC. of q. A key characteristic of FIC can be observed in the second term of Eq. (1), which gives the penalty in terms 2. LVMs and Degeneration of model complexity.∑ As we can see, the penalty term We first define the class of LVMs we deal with in this paper. decreases when n znk—the number of effective sam- ples of the k-th component—is small, i.e., Z is degener- Here, we consider LVMs that have K-dimensional latent ated. Therefore, through the optimization of q, the de- variables zn (including the MMs in the previous section), generate dimension is automatically pruned until a non- but now zn can take not only binary but also real values. degenerated Z is found. This mechanism makes FAB a Given X and a model family (e.g., MMs), our goal is to one-pass model selection algorithm and computationally determine K and we refer to this as a model. Note that we more attractive than the other methods. The validity of the sometimes omit the notation K for the sake of brevity, if it penalty term has been confirmed for other binary LVMs, is obvious from the context. e.g., HMMs (Fujimaki & Hayashi, 2012), latent feature The LVMs have DΞ-dimensional component-dependent models (Hayashi & Fujimaki, 2013), and mixture of ex- { } parameters Ξ = ξ1,..., ξK and DΘ-dimensional perts (Eto et al., 2014). component-independent parameters Θ, which include hy- Despite FAB’s practical success compared with BIC and perparameters of the prior of Z. We abbreviate them { } VB, it is unclear that what conditions are actually neces- as Π = Θ, Ξ and assume the dimension DΠ = sary to guarantee that FAB yields the true latent variable DΘ + DΞ is finite. Then, we define the joint prob- dimensionality. In addition, the generalization of FIC for ability: p(X, Z, Π) = p(X|Z, Π)p(Z|Π)p(Π) where | ∈ P ≡ non-binary LVMs remains an important issue.∑ In case that ln p(X, Z Π) is twice differentiable at Π . Let FΠ ( ) ( ) Z takes negative and/or continuous values, n znk is no ∂ ( ) F F > ln p(X, Z | Π) Θ Θ,Ξ − ∂Θ ∂ ∂ longer interpretable as the number of effective samples, and > = ∂ ∂Θ ∂Ξ . F F > N we loose the clue for finding the redundant dimension of Z. Θ,Ξ Ξ ∂Ξ ˆ ≡ | This paper proposes generalized FIC (gFIC), given by Note that the MJLE Π argmaxΠ ln p(X, Z Π) de- pends on Z (and X). In addition, ln p(X, Z | Π) can have ˆ gFIC(K) ≡Eq∗ [L(Z, Π,K)] + H(q), (2) multiple maximizers, and Πˆ could be a set of solutions. 1 DΠ L(Z, Π,K) = ln p(X, Z | Π,K) − ln |FΞ| − ln N. Model redundancy is a notable property of LVMs. Be- 2 2 cause the latent variable Z is unobservable, the pair (Z, Π) Here, q∗(Z) ≡ p(Z|X,K) is the marginal posterior and is not necessarily determined uniquely for a given X. In − | other words, there could be pairs (Z, Π) and (Z˜, Π˜ ), whose FΞˆ is the Hessian matrix of ln p(X, Z Π,K)/N with re- spect to Ξ. In gFIC, the penalty term is given by the volume likelihoods have the same value, i.e., p(X, Z | Π,K) = of the (empirical) Fisher information matrix, which natu- p(X, Z˜ | Π˜ ,K). In such case, the model possibly has mul- rally penalizes model complexity even for continuous la- tiple maxima, and the conventional asymptotic analysis is tent variables. Accordingly, gFIC is applicable to a broader no longer valid (Watanabe, 2009). class of LVMs, such as Bayesian principal component anal- Previous FIC studies address this redundancy by introduc- ysis (BPCA) (Bishop, 1998). ing a variational representation that enables treating Z as Furthermore, we prove that FAB automatically prunes fixed, as we explain in the next section. However, even if redundant dimensionality along with optimizing q, and Z is fixed, the redundancy still remains, namely, the case gFIC for the optimized q asymptotically converges to the in which Z is singular or “degenerated,” and there exists an equivalent likelihood with a smaller model K0 < K: 1Note that the MJLE is not equivalent to the maxi- 0 mum a posteriori estimator (MAP). The MJLE is given p(X, Z | Π,K) = p(X, Z˜ K0 | Π˜ K0 ,K ). (3) | by argmaxΘ p(X, Z Θ) while the MAP is given by | argmaxΘ p(X, Z Θ)p(Θ). In this case, K is overcomplete for Z, and Z lies on the Rebuilding Factorized Information Criterion subspace of the model K. As a simple example, let us con- samples and analysis of cases with sample correlation (e.g. sider a three-component MM for which Z = (z, 1 − z, 0). time series models) remains as an open problem. In this case, ξ3 is unidentifiable, because the third com- ponent is completely unused, and the K0 = 2-component MMs In the same notation used in Section 1, ˜ ≡ − ˜ ≡ | MM with Z2 (z, 1 z) and Π2 (Θ, (ξ1, ξ2)) satis- the∏ ∏ joint likelihood is given by p(X, Z Π) = { | }znk fies equivalence relation (3). The notion of degeneration is n k βkpk(xn ξk) where pk is the density of defined formally as follows. component k. If ξ1,..., ξK have no overlap, FΞ is Definition 1. Given X and K, Z is degenerated if there are the block-diagonal∑ matrix whose block is given by − ∇∇ | Fξ = ln pk(xn ξk)znk/N. This shows that multiple MJLEs and any FΠˆ of the MJLEs are not positive k n definite. Similarly, p(Z) is degenerated in distribution, if the MM is degenerated, when more than one column E [F ] are not positive definite. Let κ(Z) ≡ rank(F ) of Z is filled by zero. For that case, removing such p Πˆ Πˆ ˜ ≡ E columns and corresponding ξk suffices as ZK0 and and κ(p) rank( p[FΠˆ ]). Π˜ K0 in A1. Note that if pk is an exponential-family > − −∇∇ | The idea of degeneration is conceptually understandable as distribution exp(xn ξk ψ(ξk)), ln pk(xn ξk) = ∇∇ an analogous of linear algebra. Namely, each component of ψ(ξk) = C does not depend on n and gFIC a model is a “basis”, Z are “coordinates”, and κ(Z) is the recovers the original∑ formulation of∑FICMM, i.e., z D z number of necessary components to represent X, i.e., the 1 | | 1 | n nk | ξk n nk 2 ln Fξˆ = 2 ln C( N ) = 2 ln N + const. “rank” of X in terms of the model family. The degeneration k ∑ of Z is then the same idea of the “degeneration” in linear ∈ RN×D BPCA Suppose X is centerized, i.e., n xn = algebra, i.e., the number of components is too many and Π N×K 0. Then, the joint likelihood∏ of X and Z ∈ R is is not uniquely determined even if Z is fixed. | | 1 | given by p(X, Z Π) = n N(xn Wzn, λ I)N(zn 0, I), As discussed later, given a degenerated Z where κ(Z) = where Ξ = W = (w·1,..., w·K ) is a linear basis and 0 K , finding the equivalent parameters Z˜ K0 and Π˜ K0 that Θ = λ is the reciprocal of the noise variance. Note that satisfy Eq. (3) is an important task. In order to analyze this, the original study of BPCA∏ (Bishop, 1998) introduces the | −1 we assume A1): for any degenerated Z under a model K ≥ additional priors p(W) = d N(wd 0, diag(α )) and 0 | 2 and K < K, there exists a continuous onto mapping p∏(λ) = Gamma(λ aλ, bλ) and the hyperprior p(α) = ˜ ˜ ˜ | (Z, Π) → (ZK0 , ΠK0 ) that satisfies Eq. (3), and ZK0 is k Gamma(αk aα, bα). In this paper, however, we do not degenerated. Note that, if P is a subspace of RDΠ , a not specify explicit forms of those priors but just treat them D DΠ 0 linear projection V : R Π 7→ R K satisfies A1 where as an O(1) term.

V is the top-DΠ 0 eigenvectors of FΠ. This is verified K Since there is no second-order interaction between wi and easily by the fact that, by using the chain rule, F ˜ = ΠK0 wj=6 i, the Hessian FΞ is a block-diagonal and each block > > VF ˆ V , which is a diagonal matrix whose elements are λ Π is given by N Z Z. The penalty term is then given as positive eigenvalues. Therefore, F ˜ is positive definite ΠK0 1 D 1 > and Z˜ 0 is not degenerated. − ln |F | = − (K ln λ + ln | Z Z|), (5) K 2 Ξ 2 N Let us further introduce a few assumptions required to and Z is degenerated, if rank(Z) < K. Suppose that Z is show the asymptotic properties of gFIC. Suppose A2) the degenerated, let K0 = rank(Z) < K, and let the SVD be joint distribution is mutually independent in sample-wise, > > Z = Udiag(σ)V = (UK0 , 0)diag(σK0 , 0)(VK0 , 0) , ∏ 0 where UK0 and VK0 are K non-zero singular vectors and p(X, Z | Π,K) = p(xn, zn | Π,K), (4) 0 σ 0 is K non-zero singular values. From the definition of n K FΞ, the projection V removes the degeneration of Z, i.e., | | and A3) ln p(Π K) is constant, i.e., limN→∞ ln p(Π by letting Z˜ = ZV and W˜ = WV, K)/N = 0. In addition, A4) p(Π | K) is continuous, λ 1 not improper, and its support P is compact and the whole | − k − >k2 − k k2 ln p(X, Z Π,K) = X ZW F Z F + const. space. Note that for almost all Z, we expect that Πˆ ∈ P is 2 2 λ > 1 uniquely determined and F ˆ is positive definite, i.e., A5) − k − ˜ ˜ k2 − k˜k2 Π = X ZW F Z F + const. if Z is not degenerated, then ln p(X, Z | Π,K) is concave 2 2 0 ˜ 0 | { ˜ 0 } and det |F ˆ | < ∞. = ln p(X, ZK λ, WK ,K ). Π ∑ where kAk2 = a2 denotes the Frobenius norm, 2.1. Examples of the LVM Class F ij ij Z˜ K0 = UK0 diag(σK0 ), and W˜ K0 = WVK0 . V trans- The above definition covers a broad class of LVMs. Here, forms K − K0 redundant components to 0-column vectors, we show that, as examples, MMs and BPCA are included in and we can find the smaller model K0 by removing the 0- that class. Note that A2 does not allow correlation among column vectors from W˜ and Z˜, which satisfies A1. Rebuilding Factorized Information Criterion

3. Derivation of gFIC constant by definition. Therefore, ignoring all of the in- formation of F ˆ in Eq. (9) just gives another O(1) error To obtain p(X | K), we need to marginalize out two vari- Π and equivalence of gFIC (8) still holds. However, FΞ con- ables: Z and Π. Let us consider the variational form for Z, tains important information of which component is effec- written as tively used to captures X. Therefore, we use the relation −1 > ∗ ln |FΠ| = ln |FΞ| + ln |FΞ,ΠF F | and remain the ln p(X|K) =Eq[ln p(X, Z|K)] + H(q) + KL(qkq ) (6) Π Ξ,Π ∗ first term in gFIC. In Section 4.3, we interpret the effect of =E ∗ [ln p(X, Z|K)] + H(q ), (7) q F ∫ Ξ in more detail. where KL(qkp) = q(x) ln q(x)/p(x)dx is the Kullback- Leibler (KL) divergence. 3.2. Degenerated Cases Variational representation (7) allows us to consider the If p(Z | X,K) is degenerated, then the regularity condi- cases of whether Z is degenerated or not separately. In par- tion does not hold, and we cannot use Laplace’s method ticular, when Z ∼ q∗(Z) is not degenerated, then A5 guar- (Lemma 3) directly. In that case, however, A1 guaran- antees that p(X, Z | K) is regular, and standard asymp- tees the existence of a variable transformation (Z, Π) → ˜ ˜ totic results such as Laplace’s method are applicable. In (ZK0 , ΠK0 ) that replaces the joint likelihood by the equiv- contrast, if q∗(Z) is degenerated, p(X, Z | K) becomes alent yet smaller “regular” model: p(X, Z | K) = ∫ singular and its asymptotic behavior is unclear. p(X, Z | Π,K)p(Π | K)dΠ In this section, we analyze the asymptotic behavior of the ∫ variational representation (7) in both cases and show that 0 0 ∗ = p(X, Z˜ 0 | Π˜ 0 ,K )˜p(Π˜ 0 | K )dΠ˜ 0 . (11) gFIC is accurate even if q (Z) is degenerated. Our main K K K K 2 contribution is the following theorem. 0 Since Z˜ 0 is not degenerated in the model K , we can 0 | K Theorem 2. Let K = κ(p(Z X,K)). Then, apply Laplace’s method and obtain asymptotic approxima- 0 | 0 tion (10) by replacing K by K . Note that the transformed ln p(X K) = gFIC(K ) + O(1). (8) 0 prior p˜(ΠK0 | K ) would differ from the original prior 0 p(Π 0 | K ). However, since the prior does not depend We emphasize that the above theorem holds even if the K on N (A3), the difference is at most O(1), which is asymp- model family does not include the true distribution of X. totically ignorable. To prove Theorem 2, we first investigate the asymptotic be- havior of ln p(X | K) for the non-degenerated case. Eq. (11) also gives us an asymptotic form of the marginal posterior. 3.1. Non-degenerated Cases Proposition 4. Suppose K is fixed, and consider the marginalization p(Z | X,K) = p (Z)(1 + O(N −1)), (12) ∫ { K p(X, Z) = p(X, Z|Π)p(Π)dΠ. If p(Z|X) is not de- | ˆ | |−1/2 p(Z,X Π,K) FΠˆ ∼ | C K = κ(Z), generated, then Z p(Z X) is not degenerated with prob- pK (Z) ≡ (13) ability one. This suffices to guarantee the regularity condi- pκ(Z)(Tκ(Z)(Z)) K > κ(Z), tion (A5) and hence to justify the application of Laplace’s method, which approximates p(X, Z) in an asymptotic where TK0 : Z → Z˜ K0 as Eq. (3) and C is the normalizing manner (Tierney & Kadane, 1986). constant. Lemma 3. If Z is not degenerated, p(X, Z) = The above proposition indicates that, if κ(p(Z | X,K)) = ( ) 0 DΠ/2 K , p(Z | X,K) is represented by the non-degenerated ˆ −1/2 2π −1 0 p(X, Z, Π)|F ˆ | (1 + O(N )). (9) distribution p(Z | X,K ). Now, we see that the joint like- Π N lihood (11) and the marginal posterior (12) depend on K0 rather than K. Therefore, putting these results into vari- This result immediately yields the following relation: ational bound (7) leads to (8), i.e., ln p(X | K) is repre- 0 ln p(X, Z) = L(Z, Πˆ ,K) + O(1). (10) sented by gFIC of the “true” model K . Theorem 2 indicates that, if the model K is overcomplete Substitution of Eq. (10) into Eq. (7) yields Eq. (8).Note for the true model K0, ln p(X | K) takes the same value as that we drop the O(1) terms: ln p(Πˆ ) (see A3), DΠ ln 2π, 2 ln p(X | K0). and a term related to FΘˆ to obtain Eq. (10). We empha- K > K0 = κ(p(Z | X)) size here that the magnitude of FΠ (and FΘ and FΞ) is Corollary 5. For every , | | 0 2A formal proof is given in supplemental material. ln p(X K) = ln p(X K ) + O(1). (14) Rebuilding Factorized Information Criterion

This implication is fairly intuitive in the sense that if X Proposition 6. Suppose p(Z|X,K) is not degener- concentrates on the subspace of the model, then marginal- ated in distribution. Then, p(Z|X,K) converges to ization with respect to the parameters outside of the sub- p(Z|X, Πˆ ,K), and p(Z|X,K) is asymptotically mutually | space contributes nothing to ln p(X K). Corollary 5 independent for z1,..., zn. also justifies model selection of the LVMs on the basis of the marginal likelihood. According to Corollary 5, at In some models, such as MMs (Fujimaki & Morinaga, N → ∞ redundant models always take the same value of 2012), the mean-field approximation suffices to solve the the marginal likelihood as that of the true model, and we variational problem. If it is still intractable, other approxi- can safely exclude them from model candidates. mations are necessary. For BPCA, for example,∏ we restrict | q as the Gaussian density q(Z) = n N(zn µn, Σn), which we use in the experiments (Section 7). 4. The gFAB Inference To evaluate gFIC (2), we need to solve several estimation 4.1.2. UPDATE OF Π | problems. First, we need to estimate p(Z X,K) to min- After obtaining q, we need to estimate Πˆ for each sample imize the KL divergence in Eq. (6). In addition, since ∼ 0 Z q(Z), which is also intractable. Alternatively, we es- ln p(X | K) depends on the true model K (Theorem 2), ¯ E | timate the expected MJLE Π = argmaxΠ q[ln p(X, Z we need to check whether the current model is degenerated Π)]. Since the max operator has convexity, Jensen’s in- or not, and if it is degenerated, we need to estimate K0. equality shows that replacing Πˆ by Π¯ introduces the fol- This is paradoxical, because we would like to determine K0 lowing lower bound. through model selection. However, by using the properties 0 of gFIC, we can obtain K efficiently by optimization. Eq[ln p(X, Z | Πˆ )] = Eq[max ln p(X, Z | Π)] Π

≥Eq[ln p(X, Z | Π¯ )] = max Eq[ln p(X, Z | Π)]. 4.1. Computation of gFIC Π By applying Laplace’s method to Eq. (11) and substituting Since Π¯ depends only on q, we now need to compute the it into the variational form (6), we obtain ln p(X | K) = parameter only once. Remarkably, Πˆ is consistent with Π¯ and the above equality holds asymptotically. ˆ ∗ Eq[L(Z, Π, κ(q))] + H(q) + KL(qkq ) + O(1). (15) Proposition 7. If q(Z) is not degenerated in distribution, p then Πˆ → Π¯ . Since the KL divergence is non-negative, substituting this into Eq. (8) and ignoring the KL divergence gives a lower Since Eq[ln p(X, Z|Π)] is the average of the concave func- 0 bound of gFIC(K ), i.e., tion (A5), Eq[ln p(X, Z|Π)] itself is also concave and the estimation of Π¯ is relatively easy. If the expecta- 0 E | E gFIC(K ) ≥ Eq[L(Z, Πˆ , κ(q))] + H(q). (16) tions q[ln p(X, Z Π)] and q[FΠ] are analytically writ- ten, then gradient-based methods suffice to obtain Π¯ . This formulation allows us to estimate gFIC(K0) via max- imizing the lower bound. Moreover, we no longer need 4.1.3. MODEL PRUNING to know K0—if the initial dimension of q is greater than During the optimization of q, it can become degenerated K0, the maximum of lower bound (16) attains gFIC(K0) or nearly degenerated. In such a case, by definition of ob- and thus ln p(X | K0). Similarly to other variational in- jective (16), we need to change the form of L(Z, Πˆ ,K) ference problems, this optimization is solved by iterative to L(Z, Πˆ ,K0). This can be accomplished by using the maximization of q and Π. transformation (Z, Π) → (Z˜ K0 , Π˜ K0 ) and decreasing the current model from K to K0, i.e., removing degenerated 4.1.1. UPDATE OF q components. We refer to this operation as “model prun- As suggested in Eq. (15), the maximizer of lower ing”. We practically verify the degeneration by the rank of bound (16) is p(Z | X) in which the asymptotic form FΞ, i.e., we perform model pruning if the eigenvalues are is shown in Proposition 4. Unfortunately, we cannot use less than some threshold. this as q, because the normalizing constant is intractable. One helpful∏ tool is the mean-field approximation of q, i.e., 4.2. The gFAB Algorithm q(Z) = qn(zn). Although the asymptotic marginal n Algorithm 1 summarizes the above procedures, solving the posterior (12) depends on n due to FΠ, this dependency eventually vanishes for N → ∞, and the mean-field ap- following optimization problem: proximation still maintains the asymptotic consistency of max Eq[L(Z, Π¯ (q), κ(q))] + H(q), (17) gFIC. q∈Q Rebuilding Factorized Information Criterion

of X, then F converges to the Fisher information matrix. Algorithm 1 The gFAB algorithm Πˆ From another viewpoint, F is interpreted as the covari- X K δ Πˆ Input: data , initial model , threshold ance matrix of the asymptotic posterior of Π. As a result repeat of applying the Bernstein-von Mises theorem, the asymp- q ← argmax E [L(Z, Π¯ , κ(q))] + H(q) q∈Q q totic normality holds for the posterior p(Π|X, Z) in which if σK (FΞ) ≤ · · · ≤ σK0 (FΞ) ≤ δ then −1 0 the covariance is given by (NFΠˆ ) . K ← K and (Z, Π¯ ) ← (Z˜ 0 , Π˜ 0 ) √ K K ˆ end if Proposition 9. Let Ω = N(Π − Π). Then, if Z is not ¯ ← E | | | − E −1 | →p Π argmaxΠ q[ln p(X, Z Π,K)] degenerated, p(Ω X, Z) N(0, [FΠˆ ] ) 0. until Convergence This interpretation has the following implication. In max- imizing the variational lower bound (7), we maximize − 1 | | 2 ln FΞ . In the gFAB algorithm, this is equivalent to = maximize the posterior covariance and pruning the com- K K' K ponents where those covariance diverge to infinity. Diver- gence of the posterior covariance means that there is insuf- | |−1/2 Figure 1. The gFIC penalty FΞˆ changes the shape of the ficient information to determine those parameters, which posterior p(Z | X, Πˆ ) as increasing the probability of degener- are not necessary for the model and thus can reasonably be ated Z (indicated by diagonal stripes). removed. ∏ Q { | } where = q(Z) q(Z) = n qn(zn) . As shown in 5. Relationship with VB Propositions 6 and 7, the above objective is the lower bound of Eq. (16) and thus of gFIC(K0), and the equality holds Similarly to FAB, VB alternatingly optimizes with respect asymptotically. to Z and Π, whereas VB treats both of them as distribu- tions. Suppose K ≤ K0, i.e., the case when the posterior Corollary 8. p(Z | X,K0) is not degenerated in distribution. Then, the { 0 marginal log-likelihood is written by the variational lower gFIC(K ) ≥ Eq. (17) for a finite N > 0, 0 (18) bound. Namely, let J (Z, Π) ≡ ln p(X, Z, Π | K ), and gFIC(K0) = Eq. (17) for N → ∞. 0 ln p(X | K ) ≥ Eq(Z,Π)[J (Z, Π)] + H(q(Z, Π)) ≥E [J (Z, Π)] + H(q(Z)) + H(q(Π)), (19) The gFAB algorithm is the block coordinate ascent. There- q(Z)q(Π) fore, if the pruning threshold δ is sufficiently small, each where we use the mean-field approximation q(Z, Π) = step monotonically increases objective (17), and the algo- q(Π)q(Z). The maximizers of Eq. (19) are given as rithm stops at a critical point. q˜(Π) ∝ exp(Eq(Z)[J ]), q˜(Z) ∝ exp(Eq(Π)[J ]). (20) A unique property of the gFAB algorithm is that it esti- mates the true model K0 along with the updates of q and Here, we look inside the optimal distributions to see the relationship with the gFAB algorithm. Let us consider to Π¯ . If N is sufficiently large and the initial model Kmax is 0 restrict the density q(Π) to be Gaussian. Since Eq(Z)[J ] larger than K , the algorithm learns pK0 (Z) as q, according increases proportional to N while ln p(Π) does not, q˜(Π) to Proposition 4. At the same time, model pruning removes ¯ − 0 attains its maximum around Π. Then, the second order degenerated K K components. Therefore, if the solu- ¯ tions converge to the global optimum, the gFAB algorithm expansion to lnq ˜(Π) at Π yields the solution q˜(Π) = ¯ −1 returns K0. N(Π, (NFΠ¯ ) ). We remark that this solution can be seen as an empirical version of the asymptotic normal posterior given by Proposition 9. Then, if we further 4.3. How F Works? Π approximate J (Z, Π) by the second order expansion at ¯ Proposition 4 shows that if the model is not degener- Π = Π, the other expectation Eq(Π)[J (Z, Π)] appearing ∝ J ¯ − 1 | | ated, objective (17) is maximized at q(Z) = pK (Z) in Eq. (20) is evaluated by (Z, Π) 2 ln FΠ¯ . Under | ˆ | |−1/2 { ¯ } p(Z X, Π) FΠˆ , which is the product of the un- these approximations, alternating updates of Π,FΠ¯ and 3 marginalized posterior p(Z|X, Πˆ ) and the gFIC penalty q˜(Z) coincide exactly with the gFAB algorithm , which −1/2 −1/2 term |F ˆ | . Since |F ˆ | has a peak where Z is justifies the VB lower bound as an asymptotic expansion Π Π | 0 degenerating, it changes the shape of p(Z|X, Πˆ ) and in- of p(X K ). creases the probability that Z is degenerated. Figure 1 il- Proposition 10. Let LVB(K) be the VB lower bound (19) lustrates how the penalty term affects the posterior. with restricting q(Π) to be Gaussian and approximating Note that, if the model family contains the true distribution 3Note that model pruning is not necessary when K ≤ K0. Rebuilding Factorized Information Criterion the expectation in lnq ˜(Z) by the second order expansion. Collapsed VB (CVB) (Teh et al., 2006) is a variation of 0 Then, for K ≤ K , ln p(X | K) = LVB(K) + O(1). VB. Similarly to gFIC, CVB takes the variational bound after marginalizing out Π from the joint likelihood. In Proposition 10 states that the VB approximation is asymp- contrast to gFIC, CVB approximates q in a non-asymptotic totically accurate as well as gFIC when the model is manner, such as the first-order expansion (Asuncion et al., not degenerated. For the degenerated case, the asymp- 2009). Although such approximation has been found to totic behavior of LVB(K) of general LVMs is un- be accurate in practice, its asymptotic properties, such as clear except for a few specific models such as Gaussian consistency, have not been explored. Note that as one of MMs (Watanabe & Watanabe, 2006) and reduced rank re- ∏those approximations, the mean-field assumption q(Z) = gressions (Watanabe, 2009). n q(zn) is used in the original CVB paper (Teh et al., Proposition 10 also suggests that the mean-field approxi- 2006), motivated by the intuition that the dependence { } mation does not loose the consistency with p(X | K). As among zn is weak after marginalization. Proposition 6 shown in Proposition 9, for K ≤ K0, the posterior covari- formally justifies this asymptotic independence assumption −1 → ∞ on the marginal distribution employed in CVB. ance is given by (NFΠˆ ) , which goes to 0 for N , i.e., the posterior converges to a point. Therefore, mutual Several authors have studied about asymptotic behaviors dependence among Z and Π eventually vanishes in the pos- of VB methods for LVMs. Wang & Titterington (2004) in- terior, and the mean-field assumption holds asymptotically. vestigated the VB approximation for linear dynamical sys- tems and showed the inconsistency of VB estimation with 6. Related Work large observation noise. Watanabe & Watanabe (2006) de- rived an asymptotic variational lower bound of the Gaus- The EM Algorithm Algorithm 1 looks quite similar to sian MMs and demonstrated its usefulness for the model se- the EM algorithm, solving lection. Recently, Nakajima et al. (2014) analyzed the VB

max Eq[ln p(X | Π,K)] + H(q). (21) learning on latent Dirichlet allocation (LDA) (Blei et al., q,Π 2003), who revealed conditions for the consistency and We see that both gFAB and EM algorithms iteratively up- clarified its transitional behavior of the parameter sparsity. date the posterior-like distribution of Z and estimate Π. By comparing with these existing works, we have a contri- The essential difference between them is that the EM algo- bution in terms of that our asymptotic analysis is valid for rithm infers the posterior p(Z|X, Πˆ ) in the E-step, but the general LVMs, rather than individual models. gFAB algorithm infers the marginal posterior p(Z|X) ' | ˆ | |−1/2 BIC and Extensions Let Y = (X, Z) be a pair of non- p(Z X, Π) FΠˆ . As discussed in Section 4.3, the | |−1/2 degenerated X and Z. By ignoring all the constant terms penalty term FΠˆ increases the probability mass of the posterior, where Z is degenerating, enabling automatic of Laplace’s approximation (9), we obtain BIC (Schwarz, model determination through model pruning. In contrast, 1978) considering Y as an observation, which is given by the EM algorithm lacks such pruning mechanism, and al- the right-hand side of the following equation. ways overfits to X as long as N is finite while p(Z|X) DΠ eventually converges to p(Z|X, Πˆ ) for N → ∞ (see ln p(Y | K) = ln p(Y | Πˆ ,K) − ln N + O(1). 2 Proposition 6). Unfortunately, the above relation does not hold for p(X | Note that Eq. (21) has O(ln N) error against ln p(X). ∫ K). Since p(X | K) = p(Y | K)dZ mixes up de- Analogously to gFIC, this error is easily reduced to O(1) generated and non-degenerated cases, p(X | K) always by adding − DΠ ln N. This modification provides another 2 becomes singular. This looses the necessary condition A5 information criterion, which we refer to as BICEM. for Laplace’s approximation. VB Methods As discussed in the previous section, There are several studies that extend BIC to be able to gFIC/gFAB has better theoretical properties than the VB deal with singular models. Watanabe (2009) evaluates methods. Another key advantage is that gFIC/gFAB does p(X | K) with an O(1) error for any singular models by not depend on . The VB methods intro- using algebraic geometry. However, it requires an evalua- duce a prior distribution in an explicit form and often with tion of the intractable rational number called the real log hyperparameters, and we need to determine the hyperpa- canonical threshold. Recent study (Watanabe, 2013) re- rameters by e.g., the empirical Bayes. Conversely, gFIC laxes this intractable evaluation to the evaluation√ of crite- ignores the direct effect of the prior as O(1) (see A3) but rion called WBIC at the expense of an Op( ln N) error. takes into account the effect caused by the marginalization Yet, the evaluation of WBIC needs an expectation with re- via Laplace’s method. This makes gFIC prior-free and let spect to a practically intractable distribution, which usually tuning unnecessary. incurs heavy computation. Rebuilding Factorized Information Criterion

EM BICEM† VB CVB FAB gFAB† − DΠ Obj. function Eq. (21) Eq. (21) 2 ln N Eq. (19) Eq. (7) Eq. (16) Π Point estimate Posterior w/ MF Marginalized out Laplace approximation q(Z) = p(Z | X, Πˆ ) ' p(Z | X) ∝ p(Z | X)(1 + O(1))† ln p(X|K ≤ K0) O(ln N) O(1)† O(1)† O(1)† ln p(X|K > K0) NA Generally NA O(1)† Applicability Many models Many models Binary LVMs Binary LVMs LVMs

Table 1. A comparison of approximated Bayesian methods. The symbol † highlights our contributions. “MF” stands for the mean-field approximation. Note that the asymptotic relations with ln p(X | K) hold only for LVMs.

method EM BICEM VB1 VB2 gFAB 7. Numerical Experiments 2000 N=100 We compare the performance of model selection for BPCA 1000 explained in Section 2.1 with the EM algorithm, BICEM in- 0 troduced in Section 6, simple VB (VB1), full VB (VB2), −1000 12000 and the gFAB algorithm. VB2 had the priors for W, λ,

10000 N=500 and α described in Section 2.1 in which the hyperparam- 8000 eters were fixed as aλ = bλ = aα = bα = 0.01 by fol- 6000 lowing (Bishop, 1999). VB1 is a simple variant of VB2, 4000 which fixed α = 1. In this experiments, We used the syn- 22000 N=1000 > 4 Objective thetic data X = ZW + E where W ∼ uniform([0, 1]) , 20000 2 18000 Z ∼ N(0, I), and End ∼ N(0, σ ). Under the data di- 0 16000 mensionality D = 30 and the true model K = 10, we 14000 generated data with N = 100, 500, 1000, and 2000. We 45000 N=2000 stopped the algorithms if the relative error was less than 40000 10−5 or the number of iterations was greater than 104. 35000

Figure 2 depicts the objective functions after convergence 30000 10 20 30 for K = 2,..., 30. Note that, we performed gFAB with K K = 30 and it finally converged at K ' 10 owing to model pruning, which allowed us to skip the computation Figure 2. The objective function versus the model K. The error- for K ' 10,..., 30, and the objective values for those Ks bar shows the standard deviations over 10 different random seeds, are not drawn in the figure. We see that gFAB success- which affect both data and initial values of the algorithms. fully chose the true model, except for the small sample size (N = 100) where gFAB underestimated the model. In contrast, the objective of EM slightly but monotoni- 8. Conclusion cally increased with K, which means EM always chose the largest K as the best model. This is because EM maximizes This paper provided an asymptotic analysis for the Eq. (21), which does not contain the model complexity marginal log-likelihood of LVMs. As the main contribu- term of Π. As our analysis suggested in Section 6, BICEM tion, we proposed gFIC for model selection and showed its and VB1 are close to gFAB as N increasing and has a peak consistency with the marginal log-likelihood. Part of our around K0 = 10, meaning that BICEM and VB1 are ade- analysis also provided insight into the EM and VB meth- quate for model selection. However, in contrast to gFAB, ods. Numerical experiments confirmed the validity of our both of them need to compute for all K. Interestingly, VB2 analysis. were unstable for N < 2000 and it gave the inconsistent We remark that gFIC is potentially applicable to many model selection results. We observed that VB2 had very other LVMs, including factor analysis, LDA, canonical cor- strong dependence on the initial values. This behavior is relation analysis, and partial membership models. Investi- understandable because VB2 has the additional prior and gating the behavior of gFIC on these models is an important hyperparameters to be estimated, which might produce ad- future research direction. ditional local minima that make optimization difficult.

4This setting could be unfair because VB1 and VB2 assume Acknowledgments KH was supported by MEXT Kak- the Gaussian prior for W. However, we confirmed that data gen- enhi 25880028. SM was supported by MEXT Kakenhi erated by W ∼ N(0, 1) gave almost the same results. 15K15210. Rebuilding Factorized Information Criterion

References Schwarz, Gideon. Estimating the Dimension of a Model. The Annals of Statistics, 6(2):461–464, 1978. Asuncion, Arthur, Welling, Max, Smyth, Padhraic, and Teh, Yee Whye. On smoothing and inference for topic Teh, Yee W., Newman, David, and Welling, Max. A col- models. In Proceedings of the Twenty-Fifth Conference lapsed variational algorithm for latent on Uncertainty in Artificial Intelligence (UAI), 2009. dirichlet allocation. In 19th Annual Conference on Neu- ral Information Processing Systems (NIPS), 2006. Attias, Hagai. Inferring Parameters and Structure of Latent Variable Models by Variational Bayes. In Uncertainty in Tierney, Luke and Kadane, Joseph B. Accurate Approx- Artificial Intelligence (UAI), 1999. imations for Posterior Moments and Marginal Densi- ties. Journal of the American Statistical Association, 81 Bishop, C. M. Bayesian PCA. In Advances in Neural In- (393):82–86, 1986. formation Processing Systems (NIPS), 1998. Wang, Bo and Titterington, D.M. Lack of consistency of Bishop, C. M. Variational principal components. In In- mean field and variational bayes approximations for state ternational Conference on Artificial Neural Networks space models. Neural Processing Letters, 20(3):151– (ICANN), 1999. 170, 2004.

Blei, David M., Ng, Andrew Y., and Jordan, Michael I. Watanabe, Kazuho and Watanabe, Sumio. Stochastic com- Latent dirichlet allocation. Journal of Machine Learning plexities of gaussian mixtures in variational bayesian ap- Research, 3:993–1022, 2003. proximation. Journal of Machine Learning Research, 7: 625–644, 2006. Eto, Riki, Fujimaki, Ryohei, Morinaga, Satoshi, and Tamano, Hiroshi. Fully-automatic bayesian piecewise Watanabe, Sumio. Algebraic Geometry and Statistical sparse linear models. In AISTATS, 2014. Learning Theory. Cambridge University Press, 2009.

Fujimaki, Ryohei and Hayashi, Kohei. Factorized asymp- Watanabe, Sumio. A widely applicable bayesian informa- totic bayesian hidden markov model. In International tion criterion. Journal of Machine Learning Research, Conference on Machine Learning (ICML), 2012. 14(1):867–897, 2013.

Fujimaki, Ryohei and Morinaga, Satoshi. Factorized asymptotic bayesian inference for mixture modeling. In AISTATS, 2012.

Hastings, W. K. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57(1): 97–109, 1970.

Hayashi, Kohei and Fujimaki, Ryohei. Factorized asymp- totic bayesian inference for latent feature models. In 27th Annual Conference on Neural Information Process- ing Systems (NIPS), 2013.

Jordan, Michael I., Ghahramani, Zoubin, Jaakkola, Tommi S., and Saul, Lawrence K. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233, 1999.

Liu, Chunchen, Feng, Lu, Fujimaki, Ryohei, and Muraoka, Yusuke. Scalable model selection for large-scale facto- rial relational models. In International Conference on Machine Learning (ICML), 2015.

Nakajima, Shinichi, Sato, Issei, Sugiyama, Masashi, Watanabe, Kazuho, and Kobayashi, Hiroko. Analysis of variational bayesian latent dirichlet allocation: Weaker sparsity than MAP. In Advances in Neural Information Processing Systems (NIPS), 2014.