Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood

Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood Kohei Hayashi [email protected] Global Research Center for Big Data Mathematics, National Institute of Informatics Kawarabayashi Large Graph Project, ERATO, JST Shin-ichi Maeda [email protected] Graduate School of Informatics, Kyoto University Ryohei Fujimaki [email protected] NEC Knowledge Discovery Research Laboratories Abstract eters and probabilistic models is not one-to-one) and Factorized information criterion (FIC) is a re- such classical information criteria based on the regu- cently developed approximation technique for larity assumption as the Bayesian information criterion the marginal log-likelihood, which provides an (BIC) (Schwarz, 1978) are no longer justified. Since ex- automatic model selection framework for a few act evaluation of the marginal log-likelihood is often not latent variable models (LVMs) with tractable in- available, approximation techniques have been developed ference algorithms. This paper reconsiders FIC using sampling (i.e., Markov Chain Monte Carlo meth- and fills theoretical gaps of previous FIC studies. ods (MCMCs) (Hastings, 1970)), a variational lower bound First, we reveal the core idea of FIC that allows (i.e., the variational Bayes methods (VB) (Attias, 1999; generalization for a broader class of LVMs. Sec- Jordan et al., 1999)), or algebraic geometry (i.e., the widely ond, we investigate the model selection mecha- applicable BIC (WBIC) (Watanabe, 2013)). However, nism of the generalized FIC, which we provide model selection using these methods requires heavy com- a formal justification of FIC as a model selec- putational cost (e.g., a large number of MCMC sampling in tion criterion for LVMs and also a systematic a high-dimensional space, an outer loop for WBIC.) procedure for pruning redundant latent variables. In the last few years, a new approximation technique Third, we uncover a few previously-unknown re- and an inference method, factorized information cri- lationships between FIC and the variational free terion (FIC) and factorized asymptotic Bayesian in- energy. A demonstrative study on Bayesian prin- ference (FAB), have been developed for some binary cipal component analysis is provided and numer- LVMs (Fujimaki & Morinaga, 2012; Fujimaki & Hayashi, ical experiments support our theoretical results. 2012; Hayashi & Fujimaki, 2013; Eto et al., 2014; Liu et al., 2015). Unlike existing methods that evaluate approximated marginal log-likelihoods calculated for 1. Introduction each latent variable dimensionality (and therefore need an outer loop for model selection), FAB finds an effective The marginal log-likelihood is a key concept of Bayesian dimensionality via an EM-style alternating optimization model identification of latent variable models (LVMs), procedure. such as mixture models (MMs), probabilistic principal component analysis, and hidden Markov models (HMMs). For example, let us consider a K-component MM for N > Determination of dimensionality of latent variables is an observations X = (x1;:::; xN ) with one-of-K coding > essential task to uncover hidden structures behind the ob- latent variables Z = (z1;:::; zN ), mixing coefficients served data as well as to mitigate overfitting. In gen- β = (β1; : : : ; βK ), and DΞ-dimensional component-wise f g eral, LVMs are singular (i.e., mapping between param- parameters Ξ = ξ1;:::; ξK . By using Laplace’s method to the marginalization of the log-likelihood, FIC Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy- right 2015 by the author(s). Rebuilding Factorized Information Criterion of MMs (Fujimaki & Morinaga, 2012) is derived by marginal log-likelihood with a constant order error under h some reasonable assumptions. This justifies gFIC as a ^ FICMM(K) ≡ max Eq ln p(X; Z j β; Ξ^ ;K) model selection criterion for LVMs and further a natural q # one-pass model “pruning” procedure is derived, which is X P Dξ znk DΠ performed heuristically in previous FIC studies. We also − k ln n + H(q) − ln N; (1) 2 N 2 provide an interpretation of gFIC as a variational free en- k ergy and uncover a few previously-unknown relationships where q is the distribution of Z, β^ and Ξ^ are the maximum between them. Finally, we demonstrate the validity of gFIC 1 joint-likelihood estimators (MJLEs) , DΠ = DΞ + K − 1 by applying it to BPCA. The experimental results agree is the total dimension of Ξ and β, and H(q) is the entropy with to the theoretical properties of gFIC. of q. A key characteristic of FIC can be observed in the second term of Eq. (1), which gives the penalty in terms 2. LVMs and Degeneration of model complexity.P As we can see, the penalty term We first define the class of LVMs we deal with in this paper. decreases when n znk—the number of effective samples of the k-th component—is small, i.e., Z is degener- Here, we consider LVMs that have K-dimensional latent ated. Therefore, through the optimization of q, the de- variables zn (including the MMs in the previous section), generate dimension is automatically pruned until a non- but now zn can take not only binary but also real values. degenerated Z is found. This mechanism makes FAB a Given X and a model family (e.g., MMs), our goal is to one-pass model selection algorithm and computationally determine K and we refer to this as a model. Note that we more attractive than the other methods. The validity of the sometimes omit the notation K for the sake of brevity, if it penalty term has been confirmed for other binary LVMs, is obvious from the context. e.g., HMMs (Fujimaki & Hayashi, 2012), latent feature The LVMs have DΞ-dimensional component-dependent models (Hayashi & Fujimaki, 2013), and mixture of ex- f g parameters Ξ = ξ1;:::; ξK and DΘ-dimensional perts (Eto et al., 2014). component-independent parameters Θ, which include hy- Despite FAB’s practical success compared with BIC and perparameters of the prior of Z. We abbreviate them f g VB, it is unclear that what conditions are actually neces- as Π = Θ; Ξ and assume the dimension DΠ = sary to guarantee that FAB yields the true latent variable DΘ + DΞ is finite. Then, we define the joint prob- dimensionality. In addition, the generalization of FIC for ability: p(X; Z; Π) = p(XjZ; Π)p(ZjΠ)p(Π) where j 2 P ≡ non-binary LVMs remains an important issue.P In case that ln p(X; Z Π) is twice differentiable at Π . Let FΠ ( ) ( ) Z takes negative and/or continuous values, n znk is no @ ( ) F F > ln p(X; Z j Π) Θ Θ;Ξ − @Θ @ @ longer interpretable as the number of effective samples, and > = @ @Θ @Ξ : F F > N we loose the clue for finding the redundant dimension of Z. Θ;Ξ Ξ @Ξ ^ ≡ j This paper proposes generalized FIC (gFIC), given by Note that the MJLE Π argmaxΠ ln p(X; Z Π) de- pends on Z (and X). In addition, ln p(X; Z j Π) can have ^ gFIC(K) ≡Eq∗ [L(Z; Π;K)] + H(q); (2) multiple maximizers, and Π^ could be a set of solutions. 1 DΠ L(Z; Π;K) = ln p(X; Z j Π;K) − ln jFΞj − ln N: Model redundancy is a notable property of LVMs. Be- 2 2 cause the latent variable Z is unobservable, the pair (Z; Π) Here, q∗(Z) ≡ p(ZjX;K) is the marginal posterior and is not necessarily determined uniquely for a given X. In − j other words, there could be pairs (Z; Π) and (Z~; Π~ ), whose FΞ^ is the Hessian matrix of ln p(X; Z Π;K)=N with re- spect to Ξ. In gFIC, the penalty term is given by the volume likelihoods have the same value, i.e., p(X; Z j Π;K) = of the (empirical) Fisher information matrix, which natu- p(X; Z~ j Π~ ;K). In such case, the model possibly has mul- rally penalizes model complexity even for continuous la- tiple maxima, and the conventional asymptotic analysis is tent variables. Accordingly, gFIC is applicable to a broader no longer valid (Watanabe, 2009). class of LVMs, such as Bayesian principal component anal- Previous FIC studies address this redundancy by introduc- ysis (BPCA) (Bishop, 1998). ing a variational representation that enables treating Z as Furthermore, we prove that FAB automatically prunes fixed, as we explain in the next section. However, even if redundant dimensionality along with optimizing q, and Z is fixed, the redundancy still remains, namely, the case gFIC for the optimized q asymptotically converges to the in which Z is singular or “degenerated,” and there exists an equivalent likelihood with a smaller model K0 < K: 1Note that the MJLE is not equivalent to the maxi- 0 mum a posteriori estimator (MAP). The MJLE is given p(X; Z j Π;K) = p(X; Z~ K0 j Π~ K0 ;K ): (3) j by argmaxΘ p(X; Z Θ) while the MAP is given by j argmaxΘ p(X; Z Θ)p(Θ). In this case, K is overcomplete for Z, and Z lies on the Rebuilding Factorized Information Criterion subspace of the model K. As a simple example, let us con- samples and analysis of cases with sample correlation (e.g. sider a three-component MM for which Z = (z; 1 − z; 0). time series models) remains as an open problem. In this case, ξ3 is unidentifiable, because the third component is completely unused, and the K0 = 2-component MMs In the same notation used in Section 1, ~ ≡ − ~ ≡ j MM with Z2 (z; 1 z) and Π2 (Θ; (ξ1; ξ2)) satis- QtheQ joint likelihood is given by p(X; Z Π) = f j gznk fies equivalence relation (3).

Rebuilding Factorized Information Criterion: Asymptotically Accurate Marginal Likelihood

Assessing Fairness with Unlabeled Data and Bayesian Inference

Bayesian Inference Chapter 4: Regression and Hierarchical Models

The Bayesian Lasso

Posterior Propriety and Admissibility of Hyperpriors in Normal

Empirical Bayes Methods for Combining Likelihoods Bradley EFRON

A New Hyperprior Distribution for Bayesian Regression Model with Application in Genomics

Bayesian Learning

9 Introduction to Hierarchical Models

Model Checking and Improvement

CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes

Prior Effective Sample Size in Conditionally Independent

Computational Bayesian Statistics