<<

A Generalized for Principal Component Analysis of Binary

Andrew I. Schein Lawrence K. Saul Lyle H. Ungar Department of Computer and Information Science University of Pennsylvania Moore School Building 200 South 33rd Street Philadelphia, PA 19104-6389 {ais,lsaul,ungar}@cis.upenn.edu

Abstract they are not generally appropriate for other data types. Recently, Collins et al.[5] derived generalized criteria We investigate a for for dimensionality reduction by appealing to proper- dimensionality reduction of binary data. The ties of distributions in the . In their model is related to principal component anal- framework, the conventional PCA of real-valued data ysis (PCA) in the same way that logistic re- emerges naturally from assuming a Gaussian distribu- gression is related to . Thus tion over a set of observations, while generalized ver- we refer to the model as logistic PCA. In this sions of PCA for binary and nonnegative data emerge paper, we derive an alternating respectively by substituting the Bernoulli and Pois- method to estimate the basis vectors and gen- son distributions for the Gaussian. For binary data, eralized linear coefficients of the logistic PCA the generalized model’s relationship to PCA is anal- model. The resulting updates have a simple ogous to the relationship between logistic and linear closed form and are guaranteed at each iter- regression[12]. In particular, the model exploits the ation to improve the model’s likelihood. We log-odds as the natural parameter of the Bernoulli dis- evaluate the performance of logistic PCA—as tribution and the logistic function as its canonical link. measured by reconstruction error rates—on In this paper we will refer to the PCA model for bi- data sets drawn from four real world applica- nary data as logistic PCA, and to its counterpart for tions. In general, we find that logistic PCA is real-valued data as linear (or conventional) PCA. much better suited to modeling binary data Collins et al.[5] proposed an iterative algorithm for all than conventional PCA. of these generalizations of PCA, but the optimizations required at each iteration of their algorithm do not have a simple closed form for the logistic PCA case. 1 Introduction In this paper, we derive an alternating least squares method to estimate the basis vectors and generalized Principal component analysis (PCA) is a canonical linear coefficients of logistic PCA. Our method adapts and widely used method for dimensionality reduction an algorithm, originally proposed by Tipping[15], for of multivariate data. Applications include the ex- fitting the parameters of a closely related generative ploratory analysis[9] and visualization of large data model. Like Tipping’s algorithm, it also relies on a sets, as well as the denoising and decorrelation of in- particular convexity inequality introduced by Jaakkola puts for algorithms in statistical learning[2, 6]. PCA and Jordan[10] for the logistic function. Our algorithm discovers the linear projections of the data with max- is easy to implement and guaranteed at each iteration imum , or equivalently, the lower dimensional to improve the logistic PCA log-likelihood. Thus, it subspace that yields the minimum squared reconstruc- provides a simple but powerful tool for dimensionality tion error. In practice, model fitting occurs either by reduction of binary data. computing the top eigenvectors of the sample covari- ance matrix or by performing a singular value decom- We evaluate the performance of our algorithm on data position on the matrix of -centered data. sets drawn from four real world applications. Perfor- mance is measured by the ability to compress and re- While the centering operations and least squares cri- construct large binary matrices with minimum error. teria of PCA are naturally suited to real-valued data, Our experimental results add significantly to those of Table 1: Summary of the logistic PCA model nota- Collins et al[5], who looked only at simulated data, and tion. Tipping[15], who mainly considered the application to visualization. In the majority of , we find N number of observations that logistic PCA is much better suited to binary data D dimensionality of binary data than conventional PCA. L dimensionality of latent space

2Model Xnd binary data (N ×D matrix) Θnd log-odds (N ×D matrix) Logistic PCA is based on a multivariate generalization P X | σ of the Bernoulli distribution. The Bernoulli distribu- [ nd =1Θnd]= (Θnd) tion for a univariate binary x∈{0, 1} p with mean is given by: Un` coefficients (N ×L matrix) V`d basis vectors (L×D matrix) P(x|p)=px(1 − p)1−x. (1) ∆d bias vector (1×D vector)

We can equivalently write this distribution in terms Θnd =(UV)nd +∆d p of the log-odds parameter θ = log( 1−p ) and the logis- tic function σ(θ)=[1+e−θ]−1. In these terms, the Bernoulli distribution is given by: where U is an equally tall but narrower N ×L matrix, P(x|θ)=σ(θ)xσ(−θ)1−x. (2) V is a shorter but equally wide L×D matrix, ∆ is a D- dimensional vector, and the sum over the subscript ` The log-odds and logistic function are, respectively, in eq. (5) makes explicit the matrix multiplication of U the natural parameter and canonical link function of and V . Note that the parameters U, V and ∆ in this the Bernoulli distribution expressed as a member of model play roles analogous to the linear coefficients, the exponential family. basis vectors, and empirical mean computed by PCA of real-valued data. The model is summarized in Ta- A simple multivariate generalization of eq. (2) yields ble 1. Though the bias vector ∆ in this model could the logistic PCA model. Let Xnd denote the elements be absorbed by a redefinition of U and V , its presence of an N×D binary matrix, each of whose N rows stores permits a more straightforward comparison to linear the observation of a D-dimensional binary vector. A PCA of mean-centered data. over matrices of this form is given by: Logistic PCA can be applied to binary data in largely Y the same way that conventional (or linear) PCA is ap- X 1−X P(X|Θ) = σ(Θnd) nd σ(−Θnd) nd , (3) plied to real-valued data. In particular, given binary nd data X, we compute the parameters U, V and ∆ that maximize (at least locally) the log-likelihood in eq. (4). where Θnd denotes the log-odds of the binary random An iterative least squares method for maximizing (4) is variable Xnd. The log-likelihood of binary data under described in the next section. While the log-likelihood this model is given by: U V X in eq. (4) is not convex in the parameters , ,and∆, it is convex in any one of these parameters if the oth- L = [Xnd log σ(Θnd)+(1−Xnd)logσ(−Θnd)] . nd ers are held fixed. Thus, having estimated these pa- (4) rameters from training data X, we can compute a low 0 Low dimensional structure in the data can be discov- dimensional representation U of previously unseen (or 0 ered by assuming a compact representation for the test) data X by locating the global maximum of the 0 log-odds matrix Θ and attempting to maximize this corresponding test log-likelihood L (with fixed V and log-likelihood. A compact representation analogous to ∆). PCA is obtained by constraining the rows of Θ to lie Logistic and linear PCA can both be viewed as spe- LD in a latent subspace of dimensionality . To this cial cases of the generalized framework described by end, we parameterize the log-odds matrix Θ in terms of Collins et al[5]. This is done by writing the log- U V two smaller matrices and and (optionally) a bias likelihood in eq. (4) in the more general form: vector ∆. In terms of these parameters, the N × D X matrix Θ is represented as: L = ΘndXnd − G(Θnd)+logP0(Xnd), (6) X nd Θnd = Un`V`d +∆d, (5) ` where it applies to any member distribution of the ex- ponential family. In this more general formulation, We begin by computing intermediate quantities: the function G(θ) in eq. (6) is given by the integral of P x tanh(Θnd/2) the distribution’s canonical link, while the term 0( ) Tnd = , (7) provides a normalization but otherwise has no depen- X Θnd dence on the distribution’s natural parameter. As be- An``0 = TndV`dV`0d, (8) fore, the rows of Θ are constrained to lie in a low di- Xd mensional subspace. Linear PCA for real-valued data Bn` = [2Xnd −1−Tnd∆d] V`d. (9) emerges naturally in this framework from the form of d the Gaussian distribution[17, 18], while logistic PCA emerges from the form of the Bernoulli distribution. Note that Tnd depends on Θnd and (hence) the current Other examples and a fuller treatment are given by estimate of U through eq. (5). In terms of the matri- Collins et al[5]. ces in eqs. (7–9), the updates for different rows of the matrix U are conveniently decoupled. In particular, Note that logistic PCA has many of the same short- we update the nth row of the matrix U by solving the comings as linear PCA. In particular, it does not de- L×L set of linear equations: fine a proper that can be used to X handle or infer a conditional distribution An``0 Un`0 = Bn`. (10) P[U|X] over the coefficients U given data X.Factor `0 analysis[1, 15] and related probabilistic models[14, 16] exist to address these shortcomings of linear PCA. V -Update Generalized models of have also been Holding the parameters U and ∆ fixed, we obtain a proposed for binary data[1, 15]. These models typ- similar update rule for the matrix V that stores the ically assume that the coefficients U obey a multi- basis vectors of the log-odds matrix Θ. Again, we variate Gaussian distribution. In such models, how- compute the matrix Tnd as in eq. (7), as well as the ever, exact probabilistic inference is intractable, and intermediate quantities: approximations are required to compute the expected X values for maximum likelihood estimation. Logistic Ad``0 = TndUn`Un`0 , (11) PCA can be viewed as a simpler non-probabilistic al- Xn ternative to these models that does not make an ex- Bd` = [2Xnd−1−Tnd∆d] Un`. (12) plicit assumption about the distribution obeyed by the n coefficients U. In terms of these quantities, the updates for different columns of the matrix V are conveniently decoupled. 3 Algorithm In particular, we update the dth column of the matrix V by solving the L×L set of linear equations: Logistic PCA models the structure of binary data by X A 0 V 0 B . maximizing the log-likelihood in eq. (4), subject to the d`` ` d = d` (13) `0 constraint that the rows of the log-odds matrix Θ lie in the linear subspace defined by the parameters U, V and ∆ in eq. (5). Our method for fitting these param- ∆-Update U V eters is an iterative scheme that alternates between Finally, holding the parameters and fixed, we ob- least squares updates for U, V and ∆. Essentially, one tain a simple update rule for the bias vector ∆. With T set of parameters is updated while the others are held nd computed as in eq. (7), we update the elements of the bias vector by: fixed, and this procedure is repeated—cycling through P U-updates, V -updates and ∆-updates—until the log- X − −T UV n [2 ndP 1 nd( )nd]. likelihood converges to a desired degree of precision. ∆d = (14) n0 Tn0d In the appendix, an auxiliary function is used to de- rive the alternating least squares (ALS) updates and To compute the coefficients U 0 for test data X0 from a to prove that they lead to monotonic increases in the previously trained model of logistic PCA, one iterates log-likelihood. In this section, we present the ALS just the U-updates while holding the parameters V updates with a minimum of extra formalism, noting and ∆ fixed. This is done until the log-likelihood for similarities and differences with other approaches. X0, based on the coefficients U 0, converges to a desired U-Update: degree of accuracy. The optimization of U 0 for fixed Holding the parameters V and ∆ fixed, we obtain a V and ∆ is convex, so that for test data X0,theU- simple update rule for the matrix U that stores the updates converge (except in degenerate cases) to the reconstruction coefficients of the log-odds matrix Θ. same result independent of the starting point for U 0. The ALS updates have a simple closed form that The individual data sets are described in greater detail make them easier to implement than completely gen- below. The density of each data set was measured by eral algorithms for convex optimization. Thus, for di- the mean value of elements in the N ×D binary data mensionality reduction of binary data, they provide matrix X and is indicated in Table 2 by the symbol ρ. a simpler approach than the general procedure out- lined in Collins et al.[5] for models of the form in 4.1 Microsoft Web Data eq. (6). The ALS updates are closely related to Tip- ping’s procedure[15] for fitting a factor analysis model The anonymous Microsoft Web Database is available 1 of binary data[1], though in the factor analysis model, from the UCI machine learning repository .Ithas they are used to approximate a posterior distribution been used previously to evaluate collaborative filtering over the matrix U. The model of dimensionality re- algorithms[3]. The data was generated by duction in this paper is simpler by comparison, as it logs at www.microsoft.com that recorded (anony- requires only point estimates of U. mously) the behavior of N = 32711 users. The ex- tracted portion of the logs consists of binary variables indicating whether a user visited a URL over a one 4 Experiments week period. There are data for D = 285 “vroots” or URL prefixes. In our experiments, each user was rep- resented by a row in the binary data matrix X.The We compared the performance of logistic and linear data is very sparse, with density ρ=0.011. PCA on data sets of varying size, dimensionality and sparsity. To ensure a high degree of convergence we employed 300 iterations of ALS to fit the logistic 4.2 MovieLens Data PCA model. A singular value decomposition of mean- The MovieLens data set contains movie ratings by a centered data served to fit the linear PCA model. For large number of users[7]. It was collected in order to latent spaces of equal dimensionality, these two models evaluate the performance of automated recommender involve the same number of fitted parameters, making systems. In our experiments we ignore the actual rat- a direct comparison fairly straightforward. We em- ings except to obtain a binary matrix indicating which ployed latent spaces of dimensionality L=1, 2, 4and8 of D = 1682 movies were rated by N = 993 users. in our experiments. Each user was represented by a row in the binary We evaluated logistic and linear PCA by measuring data matrix X. The data is fairly sparse, with density how well their low dimensional representations could ρ=0.063. reconstruct large matrices of binary data from four real world application domains. Note that the low di- 4.3 Gene Expression Data mensional representations of both models yield contin- uous real-valued predictions for these reconstructions; The gene expression data of Causton et al.[4] was these were converted to binary values {0, 1} by simple generated by measuring gene expression levels on 2 thresholding. For each data set and for each hypoth- an Affymetrix gene chip. The experimental design esized dimensionality L of the latent space, we report called for genome-wide expression analysis of Saccha- results from logistic and linear PCA in terms of two romyces cerevisiae to detect transcriptional change un- error rates: a minimum overall error rate and a bal- der various environmental conditions. The measure- anced error rate. These error rates were obtained by ments were converted to binary values by Affymetrix choosing the models’ binary decision thresholds in two GeneChip software and provided in this form to the different ways. In the first case, the thresholds were authors. The binary values are noisy indicators of chosen to minimize the overall error rates, counting the presence or absence of mRNA in a Saccharomyces equally those errors due to false positives and false neg- cerevisiae cell. There are measurements for N = 6015 atives. In the second case, the thresholds were chosen genes in D = 46 environmental conditions. Each gene to equalize the false positive and false negative error was represented by a row in the binary data matrix X. rates; this has the effect, in sparse data sets, of giving The data is fairly balanced between positive and neg- more weight to errors from false positives. Both types ative indicators, with density ρ=0.738. of error rates are revealing, as they capture different notions of error cost. The results are summarized in 4.4 Advertisement Data Table 2. Note that logistic PCA outperformed lin- ear PCA on the task of binary data reconstruction in The advertisement data is available from the UCI nearly every one of our experiments, with exceptional machine learning repository. The data was collected performance gains under the balanced error rate crite- 1http://www.ics.uci.edu/~mlearn/MLRepository.html rion. 2Affymetrix, Inc. Santa Clara, CA Table 2: Minimum and balanced error rates on four different data sets for the task of binary data reconstruction. The data sets were N ×D binary matrices with ρND nonzero elements.

Microsoft Web Log (N = 32711, D = 285, ρ =0.011)

Minimum Error Rates (%) Balanced Error Rates (%) L Linear PCA Logistic PCA L Linear PCA Logistic PCA 1 0.0886 0.0959 1 1.52 1.28 2 0.0831 0.0701 2 1.41 1.15 4 0.0661 0.0502 4 1.36 0.760 8 0.0475 0.0237 8 1.11 0.355

MovieLens (N = 943, D = 1682, ρ =0.063)

Minimum Error Rates (%) Balanced Error Rates (%) L Linear PCA Logistic PCA L Linear PCA Logistic PCA 1 0.557 0.555 1 1.73 1.64 2 0.528 0.518 2 1.57 1.42 4 0.498 0.479 4 1.34 1.18 8 0.457 0.428 8 1.15 0.993

Gene Expression (N = 6015, D =46, ρ =0.738)

Minimum Error Rates (%) Balanced Error Rates (%) L Linear PCA Logistic PCA L Linear PCA Logistic PCA 1 1.24 1.21 1 1.46 1.43 2 1.04 1.02 2 1.28 1.20 4 0.884 0.758 4 1.03 0.804 8 0.667 0.393 8 0.766 0.499

Advertising (N = 3279, D = 1555, ρ =0.072)

Minimum Error Rates (%) Balanced Error Rates (%) L Linear PCA Logistic PCA L Linear PCA Logistic PCA 1 0.0709 0.700 1 2.68 1.97 2 0.0682 0.0591 2 2.39 1.20 4 0.0616 0.0388 4 2.17 0.626 8 0.0576 0.0124 8 1.76 0.268 A) Linear PCA of Advertising Data B) Logistic PCA Biplot of Advertising Data 6 −1 Ad Ad Non−Ad Non−Ad 5 −1.5

4 −2

3 −2.5

2 −3

1 −3.5 Second Principal Component Second Principal Component

0 −4

−1 −4.5 −3 −2 −1 0 1 2 −5 −4.5 −4 −3.5 −3 −2.5 −2 −1.5 First Principal Component First Principal Component

Figure 1: Separation of advertisements and non-advertisements by the first two components of (A) linear PCA and (B) logistic PCA applied to their surrounding features. to predict whether or not images are advertisements, optimism that this is not the case. Currently we know based on a large number of their surrounding features. that the maximum likelihood value of the logistic PCA To generate binary data for our experiments, we re- model will not have unique estimates for U and V as moved the three continuous features in this data set, we can always permute these matrices. as well as the one feature with missing values. We There are many promising areas for future work. In also removed the class labels distinguishing advertise- many applications involving binary data, dimension- ments from non-advertisements. The data for our ex- ality reduction has been performed by singular value periments consisted of D = 1555 binary features for decomposition (SVD). For these applications, logistic N = 3279 images. Each image was represented by a PCA provides a natural and compelling alternative. row in the binary data matrix X. The data is very It will also be useful to develop algorithms for di- sparse, with density ρ=0.072. mensionality reduction of hybrid data (c.f. [1]); this Besides the error rates in Table 2, for this data we could be done by combining ALS methods for bi- also plotted the first two principal components for the nary features and SVD methods for real-valued ones. first fifty points of each class (advertisement and non- Empirical comparisons should also be made between advertisement). The results in Fig. 1 show that both logistic PCA and the models of nonnegative ma- linear and logistic PCA separate the classes with only trix factorization[11] and probabilistic latent semantic a moderate number of outliers. Curiously, though, the analysis[8]. These models, unlike conventional PCA, data is much more uniformly distributed in the are tailored to the dimensionality reduction of nonneg- for logistic PCA. These results suggest, as in previ- ative and stochastic matrices. Difficulties have been ous studies[15], that logistic PCA may be useful for reported when some of these models were used to fit continuous visualization of multivariate binary data. extremely sparse data from word document counts[13]. We have not encountered such difficulties in our in- vestigations of logistic PCA. Finally, we need to de- 5 Discussion velop a better understanding of the relationship be- tween logistic PCA and factor analysis models of bi- Our experimental results show conclusively that lo- nary data[1, 15]. This relationship is not as well un- gistic PCA is better suited to reconstruction of bi- derstood as the relationship between non-probabilistic nary data than linear PCA. They also establish the and probabilistic models of PCA[14, 16] for real-valued practical utility of the ALS updates for optimizing the data. log-likelihood in eq. (4). The results are noteworthy in the absence of any theoretical guarantee that the ALS updates converge to a global maximum of the Acknowledgments log-likelihood. It is not currently known whether this log-likelihood has local maxima to confound optimal Andrew Schein was supported by NIH Training Grant model fitting. Our experimental results leave room for in Computational Genomics, T-32-HG00046. References [13] A. Popescul, L. H. Ungar, D. M. Pennock, and S. Lawrence. Probabilistic models for unified col- [1] D. J. Bartholomew and M. Knott. Latent Vari- laborative and content-based recommendation in able Models and Factor Analysis, volume 7 of sparse-data environments. In Proceedings of the Kendall’s Library of Satistics. Oxford University 17’th Conference on Uncertainty in Artificial In- Press, New York, 1999. telligence (UAI 2001), Seatle, WA, August, 2001.

[2] C. M. Bishop. Neural Networks for Pattern Recog- [14] S. Roweis. EM algorithms for PCA and SPCA. In nition. Oxford University Press, 1995. M. I. Jordan, M. S. Kearns, and S. A. Solla, edi- tors, Advances in Neural Information Processing [3] J. Breese, D. Heckerman, and C. Kadie. Empiri- Systems, volume 10, pages 626–632, Cambridge, cal analysis of predictive algorithms for collabora- MA, 1998. MIT Press. tive filtering. In Fourteenth Conference on Uncer- tainty in Artificial Intelligence, November, 1998. [15] M. E. Tipping. Probabilistic visualisation of high- dimensional binary data. In Advances in Neural [4]H.C.Causton,B.Ren,S.S.Koh,C.T.Harbi- Information Processing Systems 11, pages 592– son, E. Kanin, E. G. Jennings, T. I. Lee, H. L. 598. MIT Press, 1999. True, E. S. Lander, and R. A. Young. Remod- eling of yeast genome expression in response to [16] M. E. Tipping and C. M. Bishop. Probabilis- environmental changes. Molecular Biology of the tic principal component analysis. Journal of the Cell, 12:323–337, 2001. Royal Statistical Society, 61:661–622, 1999.

[5] M. Collins, S. Dasgupta, and R. E. Schapire. A [17] P. Whittle. On principal components and least generalization of principal component analysis to squares method of factor analysis. Skandinavisk the exponential family. In Proceedings of NIPS, Aktuarietidskrift, 36:223–239, 1952. 2001. [18] G. Young. Maximum likelihood estimation and [6] T. Hastie, R. Tibshirani, and J. Friedman. The factor analysis. Psychometrika, 6:49–53, 1940. Elements of Statistical Learning. Springer Series in . Springer-Verlag, 2001. A Derivation [7] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for perform- The ALS updates are derived from a lower bound on L ing collaborative filtering. In Proceedings of the the log-likelihood (Θ) in eq. (4). To establish the 1999 Conference on Research and Development bound, we begin by noting that the sigmoid function in Information Retrieval, August, 1999. in eq. (4) obeys: σ θ − θ/ − θ/ . [8] T. Hofmann. Unsupervised learning by proba- log ( )= log 2 + ( 2) log cosh( 2) (15) bilistic latent semantic analysis. Machine Learn- ing, 42(1):177–196, 2001. Consider the rightmost term in this equation. A bound√ on this term is obtained by noting that log cosh( z) z [9] H. Hotelling. Analysis of a complex of statistical is a concave function of . It follows that the value z variables into principal components. Journal of of this function at ˆ is upper bounded by its linear z Educational Psychology, 24:417–441, 1933. extrapolation from :  √  √ √ tanh( z) [10] T. Jaakkola and M. Jordan. A variational ap- log cosh( zˆ) ≤ log cosh( z)+(ˆz − z) √ . proach to bayesian models and 2 z their extensions. In Proceedings of the Sixth Inter- (16) national Workshop on Artificial Intelligence and This bound was introduced by Jaakkola and Statistics, 1997. Jordan[10] for Bayesian logistic regression and subse- quently applied to factor analysis of binary√ data by √ θ θˆ [11] D. D. Lee and H. S. Seung. Learning the parts Tipping[15]. Substituting z = 2 and zˆ = 2 into of objects with nonnegative matrix factorization. eq. (16) gives: Nature, 401:788–791, 1999.   tanh(θ/2) log cosh(θ/ˆ 2) ≤ log cosh(θ/2)+(θˆ2 −θ2) . [12] P. McCullagh and J. A. Nelder. Generalised Lin- 4θ ear Models. Chapman and Hall, London, 1987. (17) Finally, combining eqs. (4) and (15–17), we obtain a Thus, updates derived in this way are guaranteed to lower bound on the log-likelihood: lead to monotonic increases in the log-likelihood. As-  suming furthermore that the functions in eq. (19) are X T 2 ndΘnd smooth and well-behaved, it follows that the updates L(Θ)ˆ ≥ log 2 − log cosh(Θnd/2) + nd 4 converge only to stationary points of the log-likelihood. # This procedure leads to the ALS updates for lo- 2 (2Xnd − 1)Θˆ nd TndΘˆ gistic PCA. The U-update is obtained by setting + − nd , (18) 2 4 Θˆ nd =(UVˆ )nd +∆d and Θnd =(UV)nd +∆d and maximizing the auxiliary function Q(Θˆ , Θ) with re- T with nd defined as in eq. (7). Note that this bound spect to Uˆ. Since the auxiliary function is quadratic L ˆ ˆ on (Θ) is quadratic in the matrix elements of Θand in the elements of the matrix Θ,ˆ it is also quadratic in ˆ holds for all matrices ΘandΘ. the elements of the matrix Uˆ. The required maximiza- Let the auxiliary function Q(Θˆ, Θ) be defined by the tion therefore reduces to a least squares problem that can be solved in closed form. In particular, setting right hand side of eq. (18). The auxiliary function ˆ ˆ   satisfies the key property that L(Θ) ≥ Q(Θ, Θ) for ∂Q X ∂Q ∂Θˆ X ∂Q 0= = nd = V (20) all matrices Θˆ and Θ, with equality holding only if ˆ ˆ ˆ ˆ `d ∂Un` ∂Θnd ∂Un` ∂Θnd Θ=Θ.ˆ If wechoose Θˆ to maximize the auxiliary func- d d tion Q(Θˆ, Θ), then it follows that: and solving for Uˆ leads to the U-update in eq. (10). L(Θ)ˆ ≥ Q(Θˆ, Θ) ≥ Q(Θ, Θ) = L(Θ). (19) The V -update and ∆-update are obtained in an anal- ogous way.