Bayesian Inference of Log Determinants
Total Page:16
File Type:pdf, Size:1020Kb
Bayesian Inference of Log Determinants Jack Fitzsimons1 Kurt Cutajar2 Michael Osborne1 Stephen Roberts1 Maurizio Filippone2 1 Information Engineering, University of Oxford, UK 2 Department of Data Science, EURECOM, France Abstract The standard approach for evaluating the log- determinant of a positive definite matrix involves The log-determinant of a kernel matrix ap- the use of Cholesky decomposition (Golub & Van Loan, pears in a variety of machine learning prob- 1996), which is employed in various applications of lems, ranging from determinantal point pro- statistical models such as kernel machines. However, cesses and generalized Markov random fields, the use of Cholesky decomposition for general dense 3 through to the training of Gaussian processes. matrices requires O(n ) operations, whilst also entail- 2 Exact calculation of this term is often in- ing memory requirements of O(n ). In view of this tractable when the size of the kernel matrix ex- computational bottleneck, various models requiring ceeds a few thousands. In the spirit of proba- the log-determinant for inference bypass the need to bilistic numerics, we reinterpret the problem of compute it altogether (Anitescu et al., 2012; Stein et al., computing the log-determinant as a Bayesian 2013; Cutajar et al., 2016; Filippone & Engler, 2015). inference problem. In particular, we com- Alternatively, several methods exploit sparsity and struc- bine prior knowledge in the form of bounds ture within the matrix itself to accelerate computations. from matrix theory and evidence derived from For example, sparsity in Gaussian Markov Random stochastic trace estimation to obtain proba- fields (GMRFs) arises from encoding conditional inde- bilistic estimates for the log-determinant and pendence assumptions that are readily available when its associated uncertainty within a given com- considering low-dimensional problems. For such matri- putational budget. Beyond its novelty and the- ces, the Cholesky decompositions can be computed in oretic appeal, the performance of our proposal fewer than O(n3) operations (Rue & Held, 2005; Rue is competitive with state-of-the-art approaches et al., 2009). Similarly, Kronecker-based linear algebra to approximating the log-determinant, while techniques may be employed for kernel matrices com- also quantifying the uncertainty due to budget- puted on regularly spaced inputs (Saatc¸i, 2011). While constrained evidence. these ideas have proven successful for a variety of spe- cific applications, they cannot be extended to the case of general dense matrices without assuming special forms arXiv:1704.01445v1 [stat.ML] 5 Apr 2017 1 INTRODUCTION or structures for the available data. Developing scalable learning models without compro- To this end, general approximations to the log- mising performance is at the forefront of machine learn- determinant frequently build upon stochastic trace es- ing research. The scalability of several learning mod- timation techniques using iterative methods (Avron & els is predominantly hindered by linear algebraic op- Toledo, 2011). Two of the most widely-used polynomial erations having large computational complexity, among approximations for large-scale matrices are the Taylor which is the computation of the log-determinant of a ma- and Chebyshev expansions (Aune et al., 2014; Han et al., trix (Golub & Van Loan, 1996). The latter term features 2015). A more recent approach draws from the possibil- heavily in the machine learning literature, with applica- ity of estimating the trace of functions using stochastic tions including spatial models (Aune et al., 2014; Rue Lanczos quadrature (Ubaru et al., 2016), which has been & Held, 2005), kernel-based models (Davis et al., 2007; shown to outperform polynomial approximations from Rasmussen & Williams, 2006), and Bayesian learn- both a theoretic and empirical perspective. ing (Mackay, 2003). Inspired by recent developments in the field of proba- the logarithm of the matrix, and employ trace estima- bilistic numerics (Hennig et al., 2015), in this work we tion techniques (Hutchinson, 1990) to obtain unbiased propose an alternative approach for calculating the log- estimates of these. Chen et al. (2011) propose an itera- determinant of a matrix by expressing this computation tive algorithm to efficiently compute the product of the as a Bayesian quadrature problem. In doing so, we refor- logarithm of a matrix with a vector, which is achieved mulate the problem of computing an intractable quantity by computing a spline approximation to the logarithm into an estimation problem, where the goal is to infer the function. A similar idea using Chebyshev polynomi- correct result using tractable computations that can be als has been developed by Han et al. (2015). Most re- carried out within a given time budget. In particular, we cently, the Lanczos method has been extended to handle model the eigenvalues of a matrix A from noisy obser- stochastic estimates of the trace and obtain probabilistic vations of Tr(Ak) obtained through stochastic trace esti- error bounds for the approximation (Ubaru et al., 2016). mation using the Taylor approximation method (Zhang Blocking techniques, such as in Ipsen & Lee (2011) and & Leithead, 2007). Such a model can then be used Ambikasaran et al. (2016), have also been proposed. to make predictions on the infinite series of the Tay- In our work, we similarly strive to use a small num- lor expansion, yielding the estimated value of the log- ber of matrix-vector products for approximating log- determinant. Aside from permitting a probabilistic ap- determinants. However, we show that by taking a proach for predicting the log-determinant, this approach Bayesian approach we can combine priors with the ev- inherently yields uncertainty estimates for the predicted idence gathered from the intermediate results of matrix- value, which in turn serves as an indicator of the quality vector products involved in the afore-mentioned methods of our approximation. to obtain more accurate results. Most importantly, our Our contributions are as follows. proposal has the considerable advantage that it provides a full distribution on the approximated value. 1. We propose a probabilistic approach for computing the log-determinant of a matrix which blends differ- Our proposal allows for the inclusion of explicit bounds ent elements from the literature on estimating log- on log-determinants to constrain the posterior distribu- determinants under a Bayesian framework. tion over the estimated log-determinant (Bai & Golub, 1997). Nystrom¨ approximations can also be used to 2. We demonstrate how bounds on the expected value bound the log-determinant, as shown by Bardenet & Tit- of the log-determinant improve our estimates by sias (2015). Similarly, Gaussian processes (Rasmussen constraining the probability distribution to lie be- & Williams, 2006) have been formulated directly using tween designated lower and upper bounds. the eigendecomposition of its spectrum, where eigenvec- 3. Through rigorous numerical experiments on syn- tors are approximated using the Nystrom¨ method (Peng thetic and real data, we demonstrate how our & Qi, 2015). There has also been work on estimating method can yield superior approximations to com- the distribution of kernel eigenvalues by analyzing the peting approaches, while also having the additional spectrum of linear operators (Braun, 2006; Wathen & benefit of uncertainty quantification. Zhu, 2015), as well as bounds on the spectra of ma- trices with particular emphasis on deriving the largest 4. Finally, in order to demonstrate how this technique eigenvalue (Wolkowicz & Styan, 1980; Braun, 2006). may be useful within a practical scenario, we em- In this work, we directly consider bounds on the log- ploy our method to carry out parameter selection for determinants of matrices (Bai & Golub, 1997). a large-scale determinantal point process. To the best of our knowledge, this is the first time that 2 BACKGROUND the approximation of log-determinants is viewed as a Bayesian inference problem, with the resulting quantifi- As highlighted in the introduction, several approaches cation of uncertainty being hitherto unexplored thus far. for approximating the log-determinant of a matrix rely on stochastic trace estimation for accelerating computa- 1.1 RELATED WORK tions. This comes about as a result of the relationship between the log-determinant of a matrix, and the corre- The most widely-used approaches for estimating log- sponding trace of the log-matrix, whereby determinants involve extensions of iterative algorithms, such as the Conjugate-Gradient and Lanczos methods, logDet (A) = Trlog (A). (1) to obtain estimates of functions of matrices (Chen et al., 2011; Han et al., 2015) or their trace (Ubaru et al., 2016). Provided the matrix log(A) can be efficiently sampled, The idea is to rewrite log-determinants as the trace of this simple identity enables the use of stochastic trace es- 100 know that the smallest eigenvalue is bounded by zero, 10-1 λn ≥ 0. Motivated by the identity presented in (1), the Taylor series expansion (Barry & Pace, 1999; Zhang & 10-2 Leithead, 2007) may be employed for evaluating the log- -3 10 determinant of matrices having eigenvalues bounded be- 10-4 ν = 1 ν = 30 tween zero and one. In particular, this approach relies on Absolute Error ν = 10 ν = 40 10-5 the following logarithm identity, ν = 20 ν = 50 -6 1 k 10 0 1 2 3 4 5 6 X A 10 10 10 10 10 10 10 log (I − A) = − : (3) Order of Truncation k k=1 While the infinite summation is not explicitly com- Figure 1: Expected absolute error of truncated Taylor putable in finite time, this may be approximated by com- series for stationary ν-continuous kernel matrices. The puting a truncated series instead. Furthermore, given that dashed grey lines indicate O(n−1). the trace of matrices is additive, we find m k X Tr A Trlog (I − A) ≈ − : (4) timation techniques (Avron & Toledo, 2011; Fitzsimons k et al., 2016). We elaborate further on this concept below. k=1 The Tr(Ak) term can be computed efficiently and recur- 2.1 STOCHASTIC TRACE ESTIMATION sively by propagating O(n2) vector-matrix multiplica- tions in a stochastic trace estimation scheme.