Bayesian Inference of Log Determinants
Total Page:16
File Type:pdf, Size:1020Kb
Bayesian Inference of Log Determinants Jack Fitzsimons1 Kurt Cutajar2 Michael Osborne1 Stephen Roberts1 Maurizio Filippone2 1 Information Engineering, University of Oxford, UK 2 Department of Data Science, EURECOM, France Abstract The standard approach for evaluating the log determinant of a positive definite matrix involves the use of Cholesky The log determinant of a kernel matrix ap- decomposition (Golub & Van Loan, 1996), which is em- pears in a variety of machine learning prob- ployed in various applications of statistical models such lems, ranging from determinantal point pro- as kernel machines. However, the use of Cholesky de- 3 cesses and generalized Markov random fields, composition for general dense matrices requires O(n ) through to the training of Gaussian processes. operations, whilst also entailing memory requirements 2 Exact calculation of this term is often in- of O(n ). In view of this computational bottleneck, var- tractable when the size of the kernel matrix ex- ious models requiring the log determinant for inference ceeds a few thousands. In the spirit of proba- bypass the need to compute it altogether (Anitescu et al., bilistic numerics, we reinterpret the problem of 2012; Stein et al., 2013; Cutajar et al., 2016; Filippone & computing the log determinant as a Bayesian Engler, 2015). inference problem. In particular, we com- Alternatively, several methods exploit sparsity and struc- bine prior knowledge in the form of bounds ture within the matrix itself to accelerate computations. from matrix theory and evidence derived from For example, sparsity in Gaussian Markov random fields stochastic trace estimation to obtain proba- (GMRFs) arises from encoding conditional indepen- bilistic estimates for the log determinant and dence assumptions that are readily available when con- its associated uncertainty within a given com- sidering low-dimensional problems. For such matrices, putational budget. Beyond its novelty and the- the Cholesky decompositions can be computed in fewer oretic appeal, the performance of our proposal than O(n3) operations (Rue & Held, 2005; Rue et al., is competitive with state-of-the-art approaches 2009). Similarly, Kronecker-based linear algebra tech- to approximating the log determinant, while niques may be employed for kernel matrices computed also quantifying the uncertainty due to budget- on regularly spaced inputs (Saatc¸i, 2011). While these constrained evidence. ideas have proven successful for a variety of specific ap- plications, they cannot be extended to the case of general 1 INTRODUCTION dense matrices without assuming special forms or struc- tures for the available data. Developing scalable learning models without compro- To this end, general approximations to the log determi- mising performance is at the forefront of machine learn- nant frequently build upon stochastic trace estimation ing research. The scalability of several learning mod- techniques using iterative methods (Avron & Toledo, els is predominantly hindered by linear algebraic op- 2011). Two of the most widely-used polynomial ap- erations having large computational complexity, among proximations for large-scale matrices are the Taylor and which is the computation of the log determinant of a ma- Chebyshev expansions (Aune et al., 2014; Han et al., trix (Golub & Van Loan, 1996). The latter term features 2015). A more recent approach draws from the possibil- heavily in the machine learning literature, with applica- ity of estimating the trace of functions using stochastic tions including spatial models (Aune et al., 2014; Rue Lanczos quadrature (Ubaru et al., 2016), which has been & Held, 2005), kernel-based models (Davis et al., 2007; shown to outperform polynomial approximations from Rasmussen & Williams, 2006), and Bayesian learn- both a theoretic and empirical perspective. ing (Mackay, 2003). Inspired by recent developments in the field of proba- the logarithm of the matrix, and employ trace estima- bilistic numerics (Hennig et al., 2015), in this work we tion techniques (Hutchinson, 1990) to obtain unbiased propose an alternative approach for calculating the log estimates of these. Chen et al. (2011) propose an itera- determinant of a matrix by expressing this computation tive algorithm to efficiently compute the product of the as a Bayesian quadrature problem. In doing so, we refor- logarithm of a matrix with a vector, which is achieved mulate the problem of computing an intractable quantity by computing a spline approximation to the logarithm as an estimation problem, where the goal is to infer the function. A similar idea using Chebyshev polynomi- correct result using tractable computations that can be als has been developed by Han et al. (2015). Most re- carried out within a given time budget. In particular, we cently, the Lanczos method has been extended to handle model the eigenvalues of a matrix A from noisy obser- stochastic estimates of the trace and obtain probabilistic vations of Tr(Ak) obtained through stochastic trace es- error bounds for the approximation (Ubaru et al., 2016). timation using the Taylor approximation method (Zhang Blocking techniques, such as in Ipsen & Lee (2011) and & Leithead, 2007). Such a model can then be used to Ambikasaran et al. (2016), have also been proposed. make predictions on the infinite series of the Taylor ex- In our work, we similarly strive to use a small number pansion, yielding the estimated value of the log deter- of matrix-vector products for approximating log deter- minant. Aside from permitting a probabilistic approach minants. However, we show that by taking a Bayesian for predicting the log determinant, this approach inher- approach we can combine priors with the evidence gath- ently yields uncertainty estimates for the predicted value, ered from the intermediate results of matrix-vector prod- which in turn serves as an indicator of the quality of our ucts involved in the aforementioned methods to obtain approximation. more accurate results. Most importantly, our proposal Our contributions are as follows. has the considerable advantage that it provides a full dis- tribution on the approximated value. 1. We propose a probabilistic approach for computing the log determinant of a matrix which blends differ- Our proposal allows for the inclusion of explicit bounds ent elements from the literature on estimating log on log determinants to constrain the posterior distribu- determinants under a Bayesian framework. tion over the estimated log determinant (Bai & Golub, 1997). Nystrom¨ approximations can also be used to 2. We demonstrate how bounds on the expected value bound the log determinant, as shown in Bardenet & Tit- of the log determinant improve our estimates by sias (2015). Similarly, Gaussian processes (Rasmussen constraining the probability distribution to lie be- & Williams, 2006) have been formulated directly using tween designated lower and upper bounds. the eigendecomposition of its spectrum, where eigenvec- 3. Through rigorous numerical experiments on syn- tors are approximated using the Nystrom¨ method (Peng thetic and real data, we demonstrate how our & Qi, 2015). There has also been work on estimating method can yield superior approximations to com- the distribution of kernel eigenvalues by analyzing the peting approaches, while also having the additional spectrum of linear operators (Braun, 2006; Wathen & benefit of uncertainty quantification. Zhu, 2015), as well as bounds on the spectra of matrices with particular emphasis on deriving the largest eigen- 4. Finally, in order to demonstrate how this technique value (Wolkowicz & Styan, 1980; Braun, 2006). In this may be useful within a practical scenario, we em- work, we directly consider bounds on the log determi- ploy our method to carry out parameter selection for nants of matrices (Bai & Golub, 1997). a large-scale determinantal point process. To the best of our knowledge, this is the first time that 2 BACKGROUND the approximation of log determinants is viewed as a Bayesian inference problem, with the resulting quantifi- As highlighted in the introduction, several approaches cation of uncertainty being hitherto unexplored thus far. for approximating the log determinant of a matrix rely on stochastic trace estimation for accelerating computa- 1.1 RELATED WORK tions. This comes about as a result of the relationship between the log determinant of a matrix, and the corre- The most widely-used approaches for estimating log sponding trace of the log-matrix, whereby determinants involve extensions of iterative algorithms, such as the conjugate gradient and Lanczos methods, to logDet (A) = Trlog (A). (1) obtain estimates of functions of matrices (Chen et al., 2011; Han et al., 2015) or their trace (Ubaru et al., 2016). Provided the matrix log(A) can be efficiently sampled, The idea is to rewrite log determinants as the trace of this simple identity enables the use of stochastic trace es- 100 determinant of matrices having eigenvalues bounded be- 10-1 tween zero and one. In particular, this approach relies on the following logarithm identity, 10-2 1 10-3 X Ak log (I − A) = − : (3) 10-4 º = 1 º = 30 k k=1 Absolute Error º = 10 º = 40 10-5 º = 20 º = 50 -6 While the infinite summation is not explicitly com- 10 0 1 2 3 4 5 6 10 10 10 10 10 10 10 putable in finite time, this may be approximated by com- Order of Truncation puting a truncated series instead. Furthermore, given that the trace of matrices is additive, we find Figure 1: Expected absolute error of truncated Taylor m k series for stationary ν-continuous