Bayesian Inference of Log

Jack Fitzsimons1 Kurt Cutajar2 Michael Osborne1 Stephen Roberts1 Maurizio Filippone2

1 Information Engineering, University of Oxford, UK 2 Department of Data Science, EURECOM, France

Abstract The standard approach for evaluating the log of a positive definite involves the use of Cholesky The log determinant of a kernel matrix ap- decomposition (Golub & Van Loan, 1996), which is em- pears in a variety of machine learning prob- ployed in various applications of statistical models such lems, ranging from determinantal point pro- as kernel machines. However, the use of Cholesky de- 3 cesses and generalized Markov random fields, composition for general dense matrices requires O(n ) through to the training of Gaussian processes. operations, whilst also entailing memory requirements 2 Exact calculation of this term is often in- of O(n ). In view of this computational bottleneck, var- tractable when the size of the kernel matrix ex- ious models requiring the log determinant for inference ceeds a few thousands. In the spirit of proba- bypass the need to compute it altogether (Anitescu et al., bilistic numerics, we reinterpret the problem of 2012; Stein et al., 2013; Cutajar et al., 2016; Filippone & computing the log determinant as a Bayesian Engler, 2015). inference problem. In particular, we com- Alternatively, several methods exploit sparsity and struc- bine prior knowledge in the form of bounds ture within the matrix itself to accelerate computations. from matrix theory and evidence derived from For example, sparsity in Gaussian Markov random fields stochastic trace estimation to obtain proba- (GMRFs) arises from encoding conditional indepen- bilistic estimates for the log determinant and dence assumptions that are readily available when con- its associated uncertainty within a given com- sidering low-dimensional problems. For such matrices, putational budget. Beyond its novelty and the- the Cholesky decompositions can be computed in fewer oretic appeal, the performance of our proposal than O(n3) operations (Rue & Held, 2005; Rue et al., is competitive with state-of-the-art approaches 2009). Similarly, Kronecker-based linear algebra tech- to approximating the log determinant, while niques may be employed for kernel matrices computed also quantifying the uncertainty due to budget- on regularly spaced inputs (Saatc¸i, 2011). While these constrained evidence. ideas have proven successful for a variety of specific ap- plications, they cannot be extended to the case of general 1 INTRODUCTION dense matrices without assuming special forms or struc- tures for the available data. Developing scalable learning models without compro- To this end, general approximations to the log determi- mising performance is at the forefront of machine learn- nant frequently build upon stochastic trace estimation ing research. The scalability of several learning mod- techniques using iterative methods (Avron & Toledo, els is predominantly hindered by linear algebraic op- 2011). Two of the most widely-used polynomial ap- erations having large computational complexity, among proximations for large-scale matrices are the Taylor and which is the computation of the log determinant of a ma- Chebyshev expansions (Aune et al., 2014; Han et al., trix (Golub & Van Loan, 1996). The latter term features 2015). A more recent approach draws from the possibil- heavily in the machine learning literature, with applica- ity of estimating the trace of functions using stochastic tions including spatial models (Aune et al., 2014; Rue Lanczos quadrature (Ubaru et al., 2016), which has been & Held, 2005), kernel-based models (Davis et al., 2007; shown to outperform polynomial approximations from Rasmussen & Williams, 2006), and Bayesian learn- both a theoretic and empirical perspective. ing (Mackay, 2003). Inspired by recent developments in the field of proba- the of the matrix, and employ trace estima- bilistic numerics (Hennig et al., 2015), in this work we tion techniques (Hutchinson, 1990) to obtain unbiased propose an alternative approach for calculating the log estimates of these. Chen et al. (2011) propose an itera- determinant of a matrix by expressing this computation tive algorithm to efficiently compute the product of the as a Bayesian quadrature problem. In doing so, we refor- logarithm of a matrix with a vector, which is achieved mulate the problem of computing an intractable quantity by computing a spline approximation to the logarithm as an estimation problem, where the goal is to infer the function. A similar idea using Chebyshev polynomi- correct result using tractable computations that can be als has been developed by Han et al. (2015). Most re- carried out within a given time budget. In particular, we cently, the Lanczos method has been extended to handle model the eigenvalues of a matrix A from noisy obser- stochastic estimates of the trace and obtain probabilistic vations of Tr(Ak) obtained through stochastic trace es- error bounds for the approximation (Ubaru et al., 2016). timation using the Taylor approximation method (Zhang Blocking techniques, such as in Ipsen & Lee (2011) and & Leithead, 2007). Such a model can then be used to Ambikasaran et al. (2016), have also been proposed. make predictions on the infinite series of the Taylor ex- In our work, we similarly strive to use a small number pansion, yielding the estimated value of the log deter- of matrix-vector products for approximating log deter- minant. Aside from permitting a probabilistic approach minants. However, we show that by taking a Bayesian for predicting the log determinant, this approach inher- approach we can combine priors with the evidence gath- ently yields uncertainty estimates for the predicted value, ered from the intermediate results of matrix-vector prod- which in turn serves as an indicator of the quality of our ucts involved in the aforementioned methods to obtain approximation. more accurate results. Most importantly, our proposal Our contributions are as follows. has the considerable advantage that it provides a full dis- tribution on the approximated value. 1. We propose a probabilistic approach for computing the log determinant of a matrix which blends differ- Our proposal allows for the inclusion of explicit bounds ent elements from the literature on estimating log on log determinants to constrain the posterior distribu- determinants under a Bayesian framework. tion over the estimated log determinant (Bai & Golub, 1997). Nystrom¨ approximations can also be used to 2. We demonstrate how bounds on the expected value bound the log determinant, as shown in Bardenet & Tit- of the log determinant improve our estimates by sias (2015). Similarly, Gaussian processes (Rasmussen constraining the probability distribution to lie be- & Williams, 2006) have been formulated directly using tween designated lower and upper bounds. the eigendecomposition of its spectrum, where eigenvec- 3. Through rigorous numerical experiments on syn- tors are approximated using the Nystrom¨ method (Peng thetic and real data, we demonstrate how our & Qi, 2015). There has also been work on estimating method can yield superior approximations to com- the distribution of kernel eigenvalues by analyzing the peting approaches, while also having the additional spectrum of linear operators (Braun, 2006; Wathen & benefit of uncertainty quantification. Zhu, 2015), as well as bounds on the spectra of matrices with particular emphasis on deriving the largest eigen- 4. Finally, in order to demonstrate how this technique value (Wolkowicz & Styan, 1980; Braun, 2006). In this may be useful within a practical scenario, we em- work, we directly consider bounds on the log determi- ploy our method to carry out parameter selection for nants of matrices (Bai & Golub, 1997). a large-scale determinantal point process. To the best of our knowledge, this is the first time that 2 BACKGROUND the approximation of log determinants is viewed as a Bayesian inference problem, with the resulting quantifi- As highlighted in the introduction, several approaches cation of uncertainty being hitherto unexplored thus far. for approximating the log determinant of a matrix rely on stochastic trace estimation for accelerating computa- 1.1 RELATED WORK tions. This comes about as a result of the relationship between the log determinant of a matrix, and the corre- The most widely-used approaches for estimating log sponding trace of the log-matrix, whereby determinants involve extensions of iterative algorithms, such as the conjugate gradient and Lanczos methods, to logDet (A) = Trlog (A). (1) obtain estimates of functions of matrices (Chen et al., 2011; Han et al., 2015) or their trace (Ubaru et al., 2016). Provided the matrix log(A) can be efficiently sampled, The idea is to rewrite log determinants as the trace of this simple identity enables the use of stochastic trace es- 100 determinant of matrices having eigenvalues bounded be- 10-1 tween zero and one. In particular, this approach relies on the following logarithm identity, 10-2 ∞ 10-3 X Ak log (I − A) = − . (3) 10-4 º = 1 º = 30 k k=1 Absolute Error º = 10 º = 40 10-5 º = 20 º = 50 -6 While the infinite summation is not explicitly com- 10 0 1 2 3 4 5 6 10 10 10 10 10 10 10 putable in finite time, this may be approximated by com- Order of Truncation puting a truncated series instead. Furthermore, given that the trace of matrices is additive, we find Figure 1: Expected absolute error of truncated Taylor m k series for stationary ν-continuous kernel matrices. The  X Tr A −1 Tr log (I − A) ≈ − . (4) dashed grey lines indicate O(n ). k k=1

timation techniques (Avron & Toledo, 2011; Fitzsimons The Tr(Ak) term can be computed efficiently and recur- et al., 2016). We elaborate further on this concept below. sively by propagating O(n2) vector-matrix multiplica- tions in a stochastic trace estimation scheme. To com- 2.1 STOCHASTIC TRACE ESTIMATION pute Tr(log(K)) we simply set A = I − K. There are two sources of error associated with this ap- It is possible to obtain a stochastic estimate of the trace proach; the first due to stochastic trace estimation, and term such that the expectation of the estimate matches the second due to truncation of the . In the term being approximated (Avron & Toledo, 2011). the case of covariance matrices, the smallest eigen- In this work, we shall consider the Gaussian estimator, value tends to be very small, which can be verified by (i) whereby we introduce Nr vectors r sampled from an Weyl (1912) and Silverstein (1986)’s observations on the independently and identically distributed zero-mean and eigenspectra of covariance matrices. This leads to Ak unit variance Gaussian distribution. This yields the unbi- decaying slowly as k → ∞. ased estimate In light of the above, standard Taylor approximations to Nr the log determinant of covariance matrices are typically 1 X (i)> (i) Tr(A) = r A r . (2) unreliable, even when the exact traces of matrix powers Nr i=1 are available. This can be verified analytically based on results from kernel theory, which state that the approxi- Note that more sophisticated trace estimators (see Fitzsi- mate rate of decay for the eigenvalues of positive definite mons et al., 2016) may also be considered; without loss kernels which are ν-continuous is O(n−ν−0.5) (Weyl, of generality, we opt for a more straightforward approach 1912; Wathen & Zhu, 2015). Combining this result with in order to preserve clarity. the absolute error, E(λ), of the truncated Taylor approx- imation we find 2.2 TAYLOR APPROXIMATION  m  Z 1  X λj  Against the backdrop of machine learning applica- [E (λ)] = O λν+0.5 log (λ) − dλ E  j  tions, in this work we predominantly consider covari- 0 j=1 ance matrices taking the form of a Gram matrix K =   Z 1 ∞ j ν+0.5 X λ {κ(xi, xj)}i,j=1,...,n, where the kernel function κ im- = O λ dλ  j  plicitly induces a feature space representation of data 0 j=m points x. Assume K has been normalized such that the Ψ(0) (m + ν + 1.5) − Ψ(0) (m) maximum eigenvalue is less than or equal to one, λ0 ≤ 1, = O , where the largest eigenvalue can be efficiently found us- ν + 1.5 ing Gershgorin intervals (Gershgorin, 1931). Given that covariance matrices are positive semidefinite, we also where Ψ(0)(·) is the Digamma function. In Figure 1, we know that the smallest eigenvalue is bounded by zero, plot the relationship between the order of the Taylor ap- λn ≥ 0. Motivated by the identity presented in (1), the proximation and the expected absolute error. It can be Taylor series expansion (Barry & Pace, 1999; Zhang & observed that irrespective of the continuity of the kernel, Leithead, 2007) may be employed for evaluating the log the error converges at a rate of O(n−1). > 3 THE PROBABILISTIC NUMERICS a set of labels y = (y1, . . . , yn) , may be computed as APPROACH > −1 µ = µ0 + K∗ K (y − µ0), (6) > −1 We now propose a probabilistic numerics (Hennig et al., Σ = K∗,∗ − K∗ K K∗, (7) 2015) approach: we reframe a numerical computation with µ and Σ denoting the posterior mean and variance, (in this case, trace estimation) as probabilistic inference. and K being the n × n covariance matrix for the ob- Probabilistic numerics usually requires distinguishing: served variables {x , y ; i ∈ (1, 2, . . . n)}. The latter is an appropriate latent function; data and; the ultimate i i computed as κ(x, x0) for any pair of points x, x0 ∈ X. object of interest. Given the data, a posterior distri- Meanwhile, K and K respectively denote the covari- bution is calculated for the object of interest. For in- ∗ ∗,∗ ance between the observable and the predictive points, stance, in numerical integration, the latent function is the and the prior over the predicted points. Note that µ , the integrand, f, the data are evaluations of the integrand, 0 prior mean, may be set to zero without loss of generality. f(x), and the object of interest is the value of the in- tegral, R f(x)p(x)dx (see § 3.1.1 for more details). In Bayesian Quadrature (BQ; O’Hagan, 1991) is primarily this work, our latent function is the distribution of eigen- concerned with performing integration of potentially in- values of A, the data are noisy observations of Tr(Ak), tractable functions. In this work, we limit our discussion and the object of interest is log(Det(K)). For this object to the setting where the integrand is modeled as a GP, of interest, we are able to provide both expected value Z and variance. That is, although the Taylor approximation p(x) f(x) dx, f ∼ GP(µ, Σ), to the log determinant may be considered unsatisfactory, the intermediate trace terms obtained when raising the where p(x) is some measure with respect to which matrix to higher powers may prove to be informative if we are integrating. A full discussion of BQ may be considered as observations within a probabilistic model. found in O’Hagan (1991) and Rasmussen & Ghahramani (2002); for the sake of conciseness, we only state the re- 3.1 RAW MOMENT OBSERVATIONS sult that the integrals may be computed by integrating the covariance function with respect to p(x) for both K∗, We wish to model the eigenvalues of A from noisy ob- Z  Z 0 0 servations of Tr Ak obtained through stochastic trace κ · dx, x = p (x) κ (x, x ) dx, estimation, with the ultimate goal of making predictions on the infinite series of the Taylor expansion. Let us and K∗,∗, assume that the eigenvalues are i.i.d. random variables Z Z  ZZ drawn from P (λi = x), a probability distribution over κ · dx, · dx0 = p (x) κ (x, x0) p (x0) dxdx0. x ∈ [0, 1]. In this setting Tr(A) = nEx[P (λi = x)], and k (k) more generally Tr A = nRx [P (λ = x)], where i 3.2 KERNELS FOR RAW MOMENTS AND (k) th Rx is the k raw moment over the x domain. The raw INFERENCE ON THE LOG DETERMINANT moments can thus be computed as,

Recalling (5), if P (λi = x) is modeled using a GP, in Z 1 (k) (k) k order to include observations of R [P (λ = x)], de- R [P (λi = x)] = x P (λi = x) dx. (5) x i x (k) 0 noted as Rx , we must be able to integrate the kernel with respect to the polynomial in x, Such a formulation is appealing because if P (λi = x) Z 1  (k) 0 k 0 is modeled as a Gaussian process, the required integrals κ Rx , x = x κ (x, x ) dx, (8) may be solved analytically using Bayesian Quadrature. 0 Z 1 Z 1  (k) (k0) k 0 0k0 0 κ Rx , Rx0 = x κ (x, x ) x dxdx . (9) 3.1.1 Bayesian Quadrature 0 0 Gaussian processes (GPs; Rasmussen & Williams, 2006) Although the integrals described above are typically ana- are a powerful Bayesian inference method defined over lytically intractable, certain kernels have an elegant ana- functions X → R, such that the distribution of func- lytic form which allows for efficient computation. In this tions over any finite subset of the input points X = section, we derive the raw moment observations for a his- {x1,..., xn} is a multivariate Gaussian distribution. togram kernel, and demonstrate how estimates of the log Under this framework, the moments of the conditional determinant can be obtained. An alternate polynomial Gaussian distribution for a set of predictive points, given kernel is described in Appendix A. 3.2.1 Histogram Kernel Following the derivations presented above, we can finally go about computing the prediction for the log determi- The entries of the histogram kernel, also known as nant, and its corresponding variance, using the GP poste- 0 the piecewise constant kernel, are given by κ(x, x ) = rior equations given in (6) and (7). This can be achieved P1−m H( j , j+1 , x, x0) j=0 m m , where by replacing the terms K∗ and K∗,∗ with the construc- ( tions presented in (12) and (13), respectively. The entries  j j + 1  1 x, x0 ∈  j , j+1  H , , x, x0 = m m . of K are filled in using (11), whereas y denotes the noisy m m 0 otherwise observations of Tr Ak.

Covariances between raw moments may be computed as 3.2.2 Prior Mean Function follows: Z 1 While GPs, and in this case BQ, can be applied with a  (k) 0 k 0 κ Rx , x = x κ x, x dx zero mean prior without loss of generality, it is often ben- 0 ! eficial to have a mean function as an initial starting point. 1  j + 1 k+1  j k+1 = − , If P (λi = x) is composed of a constant mean function k + 1 m m g(λi = x), and a GP is used to model the residual, we (10) have that

 j j+1  P (λi = x) = g (λi = x) + f (λi = x) . where in the above x lies in the interval m , m . Ex- tending this to the covariance function between raw mo- ments we have, The previously derived moment observations may then

0 Z 1 Z 1 be decomposed into,  (k) (k ) k 0k0 0 0 κ Rx , Rx0 = x x κ x, x dxdx 0 0 Z Z m−1 k¯+1 k¯+1! k k X Y 1  j + 1   j  x P (λi = x) dx = x g (λi = x) dx = − . k¯ + 1 m m (14) j=0 ¯ 0 Z k∈(k,k ) + xkf (λ = x) dx. (11) i

This simple kernel formulation between observations of Due to the domain of P (λi = x) lying between zero and the raw moments compactly allows us to perform infer- one, we set a beta distribution as the prior mean, which ence over P (λi = x). However, the ultimate goal is has some convenient properties. First, it is fully specified k ∞ Tr(A ) by the mean and variance of the distribution, which can to predict log(Det(K)), and hence P . This i=1 k be computed using the trace and Frobenius norm of the requires a seemingly more complex set of kernel expres- matrix. Secondly, the r-th raw moment of a Beta distri- sions; nevertheless, by propagating the implied infinite bution parameterized by α and β is summations into the kernel function, we can also obtain the closed form solutions for these terms, α + r R(k) [g (λ = x)] = , x i α + β + r ∞ (k) ! m−1  k0+1 X Rx (k0) X 1 j + 1 κ , R = − which is straightforward to compute. k x0 k0 + 1 m k=1 j=0 In consequence, the expectation of the logarithm of ran- 0 !  j k +1   j + 1   j  dom variables and, hence, the ‘prior’ log determinant S − S m m m yielded by g (λi = x) can be computed as (12) E[log(X); X ∼ g(λi = x)] = Ψ(α) − Ψ(α + β). (15)

∞ (k) ∞ (k0) ! m−1     2 X Rx X R 0 X j + 1 j κ , x = S − S This can then simply be added to the previously derived k k0 m m k=1 k0=1 j=0 GP expectation of the log determinant. (13) 3.2.3 Using Bounds on the log determinant P∞ αk+1 where S(α) = k=1 k(k+1) , which has the convenient identity for 0 < α < 1, As with most GP specifications, there are hyperparam- eters associated with the prior and the kernel. The op- S(α) = α + (1 − α) log(1 − α). timal settings for these parameters may be obtained via optimization of the standard GP log marginal likelihood, (6) and (7). As outlined in the previous section, the re- defined as sulting posterior distribution can be truncated using the derived bounds to obtain the final estimates for the log 1 > −1 1 LMLGP = − y K y − log(Det(K)) + const. determinant and its uncertainty (line 13). 2 2 Algorithm 1 Computing log determinant and uncertainty Borrowing from the literature on bounds for the log de- using probabilistic numerics terminant of a matrix, as described in Appendix B, we can also exploit such upper and lower bounds to trun- Input: PSD matrix A ∈ Rn×n, raw moments kernel κ, cate the resulting GP distribution to the relevant domain, expansion order M, and random vectors Z which is expected to greatly improve the predicted log Output: Posterior mean MTRN, and uncertainty VTRN determinant. These additional constraints can then be propagated to the hyperparameter optimization proce- 1: A ← NORMALIZE(A) dure by incorporating them into the likelihood function 2: BOUNDS ← GETBOUNDS(A) via the product rule, as follows: 3: for i ← 1 to M do 4: yi ← STOCHASTICTAYLOROBS(A, i, Z)      a − µˆ b − µˆ 5: for i ← 1 to M do LML = LMLGP + log Φ − Φ , σˆ σˆ 6: for j ← 1 to M do 7: Kij ← κ(i, j) with a and b representing the upper and lower log de- 8: κ, K ← TUNEKERNEL(K, y, BOUNDS) terminant bounds respectively, µˆ and σˆ representing the 9: for i ← 1 to M do Φ(·) posterior mean and standard deviation, and repre- 10: K∗,i ← κ(∗, i) senting the Gaussian cumulative density function. Priors 11: k ← κ(∗, ∗) on the hyperparameters may be accounted for in a similar ∗,∗ 12: MEXP, VEXP ← GPPRED(y,K,K , k ) way. ∗ ∗,∗ 13: MTRN, VTRN ← TRUNC(MEXP, VEXP, BOUNDS) 3.2.4 Algorithm Complexity and Recap

Due to its cubic complexity, GP inference is typically 4 EXPERIMENTS considered detrimental to the scalability of a model. However, in our formulation, the GP is only being ap- In this section, we show how the appeal of this formu- plied to the noisy observations of Tr Ak, which rarely lation extends beyond its intrinsic novelty, whereby we exceed the order of tens of points. As a result, given that also consistently obtain performance improvements over we assume this to be orders of magnitude smaller than competing techniques. We set up a variety of exper- the dimensionality n of the matrix K, the computational iments for assessing the model performance, including complexity is dominated by the matrix-vector operations both synthetically constructed and real matrices. Given involved in stochastic trace estimation, i.e. O(n2) for the model’s probabilistic formulation, we also assess the dense matrices and O(ns) for s-sparse matrices. quality of the uncertainty estimates yielded by the model. We conclude by demonstrating how this approach may The steps involved in the procedure described within this be fitted within a practical learning scenario. section are summarized as pseudo-code in Algorithm 1. The input matrix A is first normalized by using Gersh- We compare our approach against several other esti- gorin intervals to find the largest eigenvalue (line 1), and mations to the log determinant, namely approximations the expected bounds on the log determinant (line 2) are based on Taylor expansions, Chebyshev expansions and calculated using matrix theory (Appendix B). The noisy Stochastic Lanczos quadrature. The Taylor approxima- Taylor observations up to an expansion order M (lines 3- tion has already been introduced in § 2.2, and we briefly 4), denoted here as y, are then obtained through stochas- describe the others below. tic trace estimation, as described in § 2.2. These can be Chebyshev Expansions: This approach utilizes the modeled using a GP, where the entries of the kernel ma- m-degree Chebyshev polynomial approximation to the trix K (lines 5-7) are computed using (11). The kernel function log (I − A) (Han et al., 2015; Boutsidis et al., parameters are then tuned as per § 3.2.3 (line 8). Recall 2015; Peng & Wang, 2015), that we seek to make a prediction for the infinite Tay- m lor expansion, and hence the exact log determinant. To X Tr (log (I − A)) ≈ ckTr (Tk (A)) , (16) this end, we must compute K∗ (lines 9-10) and k∗,∗ (line 11) using (12) and (13), respectively. The posterior mean k=0 and variance (line 12) may then be evaluated by filling in where Tk(x) = ATk−1 (A) − Tk−2 (A) starting with 100 0 −2 -1 10 −4 )

i −6

-2 ¸

10 (

g −8 o l −10 10-3 −12

Absolute Relative Error −14 10-4 Spect-1 Spect-2 Spect-3 Spect-4 Spect-5 Spect-6 0 200 400 600 800 1000 Matrix Log Eigenspectra Eigenvalue Index Taylor SLQ PN Trunc. Mean Spect-1 Spect-3 Spect-5 Chebyshev PN Mean Spect-2 Spect-4 Spect-6

Figure 2: Empirical performance over 6 covariance matrices described in § 4.1. The right figure displays the log eigenspectrum of the matrices and their respective indices. The left figure displays the relative performance of the algorithms for the stochastic trace estimation order set to 5, 25 and 50 (from left to right respectively).

T0(A) = 1 and T1 (A) = A, and ck is defined as of the low-rank tridiagonal transformation of A obtained using the Lanczos algorithm (Paige, 1972). Denoting the n 2 X resulting eigenvalues and eigenvectors by θ and y respec- c = log (I − x ) T (x ) , k n + 1 i k i tively, the quadratic form may finally be evaluated as, i=0 (17) 1  ! m i + π > X x = cos 2 . r(i) log (A) r(i) ≈ τ 2 log (θ ) , (18) i n + 1 j j j=0

The Chebyshev approximation is appealing as it gives the  T  with τj = e1 yj . best m-degree polynomial approximation of log (I − x) under the L∞-norm. The error induced by general 4.1 SYNTHETICALLY CONSTRUCTED Chebyshev polynomial approximations has also been MATRICES thoroughly investigated (Han et al., 2015). Stochastic Lanczos Quadrature: This approach (Ubaru Previous work on estimating log determinants have im- et al., 2016) relies on stochastic trace estimation to ap- plied that the performance of any given method is closely proximate the trace using the identity presented in (1). tied to the shape of the eigenspectrum for the matrix un- If we consider the eigendecomposition of matrix A into der review. As such, we set up an experiment for as- QΛQ>, the quadratic form in the equation becomes sessing the performance of each technique when applied to synthetically constructed matrices whose eigenvalues > > r(i) log(A)r(i) = r(i) Q log (Λ) Q>r(i) decay at different rates. Given that the computational complexity of each method is dominated by the number n X , = log (λ ) µ2 of matrix-vector products (MVPs) incurred, we also illus- k k trate the progression of each technique for an increasing k=1 allowance of MVPs. All matrices are constructed using a > (i) where µk denotes the individual components of Q r . Gaussian kernel evaluated over 1000 input points. By transforming this term into a Riemann-Stieltjes inte- As illustrated in Figure 2, the estimates returned by our R b gral a log(t)dµ(t), where µ(t) is a piecewise constant approach are consistently on par with (and frequently function (Ubaru et al., 2016), we can approximate it as superior to) those obtained using other methods. For matrices having slowly-decaying eigenvalues, standard Z b m X Chebyshev and Taylor approximations fare quite poorly, log(t)dµ(t) ≈ ωj log (θj) , a j=0 whereas SLQ and our approach both yield compara- ble results. The results become more homogeneous where m is the degree of the approximation, while the across methods for faster-decaying eigenspectra, but our sets of ω and θ are the parameters to be inferred using method is frequently among the top two performers. For Gauss quadrature. It turns out that these parameters may our approach, it is also worth noting that truncating the be computed analytically using the eigendecomposition GP using known bounds on the log determinant indeed 100 2.5

2.0 10-1

1.5 10-2 1.0

-3 10 0.5 Absolute Relative Error Absolute Error Divided by Predicted Standard Deviation 10-4 0.0 thermomech_TC bonesS01 ecology2 thermal2 thermomech_TC bonesS01 ecology2 thermal2 (d=102,158) (d=127,224) (d=999,999) (d=1,228,045) (d=102,158) (d=127,224) (d=999,999) (d=1,228,045) UFL Dataset UFL Dataset Taylor SLQ PN Trunc. Mean Chebyshev PN Mean Figure 4: Quality of uncertainty estimates on UFL datasets, measured as the ratio of the absolute error to the Figure 3: Methods compared on a variety on UFL Sparse predicted variance. As before, results are shown for in- Datasets. Each dataset was ran the matrix approximately creasing computational budgets (MVPs). The true value raised to the power of 5, 10, 15, 20, 25 and 30 (left to lay outside 2 standard deviations in only one of 24 trials. right) using stochastic trace estimation.

be meaningful, the error should ideally lie within only a results in superior posterior estimates. This is particu- few multiples of the standard deviation. larly evident when the eigenvalues decay very rapidly. Somewhat surprisingly, the performance does not seem In Figure 4, we report this metric for our approach when to be greatly affected by the number of budgeted MVPs. using the histogram kernel. We carry out this evaluation over the matrices introduced in the previous experiment, 4.2 UFL SPARSE DATASETS once again showing how the performance varies for dif- ferent MVP allowances. In all cases, the absolute error of Although we have so far limited our discussion to co- the predicted log determinant is consistently bounded by variance matrices, our proposed method is amenable to at most twice the predicted standard deviation, which is any positive semi-definite matrix. To this end, we ex- very sensible for such a probabilistic model. tend the previous experimental set-up to a selection of real, sparse matrices obtained from the SuiteSparse Ma- 4.4 MOTIVATING EXAMPLE trix Collection (Davis & Hu, 2011). Following Ubaru et al. (2016), we list the true values of the log determi- Determinantal point processes (DPPs; Macchi, 1975) are nant reported in Boutsidis et al. (2015), and compare all stochastic point processes defined over subsets of data other approaches to this baseline. such that an established degree of repulsion is main- tained. A DPP, P, over a discrete space y ∈ {1, . . . , n} The results for this experiment are shown in Figure 3. is a probability measure over all subsets of y such that Once again, the estimates obtained using our probabilis- tic approach achieve comparable accuracy to the compet- P(A ∈ y) = Det(KA), ing techniques, and several improvements are noted for larger allowances of MVPs. As expected, the SLQ ap- K proach generally performs better than Taylor and Cheby- where A is a positive definite matrix having all eigen- shev approximations, especially for smaller computa- values less than or equal to 1. A popular method for mod- K L-ensemble tional budgets. Even so, our proposed technique con- eling data via is the approach (Borodin, L sistently appears to have an edge across all datasets. 2009), which transforms kernel matrices, , into an ap- propriate K, K = (L + I)−1L. 4.3 UNCERTAINTY QUANTIFICATION The goal of inference is to correctly parameterize L given One of the notable features of our proposal is the abil- observed subsets of y, such that the probability of unseen ity to quantify the uncertainty of the predicted log de- subsets can be accurately inferred in the future. terminant, which can be interpreted as an indicator of the quality of the approximation. Given that none of the Given that the log-likelihood term of a DPP requires the other techniques offer such insights to compare against, log determinant of L, na¨ıve computations of this term we assess the quality of the model’s uncertainty estimates are intractable for large sample sizes. In this experi- by measuring the ratio of the absolute error to the pre- ment, we demonstrate how our proposed approach can be dicted standard deviation (uncertainty). For the latter to employed to the purpose of parameter optimization for 1.0 References 0.8 ) x a

m 0.6 Ambikasaran, S., Foreman-Mackey, D., Greengard, L., l = l NLL / 0.4

( Hogg, D. W., and O’Neil, M. Fast Direct Methods for P 0.2 Gaussian Processes. IEEE Transactions on Pattern Anal- 0.0 10-3 10-2 10-1 100 ysis and Machine Intelligence, 38(2):252–265, 2016. Lengthscale Anitescu, M., Chen, J., and Wang, L. A Matrix-free Approach for Solving the Parametric Gaussian Process Figure 5: The rescaled negative log-likelihood (NLL) of Maximum Likelihood Problem. SIAM J. Scientific Com- DPP with varying length scale (blue) and probability of puting, 34(1), 2012. maximum likelihood (red). Cubic interpolation was used between inferred likelihood observations. Ten samples, Aune, E., Simpson, D. P., and Eidsvik, J. Parameter z, were taken to polynomial order 30. Estimation in High Dimensional Gaussian Distributions. Statistics and Computing, 24(2):247–263, 2014. Avron, H. and Toledo, S. Randomized Algorithms for large-scale DPPs. In particular, we sample points from Estimating the Trace of an Implicit Symmetric Positive a DPP defined on a lattice over [−1, 1]5, with one mil- Semi-definite Matrix. J. ACM, 58(2):8:1–8:34, 2011. lion points at uniform intervals. A Gaussian kernel with Bai, Z. and Golub, G. H. Bounds for the Trace of the lengthscale parameter l is placed over these points, creat- Inverse and the Determinant of Symmetric Positive Def- ing the true L. Subsets of the lattice points can be drawn inite Matrices. Annals of Numerical , 4:29– by taking advantage of the tensor structure of L, and we 38, 1997. draw five sets of 12,500 samples each. For a given selec- tion of lengthscale options, the goal of this experiment is Bardenet, R. and Titsias, M. K. Inference for Deter- to confirm that the DPP likelihood of the obtained sam- minantal Point Processes Without Spectral Knowledge. ples is indeed maximized when L is parameterized by the In Proceedings of the 28th International Conference on true lengthscale, l. As shown in Figure 5, the computed Neural Information Processing Systems, pp. 3393–3401, uncertainty allows us to derive a distribution over the true 2015. lengthscale which, despite using few matrix-vector mul- Barry, R. P. and Pace, R. K. Monte Carlo Estimates of tiplications, is very close to the optimal. the Log-Determinant of Large Sparse Matrices. Linear Algebra and its applications, 289(1):41–54, 1999. Borodin, A. Determinantal point processes. arXiv 5 CONCLUSION preprint arXiv:0911.1153, 2009. Boutsidis, C., Drineas, P., Kambadur, P., and Zouzias, In a departure from conventional approaches for estimat- A. A Randomized Algorithm for Approximating the Log ing the log determinant of a matrix, we propose a novel Determinant of a Symmetric Positive Definite Matrix. probabilistic framework which provides a Bayesian per- CoRR, abs/1503.00374, 2015. spective on the literature of matrix theory and stochastic Braun, M. L. Accurate Error Bounds for the Eigenval- trace estimation. In particular, our approach enables the ues of the Kernel Matrix. Journal of Machine Learning log determinant to be inferred from noisy observations of Research, 7:2303–2328, December 2006. Tr Ak obtained from stochastic trace estimation. By modeling these observations using a GP, a posterior esti- Chen, J., Anitescu, M., and Saad, Y. Computing f(A)b mate for the log determinant may then be computed us- via Least Squares Polynomial Approximations. SIAM ing Bayesian Quadrature. Our experiments confirm that Journal on Scientific Computing, 33(1):195–222, 2011. the results obtained using this model are highly compa- Cutajar, K., Osborne, M., Cunningham, J., and Filip- rable to competing methods, with the additional benefit pone, M. Preconditioning Kernel Matrices. In Proceed- of measuring uncertainty. ings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June We forecast that the foundations laid out in this work 19-24 can be extended in various directions, such as explor- , 2016. ing more kernels on the raw moments which permit Davis, J. V., Kulis, B., Jain, P., Sra, S., and Dhillon, I. S. tractable Bayesian Quadrature. The uncertainty quanti- Information-theoretic Metric Learning. In Proceedings fied in this work is also a step closer towards fully char- of the Twenty-Fourth International Conference (ICML acterizing the uncertainty associated with approximating 2007), Corvallis, Oregon, USA, June 20-24, 2007, pp. large-scale kernel-based models. 209–216, 2007. Davis, T. A. and Hu, Y. The University of Florida Sparse 24th International Conference on Artificial Intelligence, Matrix Collection. ACM Transactions on Mathematical IJCAI’15, pp. 3763–3769. AAAI Press, 2015. Software (TOMS), 38(1):1, 2011. Peng, W. and Wang, H. Large-scale Log-Determinant Filippone, M. and Engler, R. Enabling Scalable Computation via Weighted L 2 Polynomial Approxima- Stochastic Gradient-based inference for Gaussian pro- tion with Prior Distribution of Eigenvalues. In Interna- cesses by employing the Unbiased LInear System SolvEr tional Conference on High Performance Computing and (ULISSE). In Proceedings of the 32nd International Applications, pp. 120–125. Springer, 2015. Conference on Machine Learning, ICML 2015, Lille, Rasmussen, C. E. and Williams, C. Gaussian Processes France, July 6-11, 2015. for Machine Learning. MIT Press, 2006. Fitzsimons, J. K., Osborne, M. A., Roberts, S. J., and Rasmussen, C. E. and Ghahramani, Z. Bayesian Monte Fitzsimons, J. F. Improved Stochastic Trace Estimation Carlo. In Advances in Neural Information Processing using Mutually Unbiased Bases. CoRR, abs/1608.00117, Systems 15, NIPS 2002, December 9-14, 2002, Vancou- 2016. ver, British Columbia, Canada, pp. 489–496, 2002. Gershgorin, S. Uber die Abgrenzung der Eigenwerte Rue, H. and Held, L. Gaussian Markov Random Fields: einer Matrix. Izvestija Akademii Nauk SSSR, Serija Theory and Applications, volume 104 of Monographs Matematika, 7(3):749–754, 1931. on Statistics and Applied Probability. Chapman & Hall, Golub, G. H. and Van Loan, C. F. Matrix computations. London, 2005. The Johns Hopkins University Press, 3rd edition, Octo- Rue, H., Martino, S., and Chopin, N. Approximate ber 1996. ISBN 080185413. Bayesian inference for latent Gaussian models by using Journal of the Han, I., Malioutov, D., and Shin, J. Large-scale Log- integrated nested Laplace approximations. Royal Statistical Society: Series B (Statistical Methodol- Determinant computation through Stochastic Chebyshev ogy) Expansions. In Bach, F. R. and Blei, D. M. (eds.), Pro- , 71(2):319–392, 2009. ceedings of the 32nd International Conference on Ma- Saatc¸i, Y. Scalable Inference for Structured Gaussian chine Learning, ICML 2015, Lille, France, 6-11 July Process Models. PhD thesis, University of Cambridge, 2015, 2015. 2011. Hennig, P., Osborne, M. A., and Girolami, M. Proba- Silverstein, J. W. Eigenvalues and Eigenvectors of Large bilistic Numerics and Uncertainty in Computations. Pro- Dimensional Sample Covariance Matrices. Contempo- ceedings of the Royal Society of London A: Mathe- rary Mathematics, 50:153–159, 1986. matical, Physical and Engineering Sciences, 471(2179), Stein, M. L., Chen, J., and Anitescu, M. Stochastic Ap- 2015. proximation of Score functions for Gaussian processes. Hutchinson, M. A Stochastic Estimator of the Trace of The Annals of Applied Statistics, 7(2):1162–1191, 2013. the Influence Matrix for Laplacian Smoothing Splines. doi: 10.1214/13-AOAS627. Communications in Statistics - Simulation and Compu- Ubaru, S., Chen, J., and Saad, Y. Fast Estimation of tr (f tation, 19(2):433–450, 1990. (a)) via Stochastic Lanczos Quadrature. 2016. Ipsen, I. C. F. and Lee, D. J. Determinant Approxima- Wathen, A. J. and Zhu, S. On Spectral Distribution of tions, May 2011. Kernel Matrices related to Radial Basis functions. Nu- Macchi, O. The Coincidence Approach to Stochastic merical Algorithms, 70(4):709–726, 2015. point processes. Advances in Applied Probability, 7:83– Weyl, H. Das asymptotische Verteilungsgesetz der 122, 1975. Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraum- Mackay, D. J. C. Information Theory, Inference and strahlung). Mathematische Annalen, 71(4):441–479, Learning Algorithms. Cambridge University Press, first 1912. edition edition, June 2003. ISBN 0521642981. Wolkowicz, H. and Styan, G. P. Bounds for Eigenvalues O’Hagan, A. Bayes-Hermite Quadrature. Journal of Sta- using Traces. Linear algebra and its applications, 29: tistical Planning and Inference, 29:245–260, 1991. 471–506, 1980. Paige, C. C. Computational Variants of the Lanczos Zhang, Y. and Leithead, W. E. Approximate Imple- method for the Eigenproblem. IMA Journal of Applied mentation of the logarithm of the Matrix Determinant Mathematics, 10(3):373–381, 1972. in Gaussian process Regression. Journal of Statistical Peng, H. and Qi, Y. EigenGP: Gaussian Process Mod- Computation and Simulation, 77(4):329–348, 2007. els with Adaptive Eigenfunctions. In Proceedings of the