<<

Bayesian Inference of Log

Jack Fitzsimons1 Kurt Cutajar2 Michael Osborne1 Stephen Roberts1 Maurizio Filippone2

1 Information Engineering, University of Oxford, UK 2 Department of Data Science, EURECOM, France

Abstract The standard approach for evaluating the log- of a positive definite involves The log-determinant of a matrix ap- the use of Cholesky decomposition (Golub & Van Loan, pears in a variety of machine learning prob- 1996), which is employed in various applications of lems, ranging from determinantal point pro- statistical models such as kernel machines. However, cesses and generalized Markov random fields, the use of Cholesky decomposition for general dense 3 through to the training of Gaussian processes. matrices requires O(n ) operations, whilst also entail- 2 Exact calculation of this term is often in- ing memory requirements of O(n ). In view of this tractable when the size of the kernel matrix ex- computational bottleneck, various models requiring ceeds a few thousands. In the spirit of proba- the log-determinant for inference bypass the need to bilistic numerics, we reinterpret the problem of compute it altogether (Anitescu et al., 2012; Stein et al., computing the log-determinant as a Bayesian 2013; Cutajar et al., 2016; Filippone & Engler, 2015). inference problem. In particular, we com- Alternatively, several methods exploit sparsity and struc- bine prior knowledge in the form of bounds ture within the matrix itself to accelerate computations. from matrix theory and evidence derived from For example, sparsity in Gaussian Markov Random stochastic trace estimation to obtain proba- fields (GMRFs) arises from encoding conditional inde- bilistic estimates for the log-determinant and pendence assumptions that are readily available when its associated uncertainty within a given com- considering low-dimensional problems. For such matri- putational budget. Beyond its novelty and the- ces, the Cholesky decompositions can be computed in oretic appeal, the performance of our proposal fewer than O(n3) operations (Rue & Held, 2005; Rue is competitive with state-of-the-art approaches et al., 2009). Similarly, Kronecker-based to approximating the log-determinant, while techniques may be employed for kernel matrices com- also quantifying the uncertainty due to budget- puted on regularly spaced inputs (Saatc¸i, 2011). While constrained evidence. these ideas have proven successful for a variety of spe- cific applications, they cannot be extended to the case of general dense matrices without assuming special forms arXiv:1704.01445v1 [stat.ML] 5 Apr 2017 1 INTRODUCTION or structures for the available data. Developing scalable learning models without compro- To this end, general approximations to the log- mising performance is at the forefront of machine learn- determinant frequently build upon stochastic trace es- ing research. The scalability of several learning mod- timation techniques using iterative methods (Avron & els is predominantly hindered by linear algebraic op- Toledo, 2011). Two of the most widely-used polynomial erations having large computational complexity, among approximations for large-scale matrices are the Taylor which is the computation of the log-determinant of a ma- and Chebyshev expansions (Aune et al., 2014; Han et al., trix (Golub & Van Loan, 1996). The latter term features 2015). A more recent approach draws from the possibil- heavily in the machine learning literature, with applica- ity of estimating the trace of functions using stochastic tions including spatial models (Aune et al., 2014; Rue Lanczos quadrature (Ubaru et al., 2016), which has been & Held, 2005), kernel-based models (Davis et al., 2007; shown to outperform polynomial approximations from Rasmussen & Williams, 2006), and Bayesian learn- both a theoretic and empirical perspective. ing (Mackay, 2003). Inspired by recent developments in the field of proba- the of the matrix, and employ trace estima- bilistic numerics (Hennig et al., 2015), in this work we tion techniques (Hutchinson, 1990) to obtain unbiased propose an alternative approach for calculating the log- estimates of these. Chen et al. (2011) propose an itera- determinant of a matrix by expressing this computation tive algorithm to efficiently compute the product of the as a Bayesian quadrature problem. In doing so, we refor- logarithm of a matrix with a vector, which is achieved mulate the problem of computing an intractable quantity by computing a spline approximation to the logarithm into an estimation problem, where the goal is to infer the function. A similar idea using Chebyshev polynomi- correct result using tractable computations that can be als has been developed by Han et al. (2015). Most re- carried out within a given time budget. In particular, we cently, the Lanczos method has been extended to handle model the eigenvalues of a matrix A from noisy obser- stochastic estimates of the trace and obtain probabilistic vations of Tr(Ak) obtained through stochastic trace esti- error bounds for the approximation (Ubaru et al., 2016). mation using the Taylor approximation method (Zhang Blocking techniques, such as in Ipsen & Lee (2011) and & Leithead, 2007). Such a model can then be used Ambikasaran et al. (2016), have also been proposed. to make predictions on the infinite series of the Tay- In our work, we similarly strive to use a small num- lor expansion, yielding the estimated value of the log- ber of matrix-vector products for approximating log- determinant. Aside from permitting a probabilistic ap- determinants. However, we show that by taking a proach for predicting the log-determinant, this approach Bayesian approach we can combine priors with the ev- inherently yields uncertainty estimates for the predicted idence gathered from the intermediate results of matrix- value, which in turn serves as an indicator of the quality vector products involved in the afore-mentioned methods of our approximation. to obtain more accurate results. Most importantly, our Our contributions are as follows. proposal has the considerable advantage that it provides a full distribution on the approximated value. 1. We propose a probabilistic approach for computing the log-determinant of a matrix which blends differ- Our proposal allows for the inclusion of explicit bounds ent elements from the literature on estimating log- on log-determinants to constrain the posterior distribu- determinants under a Bayesian framework. tion over the estimated log-determinant (Bai & Golub, 1997). Nystrom¨ approximations can also be used to 2. We demonstrate how bounds on the expected value bound the log-determinant, as shown by Bardenet & Tit- of the log-determinant improve our estimates by sias (2015). Similarly, Gaussian processes (Rasmussen constraining the probability distribution to lie be- & Williams, 2006) have been formulated directly using tween designated lower and upper bounds. the eigendecomposition of its spectrum, where eigenvec- 3. Through rigorous numerical experiments on syn- tors are approximated using the Nystrom¨ method (Peng thetic and real data, we demonstrate how our & Qi, 2015). There has also been work on estimating method can yield superior approximations to com- the distribution of kernel eigenvalues by analyzing the peting approaches, while also having the additional spectrum of linear operators (Braun, 2006; Wathen & benefit of uncertainty quantification. Zhu, 2015), as well as bounds on the spectra of ma- trices with particular emphasis on deriving the largest 4. Finally, in order to demonstrate how this technique eigenvalue (Wolkowicz & Styan, 1980; Braun, 2006). may be useful within a practical scenario, we em- In this work, we directly consider bounds on the log- ploy our method to carry out parameter selection for determinants of matrices (Bai & Golub, 1997). a large-scale determinantal point process. To the best of our knowledge, this is the first time that 2 BACKGROUND the approximation of log-determinants is viewed as a Bayesian inference problem, with the resulting quantifi- As highlighted in the introduction, several approaches cation of uncertainty being hitherto unexplored thus far. for approximating the log-determinant of a matrix rely on stochastic trace estimation for accelerating computa- 1.1 RELATED WORK tions. This comes about as a result of the relationship between the log-determinant of a matrix, and the corre- The most widely-used approaches for estimating log- sponding trace of the log-matrix, whereby determinants involve extensions of iterative algorithms, such as the Conjugate-Gradient and Lanczos methods, logDet (A) = Trlog (A). (1) to obtain estimates of functions of matrices (Chen et al., 2011; Han et al., 2015) or their trace (Ubaru et al., 2016). Provided the matrix log(A) can be efficiently sampled, The idea is to rewrite log-determinants as the trace of this simple identity enables the use of stochastic trace es- 100 know that the smallest eigenvalue is bounded by zero, 10-1 λn ≥ 0. Motivated by the identity presented in (1), the expansion (Barry & Pace, 1999; Zhang & 10-2 Leithead, 2007) may be employed for evaluating the log- -3 10 determinant of matrices having eigenvalues bounded be- 10-4 ν = 1 ν = 30 tween zero and one. In particular, this approach relies on

Absolute Error ν = 10 ν = 40 10-5 the following logarithm identity, ν = 20 ν = 50 -6 ∞ k 10 0 1 2 3 4 5 6 X A 10 10 10 10 10 10 10 log (I − A) = − . (3) Order of Truncation k k=1

While the infinite summation is not explicitly com- Figure 1: Expected absolute error of truncated Taylor putable in finite time, this may be approximated by com- series for stationary ν-continuous kernel matrices. The puting a truncated series instead. Furthermore, given that dashed grey lines indicate O(n−1). the trace of matrices is additive, we find m k X Tr A Trlog (I − A) ≈ − . (4) timation techniques (Avron & Toledo, 2011; Fitzsimons k et al., 2016). We elaborate further on this concept below. k=1 The Tr(Ak) term can be computed efficiently and recur- 2.1 STOCHASTIC TRACE ESTIMATION sively by propagating O(n2) vector-matrix multiplica- tions in a stochastic trace estimation scheme. To com- The standard approach for computing the trace term of a pute Tr(log(K)) we simply set A = I − K. matrix A ∈ Rn×n involves summing the eigenvalues of the matrix. Obtaining the eigenvalues typically involves There are two sources of error associated with this ap- computational complexity of O(n3), which is infeasible proach; the first due to stochastic trace estimation, and for large matrices. However, it is possible to obtain a the second due to truncation of the Taylor series. In stochastic estimate of the trace term such that the expec- the case of covariance matrices, the smallest eigen- tation of the estimate matches the term being approxi- value tends to be very small, which can be verified by mated (Avron & Toledo, 2011). In this work, we shall Weyl (1912) and Silverstein (1986)’s observations on the consider the Gaussian estimator, whereby we introduce eigenspectra of covariance matrices. This leads to Ak (i) Nr vectors r sampled from an independently and iden- decaying slowly as k → ∞. tically distributed zero-mean and unit variance Gaussian In light of the above, standard Taylor approximations to distribution. This yields the unbiased estimate the log-determinant of covariance matrices are typically Nr unreliable, even when the exact traces of matrix powers 1 X > Tr(A) = r(i) A r(i). (2) are available. This can be verified analytically based on N r i=1 results from kernel theory, which state that the approxi- mate rate of decay for the eigenvalues of positive definite Note that more sophisticated trace estimators (see Fitzsi- −ν−0.5 mons et al., 2016) may also be considered; without loss kernels which are ν-continuous is O(n ) (Weyl, of generality, we opt for a more straightforward approach 1912; Wathen & Zhu, 2015). Combining this result with in order to preserve clarity. the absolute error, E(λ), of the truncated Taylor approx- imation we find

 m  2.2 TAYLOR APPROXIMATION Z 1  X λj  E [E (λ)] = O λν+0.5 log (λ) − dλ  j  Against the backdrop of machine learning applica- 0 j=1 tions, in this work we predominantly consider covari-   Z 1 ∞ j ance matrices taking the form of a Gram matrix K = ν+0.5 X λ = O  λ dλ {κ(x , x )} , where the kernel function κ im- j i j i,j=1,...,n 0 j=m plicitly induces a feature space representation of data Ψ(0) (m + ν + 1.5) − Ψ(0) (m) points xi. Assume K has been normalized such that the = O , maximum eigenvalue is less than or equal to one, λ0 ≤ 1, ν + 1.5 where the largest eigenvalue can be efficiently found us- ing Gershgorin intervals (Gershgorin, 1931). Given that where Ψ(0)(·) is the Digamma function. In Figure 1, we covariance matrices are positive semidefinite, we also plot the relationship between the order of the Taylor ap- proximation and the expected absolute error. It can be {x1,..., xn} is a multivariate Gaussian distribution. observed that irrespective of the continuity of the kernel, Under this framework, the moments of the conditional the error converges at a rate of O(n−1). Gaussian distribution for a set of predictive points, given > a set of labels y = (y1, . . . , yn) , may be computed as

3 THE PROBABILISTIC NUMERICS > −1 µ = µ0 + K K (y − µ0), (6) APPROACH ∗

> −1 We now propose a probabilistic numerics (Hennig et al., Σ = K∗,∗ − K∗ K K∗, (7) 2015) approach: we’ll re-frame a numerical computa- tion (in this case, trace estimation) as probabilistic in- with µ and Σ denoting the posterior mean and variance, ference. Probabilistic numerics usually requires distin- and K being the n × n covariance matrix for the ob- guishing: an appropriate latent function; data and; the served variables {xi, yi; i ∈ (1, 2, . . . n)}. The latter is ultimate object of interest. Given the data, a posterior computed as κ(x, x0) for any pair of points x, x0 ∈ X. distribution is calculated for the object of interest. For Meanwhile, K∗ and K∗,∗ respectively denote the covari- instance, in numerical integration, the latent function is ance between the observable and the predictive points, the integrand, f, the data are evaluations of the integrand, and the prior over the predicted points. Note that µ0, the f(x), and the object of interest is the value of the in- prior mean, may be set to zero without loss of generality. tegral, R f(x)p(x)dx (see § 3.1.1 for more details). In Bayesian Quadrature (BQ; O’Hagan, 1991) is primarily this work, our latent function is the distribution of eigen- concerned with performing integration of potentially in- values of A, the data are noisy observations of Tr(Ak), tractable functions. In this work, we limit our discussion and the object of interest is log(Det(K)). For this object to the setting where the integrand is modeled as a GP, of interest, we are able to provide both expected value and variance. That is, although the Taylor approximation Z to the log-determinant may be considered unsatisfactory, p(x) f(x) dx, f ∼ GP(µ, Σ), the intermediate trace terms obtained when raising the matrix to higher powers may prove to be informative if where p(x) is some measure with respect to which considered as observations within a probabilistic model. we are integrating. A full discussion of BQ may be found in O’Hagan (1991) and Rasmussen & Ghahramani 3.1 RAW MOMENT OBSERVATIONS (2002); for the sake of conciseness, we only state the re- sult that the integrals may be computed by integrating the We wish to model the eigenvalues of A from noisy ob- covariance function with respect to p(x) for both K∗, servations of Tr Ak obtained through stochastic trace Z  Z estimation, with the ultimate goal of making predictions κ xdx, x0 = p (x) κ (x, x0) dx, on the infinite series of the Taylor expansion. Let us assume that the eigenvalues are i.i.d. random variables drawn from P (λi = x), a probability distribution over and K∗,∗, x ∈ [0, 1]. In this setting Tr(A) = nE [P (λ = x)], and x i Z Z  ZZ k (k) 0 0 0 0 0 more generally Tr A = nRx [P (λi = x)], where κ xdx, x dx = p (x) κ (x, x ) p (x ) dxdx . (k) th Rx is the k raw moment over the x domain. The raw moments can thus be computed as, 3.2 KERNELS FOR RAW MOMENTS AND Z 1 (k) k INFERENCE ON THE LOG-DETERMINANT Rx [P (λi = x)] = x P (λi = x) dx. (5) 0 Recalling (5), if P (λi = x) is modeled using a GP, in (k) Such a formulation is appealing because if P (λi = x) order to include observations of Rx [P (λi = x)], de- (k) is modeled as a Gaussian process, the required integrals noted as Rx , we must be able to integrate the kernel may be solved analytically using Bayesian Quadrature. with respect to the polynomial in x,

Z 1 3.1.1 Bayesian Quadrature  (k) 0 k 0 κ Rx , x = x κ (x, x ) dx, (8) 0 Gaussian processes (GPs; Rasmussen & Williams, 2006) are a powerful Bayesian inference method defined over Z 1 Z 1 functions X → R, such that the distribution of func-  (k) (k0) k 0 0k0 0 tions over any finite subset of the input points X = κ Rx , Rx0 = x κ (x, x ) x dxdx . (9) 0 0 Although the integrals described above are typically an- alytically intractable, certain kernels have an elegant an- alytic form which allows for efficient computation. In ∞ (k) ∞ (k0) ! m−1     2 X Rx X R 0 X j + 1 j κ , x = S − S this section, we derive the raw moment observations for k k0 m m a histogram kernel, and demonstrate how estimates of k=1 k0=1 j=0 (13) the log-determinant can be obtained. An alternate poly- nomial kernel is described in Appendix A. P∞ αk+1 where S(α) = k=1 k(k+1) , which has the convenient 3.2.1 Histogram Kernel identity for 0 < α < 1,

The entries of the histogram kernel, also known as S(α) = α + (1 − α) log(1 − α). the piecewise constant kernel, are given by κ(x, x0) = P1−m H( j , j+1 , x, x0), where Following the derivations presented above, we can fi- j=0 m m nally go about computing the prediction for the log- ( determinant, and its corresponding variance, using the  j j + 1  1 x, x0 ∈  j , j+1  H , , x, x0 = m m . GP posterior equations given in (6) and (7). This can m m 0 otherwise be achieved by replacing the terms K∗ and K∗,∗ with the constructions presented in (12) and (13), respectively. Covariances between raw moments may be computed as The entries of K are filled in using (11), whereas y de- follows: notes the noisy observations of Tr Ak.

Z 1  (k) 0 k 0 κ Rx , x = x κ x, x dx 3.2.2 Prior Mean Function 0 ! 1  j + 1 k+1  j k+1 While GPs, and in this case BQ, can be applied with a = − , k + 1 m m zero mean prior without loss of generality, it is often ben- (10) eficial to have a mean function as an initial starting point. If P (λi = x) is composed of a constant mean function g(λi = x), and a GP is used to model the residual, we  j j+1  where in the above x lies in the interval m , m . Ex- have that tending this to the covariance function between raw mo- ments we have, P (λi = x) = g (λi = x) + f (λi = x) .

Z 1 Z 1  (k) (k0) k 0k0 0 0 The previously derived moment observations may then κ Rx , Rx0 = x , x κ x, x dxdx 0 0 be decomposed into, m−1 k¯+1 k¯+1! X Y 1  j + 1   j  Z Z = − . k k k¯ + 1 m m x P (λi = x) dx = x g (λi = x) dx j=0 k¯∈(k,k0) Z (14) (11) k + x f (λi = x) dx.

This simple kernel formulation between observations of Due to the domain of P (λi = x) lying between zero and the raw moments compactly allows us to perform infer- one, we set a Beta distribution as the prior mean, which ence over P (λi = x). However, the ultimate goal is k has some convenient properties. First, it is fully specified P∞ Tr(A ) to predict log(Det(K)), and hence i=1 k . This by the mean and variance of the distribution, which can requires a seemingly more complex set of kernel expres- be computed using the trace and Frobenius norm of the sions; nevertheless, by propagating the implied infinite matrix. Secondly, the r-th raw moment of a Beta distri- summations into the kernel function, we can also obtain bution parameterized by α and β is the closed form solutions for these terms, α + r R(k) [g (λ = x)] = , x i α + β + r

∞ (k) ! m−1  k+1 which is straightforward to compute. X Rx (k0) X 1 j + 1 κ , R = − k x0 k0 + 1 m k=1 j=0 In consequence, the expectation of the logarithm of ran- ! dom variables and, hence, the ‘prior’ log determinant  j k+1   j + 1   j  S − S yielded by g (λi = x) can be computed as m m m (12) E[log(X); X ∼ g(λi = x)] = φ(α) − φ(α + β). (15) This can then simply be added to the previously derived modeled using a GP, where the entries of the kernel ma- GP expectation of the log-determinant. trix K (lines 5-7) are computed using (11). The kernel parameters are then tuned as per § 3.2.3 ( 8). Recall 3.2.3 Using Bounds on the Log-Determinant that we seek to make a prediction for the infinite Tay- lor expansion, and hence the exact log-determinant. To As with most GP specifications, there are hyperparam- this end, we must compute K∗ (lines 9-10) and k∗,∗ (line eters associated with the prior and the kernel. The op- 11) using (12) and (13), respectively. The posterior mean timal settings for these parameters may be obtained via and variance (line 12) may then be evaluated by filling in optimization of the standard GP log marginal likelihood, (6) and (7). As outlined in the previous section, the re- defined as sulting posterior distribution can be truncated using the 1 1 derived bounds to obtain the final estimates for the log- LML = − y>K−1y − log(Det(K)) + const. GP 2 2 determinant and its uncertainty (line 13).

Borrowing from the literature on bounds for the log- Algorithm 1 Computing log-determinant and uncer- determinant of a matrix, as described in Appendix B, we tainty using probabilistic numerics

can also exploit such upper and lower bounds to trun- n×n cate the resulting GP distribution to the relevant domain, Input: PSD matrix A ∈ R , raw moments kernel κ, which is expected to greatly improve the predicted log- expansion order M, and random vectors Z determinant. These additional constraints can then be Output: Posterior mean MTRN, and uncertainty VTRN propagated to the hyperparameter optimization proce- 1: A ← NORMALIZE(A) dure by incorporating them into the likelihood function 2: BOUNDS ← GETBOUNDS(A) via the product rule, as follows: 3: for i ← 1 to M do 4: y ← STOCHASTICTAYLOROBS(A, i, Z)  a − µˆ  b − µˆ  i LML = LML + log Φ − Φ , 5: for i ← 1 to M do GP σˆ σˆ 6: for j ← 1 to M do 7: Kij ← κ(i, j) with a and b representing the upper and lower log- 8: κ, K ← TUNEKERNEL(K, y, BOUNDS) determinant bounds respectively, µˆ and σˆ representing 9: for i ← 1 to M do the posterior mean and standard deviation, and Φ(·) rep- 10: K ← κ(∗, i) resenting the Gaussian cumulative density function. Pri- ∗,i ors on the hyperparameters may be accounted for in a 11: k∗,∗ ← κ(∗, ∗) similar way. 12: MEXP, VEXP ← GPPRED(y,K,K∗, k∗,∗) 13: MTRN, VTRN ← TRUNC(MEXP, VEXP, BOUNDS) 3.2.4 Algorithm Complexity and Recap

Due to its cubic complexity, GP inference is typically considered detrimental to the scalability of a model. 4 EXPERIMENTS However, in our formulation, the GP is only being ap- plied to the noisy observations of Tr Ak, which rarely In this section, we show how the appeal of this formu- exceed the order of tens of points. As a result, given that lation extends beyond its intrinsic novelty, whereby we we assume this to be orders of magnitude smaller than also consistently obtain performance improvements over the dimensionality n of the matrix K, the computational competing techniques. We set up a variety of exper- complexity is dominated by the matrix-vector operations iments for assessing the model performance, including involved in stochastic trace estimation, i.e. O(n2) for both synthetically constructed and real matrices. Given dense matrices and O(ns) for s-sparse matrices. the model’s probabilistic formulation, we also assess the quality of the uncertainty estimates yielded by the model. The steps involved in the procedure described within this We conclude by demonstrating how this approach may section are summarized as pseudo-code in Algorithm 1. be fitted within a practical learning scenario. The input matrix A is first normalized by using Gersh- gorin intervals to find the largest eigenvalue (line 1), and We compare our approach against several other esti- the expected bounds on the log-determinant (line 2) are mations to the log-determinant, namely approximations calculated using matrix theory (Appendix B). The noisy based on Taylor expansions, Chebyshev expansions and Taylor observations up to an expansion order M (lines 3- Stochastic Lanczos quadrature. The Taylor approxima- 4), denoted here as y, are then obtained through stochas- tion has already been introduced in § 2.2, and we briefly tic trace estimation, as described in § 2.2. These can be describe the others below. 100 0 2 10-1 4 )

i 6

-2 λ 10 (

g 8 o l 10 10-3 12

Absolute Relative Error 14 10-4 Spect-1 Spect-2 Spect-3 Spect-4 Spect-5 Spect-6 0 200 400 600 800 1000 Matrix Log Eigenspectra Eigenvalue Index Taylor SLQ PN Trunc. Mean Spect-1 Spect-3 Spect-5 Chebyshev PN Mean Spect-2 Spect-4 Spect-6

Figure 2: Empirical performance of 6 covariances described in § 4.1. The right figure displays the log eigenspectrum of the matrices and their respective indices. The left figure displays the relative performance of the algorithms for the stochastic trace estimation order set to 5, 25 and 50 (from left to right respectively).

Chebyshev Expansions: This approach utilizes the function (Ubaru et al., 2016), we can approximate it as m-degree Chebyshev polynomial approximation to the m Z b X function log (I − A) (Han et al., 2015; Boutsidis et al., log(t)dµ(t) ≈ ωj log (θj) , 2015; Peng & Wang, 2015), a j=0 m X where m is the degree of the approximation, while the Tr (log (I − A)) ≈ ckTr (Tk (A)) , (16) sets of ω and θ are the parameters to be inferred using k=0 Gauss quadrature. It turns out that these parameters may be computed analytically using the eigendecomposition where Tk(x) = ATk−1 (A) − Tk−2 (A) starting with of the low- tridiagonal transformation of A obtained T0(A) = 1 and T0 (A) = 2 ∗ A − 1, and ck is defined as using the (Paige, 1972). Denoting the n 2 X resulting eigenvalues and eigenvectors by θ and y respec- c = log (I − x ) T (x ) , k n + 1 i k i tively, the quadratic form may finally be evaluated as, i=0 ! (17) m i + 1  π (i)> (i) X 2 2 r log (A) r ≈ τ log (θj) , (18) xi = cos . j n + 1 j=0

 T  The Chebyshev approximation is appealing as it gives the with τj = e1 yj . best m-degree polynomial approximation of log (I − x) under the L∞-norm. The error induced by general 4.1 SYNTHETICALLY CONSTRUCTED Chebyshev polynomial approximations has also been MATRICES thoroughly investigated (Han et al., 2015). Previous work on estimating log-determinants have im- Stochastic Lanczos Quadrature: This approach (Ubaru plied that the performance of any given method is closely et al., 2016) relies on stochastic trace estimation to ap- tied to the shape of the eigenspectrum for the matrix un- proximate the trace using the identity presented in (1). der review. As such, we set up an experiment for as- If we consider the eigendecomposition of matrix A into sessing the performance of each technique when applied > QΛQ , the quadratic form in the equation becomes to synthetically constructed matrices whose eigenvalues decay at different rates. Given that the computational (i)> (i) (i)> > (i) r log(A)r = r Q log (Λ) Q r complexity of each method is dominated by the number n , of matrix-vector products (MVPs) incurred, we also illus- X 2 = log (λk) µk trate the progression of each technique for an increasing k=1 allowance of MVPs. All matrices are constructed using a > (i) Gaussian kernel evaluated over 1000 input points. where µk denotes the individual components of Q r . By transforming this term into a Riemann-Stieltjes inte- As illustrated in Figure 2, the estimates returned by our R b gral a log(t)dµ(t), where µ(t) is a piecewise constant approach are consistently on par with (and frequently 100 4.3 UNCERTAINTY QUANTIFICATION

10-1 2.5

-2 10 2.0

1.5 10-3 Absolute Relative Error 1.0 10-4 thermomech_TC bonesS01 ecology2 thermal2 (d=102,158) (d=127,224) (d=999,999) (d=1,228,045) 0.5 Absolute Error Divided by

UFL Dataset Predicted Standard Deviation 0.0 Taylor SLQ PN Trunc. Mean thermomech_TC bonesS01 ecology2 thermal2 Chebyshev PN Mean (d=102,158) (d=127,224) (d=999,999) (d=1,228,045) UFL Dataset

Figure 3: Methods compared on a variety on UFL Sparse Figure 4: Quality of uncertainty estimates on UFL Datasets. Each dataset was ran the matrix approximately datasets, measured as the ratio of the absolute error to the raised to the power of 5, 10, 15, 20, 25 and 30 (left to predicted variance. As before, results are shown for in- right) using stochastic trace estimation. creasing computational budgets (MVPs). The true value lay outside 2 standard deviations in only one of 24 trials.

superior to) those obtained using other methods. For One of the notable features of our proposal is the abil- matrices having slowly-decaying eigenvalues, standard ity to quantify the uncertainty of the predicted log- Chebyshev and Taylor approximations fare quite poorly, determinant, which can be interpreted as an indicator of whereas SLQ and our approach both yield compara- the quality of the approximation. Given that none of the ble results. The results become more homogeneous other techniques offer such insights to compare against, across methods for faster-decaying eigenspectra, but our we assess the quality of the model’s uncertainty estimates method is frequently among the top two performers. For by measuring the ratio of the absolute error to the pre- our approach, it is also worth noting that truncating the dicted standard deviation (uncertainty). For the latter to GP using known bounds on the log-determinant indeed be meaningful, the error should ideally lie within only a results in superior posterior estimates. This is particu- few multiples of the standard deviation. larly evident when the eigenvalues decay very rapidly. Somewhat surprisingly, the performance does not seem In Figure 4, we report this metric for our approach when to be greatly affected by the number of budgeted MVPs. using the histogram kernel. We carry out this evaluation over the matrices introduced in the previous experiment, once again showing how the performance varies for dif- 4.2 UFL SPARSE DATASETS ferent MVP allowances. In all cases, the absolute error of the predicted log-determinant is consistently bounded by at most twice the predicted standard deviation, which is Although we have so far limited our discussion to co- very sensible for such a probabilistic model. variance matrices, our proposed method is amenable to any positive semi-definite matrix. To this end, we extend the previous experimental set-up to a selection of real, 4.4 MOTIVATING EXAMPLE sparse matrices obtained from the SuiteSparse Matrix Determinantal point processes (DPPs; Macchi, 1975) are Collection (Davis & Hu, 2011). Following Ubaru et al. stochastic point processes defined over subsets of data (2016), we list the true values of the log-determinant re- such that an established degree of repulsion is main- ported in Boutsidis et al. (2015), and compare all other tained. A DPP, P, over a discrete space y ∈ {1, . . . , n} approaches to this baseline. is a probability measure over all subsets of y such that The results for this experiment are shown in Figure 3. Once again, the estimates obtained using our probabilis- P(A ∈ y) = Det(KA), tic approach achieve comparable accuracy to the compet- ing techniques, and several improvements are noted for where K is a positive definite matrix having all eigenval- larger allowances of MVPs. As expected, the SLQ ap- ues less than or equal to 1. A popular method for mod- proach generally performs better than Taylor and Cheby- eling data via K is the L-ensemble approach (Borodin, shev approximations, especially for smaller computa- 2009), which transforms kernel matrices, L, into an ap- tional budgets. Even so, our proposed technique con- propriate K, sistently appears to have an edge across all datasets. K = (L + I)−1L. 1.0 can be extended in various directions, such as explor- 0.8 )

x ing more kernels on the raw moments which permit a

m 0.6 l tractable Bayesian Quadrature. The uncertainty quanti- = l NLL / 0.4 ( P 0.2 fied in this work is also a step closer towards fully char- 0.0 acterizing the uncertainty associated with approximating 10-3 10-2 10-1 100 Lengthscale large-scale kernel-based models.

Figure 5: The rescaled Negative log likelihood (NLL) of Acknowledgements DPP with varying length scale (blue) and probability of maximum likelihood (red). Cubic interpolation was used Part of this work was supported by the Royal Academy between inferred likelihood observations. Ten samples, of Engineering and the Oxford-Man Institute. MF grate- z, were taken to polynomial order 30. fully acknowledges support from the AXA Research Fund. The authors would like to thank Jonathan Down- ing for his supportive and insightful conversation on this The goal of inference is to correctly parameterize L given work. observed subsets of y, such that the probability of unseen subsets can be accurately inferred in the future. References Given that the log-likelihood term of a DPP requires the log-determinant of L, na¨ıve computations of this term Ambikasaran, S., Foreman-Mackey, D., Greengard, L., are intractable for large sample sizes. In this experi- Hogg, D. W., and O’Neil, M. Fast Direct Methods for ment, we demonstrate how our proposed approach can be Gaussian Processes. IEEE Transactions on Pattern Anal- employed to the purpose of parameter optimization for ysis and Machine Intelligence, 38(2):252–265, 2016. large-scale DPPs. In particular, we sample points from Anitescu, M., Chen, J., and Wang, L. A Matrix-free 5 a DPP defined on a lattice over [−1, 1] , with one mil- Approach for Solving the Parametric Gaussian Process lion points at uniform intervals. A Gaussian kernel with Maximum Likelihood Problem. SIAM J. Scientific Com- lengthscale parameter l is placed over these points, creat- puting, 34(1), 2012. ing the true L. Subsets of the lattice points can be drawn by taking advantage of the structure of L, and we Aune, E., Simpson, D. P., and Eidsvik, J. Parameter draw five sets of 12,500 samples each. For a given selec- Estimation in High Dimensional Gaussian Distributions. tion of lengthscale options, the goal of this experiment is Statistics and Computing, 24(2):247–263, 2014. to confirm that the DPP likelihood of the obtained sam- Avron, H. and Toledo, S. Randomized Algorithms for ples is indeed maximized when L is parameterized by the Estimating the Trace of an Implicit Symmetric Positive true lengthscale, l. As shown in Figure 5, the computed Semi-definite Matrix. J. ACM, 58(2):8:1–8:34, 2011. uncertainty allows us to derive a distribution over the true lengthscale which, despite using few matrix-vector mul- Bai, Z. and Golub, G. H. Bounds for the Trace of the tiplications, is very close to the optimal. Inverse and the Determinant of Symmetric Positive Def- inite Matrices. Annals of Numerical , 4:29– 38, 1997. 5 CONCLUSION Bardenet, R. and Titsias, M. K. Inference for Deter- In a departure from conventional approaches for estimat- minantal Point Processes Without Spectral Knowledge. ing the log-determinant of a matrix, we propose a novel In Proceedings of the 28th International Conference on probabilistic framework which provides a Bayesian per- Neural Information Processing Systems, pp. 3393–3401, spective on the literature of matrix theory and stochastic 2015. trace estimation. In particular, our approach enables the Barry, R. P. and Pace, R. K. Monte Carlo Estimates of log-determinant to be inferred from noisy observations k the Log-Determinant of Large Sparse Matrices. Linear of Tr A obtained from stochastic trace estimation. By Algebra and its applications, 289(1):41–54, 1999. modeling these observations using a GP, a posterior esti- mate for the log-determinant may then be computed us- Borodin, A. Determinantal point processes. arXiv ing Bayesian Quadrature. Our experiments confirm that preprint arXiv:0911.1153, 2009. the results obtained using this model are highly compa- Boutsidis, C., Drineas, P., Kambadur, P., and Zouzias, rable to competing methods, with the additional benefit A. A Randomized Algorithm for Approximating the Log of measuring uncertainty. Determinant of a Symmetric Positive Definite Matrix. We forecast that the foundations laid out in this work CoRR, abs/1503.00374, 2015. Braun, M. L. Accurate Error Bounds for the Eigenval- Ipsen, I. C. F. and Lee, D. J. Determinant Approxima- ues of the Kernel Matrix. Journal of Machine Learning tions, May 2011. Research, 7:2303–2328, December 2006. Macchi, O. The Coincidence Approach to Stochastic Chen, J., Anitescu, M., and Saad, Y. Computing f(A)b point processes. Advances in Applied Probability, 7:83– via Least Squares Polynomial Approximations. SIAM 122, 1975. Journal on Scientific Computing, 33(1):195–222, 2011. Mackay, D. J. C. Information Theory, Inference and Cutajar, K., Osborne, M., Cunningham, J., and Filip- Learning Algorithms. Cambridge University Press, first pone, M. Preconditioning Kernel Matrices. In Proceed- edition edition, June 2003. ISBN 0521642981. ings of the 33nd International Conference on Machine O’Hagan, A. Bayes-Hermite Quadrature. Journal of Sta- Learning, ICML 2016, New York City, NY, USA, June tistical Planning and Inference, 29:245–260, 1991. 19-24, 2016. Paige, C. C. Computational Variants of the Lanczos Davis, J. V., Kulis, B., Jain, P., Sra, S., and Dhillon, I. S. method for the Eigenproblem. IMA Journal of Applied Information-theoretic Metric Learning. In Proceedings Mathematics, 10(3):373–381, 1972. of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, pp. Peng, H. and Qi, Y. EigenGP: Gaussian Process Mod- 209–216, 2007. els with Adaptive Eigenfunctions. In Proceedings of the 24th International Conference on Artificial Intelligence, Davis, T. A. and Hu, Y. The University of Florida Sparse IJCAI’15, pp. 3763–3769. AAAI Press, 2015. Matrix Collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1, 2011. Peng, W. and Wang, H. Large-scale Log-Determinant Computation via Weighted L 2 Polynomial Approxima- Filippone, M. and Engler, R. Enabling Scalable tion with Prior Distribution of Eigenvalues. In Interna- Stochastic Gradient-based inference for Gaussian pro- tional Conference on High Performance Computing and cesses by employing the Unbiased LInear System SolvEr Applications, pp. 120–125. Springer, 2015. (ULISSE). In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, Rasmussen, C. E. and Williams, C. Gaussian Processes France, July 6-11, 2015. for Machine Learning. MIT Press, 2006. Fitzsimons, J. K., Osborne, M. A., Roberts, S. J., and Rasmussen, C. E. and Ghahramani, Z. Bayesian Monte Fitzsimons, J. F. Improved Stochastic Trace Estimation Carlo. In Advances in Neural Information Processing using Mutually Unbiased Bases. CoRR, abs/1608.00117, Systems 15, NIPS 2002, December 9-14, 2002, Vancou- 2016. ver, British Columbia, Canada, pp. 489–496, 2002. Gershgorin, S. Uber die Abgrenzung der Eigenwerte Rue, H. and Held, L. Gaussian Markov Random Fields: einer Matrix. Izvestija Akademii Nauk SSSR, Serija Theory and Applications, volume 104 of Monographs Matematika, 7(3):749–754, 1931. on Statistics and Applied Probability. Chapman & Hall, London, 2005. Golub, G. H. and Van Loan, C. F. Matrix computations. The Johns Hopkins University Press, 3rd edition, Octo- Rue, H., Martino, S., and Chopin, N. Approximate ber 1996. ISBN 080185413. Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Han, I., Malioutov, D., and Shin, J. Large-scale Log- Royal Statistical Society: Series B (Statistical Methodol- Determinant computation through Stochastic Chebyshev ogy), 71(2):319–392, 2009. Expansions. In Bach, F. R. and Blei, D. M. (eds.), Pro- ceedings of the 32nd International Conference on Ma- Saatc¸i, Y. Scalable Inference for Structured Gaussian chine Learning, ICML 2015, Lille, France, 6-11 July Process Models. PhD thesis, University of Cambridge, 2015, 2015. 2011. Hennig, P., Osborne, M. A., and Girolami, M. Proba- Silverstein, J. W. Eigenvalues and Eigenvectors of Large bilistic Numerics and Uncertainty in Computations. Pro- Dimensional Sample Covariance Matrices. Contempo- ceedings of the Royal Society of London A: Mathe- rary Mathematics, 50:153–159, 1986. matical, Physical and Engineering Sciences, 471(2179), Stein, M. L., Chen, J., and Anitescu, M. Stochastic Ap- 2015. proximation of Score functions for Gaussian processes. Hutchinson, M. A Stochastic Estimator of the Trace of The Annals of Applied Statistics, 7(2):1162–1191, 2013. the Influence Matrix for Laplacian Smoothing Splines. doi: 10.1214/13-AOAS627. Communications in Statistics - Simulation and Compu- Ubaru, S., Chen, J., and Saad, Y. Fast Estimation of tr (f tation, 19(2):433–450, 1990. (a)) via Stochastic Lanczos Quadrature. 2016. Wathen, A. J. and Zhu, S. On Spectral Distribution of Kernel Matrices related to Radial functions. Nu- merical Algorithms, 70(4):709–726, 2015. Weyl, H. Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraum- strahlung). Mathematische Annalen, 71(4):441–479, 1912. Wolkowicz, H. and Styan, G. P. Bounds for Eigenvalues using Traces. Linear algebra and its applications, 29: 471–506, 1980. Zhang, Y. and Leithead, W. E. Approximate Imple- mentation of the logarithm of the Matrix Determinant in Gaussian process Regression. Journal of Statistical Computation and Simulation, 77(4):329–348, 2007. A POLYNOMIAL KERNEL

Similar to the derivation of the histogram kernel, we can also derive the polynomial kernel for moment observations. The entries of the polynomial kernel, given by k(x, x0) = (xx0 + c)d, can be integrated over as,

Z 1 d    (k) 0 X d k+i 0i d−i κ Rx , x = x x c dx, 0 i i=1 (19) d X d x0icd−i = . i k + i + 1 i=1

Z 1 Z 1 d    (k) (k0) X d k+i 0k0+i d−i 0 κ Rx , Rx0 = x x c dxdx 0 0 i i=1 (20) d X d cd−i = . i (k + i + 1) (k0 + i + 1) i=1 As with the histogram kernel, the infinite sum of the Taylor expansion can also be combined into the Gaussian process,

∞ (k) ! ∞ d   d−i X Rx (k0) 1 X X d c κ , R 0 = k x k i (k + i + 1) (k0 + i + 1) k=1 k=1 i=1 (21) d d−i (0)  X dc Ψ (i + 2) + γ = , i (i + 1) (k0 + i + 1) i=1

0 ∞ (k) ∞ (k ) ! ∞ ∞ d   d−i X Rx X Rx0 1 X X X d c κ , 0 = 0 0 k 0 k kk 0 i (k + i + 1) (k + i + 1) k=1 k =1 k=1 k =1 i=1 (22) d d−i (0) 2 X dc Ψ (i + 2) + γ = . i 2 i=1 (i + 1)

In the above, Ψ(0)(·) is the Digamma function and γ is the Euler-Mascheroni constant. We strongly believe that the polynomial and histogram kernels are not the only kernels which can analytically derived to include moment observations but act as a reasonable initial choice for practitioners.

B BOUNDS ON LOG DETERMINANTS

For the sake of completeness, we restate the bounds on the log determinants used throughout this paper (Bai & Golub, 1997).

2 Theorem 1 Let A be an n-by-n symmetric positive definite matrix, µ1 = Tr(A), µ2 = kAkF and λi(A) ∈ [α; β] with α > 0, then  T      T     log α α t µ1 log β β t¯ µ1 2 2 ≤ Tr(log(A)) ≤ 2 2 log t α t µ2 log t¯ β t¯ µ2

where, αµ − µ βµ − µ t = 1 2 , t¯= 1 2 αn − µ2 βn − µ2 This bound can be easily computed during the loading of the matrix as both the trace and Frobenius norm can be readily calculated using summary statistics. However, bounds on the maximum and minimum must also be derived. We chose to use Gershgorin intervals to bound the eigenvalues (Gershgorin, 1931).