Arxiv:1905.02086V1 [Stat.CO] 6 May 2019 K Arise When Looking for the Optimal Regularization Parameter in Q X Sˆg(Q) = Rt(Qi + L)−1R
Total Page:16
File Type:pdf, Size:1020Kb
ESTIMATING THE INVERSE TRACE USING RANDOM FORESTS ON GRAPHS S. Barthelme´ (1), N. Tremblay (1), A. Gaudilliere` (2), L. Avena (3) and P-O Amblard (1) (1) Univ Grenoble-Alpes, CNRS, Grenoble-INP, GIPSA-lab, Grenoble, France (2) Aix-Marseille Univ, CNRS, I2M, Marseille, France (3) Leiden University, Netherlands ABSTRACT In most cases the optimal value of q is unknown and must be Some data analysis problems require the computation of estimated, for instance using AIC (Akaike’s Information Cri- (regularised) inverse traces, i.e. quantities of the form terion) or Generalised-Cross Validation (GCV). AIC requires Tr(qI + L)−1. For large matrices, direct methods are un- computing the number of degrees of freedom of the estimator, feasible and one must resort to approximations, for example which here can be taken to equal s(q) (see [10], ch. 5, [9, 8]). using a conjugate gradient solver combined with Girard’s The simplest solution to compute eq. (1) is of course to trace estimator (also known as Hutchinson’s trace estimator). compute the eigenvalues of L, which comes at O(n3) cost if Here we describe an unbiased estimator of the regularized L is n × n. Moreover, there is no particular gain to expect inverse trace, based on Wilson’s algorithm, an algorithm from the sparsity of L. In fact, iterative methods for eigenval- that was initially designed to draw uniform spanning trees in ues, that look to estimate the smallest or largest eigenvalues graphs. Our method is fast, easy to implement, and scales to of L, cannot be used directly here, as s(q) involves the whole very large matrices. Its main drawback is that it is limited to spectral density. An alternative is to consider Monte Carlo diagonally dominant matrices L. methods. A famous estimator for the trace of a matrix was first suggested by Girard in [9]: let r denote a length-n vector Monte Carlo methods are increasingly popular in large- of independent, standard Gaussian entries. Let M denote a scale linear algebra problems [14]. Among the many different n × n matrix. Then: quantities one may need to compute on large matrices, spec- t t t Pn E(r Mr) = E Tr(Mrr ) = Tr ME(rr ) = Tr M (4) tral summaries of the form i=1 f(λi(L)), where the λi’s f are the eigenvalues of L and is some function, are often This leads immediately to estimating Tr M using the empiri- required. Here we focus on the following quantity: 1 Pk t cal mean Tr M ≈ k l=1 rl Mrl. Note that eq. (4) is valid n −1 X q for any random vector with diagonal covariance, so we may s(q) = q Tr((L + qI) ) = (1) use other random vectors [12]. Various options have been λi + q i=1 studied in the literature, see [5]. In this work we use Gaussian which we seek to evaluate for real q > 0. We call the quantity vectors for simplicity (as we will see, it is not the main factor s(q) because it is equivalent (up to scaling) to the Stieltjes here). transform of the eigenvalue density evaluated on the negative In our case, M = q(qI + L)−1, and Girard’s estimator of real axis [1]. s(q) reads In practice, the problem of estimating efficiently s(q) may arXiv:1905.02086v1 [stat.CO] 6 May 2019 k arise when looking for the optimal regularization parameter in q X s^G(q) = rt(qI + L)−1r : (5) a regularized optimization problem. Say we measure a signal k k l l t l=1 x = [x1; : : : ; xn] under white Gaussian noise . The mea- surements read yi = xi + i for i = 1 to n. Many estimation In the Gaussian case, the variance of the estimation for k = methods (smoothing splines, semi-supervised learning, Gaus- 1 (see, e.g., lemma 9 of [5]) is: sian process regression) define an estimator of x as: n 2 q 2 1 t G X 2q x^ = argmin jjy − zjj + z Lz (2) Var(^s1(q)) = 2 : (6) n 2 2 (q + λ ) z2R i=1 i where L is a semi-definite positive matrix defining the penalty We still need to figure out how to compute the quadratic forms (regularisation) term, and q parametrizes the regularisation’s rt(qI + L)−1r in eq. (5). This involves solving a large linear strength. The solution to this optimisation problem equals: system, a task for which algorithms abound. If L is sparse, x^ = q(qI + L)−1y: (3) computing a sparse Cholesky factor will give good results for many systems, up to a certain size 1. Alternatively, for very Algorithm 1 A variant of Wilson’s algorithm large systems, iterative solvers such as Conjugate Gradients Input: A graph G = (V; E) of size n and q > 0 may be used [6]. Another approach is to use an order p poly- R ;, W ; nomial approximation2 of the function f(x) = q=(q + x) ' Add a node, called ∆, to G and connect it to all n nodes in Pp j t −1 0 j=0 αjx . Estimating r (qI + L) r then boils down to V with edges of weight q. Call this augmented graph G . t Pp j computing r j=0 αjL r, that is: p matrix vector multipli- while W 6= V do: cations and one scalar product. Iterative solvers and poly- · Do a random walk on G0 starting from any node i 2 nomial methods only provide approximate solutions, but we V n W until it reaches either ∆, or a node in W. expect the error induced by these approximations to be small · Erase all the loops of the trajectory, in the order of ap- relative to the Girard variance of Eq. (6). A combination of pearance. Girard’s trace estimator and iterative solvers has been used, · Add all the nodes of this loop-erased trajectory to the e.g., in [18]. set of visited nodes W. Below, we describe an alternative method that is very nat- if the last node of the trajectory is ∆ do: ural and intrinsic when L is actually a graph Laplacian, a · Denote by l the last visited node before ∆ particular class of matrices associated with graphs. At the · R R [ flg end of section 2 we extend the technique to diagonally dom- Output: R, the root set of the sampled forest spanning G. inant matrices, i.e. the set of matrices that verify 8i Lii ≥ P j6=i jLijj. branch of the spanning tree. Next, pick a node that is not yet 2 Uniform spanning trees, random forests, in the tree, run a random walk until it hits the tree, erase the and inverse traces possible loops, add this new branch to the tree, etc. Wilson’s algorithm runs in time proportional to O(τ) where τ is the In this section we recall some facts on graphs and span- average “commute time”: the time it takes a random walk to ning trees that should help understand our method. Mathe- reach node j starting from node i for two nodes picked uni- matical details can be found in [2] and [3]. formly on the graph. Consider a weighted graph G = (V; E) of n = jVj Wilson in [19] noted that his algorithm could be used to nodes and jEj edges. We restrict ourselves to undirected generate random spanning forests, and not just USTs. A for- graphs in this paper, even though the results may be ex- est is a set of trees, and a spanning forest is a set of disjoint tended to strongly connected3 directed graphs. We de- trees that, taken together, span the whole graph. The algo- note by A 2 n×n the graph’s adjacency matrix, where R rithm4 is given as alg. 1: it uses loop-erased random walks A = A ≥ 0 is the weight of the connection between nodes ij ji (LERW), but these LERWs may be interrupted early. At each i and j. The graph Laplacian of G equals L = D−A 2 n×n, R node, the random walk is interrupted with probability q . where D = diag(d ; : : : ; d ) 2 n×n is the diagonal degree q+di 1 n R Of course, the larger q, the shorter the walks, the larger the matrix with d = P A the degree of node i. The graph i j ij number of roots, the faster the algorithm. In the implementa- Laplacian is a fascinating object with many applications in tion given in alg. 1, the average runtime is5 O(jEj=q). machine learning and graph signal processing, see eg. [7]. The resulting process has many fascinating aspects, some A tree is a cycle-free graph, and a spanning tree T of G is of which have been investigated in [4]. For our purposes, we a cycle-free connected subgraph of G that spans all n nodes focus on the fact that the number of roots is in fact an unbiased of G. A typical graph has more than one spanning tree. For estimator of s(q): instance, the complete graph of size n contains nn−2 different n spanning trees. A tree sampled uniformly from the set of all X q spanning trees of G is called a uniform spanning tree (UST). E(jRj) = = s(q): (7) q + λi A fast algorithm for sampling USTs, now known as ”Wil- i=1 son’s algorithm” was developed in [19] . In a nutshell, the This suggests to define Wilson estimator of s as: algorithm runs as follows: pick a node at random, and call it k the root of the tree.