Sampling Approach to Sparse Approximation Problem
Total Page:16
File Type:pdf, Size:1020Kb
Sampling approach to sparse approximation problem: determining degrees of freedom by simulated annealing Tomoyuki Obuchi and Yoshiyuki Kabashima Department of Mathematical and Computing Science, Tokyo Institute of Technology, Yokohama 226-8502, Japan Abstract—The approximation of a high-dimensional vector by y. This can be efficiently carried out by approximate message a small combination of column vectors selected from a fixed passing (AMP) [5]. matrix has been actively debated in several different disciplines. Solvers of these kinds have their own advantages and In this paper, a sampling approach based on the Monte Carlo method is presented as an efficient solver for such problems. disadvantages, and the choice of which to employ depends on Especially, the use of simulated annealing (SA), a metaheuristic the imposed constraints and available resources. This means optimization algorithm, for determining degrees of freedom (the that developing novel possibilities is important for offering number of used columns) by cross validation is focused on and more choices. Supposing a situation where one can make use tested. Test on a synthetic model indicates that our SA-based ap- of relatively high computational resources, we explore the proach can find a nearly optimal solution for the approximation problem and, when combined with the CV framework, it can abilities and limitations of another approximate approach, i.e., optimize the generalization ability. Its utility is also confirmed Monte Carlo (MC)-based sampling. by application to a real-world supernova data set. In an earlier study, we showed that a version of simulated annealing (SA) [6], which is a versatile MC-based metaheuris- I. INTRODUCTION tic for functional optimization, has the ability to efficiently In a formulation of compressed sensing, a sparse vector find a nearly optimal solution for SAP in a wide range of N system parameters [7]. In this paper, we particularly focus x = (xi) ∈ R is recovered from a given measurement vector M on the problem of determining the degrees of freedom, K, y = (yµ) ∈ R (M <N) by minimizing the residual sum of squares (RSS) under a sparsity constraint as by cross validation (CV) utilizing SA. We will show that the necessary computational cost of our algorithm is bounded by 2 1 2 O(M N|SK|Kmax), where SK is the set of tested values of xˆ(K) = arg min ||y − Ax||2 subj. to ||x||0 ≤ K, (1) x 2 K and |SK | is its cardinality, and K is the largest one max among tested values of K. Admittedly, this computational cost RM×N where A = (Aµi) ∈ and ||x||0 denote the measure- is not cheap. However, our algorithm is easy to parallelize ment matrix and the number of non-zero components in x (ℓ0- and we expect that large-scale parallelization will significantly norm), respectively [1]. The need to solve similar problems diminish this disadvantage and will make the CV analysis also arises in many contexts of information science such using SA practical. as variable selection in linear regression, data compression, denoising, and machine learning. We hereafter refer to the II. SAMPLING FORMULATION AND SIMULATED arXiv:1603.01399v2 [cs.IT] 4 Oct 2016 problem of (1) as the sparse approximation problem (SAP). ANNEALING Despite the simplicity of its expression, SAP is highly N nontrivial to solve. Finding the exact solution of (1) has proved Let us introduce a binary vector c = (ci) ∈{0, 1} , which to be NP-hard [2], and various approximate methods have indicates the column vectors used to approximate y: If ci =1, been proposed. Orthogonal matching pursuit (OMP) [3], in the ith column of A, ai, is used; if ci =0, it is not used. We which the set of used columns is incremented in a greedy call this binary variable sparse weight. Given c, the optimal manner to minimize RSS, is a representative example of such coefficients, x(c), are expressed as approximate solvers. Another way of finding an approximate 2 x(c) = arg min ||y − A(c ◦ x)||2, (2) solution of (1) is by converting the problem into a Lagrangian x 1 2 p form 2 ||y−Ax||2 +λ||x||p in conjunction with relaxing ||x||0 p N p where (c ◦ x)i = cixi represents the Hadamard product. The to ||x||p = i=1 |xi| . In particular, setting p =1 makes the converted problem convex and allows us to efficiently find components of x(c) for the zero components of c are actually its unique minimumP solution. This approach is often termed indefinite, and we set them to be zero. The corresponding RSS LASSO [4]. When prior knowledge about the generation of x is thus defined by and y is available in the form of probability distributions, one 1 E(c|y, A)= Mǫ(c|y, A)= ||y − Ax(c)||2. (3) may resort to the Bayesian framework for inferring x from 2 2 To perform sampling, we employ the statistical mechanical constant [9]. Of course, this schedule is practically meaning- formulation in Ref. [7] as follows. By regarding E as an less, and a much faster schedule is employed generally. In “energy” and introducing an “inverse temperature” β, we can Ref. [7], we examined the performance of SA with a very rapid define a Boltzmann distribution as annealing schedule for a synthetic model whose properties of fixed β can be analytically evaluated. Comparison between the 1 P (c|β; y, A)= δ c − K e−βE(c|y,A), (4) analytical and the experimental results indicates that the rapid G i i ! SA performs quite well unless a phase transition of a certain X where G is the “partition function” type occurs at relatively low β. Owing to the analysis of the synthetic model, the range of system parameters in which the −βE(c|y,A) phase transition occurs is rather limited. We therefore expect G = G(β|y, A)= δ ci − K e . (5) that SA serves as a promising approximate solver for (1). c i ! X X In the limit of β → ∞, (4) is guaranteed to concentrate on Algorithm 1 MC update with pair flipping the solution of (1). Therefore, sampling c at β ≫ 1 offers an 1: procedure MCPF(c, β, y, A) ⊲ MC routine approximate solution of (1). 2: ONES ←{k|ck =1}, ZEROS ←{k|ck =0} Unfortunately, directly sampling c from (4) is computation- 3: randomly choose i from ONES and j from ZEROS ally difficult. To resolve this difficulty, we employ a Markov 4: c′ ← c chain dynamics whose equilibrium distribution accords with ′ ′ 5: (ci,cj ) ← (0, 1) (4). Among many choices of such dynamics, we employ the 6: (E, E′) ← (E(c|y, A), E(c′|y, A)) ′ β standard Metropolis-Hastings rule [8]. A noteworthy remark 7: p ← max(1,e− (E −E)) ′ accept is that a trial move of the sparse weights, c → c , is generated 8: generate a random number r ∈ [0, 1] by “pair flipping” two sparse weights, one equal to 0 and 9: if r<paccept then the other equal to 1. Namely, choosing an index i of the 10: c ← c′ sparse weight from ONES ≡ {k|ck = 1} and another index 11: end if ′ j from ZEROS ≡ {k|ck = 0}, we set c = c, except 12: return c for the counterpart of (ci,cj ) = (1, 0), which is given as 13: end procedure ′ ′ (ci,cj ) = (0, 1). This flipping can keep the sparsity constant during the update. A pseudo-codeof the MC algorithm is given in Algorithm 1 as a reference. III. EMPLOYMENT OF SA FOR CROSS VALIDATION The most time-consuming part in the algorithm is the eval- CV is a framework designed to evaluate the generalization uation of E′ denoted in the sixth line; the naive operation for ability of statistical models/learning systems based on a given it requires O(MK2 + K3) since matrix inversion of the gram set of data. In particular, we examine the leave-one-out (LOO) matrix A(c)TA(c) is involved, where T and A(c) stand for the CV, but its generalization to the k-fold CV is straightforward. matrix transpose and the submatrix of A that is composed of In accordance with the cost function of (1), we define the column vectors of A whose column indices belong to ONES, generalization error respectively. However, since the flip c → c′ changes A(c) only by two columns, one can reduce this computational cost N 2 1 to O(MK + K2) using a matrix inversion formula [7]. This ǫ = y − A x (6) g 2 M+1 (M+1),i i implies that, when the average number of flips per variable i=1 ! X (MC steps) is kept to a constant, the computational cost of the as a natural measure for evaluating the generalization ability algorithm scales as O(MNK) per MC step in the dominant of x, where ··· denotes the expectation with respect to the order since M >K. “unobserved” (M + 1)th data {A(M+1),i},yM+1 . LOO CV In general, a longer time is required for equilibrating MC assesses the LOO CV error (LOOE) dynamics as β grows larger. Furthermore, the dynamics has M N 2 the risk of being trapped by trivial local minima of (3) if β is 1 ǫ (K|y, A)= y − A x\µ(c\µ) (7) fixed to a very large value from the beginning. A practically LOO 2M µ µi i µ=1 i=1 ! useful technique for avoiding these inconveniences is to start X X with a sufficiently small β and gradually increase it, which as an estimator of (6) for the solution of (1), where \µ \µ \µ \µ is termed simulated annealing (SA) [6].