arXiv:1603.01399v2 [cs.IT] 4 Oct 2016 ouino 1 sb ovrigtepolmit Lagrangia a into problem the converting approximate greedy form by an is a finding (1) such of in of of solution way example incremented representative Another is a solvers. is columns approximate RSS, in used minimize [3], have to of manner (OMP) methods set pursuit approximate the matching various which Orthogonal and proposed. [2], pr been NP-hard has (1) be of solution to exact the Finding solve. to nontrivial eosn,admcielann.W eefe ee othe to refer the hereafter as such compressi We (1) science of data learning. problem regression, information machine linear of and in denoising, contexts problems selection similar many variable solve as in to arises need The also [1]. respectively norm), fsurs(S)udrasast osritas constraint sparsity a under (RSS) squares of etmti n h ubro o-eocmoet in components non-zero of number the and ment where x to y yapiaint elwrdsproadt set. co data also can supernova is it real-world utility a framework, to Its CV application ability. the by generalization approximati with the the combined for optimize when solution optimal and and, SA-base nearly on our problem a that focused find indicates is model can validation synthetic proach cross a on by Test problems. columns) tested. freedo used metaheur such of of a for degrees (SA), number determining Carlo solver annealing for Monte simulated algorithm, efficient the optimization of an on use as fixed the based Especially, presented approach a sampling disci is from different a method selected several paper, in vectors this debated In actively column been of has matrix combination small a a eott h aeinfaeokfrinferring for framework Bayesian the to resort may AS 4.We ro nweg bu h eeainof generation termed the often about find knowledge is and prior efficiently approach When This to [4]. solution. us minimum allows unique and its convex problem converted x ˆ ( ( = ( = ept h ipiiyo t xrsin A shighly is SAP expression, its of simplicity the Despite nafruaino opesdsnig prevector sparse a sensing, compressed of formulation a In Abstract K || x y r min arg = ) 1 2 || x y saalbei h omo rbblt itiuin,one distributions, probability of form the in available is A || p p i µ ) y ) = rbe:dtriigdgeso reo by freedom of degrees determining problem: ∈ ( = Teapoiaino ihdmninlvco by vector high-dimensional a of approximation —The − apigapoc osas approximation sparse to approach Sampling ∈ P R A x R A N x i N M =1 µi || srcvrdfo ie esrmn vector measurement given a from recovered is 2 2 )  | ( + x N < M ∈ 1 2 i λ | || .I I. p || R preapoiainproblem approximation sparse y npriua,setting particular, In . x M − || NTRODUCTION × p p A N ) ncnucinwt relaxing with conjunction in x ymnmzn h eiulsum residual the minimizing by and || 2 2  oy nttt fTcnlg,Ykhm 2-52 Japan 226-8502, Yokohama Technology, of Institute Tokyo || uj to subj. eateto ahmtcladCmuigScience, Computing and Mathematical of Department x || oouiOuh n ohyk Kabashima Yoshiyuki and Obuchi Tomoyuki 0 iuae annealing simulated eoetemeasure- the denote p || x 1 = || 0 ≤ ae the makes (SAP). K, nfirmed x (the m plines. x || ap- d from oved x istic ( on, ℓ (1) || on 0 x n 0 - n eryotmlslto o A nawd ag of focus the range particularly determining wide we of a paper, problem in this the SAP In efficien on for [7]. to solution parameters ability optimal system the nearly has a optimization, find functional metaheuri MC-based for versatile a tic is which [6], (SA) annealing the sampling. explore (MC)-based approach Carlo approximate we Monte another of resources, use limitations computational and make abilities can high offer one relatively where for situation of important a Supposing is mean choices. possibilities This more novel resources. developing available on and depends that employ constraints to imposed which the of choice the and disadvantages, [5]. (AMP) passing y nent,adw e hmt ezr.TecrepnigRSS corresponding by The defined zero. be thus to is them set we and indefinite, coefficients, where opnnsof components the alti iayvariable binary this call mn etdvle of values tested among eesr opttoa oto u loih sbuddby bounded the is that algorithm show our O will of We cost SA. computational utilizing necessary (CV) validation cross by K niae h ounvcosue oapproximate to used vectors column the indicates iiihti iavnaeadwl aeteC analysis CV the make will practical. parallelize SA and to using disadvantage easy signifi this will is parallelization diminish large-scale algorithm that our expect we However, and cheap. not is hscnb fcetycridotb prxmt message approximate by out carried efficiently be can This . ( na ale td,w hwdta eso fsimulated of version a that showed we study, earlier an In and advantages own their have kinds these of Solvers e sitoueabnr vector binary a introduce us Let M and i hclm of column th 2 E I S II. N ( ( c c | S | | S ◦ y K K x x A , MLN OMLTO N SIMULATED AND FORMULATION AMPLING | x ( | ) K c siscriaiy and cardinality, its is i ( = ) r min arg = ) c x = max ) ( r xrse as expressed are , c c A Mǫ ) i ) x , where , o h eocmoet of components zero the for i a ( x ersnsteHdmr rdc.The product. Hadamard the represents K c i sue;if used; is , | ANNEALING preweight sparse y ditdy hscmuainlcost computational this Admittedly, . A , || S y = ) K − stesto etdvle of values tested of set the is A 1 2 c ( || c c K y ( = i ere ffreedom of degrees ◦ 0 = max Given . − x c ) A i || ) ti o sd We used. not is it , x stelretone largest the is 2 2 { ∈ , ( c ) c 0 || c h optimal the , , 2 2 y 1 r actually are . If : } N which , c cantly i i.e., , , 1 = ing K (2) (3) tly s- s , , To perform sampling, we employ the statistical mechanical constant [9]. Of course, this schedule is practically meaning- formulation in Ref. [7] as follows. By regarding E as an less, and a much faster schedule is employed generally. In “energy” and introducing an “inverse temperature” β, we can Ref. [7], we examined the performance of SA with a very rapid define a Boltzmann distribution as annealing schedule for a synthetic model whose properties of fixed β can be analytically evaluated. Comparison between the 1 P (c|β; y, A)= δ c − K e−βE(c|y,A), (4) analytical and the experimental results indicates that the rapid G i i ! SA performs quite well unless a phase transition of a certain X where G is the “partition function” type occurs at relatively low β. Owing to the analysis of the synthetic model, the range of system parameters in which the

−βE(c|y,A) phase transition occurs is rather limited. We therefore expect G = G(β|y, A)= δ ci − K e . (5) that SA serves as a promising approximate solver for (1). c i ! X X In the limit of β → ∞, (4) is guaranteed to concentrate on Algorithm 1 MC update with pair flipping the solution of (1). Therefore, sampling c at β ≫ 1 offers an 1: procedure MCPF(c, β, y, A) ⊲ MC routine approximate solution of (1). 2: ONES ←{k|ck =1}, ZEROS ←{k|ck =0} Unfortunately, directly sampling c from (4) is computation- 3: randomly choose i from ONES and j from ZEROS ally difficult. To resolve this difficulty, we employ a Markov 4: c′ ← c chain dynamics whose equilibrium distribution accords with ′ ′ 5: (ci,cj ) ← (0, 1) (4). Among many choices of such dynamics, we employ the 6: (E, E′) ← (E(c|y, A), E(c′|y, A)) ′ β standard Metropolis-Hastings rule [8]. A noteworthy remark 7: p ← max(1,e− (E −E)) ′ accept is that a trial move of the sparse weights, c → c , is generated 8: generate a random number r ∈ [0, 1] by “pair flipping” two sparse weights, one equal to 0 and 9: if rK. “unobserved” (M + 1)th data {A(M+1),i},yM+1 . LOO CV In general, a longer time is required for equilibrating MC assesses the LOO CV error (LOOE) dynamics as β grows larger. Furthermore, the dynamics has  M N 2 the risk of being trapped by trivial local minima of (3) if β is 1 ǫ (K|y, A)= y − A x\µ(c\µ) (7) fixed to a very large value from the beginning. A practically LOO 2M µ µi i µ=1 i=1 ! useful technique for avoiding these inconveniences is to start X X with a sufficiently small β and gradually increase it, which as an estimator of (6) for the solution of (1), where \µ \µ \µ \µ is termed simulated annealing (SA) [6]. As β → ∞, c is x (c ) = (xi (c )) is the solution of (1) for the “µth no longer updated, and final configuration cfin is expected to LOO system,” which is defined by removing the µth data lead to a solution that is very close (or identical) to the optimal ({Aµi},yµ) from the original system. One can evaluate (7) by solution in (1), i.e., xˆ(K) ≈ x(cfin). independently applying SA to each of the M LOO systems. SA is mathematically guaranteed to find the globally opti- The LOOE of (7) depends on K through c\µ, and hence mal solution of (1) if β is increased to infinity slowly enough we can determine its “optimal” value from the minimum of in such a way that β(t) < C log(t + 2) is satisfied, where t is ǫLOO(K|y, A) by sweeping K. Compared to the case of a the counter of the MC dynamics and C is a time-independent single run of SA at a given K, the computational cost for ρ=0.05 ρ=0.1 ρ=0.15 LOO CV is increased by a certain factor. This factor is roughly 0.2 0.14 0.12 N=100 evaluated as O(M × |SK |× K ) when varying K in the max N=200 0.18 set of SK in which the maximum value is Kmax. However, N=400 0.12 0.1 this part of the computation can be easily parallelized and, RS 0.16 AT therefore, may not be so problematic when sufficient quantities 0.1 0.08 of CPUs and memories are available. In the next section, 0.14 the rationality of the proposed approach is examined by ǫ 0.08 0.06 application to a synthetic model and a real-world data set. 0.12 0.06 0.04 Before closing this section, we want to remark on two 0.1 important issues. For this, we assume that y is generated by 0.08 0.04 0.02 a true sparse vector x0 as 0.06 0.02 0 y = Ax0 + ξ, (8) 0 0.5 1 0 0.5 1 0 0.5 1 T T T where ξ ∈ RM is a noise vector whose entries are uncorrelated with one another. Fig. 1. RSS per component ǫ versus T = β−1 observed in the annealing The first issue is about the accuracy in inferring x0. When A process for N = 100, 200, and 400. Curves represent the RS predictions for is provided as a column-wisely normalized zero mean random (4). The replica symmetry is broken owing to the AT instability below the matrix whose entries are uncorrelated with one another, (6) is broken lines. linearly related to the squared distance between the true and 2 2 inferred vectors, x and xˆ, as α=0.5,ρ =0.1,σ =10,σ =0.1 0 0 x ξ d1 2 ǫg = ||xˆ − x0||2 + d0 (9) N=100 N 0.4 N=200 [10]–[12], where d1 and d0 are positive constants. Hence, N=400 minimizing the estimator of (6), i.e. (7), leads to minimizing 0.3 RS AT the squared distance from the true vector x0. The same ǫ ρ conclusion has also been obtained for LASSO in the limit 0.2 0 of M → ∞ while keeping N and K finite [13], where it is not needed to assume absence of correlations in A. 0.1 The second issue is about the difficulty in identifying sparse c x weight vector 0 of 0 using CV, which is known to fail 0 even when M/N → ∞ [14]. Actually, we have tried a naive 0 0.05 0.1 0.15 0.2 approach to identify c0 by directly minimizing a CV error ρ with the use of SA and confirmed that it does not work. A key quantity for this is another type of LOOE: Fig. 2. RSS per component ǫ finally achieved by SA (symbols), its RS → ∞ M N 2 assessments for β (full curve) and at the onset of the AT instability 1 (red broken curve), plotted against ρ = K/N. ǫ˜ (c|y, A)= y − A x\µ(c) . (10) LOO 2M µ µi i µ=1 i=1 ! X X This looks like (7), but is different in that c is common among IV. RESULT all the terms. It may be natural to expect that the sparse A. Test on a synthetic model weight minimizing (10), c˜ = argminc{ǫ˜LOO(c|y, A)}, is the “best” c that converges to c0 in the limit of M/N → ∞. We first examine the utility of our methodology by applying Unfortunately, this is not true; in fact, ǫ˜LOO(c|y, A) in general it to a synthetic model in which vector y is generated in the tends to decrease as K increases, irrespective of the value manner of (8). For analytical tractability, we assume that A is of ||x0||0 [14]. We have confirmed this by conducting SA, a simple random matrix whose entries are independently sam- −1 handling ǫ˜LOO(c|y, A) as an energy function of c. Therefore, pled from N (0,N ) and that each component of x0 and ξ is 2 minimizing (10) can neither identify c0 nor offer any clue for also independently generated from (1 − ρ0)δ(x)+ ρ0N (0, σx) 2 determining K. and N (0, σξ ), respectively. Under these assumptions, an an- We emphasize that minimization of (7) can be utilized to alytical technique based on the replica method of statistical determine K to optimize the generalization ability, although it mechanics makes it possible to theoretically assess the typical does not have the ability to identify c0 either. The difference values of various macroscopic quantities when c is generated between these two issues is critical and confusing, as several from (4) as N → ∞ keeping α = M/N finite [19]. We earlier studies have provided some controversial implications performed a theoretical assessment under the so-called replica to the usage of CV in sparse inference on linear models [13], symmetric (RS) assumption. [15]–[18]. In the experiment, system parameters were fixed as ρ0 = α=0.5,ρ =0.1,σ2=10,σ2=0.1 K 1 2 3 4 5 0 x ξ ǫLOO 0.0328 0.0239 0.0281 0.0331 0.0334 N=100 0.5 N=200 TABLE I N=400 = 1 5 0.45 RS LOO CV ERROR OBTAINED FOR K – FOR THE TYPE IA SUPERNOVA DATA SET. 0.4 AT ρ 0 0.35 ρ * K = 1 g

ǫ 0.3 , variable 2 ∗ ∗ ∗ ∗ 0.25

LOO times selected 78 0 0 0 0 ǫ 0.2 K = 2 ∗ ∗ 0.15 variable 2 1 275 times selected 78 77 1 0 0 0.1 0.05 K = 3 0 variable 2 1 233 14 69 0 0.05 0.1 0.15 0.2 times selected 78 76 69 3 2 ρ K = 4 variable 2 1 233 94 225 Fig. 3. ǫLOO evaluated by solutions of SA (symbols) and RS assessments times selected 78 59 56 49 13 of typical ǫg for β → ∞ (full curve) and at the onset of the AT instability (red broken curve), plotted against ρ = K/N. K = 5 variable 2 36 223 225 6 times selected 78 37 33 31 27 0.1, α =0.5, σ2 = 10, and σ2 =0.1. The annealing schedule x ξ TABLE II was set as THE TOP FIVE VARIABLES SELECTED BY THE M = 78 LOO CV FOR = 1–5. a−1 K βa = β0 + r − 1, τa = τ, (a =1, ··· , 100), (11) where τa denotes the typical number of MC flips per com- ponent for a given value of inverse temperature βa. We set plots for ρ>ρ∗. −8 τ = 5, β0 = 10 , and r = 1.1 as default parameter values. The discrepancies are considered to be caused by RSB. 4 Thus, the maximum value of β was β100 ≈ 1.3 × 10 . The For β >βAT, the MC dynamics tends to be trapped by examined system sizes were N = 100, 200, and 400. We took a metastable state. This makes it difficult for SA to find the average over Nsamp = 100 different samples. The error the global minimum of ǫ(c), which explains why the SA’s bar is given by the standard deviation among those samples results are close to ǫg(βAT). Fortunately, this trapping works divided by Nsamp − 1. beneficially for the present purpose of raising the generaliza- Fig. 1 shows how RSS per component ǫ in (1) depends on tion ability by lowering ǫg, as seen in Fig. 3. As far as we p T = β−1 during the annealing process for K/N ≡ ρ =0.05, have examined, for fixed ρ, ǫg(βAT) never exceeds the RS 0.1, and 0.15. The data of SA (symbols) for N = 100, 200, evaluation of ǫg(∞) and is always close to ǫLOO of SA’s and 400 totally exhibit a considerably good accordance with results. These imply that the generalization ability achieved the theoretical prediction (curves) for (4) despite the very rapid by tuning K = Nρ using the SA-based CV is no worse than annealing schedule of (11), in which ci (i = 1, 2,...,N) is that obtained when CV is performed by exactly solving (1) flipped only five times on average at each value of β = βa. The for LOO systems. This is presumably because, for a large ρ, replica analysis indicates that the replica symmetry breaking the optimal solution of (1) overfits the observed (training) data (RSB) due to the de Almeida-Thouless (AT) instability [20] and its generalization ability becomes worse than that of x(c) occurs below the broken lines. However, the RS predictions typically sampled at appropriate values of β(< ∞). for β → ∞ still serve as the lower bounds of ǫmin even in such cases [21], [22]. Fig. 2 shows that the values achieved B. Application to a real-world data set by SA are fairly close to the lower bounds, implying that SA We also applied our SA-based analysis to a data set from can find nearly optimal solutions of (1). the SuperNova DataBase provided by the Berkeley Supernova Let us denote (6) of typical samples generated from (4) at Ia program [23], [24]. Screening based on a certain criteria inverse temperature β as ǫg(β). The generalization error of the yields a reduced data set of M = 78 and N = 276 [25]. The optimal solution of (1) is assessed as ǫg(∞). Fig. 3 plots the purpose of the data analysis is to select a set of explanatory ǫLOO evaluated by the solutions of SA (symbols) and the RS variables relevant for predicting the absolute magnitude at the evaluations of ǫg(∞) (full curve) and ǫg(βAT) (red broken maximum of type Ia supernovae by linear regression. curve), against ρ = K/N. Here, βAT is the critical inverse Following a conventional treatment of linear regression, we temperature at which RSB occurs owing to the AT instability. preprocessed both the absolute magnitude at the maximum The three plots accord fairly well with one another in the (dependent variable) and the 276 candidates of explanatory left of their minimum point ρ∗ ∼ 0.063, whereas there are variables to have zero means. We performed the SA for M = considerable discrepancies between ǫg(∞) and the other two 78 LOO systems of the preprocessed data set. The result of one single experiment with varying K is given in Tables I [3] Y.C. Pati, R. Rezaiifar and P.S. Krishnaprasad, “Orthogonal matching and II. Table I provides the values of LOOE, which shows pursuit: Recursive function approximation with applications to wavelet decomposition,” in 1993 Conference Record of The Twenty-Seventh that ǫLOO is minimized at K =2. Asilomar Conference on Signals, Systems and Computers, 1993, pp. Possible statistical correlations between explanatory vari- 40–44. ables, which were not taken into account in the synthetic [4] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. model in sec. IV-A, could affect the results of linear re- 58, pp. 267–288, 1996. gression [26]. The CV analysis also offers a useful clue for [5] F. Krzakala, M. M´ezard, F. Sausset, Y. Sun and L. Zdeborov´a, checking this risk. Examining the SA results of M(= 78) LOO “Probabilistic reconstruction in : algorithms, phase systems, we could count how many times each explanatory diagrams, and threshold achieving matrices,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2012, pp. P08009 (1–57), 2012. variable was selected, which could be used for evaluating [6] S. Kirkpatrick, C.D. Gelatt Jr and M.P. Vecchi, “Optimization by the reliability of the variable [27]. Table II summarizes the simulated annealing,” Science, vol. 220, pp. 671–680, 1983. results for five variables from the top for K = 1–5. This [7] T. Obuchi and Y. Kabashima, “Sparse approximation problem: how rapid simulated annealing succeeds and fails,” arXiv:1601.01074. indicates that no variables other than “2,” which stands for [8] W.K. Hastings, “Monte Carlo sampling methods using Markov chains color, were chosen stably, whereas variable “1,” representing and their applications,” Biometrika, vol. 57, pp. 97–109, 1970. light curve width, was selected with high frequencies for [9] S. Geman and D. Geman, “Stochastic relaxation, Gibbs distributions, . Table II shows that the frequency of “1” being selected and the Bayesian restoration of images,” IEEE Transactions on Pattern K ≤ 3 Analysis and Machine Intelligence, vol. 6, pp. 721–741, 1984. is significantly reduced for K ≥ 4. These are presumably [10] H.S. Seung, H. Sompolinsky, and N. Tishby, “Statistical mechanics of due to the strong statistical correlations between “1” and the learning from examples,” Physical Review A, vol. 45, pp. 6056–6091, newly added variables, suggesting the low reliability of the 1992. [11] M. Opper and O. Winther, “A mean field algorithm for Bayes learning in CV results for K ≥ 4. In addition, for K ≥ 4, we observed large feed-forward neural networks,” in Advances in Neural Information that the results varied depending on samples generated by the Processing Systems 9, 1996, pp. 225–231. MC dynamics in SA, which implies that there exist many [12] T. Obuchi and Y. Kabashima, “Cross validation in LASSO and its local minima in (3) of LOO systems for . These acceleration,” 2016, arXiv:1601.00881. K ≥ 4 [13] D. Homrighausen and D.J. McDonald, “Leave-one-out cross-validation observations mean that we could select at most only color is risk consistent for lasso,” , vol. 97, pp. 65–78, 2014. and light curve width as the explanatory variables relevant for [14] J. Shao, “Linear model selection by cross-validation,” Journal of the the absolute magnitude prediction with a certain confidence. American Statistical Association, vol. 88, pp. 486–494, 1993. [15] O. Bousquet and A. Elisseeff, “Stability and generalization,” Journal This conclusion is consistent with that of [25], in which the of Machine Learning Research, vol. 2, pp. 499–526, 2002. relevant variables were selected by LASSO, combined with [16] C. Leng, Y. Lin and G. Wahba, “A note on the lasso and related hyper parameter determination following the ad hoc “one- procedures in model selection,” Statistica Sinica, vol. 16, pp. 1273– standard-error rule,” and with the comparison between several 1284, 2006. [17] K. Shalev-Shwartz, O. Shamir, N. Srebro and K. Sridharan, “Learn- resulting models. ability and stability in the general learning setting,” in COLT 2009, 2009. V. SUMMARY [18] H. Xu and S. Mannor, “Sparse algorithms are not stable: a no-free- We examined the abilities and limitations of simulated lunch theorem,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, pp. 187–193, 2012. annealing (SA) for sparse approximation problem, in par- [19] Y. Nakanishi, T. Obuchi, M. Okada and Y. Kabashima, “Sparse approx- ticular, when employed for determining degrees of freedom imation based on a random overcomplete basis,” arXiv:1510.02189. by cross validation (CV). Application to a synthetic model [20] J.R.L. de Almeida and D.J. Thouless, “Stability of the Sherrington- Kirkpatrick solution of a spin glass model,” Journal of Physics A: indicates that SA can find nearly optimal solutions for (1), and Mathematical and General, vol. 11, pp. 983–990, 1978. when combined with the CV framework, it can optimize the [21] T. Obuchi, K. Takahashi and K. Takeda, “Replica symmetry breaking, generalization ability. Its utility was also tested by application complexity, and spin representation in the generalized random energy to a real-world supernova data set. model,” Journal of Physics A: Mathematical and Theoretical, vol. 43, pp. 485004 (1–28), 2010. Although we focused on the use of SA, samples at finite [22] T. Obuchi, “Role of the finite replica analysis in the mean-field theory temperatures contain useful information for SAP. How to of spin glasses,” Ph. D thesis submitted to Tokyo Institute of Technology utilize such information is currently under investigation. in 2010, arXiv:1510.02189. [23] “The UC Berkeley Filippenko Group’s Supernova Database,” ACKNOWLEDGMENTS http://heracles.astro.berkeley.edu/sndb/. [24] J.M. Silverman, M. Ganeshalingam, W. Li. and A.V. Filippenko, “Berke- This work was supported by JSPS KAKENHI under grant ley Supernova Ia Program – III. Spectra near maximum brightness numbers 26870185 (TO) and 25120013 (YK). The UC Berke- improve the accuracy of derived distances to Type Ia supernovae,” Mon. ley SNDB is acknowledged for permission of using the data Not. R. Astron. Soc., vol. 425, pp. 1889–1916, 2012. [25] M. Uemura, K.S. Kawabata, S. Ikeda and K. Maeda, “Variable set of sec. IV-B. Useful discussions with Makoto Uemura and selection for modeling the absolute magnitude at maximum of Type Shiro Ikeda on the analysis of the data set are also appreciated. Ia supernovae,” Publ. Astron. Soc. Japan, vol. 67, pp. 55 (1–9), 2015. [26] D. Belsley, Conditioning Diagnostics: Collinearity and Weak Data in REFERENCES Regression, Wiley, 1991. [27] N. Meinshausen and P. B¨uhlmann, “Stability selection,” Journal of [1] M. Elad, Sparse and Redundant Representations, Springer, 2010. the Royal Statistical Society: Series B (Methodological), vol. 72, pp. [2] B.K. Natarajan, “Sparse approximate solutions to linear systems,” SIAM 417–473, 2010. Journal on Computing, vol. 24, pp. 227–234, 1995.