IEEE TRANSACTIONS ON INFORMATION THEORY 1 Analysis of Regularized LS Reconstruction and Ensembles in Compressed Sensing Mikko Vehkapera¨ Member, IEEE, Yoshiyuki Kabashima, and Saikat Chatterjee Member, IEEE

Abstract—Performance of regularized least-squares estimation where A ∈ RM×N represents the compressive (M ≤ N) in noisy compressed sensing is analyzed in the limit when sampling system. Measurement errors are captured by the the dimensions of the measurement matrix grow large. The M vector w ∈ R and parameter σ controls the magnitude of sensing matrix is considered to be from a class of random 0 ensembles that encloses as special cases standard Gaussian, row- the distortions. The task is then to infer x from y, given the orthogonal, geometric and so-called T -orthogonal constructions. measurement matrix A. Depending on the chosen performance Source vectors that have non-uniform sparsity are included in the metric, the level of knowledge about the of the source system model. Regularization based on `1-norm and leading to and error vectors, or computational complexity constraints, LASSO estimation, or basis pursuit denoising, is given the main multiple choices are available for achieving this task. One emphasis in the analysis. Extensions to `2-norm and “zero-norm” regularization are also briefly discussed. The analysis is carried possible solution that does not require detailed information 0 out using the replica method in conjunction with some novel about σ or statistics of {x , w} is regularized least-squares matrix integration results. Numerical experiments for LASSO (LS) based reconstruction are provided to verify the accuracy of the analytical results.  1  The numerical experiments show that for noisy compressed xˆ = arg min ky − Axk2 + c(x) , sensing, the standard Gaussian ensemble is a suboptimal choice (2) x∈ N 2λ for the measurement matrix. Orthogonal constructions provide a R superior performance in all considered scenarios and are easier where k · k is the standard Euclidean norm, λ a non-negative to implement in practical applications. It is also discovered that c : N → for non-uniform sparsity patterns the T -orthogonal matrices design parameter and R R a fixed non-negative valued can further improve the mean square error behavior of the (cost) function. If we interpret (2) as a maximum a posteriori reconstruction when the noise level is not too high. However, (MAP) estimator, the implicit assumption would as the additive noise becomes more prominent in the system, the be that: 1) the additive noise can be modeled by a zero-mean simple row-orthogonal measurement matrix appears to be the Gaussian random vector with covariance λIM , and 2) the best choice out of the considered ensembles. distribution of the source is proportional to e−c(x). Neither is Index Terms—Compressed sensing, eigenvalues of random ma- in general true for the model (1) and, therefore, reconstruction trices, compressed sensing matrices, noisy linear measurements, based on (2) is clearly suboptimal. ` minimization 1 In the sparse estimation framework, the purpose of the cost function c is to penalize the trial x so that some desired property of the source is carried over to the solution xˆ. In I.INTRODUCTION the special case when the measurements are noise-free, that is, ONSIDER the standard compressed sensing (CS) [1]– σ = 0, the choice λ → 0 reduces (2) to solving a constrained C [3] setup where the sparse vector x0 ∈ RN of interest is optimization problem observed via noisy linear measurements min c(xˆ) s.t. y = Axˆ. (3)

arXiv:1312.0256v2 [cs.IT] 2 Feb 2016 N y = Ax0 + σw, (1) xˆ∈R

It is well-known that in the noise-free case the `1-cost c(x) = Manuscript received December 1, 2013; revised November 6, 2015; ac- P kxk1 = |xj| leads to sparse solutions that can be found cepted January 25, 2016. The editor coordinating the review of this manuscript j and approving it for publication was Prof. Venkatesh Saligrama. The research using linear programming. For the noisy case the resulting was funded in part by Swedish Research Council under VR Grant 621- scheme is called LASSO [4] or basis pursuit denoising [5] 2011-1024 (MV) and MEXT KAKENHI Grant No. 25120013 (YK). M. Vehkapera’s¨ visit to Tokyo Institute of Technology was funded by the MEXT   1 2 KAKENHI Grant No. 24106008. This paper was presented in part at the 2014 xˆ`1 = arg min ky − Axk + kxk1 . (4) N 2λ IEEE International Symposium on Information Theory. x∈R M. Vehkapera¨ was with the KTH Royal Institute of Technology, Sweden and Aalto University, Finland. He is now with the Department of Electronic Just like its noise-free counterpart, it is of particular impor- and Electrical Engineering, University of Sheffield, Sheffield S1 3JD, UK. tance in CS since (4) can be solved by using standard convex (e-mail: m.vehkapera@sheffield.ac.uk) Y. Kabashima is with the Department of Computational Intelligence and optimization tools such as cvx [6]. Due to the prevalence Systems Science, Tokyo Institute of Technology, Yokohama 226-8502, Japan. of reconstruction methods based on `1-norm regularization in (e-mail: [email protected]) CS, we shall keep the special case of `1-cost c(x) = kxk1 as S. Chatterjee is with the School of Electrical Engineering and the ACCESS Linnaeus Center, KTH Royal Institute of Technology, SE-100 44 Stockholm, the main example of the paper, although it is known to be a Sweden. (e-mail: [email protected]) suboptimal choice in general.

Copyright (c) 2014 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. 2 IEEE TRANSACTIONS ON INFORMATION THEORY

A. Brief Literature Review passing (AMP) algorithm [22], [23] was introduced in [24] and shown to match the combinatorial results perfectly. Both of In the literature, compressed sensing has a strong connota- the above methods are mathematically rigorous and the AMP tion of sparse representations. We shall next provide a brief approach has the additional benefit that it provides also a low- review of the CS literature while keeping this in mind. The the- complexity computational algorithm that matches the threshold oretical works in CS can be roughly divided into two principle behavior. The downside is that extending these analysis for directions: 1) worst case analysis, and 2) average / typical case more general ensembles, both for the measurement matrix and analysis. In the former approach, analytical tools that examine the source vector, seems to be quite difficult. Alternative route the algebraic properties of the sensing matrix A, such as, is to use statistical mechanics inspired tools like the replica mutual coherence, spark or restricted isometry property (RIP) method [25]–[27]. are used. The goal is then to find sufficient conditions for the By now the replica method has been accepted in the A chosen property of that guarantee perfect reconstruction — information theory society as a mathematical tool that can at least with high probability. The latter case usually strives for tackle problems that are very difficult, or impossible, to solve A sharp conditions when the reconstruction is possible when is using other (rigorous) approaches. Although the outcomes sampled from some random distribution. Analytical tools vary of the replica analysis have received considerable success from combinatorial geometry to statistical physics methods. (see, e.g., [28]–[34] for some results related to the present Both, worst case and average case analysis have their merits paper), one should keep in mind that mathematical rigor is still and flaws as we shall discuss below. lacking in parts of the method [35]. However, recent results For mutual coherence, several works have considered the in mathematical physics have provided at least circumstantial case of noise-free observations (σ = 0) and `1-norm mini- evidence that the main problem of the replica method is mization based reconstruction. The main objective is usually most likely in the assumed structure of the solution [35]– to find the conditions that need to be satisfied between the [39] and not in the parts such as replica continuity that lack allowed sparsity level of x and the mutual coherence property mathematical proof. In particular, the mistake in the original of A so that exact reconstruction is possible. In particular, the solution of the Sherrington-Kirkpatrick spin glass has now authors of [7] established such conditions for the special case been traced to the assumption of replica symmetric (RS) ansatz when A is constructed by concatenating a pair of orthonormal in the saddle-point evaluation of the free energy. Indeed, the bases. These conditions were further refined in [8] and the end result of the Parisi’s full replica symmetry breaking (RSB) extension to general matrices was reported in [9] using the solution (see, e.g., [25]) has been proved to be correct [38], concept of spark. [39] in this case. Similar rigorous methods have also been Another direction in the worst case analysis was taken in applied in wireless communications [40] and error correction [10], where the basic setup (1) with sparse additive noise coding [41], [42], to name just a few examples1. was considered. The threshold for exact reconstruction under these conditions was derived using RIP. By establishing a connection between the Johnson-Lindenstrauss lemma and B. Related Prior Work RIP, the authors of [11] proved later that RIP holds with The authors of [28] analyzed the asymptotic performance high probability when M grows large for a certain class of of LASSO and “zero-norm” regularized LS by extending the random matrices. Special cases of this ensemble are, for ex- minimum mean square error (MMSE) estimation problem in ample, matrices whose components are independent identically code division multiple access (CDMA) to MAP detection in distributed (IID) Gaussian or Bernoulli random variables. This linear vector models. More specifically, the MMSE formulas translates roughly to a statement that such matrices are “good” obtained with the replica method [43], [44] were first assumed for CS problems when `1-norm based penalty is used if the to be valid and then transformed to the case of MAP decoding system size is sufficiently large. through “hardening”. Unfortunately, this approach was limited In addition to the basic problem stated above, mutual to the cases where the appropriate MMSE formulas already coherence and RIP based worst case analysis are prominent existed and the end result of the analysis still required quite also in the study of greedy CS algorithms and fusion strategies. a lot of numerical computations. The scope of the analysis Some examples are analysis of orthogonal matching pursuit was extended to a more general class of random matrices by [12]–[14], subspace pursuit [15], CoSaMP [16], group LASSO employing the Harish-Chandra-Itzykson-Zuber (HCIZ) inte- [17] and Fusion strategy [18]. The general weakness of these gral formula [45], [46] in [30]. Although the emphasis there approaches is, however, that if one is interested in typical was in the support recovery, also the MSE could be inferred or average case performance, the results provided by the from the given results. A slightly different scenario when the worst case analysis are often very pessimistic and loose. This additive noise is sparse was analyzed in [47], [48]. For such consideration is tackled by the second class of analytical a measurement model, if one replaces the squared `2-norm results we mentioned at the beginning of the review. 1To avoid the misconception that these methods have made non-rigorous In a series of papers, the authors of [19]–[21] used tools approaches obsolete, some comments are in place. Firstly, the scope of the from combinatorial geometry to show that in the limit of rigorous methods tend to be much more limited than that of the non-rigorous ones. Secondly, the analysis typically give bounds for the quantities of interest increasing system size, the `1-reconstruction has a sharp rather than sharp predictions. Thirdly, it is often helpful to know the end- phase transition when the measurements are noise-free. A result obtained through some non-rigorous way, like the replica method, before completely different approach based on approximate message applying the mathematically exact tools on the problem. VEHKAPERA¨ et al.: ANALYSIS OF REGULARIZED LS RECONSTRUCTION AND RANDOM MATRIX ENSEMBLES IN COMPRESSED SENSING 3

distance in (2) by `1-norm and uses also `1-regularization, also akin to some earlier works on linear models [64], [65]. perfect reconstruction becomes sometimes feasible [47], [48]. After obtaining the results for `1-regularization, we sketch how It is also possible to characterize the MSE of reconstruction they can be generalized to other cases like l2-norm and “zero- outside of this region using the replica method [48]. norm” based regularization. The references above left the question open how the choice The analysis show that under the assumption of RS ansatz of measurement matrix affects the fidelity of the reconstruction (for details, see Section IV), the average MSE of reconstruc- in the noisy setup. In [49] a partial answer was obtained tion is obtained via a system of coupled fixed point equations through information theoretic analysis. The authors showed that can be solved numerically. For the T -orthogonal case, we that standard Gaussian sensing matrices incurred no loss find that the solution depends on the sparsity pattern (how the in the noise sensitivity threshold if optimal encoding and non-zero components are located block-wise in the vector) of decoding were used. Similar result was obtained earlier using the source — even when such knowledge is not used in the the replica method in [29], and extended to more general reconstruction. In the case of rotationally invariant ensemble, matrix ensembles in the aforementioned paper [30]. On the the results are obtained as a function of the Stieltjes transform other hand, generalization of the Lindeberg principle was used of the eigenvalue spectrum that describes the measurement by the authors of [50] to show that the average cost in LASSO matrix. For this case only the total sparsity of the source was universal for a class of matrices of standard type. vector has influence on the reconstruction performance. The Based on the above results and the knowledge that for the end results for the rotationally invariant case are also shown noise-free case the perfect reconstruction threshold is quite to be equivalent to those in [30], bridging the gap between universal [20], [21], [34], one might be tempted to conclude two different approaches to replica analysis. that using sensing matrices that are sampled from the standard Finally, solving the MSE of the replica analysis for some Gaussian ensemble is the optimal choice also in the noisy case practical settings reveals that the standard Gaussian ensemble when practical algorithms such as LASSO are used. However, is suboptimal as a sensing matrix when the system is corrupted there is also some counter-evidence in other settings, such as by additive noise. For example, a random row-orthogonal the noise-free case with non-uniform sparsity [31], [32] and measurement matrix provides uniformly better reconstructions spreading sequence design in CDMA [51], [52] that leave the compared to the Gaussian one. This is in contrast to the noise- problem still interesting to investigate in more detail2. free case where it is well known that the perfect reconstruction Albeit from a slightly different motivational point-of-view, threshold is the same for the whole rotationally invariant similar endeavor was taken earlier in [60]–[63], where it was ensemble (see, e.g., [34]). On the other hand, albeit T - discovered that measurement matrices with specific structure orthogonal measurement matrices are able to offer lower MSE are beneficial for message passing decoding in noise-free than any other ensemble we tested when the sparsity of the settings. These spatially coupled, or seeded, measurement ma- source is not uniform, the effect diminishes as the noise level trices helped the iterative algorithm to get past local extrema in the system increases. This may be intuitively explained by and hence improved the perfect reconstruction threshold of `1- the fact that the additive noise in the system makes it more recovery significantly. Such constructions, however, turned out difficult to differentiate between blocks of different sparsities to be detrimental for convex relaxation based methods when when we have no prior information about it. compared to the standard Gaussian ensemble. Finally we remark that the uniform sparsity model studied D. Notation and Paper Outline in [34] was extended to a non-uniform noise-free setting in Boldface symbols denote (column) vectors and matrices. [33]. The goal there was to optimize the recovery performance Identity matrix of size M ×M is written IM and the transpose T using weighted `1-minimization when the sparsity pattern is of matrix A as A . Given a variable xk with a countable index known. We deviate from those goals by considering a noisy √set K, we abbreviate {xk} = {xk : k ∈ K}. We write i = setup with a more general matrix ensemble for measurements. −1 and for some (complex) function f(z), denote f(z0) = On the other hand, we do not try to optimize the reconstruction extrz f(z) where z0 is an extremum of the function f, that is, with block-wise adaptive weights and leave such extensions as satisfies df = 0. Analogous definition holds for functions dz z0 future research topics. of multiple variables. The indicator function satisfies 1(A) = 1 if A is true and is zero otherwise. Dirac’s delta function is C. Contribution and Summary of Results written δ(x) and the Kronecker symbol δij. Throughout the paper we assume for simplicity that given The main goal of the present paper is to extend the scope of any continuous (discrete) random variable, the respective [28] and [30] to a wider range of matrix ensembles and to non- probability density (probability mass) function exists. Same uniform sparsities of the vector of interest. We deviate from the notation is used for both cases, and given a general continuous approach of [28], [30] by evaluating the performance directly / discrete random variable (RV), we often refer to probability using the replica method as in [31]–[34]. The derivations are density function (PDF) for brevity. The true and postulated 2After the initial submission of the present paper, parallel studies using PDF of a random variable is denoted p and q, respectively. If completely different mathematical methods and arguing for the superiority x is a real-valued Gaussian RV with mean µ and covariance of the orthogonal constructions have been presented in [53] and [54]. Since Σ, we write the density of x as p(x) = gx(µ; Σ). then, an extension to the present paper has been proposed in [55] and iterative algorithms approximating Bayesian optimal estimation for structured matrices The rest of the paper is organized as follows. The problem have been devised, see for example, [56]–[59]. formulation and brief introduction to the replica trick is given 4 IEEE TRANSACTIONS ON INFORMATION THEORY in Section II. Section III provides the end-results of replica 4) Geometric setup, if A = UΣV T where U, V are in- analysis for LASSO estimation. This case is also used in the dependent Haar matrices and Σ ∈ RM×N is a diagonal m−1 detailed replica analysis provided in Appendices A and B. matrix whose (m, m)th entry is given by σm ∝ τ Sketch of the main steps involved in the replica analysis for m = 1,...,M. The parameter τ ∈ (0, 1] is chosen and comparison to existing results are given in Section IV so that given value of peak-to-average eigenvalue ratio for the rotationally invariant setup. Conclusions are provided σ2 in Section V and two matrix integral results used as a part κ = 1 (8) 1 PM 2 of the replica analysis are proved in Appendix C. Finally, M m=1 σm Appendix D provides the details of the geometric ensemble. is met and the singular values are scaled to satisfy the −1 PM 2 power constraint N m=1 σm = 1. For details, see II.PROBLEM FORMULATION AND METHODS Appendix D. Consider the CS setup (1) and assume that the elements of 5) General rotationally invariant setup, if the decomposi- T T w are IID standard Gaussian random variables, so that tion R = A A = O DO exists, so that O is an N ×N Haar matrix and D is a diagonal matrix containing the 0 0 2 p(y | A, x ) = gy(Ax ; σ IM ) (5) eigenvalues of R. We also assume that the empirical is the conditional PDF of the observations. Recall that the distribution of the eigenvalues notation x0 means here that the observation (1) was generated M 1 X as y = Ax0 + σw, that is, x0 is the true vector generated F M (x) = 1(λ (R) ≤ x), (9) R M i by the source. Note that in this setting the additive noise i=1 is dense and, therefore, perfect reconstruction is in general where 1(·) is the indicator function and λi(R) de- not possible [47], [48]. Let the sparse vector of interest notes the ith eigenvalue of R, converges to some non- 0 0 T T x be partitioned into T equal length parts {xt }t=1 that random limit in the LSL and satisfies E tr(AA )/N = are statistically independent. The components in each of the E tr(D)/N = 1. The setups 2) – 4) are all special cases blocks t = 1,...,T are drawn IID according to the mixture of this ensemble. distribution To make comparison fair between different setups, all cases T pt(x) = (1 − ρt)δ(x) + ρtπ(x), t = 1,...,T, (6) above are defined so that E tr(AA )/N = 1. In addition, both T of the orthogonal setups satisfy the condition αAA = IM . where ρt ∈ [0, 1] is the expected fraction of non-zero elements 0 Remark 1. The T -orthogonal sensing matrix was considered in xt that are drawn independently according to π(x). The expected density, or sparsity, of the whole signal is thus in [31], [32] under the assumption of noise-free measurements. −1 P There it was shown to improve the perfect recovery threshold ρ = T t ρt. We denote the true prior according to which 0 the data is generated by p(x ; {ρt}) and call {ρt} the sparsity when the source had non-uniform sparsity. On the other hand, pattern of the source. For future reference, we define the the row-orthogonal setup is the same matrix ensemble that following nomenclature. was studied in the context of CDMA in [51], [52]. There it was called Welch bound equality (WBE) spreading sequence Definition 1. When the system size grows without bound, ensemble and shown to provide maximum spectral efficiency namely, M,N → ∞ with fixed and finite compression rate both for Gaussian [51] and non-Gaussian [52] inputs given α = M/N and sparsity levels {ρt}, we say the CS setup optimal MMSE decoding. The geometric setup is inspired by approaches the large system limit (LSL). [66], where similar sensing matrix was used to examine the Definition 2. Let A ∈ RM×N be a sensing matrix with robustness of AMP algorithm and its variants via Monte Carlo compression rate α = M/N ≤ 1. We say that the recovery simulations. It reduces to the row-orthogonal ensemble when problem (2) is: κ → 1. 1) T -orthogonal setup, if N = TM and the sensing matrix is constructed as A. Bayesian Framework To enable the use of statistical mechanics tools, we refor- A = O ··· O  , (7) 1 T mulate the original optimization problem (2) in a probabilistic 4 where {Ot} are independent and distributed uniformly framework. For simplicity , we also make the additional on the group of orthogonal M × M matrices according restriction that the cost function separates as 3 to the Haar measure ; N 2) Standard Gaussian setup, if the elements of A are IID X c(x) = c(xj), (10) drawn according to ga(0; 1/M); j=1 3) Row-orthogonal setup, if O is an N × N Haar matrix where c is a function whose actual form depends on the type and the sensing matrix is constructed as A = α−1/2PO, of the argument (scalar or vector). Then, the postulated model where P = [IM 0M×(N−M)] picks the first M rows of T −1 O. Clearly AA = α IM and A has orthogonal rows. 4This assumption is in fact not necessary for the replica analysis. However, if the source vector has independent elements and the regularization function 3In the following, a matrix O that has this distribution is said to be simply decouples element-wise, the numerical evaluation of the saddle-point equa- a Haar matrix. tions is a particularly simple task. VEHKAPERA¨ et al.: ANALYSIS OF REGULARIZED LS RECONSTRUCTION AND RANDOM MATRIX ENSEMBLES IN COMPRESSED SENSING 5

prior (recall our notational convention from Section I-D) of formulation above is problematic since fβ depends on the the source is defined as observations y and the measurement process A. One way to

1 −βc(x) circumvent this difficulty is to notice that the law of large qβ(x) = e , (11) zβ numbers guarantees that for ∀ > 0, the probability that |f (y, A) − E{f (y, A)}| >  tends to vanish in the LSL R −βc(x) β β where zβ = e dx < ∞ is a normalization constant. for any finite and positive λ, σ2. This leads to computation Hence, c(x) needs to be such that the above integral is of the average free energy fβ = E{fβ(y, A)} instead of (15) convergent for given finite M,N and β > 0. The purpose and is called self-averaging in statistical mechanics. of the non-negative parameter β (inverse temperature) is to Concentrating on the average free energy fβ avoids the enable MAP detection as will become clear later. Note that explicit dependence on {y, A}. Unfortunately, assessing the (11) encodes no information about the sparsity pattern of the 0 necessary expectations is still difficult and we need some source and is mismatched from the true prior p(x ; {ρt}). further manipulations to turn the problem into a tractable one. From an algorithmic point-of-view, this means that the system The first step is to rewrite the average free energy in the zero- operator has no specific knowledge about the underlying temperature limit as sparsity structure or does not want to utilize it due to increased computational complexity. We also define a postulated PDF for 1 ∂ n f = − lim lim ln E{[Zβ(y, A)] }. (16) the measurement process β,N→∞ βN n→0+ ∂n  λ  So-far the development has been rigorous if n ∈ R and q (y | A, x) = g Ax; I , (12) 5 β y β M the limits are unique and exist . The next step is to employ replica trick 2 the to overcome the apparent road block of so that unless λ/β = σ , the observations are generated evaluating the necessary expectations as a function of real- according to a different model than what the reconstruction valued parameter n. algorithm assumes. Note that λ is the same parameter as in the original problem (2). Replica Trick. Consider the free energy in (16) and let y = Due to Bayes’ theorem, the (mismatched) posterior density Ax0 +w be a fixed observation vector. Assume that the limits of x based on the postulated distributions reads commute, which in conjunction with the expression n qβ(x | y, A) [Zβ(y, A; λ)]    Z  n  n 1 1 2 β X Y = exp − β ky − Axk + c(x) , (13) = exp − ky − Axak2 − βc(xa) dxa (17) Zβ(y, A) 2λ 2λ a=1 a=1 where Z (y, A) is the normalization factor or partition func- β for n = 1, 2,... allows the evaluation of the expectation tion of the above PDF. We could now estimate x based on (13), in (16) as a function of n ∈ . The obtained functional for example by computing the posterior mean hxi , where we R β expression is then utilized in taking the limit of n → 0+. used the notation Z It is important to note that as written above, the validity hh(x)iβ = h(x)qβ(x | y, A)dx (14) of the analytical continuation remains an open question and the replica trick is for this part still lacking mathematical for some given β > 0 and trial function h of x. The specific validation. However, as remarked in Introduction, the most case that maximizes the a posteriori probability for given serious problem in practice seems to arise from the simplifying λ (and σ2) is the zero-temperature configuration, obtained assumptions one makes about how the correlations between by letting β → ∞. In this limit (13) reduces to a uniform the variables {xa} behave in the LSL. The simplest case distribution over x that provides the global minimum of is the RS ansatz, that is described by the overlap matrix ky −Axk2/(2λ)+c(x). If the problem has a unique solution, Q ∈ R(n+1)×(n+1) of the form we have hxiβ→∞ = xˆ, where xˆ is the solution of (2). Thus, the behavior of regularized LS reconstruction can be obtained  r m ··· m by studying the density (13). This is a standard problem in m Q q  [a,b] n   statistical mechanics if we interpret q (x | y, A) as the Q = [Q ]a,b=0 = . . (18) β  . q ..  Boltzmann distribution of a spin glass, as described next.   m Q B. Free Energy, The Replica Trick and Mean Square Error with slightly non-standard indexing that is common in litera- ture related to replica analysis. The elements of Q are defined The key for finding the statistical properties of the recon- [a,b] −1 a b struction (2) is the normalization factor or partition function as overlaps, or empirical correlations, Q = N x · x . The implication of RS ansatz is that the replica indexes Zβ(y, A). Based on the statistical mechanics approach, our goal is to assess the (normalized) free energy a = 1, 2, . . . , n can be arbitrarily permuted without chang- ing the end result when M = αN → ∞. This seems a 1 fβ(y, A) = − ln Zβ(y, A) (15) βN 5In principle, the existence of a unique thermodynamic limit can be checked using the techniques introduced in [37]. However, since the replica method in the LSL where M,N → ∞ with α = M/N fixed, and itself is already non-rigorous we have opted to verify the results in the end obtain the desired statistical properties from it. However, the using numerical simulations. 6 IEEE TRANSACTIONS ON INFORMATION THEORY priori reasonable since the replicas introduced in (17) were Then, under RS ansatz, identical and in no specific order. But as also mentioned  1  in Introduction, the RS assumption is not always correct. mt = 2ρtQ , (21) p 2 Sometimes the discrepancy is easy to fix, for example as in χˆt +m ˆ t the case of random energy model [67], while much more 2(1 − ρt) 2ρt 2 Qt = − 2 r(ˆχt) − 2 r(ˆχt +m ˆ t ), (22) intricate methods like Parisi’s full RSB solution are needed mˆ t mˆ t for other cases [25]. For the purposes of the present paper, 2 √ where Q(x) = R ∞ dze−z /2/ 2π is the standard Q-function we restrict ourselves to the RS case and check accuracy of x and we denoted the end result w.r.t. simulations. Although this might seem r   mathematically somewhat unsatisfying approach, we believe h − 1 1 r(h) , e 2h − (1 + h)Q √ , (23) that the RS results are useful for practical purposes due to their 2π h simple form and can server as a stepping stone for possible 1 mˆ t , (24) extensions to the RSB cases. , λ + P Λ−1 Finally, let us consider the problem of finding the MSE of k6=t k reconstruction (2). Using the notation introduced earlier, we for notational convenience. The parameters {Λt} and {χˆt} may write are the solutions to the set of coupled equations   −1  0 2 1 mse = N E kx − xk Λ = − 1 mˆ (25) β→∞ t R t  −1 T 0  −1 2 t = ρ − 2E N hxiβ→∞x + E N kxk β→∞ 2 (ρt − 2mt + Qt)Λt = ρ − 2m + Q, (19) χˆt = 2 (1 − Rt) T where h · · · i was defined in (14) and E{···} denotes the X β + ∆ (ρ − 2m + Q − σ2R2), (26) expectation w.r.t. variables in (1). Thus, if we can compute m s,t s s s s s=1 and Q using the replica method, the MSE of reconstruction follows immediately from (19). As shown above, this amounts where we also used the auxiliary variables Rt = χtmˆ t with     to computing the overlap matrix (18). 2(1 − ρt) 1 2ρt 1 χt Q √ + Q , (27) , mˆ mˆ p 2 t χˆt t χˆt +m ˆ t III.RESULTS FOR LASSORECONSTRUCTION  T 2 −1 RsRtΛsΛt X Rk ∆s,t , 1 + (1 − 2Rs)(1 − 2Rt) 1 − 2Rk In this section we provide the results of the replica analysis k=1 2 for LASSO reconstruction (4) for the ensembles introduced Λt in Definition 2. For simplicity, we let the non-zero elements − δst, (28) 1 − 2Rt of the source be standard Gaussian, that is, π(x) = gx(0; 1) in (6). Recall also that LASSO is the special case of regu- and Kronecker delta symbol δij to simplify the notation. larization c(x) = kxk1 in the general problem (2). Replica Proof: See Appendix A. symmetric ansatz is assumed in the derivations given in The connection of the MSE provided in Proposition 1 Appendices A and B. Casual reader finds a sketch of replica and the formulation given in (19) is as follows. Here m = analysis along with some generalizations for different choices −1 P −1 P T t mt and Q = T t Qt due to the assumption of of the cost function c(x) in Section IV. Further interpretation block-wise sparsity, as described in Section II. The parameters of the result and connections to the earlier work in [28] are {Λt, χˆt} have to be solved for all t = 1,...,T , i.e., we also discussed there. After the analytical results, we provide have a set of 2T non-linear equations of 2T variables. Note some numerical examples in the following subsection. that for the purpose of solving these equations, the param- eters {m,ˆ Rt, ∆s,t, mt,Qt} are just notation and do not act as additional variables in the problem. Except for mt and A. Analytical Results Qt, the rest of the variables can in fact be considered to The first result shows that when the measurement matrix be “auxiliary”. They arise in the replica analysis when we is of the T -orthogonal form, the MSE over the whole vector assess the expectations w.r.t. randomness of the measurement may depend, not just on the average sparsity ρ but also on the process, additive noise and vector of interest. Hence, one may block-wise sparsities {ρt}. think that the replica trick transformed the task of computing difficult expectations to a problem of finding solutions to a set Proposition 1. Consider the T -orthogonal setup described of coupled fixed point equations defined by some auxiliary in Definition 2 and let mset denote the MSE of the LASSO variables. In terms of computational complexity, this is a very reconstruction in block t = 1,...,T . The average MSE over fair trade indeed. the entire vector of interest reads The implication of Proposition 1 is that the performance

T T of the T -orthogonal ensemble is in general dependent on the 1 X 1 X mse = mse = (ρ − 2m + Q ). (20) details of the sparsity pattern {ρt} — in a rather complicated T t T t t t t=1 t=1 way. Similar result was reported for the noise-free case in [32], VEHKAPERA¨ et al.: ANALYSIS OF REGULARIZED LS RECONSTRUCTION AND RANDOM MATRIX ENSEMBLES IN COMPRESSED SENSING 7

where the T -orthogonal ensemble was shown to provide supe- The function GAAT (s) is the Stieltjes transform of the asymp- 0 rior reconstruction threshold compared to rotationally invariant totic distribution FAAT (x) and GAAT (s) is the derivative of cases when the source vector had non-uniform sparsity. GAAT (s) w.r.t. the argument. For future convenience, we next present the special case of Proof: See Appendix B. uniform sparsity as an example. Since now ρ = ρt for all t = 1,...,T , we have only two coupled fixed point equations Remark 2. Due to (27) and (33), the rotationally invariant and to solve. The auxiliary parameters also simplify significantly T -orthogonal setups have exactly the same form for variables and we have the following result. {m, Q, χ}. Hence, the choice of random matrix ensemble does Example 1. Consider the T -orthogonal setup and assume not affect these variables, except for adding the indexes t = 1,...,T to them in the case of T -orthogonal setup. uniform sparsity ρt = ρ for all t = 1,...,T . The per- component MSE of reconstruction can be obtained by solving Notice that in contrast to Definition 2, the above proposition the set of equations uses the eigenvalue distribution of AAT instead of ATA. This 1   1   1 −1 T is more convenient in the present setup since M ≤ N, so we Λ = 2(1 − ρ)Q √ + 2ρQ − , (29) do not have to deal with the zero eigenvalues. Compared to the λ χˆ χˆ +m ˆ 2 λ  2 2  T -orthogonal case, here we have a fixed set of three equations 2 ρ − 2m + Q ρ − 2m + Q − σ R and unknowns to solve, regardless of {ρt} and the partition of χˆ = Λ 2 − 2 , (30) (1 − R) 1 − 2R + TR the source vector. Thus, if one knows the Stieltjes transform where we introduced the definitions GAAT (s) and it is (once) differentiable with respect to the 1 argument, the required parameters can be solved numerically R , (31) , T + λΛ from the coupled equations given in Proposition 2. For some Λ G T (s), however, the equations can be reduced analytically mˆ , (32) AA , T + λΛ − 1 to simpler forms that allow for efficient numerical evaluation for notational simplicity. of the MSE, as seen in the following two examples. For completeness, we provide next a result similar to Example 2. Recall the row-orthogonal setup. For this case −1 Proposition 1 for the rotationally invariant case. It shows that D = α IM and, thus, the Stieltjes transform of FAAT reads the performance of this matrix ensemble does not depend on 1 the specific values of {ρt} but only on the expected sparsity GAAT (s) = −1 , (38) −1 P α − s level ρ = T t ρt of the vector of interest. The MSE is 0 −1 −2 given as a function of the Stieltjes transform and its first order so that GAAT (s) = (α − s) . Plugging the above in derivative of the asymptotic eigenvalue distribution FAAT of Proposition 2 provides the measurement matrix. As shown in Section IV-B, the end λ − α−1χ + p−4λχ + (λ + α−1χ)2 result is essentially the same as the HCIZ integral formula Λ = , (39) based approach in [30], but the derivation and form of the 2λχ proposition are chosen here to match the previous analysis. λ + α−1χ − p−4λχ + (λ + α−1χ)2 mˆ = , (40) It should be remarked, however, that Proposition 1 cannot be 2λχ obtained from [30] and the T -orthogonal ensemble requires a −1 1 − α αλ2  special treatment. Λ0 = − + , (41) Λ2 (α−1 + λΛ)2 Proposition 2. Recall the rotationally invariant setup given in  1  σ2 Definition 2. The average MSE of reconstruction for LASSO in χˆ = (ρ − 2m + Q) + Λ0 − Λ0, (42) χ2 (α−1 + λΛ)2 this case is given by mse = ρ−2m+Q, where the parameters m and Q are as in (21) and (22) with the block index t omitted. so that the MSE can be obtained by solving the set {χ, χˆ} To obtain the MSE, the following set of equations of equations given by (33) and (34). As expected, for the 2(1 − ρ)  1  2ρ  1  special case of uniform sparsity ρt = ρ and α = 1/T , the χ = Q √ + Q , (33) row-orthogonal and T -orthogonal setups give always the same mˆ χˆ mˆ pχˆ +m ˆ 2 MSE (see Remark 5 in Appendix B).  1  χˆ = (ρ − 2m + Q) + Λ0 χ2 In the above example, we first used (38) in (35) and (37) 0 2 0  0 to solve Λ and mˆ as functions of χ. Plugging then G into −ασ GAAT (−λΛ) − (λΛ) · GAAT (−λΛ) Λ , (34) 0   (36) provides immediately Λ , and χˆ follows similarly from 1   G Λ = 1 − α 1 − (λΛ) · GAAT (−Λλ) , (35) (34). However, if the form of is more cumbersome, it may χ be more convenient to compute the parameters in slightly need to be solved given the definitions different order as demonstrated in the next example that  −1 considers a generalized version of the standard Gaussian CS 0 ∂Λ 1 − α 2 0 Λ = − + αλ G T (−Λλ) , (36) setup. , ∂χ Λ2 AA 1 α  Example 3. Consider the rotationally invariant setup where mˆ − Λ = 1 − (λΛ) · G T (−Λλ) . (37) , χ χ AA the elements of A are zero-mean IID with 1/M. The 8 IEEE TRANSACTIONS ON INFORMATION THEORY

T −10 eigenvalue spectrum of AA is then given by the Marcenko-ˇ T−orthogonal setup Pastur law Rangan etal. result −1 + α−1 − s − p−4s + (−1 + α−1 − s)2 G T (s) = , AA 2s (43) −10.5 that provides together with Proposition 2 standard Gaussian 1 1 Λ = − , (44) χ λ + α−1χ mse [dB] 1 −11 mˆ = , (45) λ + α−1χ 1 α−1 row−orthogonal Λ0 = − + , (46) χ2 (λ + α−1χ)2   0 1 1 − α 1 −11.5 G T (−λΛ) = − + . (47) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 AA αλ2 Λ2 Λ0 λ

Plugging the above to Proposition 2 and solving {χ, χˆ} yields Fig. 1. Average MSE in decibels mse → 10 log10(mse) dB as a function of the MSE of reconstruction. the tuning parameter λ. Solid lines given by Examples 2 and 3. Markers for the standard Gaussian setup are obtained from the analytical results provided in 0 [28, Section V] and by Example 1 for the T -orthogonal case. All results should In this case, G T (s) is of more complex form than in AA thus be considered to be in the LSL. As predicted by the analysis, the lines and Example 2, and the approach used there does not provide markers match perfectly. Parameter values: α = 1/3, ρ = 0.15, σ2 = 0.1. as simple solution as before. However, since now Λ has a particularly convenient form, we may use the definition 0 ∂Λ 0 Λ = ∂χ to write Λ as a function of χ. This can be then used in (36) to obtain G0 indirectly. in some chosen setups numerically. First we consider the case of uniform density and the MSE of reconstruction as Remark 3. As shown in Section IV-B, Examples 2 and 3 a function of the tuning parameter λ, as shown in Fig. 1. provide the same average reconstruction error as reported in The solid lines depict the performances of row-orthogonal and [30], which also implies that the IID case matches [28]. The standard Gaussian setups, as given by Examples 2 and 3. The benefit of Example 3 compared to [28] is that there are no markers, on the other hand, correspond to the result obtained integrals or expectations left to solve. Example 2, on the other by Rangan et al. [28, Section V-B] and Example 1 given in this hand, proved directly that for the special case of uniform paper. As expected, the solid lines and markers match perfectly sparsity ρt = ρ and α = 1/T , the row-orthogonal and T - although the analytical results are represented in a completely orthogonal setups give always the same MSE. different form. It is important to notice, however, that we plot The final example demonstrates the capabilities of the here the MSE (in decibels) while normalized mean square analytical framework for a “non-standard” ensemble that has error (in decibels) mse/ρ is used in [28]. Also, the definition singular values defined by geometric progression. of signal-to-noise ratio SNR0 there would correspond to value ρ/σ2 in this paper. Comparing the two curves in Fig. 1 makes Example 4. Recall the geometric setup where the singular it clear that the orthogonal constructions provide superior MSE m−1 values σm ∝ τ , m = 1,...,M satisfy peak-to-average performance compared to the standard Gaussian setup. It is condition (8). Equivalent limiting eigenvalue spectrum of also worth pointing out that the optimal value of λ depends T AA in the large system limit is described by the Stieltjes on the choice of the matrix ensemble. transform In the next experiment we consider the case of “localized 1  A − s  1 sparsity” where all non-zero elements are concentrated in one GAAT (s) = ln −γ − , (48) sγ Ae − s s subvector xt, namely, ρt = ρT for some t ∈ {1,...,T } and −γ ρ = 0 ∀s 6= t. For simplicity, we take the simplest case of 0 1 1 A(e − 1) s G T (s) = − G T (s) − , (49) AA s AA s γ(Ae−γ − s)(A − s) T = 2 and choose the overall sparsity to be ρ = 0.2. The average mean square error vs. inverse noise variance 1/σ2 of where A = κ/α and γ satisfies this case is depicted in Fig. 2. For clarity of presentation, the γ variables related to both axes are given in a decibel form, κ = γ , (50) −2 1 − e that is, x → 10 log10(x) dB, where x ∈ {mse, σ }. The for some given κ. For details on how to generate the geometric tuning parameter λ is chosen for each point so that the lowest ensemble for simulations and how the Stieltjes transform arises possible MSE is obtained for all considered methods. Due for this setup, see Appendix D. to the simple form of the analytical results, this is easy to do numerically. Examining Fig. 2 reveals a surprising phe- nomenon, namely, for small noise variance the T -orthogonal B. Numerical Examples setup gives the lowest average MSE while for more noisy Having obtained the theoretical performance of various setups it is the row-orthogonal ensemble that achieves the best matrix ensembles, we now examine the behavior of the MSE reconstruction performance. The universality of this behavior VEHKAPERA¨ et al.: ANALYSIS OF REGULARIZED LS RECONSTRUCTION AND RANDOM MATRIX ENSEMBLES IN COMPRESSED SENSING 9

−6 Gaussian

−8 row−orthogonal T−orthogonal

−10

−12

mse [dB] −14 −11 −11.5

−16 −12 −12.5 −18 −13 −13.5 11.5 12 12.5 −20 0 2 4 6 8 10 12 14 16 18 20 σ−2 [dB]

2 Fig. 2. Average MSE vs. inverse noise variance 1/σ , where both quantities are presented in decibels, that is, x → 10 log10(x) dB. For each point, the value of λ that minimizes the MSE is chosen numerically. The parameter values are α = 1/2, ρ = 0.2 and localized sparsity is considered, that is, either ρ1 = 0.4 −2 and ρ2 = 0 or vice versa. The dashed line at σ = 12 dB represents the point where the MSE performance of T -orthogonal and row-orthogonal ensembles approximately cross each other. Markers at σ−2 = 5, 12, 19 dB are obtained by using cvx for reconstruction and averaging over 100 000 realizations of the problem.

-12.5 obtained, we have plotted in Fig. 3 the average MSE of the -12.55 point σ−2 = 12 dB, for which the asymptotic prediction -12.6 given by the replica method is mse = −12.685 dB. To

mse = 199.8N-2 - 0.7099N-1 - 12.68 -12.65 obtain more accurate results, the experimental data is aver- aged now over 106 realizations and estimates are obtained -12.7 by using SpaSM [68] that provides an efficient MATLAB -12.75 implementation of the least angle regression (LARS) algorithm mse [dB] -12.8 [69]. The simulated data is fitted with a quadratic function of N −1 and the estimates for the asymptotic MSEs are -2 -1 -12.85 mse = 137.8N - 14.30*N - 12.68 obtained by extrapolating M = αN → ∞. The end result for -12.9 row-ortohogonal, simulation for both the row-orthogonal and T -orthogonal ensembles is row-orthogonal, best fit -12.95 T-orthogonal, simulation mse = −12.68 dB, showing that the replica analysis provides T-ortohogonal, best fit a good approximation of the MSE for large systems. One -13 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 can also observe that the simulations approach the asymptotic 1/N result relatively fast and for realistic system sizes the match is Fig. 3. Experimental assessment of the MSE for the point σ−2 = 12 dB already very good. Albeit the convergence behavior depends in Fig. 2. Markers correspond to simulated values using SpaSM with path- somewhat on the noise variance σ2, the present figure is a following optimization over λ. Problem sizes N = 80, 160,..., 400, each averaged over 106 realizations. The experimental data were fitted with a typical example of what was observed in our simulations. quadratic function of 1/N and plotted with solid lines. Extrapolation for M = αN → ∞ provides the estimates for the asymptotic MSE that agrees Finally, we use the geometric setup to examine the robust- well with the replica prediction for the RS ansatz mse = −12.685 dB. ness of LASSO against varying peak-to-average ratio κ intro- duced in (8). Substituting the formulas given in Example 4 to Proposition 2 provides the MSE of reconstruction for LASSO for other parameter values is left as an open question for as given in Fig. 4. The system parameters are α = 1/2, ρ = 0.2 future research. The point where these two ensembles give and σ2 is chosen to match the markers in Fig. 2. As mentioned approximately the same MSE for the given setup is located earlier, the geometric setup reduces to the row-orthogonal one at σ−2 = −12 dB. Hence, for optimal performance, if one when κ → 1, which can verified by comparing the κ = 1 is given the choice of these three matrix ensembles, the row- values to the markers of the row-orthogonal curve in Fig. 2. orthogonal setup should be chosen when on the left hand side We have also included additional simulation points at κ = 5, of the dashed line and T -orthogonal setup otherwise. At any obtained by using SpaSM as in Fig. 3. One can observe point in the figure, however, the standard Gaussian setup gives that the performance degradation with increasing κ for the the worst reconstruction in the MSE sense and should never LASSO problem is relatively graceful compared to the abrupt been chosen if such freedom is presented. transition of GAMP observed in [66]. Hence, the algorithmic To illustrate how the experimental points in Fig. 2 were considerations are indeed very important, as studied therein. 10 IEEE TRANSACTIONS ON INFORMATION THEORY

-6 For given replica symmetric Q, we may compute its prob- ability weight6 -8 Z N Nβn(QQˆ −χχˆ −2mm ˆ )/2 pβ,N (Q; n) ∝ [Vβ(Qˆ ; n)] e dQˆ , -10 (52)

-12 where χ = β(Q − q). With some abuse of notation, we used

mse [dB] above Q as a shorthand for the set {χ, Q, m} and similarly ˆ ˆ -14 Q for the auxiliary parameters {χ,ˆ Q, mˆ }. Given that the prior of x0 factorizes and the regularization that separates σ2 = −5dB as (10) the auxiliary term Vβ(Qˆ ; n) is given in (53) at the -16 2 √ σ = −12 dB z2/2 0 σ2 = −19 dB top of the next page where Dz = dz e / 2π and p(x )

-18 is assumed to be of the form (6) with ρ = ρ1 and T = 1. 1 2 3 4 5 6 7 8 9 10 κ It is important to realize that the replica analysis could, in principle, be done with general forms of c(x) and p(x) but Fig. 4. Average MSE vs. the peak-to-average ratio κ for the geometric setup. Markers are obtained by extrapolating simulated values as in Fig. 3 and then (53) would have vectors in place of scalars. This creates optimal λ is used for each point. Parameter values are α = 1/2, ρ = 0.2, computational problems as explained in short. Note also that and the points at κ = 1 match the markers of the row-orthogonal setup in we have taken here a slightly different route (on purpose) Fig. 2. compared to the analysis carried out in the Appendices. The approach there is more straightforward mathematically but the methods presented here give some additional insight to the We also observe that as expected, the MSE is a monotonic solution and provide connection to the results given in [28] increasing function of κ for all σ2, highlighting the fact that and [30]. sensing matrices with “flat” eigenvalue distributions are in Next, define a function general good for reconstruction of noisy sparse signals. 1 + ln(βν) − α ln λ H (σ2, ν) = − β,λ 2 IV. EXTENSIONSANDA SKETCHOFA DERIVATION WITH 1 2 THE REPLICA METHOD + extr{Λ(βν) − (1 − α) ln Λ − αE ln(Λβσ + Λλ + x)}, 2 Λ Although the LASSO reconstruction examined in the pre- (54) vious section is one of the most important special cases where α = M/N, ν > 0 and the expectation of the ln-term is of regularized LS problem (2), one might be interested in w.r.t. the asymptotic eigenvalue distribution F T (x). Then expanding the scope of analysis to some other cases. In fact, AA Z up to a certain point, the evaluation of free energy (16) is Ξβ,N (n) = dQ pβ,N (Q; n) independent of the choice of regularization c(x) as well as 2 the marginal distribution of the non-zero components of the ×eNHβ,λ(nσ ,n(r−2m+q)+Q−q)+N(n−1)Hβ,λ(0,Q−q), (55) source vector. For this end, let us assume in the following that H the source vector has elements drawn independently according where the exponential term involving function arises from w A to (6) where π(x) is a suitable PDF with zero-mean, unit the expectation of (17) w.r.t. noise and sensing matrix {xa} Q variance and finite moments. Note, however, that in order to given and . The relevant matrix integral is proved in obtain the final saddle-point conditions, one needs to make a Appendix C-B and more details can be found in the derivations choice about both c and π in the end. following Lemma 1 in Appendix A. In the limit of vanishing temperature and increasing system size β, N → ∞, saddle point method7 may be employed to assess the integrals over A. Sketch of a Replica Analysis Q and Qˆ . Taking also the partial derivative w.r.t. n and then To provide a brief sketch how the results in previous section letting n → 0 provides were obtained and elucidate where the choice of regularization  QQˆ χχˆ (10) affects the analysis, let us consider for simplicity the f = extr mmˆ − + χ,Q,m,χ,ˆ Q,ˆ mˆ 2 2 rotationally invariant ensemble with source vector that has ZZ 0 p 0  0 uniform sparsity, i.e., we set T = 1 and ρ = ρ1. By + p(x )φ z χˆ +mx ˆ ; Qˆ dx Dz (56) Appendices A and B we know that (16) can be written under  RS ansatz as 1 ∂ 2 − lim lim Hβ,λ(nσ , n(ρ − 2m + q) + Q − q) , β→∞ β n→0 ∂n ∂ 1 f = − lim lim ln Ξβ,N (n), (51) n→0+ ∂n β,N→∞ βN 6We have omitted a term vanishing multiplicative terms and −n2βχq/ˆ 2 in the exponent since it does not affect the free energy, see (134) in Appendix A. where n is the number of replicas in the system. To charac- 7For more information, see for example [70, Ch. 6] and [71, Ch. 12.7], or terize Ξβ,N (n), we consider the simplest case of RS overlap for a gentle introduction [72]. Note that in our case when β, N → ∞, the matrix (18) and remind the reader that albeit this choice may correction terms in front of the exponentials after saddle point approximation vanish due to the and division by βN in (51). With some abuse seem intuitively reasonable, it is known to be incorrect in some of notation, we have simply omitted them in the paper to avoid unnecessary cases (see Introduction for further discussion). distractions. VEHKAPERA¨ et al.: ANALYSIS OF REGULARIZED LS RECONSTRUCTION AND RANDOM MATRIX ENSEMBLES IN COMPRESSED SENSING 11

n n n n n 2 Z  Y   X   βQˆ X X 1 X   V (Qˆ ; n) = p(x0) dxa exp − β c(xa) exp − (xa)2 + βmxˆ 0 xa + βpχˆ xa β 2 2 a=0 a=1 a=1 a=1 a=1 (  ! )n ZZ Z Qˆ p = p(x0) exp β − x2 + z χˆ +mx ˆ 0x − c(x) dx dx0Dz, (53) 2

where we defined a scalar function where the latter equality is obtained using integration by Qˆ  parts formula. For many cases of interest, these equations φ(y; Qˆ) = min x2 − yx + c(x) . (57) can be evaluated analytically, or at least numerically, and x 2 they provide the single body representation of the variables Note that due to the limit n → 0 in (56), the function H in (19). If one substitutes (6) with π(x) = gx(0; 1) in the 2 has the arguments σ = 0 and ν = Q − q when it comes above formulas along with c(x) = |x|, some (tedious) calculus to solving the extremization over Λ in (54). The value of shows that the results of the previous section are recovered. ∗ Λ which provides a solution for this is denoted Λ in the An alternative way of obtaining the parameters is provided in following. We also remark that (57) is the only part of free Appendices A and B. energy that directly depends on the choice of cost function c. Given {m,ˆ Q,ˆ χˆ}, one may now obtain the MSE of the If we would have chosen p(x) that is not a product distribution original reconstruction described in Section II by considering and c that did not separate as in (10), we would need to an equivalent scalar problem, namely, consider here a multivariate optimization over x. In fact, the mse = ρ − 2m + Q dimensions should grow without bound by the assumptions of ZZ 0 p 0  2 0 0 the analysis and, hence, the problem would be infeasible in = x − xˆ z χˆ +mx ˆ ; Qˆ p(x )dx Dz. (64) general. In practice one could consider some large but finite setting and use it to approximate the infinite dimensional limit, We may thus conclude that the minimizing x in (57) is the xˆ or concentrate on forms of c(x) and p(x) that have only a few above, which can be interpreted as the output of a regularized local dependencies. For simplicity we have chosen to restrict LS estimator that postulates y = Qxˆ + Qˆ1/2z, or equivalently, ourselves to the case where the problem fully decouples into y = x + Qˆ−1/2z, (65) a single dimensional one. √ We can get an interpretation of the remaining parameters as while the true model is y =mx ˆ 0 + z χˆ, or equivalently, √ follows. First, let χˆ Z y = x0 + z . (66) 1 −β[Qxˆ 2/2−yx+c(x)] mˆ hh(x); yiβ = h(x)e dx (58) Zβ(y) In the notation of [30] (resp. [28]), we thus have the relations ˆ 2 −1 be a posterior mean of some function h(x). Note that the ξ ↔ Q (↔ λp) and η ↔ mˆ /χˆ (↔ µ ) between the structure of this scalar estimator is essentially the same as the parameters. The above also implies mˆ = Qˆ as is indeed vector valued counterpart given in (13) and (14). If we denote verified later in (70). We shall expand on the connection the x that minimizes (57) by xˆ(y; Qˆ) then clearly between this paper and [28], [30] in Section IV-B. Let us denote Λ∗ for the solution of extremization in (54) ˆ  hh(x); yiβ→∞ = h xˆ(y; Q) , (59) under condition σ2 = 0, namely, and also ∗ 1 α ∗ ∗  ∂ Λ − = − 1 − (λΛ ) · G T (−λΛ ) , (67) xˆ(y; Qˆ) = − φ(y; Qˆ). (60) χ χ AA ∂y which is the same condition as (35) in Proposition 2. Then we We thus obtain from (53) with a little bit calculus that as can plug in (56) the identity N, β → ∞ and n → 0, the terms that yield the mean square 1 ∂ 2 error mse = ρ − 2m + Q are given by lim lim Hβ,λ(nσ , ν(n)) ZZ β→∞ β n→0 ∂n 0 p 0 ˆ 0 0 2 ∗   m = x xˆ z χˆ +mx ˆ ; Q p(x )dx Dz, (61) ασ Λ ∗ ρ − 2m + Q ∗ 1 = − GAAT (−λΛ ) + Λ − , (68) ZZ 2 2 χ  p 0 ˆ2 0 0 Q = xˆ z χˆ +mx ˆ ; Q p(x )dx Dz, (62) where we used the fact that in the LSL, r → ρ in (18) by the √ weak law of large numbers. The RS free energy thus becomes where we now have y = z χˆ+mx ˆ 0 and z is a standard Gaus-  QQˆ χχˆ sian RV. Similarly, the extremum condition for the variable χ f = extr mmˆ − + reads χ,Q,m,χ,ˆ Q,ˆ mˆ 2 2 1 Z Z ZZ p χ = √ z xˆzpχˆ +mx ˆ 0; Qˆp(x0)dx0Dz + p(x0)φz χˆ +mx ˆ 0; Qˆdx0Dz (69) χˆ ZZ 2 ∗   ∂ p 0  0 0 ασ Λ ∗ ρ − 2m + Q ∗ 1 = √ xˆ z χˆ +mx ˆ ; Qˆ p(x )dx Dz, (63) + G T (−λΛ ) − Λ − , ∂(z χˆ) 2 AA 2 χ 12 IEEE TRANSACTIONS ON INFORMATION THEORY where the parameters {m, Q, χ} satisfy (61)–(63) Note that from (195) by setting σ2 = 0 and χ = β(Q − q), namely so-far we have not made any assumptions about the details 2 of the function c(x), which means that (69) is valid for any Hβ,λ(σ = 0, ν = Q − q) type of regularization that separates as given in (10). In fact, 1  Z  ' extr Λχ − ln(x + λΛ)dFATA(x) , (73) we can go even further and solve the partial derivatives w.r.t. 2 Λ variables m and Q, which reveals that where we have omitted the terms that do not depend on Λ. 1 Qˆ =m ˆ = − Λ∗. (70) The solution to the extremization then provides the condition χ (we write Λ = Λ∗ here for simplicity) With some additional effort, one also finds that Z 1 χ = λ dFATA(x) = λGATA(−λΛ) (74)  1 ∂Λ∗  x + λΛ χˆ = (ρ − 2m + Q) + 1 χ χ2 ∂χ −1 ⇐⇒ Λ = − GATA , (75) ∗ λ λ 2 ∗ ∗ 0 ∗ ∂Λ −ασ G T (−λΛ ) − (λΛ ) · G T (−λΛ ) (71) AA AA ∂χ −1  where GATA GATA(s) = s is the functional inverse of the holds for the RS free energy with Stieltjes transform. Note that this also implies    ∗  −1 −1 χ χ ∂Λ 1 − α T T 2 0 ∗ GA A(−λΛ) = GA A GATA = . (76) = − + (αλ ) · G T (−λΛ ) . (72) λ λ ∂χ (Λ∗)2 AA R (z) = G−1(−z) − z−1 It is now easy to see that for the rotationally invariant setup, Using the definition X X of the R- our initial assumption of uniform sparsity, i.e., T = 1 and ρ = transform in (75) yields ρ1 is not necessary and the same set of equations is obtained 1  χ 1 −1 P R T − = − Λ. (77) for arbitrary sparsity pattern {ρt} that satisfies ρ = T t ρt. λ A A λ χ Similarly, we may obtain the equivalent representation for the T -orthogonal setup considered in Proposition 1. On the other hand, we know from (70) that a solution to (69) satisfies Qˆ =m ˆ = χ−1 − Λ, so that Remark 4. Comparing (67) together with (70)–(72) to the saddle-point conditions (34)–(37) given in Proposition 2 shows 1  χ Qˆ =m ˆ = R T − (78) that the choice of c or the marginal PDF of the non-zero λ A A λ elements in the source vector in (6) has no direct impact on the expressions that provide the variables {m,ˆ Q,ˆ χˆ}. Hence, the is the saddle-point condition for Qˆ and mˆ in terms of the R- form of these conditions is the same for all setups where the transform. Note that compared to the Stieltjes-transform that is sensing matrix is from the same ensemble. The parameters related to AAT at the saddle-point solution, the R-transform {m, Q, χ} on the other hand are affected by the choice of describes the eigenvalue spectrum of ATA. The condition (78) regularization and source distribution. As stated in Remark 2, matches [30, (131)] apart from a slightly different placements the effect of sensing matrix ensemble is the reverse, namely, of regularization parameters so that mˆ ↔ ξ as already for fixed c and p(x0), the form of {m, Q, χ} is always the remarked earlier. The above also suggests that apart from same while {m,ˆ Q,ˆ χˆ} can have different form depending on scalings by the regularization parameter χ ↔ E[σ2(Y ; ξ)], the choice of the measurement matrix. which can also be inferred from [28, Lemma 9]. Finally, we know from the above developments and [30, Appendix B] that χˆ ↔ f ? should hold if the results are equal. B. Alternative Representation of Rotationally Invariant Case To this end, let us examine the last line of (69), namely, and Comparison to Existing Results ασ2Λ ρ − 2m + Q 1  The saddle-point condition for the rotationally invariant case G T (−λΛ) − Λ − , (79) 2 AA 2 χ is described in Proposition 2 in terms of the Stieltjes transform of FAAT . In [30] the HCIZ-formula is used, which makes and substitute (75)–(77) there. Considering the end result as a it natural to express the results in terms of the R-transform function of χ, we obtain of FATA. Furthermore, different sets of auxiliary variables 2    are used in these two papers. In this section we sketch an ασ 1 1 χ χ ϕ(χ) = − R T − alternative representation of Proposition 2 that is equivalent 2 χ λ A A λ λ to [30] up to some minor scaling factors. This also implies ρ − 2m + Q  χ + R T − (80) that apart from minor differences in scalings, our results for 2λ A A λ the IID case are also equivalent to [28] as explained in [30, 1ρ − 2m + Q ασ2χ  χ Sec. IV-C] and shown in Fig. 1. ' − R T − , (81) 2 λ λ2 A A λ Let us first consider the conditions enforced by the matrix integration formula (54) through Λ. By the remark following where (81) is obtained by omitting the terms that do not (57), we know that the relevant terms can also be obtained depend on χ. We are interested in the point where the partial VEHKAPERA¨ et al.: ANALYSIS OF REGULARIZED LS RECONSTRUCTION AND RANDOM MATRIX ENSEMBLES IN COMPRESSED SENSING 13 derivative in (69) w.r.t. χ vanishes, that is, affect the MSE. In this special case though it is the same for ∂ the T -orthogonal and row-orthogonal setups — also for non- χˆ = −2 ϕ(χ) ∂χ uniform sparsities.     ρ − 2m + Q 0 χ The benefit of `1 and `2-norm regularizations is that both = R T − λ2 A A λ are of polynomial complexity. Implementation of (84) is trivial ασ2  ∂   χ and for solving (4) one may use standard convex optimization + χR T − cvx λ2 ∂χ A A λ tools like [6]. However, one may wonder if there are better choices for regularization when the goal is to reconstruct ασ2   χ = R T − a sparse vector. If we take the noise-free case as the guide, λ2 A A λ instead of say `1-norm, we should have a direct penalty on the   2  1 0 χ χασ number of non-zero elements in the source vector. We may + R T − ρ − 2m + Q − . (82) λ2 A A λ λ achieve this by so-called “zero-norm” based regularization. 8 Thus we have a formula for χˆ as a function of χ and mse, One way to write the corresponding cost function is expressed in terms of the R-transform (and its derivative R0). N X  Comparing to [30, (195)] we see that the expressions are c(x) = 1 xj ∈ R \{0} (88) the same apart from minor scalings. The differences can be j=1 explained by noticing that in [30]: 1) the noise variance is = number of non-zero elements in x. (89) σ2 = 1, 2) λ = γ−1 by definition, and 3) the matrices If the postulated and true scalar outputs are given by are square so that α = 1. Thus, we conclude that χˆ ↔ f ? (65) and (66), respectively, we obtain and Proposition 2 is indeed identical to [30], just expressed p differently. xˆ(y; Qˆ) = y · 1|y| > 2Qˆ, (90) which is just hard thresholding estimator of scalar input (66), C. Regularization with `2-norm and “zero-norm” given mismatched model (65). To proceed further, we need to As a first example of regularization other than `1-norm, fix the marginal distribution π(x) of the non-zero components consider the case when in (6). For the special case of Gaussian distribution, some N 2 1 X xj algebra provides the following result. c(x) = kxk2 = . (83) 2 2 j=1 Example 6. Let π(x) = gx(0; 1), that is, consider the case of Gaussian marginals (6) with the rotationally invariant setup This regularization implies that in the MAP-framework, the and general problem (2) with the “zero-norm” regularization desired signal is postulated to have a standard Gaussian given by (88). Define a function distribution. It is thus not surprising that such an assumption r √ reduces the estimate (14) to the standard linear form −h h r0(h) = e + Q( 2h). (91) T T −1 π xˆ = A (AA + λIM ) y, (84) Then, the average MSE for the rotationally invariant setup which is independent of the parameter β. For the replica mse = ρ − 2m + Q is obtained from analysis one obtains from (65) and (66)  mˆ  0 √ Qˆ Qˆ=m ˆ mxˆ + z χˆ m = 2ρr0 2 , (92) xˆ(y; Qˆ) = y = , (85) mˆ +χ ˆ 1 + Qˆ 1 +m ˆ  χˆ  mˆ  Q = 2(1 − ρ) r which can be interpreted as mismatched linear MMSE estima- mˆ 2 0 χˆ 0 tion of x from observation (66). From (61)–(63) we obtain mˆ 2 +χ ˆ  mˆ  + 2ρ r , (93) the following simple result. mˆ 2 0 mˆ 2 +χ ˆ Example 5. Let the distribution of the non-zero elements using the condition π(x) have zero-mean, unit variance and finite moments. For 2(1 − ρ) mˆ  2ρ  mˆ  rotationally invariant setup and general problem (2) with the χ = r + r (94) mˆ 0 χˆ mˆ 0 mˆ 2 +χ ˆ `2-regularization we have ρ +χ ˆ and equations (34)–(37) in Proposition 2. Furthermore, by mse = , (86) (1 +m ˆ )2 Remark 4, the T -orthogonal case can also be obtained easily 1 χ = . (87) 8As remarked also in [28, Section V-C], “zero-norm” regularization does 1 +m ˆ not satisfy the requirement that the normalization constant in (11) is well- defined for any finite M,N and β > 0. Hence, appropriate limits should Comparing to [30, (135)–(138)], we see that the results indeed be considered for mathematically rigorous treatment. However, using similar match as discussed in Section IV-B. The value of the parameter analysis as given in [32, Section 3.5], it is possible to show that the RS λ that minimizes the MSE is λ∗ = σ2/ρ. The marginal density solution for the “zero-norm” regularization is in fact always unstable due to the discontinuous nature of (90). For this reason, we skip the formalities of π(x) of the non-zero elements (6) has no impact on the MSE. taking appropriate limits and report the results as they arise by directly using The choice of the sensing matrix, on the other hand, does the form given in (88). 14 IEEE TRANSACTIONS ON INFORMATION THEORY

0 reconstruction was derived in the limit when the system size ℓ1 T−orthogonal grows very large. By introducing some novel results for taking −2 ℓ0 an expectation of a matrix in an exponential form, the previous ℓ2 results concerning standard Gaussian measurement matrix was −4 extended to more general ensembles. As specific examples, row-orthogonal, geometric and T -orthogonal random matrices −6 were considered in addition to the Gaussian one. The assump- tion about uniform sparsity of the source was also relaxed and blockwise sparsity levels were allowed. −8 The analytical results show that while the MSE of recon- struction depends only on the average sparsity level of the −10 source for rotationally invariant cases, the performance of T -

ℓ1 orthogonal setup depends on the individual sparsities of the normalized MSE [dB] −12 sub-blocks. In case of uniform sparsity, row-orthogonal and T -orthogonal setups have provably the same performance. It −14 was also found that while the row-orthogonal, geometric and Gaussian setups each fall under the category of rotationally

−16 ℓ0 invariant ensemble, that is known to have a unique perfect re- covery threshold in a noise-free setting, with additive noise the MSE performance of these ensembles can be very different. −18 1 1.5 2 2.5 3 The numerical experiments revealed the fact that under 1/α all considered settings, the standard Gaussian ensemble per- formed always worse than the orthogonal constructions. The Fig. 5. Normalized mean square error mse/ρ in decibels vs. inverse compression rate 1/α as given by Examples 5 and 6. Rotationally invariant MSE for the geometric ensemble was found to be an increasing ensemble with arbitrary sparsity pattern and T -orthogonal case with localized function of the peak-to-average ratio of the eigenvalues of the sparsity, namely, ρt = ρT for some t ∈ {1,...,T } and ρs = 0 ∀s 6= t. sensing matrix, suggesting that spectrally uniform sensing ma- Thin red lines for IID sensing matrix, thick blue lines for row-orthogonal case. Noise variance σ2 = 0.01 and the parameter λ is chosen so that the MSE is trices are beneficial for recovery. When the sparsity was non- minimized. The red lines match the corresponding curves in [28, Fig. 3]. uniform, the ranking of the orthogonal constructions depended on the noise level. For highly noisy measurements, the row- orthogonal measurement matrix was found to provide the best by first adding the block index t to all variables and then overall MSE, while relatively clean measurements benefited replacing (21) and (22) by the formulas given here. The last from the T -orthogonal sensing matrices. These findings show modification is to write (27) simply as Rt = χtmˆ t with the that the choice of random measurement matrix does have an χt given in (94). impact in the MSE of the reconstruction when noise is present. Furthermore, if the source does not have a uniform sparsity, To illustrate the above analytical results, we have plotted the effect becomes even more varied and complex. the normalized mean square error 10 log (mse/ρ) dB as 10 A natural extension of the current work is to consider a function of inverse compression rate 1/α in Fig. 5. The Bayesian optimal recovery and its message passing approx- axes are chosen so that the curves can be directly compared imation for the matrix ensembles that differ from the standard to [28, Fig. 3]. Note, however, that we plot only the region ones. Indeed, since the initial submission of the present paper, 1/3 ≤ α ≤ 1 (in contrast to 1/3 ≤ α ≤ 2 there) since this such algorithms have been developed and shown to benefit of corresponds to the assumption of CS setup and one cannot matrices with structure, see for example, [56]–[59]. construct a row-orthogonal matrix for α > 1. It is clear that for all estimators, using row-orthogonal ensemble for measurements is beneficial compared to the standard Gaussian APPENDIX A setup in this region. Furthermore, if the source has non- REPLICA ANALYSIS OF T -ORTHOGONAL SETUP uniform sparsity and `0 or `1 regularization is used, the T - A. Free Energy orthogonal setup (the black markers in the figure) provides Recall the T -orthogonal setup from Definition 2 and let the an additional gain in average MSE for the compression ratios partition of A be one that matches that of x, i.e., α = 1/2 and α = 1/3. T X V. CONCLUSIONS Ax = Otxt. (95) t=1 The main emphasis of the present paper was in the analy- We then recall (16), invoke the replica trick introduced in Sec- sis of `1-regularized least-squares estimation, also known as LASSO or basis pursuit denoising, in the noisy compressed tion II-B, and assume that the limits commute. The normalized free energy of the system reads thus sensing setting. Extensions to `2-norm and “zero-norm” reg- ularization were briefly discussed. Using the replica method 1 ∂ 1 f = − lim lim ln Ξβ,M (n), (96) from statistical physics, the mean square error behavior of T n→0+ ∂n β,M→∞ βM VEHKAPERA¨ et al.: ANALYSIS OF REGULARIZED LS RECONSTRUCTION AND RANDOM MATRIX ENSEMBLES IN COMPRESSED SENSING 15

a a √ where we denoted of radius ||Ot∆xt || = ||∆xt || = MQt that are centered at Z  n  the origin (see Appendix C). As Ot are sampled independently 0 X a Ξβ,M (n) = Ew,{Ot} p(x ; {ρt}) exp − β c(x ) among t = 1, 2,...,T , the cross-correlations for all t1 6= t2 a=1 reduce to  n T 2 n β X X a Y a E (O ∆xa )T(O ∆xb ) × exp − σw − Ot∆x dx (97) {Ot} t1 t1 t2 t2 2λ t a=1 t=1 a=0 = (∆xa )TE {OT O }∆xb t1 {Ot} t1 t2 t2 for notational convenience. The form of (97) implies that a T b = (∆x ) 0M ∆x = 0, (109) a 0 t1 t2 {x } are independently drawn according to (11), xt has the same distribution as xt, i.e., elements drawn according to where 0M is the M ×M zero matrix. On the other hand, given a 0 a t = t = t we obtain (6) for each block t = 1,...,T , and ∆xt = xt − xt for 1 2 a = 1, . . . , n. The outer expectation is w.r.t. the additive noise −1  a T b M E{Ot} (Ot∆xt ) (Ot∆xt ) and measurement matrices. −1 a T T b a n = M (∆x ) E{O }{O Ot}∆x For each set of random vectors {∆xt }a=1, let us now t t t t n×n −1 a T b [a,b] construct a matrix St ∈ R for all t = 1,...,T whose = M (∆xt ) IM ∆xt = St . (110) (a, b)th element represents the empirical covariance [a,b] Since St are in general non-zero for any pairs of replica [a,b] 1 a b St = ∆xt · ∆xt indexes a, b = 1, 2, . . . , n, the expectation in (97) is nontrivial M to compute. However, these correlations can be decoupled by kx0k2 x0 · xb xa · x0 xa · xb = t − t t − t t + t t (98) linearly transforming the variables using a matrix M M M M   n×n [0,0] [0,b] [a,0] [a,b] E = e1 e2 ··· en ∈ , (111) = Qt − Qt − Qt + Qt . (99) R T T √ T that satisfies E E = EE = In, and e1 = 1n/ n by We also construct a similar set of matrices {Qt}t=1 whose [a,b] definition. More precisely, if we let ∆x˜1 ··· ∆x˜n = elements {Qt } have the obvious definitions. The rotational t t ∆x1 ··· ∆xn E be the transformed vectors, symmetry of distributions wt and {Qt} indicates that t t  n T 2 1 n  1 n T β X X a E{Ot} Ot∆x˜t ··· Ot∆x˜t Ew,{O } exp − σw − Ot∆x (100) M t 2λ t o a=1 t=1  1 n × Ot∆x˜t ··· Ot∆x˜t [a,b] a n becomes a function of {Q } for any fixed set of {∆x } . 1 n T t t a=1 = E O ∆x1 ··· ∆xn E In addition, inserting a set of trivial identities M {Ot} t t t Z  1 n o n(n+1)/2 Y [a,b] a b [a,b] × Ot ∆xt ··· ∆xt E 1 = M δ(MQt − xt · xt )dQt 1 T 0≤a≤b≤n = ∆x1 ··· ∆xn E (101) M t t for t = 1, 2,...,T into (97) and performing the integration  T  1 n ×E{Ot} Ot Ot ∆xt ··· ∆xt E over {xa} yields an expression that allows for saddle point [a,b] 1  1 n T  1 n evaluation of Ξ (n) with respect to {Q }. = ∆x ··· ∆x E ∆x ··· ∆x E β,M t M t t t t To proceed with the analysis, we make the RS assumption S˜ . (112) which states that at the dominant saddle point, the overlap , t ˜ n×n matrices {Qt} are invariant under the rotation of the replica But St ∈ R is just indexes a = 1, 2, . . . , n (see also (18)) T S˜ t = E StE [0,0] Q = rt, (102) [1,2] [1,1] [1,2] t = diag(nSt + St − St , [0,b] [a,0] Q = Q = mt ∀a, b ≥ 1, (103) [1,1] [1,2] [1,1] [1,2] t t St − St ,...,St − St ), (113) [a,a] | {z } Qt = Qt ∀a ≥ 1, (104) n−1 times [a,b] Qt = qt ∀a 6= b ≥ 1. (105) T T since ea 1n = ea e1 = 0 for all a = 2, . . . , n and, thus, Under the RS assumption, the matrices St introduced above 1 E (O ∆x˜a)T(O ∆x˜b) have also a simple form M {Ot} t t t t  [1,2] T [1,1] [1,2] 0 if a 6= b, St = St 1n1n + (St − St )In, (106)  T n = n(rt − 2mt + qt) + Qt − qt if a = b = 1, where 1n = [1 ··· 1] ∈ R is an all-ones vector and  Qt − qt if a = b = 2, . . . , n, [1,1] St = rt − 2mt + Qt (107) (114) [1,2] St = rt − 2mt + qt. (108) a n holds. The above shows that for given {∆xt }a=1, the set a When Ot are independent Haar matrices, the vectors of vectors {Ot∆x˜t } are independent among t = 1,...,T a Ot∆xt are distributed on the M-dimensional hyper spheres and also uncorrelated in the space of replicas, as indicated 16 IEEE TRANSACTIONS ON INFORMATION THEORY

a by (114). This is in contrast to the original set {Ot∆xt } 2) Average the obtained result w.r.t. pβ,M (Qt; n) as given whose replica space correlation structure (110) is much more in (120) when β → ∞. cumbersome to deal with. The first step may be achieved by separately averaging over To proceed with the analysis, we first notice that since the replicas a = 1, . . . , n using the following result. Note this T EE = In, the quadratic term in (97) can be expressed as is always possible since we consider the setting of large M n T 2 n T 2 and hence n  M. X X a X X a Ot∆xt = Ot∆x˜t (115) T Lemma 1. Let {ut}t=1 be a set of length-M vectors that a=1 t=1 a=1 t=1 2 satisfy kutk = Mνt for some given non-negative reals {νt}. a using the uncorrelated random vectors {Ot∆x˜t }. For nota- Let {Ot} a set of independent Haar matrices and define tional convenience, we define next an auxiliary matrix E0 = MG (σ2,{ν }) − β kσw−PT O u k2  0 0  T e β,λ t = E e 2λ t=1 t t , (121) e1 ··· en = E so that w,{Ot} a  1 n 0 where w is a standard Gaussian random vector. Then, for ∆xt = ∆x˜t ··· ∆x˜t ea, a = 1, . . . , n, (116) large M 0 where {ea} again forms an orthonormal set that is independent T of t. Then, after the transformation (116), the linear term in 1 X  G (σ2, {ν }) = − T − ln λ + ln(βν ) (97) becomes β,λ t 2 t t=1 T n T n  T  T  X X T a X X T  1 n 0 1 X   2 X 1 w Ot∆xt = w Ot ∆x˜t ··· ∆x˜t ea + extr Λt(βνt) − ln Λt −ln λ + βσ + , 2 {Λ } Λ t=1 a=1 t=1 a=1 t t=1 t=1 t T (122) √ T X 1 = nw Ot∆x˜t , (117) t=1 where we have omitted terms of the order O(1/M). where the we used the fact that Proof: Proof is given in Appendix C-A. n a n Since {Ot∆x˜t }a=1 are uncorrelated, we can apply the X 0 T √ T ea = E 1n = n 0 ··· 0 . (118) above result to (119) separately for all replica indexes a = a=1 1, . . . , n in order to evaluate the expectations w.r.t. w and Combining the above findings and re-arranging implies that {Ot}. Thus, for n  M, we get (123) at the top of the next (97) can be equivalently expressed as page, where n is now just a parameter in Gβ,λ and is not enforced to be an integer by the function itself. The next step Ξ (n) β,M is to compute the integral over {Qt}. With some abuse of Z  √ T 2 notation, we start by using the Dirac’s delta identity (181) to β 2 X 1 = Ew,{Ot} exp − nσ w − Ot∆x˜t write 2λ t=1 Z  ˜[a,b]   n T 2 1 Y dQt β X X a pβ,M (Qt; n) = n × exp − Ot∆x˜ z 2πi 2λ t β,M 0≤a≤b≤n a=2 t=1    n  n X ˜[a,b] [a,b] ˜ X Y × exp M Q Q Vβ,M (Q ; n), (124) ×p(x0; {ρ }) exp − β c(xa) dxa. (119) t t t t 0≤a≤b≤n a=1 a=0 ˜ T Next, recall the definition of the matrix Q whose elements where {Qt}t=1 is a set of transform domain matrices whose t ˜[a,b] are as given in (99). From the identity of (101) we obtain the elements are {Qt } and probability weight for Qt, t = 1,...,T as Z n Y  a  ˜ 0 0 −βc(xt ) a Z n Vβ,M (Qt; n) = p(xt )dxt e dxt 1 0 0 Y  −βc(xa) a p (Q ; n) = p(x )dx e t dx a=1 β,M t zn t t t   β,M X [a,b] a=1 × exp − Q˜ xa · xb . (125) Y a b [a,b] t t t × δ(xt · xt − MQt ), (120) 0≤a≤b≤n 0≤a≤b≤n Note that n has to be an integer in (125) and the goal is a where c(xt ) is interpreted as in (10) to be a sum of scalar thus to write it in a form where n can be regarded as a real regularization functions and the normalization constant is valued non-negative variable. One can then verify that since given by z = z M −(n+1)/2, where z is as in (11). ˜[0,0] ˜[0,0] β,M β β Qt is connected only to the zeroth replica, Qt → 0 Then we proceed as follows: when n → 0. Therefore, it plays no role in the evaluation of T ˜[a,a] 1) Fix the matrices {Qt}t=1 so that the lengths St , a = the asymptotic free energy and consequently, the MSE. This a 1, . . . , n in (114) are constant and, thus, {∆x˜t } have is indeed a common feature of replica symmetric solution and fixed (squared) lengths. Then, assuming M grows with- similar conclusion can be found, e.g., in [30], [43] and [44]. ˜[0,0] out bound, average over the joint distribution of w, {Ot} To simplify notation, we therefore omit Qt from further a T and {∆x˜t }, given {Qt}t=1. consideration. VEHKAPERA¨ et al.: ANALYSIS OF REGULARIZED LS RECONSTRUCTION AND RANDOM MATRIX ENSEMBLES IN COMPRESSED SENSING 17

1 ln Ξ (n) M β,M Z T √  n−1 1 n β 2 PT 1 2 o n β PT 2 2 o Y   − k nσ w− Ot∆x˜ k − k Ot∆x˜ k = ln p (Q ; n)dQ E e 2λ t=1 t E e 2λ t=1 t M β,M t t w,{Ot} w,{Ot} t=1 Z T 1 Y 2 = ln p (Q ; n)dQ eMGβ,λ(nσ ,{n(rt−2mt+qt)+Qt−qt})eM(n−1)Gβ,λ(0,{Qt−qt}) (123) M β,M t t t=1

With some foresight we now impose the RS assumption on where the second equality follows by the fact that the x that ˜ ˆ 9 {Qt} via auxiliary parameters {Qt, mˆ t, χˆt} as minimizes the cost in (131) is given by y − 1 ˜[a,0] ˜[0,b] Q = Q = −βmˆ t, ∀a, b ≥ 1, (126)  , if y > 1; t t  Qˆ βQˆ − β2χˆ ˆ  Q˜[a,a] = t t , ∀a ≥ 1, (127) xˆ(y; Q) = 0, if |y| ≤ 1; (132) t 2 y + 1  ˜[a,b] 2  , if y < −1. Qt = −β χˆt, ∀a 6= b ≥ 1. (128)  Qˆ The saddle-point method then provides the following expres- Recalling that the elements of x0 are IID according to (6) with t sion π(x) = gx(0; 1) simplifies the function Vβ,M under the RS M Z h i assumption to Vβ,M (Q˜ ; n) = [Vβ(Qˆ ; n)] where ˆ p ˆ  t t Vβ(Qt; n) = (1 − ρt) exp − βnφ zt χˆt; Qt Dzt Z  n   n  Z h q i ˆ Y a X a +ρ exp − βnφz χˆ +m ˆ 2; Qˆ  Dz . Vβ(Qt; n) = dxt exp − β c(xt ) t t t t t t a=1 a=1 (133) (  ˆ n  n 2 βQt X 1 X × (1 − ρ ) exp − (xa)2 + βpχˆ xa t 2 t 2 t t Note that the structure of the equations does not force n a=1 a=1 to be an integer anymore, so we assume that analytical  ˆ n  q n 2) βQt X a 2 1 2 X a continuation can be used to take the limit n → 0. This provides +ρt exp − (xt ) + β χˆt +m ˆ t xt , ˆ 2 2 Vβ(Qt; n) → 1 for the data dependent part of the probability a=1 a=1 weight (124), which is consistent with (125). (129) Returning to (124) and denoting χt = β(Qt − qt), we have ˆ ˆ under RS ansatz and Qt should be read as a shorthand for the set {χˆt, Qt, mˆ t}. To assess the integrals in (129) w.r.t. the replicated variables pβ,M (Qt; n) we first decouple the quadratic terms that have summations 1 Z   Qˆ Q − χˆ χ χˆ q inside by using (182). By the fact that all integrals for a = = exp βM n t t t t − nmˆ m − n2β t t zn 2 t t 2 1, 2, . . . , n are identical we obtain β,M  1 ˆ ˆ ˆ + log Vβ(Qt; n) dQt, (134) Vβ(Qt; n) β n Z  Z 2 √  −β[Qˆtx /2−zt χˆtxt+c(xt)] ˆ ˆ = (1 − ρt) e t dxt Dzt where dQ is a short-hand for dˆχtdQtdm ˆ t. It is important to recognize that we have now managed to write the components Z  Z √ n −β[Qˆ x2/2−z χˆ +m ˆ 2x +c(x )] of the free energy in a functional form of n where the limit +ρ e t t t t t t t dx Dz , t t t n → 0 can be taken, at least in principle. Applying the saddle- (130) point method to integrate w.r.t. Qˆ and Q as β, M → ∞ and √ changing the order of extremization and partial derivation, we −z2/2 where Dzt = dzte t / 2π is the standard Gaussian get (135) at the top of the next page. Here we used the fact −1 measure. For large β we may then employ the saddle-point that β Gβ,λ(0, {Qt − qt}) → 0 as β → ∞ for χ, Λ ∈ R and integration w.r.t. xt. If we now specialize to LASSO recon- 1 ∂ ˆ struction (4) so that the per-element regularization function is − lim log Vβ(Qt; n) β n→0 ∂n c(x) = |x|, we may define Z p  = (1 − ρt) Dztφ zt χˆt; Qˆt Qˆ  φ(y; Qˆ) = min x2 − yx + |x| Z q x∈ 2 2 ˆ  R +ρt Dztφ zt χˆt +m ˆ t ; Qt . (136)  (|y| − 1)2 − , |y| > 1, 9 ˆ = ˆ Note that xˆ(y; Q) can be interpreted as soft thresholding of observation 2Q (131) y. Compare the above also to (60). For further discussion on the relevance 0 otherwise, and interpretation of this function, see Section IV. 18 IEEE TRANSACTIONS ON INFORMATION THEORY

 T  ˆ Z Z q  1 X QtQt χˆtχt p ˆ  2 ˆ  f = extr mˆ tmt − + + (1 − ρt) Dztφ zt χˆt; Qt + ρt Dztφ zt χˆt +m ˆ t ; Qt T ˆ 2 2 Q,Q t=1  ∂ 1 2 − lim lim Gβ,λ(nσ , {n(ρt − 2mt + qt) + Qt − qt}) (135) β→∞ n→0 ∂n β

−1 We also used above the fact that rt → ρt for large M and that all values of β > 0, which means χt (ρt − 2mt + qt) → 1 −1 the term n in (120) is irrelevant for the analysis because χt (ρt − 2mt + Qt) as β → ∞. zβ,M

1 ∂ n M→∞ lim ln zβ,M −−−−→ 0. (137) M n→0 ∂n B. Saddle-Point Conditions By the chain rule Using the short-hand notation

1 ∂ 2 T ∗ lim Gβ,λ(nσ , {νt(n)}t=1) Rt = Rt(λ, {Λt }), (145) β n→0 ∂n σ2 ∂G (nσ2, {ν (n)}T ) the partial derivatives w.r.t. {m ,Q } in (142) provide the = lim β,λ t t=1 t t β n→0 ∂(nσ2) saddle-point conditions T   2 T R 1 1 X ∂νt(n) ∂Gβ,λ(nσ , {νt(n)}t=1) ˆ t ∗ + lim , (138) Qt =m ˆ t = = − Λt . (146) β n→0 ∂n ∂ν (n) χt χt t=1 t By the fact that so that by plugging νt(n) = n(ρt − 2mt + qt) + Qt − qt to (138) we have ∂ ∂h  1  r(h) = − Q √ , (147) 1 ∂ 2 T ∂x ∂x h lim Gβ,λ(nσ , {νt(n)}t=1) β n→0 ∂n we may assess the partial derivatives w.r.t. the variables 2  T −1 T   σ X 1 X ρt − 2mt + qt 1 {Qˆ , mˆ , χˆ } as well to obtain = − λ + + Λ∗ − , t t t 2 Λ∗ 2 t χ t=1 t t=1 t  1  mt = 2ρtQ , (148) (139) p 2 χˆt +m ˆ t where {Λ∗} denotes the solution to the extremization problem 2(1 − ρ ) 2ρ t Q = − t r(ˆχ ) − t r(ˆχ +m ˆ 2), (149) in (122), given σ2 = 0. t 2 t 2 t t mˆ t mˆ t To solve the last integrals in (135), let us denote     2(1 − ρt) 1 2ρt 1 χt = Q √ + Q , (150) r   mˆ mˆ p 2 h − 1 1 t χˆt t χˆt +m ˆ t r(h) = e 2h − (1 + h)Q √ . (140) 2π h where we used the identity Qˆt =m ˆ t to simplify the results. With some calculus one may verify that for h > 0 The MSE of the reconstruction for xt thus becomes Z √ ˆ  r(h) φ zt h; Qt Dzt = , (141) mse = ρ − 2m + Q Qˆ t t t t t  1  which implies that combining all of the above, the free energy = ρt − 4ρtQ pχˆ +m ˆ 2 has the form (142) given at the top of the next page. To obtain t t this result, we used the fact that denoting 2(1 − ρt) 2ρt 2 − 2 r(ˆχt) − 2 r(ˆχt +m ˆ t ). (151) mˆ t mˆ t 1  T 1 −1 X ∗ Rt(λ, {Λt}) = λ + , (143) Finally, recalling that Λt is a function of {χt}, we obtain from Λt Λs s=1 the partial derivative of χt the extremization in (122) implies the condition T ∗ mset X 2 2 ∗ 1 Rt(λ, {Λt }) χˆt = 2 + (mses − σ Rs)∆s,t, (152) Λt − = − , (144) χt χt χt s=1

∂Λs between the variables χt and {Λt}. Furthermore, in order to where we denoted ∆s,t = for the partial derivative of Λs ∂χt have a meaningful solution to (142) and (144) for σ > 0, we w.r.t. χt. 10 also need to have χt = β(Qt − qt) positive and finite for To solve the equation for χˆt, we need an expression for ∆s,t. We do this via the inverse function theorem that relates 10The case of χ → 0 is in fact relevant for the noise-free scenario σ = 0 t the Jacobian matrices as and corresponds to the perfect recovery condition ρt = mt = Qt =⇒ mset = ρt − 2mt + Qt = 0, which automatically satisfies Qt = qt as well.  −1 ˆ ∂(Λ ,..., Λ ) ∂(χ , . . . , χ ) Furthermore, for this scenario Q =m ˆ → ∞ as β → ∞, while in the noisy ∆ = 1 T = 1 T . (153) case they are always positive and finite parameters. ∂(χ1, . . . , χT ) ∂(Λ1,..., ΛT ) VEHKAPERA¨ et al.: ANALYSIS OF REGULARIZED LS RECONSTRUCTION AND RANDOM MATRIX ENSEMBLES IN COMPRESSED SENSING 19

 T  ˆ  1 X QtQt χˆtχt 1 − ρt ρt 2 f = extr mˆ tmt − + + r(ˆχt) + r(ˆχt +m ˆ t ) T ˆ 2 2 ˆ ˆ {mt,Qt,χt,mˆ t,Qt,χˆt} t=1 Qt Qt 2  T −1 T   σ X 1 X ρt − 2mt + Qt 1 + λ + + − Λ∗ (142) 2 Λ∗ 2 χ t t=1 t t=1 t

Here the (i, j)th element of the Jacobian ∂(χ1,...,χT ) is given where ∆xa = x0 − xa ∈ N for a = 1, . . . , n, so that the ∂(Λ1,...,ΛT ) R by counterpart of (96) reads

∂χi ∂ (1 − Ri) ∂ 1 = f = − lim lim ln Ξβ,N (n). (158) n→0+ β,N→∞ ∂Λj ∂Λj Λi ∂n βN (1 − Ri) 1 ∂Ri The goal is then to assess the normalized free energy (158) = − 2 δij − Λ Λi ∂Λj by following the same steps as given in Appendix A. i n×n Let us construct matrices St ∈ and Q for all (1 − 2Ri) RiRj R t = − 2 δij − . (154) t = 1,...,T with elements as given in (98) and (99). Λi ΛiΛj −1 P Also define the “empirical mean” matrices S = T t St, T −1 P [a,b] In other words, denoting b = [R1/Λ1 ··· RT /ΛT ] and Q = T t Qt, that have the respective elements S defining C to be diagonal matrix whose (t, t)th entry is given and Q[a,b] and invoke the RS assumption (102)–(105). We −2 a a by (1−2Rt)Λt , we obtain by (153) and the matrix inversion then make the transformation {∆xt } → {∆x˜t } as with lemma the desired Jacobian as the T -orthogonal setup so that the empirical correlations of a −1 −1 T {∆x˜t } satisfy (114). Note that this means that given {Qt}, T −1 −1 (C b)(C b) a  a T a TT ∆ = −(C + bb ) = −C + T −1 , (155) the transformed vectors ∆x˜ = (∆x˜1) ··· (∆x˜T ) 1 + b C b satisfy which means that T a 2 ˆ X ˜[a,b] ˜[a,b] k∆x˜ k = M St = NS , (159)  T 2 −1 RsRtΛsΛt X Rk t=1 ∆s,t = 1 + (1 − 2Rs)(1 − 2Rt) 1 − 2Rk ˜[a,b] −1 P ˜[a,b] k=1 where S = T t St . Combining the above provides 2 Λt the counterpart of (119) as − δst. (156) 1 − 2Rt Ξβ,N (n) Combining all the results completes the derivation. Z  n   n  Y a 0 X a = Ew,{Ot} dx p(x ; {ρt}) exp − β c(x ) a=0 a=1 APPENDIX B n  β √ β X  REPLICA ANALYSIS OF ROTATIONALLY INVARIANT SETUP × exp − k nσ2w − A∆x˜1k2 − kA∆x˜ak2 . 2λ 2λ The derivation in this Appendix provides an end result that a=2 is essentially the same as the HCIZ-formula based [45], [46] (160) approach used in Section IV and Appendix B in [30]. In our We then need the following small result to proceed. case the difference is that the source does not need to have M×N IID elements, but can have a block structure. Furthermore, Lemma 2. Consider the case where A ∈ R is sampled our analytical approach is slightly different to the one in [30] from the rotationally invariant setup given in Definition 2. Let T ˆ 2 since we do not seek to find first a decoupling result for finite {ut}t=1 be a fixed set of length-M vectors satisfying kutk = ˆ ˆ β and then use hardening arguments as in [28] to obtain the Mνt for some given non-negative values {νt} and N = T M. N final result when β → ∞. Both end results are equivalent as Denote u ∈ R for the vector obtained by stacking {ut} and shown in Section IV-B. define 2 β 2 NHβ,λ(σ ,{νt}) − kσw−Auk Recall the rotationally invariant setup as given in Defini- e = Ew,Ae 2λ , (161) 0 tion 2. Let p(x ; {ρt}) be the distribution of the source vector 0 N 0 ˆ where w is a standard Gaussian random vector. Then, for x ∈ R and assume that each of the sub-vectors xt has M elements drawn independently according to (6). Clearly we large N ˆ ˆ 2 2 have to have N = MT but it is not necessary to have M = M Hβ,λ(σ , ν) = Hβ,λ(σ , {νt}) as in the case of T -orthogonal setup. Define 1  = extr Λ(βν) − (1 − α) ln Λ Z  n  2 Λ 0 X a Ξβ,N (n) = Ew,A p(x ; {ρt}) exp − β c(x ) Z  2 a=1 −α ln(Λβσ + Λλ + x)dFAAT (x) n n  β X  Y × exp − kσw − A∆xak2 dxa, (157) 1 + ln(βν) − α ln λ 2λ − , (162) a=1 a=0 2 20 IEEE TRANSACTIONS ON INFORMATION THEORY

−1 PT where ν = T t=1 νt and we omitted terms of the order where we used the Stieltjes transform of FAAT (x), O(1/N). Z 1 G T (s) = dF T (x), (168) Proof: Proof is given in Appendix C-B. AA x − s AA Notice that the H-function in (162) depends on the pa- along with the chain rule rameters {νt} only through the “empirical mean” ν = −1 PT 1 ∂ 2 T t=1 νt. This will translate later to the fact that the lim lim Hβ,λ(nσ , ν(n)) performance of rotationally invariant setup depends on the β→∞ β n→0 ∂n 2 2 −1 P σ ∂Hβ,λ(nσ , ν(n)) sparsities {ρt} only through ρ = T t ρt. With the above in = lim lim n→0 2 mind, we may obtain the probability weight pβ,N (Q; n) of Q β→∞ β ∂(nσ ) by using (101) with suitable variable substitutions. Applying 1 ∂ν(n)∂H (nσ2, ν(n)) + lim lim β,λ then Lemma 2 to (160) provides β→∞ β n→0 ∂n ∂ν(n) 2 Z ∗   1 ασ Λ ρ − 2m + Q ∗ 1 ln Ξ (n) = − dF T (x) + Λ − N β,N 2 λΛ∗ + x AA 2 χ 1 Z 2 2 ∗   = ln p (Q; n)eNHβ,λ(nσ ,n(r−2m+q)+Q−q) ασ Λ ∗ ρ − 2m + Q ∗ 1 β,N = − G T (−λΛ ) + Λ − , N 2 AA 2 χ ×eN(n−1)Hβ,λ(0,Q−q)dQ, (163) (169) −1 P −1 P −1 P ∗ where r = T t rt, m = T t mt,Q = T t Qt, where ν(n) = n(r − 2m + q) + Q − q. Here Λ is the solution −1 P 2 and q = T t qt are the “averaged” versions of the RS to the extremization in (162), given σ = 0, and satisfies the variables {rt, mt,Qt, qt}. The probability weight of Q reads condition ˆ ∗ Z  Y dQ˜[a,b]  ∗ 1 α ∗ ∗  R(Λ ) p (Q; n) = N n(n+1)/2 Λ − = − 1−(λΛ )·GAAT (−λΛ ) = − , (170) β,N 2πi χ χ χ 0≤a≤b≤n   where X ˜[a,b] [a,b] ˜ × exp N Q Q Vβ,N (Q; n), (164) ∗  ∗ ∗  Rˆ(Λ ) = α 1 − (λΛ ) · G T (−λΛ ) . (171) 0≤a≤b≤n AA Finally, we need to resolve the saddle point conditions in where Q˜ is a (n+1)×(n+1) transform domain matrix whose (167). The partial derivatives w.r.t. {m, Q} provide elements are {Q˜[a,b]} and ˆ ∗ n ˆ 1 ∗ R(Λ ) Z  a  Q =m ˆ = − Λ = , (172) 0 0 Y −βkx k1 a Vβ,N (Q˜ ; n) = p(x )dx e dx χ χ a=1 while the partial derivatives w.r.t. {Q,ˆ m,ˆ χˆ} are of the same  X  × exp − Q˜[a,b]xa · xb . (165) format as in (148)–(150) but without indexes t. Finally, ∗ 0≤a≤b≤n recalling that Λ depends on χ   We then get directly using the arguments from Appendix A ∂ ∗ ∂Λ 0 ∗ G T (−λΛ ) = −λ G T (−λΛ ), (173) that (165) becomes in the limit N → ∞ ∂χ AA ∂χ AA Z 0 h p i where G T denotes the derivative of G T w.r.t. the argu- V (Qˆ ; n) = (1 − ρ) exp − βnφz χˆ; Qˆ Dz AA AA β ment, gives Z h i p 2 ˆ  1 ∂Λ∗  +ρ exp − βnφ z χˆ +m ˆ ; Q Dz, (166) χˆ = mse + χ2 ∂χ −1 P ∗ where ρ = T t ρt is the expected sparsity of the entire 2 ∗ ∗ 0 ∗ ∂Λ −ασ G T (−λΛ ) − (λΛ ) · G T (−λΛ ) , (174) source vector x. Therefore, the details of how the non- AA AA ∂χ zero elements are distributed on different sub-blocks {xt} is in which irrelevant for the rotationally invariant case. ∗  −1 Combining everything above and denoting ∂Λ 1 − α 2 0 ∗ = − + (αλ ) · G T (−λΛ ) . (175) −1 P ∂χ (Λ∗)2 AA χ = T t β(Qt − qt) implies that the free energy for the rotationally invariant case reads To obtain the last formula we used the fact that  ˆ ˆ QQ χχˆ ∂χ 1 ˆ 1 ∂R f = extr mmˆ − + ∗ = − ∗ 2 (1 − R) − ∗ ∗ {m,Q,χ,m,ˆ Q,ˆ χˆ} 2 2 ∂Λ (Λ ) Λ ∂Λ 1 − α  1  2  2 0 ∗ + (1 − ρ)r(ˆχ) − ρr(ˆχ +m ˆ ) = − + (αλ ) · GAAT (−λΛ ) . (176) Qˆ (Λ∗)2 2 ∗   ασ Λ ∗ ρ − 2m + Q 1 ∗ Remark 5. Consider the row-orthogonal setup where + GAAT (−λΛ ) + − Λ , 2 2 χ 1 G T (s) = . (177) (167) AA α−1 − s VEHKAPERA¨ et al.: ANALYSIS OF REGULARIZED LS RECONSTRUCTION AND RANDOM MATRIX ENSEMBLES IN COMPRESSED SENSING 21

For this case, the extremization in (162) can also be written where the second equality is obtained by solving the extrem- in the form ization problem. Substituting (184) back to (180) provides an (M) 1 1  1  expression for p ({ut}; {νt}). Λ − = − βν βν α−1 + Λ(λ + βσ2) Recall the T -orthogonal setup given in Definition 2. Fix the 1  1  parameters M, β, λ and define −−−→−σ=0 . (178) βν α−1 + Λλ (M) 2 Gβ,λ (σ , {νt}) We may then plug (177) and (178) to (167) and compare 1 − β kσw−PT O ∆x k2 = ln E e 2λ t=1 t t the end result with (142)–(144). It is clear that the two free M w,{Ot} energies are exactly the same if we set α = 1/T and ρ = ρt so 1 Z = ln E p(M)({u }; {ν }) that ν = νt and Λ = Λt for all t = 1,...,T . Therefore, also M w t t the saddle point solutions of row-orthogonal and T -orthogonal √ √ T 1 2 PT 2 − k βσ w− β utk Y setups match for this special case and the MSE is the same. ×e 2λ t=1 dut, (185) t=1 APPENDIX C USEFUL MATRIX INTEGRALS where {Ot}, {∆xt}, {ut} and {νt} are as before. Applying A. T -Orthogonal Setup the Gaussian integration formula (182) from right-to-left along T with the expressions (180) and (184) provides Let {Ot}t=1 be a set of independent M ×M Haar matrices T (M) 2 and {∆xt}t=1 a set of (fixed) length-M vectors that satisfy G (σ , {νt}) 2 β,λ k∆xtk = Mνt for some given non-negative values {νt}. T 1 Z  M  Z Given {∆xt} and {νt}, the vector ut = Ot∆xt is uniformly Y Λtνt a(k,w) = ln Ew dΛte 2 dke distributed on a surface of a sphere that has a fixed radius M √ t=1 Mν for each t = 1,...,T . Thus, the joint PDF of {u } t t T √ Z  1 2 T  Y − Λtkutk −i βk ut reads × e 2 dut (M) t=1 p ({ut}; {νt}) T 1 λ 1 + ln − ln Z({νt}), (186) 1 Y 2 = δ(kutk − Mνt) (179) 2 2π M Z({νt}) t=1 where k ∈ RM , the normalization factor is given in (184) and −T T (4πi) Z  Λt 2  we denoted Y − (kutk −Mνt) = e 2 dΛt , (180) Z({νt}) λ p t=1 a(k, w) = − kkk2 + i βσ2kTw. (187) 2 where Z({νt}) is the normalization factor, {Λt} is a set of complex numbers and we used the identity Using next Gaussian integration repeatedly to assess the expec- Z c+i∞ tations w.r.t. {ut}, k and w yields (188) at the top of the next 1 − Λ (t−a) δ(t − a) = e 2 dΛ, (181) page. We then change the integration variables as Λt → βΛt, 4πi c−i∞ take the limit M → ∞ and employ saddle-point integration. where a, c, t ∈ R, Λ ∈ C. Using the Gaussian integration Omitting all terms that vanish in the large-M limit provides formula the final expression Z 1 − 1 zTMz+bTz 1 1 bTM −1b e 2 dz = e 2 ,  T  N/2 p 1 X (2π) det(M) G (σ2, {ν }) = − T − ln λ + ln(βν ) β,λ t 2 t (182) t=1 N where b, z ∈ R and M is symmetric positive definite, the  T  T  1 X   2 X 1 normalization factor becomes + extr Λt(βνt)−ln Λt −ln λ + βσ + . 2 {Λ } Λ T t t=1 t=1 t 1 Z  Λt 1 2  Y Mνt − Λtkutk Z({νt}) = e 2 e 2 dutdΛt (189) (4πi)T t=1 2  M/2 T Z T Finally, we remark that the extremization in Gβ,λ(σ , {νt}) (2π) Y  M (Λ ν −ln Λ )  = e 2 t t t dΛ . (183) as given above enforces the condition 4πi t t=1 −1 ! 2 1 Λt Since the argument of the exponent in (183) is a complex βνt(σ , β, λ) = 1 − , (190) Λ 2 PT −1 analytic function of {Λt} and we are interested in the large- t λ + βσ + t=1 Λt M asymptotic, the saddle-point method further simplifies the implying Λ ∈ \{0} for all {β, λ, σ2} and t = 1,...,T . normalization factor to the form t R Thus, the expression (189) together with the condition (190) T 1 1 X  −1 provides the solution to the integration problem defined in ln Z({νt}) = extr Λtνt − ln Λt + O(M ) M 2 Λt (185). Furthermore, for the special case of σ = 0 we have t=1 T −1 ! X 1 + ln νt 1 Λ = + O(M −1), (184) βν (σ2 = 0, β, λ) = 1 − t , (191) 2 t Λ PT −1 t=1 t λ + k=1 Λk 22 IEEE TRANSACTIONS ON INFORMATION THEORY

Z T Z   T   (M) 2 1 Y  M Λ ν − M ln Λ  1 X β 2 p T G (σ , {ν }) = ln E dΛ e 2 t t 2 t exp − λ + kkk + i βσ2w k dk β,λ t M w t 2 Λ t=1 t=1 t 1 λ 1 + ln − ln Z({ν }) 2 2π M t T T T T 1 Z  Y  M  X X  X β  = ln dΛ exp Λ ν − ln Λ − ln λ + M t 2 t t t Λ t=1 t=1 t=1 t=1 t T −1 1 Z  1  X β    1 1 × exp − 1 + βσ2 λ + kwk2 dw + ln λ − ln Z({ν }) (2π)M/2 2 Λ 2 M t t=1 t T T T T 1 Z M  X X  X β  Y = ln exp Λ ν − ln Λ − ln λ + βσ2 + dΛ M 2 t t t Λ t t=1 t=1 t=1 t t=1 1 1 + ln λ − ln Z({ν }) (188) 2 M t

2 −1 2 so that νt(σ = 0, β → ∞, λ) → 0 and β Gβ,λ(σ = fixed ratio α = M/N, this expectation is by assumption w.r.t. β→∞ a probability measure that has a single non-zero point corre- 0, {νt}) −−−−→ 0. This is fully compatible with the earlier result obtained in [32], as expected. sponding to the limiting deterministic eigenvalue distribution FATA. Finally, using saddle point method to integrate over Λ, we obtain B. Rotationally Invariant Setup M×N 2 Let us consider the case where A ∈ R is sam- Hβ,λ(σ , ν) pled from an ensemble that allows the decomposition R = 1  Z  1   T T = extr Λ(βν) − ln Λ + x dFATA(x) A A = O DO where O is an N × N Haar matrix and 2 Λ λ + βσ2 D = diag(d1, . . . , dN ) contains the eigenvalues of R. This is α  βσ2  1 + ln(βν) the case of rotationally invariant setup given in Definition 2. − ln 1 + − (195) T ˆ 2 λ 2 Furthermore, let {∆xt}t=1 be a set of (fixed) length-M vec-  2 ˆ 1 tors satisfying k∆xtk = Mνt for some given non-negative = extr Λ(βν) − (1 − α) ln Λ 2 Λ values {νt} and N = T Mˆ . For notational convenience, we N Z  write ∆x ∈ for the vector obtained by stacking {∆xt}. 2 R −α ln(Λβσ + Λλ + x)dF T (x) The counterpart of (185) reads then AA 1 + ln(βν) − α ln λ H(N)(σ2, {ν }) − , (196) β,λ t 2 1 − β kσw−A∆xk2 = ln Ew,Ae 2λ , where the second equality is obtained by changing the integral N 2 α  βσ2  measure and simplifying. For the case σ = 0, the extremiza- = − ln 1 + tion then provides the condition 2 λ      Z 2  1 1 β T 1 α Λβσ + Λλ + ln ER exp − ∆x R∆x , (192) Λ − = − 1 − dF T (x) N 2 λ + βσ2 βν βν Λβσ2 + Λλ + x AA where the second equality follows by using Gaussian integra- σ2=0 1 α   −−−→ Λ − = − 1 − (Λλ)GAAT (−Λλ) , (197) tion formula (182) to average over the additive noise term w. βν βν T Recall next the fact that R = O DO and denote u = O∆x. where we used again the Stieltjes transformation (168) of Since O are Haar matrices and FAAT (x). T X νt kO∆xk2 = T Mˆ = Nν, (193) T t=1 APPENDIX D GEOMETRIC ENSEMBLE where ν is the “empirical average” over {νt}, we get by the same arguments as in Appendix C-A an expression for Recall that the geometric singular value ensemble is gen- (N) 2 A = UΣV T U V Hβ,λ (σ , ν) as given in (194) at the top of the next page. erated as where and are independent Considering next the limit of large M and N, we replace Haar matrices. The diagonal elements of Σ are the singu- p m−1 the summation in (194) by an integral over the empirical lar values σm = a(κ)τ , m = 1,...,M of A with −1 PM distribution of the eigenvalues (9), so that the outer expectation τ ∈ (0, 1] and a(κ) > 0 such that N m=1 λm = 1 2 T w.r.t. D becomes an expectation over all empirical eigenvalue where λm = σm are the eigenvalues of AA . Alternatively, −γM (i/M) distributions of R. But when M,N → ∞ with a finite and we may write λi+1 = a(κ)e , i = 0, 1,...,M − 1 VEHKAPERA¨ et al.: ANALYSIS OF REGULARIZED LS RECONSTRUCTION AND RANDOM MATRIX ENSEMBLES IN COMPRESSED SENSING 23

Z Z     (N) 2 1 Λ Nν 1 T β H (σ , ν) = ln E dΛe 2 exp − u ΛI + D u du β,λ N D 2 N λ + βσ2 α  βσ2  1 + ln ν − ln 1 + − + O(N −1) 2 λ 2 N 1 Z N  1 X  1  = ln E exp Λ(βν) − ln Λ + d dΛ N D 2 N λ + βσ2 n n=1 α  βσ2  1 + ln(βν) − ln 1 + − + O(N −1) (194) 2 λ 2

where γM = −2M ln τ ≥ 0. Letting M → ∞ provides the 3) Replace S by Σ to create a sensing matrix X = continuous limit function for the eigenvalues UΣV T. Note that permutations of the diagonal ele- ments in Σ has no impact on the reconstruction per- λ(t) = A(κ)e−γt, t ∈ [0, 1), (198) formance. where γ > 0 satisfies λ(0) γ REFERENCES κ = = , (199) R 1 1 − e−γ [1] D. L. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory, 0 λ(t)dt vol. 52, no. 4, pp. 1289–1306, Apr. 2006. for the given peak-to-average ratio κ. The normalization [2] E. J. Candes, J. Romberg, and T. Tao, “Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency informa- −1 PM condition N m=1 λm = 1 becomes now tion,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, Feb. 2006. 1 [3] E. J. Candes and T. Tao, “Near-optimal signal recovery from random Z κ projections: Universal encoding strategies?” IEEE Trans. Inf. Theory, α A(κ)e−γtdt = 1 ⇐⇒ A(κ) = , (200) α vol. 52, no. 12, pp. 5406–5425, Dec. 2006. 0 [4] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. κ −γ κ Royal. Statist. Soc., Ser. B, vol. 58, no. 1, pp. 267–288, 1996. which means that λ(t) ∈ [ α e , α ]. T [5] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition The function (198) describes the eigenvalues of AA in by basis pursuit,” SIAM J. Sci Comp., vol. 20, no. 1, pp. 33–61, 1998. the large system limit. Since the order of the eigenvalues and [6] CVX Research, Inc., “CVX: Matlab software for disciplined convex associated eigenvectors does not affect the performance of the programming,” http://cvxr.com/cvx. [7] D. Donoho and X. Huo, “Uncertainty principles and ideal atomic reconstruction, we may also consider sampling randomly and decomposition,” IEEE Trans. Inform. Theory, vol. 47, no. 7, pp. 2845– uniformly t ∈ [0, 1) and assigning the corresponding eigenval- 2862, Nov. 2001. ues according to (198). Then, by construction the limit of (9) [8] M. Elad and A. Bruckstein, “A generalized uncertainty principle and −γt sparse representation in pairs of bases,” IEEE Trans. Inform. Theory, for this ensemble is given by FAAT (Ae ) = 1−t, t ∈ [0, 1) vol. 48, no. 9, pp. 2558–2567, Sep. 2002. or more conveniently [9] D. Donoho and M. Elad, “Optimally sparse representation in general ( (non-orthogonal) dictionaries via l1 minimization,” Proc. Nat. Acad. 1 + γ−1 ln x − γ−1 ln A, if x ∈ (Ae−γ ,A], Sci., vol. 100, no. 5, pp. 2197–2202, Nov. 2003. FAAT (x) = [10] E. Candes and T. Tao, “Decoding by linear programming,” IEEE Trans. 0, otherwise, Inform. Theory, vol. 51, no. 12, pp. 4203 – 4215, dec. 2005. (201) [11] R. Baraniuk, M. Davenport, R. Devore, and M. Wakin, “A simple proof where we wrote for simplicity A = A(κ). This is also called of the restricted isometry property for random matrices,” Constr Approx, vol. 28, no. 3, pp. 253–263, 2008. the reciprocal distribution whose density reads [12] J. Tropp and A. Gilbert, “Signal recovery from random measurements via orthogonal matching pursuit,” IEEE Trans. Inf. Theory, vol. 53, ( 1 −γ γx , if x ∈ (Ae ,A], no. 12, pp. 4655–4666, Dec. 2007. f T (x) = (202) AA 0, otherwise. [13] M. A. Davenport and W. B. Wakin, “Analysis of orthogonal matching pursuit using the restricted isometry property,” IEEE Trans. Inf. Theory, vol. 56, no. 9, pp. 4395–4401, Sep. 2010. For the analysis, one can obtain the Stieltjes transform of (202) [14] T. T. Cai and L. Wang, “Orthogonal matching pursuit for sparse signal directly from the definition (168), as given in Example 4. recovery with noise,” IEEE Trans. Inf. Theory, vol. 57, no. 7, pp. 4680– The sensing matrices for the geometric setup in finite size 4688, Jul. 2011. [15] W. Dai and O. Milenkovic, “Subspace pursuit for compressive sensing simulations, on the other hand, can be constructed as follows: signal reconstruction,” IEEE Trans. Inf. Theory, vol. 55, no. 5, pp. 2230– 1) Generate M × N matrix X with IID standard normal 2249, May 2009. elements and calculate the singular value decomposition [16] D. Needell and J. A. Tropp, “CoSaMP: Iterative signal recovery from T incomplete and inaccurate samples,” Applied and Computational Har- X = USV . For the Gaussian ensemble, U and V monic Analysis, vol. 26, no. 3, pp. 301–321, 2009. are independent Haar matrices. [17] X. Lv, G. Bi, and C. Wan, “The group lasso for stable recovery of block- 2) Find numerically the value of τ that meets the peak-to- sparse signal representations,” IEEE Trans. Signal Process., vol. 59, no. 4, pp. 1371–1382, Apr. 2011. average constraint (8) and set [18] S. K. Ambat, S. Chatterjee, and K. V. S. Hari, “Fusion of algorithms 1 for compressed sensing,” IEEE Trans. Signal Process., vol. 61, no. 14, a(κ) = , (203) pp. 3699–3704, Jul. 2013. −1 PM 2(m−1) [19] D. Donoho and J. Tanner, “Counting faces of randomly projected N m=1 τ polytopes when the projection radically lowers dimension,” Journal of so that the average power constraint is satisfied. the American Mathematical Society, vol. 22, no. 1, pp. 1–53, 2009. 24 IEEE TRANSACTIONS ON INFORMATION THEORY

[20] D. L. Donoho and J. Tanner, “Counting the faces of randomly-projected [49] Y. Wu and S. Verdu,´ “Optimal phase transitions in compressed sensing,” hypercubes and orthants, with applications,” Discrete & Computational IEEE Trans. Inf. Theory, vol. 58, no. 10, pp. 6241–6263, Oct. 2012. Geometry, vol. 43, no. 3, pp. 522–541, 2010. [50] S. B. Korada and A. Montanari, “Applications of the Lindeberg principle [21] ——, “Precise undersampling theorems,” Proc. IEEE, vol. 98, no. 6, pp. in communications and statistical learning,” IEEE Trans. Inf. Theory, 913–924, Jun. 2010. vol. 57, no. 4, pp. 2440–2450, Apr. 2011. [22] D. L. Donoho, A. Maleki, and A. Montanari, “Message passing algo- [51] P. Viswanath, V. Anantharam, and D. N. C. Tse, “Optimal sequences, rithms for compressed sensing,” Proc. Nat. Acad. Sci., vol. 106, pp. power control, and user capacity of synchronous CDMA systems with 18 914–18 919, 2009. linear MMSE multiuser receivers,” IEEE Trans. Inf. Theory, vol. 45, [23] A. Montanari, “Graphical models concepts in compressed sensing,” in no. 6, pp. 1968–1983, Sep. 1999. Compressed sensing: Theory and Applications, Y. Eldar and G. Ku- [52] K. Kitagawa and T. Tanaka, “Optimization of sequences in CDMA tyniok, Eds. Cambridge University Press, 2012, pp. 394–438. systems: A statistical-mechanics approach,” Computer Networks, vol. 54, [24] M. Bayati and A. Montanari, “The dynamics of message passing on no. 6, pp. 917–924, 2010. dense graphs, with applications to compressed sensing,” IEEE Trans. [53] S. Oymak and B. Hassibi, “A case for orthogonal measurements in linear Inf. Theory, vol. 57, no. 2, pp. 764–785, Feb. 2011. inverse problems,” in Proc. IEEE Int. Symp. Inform. Theory, 2014, pp. [25] M. Mezard,´ G. Parisi, and M. A. Virasoro, Spin Glass Theory and 3175–3179. Beyond. Singapore: World Scientific, 1987. [54] C. Thrampoulidis and B. Hassibi, “Isotropically random orthogonal [26] V. Dotsenko, Introduction to the Replica Theory of Disordered Statistical matrices: Performance of LASSO and minimum conic singular values,” Systems. New York: Cambridge University Press, 2001. in Proc. IEEE Int. Symp. Inform. Theory, 2015, pp. 556–560. [27] H. Nishimori, Statistical Physics of Spin Glasses and Information [55] C.-K. Wen, J. Zhang, K.-K. Wong, J.-C. Chen, and C. Yuen, “On Processing. New York: Oxford University Press, 2001. sparse vector recovery performance in structurally orthogonal matrices [28] S. Rangan, A. K. Fletcher, and V. K. Goyal, “Asymptotic analysis of via LASSO,” arXiv:1410.7295 [cs.IT], Oct. 2014. MAP estimation via the replica method and applications to compressed [56] Y. Kabashima and M. Vehkapera,¨ “Signal recovery using expectation sensing,” IEEE Trans. Inform. Theory, vol. 58, no. 3, pp. 1902–1923, consistent approximation for linear observations,” in Proc. IEEE Int. Mar. 2012. Symp. Inform. Theory, 2014, pp. 226–230. [29] D. Guo, D. Baron, and S. Shamai, “A single-letter characterization [57] B. Cakmak, O. Winther, and B. H. Fleury, “S-AMP: Approximate of optimal noisy compressed sensing,” in Proc. Annual Allerton Conf. message passing for general matrix ensembles,” in Proc. IEEE Inform. Commun., Contr., Computing, Sep. 30 - Oct. 2 2009, pp. 52–59. Theory Workshop, 2014, pp. 192–196. [30] A. M. Tulino, G. Caire, S. Verdu,´ and S. Shamai, “Support recovery [58] J. Ma, X. Yuan, and L. Ping, “On the performance of turbo signal re- with sparsely sampled free random matrices,” IEEE Trans. Inf. Theory, covery with partial DFT sensing matrices,” IEEE Trans. Signal Process., vol. 59, no. 7, pp. 4243–4271, Jul. 2013. vol. 22, no. 10, pp. 1580–1584, Oct. 2015. [31] M. Vehkapera,¨ Y. Kabashima, S. Chatterjee, E. Aurell, M. Skoglund, and [59] M. Opper, B. Cakmak, and O. Winther, “A theory of solving TAP L. Rasmussen, “Analysis of sparse representations using bi-orthogonal equations for Ising models with general invariant random matrices,” dictionaries,” in Proc. IEEE Inform. Theory Workshop, Sep. 3–7 2012. arXiv:1509.01229 [cond-mat.dis-nn], Sep. 2015. [32] Y. Kabashima, M. Vehkapera,¨ and S. Chatterjee, “Typical l1-recovery [60] S. Kudekar and H. D. Pfister, “The effect of spatial coupling on limit of sparse vectors represented by concatenations of random orthog- compressive sensing,” in Proc. Annual Allerton Conf. Commun., Contr., onal matrices,” J. Stat. Mech., vol. 2012, no. 12, p. P12003, 2012. Computing, Sep. 29 - Oct. 1 2010, pp. 347–353. [33] T. Tanaka and J. Raymond, “Optimal incorporation of sparsity infor- [61] D. L. Donoho, A. Javanmard, and A. Montanari, “Information- mation by weighted l1-optimization,” in Proc. IEEE Int. Symp. Inform. theoretically optimal compressed sensing via spatial coupling and ap- Theory, Jun. 2010, pp. 1598–1602. proximate message passing,” in Proc. IEEE Int. Symp. Inform. Theory, [34] Y. Kabashima, T. Wadayama, and T. Tanaka, “A typical reconstruction Jul. 1 - 6 2012, pp. 1231–1235. limit for compressed sensing based on lp-norm minimization,” J. Stat. [62] F. Krzakala, M. Mezard,´ F. Sausset, Y. Sun, and L. Zdeborova,´ Mech., vol. 2009, no. 9, p. L09003, 2009. “Probabilistic reconstruction in compressed sensing: algorithms, phase [35] M. Talagrand, Spin Glasses: A Challenge for Mathematicians, Cavity diagrams, and threshold achieving matrices,” J. Stat. Mech., vol. 2012, and Mean Field Models. Berlin Heidelberg: Springer-Verlag, 2003. no. 08, p. P08009, 2012. [36] F. Guerra and F. L. Toninelli, “Quadratic replica coupling in the [63] F. Krzakala, M. Mezard,´ F. Sausset, Y. F. Sun, and L. Zdeborova,´ Sherrington-Kirkpatrick mean field spin glass model,” J. Math. Phys., “Statistical-physics-based reconstruction in compressed sensing,” Phys. vol. 43, no. 7, pp. 3704–3716, 2002. Rev. X, vol. 2, p. 021005, May 2012. [37] ——, “The thermodynamic limit in mean field spin glass models,” [64] K. Takeda, S. Uda, and Y. Kabashima, “Analysis of CDMA systems Commun. Math. Phys., vol. 230, no. 1, pp. 71–79, 2002. that are characterized by eigenvalue spectrum,” Europhys. Lett., vol. 76, [38] F. Guerra, “Broken replica symmetry bounds in the mean field spin glass no. 6, p. 1193, 2006. model,” Commun. Math. Phys., vol. 233, no. 1, pp. 1–12, 2003. [65] Y. Kabashima, “Inference from correlated patterns: A unified theory [39] M. Talagrand, “The Parisi formula,” Annals of Math, vol. 163, no. 1, for perceptron learning and linear vector channels,” J. Phys. Conf. Ser., pp. 221–263, 2006. vol. 95, no. 1, p. 012001, 2008. [40] S. B. Korada and N. Macris, “Tight bounds on the capacity of binary [66] S. Rangan, A. K. Fletcher, P. Schniter, and U. Kamilov, “Inference input random CDMA systems,” IEEE Trans. Inf. Theory, vol. 56, no. 11, for generalized linear models via alternating directions and Bethe free pp. 5590–5613, Nov. 2010. energy minimization,” arXiv:1501.01797 [cs.IT], Jan. 2015. [41] A. Montanari, “Tight bounds for LDPC and LDGM codes under MAP [67] M. Mezard´ and A. Montanari, Information, Physics, and Computation. decoding,” IEEE Trans. Inf. Theory, vol. 51, no. 9, pp. 3221–3246, Sep. New York: Oxford University Press, 2009. 2005. [68] K. Sjostrand¨ and B. Ersbøll, “SpaSM: A Matlab toolbox for sparse [42] S. Kudekar and N. Macris, “Sharp bounds for optimal decoding of low- statistical modeling,” http://www2.imm.dtu.dk/projects/spasm/. density parity-check codes,” IEEE Trans. Inf. Theory, vol. 55, no. 10, [69] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle pp. 4635–4650, Oct. 2009. regression,” Ann. Statist., vol. 32, no. 2, pp. 407–499, 04 2004. [43] T. Tanaka, “A statistical-mechanics approach to large-system analysis [70] S. A. Orszag and C. M. Bender, Advanced Mathematical Methods for of CDMA multiuser detectors,” IEEE Trans. Inform. Theory, vol. 48, Scientists and Engineers. McGraw-Hill, 1978. no. 11, pp. 2888–2910, Nov. 2002. [71] G. B. Arfken, H. J. Weber, and F. E. Harris, Mathematical Methods for [44] D. Guo and S. Verdu,´ “Randomly spread CDMA: Asymptotics via Physicists, 7th ed. Elsevier, 2013. statistical physics,” IEEE Trans. Inf. Theory, vol. 51, no. 6, pp. 1983– [72] C. Goutis and G. Casella, “Explaining the saddlepoint approximation,” 2010, Jun. 2005. The American Statistician, vol. 53, no. 3, pp. 216–224, 1999. [45] Harish-Chandra, “Differential operators on a semisimple lie algebra,” Amer. J. Math., vol. 79, no. 1, pp. 87–120, 1957. [46] C. Itzykson and J. B. Zuber, “Planar approximation 2,” J. Math. Phys., vol. 21, no. 3, pp. 411–421, 1980. [47] J. Wright and Y. Ma, “Dense error correction via `1-minimization,” IEEE Trans. Inf. Theory, vol. 56, no. 7, pp. 3540–3560, Jul. 2010. [48] M. Vehkapera,¨ Y. Kabashima, and S. Chatterjee, “Statistical mechanics approach to sparse noise denoising,” in Proc. European Sign. Proc. Conf., Sep. 9–13 2013. VEHKAPERA¨ et al.: ANALYSIS OF REGULARIZED LS RECONSTRUCTION AND RANDOM MATRIX ENSEMBLES IN COMPRESSED SENSING 25

Mikko Vehkapera¨ received the Ph.D. degree from Norwegian University of Science and Technology (NTNU), Trondheim, Norway, in 2010. Between 2010–2013 he was a post-doctoral researcher at School of Electrical Engineer- ing, and the ACCESS Linnaeus Center, KTH Royal Institute of Technology, Sweden, and 2013–2015 an Academy of Finland Postdoctoral Researcher at Aalto University School of Electrical Engineering, Finland. He is now a lec- turer (assistant professor) at University of Sheffield, Department of Electronic and Electrical Engineering, United Kingdom. He held visiting appointments at Massachusetts Institute of Technology (MIT), US, Kyoto University and Tokyo Institute of Technology, Japan, and University of Erlangen-Nuremberg, Germany. His research interests are in the field of wireless communications, information theory and signal processing. Dr. Vehkapera¨ was a co-recipient for the Best Student Paper Award at IEEE International Conference on Networks (ICON2011) and IEEE Sweden Joint VT-COM-IT Chapter Best Student Conference Paper Award 2015.

Yoshiyuki Kabashima received the B.Sci., M.Sci., and Ph.D. degrees in physics from Kyoto University, Japan, in 1989, 1991, and 1994, respectively. From 1993 until 1996, he was with the Department of Physics, Nara Women’s University, Japan. In 1996, he moved to the Department of Computational Intelligence and Systems Science, Tokyo Institute of Technology, Japan, where he is currently a professor. His research interests include statistical mechanics, information theory, and machine learning. Dr. Kabashima received the 14th Japan IBM Science Prize in 2000, the Young Scientist Award from the Ministry of Education, Culture, Sports, Science, and Technology, Japan, in 2006, and the 11th Ryogo Kubo Memorial Prize in 2007.

Saikat Chatterjee is an assistant professor and docent in the Dept of Communication Theory, KTH-Royal Institute of Technology, Sweden. He is also part of Dept of Signal Processing, KTH. Before moving to Sweden, he received Ph.D. degree in 2009 from Indian Institute of Science, India. He has published more than 80 papers in international journals and conferences. He was a co-author of the paper that won the best student paper award at ICASSP 2010. His current research interests are signal processing, machine learning, coding, speech and audio processing, and computational biology.