IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 1 Asymptotic Analysis of Complex LASSO via Complex Approximate Message Passing (CAMP) Arian Maleki, Laura Anitori, Zai Yang, and Richard Baraniuk, Fellow, IEEE

Abstract—Recovering a sparse signal from an undersampled This algorithm is an extension of the AMP algorithm [3], set of random linear measurements is the main problem of [11]. However, the extension of AMP and its analysis from interest in compressed sensing. In this paper, we consider the the real to the complex setting is not trivial; although CAMP case where both the signal and the measurements are complex- shares some interesting features with AMP, it is substantially valued. We study the popular recovery method of `1-regularized least squares or LASSO. While several studies have shown that more challenging to establish the characteristics of CAMP. LASSO provides desirable solutions under certain conditions, the Furthermore, some important features of CAMP are specific to precise asymptotic performance of this algorithm in the complex complex-valued signals and the relevant optimization problem. setting is not yet known. In this paper, we extend the approximate Note that the extension of the Bayesian-AMP algorithm to message passing (AMP) algorithm to solve the complex-valued LASSO problem and obtain the complex approximate message complex-valued signals has been considered elsewhere [12], passing algorithm (CAMP). We then generalize the state evolution [13] and is not the main focus of this work. framework recently introduced for the analysis of AMP to the In the next section, we briefly review some of the existing complex setting. Using the state evolution, we derive accurate algorithms for sparse signal recovery in the real-valued setting formulas for the phase transition and noise sensitivity of both LASSO and CAMP. Our theoretical results are concerned with and then focus on recovery algorithms for the complex case, the case of i.i.d. Gaussian sensing matrices. Simulations confirm with particular attention to the AMP and CAMP algorithms. that our results hold for a larger class of random matrices. We then introduce two criteria which we use as measures Index Terms—compressed sensing, complex-valued LASSO, of performance for various algorithms in noiseless and noisy approximate message passing, minimax analysis. settings. Based on these criteria, we establish the novelty of our results compared to the existing work. An overview of the I.INTRODUCTION organization of the rest of the paper is provided in Section I-G. Recovering a sparse signal from an undersampled set of random linear measurements is the main problem of interest in compressed sensing (CS). In the past few years many algo- rithms have been proposed for signal recovery, and their per- formance has been analyzed both analytically and empirically A. Real-valued sparse recovery algorithms [1]–[6]. However, whereas most of the theoretical work has N focussed on the case of real-valued signals and measurements, Consider the problem of recovering a sparse vector so ∈ R in many applications, such as magnetic resonance imaging and from a noisy undersampled set of linear measurements y ∈ n radar, the signals are more easily representable in the complex R , where y = Aso + w and w is the noise. Let k denote the domain [7]–[10]. In such applications, the real and imaginary number of nonzero elements of so. The measurement matrix components of a complex signal are often either zero or A has i.i.d. elements from a given distribution on R. Given y arXiv:1108.0477v2 [cs.IT] 6 Mar 2013 non-zero simultaneously. Therefore, recovery algorithms may and A, we seek an approximation to so. benefit from this prior knowledge. Indeed the results presented Many recovery algorithms have been proposed, ranging in this paper confirm this intuition. from convex relaxation techniques to greedy approaches to Motivated by this observation, we investigate the perfor- iterative thresholding schemes. See [1] and the references mance of the complex-valued LASSO in the case of noise- therein for an exhaustive list of algorithms. [6] has com- free and noisy measurements. The derivations are based on the pared several different recovery algorithms and concluded state evolution (SE) framework, presented previously in [3]. that among the algorithms compared in that paper the `1- Also a new algorithm, complex approximate message passing regularized least squares, a.k.a. LASSO or BPDN [2], [14] that 1 2 (CAMP), is presented to solve the complex LASSO problem. seeks the minimizer of minx 2 ky − Axk2 + λkxk1 provides the best performance in the sense of the sparsity/measurement Arian Maleki is with the Department of Statistics, Columbia University, New York city, NY (e-mail:[email protected]). tradeoff. Recently, several iterative thresholding algorithms Laura Anitori is with TNO, The Hague, The Netherlands (e-mail: have been proposed for solving LASSO using few computa- [email protected]) tions per-iteration; this enables the use of the LASSO in high- Zai Yang is EXQUISITUS, Center for E-City, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore (e-mail: dimensional problems. See [15] and the references therein for [email protected]) an exhaustive list of these algorithms. Richard Baraniuk is with the Department of Electrical and Computer Engineering, , Houston, TX (e-mail: [email protected]). In this paper, we are particularly interested in AMP [3]. Manuscript received August 1, 2011; revised July 1, 2012. Starting from x0 = 0 and z0 = y, AMP uses the following IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 2 iterations: where the complex `1-norm is defined as kxk1 , P |x | = P p(xR)2 + (xI )2 [4], [5], [21]–[23]. The xt+1 = η xt + AT zt; τ , i i i i i ◦ t limit of the solution as λ → 0 is t t t |I | t−1 z = y − Ax + z , arg min kxk1, s.t. y = Ax, n x where η◦(x; τ) = (|x| − τ)+sign(x) is the soft thresholding which we refer to as c-BP. t function, τt is the threshold parameter, and I is the active An important question we address in this paper is: can we t t t t set of x , i.e., I = {i | xi 6= 0}. The notation |I | measure how much the grouping of the real and the imaginary t denotes the cardinality of I . As we will describe later, the parts improves the performance of c-LASSO compared to strong connection between AMP and LASSO and the ease of r-LASSO? Several papers have considered similar problems predicting the performance of AMP has led to an accurate [24]–[41] and have provided guarantees on the performance performance analysis of LASSO [11], [16]. of c-LASSO. However, the results are usually inconclusive because of the loose constants involved in the analyses. This B. Complex-valued sparse recovery algorithms paper addresses the above questions with an analysis that does not involve any loose constants and therefore provides Consider the complex setting, where the signal so, the measurements y, and the matrix A are complex-valued. The accurate comparisons. success of LASSO has motivated researchers to use similar techniques in this setting as well. We consider the following Motivated by the recent results in the asymptotic analysis of two schemes that have been used in the signal processing the LASSO [3], [11], we first derive the complex approximate literature: message passing algorithm (CAMP) as a fast and efficient algorithm for solving the c-LASSO problem. We then extend • r-LASSO: The simplest extension of the LASSO to the state evolution (SE) framework introduced in [3] to predict the complex setting is to consider the complex signal the performance of the CAMP algorithm in the asymptotic and measurements as a 2N dimensional real-valued setting. Since the CAMP algorithm solves c-LASSO, such signal and 2n dimensional real-valued measurements, predictions are accurate for c-LASSO as well for N → ∞. respectively. Let the superscript R and I denote the The analysis carried out in this paper provides new information real and imaginary parts of a complex number. Define and insight on the performance of the c-LASSO that was not y˜ [(yR)T , (yI )T ]T and s˜ [(sR)T , (sI )T ]T , where , o , o o known before such as the least favorable distribution and the the superscript T denotes the transpose operator. We have noise sensitivity of c-LASSO and CAMP. A more detailed  AR −AI  description of the contributions of this paper is summarized in y˜ = s˜ . AI AR o Section I-E. | {z } A˜ , C. Notation We then search for an approximation of s˜o by solving ∗ arg min 1 ky˜ − A˜x˜k2 + λkx˜k [17], [18]. We call this Let |α|, ]α, α , R(α), I(α) denote the amplitude, phase, x˜ 2 2 1 C algorithm r-LASSO. The limit of the solution as λ → 0 conjugate, real part, and imaginary part of α ∈ respectively. Cn×N ∗ is Furthermore, for the matrix A ∈ , A , A`, A`j denote the conjugate transpose, `th column and `jth element of arg min kx˜k1, s.t. y˜ = A˜x,˜ x˜ matrix A. We are interested in approximating a sparse N which is called the basis pursuit problem, or r-BP in signal so ∈ C from an undersampled set of noisy linear n×N this paper. It is straightforward to extend the analyses of measurements y = Aso + w. A ∈ C has i.i.d. random LASSO and BP for the real-valued signals to r-LASSO elements (with independent real and imaginary parts) from a 1 E E 2 1 and r-BP. given distribution that satisfies A`j = 0 and |A`j| = n , r-LASSO ignores the information about any potential and w ∈ CN is the measurement noise. Throughout the grouping of the real and imaginary parts. But, in many paper, we assume that the noise is i.i.d. CN(0, σ2), where applications the real and imaginary components tend to CN stands for the complex normal distribution. be either zero or non-zero simultaneously. Considering this extra information in the recovery stage may improve We are interested in the asymptotic setting where the overall performance of a CS system. δ = n/N and ρ = k/n are fixed, while N → ∞. • c-LASSO: Another natural extension of the LASSO to We further assume that the elements of so are i.i.d. the complex setting is the following optimization problem so,i ∼ (1 − ρδ)δ0(|so,i|) + ρδG(so,i), where G is an unknown that we term c-LASSO probability distribution with no point mass at 0, and δ0 2 1 is a Dirac delta function. Clearly, the expected number min ky − Axk2 + λkxk , 2 2 1 of non-zero elements in the vector so is ρδN. We call this value the sparsity level of the signal. In this model, 1The asymptotic theoretical results on LASSO and BP consider i.i.d. Gaussian measurement matrices [19]. However, it has been conjectured that 2This assumption is not necessary and as long as the marginal distribution the results are universal and hold for a “larger” class of random matrices [11], of so converges to a given distribution the statements of this paper hold. For [20]. further information on this, see [11] and [16]. IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 3

we are assuming that all the non-zero real and imaginary 1 coefficients are paired. This quantifies the maximum amount r−BP c−BP of improvement the c-LASSO gains by grouping the real and 0.8 imaginary parts.

0.6 We use the notations E, EX , and EX∼F for expected value, conditional expected value given the random variable ρ X, and expected value with respect to a random variable X 0.4 drawn from the distribution F , respectively. Define F,γ as the E I family of distributions F with X∼F ( (|X| = 0)) ≥ 1 −  0.2 2 2 and EX∼F (|X| ) ≤ γ , where I denotes the indicator function. An important distribution in this class is 0 qo(X) , q(|X|) , (1 − )δ0(|X|) + δγ (|X|), where 0 0.2 0.4 0.6 0.8 1 δ δγ (|X|) , δ0(|X| − γ). Note that this distribution is independent of the phase and in addition to a point mass at zero has another point mass at γ. Finally, define Fig. 1. Comparison of the phase transition curve of the r-BP and c-BP. When E I all the non-zero real and imaginary parts of the signal are grouped, the phase F , { F | X∼F ( (|X|= 6 0)) ≤ }. transition of c-BP outperforms that of r-BP.

D. Performance criteria 2) Noisy measurements: Consider the problem of recover- i.i.d. We compare c-LASSO with r-LASSO in both the noise-free ing so distributed according to so,i ∼ (1 − ρδ)δ0(|so,i|) + and noisy measurements cases. For each scenario, we define ρδG(so,i), from a set of noisy linear observations y = Aso+w, i.i.d. 2 a specific measure to compare the performance of the two where wi ∼ CN(0, σ ). In the presence of measurement algorithms. noise exact recovery is not possible. Therefore, tuning the 1) Noise-free measurements: Consider the problem of re- parameter for the highest phase transition curve does not i.i.d. necessarily provide the optimal performance. In this section, covering so drawn from so,i ∼ (1−ρδ)δ0(|so,i|)+ρδG(so,i), we explain the optimal noise sensitivity tuning introduced in from a set of noise free measurements y = Aso. Let Aα [11]. Consider the `2-norm as a measure for the reconstruc- be a sparse recovery algorithm with free parameter α. For Aα 2 kxˆ −sok2 instance A may be the c-LASSO algorithm and the free tion error and assume that N → MSE(ρ, δ, α, σ, G) parameter of the algorithm is the regularization argument λ. almost surely. Define the noise sensitivity of the algorithm Aα Aα Given (y, A), Aα returns an estimate xˆ of so. Suppose as that in the noise free case, as N → ∞, the performance MSE(ρ, δ, α, σ, G) of A exhibits a sharp phase transition, i.e., for every value α NS(ρ, δ, α) , sup sup 2 , (1) Aα Aα σ>0 G σ of δ, there exists ρ (δ), below which limN→∞ kxˆ − 2 Aα sok /N → 0 almost surely, while for ρ > ρ (δ), Aα where α denotes the tuning parameter of the algorithm Aα. If fails and lim kxˆAα − s k2/N 0. The phase transition N→∞ o 9 the noise sensitivity is large, then the measurement noise may has been studied both empirically and theoretically for many severely degrade the final reconstruction. In (1) we search for sparse recovery algorithms [6], [19], [20], [42]–[45]. The the distribution that induces the maximum reconstruction error phase transition curve ρAα (δ) specifies the fundamental exact to the algorithm. This ensures that for other signal distributions recovery limit of algorithm A . α the reconstruction error is smaller. By tuning α, we may obtain The free parameter α can strongly affect the performance of better estimate of s . Therefore, we tune the parameter α to the sparse recovery algorithm [6]. Therefore, optimal tuning o obtain the lowest noise sensitivity, i.e., of this parameter is essential in practical applications. One approach is to tune the parameter for the highest phase NS(ρ, δ) inf NS(ρ, δ, α). transition [6],3 i.e., , α

A A ρ (δ) , sup ρ α (δ). Based on this framework, we say that algorithm A outperforms α B at a given δ and ρ if and only if NSA(δ, ρ) < NSB(δ, ρ). A In other words, ρ is the best performance Aα provides in the exact sparse signal recovery problem, if we know how to tune the algorithm properly. Based on this framework, we E. Contributions say algorithm A outperforms B at a given δ, if and only if In this paper, we first develop the complex approximate ρA(δ) > ρB(δ). message passing (CAMP) algorithm that is a simple and fast converging iterative method for solving c-LASSO. We extend 3In this paper, we consider algorithms whose phase transitions do not depend on the distribution G of non-zero coefficients. Otherwise, one could the state evolution (SE), introduced recently as a framework use the maximin framework introduced in [6]. for accurate asymptotic predictions of the AMP performance, IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 4 to CAMP.4 We will then use the connection between CAMP 1 and c-LASSO to provide an accurate asymptotic analysis of 0.9 the c-LASSO problem. We aim to characterize the phase tran- sition curve (noise-free measurements) and noise sensitivity 0.8 (noisy measurements) of c-LASSO and CAMP when the real 0.7 and imaginary parts are paired, i.e., they are both zero or 0.6 non-zero simultaneously. Both criteria have been extensively 8

ρ 0.5 studied for the real signals (and hence for the r-LASSO) 4 [3], [11]. The results of our predictions are summarized in 0.4 2 Figures 1, 2, and 3. Figure 1 compares the phase transition 0.3 curve of c-BP and CAMP with the phase transition curve 1 0.2 of r-BP. As we expected c-BP outperforms r-BP since it 0.5 0.1 0.25 exploits the connection between the real and imaginary parts. 0.125 ρ (δ) If SE denotes the phase transition curve, then we also 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 prove that ρSE(δ) ∼ log(1/2δ) as δ → 0. Comparing this with δ R 1 ρSE(δ) ∼ 2 log(1/δ) for the r-LASSO [19], we conclude that Fig. 2. Contour lines of noise sensitivity in the (δ, ρ) plane. The black curve ρSE(δ) is the phase transition curve at which the noise sensitivity is infinite. The lim = 2. colored lines display the level sets of NS(ρ, δ) = 0.125, 0.25, 0.5, 1, 2, 4, 8. δ→0 R ρSE(δ) This means that, in the very high undersampling regime the c-LASSO can recover signals that are two times more dense than the signals that are recovered by r-LASSO. Figure 2 0.9 exhibits the noise sensitivity of c-LASSO and CAMP. We 0.8 prove in Section III-C that, as the sparsity approaches the 0.7 phase transition curve, the noise sensitivity grows up to infinity. Finally, Figure 3 compares the contour plots of the 0.6 noise sensitivity of c-LASSO with those of the r-LASSO. For ρ 0.5 the fixed value noise sensitivity, the level set of the c-LASSO 0.4 is higher than that of r-LASSO. It is worth noting that the 2 same comparisons hold between CAMP and AMP, as we will 0.3 2 clarify in Section III-D. 0.2 0.5 0.1 0.125 0.5 0.125 F. Related work 0.2 0.4 0.6 0.8 The state evolution framework used in this paper was δ first introduced in [3]. Deriving the phase transition and Fig. 3. Comparison of the noise sensitivity of r-LASSO with the noise noise sensitivity of the LASSO for real-valued signals and sensitivity of c-LASSO. The colored solid lines present the level sets of the real-valued measurements from SE is due to [11]; see [47] NS(ρ, δ) = 0.125, 0.5, 2 for the c-LASSO, and the colored dotted lines for more comprehensive discussion. Finally, the derivation of display the same level sets for the r-LASSO. AMP from the full sum-product message passing is due to [48]. Our main contribution in this paper is to extend these results to the complex setting. Not only is the analysis of the this context is the group-LASSO [35], [37]. Consider a signal N state evolution more challenging in this setting, but it also so ∈ R . Partition the indices of so into m groups g1, . . . , gm. provides new insights on the performance of c-LASSO that The group-LASSO algorithm minimizes the following cost have not been available. For instance, the noise sensitivity of function: c-LASSO has not previously been determined. m 1 2 X min ky − Axk2 + λikxgi k2, (2) x 2 The recovery of sparse complex signals is a special case i=1 of group-sparsity or block-sparsity, where all the groups are where the λi’s are regularization parameters. non-overlapping and have size 2. According to the group The group-Lasso algorithm has been extensively studied in sparsity assumption, the non-zero elements of the signal tend the literature [24]–[41]. We briefly review several papers and to occur in groups or clusters. One of the algorithms used in emphasize the differences from our work. [38] analyzes the 4Note that SE has been proved to be accurate only for the case of Gaussian consistency of the group LASSO estimator in the presence of measurement matrices [16], [46]. But, extensive simulations have confirmed noise. Fixing the signal so, it provides conditions under which its accuracy for a large class of random measurement matrices [3], [11]. the group LASSO is consistent as n → ∞. [39], [49] consider The results of our paper are also provably correct for complex Gaussian measurement matrices. But, our simulations confirm that they hold for broader a weak notion of consistency, i.e., exact support recovery. set of matrices. However, [49] proves that in the setting we are interested in, IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 5 i.e., k/n = ρ and n/N = δ, even exact support recovery is number of measurements and the mean square error bounds. not possible. When noise is present, our goal is neither exact recovery nor exact support recovery. Instead, we characterize Finally, from an algorithmic point of view, several papers the mean square error (MSE) of the reconstruction. This have considered solving the c-LASSO problem using first- criterion has been considered in [24], [40]. Although the order algorithms [4], [21].5 The deterministic framework that results of [24], [40] show qualitatively the benefit of group measures the convergence of an algorithm on the problem sparsity, they do not characterize the difference quantitatively. instance that yields the slowest convergence rate, is not an ap- In fact, loose constants in both the error bound and the propriate measure of the convergence rate for the compressed number of samples do not permit accurate performance sensing problems [15]. Therefore, [15] considers the average comparison. In our analysis, no loose constant is involved, convergence rate for iterative algorithms. In that setting, AMP and we provide very accurate characterization of the mean is the only first order algorithm that provably achieves linear square error. convergence to date. Similarly, the CAMP algorithm, intro- duced in this paper, provides the first, first-order c-LASSO Group-sparsity and group-LASSO are also of interest in the solver that provides a linear average convergence rate. sparse recovery community. For example, the analysis carried out in [26], [29], [30] are based on “coherence”. These results provide sufficient conditions with again loose constants as G. Organization of the paper discussed above. The work of [31]–[33] addresses this issue by an accurate analysis of the algorithm in the noiseless We introduce the CAMP algorithm in Section II. We then setting σ = 0. They provide a very accurate estimate of the explain the state evolution equations that characterizes the phase transition curve for the group-LASSO. However, SE evolution of the mean square error through the iterations of the provides a more flexible framework to analyze c-LASSO than CAMP algorithm in Section III, and we analyze the important the analysis of [33], and it provides more information than properties of the SE equations. We then discuss the connection just the phase transition curve. For instance, it points to the between our calculations and the solution of LASSO in Section least favorable distribution of the input and noise sensitivity III-D. We confirm our results via Monte Carlo simulations in of c-LASSO. Section IV.

The Bayesian approach that assumes a hidden Markov model for the signal has been also explored for the recovery II.COMPLEX APPROXIMATE MESSAGE PASSING of group sparse signals [50], [51]. It has been shown that The high computational complexity of interior point meth- AMP combined with an expectation maximization algorithm ods for solving large scale convex optimization problems has (for estimating the parameters of the distribution) leads to spurred the development of first-order methods for solving promising results in practice [12]. Kamilov et al. [52] have the LASSO problem. See [15] and the references therein for taken the first step towards a theoretical understanding of a description of some of these algorithms. One of the most such algorithms. However, the complete understanding of the successful algorithms for CS problems is the AMP algorithm expectation maximization employed in such methods is not introduced in [3]. In this section, we use the approach in- available yet. Furthermore, the success of such algorithms troduced in [48] to derive the approximate message passing seem to be dependent on the match between the assumed and algorithm for the c-LASSO problem that we term Complex actual prior distribution. Such dependencies have not been Approximate Message Passing (CAMP). theoretically analyzed yet. In this paper we assume that the Let s , s , . . . , s be N random variables with the follow- distribution of non-zero coefficients is not known beforehand 1 2 N ing distribution: and characterize the performance of c-LASSO for the least favorable distribution. 1 −βλksk − β ky−Ask2 p(s , s , . . . , s ) = e 1 2 2 , (3) 1 2 N Z(β) While writing this paper we were made aware that in β 2 an independent work Donoho, Johnstone, and Montanari R −βλksk1− 2 ky−Ask2 where β is a constant and Z(β) , s e ds. are extending the SE framework to the general setting of As β → ∞, the mass of this distribution concentrates around group sparsity [53]. Their work considers the state evolution the solution of the LASSO. Therefore, one way to find the framework for the group-LASSO problem and will include solution of LASSO is to marginalize this distribution. How- the generalization of the analysis provided in this paper to ever, calculating the marginal distribution is an NP-complete the case where the variables tend to cluster in groups of size B. problem. The sum-product message passing algorithm pro- vides a successful heuristic for approximating the marginal Both complex signals and group-sparse signals are distribution. As N → ∞ and β → ∞ the iterations of the sum- special cases of model-based CS [54]. By introducing more product message passing algorithm are simplified to ( [48] or structured models for the signal, [54] proves that the number of measurements needed are proportional to the “complexity” 5First-order methods are iterative algorithms that use either the gradient of the model rather than the sparsity level [55]. The results in or the subgradient of the function at the previous iterations to update their model-based CS also suffer from loose constants in both the estimates. IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 6

Chapter 5 of [47]) algorithm or the connection between CAMP and c-LASSO, since message passing is a heuristic algorithm and does not t+1  X ∗ t  x`→a = η Ab`zb→`; τt , necessarily converge to the correct marginal distribution of (3). b6=a t X t za→` = ya − Aajxj→a, (4) III.FORMAL ANALYSIS OF CAMP ANDC-LASSO j6=` In this section, we explain the state evolution (SE) frame-  λ(u+iv)  where η(u + iv; λ) u + iv − √ I 2 2 2 is work that predicts the performance of the CAMP and c- , u2+v2 {u +v >λ } + LASSO in the asymptotic settings. We then use this framework the proximity operator of the complex `1-norm and is called complex soft thresholding. See Appendix V-A for further in- to analyze the phase transition and noise sensitivity of the CAMP and c-LASSO. The formal connection between state formation regarding this function. τt is the threshold parameter at time t. The choice of this parameter will be discussed in evolution and CAMP/c-LASSO is discussed in Section III-D. Section III-A. The per-iteration computational complexity of t t this algorithm is high, since 2nN messages x`→a and za→` A. State evolution are updated. Therefore, following [48] we assume that there √ We now conduct an asymptotic analysis of the CAMP exist ∆xt , ∆zt = O(1/ N) such that `→a `→a algorithm. As we confirm in Section III-D, the asymptotic t t t x`→a = x` + ∆x`→a + O(1/N), performance of the algorithm is tracked through a few vari- zt = zt + ∆zt + O(1/N). (5) ables, called the state variables. The state of the algorithm is a→` a `→a the 5-tuple s = (m; δ, ρ, σ, G), where G corresponds to the Here, the O(·) errors are uniform in the choice of the edges distribution of the non-zero elements of the sparse vector so, t ` → a and a → `. In other words we assume that x`→a σ is the standard deviation of the measurement noise, and m t is independent of a and√za→` is independent of ` except is the asymptotic normalized mean square error. The threshold for an error of order 1/ N. For further discussion of this parameter (threshold policy) of CAMP in its most general form assumption and its validation, see [48] or Chapter 5 of [47]. could be a function of the state of the algorithm τ(s). Define I R 2 m Let η and η be the imaginary and real parts of the complex npi(m; σ, δ) , σ + δ . The mean square error (MSE) map ∂ηR ∂ηR soft thresholding function. Furthermore, define ∂x and ∂y is defined as as the partial derivatives of ηR with respect to the real and I I Ψ(s, τ(s)) imaginary parts of the input respectively. ∂η , and ∂η are , ∂x ∂y E p p 2 defined similarly. The following theorem shows how one can |η(X + npi(m, σ, δ)Z1 + i npi(m, σ, δ)Z2; τ(s)) − X| , simplify the message passing as N → ∞. where Z1,Z2 ∼ N(0, 1/2) and X ∼ (1−ρδ)δ0(|x|)+ρδG(x) Proposition II.1. Suppose that (5) holds for every iteration of are independent random variables. Note that G is a probability t distribution on C. In the rest of this paper, we consider the the message passing algorithm specified in (4). Then x` and t thresholding policy τ(s) = τpnpi(m, σ, δ), where the con- za satisfy the following equations: stant τ is yet to be tuned according to the schemes introduced t+1  t X ∗ t  x` = η x` + Ab`zb; τt , in Sections I-D1 and I-D2. When we use this thresholding b policy we may equivalently write Ψ(s, τ(s)) as Ψ(s, τ). This t+1 X t+1 za = ya − Aajxj thresholding policy is the same as the thresholding policy j introduced in [3], [11]. When the parameters δ, ρ, σ, τ and R ! G are clear from the context, we denote the MSE map by X ∂η  t X ∗ t ∗ t − Aaj xj + Abjzb R(Aajza) Ψ(m). SE is the evolution of m (starting from t = 0 and ∂x 2 j b m0 = E(|X| )) by the rule ! ∂ηR   X t X ∗ t ∗ t mt+1 = Ψ(mt) − Aaj xj + Abjzb I(Aajza) ∂y  q q q  2 j b E t t t ! , η X + npi Z1 +i npi Z2; τ npi −X , (7) X ∂ηI  X  −i A xt + A∗ zt R(A∗ zt ) aj ∂x j bj b aj a t j b where npi , npi(mt, σ, δ). As will be described in Section ! III-D, this equation tracks the normalized MSE of the CAMP X ∂ηI  X  −i A xt + A∗ zt I(A∗ zt ).(6) algorithm in the asymptotic setting n, N → ∞ and n/N → δ. aj ∂y j bj b aj a j b In other words, if mt is the MSE of the CAMP algorithm at iteration t, the m , calculated by (7), is the MSE of CAMP See Appendix V-B for the proof. According to Proposition t+1 t at iteration t + 1. II.1 and (5), for large values of N, the messages x`→a and t t t za→` are close to x` and za in (6). Therefore, we define Definition III.1. Let Ψ be almost everywhere differentiable. the CAMP algorithm as the iterative method that starts from m∗ is called a fixed point of Ψ if and only if Ψ(m∗) = m∗. 0 0 x = 0 and z = y and uses the iterations specified in (6). Furthermore, a fixed point is called stable if dΨ(m) < dm ∗ It is important to note that Proposition II.1 does not provide m=m dΨ(m) 1, and unstable if dm > 1. any information on either the performance of the CAMP m=m∗ IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 7

It is clear that if m∗ is the unique stable fixed point of the Ψ Setting this derivative to 1, it is clear that the phase transition ∗ function, then mt → m as t → ∞. Also, if all the fixed points value of ρ is independent of G. of Ψ are unstable, then mt → ∞ as t → ∞. Define µ , |X| According to Proposition III.4 the only parameters that and θ ,]X. Let G(µ, θ) denote the probability density affect ρSE are δ and the free parameter τ. Fixing δ, we tune R function of X and define G(µ) , G(µ, θ)dθ as the marginal τ such that the algorithm achieves its highest phase transition distribution of µ. The next lemma shows that in order to for a certain number of measurements, i.e., analyze the state evolution function we only need to consider ρSE(δ) , sup ρSE(δ; τ). the amplitude distribution. This substantially simplifies our τ analysis of SE in the next sections. Using SE we can calculate the optimal value of τ and ρSE(δ). Lemma III.2. The MSE map does not depend on the phase distribution of the input signal, i.e., Theorem III.5. ρSE(δ) and δ satisfy the following implicit relations: Ψ(m, δ, ρ, σ, G(µ, θ), τ) = Ψ(m, δ, ρ, σ, G(µ), τ). χ1(τ) ρSE(δ) = 2 , See Appendix V-C for the proof. (1 + τ )χ1(τ) − τχ2(τ) 4(1 + τ 2)χ (τ) − 4τχ (τ) δ = 1 2 , −2τ + 4χ (τ) B. Noise-free signal recovery 2 R −ω2 Consider the noise free setting with σ = 0. Suppose that SE for τ ∈ [0, ∞). Here, χ1(τ) , ω≥τ ω(τ − ω)e dω and predicts the MSE of CAMP in the asymptotic setting (we will R 2 −ω2 χ2(τ) , ω>τ ω(ω − τ) e . make this rigorous in Section III-D). As mentioned in Section I-D1, in order to characterize the performance of CAMP in See Appendix V-D for the proof. Figure 1 displays this phase the noiseless setting, we first derive its phase transition curve transition curve that is derived from the SE framework and and then optimize over τ to obtain the highest phase transition compares it with the phase transition of r-BP algorithm. As CAMP can achieve. Fix all the state variables except for m, will be described later, ρSE(δ) corresponds to the phase and ρ. The evolution of m, discriminates the following two transition of c-LASSO. Hence the difference between ρSE(δ) regions for ρ: and phase transition curve of r-LASSO is the benefit of grouping the real and imaginary parts. Region I: The values of ρ for which Ψ(m) < m for every It is also interesting to compare the ρ (δ) (which as we m > 0; SE see later predicts the performance of c-LASSO) with the phase Region II: The complement of Region I. transition of r-LASSO in high undersampling regime δ → Since 0 is necessarily a fixed point of the Ψ function, in 0. The implicit formulation above enables us to calculate the Region I mt → 0 as t → ∞. The following lemma shows that asymptotic performance of the phase transition as δ → 0. in Region II m = 0 is an unstable fixed point and therefore Theorem III.6. ρ (δ) follows the asymptotic behavior starting from m0 6= 0, mt 9 0. SE Lemma III.3. Let σ = 0. If ρ is in Region II, then Ψ has an 1 ρSE(δ) ∼ 1 , as δ → 0. unstable fixed point at zero. log 2δ Proof: We prove in Lemma V.2 that Ψ(m) is a concave See Appendix V-E for the proof. As mentioned above, this function of m. Therefore, ρ is in Region II if and only if theorem shows that as δ → 0 the phase transition of c-BP and dΨ(m) CAMP is two times that of the r-LASSO, which is given by dm > 1. This in turn indicates that 0 is an unstable m=0 ρR ∼ 1/(2 log(1/δ)) [19]. This improvement is due to the fixed point. SE grouping of real and imaginary parts of the signal. It is also easy to confirm that Region I is of the form [0, ρSE(δ, G, τ)). As we will see in Section III-D, ρSE(δ, G, τ) determines the phase transition curve of the CAMP algorithm. C. Noise sensitivity According to Lemma III.2, the MSE map does not depend on In this section we characterize the noise sensitivity of SE. the phase distribution of the non-zero elements. The following To achieve this goal, we first discuss the risk of the complex proposition shows that in fact ρSE is independent of G even soft thresholding function. The properties of this risk play an though the Ψ function depends on G(µ). important role in the discussion of the noise sensitivity of SE

Proposition III.4. ρSE(δ, G, τ) is independent of the distri- in Section III-C2. bution G. 1) Risk of soft thresholding: Define the risk of the soft thresholding function as Proof: According to Lemma V.2 in Appendix V-D, Ψ E iθ 2 is concave. Therefore, it has a stable fixed point at zero if r(µ, τ) , |η(µe + Z1 + iZ2; τ) − X| , and only if its derivative at zero is less than 1. It is also where µ ∈ [0, ∞), θ ∈ [0, 2π), and the expected value straightforward (from Appendix V-D) to show that is with respect to the two independent random variables 2 dΨ ρδ(1 + τ ) 1 − ρδ 2 Z1,Z2 ∼ N(0, 1/2). It is important to note that according = + E|η(Z1 + iZ2; τ)| . dm m=0 δ δ to Lemma III.2, the risk function is independent of θ. The IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 8 following lemma characterizes two important properties of this We call the fixed point in Lemma III.10 risk function: fMSE(σ2, δ, ρ, G, τ). According to Section I-D2, we define the minimax noise sensitivity as Lemma III.7. r(µ, τ) is an increasing function of µ and a 2 concave function in terms of µ . NSSE(δ, ρ) min sup sup fMSE(σ2, δ, ρ, G, τ)/σ2. , τ See Appendix V-F for the proof of this lemma. We define the σ>0 q∈F minimax risk of the soft thresholding function as The noise sensitivity of SE can be easily evaluated from [ [ 2 M (). The following theorem characterizes this relation. M () inf sup E|η(X + Z1 + iZ2; τ) − X| , , τ>0 q∈F Theorem III.11. Let ρMSE(δ) be the value of ρ satisfying [ where q is the probability density function of X, and the M (ρδ) = δ. Then, for ρ < ρMSE we have expected value is with respect to X, Z1 and Z2. M [(δρ) NSSE(δ, ρ) = , [ Note that q ∈ F implies that q has a point mass of 1 −  at 1 − M (δρ)/δ zero; see Section I-C for more information. In the next section SE we show a connection between this minimax risk and the noise and for ρ > ρMSE(δ), NS (δ, ρ) = ∞. sensitivity of the SE. Therefore, it is important to characterize The proof of this theorem follows along the same lines as the M [() . proof of Proposition 3.1 in [11], and therefore we skip it for Proposition III.8. The minimax risk of the soft thresholding the sake of brevity. The contour lines of this noise sensitivity function satisfies function are displayed in Figure 2. ∞ Similar arguments as those presented in Proposition 3.1 in Z 2 M [() = inf 2(1 − ) w(w − τ)2e−w dw + (1 + τ 2). [11] combined with Proposition III.9 prove the following. τ w=τ (8) Proposition III.12. The maximum of the formal MSE, max fMSE(σ2, δ, ρ, G, τ) is achieved by q = (1 − See Appendix V-G for the proof. It is important to note q∈F,γ )δ (|X|) + δ (|X|), independent of σ and τ. that the quantities in (8) can be easily calculated in terms of 0 γ the density and distribution function of a normal random vari- Again we emphasize that the maximizing or least favorable able. Therefore, a simple computer program may accurately distribution is not unique. Note that the least favorable distri- calculate the value of M [() for any . bution provides a simple approach for designing and setting The proof provided for Proposition III.8 also proves the the parameters of CS systems [8]: We design the system such following proposition. We will discuss the importance of this that it performs well on the least favorable distribution, and it result for compressed sensing problems in the next section. is then guaranteed that the system will perform as well (or in Proposition III.9. The maximum of the risk function, many cases better) on all other input distributions. 2 As a final remark we note that ρMSE(δ) equals ρSE(δ) as maxq∈F E|η(X + Z1 + iZ2; τ) − X| , is achieved on ,γ proved next. q(X) = (1 − )δ0(|X|) + δγ (|X|). First, note that the maximizing distribution (or least favor- Proposition III.13. For every δ ∈ [0, 1] we have able distribution) is independent of the threshold parameter. Second, note that the maximizing distribution is not unique ρMSE(δ) = ρSE(δ). since we have already proved that the phase distribution does not affect the risk function. Proof: The proof is a simple comparison of the formulas. 2) Noise sensitivity of state evolution: As mentioned in We first know that ρMSE is derived from the following Section III-A, in the presence of measurement noise, SE is equation given by Z 2 min 2(1 − ρδ) ω(ω − τ)2e−ω dω + ρδ(1 + τ 2) = δ. τ mt+1 = Ψ(mt) ω>τ E p p p 2 = |η(X + npiZ1 + i npiZ2; τ npi) − X| , On the other hand, since Ψ(m) is a concave function of m, dΨ(m) m 2 t ρSE(δ, τ) is derived from dm = 1. This derivative is where npi = σ + δ . As mentioned above, mt characterizes m=0 the asymptotic MSE of CAMP at iteration t. Therefore, the equal to final solution of the CAMP algorithm converges to one of Z dΨ(m) 2(1 − ρδ) 2 −ω2 ρδ 2 the stable fixed points of the Ψ function. The next theorem = ω(ω−τ) e dω+ (1+τ ). suggests that the stable fixed point is unique, and therefore no dm m=0 δ ω>τ δ matter where the algorithm starts from it will always converge Also, ρ (δ) = sup ρ (τ, δ). However, in order to obtain to the same MSE. SE τ SE the highest ρ we should minimize the above expression over Lemma III.10. Ψ(m) has a unique stable fixed point to which τ. Therefore, both ρSE(δ) and ρMSE(δ) satisfy the same the sequence of {mt} converges. equations and thus are exactly equal. IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 9

D. Connection between the state evolution, CAMP, and c- and E is with respect to independent random variables Z1 + LASSO iZ2 ∼ CN(0, 1) and X ∼ (1 − ρδ)δ0(|so,i|) + ρδG(so,i). There is a strong connection between the SE framework, The following theorem establishes the connection between the the CAMP algorithm, and c-LASSO. Recently, [16] proved solution of LASSO and the state evolution equation. that SE predicts the asymptotic performance of the AMP Theorem III.16. Consider a converging sequence λ(τ) algorithm when the measurement matrix is i.i.d. Gaussian. The {so(N),A(N), w(N)}. Let xˆ (N) be the solution of result also holds for complex Gaussian matrices and complex LASSO. Then, for any pseudo Lipschitz function γ : C2 → R input vectors. As in [3], we conjecture that the SE predictions we have are correct for a “large” class of random matrices. We show N evidence of this claim in Section IV. Here, for the sake of 1 X λ(τ) lim γ(ˆxi (N), so,i) N→∞ N completeness, we quote the result of [16] in the complex i=1 setting. Let γ : C2 → R be a pseudo-Lipschitz function.6 To   p p p   = Eγ η X + npi∗Z + i npi∗Z ; τ npi∗ ,X make the presentation clear we consider a simplified version 1 2 of Definition 1 in [46]. almost surely, where Z1 + iZ2 ∼ CN(0, 1) and X ∼ Definition III.14. A sequence of instances (1−ρδ)δ0(|so,i|)+ρδG(so,i) are independent complex random variables. npi∗ σ2 + m∗/δ, where m∗ is the fixed point of {so(N),A(N), w(N)}, indexed by the ambient dimension N, , is called a converging sequence if the following conditions (7). hold: The proof of the theorem is similar to the proof of Theorem N - The elements of so(N) ∈ R are i.i.d. drawn from (1 − 1.4 in [46] and hence is skipped here. ρδ)δ0(|so,i|) + ρδG(so,i). - The elements of w(N) ∈ Rn (n = δN) are i.i.d. drawn Note that according to Theorems III.15 and III.16, SE 2 from N(0, σw). predicts the dynamic of the AMP algorithm and the solution - The elements of A(N) ∈ Rn×N are i.i.d. drawn from a of LASSO accurately in the asymptotic settings. complex Gaussian distribution. Theorem III.15. Consider a converging sequence E. Discussion t {so(N),A(N), w(N)}. Let x (N) be the estimate of 1) Convergence rate of CAMP: In this section we briefly the CAMP algorithm at iteration t. For any pseudo Lipschitz discuss the convergence rate of the CAMP algorithm. In this function γ : C2 → R we have respect our results are straightforward extension of the analysis N in [15]. But, for the sake of completeness, we mention a few 1 X t highlights. Let {m }∞ be a sequence of MSE generated ac- lim γ(xi, so,i) t t=1 N→∞ N i=1 cording to state evolution (7) for X ∼ (1−)δ0(|X|)+G(X),   q q q   τ σ2 = 0 t t t , and . The following proposition provides an upper = Eγ η X + npi Z1 + i npi Z2; τ npi ,X bound on mt as a function of iteration t. Theorem III.17. Let {m }∞ be a sequence of MSEs gener- almost surely, where Z1 + iZ2 ∼ CN(0, 1) and X ∼ t t=1 ated according to SE. Then (1−ρδ)δ0(|so,i|)+ρδG(so,i) are independent complex random t 2 variables. Also, npi σ + mt/δ, where mt satisfies (7).  t , t dΨ(m) m ≤ m0. The proof of this theorem is similar to the proof of dm m=0 Theorem 1 in [16] and hence is skipped here. Proof: Since according to Lemma V.2 Ψ(m) is concave,

Ψ(m) ≤ dΨ(m) m we have dm . Hence at every iteration, It is also straightforward to extend the result of [11] and [46] m =0 dΨ(m) on the connection of message passing algorithms and LASSO m is attenuated by dm . After t iterations we have m=0 τ t to the complex setting. For a given value of suppose that  dΨ(m)  ∗ t m ≤ dm m0. the fixed point of the state evolution is denoted by m . Define m=0 λ(τ) as According to Theorem III.17 the convergence rate of CAMP √  1 ∂ηR ∂ηI  is linear (in the asymptotic setting).7 In fact, due to the λ(τ) τ m∗ 1 − E + , (9) , 2δ ∂x ∂y concavity of the Ψ function, CAMP converges faster for large values of MSE m. As m reaches zero the convergence where rate decreases towards the rate predicted by this theorem. ∂ηR ∂ηR  √ √ √  Theorem III.17 provides an upper bound on the number of X + m∗Z +i m∗Z ; τ m∗ , , 1 2 iterations the algorithm requires to reach to a certain accuracy. ∂x ∂x I I  √ √ √  dΨ(m) ∂η ∂η ∗ ∗ ∗ Figure 4 exhibits the value of dm as a function of X + m Z + i m Z ; τ m , m=0 ∂y , ∂y 1 2 7If the measurement matrix is not i.i.d. random CAMP does not necessarily 6γ : C2 → R is pseudo-Lipschitz if and only if |ψ(x) − ψ(y)| ≤ L(1 + converge at this rate. This is due to the fact that state evolution does not kxk2 + kyk2)kx − yk2. necessarily hold for arbitrary matrices. IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 10

1 1 Theoretical Empirical(Gaussian) 0.8 0.8 Empirical(Rademacher) Empirical(Ternary) 0.98 0.6 0.6 ρ ρ 0.9 0.4 0.4 0.8 0.2 0.2 0.65 0.5 0 0.3 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 δ δ

1 Fig. 4. Contour lines of dΨ(m) as a function of ρ and δ. The dm m=0 Theoretical parameter τ in the SE is set according to Theorem III.5. The black curve Empirical(Gaussian) is the phase transition of CAMP. The colored lines display the level sets 0.8 Empirical(Rademacher) of dΨ(m) = 0.3, 0.5, 0.65, 0.8, 0.9, 0.98. Note that according to dm m=0 Empirical(Ternary) dΨ(m) −10 Proposition III.17, if < 0.9, then m200 < 7.1 × 10 m0. dm m=0 0.6 ρ 0.4 ρ and δ. This figure is based on the calculations we have presented in Appendix V-D. Here, τ is chosen such that the 0.2 CAMP algorithm achieves the same phase transition as c- BP algorithm. Note that, according to Proposition III.17, if dΨ(m) −10 0 dm < 0.9, then m200 < 7.1 × 10 m0. 0 0.2 0.4 0.6 0.8 1 m=0 δ Theorem III.17 only considers the noise-free problem. But, again due to the concavity of the Ψ function, the convergence of CAMP to its fixed point is even faster for noisy measure- Fig. 5. Comparison of ρSE (δ) with the empirical phase transition of c- ments. To see this, note that once the measurements are noisy, LASSO [57] (top) and CAMP (bottom). There is a close match between the theoretical prediction and the empirical results from Monte Carlo simulations. the fixed point of CAMP occurs at a larger value of m. Since Ψ is concave, the derivative at this point is lower than the derivative at zero. Hence, convergence will be faster. evidence that suggests the theoretical framework is applicable 2) Extensions: The results presented in this paper are to a wider class of measurement matrices. We then investigate concerned with the two most popular problems in compressed the dependence of the empirical phase transition on the input sensing, i.e., exact recovery of sparse signals and approximate distribution for medium problem sizes. recovery of sparse signals in the presence of noise. However, our framework is far more powerful and can address other compressed sensing problems as well. For instance a similar A. Measurement matrix simulations framework has been used to address the problem of recovering We investigate the effect of the measurement matrix dis- approximately sparse signals in the presence of noise [56]. For tribution on the performance of CAMP and c-LASSO in the sake of brevity we have not provided such an analysis in two different cases. First, we consider the case where the the current paper. However, the properties we proved in Lem- measurements are noise-free. We postpone a discussion of mas III.2, III.7, and Proposition III.8 enable a straightforward measurement noise to Section IV-A2. extension of our analysis to such cases as well. Furthermore, the framework we developed here provides a TABLE I ENSEMBLESCONSIDEREDFORTHEMEASUREMENTMATRIX A INTHE way for recovering sparse complex-valued signals when the MATRIXUNIVERSALITYENSEMBLEEXPERIMENTS. distribution of non-zero elements is known. This area has been studied in [12], [51]. Name Specification Gaussian i.i.d. elements with CN(0, 1/n) Rademacher i.i.d. elements with real and imaginary parts distributed IV. SIMULATIONS 1 1 according to δ q (x) + δq (x) 2 − 1 2 1 As explained in Section III-D, our theoretical results show 2n 2n Ternary i.i.d. elements with real and imaginary parts distributed that, if the elements of the matrix are i.i.d. Gaussian, then 1 1 1 according to δ q (x) + δ0(x) + δq (x) 3 − 3 3 3 3 SE predicts the performance of the CAMP and c-LASSO 2n 2n algorithms accurately. However, in this section we will show IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 11

1) Noise-free measurements: Suppose that the measure- defined in Table I perform similarly. Here is the setup for ments are noise-free. Our goal is to empirically measure the our experiment: phase transition curves of the c-LASSO and CAMP on the - We set δ = 0.25, ρ = 0.1, and N = 1000. measurement matrices provided in Table I. To characterize the - We choose 50 different values of σ in the range [0.001, phase transition of an algorithm, we do the following: 0.1]. - We consider 33 equispaced values of δ between 0 and 1. - We choose n × N measurement matrix A from one of - For each value of δ, we calculate ρSE(δ) from the the ensembles specified in Table I. theoretical framework and then consider 41 equispaced - We draw k i.i.d. elements from UP ensemble for the k = values of ρ in [ρSE(δ) − 0.2, ρSE(δ) + 0.2]. bρnc non-zero elements of the input so. - We fix N = 1000, and for any value of ρ and δ, we - We form the measurement vector y = Aso + σw where calculate n = bδNc and k = bρδNc. w is the noise vector with i.i.d. elements from CN(0, 1). - We draw M = 20 independent random matrices from - For CAMP, we set τ = 2. For c-LASSO, we use (9) to one of the distributions described in Table I and for derive the corresponding values of λ for τ = 2 in CAMP. 2 each matrix we construct a random input vector so with - We calculate the MSE kxˆ − sok2/N for each matrix one of the distributions described in Table II. We then ensemble and compare the results. form y = Aso and recover so from y, A by either c- Figures 6 and 7 summarize our results. The concentration BP or CAMP to obtain xˆ. The matrix distributions and of the points along the y = x line indicates that the matrix coefficient distributions we consider in our simulations ensembles, specified in Table I, perform similarly. The co- are specified in Tables I and II, respectively. incidence of the phase transition curves for different matrix - For each δ, ρ, and Monte Carlo sample j we define ensembles is known as universality hypothesis (conjecture).  kxˆ−xk2  a success variable Sδ,ρ,j = I < tol and we In order to provide a stronger evidence, we run the above kxk2 S 1 P experiment with N = 4000. The results of this experiment calculate the success probability pˆδ,ρ = M j Sδ,ρ,j. This provides an empirical estimate of the probability of are exhibited in Figures 8 and 9. It is clear from these figures correct recovery. The value of tol in our case is set to that the MSE is now more concentrated around the y = x line. 10−4. Additional experiments with other parameter values exhibited - For a fixed value of δ, we fit a logistic regression function the same behavior. Note that as N grows, the variance of S S the MSE estimate becomes smaller, and the behavior of the to pˆ (δ, ρ) to obtain pδ (ρ). Then we find the value of ρˆδ S algorithm is closer to the average performance that is predicted for which pδ (ρ) = 0.5. See [6] for a more detailed discussion of this approach. For by the SE equation. the c-LASSO algorithm, we are reproducing the experiments of [57], [58] and, therefore, we are using one-L1 algorithm B. Coefficient ensemble simulations [57]. Although Figure 4 confirms that for most cases even According to Proposition III.4, ρSE(δ, τ) is independent of 200 iterations of CAMP are enough to reach convergence, the distribution G of non-zero coefficients of s0. We test the since our goal is to measure the phase transition, we consider accuracy of this result on medium problem sizes. We fix δ S 3000 iterations. See Section III-E1 for the discussion on the to 0.1 and we calculate pˆδ,ρ for 60 equispaced values of ρ convergence rate. between 0.1 and 0.5. For each algorithm and each value of ρ Figure 5 compares the phase transition of c-LASSO and we run 100 Monte Carlo trials and calculate the success rate CAMP on the ensembles specified in Table I with the theoret- for the Gaussian matrix and the coefficient ensembles specified ical prediction of this paper. In this simulation the coefficient in Table II. Figure 10 summarizes our result. Simulations at ensemble is UP (see Table II). Clearly, the empirical and other values of δ result in very similar behavior. These results theoretical phase transitions of the algorithms coincide. More are consistent with Proposition III.4. The small differences importantly, we can conjecture that the choice of the mea- between the empirical phase transitions are due to two issues surement matrix ensemble does not affect the phase transition that are not reflected in Proposition III.4: (i) N is finite, of these two algorithms. We will next discuss the impact of while Proposition III.4 considers the asymptotic setting. (ii) measurement matrix when there is noise on the measurements. The number of algorithm iterations is finite, while Proposition III.4 assumes that we run CAMP for an infinite number of TABLE II iterations. COEFFICIENT ENSEMBLES CONSIDERED IN COEFFICIENT ENSEMBLE EXPERIMENTS. V. PROOFS OF THE MAIN RESULTS A. Proximity operator Name Specification n UP i.i.d. elements with amplitude 1 and uniform phase For a given convex function f : C → R the proximity ZP i.i.d. elements with amplitude 1 and phase zero operator at point x is defined as GA i.i.d. elements with standard normal real and imaginary parts UF i.i.d. elements with U[0, 1] real and imaginary parts 1 2 Proxf (x) , arg min ky − xk2 + f(y). (10) y∈Cn 2 2) Noisy measurements: In this section we aim to show The proximity operator plays an important role in optimization that, even in the presence of noise, the matrix ensembles theory. For further information refer to [59] or Chapter 7 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 12

−3 −3 x 10 x 10 1.5 2

1.5 1

1 0.5

0.5 Mean Square Error (Gaussian)

0 Mean Square Error (Gaussian) 0 0.5 1 1.5 Mean Square Error (Rademacher) −3 0 x 10 0 0.5 1 1.5 2 −3 Mean Square Error (Rademacher) x 10 −3 x 10 1.5 −3 x 10 2

1 1.5

0.5 1 Mean Square Error (Gaussian) 0.5 0 0 0.5 1 1.5 Mean Square Error (Ternary) −3 x 10 Mean Square Error (Gaussian) 0 0 0.5 1 1.5 2 −3 Fig. 6. Comparison of the means square error of c-LASSO for Gaussian Mean Square Error (Ternary) x 10 and Rademacher matrix ensembles (top), and Gaussian and Ternary ensemble (bottom). The concentration of points around the y = x confirms the universality hypothesis. The norms of residuals are equal to 5.9 × 10−4 Fig. 7. Comparison of the MSE of CAMP for Gaussian and Rademacher and 6 × 10−4 for the top and bottom figures, respectively. Comparison of matrix ensembles (top), and Gaussian and Ternary ensemble (bottom). The this figure with Figure 8 confirms that as N grows the points become more concentration of points around the y = x line confirms the universality concentrated around y = x line. hypothesis. The norms of residuals are equal to 9.1 × 10−4 and 9.4 × 10−4 for the top and bottom figures, respectively. Comparison of this figure with Figure 9 confirms that as N grows the points become more concentrated around y = x line. of [47]. The following lemma characterizes the proximity operator for the complex `1-norm. This proximity operator has been used in several other papers [4], [21]–[23], [57]. solution satisfies τyR Lemma V.1. Let f denote the complex `1-norm function, i.e., xR − yR = ∗ , P p R 2 I 2 ∗ p R 2 I 2 f(x) = i (xi ) + (xi ) . Then the proximity operator is (y∗ ) + (y∗) given by τyI xI − yI = ∗ . (11) ∗ p(yR)2 + (yI )2 Proxτf (x) = η(x; τ), ∗ ∗ R I R I Combining the two equations in 11 we obtain y∗ x = x y∗. R  τ(u+iv)  R R τ|x | √ 2 2 2 √ where η(u + iv; τ) = u + iv − 2 2 1{u +v >τ } is Replacing this in (11) we have y∗ = x − and u +v + (xR)2+(xI )2 I applied component-wise to the vector x. yI = xI − √ τ|x | . It is clear that if p(xR)2 + (xI )2 < ∗ (xR)2+(xI )2 R R Proof: Since (10) can be decoupled into the elements of τ, then the signs of y∗ and x will be opposite, which is in the x, y, we can obtain the optimal value of y, by optimizing contradiction with (11). Therefore, if p(xR)2 + (xI )2 < τ, R I over its individual components. In other words, we solve the both y∗ and y∗ are zero. It is straightforward to check that optimization in (10) for x, y ∈ C. In this case the optimization (0, 0) satisfies the subgradient optimality condition. reduces to

1 2 B. Proof of Proposition II.1 Proxτf (x) = arg min |y − x| + τ|y|. y∈C 2 Let

R 2 I 2 ηR(x + iy) R(η(x + iy; λ)), Suppose that the optimal y∗ satisfies (y∗ ) + (y∗) > 0. Then , p I the function (yR)2 + (yI )2 is differentiable and the optimal η (x + iy) , I(η(x + iy; λ)) (12) IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 13

−3 −3 x 10 x 10 1.2 1.2

1 1

0.8 0.8

0.6 0.6

0.4 0.4 0.2

Mean Square Error (Gaussian) 0.2

0 Mean Square Error (Gaussian) 0 0.2 0.4 0.6 0.8 1 1.2 −3 0 Mean Square Error (Rademacher) x 10 0 0.2 0.4 0.6 0.8 1 1.2 −3 Mean Square Error (Rademacher) x 10

−3 x 10 1.2 −3 x 10 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2

Mean Square Error (Gaussian) 0.2 0 Mean Square Error (Gaussian) 0 0.2 0.4 0.6 0.8 1 1.2 −3 0 Mean Square Error (Ternary) x 10 0 0.2 0.4 0.6 0.8 1 1.2 −3 Mean Square Error (Ternary) x 10

Fig. 8. Comparison of the MSE of c-LASSO for Gaussian and Rademacher matrix ensembles (top), and Gaussian and Ternary ensemble (bottom). The Fig. 9. Comparison of the MSE of CAMP for Gaussian and Rademacher concentration of the points around the y = x line confirms the universality matrix ensembles (top), and Gaussian and Ternary ensemble (bottom). The −4 −4 hypothesis. The norms of residuals are 2.8 × 10 and 2.3 × 10 for the norms of residuals are 2 × 10−4 and 1.8 × 10−4, respectively. Comparison top and bottom figures respectively. Comparison of this figure with Figure 6 with Figure 7 confirms that as N grows the data points concentrate more confirms that as N grows the data points concentrate more around y = x around y = x line. line.

We also use the first-order expansion of the soft thresholding denote the real and imaginary parts of the complex soft function to obtain t+1 thresholding function. Define x`→a  X ∗ t X ∗ t ∗ t  = η Ab`zb + Ab`∆zb→` − Aa`za; τt R R ∂η (x + iy) b∈[n] b∈[n] ∂1η , , ∂x  1  R +O R ∂η (x + iy) N ∂2η , , ∂y  X ∗ t X ∗ t  I = η Ab`zb + Ab`∆zb→`; τt I ∂η (x + iy) ∂1η , , b∈[n] b∈[n] ∂x | {z } I xt I ∂η (x + iy) `, ∂2η . (13) , ∂y ∗ t R X ∗ t X ∗ t  − R(Aa`za)∂1η Ab`zb + Ab`∆zb→` b∈[n] b∈[n] t   We first simplify the expression for za→`: ∗ t R X ∗ t X ∗ t − I(Aa`za)∂2η Ab`zb + Ab`∆zb→` b∈[n] b∈[n] t X t X t  X X  z = ya − Aajx − Aaj∆x ∗ t I ∗ t ∗ t a→` j j→a − R(Aa`za)∂1η Ab`zb + Ab`∆zb→` j∈[N] j∈[N] b∈[n] b∈[n] | {z }   zt ∗ t I X ∗ t X ∗ t a, − I(Aa`za)∂2η Ab`zb + Ab`∆zb→` t + Aa`x` +O(1/N). (14) b∈[n] b∈[n] | {z }   t 1 ∆za→` +O . (15) , N IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 14

1 By plugging (16) into (17), we obtain UF Empirical UF LOGIT Fit X t − Aaj∆xj→a UP Empirical 0.8 j UP LOGIT Fit   ZP Empirical X ∗ t R X ∗ t t = AajR(Aajza)∂1η Abjzb + xj ZP LOGIT Fit 0.6 j b GA Empirical   GA LOGIT Fit X ∗ t R X ∗ t t + AajI(Aajza)∂2η Abjzb + xj 0.4 j b

Success Rate X ∗ t I  X ∗ t t  + i AajR(Aajza)∂1η Abjzb + xj 0.2 j b X ∗ t I  X ∗ t t  + i AajI(Aajza)∂2η Abjzb + xj , 0 j b 0.1 0.2 0.3 0.4 0.5 ρ which completes the proof. 

1 C. Proof of Lemma III.2 UF Empirical UF LOGIT Fit Let µ and θ denote the√ amplitude and phase of the random UP Empirical p 2 m µ 0.8 variable X. Define ν , npi = σ + δ and ζ , ν . Then UP LOGIT Fit p p p 2 ZP Empirical Ψ(m) = E|η(X + npiZ1 + i npiZ2; τ npi) − X| 0.6 ZP LOGIT Fit   2 GA Empirical 2 X X = ν E η + Z1 + iZ2; τ − GA LOGIT Fit ν ν 0.4 2 2 = (1 − )ν E|η(Z1 + iZ2; τ)| Success Rate 2 iθ iθ 2 + ν E Eζ,θ|η(ζe + Z1 + iZ2; τ) − ζe | , (18) 0.2 where Eζ,θ denotes the conditional expectation given the variables ζ, θ. Note that the marginal distribution of ζ depends 0 only on the marginal distribution of µ. The first term in (18) 0.1 0.2 0.3 0.4 0.5 ρ is independent of the phase θ, and therefore we should prove that the second term is also independent of θ. Define E iθ iθ 2 Fig. 10. Comparison of the phase transition of c-LASSO (top) and CAMP Φ(ζ, θ) , ζ,θ(|η(ζe + Z1 + iZ2; τ) − ζe | ). (19) (bottom) for different coefficient ensembles specified in Table II. δ = 0.1 in this figure. These figures are in agreement with Proposition III.4 that We prove that Φ is independent of θ. For two real-valued claims the phase transition of CAMP and c-LASSO are independent of the variables zr and zc, define z (zr, zc), dz dzrdzc, and distribution of the non-zero coefficients. Simulations at other values of δ result , , in similar behavior. p 2 2 αz , (ζ cos θ + zr) + (ζ sin θ + zc) ,   ζ sin θ + zc χz , arctan , t t ζ cos θ + zr According to (14) ∆zb→` = Ab`x`. Furthermore, we assume that the columns of the matrix are normalized. Therefore, ζ cos θ + zr cr , P ∗ t t , α b Ab`δzb→` = x`. It is also clear that z ζ sin θ + zc t ci , . ∆x`→a αz  X X  − R(A∗ zt )∂ ηR A∗ zt + A∗ ∆zt c , a` a 1 b` b b` b→` Define the two sets Sτ , {(zr, zc) | αz < τ} and Sτ , 2 b∈[n] b∈[n] R \Sτ , where “\” is the set subtraction operator. We have ∗ t R X ∗ t X ∗ t  − I(Aa`za)∂2η Abizb + Ab`∆zb→` Φ(ζ, θ) b∈[n] b∈[n] Z 2 1 −(z2+z2) = ζ e r c dz ∗ t I  X ∗ t X ∗ t  − R(Aa`za)∂1η Ab`zb + Ab`∆zb→` π z∈Sτ b∈[n] b∈[n] Z 2 1 2 2   iχz −(zr +zc ) ∗ t I X ∗ t X ∗ t + (αz − τ)e − ζ cos θ − iζ sin θ e dz − I(Aa`za)∂2η Ab`zb + Ab`∆zb→` . π c b∈[n] b∈[n] z∈Sτ Z (16) 2 1 −(z2+z2) = ζ e r c dz π Also, according to (14) z∈Sτ Z 2 1 −(z2+z2) t X t X t + |zr + izc − τcr − iτci| e r c dz. (20) za = ya − Aajxj − Aaj∆xj→a. (17) π c j j z∈Sτ IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 15

iθ The first integral in (20) corresponds to the case |ζe + zr + where EX denotes the expected value conditioned on 2 izc| < τ. The second integral is over the values of zr and zc the random variable X. We first prove that ΨX (ν ) , iθ 2  2 for which |ζe + zr + izc| ≥ τ. Define β , ζ cos θ + zr and ν EX |η(Xν + Z1 + iZ2; τ) − Xν | is concave with re- γ ζ sin θ + z . We then obtain , c spect to ν2 by proving dΨX ≤ 0. Then, since Ψ(ν2) is a Z d(ν2)2 2 1 −(z2+z2) 2 2 |zr + izc − τcr − iτci| e r c dzrdzc convex combination of ΨX (ν ), we conclude that Ψ(ν ) is a π concave function of ν2 as well. The rest of the proof details z∈Sc 2 2 τ d ΨX (ν ) Z the algebra required for calculating and simplifying d2ν2 .

= β − ζ cos θ + i(γ − ζ sin θ) √ β2+γ2>τ Using the real and imaginary parts of the soft thresholding τβ τγ 2 function and its partial derivatives introduced in (12) and (13) − − i we have p 2 2 p 2 2 β + γ β + γ 2 dΨX (ν ) 1 2 2 e−(β−ζ cos θ) −(γ−ζ sin θ) dβdγ dν2 π E 2 2π = X |η (Xν + Z1 + iZ2; τ) − Xν | (a) Z Z d dν = |(r − τ) cos φ − ζ cos θ + ν2 E |η (X + Z + iZ ; τ) − X |2 dν X ν 1 2 ν d(ν2) φ=0 r>τ = E |η (X + Z + iZ ; τ) − X |2 +i((r − τ) sin φ − ζ sin θ)|2 X ν 1 2 ν ν d 2 1 2 2 + E |η (X + Z + iZ ; τ) − X | e−(r cos φ−ζ cos θ) −(r sin φ−ζ sin θ) rdrdφ 2 dν X ν 1 2 ν π E 2 2π = X |η (Xν + Z1 + iZ2; τ) − Xν | Z Z 2 2 h R  = [(r − τ) + ζ − 2ζ(r − τ) cos(θ − φ)] −Xν EX ∂1η (Xν + Z1 + iZ2; τ) − 1 i φ=0 r>τ ηR(X + Z + iZ ; τ) − X  2 2 ν 1 2 ν e−r −ζ +2rζ cos(θ−φ)rdrdφ. h I  −Xν EX ∂1η (Xν + Z1 + iZ2; τ) Equality (a) is the result of the change of integration variables   i p 2 2 γ I  from γ and β to r , β + γ and φ , arctan β . η (Xν + Z1 + iZ2; τ) , The periodicity of the cosine function proves that the last R R I I where ∂1 , ∂2 , ∂1 , and ∂2 are defined in (13). Note that in integration is independent of the phase θ. We can similarly R I R 2 1 −z2+z2 the above calculations, ∂2η and ∂2η did not appear, since prove that ζ e r c dz is independent of θ. This z∈Sτ π we assumed that X is a real-valued random variable. Define completes the proof.  p 2 2 Aν , (Xν + Z1) + Z2 . It is straightforward to show that  τZ2  D. Proof of Theorem III.5 R 2 I ∂1η (Xν + Z1 + iZ2; τ) = 1 − 3 (Aν > τ), We first prove the following lemma that simplifies the proof Aν of Theorem III.5.  τ  ηR(X + Z + iZ ; τ) = (X + Z ) 1 − I(A ≥ τ), ν 1 2 ν 1 A ν Lemma V.2. The function Ψ(m) is concave with respect to ν τ(X + Z )Z m. I ν 1 2 I ∂1η (Xν + Z1 + iZ2; τ) = 3 (Aν ≥ τ), Aν Proof: For the notational simplicity define ν ,  τZ  pσ2 + m , X X , and A |X − Z + iZ |. We note ηI (X + Z + iZ ; τ) = Z − 2 I(A ≥ τ). (21) δ ν , ν ν , ν 1 2 ν 1 2 2 A ν that ν 2 2    2  C R 2 ∂ f(x+iy) d Ψ d dΨ d dΨ dν For f : → we define ∂1 f(x + iy) , 2 . It is = = ∂x dm2 dm dm dm d(ν2) dm straightforward to show that 2 2 2 1 d  dΨ  1 d Ψ d ΨX (ν ) = = . δ dm dν2 δ2 d(ν2)2 d2ν2 X h Therefore, Ψ is concave with respect to m if and only if = − E ∂ ηR(X + Z + iZ ; τ) − 1 ν3 X 1 ν 1 2 it is concave with respect to ν2. According to Lemma III.2 R  i the phase distribution of X does not affect the Ψ function. η (Xν + Z1 + iZ2; τ) − Xν Therefore, we set the phase of X to zero and assume that it is a X h − E ∂ ηI (X + Z + iZ ; τ) positive-valued random variable (representing the amplitude). ν3 X 1 ν 1 2 This assumption substantially simplifies the calculations. We i ηI (X + Z + iZ ; τ) have ν 1 2   X h Ψ(ν2) = ν2E |η(X + Z + iZ ; τ) − X |2 + E ∂ ηR(X + Z + iZ ; τ) − 1 ν 1 2 ν 2ν3 X 1 ν 1 2 2   2 R  i = ν E EX |η(Xν + Z1 + iZ2; τ) − Xν | , η (Xν + Z1 + iZ2; τ) − Xν IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 16

X h + E ∂ ηI (X + Z + iZ ; τ) and 2ν3 X 1 ν 1 2 I  i ∂2ηI (X + Z + iZ ; τ) η (Xν + Z1 + iZ2; τ) 1 ν 1 2 τZ 2 2 X 2 = I(Aν ≥ τ) E R  A3 + 4 X ∂1η (Xν + Z1 + iZ2; τ) − 1 ν 2ν 2 2 (Xν + Z1) Z2 X I 2 − 3τ I(Aν ≥ τ) + EX ∂1η (Xν + Z1 + iZ2; τ) A5 2ν4 ν 2 X2 h τ(Xν + Z1) Z2 E 2 R  + δ(Aν − τ). + X ∂1 η (Xν + Z1 + iZ2; τ) 4 2ν4 Aν R  i η (Xν + Z1 + iZ2; τ) − Xν Define h X2 h E 2 R  + E ∂2ηI (X + Z + iZ ; τ) S , X ∂1 η (Xν + Z1 + iZ2; τ) 2ν4 X 1 ν 1 2 R  i I  i η (Xν + Z1 + iZ2; τ) − Xν η (Xν + Z1 + iZ2; τ) . (22) h E 2 I + X ∂1 η (Xν + Z1 + iZ2; τ)

I i Our next objective is to simplify the terms in (22). We start η (Xν + Z1 + iZ2; τ) . with We then have  2  h R  3τZ1Z2 (Xν + Z1) EX ∂1η (Xν + Z1 + iZ2; τ) − 1 S = E I(A ≥ τ) X A5 ν i ν ηR(X + Z + iZ ; τ) − X  3τ 2(X + Z )2Z2  ν 1 2 ν − E ν 1 2 I(A ≥ τ) h X A6 ν E I  ν + X ∂1η (Xν + Z1 + iZ2; τ)   2   E Xν (Xν + Z1) Z2 I  i − X 1 − 2 δ(Aν − τ) η (Xν + Z1 + iZ2; τ) Aν Aν 2 2 2  2  τZ2 τ Z2   X τZ2 + E − I(A ≥ τ) = E I(A ≤ τ) + I(A ≥ τ) . (23) X 3 4 ν X ν 3 ν Aν Aν ν Aν  2 2   E 3τ(Xν + Z1) Z2 I − X 5 (Aν ≥ τ) Aν Similarly,  2 2 2   E 3τ (Xν + Z1) Z2 I − X 6 (Aν ≥ τ) . Aν R 2 EX ∂1η (Xν + Z1 + iZ2; τ) − 1 Note that in the above expression we have replaced  τZ2   Z2  I 2 2 2 E  1 − 3 δ(Aν −τ) with 1 − 2 δ(Aν −τ) for an obvious + X ∂1η (Xν + Z1 + iZ2; τ) Aν Aν 2 reason. It is straightforward to simplify this expression to  τZ2   E 2 I obtain = X 1 − 3 (Aν ≥ τ) − 1 Aν τZ2 τ 2Z2   τ(X + Z )Z 2 S = E 2 − 2 I(A ≥ τ) + E ν 1 2 I(A ≥ τ) X A3 A4 ν A3 ν ν ν ν 3τ(X + Z )Z2X    τ 2Z4  − E ν 1 2 ν I(A ≥ τ) = E I(A ≤ τ) + 2 I(A ≥ τ) X A5 ν X ν A6 ν v ν X (X + Z )  Z2   τ 2(X + Z )2Z2  − E ν ν 1 1 − 2 δ(A − τ) .(25) E ν 1 2 I X 2 ν + X 6 (Aν ≥ τ) Aν Aν Aν  2 2  By plugging (23), (24), and (25) into (22), we obtain τ Z2 = EX I(Aν ≤ τ) + I(Aν ≥ τ) . (24) A4 2 2 3 2 ν d ΨX (ν ) E3τX (Xν + Z1)Z2 I 2 2 = − 5 5 (Aν ≥ τ) d ν 2ν Aν    2  We also have E Xν (Xν + Z1) Z2 − 1 − 2 δ(Aν − τ). Aν Aν (26) 2 R ∂1 η (Xν + Z1 + iZ2; τ) 2 We claim that both terms on the right hand side of (26) are 3τZ2 (Xν + Z1)I = 5 (Aν ≥ τ) negative. To prove this claim, we first focus on the first term: Aν  2     2  τZ2 Xν + Z1 E (Xν + Z1)Z2 I + 1 − 3 δ(Aν − τ) 5 (Aν ≥ τ) ≥ 0. Aν Aν Aν IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 17

Define Sτ , {(Z1,Z2) | Aν ≥ τ}. We have ω to γ = ω − τ, we obtain

 2  Z 2 Z 2 (Xν + Z1)Z2 ω(ω − τ)e−ω dω = (γ + τ)γe−(γ+τ) dγ E I(Aν ≥ τ) A5 ω≥τ γ≥0 ν ZZ 2 2 Z 2 (Xν + z1)z2 1 −z2−z2 −τ −γ = e 1 2 dz1dz2 ≤ e (γ + τ)γe dγ 5 γ≥0 (z1,z2)∈Sτ Aν π 2 2   Z ∞ Z 2π 2 2 −r −X +2rXν cos φ 2 1 τ (a) r cos φr sin φe ν = e−τ + √ . (30) = 5 rdφdr 4 2 π τ 0 r ∞ 2π 2 −r2−X2+2rX cos φ Z Z sin φe ν ν Again by changing integration variables we have = d sin(φ)dr ≥ 0. Z Z r 2 −ω2 2 −(γ+τ)2 τ 0 ω(ω − τ) e = (γ + τ)γ e (27) ω≥τ γ≥0

2 Z 2 Equality (a) is the result of the change of integration variables ≤ e−τ (γ + τ)γ2e−γ   z2 γ>0 from z1, z2 to r Aν and φ arctan . With exactly , , z1+Xν   2 τ 1 similar approach we can prove that the second term of (26) is = e−τ + √ . (31) also negative. 4 2 π So far we have proved that ΨX (m) is concave with respect Using (30) and (31) in the formula for δ in Theorem III.5 to m. But this implies that Ψ(m) is also concave, since it is establishes that δ → 0 as τ → ∞. Therefore, in order to find a convex combination of concave functions. the asymptotic behavior of the phase transition as δ → 0, we can calculate the asymptotic behavior of δ and ρ as τ → ∞. Proof of Theorem III.5: As proved in Lemma V.2, Ψ(m) This is a standard application of Laplace’s method. Using this is a concave function. Furthermore Ψ(0) = 0. Therefore a method we calculate the leading terms of ρ and δ: given value of ρ is below the phase transition, i.e., ρ < ρSE(δ) dΨ Z ∞ −λ2 if and only if < 1. It is straightforward to calculate the 2 e dm m ω(τ − ω)e−ω dω ∼ , τ → ∞, (32) derivative at zero and confirm that 3 τ 8τ 2 dΨ ρδ(1 + τ ) 1 − ρδ 2 = + E|η(Z + iZ ; τ)| . (28) ∞ −τ 2 1 2 Z 2 e dm m=0 δ δ 2 −ω ω(τ − ω) e dω ∼ 2 , τ → ∞. (33) τ 4τ Since Z1,Z2 ∼ N(0, 1/2) and are independent, the phase of Plugging (32) and (33) into the formula we have for ρ and δ Z1 + iZ2 has a uniform distribution, while its amplitude has Rayleigh distribution. Therefore, we have in Theorem III.5, we obtain −τ 2 Z ∞ e E 2 2 −ω2 δ ∼ , τ → ∞, |η(Z1 + iZ2; τ)| = 2 ω(ω − τ) e dω. (29) 2 τ 1 ρ ∼ , τ → ∞, dΨ τ 2 We plug (29) into (28) and set the derivative dm m = 1 to obtain the value of ρ at which the phase transition occurs. This which completes the proof.  value is given by

2 δ − 2 R ∞ ω(ω − τ)2e−ω dω F. Proof of Lemma III.7 ρ = τ . δ(1 + τ 2 − 2 R ∞ ω(ω − τ)2e−ω2 dω) According to Lemma III.2, the phase θ does not affect the τ risk function, and therefore we set it to zero. We have Clearly the phase transition depends on τ. Hence according to E 2 the framework we introduced in Section I-D1, we search for r(µ, τ) = |η(µ + Z1 + iZ2; τ) − µ| R 2 the value of τ that maximizes the phase transition ρ. Define = E(η (µ + Z1 + iZ2; τ) − µ) R ∞ −ω2 R ∞ I 2 χ1(τ) ω(τ∗ − ω)e dω and χ2(τ) ω(ω − E , τ∗ , τ∗ + (η (µ + Z1 + iZ2; τ)) , 2 −ω2 τ∗) e dω. This optimal τ satisfies   R τ(µ+Z1) I where η (µ + Z1 + iZ2; τ) = µ + Z1 − A (A ≥ ∗ 2 ∗  τZ 4χ1(τ ) 1 + τ∗ − 2χ2(τ ) I 2  I τ), η (µ + Z1 + iZ2; τ) = z2 − A (A ≥ τ) and A , ∗ ∗ p 2 2 = (4χ1(τ ) − 2τ∗)(δ − 2χ2(τ )) , (µ + Z1) + Z2 . If we calculate the derivative of the risk 2 ∗ ∗ function with respect to µ, then we have 4(1+τ∗ )χ1(τ )−4τ∗χ2(τ ) which in turn results in δ = ∗ . Plugging −2τ∗+4χ1(τ ) dr(µ, τ) δ into the formula for ρ, we obtain the formula in Theorem III.5. dµ dηR  = 2E(ηR(µ + Z + iZ ; τ) − µ) − 1 1 2 dµ E. Proof of Theorem III.6 dηI + 2EηI (µ + Z + iZ ; τ) . We first show that the value of δ in Theorem III.5 goes to 1 2 dµ zero as τ → ∞. By changing the variable of integration from IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 18

˜ 2 It is straightforward to show that Toward this end we define G(µ) as δµ0 (µ) such that µ0 = 2 EG(µ ). In other words, G˜ and G have the same second dr(µ, τ) h R = E (η (µ + Z1 + iZ2; τ) − µ) moments. From the Jensen inequality we have dµ  τZ2  i E E |η(µ + Z + iZ ; τ) − µ|2 1 − 2 I(A ≥ τ) − 1 µ∼G µ 1 2 3 2 A ≤ E ˜ Eµ|η(µ + Z1 + iZ2; τ) − µ| . h τ(µ + Z )Z  i µ∼G + E ηI (µ + Z + iZ ; τ) 1 2 I(A ≥ τ) 1 2 A3 Furthermore, from the monotonicity of the risk function = µE[I(A ≤ τ) proved in Lemma III.7, we have    2   τ(µ + Z1) τZ2 E E 2 − E Z − I(A ≥ τ) µ∼G˜ µ|η(µ + Z1 + iZ2; τ) − µ| 1 A A3 ≤ E E |η(µ + Z + iZ ; τ) − µ|2 ∀m > µ .  τZ  τ(µ + Z )Z   µ∼Gm µ 1 2 0 + E Z − 2 1 2 I(A ≥ τ) 2 A A3 Again we can use the monotonicity of the risk function to = µE[I(A ≤ τ)] prove that  2 2 2 2 2   τZ1Z2 τ µZ2 τ Z1Z2 E E |η(µ + Z + iZ ; τ) − µ|2 − E + + I(A ≥ τ) µ∼Gm µ 1 2 A3 A4 A4 2 ≤ lim Eµ∼G Eµ|η(µ + Z1 + iZ2; τ) − µ| 2 2 2 2 2 m→∞ m hτµZ τZ Z1 τ µZ τ Z1Z2  i + E 2 + 2 − 2 − I(A ≥ τ) 2 A3 A3 A4 A4 = 1 + τ . (35)  2  E I E τZ2 The last equality is the result of the monotone convergence = µ [ (A ≤ τ)] + µ 3 ≥ 0. A theorem. Combining (34) and (35) completes the proof.  Therefore, the risk of the complex soft thresholding is an increasing function of µ. Furthermore, VI.CONCLUSIONS dr(µ, τ) 1 dr(µ, τ) τZ2  We have considered the problem of recovering a complex- 2 = = E(I(A ≤ τ)) + E 2 . dµ2 µ dµ A3 valued sparse signal from an undersampled set of complex- valued measurements. We have accurately analyzed the It is clear that the next derivative with respect to µ2 is negative, asymptotic performance of c-LASSO and CAMP algorithms. and therefore the function is concave.  Using the state evolution framework, we have derived simple expressions for the noise sensitivity and phase transition of G. Proof of Proposition III.8 these two algorithms. The results presented here show that As is clear from the statement of the theorem, the main substantial improvements can be achieved when the real challenge here is to characterize and imaginary parts are considered jointly by the recovery 2 algorithm. For instance, Theorem III.6 shows that in the high sup E|η(X + Z1 + iZ2; τ) − X| . q∈F undersampling regime the phase transition of CAMP and c-BP is two times higher than the phase transition of r-LASSO. Let X = µeiθ, where µ and θ are the phase and amplitude of X respectively. According to Lemma III.2, the risk function is ACKNOWLEDGEMENTS independent of θ. Furthermore, since q ∈ F, we can write it as q(µ) = (1 − )δ0(µ) + (1 − )G(µ), where G is absolutely Thanks to David Donoho and Andrea Montanari for their continuous with respect to Lebesgue measure. We then have encouragement and their valuable suggestions on an early draft of this paper. We would also like to thank the reviewers and the E|η(µ + Z + iZ ; τ) − X|2 1 2 associate editor for the thoughtful comments that helped us im- 2 = (1 − )E|η(Z1 + iZ2; τ)| prove the quality of the manuscript, and to thank Ali Mousavi 2 + Eµ∼GEX |η(µ + Z1 + iZ2; τ) − µ| for careful reading of our paper and suggesting improvements. Z ∞ This work was supported by the grants NSF CCF-0431150, 2 −w2 = 2(1 − ) w(w − τ) e dw CCF-0926127, and CCF-1117939; DARPA/ONR N66001- w=τ 11-C-4092 and N66001-11-1-4090; ONR N00014-08-1-1112, + E E |η(µ + Z + iZ ; τ) − µ|2. (34) µ∼G µ 1 2 N00014-10-1-0989, and N00014-11-1-0714; AFOSR FA9550- The notation E means that we are taking the expectation 09-1-0432; ARO MURI W911NF-07-1-0185 and W911NF- µG˜m with respect to µ, whose distribution is G. Also Eµ represents 09-1-0383; and the TI Leadership University Program. the conditional expectation given the random variable µ. Define δm(µ) , δ(µ−m). Using Lemma III.7 and the Jensen REFERENCES ∞ inequality we prove that {Gm(µ)} , Gm(µ) = δm(µ) m=1 [1] J. A. Tropp and S. J. Wright. Computational methods for sparse solution is the least favorable sequence of distributions, i.e., for any of linear inverse problems. Proc. IEEE, 98:948–958, 2010. distribution G [2] S. S. Chen, D. L. Donoho, and M. A. Saunders. Atomic decomposition by basis pursuit. SIAM J. on Sci. Computing, 20:33–61, 1998. 2 Eµ∼GEµ|η(µ + Z1 + iZ2; τ) − µ| [3] D. L. Donoho, A. Maleki, and A. Montanari. Message passing algo- 2 rithms for compressed sensing. Proc. Natl. Acad. Sci., 106(45):18914– ≤ lim Eµ∼G Eµ|η(µ + Z1 + iZ2; τ) − µ| . m→∞ m 18919, 2009. IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 19

[4] E. van den Berg and M. P. Friedlander. Probing the pareto frontier [30] X. Lv, G. Bi, and C. Wan. The group Lasso for stable recovery of for basis pursuit solutions. SIAM J. on Sci. Computing, 31(2):890–912, block-sparse signal representations. IEEE Trans. Signal Processing, 2008. 59(4):1371–1382, 2011. [5] Z. Yang, C. Zhang, J. Deng, and W. Lu. Orthonormal expansion `1- [31] M. Stojnic, F. Parvaresh, and B. Hassibi. On the reconstruction of block- minimization algorithms for compressed sensing. Preprint, 2010. sparse signals with an optimal number of measurements. IEEE Trans. [6] A. Maleki and D. L. Donoho. Optimally tuned iterative thresholding al- Signal Processing, 57(8):3075–3085, 2009. gorithm for compressed sensing. IEEE J. Select. Top. Signal Processing, [32] M. Stojnic. `2/`1-optimization in block-sparse compressed sensing and Apr. 2010. its strong thresholds. IEEE J. Select. Top. Signal Processing, 4(2):350– [7] M. Lustig, D. L. Donoho, and J. Pauly. Sparse MRI: The application 357, 2010. of compressed sensing for rapid MR imaging. Mag. Resonance Med., [33] M. Stojnic. Block-length dependent thresholds in block-sparse com- 58(6):1182–1195, Dec. 2007. pressed sensing. Preprint arXiv:0907.3679, 2009. [8] L. Anitori, A. Maleki, M. Otten, , R. G. Baraniuk, and W. van Rossum. [34] S. Ji, D. Dunson, and L. Carin. Multi-task compressive sensing. IEEE Compressive CFAR radar detection. In Proc. IEEE Radar Conference Trans. Signal Processing, 57(1):92–106, 2009. (RADAR), pages 0320–0325, May 2012. [35] S. Bakin. Adaptive regression and model selection in data mining [9] R. G. Baraniuk and P. Steeghs. Compressive radar imaging. In Proc. problems. Ph.D. Thesis, Australian National University, 1999. IEEE Radar Conference (RADAR), pages 128–133, Apr. 2007. [36] L. Meier, S. Van De Geer, and P. Buhlmann. The group Lasso for logistic regression. J. Roy. Statist. Soc. Ser. B, 70(Part 1):53–71, 2008. [10] M. A. Herman and T. Strohmer. High-resolution radar via compressed [37] M. Yuan and Y. Lin. Model selection and estimation in regression with sensing. IEEE Trans. Signal Processing, 57(6):2275–2284, Jun. 2009. grouped variables. J. Roy. Statist. Soc. Ser. B, 68(1):49–67, 2006. [11] D. L. Donoho, A. Maleki, and A. Montanari. Noise sensitivity phase [38] F. R. Bach. Consistency of the group lasso and multiple kernel learning. IEEE Trans. Inform. Theory transition. , 2010. submitted. J. Machine Learning Research, 9:1179–1225, Jun. 2008. [12] P. Schniter J. P. Vila. Expectation-maximization gaussian-mixture [39] Y. Nardi and A. Rinaldo. On the asymptotic properties of the group approximate message passing. preprint, 2012. arXiv:1207.3107v1. Lasso estimator for linear models. Electron. J. Statist., 2:605–633, 2008. [13] S. Som, L. C. Potter, and P. Schniter. Compressive imaging using [40] K. Lounici, M. Pontil, A. B. Tsybakov, and S. van de Geer. Taking approximate message passing and a markov-tree prior. In Proc. Asilomar advantage of sparsity in multi-task learning. arXiv:0903.1468v1, 2009. Conf. Signals, Systems, and Computers, pages 243–247, Nov. 2010. [41] D. Malioutov, M. Cetin, and A. S. Willsky. A sparse signal reconstruc- [14] R. Tibshirani. Regression shrinkage and selection via the Lasso. J. Roy. tion perspective for source localization with sensor arrays. IEEE Trans. Stat. Soc. Series B, 58(1):267–288, 1996. Signal Processing, 53(8):3010–3022, Aug. 2005. [15] A. Maleki and R. G. Baraniuk. Least favorable compressed sensing [42] D. Needell and J. A. Tropp. CoSaMP: Iterative signal recovery from problems for the first order methods. Proc. IEEE Int. Symp. Inform. incomplete and inaccurate samples. Appl. Comput. Harmon. Anal., Theory (ISIT), 2011. 26(3):301–321, 2008. [16] M. Bayati and A. Montanari. The dynamics of message passing on [43] T. Blumensath and M. E. Davies. How to use the iterative hard dense graphs, with applications to compressed sensing. IEEE Trans. thresholding algorithm. Proc. Work. Struc. Parc. Rep. Adap. Signaux Inform. Theory, 57:764–785, 2011. (SPARS), 2009. [17] G. Taubock and F. Hlawatsch. A compressed sensing technique for [44] T. Blumensath and M. E. Davies. Iterative hard thresholding for OFDM channel estimation in mobile environments: Exploiting channel compressed sensing. Appl. Comput. Harmon. Anal., 27(3):265–274, sparsity for reducing pilots. Proc. IEEE Int. Conf. Acoust., Speech, and 2009. Signal Processing (ICASSP), 2008. [45] D. L. Donoho, I. Drori, Y. Tsaig, and J. L. Starck. Sparse solution [18] J. S. Picard and A. J. Weiss. Direction finding of multiple emitters of underdetermined linear equations by stagewise orthogonal matching by spatial sparsity and linear programming. In Proc. IEEE Int. Symp. pursuit. Stanford Statistics Department Technical Report, 2006. Inform. Theory (ISIT), pages 1258–1262, Sept. 2009. [46] M. Bayati and A. Montanari. The Lasso risk for Gaussian matrices. [19] D. L. Donoho and J. Tanner. Precise undersampling theorems. Proc. of arXiv:1008.2581v1, 2011. the IEEE, 98(6):913 –924, Jun. 2010. [47] A. Maleki. Approximate message passing algorithm for compressed [20] D. L. Donoho and J. Tanner. Observed universality of phase transitions in sensing. Stanford University Ph.D. Thesis, 2011. high-dimensional geometry, with applications in modern signal process- [48] D. L. Donoho, A. Maleki, and A. Montanari. Construction of message ing and data analysis. Philos. Trans. Roy. Soc. A, 367(1906):4273–4293, passing algorithms for compressed sensing. Preprint, 2011. 2009. [49] M. J. Wainwright G. Obozinski and M. I. Jordan. Support union recovery [21] M. Figueiredo, R. Nowak, and S. J. Wright. Gradient projection for in high-dimensional multivariate regression. Ann. Stat., 39:1–47, 2011. sparse reconstruction: Application to compressed sensing and other [50] P. Schniter. A message-passing receiver for bicm-ofdm over unknown inverse problems. IEEE J. Select. Top. Signal Processing, 1(4):586– clustered-sparse channels. IEEE J. Select. Top. Signal Processing, 598, 2007. 5(8):1462–1474, Dec. 2011. Proc. [22] S. J. Wright, R. Nowak, and M. Figueiredo. Sparse reconstruction by [51] P. Schniter. Turbo reconstruction of structured sparse signals. In IEEE Conf. Inform. Science and Systems (CISS) separable approximation. Proc. IEEE Int. Conf. Acoust., Speech, and , pages 1–6, Mar. 2010. Signal Processing (ICASSP), 2009. [52] A. Fletcher M. Unser U. Kamilov, S. Rangan. Approximate message passing with consistent parameter estimation and applications to sparse [23] S. J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky. A method learning. Submitted to IEEE Trans. Inf. Theory, 2012. for large-scale ` -regularized least squares. IEEE J. Select. Top. Signal 1 [53] D. L. Donoho, I. M. Johnstone, and A. Montanari. Accurate prediction Processing, 1(4):606–617, Dec. 2007. of phase transition in compressed sensing via a connection to minimax [24] J. Huang and T. Zhang. The benefit of group sparsity. Preprint denoising. Nov. 2011. arXiv:0901.2962, 2009. [54] R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based [25] J. Peng, J. Zhu, A. Bergamaschi, W. Han, D.Y. Noh, J.R. Pollack, and compressive sensing. IEEE Trans. Inform. Theory, 56(4):1982–2001, P. Wang. Regularized multivariate regression for identifying master Apr. 2010. predictors with application to integrative genomics study of breast [55] S. Jalali and A. Maleki. Minimum complexity pursuit. In Proc. Allerton cancer. Ann. Appl. Stat., 4(1):53–77, 2010. Conf. Communication, Control, and Computing, pages 1764–1770, Sep. [26] M. F. Duarte, W. U. Bajwa, and R. Calderbank. The performance 2011. of group Lasso for linear regression of grouped variables. Technical [56] D. L. Donoho, I. M. Johnstone, A. Maleki, and A. Montanari. Com- report, Technical Report TR-2010-10, Duke University, Dept. Computer pressed sensing over `p-balls: Minimax mean square error. preprint, Science, Durham, NC, 2011. 2011. arXiv:1103.1943v2. [27] E. Van Den Berg and M.P. Friedlander. Theoretical and empirical results [57] Z. Yang and C. Zhang. Sparsity-undersampling tradeoff of compressed for recovery from multiple measurements. IEEE Trans. Inform. Theory, sensing in the complex domain. In Proc. IEEE Int. Conf. Acoust., 56(5):2516–2527, 2010. Speech, and Signal Processing (ICASSP), 2011. [28] J. Chen and X. Huo. Theoretical results on sparse representations [58] Z. Yang, C. Zhang, and L. Xie. On phase transition of compressed of multiple-measurement vectors. IEEE Trans. Signal Processing, sensing in the complex domain. IEEE Sig. Processing Letters, 19(1):47– 54(12):4634–4643, 2006. 50, 2012. [29] Y. C. Eldar, P. Kuppinger, and H. Bolcskei. Block-sparse signals: Uncer- [59] P. L. Combettes and V. R. Wajs. Signal recovery by proximal forward- tainty relations and efficient recovery. IEEE Trans. Signal Processing, backward splitting. SIAM J. Multiscale Model. Simul., 4(4):1168–1200, 58(6):3042–3054, 2010. 2005. IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. .., NO. .., .. 20

Arian Maleki received his Ph.D. in electrical engineering from Stanford University in 2010. After spending 2011-2012 in the DSP group at Rice Uni- versity, he joined Columbia University as an Assistant Professor of Statistics. His research interests include compressed sensing, statistics, machine learning, signal processing and optimization. He received his M.Sc. in statistics from Stanford University, and B.Sc. and M.Sc. both in electrical engineering from Sharif University of Technology.

Laura Anitori (S’09) received the M.Sc. degree (summa cum laude) in telecommunication engineering from the University of Pisa, Pisa, Italy, in 2005. From 2005 to 2007, she worked as a research assistant at the Telecom- munication department, University of Twente, Enschede, The Netherlands on importance sampling and its application to STAP detectors. Since January 2007, she has been with the Radar Technology department at TNO, The Hague, The Netherlands. Her research interests include radar signal process- ing, FMCW and ground penetrating radar, and airborne surveillance systems. Currently, she is also working towards her Ph.D. degree in cooperation with the department of Geoscience and Remote Sensing at Delft University of Technology, Delft, The Netherlands on the topic of Compressive Sensing and its applications to radar detection. In 2012, she received the second prize at the IEEE Radar Conference for the best student paper award.

Zai Yang Zai Yang was born in Anhui Province, China, in 1986. He received the B.S. from mathematics and M.Sc. from applied mathematics in 2007 and 2009, respectively, from Sun Yat-sen (Zhongshan) University, China. He is currently pursuing the Ph.D. degree in electrical and electronic engineering at Nanyang Technological University, Singapore. His current research interests include compressed sensing and its applications to source localization and magnetic resonance imaging.

Richard G. Baraniuk received the B.Sc. degree in 1987 from the University of Manitoba (Canada), the M.Sc. degree in 1988 from the University of Wisconsin-Madison, and the Ph.D. degree in 1992 from the University of Illinois at Urbana-Champaign, all in Electrical Engineering. After spend- ing 1992–1993 with the Signal Processing Laboratory of Ecole Normale Superieure,´ in Lyon, France, he joined Rice University, where he is currently the Victor E. Cameron Professor of Electrical and Computer Engineering. His research interests lie in the area of signal processing and machine learning. Dr. Baraniuk received a NATO postdoctoral fellowship from NSERC in 1992, the National Young Investigator award from the National Science Foundation in 1994, a Young Investigator Award from the Office of Naval Research in 1995, the Rosenbaum Fellowship from the Isaac Newton Institute of Cambridge University in 1998, the C. Holmes MacDonald National Outstanding Teaching Award from in 1999, the University of Illinois ECE Young Alumni Achievement Award in 2000, the Tech Museum Laureate Award from the Tech Museum of Innovation in 2006, the Wavelet Pioneer Award from SPIE in 2008, the Internet Pioneer Award from the Berkman Center for Internet and Society at Harvard Law School in 2008, the World Technology Network Education Award and IEEE Signal Processing Society Magazine Column Award in 2009, the IEEE-SPS Education Award in 2010, the WISE Education Award in 2011, and the SPIE Compressive Sampling Pioneer Award in 2012. In 2007, he was selected as one of Edutopia Magazine’s Daring Dozen educators, and the Rice single-pixel compressive camera was selected by MIT Technology Review Magazine as a TR10 Top 10 Emerging Technology. He was elected a Fellow of the IEEE in 2001 and of AAAS in 2009.