Structured Signal Recovery from Non-Linear and Heavy-Tailed Measurements Larry Goldstein, Stanislav Minsker and Xiaohan Wei
Total Page:16
File Type:pdf, Size:1020Kb
1 Structured Signal Recovery from Non-linear and Heavy-tailed Measurements Larry Goldstein, Stanislav Minsker and Xiaohan Wei Abstract—We study high-dimensional signal recovery of i.i.d. copies of (x; y) defined on a probability −1 from non-linear measurements with design vectors having space (Ω; B; P). As f(a hx; aθ∗i; δ) = f(hx; θ∗i; δ) elliptically symmetric distribution. Special attention is for any a > 0, the best one can hope for is to devoted to the situation when the unknown signal belongs to a set of low statistical complexity, while both the recover θ∗ up to a scaling factor. Hence, without measurements and the design vectors are heavy-tailed. loss of generality, we will assume that θ∗ satisfies 1=2 2 1=2 1=2 We propose and analyze a new estimator that adapts to kΣ θ∗k2 := Σ θ∗; Σ θ∗ = 1, where Σ = the structure of the problem, while being robust both E(x − Ex)(x − Ex)T is the covariance matrix of x. to the possible model misspecification characterized by In many applications, θ possesses special struc- arbitrary non-linearity of the measurements as well as ∗ to data corruption modeled by the heavy-tailed distribu- ture, such as sparsity or low rank (when θ∗ 2 d1×d2 tions. Moreover, this estimator has low computational R ; d1d2 = d, is a matrix). To incorporate complexity. Our results are expressed in the form of such structural assumptions into the problem, we will exponential concentration inequalities for the error of d assume that θ∗ is an element of a closed set Θ ⊆ R the proposed estimator. On the technical side, our proofs of small “statistical complexity” that is characterized rely on the generic chaining methods, and illustrate the power of this approach for statistical applications. Theory by its Gaussian mean width [1]. The past decade is supported by numerical experiments demonstrating has witnessed significant progress related to estima- that our estimator outperforms existing alternatives when tion in high-dimensional spaces, both in theory and data is heavy-tailed. applications. Notable examples include sparse linear Index Terms—Signal reconstruction, Nonlinear mea- regression ([2], [3], [4]), low-rank matrix recovery ([5], surements, Heavy-tailed noise, Elliptically symmetric dis- [6], [7]), and mixed structure recovery [8]. However, tribution. the majority of the aforementioned works assume that the link function f is linear, and their results apply I. INTRODUCTION. only to this particular case. Let (x; y) 2 Rd × R be a random couple with Generally, the task of estimating the index vector distribution P governed by the semi-parametric single requires approximating the link function f or its deriva- index model tive, assuming that it exists (the so-called Average Derivative Method), see [9], [10], [11]. However, when y = f(hx; θ∗i; δ); (1) the measurement vector x ∼ N (0; Id×d), a somewhat where x is a measurement vector with marginal dis- surprising result states that one can estimate θ∗ di- tribution Π, δ is a noise variable that is assumed rectly, avoiding preliminary link function estimation d to be independent of x, θ∗ 2 R is a fixed but step completely. More specifically, it has been proven E 2 otherwise unknown signal (“index vector”), and f : in [12] that ηθ∗ = argminθ2Rd (y − hθ; xi) , where 2 R 7! R is an unknown link function; here and in η = E y hx; θ∗i. Here is the short proof of this fact: what follows, h·; ·i denotes the Euclidean dot product. 2 We impose mild conditions on f, and in particular argmin E (y − hθ; xi) Rd it is not assumed that f is convex, or even contin- θ2 2 1 = argmin kθk − 2Ey hx; θi uous. Our goal is to estimate the signal θ∗ from 2 θ2Rd the training data (x ; y );:::; (x ; y ) - a sequence 1 1 m m 2 E E ? ? = argmin kθk2 − 2 y hx; θ∗i hθ; θ∗i − 2 y x; θ∗ θ; θ∗ Rd L. Goldstein and S. Minsker are with the Department of Mathe- θ2 2 E matics, University of Southern California, Los Angeles, CA (email: = argmin kθk2 − 2 y hx; θ∗i hθ; θ∗i [email protected]; [email protected]). θ2Rd X. Wei is with the Department of Electrical Engineering, Uni- 2 versity of Southern California, Los Angeles, CA (email: xiao- = argmin kθ − ηθ∗k2; Rd [email protected]). θ2 S. Minsker and X. Wei were supported in part by the National ? Science Foundation grant DMS-1712956. where θ∗ denotes the unit vector in the (θ∗; θ) plane 1 The precise assumption on f is specified after (16). orthogonal to θ∗, and the third equality follows from 2 ? the fact that x; θ∗ is a centered Gaussian random variable y: for example, when the link function satisfies ~ variable independent of hx; θ∗i, hence also independent f(hx; θ∗i ; δ) = f(hx; θ∗i) + δ, it is only assumed of y. that δ possesses 2 + " moments, for some " > 0. In Later, in [13], this result was extended to the more [17], Y. Plan and R. Vershynin present analysis for the general class of elliptically symmetric distributions, Gaussian case and ask “Can the same kind of accuracy which includes Gaussian distributions as a special case; be expected for random non-Gaussian matrices?” In see Lemma 5. In general, it is not always possible to this paper, we give a positive answer to their question. recover θ∗: see [14] for an example in the case when To achieve our goal, we propose a Lasso-type estimator f(x; δ) = sign(x) + δ (so-called “1-bit compressed that admits tight probabilistic guarantees in spirit of (3) sensing” [15]). despite weak tail assumptions (see Theorem 1 below Y. Plan, R. Vershynin and E. Yudovina recently pre- for details). sented the non-asymptotic study for the case of Gaus- Proofs of related non-asymptotic results in the liter- sian measurements in the context of high-dimensional ature rely on special properties of Gaussian measures structured estimation, see [16], [17], [18]; we refer (see, for example, [23]). To handle a wider class of the reader to [19], [14], [20], [21] for further details. elliptically symmetric distributions, we rely on recent On a high level, these works show that when xj’s are developments in generic chaining methods, see [24], Gaussian, nonlinearity can be treated as an additional [25]. These general tools could prove useful in de- noise term. To give an example, works [17] and [16] veloping further extensions to a wider class of design demonstrate that under the same model as (1), when distributions. xj ∼ N (0; Id×d) and θ∗ 2 Θ, solving the constrained problem II. DEFINITIONS AND BACKGROUND MATERIAL. 2 θb = argmin ky − Xθk2; (2) θ2Θ This section introduces main notation and the key facts related to elliptically symmetric distributions, y = [y ··· y ]T X = p1 [x ··· x ]T with 1 m and m 1 m , convex geometry and empirical processes. The results results in the following bound: with probability ≥ 0:99, of this section will be used repeatedly throughout the d−1 σ1!(D(Θ; ηθ∗) \ S ) + σ2 paper. For the unified treatment of vectors and matri- θb− ηθ∗ ≤ C p ; (3) v 2 Rd×1 2 m ces, it will be convenient to treat a vector as a d × 1 matrix. Let d1; d2 2 N be such that d1d2 = d. where (with formal definitions to follow in Section II) d1×d2 Given v1; v2 2 R , the Euclidean dot product is C is an absolute constant, T then defined as hv1; v2i = tr(v1 v2), where tr(·) stands T g ∼ N (0; 1); η = E(f(g; δ)g); for the trace of a matrix and v denotes the transpose 2 E 2 of v. σ1 = (f(g; δ) − ηg) ; d The `1-norm of v 2 R is defined as kvk1 = 2 E 2 2 d σ2 = (f(g; δ) − ηg) g ; P Rd1×d2 j=1 jvjj. The nuclear norm of a matrix v 2 d−1 d Pmin(d1;d2) S is the unit sphere in R , D(Θ; θ) is the descent is kvk∗ = j=1 σj(v), where σj(v); j = cone of Θ at point θ, and !(T ) is the Gaussian mean 1;:::; min(d1; d2) stand for the singular values of width of a subset T ⊂ Rd. A different approach to v, and the operator norm is defined as kvk = estimation of the index vector in model (1) with similar maxj=1;:::;min(d1;d2) σj(v). recovery guarantees has been developed in [21]. How- ever, the key assumption adopted in all these works A. Elliptically symmetric distributions. that the vectors fx gm follow Gaussian distributions j j=1 A centered random vector x 2 Rd has elliptically preclude situations where the measurements are heavy symmetric (alternatively, elliptically contoured or just tailed, and hence might be overly restrictive for some elliptical) distribution with parameters Σ and F , de- practical applications; for example, noise and outliers µ noted x ∼ E(0; Σ;F ), if observed in high-dimensional image recovery often µ exhibit heavy-tailed behavior, see [22]. x =d µBU; (4) As we mentioned above, in [13] authors have shown d that direct consistent estimation of θ∗ is possible where = denotes equality in distribution, µ is a scalar when Π belongs to a family of elliptically symmet- random variable with cumulative distribution function T ric distributions. Our main contribution is the non- Fµ, B is a fixed d × d matrix such that Σ = BB , asymptotic analysis for this scenario, with a particular and U is uniformly distributed over the unit sphere d−1 focus on the case when d > n and θ∗ possesses S and independent of µ.