High Dimensional

Lecture Notes

(This version: November 5, 2019)

Philippe Rigollet and Jan-Christian H¨utter

Spring 2017 Preface

These lecture notes were written for the course 18.657, High Dimensional Statistics at MIT. They build on a set of notes that was prepared at Prince- ton University in 2013-14 that was modified (and hopefully improved) over the years. Over the past decade, statistics have undergone drastic changes with the development of high-dimensional statistical inference. Indeed, on each indi- vidual, more and more features are measured to a point that their number usually far exceeds the number of observations. This is the case in biology and specifically genetics where millions of (combinations of) genes are measured for a single individual. High resolution imaging, finance, online advertising, climate studies . . . the list of intensive data producing fields is too long to be established exhaustively. Clearly not all measured features are relevant for a given task and most of them are simply noise. But which ones? What can be done with so little data and so much noise? Surprisingly, the situation is not that bad and on some simple models we can assess to which extent meaningful statistical methods can be applied. Regression is one such simple model. can be traced back to 1632 when Galileo Galilei used a procedure to infer a linear relationship from noisy data. It was not until the early 19th century that Gauss and Legendre developed a systematic pro- cedure: the least-squares method. Since then, regression has been studied in so many forms that much insight has been gained and recent advances on high-dimensional statistics would not have been possible without standing on the shoulders of giants. In these notes, we will explore one, obviously subjec- tive giant on whose shoulders high-dimensional statistics stand: nonparametric statistics. The works of Ibragimov and Has’minskii in the seventies followed by many researchers from the Russian school have contributed to developing a large toolkit to understand regression with an infinite number of parameters. Much insight from this work can be gained to understand high-dimensional or sparse regression and it comes as no surprise that Donoho and Johnstone have made the first contributions on this topic in the early nineties. Therefore, while not obviously connected to high dimensional statistics, we will talk about nonparametric estimation. I borrowed this disclaimer (and the template) from my colleague Ramon van Handel. It does apply here. I have no illusions about the state of these notes—they were written

i Preface ii

rather quickly, sometimes at the rate of a chapter a week. I have no doubt that many errors remain in the text; at the very least many of the proofs are extremely compact, and should be made a little clearer as is befitting of a pedagogical (?) treatment. If I have another opportunity to teach such a course, I will go over the notes again in detail and attempt the necessary modifications. For the time being, however, the notes are available as-is. As any good set of notes, they should be perpetually improved and updated but a two or three year horizon is more realistic. Therefore, if you have any comments, questions, suggestions, omissions, and of course mistakes, please let me know. I can be contacted by e-mail at [email protected]. Acknowledgements. These notes were improved thanks to the careful read- ing and comments of Mark Cerenzia, Youssef El Moujahid, Georgina Hall, Gautam Kamath, Hengrui Luo, Kevin Lin, Ali Makhdoumi, Yaroslav Mukhin, Mehtaab Sawhney, Ludwig Schmidt, Bastian Schroeter, Vira Semenova, Math- ias Vetter, Yuyan Wang, Jonathan Weed, Chiyuan Zhang and Jianhao Zhang. These notes were written under the partial support of the National Science Foundation, CAREER award DMS-1053987. Required background. I assume that the reader has had basic courses in probability and mathematical statistics. Some elementary background in anal- ysis and measure theory is helpful but not required. Some basic notions of linear algebra, especially spectral decomposition of matrices is required for the later chapters. Since the first version of these notes was posted a couple of manuscripts on high-dimensional probability by Ramon van Handel [vH17] and Roman Ver- shynin [Ver18] were published. Both are of outstanding quality—much higher than the present notes—and very related to this material. I strongly recom- mend the reader to learn about this fascinating topic in parallel with high- dimensional statistics. Notation

Functions, sets, vectors [n] Set of integers [n] = {1, . . . , n} Sd−1 Unit sphere in dimension d 1I( · ) Indicator function 1 P q q |x|q `q norm of x defined by |x|q = i |xi| for q > 0

|x|0 `0 norm of x defined to be the number of nonzero coordinates of x f (k) k-th derivative of f

ej j-th vector of the canonical basis Ac complement of set A conv(S) of set S.

an . bn an ≤ Cbn for a numerical constant C > 0

Sn symmetric group on n elements Matrices p Ip Identity matrix of IR Tr(A) trace of a square matrix A d×d Sd Symmetric matrices in IR + d×d Sd Symmetric positive semi-definite matrices in IR ++ d×d Sd Symmetric positive definite matrices in IR A  B Order relation given by B − A ∈ S+ A ≺ B Order relation given by B − A ∈ S++ M † Moore-Penrose pseudoinverse of M

∇xf(x) Gradient of f at x

∇xf(x)|x=x0 Gradient of f at x0 Distributions

iii Preface iv

N (µ, σ2) Univariate Gaussian distribution with mean µ ∈ IR and variance σ2 > 0 d d×d Nd(µ, Σ) d-variate distribution with mean µ ∈ IR and covariance matrix Σ ∈ IR subG(σ2) Univariate sub-Gaussian distributions with variance proxy σ2 > 0 2 2 subGd(σ ) d-variate sub-Gaussian distributions with variance proxy σ > 0 subE(σ2) sub-Exponential distributions with variance proxy σ2 > 0 Ber(p) Bernoulli distribution with parameter p ∈ [0, 1] Bin(n, p) Binomial distribution with parameters n ≥ 1, p ∈ [0, 1] Lap(λ) Double exponential (or Laplace) distribution with parameter λ > 0

PX Marginal distribution of X Function spaces W (β, L) Sobolev class of functions

Θ(β, Q) Sobolev ellipsoid of `2(IN) Contents

Prefacei

Notation iii

Contentsv

Introduction1

1 Sub-Gaussian Random Variables 14 1.1 Gaussian tails and MGF...... 14 1.2 Sub-Gaussian random variables and Chernoff bounds ...... 16 1.3 Sub-exponential random variables...... 22 1.4 Maximal inequalities...... 25 1.5 Sums of independent random matrices...... 30 1.6 Problem set...... 36

2 Model 40 2.1 Fixed design linear regression...... 40 2.2 Least squares estimators...... 42 2.3 The Gaussian Sequence Model ...... 49 2.4 High-dimensional linear regression ...... 54 2.5 Problem set...... 70

3 Misspecified Linear Models 72 3.1 Oracle inequalities ...... 73 3.2 ...... 81 3.3 Problem Set...... 91

4 Minimax Lower Bounds 92 4.1 Optimality in a minimax sense ...... 92 4.2 Reduction to finite hypothesis testing ...... 94 4.3 Lower bounds based on two hypotheses ...... 95 4.4 Lower bounds based on many hypotheses ...... 101 4.5 Application to the Gaussian sequence model ...... 104 4.6 Lower bounds for sparse estimation via χ2 divergence . . . . . 109

v Contents vi

4.7 Lower bounds for estimating the `1 norm via moment matching 112 4.8 Problem Set...... 120

5 Matrix estimation 121 5.1 Basic facts about matrices...... 121 5.2 Multivariate regression...... 124 5.3 Covariance matrix estimation...... 132 5.4 Principal component analysis...... 134 5.5 Graphical models...... 141 5.6 Problem set...... 154

Bibliography 157 Introduction

This course is mainly about learning a regression function from a collection of observations. In this chapter, after defining this task formally, we give an overview of the course and the questions around regression. We adopt the statistical learning point of view where the task of prediction prevails. Nevertheless many interesting questions will remain unanswered when the last page comes: testing, , implementation,. . .

REGRESSION ANALYSIS AND PREDICTION RISK

Model and definitions Let (X,Y ) ∈ X ×Y where X is called feature and lives in a topological space X and Y ∈ Y ⊂ IR is called response or sometimes label when Y is a discrete set, e.g., Y = {0, 1}. Often X ⊂ IRd, in which case X is called vector of covariates or simply covariate. Our goal will be to predict Y given X and for our problem to be meaningful, we need Y to depend non-trivially on X. Our task would be done if we had access to the conditional distribution of Y given X. This is the world of the probabilist. The statistician does not have access to this valuable information but rather has to estimate it, at least partially. The regression function gives a simple summary of this conditional distribution, namely, the conditional expectation. Formally, the regression function of Y onto X is defined by:

f(x) = IE[Y |X = x] , x ∈ X .

As we will see, it arises naturally in the context of prediction.

Best prediction and prediction risk Suppose for a moment that you know the conditional distribution of Y given X. Given the realization of X = x, your goal is to predict the realization of Y . Intuitively, we would like to find a measurable1 function g : X → Y such that g(X) is close to Y , in other words, such that |Y − g(X)| is small. But |Y − g(X)| is a random variable, so it not clear what “small” means in this context. A somewhat arbitrary answer can be given by declaring a random

1all topological spaces are equipped with their Borel σ-algebra

1 Introduction 2 variable Z small if IE[Z2] = [IEZ]2 + var[Z] is small. Indeed in this case, the expectation of Z is small and the fluctuations of Z around this value are also 2 small. The function R(g) = IE[Y − g(X)] is called the L2 risk of g defined for IEY 2 < ∞. For any measurable function g : X → IR, the L2 risk of g can be decom- posed as

IE[Y − g(X)]2 = IE[Y − f(X) + f(X) − g(X)]2 = IE[Y − f(X)]2 + IE[f(X) − g(X)]2 + 2IE[Y − f(X)][f(X) − g(X)]

The cross-product term satisfies   IE[Y − f(X)][f(X) − g(X)] = IE IE [Y − f(X)][f(X) − g(X)] X = IE[IE(Y |X) − f(X)][f(X) − g(X)] = IE[f(X) − f(X)][f(X) − g(X)] = 0 .

The above two equations yield

IE[Y − g(X)]2 = IE[Y − f(X)]2 + IE[f(X) − g(X)]2 ≥ IE[Y − f(X)]2 , with equality iff f(X) = g(X) almost surely. We have proved that the regression function f(x) = IE[Y |X = x], x ∈ X , enjoys the best prediction property, that is

IE[Y − f(X)]2 = inf IE[Y − g(X)]2 , g where the infimum is taken over all measurable functions g : X → IR.

Prediction and estimation As we said before, in a statistical problem, we do not have access to the condi- tional distribution of Y given X or even to the regression function f of Y onto X. Instead, we observe a sample Dn = {(X1,Y1),..., (Xn,Yn)} that consists of independent copies of (X,Y ). The goal of regression function estimation is ˆ to use this data to construct an estimator fn : X → Y that has small L2 risk ˆ R(fn). Let PX denote the marginal distribution of X and for any h : X → IR, define Z 2 2 khk2 = h dPX . X 2 Note that khk2 is the Hilbert norm associated to the inner product Z 0 0 hh, h i2 = hh dPX . X When the reference measure is clear from the context, we will simply write 0 0 khk2 = khkL2(PX ) and hh, h i2 := hh, h iL2(PX ). Introduction 3

It follows from the proof of the best prediction property above that

ˆ 2 ˆ 2 R(fn) = IE[Y − f(X)] + kfn − fk2 2 ˆ 2 = inf IE[Y − g(X)] + kfn − fk2 g

In particular, the prediction risk will always be at least equal to the positive constant IE[Y − f(X)]2. Since we tend to prefer a measure of accuracy to be able to go to zero (as the sample size increases), it is equivalent to study the ˆ 2 ˆ ˆ 2 estimation error kfn − fk2. Note that if fn is random, then kfn − fk2 and ˆ R(fn) are random quantities and we need deterministic summaries to quantify their size. It is customary to use one of the two following options. Let {φn}n be a sequence of positive numbers that tends to zero as n goes to infinity. 1. Bounds in expectation. They are of the form:

ˆ 2 IEkfn − fk2 ≤ φn ,

where the expectation is taken with respect to the sample Dn. They indicate the average behavior of the estimator over multiple realizations of the sample. Such bounds have been established in nonparametric −α statistics where typically φn = O(n ) for some α ∈ (1/2, 1) for example. Note that such bounds do not characterize the size of the deviation of the ˆ 2 random variable kfn − fk2 around its expectation. As a result, it may be therefore appropriate to accompany such a bound with the second option below. 2. Bounds with high-probability. They are of the form:

 ˆ 2  IP kfn − fk2 > φn(δ) ≤ δ , ∀δ ∈ (0, 1/3) .

Here 1/3 is arbitrary and can be replaced by another positive constant. ˆ 2 Such bounds control the tail of the distribution of kfn − fk2. They show ˆ 2 how large the quantiles of the random variable kf − fnk2 can be. Such bounds are favored in learning theory, and are sometimes called PAC- bounds (for Probably Approximately Correct).

Often, bounds with high probability follow from a bound in expectation and a concentration inequality that bounds the following probability

 ˆ 2 ˆ 2  IP kfn − fk2 − IEkfn − fk2 > t by a quantity that decays to zero exponentially fast. Concentration of measure is a fascinating but wide topic and we will only briefly touch it. We recommend the reading of [BLM13] to the interested reader. This book presents many aspects of concentration that are particularly well suited to the applications covered in these notes. Introduction 4

Other measures of error

We have chosen the L2 risk somewhat arbitrarily. Why not the Lp risk defined p by g 7→ IE|Y − g(X)| for some p ≥ 1? The main reason for choosing the L2 risk is that it greatly simplifies the mathematics of our problem: it is a Hilbert ˆ space! In particular, for any estimator fn, we have the remarkable identity:

ˆ 2 ˆ 2 R(fn) = IE[Y − f(X)] + kfn − fk2 .

ˆ 2 This equality allowed us to consider only the part kfn − fk2 as a measure of error. While this decomposition may not hold for other risk measures, it may be desirable to explore other distances (or pseudo-distances). This leads to two ˆ distinct ways to measure error. Either by bounding a pseudo-distance d(fn, f) ˆ (estimation error) or by bounding the risk R(fn) for choices other than the L2 risk. These two measures coincide up to the additive constant IE[Y − f(X)]2 in the case described above. However, we show below that these two quantities may live independent lives. Bounding the estimation error is more customary in statistics whereas risk bounds are preferred in learning theory. Here is a list of choices for the pseudo-distance employed in the estimation error.

• Pointwise error. Given a point x0, the pointwise error measures only the error at this point. It uses the pseudo-distance: ˆ ˆ d0(fn, f) = |fn(x0) − f(x0)| .

• Sup-norm error. Also known as the L∞-error and defined by ˆ ˆ d∞(fn, f) = sup |fn(x) − f(x)| . x∈X

It controls the worst possible pointwise error.

• Lp-error. It generalizes both the L2 distance and the sup-norm error by taking for any p ≥ 1, the pseudo distance Z ˆ ˆ p dp(fn, f) = |fn − f| dPX . X The choice of p is somewhat arbitrary and mostly employed as a mathe- matical exercise.

Note that these three examples can be split into two families: global (Sup-norm and Lp) and local (pointwise). For specific problems, other considerations come into play. For example, if Y ∈ {0, 1} is a label, one may be interested in the classification risk of a classifier h : X → {0, 1}. It is defined by

R(h) = IP(Y 6= h(X)) . Introduction 5

We will not cover this problem in this course. Finally, we will devote a large part of these notes to the study of linear models. For such models, X = IRd and f is linear (or affine), i.e., f(x) = x>θ for some unknown θ ∈ IRd. In this case, it is traditional to measure error ˆ > ˆ directly on the coefficient θ. For example, if fn(x) = x θn is a candidate ˆ linear estimator, it is customary to measure the distance of fn to f using a ˆ (pseudo-)distance between θn and θ as long as θ is identifiable.

MODELS AND METHODS

Empirical risk minimization ˆ In our considerations on measuring the performance of an estimator fn, we ˆ have carefully avoided the question of how to construct fn. This is of course one of the most important task of statistics. As we will see, it can be carried out in a fairly mechanical way by following one simple principle: Empirical Risk Minimization (ERM2). Indeed, an overwhelming proportion of statistical methods consist in replacing an (unknown) expected value (IE) by a (known) 1 Pn empirical mean ( n i=1). For example, it is well known that a good candidate to estimate the expected value IEX of a random variable X from a sequence of i.i.d copies X1,...,Xn of X, is their empirical average

n 1 X X¯ = X . n i i=1 In many instances, it corresponds to the maximum likelihood estimator of IEX. Another example is the sample variance, where IE(X − IE(X))2 is estimated by

n 1 X (X − X¯)2 . n i i=1 It turns out that this principle can be extended even if an optimization follows 2 the substitution. Recall that the L2 risk is defined by R(g) = IE[Y −g(X)] . See the expectation? Well, it can be replaced by an average to form the empirical risk of g defined by n 1 X 2 R (g) = Y − g(X ) . n n i i i=1 We can now proceed to minimizing this risk. However, we have to be careful. Indeed, Rn(g) ≥ 0 for all g. Therefore any function g such that Yi = g(Xi) for all i = 1, . . . , n is a minimizer of the empirical risk. Yet, it may not be the best choice (Cf. Figure1). To overcome this limitation, we need to leverage some prior knowledge on f: either it may belong to a certain class G of functions (e.g., linear functions) or it is smooth (e.g., the L2-norm of its second derivative is

2ERM may also mean Empirical Risk Minimizer Introduction 6 y y y −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8

x x x

Figure 1. It may not be the best choice idea to have fˆn(Xi) = Yi for all i = 1, . . . , n. small). In both cases, this extra knowledge can be incorporated to ERM using either a constraint: min Rn(g) g∈G or a penalty: n o min Rn(g) + pen(g) , g or both n o min Rn(g) + pen(g) , g∈G These schemes belong to the general idea of regularization. We will see many variants of regularization throughout the course. Unlike traditional (low dimensional) statistics, computation plays a key role in high-dimensional statistics. Indeed, what is the point of describing an esti- mator with good prediction properties if it takes years to compute it on large datasets? As a result of this observation, much of the modern estimators, such as the Lasso estimator for sparse linear regression can be computed efficiently using simple tools from . We will not describe such algo- rithms for this problem but will comment on the computability of estimators when relevant. In particular computational considerations have driven the field of com- pressed sensing that is closely connected to the problem of sparse linear regres- sion studied in these notes. We will only briefly mention some of the results and refer the interested reader to the book [FR13] for a comprehensive treatment. Introduction 7

Linear models When X = IRd, an all time favorite constraint G is the class of linear functions that are of the form g(x) = x>θ, that is parametrized by θ ∈ IRd. Under this constraint, the estimator obtained by ERM is usually called least squares ˆ > ˆ estimator and is defined by fn(x) = x θ, where

n ˆ 1 X > 2 θ ∈ argmin (Yi − Xi θ) . d n θ∈IR i=1

Note that θˆ may not be unique. In the case of a linear model, where we assume that the regression function is of the form f(x) = x>θ∗ for some unknown θ∗ ∈ IRd, we will need assumptions to ensure identifiability if we want to prove bounds on d(θ,ˆ θ∗) for some specific pseudo-distance d(· , ·). Nevertheless, in other instances such as regression with fixed design, we can prove bounds on the prediction error that are valid for any θˆ in the argmin. In the latter case, we will not even require that f satisfies the linear model but our bound will be meaningful only if f can be well approximated by a linear function. In this case, we talk about a misspecified model, i.e., we try to fit a linear model to data that may not come from a linear model. Since linear models can have good approximation properties especially when the dimension d is large, our hope is that the linear model is never too far from the truth. In the case of a misspecified model, there is no hope to drive the estimation ˆ error d(fn, f) down to zero even with a sample size that tends to infinity. Rather, we will pay a systematic approximation error. When G is a linear subspace as above, and the pseudo distance is given by the squared L2 norm ˆ ˆ 2 d(fn, f) = kfn − fk2, it follows from the Pythagorean theorem that ˆ 2 ˆ ¯ 2 ¯ 2 kfn − fk2 = kfn − fk2 + kf − fk2 , where f¯ is the projection of f onto the linear subspace G. The systematic ¯ 2 approximation error is entirely contained in the deterministic term kf − fk2 ˆ ¯ 2 and one can proceed to bound kfn − fk2 by a quantity that goes to zero as n goes to infinity. In this case, bounds (e.g., in expectation) on the estimation error take the form ˆ 2 ¯ 2 IEkfn − fk2 ≤ kf − fk2 + φn .

The above inequality is called an oracle inequality. Indeed, it says that if φn ˆ ¯ is small enough, then fn the estimator mimics the oracle f. It is called “oracle” because it cannot be constructed without the knowledge of the unknown f. It is clearly the best we can do when we restrict our attention to estimator in the class G. Going back to the gap in knowledge between a probabilist who knows the whole joint distribution of (X,Y ) and a statistician who only sees the data, the oracle sits somewhere in-between: it can only see the whole distribution through the lens provided by the statistician. In the case above, the lens is that of linear regression functions. Different oracles are more or less powerful and there is a tradeoff to be achieved. On the one hand, if the oracle is weak, Introduction 8 then it’s easy for the statistician to mimic it but it may be very far from the true regression function; on the other hand, if the oracle is strong, then it is harder to mimic but it is much closer to the truth. Oracle inequalities were originally developed as analytic tools to prove adap- tation of some nonparametric estimators. With the development of aggregation [Nem00, Tsy03, Rig06] and high dimensional statistics [CT07, BRT09, RT11], they have become important finite sample results that characterize the inter- play between the important parameters of the problem. In some favorable instances, that is when the Xis enjoy specific properties, it is even possible to estimate the vector θ accurately, as is done in parametric statistics. The techniques employed for this goal will essentially be the same as the ones employed to minimize the prediction risk. The extra assumptions ˆ on the Xis will then translate in interesting properties on θ itself, including ˆ > ˆ uniqueness on top of the prediction properties of the function fn(x) = x θ.

High dimension and sparsity These lecture notes are about high dimensional statistics and it is time they enter the picture. By high dimension, we informally mean that the model has more “parameters” than there are observations. The word “parameter” is used here loosely and a more accurate description is perhaps degrees of freedom. For example, the linear model f(x) = x>θ∗ has one parameter θ∗ but effectively d degrees of freedom when θ∗ ∈ IRd. The notion of degrees of freedom is actually well defined in the statistical literature but the formal definition does not help our informal discussion here. As we will see in Chapter2, if the regression function is linear f(x) = x>θ∗, θ∗ ∈ IRd, and under some assumptions on the marginal distribution of X, then ˆ > ˆ the least squares estimator fn(x) = x θn satisfies d IEkfˆ − fk2 ≤ C , (1) n 2 n where C > 0 is a constant and in Chapter4, we will show that this cannot be improved apart perhaps for a smaller multiplicative constant. Clearly such a bound is uninformative if d  n and actually, in view of its optimality, we can even conclude that the problem is too difficult statistically. However, the situation is not hopeless if we assume that the problem has actually less degrees of freedom than it seems. In particular, it is now standard to resort to the sparsity assumption to overcome this limitation. A vector θ ∈ IRd is said to be k-sparse for some k ∈ {0, . . . , d} if it has at most k non-zero coordinates. We denote by |θ|0 the number of nonzero coordinates of θ, which is also known as sparsity or “`0-norm” though it is clearly not a norm (see footnote3). Formally, it is defined as

d X |θ|0 = 1I(θj 6= 0) . j=1 Introduction 9

Sparsity is just one of many ways to limit the size of the set of potential θ vectors to consider. One could consider vectors θ that have the following structure for example (see Figure2):

• Monotonic: θ1 ≥ θ2 ≥ · · · ≥ θd

α • Smooth: |θi − θj| ≤ C|i − j| for some α > 0 Pd−1 • Piecewise constant: j=1 1I(θj+1 6= θj) ≤ k • Structured in another basis: θ = Ψµ, for some orthogonal matrix and µ is in one of the structured classes described above. Sparsity plays a significant role in statistics because, often, structure trans- lates into sparsity in a certain basis. For example, a smooth function is sparse in the trigonometric basis and a piecewise constant function has sparse incre- ments. Moreover, real images are approximately sparse in certain bases such as wavelet or Fourier bases. This is precisely the feature exploited in compression schemes such as JPEG or JPEG-2000: only a few coefficients in these images are necessary to retain the main features of the image. We say that θ is approximately sparse if |θ|0 may be as large as d but many coefficients |θj| are small rather than exactly equal to zero. There are several mathematical ways to capture this phenomena, including `q-“balls” for q ≤ 1. d For q > 0, the unit `q-ball of IR is defined as

d  d q X q Bq(R) = θ ∈ IR : |θ|q = |θj| ≤ 1 j=1

3 where |θ|q is often called `q-norm . As we will see, the smaller q is, the better vectors in the unit `q ball can be approximated by sparse vectors. d Pk d Note that the set of k-sparse vectors of IR is a union of j=0 j linear subspaces with dimension at most k and that are spanned by at most k vectors in the canonical basis of IRd. If we knew that θ∗ belongs to one of these subspaces, we could simply drop irrelevant coordinates and obtain an oracle inequality such as (1), with d replaced by k. Since we do not know what subspace θ∗ lives exactly, we will have to pay an extra term to find in which subspace θ∗ lives. It turns out that this term is exactly of the order of

 k  P d ed  log j=0 j k log ' C k n n Therefore, the price to pay for not knowing which subspace to look at is only a logarithmic factor.

3 Strictly speaking, |θ|q is a norm and the `q ball is a ball only for q ≥ 1. Introduction 10

Monotone Smooth y1 y2

x x

Constant Basis y3 y4

x x

Figure 2. Examples of structures vectors θ ∈ IR50

Nonparametric regression Nonparametric does not mean that there is no parameter to estimate (the regression function is a parameter) but rather that the parameter to estimate is infinite dimensional (this is the case of a function). In some instances, this parameter can be identified with an infinite sequence of real numbers, so that we are still in the realm of countable infinity. Indeed, observe that since L2(PX ) equipped with the inner product h· , ·i2 is a separable Hilbert space, it admits an orthonormal basis {ϕk}k∈Z and any function f ∈ L2(PX ) can be decomposed as X f = αkϕk , k∈Z where αk = hf, ϕki2. Therefore estimating a regression function f amounts to estimating the infinite sequence {αk}k∈Z ∈ `2. You may argue (correctly) that the basis {ϕk}k∈Z is also unknown as it depends on the unknown PX . This is absolutely Introduction 11

correct but we will make the convenient assumption that PX is (essentially) known whenever this is needed. Even if infinity is countable, we still have to estimate an infinite number of coefficients using a finite number of observations. It does not require much statistical intuition to realize that this task is impossible in general. What if we know something about the sequence {αk}k? For example, if we know that αk = 0 for |k| > k0, then there are only 2k0 + 1 parameters to estimate (in general, one would also have to “estimate” k0). In practice, we will not exactly see αk = 0 for |k| > k0, but rather that the sequence {αk}k decays to 0 at −γ a certain polynomial rate. For example |αk| ≤ C|k| for some γ > 1/2 (we need this sequence to be in `2). It corresponds to a smoothness assumption on the function f. In this case, the sequence {αk}k can be well approximated by a sequence with only a finite number of non-zero terms. We can view this problem as a misspecified model. Indeed, for any cut-off k0, define the oracle ¯ X fk0 = αkϕk .

|k|≤k0

Note that it depends on the unknown αk and define the estimator ˆ X fn = αˆkϕk ,

|k|≤k0 whereα ˆk are some data-driven coefficients (obtained by least-squares for ex- ample). Then by the Pythagorean theorem and Parseval’s identity, we have

ˆ 2 ¯ 2 ˆ ¯ 2 kfn − fk2 = kf − fk2 + kfn − fk2 X 2 X 2 = αk + (ˆαk − αk)

|k|>k0 |k|≤k0

We can even work further on this oracle inequality using the fact that |αk| ≤ C|k|−γ . Indeed, we have4

X 2 2 X −2γ 1−2γ αk ≤ C k ≤ Ck0 .

|k|>k0 |k|>k0

The so called stochastic term IE P (ˆα − α )2 clearly increases with k |k|≤k0 k k 0 1−2γ (more parameters to estimate) whereas the approximation term Ck0 de- creases with k0 (less terms discarded). We will see that we can strike a com- promise called bias-variance tradeoff. The main difference between this and oracle inequalities is that we make assumptions on the regression function (here in terms of smoothness) in order

4Here we illustrate a convenient notational convention that we will be using through- out these notes: a constant C may be different from line to line. This will not affect the interpretation of our results since we are interested in the order of magnitude of the error bounds. Nevertheless we will, as much as possible, try to make such constants explicit. As an exercise, try to find an expression of the second C as a function of the first one and of γ. Introduction 12 to control the approximation error. Therefore oracle inequalities are more general but can be seen on the one hand as less quantitative. On the other hand, if one is willing to accept the fact that approximation error is inevitable then there is no reason to focus on it. This is not the final answer to this rather philosophical question. Indeed, choosing the right k0 can only be done with a control of the approximation error. Indeed, the best k0 will depend on γ. We will see that even if the smoothness index γ is unknown, we can select k0 in a data-driven way that achieves almost the same performance as if γ were known. This phenomenon is called adaptation (to γ). It is important to notice the main difference between the approach taken in nonparametric regression and the one in sparse linear regression. It is not so much about linear vs. non-linear models as we can always first take nonlinear transformations of the xj’s in linear regression. Instead, sparsity or approx- imate sparsity is a much weaker notion than the decay of coefficients {αk}k presented above. In a way, sparsity only imposes that after ordering the coef- ficients present a certain decay, whereas in nonparametric statistics, the order is set ahead of time: we assume that we have found a basis that is ordered in such a way that coefficients decay at a certain rate.

Matrix models In the previous examples, the response variable is always assumed to be a scalar. What if it is a higher dimensional signal? In Chapter5, we consider various problems of this form: matrix completion a.k.a. the Netflix problem, structured graph estimation and covariance matrix estimation. All these problems can be described as follows. Let M,S and N be three matrices, respectively called observation, signal and noise, and that satisfy M = S + N. Here N is a random matrix such that IE[N] = 0, the all-zero matrix. The goal is to estimate the signal matrix S from the observation of M. The structure of S can also be chosen in various ways. We will consider the case where S is sparse in the sense that it has many zero coefficients. In a way, this assumption does not leverage much of the matrix structure and essentially treats matrices as vectors arranged in the form of an array. This is not the case of low rank structures where one assumes that the matrix S has either low rank or can be well approximated by a low rank matrix. This assumption makes sense in the case where S represents user preferences as in the Netflix example. In this example, the (i, j)th coefficient Sij of S corresponds to the rating (on a scale from 1 to 5) that user i gave to movie j. The low rank assumption simply materializes the idea that there are a few canonical profiles of users and that each user can be represented as a linear combination of these users. At first glance, this problem seems much more difficult than sparse linear regression. Indeed, one needs to learn not only the sparse coefficients in a given Introduction 13 basis, but also the basis of eigenvectors. Fortunately, it turns out that the latter task is much easier and is dominated by the former in terms of statistical price. Another important example of matrix estimation is high-dimensional co- variance estimation, where the goal is to estimate the covariance matrix of a random vector X ∈ IRd, or its leading eigenvectors, based on n observations. Such a problem has many applications including principal component analysis, linear discriminant analysis and portfolio optimization. The main difficulty is that n may be much smaller than the number of degrees of freedom in the covariance matrix, which can be of order d2. To overcome this limitation, assumptions on the rank or the sparsity of the matrix can be leveraged.

Optimality and minimax lower bounds So far, we have only talked about upper bounds. For a linear model, where f(x) = x>θ∗, we will prove in Chapter2 the following bound for a modified ˆ > ˆ least squares estimator fn = x θ d IEkfˆ − fk2 ≤ C . n 2 n

Is this the right dependence in d and n? Would it√ be possible to obtain as an upper bound something like C(log d)/n, C/n or d/n2, by either improving our proof technique or using another estimator altogether? It turns out that the answer to this question is negative. More precisely, we can prove that for ˜ > ∗ any estimator fn, there exists a function f of the form f(x) = x θ such that d IEkfˆ − fk2 > c n 2 n for some positive constant c. Here we used a different notation for the constant to emphasize the fact that lower bounds guarantee optimality only up to a constant factor. Such a lower bound on the risk is called minimax lower bound for reasons that will become clearer in chapter4. How is this possible? How can we make a statement for all estimators? We will see that these statements borrow from the theory of tests where we know that it is impossible to drive both the type I and the type II error to zero simultaneously (with a fixed sample size). Intuitively this phenomenon is related to the following observation: Given n observations X1,...,Xn, it is hard to tell if they are distributed according to N (θ, 1) or to N (θ0, 1) for a 0 Euclidean distance |θ − θ |2 is small enough. We will see that it is the case for 0 p example if |θ − θ |2 ≤ C d/n, which will yield our lower bound. ysmer eas have also we symmetry By variance 1.1. inequality). Proposition (Mill’s proposition following of the tails using the quantified of be decay fast the to density itiuinhsubuddspotbti swl nw hti has it that known IP( well sense: is following it the this in Clearly, but support distribution. support bounded in unbounded holds equality has the distribution where and Gaussian) standard where X ewrite We . ealta admvariable random a that Recall µ p σ IE( = ihrsett h eegemaueo Rgvnby given IR on measure Lebesgue the to respect with 2 hnfrany for then X X u-asinRno Variables Random Sub-Gaussian ) N ∼ p ( ∈ x Let = ) Rand IR . ASINTISADMGF AND TAILS GAUSSIAN 1.1 ( ,σ µ, IP( X IP( √ > t 2 2 X eaGusa admvral ihmean with variable random Gaussian a be X 1 .Nt that Note ). πσ σ ,i holds it 0, − − 2 2 < µ p exp var( = t > µ as X  − | ∈ x ) − t 14 ∞ → | ) RhsGusa itiuini thsa has it iff distribution Gaussian has IR X ≤ ≤ ( X ) x √ 2 > √ − σ | 2 = σ X σ π 2 2 µ r the are 0 .Ti ea can decay This 1.1). Figure (see π σZ − e ) 2 − e µ −  t 2 t + ≤ | σ t 2 2 x , 2 t σ 2 2 µ . 3 for σ mean ∈ ) ' IR Z 0 , N ∼ . 9.Ti sdue is This 997. and (0 variance

, C h a p t e r )(called 1) 1 almost µ and of 1.1. Gaussian tails and MGF 15

68%

95%

99.7%

µ − 3σ µ − 2σ µ − σ µ µ + σ µ + 2σ µ + 3σ

Figure 1.1. Probabilities of falling within 1, 2, and 3 standard deviations close to the mean in a Gaussian distribution. Source http://www.openintro.org/ and 2 r − t 2 e 2σ2 IP(|X − µ| > t) ≤ σ . π t Proof. Note that it is sufficient to prove the theorem for µ = 0 and σ2 = 1 by simple translation and rescaling. We get for Z ∼ N (0, 1),

1 Z ∞  x2  IP(Z > t) = √ exp − dx 2π t 2 1 Z ∞ x  x2  ≤ √ exp − dx 2π t t 2 1 Z ∞ ∂  x2  = √ − exp − dx t 2π t ∂x 2 1 = √ exp(−t2/2) . t 2π The second inequality follows from symmetry and the last one using the union bound:

IP(|Z| > t) = IP({Z > t} ∪ {Z < −t}) ≤ IP(Z > t) + IP(Z < −t) = 2IP(Z > t) .

The fact that a Gaussian random variable Z has tails that decay to zero exponentially fast can also be seen in the moment generating function (MGF)

M : s 7→ M(s) = IE[exp(sZ)] . 1.2. Sub-Gaussian random variables and Chernoff bounds 16

Indeed, in the case of a standard Gaussian random variable, we have

Z 2 1 sz − z M(s) = IE[exp(sZ)] = √ e e 2 dz 2π Z 2 2 1 − (z−s) + s = √ e 2 2 dz 2π s2 = e 2 .

2 σ2s2  It follows that if X ∼ N (µ, σ ), then IE[exp(sX)] = exp sµ + 2 .

1.2 SUB-GAUSSIAN RANDOM VARIABLES AND CHERNOFF BOUNDS

Definition and first properties Gaussian tails are practical when controlling the tail of an average of indepen- dent random variables. Indeed, recall that if X1,...,Xn are 2 ¯ 1 Pn 2 iid N (µ, σ ), then X = n i=1 Xi ∼ N (µ, σ /n). Using Lemma 1.3 below for example, we get nt2 IP(|X¯ − µ| > t) ≤ 2 exp −  . 2σ2 Equating the right-hand side with some confidence level δ > 0, we find that with probability at least1 1 − δ, r r h 2 log(2/δ) 2 log(2/δ)i µ ∈ X¯ − σ , X¯ + σ , (1.1) n n This is almost the confidence interval that you used in introductory statistics. The only difference is that we used an approximation for the Gaussian tail whereas statistical tables or software use a much more accurate computation. Figure 1.2 shows the ratio of the width of the confidence interval to that of the confidence interval computed by the software R. It turns out that intervals of the same form can be also derived for non-Gaussian random variables as long as they have sub-Gaussian tails. Definition 1.2. A random variable X ∈ IR is said to be sub-Gaussian with variance proxy σ2 if IE[X] = 0 and its moment generating function satisfies

σ2s2  IE[exp(sX)] ≤ exp , ∀ s ∈ IR . (1.2) 2 In this case we write X ∼ subG(σ2). Note that subG(σ2) denotes a class of distributions rather than a distribution. Therefore, we abuse notation when writing X ∼ subG(σ2). More generally, we can talk about sub-Gaussian random vectors and matri- ces. A random vector X ∈ IRd is said to be sub-Gaussian with variance proxy

1We will often commit the statement “at least” for brevity 1.2. Sub-Gaussian random variables and Chernoff bounds 17 y 3 4 5 6 7

0 5 10 15 20 x

Figure 1.2. Width of confidence intervals from exact computation in R (red dashed) and from the approximation (1.1) (solid black). Here x = δ and y is the width of the confidence intervals.

σ2 if IE[X] = 0 and u>X is sub-Gaussian with variance proxy σ2 for any vector d−1 2 2 u ∈ S . In this case we write X ∼ subGd(σ ). Note that if X ∼ subGd(σ ), > 2 then for any v such that |v|2 ≤ 1, we have v X ∼ subG(σ ). Indeed, denoting d−1 u = v/|v|2 ∈ S , we have

σ2s2|v|2 2 2 sv>X s|v| u>X 2 σ s IE[e ] = IE[e 2 ] ≤ e 2 ≤ e 2 . A random matrix X ∈ IRd×T is said to be sub-Gaussian with variance proxy σ2 if IE[X] = 0 and u>Xv is sub-Gaussian with variance proxy σ2 for any unit d−1 T −1 2 vectors u ∈ S , v ∈ S . In this case we write X ∼ subGd×T (σ ). This property can equivalently be expressed in terms of bounds on the tail of the random variable X. Lemma 1.3. Let X ∼ subG(σ2). Then for any t > 0, it holds  t2   t2  IP[X > t] ≤ exp − , and IP[X < −t] ≤ exp − . (1.3) 2σ2 2σ2 Proof. Assume first that X ∼ subG(σ2). We will employ a very useful tech- nique called Chernoff bound that allows to translate a bound on the moment generating function into a tail bound. Using Markov’s inequality, we have for any s > 0, IEesX  IP(X > t) ≤ IPesX > est ≤ . est 1.2. Sub-Gaussian random variables and Chernoff bounds 18

Next we use the fact that X is sub-Gaussian to get

2 2 σ s −st IP(X > t) ≤ e 2 . The above inequality holds for any s > 0 so to make it the tightest possible, we 0 σ2s2 minimize with respect to s > 0. Solving φ (s) = 0, where φ(s) = 2 − st, we t2 find that infs>0 φ(s) = − 2σ2 . This proves the first part of (1.3). The second inequality in this equation follows in the same manner (recall that (1.2) holds for any s ∈ IR).

Moments Recall that the absolute moments of Z ∼ N (0, σ2) are given by 1 k + 1 IE[|Z|k] = √ (2σ2)k/2Γ π 2 where Γ(·) denotes the Gamma function defined by Z ∞ Γ(t) = xt−1e−xdx , t > 0 . 0 The next lemma shows that the tail bounds of Lemma 1.3 are sufficient to show that the absolute moments of X ∼ subG(σ2) can be bounded by those of Z ∼ N (0, σ2) up to multiplicative constants. Lemma 1.4. Let X be a random variable such that  t2  IP[|X| > t] ≤ 2 exp − , 2σ2 then for any positive integer k ≥ 1, IE[|X|k] ≤ (2σ2)k/2kΓ(k/2) .

In particular, √ IE[|X|k])1/k ≤ σe1/e k , k ≥ 2 . √ and IE[|X|] ≤ σ 2π . Proof. Z ∞ IE[|X|k] = IP(|X|k > t)dt 0 Z ∞ = IP(|X| > t1/k)dt 0 Z ∞ 2/k − t ≤ 2 e 2σ2 dt 0 Z ∞ 2/k 2 k/2 −u k/2−1 t = (2σ ) k e u du , u = 2 0 2σ = (2σ2)k/2kΓ(k/2) 1.2. Sub-Gaussian random variables and Chernoff bounds 19

The second statement follows from Γ(k/2) ≤ (k/2)k/2 and k1/k ≤ e1/e for any k ≥ 2. It yields

r 2 1/k 2σ k √ (2σ2)k/2kΓ(k/2) ≤ k1/k ≤ e1/eσ k . 2 √ √ Moreover, for k = 1, we have 2Γ(1/2) = 2π.

Using moments, we can prove the following reciprocal to Lemma 1.3.

Lemma 1.5. If (1.3) holds and IE[X] = 0, then for any s > 0, it holds

2 2 IE[exp(sX)] ≤ e4σ s .

As a result, we will sometimes write X ∼ subG(σ2) when it satisfies (1.3) and IE[X] = 0.

Proof. We use the Taylor expansion of the exponential function as follows. Observe that by the dominated convergence theorem

∞ X skIE[|X|k] IEesX  ≤ 1 + k! k=2 ∞ X (2σ2s2)k/2kΓ(k/2) ≤ 1 + k! k=2 ∞ ∞ X (2σ2s2)k2kΓ(k) X (2σ2s2)k+1/2(2k + 1)Γ(k + 1/2) = 1 + + (2k)! (2k + 1)! k=1 k=1 ∞ √ X (2σ2s2)kk! ≤ 1 + 2 + 2σ2s2 (2k)! k=1 r ∞ σ2s2 X (2σ2s2)k ≤ 1 + 1 +  2(k!)2 ≤ (2k)! 2 k! k=1 r 2 2 2 2 σ s 2 2 = e2σ s + (e2σ s − 1) 2 2 2 ≤ e4σ s .

From the above Lemma, we see that sub-Gaussian random variables can be equivalently defined from their tail bounds and their moment generating functions, up to constants. 1.2. Sub-Gaussian random variables and Chernoff bounds 20

Sums of independent sub-Gaussian random variables

Recall that if X1,...,Xn are iid N (0, σ2), then for any a ∈ IRn,

n X 2 2 aiXi ∼ N (0, |a|2σ ). i=1 If we only care about the tails, this property is preserved for sub-Gaussian random variables.

Theorem 1.6. Let X = (X1,...,Xn) be a vector of independent sub-Gaussian random variables that have variance proxy σ2. Then, the random vector X is sub-Gaussian with variance proxy σ2.

Proof. Let u ∈ Sn−1 be a unit vector, then

n n σ2s2u2 σ2s2|u|2 2 2 su>X Y su X Y i 2 σ s IE[e ] = IE[e i i ] ≤ e 2 = e 2 = e 2 . i=1 i=1

Using a Chernoff bound, we immediately get the following corollary

Corollary 1.7. Let X1,...,Xn be n independent random variables such that 2 n Xi ∼ subG(σ ). Then for any a ∈ IR , we have

n h X i  t2  IP a X > t ≤ exp − , i i 2σ2|a|2 i=1 2 and n h X i  t2  IP a X < −t ≤ exp − i i 2σ2|a|2 i=1 2

Of special interest is the case where ai = 1/n for all i. Then, we get that ¯ 1 Pn the average X = n i=1 Xi, satisfies

2 2 − nt − nt IP(X¯ > t) ≤ e 2σ2 and IP(X¯ < −t) ≤ e 2σ2 just like for the Gaussian average.

Hoeffding’s inequality The class of sub-Gaussian random variables is actually quite large. Indeed, Hoeffding’s lemma below implies that all random variables that are bounded uniformly are actually sub-Gaussian with a variance proxy that depends on the size of their support. 1.2. Sub-Gaussian random variables and Chernoff bounds 21

Lemma 1.8 (Hoeffding’s lemma (1963)). Let X be a random variable such that IE(X) = 0 and X ∈ [a, b] almost surely. Then, for any s ∈ IR, it holds

2 2 sX s (b−a) IE[e ] ≤ e 8 .

(b−a)2 In particular, X ∼ subG( 4 ) . Proof. Define ψ(s) = log IE[esX ], and observe that and we can readily compute

IE[XesX ] IE[X2esX ] IE[XesX ]2 ψ0(s) = , ψ00(s) = − . IE[esX ] IE[esX ] IE[esX ] Thus ψ00(s) can be interpreted as the variance of the random variable X under esX the probability measure dQ = IE[esX ] dIP. But since X ∈ [a, b] almost surely, we have, under any probability, a + b h a + b2i (b − a)2 var(X) = varX −  ≤ IE X − ≤ . 2 2 4 The fundamental theorem of calculus yields Z s Z µ s2(b − a)2 ψ(s) = ψ00(ρ) dρ dµ ≤ 0 0 8 using ψ(0) = log 1 = 0 and ψ0(0) = IEX = 0. Using a Chernoff bound, we get the following (extremely useful) result.

Theorem 1.9 (Hoeffding’s inequality). Let X1,...,Xn be n independent ran- dom variables such that almost surely,

Xi ∈ [ai, bi] , ∀ i. ¯ 1 Pn Let X = n i=1 Xi, then for any t > 0, 2 2 ¯ ¯  2n t  IP(X − IE(X) > t) ≤ exp − Pn 2 , i=1(bi − ai) and 2 2 ¯ ¯  2n t  IP(X − IE(X) < −t) ≤ exp − Pn 2 . i=1(bi − ai) Note that Hoeffding’s lemma holds for any bounded random variable. For example, if one knows that X is a Rademacher random variable,

s −s 2 sX e + e s IE(e ) = = cosh(s) ≤ e 2 . 2 Note that 2 is the best possible constant in the above approximation. For such variables, a = −1, b = 1, and IE(X) = 0, so Hoeffding’s lemma yields

2 sX s IE(e ) ≤ e 2 . 1.3. Sub-exponential random variables 22

Hoeffding’s inequality is very general but there is a price to pay for this gen- erality. Indeed, if the random variables have small variance, we would like to see it reflected in the exponential tail bound (as for the Gaussian case) but the variance does not appear in Hoeffding’s inequality. We need a more refined inequality.

1.3 SUB-EXPONENTIAL RANDOM VARIABLES

What can we say when a centered random variable is not sub-Gaussian? A typical example is the double exponential (or Laplace) distribution with parameter 1, denoted by Lap(1). Let X ∼ Lap(1) and observe that

IP(|X| > t) = e−t, t ≥ 0 .

In particular, the tails of this distribution do not decay as fast as the Gaussian 2 ones (that decays as e−t /2). Such tails are said to be heavier than Gaussian. This tail behavior is also captured by the moment generating function of X. Indeed, we have 1 IEesX  = if |s| < 1 , 1 − s2 and it is not defined for s ≥ 1. It turns out that a rather weak condition on the moment generating function is enough to partially reproduce some of the bounds that we have proved for sub-Gaussian random variables. Observe that for X ∼ Lap(1) 2 IEesX  ≤ e2s if |s| < 1/2 , In particular, the moment generating function of the Laplace distribution is bounded by that of a Gaussian in a neighborhood of 0 but does not even exist away from zero. It turns out that all distributions that have tails at most as heavy as that of a Laplace distribution satisfy such a property.

Lemma 1.10. Let X be a centered random variable such that IP(|X| > t) ≤ 2e−2t/λ for some λ > 0. Then, for any positive integer k ≥ 1,

IE[|X|k] ≤ λkk! .

Moreover, IE[|X|k])1/k ≤ 2λk , and the moment generating function of X satisfies

2 2 1 IEesX  ≤ e2s λ , ∀|s| ≤ . 2λ 1.3. Sub-exponential random variables 23

Proof. Z ∞ IE[|X|k] = IP(|X|k > t)dt 0 Z ∞ = IP(|X| > t1/k)dt 0 Z ∞ 1/k − 2t ≤ 2e λ dt 0 Z ∞ 2t1/k = 2(λ/2)kk e−uuk−1du , u = 0 λ ≤ λkkΓ(k) = λkk!

The second statement follows from Γ(k) ≤ kk and k1/k ≤ e1/e ≤ 2 for any k ≥ 1. It yields 1/k λkkΓ(k) ≤ 2λk . To control the MGF of X, we use the Taylor expansion of the exponential function as follows. Observe that by the dominated convergence theorem, for any s such that |s| ≤ 1/2λ

∞ X |s|kIE[|X|k] IEesX  ≤ 1 + k! k=2 ∞ X ≤ 1 + (|s|λ)k k=2 ∞ X = 1 + s2λ2 (|s|λ)k k=0 1 ≤ 1 + 2s2λ2 |s| ≤ 2λ 2 2 ≤ e2s λ

This leads to the following definition Definition 1.11. A random variable X is said to be sub-exponential with parameter λ (denoted X ∼ subE(λ)) if IE[X] = 0 and its moment generating function satisfies 2 2 1 IEesX  ≤ es λ /2 , ∀|s| ≤ . λ A simple and useful example of of a sub-exponential random variable is given in the next lemma. Lemma 1.12. Let X ∼ subG(σ2) then the random variable Z = X2 − IE[X2] is sub-exponential: Z ∼ subE(16σ2). 1.3. Sub-exponential random variables 24

Proof. We have, by the dominated convergence theorem,

∞ k  2 2 k X s IE X − IE[X ] IE[esZ ] = 1 + k! k=2 ∞ k k−1 2k 2 k X s 2 IE[X ] + (IE[X ]) ≤ 1 + (Jensen) k! k=2 ∞ X sk4kIE[X2k] ≤ 1 + (Jensen again) 2(k!) k=2 ∞ X sk4k2(2σ2)kk! ≤ 1 + (Lemma 1.4) 2(k!) k=2 ∞ X = 1 + (8sσ2)2 (8sσ2)k k=0 1 = 1 + 128s2σ4 for |s| ≤ 16σ2 2 4 ≤ e128s σ .

Sub-exponential random variables also give rise to exponential deviation inequalities such as Corollary 1.7 (Chernoff bound) or Theorem 1.9 (Hoeffd- ing’s inequality) for weighted sums of independent sub-exponential random variables. The significant difference here is that the larger deviations are con- trolled in by a weaker bound.

Bernstein’s inequality

Theorem 1.13 (Bernstein’s inequality). Let X1,...,Xn be independent ran- dom variables such that IE(Xi) = 0 and Xi ∼ subE(λ). Define

n 1 X X¯ = X , n i i=1 Then for any t > 0 we have

 n t2 t  IP(X¯ > t) ∨ IP(X¯ < −t) ≤ exp − ( ∧ ) . 2 λ2 λ

Proof. Without loss of generality, assume that λ = 1 (we can always replace Xi by Xi/λ and t by t/λ). Next, using a Chernoff bound, we get for any s > 0

n Y IP(X¯ > t) ≤ IEesXi e−snt . i=1 1.4. Maximal inequalities 25

2 Next, if |s| ≤ 1, then IEesXi  ≤ es /2 by definition of sub-exponential distri- butions. It yields 2 ns −snt IP(X¯ > t) ≤ e 2 Choosing s = 1 ∧ t yields

− n (t2∧t) IP(X¯ > t) ≤ e 2

We obtain the same bound for IP(X¯ < −t) which concludes the proof.

Note that usually, Bernstein’s inequality refers to a slightly more precise result that is qualitatively the same as the one above: it exhibits a Gaussian 2 2 tail e−nt /(2λ ) and an exponential tail e−nt/(2λ). See for example Theorem 2.10 in [BLM13].

1.4 MAXIMAL INEQUALITIES

The exponential inequalities of the previous section are valid for linear com- binations of independent random variables, and, in particular, for the average X¯. In many instances, we will be interested in controlling the maximum over the parameters of such linear combinations (this is because of empirical risk minimization). The purpose of this section is to present such results.

Maximum over a finite set We begin by the simplest case possible: the maximum over a finite set.

2 Theorem 1.14. Let X1,...,XN be N random variables such that Xi ∼ subG(σ ). Then p p IE[ max Xi] ≤ σ 2 log(N) , and IE[ max |Xi|] ≤ σ 2 log(2N) 1≤i≤N 1≤i≤N

Moreover, for any t > 0,

2 2  − t  − t IP max Xi > t ≤ Ne 2σ2 , and IP max |Xi| > t ≤ 2Ne 2σ2 1≤i≤N 1≤i≤N

Note that the random variables in this theorem need not be independent. 1.4. Maximal inequalities 26

Proof. For any s > 0, 1  s max1≤i≤N Xi  IE[ max Xi] = IE log e 1≤i≤N s 1 ≤ log IEes max1≤i≤N Xi  (by Jensen) s 1 = log IE max esXi  s 1≤i≤N 1 X ≤ log IEesXi  s 1≤i≤N 1 X σ2s2 ≤ log e 2 s 1≤i≤N log N σ2s = + . s 2

Taking s = p2(log N)/σ2 yields the first inequality in expectation. The first inequality in probability is obtained by a simple union bound:   [  IP max Xi > t = IP {Xi > t} 1≤i≤N 1≤i≤N X ≤ IP(Xi > t) 1≤i≤N 2 − t ≤ Ne 2σ2 , where we used Lemma 1.3 in the last inequality. The remaining two inequalities follow trivially by noting that

max |Xi| = max Xi , 1≤i≤N 1≤i≤2N where XN+i = −Xi for i = 1,...,N. Extending these results to a maximum over an infinite set may be impossi- ble. For example, if one is given an infinite sequence of 2 iid N (0, σ ) random variables X1,X2,..., then for any N ≥ 1, we have for any t > 0, N IP( max Xi < t) = [IP(X1 < t)] → 0 ,N → ∞ . 1≤i≤N

On the opposite side of the picture, if all the Xis are equal to the same random variable X, we have for any t > 0,

IP( max Xi < t) = IP(X1 < t) > 0 ∀ N ≥ 1 . 1≤i≤N In the Gaussian case, lower bounds are also available. They illustrate the effect of the correlation between the Xis. 1.4. Maximal inequalities 27

Examples from statistics have structure and we encounter many examples where a maximum of random variables over an infinite set is in fact finite. This is due to the fact that the random variables that we are considering are not independent from each other. In the rest of this section, we review some of these examples.

Maximum over a convex polytope We use the definition of a polytope from [Gru03]: a convex polytope P is a compact set with a finite number of vertices V(P) called extreme points. It satisfies P = conv(V(P)), where conv(V(P)) denotes the convex hull of the vertices of P. Let X ∈ IRd be a random vector and consider the (infinite) family of random variables F = {θ>X : θ ∈ P} , where P ⊂ IRd is a polytope with N vertices. While the family F is infinite, the maximum over F can be reduced to the a finite maximum using the following useful lemma. Lemma 1.15. Consider a linear form x 7→ c>x, x, c ∈ IRd. Then for any convex polytope P ⊂ IRd, max c>x = max c>x x∈P x∈V(P) where V(P) denotes the set of vertices of P.

Proof. Assume that V(P) = {v1, . . . , vN }. For any x ∈ P = conv(V(P)), there exist nonnegative numbers λ1, . . . λN that sum up to 1 and such that x = λ1v1 + ··· + λN vN . Thus N N N > > X  X > X > > c x = c λivi = λic vi ≤ λi max c x = max c x . x∈V(P) x∈V(P) i=1 i=1 i=1 It yields max c>x ≤ max c>x ≤ max c>x x∈P x∈V(P) x∈P so the two quantities are equal. It immediately yields the following theorem Theorem 1.16. Let P be a polytope with N vertices v(1), . . . , v(N) ∈ IRd and let X ∈ IRd be a random vector such that [v(i)]>X, i = 1,...,N, are sub-Gaussian random variables with variance proxy σ2. Then IE[max θ>X] ≤ σp2 log(N) , and IE[max |θ>X|] ≤ σp2 log(2N) . θ∈P θ∈P Moreover, for any t > 0,

2 2 >  − t >  − t IP max θ X > t ≤ Ne 2σ2 , and IP max |θ X| > t ≤ 2Ne 2σ2 θ∈P θ∈P 1.4. Maximal inequalities 28

Of particular interest are polytopes that have a small number of vertices. d A primary example is the `1 ball of IR defined by

d n d X o B1 = x ∈ IR : |xi| ≤ 1 . i=1 Indeed, it has exactly 2d vertices.

Maximum over the `2 ball d Recall that the unit `2 ball of IR is defined by the set of vectors u that have Euclidean norm |u|2 at most 1. Formally, it is defined by

d n d X 2 o B2 = x ∈ IR : xi ≤ 1 . i=1 Clearly, this ball is not a polytope and yet, we can control the maximum of random variables indexed by B2. This is due to the fact that there exists a finite subset of B2 such that the maximum over this finite set is of the same order as the maximum over the entire ball.

Definition 1.17. Fix K ⊂ IRd and ε > 0. A set N is called an ε-net of K with respect to a distance d(·, ·) on IRd, if N ⊂ K and for any z ∈ K, there exists x ∈ N such that d(x, z) ≤ ε.

Therefore, if N is an ε-net of K with respect to a norm k · k, then every point of K is at distance at most ε from a point in N . Clearly, every compact set admits a finite ε-net. The following lemma gives an upper bound on the size of the smallest ε-net of B2.

Lemma 1.18. Fix ε ∈ (0, 1). Then the unit Euclidean ball B2 has an ε-net N with respect to the Euclidean distance of cardinality |N | ≤ (3/ε)d

Proof. Consider the following iterative construction if the ε-net. Choose x1 = 0. For any i ≥ 2, take xi to be any x ∈ B2 such that |x − xj|2 > ε for all j < i. If no such x exists, stop the procedure. Clearly, this will create an ε-net. We now control its size. Observe that since |x−y|2 > ε for all x, y ∈ N , the Euclidean balls centered at x ∈ N and with radius ε/2 are disjoint. Moreover, [ ε ε {z + B } ⊂ (1 + )B 2 2 2 2 z∈N where {z + εB2} = {z + εx , x ∈ B2}. Thus, measuring volumes, we get ε  [ ε  X ε vol (1 + )B  ≥ vol {z + B } = vol {z + B } 2 2 2 2 2 2 z∈N z∈N 1.4. Maximal inequalities 29

This is equivalent to  ε d  ε d 1 + ≥ |N | . 2 2 Therefore, we get the following bound

 2d 3d |N | ≤ 1 + ≤ . ε ε

Theorem 1.19. Let X ∈ IRd be a sub-Gaussian random vector with variance proxy σ2. Then √ IE[max θ>X] = IE[max |θ>X|] ≤ 4σ d . θ∈B2 θ∈B2 Moreover, for any δ > 0, with probability 1 − δ, it holds √ max θ>X = max |θ>X| ≤ 4σ d + 2σp2 log(1/δ) . θ∈B2 θ∈B2

Proof. Let N be a 1/2-net of B2 with respect to the Euclidean norm that d satisfies |N | ≤ 6 . Next, observe that for every θ ∈ B2, there exists z ∈ N and x such that |x|2 ≤ 1/2 and θ = z + x. Therefore,

max θ>X ≤ max z>X + max x>X θ∈B z∈N 1 2 x∈ 2 B2 But 1 max x>X = max x>X 1 x∈B2 x∈ 2 B2 2 Therefore, using Theorem 1.14, we get √ IE[max θ>X] ≤ 2IE[max z>X] ≤ 2σp2 log(|N |) ≤ 2σp2(log 6)d ≤ 4σ d . θ∈B2 z∈N

The bound with high probability then follows because

2 2 >  >  − t d − t IP max θ X > t ≤ IP 2 max z X > t ≤ |N |e 8σ2 ≤ 6 e 8σ2 . θ∈B2 z∈N To conclude the proof, we find t such that

2 − t +d log(6) 2 2 2 e 8σ2 ≤ δ ⇔ t ≥ 8 log(6)σ d + 8σ log(1/δ) . √ Therefore, it is sufficient to take t = p8 log(6)σ d + 2σp2 log(1/δ). 1.5. Sums of independent random matrices 30

1.5 SUMS OF INDEPENDENT RANDOM MATRICES

In this section, we are going to explore how concentration statements can be extended to sums of matrices. As an example, we are going to show a version of Bernstein’s inequality, 1.13, for sums of independent matrices, closely following the presentation in [Tro12], which builds on previous work in [AW02]. Results of this type have been crucial in providing guarantees for low rank recovery, see for example [Gro11]. In particular, we want to control the maximum eigenvalue of a sum of P independent random symmetric matrices IP(λmax( i Xi) > t). The tools in- volved will closely mimic those used for scalar random variables, but with the caveat that care must be taken when handling exponentials of matrices because eA+B 6= eAeB if A, B ∈ IRd and AB 6= BA.

Preliminaries We denote the set of symmetric, positive semi-definite, and positive definite d×d + ++ matrices in IR by Sd, Sd , and Sd , respectively, and will omit the subscript when the dimensionality is clear. Here, positive semi-definite means that for + > d a matrix X ∈ S , v Xv ≥ 0 for all v ∈ IR , |v|2 = 1, and positive definite means that the equality holds strictly. This is equivalent to all eigenvalues of X being larger or equal (strictly larger, respectively) than 0, λj(X) ≥ 0 for all j = 1, . . . , d. The cone of positive definite matrices induces an order on matrices by set- ting A  B if B − A ∈ S+. Since we want to extend the notion of exponentiating random variables that was essential to our derivation of the Chernoff bound to matrices, we will make use of the matrix exponential and the matrix logarithm. For a symmetric matrix X, we can define a functional calculus by applying a function f : IR → IR to the diagonal elements of its spectral decomposition, i.e., if X = QΛQ>, then

f(X) := Qf(Λ)Q>, (1.4) where [f(Λ)]i,j = f(Λi,j), i, j ∈ [d], (1.5) is only non-zero on the diagonal and f(X) is well-defined because the spectral decomposition is unique up to the ordering of the eigenvalues and the basis of the eigenspaces, with respect to both of which (1.4) is invariant, as well. From the definition, it is clear that for the spectrum of the resulting matrix,

σ(f(X)) = f(σ(X)). (1.6)

While inequalities between functions not generally carry over to matrices, we have the following transfer rule:

f(a) ≤ g(a) for a ∈ σ(X) =⇒ f(X)  g(X). (1.7) 1.5. Sums of independent random matrices 31

We can now define the matrix exponential by (1.4) which is equivalent to defining it via a power series,

∞ X 1 exp(X) = Xk. k! k=1 Similarly, we can define the matrix logarithm as the inverse function of exp on S, log(eA) = A, which defines it on S+. Some nice properties of these functions on matrices are their monotonicity: For A  B, Tr exp A ≤ Tr exp B, (1.8) and for 0 ≺ A  B, log A  log B. (1.9) Note that the analogue of (1.9) is in general not true for the matrix exponential.

The Laplace transform method

d×d For the remainder of this section, let X1,..., Xn ∈ IR be independent random symmetric matrices. As a first step that can lead to different types of matrix concentration inequalities, we give a generalization of the Chernoff bound for the maximum eigenvalue of a matrix. Proposition 1.20. Let Y be a random symmetric matrix. Then, for all t ∈ IR,

−θt θY IP(λmax(Y) ≥ t) ≤ inf {e IE[Tr e ]} θ>0

Proof. We multiply both sides of the inequality λmax(Y) ≥ t by θ, take ex- ponentials, apply the spectral theorem (1.6) and then estimate the maximum eigenvalue by the sum over all eigenvalues, the trace.

IP(λmax(Y) ≥ t) = IP(λmax(θY) ≥ θt) = IP(eλmax(θY) ≥ eθt) ≤ e−θtIE[eλmax(θY)] −θt θY = e IE[λmax(e )] ≤ e−θtIE[Tr(eθY)].

Recall that the crucial step in proving Bernstein’s and Hoeffding’s inequal- ity was to exploit the independence of the summands by the fact that the exponential function turns products into sums.

P θX Y θX Y θX IE[e i i ] = IE[ e i ] = IE[e i ]. i i 1.5. Sums of independent random matrices 32

This property, eA+B = eAeB, no longer holds true for matrices, unless they commute. We could try to replace it with a similar property, the Golden-Thompson inequality, Tr[eθ(X1+X2)] ≤ Tr[eθX1 eθX2 ]. Unfortunately, this does not generalize to more than two matrices, and when trying to peel off factors, we would have to pull a maximum eigenvalue out of the trace, θX1 θX2 θX2 θX1 Tr[e e ] ≤ λmax(e )Tr[e ]. This is the approach followed by Ahlswede-Winter [AW02], which leads to worse constants for concentration than the ones we obtain below. Instead, we are going to use the following deep theorem due to Lieb [Lie73]. A sketch of the proof can be found in the appendix of [Rus02].

Theorem 1.21 (Lieb). Let H ∈ IRd×d be symmetric. Then,

+ H+log A Sd → IR, A 7→ Tre is a concave function.

Corollary 1.22. Let X, H ∈ Sd be two symmetric matrices such that X is random and H is fixed. Then,

X IE[TreH+X] ≤ TreH+log IE[e ].

Proof. Write Y = eX and use Jensen’s inequality:

IE[TreH+X] = IE[TreH+log Y] ≤ TreH+log(IE[Y])

X = TreH+log IE[e ]

With this, we can establish a better bound on the moment generating func- tion of sums of independent matrices.

Lemma 1.23. Let X1,..., Xn be n independent, random symmetric matrices. Then, X X θXi IE[Tr exp( θXi)] ≤ Tr exp( log IEe ) i i

Proof. Without loss of generality, we can assume θ = 1. Write IEk for the expectation conditioned on X1,..., Xk, IEk[ . ] = IE[ . |X1,..., Xk]. By the 1.5. Sums of independent random matrices 33 tower property,

n n X X IE[Tr exp( Xi)] = IE0 ... IEn−1Tr exp( Xi) i=1 i=1 n−1 X Xn  ≤ IE0 ... IEn−2Tr exp Xi + log IEn−1e i=1 | {z } =IEeXn . . n X ≤ Tr exp( log IE[eXi ]), i=1 where we applied Lieb’s theorem, Theorem 1.21 on the conditional expectation and then used the independence of X1,..., Xn.

Tail bounds for sums of independent matrices Theorem 1.24 (Master tail bound). For all t ∈ IR,

X −θt X θXi IP(λmax( Xi) ≥ t) ≤ inf {e Tr exp( log IEe )} θ>0 i i Proof. Combine Lemma 1.23 with Proposition 1.20. Corollary 1.25. Assume that there is a function g : (0, ∞) → [0, ∞] and fixed θXi g(θ)Ai symmetric matrices Ai such that IE[e ]  e for all i. Set X ρ = λmax( Ak). i Then, for all t ∈ IR,

X −θt+g(θ)ρ IP(λmax( Xi) ≥ t) ≤ d inf {e }. θ>0 i Proof. Using the operator monoticity of log, (1.9), and the monotonicity of Tr exp, (1.8), we can plug the estimates for the matrix mgfs into the master inequality, Theorem 1.24, to obtain

X −θt X IP(λmax( Xi) ≥ t) ≤ e Tr exp(g(θ) Ai) i i −θt X ≤ de λmax(exp(g(θ) Ai)) i −θt X = de exp(g(θ)λmax( Ai)), i Pd where we estimated Tr(X) = j=1 λj ≤ dλmax and used the spectral theorem. 1.5. Sums of independent random matrices 34

Now we are ready to prove Bernstein’s inequality We just need to come up with a dominating function for Corollary 1.25.

Lemma 1.26. If IE[X] = 0 and λmax(X) ≤ 1, then IE[eθX]  exp((eθ − θ − 1)IE[X2]).

Proof. Define eθx − θx − 1  , x 6= 0,  2 f(x) = x θ2  , x = 0,  2 and verify that this is a smooth and increasing function on IR. Hence, f(x) ≤ f(1) for x ≤ 1. By the transfer rule, (1.7), f(X)  f(I) = f(1)I. Therefore,

eθX = I + θX + Xf(X)X  I + θX + f(1)X2, and by taking expectations,

IE[eθX]  I + f(1)IE[X2]  exp(f(1)IE[X2]) = exp((eθ − θ − 1)IE[X2]).

Theorem 1.27 (Matrix Bernstein). Assume IEXi = 0 and λmax(Xi) ≤ R almost surely for all i and set

2 X 2 σ = k IE[Xi ]kop. i Then, for all t ≥ 0,

X σ2 IP(λ ( X ) ≥ t) ≤ d exp(− h(Rt/σ2)) max i R2 i  −t2/2  ≤ d exp σ2 + Rt/3   3t2  σ2 d exp − , t ≤ ,  8σ2 R ≤  3t  σ2  d exp − , t ≥ ,  8R R where h(u) = (1 + u) log(1 + u) − u, u > 0. Proof. Without loss of generality, take R = 1. By Lemma 1.26,

θXi 2 θ IE[e ]  exp(g(θ)IE[Xi ]), with g(θ) = e − θ − 1, θ > 0. By Corollary 1.25,

X 2 IP(λmax( Xi) ≥ t) ≤ d exp(−θt + g(θ)σ ). i 1.5. Sums of independent random matrices 35

We can verify that the minimal value is attained at θ = log(1 + t/σ2). u2/2 The second inequality then follows from the fact that h(u) ≥ h2(u) = 1+u/3 for u ≥ 0 which can be verified by comparing derivatives. The third inequality follows from exploting the properties of h2. With similar techniques, one can also prove a version of Hoeffding’s inequal- ity for matrices, see [Tro12, Theorem 1.3]. 1.6. Problem set 36

1.6 PROBLEM SET

Problem 1.1. Let X1,...,Xn be independent random variables such that > n IE(Xi) = 0 and Xi ∼ subE(λ). For any vector a = (a1, . . . , an) ∈ IR , define the weighted sum n X S(a) = aiXi , i=1 Show that for any t > 0 we have

 2  t t  IP(|S(a)| > t) ≤ 2 exp −C 2 2 ∧ . λ |a|2 λ|a|∞ for some positive constant C.

2 Problem 1.2. A random variable X has χn (chi-squared with n degrees of 2 2 freedom) if it has the same distribution as Z1 + ... + Zn, where Z1,...,Zn are i.i.d. N (0, 1). (a) Let Z ∼ N (0, 1). Show that the moment generating function of Y = Z2 − 1 satisfies  e−s  sY   √ if s < 1/2 φ(s) := E e = 1 − 2s  ∞ otherwise

(b) Show that for all 0 < s < 1/2,

 s2  φ(s) ≤ exp . 1 − 2s

(c) Conclude that √ IP(Y > 2t + 2 t) ≤ e−t √ [Hint: you can use the convexity inequality 1 + u ≤ 1+u/2].

2 (d) Show that if X ∼ χn, then, with probability at least 1 − δ, it holds

X ≤ n + 2pn log(1/δ) + 2 log(1/δ) .

Problem 1.3. Let X1,X2 ... be an infinite sequence of sub-Gaussian random 2 −1 variables with variance proxy σi = C(log i) . Show that for C large enough, we get   IE max Xi < ∞ . i≥2 1.6. Problem set 37

Problem 1.4. Let A = {Ai,j} 1≤i≤n be a random matrix such that its entries 1≤j≤m are i.i.d. sub-Gaussian random variables with variance proxy σ2. (a) Show that the matrix A is sub-Gaussian. What is its variance proxy? (b) Let kAk denote the operator norm of A defined by |Ax| max 2 . x∈IRm |x|2 Show that there exits a constant C > 0 such that √ √ IEkAk ≤ C( m + n) .

n Problem 1.5. Recall that for any q ≥ 1, the `q norm of a vector x ∈ IR is defined by n 1  X q q |x|q = |xi| . i=1

Let X = (X1,...,Xn) be a vector with independent entries such that Xi is 2 sub-Gaussian with variance proxy σ and IE(Xi) = 0. (a) Show that for any q ≥ 2, and any x ∈ IRd,

1 − 1 |x|2 ≤ |x|qn 2 q , and prove that the above inequality cannot be improved (b) Show that for for any q > 1,

1 √ IE|X|q ≤ 4σn q q

(c) Recover from this bound that p IE max |Xi| ≤ 4eσ log n . 1≤i≤n

Problem 1.6. Let K be a compact subset of the unit sphere of IRp that p admits an ε-net Nε with respect to the Euclidean distance of IR that satisfies d |Nε| ≤ (C/ε) for all ε ∈ (0, 1). Here C ≥ 1 and d ≤ p are positive constants. 2 Let X ∼ subGp(σ ) be a centered random vector. Show that there exists positive constants c1 and c2 to be made explicit such that for any δ ∈ (0, 1), it holds

> p p max θ X ≤ c1σ d log(2p/d) + c2σ log(1/δ) θ∈K with probability at least 1−δ. Comment on the result in light of Theorem 1.19. 1.6. Problem set 38

Problem 1.7. For any K ⊂ IRd, distance d on IRd and ε > 0, the ε-covering number C(ε) of K is the cardinality of the smallest ε-net of K. The ε-packing number P (ε) of K is the cardinality of the largest set P ⊂ K such that d(z, z0) > ε for all z, z0 ∈ P, z 6= z0. Show that

C(2ε) ≤ P (2ε) ≤ C(ε) .

Problem 1.8. Let X1,...,Xn be n independent and random variables such 2 that IE[Xi] = µ and var(Xi) ≤ σ . Fix δ ∈ (0, 1) and assume without loss of generality that n can be factored into n = K · G where G = 8 log(1/δ) is a positive integers. For g = 1,...,G, let X¯g denote the average over the gth group of k variables. Formally gk 1 X X¯ = X . g k i i=(g−1)k+1 1. Show that for any g = 1,...,G,

 2σ  1 IP X¯g − µ > √ ≤ . k 4

2. Letµ ˆ be defined as the median of {X¯1,..., X¯G}. Show that 2σ G IPµˆ − µ > √  ≤ IPB ≥  , k 2 where B ∼ Bin(G, 1/4). 3. Conclude that r 2 log(1/δ) IPµˆ − µ > 4σ  ≤ δ n 4. Compare this result with Corollary 1.7 and Lemma 1.3. Can you conclude thatµ ˆ − µ ∼ subG(¯σ2/n) for someσ ¯2? Conclude. Problem 1.9. The goal of this problem is to prove the following theorem:

Theorem 1.28 (Johnson-Lindenstrauss Lemma). Given n points X = {x1, . . . , xn} d k×d q d in IR , let Q ∈ IR be a random projection operator and set P := k Q. There is a constant C > 0 such that if C k ≥ log n, ε2 P is an ε-isometry for X, i.e.,

2 2 2 (1 − ε)kxi − xjk2 ≤ kP xi − P xjk2 ≤ (1 + ε)kxi − xjk2, for all i, j with propability at least 1 − 2 exp(−cε2k). 1.6. Problem set 39

1. Convince yourself that if d > n, there is a projection P ∈ IRn×d to an n dimensional subspace such that kP xi −P xjk2 = kxi −xjk2, i.e., pairwise distances are exactly preserved.

Let k ≤ d be two integers, Y = (y1, . . . , yd) ∼ N (0,Id×d) independent and identically distributed Gaussians and Q ∈ IRd×k the projection onto the first k 1 1 coordinates, i.e., Qy = (y1, . . . , yk). Define Z = kY k QY = kY k (y1, . . . , yk) and L = kZk2.

k 2. Show that IE[L] = d . 3. Show that for all t > 0 such that 1 − 2t(kβ − d) > 0 and 1 − 2tβk > 0,

k d ! X k X IP y2 ≤ β y2 ≤ (1 − 2t(kβ − d))−k/2(1 − 2tβk)−(d−k)/2 i d i i=1 i=1

2 (Hint: Show that IE[eλX ] = √ 1 for λ < 1 if X ∼ N (0, 1).) 1−2λ 2 4. Conclude that for β < 1,

 k  k  IP L ≤ β ≤ exp (1 − β + log β) . d 2

5. Similarly, show that for β > 1,

 k  k  IP L ≥ β ≤ exp (1 − β + log β) . d 2

6. Show that for a random projection operator Q ∈ IRk×d and a fixed vector x ∈ IRd,

2 k 2 a) IE[kQxk ] = d kxk . b) For ε ∈ (0, 1), there is a constant c > 0 such that with probability at least 1 − 2 exp(−ckε2),

k k (1 − ε) kxk2 ≤ kQxk2 ≤ (1 + ε) kxk2. d 2 d 2 (Hint: Think about how to apply the previous results in this case and use the inequalities log(1 − ε) ≤ −ε − ε2/2 and log(1 + ε) ≤ ε − ε2/2 + ε3/3.) 7. Prove Theorem 1.28. nti hpe,w osdrtefloigrgeso model: regression following the consider we chapter, this In where h aeo admdsg orsod otesaitcllann eu.Let setup. learning statistical the to corresponds design ( random of case The design Random on either focus will we particular, In design. risk. of measure different ( that assume we Namely, IE[ fptet o-egto rsae .Hr h oli lal oconstruct to clearly is goal the Here ). . . . prostate, for of log-weight patient, of on take. based to it likely is it value variable what for account to have we of and predictor unknown good a is X X ε 1 1 prediction eedn ntentr fthe of nature the on Depending eto .] h response The 3.2]. Section [HTF01, from example following the Consider .Orga st siaetefunction the estimate to is goal Our 0. = ] Y , Y , 1 1 ε ) ) , . . . , , . . . , ( = Y stelgvlm facneostmr n h oli opredict to is goal the and tumor, cancerous a of log-volume the is ε X 1 ( ( ε , . . . , upss ned ewn ofida uoai ehns that mechanism automatic an find to want we Indeed, purposes. X X ∈ . IE EINLNA REGRESSION LINEAR DESIGN FIXED 2.1 n n +1 Y , IR 6 Y , n n olcino aibe htaeese omaue(age measure to easier are that variables of collection a , ,tega scntutafunction a construct is goal the ), ) n > Y +1 Y n ssbGusa ihvrac proxy variance with sub-Gaussian is i x +1 be ) = ∈ oeta when that Note . f IR ( X n ierRgeso Model Regression Linear d i and ...rno ope.Gvntepairs the Given couples. random i.i.d. 1 + + ) design ε f 40 i ( i , x = ) points 1 = x > f ˆ , n , . . . , θ n f ∗ sconstructed, is o oeunknown some for X ne ierassumption. linear a under 1 X , . . . , f ˆ n uhthat such n σ ewl ao a favor will we , fixed 2 n uhthat such and X n

or C h a p t e r f +1 ˆ θ n ∗ random ( 2 X sstill is ∈ (2.1) n IR +1 d f ) . 2.1. Fixed design linear regression 41 outputs a good prediction of the log-weight of the tumor given certain inputs for a new (unseen) patient. A natural measure of performance here is the L2-risk employed in the in- troduction:

ˆ ˆ 2 2 ˆ 2 R(f ) = IE[Y − f (X )] = IE[Y − f(X )] + kf − fk 2 , n n+1 n n+1 n+1 n+1 n L (PX ) where PX denotes the marginal distribution of Xn+1. It measures how good the prediction of Yn+1 is in average over realizations of Xn+1. In particular, it does not put much emphasis on values of Xn+1 that are not very likely to occur. 2 Note that if the εi are random variables with variance σ , then one simply ˆ 2 ˆ 2 has R(f ) = σ + kf − fk 2 . Therefore, for random design, we will focus n n L (PX ) ˆ 2 on the squared L norm kf − fk 2 as a measure of accuracy. It measures 2 n L (PX ) ˆ how close fn is to the unknown f in average over realizations of Xn+1.

Fixed design

In fixed design, the points (or vectors) X1,...,Xn are deterministic. To em- phasize this fact, we use lowercase letters x1, . . . , xn to denote fixed design. Of course, we can always think of them as realizations of a random variable but the distinction between fixed and random design is deeper and significantly affects our measure of performance. Indeed, recall that for random design, we look at the performance in average over realizations of Xn+1. Here, there is no such thing as a marginal distribution of Xn+1. Rather, since the design points x1, . . . , xn are considered deterministic, our goal is estimate f only at these points. This problem is sometimes called denoising since our goal is to recover f(x1), . . . , f(xn) given noisy observations of these values. In many instances, fixed designs exhibit particular structures. A typical example is the regular design on [0, 1], given by xi = i/n, i = 1, . . . , n. Inter- polation between these points is possible under smoothness assumptions. ∗ ∗ > Note that in fixed design, we observe µ +ε, where µ = f(x1), . . . , f(xn) ∈ n > n 2 IR and ε = (ε1, . . . , εn) ∈ IR is sub-Gaussian with variance proxy σ . In- stead of a functional estimation problem, it is often simpler to view this problem as a vector problem in IRn. This point of view will allow us to leverage the Euclidean geometry of IRn. In the case of fixed design, we will focus on the Mean Squared Error (MSE) as a measure of performance. It is defined by

n 1 X 2 MSE(fˆ ) = fˆ (x ) − f(x ) . n n n i i i=1 Equivalently, if we view our problem as a vector problem, it is defined by

n 1 X 2 1 MSE(ˆµ) = µˆ − µ∗ = |µˆ − µ∗|2 . n i i n 2 i=1 2.2. Least squares estimators 42

d Often, the design vectors x1, . . . , xn ∈ IR are stored in an n × d design matrix > X, whose jth row is given by xj . With this notation, the linear regression model can be written as ∗ Y = Xθ + ε , (2.2) > > where Y = (Y1,...,Yn) and ε = (ε1, . . . , εn) . Moreover,

1 > MSE( θˆ) = | (θˆ − θ∗)|2 = (θˆ − θ∗)> X X(θˆ − θ∗) . (2.3) X n X 2 n A natural example of fixed design regression is image denoising. Assume ∗ that µi , i ∈ 1, . . . , n is the grayscale value of pixel i of an image. We do not get to observe the image µ∗ but rather a noisy version of it Y = µ∗ + ε. Given n a library of d images {x1, . . . , xd}, xj ∈ IR , our goal is to recover the original ∗ image µ using linear combinations of the images x1, . . . , xd. This can be done fairly accurately (see Figure 2.1).

Figure 2.1. Reconstruction of the digit “6”: Original (left), Noisy (middle) and Recon- struction (right). Here n = 16 × 16 = 256 pixels. Source [RT11].

As we will see in Remark 2.3, properly choosing the design also ensures that ˆ ˆ > ˆ ˆ ∗ 2 if MSE(f) is small for some linear estimator f(x) = x θ, then |θ − θ |2 is also small.

In this chapter we only consider the fixed design case.

2.2 LEAST SQUARES ESTIMATORS

Throughout this section, we consider the regression model (2.2) with fixed design.

Unconstrained least squares estimator Define the (unconstrained) least squares estimator θˆls to be any vector such that ˆls 2 θ ∈ argmin |Y − Xθ|2 . θ∈IRd 2.2. Least squares estimators 43

Note that we are interested in estimating Xθ∗ and not θ∗ itself, so by exten- sion, we also callµ ˆls = Xθˆls least squares estimator. Observe thatµ ˆls is the projection of Y onto the column span of X. It is not hard to see that least squares estimators of θ∗ and µ∗ = Xθ∗ are 2 maximum likelihood estimators when ε ∼ N (0, σ In).

Proposition 2.1. The least squares estimatorµ ˆls = Xθˆls ∈ IRn satisfies

> ls > X µˆ = X Y. Moreover, θˆls can be chosen to be

ls > † > θˆ = (X X) X Y, where (X>X)† denotes the Moore-Penrose pseudoinverse of X>X. 2 Proof. The function θ 7→ |Y − Xθ|2 is convex so any of its minima satisfies

2 ∇θ|Y − Xθ|2 = 0, where ∇θ is the gradient operator. Using matrix calculus, we find

2  2 > > > > > > > ∇θ|Y − Xθ|2 = ∇θ |Y |2 − 2Y Xθ + θ X Xθ = −2(Y X − θ X X) .

2 Therefore, solving ∇θ|Y − Xθ|2 = 0 yields

> > X Xθ = X Y. It concludes the proof of the first statement. The second statement follows from the definition of the Moore-Penrose pseudoinverse.

We are now going to prove our first result on the finite sample performance of the least squares estimator for fixed design.

2 Theorem 2.2. Assume that the linear model (2.2) holds where ε ∼ subGn(σ ). Then the least squares estimator θˆls satisfies 1 r IEMSE( θˆls) = IE| θˆls − θ∗|2 σ2 , X n X X 2 . n where r = rank(X>X). Moreover, for any δ > 0, with probability at least 1 − δ, it holds r + log(1/δ) MSE( θˆls) σ2 . X . n Proof. Note that by definition

ˆls 2 ∗ 2 2 |Y − Xθ |2 ≤ |Y − Xθ |2 = |ε|2 . (2.4) Moreover,

ˆls 2 ∗ ˆls 2 ˆls ∗ 2 > ˆls ∗ 2 |Y − Xθ |2 = |Xθ + ε − Xθ |2 = |Xθ − Xθ |2 − 2ε X(θ − θ ) + |ε|2 . 2.2. Least squares estimators 44

Therefore, we get

> ˆls ∗ ˆls ∗ 2 > ˆls ∗ ˆls ∗ ε X(θ − θ ) | θ − θ |2 ≤ 2ε (θ − θ ) = 2| θ − θ |2 (2.5) X X X X X ˆls ∗ |X(θ − θ )|2 Note that it is difficult to control ε>X(θˆls − θ∗) ˆls ∗ |X(θ − θ )|2 as θˆls depends on ε and the dependence structure of this term may be compli- cated. To remove this dependency, a traditional technique is to “sup-out” θˆls. This is typically where maximal inequalities are needed. Here we have to be a bit careful. n×r Let Φ = [φ1, . . . , φr] ∈ IR be an orthonormal basis of the column span of X. In particular, there exists ν ∈ IRr such that X(θˆls − θ∗) = Φν. It yields ε> (θˆls − θ∗) ε>Φν ε>Φν ν X = = =ε ˜> ≤ sup ε˜>u , ˆls ∗ |Φν| |ν| |ν| |X(θ − θ )|2 2 2 2 u∈B2 r > where B2 is the unit ball of IR andε ˜ = Φ ε. Thus ˆls ∗ 2 > 2 |Xθ − Xθ |2 ≤ 4 sup (˜ε u) , u∈B2 r−1 2 > > > Next, let u ∈ S and note that |Φu|2 = u Φ Φu = u u = 1, so that for any s ∈ IR, we have 2 2 sε˜>u sε>Φu s σ IE[e ] = IE[e ] ≤ e 2 . 2 Therefore,ε ˜ ∼ subGr(σ ). To conclude the bound in expectation, observe that Lemma 1.4 yields r  > 2 X 2 2 4IE sup (˜ε u) = 4 IE[˜εi ] ≤ 16σ r . u∈B2 i=1 Moreover, with probability 1 − δ, it follows from the last step in the proof1 of Theorem 1.19 that sup (˜ε>u)2 ≤ 8 log(6)σ2r + 8σ2 log(1/δ) . u∈B2

> X X Remark 2.3. If d ≤ n and B := n has rank d, then we have ˆls ˆls ∗ 2 MSE(Xθ ) |θ − θ |2 ≤ , λmin(B) ˆls ∗ 2 and we can use Theorem 2.2 to bound |θ − θ |2 directly. By contrast, in the high-dimensional case, we will need more structural assumptions to come to a similar conclusion. 1we could use Theorem 1.19 directly here but at the cost of a factor 2 in the constant. 2.2. Least squares estimators 45

Constrained least squares estimator Let K ⊂ IRd be a symmetric convex set. If we know a priori that θ∗ ∈ K, we ˆls may prefer a constrained least squares estimator θK defined by ˆls 2 θK ∈ argmin |Y − Xθ|2 . θ∈K The fundamental inequality (2.4) would still hold and the bounds on the MSE may be smaller. Indeed, (2.5) can be replaced by

ˆls ∗ 2 > ˆls ∗ > |XθK − Xθ |2 ≤ 2ε X(θK − θ ) ≤ 2 sup (ε Xθ) , θ∈K−K where K − K = {x − y : x, y ∈ K}. It is easy to see that if K is symmetric (x ∈ K ⇔ −x ∈ K) and convex, then K − K = 2K so that

> > 2 sup (ε Xθ) = 4 sup (ε v) θ∈K−K v∈XK where XK = {Xθ : θ ∈ K} ⊂ IRn. This is a measure of the size (width) of XK. If ε ∼ N (0,Id), the expected value of the above supremum is actually called Gaussian width of XK. While ε is not Gaussian but sub-Gaussian here, similar properties will still hold.

`1 constrained least squares d Assume here that K = B1 is the unit `1 ball of IR . Recall that it is defined by

d n d X B1 = x ∈ IR : |xi| ≤ 1 , i=1 and it has exactly 2d vertices V = {e1, −e1, . . . , ed, −ed}, where ej is the j-th vector of the canonical basis of IRd and is defined by

> ej = (0,..., 0, 1 , 0,..., 0) . |{z} jth position

It implies that the set XK = {Xθ, θ ∈ K} ⊂ IRn is also a polytope with at most 2d vertices that are in the set XV = {X1, −X1,..., Xd, −Xd} where Xj is the j-th column of X. Indeed, XK is a obtained by rescaling and embedding (resp. projecting) the polytope K when d ≤ n (resp., d ≥ n). Note that some columns of X might not be vertices of XK so that XV might be a strict superset of the set of vertices of XK. d Theorem 2.4. Let K = B1 be the unit `1 ball of IR , d ≥ 2 and assume that ∗ θ ∈ B1. Moreover, assume the conditions of Theorem√ 2.2 and that the columns of X are normalized in such a way that maxj |Xj|2 ≤ n. Then the constrained least squares estimator θˆls satisfies B1 r 1 log d IEMSE( θˆls ) = IE| θˆls − θ∗|2 σ , X B1 n X B1 X 2 . n 2.2. Least squares estimators 46

Moreover, for any δ > 0, with probability 1 − δ, it holds r log(d/δ) MSE( θˆls ) σ . X B1 . n Proof. From the considerations preceding the theorem, we get that

| θˆls − θ∗|2 ≤ 4 sup (ε>v). X B1 X 2 v∈XK

2 Observe now that since ε ∼ subGn(σ ), for any column Xj such that |Xj|2 ≤ √ > 2 n, the random variable ε Xj ∼ subG(nσ ). Therefore, applying Theo-  ˆls  rem 1.16, we get the bound on IE MSE(XθK ) and for any t ≥ 0,

nt2  ˆls   >  − 2 IP MSE(XθK ) > t ≤ IP sup (ε v) > nt/4 ≤ 2de 32σ v∈XK To conclude the proof, we find t such that

2 − nt 2 2 log(2d) 2 log(1/δ) 2de 32σ2 ≤ δ ⇔ t ≥ 32σ + 32σ . n n

Note that the proof of Theorem 2.2 also applies to θˆls (exercise!) so that B1 θˆls benefits from the best of both rates, X B1 r  r log d IEMSE( θˆls ) min σ2 , σ . X B1 . n n √ This is called an elbow effect. The elbow takes place around r ' n (up to logarithmic terms).

`0 constrained least squares By an abuse of notation, we call the number of non-zero coefficients of a vector d θ ∈ IR its `0 norm (it is not actually a norm). It is denoted by

d X |θ|0 = 1I(θj 6= 0) . j=1

We call a vector θ with “small” `0 norm a sparse vector. More precisely, if |θ|0 ≤ k, we say that θ is a k-sparse vector. We also call support of θ the set  supp(θ) = j ∈ {1, . . . , d} : θj 6= 0 so that |θ|0 = card(supp(θ)) =: | supp(θ)| . 2.2. Least squares estimators 47

Remark 2.5. The `0 terminology and notation comes from the fact that

d X q lim |θj| = |θ|0 q→0+ j=1

q 0 Therefore it is really limq→0+ |θ|q but the notation |θ|0 suggests too much that it is always equal to 1.

d By extension, denote by B0(k) the `0 ball of IR , i.e., the set of k-sparse vectors, defined by d B0(k) = {θ ∈ IR : |θ|0 ≤ k} . ˆls In this section, our goal is to control the MSE of θK when K = B0(k). Note that computing θˆls essentially requires computing d least squares estimators, B0(k) k which is an exponential number in k. In practice this will be hard (or even impossible) but it is interesting to understand the statistical properties of this estimator and to use them as a benchmark.

Theorem 2.6. Fix a positive integer k ≤ d/2. Let K = B0(k) be set of d ∗ k-sparse vectors of IR and assume that θ ∈ B0(k). Moreover, assume the conditions of Theorem 2.2. Then, for any δ > 0, with probability 1 − δ, it holds

2   2 2 ˆls σ d σ k σ MSE(XθB (k)) log + + log(1/δ) . 0 . n 2k n n

Proof. We begin as in the proof of Theorem 2.2 to get (2.5):

> ˆls ∗ ˆls ∗ 2 > ˆls ∗ ˆls ∗ ε X(θK − θ ) | θ − θ | ≤ 2ε (θ − θ ) = 2| θ − θ |2 . X K X 2 X K X K X ˆls ∗ |X(θK − θ )|2

ˆls ∗ ˆls ∗ We know that both θK and θ are in B0(k) so that θK − θ ∈ B0(2k). For any S ⊂ {1, . . . , d}, let XS denote the n × |S| submatrix of X that is obtained from the columns of Xj, j ∈ S of X. Denote by rS ≤ |S| the rank of XS and n×rS let ΦS = [φ1, . . . , φrS ] ∈ IR be an orthonormal basis of the column span d |S| of XS. Moreover, for any θ ∈ IR , define θ(S) ∈ IR to be the vector with ˆ ˆls ∗ ˆ coordinates θj, j ∈ S. If we denote by S = supp(θK − θ ), we have |S| ≤ 2k r and there exists ν ∈ IR Sˆ such that

ˆls ∗ ˆls ˆ ∗ ˆ X(θK − θ ) = XSˆ(θK (S) − θ (S)) = ΦSˆν . Therefore, > ˆls ∗ > ε X(θK − θ ) ε ΦSˆν > = ≤ max sup [ε ΦS]u ls r | (θˆ − θ∗)| |ν|2 |S|=2k S X K 2 u∈B2

rS rS where B2 is the unit ball of IR . It yields

ˆls ∗ 2 > 2 |XθK − Xθ |2 ≤ 4 max sup (˜εS u) , |S|=2k rS u∈B2 2.2. Least squares estimators 48

> 2 withε ˜S = ΦS ε ∼ subGrS (σ ). Using a union bound, we get for any t > 0, X IP max sup (˜ε>u)2 > t ≤ IP sup (˜ε>u)2 > t |S|=2k rS rS u∈B2 |S|=2k u∈B2

It follows from the proof of Theorem 1.19 that for any |S| ≤ 2k,

> 2  |S| − t 2k − t IP sup (˜ε u) > t ≤ 6 e 8σ2 ≤ 6 e 8σ2 . rS u∈B2 Together, the above three displays yield   ls ∗ 2 d 2k − t IP(| θˆ − θ | > 4t) ≤ 6 e 8σ2 . (2.6) X K X 2 2k

To ensure that the right-hand side of the above inequality is bounded by δ, we need n  d  o t ≥ Cσ2 log + k log(6) + log(1/δ) . 2k

d  How large is log 2k ? It turns out that it is not much larger than k. Lemma 2.7. For any integers 1 ≤ k ≤ n, it holds

n enk ≤ . k k

Proof. Observe first that if k = 1, since n ≥ 1, it holds,

n en1 = n ≤ en = 1 1

Next, we proceed by induction and assume that it holds for some k ≤ n − 1 that n enk ≤ . k k Observe that  n  nn − k enk n − k eknk+1  1 k = ≤ ≤ 1 + , k + 1 k k + 1 k k + 1 (k + 1)k+1 k where we used the induction hypothesis in the first inequality. To conclude, it suffices to observe that  1 k 1 + ≤ e. k It immediately leads to the following corollary: 2.3. The Gaussian Sequence Model 49

Corollary 2.8. Under the assumptions of Theorem 2.6, for any δ > 0, with probability at least 1 − δ, it holds 2   2 2 ˆls σ k ed σ k σ MSE(XθB (k)) log + log(6) + log(1/δ) . 0 . n 2k n n

Note that for any fixed δ, there exits a constant Cδ > 0 such that for any n ≥ 2k, with high probability, 2   ˆls σ k ed MSE(XθB (k)) ≤ Cδ log . 0 n 2k Comparing this result with Theorem 2.2 with r = k, we see that the price to pay for not knowing the support of θ∗ but only its size, is a logarithmic factor in the dimension d. This result immediately leads the following bound in expectation. Corollary 2.9. Under the assumptions of Theorem 2.6, 2    ˆls  σ k ed IE MSE(XθB (k)) log . 0 . n k Proof. It follows from (2.6) that for any H ≥ 0, Z ∞ IEMSE( θˆls ) = IP(| θˆls − θ∗|2 > nu)du X B0(k) X K X 2 0 Z ∞ ˆls ∗ 2 ≤ H + IP(|XθK − Xθ |2 > n(u + H))du 0 2k   Z ∞ X d 2k − n(u+H) ≤ H + 6 e 32σ2 du j j=1 0 2k   2 X d 2k − nH 32σ = H + 6 e 32σ2 . j n j=1 Next, take H to be such that

2k   X d 2k − nH 6 e 32σ2 = 1 . j j=1 This yields σ2k ed H log , . n k which completes the proof.

2.3 THE GAUSSIAN SEQUENCE MODEL

The Gaussian Sequence Model is a toy model that has received a lot of attention, mostly in the eighties. The main reason for its popularity is that 2.3. The Gaussian Sequence Model 50 it carries already most of the insight of nonparametric estimation. While the model looks very simple it allows to carry deep ideas that extend beyond its framework and in particular to the linear regression model that we are inter- ested in. Unfortunately we will only cover a small part of these ideas and the interested reader should definitely look at the excellent books by A. Tsy- bakov [Tsy09, Chapter 3] and I. Johnstone [Joh11]. The model is as follows: ∗ Yi = θi + εi , i = 1, . . . , d , (2.7) 2 where ε1, . . . , εd are i.i.d. N (0, σ ) random variables. Note that often, d is taken equal to ∞ in this sequence model and we will also discuss this case. Its links to nonparametric estimation will become clearer in Chapter3. The goal here is to estimate the unknown vector θ∗.

The sub-Gaussian Sequence Model Note first that the model (2.7) is a special case of the linear model with fixed > ∗ design (2.1) with n = d and f(xi) = xi θ , where x1, . . . , xn form the canonical n basis e1, . . . , en of IR and ε has a Gaussian distribution. Therefore, n = d is both the dimension of the parameter θ and the number of observations and it looks like we have chosen to index this problem by d rather than n somewhat arbitrarily. We can bring n back into the picture, by observing that this model encompasses slightly more general choices for the design matrix X as long as it satisfies the following assumption. Assumption ORT The design matrix satisfies > X X = I , n d d where Id denotes the identity matrix of IR . Assumption ORT allows for cases where d ≤ n but not d > n (high dimensional case) because of obvious rank constraints. In particular,√ it means that the d columns of X are orthogonal in IRn and all have norm n. Under this assumption, it follows from the linear regression model (2.2) that 1 > 1 y := >Y = X Xθ∗ + >ε nX n nX = θ∗ + ξ ,

2 where ξ = (ξ1, . . . , ξd) ∼ subGd(σ /n) (If ε is Gaussian, we even have ξ ∼ 2 Nd(0, σ /n)). As a result, under the assumption ORT, when ξ is Gaussian, the linear regression model (2.2) is equivalent to the Gaussian Sequence Model (2.7) up to a transformation of the data Y and a rescaling of the variance. Moreover, for any estimator θˆ ∈ IRd, under ORT, it follows from (2.3) that > MSE( θˆ) = (θˆ − θ∗)> X X(θˆ − θ∗) = |θˆ − θ∗|2 . X n 2 2.3. The Gaussian Sequence Model 51

Furthermore, for any θ ∈ IRd, the assumption ORT yields, 1 |y − θ|2 = | >Y − θ|2 2 nX 2 2 1 = |θ|2 − θ> >Y + Y > >Y 2 n X n2 XX 1 2 1 = | θ|2 − ( θ)>Y + |Y |2 + Q n X 2 n X n 2 1 = |Y − θ|2 + Q, (2.8) n X 2 where Q is a constant that does not depend on θ and is defined by 1 1 Q = Y > >Y − |Y |2 n2 XX n 2 This implies in particular that the least squares estimator θˆls is equal to y.

We introduce a sightly more general model called sub-Gaussian sequence model:

y = θ∗ + ξ ∈ IRd , (2.9) 2 where ξ ∼ subGd(σ /n). In this section, we can actually completely “forget” about our original model (2.2). In particular we can define this model independently of Assump- tion ORT and thus for any values of n and d. The sub-Gaussian sequence model and the Gaussian sequence model are called direct (observation) problems as opposed to inverse problems where the goal is to estimate the parameter θ∗ only from noisy observations of its image through an operator. The linear regression model is one such inverse problem where the matrix X plays the role of a linear operator. However, in these notes, we never try to invert the operator. See [Cav11] for an interesting survey on the statistical theory of inverse problems.

Sparsity adaptive thresholding estimators If we knew a priori that θ was k-sparse, we could directly employ Corollary 2.8 to obtain that with probability at least 1 − δ, we have 2   ˆls σ k ed MSE(XθB (k)) ≤ Cδ log . 0 n 2k As we will see, the assumption ORT gives us the luxury to not know k and yet adapt to its value. Adaptation means that we can construct an estimator that ∗ does not require the knowledge of k (the smallest such that |θ |0 ≤ k) and yet perform as well as θˆls , up to a multiplicative constant. B0(k) Let us begin with some heuristic considerations to gain some intuition. Assume the sub-Gaussian sequence model (2.9). If nothing is known about θ∗ 2.3. The Gaussian Sequence Model 52 it is natural to estimate it using the least squares estimator θˆls = y. In this case, σ2d MSE( θˆls) = |y − θ∗|2 = |ξ|2 ≤ C , X 2 2 δ n where the last inequality holds with probability at least 1 − δ. This is actually what we are looking for if k = Cd for some positive constant C ≤ 1. The problem with this approach is that it does not use the fact that k may be much smaller than d, which happens when θ∗ has many zero coordinates. ∗ If θj = 0, then, yj = ξj, which is a sub-Gaussian random variable with vari- ance proxy σ2/n. In particular, we know from Lemma 1.3 that with probability at least 1 − δ, r 2 log(2/δ) |ξ | ≤ σ = τ . (2.10) j n The consequences of this inequality are interesting. One the one hand, if we ∗ observe |yj|  τ , then it must correspond to θj 6= 0. On the other hand, if ∗ |yj| ≤ τ is smaller, then, θj cannot be very large. In particular, by the triangle ∗ inequality, |θj | ≤ |yj| + |ξj| ≤ 2τ. Therefore, we loose at most 2τ by choosing ˆ θj = 0. It leads us to consider the following estimator. Definition 2.10. The hard thresholding estimator with threshold 2τ > 0 is denoted by θˆhrd and has coordinates  ˆhrd yj if |yj| > 2τ , θj = 0 if |yj| ≤ 2τ ,

ˆhrd for j = 1, . . . , d. In short, we can write θj = yj1I(|yj| > 2τ). From our above consideration, we are tempted to choose τ as in (2.10). Yet, this threshold is not large enough. Indeed, we need to choose τ such that |ξj| ≤ τ simultaneously for all j. This can be done using a maximal inequality. Namely, Theorem 1.14 ensures that with probability at least 1 − δ, r 2 log(2d/δ) max |ξj| ≤ σ . 1≤j≤d n It yields the following theorem. Theorem 2.11. Consider the linear regression model (2.2) under the assump- tion ORT or, equivalenty, the sub-Gaussian sequence model (2.9). Then the hard thresholding estimator θˆhrd with threshold r 2 log(2d/δ) 2τ = 2σ (2.11) n enjoys the following two properties on the same event A such that IP(A) ≥ 1−δ:

∗ (i) If |θ |0 = k, k log(2d/δ) MSE( θˆhrd) = |θˆhrd − θ∗|2 σ2 . X 2 . n 2.3. The Gaussian Sequence Model 53

∗ (ii) if minj∈supp(θ∗) |θj | > 3τ, then supp(θˆhrd) = supp(θ∗) .

Proof. Define the event n A = max |ξj| ≤ τ , j and recall that Theorem 1.14 yields IP(A) ≥ 1 − δ. On the event A, the following holds for any j = 1, . . . , d. First, observe that

∗ |yj| > 2τ ⇒ |θj | ≥ |yj| − |ξj| > τ (2.12) and ∗ |yj| ≤ 2τ ⇒ |θj | ≤ |yj| + |ξj| ≤ 3τ. (2.13) It yields ˆhrd ∗ ∗ ∗ |θj − θj | = |yj − θj |1I(|yj| > 2τ) + |θj |1I(|yj| ≤ 2τ) ∗ ≤ τ1I(|yj| > 2τ) + |θj |1I(|yj| ≤ 2τ) ∗ ∗ ∗ ≤ τ1I(|θj | > τ) + |θj |1I(|θj | ≤ 3τ) by (2.12) and (2.13) ∗ ≤ 4 min(|θj |, τ) It yields

d d ˆhrd ∗ 2 X ˆhrd ∗ 2 X ∗ 2 2 ∗ 2 |θ − θ |2 = |θj − θj | ≤ 16 min(|θj | , τ ) ≤ 16|θ |0τ . j=1 j=1 This completes the proof of (i). ∗ ∗ To prove (ii), note that if θj 6= 0, then |θj | > 3τ so that ∗ |yj| = |θj + ξj| > 3τ − τ = 2τ .

ˆhrd ∗ ˆhrd Therefore, θj 6= 0 so that supp(θ ) ⊂ supp(θ ). ˆhrd ˆhrd Next, if θj 6= 0, then |θj | = |yj| > 2τ. It yields ∗ |θj | ≥ |yj| − τ > τ.

∗ ˆhrd ∗ Therefore, |θj |= 6 0 and supp(θ ) ⊂ supp(θ ).

Similar results can be obtained for the soft thresholding estimator θˆsft defined by   yj − 2τ if yj > 2τ , ˆsft θj = yj + 2τ if yj < −2τ ,  0 if |yj| ≤ 2τ , In short, we can write   ˆsft 2τ θj = 1 − yj |yj| + 2.4. High-dimensional linear regression 54

Hard Soft y y −2 −1 0 1 2 −2 −1 0 1 2

−2 −1 0 1 2 −2 −1 0 1 2

x x

Figure 2.2. Transformation applied to yj with 2τ = 1 to obtain the hard (left) and soft (right) thresholding estimators

2.4 HIGH-DIMENSIONAL LINEAR REGRESSION

The BIC and Lasso estimators It can be shown (see Problem 2.6) that the hard and soft thresholding es- timators are solutions of the following penalized empirical risk minimization problems: n o ˆhrd 2 2 θ = argmin |y − θ|2 + 4τ |θ|0 θ∈IRd n o ˆsft 2 θ = argmin |y − θ|2 + 4τ|θ|1 θ∈IRd

In view of (2.8), under the assumption ORT, the above variational definitions can be written as n o ˆhrd 1 2 2 θ = argmin |Y − Xθ|2 + 4τ |θ|0 θ∈IRd n n o ˆsft 1 2 θ = argmin |Y − Xθ|2 + 4τ|θ|1 θ∈IRd n

When the assumption ORT is not satisfied, they no longer correspond to thresh- olding estimators but can still be defined as above. We change the constant in the threshold parameters for future convenience.

Definition 2.12. Fix τ > 0 and assume the linear regression model (2.2). The 2.4. High-dimensional linear regression 55

BIC2 estimator of θ∗ is defined by any θˆbic such that n o ˆbic 1 2 2 θ ∈ argmin |Y − Xθ|2 + τ |θ|0 . θ∈IRd n

Moreover, the Lasso estimator of θ∗ is defined by any θˆL such that

ˆL n 1 2 o θ ∈ argmin |Y − Xθ|2 + 2τ|θ|1 . θ∈IRd n Remark 2.13 (Numerical considerations). Computing the BIC estimator can be proved to be NP-hard in the worst case. In particular, no computational method is known to be significantly faster than the brute force search among all 2d sparsity patterns. Indeed, we can rewrite:

n 1 2 2 o n 1 2 2 o min |Y − θ| + τ |θ|0 = min min |Y − θ| + τ k d X 2 X 2 θ∈IR n 0≤k≤d θ : |θ|0=k n

1 2 d To compute minθ : |θ|0=k n |Y − Xθ|2, we need to compute k least squares estimators on a space of size k. Each costs O(k3) (matrix inversion). Therefore the total cost of the brute force search is

d X d C k3 = Cd32d . k k=0 By contrast, computing the Lasso estimator is a convex problem and there exist many efficient algorithms to solve it. We will not describe this optimiza- tion problem in details but only highlight a few of the best known algorithms: 1. Probably the most popular method among statisticians relies on coor- dinate . It is implemented in the glmnet package in R [FHT10], 2. An interesting method called LARS [EHJT04] computes the entire reg- ularization path, i.e., the solution of the convex problem for all values of τ. It relies on the fact that, as a function of τ, the solution θˆL is a piecewise linear function (with values in IRd). Yet this method proved to be too slow for very large problems and has been replaced by glmnet which computes solutions for values of τ on a grid much faster. 3. The optimization community has made interesting contributions to this field by using proximal methods to solve this problem. These methods exploit the structure of the objective function which is of the form smooth (sum of squares) + simple (`1 norm). A good entry point to this literature is perhaps the FISTA algorithm [BT09].

2Note that it minimizes the Bayes Information Criterion (BIC) employed in the tradi- tional literature of asymptotic statistics if τ = plog(d)/n. We will use the same value below, up to multiplicative constants (it’s the price to pay to get non asymptotic results). 2.4. High-dimensional linear regression 56

4. Recently there has been a lot of interest around this objective for very 2 large d and very large n. In this case, even computing |Y − Xθ|2 may be computationally expensive and solutions based on stochastic gradient descent are flourishing.

Note that by Lagrange duality, computing θˆL is equivalent to solving an `1 constrained least squares. Nevertheless, the radius of the `1 constraint is unknown. In general it is hard to relate Lagrange multipliers to the size con- straints. The name “Lasso” was originally given to the constrained version this estimator in the original paper of Robert Tibshirani [Tib96].

Analysis of the BIC estimator While computationally hard to implement, the BIC estimator gives us a good benchmark for sparse estimation. Its performance is similar to that of θˆhrd but without requiring the assumption ORT.

2 Theorem 2.14. Assume that the linear model (2.2) holds, where ε ∼ subGn(σ ) ∗ ˆbic and that |θ |0 ≥ 1. Then, the BIC estimator θ with regularization parameter

σ2 σ2 log(ed) τ 2 = 16 log(6) + 32 . (2.14) n n satisfies 1 log(ed/δ) MSE( θˆbic) = | θˆbic − θ∗|2 |θ∗| σ2 (2.15) X n X X 2 . 0 n with probability at least 1 − δ.

Proof. We begin as usual by noting that 1 1 |Y − θˆbic|2 + τ 2|θˆbic| ≤ |Y − θ∗|2 + τ 2|θ∗| . n X 2 0 n X 2 0 It implies

ˆbic ∗ 2 2 ∗ > ˆbic ∗ 2 ˆbic |Xθ − Xθ |2 ≤ nτ |θ |0 + 2ε X(θ − θ ) − nτ |θ |0 . First, note that

 ˆbic ∗  > ˆbic ∗ > Xθ − Xθ ˆbic ∗ 2ε (θ − θ ) = 2ε | θ − θ |2 X ˆbic ∗ X X |Xθ − Xθ |2 h  ˆbic ∗ i2 > Xθ − Xθ 1 ˆbic ∗ 2 ≤ 2 ε + | θ − θ |2 , ˆbic ∗ X X |Xθ − Xθ |2 2

2 1 2 where we use the inequality 2ab ≤ 2a + 2 b . Together with the previous display, it yields

ˆbic ∗ 2 2 ∗  > ˆbic ∗ 2 2 ˆbic |Xθ − Xθ |2 ≤ 2nτ |θ |0 + 4 ε U(θ − θ ) − 2nτ |θ |0 (2.16) 2.4. High-dimensional linear regression 57 where z U(z) = |z|2 Next, we need to “sup out” θˆbic. To that end, we decompose the sup into a max over cardinalities as follows:

sup = max max sup . θ∈IRd 1≤k≤d |S|=k supp(θ)=S

Applied to the above inequality, it yields

 > ˆbic ∗ 2 2 ˆbic 4 ε U(θ − θ ) − 2nτ |θ |0 2 ≤ max  max sup 4ε>U(θ − θ∗) − 2nτ 2k 1≤k≤d |S|=k supp(θ)=S   > 2 2 ≤ max max sup 4 ε ΦS,∗u − 2nτ k , 1≤k≤d |S|=k rS,∗ u∈B2 where ΦS,∗ = [φ1, . . . , φrS,∗ ] is an orthonormal basis of the set {Xj, j ∈ S ∪ ∗ ∗ supp(θ )} of columns of X and rS,∗ ≤ |S| + |θ |0 is the dimension of this column span. Using union bounds, we get for any t > 0,

   > 2 2  IP max max sup 4 ε ΦS,∗u − 2nτ k ≥ t 1≤k≤d |S|=k rS,∗ u∈B2 d X X   > 2 t 1 2  ≤ IP sup ε ΦS,∗u ≥ + nτ k rS,∗ 4 2 k=1 |S|=k u∈B2

Moreover, using the ε-net argument from Theorem 1.19, we get for |S| = k,

t 1 2  2 t 1   + nτ k   >  2 rS,∗ 4 2 IP sup ε ΦS,∗u ≥ + nτ k ≤ 2 · 6 exp − 2 rS,∗ 4 2 8σ u∈B2  t nτ 2k  ≤ 2 exp − − + (k + |θ∗| ) log(6) 32σ2 16σ2 0  t  ≤ exp − − 2k log(ed) + |θ∗| log(12) 32σ2 0 where, in the last inequality, we used the definition (2.14) of τ. 2.4. High-dimensional linear regression 58

Putting everything together, we get   ˆbic ∗ 2 2 ∗ IP |Xθ − Xθ |2 ≥ 2nτ |θ |0 + t ≤ d X X  t  exp − − 2k log(ed) + |θ∗| log(12) 32σ2 0 k=1 |S|=k d X d  t  = exp − − 2k log(ed) + |θ∗| log(12) k 32σ2 0 k=1 d X  t  ≤ exp − − k log(ed) + |θ∗| log(12) by Lemma 2.7 32σ2 0 k=1 d X  t  = (ed)−k exp − + |θ∗| log(12) 32σ2 0 k=1  t  ≤ exp − + |θ∗| log(12) . 32σ2 0

2 ∗ 2 To conclude the proof, choose t = 32σ |θ |0 log(12)+32σ log(1/δ) and observe that combined with (2.16), it yields with probability 1 − δ,

ˆbic ∗ 2 2 ∗ |Xθ − Xθ |2 ≤ 2nτ |θ |0 + t 2 ∗ 2 ∗ 2 = 64σ log(ed)|θ |0 + 64 log(12)σ |θ |0 + 32σ log(1/δ) ∗ 2 2 ≤ 224|θ |0σ log(ed) + 32σ log(1/δ) .

It follows from Theorem 2.14 that θˆbic adapts to the unknown sparsity of θ∗, just like θˆhrd. Moreover, this holds under no assumption on the design matrix X.

Analysis of the Lasso estimator Slow rate for the Lasso estimator The properties of the BIC estimator are quite impressive. It shows that under no assumption on X, one can mimic two oracles: (i) the oracle that knows the support of θ∗ (and computes least squares on this support), up to a log(ed) ∗ ∗ term and (ii) the oracle that knows the sparsity |θ |0 of θ , up to a smaller loga- ∗ rithmic term log(ed/|θ |0), which is subsumed by log(ed). Actually, the log(ed) ∗ can even be improved to log(ed/|θ |0) by using a modified BIC estimator (see Problem 2.7). The Lasso estimator is a bit more difficult to analyze because, by construc- ∗ tion, it should more naturally adapt to the unknown `1-norm of θ . This can be easily shown as in the next theorem, analogous to Theorem 2.4. 2.4. High-dimensional linear regression 59

2 Theorem 2.15. Assume that the linear model (2.2) holds where ε ∼ subGn(σ ). Moreover, assume√ that the columns of X are normalized in such a way that ˆL maxj |Xj|2 ≤ n. Then, the Lasso estimator θ with regularization parameter r r 2 log(2d) 2 log(1/δ) 2τ = 2σ + 2σ (2.17) n n satisfies r r 1 2 log(2d) 2 log(1/δ) MSE( θˆL) = | θˆL − θ∗|2 ≤ 4|θ∗| σ + 4|θ∗| σ X n X X 2 1 n 1 n with probability at least 1 − δ. Proof. From the definition of θˆL, it holds 1 1 |Y − θˆL|2 + 2τ|θˆL| ≤ |Y − θ∗|2 + 2τ|θ∗| . n X 2 1 n X 2 1 Using H¨older’sinequality, it implies ˆL ∗ 2 > ˆL ∗ ∗ ˆL  |Xθ − Xθ |2 ≤ 2ε X(θ − θ ) + 2nτ |θ |1 − |θ |1 > ˆL ˆL > ∗ ∗ ≤ 2|X ε|∞|θ |1 − 2nτ|θ |1 + 2|X ε|∞|θ |1 + 2nτ|θ |1 > ˆL > ∗ = 2(|X ε|∞ − nτ)|θ |1 + 2(|X ε|∞ + nτ)|θ |1 Observe now that for any t > 0,

t2 > > − 2 IP(|X ε|∞ ≥ t) = IP( max |Xj ε| > t) ≤ 2de 2nσ 1≤j≤d

Therefore, taking t = σp2n log(2d) + σp2n log(1/δ) = nτ, we get that with probability at least 1 − δ, ˆL ∗ 2 ∗ |Xθ − Xθ |2 ≤ 4nτ|θ |1 . Notice that the regularization parameter (2.17) depends on the confidence level δ. This not the case for the BIC estimator (see (2.14)). The rate in Theorem 2.15 is of order p(log d)/n (slow rate), which is much slower than the rate of order (log d)/n (fast rate) for the BIC estimator. Here- after, we show that fast rates can also be achieved by the computationally efficient Lasso estimator, but at the cost of a much stronger condition on the design matrix X.

Incoherence

Assumption INC(k) We say that the design matrix X has incoherence k for some integer k > 0 if > X X 1 − Id ≤ n ∞ 32k where the |A|∞ denotes the largest element of A in absolute value. Equivalently, 2.4. High-dimensional linear regression 60

1. For all j = 1, . . . , d, 2 |Xj|2 1 − 1 ≤ . n 32k 2. For all 1 ≤ i, j ≤ d, i 6= j, we have

| > | 1 Xi Xj ≤ . n 32k

Note that Assumption ORT arises as the limiting case of INC(k) as k → ∞. However, while Assumption ORT requires d ≤ n, here we may have d  n as illustrated in Proposition 2.16 below. To that end, we simply have to show that there exists a matrix that satisfies INC(k) even for d > n. We resort to the probabilistic method [AS08]. The idea of this method is that if we can find a probability measure that assigns positive probability to objects that satisfy a certain property, then there must exist objects that satisfy said property. In our case, we consider the following probability distribution on random matrices with entries in {±1}. Let the design matrix X have entries that are i.i.d. Rademacher (±1) random variables. We are going to show that most realizations of this random matrix satisfy Assumption INC(k) for large enough n.

n×d Proposition 2.16. Let X ∈ IR be a random matrix with entries Xij, i = 1, . . . , n, j = 1, . . . , d, that are i.i.d. Rademacher (±1) random variables. Then, X has incoherence k with probability 1 − δ as soon as n ≥ 211k2 log(1/δ) + 213k2 log(d) .

It implies that there exists matrices that satisfy Assumption INC(k) for

n ≥ Ck2 log(d) , for some numerical constant C.

Proof. Let εij ∈ {−1, 1} denote the Rademacher random variable that is on the ith row and jth column of X. Note first that the jth diagonal entries of X>X/n are given by n 1 X ε2 = 1. n i,j i=1

Moreover, for j 6= k, the (j, k)th entry of the d × d matrix X>X/n is given by n n 1 X 1 X (j,k) ε ε = ξ , n i,j i,k n i i=1 i=1

(j,k) (j,k) (j,k) where for each pair (j, k), ξi = εi,jεi,k, so that ξ1 , . . . , ξn are i.i.d. Rademacher random variables. 2.4. High-dimensional linear regression 61

Therefore, we get that for any t > 0,

 >   1 n  X X X (j,k) IP − Id > t = IP max ξi > t n j6=k n ∞ i=1 n X  1 X (j,k)  ≤ IP ξ > t (Union bound) n i j6=k i=1 2 X − nt ≤ 2e 2 (Hoeffding: Theorem 1.9) j6=k 2 2 − nt ≤ 2d e 2 .

Taking now t = 1/(32k) yields

> X X 1  2 − n IP − Id > ≤ 2d e 211k2 ≤ δ n ∞ 32k for n ≥ 211k2 log(1/δ) + 213k2 log(d) .

d For any θ ∈ IR , S ⊂ {1, . . . , d}, define θS to be the vector with coordinates

 θ if j ∈ S, θ = j S,j 0 otherwise .

In particular |θ|1 = |θS|1 + |θSc |1. The following lemma holds

Lemma 2.17. Fix a positive integer k ≤ d and assume that X satisfies as- sumption INC(k). Then, for any S ∈ {1, . . . , d} such that |S| ≤ k and any θ ∈ IRd that satisfies the cone condition

|θSc |1 ≤ 3|θS|1 , (2.18) it holds | θ|2 |θ|2 ≤ 2 X 2 . 2 n Proof. We have

2 2 2 > |Xθ|2 |XθS|2 |XθSc |2 > X X = + + 2θ θ c . n n n S n S We bound each of the three terms separately.

(i) First, if follows from the incoherence condition that

| θ |2 >  >  |θ |2 X S 2 = θ> X Xθ = |θ |2 + θ> X X − I θ ≥ |θ |2 − S 1 , n S n S S 2 S n d S S 2 32k 2.4. High-dimensional linear regression 62

(ii) Similarly, 2 2 2 |XθSc |2 2 |θSc |1 2 9|θS|1 ≥ |θ c | − ≥ |θ c | − , n S 2 32k S 2 32k where, in the last inequality, we used the cone condition (2.18). (iii) Finally, > > X X 2 6 2 2 θ θ c ≤ |θ | |θ c | ≤ |θ | . S n S 32k S 1 S 1 32k S 1 where, in the last inequality, we used the cone condition (2.18). Observe now that it follows from the Cauchy-Schwarz inequality that

2 2 |θS|1 ≤ |S||θS|2. Thus for |S| ≤ k,

2 |Xθ|2 2 2 16|S| 2 1 2 ≥ |θ | + |θ c | − |θ | ≥ |θ| . n S 2 S 2 32k S 2 2 2

Fast rate for the Lasso Theorem 2.18. Fix n ≥ 2. Assume that the linear model (2.2) holds where ε ∼ 2 ∗ subGn(σ ). Moreover, assume that |θ |0 ≤ k and that X satisfies assumption INC(k). Then the Lasso estimator θˆL with regularization parameter defined by r r log(2d) log(1/δ) 2τ = 8σ + 8σ n n satisfies 1 log(2d/δ) MSE( θˆL) = | θˆL − θ∗|2 kσ2 (2.19) X n X X 2 . n and log(2d/δ) |θˆL − θ∗|2 kσ2 . (2.20) 2 . n with probability at least 1 − δ. Proof. From the definition of θˆL, it holds 1 1 |Y − θˆL|2 ≤ |Y − θ∗|2 + 2τ|θ∗| − 2τ|θˆL| . n X 2 n X 2 1 1 ˆL ∗ Adding τ|θ − θ |1 on each side and multiplying by n, we get ˆL ∗ 2 ˆL ∗ > ˆL ∗ ˆL ∗ ∗ ˆL |Xθ −Xθ |2+nτ|θ −θ |1 ≤ 2ε X(θ −θ )+nτ|θ −θ |1+2nτ|θ |1−2nτ|θ |1 . Applying H¨older’sinequality and using the same steps as in the proof of The- orem 2.15, we get that with probability 1 − δ, we get > ˆL ∗ > ˆL ∗ ε X(θ − θ ) ≤ |ε X|∞|θ − θ |1 nτ ≤ |θˆL − θ∗| , 2 1 2.4. High-dimensional linear regression 63

2 where we used the fact that |Xj|2 ≤ n + 1/(32k) ≤ 2n. Therefore, taking S = supp(θ∗) to be the support of θ∗, we get

ˆL ∗ 2 ˆL ∗ ˆL ∗ ∗ ˆL |Xθ − Xθ |2 + nτ|θ − θ |1 ≤ 2nτ|θ − θ |1 + 2nτ|θ |1 − 2nτ|θ |1 ˆL ∗ ∗ ˆL = 2nτ|θS − θ |1 + 2nτ|θ |1 − 2nτ|θS |1 ˆL ∗ ≤ 4nτ|θS − θ |1. (2.21) In particular, it implies that

ˆL ∗ ˆL ∗ |θSc − θSc |1 ≤ 3|θS − θS|1 . (2.22) so that θ = θˆL − θ∗ satisfies the cone condition (2.18). Using now the Cauchy- Schwarz inequality and Lemma 2.17 respectively, we get, since |S| ≤ k, r 2k |θˆL − θ∗| ≤ p|S||θˆL − θ∗| ≤ p|S||θˆL − θ∗| ≤ | θˆL − θ∗| . S 1 S 2 2 n X X 2 Combining this result with (2.21), we find

ˆL ∗ 2 2 |Xθ − Xθ |2 ≤ 32nkτ . This concludes the proof of the bound on the MSE. To prove (2.20), we use Lemma 2.17 once again to get

ˆL ∗ 2 ˆL 2 |θ − θ |2 ≤ 2MSE(Xθ ) ≤ 64kτ . Note that all we required for the proof was not really incoherence but the conclusion of Lemma 2.17:

2 |Xθ|2 inf inf 2 ≥ κ, (2.23) |S|≤k θ∈CS n|θ|2 where κ = 1/2 and CS is the cone defined by  CS = |θSc |1 ≤ 3|θS|1 .

Condition (2.23) is sometimes called restricted eigenvalue (RE) condition. Its name comes from the following observation. Note that all k-sparse vectors θ are in a cone CS with |S| ≤ k so that the RE condition implies that the smallest eigenvalue of XS satisfies λmin(XS) ≥ nκ for all S such that |S| ≤ k. Clearly, the RE condition is weaker than incoherence and it can actually be shown that a design matrix X of i.i.d. Rademacher random variables satisfies the RE conditions as soon as n ≥ Ck log(d) with positive probability.

The Slope estimator We noticed that the BIC estimator can actually obtain a rate of k log(ed/k), ∗ where k = |θ |0, while the LASSO only achieves k log(ed). This begs the 2.4. High-dimensional linear regression 64 question whether the same rate can be achieved by an efficiently computable estimator. The answer is yes, and the estimator achieving this is a slight modification of the LASSO, where the entries in the `1-norm are weighted differently according to their size in order to get rid of a slight inefficiency in our bounds.

Definition 2.19 (Slope estimator). Let λ = (λ1, . . . , λd) be a non-increasing sequence of positive real numbers, λ1 ≥ λ2 ≥ · · · ≥ λd > 0. For θ = d ∗ ∗ (θ1, . . . , θd) ∈ IR , let (θ1, . . . , θd) be a non-increasing rearrangement of the modulus of the entries, |θ1|,..., |θd|. We define the sorted `1 norm of θ as

d X ∗ |θ|∗ = λjθj , (2.24) j=1 or equivalently as d X |θ|∗ = max λj|θφ(j)|. (2.25) φ∈Sd j=1 The Slope estimator is then given by

ˆS 1 2 θ ∈ argmin{ |Y − Xθ|2 + 2τ|θ|∗} (2.26) θ∈IRd n for a choice of tuning parameters λ and τ > 0. Slope stands for Sorted L-One Penalized Estimation and was introduced in [BvS+15], motivated by the quest for a penalized estimation procedure that ∗ could offer a control of false discovery rate for the hypotheses H0,j : θj = 0. We ˆS can check that | . |∗ is indeed a norm and that θ can be computed efficiently, for example by proximal gradient algorithms, see [BvS+15]. In the following, we will consider p λj = log(2d/j), j = 1, . . . , d. (2.27)

With this choice, we will exhibit a scaling in τ that leads to the desired high probability bounds, following the proofs in [BLT16]. We begin with refined bounds on the suprema of Gaussians.

Lemma 2.20. Let g1, . . . , gd be zero-mean Gaussian variables with variance at most σ2. Then, for any k ≤ d,

 k  1−3t/8 1 X 2d 2d IP (g∗)2 > t log ≤ . (2.28) kσ2 j k  k j=1

Proof. We consider the moment generating function to apply a Chernoff bound. 2.4. High-dimensional linear regression 65

Use Jensen’s inequality on the sum to get   k k ∗ 2 ! 3 X 1 X 3(gj ) IEexp (g∗)2 ≤ IEexp 8kσ2 j  k 8σ2 j=1 j=1

d 2 ! 1 X 3gj ≤ IEexp k 8σ2 j=1 2d ≤ , k where we used the bound on the moment generating function of the χ2 distri- bution from Problem 1.2 to get that IE[exp(3g2/8)] = 2 for g ∼ N (0, 1). The rest follows from a Chernoff bound.

Lemma 2.21. Define [d] := {1, . . . , d}. Under the same assumptions as in Lemma 2.20, (g∗) sup k ≤ 4plog(1/δ), (2.29) k∈[d] σλk 1 with probability at least 1 − δ, for δ < 2 . ∗ 2 Proof. We can estimate (gk) by the average of all larger variables,

k 1 X (g∗)2 ≤ (g∗)2. (2.30) k k j j=1

Applying Lemma 2.20 yields

 ∗ 2   1−3t/8 (gk) 2d IP 2 2 > t ≤ (2.31) σ λk k For t > 8,

! d 1−3t/8 (g∗)2 X 2d IP sup k > t ≤ (2.32) σ2λ2 j k∈[d] k j=1 d X 1 ≤ (2d)1−3t/8 (2.33) j2 j=1 ≤ 4 · 2−3t/8. (2.34)

In turn, this means that (g∗) sup k ≤ 4plog(1/δ), (2.35) k∈[d] σλk for δ ≤ 1/2. 2.4. High-dimensional linear regression 66

Theorem 2.22. Fix n ≥ 2. Assume that the linear model (2.2) holds where 2 ∗ ε ∼ Nn(0, σ In). Moreover, assume that |θ |0 ≤ k and that X satisfies as- sumption INC(k0) with k0 ≥ 4k log(2de/k). Then the Slope estimator θˆS with regularization parameter defined by r √ log(1/δ) τ = 8 2σ (2.36) n satisfies

1 k log(2d/k) log(1/δ) MSE( θˆS ) = | θˆS − θ∗|2 σ2 (2.37) X n X X 2 . n and k log(2d/k) log(1/δ) |θˆS − θ∗|2 σ2 . (2.38) 2 . n with probability at least 1 − δ.

Remark 2.23. The multplicative depedence on log(1/δ) in (2.37) is an artifact of the simplified proof we give here and can be improved to an additive one similar to the lasso case, (2.19), by appealing to stronger and more general concentration inequalities. For more details, see [BLT16].

ˆS ˆS ∗ Proof. By minimality of θ in (2.26), adding nτ|θ − θ |∗ on both sides,

ˆS ∗ 2 ˆS ∗ > ˆS ∗ ˆS ∗ ∗ ˆS |Xθ −Xθ |2+nτ|θ −θ |∗ ≤ 2ε X(θ −θ )+nτ|θ −θ |∗+2τn|θ |∗−2τn|θ |∗ . (2.39) Set ˆS ∗ > u := θ − θ , gj = (X ε)j, (2.40) By Lemma 2.21, we can estimate

d d > X > X ∗ ∗ ε Xu = (X ε)juj ≤ gj uj (2.41) j=1 j=1 d ∗ X gj = (λ u∗) (2.42) λ j j j=1 j  ∗  gj ≤ sup |u|∗ (2.43) j λj √ nτ ≤ 4 2σpn log(1/δ)|u| = |u| , (2.44) ∗ 2 ∗

2 where we used that |Xj|2 ≤ 2n. 2.4. High-dimensional linear regression 67

∗ Pk Pick a permutation φ such that |θ |∗ = j=1 λj|θφ(j)| and |uφ(k+1)| ≥ · · · ≥ |uφ(d)|. Then, noting that λj is monotonically decreasing,

k d ∗ ˆS X ∗ X ˆS ∗ |θ |∗ − |θ |∗ = λj|θφ(j)| − λj(θ )j (2.45) j=1 j=1 k d X ∗ ˆS X ˆS ≤ λj(|θφ(j)| − |θφ(j)|) − λj|θφ(j)| (2.46) j=1 j=k+1 k d X X ∗ ≤ λj|uφ(j)| − λjuj (2.47) j=1 j=k+1 k d X ∗ X ∗ ≤ λjuj − λjuj . (2.48) j=1 j=k+1 √ Combined with τ = 8 2σplog(1/δ)/n and the basic inequality (2.39), we have that

ˆS ∗ 2 ∗ ˆS |Xθ − Xθ |2 + nτ|u|∗ ≤ 2nτ|u|∗ + 2τn|θ |∗ − 2τn|θ |∗ (2.49) k X ∗ ≤ 4nτ λjuj , (2.50) j=1 whence d k X ∗ X ∗ λjuj ≤ 3 λjuj . (2.51) j=k+1 j=1 We can now repeat the incoherence arguments from Lemma 2.17, with S being the k largest entries of |u|, to get the same conclusion under the restriction INC(k0). First, by exactly the same argument as in Lemma 2.17, we have

| u |2 |u |2 k X S 2 ≥ |u |2 − S 1 ≥ |u |2 − |u |2. (2.52) n S 2 32k0 S 2 32k0 S 2 2.4. High-dimensional linear regression 68

Next, for the cross terms, we have

> k d > X X 2 X ∗ X ∗ 2 u u c ≤ u u (2.53) S n S 32k0 i j i=1 j=k+1 k d 2 X ∗ X λj ∗ = 0 ui uj (2.54) 32k λj i=1 j=k+1 k d 2 X ∗ X ∗ ≤ 0 ui λjuj (2.55) 32k λd i=1 j=k+1 k k 6 X X ≤ u∗ λ u∗ (2.56) 32k0λ i j j d i=1 j=1

√ k !1/2 k 6 k X X ≤ λ2 (u∗)2 (2.57) 32k0λ j i d i=1 i=1

6k p 2 ≤ 0 log(2de/k)|uS|2. (2.58) 32k λd where we estimated the sum by

k X 2d log = k log(2d) − log(k!) ≤ k log(2d) − k log(k/e) (2.59) j j=1 = k log(2de/k), (2.60) using Stirling’s inequality, k! ≥ (k/e)k. Finally, from a similar calculation, using the cone condition twice,

2 |XuSc |2 2 9k 2 c ≥ |uS |2 − 0 2 log(2de/k)|uS|2. (2.61) n 32k λd Concluding, if INC(k0) holds with k0 ≥ 4k log(2de/k), taking into account that λd ≥ 1/2, we have

2 |Xu|2 2 2 36 + 12 + 1 2 1 2 ≥ |u | + |u c | − |u | ≥ |u| . (2.62) n S 2 S 2 128 S 2 2 2 Hence, from (2.50),

1/2 1/2  k   k  2 X 2 X ∗ 2 |Xu|2 + nτ|u|∗ ≤ 4nτ  λj   (uj )  (2.63) j=1 j=1 √ p ≤ 4 2τ nk log(2de/k)|Xu|2 (2.64) 6 p p = 2 σ log(1/δ) k log(2de/k)|Xu|2, (2.65) 2.4. High-dimensional linear regression 69 which combined yields 1 k | u|2 ≤ 212σ2 log(2de/k) log(1/δ), (2.66) n X 2 n and k |u|2 ≤ 213 σ2 log(2de/k) log(1/δ). (2.67) 2 n 2.5. Problem set 70

2.5 PROBLEM SET

Problem 2.1. Consider the linear regression model with fixed design with d ≤ n. The ridge regression estimator is employed when rank(X>X) < d but we are interested in estimating θ∗. It is defined for a given parameter τ > 0 by

ˆridge n 1 2 2o θτ = argmin |Y − Xθ|2 + τ|θ|2 . θ∈IRd n

ˆridge (a) Show that for any τ, θτ is uniquely defined and give its closed form expression. ˆridge (b) Compute the bias of θτ and show that it is bounded in absolute value ∗ by |θ |2. Problem 2.2. Let X = (1,Z,...,Zd−1)> ∈ IRd be a random vector where Z is a random variable. Show that the matrix IE(XX>) is positive definite if Z admits a probability density with respect to the Lebesgue measure on IR. Problem 2.3. Let θˆhrd be the hard thresholding estimator defined in Defini- tion 2.10. 1. Show that |θˆhrd − θ∗|2 |θˆhrd| ≤ |θ∗| + 2 0 0 4τ 2 2. Conclude that if τ is chosen as in Theorem 2.11, then

ˆhrd ∗ |θ |0 ≤ C|θ |0 with probability 1 − δ for some constant C to be made explicit.

∗ Problem 2.4. In the proof of Theorem 2.11, show that 4 min(|θj |, τ) can be ∗ replaced by 3 min(|θj |, τ), i.e., that on the event A, it holds

ˆhrd ∗ ∗ |θj − θj | ≤ 3 min(|θj |, τ) .

d Problem 2.5. For any q > 0, a vector θ ∈ IR is said to be in a weak `q ball of radius R if the decreasing rearrangement |θ[1]| ≥ |θ[2]| ≥ ... satisfies

−1/q |θ[j]| ≤ Rj .

Moreover, we define the weak `q norm of θ by

1/q |θ|w` = max j |θ[j]|. q 1≤j≤d

(a) Give examples of θ, θ0 ∈ IRd such that

0 0 |θ + θ |w`1 > |θ|w`1 + |θ |w`1 What do you conclude? 2.5. Problem set 71

(b) Show that |θ|w`q ≤ |θ|q.

(c) Given a sequence θ1, θ2,..., show that if limd→∞ |θ{1,...,d}|w`q < ∞, then 0 limd→∞ |θ{1,...,d}|q0 < ∞ for all q > q.

(d) Show that, for any q ∈ (0, 2) if limd→∞ |θ{1,...,d}|w`q = C, there exists a constant Cq > 0 that depends on q but not on d such that under the assumptions of Theorem 2.11, it holds

2 q σ log 2d1− 2 |θˆhrd − θ∗|2 ≤ C 2 q n with probability .99. Problem 2.6. Show that n o ˆhrd 2 2 θ = argmin |y − θ|2 + 4τ |θ|0 θ∈IRd n o ˆsft 2 θ = argmin |y − θ|2 + 4τ|θ|1 θ∈IRd

2 ∗ Problem 2.7. Assume the linear model (2.2) with ε ∼ subGn(σ ) and θ 6= 0. Show that the modified BIC estimator θˆ defined by

ˆ n 1 2  ed o θ ∈ argmin |Y − Xθ|2 + λ|θ|0 log θ∈IRd n |θ|0 satisfies ed  log ∗ MSE( θˆ) |θ∗| σ2 |θ |0 . X . 0 n with probability .99, for appropriately chosen λ. What do you conclude?

2 Problem 2.8. Assume that the linear model (2.2) holds where ε ∼ subGn(σ ). Moreover, assume the conditions of Theorem 2.2√ and that the columns of X are normalized in such a way that maxj |Xj|2 ≤ n. Then the Lasso estimator θˆL with regularization parameter r 2 log(2d) 2τ = 8σ , n satisfies ˆL ∗ |θ |1 ≤ C|θ |1 with probability 1 − (2d)−1 for some constant C to be specified. a tl esi bu hs siaosusing something estimators chapter, these this in about of see said estimator will be we consistent still as a Nevertheless, can finding model. of correct previous the chance the the know no studying in have in we introduced interested vations, are estimators we linear linear, various chapters: be of not properties may the though Even i htthe that is 2 Chapter in made we that function assumption regression strongest the Arguably, hscs,w a rt o n estimator any for write can we case, this hndaigwt xddsg,i ilb ovnett osdrtevector the consider to convenient be will it design, fixed g with dealing When where misspecified particular in and misspecification, model hope we of models. and linear problem model the linear be the is should in This believe methods really statistical not good do that we reality, In violated? is ∈ hogotti hpe,w suetefloigmodel: following the assume we chapter, this Throughout IR n ε ( = endfrayfunction any for defined θ ˆ ls ε 1 , ε , . . . , θ ˆ K ls , θ ˜ X ls f n , ) ( θ > ˆ x bic so h form the of is ) Y ssbGusa ihvrac proxy variance with sub-Gaussian is i , = θ ˆ iseie ierModels Linear Misspecified L MSE f lal,ee iha nnt ubro obser- of number infinite an with even Clearly, . ( X i ( g + ) f ˆ IR : = ) ε 72 i d i , f n 1 → ( robust | f x f ˆ ˆ = ) ∈ − Rby IR 1 = IR f rceinequalities oracle x | , n , . . . , n odvain rmti model. this from deviations to 2 2 > . g of θ ∗ ( = hti hsassumption this if What . f , g ( X 1 σ ) g , . . . , 2 Here . . f ( fw don’t we if

X C h a p t e r X n i )) 3 ∈ > (3.1) IR In . d . 3.1. Oracle inequalities 73

3.1 ORACLE INEQUALITIES

Oracle inequalities As mentioned in the introduction, an oracle is a quantity that cannot be con- structed without the knowledge of the quantity of interest, here: the regression function. Unlike the regression function itself, an oracle is constrained to take a specific form. For all matter of purposes, an oracle can be viewed as an estimator (in a given family) that can be constructed with an infinite amount of data. This is exactly what we should aim for in misspecified models. When employing the least squares estimator θˆls, we constrain ourselves to estimating functions that are of the form x 7→ x>θ, even though f itself may not be of this form. Therefore, the oracle fˆ is the linear function that is the closest to f. Rather than trying to approximate f by a linear function f(x) ≈ θ>x, we make the model a bit more general and consider a dictionary H = {ϕ1, . . . , ϕM } d of functions where ϕj : IR → IR. In this case, we can actually remove the assumption that X ∈ IRd. Indeed, the goal is now to estimate f using a linear combination of the functions in the dictionary:

M X f ≈ ϕθ := θjϕj . j=1

(j) Remark 3.1. If M = d and ϕj(X) = X returns the jth coordinate of X ∈ IRd then the goal is to approximate f(x) by θ>x. Nevertheless, the use of a dictionary allows for a much more general framework. Note that the use of a dictionary does not affect the methods that we have been using so far, namely penalized/constrained least squares. We use the same notation as before and define

1. The least squares estimator:

n ˆls 1 X 2 θ ∈ argmin Yi − ϕθ(Xi) (3.2) M n θ∈IR i=1

2. The least squares estimator constrained to K ⊂ IRM :

n 1 X 2 θˆls ∈ argmin Y − ϕ (X ) K n i θ i θ∈K i=1

3. The BIC estimator:

n n o ˆbic 1 X 2 2 θ ∈ argmin Yi − ϕθ(Xi) + τ |θ|0 (3.3) M n θ∈IR i=1 3.1. Oracle inequalities 74

4. The Lasso estimator: n ˆL n 1 X 2 o θ ∈ argmin Yi − ϕθ(Xi) + 2τ|θ|1 (3.4) M n θ∈IR i=1

Definition 3.2. Let R(·) be a risk function and let H = {ϕ1, . . . , ϕM } be a dictionary of functions from IRd to IR. Let K be a subset of IRM . The oracle ¯ on K with respect to R is defined by ϕθ¯, where θ ∈ K is such that

R(ϕθ¯) ≤ R(ϕθ) , ∀ θ ∈ K. ˆ Moreover, RK = R(ϕθ¯) is called oracle risk on K. An estimator f is said to satisfy an oracle inequality (over K) with remainder term φ in expectation (resp. with high probability) if there exists a constant C ≥ 1 such that ˆ IER(f) ≤ C inf R(ϕθ) + φn,M (K) , θ∈K or  ˆ IP R(f) ≤ C inf R(ϕθ) + φn,M,δ(K) ≥ 1 − δ , ∀ δ > 0 θ∈K respectively. If C = 1, the oracle inequality is sometimes called exact. Our goal will be to mimic oracles. The finite sample performance of an estimator at this task is captured by an oracle inequality.

Oracle inequality for the least squares estimator While our ultimate goal is to prove sparse oracle inequalities for the BIC and Lasso estimator in the case of misspecified model, the difficulty of the extension to this case for linear models is essentially already captured by the analysis for the least squares estimator. In this simple case, can even obtain an exact oracle inequality.

2 Theorem 3.3. Assume the general regression model (3.1) with ε ∼ subGn(σ ). Then, the least squares estimator θˆls satisfies for some numerical constant C > 0, σ2M MSE(ϕθˆls ) ≤ inf MSE(ϕθ) + C log(1/δ) θ∈IRM n with probability at least 1 − δ.

Proof. Note that by definition

2 2 |Y − ϕθˆls |2 ≤ |Y − ϕθ¯|2 where ϕθ¯ denotes the orthogonal projection of f onto the linear span of ϕ1, . . . , ϕM . Since Y = f + ε, we get

2 2 > |f − ϕθˆls |2 ≤ |f − ϕθ¯|2 + 2ε (ϕθˆls − ϕθ¯) 3.1. Oracle inequalities 75

Moreover, by Pythagoras’s theorem, we have

2 2 2 |f − ϕθˆls |2 − |f − ϕθ¯|2 = |ϕθˆls − ϕθ¯|2 . It yields 2 > |ϕθˆls − ϕθ¯|2 ≤ 2ε (ϕθˆls − ϕθ¯) . Using the same steps as the ones following equation (2.5) for the well specified case, we get 2 2 σ M |ϕ − ϕ¯| log(1/δ) θˆls θ 2 . n with probability 1 − δ. The result of the lemma follows.

Sparse oracle inequality for the BIC estimator The analysis for more complicated estimators such as the BIC in Chapter2 allow us to derive oracle inequalities for these estimators.

2 Theorem 3.4. Assume the general regression model (3.1) with ε ∼ subGn(σ ). Then, the BIC estimator θˆbic with regularization parameter σ2 σ2 log(ed) τ 2 = 16 log(6) + 32 (3.5) n n satisfies for some numerical constant C > 0, n Cσ2 o Cσ2 MSE(ϕθˆbic ) ≤ inf 3MSE(ϕθ)+ |θ|0 log(eM/δ) + log(1/δ) θ∈IRM n n with probability at least 1 − δ.

Proof. Recall that the proof of Theorem 2.14 for the BIC estimator begins as follows: 1 1 |Y − ϕ |2 + τ 2|θˆbic| ≤ |Y − ϕ |2 + τ 2|θ| . n θˆbic 2 0 n θ 2 0 This is true for any θ ∈ IRM . It implies

2 2 ˆbic 2 > 2 |f − ϕθˆbic |2 + nτ |θ |0 ≤ |f − ϕθ|2 + 2ε (ϕθˆbic − ϕθ) + nτ |θ|0 . Note that if θˆbic = θ, the result is trivial. Otherwise,

> > ϕθˆbic − ϕθ  2ε (ϕθˆbic − ϕθ) = 2ε |ϕθˆbic − ϕθ|2 |ϕθˆbic − ϕθ|2 2 2 h > ϕθˆbic − ϕθ i α 2 ≤ ε + |ϕθˆbic − ϕθ|2 , α |ϕθˆbic − ϕθ|2 2

2 2 α 2 where we used Young’s inequality 2ab ≤ α a + 2 b , which is valid for a, b ∈ IR and α > 0. Next, since α |ϕ − ϕ |2 ≤ α|ϕ − f|2 + α|ϕ − f|2 , 2 θˆbic θ 2 θˆbic 2 θ 2 3.1. Oracle inequalities 76 we get for α = 1/2,

1 3 2 |ϕ − f|2 ≤ |ϕ − f|2 + nτ 2|θ| + 4ε>U(ϕ − ϕ ) − nτ 2|θˆbic| 2 θˆbic 2 2 θ 2 0 θˆbic θ 0 3 2 ≤ |ϕ − f|2 + 2nτ 2|θ| + 4ε>U(ϕ ) − nτ 2|θˆbic − θ| . 2 θ 2 0 θˆbic−θ 0 We conclude as in the proof of Theorem 2.14.

The interpretation of this theorem is enlightening. It implies that the BIC estimator will mimic the best tradeoff between the approximation error MSE(ϕθ) and the complexity of θ as measured by its sparsity. In particular, this result, which is sometimes called sparse oracle inequality, implies the following oracle inequality. Define the oracle θ¯ to be such that

MSE(ϕθ¯) = min MSE(ϕθ). θ∈IRM

Then, with probability at least 1 − δ,

Cσ2 h io MSE(ϕ ) ≤ 3MSE(ϕ¯) + |θ¯| log(eM) + log(1/δ) θˆbic θ n 0

If the linear model happens to be correct, then we simply have MSE(ϕθ¯) = 0.

Sparse oracle inequality for the Lasso To prove an oracle inequality for the Lasso, we need additional assumptions on the design matrix, such as incoherence. Here, the design matrix is given by the n × M matrix Φ with elements Φi,j = ϕj(Xi).

2 Theorem 3.5. Assume the general regression model (3.1) with ε ∼ subGn(σ ). Moreover, assume that there exists an integer k such that the matrix Φ satisfies assumption INC(k). Then, the Lasso estimator θˆL with regularization param- eter given by r r 2 log(2M) 2 log(1/δ) 2τ = 8σ + 8σ (3.6) n n satisfies for some numerical constant C,

n Cσ2 o MSE(ϕθˆL ) ≤ inf MSE(ϕθ) + |θ|0 log(eM/δ) θ∈IRM n |θ|0≤k with probability at least 1 − δ.

Proof. From the definition of θˆL, for any θ ∈ IRM , 1 1 |Y − ϕ |2 ≤ |Y − ϕ |2 + 2τ|θ| − 2τ|θˆL| . n θˆL 2 n θ 2 1 1 3.1. Oracle inequalities 77

ˆL Expanding the squares, adding τ|θ − θ|1 on each side and multiplying by n, we get

2 2 ˆL |ϕθˆL − f|2 − |ϕθ − f|2 + nτ|θ − θ|1 > ˆL ˆL ≤ 2ε (ϕθˆL − ϕθ) + nτ|θ − θ|1 + 2nτ|θ|1 − 2nτ|θ |1 . (3.7) √ Next, note that INC(k) for any k ≥ 1 implies that |ϕj|2 ≤ 2 n for all j = 1,...,M. Applying H¨older’sinequality, we get that with probability 1 − δ, it holds that nτ 2ε>(ϕ − ϕ ) ≤ |θˆL − θ| . θˆL θ 2 1 Therefore, taking S = supp(θ) to be the support of θ, we get that the right- hand side of (3.7) is bounded by

2 2 ˆL ˆL ˆL |ϕθˆL − f|2 − |ϕθ − f|2 + nτ|θ − θ|1 ≤ 2nτ|θ − θ|1 + 2nτ|θ|1 − 2nτ|θ |1 ˆL ˆL = 2nτ|θS − θ|1 + 2nτ|θ|1 − 2nτ|θS |1 ˆL ≤ 4nτ|θS − θ|1 (3.8) with probability 1 − δ.

It implies that either MSE(ϕθˆL ) ≤ MSE(ϕθ) or that ˆL ˆL |θSc − θSc |1 ≤ 3|θS − θS|1 . so that θ = θˆL − θ satisfies the cone condition (2.18). Using now the Cauchy- Schwarz inequality and Lemma 2.17, respectively, assuming that |θ|0 ≤ k, we get ˆL p ˆL p 4nτ|θS − θ|1 ≤ 4nτ |S||θS − θ|2 ≤ 4τ 2n|θ|0|ϕθˆL − ϕθ|2 . 2 2 α 2 Using now the inequality 2ab ≤ α a + 2 b , we get 16τ 2n|θ| α 4nτ|θˆL − θ| ≤ 0 + |ϕ − ϕ |2 S 1 α 2 θˆL θ 2 16τ 2n|θ| ≤ 0 + α|ϕ − f|2 + α|ϕ − f|2 α θˆL 2 θ 2 Combining this result with (3.7) and (3.8), we find

16τ 2|θ| (1 − α)MSE(ϕ ) ≤ (1 + α)MSE(ϕ ) + 0 . θˆL θ α To conclude the proof, it only remains to divide by 1 − α on both sides of the above inequality and take α = 1/2.

Maurey’s argument

In there is no sparse θ such that MSE(ϕθ) is small, Theorem 3.4 is useless whereas the Lasso may still enjoy slow rates. In reality, no one really be- lieves in the existence of sparse vectors but rather of approximately sparse 3.1. Oracle inequalities 78 vectors. Zipf’s law would instead favor the existence of vectors θ with abso- lute coefficients that decay polynomially when ordered from largest to smallest in absolute value. This is the case for example if θ has a small `1 norm but is not sparse. For such θ, the Lasso estimator still enjoys slow rates as in Theorem 2.15, which can be easily extended to the misspecified case (see Prob- lem 3.2). As a result, it seems that the Lasso estimator is strictly better than the BIC estimator as long as incoherence holds since it enjoys both fast and slow rates, whereas the BIC estimator seems to be tailored to the fast rate. Fortunately, such vectors can be well approximated by sparse vectors in the M following sense: for any vector θ ∈ IR such that |θ|1 ≤ 1, there exists a vector 0 θ that is sparse and for which MSE(ϕθ0 ) is not much larger than MSE(ϕθ). The following theorem quantifies exactly the tradeoff between sparsity and MSE. It is often attributed to B. Maurey and was published by Pisier [Pis81]. This is why it is referred to as Maurey’s argument.

Theorem 3.6. Let {ϕ1, . . . , ϕM } be a dictionary normalized in such a way that √ max |ϕj|2 ≤ D n . 1≤j≤M Then for any integer k such that 1 ≤ k ≤ M and any positive R, we have

D2R2 min MSE(ϕθ) ≤ min MSE(ϕθ) + . θ∈IRM θ∈IRM k |θ|0≤k |θ|1≤R

Proof. Define ¯ 2 θ ∈ argmin |ϕθ − f|2 θ∈IRM |θ|1≤R ¯ ¯ ¯ and assume without loss of generality that |θ1| ≥ |θ2| ≥ ... ≥ |θM |. n Let now U ∈ IR be a random vector with values in {0, ±Rϕ1,..., ±RϕM } defined by

|θ¯ | IP(U = R sign(θ¯ )ϕ ) = j , j = 1,...,M, j j R |θ¯| IP(U = 0) = 1 − 1 . R √ Note that IE[U] = ϕθ¯ and |U|2 ≤ RD n. Let now U1,...,Uk be k independent copies of U and define their average

k 1 X U¯ = U . k i i=1 3.1. Oracle inequalities 79

¯ ˜ M ˜ Note that U = ϕθ˜ for some θ ∈ IR such that |θ|0 ≤ k. Moreover, using the Pythagorean Theorem,

¯ 2 ¯ 2 IE|f − U|2 = IE|f − ϕθ¯ + ϕθ¯ − U|2 2 ¯ 2 = IE|f − ϕθ¯|2 + |ϕθ¯ − U|2 2 2 IE|U − IE[U]|2 = |f − ϕ¯| + θ 2 k √ 2 2 (RD n) ≤ |f − ϕ¯| + θ 2 k To conclude the proof, note that

¯ 2 2 2 IE|f − U|2 = IE|f − ϕθ˜|2 ≥ min |f − ϕθ|2 θ∈IRM |θ|0≤k and divide by n.

Maurey’s argument implies the following corollary.

Corollary 3.7. Assume that the assumptions of Theorem 3.4 hold and that the dictionary {ϕ1, . . . , ϕM } is normalized in such a way that √ max |ϕj|2 ≤ n . 1≤j≤M

Then there exists a constant C > 0 such that the BIC estimator satisfies

2 r n hσ |θ|0 log(eM) log(eM)io MSE(ϕθˆbic ) ≤ inf 2MSE(ϕθ) + C ∧ σ|θ|1 θ∈IRM n n σ2 log(1/δ) + C n with probability at least 1 − δ.

Proof. Choosing α = 1/3 in Theorem 3.4 yields

2 2 n σ |θ|0 log(eM)o σ log(1/δ) MSE(ϕθˆbic ) ≤ 2 inf MSE(ϕθ) + C + C θ∈IRM n n

Let θ0 ∈ IRM . It follows from Maurey’s argument that for any k ∈ [M], there 0 M exists θ = θ(θ , k) ∈ IR such that |θ|0 = k and

0 2 2|θ |1 MSE(ϕ ) ≤ MSE(ϕ 0 ) + θ θ k It implies that

2 0 2 2 σ |θ|0 log(eM) 2|θ |1 σ k log(eM) MSE(ϕ ) + C ≤ MSE(ϕ 0 ) + + C θ n θ k n 3.1. Oracle inequalities 80 and furthermore that

2 0 2 2 σ |θ|0 log(eM) 2|θ |1 σ k log(eM) inf MSE(ϕθ) + C ≤ MSE(ϕθ0 ) + + C θ∈IRM n k n

Since the above bound holds for any θ0 ∈ RM and k ∈ [M], we can take an infimum with respect to both θ0 and k on the right-hand side to get

2 n σ |θ|0 log(eM)o inf MSE(ϕθ) + C θ∈IRM n 0 2 2 n |θ |1 σ k log(eM)o ≤ inf MSE(ϕθ0 ) + C min + C . θ0∈IRM k k n To control the minimum over k, we need to consider three cases for the quantity

|θ0| r n k¯ = 1 . σ log(eM)

1. If 1 ≤ k¯ ≤ M, then we get

0 2 2 r |θ |1 σ k log(eM) 0 log(eM) min + C ≤ Cσ|θ |1 k k n n

2. If k¯ ≤ 1, then σ2 log(eM) |θ0|2 ≤ C , 1 n which yields

|θ0|2 σ2k log(eM) σ2 log(eM) min 1 + C ≤ C k k n n

3. If k¯ ≥ M, then σ2M log(eM) |θ0|2 ≤ C 1 . n M Therefore, on the one hand, if M ≥ √ |θ|1 , we get σ log(eM)/n

0 2 2 0 2 r |θ |1 σ k log(eM) |θ |1 0 log(eM) min + C ≤ C ≤ Cσ|θ |1 . k k n M n

On the other hand, if M ≤ √ |θ|1 , then for any θ ∈ IRM , we have σ log(eM)/n r σ2|θ| log(eM) σ2M log(eM) log(eM) 0 ≤ ≤ Cσ|θ0| . n n 1 n 3.2. Nonparametric regression 81

Combined,

0 2 2 n |θ |1 σ k log(eM)o inf MSE(ϕθ0 ) + C min + C θ0∈IRM k k n r 2 n 0 log(eM σ log(eM)o ≤ inf MSE(ϕθ0 ) + Cσ|θ |1 + C , θ0∈IRM n n which together with Theorem 3.4 yields the claim.

Note that this last result holds for any estimator that satisfies an oracle inequality with respect to the `0 norm as in Theorem 3.4. In particular, this estimator need not be the BIC estimator. An example is the Exponential Screening estimator of [RT11]. Maurey’s argument allows us to enjoy the best of both the `0 and the `1 world. The rate adapts to the sparsity of the problem and can be even generalized to `q-sparsity (see Problem 3.3). However, it is clear from the proof that this argument is limited to squared `2 norms such as the one appearing in MSE and extension to other risk measures is non trivial. Some work has been done for non-Hilbert spaces [Pis81, DDGS97] using more sophisticated arguments.

3.2 NONPARAMETRIC REGRESSION

So far, the oracle inequalities that we have derived do not deal with the approximation error MSE(ϕθ). We kept it arbitrary and simply hoped that it was small. Note also that in the case of linear models, we simply assumed that the approximation error was zero. As we will see in this section, this error can be quantified under natural smoothness conditions if the dictionary of functions H = {ϕ1, . . . , ϕM } is chosen appropriately. In what follows, we assume for simplicity that d = 1 so that f : IR → IR and ϕj : IR → IR.

Fourier decomposition Historically, nonparametric estimation was developed before high-dimensional statistics and most results hold for the case where the dictionary H = {ϕ1, . . . , ϕM } forms an orthonormal system of L2([0, 1]):

Z 1 Z 1 2 ϕj (x)dx = 1 , ϕj(x)ϕk(x)dx = 0, ∀ j 6= k . 0 0 We will also deal with the case where M = ∞. ∗ When H is an orthonormal system, the coefficients θj ∈ IR defined by

Z 1 ∗ θj = f(x)ϕj(x)dx , 0 are called Fourier coefficients of f. 3.2. Nonparametric regression 82

Assume now that the regression function f admits the following decompo- sition ∞ X ∗ f = θj ϕj . j=1 There exists many choices for the orthonormal system and we give only two as examples.

Example 3.8. Trigonometric basis. This is an orthonormal basis of L2([0, 1]). It is defined by

ϕ ≡ 1 1 √ ϕ (x) = 2 cos(2πkx) , 2k √ ϕ2k+1(x) = 2 sin(2πkx) , for k = 1, 2,... and x ∈ [0, 1]. The fact that it is indeed an orthonormal system can be easily check using trigonometric identities. The next example has received a lot of attention in the signal (sound, image, . . . ) processing community. Example 3.9. Wavelets. Let ψ : IR → IR be a sufficiently smooth and compactly supported function, called “mother wavelet”. Define the system of functions j/2 j ψjk(x) = 2 ψ(2 x − k) , j, k ∈ Z .

It can be shown that for a suitable ψ, the dictionary {ψj,k, j, k ∈ Z} forms an orthonormal system of L2([0, 1]) and sometimes a basis. In the latter case, for any function g ∈ L2([0, 1]), it holds

∞ ∞ X X Z 1 g = θjkψjk , θjk = g(x)ψjk(x)dx . j=−∞ k=−∞ 0

The coefficients θjk are called wavelet coefficients of g. The simplest example is given by the Haar system obtained by taking ψ to be the following piecewise constant function (see Figure 3.1). We will not give more details about wavelets here but simply point the interested reader to [Mal09].   1 0 ≤ x < 1/2 ψ(x) = −1 1/2 ≤ x ≤ 1  0 otherwise

Sobolev classes and ellipsoids We begin by describing a class of smooth functions where smoothness is under- stood in terms of its number of derivatives. Recall that f (k) denotes the k-th derivative of f. 3.2. Nonparametric regression 83 y −1 0 1

0.0 0.5 1.0 x

Figure 3.1. The Haar mother wavelet

Definition 3.10. Fix parameters β ∈ {1, 2,... } and L > 0. The Sobolev class of functions W (β, L) is defined by

n (β−1) W (β, L) = f : [0, 1] → IR : f ∈ L2([0, 1]) , f is absolutely continuous and Z 1 o [f (β)]2 ≤ L2 , f (j)(0) = f (j)(1), j = 0, . . . , β − 1 0 Any function f ∈ W (β, L) can represented1 as its Fourier expansion along the trigonometric basis:

∞ ∗ X ∗ ∗  f(x) = θ1ϕ1(x) + θ2kϕ2k(x) + θ2k+1ϕ2k+1(x) , ∀ x ∈ [0, 1] , k=1

∗ ∗ where θ = {θj }j≥1 is in the space of squared summable sequences `2(IN) defined by ∞ n X 2 o `2(IN) = θ : θj < ∞ . j=1 For any β > 0, define the coefficients

 jβ for j even a = (3.9) j (j − 1)β for j odd

With these coefficients, we can define the Sobolev class of functions in terms of Fourier coefficients.

1In the sense that Z 1 k X 2 lim |f(t) − θj ϕj (t)| dt = 0 k→∞ 0 j=1 3.2. Nonparametric regression 84

Theorem 3.11. Fix β ≥ 1 and L > 0 and let {ϕj}j≥1 denote the trigonometric basis of L2([0, 1]). Moreover, let {aj}j≥1 be defined as in (3.9). A function f ∈ W (β, L) can be represented as

∞ X ∗ f = θj ϕj , j=1

∗ where the sequence {θj }j≥1 belongs to Sobolev ellipsoid of `2(IN) defined by

∞ n X 2 2 o Θ(β, Q) = θ ∈ `2(IN) : aj θj ≤ Q j=1 for Q = L2/π2β.

Proof. Let us first define the Fourier coefficients {sk(j)}k≥1 of the jth deriva- tive f (j) of f for j = 1, . . . , β: Z 1 (j) (j−1) (j−1) s1(j) = f (t)dt = f (1) − f (0) = 0 , 0 √ Z 1 (j) s2k(j) = 2 f (t) cos(2πkt)dt , 0 √ Z 1 (j) s2k+1(j) = 2 f (t) sin(2πkt)dt , 0

The Fourier coefficients of f are given by θk = sk(0). Using integration by parts, we find that

√ 1 √ Z 1 (β−1) (β−1) s2k(β) = 2f (t) cos(2πkt) + (2πk) 2 f (t) sin(2πkt)dt 0 0 √ √ Z 1 = 2[f (β−1)(1) − f (β−1)(0)] + (2πk) 2 f (β−1)(t) sin(2πkt)dt 0 = (2πk)s2k+1(β − 1) . Moreover,

√ 1 √ Z 1 (β−1) (β−1) s2k+1(β) = 2f (t) sin(2πkt) − (2πk) 2 f (t) cos(2πkt)dt 0 0 = −(2πk)s2k(β − 1) . In particular, it yields

2 2 2 2 2 s2k(β) + s2k+1(β) = (2πk) s2k(β − 1) + s2k+1(β − 1) By induction, we find that for any k ≥ 1,

2 2 2β 2 2  s2k(β) + s2k+1(β) = (2πk) θ2k + θ2k+1 3.2. Nonparametric regression 85

Next, it follows for the definition (3.9) of aj that

∞ ∞ ∞ X 2β 2 2  2β X 2 2 2β X 2 2 (2πk) θ2k + θ2k+1 = π a2kθ2k + π a2k+1θ2k+1 k=1 k=1 k=1 ∞ 2β X 2 2 = π aj θj . j=1

Together with the Parseval identity, it yields

Z 1 ∞ ∞ (β) 2 X 2 2 2β X 2 2 f (t) dt = s2k(β) + s2k+1(β) = π aj θj . 0 k=1 j=1

To conclude, observe that since f ∈ W (β, L), we have

1 Z 2 f (β)(t) dt ≤ L2 , 0 so that θ ∈ Θ(β, L2/π2β).

It can actually be shown that the reciprocal is true, that is, any function with Fourier coefficients in Θ(β, Q) belongs to if W (β, L), but we will not be needing this. In what follows, we will define smooth functions as functions with Fourier coefficients (with respect to the trigonometric basis) in a Sobolev ellipsoid. By extension, we write f ∈ Θ(β, Q) in this case and consider any real value for β.

Proposition 3.12. The Sobolev ellipsoids enjoy the following properties

(i) For any Q > 0,

0 < β0 < β ⇒ Θ(β, Q) ⊂ Θ(β0,Q).

(ii) For any Q > 0, 1 β > ⇒ f is continuous. 2 The proof is left as an exercise (Problem 3.6). It turns out that the first n functions in the trigonometric basis are not only orthonormal with respect to the inner product of L2, but also with respect to 1 Pn the inner predictor associated with fixed design, hf, gi := n i=1 f(Xi)g(Xi), when the design is chosen to be regular, i.e., Xi = (i − 1)/n, i = 1, . . . , n.

Lemma 3.13. Assume that {X1,...,Xn} is the regular design, i.e., Xi = (i − 1)/n. Then, for any M ≤ n − 1, the design matrix Φ = {ϕj(Xi)} 1≤i≤n 1≤j≤M satisfies the ORT condition. 3.2. Nonparametric regression 86

0 0 > Proof. Fix j, j ∈ {1, . . . , n − 1}, j 6= j , and consider the inner product ϕj ϕj0 . 0 0 Write kj = bj/2c for the integer part of j/2 and define the vectors a, b, a , b ∈ i2πk s i2πk 0 s n j j 0 0 IR with coordinates such that e n = as+1 +ibs+1 and e n = as+1 +ibs+1 for s ∈ {0, . . . , n − 1}. It holds that

1 > > 0 > 0 > 0 > 0 ϕ ϕ 0 ∈ {a a , b b , b a , a b } , 2 j j depending on the parity of j and j0. On the one hand, observe that if kj 6= kj0 , we have for any σ ∈ {−1, +1},

n−1 n−1 i2πk s i2πk 0 s i2π(kj +σk 0 )s X j σ j X j e n e n = e n = 0 . s=0 s=0 On the other hand

n−1 i2πk s i2πk 0 s X j σ j > 0 0 > 0 > 0  > 0 > 0 e n e n = (a + ib) (a + σib ) = a a − σb b + i b a + σa b s=0

> 0 > 0 > 0 > 0 so that a a = ±b b = 0 and b a = ±a b = 0 whenever, kj 6= kj0 . It yields > ϕj ϕj0 = 0. Next, consider the case where kj = kj0 so that

n−1 i2πk s i2πk 0 s  X j σ j 0 if σ = 1 e n e n = . n if σ = −1 s=0

0 > > 0 > 0 On the one hand if j 6= j , it can only be the case that ϕj ϕj0 ∈ {b a , a b } but the same argument as above yields b>a0 = ±a>b0 = 0 since the imaginary > part of the inner product is still 0. Hence, in that case, ϕj ϕj0 = 0. On the 0 0 0 > 0 2 other hand, if j = j , then a = a and b = b so that it yields a a = |a|2 = n > 0 2 > 2 and b b = |b|2 = n which is equivalent to ϕj ϕj = |ϕj|2 = n. Therefore, the design matrix Φ is such that

> Φ Φ = nIM .

Integrated squared error As mentioned in the introduction of this chapter, the smoothness assumption allows us to control the approximation error. Before going into the details, let 2 2 us gain some insight. Note first that if θ ∈ Θ(β, Q), then aj θj → 0 as j → ∞ −β so that |θj| = o(j ). Therefore, the θjs decay polynomially to zero and it makes sense to approximate f by its truncated Fourier series

M X ∗ M θj ϕj =: ϕθ∗ j=1 3.2. Nonparametric regression 87 for any fixed M. This truncation leads to a systematic error that vanishes as M → ∞. We are interested in understanding the rate at which this happens. The Sobolev assumption allows us to control precisely this error as a func- tion of the tunable parameter M and the smoothness β. Lemma 3.14. For any integer M ≥ 1, and f ∈ Θ(β, Q), β > 1/2, it holds

M 2 X ∗ 2 −2β kϕ ∗ − fk = |θ | ≤ QM . (3.10) θ L2 j j>M and for M = n − 1, we have

2 n−1 2  X ∗  2−2β |ϕθ∗ − f|2 ≤ 2n |θj | . Qn . (3.11) j≥n

Proof. Note that for any θ ∈ Θ(β, Q), if β > 1/2, then

∞ ∞ X X 1 |θ | = a |θ | j j j a j=2 j=2 j

v ∞ ∞ uX X 1 ≤ u a2θ2 by Cauchy-Schwarz t j j a2 j=2 j=2 j

v ∞ u X 1 ≤ uQ < ∞ t j2β j=1

Since {ϕj}j forms an orthonormal system in L2([0, 1]), we have

2 M 2 X ∗ 2 min kϕθ − fkL = kϕθ∗ − fkL = |θj | . θ∈IRM 2 2 j>M

When θ∗ ∈ Θ(β, Q), we have

X X 1 1 Q |θ∗|2 = a2|θ∗|2 ≤ Q ≤ . j j j a2 a2 M 2β j>M j>M j M+1

To prove the second part of the lemma, observe that √ n−1 X ∗ X ∗ |ϕθ∗ − f|2 = θj ϕj 2 ≤ 2 2n |θj | , j≥n j≥n where in√ the last inequality, we used the fact that for the trigonometric basis |ϕj|2 ≤ 2n, j ≥ 1 regardless of the choice of the design X1,...,Xn. When θ∗ ∈ Θ(β, Q), we have s s X X 1 X X 1 1 ∗ ∗ 2 ∗ 2 p 2 −β |θj | = aj|θj | ≤ aj |θj | 2 . Qn . aj a j≥n j≥n j≥n j≥n j 3.2. Nonparametric regression 88

M Note that the truncated Fourier series ϕθ∗ is an oracle: this is what we see when we view f through the lens of functions with only low frequency harmonics. M To estimate ϕθ∗ = ϕθ∗ , consider the estimator ϕθˆls where

n ˆls X 2 θ ∈ argmin Yi − ϕθ(Xi) , M θ∈IR i=1 which should be such that ϕθˆls is close to ϕθ∗ . For this estimator, we have proved (Theorem 3.3) an oracle inequality for the MSE that is of the form

M 2 M 2 2 |ϕˆls − f|2 ≤ inf |ϕθ − f|2 + Cσ M log(1/δ) , C > 0 . θ θ∈IRM

It yields

M M 2 M M > M 2 |ϕ − ϕ ∗ | ≤ 2(ϕ − ϕ ∗ ) (f − ϕ ∗ ) + Cσ M log(1/δ) θˆls θ 2 θˆls θ θ M M > X ∗ 2 = 2(ϕ − ϕ ∗ ) ( θ ϕ ) + Cσ M log(1/δ) θˆls θ j j j>M M M > X ∗ 2 = 2(ϕ − ϕ ∗ ) ( θ ϕ ) + Cσ M log(1/δ) , θˆls θ j j j≥n where we used Lemma 3.13 in the last equality. Together with (3.11) and Young’s inequality 2ab ≤ αa2 + b2/α, a, b ≥ 0 for any α > 0, we get

M M > X ∗ M M 2 C 2−2β 2(ϕ − ϕ ∗ ) ( θ ϕ ) ≤ α|ϕ − ϕ ∗ | + Qn , θˆls θ j j θˆls θ 2 α j≥n for some positive constant C when θ∗ ∈ Θ(β, Q). As a result,

2 M M 2 1 2−2β σ M |ϕ − ϕ ∗ | Qn + log(1/δ) (3.12) θˆls θ 2 . α(1 − α) 1 − α

M M 2 M for any α ∈ (0, 1). Since, Lemma 3.13 implies, |ϕ − ϕ ∗ | = nkϕ − θˆls θ 2 θˆls M 2 ϕ ∗ k , we have proved the following theorem. θ L2([0,1]) √ Theorem 3.15. Fix β ≥ (1 + 5)/4 ' 0.81, Q > 0, δ > 0 and assume the 2 2 general regression model (3.1) with f ∈ Θ(β, Q) and ε ∼ subGn(σ ), σ ≤ 1. 1 Moreover, let M = dn 2β+1 e and n be large enough so that M ≤ n−1. Then the ˆls M least squares estimator θ defined in (3.2) with {ϕj}j=1 being the trigonometric basis, satisfies with probability 1 − δ, for n large enough,

M 2 − 2β 2 kϕ − fk n 2β+1 (1 + σ log(1/δ)) . θˆls L2([0,1]) . where the constant factors may depend on β, Q and σ. 3.2. Nonparametric regression 89

Proof. Choosing α = 1/2 for example and absorbing Q in the constants, we get from (3.12) and Lemma 3.13 that for M ≤ n − 1,

M M 2 1−2β 2 M log(1/δ) kϕ − ϕ ∗ k n + σ . θˆls θ L2([0,1]) . n Using now Lemma 3.14 and σ2 ≤ 1, we get M log(1/δ) kϕM − fk2 M −2β + n1−2β + σ2 . θˆls L2([0,1]) . n

1 Taking M = dn 2β+1 e ≤ n − 1 for n large enough yields

M 2 − 2β 1−2β 2 kϕ − fk n 2β+1 + n σ log(1/δ) . θˆls L2([0,1]) . To conclude the proof, simply note that for the prescribed β, we have n1−2β ≤ − 2β n 2β+1 .

Adaptive estimation

1 2β+1 The rate attained by the projection estimator ϕθˆls with M = dn e is actually optimal so, in this sense, it is a good estimator. Unfortunately, its implementa- tion requires the knowledge of the smoothness parameter β which is typically unknown, to determine the level M of truncation. The purpose of adaptive es- timation is precisely to adapt to the unknown β, that is to build an estimator − 2β that does not depend on β and yet, attains a rate of the order of Cn 2β+1 (up to a logarithmic lowdown). To that end, we will use the oracle inequalities for the BIC and Lasso estimator defined in (3.3) and (3.4) respectively. In view of Lemma 3.13, the design matrix Φ actually satisfies the assumption ORT when we work with the trigonometric basis. This has two useful implications: 1. Both estimators are actually thresholding estimators and can therefore be implemented efficiently 2. The condition INC(k) is automatically satisfied for any k ≥ 1. These observations lead to the following corollary. √ Corollary 3.16. Fix β > (1 + 5)/4 ' 0.81, Q > 0, δ > 0 and n large enough 1 to ensure n − 1 ≥ dn 2β+1 e assume the general regression model (3.1) with 2 2 n−1 f ∈ Θ(β, Q) and ε ∼ subGn(σ ), σ ≤ 1. Let {ϕj}j=1 be the trigonometric basis. Denote by ϕn−1 (resp. ϕn−1) the BIC (resp. Lasso) estimator defined θˆbic θˆL in (3.3) (resp. (3.4)) over IRn−1 with regularization parameter given by (3.5) (resp. (3.6)). Then ϕn−1, where θˆ ∈ {θˆbic, θˆL} satisfies with probability 1 − δ, θˆ

n−1 2 − 2β 2 kϕ − fk (n/ log n) 2β+1 (1 + σ log(1/δ)) , θˆ L2([0,1]) . where the constants may depend on β and Q. 3.2. Nonparametric regression 90

Proof. For θˆ ∈ {θˆbic, θˆL}, adapting the proofs of Theorem 3.4 for the BIC estimator and Theorem 3.5 for the Lasso estimator, for any θ ∈ IRn−1, with probability 1 − δ 1 + α |ϕn−1 − f|2 ≤ |ϕn−1 − f|2 + R(|θ| ) . θˆ 2 1 − α θ 2 0 where Cσ2 R(|θ| ) := |θ | log(en/δ) 0 α(1 − α) 0 It yields 2α |ϕn−1 − ϕn−1|2 ≤ |ϕn−1 − f|2 + 2(ϕn−1 − ϕn−1)>(f − ϕn−1) + R(|θ| ) θˆ θ 2 1 − α θ 2 θˆ θ θ 0  2α 1  ≤ + |ϕn−1 − f|2 + α|ϕn−1 − ϕn−1|2 + R(|θ| ) , 1 − α α θ 2 θˆ θ 2 0 where we used Young’s inequality once again. Let M ∈ N be determined later ∗ ∗ ∗ and choose now α = 1/2 and θ = θM , where θM is equal to θ on its first M n−1 M coordinates and 0 otherwise so that ϕ ∗ = ϕ ∗ . It yields θM θ

n−1 n−1 2 n−1 2 n−1 n−1 2 n−1 2 |ϕ − ϕ ∗ |2 |ϕ ∗ − f|2 + R(M) |ϕ ∗ − ϕ ∗ |2 + |ϕ ∗ − f|2 + R(M) θˆ θM . θM . θM θ θ

n−1 2 2−2β Next, it follows from (3.11) that |ϕθ∗ − f|2 . Qn . Together with Lemma 3.13, it yields

n−1 n−1 2 n−1 n−1 2 1−2β R(M) kϕˆ − ϕθ∗ kL ([0,1]) . kϕθ∗ − ϕθ∗ kL ([0,1]) + Qn + . θ M 2 M 2 n Moreover, using (3.10), we find that

M kϕn−1 − fk2 M −2β + Qn1−2β + σ2 log(en/δ) . θˆ L2([0,1]) . n

1 To conclude the proof, choose M = d(n/ log n) 2β+1 e and observe that the choice 1−2β −2β of β ensures that n . M . While there is sometimes a (logarithmic) price to pay for adaptation, it turns out that the extra logarithmic factor can be removed by a clever use of blocks (see [Tsy09, Chapter 3]). The reason why we get this extra logarithmic factor here is because we use a hammer that’s too big. Indeed, BIC and Lasso allow for “holes” in the Fourier decomposition, so we only get part of their potential benefits. 3.3. Problem Set 91

3.3 PROBLEM SET

Problem 3.1. Show that the least-squares estimator θˆls defined in (3.2) sat- isfies the following exact oracle inequality:

2 M IEMSE(ϕθˆls ) ≤ inf MSE(ϕθ) + Cσ θ∈IRM n for some constant C to be specified.

2 Problem 3.2. Assume that ε ∼ subG√ n(σ ) and the vectors ϕj are normalized in such a way that maxj |ϕj|2 ≤ n. Show that there exists a choice of τ such that the Lasso estimator θˆL with regularization parameter 2τ satisfies the following exact oracle inequality: r n log M o MSE(ϕθˆL ) ≤ inf MSE(ϕθ) + Cσ|θ|1 θ∈IRM n with probability at least 1 − M −c for some positive constants C, c.

Problem 3.3. Let √{ϕ1, . . . , ϕM } be a dictionary normalized in such a way that maxj |ϕj|2 ≤ D n. Show that for any integer k such that 1 ≤ k ≤ M, we have 1 1 2 q¯ q¯  2 M − k min MSE(ϕθ) ≤ min MSE(ϕθ) + CqD , θ∈IRM θ∈IRM k |θ|0≤2k |θ|w`q ≤1 1 1 where |θ|w`q denotes the weak `q norm andq ¯ is such that q + q¯ = 1, for q > 1. Problem 3.4. Show that the trigonometric basis and the Haar system indeed form an orthonormal system of L2([0, 1]).

Problem 3.5. Consider the n × d random matrix Φ = {ϕj(Xi)}1≤i≤n where 1≤j≤d X1,...,Xn are i.i.d uniform random variables on the interval [0, 1] and φj is the trigonometric basis as defined in Example 3.8. Show that Φ satisfies INC(k) with probability at least .9 as long as n ≥ Ck2 log(d) for some large enough constant C > 0. Problem 3.6. If f ∈ Θ(β, Q) for β > 1/2 and Q > 0, then f is continuous. o nice 1. a question to for answer [CT06] negative a (see imply theory will information in arduous. lies more about answer much statement The is a introduction). answer is done? that studied it negative proof be have 2., A a this we question with in estimator together 2.). example, estimator, the (question For better for better a proof performs finding simply it better questions or these a to 1.) answer finding positive (question a in Indeed, consists answer. a negative simply of a or optimality positive a about for is one second the of whereas estimator, form bound. an some of about optimality ask questions Both the answer to is goal of our goal Specifically, the optimality. and their questions: bounds assess following upper to several is proved chapter have we this chapters, previous the In ( Y 1 Cnayetmtripoeuo hs bounds? these that upon estimators improve the estimator do any Can words: 2. other In improved? be analysis our Can 1. Y , . . . , osdrteGusa euneMdl(S)weew observe we where (GSM) Model Sequence Gaussian the Consider It 2. question to answer negative a give to how see will we chapter, this In looking are we whether on depending varies questions these of difficulty The ehv tde culystsybte bounds? better satisfy actually studied have we d ) > endby defined , . PIAIYI IIA SENSE MINIMAX A IN OPTIMALITY 4.1 Y i = θ i ∗ + iia oe Bounds Lower Minimax ε i i , 92 optimality 1 = , d , . . . , h rtoei about is one first The . l estimators all

o can How . C h a p t e r 4 Y (4.1) = 4.1. Optimality in a minimax sense 93

Θ φ(Θ) Estimator Result

σ2d IRd θˆls Theorem 2.2 n r log d ˆls B1 σ θ Theorem 2.4 n B1 σ2k ˆls B0(k) log(ed/k) θ Corollaries 2.8-2.9 n B0(k)

Table 41. Rate φ(Θ) obtained for different choices of Θ.

> σ2 ∗ ∗ ∗ > where ε = (ε1, . . . , εd) ∼ Nd(0, n Id), θ = (θ1, . . . , θd) ∈ Θ is the parameter of interest and Θ ⊂ IRd is a given set of parameters. We will need a more precise notation for probabilities and expectations throughout this chapter. Denote by IPθ∗ and IEθ∗ the probability measure and corresponding expectation that are associated to the distribution of Y from the GSM (4.1). Recall that GSM is a special case of the linear regression model when the design matrix satisfies the ORT condition. In this case, we have proved several performance guarantees (upper bounds) for various choices of Θ that can be expressed either in the form  ˆ ∗ 2 IE |θn − θ |2 ≤ Cφ(Θ) (4.2) or the form ˆ ∗ 2 −2 |θn − θ |2 ≤ Cφ(Θ) , with prob. 1 − d (4.3) For some constant C. The rates φ(Θ) for different choices of Θ that we have obtained are gathered in Table 41 together with the estimator (and the corre- sponding result from Chapter2) that was employed to obtain this rate. Can any of these results be improved? In other words, does there exists another ˜ ˜ ∗ 2 estimator θ such that supθ∗∈Θ IE|θ − θ |2  φ(Θ)? A first step in this direction is the Cram´er-Raolower bound [Sha03] that allows us to prove lower bounds in terms of the Fisher information. Neverthe- less, this notion of optimality is too stringent and often leads to nonexistence of optimal estimators. Rather, we prefer here the notion of minimax optimality that characterizes how fast θ∗ can be estimated uniformly over Θ. ˆ Definition 4.1. We say that an estimator θn is minimax optimal over Θ if it satisfies (4.2) and there exists C0 > 0 such that

ˆ 2 0 inf sup IEθ|θ − θ|2 ≥ C φ(Θ) (4.4) θˆ θ∈Θ where the infimum is taker over all estimators (i.e., measurable functions of Y). Moreover, φ(Θ) is called minimax rate of estimation over Θ. 4.2. Reduction to finite hypothesis testing 94

Note that minimax rates of convergence φ are defined up to multiplicative constants. We may then choose this constant such that the minimax rate has a simple form such as σ2d/n as opposed to 7σ2d/n for example. This definition can be adapted to rates that hold with high probability. As we saw in Chapter2 (Cf. Table 41), the upper bounds in expectation and those with high probability are of the same order of magnitude. It is also the case for lower bounds. Indeed, observe that it follows from the Markov inequality that for any A > 0,

 −1 ˆ 2  −1 ˆ 2  IEθ φ (Θ)|θ − θ|2 ≥ AIPθ φ (Θ)|θ − θ|2 > A (4.5) Therefore, (4.4) follows if we prove that

 ˆ 2  inf sup IPθ |θ − θ|2 > Aφ(Θ) ≥ C θˆ θ∈Θ for some positive constants A and C”. The above inequality also implies a lower bound with high probability. We can therefore employ the following alternate definition for minimax optimality. Definition 4.2. We say that an estimator θˆ is minimax optimal over Θ if it satisfies either (4.2) or (4.3) and there exists C0 > 0 such that

 ˆ 2  0 inf sup IPθ |θ − θ|2 > φ(Θ) ≥ C (4.6) θˆ θ∈Θ where the infimum is taker over all estimators (i.e., measurable functions of Y). Moreover, φ(Θ) is called minimax rate of estimation over Θ.

4.2 REDUCTION TO FINITE HYPOTHESIS TESTING

Minimax lower bounds rely on information theory and follow from a simple principle: if the number of observations is too small, it may be hard to distin- guish between two probability distributions that are close to each other. For example, given n i.i.d. observations, it is impossible to reliably decide whether 1 they are drawn from N (0, 1) or N ( n , 1). This simple argument can be made precise using the formalism of statistical hypothesis testing. To do so, we reduce our estimation problem to a testing problem. The reduction consists of two steps. 1. Reduction to a finite number of parameters. In this step the goal is to find the largest possible number of parameters θ1, . . . , θM ∈ Θ under the constraint that 2 |θj − θk|2 ≥ 4φ(Θ) . (4.7) This problem boils down to a packing of the set Θ. Then we can use the following trivial observations:

 ˆ 2   ˆ 2  inf sup IPθ |θ − θ|2 > φ(Θ) ≥ inf max IPθj |θ − θj|2 > φ(Θ) . θˆ θ∈Θ θˆ 1≤j≤M 4.3. Lower bounds based on two hypotheses 95

2. Reduction to a hypothesis testing problem. In this second step, the necessity of the constraint (4.7) becomes apparent. For any estimator θˆ, define the minimum distance test ψ(θˆ) that is asso- ciated to it by ˆ ˆ ψ(θ) = argmin |θ − θj|2 , 1≤j≤M with ties broken arbitrarily. Next observe that if, for some j = 1,...,M, ψ(θˆ) 6= j, then there exists ˆ ˆ k 6= j such that |θ − θk|2 ≤ |θ − θj|2. Together with the reverse triangle inequality it yields ˆ ˆ ˆ |θ − θj|2 ≥ |θj − θk|2 − |θ − θk|2 ≥ |θj − θk|2 − |θ − θj|2 so that 1 |θˆ − θ | ≥ |θ − θ | j 2 2 j k 2 Together with constraint (4.7), it yields ˆ 2 |θ − θj|2 ≥ φ(Θ) As a result,  ˆ 2   ˆ  inf max IPθj |θ − θj|2 > φ(Θ) ≥ inf max IPθj ψ(θ) 6= j θˆ 1≤j≤M θˆ 1≤j≤M   ≥ inf max IPθ ψ 6= j ψ 1≤j≤M j where the infimum is taken over all tests ψ based on Y and that take values in {1,...,M}.

Conclusion: it is sufficient for proving lower bounds to find θ1, . . . , θM ∈ Θ 2 such that |θj − θk|2 ≥ 4φ(Θ) and   0 inf max IPθ ψ 6= j ≥ C . ψ 1≤j≤M j The above quantity is called minimax probability of error. In the next sections, we show how it can be bounded from below using arguments from information theory. For the purpose of illustration, we begin with the simple case where M = 2 in the next section.

4.3 LOWER BOUNDS BASED ON TWO HYPOTHESES

The Neyman-Pearson Lemma and the total variation distance

Consider two probability measures IP0 and IP1 and observations X drawn from either IP0 or IP1. We want to know which distribution X comes from. It corresponds to the following statistical hypothesis problem:

H0 : Z ∼ IP0

H1 : Z ∼ IP1 4.3. Lower bounds based on two hypotheses 96

A test ψ = ψ(Z) ∈ {0, 1} indicates which hypothesis should be true. Any test ψ can make two types of errors. It can commit either an error of type I (ψ = 1 whereas Z ∼ IP0) or an error of type II (ψ = 0 whereas Z ∼ IP1). Of course, the test may also be correct. The following fundamental result, called the Neyman Pearson Lemma indicates that any test ψ is bound to commit one of these two types of error with positive probability unless IP0 and IP1 have essentially disjoint support. Let ν be a sigma finite measure satisfying IP0  ν and IP1  ν. For example we can take ν = IP0 + IP1. It follows from the Radon-Nikodym theorem [Bil95] that both IP0 and IP1 admit probability densities with respect to ν. We denote them by p0 and p1 respectively. For any function f, we write for simplicity Z Z f = f(x)ν(dx)

Lemma 4.3 (Neyman-Pearson). Let IP0 and IP1 be two probability measures. Then for any test ψ, it holds Z IP0(ψ = 1) + IP1(ψ = 0) ≥ min(p0, p1)

? Moreover, equality holds for the Likelihood Ratio test ψ = 1I(p1 ≥ p0). Proof. Observe first that Z Z ? ? IP0(ψ = 1) + IP1(ψ = 0) = p0 + p1 ψ∗=1 ψ∗=0 Z Z = p0 + p1 p1≥p0 p1

? Next for any test ψ, define its rejection region R = {ψ = 1}. Let R = {p1 ≥ ? p0} denote the rejection region of the likelihood ratio test ψ . It holds

IP0(ψ = 1) + IP1(ψ = 0) = 1 + IP0(R) − IP1(R) Z = 1 + p0 − p1 R Z Z = 1 + p0 − p1 + p0 − p1 R∩R? R∩(R?)c Z Z = 1 − |p0 − p1| + |p0 − p1| R∩R? R∩(R?)c Z  ? c ?  = 1 + |p0 − p1| 1I(R ∩ (R ) ) − 1I(R ∩ R )

The above quantity is clearly minimized for R = R?. 4.3. Lower bounds based on two hypotheses 97

The lower bound in the Neyman-Pearson lemma is related to a well known quantity: the total variation distance.

Definition-Proposition 4.4. The total variation distance between two prob- ability measures IP0 and IP1 on a measurable space (X , A) is defined by

TV(IP0, IP1) = sup |IP0(R) − IP1(R)| (i) R∈A Z

= sup p0 − p1 (ii) R∈A R 1 Z = |p − p | (iii) 2 0 1 Z = 1 − min(p0, p1)(iv)   = 1 − inf IP0(ψ = 1) + IP1(ψ = 0) (v) ψ where the infimum above is taken over all tests.

Proof. Clearly (i) = (ii) and the Neyman-Pearson Lemma gives (iv) = (v). Moreover, by identifying a test ψ to its rejection region, it is not hard to see that (i) = (v). Therefore it remains only to show that (iii) is equal to any of the other expressions. Hereafter, we show that (iii) = (iv). To that end, observe that Z Z Z |p0 − p1| = p1 − p0 + p0 − p1 p1≥p0 p1

In view of the Neyman-Pearson lemma, it is clear that if we want to prove large lower bounds, we need to find probability distributions that are close in total variation. Yet, this conflicts with constraint (4.7) and a tradeoff needs to be achieved. To that end, in the Gaussian sequence model, we need to be able σ2 σ2 to compute the total variation distance between N (θ0, n Id) and N (θ1, n Id). None of the expression in Definition-Proposition 4.4 gives an easy way to do so. The Kullback-Leibler divergence is much more convenient. 4.3. Lower bounds based on two hypotheses 98

The Kullback-Leibler divergence Definition 4.5. The Kullback-Leibler divergence between probability mea- sures IP1 and IP0 is given by  Z dIP1   log dIP1 , if IP1  IP0 KL(IP1, IP0) = dIP0  ∞ , otherwise .

It can be shown [Tsy09] that the integral is always well defined when IP1  IP0 (though it can be equal to ∞ even in this case). Unlike the total variation distance, the Kullback-Leibler divergence is not a distance. Actually, it is not even symmetric. Nevertheless, it enjoys properties that are very useful for our purposes.

Proposition 4.6. Let IP and Q be two probability measures. Then 1. KL(IP, Q) ≥ 0. 2. The function (IP, Q) 7→ KL(IP, Q) is convex. 3. If IP and Q are product measures, i.e., n n O O IP = IPi and Q = Qi i=1 i=1 then n X KL(IP, Q) = KL(IPi, Qi) . i=1 Proof. If IP is not absolutely continuous then the result is trivial. Next, assume that IP  Q and let X ∼ IP. 1. Observe that by Jensen’s inequality,  d   d  KL(IP, ) = −IElog Q (X) ≥ − log IE Q (X) = − log(1) = 0 . Q dIP dIP 2. Consider the function f :(x, y) 7→ x log(x/y) and compute its Hessian:

∂2f 1 ∂2f x ∂2f 1 = , = , = − , ∂x2 x ∂y2 y2 ∂x∂y y  1 1  x − y We check that the matrix H = 1 x is positive semidefinite for all − y y2 x, y > 0. To that end, compute its determinant and trace: 1 1 1 x det(H) = − = 0 , Tr(H) = > 0 . y2 y2 x y2 Therefore, H has one null eigenvalue and one positive one and must therefore be positive semidefinite so that the function f is convex on (0, ∞)2. Since the 4.3. Lower bounds based on two hypotheses 99

KL divergence is a sum of such functions, it is also convex on the space of positive measures. 3. Note that if X = (X1,...,Xn), dIP  KL(IP, Q) = IElog (X) dQ n Z X dIPi  = log (X ) dIP (X ) ··· dIP (X ) d i 1 1 n n i=1 Qi n Z X dIPi  = log (X ) dIP (X ) d i i i i=1 Qi n X = KL(IPi, Qi) i=1

Point 2. in Proposition 4.6 is particularly useful in statistics where obser- vations typically consist of n independent random variables.

d Example 4.7. For any θ ∈ IR , let Pθ denote the distribution of Y ∼ 2 N (θ, σ Id). Then

d 0 2 0 2 X (θi − θi) |θ − θ |2 KL(P ,P 0 ) = = . θ θ 2σ2 2σ2 i=1

The proof is left as an exercise (see Problem 4.1).

The Kullback-Leibler divergence is easier to manipulate than the total vari- ation distance but only the latter is related to the minimax probability of error. Fortunately, these two quantities can be compared using Pinsker’s inequality. We prove here a slightly weaker version of Pinsker’s inequality that will be sufficient for our purpose. For a stronger statement, see [Tsy09], Lemma 2.5.

Lemma 4.8 (Pinsker’s inequality.). Let IP and Q be two probability measures such that IP  Q. Then p TV(IP, Q) ≤ KL(IP, Q) . 4.3. Lower bounds based on two hypotheses 100

Proof. Note that Z p KL(IP, Q) = p log pq>0 q Z r q  = −2 p log pq>0 p Z hr q i  = −2 p log − 1 + 1 pq>0 p Z hr q i ≥ −2 p − 1 (by Jensen) pq>0 p Z √ = 2 − 2 pq

Next, note that

 Z √ 2  Z 2 pq = pmax(p, q) min(p, q) Z Z ≤ max(p, q) min(p, q) (by Cauchy-Schwarz) Z Z = 2 − min(p, q) min(p, q)   = 1 + TV(IP, Q) 1 − TV(IP, Q) 2 = 1 − TV(IP, Q) The two displays yield

p 2 KL(IP, Q) ≥ 2 − 2 1 − TV(IP, Q)2 ≥ TV(IP, Q) , √ where we used the fact that 0 ≤ TV(IP, Q) ≤ 1 and 1 − x ≤ 1 − x/2 for x ∈ [0, 1].

Pinsker’s inequality yields the following theorem for the GSM.

Theorem 4.9. Assume that Θ contains two hypotheses θ0 and θ1 such that 2 2 2 |θ0 − θ1|2 = 8α σ /n for some α ∈ (0, 1/2). Then

2 ˆ 2 2ασ 1 inf sup IPθ(|θ − θ|2 ≥ ) ≥ − α . θˆ θ∈Θ n 2

Proof. Write for simplicity IPj = IPθj , j = 0, 1. Recall that it follows from the 4.4. Lower bounds based on many hypotheses 101 reduction to hypothesis testing that

2 ˆ 2 2ασ inf sup IPθ(|θ − θ|2 ≥ ) ≥ inf max IPj(ψ 6= j) θˆ θ∈Θ n ψ j=0,1 1   ≥ inf IP0(ψ = 1) + IP1(ψ = 0) 2 ψ 1h i = 1 − TV(IP , IP ) (Prop.-def. 4.4) 2 0 1 1h i ≥ 1 − pKL(IP , IP ) (Lemma 4.8) 2 1 0 r 1h n|θ − θ |2 i = 1 − 1 0 2 (Example 4.7) 2 2σ2 1h i = 1 − 2α 2 Clearly the result of Theorem 4.9 matches the upper bound for Θ = IRd only for d = 1. How about larger d? A quick inspection of our proof shows that our technique, in its present state, cannot yield better results. Indeed, there are only two known candidates for the choice of θ∗. With this knowledge, one can obtain upper bounds that do not depend on d by simply projecting Y onto the linear span of θ0, θ1 and then solving the GSM in two dimensions. To obtain larger lower bounds, we need to use more than two hypotheses. In particular, in view of the above discussion, we need a set of hypotheses that spans a linear space of dimension proportional to d. In principle, we should need at least order d hypotheses but we will actually need much more.

4.4 LOWER BOUNDS BASED ON MANY HYPOTHESES

The reduction to hypothesis testing from Section 4.2 allows us to use more than two hypotheses. Specifically, we should find θ1, . . . , θM such that

  0 inf max IPθ ψ 6= j ≥ C , ψ 1≤j≤M j for some positive constant C0. Unfortunately, the Neyman-Pearson Lemma no longer exists for more than two hypotheses. Nevertheless, it is possible to relate the minimax probability of error directly to the Kullback-Leibler divergence, without involving the total variation distance. This is possible using a well known result from information theory called Fano’s inequality.

Theorem 4.10 (Fano’s inequality). Let P1,...,PM ,M ≥ 2 be probability dis- tributions such that Pj  Pk, ∀ j, k. Then

1 PM KL(P ,P ) + log 2   M 2 j,k=1 j k inf max Pj ψ(X) 6= j ≥ 1 − ψ 1≤j≤M log M where the infimum is taken over all tests with values in {1,...,M}. 4.4. Lower bounds based on many hypotheses 102

Proof. Define

M 1 X p = P (ψ = j) and q = P (ψ = j) j j j M k k=1 so that M M 1 X 1 X p¯ = p ∈ (0, 1) , q¯ = q . M j M j j=1 j=1 Moreover, define the KL divergence between two Bernoulli distributions: p 1 − p kl(p, q) = p log  + (1 − p) log  . q 1 − q Note thatp ¯logp ¯ + (1 − p¯) log(1 − p¯) ≥ − log 2, which yields kl(¯p, q¯) + log 2 p¯ ≤ . log M Next, observe that by convexity (Proposition 4.6, 2.), it holds

M M 1 X 1 X kl(¯p, q¯) ≤ kl(p , q ) ≤ kl(P (ψ = j),P (ψ = j)) . M j j M 2 j k j=1 j.k=1 It remains to show that

kl(Pj(ψ = j),Pk(ψ = j)) ≤ KL(Pj,Pk) . This result can be seen to follow directly from a well known inequality often referred to as data processing inequality but we are going to prove it directly. =j 6=j Denote by dPk (resp. dPk ) the conditional density of Pk given ψ(X) = j (resp. ψ(X) 6= j) and recall that Z  dPj  KL(Pj,Pk) = log dPj dPk Z Z  dPj   dPj  = log dPj + log dPj ψ=j dPk ψ6=j dPk =j Z dP P (ψ = j)  = P (ψ = j) log j j dP =j j =j P (ψ = j) j dPk k Z dP 6=j P (ψ 6= j)  + P (ψ 6= j) log j dP 6=j j 6=j P (ψ 6= j) j dPk k   Pj(ψ = j)  =j =j  = Pj(ψ = j) log + KL(Pj ,Pj ) Pk(ψ = j)

  Pj(ψ 6= j)  6=j 6=j  + Pj(ψ 6= j) log + KL(Pj ,Pj ) Pk(ψ 6= j)

≥ kl(Pj(ψ = j),Pk(ψ = j)) . 4.4. Lower bounds based on many hypotheses 103 where we used Proposition 4.6, 1. We have proved that

M 1 PM 1 X 2 j,k=1 KL(Pj,Pk) + log 2 P (ψ = j) ≤ M . M j log M j=1

Passing to the complemetary sets completes the proof of Fano’s inequality.

Fano’s inequality leads to the following useful theorem.

Theorem 4.11. Assume that Θ contains M ≥ 5 hypotheses θ1, . . . , θM such that for some constant 0 < α < 1/4, it holds

2 (i) |θj − θk|2 ≥ 4φ 2ασ2 (ii) |θ − θ |2 ≤ log(M) j k 2 n Then ˆ 2  1 inf sup IPθ |θ − θ|2 ≥ φ ≥ − 2α . θˆ θ∈Θ 2 Proof. in view of (i), it follows from the reduction to hypothesis testing that it is sufficient to prove that

  1 inf max IPθ ψ 6= j ≥ − 2α. ψ 1≤j≤M j 2

If follows from (ii) and Example 4.7 that

n|θ − θ |2 KL(IP , IP ) = j k 2 ≤ α log(M) . j k 2σ2 Moreover, since M ≥ 5,

1 PM 2 KL(IPj, IPk) + log 2 α log(M) + log 2 1 M j,k=1 ≤ ≤ 2α + . log(M − 1) log(M − 1) 2

The proof then follows from Fano’s inequality.

ασ2 Theorem 4.11 indicates that we must take φ ≤ 2n log(M). Therefore, the larger the M, the larger the lower bound can be. However, M cannot be ar- bitrary larger because of the constraint (i). We are therefore facing a packing problem where the goal is to “pack” as many Euclidean balls of radius propor- tional to σplog(M)/n in Θ under the constraint that their centers remain close together (constraint (ii)). If Θ = IRd, this the goal is to pack the Euclidean ball of radius R = σp2α log(M)/n with Euclidean balls of radius Rp2α/γ. This can be done using a volume argument (see Problem 4.3). However, we will use the more versatile lemma below. It gives a a lower bound on the size 4.5. Application to the Gaussian sequence model 104 of a packing of the discrete hypercube {0, 1}d with respect to the Hamming distance defined by

d 0 X 0 0 d ρ(ω, ω ) = 1I(ωi 6= ωj) , ∀ ω, ω ∈ {0, 1} . i=1

Lemma 4.12 (Varshamov-Gilbert). For any γ ∈ (0, 1/2), there exist binary d vectors ω1, . . . ωM ∈ {0, 1} such that 1 (i) ρ(ω , ω ) ≥ − γd for all j 6= k , j k 2

2 γ2d γ d (ii) M = be c ≥ e 2 .

Proof. Let ωj,i, 1 ≤ i ≤ d, 1 ≤ j ≤ M be i.i.d. Bernoulli random variables with parameter 1/2 and observe that

d − ρ(ωj, ωk) = X ∼ Bin(d, 1/2) .

Therefore it follows from a union bound that 1 M(M − 1) d IP∃j 6= k , ρ(ω , ω ) < − γd ≤ IPX − > γd . j k 2 2 2 Hoeffding’s inequality then yields

M(M − 1) d  M(M − 1)  IPX − > γd ≤ exp − 2γ2d + log  < 1 2 2 2 as soon as   M(M − 1) < 2 exp 2γ2d .

2 A sufficient condition for the above inequality to hold is to take M = beγ dc ≥ γ2d e 2 . For this value of M, we have 1 IP∀j 6= k , ρ(ω , ω ) ≥ − γd > 0 j k 2

d and by virtue of the probabilistic method, there exist ω1, . . . ωM ∈ {0, 1} that satisfy (i) and (ii)

4.5 APPLICATION TO THE GAUSSIAN SEQUENCE MODEL

We are now in a position to apply Theorem 4.11 by choosing θ1, . . . , θM based on ω1, . . . , ωM from the Varshamov-Gilbert Lemma. 4.5. Application to the Gaussian sequence model 105

Lower bounds for estimation

Take γ = 1/4 and apply the Varshamov-Gilbert Lemma to obtain ω1, . . . , ωM d/16 d/32 with M = be c ≥ e and such that ρ(ωj, ωk) ≥ d/4 for all j 6= k. Let θ1, . . . , θM be such that βσ θ = ω √ , j j n for some β > 0 to be chosen later. We can check the conditions of Theorem 4.11: β2σ2 β2σ2d (i) |θ − θ |2 = ρ(ω , ω ) ≥ 4 ; j k 2 n j k 16n β2σ2 β2σ2d 32β2σ2 2ασ2 (ii) |θ − θ |2 = ρ(ω , ω ) ≤ ≤ log(M) = log(M), j k 2 n j k n n n √ α for β = 4 . Applying now Theorem 4.11 yields

2 ˆ 2 α σ d 1 inf sup IPθ |θ − θ|2 ≥ ≥ − 2α . θˆ θ∈IRd 256 n 2 It implies the following corollary. Corollary 4.13. The minimax rate of estimation of over IRd in the Gaussian sequence model is φ(IRd) = σ2d/n. Moreover, it is attained by the least squares estimator θˆls = Y. Note that this rate is minimax over sets Θ that are strictly smaller than IRd (see Problem 4.4). Indeed, it is minimax over any subset of IRd that contains θ1, . . . , θM .

Lower bounds for sparse estimation It appears from Table 41 that when estimating sparse vectors, we have to pay for an extra logarithmic term log(ed/k) for not knowing the sparsity pattern of the unknown θ∗. In this section, we show that this term is unavoidable as it appears in the minimax optimal rate of estimation of sparse vectors. Note that the vectors θ1, . . . , θM employed in the previous subsection are not guaranteed to be sparse because the vectors ω1, . . . , ωM obtained from the Varshamov-Gilbert Lemma may themselves not be sparse. To overcome this limitation, we need a sparse version of the Varhsamov-Gilbert lemma.

Lemma 4.14 (Sparse Varshamov-Gilbert). There exist positive constants C1 and C2 such that the following holds for any two integers k and d such that d 1 ≤ k ≤ d/8. There exist binary vectors ω1, . . . ωM ∈ {0, 1} such that k (i) ρ(ω , ω ) ≥ for all i 6= j , i j 2 k d (ii) log(M) ≥ log(1 + ) . 8 2k 4.5. Application to the Gaussian sequence model 106

(iii) |ωj|0 = k for all j .

Proof. Take ω1, . . . , ωM independently and uniformly at random from the set d C0(k) = {ω ∈ {0, 1} : |ω|0 = k} d of k-sparse binary random vectors. Note that C0(k) has cardinality k . To choose ωj uniformly from C0(k), we proceed as follows. Let U1,...,Uk ∈ {1, . . . , d} be k random variables such that U1 is drawn uniformly at random from {1, . . . , d} and for any i = 2, . . . , k, conditionally on U1,...,Ui−1, the ran- dom variable Ui is drawn uniformly at random from {1, . . . , d}\{U1,...,Ui−1}. Then define

 1 if i ∈ {U ,...,U } ω = 1 k 0 otherwise .

Clearly, all outcomes in C0(k) are equally likely under this distribution and therefore, ω is uniformly distributed on C0(k). Observe that 1 X k IP∃ ω 6= ω : ρ(ω , ω ) < k = IP∃ ω 6= x : ρ(ω , x) <  j k j k d j j 2 k x∈{0,1}d |x|0=k M 1 X X k ≤ IPω 6= x : ρ(ω , x) <  d j j 2 k x∈{0,1}d j=1 |x|0=k k = MIPω 6= x : ρ(ω, x ) < , 0 0 2 where ω has the same distribution as ω1 and x0 is any k-sparse vector in d {0, 1} . The last equality holds by symmetry since (i) all the ωjs have the same distribution and (ii) all the outcomes of ωj are equally likely. Note that k X ρ(ω, x0) ≥ k − Zi , i=1 where Zi = 1I(Ui ∈ supp(x0)). Indeed the left hand side is the number of coordinates on which the vectors ω, x0 disagree and the right hand side is the number of coordinates in supp(x0) on which the two vectors disagree. In particular, we have that Z1 ∼ Ber(k/d) and for any i = 2, . . . , d, conditionally on Z1,...,Zi−i, we have Zi ∼ Ber(Qi), where k − Pi−1 Z k 2k Q = l=1 l ≤ ≤ , i p − (i − 1) d − k d since k ≤ d/2. Next we apply a Chernoff bound to get that for any s > 0,

k k k  X k  h X i − sk IP ω 6= x : ρ(ω, x ) < ≤ IP Z > = IE exp s Z e 2 0 0 2 i 2 i i=1 i=1 4.5. Application to the Gaussian sequence model 107

The above MGF can be controlled by induction on k as follows:

k k−1 h X i h X  i IE exp s Zi = IE exp s Zi IEexp sZk Z1,...,Zk=1 i=1 i=1 k−1 h X  s i = IE exp s Zi (Qk(e − 1) + 1) i=1 k−1 h X i 2k ≤ IE exp s Z  (es − 1) + 1 i d i=1 . .

2k k ≤ (es − 1) + 1 d = 2k

d For s = log(1 + 2k ). Putting everything together, we get  sk  IP∃ ω 6= ω : ρ(ω , ω ) < k ≤ exp log M + k log 2 − j k j k 2  k d  = exp log M + k log 2 − log(1 + ) 2 2k  k d  ≤ exp log M + k log 2 − log(1 + ) 2 2k  k d  ≤ exp log M − log(1 + ) (for d ≥ 8k) 4 2k < 1 , if we take M such that k d log M < log(1 + ). 4 2k

Apply the sparse Varshamov-Gilbert lemma to obtain ω1, . . . , ωM with k d log(M) ≥ 8 log(1 + 2k ) and such that ρ(ωj, ωk) ≥ k/2 for all j 6= k. Let θ1, . . . , θM be such that r βσ d θ = ω √ log(1 + ) , j j n 2k for some β > 0 to be chosen later. We can check the conditions of Theorem 4.11: β2σ2 d β2σ2 d (i) |θ − θ |2 = ρ(ω , ω ) log(1 + ) ≥ 4 k log(1 + ); j k 2 n j k 2k 8n 2k β2σ2 d 2kβ2σ2 d 2ασ2 (ii) |θ −θ |2 = ρ(ω , ω ) log(1+ ) ≤ log(1+ ) ≤ log(M), j k 2 n j k 2k n 2k n 4.5. Application to the Gaussian sequence model 108

p α for β = 8 . Applying now Theorem 4.11 yields 2 2 ˆ 2 α σ d  1 inf sup IPθ |θ − θ|2 ≥ k log(1 + ) ≥ − 2α . ˆ θ θ∈IRd 64n 2k 2 |θ|0≤k It implies the following corollary.

d Corollary 4.15. Recall that B0(k) ⊂ IR denotes the set of all k-sparse vectors d of IR . The minimax rate of estimation over B0(k) in the Gaussian sequence σ2k model is φ(B0(k)) = n log(ed/k). Moreover, it is attained by the constrained least squares estimator θˆls . B0(k) Note that the modified BIC estimator of Problem 2.7 and the Slope (2.26) are also minimax optimal over B (k). However, unlike θˆls and the modified 0 B0(k) BIC, the Slope is also adaptive to k. For any ε > 0, the Lasso estimator and the BIC estimator are minimax optimal for sets of parameters such that k ≤ d1−ε.

Lower bounds for estimating vectors in `1 balls

Recall that in Maurey’s argument, we approximated a vector θ such that |θ|1 = 0 0 R q n R by a vector θ such that |θ |0 = σ log d . We can essentially do the same for the lower bound. √ Assume that d ≥ n and let β ∈ (0, 1) be a parameter to be chosen later and define k to be the smallest integer such that R r n k ≥ √ . βσ log(ed/ n)

Let ω1, . . . , ωM be obtained from the sparse Varshamov-Gilbert Lemma 4.14 with this choice of k and define R θ = ω . j j k

Observe that |θj|1 = R for j = 1,...,M. We can check the conditions of Theorem 4.11: √ R2 R2 R log(ed/ n) (i) |θ − θ |2 = ρ(ω , ω ) ≥ ≥ 4R min , β2σ  . j k 2 k2 j k 2k 8 8n r √ 2R2 log(ed/ n) 2ασ2 (ii) |θ − θ |2 ≤ ≤ 4Rβσ ≤ log(M), j k 2 k n n for β small enough if d ≥ Ck for some constant C > 0 chosen large enough. Applying now Theorem 4.11 yields √ ˆ 2 R 2 2 log(ed/ n) 1 inf sup IPθ |θ − θ|2 ≥ R min , β σ ≥ − 2α . θˆ θ∈IRd 8 8n 2 It implies the following corollary. 4.6. Lower bounds for sparse estimation via χ2 divergence 109

d d Corollary 4.16. Recall that B1(R) ⊂ IR denotes the set vectors θ ∈ IR such 1/2+ε that |θ|1 ≤ R. Then there exist a constant C > 0 such that if d ≥ n , ε > 0, the minimax rate of estimation over B1(R) in the Gaussian sequence model is log d φ(B (k)) = min(R2, Rσ ) . 0 n Moreover, it is attained by the constrained least squares estimator θˆls if B1(R) log d ˆ R ≥ σ n and by the trivial estimator θ = 0 otherwise. Proof. To complete the proof of the statement, we need to study risk of the ∗ trivial estimator equal to zero for small R. Note that if |θ |1 ≤ R, we have ∗ 2 ∗ 2 ∗ 2 2 |0 − θ |2 = |θ |2 ≤ |θ |1 = R . ∗ 2 ∗ 2 Remark 4.17. Note that the inequality |θ |2 ≤ |θ |1 appears to be quite loose. Nevertheless, it is tight up to a multiplicative constant for the vectors of the R log d form θj = ωj k that are employed in the lower bound. Indeed, if R ≤ σ n , we have k ≤ 2/β R2 β |θ |2 = ≥ |θ |2 . j 2 k 2 j 1 4.6 LOWER BOUNDS FOR SPARSE ESTIMATION VIA χ2 DIVERGENCE In this section, we will show how to derive lower bounds by directly con- trolling the distance between a simple and a composite hypothesis, which is useful when investigating lower rates for decision problems instead of estima- tion problems. Define the χ2 divergence between two probability distributions IP, Q as Z  2  dIP 2  − 1 dQ if IP  Q, χ (IP, Q) = dQ  ∞ otherwise. When we compare the expression in the case where IP  Q, then we see that   both the KL divergence and the χ2 divergence can be written as R f dIP d dQ Q with f(x) = x log x and f(x) = (x − 1)2, respectively. Also note that both of these functions are convex in x and fulfill f(1) = 0, which by Jensen’s inequality shows us that they are non-negative and equal to 0 if and only if IP = Q. Firstly, let us derive another useful expression for the χ2 divergence. By expanding the square, we have Z 2 Z Z Z  2 2 (dIP) dIP χ (IP, Q) = − 2 dIP+ dQ = dQ − 1 dQ dQ Secondly, we note that we can bound the KL divergence from above by the χ2 divergence, via Jensen’s inequality, as Z   Z 2 dIP (dIP) 2 2 KL(IP, Q) = log dIP ≤ log = log(1 + χ (IP, Q)) ≤ χ (IP, Q), dQ dQ (4.8) 4.6. Lower bounds for sparse estimation via χ2 divergence 110 where we used the concavity inequality log(1 + x) ≤ x for x > −1. Note that this estimate is a priori very rough and that we are transitioning from something at log scale to something on a regular scale. However, since we are mostly interested in showing that these distances are bounded by a small constant after adjusting the parameters of our models appropriately, it will suffice for our purposes. In particular, we can combine (4.8) with the Neyman-Pearson Lemma 4.3 and Pinsker’s inequality, Lemma 4.8, to get 1 √ inf IP0(ψ = 1) + IP1(ψ = 0) ≥ (1 − ε) ψ 2 2 for distinguishing between two hypothesis with a decision rule ψ if χ (IP0, IP1) ≤ ε. In the following, we want to apply this observation to derive lower rates for distinguishing between

H0 : Y ∼ IP0 H1 : Y ∼ IPu, for some u ∈ B0(k), where IPu denotes the probability distribution of a Gaussian sequence model σ Y = u + n ξ, with ξ ∼ N (0,Id). Theorem 4.18. Consider the detection problem ∗ ∗ H0 : θ = 0,Hv : θ = µv, for some v ∈ B0(k), kvk2 = 1,

∗ √σ with data given by Y = θ + n ξ, ξ ∼ N (0,Id). There exist ε > 0 and cε > 0 such that if r s σ k  dε µ ≤ log 1 + 2 n k2 then, 1 √ inf{IP0(ψ = 1) ∨ max IPv(ψ = 0)} ≥ (1 − ε). ψ v∈B0(k) 2 kvk2=1 Proof. We introduce the mixture hypothesis 1 X IP¯ = IP , d S k S⊆[d] |S|=k √ √σ with IPS being the distribution of Y = µ1IS/ k + n ξ and give lower bounds on distinguishing H0 : Y ∼ IP0,H1 : Y ∼ IP¯ , by computing their χ2 distance, Z  2 2 dp¯ χ (IP¯ , IP0) = dIP0 − 1 dIP0 Z   1 X dIPS dIPT = dIP − 1. d 0 dIP0 dIP0 k S,T 4.6. Lower bounds for sparse estimation via χ2 divergence 111

The first step is to compute dIPS/(dIP0). Writing out the corresponding Gaus- sian densities yields √ 1 n 2 √ d exp(− 2 kX − µ1IS/ kk ) dIPS 2π 2σ 2 (X) = 1 n 2 dIP0 √ exp(− 2 kX − 0k ) 2πd 2σ 2    n 2µ 2 = exp √ hX, 1ISi − µ 2σ2 k

For convenience, we introduce the notation X = √σ Z for a standard normal √ √ n Z ∼ N (0,Id), ν = µ n/(σ k), and write ZS for the restriction to the coor- dinates in the set S. Multiplying two of these densities and integrating with respect to IP0 in turn then reduces to computing the mgf of a Gaussian and gives     n σ 2 IEX∼IP exp 2µ√ (ZS + ZT ) − 2µ 0 2σ2 kn 2  = IEZ∼N (0,Id) exp ν(ZS + ZT ) − ν k . P P Decomposing ZS + ZT = 2 i∈S∩T Zi + i∈S∆T Zi and noting that |S∆T | ≤ 2k, we see that

   2  dIPS dIPT 4 2 ν 2 IEIP0 = exp ν |S ∩ T + |S∆T | − ν k dIP0 dIP0 2 2 ≤ exp 2ν2|S ∩ T | .

Now, we need to take the expectation over two uniform draws of support sets S and T , which reduces via conditioning and exploiting the independence of the two draws to

2 2 χ (IP¯ , IP0) ≤ IES,T [exp(2ν |S ∩ T |] − 1 2 = IET IES[[exp(2ν |S ∩ T | | T ]] − 1 2 = IES[exp(2ν |S ∩ [k]|)] − 1.

Similar to the proof of the sparse Varshamov-Gilbert bound, Lemma 4.14, the distribution of |S ∩ [k]| is stochastically dominated by a binomial distribution Bin(k, k/d), so that k χ2(IP¯ , IP ) ≤ IE[exp(2ν2Bin(k, )] − 1 0 d k = IE[exp(2ν2Ber(k/d))] − 1   k 2 k k = e2ν + 1 − − 1 d d  k k 2 = 1 + (e2ν − 1) − 1 d 4.7. Lower bounds for estimating the `1 norm via moment matching 112

k 2 1 d Since (1 + x) − 1 ≈ kx for x = o(1/k), we can take ν = 2 log(1 + k2 ε) to get a bound of the order ε. Plugging this back into the definition of ν yields r s σ k  dε µ = log 1 + . 2 n k2 The rate for detection for arbitrary v now follows from

IP0(ψ = 1) ∨ max IPv(ψ = 0) ≥ IP0(ψ = 1) ∨ IP(¯ ψ = 0) v∈B0(k) kvk2=1 1 √ ≥ (1 − ε). 2 Remark 4.19. If instead of the χ2 divergence, we try to compare the KL di- vergence between IP0 and IP,¯ we are tempted to exploit its convexity properties to get 1 X KL(IP¯ , IP ) ≤ KL(IP , IP ) = KL(IP , IP ), 0 d S 0 [k] 0 k S which is just the distance between two Gaussian distributions. In turn, we do not see any increase in complexity brought about by increasing the number of competing hypothesis as we did in Section 4.5. By using the convexity estimate, we have lost the granularity of controlling how much the different probability distributions overlap. This is where the χ2 divergence helped. Remark 4.20. This rate for detection implies lower rates for estimation of ∗ ∗ both θ and kθ k2: If given an estimator Tb(X) for T (θ) = kθk2 that achieves an error less than µ/2 for some X, then we would get a correct decision rule for those X by setting  µ 0, if Tb(X) < ,  2 ψ(X) = µ 1, if Tb(X) ≥ . 2 Conversely, the lower bound above means that such an error rate can not be achieved with vanishing error probability for µ smaller than the critical value. Moreover, by the triangle inequality, we can the same lower bound also for estimaton of θ∗ itself. We note that the rate that we obtained is slightly worse in the scaling of the log factor (d/k2 instead of d/k) than the one obtained in Corollary 4.15. This is because it applies to estimating the `2 norm as well, which is a strictly easier problem and for which estimators can be constructed that achieve the k 2 faster rate of n log(d/k ).

4.7 LOWER BOUNDS FOR ESTIMATING THE `1 NORM VIA MOMENT MATCHING

So far, we have seen many examples of rates tha scale with n−1 for the squared error. In this section, we are going to see an example for estimating 4.7. Lower bounds for estimating the `1 norm via moment matching 113 a functional that can only be estimated at a much slower rate of 1/ log n, following [CL11]. Consider the normalized `1 norm,

n 1 X T (θ) = |θ |, (4.9) n i i=1 as the target functional to be estimated, and measurements

Y ∼ N(θ, In). (4.10)

We are going to show lower bounds for the estimation if restricted to an `∞ n n ball, Θn(M) = {θ ∈ IR : |θi| ≤ M}, and for a general θ ∈ IR .

A constrained risk inequality We start with a general principle that is in the same spirit as the Neyman- Pearson lemma, (Lemma 4.3), but deals with expectations rather than prob- abilities and uses the χ2 distance to estimate the closeness of distributions. Moreover, contrary to what we have seen so far, it allows us to compare two mixture distributions over candidate hypotheses, also known as priors. In the following, we write X for a random observation coming from a dis- tribution indexed by θ ∈ Θ, Tb = Tb(X) an estimator for a function T (θ), and write B(θ) = IEθTb − T (θ) for the bias of Tb. 2 Let µ0 and µ1 be two prior distributions over Θ and denote by mi, vi the means and variance of T (θ) under the priors µi, i ∈ {0, 1}, Z Z Z 2 2 mi = T (θ)dµi(θ), vi = (T (θ) − mi) dµi(θ).

Moreover, write Fi for the marginal distributions over the priors and denote their density with respect to the average of the two probability distributions 2 by fi. With this, we write the χ distance between f0 and f1 as  2  2 2 2 f1(X) f1(X) I = χ (IPf0 , IPf1 ) = IEf0 − 1 = IEf0 − 1. f0(X) f0(X)

R 2 2 Theorem 4.21. (i) If IEθ(Tb(X) − T (θ)) dµ0(θ) ≤ ε , Z Z

B(θ)dµ1(θ) − B(θ)dµ0(θ) ≥ |m1 − m0| − (ε + v0)I. (4.11)

(ii) If |m1 − m0| > v0I and 0 ≤ λ ≤ 1,

Z 2 2 λ(1 − λ)(|m1 − m0| − v0I) IEθ(Tb(X) − T (θ)) d(λµ0 + (1 − λ)µ1)(θ) ≥ , λ + (1 − λ)(I + 1)2 (4.12) 4.7. Lower bounds for estimating the `1 norm via moment matching 114 in particular

Z 2 2 (|m1 − m0| − v0I) max IEθ(Tb(X) − T (θ)) dµi(θ) ≥ , (4.13) i∈{0,1} (I + 2)2 and 2 2 (|m1 − m0| − v0I) sup IEθ(Tb(X) − T (θ)) ≥ 2 . (4.14) θ∈Θ (I + 2)

Proof. Without loss of generality, assume m1 ≥ m0. We start by considering the term    f1(X) − f0(X) IEf0 (Tb(X) − m0) f0(X)    f1(X) − f0(X) = IE Tb(X) f0(X) h i h i = IEf1 m1 + Tb(X) − m1 − IEf0 m0 + Tb(X) − m0 Z  Z  = m1 + B(θ)dµ1(θ) − m0 + B(θ)dµ0(θ) .

Moreover, Z 2 2 IEf0 (Tb(X) − m0) = IEθ(Tb(X) − m0) dµ0(θ) Z 2 = IEθ(Tb(X) − T (θ) + T (θ) − m0) dµ0(θ) Z 2 = IEθ(Tb(X) − T (θ)) dµ0(θ) Z + 2 B(θ)(T (θ) − m0)dµ0(θ) Z 2 + (T (θ) − m0) dµ0(θ)

2 2 2 ≤ ε + 2εv0 + v0 = (ε + v0) . Therefore, by Cauchy-Schwarz,    f1(X) − f0(X) IEf0 (Tb(X) − m0) ≤ (ε + v0)I, f0(X) and hence Z  Z  m1 + B(θ)dµ1(θ) − m0 + B(θ)dµ0(θ) ≤ (ε + v0)I, whence Z Z B(θ)dµ0(θ) − B(θ)dµ1(θ) ≥ m1 − m0 − (ε + v0)I, 4.7. Lower bounds for estimating the `1 norm via moment matching 115 which gives us (4.11). To estimate the risk under a mixture of µ0 and µ1, we note that from (4.11), we have an estimate for the risk under µ1 given an upper bound on the risk under µ0, which means we can reduce this problem to estimating a quadratic from below. Consider

J(x) = λx2 + (1 − λ)(a − bx)2,

ab(1−λ) with 0 < λ < 1, a > 0 and b > 0. J is minimized by x = xmin = λ+b2(1−λ) a2λ(1−λ) with a − bxmin > 0 and J(xmin) = λ+b2(1−λ) . Hence, if we add a cut-off at zero in the second term, we do not change the minimal value and obtain that

2 2 λx + (1 − λ)((a − bx)+) has the same minimum. 2 R 2 From (4.11) and Jensen’s inequality, setting ε = B(θ) dµ0(θ), we get

Z Z 2 2 B(θ) dµ1(θ) ≥ B(θ)dµ1(θ)

2 ≥ ((m1 − m0 − (ε + v0)I − ε)+) .

Combining this with the estimate for the quadratic above yields Z 2 IEθ(Tb(X) − T (θ)) d(λµ0 + (1 − λ)µ1)(θ)

2 2 ≥ λε + (1 − λ)((m1 − m0 − (ε + v0)I − ε)+) λ(1 − λ)(|m − m | − v I)2 ≥ 1 0 0 , λ + (1 − λ)(I + 1)2 which is (4.12). Finally, since the minimax risk is bigger than any mixture risk, we can bound it from below by the maximum value of this bound, which is obtained at λ = (I + 1)/(I + 2) to get (4.13), and by the same argument (4.14).

Unfavorable priors via polynomial approximation

In order to construct unfavorable priors for the `1 norm (4.9), we resort to polynomial approximation theory. For a continuous function f on [−1, 1], de- note the maximal approximation error by polynomials of degree k, written as Pk, by δk(f) = inf max |f(x) − G(x)|, G∈Pk x∈[−1,1] ∗ and the polynomial attaining this error by Gk. For f(t) = |t|, it is known [Riv90] that β∗ = lim 2kδ2k(f) ∈ (0, ∞), k→∞ 4.7. Lower bounds for estimating the `1 norm via moment matching 116 which is a very slow rate of convergence of the approximation, compared to smooth functions for which we would expect geometric convergence. We can now construct our priors using some abstract measure theoretical results.

Lemma 4.22. For an integer k > 0, there are two probability measures ν0 and ν1 on [−1, 1] that fulfill the following:

1. ν0 and ν1 are symmetric about 0;

R l R l 2. t dν1(t) = t dν0(t) for all l ∈ {0, 1, . . . , k}; R R 3. |t|dν1(t) − |t|dν0(t) = 2δk. Proof. The idea is to construct the measures via the Hahn-Banach theorem and the Riesz representation theorem. First, consider f(t) = |t| as an element of the space C([−1, 1]) equipped with the supremum norm and define Pk as the space of polynomials of degree up to k, and Fk := span(Pk ∪ {f}). On Fk, define the functional T (cf + pk) = cδk, which is well-defined because f is not a polynomial. Claim: kT k = sup{T (g): g ∈ Fk, kgk∞ ≤ 1} = 1. ∗ kT k ≥ 1: Let Gk be the best-approximating polynomial to f in Pk. Then, ∗ ∗ ∗ kf − Gkk∞ = δk, k(f − Gk)/δkk∞ = 1, and T ((f − Gk)/δk) = 1. kT k ≤ 1: Suppose g = cf + pk with pk ∈ Pk, kgk∞ = 1 and T (g) > 1. Then c > 1/δk and 1 kf − (−p /c)k = < δ , k ∞ c k contradicting the definition of δk. Now, by the Hahn-Banach theorem, there is a norm-preserving extension of T to C([−1, 1]) which we again denote by T . By the Riesz representation theorem, there is a Borel signed measure τ with variation equal to 1 such that

Z 1 T (g) = g(t)dτ(t), for all g ∈ C([−1, 1]). −1

The Hahn-Jordan decomposition gives two positive measures τ+ and τ− such that τ = τ+ − τ− and

Z 1 |t|d(τ+ − τ−)(t) = δk and −1 Z 1 Z 1 l l t dτ+(t) = t dτ−(t), l ∈ {0, . . . , k}. −1 −1 Since f is a symmetric function, we can symmetrize these measures and hence can assume that they are symmetric about 0. Finally, to get a probability measures with the desired properties, set ν1 = 2τ+ and ν0 = 2τ−. 4.7. Lower bounds for estimating the `1 norm via moment matching 117 Minimax lower bounds Theorem 4.23. We have the following bounds on the minimax rate for T (θ) = 1 Pn n n i=1 |θi|: For θ ∈ Θn(M) = {θ ∈ IR : |θi| ≤ M},  2 2 2 2 log log n inf sup IEθ(Tb(X) − T (θ)) ≥ β∗ M (1 + o(1)). (4.15) Tb θ∈Θn(M) log n

For θ ∈ IRn,

2 2 β∗ inf sup IEθ(Tb(X) − T (θ)) ≥ 2 (1 + o(1)). (4.16) Tb θ∈IRn 16e log n The last remaining ingredient for the proof are the Hermite polynomials, a family of orthogonal polynomials with respect to the Gaussian density

1 2 ϕ(y) = √ e−y /2. 2π For our purposes, it is enough to define them by the derivatives of the density,

dk ϕ(y) = (−1)kH (y)ϕ(y), dyk k and to observe that they are orthogonal, Z Z 2 Hk (y)ϕ(y)dy = k!, Hk(y)Hj(y)ϕ(y)dy = 0, for k 6= j.

Proof. We want to use Theorem 4.21. To construct the prior measures, we scale the measures from Lemma 4.22 appropriately: Let kn be an even integer that is to be determined, and ν0, ν1 the two measures given by Lemma 4.22. Define −1 g(x) = Mx and define the measures µi by dilating νi, µi(A) = νi(g (A)) for every Borel set A ⊆ [−M,M] and i ∈ {0, 1}. Hence,

1. µ0 and µ1 are symmetric about 0;

R l R l 2. t dµ1(t) = t dµ0(t) for all l ∈ {0, 1, . . . , kn}; R R 3. |t|dµ1(t) − |t|dµ0(t) = 2Mδkn . n Qn To get priors for n i.i.d. samples, consider the product priors µi = j=1 µi. With this, we have

IE n T (θ) − IE n T (θ) = IE |θ | − IE |θ | = 2Mδ , µ1 µ0 µ1 1 µ0 1 kn and 2 2 2 M IEµn (T (θ) − IEµn T (θ)) = IEµn (T (θ) − IEµn T (θ)) ≤ , 0 0 0 0 n since each θi ∈ [−M,M]. 4.7. Lower bounds for estimating the `1 norm via moment matching 118

It remains to control the χ2 distance between the two marginal distribu- tions. To this end, we will expand the Gaussian density in terms of Hermite R polynomials. First, set fi(y) = ϕ(y − t)dµi(t). Since g(x) = exp(−x) is convex and µ0 is symmetric, 1  Z (y − t)2  f0(y) ≥ √ exp − dµ0(t) 2π 2  1 Z  = ϕ(y) exp − M 2 t2dν (t) 2 0  1  ≥ ϕ(y) exp − M 2 . 2 Expand ϕ as ∞ X tk ϕ(y − t) = H (y)ϕ(y) . k k! k=0 Then, Z 2 ∞ (f1(y) − f0(y)) 2 X 1 dy ≤ eM /2 M 2k. f0(y) k! k=kn+1 The χ2 distance for n i.i.d. observations can now be easily bounded by Z Qn Qn 2 2 ( i=1 f1(yi) − i=1 f0(yi)) In = Qn dy1··· dyn i=1 f0(yi) Z Qn 2 ( i=1 f1(yi)) = Qn dy1··· dyn − 1 i=1 f0(yi) n Z 2 ! Y (f1(yi)) = dy − 1 f (y ) i i=1 0 i ∞ !n 2 X 1 ≤ 1 + eM /2 M 2k − 1 k! k=kn+1  n 2 1 ≤ 1 + e3M /2 M 2kn − 1 kn! n  2 kn ! 2 eM ≤ 1 + e3M /2 − 1, (4.17) kn where the last step used a Stirling type estimate, k! > (k/e)k. Note that if x = o(1/n), then (1 + x)n − 1 = o(nx) by Taylor’s formula, so we can choose kn ≥ log n/(log log n) to guarantee that for n large enough, In → 0. With Theorem 4.21, we have √ 2 2 (2Mδkn − (M/ n)In) inf sup IE(Tb − T (θ)) ≥ 2 Tb θ∈Θn(M) (In + 2) log log n2 = β2M 2 (1 + o(1)). ∗ log n 4.7. Lower bounds for estimating the `1 norm via moment matching 119

n √ To prove the lower bound over IR , take M = log n and kn to be the smallest integer such that kn ≥ 2e log n and plug this into (4.17), yielding !n  e log n kn I2 ≤ 1 + n3/2 − 1, n 2e log n which goes to zero. Hence, we can conclude just as before.

Note that [CL11] also complement these lower bounds with upper bounds that are based on polynomial approximation, completing the picture of esti- mating T (θ). 4.8. Problem Set 120

4.8 PROBLEM SET

Problem 4.1. (a) Prove the statement of Example 4.7.

(b) Let Pθ denote the distribution of X ∼ Ber(θ). Show that 0 2 KL(Pθ,Pθ0 ) ≥ C(θ − θ ) .

Problem 4.2. Let IP0 and IP1 be two probability measures. Prove that for any test ψ, it holds 1 −KL(IP0,IP1) max IPj(ψ 6= j) ≥ e . j=0,1 4 d Problem 4.3. For any R > 0, θ ∈ IR , denote by B2(θ, R) the (Euclidean) ball of radius R and centered at θ. For any ε > 0 let N = N(ε) be the largest integer such that there exist θ1, . . . , θN ∈ B2(0, 1) for which

|θi − θj| ≥ 2ε for all i 6= j. We call the set {θ1, . . . , θN } an ε-packing of B2(0, 1) . d (a) Show that there exists a constant C > 0 such that N ≤ Cd/ε .

(b) Show that for any x ∈ B2(0, 1), there exists i = 1,...,N such that |x − θi|2 ≤ 2ε. (c) Use (b) to conclude that there exists a constant C0 > 0 such that N ≥ 0 d Cd/ε . Problem 4.4. Show that the rate φ = σ2d/n is the minimax rate of estimation over: (a) The Euclidean Ball of IRd with radius pσ2d/n.

d (b) The unit `∞ ball of IR defined by d B∞(1) = {θ ∈ IR : max |θj| ≤ 1} j as long as σ2 ≤ n. (c) The set of nonnegative vectors of IRd.

σ√ d (d) The discrete hypercube 16 n {0, 1} . Problem 4.5. Fix β ≥ 5/3, Q > 0 and prove that the minimax rate of esti- 2β − 2β+1 mation over Θ(β, Q) with the k · kL2([0,1])-norm is given by n . [Hint:Consider functions of the form N C X f = √ ω ϕ j n ji i i=1 N where C is a constant, ωj ∈ {0, 1} for some appropriately chosen N and {ϕj}j≥1 is the trigonometric basis.] a eietfidt ieroeaosfo IR IR from from operators operators linear linear to with identified be identified can be can vectors while and matrices on reminder quick a by begin we algebra. topics, linear these analysis. to component principal getting and Before estimation matrix covariance regression, ate interest the by popular made was tnadrfrneo arcsadnmrcllna algebra. linear become numerical has such and that of matrices [GVL96], number on book small excellent reference the a standard a in only a than found needing be more by can be over illustrated which will matrices as properties, we about Fortunately, properties facts and pages. contains notions thousand that of [Ber09] profusion a book to Bernstein’s rise gives fact simple by matrix a of complexity the from measure ranging can vectors we simple here describing sparsity, been its their have we of while terms their particular, between in In “interaction” matrices allowing straightforward, columns. vectors exam- is and than see rows matrices structure will to we richer much vectors and a true, from have is extension are this the they While where that vectors. is ples of explanation extension simplest natural the most high-dimensional Perhaps the of picture reasons. the several entered have for matrices statistics so, or decade past the Over rank arcsaemc oecmlctdojcsta etr.I particular, In vectors. than objects complicated more much are Matrices of parameter the where problems statistical several study we chapter, this In hsfauewsscesul mlydi ait fapplications of variety a in employed successfully was feature This . θ samti ahrta etr hs rbesicue multivari- include: problems These vector. a than rather matrix a is ut-aklearning multi-task . AI AT BU MATRICES ABOUT FACTS BASIC 5.1 Netflix to olbrtv filtering collaborative rz nparticular. in prize 121 d arxestimation Matrix oIR to n for n hsls application last This . ≥ .Ti seemingly This 1. d oI,matrices IR, to C h a p t e r 5 5.1. Basic facts about matrices 122

Singular value decomposition

Let A = {aij, 1 ≤ i ≤ m, 1 ≤ j ≤ n} be a m × n real matrix of rank r ≤ min(m, n). The Singular Value Decomposition (SVD) of A is given by

r > X > A = UDV = λjujvj , j=1 where D is an r ×r diagonal matrix with positive diagonal entries {λ1, . . . , λr}, m U is a matrix with columns {u1, . . . , ur} ∈ IR that are orthonormal, and V is n a matrix with columns {v1, . . . , vr} ∈ IR that are also orthonormal. Moreover, it holds that > 2 > 2 AA uj = λj uj , and A Avj = λj vj for j = 1, . . . , r. The values λj > 0 are called singular values of A and are uniquely defined. If rank r < min(n, m) then the singular values of A are > min(n,m) given by λ = (λ1, . . . , λr, 0,..., 0) ∈ IR where there are min(n, m) − r zeros. This way, the vector λ of singular values of a n × m matrix is a vector in IRmin(n,m). In particular, if A is an n × n symmetric positive semidefinite (PSD), i.e. A> = A and u>Au ≥ 0 for all u ∈ IRn, then the singular values of A are equal to its eigenvalues. The largest singular value of A denoted by λmax (A) also satisfies the fol- lowing variational formulation:

> |Ax|2 y Ax > λmax (A) = max = max = max y Ax . x∈IRn |x| x∈IRn |y| |x| n−1 2 m 2 2 x∈S y∈IR y∈Sm−1

In the case of a n × n PSD matrix A, we have

> λmax (A) = max x Ax . x∈Sn−1 In these notes, we always assume that eigenvectors and singular vectors have unit norm.

Norms and inner product

Let A = {aij} and B = {bij} be two real matrices. Their size will be implicit in the following notation.

Vector norms The simplest way to treat a matrix is to deal with it as if it were a vector. In particular, we can extend `q norms to matrices:

1/q  X q |A|q = |aij| , q > 0 . ij 5.1. Basic facts about matrices 123

The cases where q ∈ {0, ∞} can also be extended matrices: X |A|0 = 1I(aij 6= 0) , |A|∞ = max |aij| . ij ij

The case q = 2 plays a particular role for matrices and |A|2 is called the Frobenius norm of A and is often denoted by kAkF . It is also the Hilbert- Schmidt norm associated to the inner product: hA, Bi = Tr(A>B) = Tr(B>A) .

Spectral norms

Let λ = (λ1, . . . , λr, 0,..., 0) be the singular values of a matrix A. We can define spectral norms on A as vector norms on the vector λ. In particular, for any q ∈ [1, ∞], kAkq = |λ|q , is called Schatten q-norm of A. Here again, special cases have special names:

• q = 2: kAk2 = kAkF is the Frobenius norm defined above.

• q = 1: kAk1 = kAk∗ is called the Nuclear norm (or trace norm) of A.

• q = ∞: kAk∞ = λmax (A) = kAkop is called the operator norm (or spectral norm) of A. We are going to employ these norms to assess the proximity to our matrix of interest. While the interpretation of vector norms is clear by extension from the vector case, the meaning of “kA−Bkop is small” is not as transparent. The following subsection provides some inequalities (without proofs) that allow a better reading.

Useful matrix inequalities

Let A and B be two m × n matrices with singular values λ1(A) ≥ λ2(A) ... ≥ λmin(m,n)(A) and λ1(B) ≥ ... ≥ λmin(m,n)(B) respectively. Then the following inequalities hold:

max λk(A) − λk(B) ≤ kA − Bkop , Weyl (1912) k X 2 2 λk(A) − λk(B) ≤ kA − BkF , Hoffman-Weilandt (1953) k 1 1 hA, Bi ≤ kAk kBk , + = 1, p, q ∈ [1, ∞] , H¨older p q p q

The singular value decomposition is associated to the following useful lemma, known as the Eckart-Young (or Eckart-Young-Mirsky) Theorem. It states that for any given k the closest matrix of rank k to a given matrix A in Frobenius norm is given by its truncated SVD. 5.2. Multivariate regression 124

Lemma 5.1. Let A be a rank-r matrix with singular value decomposition r X > A = λiuivi , i=1 where λ1 ≥ λ2 ≥ · · · ≥ λr > 0 are the ordered singular values of A. For any k < r, let Ak to be the truncated singular value decomposition of A given by r X > Ak = λiuivi . i=1 Then for any matrix B such that rank(B) ≤ k, it holds

kA − AkkF ≤ kA − BkF . Moreover, r 2 X 2 kA − AkkF = λj . j=k+1 Proof. Note that the last equality of the lemma is obvious since r X > A − Ak = λjujvj j=k+1 and the matrices in the sum above are orthogonal to each other. Thus, it is sufficient to prove that for any matrix B such that rank(B) ≤ k, it holds r 2 X 2 kA − BkF ≥ λj . j=k+1

To that end, denote by σ1 ≥ σ2 ≥ · · · ≥ σk ≥ 0 the ordered singular values of B—some may be equal to zero if rank(B) < k—and observe that it follows from the Hoffman-Weilandt inequality that

r k r r 2 X 2 X 2 X 2 X 2 kA − BkF ≥ λj − σj = λj − σj + λj ≥ λj . j=1 j=1 j=k+1 j=k+1

5.2 MULTIVARIATE REGRESSION

In the traditional regression setup, the response variable Y is a scalar. In several applications, the goal is not to predict a variable but rather a vector Y ∈ IRT , still from a covariate X ∈ IRd. A standard example arises in genomics data where Y contains T physical measurements of a patient and X contains the expression levels for d genes. As a result the regression function in this case f(x) = IE[Y |X = x] is a function from IRd to IRT . Clearly, f can be estimated independently for each coordinate, using the tools that we have developed in the previous chapter. However, we will see that in several interesting scenar- ios, some structure is shared across coordinates and this information can be leveraged to yield better prediction bounds. 5.2. Multivariate regression 125

The model Throughout this section, we consider the following multivariate linear regres- sion model: ∗ Y = XΘ + E, (5.1) where Y ∈ IRn×T is the matrix of observed responses, X is the n × d observed design matrix (as before), Θ ∈ IRd×T is the matrix of unknown parameters and 2 E ∼ subGn×T (σ ) is the noise matrix. In this chapter, we will focus on the prediction task, which consists in estimating XΘ∗. As mentioned in the foreword of this chapter, we can view this problem as T (univariate) linear regression problems Y (j) = Xθ∗,(j) +ε(j), j = 1,...,T , where Y (j), θ∗,(j) and ε(j) are the jth column of Y, Θ∗ and E respectively. In particu- lar, an estimator for XΘ∗ can be obtained by concatenating the estimators for each of the T problems. This approach is the subject of Problem 5.1. The columns of Θ∗ correspond to T different regression tasks. Consider the following example as a motivation. Assume that the Subway headquarters want to evaluate the effect of d variables (promotions, day of the week, TV ads,. . . ) on their sales. To that end, they ask each of their T = 40, 000 restaurants to report their sales numbers for the past n = 200 days. As a result, franchise j returns to headquarters a vector Y(j) ∈ IRn. The d variables for each of the n days are already known to headquarters and are stored in a matrix X ∈ IRn×d. In this case, it may be reasonable to assume that the same subset of variables has an impact of the sales for each of the franchise, though the magnitude of this impact may differ from franchise to franchise. As a result, one may assume that the matrix Θ∗ has each of its T columns that is row sparse and that they share the same sparsity pattern, i.e., Θ∗ is of the form:  0 0 0 0   ••••     ••••    ∗  0 0 0 0  Θ =   ,  . . . .   . . . .     0 0 0 0  •••• where • indicates a potentially nonzero entry. It follows from the result of Problem 5.1 that if each task is performed individually, one may find an estimator Θˆ such that 1 kT log(ed) IEk Θˆ − Θ∗k2 σ2 , n X X F . n where k is the number of nonzero coordinates in each column of Θ∗. We remember that the term log(ed) corresponds to the additional price to pay for not knowing where the nonzero components are. However, in this case, when the number of tasks grows, this should become easier. This fact was proved in [LPTVDG11]. We will see that we can recover a similar phenomenon 5.2. Multivariate regression 126 when the number of tasks becomes large, though larger than in [LPTVDG11]. Indeed, rather than exploiting sparsity, observe that such a matrix Θ∗ has rank k. This is the kind of structure that we will be predominantly using in this chapter. Rather than assuming that the columns of Θ∗ share the same sparsity pattern, it may be more appropriate to assume that the matrix Θ∗ is low rank or approximately so. As a result, while the matrix may not be sparse at all, the fact that it is low rank still materializes the idea that some structure is shared across different tasks. In this more general setup, it is assumed that the columns of Θ∗ live in a lower dimensional space. Going back to the Subway example this amounts to assuming that while there are 40,000 franchises, there are only a few canonical profiles for these franchises and that all franchises are linear combinations of these profiles.

Sub-Gaussian matrix model

> Recall that under the assumption ORT for the design matrix, i.e., X X = nId, then the univariate regression model can be reduced to the sub-Gaussian se- quence model. Here we investigate the effect of this assumption on the multi- variate regression model (5.1). Observe that under assumption ORT, 1 1 > = Θ∗ + >E. nX Y nX Which can be written as an equation in IRd×T called the sub-Gaussian matrix model (sGMM): y = Θ∗ + F, (5.2) 1 > 1 > 2 where y = n X Y and F = n X E ∼ subGd×T (σ /n). Indeed, for any u ∈ Sd−1, v ∈ ST −1, it holds 1 1 u>F v = ( u)>Ev = √ w>Ev ∼ subG(σ2/n) , n X n

√ > 2 > X X 2 where w = Xu/ n has unit norm: |w|2 = u n u = |u|2 = 1. Akin to the sub-Gaussian sequence model, we have a direct observation model where we observe the parameter of interest with additive noise. This ∗ ∗ enables us to use thresholding methods for estimating Θ when |Θ |0 is small. However, this also follows from Problem 5.1. The reduction to the vector case in the sGMM is just as straightforward. The interesting analysis begins when Θ∗ is low-rank, which is equivalent to sparsity in its unknown eigenbasis. Consider the SVD of Θ∗: ∗ X > Θ = λjujvj . j ∗ and recall that kΘ k0 = |λ|0. Therefore, if we knew uj and vj, we could simply estimate the λjs by hard thresholding. It turns out that estimating these eigenvectors by the eigenvectors of y is sufficient. 5.2. Multivariate regression 127

Consider the SVD of the observed matrix y: X ˆ > y = λjuˆjvˆj . j Definition 5.2. The singular value thresholding estimator with threshold 2τ ≥ 0 is defined by ˆ svt X ˆ ˆ > Θ = λj1I(|λj| > 2τ)ˆujvˆj . j Recall that the threshold for the hard thresholding estimator was chosen to be the level of the noise with high probability. The singular value thresholding estimator obeys the same rule, except that the norm in which the magnitude of the noise is measured is adapted to the matrix case. Specifically, the following lemma will allow us to control the operator norm of the matrix F . 2 Lemma 5.3. Let A be a d × T random matrix such that A ∼ subGd×T (σ ). Then p p kAkop ≤ 4σ log(12)(d ∨ T ) + 2σ 2 log(1/δ) with probability 1 − δ.

Proof. This proof follows the same steps as Problem 1.4. Let N1 be a 1/4- d−1 T −1 net for S and N2 be a 1/4-net for S . It follows from Lemma 1.18 d T that we can always choose |N1| ≤ 12 and |N2| ≤ 12 . Moreover, for any u ∈ Sd−1, v ∈ ST −1, it holds 1 u>Av ≤ max x>Av + max u>Av x∈N1 4 u∈Sd−1 1 1 ≤ max max x>Ay + max max x>Av + max u>Av x∈N1 y∈N2 4 x∈N1 v∈ST −1 4 u∈Sd−1 1 ≤ max max x>Ay + max max u>Av x∈N1 y∈N2 2 u∈Sd−1 v∈ST −1 It yields > kAkop ≤ 2 max max x Ay x∈N1 y∈N2 So that for any t ≥ 0, by a union bound,

 X >  IP kAkop > t ≤ IP x Ay > t/2

x∈N1 y∈N2 2 > 2 Next, since A ∼ subGd×T (σ ), it holds that x Ay ∼ subG(σ ) for any x ∈ N1, y ∈ N2.Together with the above display, it yields t2 IPkAk > t ≤ 12d+T exp −  ≤ δ op 8σ2 for t ≥ 4σplog(12)(d ∨ T ) + 2σp2 log(1/δ) . 5.2. Multivariate regression 128

The following theorem holds. Theorem 5.4. Consider the multivariate linear regression model (5.1) under the assumption ORT or, equivalently, the sub-Gaussian matrix model (5.2). Then, the singular value thresholding estimator Θˆ svt with threshold r r log(12)(d ∨ T ) 2 log(1/δ) 2τ = 8σ + 4σ , (5.3) n n satisfies 1 k Θˆ svt − Θ∗k2 = kΘˆ svt − Θ∗k2 ≤ 144 rank(Θ∗)τ 2 n X X F F σ2 rank(Θ∗)  d ∨ T + log(1/δ) . . n with probability 1 − δ.

Proof. Assume without loss of generality that the singular values of Θ∗ and y ˆ ˆ are arranged in a non increasing order: λ1 ≥ λ2 ≥ ... and λ1 ≥ λ2 ≥ ... . ˆ Define the set S = {j : |λj| > 2τ}. Observe first that it follows from Lemma 5.3 that kF kop ≤ τ for τ chosen as in (5.3) on an event A such that IP(A) ≥ 1 − δ. The rest of the proof assumes that the event A occurred. ˆ Note that it follows from Weyl’s inequality that |λj − λj| ≤ kF kop ≤ τ. It c implies that S ⊂ {j : |λj| > τ} and S ⊂ {j : |λj| ≤ 3τ}. ¯ P > Next define the oracle Θ = j∈S λjujvj and note that

ˆ svt ∗ 2 ˆ svt ¯ 2 ¯ ∗ 2 kΘ − Θ kF ≤ 2kΘ − ΘkF + 2kΘ − Θ kF (5.4)

Using the H¨olderinequality, we control the first term as follows

ˆ svt ¯ 2 ˆ svt ¯ ˆ svt ¯ 2 ˆ svt ¯ 2 kΘ − ΘkF ≤ rank(Θ − Θ)kΘ − Θkop ≤ 2|S|kΘ − Θkop

Moreover,

svt svt ∗ ∗ kΘˆ − Θ¯ kop ≤ kΘˆ − ykop + ky − Θ kop + kΘ − Θ¯ kop ˆ ≤ max |λj| + τ + max |λj| ≤ 6τ . j∈Sc j∈Sc

Therefore, ˆ svt ¯ 2 2 X 2 kΘ − ΘkF ≤ 72|S|τ = 72 τ . j∈S The second term in (5.4) can be written as

¯ ∗ 2 X 2 kΘ − Θ kF = |λj| . j∈Sc 5.2. Multivariate regression 129

Plugging the above two displays in (5.4), we get

ˆ svt ∗ 2 X 2 X 2 kΘ − Θ kF ≤ 144 τ + |λj| j∈S j∈Sc

2 2 2 c 2 2 2 Since on S, τ = min(τ , |λj| ) and on S , |λj| ≤ 9 min(τ , |λj| ), it yields,

ˆ svt ∗ 2 X 2 2 kΘ − Θ kF ≤ 144 min(τ , |λj| ) j rank(Θ∗) X ≤ 144 τ 2 j=1 = 144 rank(Θ∗)τ 2 .

In the next subsection, we extend our analysis to the case where X does not necessarily satisfy the assumption ORT.

Penalization by rank The estimator from this section is the counterpart of the BIC estimator in the spectral domain. However, we will see that unlike BIC, it can be computed efficiently. Let Θˆ rk be any solution to the following minimization problem:

n 1 2 2 o min kY − XΘkF + 2τ rank(Θ) . Θ∈IRd×T n This estimator is called estimator by rank penalization with regularization pa- rameter τ 2. It enjoys the following property. Theorem 5.5. Consider the multivariate linear regression model (5.1). Then, the estimator by rank penalization Θˆ rk with regularization parameter τ 2, where τ is defined in (5.3) satisfies

1 σ2 rank(Θ∗)  k Θˆ rk − Θ∗k2 ≤ 8 rank(Θ∗)τ 2 d ∨ T + log(1/δ) . n X X F . n with probability 1 − δ.

Proof. We begin as usual by noting that

ˆ rk 2 2 ˆ rk ∗ 2 2 ∗ kY − XΘ kF + 2nτ rank(Θ ) ≤ kY − XΘ kF + 2nτ rank(Θ ) , which is equivalent to

ˆ rk ∗ 2 ˆ rk ∗ 2 ˆ rk 2 ∗ kXΘ − XΘ kF ≤ 2hE, XΘ − XΘ i − 2nτ rank(Θ ) + 2nτ rank(Θ ) . Next, by Young’s inequality, we have 1 2hE, Θˆ rk − Θ∗i = 2hE,Ui2 + k Θˆ rk − Θ∗k2 , X X 2 X X F 5.2. Multivariate regression 130 where Θˆ rk − Θ∗ U = X X . ˆ rk ∗ kXΘ − XΘ kF Write rk ∗ XΘˆ − XΘ = ΦN, where Φ is a n × r, r ≤ d matrix whose columns form orthonormal basis of the column span of X. The matrix Φ can come from the SVD of X for example: X = ΦΛΨ>. It yields ΦN U = kNkF and

ˆ rk ∗ 2 > 2 2 ˆ rk 2 ∗ kXΘ − XΘ kF ≤ 4hΦ E,N/kNkF i − 4nτ rank(Θ ) + 4nτ rank(Θ ) . (5.5) Note that rank(N) ≤ rank(Θˆ rk) + rank(Θ∗). Therefore, by H¨older’sin- equality, we get

2 > 2 hE,Ui = hΦ E,N/kNkF i 2 > 2 kNk1 ≤ kΦ Ekop 2 kNkF > 2 ≤ rank(N)kΦ Ekop > 2  ˆ rk ∗  ≤ kΦ Ekop rank(Θ ) + rank(Θ ) .

> 2 2 Next, note that Lemma 5.3 yields kΦ Ekop ≤ nτ so that

hE,Ui2 ≤ nτ 2 rank(Θˆ rk) + rank(Θ∗) .

Together with (5.5), this completes the proof.

It follows from Theorem 5.5 that the estimator by rank penalization enjoys the same properties as the singular value thresholding estimator even when X does not satisfy the ORT condition. This is reminiscent of the BIC estimator which enjoys the same properties as the hard thresholding estimator. However this analogy does not extend to computational questions. Indeed, while the rank penalty, just like the sparsity penalty, is not convex, it turns out that XΘˆ rk can be computed efficiently. Note first that

1 2 2 n 1 2 2 o min kY − XΘkF + 2τ rank(Θ) = min min kY − XΘkF + 2τ k . Θ∈IRd×T n k n Θ∈IRd×T rank(Θ)≤k

Therefore, it remains to show that

2 min kY − XΘkF Θ∈IRd×T rank(Θ)≤k 5.2. Multivariate regression 131 can be solved efficiently. To that end, let Y¯ = X(X>X)†X>Y denote the orthog- onal projection of Y onto the image space of X: this is a linear operator from IRd×T into IRn×T . By the Pythagorean theorem, we get for any Θ ∈ IRd×T , 2 ¯ 2 ¯ 2 kY − XΘkF = kY − YkF + kY − XΘkF . Next consider the SVD of Y¯: ¯ X > Y = λjujvj j ˜ where λ1 ≥ λ2 ≥ ... and define Y by k ˜ X > Y = λjujvj . j=1 By Lemma 5.1, it holds ¯ ˜ 2 ¯ 2 kY − YkF = min kY − ZkF . Z:rank(Z)≤k

2 Therefore, any minimizer of XΘ 7→ kY − XΘkF over matrices of rank at most k can be obtained by truncating the SVD of Y¯ at order k. Once XΘˆ rk has been found, one may obtain a corresponding Θˆ rk by least squares but this is not necessary for our results. Remark 5.6. While the rank penalized estimator can be computed efficiently, it is worth pointing out that a convex relaxation for the rank penalty can also be used. The estimator by nuclear norm penalization Θˆ is defined to be any solution to the minimization problem

n 1 2 o min kY − XΘkF + τkΘk1 Θ∈IRd×T n Clearly this criterion is convex and it can actually be implemented efficiently using semi-definite programming. It has been popularized by matrix comple- tion problems. Let X have the following SVD: r X > X = λjujvj , j=1 with λ1 ≥ λ2 ≥ ... ≥ λr > 0. It can be shown that for some appropriate choice of τ, it holds 2 ∗ 1 ˆ ∗ 2 λ1 σ rank(Θ ) kXΘ − XΘ kF . d ∨ T n λr n with probability .99. However, the proof of this result is far more involved than a simple adaption of the proof for the Lasso estimator to the matrix case (the readers are invited to see that for themselves). For one thing, there is no assumption on the design matrix (such as INC for example). This result can be found in [KLT11]. 5.3. Covariance matrix estimation 132

5.3 COVARIANCE MATRIX ESTIMATION

Empirical covariance matrix

d Let X1,...,Xn be n i.i.d. copies of a random vector X ∈ IR such that IEXX> = Σ for some unknown matrix Σ 0 called covariance matrix. This matrix contains information about the moments of order 2 of the random vector X. A natural candidate to estimate Σ is the empirical covariance matrix Σˆ defined by n 1 X Σˆ = X X> . n i i i=1 Using the tools of Chapter1, we can prove the following result. Theorem 5.7. Let Y ∈ IRd be a random vector such that IE[Y ] = 0, IE[YY >] = Id and Y ∼ subGd(1). Let X1,...,Xn be n independent copies of sub-Gaussian 1/2 > random vector X = Σ Y . Then IE[X] = 0, IE[XX ] = Σ and X ∼ subGd(kΣkop). Moreover, r  d + log(1/δ) d + log(1/δ) kΣˆ − Σk kΣk ∨ , op . op n n with probability 1 − δ. Proof. It follows from elementary computations that IE[X] = 0, IE[XX>] = Σ and X ∼ subGd(kΣkop). To prove the deviation inequality, observe first that without loss of generality, we can assume that Σ = Id. Indeed, kΣˆ − Σk k 1 Pn X X> − Σk op = n i=1 i i op kΣkop kΣkop kΣ1/2k k 1 Pn Y Y > − I k kΣ1/2k ≤ op n i=1 i i d op op kΣkop n 1 X = k Y Y > − I k . n i i d op i=1

Thus, in the rest of the proof, we assume that Σ = Id. Let N be a 1/4-net for Sd−1 such that |N | ≤ 12d. It follows from the proof of Lemma 5.3 that

> kΣˆ − Idkop ≤ 2 max x (Σˆ − Id)y. x,y∈N Hence, for any t ≥ 0, by a union bound,

 X >  IP kΣˆ − Idkop > t ≤ IP x (Σˆ − Id)y > t/2 . (5.6) x,y∈N It holds, n 1 X x>(Σˆ − I )y = (X>x)(X>y) − IE(X>x)(X>y) . d n i i i i i=1 5.3. Covariance matrix estimation 133

Using polarization, we also have

Z2 − Z2 (X>x)(X>y) = + − , i i 4

> > where Z+ = Xi (x + y) and Z− = Xi (x − y). It yields

 > >  > > i IE exp s (Xi x)(Xi y) − IE (Xi x)(Xi y) s s i = IE exp Z2 − IE[Z2 ] − Z2 − IE[Z2 ]) 4 + + 4 − −  s s 1/2 ≤ IE exp Z2 − IE[Z2 ]IE exp − Z2 − IE[Z2 ] , 2 + + 2 − − where in the last inequality, we used Cauchy-Schwarz. Next, since X ∼ subGd(1), we have Z+,Z− ∼ subG(2), and it follows from Lemma 1.12 that

2 2 2 2 Z+ − IE[Z+] ∼ subE(32) , and Z− − IE[Z−] ∼ subE(32).

Therefore, for any s ≤ 1/16, we have for any Z ∈ {Z+,Z−} that

s 2 IE exp Z2 − IE[Z2] ≤ e128s . 2 It yields that

> >  > >  (Xi x)(Xi y) − IE (Xi x)(Xi y) ∼ subE(16) .

Applying now Bernstein’s inequality (Theorem 1.13), we get

h n  t 2 t i IPx>(Σˆ − I )y > t/2 ≤ exp − ( ∧ ) . (5.7) d 2 32 32 Together with (5.6), this yields

h n  t 2 t i IPkΣˆ − I k > t ≤ 144d exp − ( ∧ ) . (5.8) d op 2 32 32 In particular, the right hand side of the above inequality is at most δ ∈ (0, 1) if

t 2d 2  2d 2 1/2 ≥ log(144) + log(1/δ) ∨ log(144) + log(1/δ) . 32 n n n n This concludes our proof.

Theorem 5.7 indicates that for fixed d, the empirical covariance matrix is a consistent estimator of Σ (in any norm as they are all equivalent in finite dimen- sion). However, the bound that we got is not satisfactory in high-dimensions when d  n. To overcome this limitation, we can introduce sparsity as we have done in the case of regression. The most obvious way to do so is to assume that few of the entries of Σ are non zero and it turns out that in this case 5.4. Principal component analysis 134 thresholding is optimal. There is a long line of work on this subject (see for example [CZZ10] and [CZ12]). Once we have a good estimator of Σ, what can we do with it? The key insight is that Σ contains information about the projection of the vector X onto any direction u ∈ Sd−1. Indeed, we have that var(X>u) = u>Σu, which can be readily estimated by Var(d X>u) = u>Σˆu. Observe that it follows from Theorem 5.7 that

> > > Var(d X u) − Var(X u) = u (Σˆ − Σ)u

≤ kΣˆ − Σkop r  d + log(1/δ) d + log(1/δ) kΣk ∨ . op n n with probability 1 − δ. The above fact is useful in the Markowitz theory of portfolio section for example [Mar52], where a portfolio of assets is a vector u ∈ IRd such that > |u|1 = 1 and the risk of a portfolio is given by the variance Var(X u). The goal is then to maximize reward subject to risk constraints. In most instances, the empirical covariance matrix is plugged into the formula in place of Σ. (See Problem 5.4).

5.4 PRINCIPAL COMPONENT ANALYSIS

Spiked covariance model Estimating the variance in all directions is also useful for dimension reduction. In Principal Component Analysis (PCA), the goal is to find one (or more) directions onto which the data X1,...,Xn can be projected without loosing much of its properties. There are several goals for doing this but perhaps the most prominent ones are data visualization (in few dimensions, one can plot and visualize the cloud of n points) and clustering (clustering is a hard computational problem and it is therefore preferable to carry it out in lower dimensions). An example of the output of a principal component analysis is given in Figure 5.1. In this figure, the data has been projected onto two orthogonal directions PC1 and PC2, that were estimated to have the most variance (among all such orthogonal pairs). The idea is that when projected onto such directions, points will remain far apart and a clustering pattern will still emerge. This is the case in Figure 5.1 where the original data is given by d = 500, 000 gene expression levels measured on n ' 1, 387 people. Depicted are the projections of these 1, 387 points in two dimension. This image has become quite popular as it shows that gene expression levels can recover the structure induced by geographic clustering. How is it possible to “compress” half a million dimensions into only two? The answer is that the data is intrinsically low dimensional. In this case, a plausible assumption is that all the 1, 387 points live close to a two-dimensional linear subspace. To see how this assumption (in one dimension instead of two for simplicity) translates 5.4. Principal component analysis 135

Figure 5.1. Projection onto two dimensions of 1, 387 points from gene expression data. Source: Gene expression blog.

into the structure of the covariance matrix Σ, assume that X1,...,Xn are Gaussian random variables generated as follows. Fix a direction v ∈ Sd−1 > and let Y1,...,Yn ∼ Nd(0,Id) so that v Yi are i.i.d. N (0, 1). In particular, > > the vectors (v Y1)v, . . . , (v Yn)v live in the one-dimensional space spanned by v. If one would observe such data the problem would be easy as only two d observations would suffice to recover v. Instead, we observe X1,...,Xn ∈ IR > 2 where Xi = (v Yi)v + Zi, and Zi ∼ Nd(0, σ Id) are i.i.d. and independent of the Yis, that is we add isotropic noise to every point. If the σ is small enough, we can hope to recover the direction v (See Figure 5.2). The covariance matrix of Xi generated as such is given by  >  > > > > 2 Σ = IE XX = IE ((v Y )v + Z)((v Y )v + Z) = vv + σ Id . This model is often called the spiked covariance model. By a simple rescaling, it is equivalent to the following definition. Definition 5.8. A covariance matrix Σ ∈ IRd×d is said to satisfy the spiked covariance model if it is of the form

> Σ = θvv + Id , 5.4. Principal component analysis 136

v

Figure 5.2. Points are close to a one dimensional space spanned by v. where θ > 0 and v ∈ Sd−1. The vector v is called the spike. This model can be extended to more than one spike but this extension is beyond the scope of these notes. Clearly, under the spiked covariance model, v is the eigenvector of the matrix Σ that is associated to its largest eigenvalue 1 + θ. We will refer to this vector simply as largest eigenvector. To estimate it, a natural candidate is the largest eigenvectorv ˆ of Σ,˜ where Σ˜ is any estimator of Σ. There is a caveat: by symmetry, if u is an eigenvector, of a symmetric matrix, then −u is also an eigenvector associated to the same eigenvalue. Therefore, we may only estimate v up to a sign flip. To overcome this limitation, it is often useful to describe proximity between two vectors u and v in terms of the principal angle between their linear span. Let us recall that for two unit vectors the principal angle between their linear spans is denoted by ∠(u, v) and defined as > ∠(u, v) = arccos(|u v|) . The following result from perturbation theory is known as the Davis-Kahan sin(θ) theorem as it bounds the sin of the principal angle between eigenspaces. This theorem exists in much more general versions that extend beyond one- dimensional eigenspaces. Theorem 5.9 (Davis-Kahan sin(θ) theorem). Let A, B be two d × d PSD matrices. Let (λ1, u1),..., (λd, ud) (resp. (µ1, v1),..., (µd, vd)) denote the pairs of eigenvalues–eigenvectors of A (resp. B) ordered such that λ1 ≥ λ2 ≥ ... (resp. µ1 ≥ µ2 ≥ ... ). Then

 2 sin ∠(u1, v1) ≤ kA − Bkop . max(λ1 − λ2, µ1 − µ2) Moreover, 2 2  min |εu1 − v1|2 ≤ 2 sin ∠(u1, v1) . ε∈{±1}

> d−1 Proof. Note that u1 Au1 = λ1 and for any x ∈ S , it holds

d > X > 2 > 2 > 2 x Ax = λj(uj x) ≤ λ1(u1 x) + λ2(1 − (u1 x) ) j=1 2  2  = λ1 cos ∠(u1, x) + λ2 sin ∠(u1, x) . 5.4. Principal component analysis 137

Therefore, taking x = v1, we get that on the one hand,

> > 2  2  u1 Au1 − v1 Av1 ≥ λ1 − λ1 cos ∠(u1, x) − λ2 sin ∠(u1, x) 2  = (λ1 − λ2) sin ∠(u1, x) . On the other hand,

> > > > > u1 Au1 − v1 Av1 = u1 Bu1 − v1 Av1 + u1 (A − B)u1 > > > ≤ v1 Bv1 − v1 Av1 + u1 (A − B)u1 (v1 is leading eigenvector of B) > > = hA − B, u1u1 − v1v1 i > > ≤ kA − Bkopku1u1 − v1v1 k1 (H¨older) √ > > ≤ kA − Bkop 2ku1u1 − v1v1 k2 (Cauchy-Schwarz)

> > where in the last inequality, we used the fact that rank(u1u1 − v1v1 ) = 2. It is straightforward to check that

> > 2 > > 2 > 2 2  ku1u1 − v1v1 k2 = ku1u1 − v1v1 kF = 2 − 2(u1 v1) = 2 sin ∠(u1, v1) . We have proved that

2   (λ1 − λ2) sin ∠(u1, x) ≤ 2kA − Bkop sin ∠(u1, x) , which concludes the first part of the lemma. Note that we can replace λ1 − λ2 with µ1 − µ2 since the result is completely symmetric in A and B. It remains to show the second part of the lemma. To that end, observe that

2 > > 2 2  min |εu1 − v1|2 = 2 − 2|u˜1 v1| ≤ 2 − 2(u1 v1) = 2 sin ∠(u1, x) . ε∈{±1}

Combined with Theorem 5.7, we immediately get the following corollary. Corollary 5.10. Let Y ∈ IRd be a random vector such that IE[Y ] = 0, IE[YY >] = Id and Y ∼ subGd(1). Let X1,...,Xn be n independent copies of sub-Gaussian 1/2 > random vector X = Σ Y so that IE[X] = 0, IE[XX ] = Σ and X ∼ subGd(kΣkop). > Assume further that Σ = θvv +Id satisfies the spiked covariance model. Then, the largest eigenvector vˆ of the empirical covariance matrix Σˆ satisfies, r 1 + θ  d + log(1/δ) d + log(1/δ) min |εvˆ − v|2 . ∨ ε∈{±1} θ n n with probability 1 − δ. This result justifies the use of the empirical covariance matrix Σˆ as a re- placement for the true covariance matrix Σ when performing PCA in low di- mensions, that is when d  n. In the high-dimensional case, where d  n, the above result is uninformative. As before, we resort to sparsity to overcome this limitation. 5.4. Principal component analysis 138

Sparse PCA In the example of Figure 5.1, it may be desirable to interpret the meaning of the two directions denoted by PC1 and PC2. We know that they are linear combinations of the original 500,000 gene expression levels. A natural question to ask is whether only a subset of these genes could suffice to obtain similar results. Such a discovery could have potential interesting scientific applications as it would point to a few genes responsible for disparities between European populations. In the case of the spiked covariance model this amounts to having a sparse v. Beyond interpretability as we just discussed, sparsity should also lead to statistical stability as in the case of sparse linear regression for example. To enforce sparsity, we will assume that v in the spiked covariance model is k- sparse: |v|0 = k. Therefore, a natural candidate to estimate v is given byv ˆ defined by vˆ>Σˆˆv = max u>Σˆu . u∈Sd−1 |u|0=k k ˆ > ˆ It is easy to check that λmax(Σ) =v ˆ Σˆv is the largest of all leading eigenvalues among all k × k sub-matrices of Σˆ so that the maximum is indeed attained, k ˆ though there my be several maximizers. We call λmax(Σ) the k-sparse leading eigenvalue of Σˆ andv ˆ a k-sparse leading eigenvector. Theorem 5.11. Let Y ∈ IRd be a random vector such that IE[Y ] = 0, IE[YY >] = Id and Y ∼ subGd(1). Let X1,...,Xn be n independent copies of sub-Gaussian 1/2 > random vector X = Σ Y so that IE[X] = 0, IE[XX ] = Σ and X ∼ subGd(kΣkop). > Assume further that Σ = θvv + Id satisfies the spiked covariance model for v such that |v|0 = k ≤ d/2. Then, the k-sparse largest eigenvector vˆ of the empirical covariance matrix satisfies, r 1 + θ  k log(ed/k) + log(1/δ) k log(ed/k) + log(1/δ) min |εvˆ − v|2 . ∨ . ε∈{±1} θ n n with probability 1 − δ.

Proof. We begin by obtaining an intermediate result of the Davis-Kahan sin(θ) theorem. Using the same steps as in the proof of Theorem 5.9, we get

v>Σv − vˆ>Σˆv ≤ hΣˆ − Σ, vˆvˆ> − vv>i

Since bothv ˆ and v are k sparse, there exists a (random) set S ⊂ {1, . . . , d} > > 2 such that |S| ≤ 2k and {vˆvˆ − vv }ij = 0 if (i, j) ∈/ S . It yields

hΣˆ − Σ, vˆvˆ> − vv>i = hΣ(ˆ S) − Σ(S), vˆ(S)ˆv(S)> − v(S)v(S)>i

Where for any d × d matrix M, we defined the matrix M(S) to be the |S| × |S| sub-matrix of M with rows and columns indexed by S and for any vector 5.4. Principal component analysis 139 x ∈ IRd, x(S) ∈ IR|S| denotes the sub-vector of x with coordinates indexed by S. It yields by H¨older’sinequality that

> > > > v Σv − vˆ Σˆv ≤ kΣ(ˆ S) − Σ(S)kopkvˆ(S)ˆv(S) − v(S)v(S) k1 . Following the same steps as in the proof of Theorem 5.9, we get now that

 2 ˆ sin ∠(ˆv, v) ≤ sup kΣ(S) − Σ(S)kop . θ S : |S|=2k ˆ To conclude the proof, it remains to control supS : |S|=2k kΣ(S) − Σ(S)kop. To that end, observe that h i IP sup kΣ(ˆ S) − Σ(S)kop > tkΣkop S : |S|=2k X h i ≤ IP sup kΣ(ˆ S) − Σ(S)kop > tkΣ(S)kop S : |S|=2k S : |S|=2k  d  h n  t 2 t i ≤ 1442k exp − ( ∧ ) . 2k 2 32 32 where we used (5.8) in the second inequality. Using Lemma 2.7, we get that the right-hand side above is further bounded by h n  t 2 t ed i exp − ( ∧ ) + 2k log(144) + k log  2 32 32 2k Choosing now t such that r k log(ed/k) + log(1/δ) k log(ed/k) + log(1/δ) t ≥ C ∨ , n n for large enough C ensures that the desired bound holds with probability at least 1 − δ.

Minimax lower bound The last section has established that the spike v in the spiked covariance model Σ = θvv> may be estimated at the rate θ−1pk log(d/k)/n, assuming that θ ≤ 1 and k log(d/k) ≤ n, which corresponds to the interesting regime. Using the tools from Chapter4, we can show that this rate is in fact optimal already in the Gaussian case.

Theorem 5.12. Let X1,...,Xn be n independent copies of the Gaussian ran- > dom vector X ∼ Nd(0, Σ) where Σ = θvv + Id satisfies the spiked covariance model for v such that |v|0 = k and write IPv the distribution of X under this assumption and by IEv the associated expectation. Then, there exists constants C, c > 0 such that for any n ≥ 2, d ≥ 9, 2 ≤ k ≤ (d + 7)/8, it holds r nh 1 k i inf sup IPv min |εvˆ − v|2 ≥ C log(ed/k) ≥ c . vˆ v∈Sd−1 ε∈{±1} θ n |v|0=k 5.4. Principal component analysis 140

Proof. We use the general technique developed in Chapter4. Using this tech- d−1 nique, we need to provide the a set v1, . . . , vM ∈ S of k sparse vectors such that

(i) min |εvj − vk|2 > 2ψ for all 1 ≤ j < k ≤ M , ε∈{−1,1}

(ii) KL(IPvj , IPvk ) ≤ c log(M)/n , where ψ = Cθ−1pk log(d/k)/n. Akin to the Gaussian sequence model, our goal is to employ the sparse Varshamov-Gilbert lemma to construct the vj’s. However, the vj’s have the extra constraint that they need to be of unit norm so we cannot simply rescale the sparse binary vectors output by the sparse Varshamov-Gilbert lemma. Instead, we use the last coordinate to make sure that they have unit norm. Recall that it follows from the sparse Varshamov-Gilbert Lemma 4.14 that d−1 there exits ω1, . . . , ωM ∈ {−1, 1} such that |ωj|0 = k − 1, ρ(ωi, ωj) ≥ (k − 1)/2 for all i 6= j and log(M) ≥ k−1 log(1 + d−1 ). Define v = (γω , `) √ 8 2k−2 j j where γ < 1/ k − 1 is to be defined later and ` = p1 − γ2(k − 1). It is straightforward to check that |vj|0 = k, |vj|2 = 1 and vj has only nonnegative coordinates. Moreover, for any i 6= j,

2 2 2 2 k − 1 min |εvi − vj|2 = |vi − vj|2 = γ ρ(vi, vj) ≥ γ . (5.9) ε∈±1 2 We choose C d γ2 = γ log  , θ2n k so that γ2(k − 1) = ψ and (i) is verified. It remains to check (ii). To that end, note that 1   KL(IP , IP ) = IE X>(I + θvv>)−1 − (I + θvv>)−1X u v 2 u d d 1   = IE Tr X>(I + θvv>)−1 − (I + θvv>)−1X 2 u d d 1   = Tr (I + θvv>)−1 − (I + θuu>)−1(I + θuu>) 2 d d d Next, we use the Sherman-Morrison formula1: θ (I + θuu>)−1 = I − uu> , ∀ u ∈ Sd−1 . d d 1 + θ

1 > θ To see this, check that (Id + θuu )(Id − 1+θ ) = Id and that the two matrices on the left-hand side commute. 5.5. Graphical models 141

It yields θ   KL(IP , IP ) = Tr (uu> − vv>)(I + θuu>) u v 2(1 + θ) d θ2 = (1 − (u>v)2) 2(1 + θ) θ2 = sin2 (u, v) 2(1 + θ) ∠ It yields θ2 KL(IP , IP ) = sin2 (v , v ) . vi vj 2(1 + θ) ∠ i j Next, note that k d log M sin2 (v , v ) = 1 − γ2ω>ω − `2 ≤ γ2(k − 1) ≤ C log  ≤ , ∠ i j i j γ θ2n k θ2n for Cγ small enough. We conclude that (ii) holds, which completes our proof.

Together with the upper bound of Theorem 5.11, we get the following corol- lary. Corollary 5.13. Assume that θ ≤ 1 and k log(d/k) ≤ n. Then θ−1pk log(d/k)/n is the minimax rate of estimation over B0(k) in the spiked covariance model.

5.5 GRAPHICAL MODELS

Gaussian graphical models In Section 5.3, we noted that thresholding is an appropriate way to gain ad- vantage from sparsity in the covariance matrix when trying to estimate it. However, in some cases, it is more natural to assume sparsity in the inverse of the covariance matrix, Θ = Σ−1, which is sometimes called concentration or precision matrix. In this chapter, we will develop an approach that allows us to get guarantees similar to the lasso for the error between the estimate and the ground truth matrix in Frobenius norm. We can understand why sparsity in Θ can be appropriate in the context of undirected graphical models [Lau96]. These are models that serve to simplify dependence relations between high dimensional distributions according to a graph. Definition 5.14. Let G = (V,E) be a graph on d nodes and to each node v, associate a random variable Xv. An undirected graphical model or Markov random field is a collection of probability distributions over Xv that factorize according to 1 Y p(x , . . . , x ) = ψ (x ), (5.10) 1 d Z C C C∈C 5.5. Graphical models 142 where C is the collection of all cliques (completely connected subsets) in G, xC is the restriction of (x1, . . . , xd) to the subset C, and ψC are non-negative potential functions. This definition implies certain conditional independence relations, such as the following. Proposition 5.15. A graphical model as in (5.10) fulfills the global Markov property. That is, for any triple (A, B, S) of disjoint subsets such that S sepa- rates A and B, A ⊥⊥ B | S. Proof. Let (A, B, S) a triple as in the proposition statement and define A˜ to be the connected component in the induced subgraph G[V \ S], as well as B˜ = G[V \ (A˜∪ S)]. By assumption, A and B have to be in different connected components of G[V \ S], hence any clique in G has to be a subset of A˜ ∪ S or B˜ ∪ S. Denoting the former cliques by CA, we get Y Y Y f(x) = ψC (x) = ψC (x) ψC (x) = h(xA˜∪S)k(xB˜∪S),

C∈C C∈CA c∈C\CA which implies the desired conditional independence. A weaker form of the Markov property is the so-called pairwise Markov property, which says that the above holds only for singleton sets of the form A = {a}, B = {b} and S = V \{a, b} if (a, b) ∈/ E. In the following, we will focus on Gaussian graphical models for n i.i.d. ∗ −1 samples Xi ∼ N (0, (Θ ) ). By contrast, there is a large body of work on discrete graphical models such as the Ising model (see for example [Bre15]), which we are not going to discuss here, although links to the following can be established as well [LWo12]. For a Gaussian models with mean zero, by considering the density in terms of the concentration matrix Θ,

> fΘ(x) ∝ exp(−x Θx/2), it is immediate that the pairs a, b for which the pairwise Markov property holds are exactly those for which Θi,j = 0. Conversely, it can be shown that for a family of distributions with non-negative density, this actually implies the factorization (5.10) according to the graph given by the edges {(i, j):Θi,j 6= 0}. This is the content of the Hammersley-Clifford theorem. Theorem 5.16 (Hammersley-Clifford). Let P be a probability distribution with positive density f with respect to a product measure over V . Then P factorizes over G as in (5.10) if and only if it fulfills the pairwise Markov property. We show the if direction. The requirement that f be positive allows us to take logarithms. Pick a fixed element x∗. The idea of the proof is to rewrite the density in a way that allows us to make use of the pairwise Markov property. The following combinatorial lemma will be used for this purpose. 5.5. Graphical models 143

Lemma 5.17 (Moebius Inversion). Let Ψ, Φ: P(V ) → IR be functions defined for all subsets of a set V . Then, the following are equivalent: P 1. For all A ⊆ V : Ψ(A) = B⊆A Φ(B);

P |A\B 2. For all A ⊆ V : Φ(A) = B⊆A(−1) Ψ(B). Proof. 1. =⇒ 2.: X X X Φ(B) = (−1)|B\C|Ψ(C) B⊆A B⊆A C⊆B   X X |B\C| = Ψ(C)  (−1)  C⊆A C⊆B⊆A   X X |H| = Ψ(C)  (−1)  . C⊆A H⊆A\C

P |H| Now, H⊆A\C (−1) = 0 unless A \ C = ∅ because any non-empty set has the same number of subsets with odd and even cardinality, which can be seen by induction.

Proof. Define ∗ HA(x) = log f(xA, xAc ) and X |A\B| ΦA(x) = (−1) HB(x). B⊆A

Note that both HA and ΦA depend on x only through xA. By Lemma 5.17, we get X log f(x) = HV (x) = ΦA(x). A⊆V

It remains to show that ΦA = 0 if A is not a clique. Assume A ⊆ V with v, w ∈ A,(v, w) ∈/ E, and set C = V \{v, w}. By adding all possible subsets of {v, w} to every subset of C, we get every subset of A, and hence

X |C\B| ΦA(x) = (−1) (HB − HB∪{v} − HB∪{w} + HB∪{v,w}). B⊆C 5.5. Graphical models 144

Writing D = V \{v, w}, by the pairwise Markov property, we have ∗ f(xv, xw, xB, xD\B HB∪{v,w}(x) − HB∪{v}(x) = log ∗ ∗ f(xv, xw, xB, xD\B ∗ ∗ f(xv|xB, xD\B)f(xw, xB, xD\B) = log ∗ ∗ ∗ f(xv|xB, xD\B)f(xw, xB, xD\B) ∗ ∗ ∗ f(xv|xB, xD\B)f(xw, xB, xD\B) = log ∗ ∗ ∗ ∗ f(xv|xB, xD\B)f(xw, xB, xD\B) ∗ ∗ f(xv, xw, xB, xD\B) = log ∗ ∗ ∗ f(xv, xw, xB, xD\B)

= HB∪{w}(x) − HB(x).

Thus ΨA vanishes if A is not a clique. In the following, we will show how to exploit this kind of sparsity when trying to estimate Θ. The key insight is that we can set up an optimization problem that has a similar error sensitivity as the lasso, but with respect to Θ. In fact, it is the penalized version of the maximum likelihood estimator for the Gaussian distribution. We set

Θˆ λ = argmin Tr(ΣΘ)ˆ − log det(Θ) + λkΘDc k1, (5.11) Θ 0 P where we wrote ΘDc for the off-diagonal restriction of Θ, kAk1 = i,j |Ai,j| for ˆ 1 P > the element-wise `1 norm, and Σ = n i XiXi for the empirical covariance matrix of n i.i.d. observations. We will also make use of the notation

kAk∞ = max |Ai,j| i,j∈[d] for the element-wise ∞-norm of a matrix A ∈ IRd×d and X kAk∞,∞ = max |Ai,j| i j for the ∞ operator norm. We note that the function Θ 7→ − log det(Θ) has first derivative −Θ−1 and second derivative Θ−1 ⊗ Θ−1, which means it is convex. Derivations for this can be found in [BV04, Appendix A.4]. This means that in the population case and for λ = 0, the minimizer in (5.11) would coincide with Θ∗. First, let us derive a bound on the error in Σˆ that is similar to Theorem 1.14, with the difference that we are dealing with sub-Exponential distributions. ˆ 1 Pn > Lemma 5.18. If Xi ∼ N (0, Σ) are i.i.d. and Σ = n i=1 XiXi denotes the empirical covariance matrix, then r log(d/δ) kΣˆ − Σk kΣk , ∞ . op n with probability at least 1 − δ, if n & log(d/δ). 5.5. Graphical models 145

−1/2 Proof. Start as in the proof of Theorem 5.7 by writing Y = Σ X ∼ N (0,Id) and notice that

n ! 1 X |e>(Σˆ − Σ)e | = |(Σ1/2e )> Y Y > − I (Σ1/2e )| i j i n i i d j i=1 n ! 1 X ≤ kΣ1/2k2 |v> Y Y > − I w| op n i i d i=1

d−1 with v, w ∈ S . Hence, without loss of generality, we can assume Σ = Id. The remaining term is sub-Exponential and can be controlled as in (5.7)

h n  t 2 t i IPe>(Σˆ − I )e > t ≤ exp − ( ∧ ) . i d j 2 16 16 By a union bound, we get

h n  t 2 t i IPkΣˆ − I k > t ≤ d2 exp − ( ∧ ) . d ∞ 2 16 16

Hence, if n & log(d/δ), then the right-hand side above is at most δ ∈ (0, 1) if

t 2 log(d/δ)1/2 ≥ . 16 n

ˆ 1 Pn > Theorem 5.19. Let Xi ∼ N (0, Σ) be i.i.d. and Σ = n i=1 XiXi denote the empirical covariance matrix. Assume that |S| = k, where S = {(i, j): i 6= j, Θi,j 6= 0}. There exists a constant c = c(kΣkop) such that if r log(d/δ) λ = c , n and n & log(d/δ), then the minimizer in (5.11) fulfills (p + k) log(d/δ) kΘˆ − Θ∗k2 F . n with probability at least 1 − δ.

Proof. We start by considering the likelihood

l(Θ, Σ) = Tr(ΣΘ) − log det(Θ).

Taylor expanding it and the mean value theorem yield

∗ ∗ ∗ ∗ l(Θ, Σn) − l(Θ , Σn) = Tr(Σn(Θ − Θ )) − Tr(Σ (Θ − Θ )) 1 + Tr(Θ˜ −1(Θ − Θ∗)Θ˜ −1(Θ − Θ∗)), 2 5.5. Graphical models 146 for some Θ˜ = Θ∗ + t(Θ − Θ∗), t ∈ [0, 1]. Note that essentially by the convexity of log det, we have ˜ −1 ∗ ˜ −1 ∗ ˜ −1 ∗ 2 ˜ −1 2 ∗ 2 Tr(Θ (Θ − Θ )Θ (Θ − Θ )) = kΘ (Θ − Θ )kF ≥ λmin(Θ ) kΘ − Θ kF and −1 −1 λmin(Θ˜ ) = (λmax(Θ))˜ . ∗ If we write ∆ = Θ − Θ , then for k∆kF ≤ 1, ∗ ∗ ∗ ∗ λmax(Θ)˜ = kΘ˜ kop = kΘ +∆kop ≤ kΘ kop+k∆kop ≤ kΘ kop+k∆kF ≤ kΘ kop+1, and therefore ∗ ∗ ∗ ∗ 2 l(Θ, Σn) − l(Θ , Σn) ≥ Tr((Σn − Σ )(Θ − Θ )) + ckΘ − Θ kF , ∗ −2 for c = (kΘ kop + 1) /2, if k∆kF ≤ 1. This takes care of the case where ∆ is small. To handle the case where it ∗ ∗ is large, define g(t) = l(Θ + t∆, Σn) − l(Θ , Σn). By the convexity of l in Θ, g(1) − g(0) g(t) − g(0) ≥ , 1 t −1 so that plugging in t = k∆kF gives   ∗ ∗ 1 ∗ l(Θ, Σn) − l(Θ , Σn) ≥ k∆kF l(Θ + ∆, Σn) − l(Θ , Σn) k∆kF   ∗ 1 ≥ k∆kF Tr((Σn − Σ ) ∆) + c k∆kF ∗ = Tr((Σn − Σ )∆) + ck∆kF , for k∆kF ≥ 1. ∗ If we now write ∆ = Θˆ − Θ and assume k∆kF ≥ 1, then by optimality of Θ,ˆ ∗ ∗ k∆kF ≤ C [Tr((Σ − Σn)∆) + λ(kΘDc k1 − kΘDc k1)] ∗ ≤ C [Tr((Σ − Σn)∆) + λ(k∆Sk1 − k∆Sc k1)] , c ∗ where S = {(i, j) ∈ D :Θi,j 6= 0} by triangle inequality. Now, split the error contributions for the diagonal and off-diagonal elements, ∗ Tr((Σ − Σn)∆) + λ(k∆Sk1 − k∆Sc k1) ∗ ∗ ≤ k(Σ − Σn)DkF k∆DkF + k(Σ − Σn)Dc k∞k∆Dc k1

+ λ(k∆Sk1 − k∆Sc k1). √ ∗ ∗ By H¨olderinequality, k(Σ − Σn)DkF ≤ dkΣ − Σnk∞, and by Lemma 5.18, ∗ p √ kΣ − Σnk∞ ≤ log(d/δ)/ n for n & log(d/δ). Combining these two esti- mates, ∗ Tr((Σ − Σn)∆) + λ(k∆Sk1 − k∆Sc k1) r r d log(d/δ) log(d/δ) k∆ k + k∆ c k + λ(k∆ k − k∆ c k ) . n D F n D 1 S 1 S 1 5.5. Graphical models 147

p Setting λ = C log(ep/δ)/n and splitting k∆Dc k1 = k∆Sk1 + k∆Sc k1 yields r r d log(d/δ) log(d/δ) k∆ k + k∆ c k + λ(k∆ k − k∆ c k ) n D F n D 1 S 1 S 1 r r d log(d/δ) log(d/δ) ≤ k∆ k + k∆ k n D F n S 1 r r d log(d/δ) k log(d/δ) ≤ k∆ k + k∆ k n D F n S F r (d + k) log(d/δ) ≤ k∆k n F

Combining this with a left-hand side of k∆kF yields k∆kF = 0 for n & (d + k) log(d/δ), a contradiction to k∆kF ≥ 1 where this bound is effective. 2 Combining it with a left-hand side of k∆kF gives us (d + k) log(d/δ) k∆k2 , F . n as desired.

∗ ∗ ∗ ∗ 2 ∗ ˆ ˆ 2 Write Σ = W Γ W , where W = ΣD and similarly define W by W = Σˆ D. Define a slightly modified estimator by replacing Σˆ in (5.11) by Γˆ to get an estimator Kˆ for K = (Γ∗)−1.

−1 −1 ∗ −1 ∗ ∗ −1 kΓˆ − Γk∞ = kWˆ ΣˆWˆ − (W ) Σ (W ) k∞ (5.12) −1 ∗ −1 ∗ −1 ∗ −1 ≤ kWˆ − (W ) k∞,∞kΣˆ − Σ k∞kWˆ − (W ) k∞,∞ −1 ∗ −1  ∗ −1 ∗ −1  + kWˆ − (W ) k∞,∞ kΣˆk∞k(W ) k∞ + kΣ k∞kWˆ k∞,∞

−1 ∗ ∗ −1 + kWˆ k∞,∞kΣˆ − Σ k∞k(W ) k∞,∞ (5.13)

ˆ p √ Note that kAk∞ ≤ kAkop, so that if n & log(d/δ), then kΓ−Γk∞ . log(d/δ)/ n. ˆ ∗ p From the above arguments, conclude kK − K kF . k log(d/δ)/n and with a calculation similar to (5.13) that r k log(d/δ) kΘˆ 0 − Θ∗k . op . n

ˆ 2 2 2 This will not work in Frobenius norm since there, we only have kW −W kF . pp log(d/δ)/n.

Lower bounds Here, we will show that the bounds in Theorem 5.19 are optimal up to log factors. It will again be based on applying Fano’s inequality, Theorem 4.10. 5.5. Graphical models 148

In order to calculate the KL divergence between two Gaussian distributions with densities fΣ1 and fΣ2 , we first observe that

fΣ1 (x) 1 > 1 1 > 1 log = − x Θ1x + log det(Θ1) + x Θ2x − log det(Θ2) fΣ2 (x) 2 2 2 2 , so that   fΣ1 KL(IPΣ1 , IPΣ2 ) = IEΣ1 log (X) fΣ2 1 = log det(Θ ) − log det(Θ ) + IE[Tr((Θ − Θ )XX>)] 2 1 2 2 1 1 = [log det(Θ ) − log det(Θ ) + Tr((Θ − Θ )Σ )] 2 1 2 2 1 1 1 = [log det(Θ ) − log det(Θ ) + Tr(Θ Σ ) − p] . 2 1 2 2 1 writing ∆ = Θ2 − Θ1, Θ˜ = Θ1 + t∆, t ∈ [0, 1], Taylor expansion then yields 1  1  KL(IP , IP ) = −Tr(∆Σ ) + Tr(Θ˜ −1∆Θ˜ −1∆) + Tr(∆Σ ) Σ1 Σ2 2 1 2 1 1 ≤ λ (Θ˜ −1)k∆k2 4 max F 1 = λ−1 (Θ)˜ k∆k2 . 4 min F This means we are in good shape in applying our standard tricks if we can make sure that λmin(Θ)˜ stays large enough among our hypotheses. To this end, 2 assume k ≤ p /16 and define the collection of matrices (Bj)j∈[M1] with entries in {0, 1} such that they are symmetric,√ have zero diagonal and only have non- zero entries in a band of size k around the diagonal, and whose entries are given by a flattening of the vectors obtained by the sparse Varshamov-Gilbert

Lemma 4.14. Moreover, let (Cj)j∈[M2] be diagonal matrices with entries in {0, 1} given by the regular Varshamov-Gilbert Lemma 4.12. Then, define β Θ = Θ = I + √ (plog(1 + d/(2k))B + C ), j ∈ [M ], j ∈ [M ]. j j1,j2 n j1 j2 1 1 2 2

We can ensure that λmin(Θ)˜ is small by the Gershgorin circle theorem. For each row of Θ˜ − I, √ X 2β( k + 1) 1 |Θ˜ | ≤ √ plog(1 + d/(2k) < il n 2 l if n & k/ log(1 + d/(2k)). Hence, β2 kΘ − Θ k2 = (ρ(B ,B ) log(1 + d/(2k)) + ρ(C ,C )) j l F n j1 l1 j2 l2 β2k (k log(1 + d/(2k)) + d), & n 5.5. Graphical models 149 and β2 kΘ − Θ k2 (k log(1 + d/(2k)) + d) j l F . n β2 ≤ log(M M ), n 1 2 which yields the following theorem.

Theorem 5.20. Denote by Bk the set of positive definite matrices with at most k non-zero off-diagonal entries and assume n & k log(d). Then, if IPΘ denotes a Gaussian distribution with inverse covariance matrix Θ and mean 0, there exists a constant c > 0 such that

 2  ⊗n ˆ ∗ 2 α 1 inf sup IPΘ∗ kΘ − Θ kF ≥ c (k log(1 + d/(2k)) + d) ≥ − 2α. ˆ ∗ Θ Θ ∈Bk n 2

Ising model Like the Gaussian graphical model, the Ising model is also a model of pairwise interactions but for random variables that take values in {−1, 1}d. Before developing our general approach, we describe the main ideas in the simplest Markov random field: an Ising model without2 external field [VMLC16]. Such models specify the distribution of a random vector Z = (Z(1),...,Z(d)) ∈ {−1, 1}d as follows

IP(Z = z) = exp z>W z − Φ(W ) , (5.14) where W ∈ IRpd×d is an unknown matrix of interest with null diagonal elements and X Φ(W ) = log exp z>W z , z∈{−1,1}d is a normalization term known as log-partition function. Fix β, λ > 0. Our goal in this section is to estimate the parameter W subject > > to the constraint that the jth row ej W of W satisfies |ej W |1 ≤ λ. To that end, we observe n independent copies Z1,...Zn of Z ∼ IP. To that end, estimate the matrix W row by row using constrained likelihood maximization. Recall from [RWL10, Eq. (4)] that for any j ∈ [d], it holds for any z(j) ∈ {−1, 1}, that

exp(2z(j)e>W z) IP(Z(j) = z(j)|Z(¬j) = z(¬j)) = j , (5.15) (j) > 1 + exp(2z ej W z) where we used the fact that the diagonal elements W are equal to zero.

2The presence of an external field does not change our method. It merely introduces an intercept in the logistic regression problem and comes at the cost of more cumbersome notation. All arguments below follow, potentially after minor modifications and explicit computations are left to the reader. 5.5. Graphical models 150

> Therefore, the jth row ej W of W may be estimated by performing a logistic (j) (¬j) > regression of Z onto Z subject to the constraint that |ej W |1 ≤ λ and > (j) (¬j) kej W k∞ ≤ β. To simplify notation, fix j and define Y = Z ,X = Z ∗ > (¬j) and w = (ej W ) . Let (X1,Y1),..., (Xn,Yn) be n independent copies of (X,Y ). Then the (rescaled) log-likelihood of a candidate parameter w ∈ IRd−1 ¯ is denoted by `n(w) and defined by

n > 1 X  exp(2YiX w)  `¯ (w) = log i n n 1 + exp(2Y X>w) i=1 i i n n 1 X 1 X = Y X>w − log 1 + exp(2Y X>w) n i i n i i i=1 i=1 ¯ With this representation, it is easy to see that w 7→ `n(w) is a concave function. The (constrained) maximum likelihood estimator (MLE)w ˆ ∈ IRd−1 is defined to be any solution of the following convex optimization problem: ¯ wˆ ∈ argmax `n(w) (5.16) w∈B1(λ) This problem can be solved very efficiently using a variety of methods similar to the Lasso. The following lemma follows from [Rig12].

Lemma 5.21. Fix δ ∈ (0, 1). Conditionally on (X1,...,Xn), the constrained MLE wˆ defined in (5.16) satisfies with probability at least 1 − δ that

n r 1 X log(2(p − 1)/δ) (X>(w ˆ − w))2 ≤ 2λeλ n i 2n i=1 Proof. We begin with some standard manipulations of the log-likelihood. Note that > >  exp(2YiXi w)   exp(YiXi w)  log > = log > > 1 + exp(2YiXi w) exp(−YiXi w) + exp(YiXi w) >  exp(YiXi w)  = log > > exp(−Xi w) + exp(Xi w) > >  = (Yi + 1)Xi w − log 1 + exp(2Xi w)

Therefore, writing Y˜i = (Yi + 1)/2 ∈ {0, 1}, we get thatw ˆ is the solution to the following minimization problem:

wˆ = argmin κ¯n(w) w∈B1(λ) where n 1 X κ¯ (w) =  − 2Y˜ X>w + log 1 + exp(2X>w) n n i i i i=1 5.5. Graphical models 151

d−1 For any w ∈ IR , write κ(w) = IE[¯κn(w)] where here and throughout this proof, all expectations are implicitly taken conditionally on X1,...,Xn. Ob- serve that ˜ > >  κ(w) = −IE[Y1]X1 w + log 1 + exp(2Xi w) .

Next, we get from the basic inequalityκ ¯n(w ˆ) ≤ κ¯n(w) for any w ∈ B1(λ) that

n 1 X κ(w ˆ) − κ(w) ≤ (Y˜ − IE[Y˜ ])X>(w ˆ − w) n i i i i=1 n 2λ X ˜ ˜ > ≤ max (Yi − IE[Yi])Xi ej , n 1≤j≤p−1 i=1 where in the second inequality, we used H¨older’sinequality and the fact |Xi|∞ ≤ 1 for all i ∈ [n]. Together with Hoeffding’s inequality and a union bound, it yields r log(4(p − 1)/δ) δ κ(w ˆ) − κ(w) ≤ 2λ , with probability 1 − . 2n 2 Note that the Hessian of κ is given by

n 1 X 4 exp(2X>w) ∇2κ(w) = i X X> . n (1 + exp(2X>w))2 i i i=1 i

Moreover, for any w ∈ B1(λ), since kXik∞ ≤ 1, it holds

4 exp(2X>w) 4e2λ i ≥ ≥ e−2λ > 2 2λ 2 (1 + exp(2Xi w)) (1 + e ) Next, since w∗ is a minimizer of w 7→ κ(w), we get from a second order Taylor expansion that

n 1 X κ(w ˆ) − κ(w∗) ≥ e−2λ (X>(w ˆ − w))2 . n i i=1 This completes our proof.

Next, we bound from below the quantity

n 1 X (X>(w ˆ − w))2 . n i i=1 To that end, we must exploit the covariance structure of the Ising model. The following lemma is similar to the combination of Lemmas 6 and 7 in [VMLC16]. 5.5. Graphical models 152

Lemma 5.22. Fix R > 0, δ ∈ (0, 1) and let Z ∈ IRd be distributed according to (5.14). The following holds with probability 1−δ/2, uniformly in u ∈ B1(2λ):

n r 1 X 1 log(2p(p − 1)/δ) (Z>u)2 ≥ |u|2 exp(−2λ) − 4λ n i 2 ∞ 2n i=1 Proof. Note that n 1 X (Z>u)2 = u>S u , n i n i=1 d×d where Sn ∈ IR denotes the sample covariance matrix of Z1,...,Zn, defined by n 1 X S = X X> . n n i i i=1 Observe that > > u Snu ≥ u Σu − 2λ|Sn − S|∞ , (5.17) where S = IE[Sn]. Using Hoeffding’s inequality together with a union bound, we get that r log(2p(p − 1)/δ) |S − S| ≤ 2 n ∞ 2n with probability 1 − δ/2. Moreover, u>Su = var(X>u). Assume without loss of generality that (1) |u | = |u|∞ and observe that

p p X 2 X var(Z>u) = IE[(Z>u)2] = (u(1))2+IE u(j)Z(j) +2IEu(1)Z(1) u(j)Z(j) . j=2 j=2

To control the cross term, let us condition on the neighborhood Z(¬1) to get

p p    (1) (1) X (j) (j)  (1) (1) (¬1) X (j) (j) 2 IE u Z u Z = 2 IE IE u Z |Z u Z j=2 j=2 p  2  X 2 ≤ IE IEu(1)Z(1)|Z(¬1) + IE u(j)Z(j) , j=2 where we used the fact that 2|ab| ≤ a2 + b2 for all a, b ∈ IR. The above two displays together yield

> 2 (1) (¬1) 2 var(Z u) ≥ |u|∞(1 − sup IE[Z |Z = z] z∈{−1,1}p−1

To control the above supremum, recall that from (5.15), we have

> (1) (¬1) exp(2e1 W z) − 1 exp(2λ) − 1 sup IE[Z |Z = z] = sup > = . z∈{−1,1}p−1 z∈{−1,1}p exp(2e1 W z) + 1 exp(2λ) + 1 5.5. Graphical models 153

Therefore, |u|2 1 var(Z>u) ≥ ∞ ≥ |u|2 exp(−λ) . 1 + exp(2λ) 2 ∞ Together with (5.17), it yields the desired result.

Combining the above two lemmas immediately yields the following theorem. Theorem 5.23. Let wˆ be the constrained maximum likelihood estimator (5.16) ∗ > (¬j) and let w = (ej W ) be the jth row of W with the jth entry removed. Then with probability 1 − δ we have r log(2p2/δ) |wˆ − w∗|2 ≤ 9λ exp(3λ) . ∞ n 5.6. Problem set 154

5.6 PROBLEM SET

Problem 5.1. Using the results of Chapter2, show that the following holds for the multivariate regression model (5.1). 1. There exists an estimator Θˆ ∈ IRd×T such that 1 rT k Θˆ − Θ∗k2 σ2 n X X F . n with probability .99, where r denotes the rank of X . 2. There exists an estimator Θˆ ∈ IRd×T such that 1 |Θ∗| log(ed) k Θˆ − Θ∗k2 σ2 0 . n X X F . n with probability .99.

Problem 5.2. Consider the multivariate regression model (5.1) where Y has SVD: X ˆ > Y = λjuˆjvˆj . j Let M be defined by

ˆ X ˆ ˆ > M = λj1I(|λj| > 2τ)ˆujvˆj , τ > 0 . j

1. Show that there exists a choice of τ such that 1 σ2 rank(Θ∗) kMˆ − Θ∗k2 (d ∨ T ) n X F . n with probability .99.

2. Show that there exists a matrix n × n matrix P such that P Mˆ = XΘˆ for some estimator Θˆ and 1 σ2 rank(Θ∗) k Θˆ − Θ∗k2 (d ∨ T ) n X X F . n with probability .99. 3. Comment on the above results in light of the results obtain in Section 5.2.

Problem 5.3. Consider the multivariate regression model (5.1) and define Θˆ be the any solution to the minimization problem

n 1 2 o min kY − XΘkF + τkXΘk1 Θ∈IRd×T n 5.6. Problem set 155

1. Show that there exists a choice of τ such that 1 σ2 rank(Θ∗) k Θˆ − Θ∗k2 (d ∨ T ) n X X F . n with probability .99. [Hint:Consider the matrix ˆ ∗ X λj + λj uˆ vˆ> 2 j j j

∗ ∗ ˆ ˆ where λ1 ≥ λ2 ≥ ... and λ1 ≥ λ2 ≥ ... are the singular values of XΘ∗ and Y respectively and the SVD of Y is given by X ˆ > Y = λjuˆjvˆj j

2. Find a closed form for XΘ.ˆ

Problem 5.4. In the Markowitz theory of portfolio selection [Mar52], a port- d Pd folio may be identified to a vector u ∈ IR such that uj ≥ 0 and j=1 uj = 1. In this case, uj represent the proportion of the portfolio invested in asset j. The vector of (random) returns of d assets is denoted by X ∈ IRd and we > assume that X ∼ subGd(1) and IE[XX ] = Σ unknown. In this theory, the two key characteristics of a portfolio u are it’s reward µ(u) = IE[u>X] and its risk R(u) = varX>u. According to this theory one should fix a minimum reward λ > 0 and choose the optimal portfolio u∗ = argmin R(u) u :µ(u)≥λ when a solution exists for a given. It is the portfolio that has minimum risk among all portfolios with reward at least λ, provided such portfolios exist. In practice, the distribution of X is unknown. Assume that we observe n independent copies X1,...,Xn of X and use them to compute the following estimators of µ(u) and R(u) respectively:

n 1 X µˆ(u) = X¯ >u = X>u , n i i=1 n 1 X Rˆ(u) = u>Σˆu, Σˆ = (X − X¯)(X − X¯)> . n − 1 i i i=1 We use the following estimated portfolio: uˆ = argmin Rˆ(u) u :ˆµ(u)≥λ We assume throughout that log d  n  d . 5.6. Problem set 156

1. Show that for any portfolio u, it holds

1 µˆ(u) − µ(u) √ , . n

and 1 Rˆ(u) − R(u) √ . . n

2. Show that r log d Rˆ(ˆu) − R(ˆu) , . n with probability .99. 3. Define the estimatoru ˜ by:

u˜ = argmin Rˆ(u) u :ˆµ(u)≥λ−ε

find the smallest ε > 0 (up to multiplicative constant) such that we have R(˜u) ≤ R(u∗) with probability .99. Bibliography

[AS08] Noga Alon and Joel H. Spencer. The probabilistic method. Wiley- Interscience Series in Discrete Mathematics and Optimization. John Wiley & Sons, Inc., Hoboken, NJ, third edition, 2008. With an appendix on the life and work of Paul Erd˝os. [AW02] Rudolf Ahlswede and Andreas Winter. Strong converse for iden- tification via quantum channels. IEEE Transactions on Infor- mation Theory, 48(3):569–579, 2002. [Ber09] Dennis S. Bernstein. Matrix mathematics. Princeton University Press, Princeton, NJ, second edition, 2009. Theory, facts, and formulas. [Bil95] Patrick Billingsley. Probability and measure. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons Inc., New York, third edition, 1995. A Wiley-Interscience Pub- lication. [BLM13] St´ephaneBoucheron, G´abor Lugosi, and Pascal Massart. Con- centration inequalities. Oxford University Press, Oxford, 2013. A nonasymptotic theory of independence, With a foreword by Michel Ledoux. [BLT16] Pierre C. Bellec, Guillaume Lecu´e,and Alexandre B. Tsybakov. Slope meets lasso: Improved oracle bounds and optimality. arXiv preprint arXiv:1605.08651, 2016. [Bre15] Guy Bresler. Efficiently learning Ising models on arbitrary graphs. STOC’15—Proceedings of the 2015 ACM Symposium on Theory of Computing, pages 771–782, 2015. [BRT09] Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Si- multaneous analysis of Lasso and Dantzig selector. Ann. Statist., 37(4):1705–1732, 2009. [BT09] Amir Beck and Marc Teboulle. A fast iterative - thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.

157 Bibliography 158

[BV04] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge University Press, Cambridge, 2004. [BvS+15] MalgorzataBogdan, Ewout van den Berg, Chiara Sabatti, Wei- jie Su, and Emmanuel J. Cand`es. SLOPE—adaptive variable selection via convex optimization. The annals of applied statis- tics, 9(3):1103, 2015. [Cav11] Laurent Cavalier. Inverse problems in statistics. In Inverse prob- lems and high-dimensional estimation, volume 203 of Lect. Notes Stat. Proc., pages 3–96. Springer, Heidelberg, 2011. [CL11] T. Tony Cai and Mark G. Low. Testing composite hypothe- ses, hermite polynomials and optimal estimation of a nonsmooth functional. Ann. Statist., 39(2):1012–1041, 04 2011. [CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition, 2006. [CT07] Emmanuel Candes and Terence Tao. The Dantzig selector: sta- tistical estimation when p is much larger than n. Ann. Statist., 35(6):2313–2351, 2007. [CZ12] T. Tony Cai and Harrison H. Zhou. Minimax estimation of large covariance matrices under `1-norm. Statist. Sinica, 22(4):1319– 1349, 2012. [CZZ10] T. Tony Cai, Cun-Hui Zhang, and Harrison H. Zhou. Opti- mal rates of convergence for covariance matrix estimation. Ann. Statist., 38(4):2118–2144, 2010. [DDGS97] M.J. Donahue, C. Darken, L. Gurvits, and E. Sontag. Rates of convex approximation in non-hilbert spaces. Constructive Approximation, 13(2):187–220, 1997. [EHJT04] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tib- shirani. Least angle regression. Ann. Statist., 32(2):407–499, 2004. With discussion, and a rejoinder by the authors. [FHT10] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Reg- ularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 2010. [FR13] Simon Foucart and Holger Rauhut. A mathematical introduc- tion to compressive sensing. Applied and Numerical Harmonic Analysis. Birkh¨auser/Springer,New York, 2013. [Gro11] David Gross. Recovering low-rank matrices from few coeffi- cients in any basis. IEEE Transactions on Information Theory, 57(3):1548–1566, 2011. Bibliography 159

[Gru03] Branko Grunbaum. Convex polytopes, volume 221 of Graduate Texts in Mathematics. Springer-Verlag, New York, second edi- tion, 2003. Prepared and with a preface by Volker Kaibel, Victor Klee and G¨unter M. Ziegler. [GVL96] Gene H. Golub and Charles F. Van Loan. Matrix computa- tions. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins University Press, Baltimore, MD, third edition, 1996. [HTF01] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning. Springer Series in Statistics. Springer-Verlag, New York, 2001. Data mining, inference, and prediction. [Joh11] Iain M. Johnstone. Gaussian estimation: Sequence and wavelet models. Unpublished Manuscript., December 2011. [KLT11] Vladimir Koltchinskii, Karim Lounici, and Alexandre B. Tsy- bakov. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. Ann. Statist., 39(5):2302–2329, 2011. [Lau96] Steffen L. Lauritzen. Graphical models, volume 17 of Oxford Sta- tistical Science Series. The Clarendon Press, Oxford University Press, New York, 1996. Oxford Science Publications. [Lie73] Elliott H. Lieb. Convex trace functions and the Wigner-Yanase- Dyson conjecture. Advances in Mathematics, 11(3):267–288, 1973. [LPTVDG11] K. Lounici, M. Pontil, A.B. Tsybakov, and S. Van De Geer. Oracle inequalities and optimal inference under group sparsity. Ann. Statist., 39(4):2164–2204, 2011. [LWo12] Po-Ling Loh, Martin J. Wainwright, and others. Structure es- timation for discrete graphical models: Generalized covariance matrices and their inverses. In NIPS, pages 2096–2104, 2012. [Mal09] St´ephaneMallat. A wavelet tour of signal processing. Else- vier/Academic Press, Amsterdam, third edition, 2009. The sparse way, With contributions from Gabriel Peyr´e. [Mar52] Harry Markowitz. Portfolio selection. The journal of finance, 7(1):77–91, 1952. [Nem00] Arkadi Nemirovski. Topics in non-parametric statistics. In Lec- tures on probability theory and statistics (Saint-Flour, 1998), volume 1738 of Lecture Notes in Math., pages 85–277. Springer, Berlin, 2000. Bibliography 160

[Pis81] G. Pisier. Remarques sur un r´esultatnon publi´ede B. Maurey. In Seminar on Functional Analysis, 1980–1981, pages Exp. No. V, 13. Ecole´ Polytech., Palaiseau, 1981. [Rig06] Philippe Rigollet. Adaptive density estimation using the block- wise Stein method. Bernoulli, 12(2):351–370, 2006. [Rig12] Philippe Rigollet. Kullback-Leibler aggregation and misspecified generalized linear models. Ann. Statist., 40(2):639–665, 2012. [Riv90] Theodore J. Rivlin. Chebyshev Polynomials: From Approxima- tion Theory to Algebra and Number Theory. Wiley, July 1990. [RT11] Philippe Rigollet and Alexandre Tsybakov. Exponential screen- ing and optimal rates of sparse estimation. Ann. Statist., 39(2):731–771, 2011. [Rus02] Mary Beth Ruskai. Inequalities for quantum entropy: A review with conditions for equality. Journal of Mathematical Physics, 43(9):4358–4375, 2002. [RWL10] Pradeep Ravikumar, Martin J. Wainwright, and John D. Lafferty. High-dimensional Ising model selection using `1- regularized logistic regression. Ann. Statist., 38(3):1287–1319, 2010. [Sha03] Jun Shao. Mathematical statistics. Springer Texts in Statistics. Springer-Verlag, New York, second edition, 2003. [Tib96] Robert Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267–288, 1996. [Tro12] Joel A. Tropp. User-friendly tail bounds for sums of random ma- trices. Foundations of computational mathematics, 12(4):389– 434, 2012. [Tsy03] Alexandre B. Tsybakov. Optimal rates of aggregation. In Bern- hard Sch¨olkopf and Manfred K. Warmuth, editors, COLT, vol- ume 2777 of Lecture Notes in Computer Science, pages 303–313. Springer, 2003. [Tsy09] Alexandre B. Tsybakov. Introduction to nonparametric estima- tion. Springer Series in Statistics. Springer, New York, 2009. Revised and extended from the 2004 French original, Translated by Vladimir Zaiats. [Ver18] Roman Vershynin. High-Dimensional Probability. Cambridge University Press (to appear), 2018. [vH17] Ramon van Handel. Probability in high dimension. Lecture Notes (Princeton University), 2017. Bibliography 161

[VMLC16] Marc Vuffray, Sidhant Misra, Andrey Lokhov, and Michael Chertkov. Interaction screening: Efficient and sample-optimal learning of ising models. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neu- ral Information Processing Systems 29, pages 2595–2603. Curran Associates, Inc., 2016.