Applied Statistics and Machine Learning

Theory: Stability, CLT, and Sparse Modeling Bin Yu, IMA, June 26, 2013

6/26/13 Today’s plan 1. Reproducibility and statistical stability 2. In the classical world: CLT – stability result Asymptotic normality and score function for parametric inference Valid bootstrap

3. In the modern world: L2 estimation rate of Lasso and extensions Model selection consistency of Lasso Sample to sample variability meets heavy tail noise

2 6/26/13 2013: Information technology and data

“Although scientists have always comforted themselves with the thought that science is self-correcting, the immediacy and rapidity with which knowledge disseminates today means that incorrect information can have a profound impact before any corrective process can take place”.

“Reforming Science: Methodological and Cultural Reforms,” Infection and Immunity Editorial (2012) by Arturo Casadevall (Editor in Chief, mBio) and Ferric C. Fang, (Editor in Chief, Infection and Immunity)

3 6/26/13 2013: IT and data

“A recent study* analyzed the cause of retraction for 788 retracted papers and found that error and fraud were responsible for 545 (69%) and 197 (25%) cases, respectively, while the cause was unknown in 46 (5.8%) cases (31).” -- Casadevall and Fang (2012)

* R Grant Steen (2011), J. Med. Ethics

Casadevall and Fang called for “Enhanced training in probability and statistics”

4 6/26/13 Scientific Reproducibility

Recent papers on reproducibility: Ioannidis, 2005; Kraft et al, 2009; Nosek et al, 2012; Naik, 2011; Booth, 2012; Donoho et al, 2009, Fanio et al, 2012…

Statistical stability is a minimum requirement for scientific reproducibility.

Statistical stability: statistical conclusions should be stable to appropriate perturbations to data and/or models.

Stable statistical results are more reliable. 5 6/26/13

Scientific reproducibility is our responsibility

} Stability is a minimum requirement for

reproducibility

} Statistical stability: statistical conclusions should be robust to appropriate

perturbations to data

6 6/26/13 Data perturbation has a long history Data perturbation schemes are now routinely employed to estimate bias, variance, and sampling distribution of an estimator. Jacknife … Quenouille (1949, 1956), Tukey (1958), Mosteller and Tukey (1977),or Efron and Stein (1981), Wu (1986), Carlstein (1986), Wu (1990), … Sub-sampling: … Mahalanobis (1949), Hartigan (1969, 1970), Politis and Romano (1992 , 94), Bickel, Götze, and van Zwet (1997), … Cross-validation: … Allen (1974), Stone (1974, 1977), Hall (1983), Li (1985), … Bootstrap: … fron (1979), Bickel and Freedman (1981), Beran (1984), 7 6/26/13 } Stability argument is central to limiting law results One proof of CLT (bedrock of classical stat theory):

1. Universality of a limiting law for a normalized sum of iid through a stability argument or Lindeberg’s swapping trick, i.e., a perturbation in a (normalized) sum by a random variable with matching first and second moments does not change the (normalized) sum distribution in the limit. 2. Finding the limit law via ODE.

cf. Lecture notes of Terence Tao at http://terrytao.wordpress.com/2010/01/05/254a-notes-2-the-central-limit-theorem/

8 6/26/13

CLT A proof of CLT via Lindeberg’s swapping trick:

Let

9 6/26/13 CLT

10 6/26/13 CLT Then we get

For a complete proof, see http://terrytao.wordpress.com/2010/01/05/254a-notes-2-the-central-limit-theorem/ 11 6/26/13 Asymptotic normality of MLE Via Taylor expansion, for a nice parametric family f ( x, θ ) , ˆ it can be seen that MLE θ n , from n iid samples X1,...,Xn has the following approximation:

ˆ √n(θn θ) ((Y1 + ....+ Yn)/√n − ≈

where Y i = U θ ( X i ) and U θ is the score function with first moment or expected value zero:

d log f(x, θ) U (x)= θ dθ

12 6/26/13 Asymptotic normality of MLE comes from stability

Lindeberg’s swapping trick proof of CLT means that

MLE is asymptotically normal because we can swap in an independent normal random variable (of mean zero and variance Fisher information) in the sum of the score functions, without changing the sum much – there is stability.

13 6/26/13 Why bootstrap works?

14 6/26/13 A sketch of proof: A similar proof can be carried out using the Lindeberg’s swapping trick with the minor modification that the second moments are matched to the 1 / √ n order. That is, let

(X1 µn)+...(Xi µn)+(Xi∗+1 µˆn)... +(Xn∗ µˆn) Sn,i = − − − − √n

(X1 µn)+...(X∗ µˆn)+(X∗ µˆn)... +(X∗ µˆn) S˜ = − i − i+1 − n − n,i √n Then we can see that the summands in the two summations above have the same first moment 0, their second moments differ by O (1 / √ n ) , and their third moments are finite. Going through Tao’s proof shows that we get another term of order 3/2 O ( n − ) from the second moment term which is ok.

15 6/26/13 Stability argument is central Recent generalizations to obtain other universal limiting distributions e.g. Wigner law under non-Gaussian assumptions and last passage percolation (Chatterjee, 2006, Suidan, 2006, …)

Concentration results also assume stability-type conditions

In learning theory , stability is closely related to good generalization performance (Devroye and Wagner, 1979, Kearns and Ron, 1999, Bousquet and Elisseeff, 2002, Kutin and Niyogi, 2002, Mukherjee et al, 2006….)

16 6/26/13 Lasso and/or Compressed Sensing: much theory lately

Compressed Sensing emphasize the choice of design matrix (often iid Gaussian entries)

Theoretical studies: much work recently on Lasso in terms of L2 error of parameter model selection consistency L2 prediction error (not covered in this lecture)

June 26, 2013 page 17

Regularized M-estimation including Lasso Estimation: Minimize loss function plus regularization term

n θλn arg min n(θ; 1 ) + λn r(θ) . ∈ θ∈Rp L X Estimate &Loss function Regularizer' ! "#$% " #$ % " #$ %

Goal: for some error metric (e.g. L2 error), bound in this metric the difference

θˆ θ ˆ λn ∗ θλn θ∗ − in high-dim scaling− as (n, p) tends to infinity.

June 26, 2013 page 18 Example 1: Lasso (sparse linear model)

y X θ∗ w

S n n p = × +

Set-up: noisy observations y = Xθ∗ + w with sparse θ∗ Estimator: Lasso program

n p 1 T 2 θλn arg min (yi xi θ) + λn θj ∈ θ n − | | i=1 j=1 " " Some past work: !Tropp, 2004; Fuchs, 2000; Meinshausen/Buhlmann, 2005;Some Candes/Tao,past work: Tropp 2005;, 2004; Donoho, Fuchs, 2005; 2000; Zhao Meinshausen & Yu, 2006;/ BuhlmannZou, 2006;, 2006; Wainwright,Candes/Tao, 2006; 2005; Koltchinskii, Donoho, 2005; 2007; Zhao Tsybakov & Yu, 2006; et al., Zou2007;, 2006; van deWainwright, Geer, 2007;2009; KoltchinskiiZhang and ,Huang, 2007; Tsybakov2008; Meinshausen et al., 2007; and van Yu, de 2009, Geer, Bickel2007; etZhang al., and

2008Huang, .... 2008; MeinshausenJune 26, 2013 and Yu, 2009, Bickel et al., 2008,page 19 Neghban et al, 12.

Example 2: Low-rank matrix approximation

T Θ∗ UD V

r r r m × × k m k r × ×

Set-up: Matrix Θ∗ Rk×m with rank r min k, m . ∈ # { } Estimator:

n min{k,m} 1 2 Θ arg min (yi Xi, Θ ) + λn σj (Θ) ∈ Θ n −%% && i=1 j=1 " " ! Some past work : Frieze et al, 1998; Achilioptas & McSherry, 2001; Fayzel & Boyd, 2001; Srebro et al, 2004; Drineas et al, 2005; Rudelson & Vershynin, 2006; Recht et al, 2007, Bach, 2008; Candes and Tao, 2009; Halko et al, 2009, Keshaven et al, 2009; Negahban & Wainwright, 2009, Tsybakov???

June 26, 2013 page 20 Example 3: Structured (inverse) Cov. Estimation

Zero pattern of inverse covariance

1 12 2

4 3

5 5

1 2 3 4 5 4

Set-up: Samples from random vector with sparse covariance Σ or sparse inverse covariance Θ∗ Rp×p. ∈ Estimator:

n 1 T Θ arg min xixi , Θ log det(Θ) + λn Θij 1 ∈ Θ "" n ## − | | i=1 " "i#=j ! Some past work :Yuan and Lin, 2006; d’Aspremont et al, 2007; Bickel & Levina, 2007; El Karoui, 2007; Rothman et al, 2007; Zhou et al, 2007; Friedman et al 2008; Ravikumar et al, 2008; Lam and Fan, 2009; Cai and Zhou, 2009 June 26, 2013 page 21 Unified Analysis S. Negahban, P. Ravikumar, M. J. Wainwright and B. Yu. (2012) A unified framework for the analysis of regularized M-estimators. Statistical Science, 27(4): 538--557, Many high-dimensional models and associated results case by case

Is there a core set of ideas that underlie these analyses?

We discovered that two key conditions are needed for a unified error bound:

Decomposability of regularizer r (e.g. L1-norm) Restricted strong convexity of loss function (e.g Least Squares loss)

June 26, 2013 page 22 Why can we estimate parameters?

Strong convexity of cost (curvature captured in Fisher Info when p ﬁxed):

δ L δ L ∆ ∆ θ θ θ θ (a) High curvature: easy to estimate (b) Low curvature: harder ! !

Asymptotic analysis of maximum likelihood estimator or OLS: 1. Fisher information (Hessian matrix) non-degenerate or strong convexity in all directions. 2. Sampling noise disappears with sample size gets large

June 26, 2013 page 23 In high-dim and when r corresponsds to structure (e.g. sparsity)

1 Restricted strong convexity (RSC) (courtesy of high-dim):

! loss functions are often ﬂat in many directions in high dim ! “curvature” needed only for directions ∆ in high dim ∈ C

n ! loss function n(θ) := n(θ; ) satisﬁes L L X1 ∗ ∗ ∗ 2 n(θ + ∆) n(θ ) n(θ ), ∆ γ( ) d (∆) for all ∆ . L − L −#∇L %≥ L ∈ C Excess loss score squared function | {z } | {z } |error{z }

2 Decomposability of regularizers makes small in high-dim: C ! ⊥ for subspace pairs (A, B ) C where A represents model constraints:

r(u+v)=r(u)+r(v) for all u A and v B⊥ ∈ ∈

! ∗ forces error ∆ = θλn θ to June 26, 2013 − C page 24 b In high-dim and when r corresponds to structure (e.g. sparsity)

When p >n as in the fMRI problem, LS is flat in many directions or it is impossible to have strong convexity in all directions.

Regularization is needed.

When is large enough to overcome the sampling λn noise, we have a deterministic situation:

• The decomposable regularization norm forces the estimation difference ˆ into a constraint or “cone” set. θλn θ∗ − • This constraint set is small relative to p when p large. R • Strong convexity is needed only over this constraint set.

. When predictors are random and Gaussian (dependence ok), strong convexity does hold for LS over the l1-norm induced constraint set. June 26, 2013 page 25 Main Result (Negahban et al, 2012)

Theorem (Negahban, Ravikumar, Wainwright & Y. 2009) Say regularizer decomposes across pair (A, B⊥) with A B, and restricted strong convexity holds for (A, B⊥) and over . With regularization⊆ constant C chosen such that λ 2r∗( L(θ∗; n)), then any solution θ satisﬁes n ≥ ∇ X1 λn

∗ 1 λn ! d(θ θ ) Ψ(B) λ + 2 r(π ⊥ (θ∗)) λn − ≤ γ( ) n γ( ) A L $ L " # ! Estimation error Approximation error Quantities that control rates: restricted strong convexity parameter: γ( ) L dual norm of regularizer: r∗(v) := sup v, u . r(u)=1! " optimal subspace const.: Ψ(B) = sup r(θ)/d(θ) θ∈B\{0}

More work is required for each case: verify restrictive strong convexity and June 26, 2013 page 26 find λ n to overcome sampling noise (concentration ineq.) Recovering existing result in Bickel et al 08

Example: Linear regression (exact sparsity)

2 Lasso program: min y Xθ 2 + λn θ 1 θ∈Rp ! − ! ! ! } ˘ RSC reduces to lower bound on restricted singular values of X Rn×p ∈ for a k-sparse vector, we have θ 1 √k θ 2. ! ! ≤ ! !

Corollary Suppose that true parameter θ∗ is exactly k-sparse. Under RSC and with T λ 2 X ε , then any Lasso solution satisﬁes θ θ∗ 1 √k λ . n ≥ " n "∞ " λn − "2 ≤ γ(L) n ! Some stochastic instances: recover known results

Compressed sensing: Xij N(0, 1) and bounded noise ε 2 σ√n ∼ ! ! ≤ 2 Deterministic design: X with bounded columns and εi N(0, σ ) ∼

T 2 X ε 2σ log p ∗ 8σ k log p ∞ w.h.p. = θλ θ 2 . ! n ! ≤ r n ⇒! n − ! ≤ γ( ) r n b L

(e.g., Candes & Tao, 2007; Meinshausen/Yu, 2008; Zhang and Huang, 2008; Bickel et al., 2009) June 26, 2013 page 27 Obtaining new result

Example: Linear regression (weak sparsity)

for some q [0, 1], say θ∗ belongs ∈ to "q-“ball”

p B (R ) := θ Rp θ q R . q q ∈ | | j| ≤ q j=1 ! " #

Corollary For θ∗ B (R ), any Lasso solution satisﬁes (w.h.p.) ∈ q q log p 1−q/2 θ θ∗ 2 σ2R . # λn − #2 ≤ O q n % & ' ( $ rate known to be minimax optimal (Raskutti et al., 2009)

June 26, 2013 page 28 Effective sample size n/log(p)

We lose at least a factor of log(p) for having to search over a set of p predictors for a small number of relevant predictors.

For the static image data, we lose a factor of log(10921)=9.28 so the effective sample size is not n=1750, but 1750/9.28=188

For the movie data, we lose at least a factor of log(26000)=10.16 so the effective sample size is not n=7200, but 7200/10.16=708

The log p factor is very much a consequence of the light-tail Gaussian (or sub-Gaussian) noise assumption.

We will lose more if the noise term has a heavier tail.

June 26, 2013 page 29

Summary of unified analysis

} Recovered existing results:

sparse linear regression with Lasso multivariate group Lasso sparse inverse cov. estimation

} Derived new results:

weak sparse linear model with Lasso low rank matrix estimation generalized linear models

June 26, 2013 page 30

An MSE result for L2Boosting

Peter Buhlmann and Bin Yu (2003). Boosting with the L2 Loss: Regression and Classification. J. Amer. Statist. Assoc. 98, 324-340.

Not

Note that the variance term is a measure of complexity and it increases by an exponentially diminishing amount when the iteration number m increases – and it is always bounded.

June 26, 2013 page 31

Model Selection Consistency of Lasso Zhao and Yu (2006), On model selection consistency of Lasso, JMLR, 7, 2541-2563 Set-up:

Linear regression model

n observations and p predictors

Assume

(A):

Knight and Fu (2000) showed L2 estimation consistency under (A) for fixed p.

June 26, 2013 page 32 Model Selection Consistency of Lasso

} p small, n large (Zhao and Y, 2006), assume (A) and

1 sign((β1,...,βs))(XS XS)− XS XSc < 1 | | Then roughly* Irrepresentable condition (1 by (p-s) matrix)

model selection consistency * Some ambiguity when equality holds.

} Related work: Tropp(06), Meinshausen and Buhlmann (06), Zou (06), Wainwright (09)

Population version

June 26, 2013 page 33

Irrepresentable condition (s=2, p=3): geomery

} r=0.4

} r=0.6

June 26, 2013 page 34 Model Selection Consistency of Lasso

} Consistency holds also for s and p growing with n, assume

irrepresentable condition bounds on max and min eigenvalues of design matrix smallest nonzero coefficient bounded away from zero.

Gaussian noise (Wainwright, 09):

Finite 2k-th moment noise (Zhao&Y,06):

June 26, 2013 page 35 Consistency of Lasso for Model Selection } Interpretation of Condition – Regressing the irrelevant predictors on the relevant

predictors. If the L1 norm of regression coefficients

(*)

} Larger than 1, Lasso can not distinguish the irrelevant predictor from the relevant predictors for some parameter values.

} Smaller than 1, Lasso can distinguish the irrelevant predictor from the relevant predictors.

} Sufficient Conditions (Verifiable)

} Constant correlation

} Power decay correlation

} Bounded correlation*

June 26, 2013 page 36

Bootstrap and Lasso+mLS Or Lasso+Ridge Liu and Yu (2013) http://arxiv.org/abs/1306.5505

Because Lasso is biased, it is common to use it to select variables and then correct the bias with a modified OLS or a Ridge with a very small smoothing parameter.

Then a residual bootstrap can be carried out to get confidence intervals for the parameters.

Under the irrepresentable condition (IC), the Lasso+mLS and Lasso+Ridge are shown to have the parametric MSE of k/n (no log p term) and The residual bootstrap is shown to work as well. Even when the IC doesn’t hold, simulations shown that this bootstrap seems to work better than its comparison methods in the paper.

See Liu and Yu (2013) for more details.

June 26, 2013 page 37 L1 penalized log Gaussian Likelihood

Given n iid observations of X with

Banerjee, El Ghaoui, d’Aspremont (08):

by a block descent algorithm.

page 38 \ June 26, 2013 Model selection consistency

Ravikumar, Wainwright, Raskutti, Yu (2011) gives sufficient conditions for model selection consistency.

Hessian:

Define “model complexity”:

June 26, 2013 page 39 Model selection consistency (Ravikumar et al, 11)

Assume the irrepresentable condition below holds

1. X sub-Gaussian with parameter and

effective sample size

Or 2. X has 4m-th moment,

Then with high probability as n tends to infinity, the correct model is chosen.

June 26, 2013 page 40 Success prob’s dependence on n and p (Gaussian)

* § Edge covariances as ! ij = 0 . 1 . Each point is an average over 100 trials.

§ Curves stack up in second plot, so that (n/log p) controls model selection.

June 26, pa 2013 ge 41 Success prob’s dependence on“model complexity”K and n

Chain graph with p = 120 nodes. § Curves from left to right have increasing values of K. § Models with larger K thus require more samples n for same probability of success. June 26, 2013 page 42

For model selection consistency results for gradient and/or backward-forward algorithms, see works by T. Zhang:

} Tong Zhang. Sparse Recovery with Orthogonal Matching Pursuit under RIP, IEEE Trans. Info. Th, 57:5215-6221, 2011. } Tong Zhang. Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations, IEEE Trans. Info. Th, 57:4689-4708, 2011.

43 6/26/13 Back to stability: robust statistics also aims at stability

Mean functions are fitted with L 2 loss.

What if the “errors” have heavier tails?

L 1 loss is commonly used in robust statistics to deal with heavy tail errors in regression.

Will L 1 Loss add more stability?

44 6/26/13 Model perturbation is used in Robust Statistics

Tukey (1958), Huber (1964, …), Hampel (1968, …), Bickel (1976, …), Rousseeuw (1979, … ), Portnoy (1979, ….), ….

"Overall, and in analogy with, for example, the stability aspects of differential equations or of numerical computations, robustness theories can be viewed as stability theories of statistical inference.” - p. 8, Hampel, Rousseeuw, Ronchetti and Stahel (1986)

45 6/26/13 Seeking insights through analytical work

For high-dim data such as ours, removing some data units could change the outcomes of our model because of feature dependence.

This phenomenon is also seen in simulated data from Gaussian linear models in high-dim.

How does sample to sample variability

interact with heavy tail errors?

46 6/26/13 Sample stability meets robust statistics in high-dim (El Karoui, Bean, Bickel, Lim and Yu, 2012) Set-up: Linear regression model

Yn 1 = Xn pβp 1 + n 1 × × × ×

For i=1,…,n: X N(0, Σ ), i iid, Ei =0 i ∼ p

We consider the random-matrix regime: p/n κ (0, 1) → ∈ ˆ β = argminβ Rp ρ(Yi Xiβ) ∈ − i Due to invariance, WLOG, assume Σp = Ip, β =0

47 6/26/13 Sample stability meets robust statistics in high-dim (El Karoui, Bean, Bickel, Lim and Yu, 2012) RESULT (in an important special case): ˆ I. Let r ρ ( p, n )= β , then ˆ is distributed as || || β rρ(p, n)U p 1 where U uniform ( S − )(1) ∼

2. r ρ ( p, n ) r ρ ( κ ) , let zˆ := + r ( κ ) Z , Z indep of → ρ (x y)2 proxc(ρ)(x)=argminy R[ρ(y)+ − ] ∈ 2c

Then r ρ ( κ ) satisfies

E [proxc(ρ)] =1 κ { } − E [ˆz prox (ˆz )]2 = κr2(κ) { − c } ρ

48 6/26/13 Sample stability meets robust statistics in high-dim (continue) limiting results --- normalization constant stablizes.

Sketch of proof:

Invariance "leave-one-out" trick both ways (reminiscent of “swapping trick” for CLT) analytical derivations (prox functions) (reminiscent of proving normality in the limit for CLT)

49 6/26/13

Sample stability meets robust statistics in high-dim (continue) Corollary of the result:

When κ = p/n > 0 . 3 , L 2 loss fitting (OLS) is better than L 1 loss fitting (LAD) even when the error is double-exponential.

Ratio of Var (LAD) and Var (OLS)

Blue – simulated Magenta – analytic

50 6/26/13 Sample stability meets robust statistics in high-dim (continue) Remarks: MLE doesn't work here for a different reason than in cases where penalized MLE works better than MLE. We have unbiased estimators, a non-sparse situation and the question is about variance. Optimal loss function can be calculated (Bean et al, 2012) Simulated model with design matrix from fMRI data and double- exp. error shows the same phenomenon: p/n > 0.3, OLS is better than LAD. Some insurance for using L2 loss function in fMRI project.

Results hold for more general settings.

51 6/26/13 John W. Tukey June 16, 1915 – July 26, 2000

What of the future? The future of data analysis can involve great progress, the overcoming of real difficulties, and the provision of a great service to all fields of science and technology. Will it? That remains to us, to our willingness to take up the rocky road of real problems in preferences to the smooth road of unreal assumptions, arbitrary criteria, and abstract results without real attachments. Who is for the challenge?

--- John W. Tukey (1962) “Future of Data Analysis” 52 6/26/13

Thank you for

your sustained enthusiasm and questions!

53 6/26/13