<<

c 2018 by Yunbo . All rights reserved. SCALABLE SPARSITY STRUCTURE LEARNING USING BAYESIAN METHODS

BY

YUNBO OUYANG

DISSERTATION

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Statistics in the Graduate College of the University of Illinois at Urbana-Champaign, 2018

Urbana, Illinois

Doctoral Committee:

Associate Professor Feng Liang, Chair Professor Annie Assistant Professor Naveen Narisetty Assistant Professor Ruoqing Zhu Abstract

Learning sparsity pattern in high dimension is a great challenge in both implementation and theory. In this thesis we develop scalable Bayesian algorithms based on EM algorithm and variational inference to learn sparsity structure in various models. Estimation consistency and selection consistency of our methods are established.

First, a nonparametric Bayes estimator is proposed for the problem of estimating a sparse sequence based on Gaussian random variables. We adopt the popular two-group prior with one component being a point mass at zero, and the other component being a mixture of Gaussian distributions. Although the Gaussian family has been shown to be suboptimal for this problem, we find that Gaussian mixtures, with a proper choice on the means and mixing weights, have the desired asymptotic behavior, e.g., the corresponding posterior concentrates on balls with the desired minimax rate.

Second, the above estimator could be directly applied to the high dimensional linear classification. In theory, we not only build a bridge to connect the estimation error of the mean difference and the classification error in different scenarios, also provide sufficient conditions of sub-optimal classifiers and optimal classifiers.

Third, we study adaptive ridge regression for linear models. Adaptive ridge regression is closely related with Bayesian variable selection problem with Gaussian mixture spike-and-slab prior because it resembles EM algorithm developed in Wang et al. (2016) for the above problem. The output of adaptive ridge regression can be used to construct a distribution estimator to approximate posterior. We show the approximate posterior has the desired concentration property and adaptive ridge regression estimator has desired predictive error.

Last, we propose a Bayesian approach to sparse principal components analysis (PCA). We show that our algorithm, which is based on variational approximation, achieves Bayesian selection consistency when both p and n go to infinity.

ii To my family.

iii Acknowledgments

First and foremost, I would like to express my sincere gratitude to my advisor Professor Feng Liang. Without her patient guidance and constinuous support, I would not make any great progress in my Ph.D. pursuit. I am always inspired by Professor Liang’s innovative ideas and immense knowledge. Her continuous encour- agement and support help me become confident. Her enthusiasm in Statistics always inspires me to work on challenging problems. I cannot imagine my Ph.D. life without Professor Liang’s guidance.

Besides my advisor, I am also very grateful to the rest of my thesis committee members: Professor Annie

Qu, Professor Naveen Narisetty, Professor Ruoqing Zhu. Their comments and suggestions are constructive and valuable for my research projects. I give my special thanks to Professor Narisetty for helpful discussions when I develop Chapter 4. Meanwhile, I would like to give my great thanks to Dr. Wang and Dr. Jianjun

Hu for the enjoyable and wonderful collaboration. Moreover, I want to appreciate all talented members in

Professor Liang’s research group: they are Jin Wang, Jianjun Hu, Xichen , Lingrui , Yinyin Chen and Wenjing Yin. Their deep insight, great knowledge and critical comments substantially improve my understanding of Bayesian Statistics.

Furthermore, I would like to give my thanks to all the faculty, staff and students in Department of

Statistics, UIUC. I sincerely appreciate the help and support from all my friends in the department. My great thanks goes to Jin Wang, Jianjun Hu, Xuan , Xichen Huang, Xiwei Tang, Fan and Xiaolu

Zhu. Thank for being company and bringing in countless memorable experiences.

Finally, I want to thank my family: my parents and my girlfriend. I owe my appreciation to my parents for raising me up and giving me so much inspirations. Last but not least, I give my great thanks to my girlfriend Yinyin Chen: you have enlightened my life.

iv Table of Contents

List of Tables ...... viii

List of Figures ...... x

Chapter 1 Introduction ...... 1

1.1 Gaussian Sequence Model ...... 2

1.2 High Dimensional Classification ...... 3

1.3 Variable Selection ...... 3

1.4 Sparse PCA ...... 4

Chapter 2 Empirical Bayes Approach for Sparse Sequence Estimation ...... 5

2.1 Introduction ...... 5

2.2 Main Results ...... 7

2.3 Prior Specification via Hard Thresholding ...... 11

2.4 Implementation ...... 12

2.4.1 Variational Algorithm to Specify Prior ...... 13

2.4.2 Posterior Computation ...... 15

2.4.3 Simulation Studies ...... 16

2.5 Conclusions and Discussions ...... 23

Chapter 3 An Empirical Bayes Approach for High Dimensional Classification ...... 24

3.1 Introduction ...... 24

3.2 Relationship between the Estimation Error and the Classification Error ...... 26

3.2.1 Known Covariance Matrix Σ = Ip ...... 27 3.2.2 Unknown Covariance Matrix ...... 29

3.3 Theory for DP Classifier ...... 32

3.4 Implementation for Empirical Bayes Classifier ...... 34

v 3.5 Empirical Studies ...... 35

3.5.1 Simulation Studies ...... 36

3.5.2 Classification with Missing Features ...... 41

3.5.3 Real Data Examples ...... 42

3.6 Discussion ...... 44

Chapter 4 Two-Stage Ridge Regression and Posterior Approximation of Bayesian Vari-

able Selection ...... 45

4.1 Introduction ...... 45

4.2 Proposed Algorithm ...... 47

4.2.1 Two-Stage Ridge Regression ...... 47

4.2.2 Distribution Estimator Construction ...... 48

4.3 Sufficient Conditions of Concentration Property ...... 50

4.4 Concentration and Prediction in the Orthogonal Design Case ...... 52

4.4.1 Concentration Property ...... 52

4.4.2 Prediction Accuracy ...... 54

4.5 Concentration and Prediction in Low Dimensional Case ...... 55

4.6 Concentration and Prediction in High Dimensional Case ...... 57

4.7 Conclusion ...... 60

Chapter 5 A Bayesian Approach for Sparse Principal Components Analysis ...... 61

5.1 Introduction ...... 61

5.2 Hybrid of EM and Variational Inference Algorithm ...... 62

5.2.1 Two-stage Method ...... 65

5.3 Theoretical Properties ...... 66

5.3.1 The general case ...... 67

5.3.2 p/n → 0 case ...... 69

5.3.3 p/n → c > 0 case ...... 70

5.3.4 log p/nη → 0 case ...... 71

5.4 Conclusion and Discussion ...... 72

Appendix A Supplemental Material for Chapter 2 ...... 73

A.1 Proof for Lemma 2.1 ...... 73

A.2 Proof for Theorem 2.1 ...... 75

vi A.3 Proof for Theorem 2.2 ...... 76

A.4 Variational Inference Algorithm Derivation ...... 77

Appendix B Supplemental Material for Chapter 3 ...... 81

B.1 Proof of Theorem 3.2 ...... 81

B.2 Proof of Theorem 3.3 ...... 84

Appendix C Supplemental Material for Chapter 4 ...... 86

C.1 Proof of Theorem 4.1 ...... 86

C.2 Proof of Theorem 4.2 ...... 88

C.3 Proof of Theorem 4.3 ...... 91

C.4 Proof of Theorem 4.4 ...... 93

Appendix D Supplemental Material for Chapter 5 ...... 97

D.1 Proof of Lemma 5.2 ...... 97

D.2 Proof of Theorem 5.1 ...... 97

D.3 Proof of Theorem 5.2 ...... 100

D.4 Proof of Theorem 5.3 ...... 101

References ...... 103

vii List of Tables

2.1 MSE of Simulation Study 1 in Chapter 2. Error of the best method are marked as bold . . . 17

2.2 MAE of Simulation Study 1 in Chapter 2. Error of the best method are marked as bold . . . 18

2.3 MSE of Simulation Study 2 in Chapter 2. Error of the best method are marked as bold . . . 19

2.4 MAE of Simulation Study 2 in Chapter 2. Error of the best method are marked as bold . . . 19

2.5 MSE of Simulation Study 3 in Chapter 2. Error of the best method are marked as bold . . . 20

2.6 MAE of Simulation Study 3 in Chapter 2. Error of the best method are marked as bold . . . 21

2.7 MSE of Simulation Study 4 in Chapter 2. Error of the best method are marked as bold . . . 22

2.8 MAE of Simulation Study 4 in Chapter 2. Error of the best method are marked as bold . . . 22

3.1 Condition Summary of Asymptotic Optimality for different classifiers in Chapter 3 when

Cp → c > 0 ...... 33 3.2 Conditions to Guarantee Asymptotic Sub-optimality for different classifiers in Chapter 3 if

Cp → ∞ ...... 34 3.3 Conditions to Guarantee Asymptotical Optimality for different classifiers in Chapter 3 if

Cp → ∞ ...... 34 3.4 Classification error for simulation study 1, p = 104, p − l entries are 0 ...... 36

3.5 Classification error for simulation study 1, p = 104, p − l entries are generated from N(0, 0.12) 37

4 3.6 Classification error for simulation study 2, p = 10 , 2000 entries are 1 for µ1. Other entries are generated from N(0, 0.12) ...... 38

4 3.7 Classification error for simulation study 2, p = 10 , 1000 entries are 1 for µ1. 100 entries are 2.5. Other entries are generated from N(0, 0.12)...... 39

4 3.8 Classification error for simulation study 2, p = 10 , 1000 entries are 1 for µ1. 50 entries are 3.5. Other entries are generated from N(0, 0.12)...... 39

3.9 Average Misclassification Rate for Simulation Study 3 ...... 40

3.10 Classification error with only 50% features available in the test data in Simulation Study 1,

p = 104, p − l entries are 0 ...... 41

viii 3.11 Classification error with only 50% features available in the test data in Simulation Study 2,

4 2 p = 10 , 2000 entries are 1 for µ1. Other entries are generated from N(0, 0.1 ) ...... 42 3.12 Classification error with only 50% features available in the test data in Simulation Study 2,

4 p = 10 , 1000 entries are 1 for µ1. 100 entries are 2.5. Other entries are generated from N(0, 0.12) ...... 42

3.13 Classification accuracy of movie review sentiment analysis data ...... 43

3.14 Training error and test error of leukemia data set ...... 44

4.1 Compare Thresholded Ridge Regression (Shao et al., 2011) and Two-stage Ridge Regression.

Bias and variance of m(0) and m when p ≤ n ...... 56

4.2 Compare Thresholded Ridge Regression (Shao et al., 2011) and Two-stage Ridge Regression.

Bias and variance of m(0) and m when p ≥ n ...... 59

ix List of Figures

2.1 Compare Condition 2.4 with Condition 2.5 in Martin and Walker (2014). Black curve is the

lower bound of σ2 for κ ∈ (0, 1) in Condition 2.4 while three gray curves are lower bounds of

σ2 in Condition 2.5 with β = 25, 50, 100 respectively...... 10

3.1 Estimation error and classification error of DP methods when p − l entries are 0. Red boxplot

indicates estimation error while blue boxplot indicates classification error...... 38

3.2 Boxplot and scatter plot of classification errors of 4 methods ...... 40

3.3 Word cloud of top 100 unigrams and bigrams in magnitude for coefficients for classifiers based

on glmnet and DP. Red indicates negative sentiment and green indicates positive sentiment. . 43

x Chapter 1

Introduction

With modern techniques such as microarray, nowadays high dimensional data is much easier to obtain but becomes a new challenge for statisticians: it is much more difficult to find useful information given oceans of data in both computational and theoretical aspects. High dimensional data usually involves a exponentially growing number of variables with only limited sample size. Statisticians are urged to develop scalable algorithms in multiple fields to find the sparse signal out of plenty of noise, such as sequence estimation, classification, variable selection and principal component analysis. There are numerous Frequentists’ work on the above areas while Bayesian approaches are relatively limited since sampling based methods such as

MCMC is relatively slow.

In Bayesian literature spike-and-slab prior (Mitchell and Beauchamp, 1988) is widely used to capture the sparsity structure. Spike-and-slab prior is a prior for unknown sparse parameters, which is a mixture of two distributions. One distribution (slab distribution) has large variance while another (spike distribution) has small variance. Latent binary indicators are often introduced for group assignment.

We aim to develop scalable algorithms for learning sparsity structure using spike-and-slab prior. EM algorithm (Dempster et al., 1977) is a well-known algorithm to calculate posterior mode. One advantage of

EM algorithm is that it has faster convergence rate than sampling based methods. EM algorithm can only compute posterior mode but cannot approximate the posterior distribution.

Another type of algorithms, variational inference, originates from Physics. Several earliest work including

Peterson and Hartman (1989) applied variational inference in Bayesian models with complicated hierarchical structure such as neural network.There exists an interesting work by Neal and Hinton (1998) which drew the connection between EM algorithm and variational inference: EM algorithm can be viewed as a special case for variational inference. Nowadays researchers utilize variational inference to approximate the posterior distribution for hierarchical Bayesian models. For an overview of variational inference, see Blei et al. (2017).

In this thesis, we use EM algorithm, variational inference and the hybrid of these two to achieve scalable computation.

In theory, we are interested in both estimation consistency and selection consistency of sparse parameters.

1 In Bayesian literature, estimation consistency is often illustrated via posterior concentration, which refers to the property that the posterior measure puts enough mass on the a ball centered at the truth with desired radius. Another class of consistency is selection consistency, which refers to with probability going to 1 our proposed algorithms could learn the correct sparsity pattern in various models.

Specifically, In this thesis, we aim to develop scalable Bayesian algorithms in Gaussian sequence esti- mation, naive Bayes classification, variable selection and sparse PCA. We will develop theory to guarantee estimation consistency and selection consisteny of our methods.

1.1 Gaussian Sequence Model

Gaussian sequence model has an extremely simple form:

Xi = θi + ei, i = 1, 2, . . . , n, (1.1)

where θ = (θ1, . . . , θn) is a unknown mean parameter. θ is often assumed to have a sparsity structure, meaning most of θis should be very small or exactly zero. This model arises in many applications, such as astronomy, signal processing, and bioinformatics. It is also a canonical problem for many modern statistical methods, such as nonparametric function estimation, large-scale variable selection and hypothesis testing.

Gaussian sequence model has long history, which could be traced back to James-Stein estimator. Besides

Frequentists’ methods, via carefully chosen prior, empirical Bayes methods could be applied to Gaussian sequence model.

In Chapter 2 we propose an empirical Bayesian approach to estimate a sparse sequence. we adopt the popular spike-and-slab prior with one component being a point mass at zero, and the other component being a mixture of Gaussian distributions. Although Gaussian family has been shown to be suboptimal for this problem, we showed that Gaussian mixtures, choosing the means and mixing weights via proper clustering algorithms like Dirichlet process mixture clustering, have the desired asymptotic behavior, e.g., the corresponding posterior distribution concentrates on balls with the desired minimax rate. To achieve computation efficiency, we obtained the posterior distribution using a deterministic variational inference algorithm. Empirical studies on several benchmark data sets demonstrate the superior performance of the proposed algorithm compared to other alternatives.

2 1.2 High Dimensional Classification

The work on Gaussian sequence model could be naturally applied to high dimensional classification. Classical

Linear Discriminant Analysis (LDA) is not preferable in the high dimensional situation because of large bias of estimating both covariance matrices and mean difference (Bickel and Levina, 2004; Fan and Fan, 2008; Shao et al., 2011). Independence Rule (Bickel and Levina, 2004) could be applied since simply ignoring covariance structure does not lose too much. Meanwhile, since most of features are irrelevant, underlying true mean difference is sparse; after normalization, sample mean difference is asymptotically normally distributed.

Therefore the work about Gaussian sequence model could be applied.

In Chapter 3 we apply the above empirical Bayes approach for estimating the sparse normalized mean difference, which could be directly applied to the high dimensional linear classification. In theory, we build a bridge to connect the estimation error of the mean difference and the classification error, also provide sufficient conditions of sub-optimal classifiers and optimal classifiers.

1.3 Variable Selection

Gaussian sequence model could be viewed as a linear regression model with identity design matrix. Learning sparsity structure of Gaussian sequence ican be generalized to selecting variables with nonzero regression coefficients. Penalized likelihood approaches have been widely used during last twenty years by Frequentists for variable selection since they provide sparse solutions.

Although L2 penalty will give us a closed form solution for regression coefficients, ridge regression estima- tor is not sparse. Shao and Deng (2012) add a thresholding step for ridge regression estimator to introduce sparsity. However, this one-step ridge regression estimator still has relatively large bias. In Chapter 4 we propose adaptive ridge regression estimator to further reduce the bias.

Spike-and-slab prior (Mitchell and Beauchamp, 1988) is a useful tool for Bayesian variable selection since this prior could eliminate noisy features and avoid shrinking large regression coefficients too much. Despite its great success, computational cost based on spike-and-slab prior is high compared with Frequentists’ counterparts. Rovckova and George (2014) and Wang et al. (2016) proposed two versions of EM algorithm to accelerate computation. Nevertheless, EM algorithm only compute the mode of posterior, but cannot approximate the whole posterior distribution.

In Chapter 4 we also show that the procedure to construct adaptive ridge regression estimator is equivalent to running EM algorithm by Wang et al. (2016) with 2 iterations. Besides, we construct a distribution to

3 approximate posterior via the output of EM algorithm. Theoretically we show the distribution estimator has concentration property and adaptive ridge regression estimator has the desired prediction accuracy.

1.4 Sparse PCA

Dimension reduction is an important technique for big data analysis to obtain the signal and get rid of the noise. Principal Component Analysis (PCA) is a classical method to find a small number of directions to capture most of the information. In classical PCA all the weights of the principal components (also known as loadings) are nonzero. This approach not only causes the problem of interpretation but also provide an inconsistent estimator of principal components.

Sparse PCA, instead, aims to find the principal components with a small number of nonzero loadings, which could not only improve interpretation but also improve estimation accuracy of principal components based on appropriate assumptions. Nowadays Sparse PCA has broad application in Bioinformatics, Finance and Signal Processing.

In Chapter 5 we propose a Bayesian approach to obtain sparse principal components. We show that our algorithm, which is based on variational approximation, achieves Bayesian selection consistency when both dimensionality and sample size go to infinity.

4 Chapter 2

An Empirical Bayes Approach for Sparse Sequence Estimation

2.1 Introduction

Consider a Gaussian sequence model

Xi = θi + ei, i = 1, 2, . . . , n, (2.1)

where θ = (θ1, . . . , θn) is the unknown mean parameter, often assumed to be sparse, and ei’s are independent errors following a standard normal distribution. The primary interest is to reconstruct the sparse vector

n θ based on the data X = (X1,X2, ··· ,Xn). Such a simple model arises in many applications, such as astronomy, signal processing and bioinformatics. It is also a canonical model for many modern statistical methods, such as nonparametric function estimation, large-scale variable selection and hypothesis testing.

There are different ways to define the sparsity of a vector. In this chapter, we focus on the common definition of sparsity, i.e., many elements of θi’s are zero. In particular, we assume θ is from the class of nearly black vectors,

n n o Θ0(sn) = θ ∈ R :#{1 ≤ i ≤ n : θi 6= 0} = sn .

The parameter sn measures the sparsity of θ, which is usually assumed to be o(n), but unknown. So a desired feature of any reconstruction procedure is to adapt to the unknown sparsity level.

Since θ is known to be sparse, a natural approach is to threshold Xn. The thresholding rule can be chosen by the principle of minimizing the empirical fitting error with penalization (Golubev, 2002) or by minimizing the False Discover Rate (Abramovich et al., 2006). In addition, a plethora of research on variable selection and prediction in the context of high dimensional linear regression models can also be applied on this problem (Fan and Lv, 2010; B¨uhlmannand Van De Geer, 2011; and Huang, 2008).

Some thresholding procedures are motivated from a Bayesian aspect, such as George and Foster (2000),

Johnstone and Silverman (2004), and Castillo et al. (2012). The idea is to model θi’s with a two-group structure prior, where one component is a point mass at zero due to the sparsity assumption and the other

5 component is a continuous distribution with a smooth symmetric density function g, namely,

π(θi) = wδ0(θi) + (1 − w)g(θi). (2.2)

Theoretical studies have indicated that g(·), the prior density function on the non-zero component, should have a heavy tail for the posterior distribution to have the desired asymptotic behavior. Therefore, the double exponential distribution and its scaled variants are recommended for the choice of g, but not Gaussian distributions. The problem with distributions like the Gaussians which have lighter tails is that they tend to over-shrink Xi’s for large θi’s therefore attain a lower posterior contraction rate (Johnstone and Silverman, 2004; Castillo et al., 2012).

In this chapter, we propose an Empirical Bayes approach to this problem. Despite the warning message on Gaussian distributions, we still use Gaussians in our prior specification, which takes the following form

T X 2 π(θi) = wδ0(θi) + (1 − w) wtN(mt, σ ), (2.3) t=1 then use data to learn weights w, w1, ··· , wT and centers m1, ··· , mT . Our prior choice can be viewed as a special case of the two-group prior (2.2) with g being a mixture of Gaussian density functions. The Gaussian prior is appealing due to its conjugacy, which simplifies our analysis and also enables tractable computation.

As revealed by our asymptotic analysis, the suboptimal behavior of Gaussian distributions as mentioned in Johnstone and Silverman (2004) and Castillo et al. (2012) is on Gaussian distributions with mean zero, which can be avoided by a proper choice on weights w1, ··· , wT and centers m1, ··· , mT . The remainder of this chapter is organized as follows. We present our main results in Section 2.2. We show that with some mild conditions on mt’s and wt’s, the corresponding posterior will have desired asymp- totic behavior: it concentrates around the true parameter θ∗ at the minimax rate, its effective dimension adapts to the unknown sparsity sn, and the corresponding the posterior mean is asymptotic minimax. Sec- tion 2.3 introduces a general framework to find mt’s and wt’s to satisfy technical conditions. A variational implementation of our approach, as well as empirical studies, is presented in Section 2.4. All the proofs are given at the end after the conclusion.

We will start the remaining of this paper by briefly discussing some related work.

• This chapter is motivated by a recent work by Martin and Walker (2014), where g(θ) = gi(θ) in (2.2)

is set to be a Gaussian distribution centered exactly at the data point Xi. In Martin and Walker

(2014), the resulting estimate of θi is still a shrinkage estimate toward zero, while in our approach,

6 the estimate of θi is adaptively shrunk toward the mean of nearby data points. This explains why the empirical performance of our approach is better than the one from Martin and Walker (2014).

• Estimating θ for the Gaussian sequence model (2.1) can be also viewed as a compound decision problem

(Robbins, 1951), where θi’s are assumed to be i.i.d. random variables from a common but unknown distribution G. The aforementioned two-group prior (2.2) can be viewed as a special form of G, but in

general G can take any form. A number of nonparametric approaches have been proposed to estimate

G and then in return provide an estimate for θ, such as the Maximum likelihood approach by Jiang

and Zhang (2009), the nonparametric empirical Bayes approach by Brown and Greenshtein (2009a),

and a convex optimization based approach by Koenker (2014). In particular, Gaussian mixtures are

used to estimate G by Jiang and Zhang (2009) and Brown and Greenshtein (2009a). However, their

asymptotic results do not cover the nearly black class Θ0(sn). In their simulation study, Jiang and Zhang (2009) indeed consider the sparse situation and suggest to add a Gaussian component with

mean zero. In our asymptotic study, we have found that to achieve the desired asymptotic properties,

a point mass zero, instead of a Gaussian with mean zero, seems necessary for the nearly black class.

2.2 Main Results

n First we introduce some key notations. Data vector is denoted as X = (X1,X2, ··· ,Xn); ΠXn (·) is used

∗ ∗ ∗ n to denote the prior indicating dependence with data; θ = (θ1, ··· , θn) is the unknown true mean of X ; n n pθ(X ) is the likelihood function of θ given X .

2 Consider the following hierarchical prior ΠXn (· | w1:T , m1:T , σ , α) on θ as follows

w ∼ Beta(αn, 1), (2.4)

T X 2 θi | w ∼ wδ0 + (1 − w) wtN(mt, σ ), i = 1, ··· , n (2.5) t=1 where σ2, α and T are fixed parameters specified by users. σ2 and α do not depend on the dimension n while

T = Tn, w1:T and m1:T may depend on n. For simplicity of notation, we suppress the explicit dependence of n for those parameters. We consider two scenarios: (a) w1:T and m1:T are fixed parameters which do not

n n depend on the data X ; (b) w1:T and m1:T depend on X . Scenario (b) corresponds to empirical Bayes approaches. The theoretical results are similar for two scenarios except that in Scenario (b) we need one additional condition on σ2. For simplicity first we consider Scenario (a). Then we move on to Scenario (b).

n The posterior distribution is proportional to pθ(X )ΠXn (θ). Strong assumptions are usually needed to

7 prove Bayesian consistency. However, following Walker and Hjort (2001) and Martin and Walker (2014), we can obtain our posterior distribution with fractional likelihood, which can weaken the conditions required for

Bayesian consistency. Given a fractional parameter κ (does not depend on n), the corresponding posterior

κ n distribution is then proportional to pθ(X )ΠXn (θ). In implementation, κ is set to be a large number close to 1 to capture most information from data. In all of the simulation studies in Section 3, we set κ = 0.99.

n In particular, the posterior measure of a Borel set A ⊂ R involving κ, denoted by Qn(A), can be expressed as

R n ∗ n κ A{pθ(X )/pθ (X )} ΠXn (dθ) Qn(A) = R n n κ . (2.6) n {pθ(X )/pθ∗ (X )} ΠXn (dθ) R ∗ ∗ ∗ ∗ θ is assumed to be sparse. Denote S as the support of θ and sn = |S | is the cardinality. Minimax rate established in Donoho et al. (1992) for estimating a vector in Θ0(sn) is εn = sn log(n/sn). The aim is to prove posterior mean estimator based on the above prior attains asymptotically minimax L2 error rate in

Θ0(sn). To accomplish this, first we prove our posterior measure Qn has a desired concentration rate and then prove L2 error is bounded.

The key conditions are imposed on sn, w1, ··· , wT and m1, ··· , mT . Condition 2.1 is that sn is upper bounded by dimension to the power of γ. This is a mild condition since we’ve already assume sn = o(n).

γ Condition 2.1. sn = O(n ) for a constant 0 ≤ γ < 1.

∗ To find suitable m1, ··· , mT , we apply suitable clustering algorithms on nonzero mean set θS∗ = {θi |i ∈ ∗ S } and retrieve cluster centers as m1, ··· , mT . Condition 2.2 essentially puts an upper bound on within-

∗ cluster sum of squares of θS if we apply some clustering algorithm. Denote mti as the corresponding cluster

∗ center of θi ∈ θS∗ , we assume

2 P 2 ∗ ∗ Condition 2.2. Let mti = arg min1≤t≤T (Xi − mt) , i∈S∗ (Xi − mti ) ≤ K εn with Pθ −probability 1 as n → ∞ where K∗ is a constant independent with n.

∗ Since the summation in the left hand side has sn terms, if maxi∈S |Xi − mti | = O(log(n/sn)), that is,

∗ the distance between θi and the corresponding cluster centers grows slower than log(n/sn), our condition could be satisfied. Therefore Condition 2.2 is not strict since sn = o(n). In implementation to specify m1, ··· , mT , we proposed two methods: (1) a simple hard thresholding method; (2) a novel clustering algorithm to estimate T nonzero cluster centers. The details will be provided in Section 2.3 and 2.4.

Condition 2.3 is on the mixing weights w1, w2, ··· , wT .

−η Condition 2.3. There exists a constant η > 0 such that min1≤t≤T wt ≥ Cn > 0 with Pθ∗ −probability 1 as n → ∞, where C is a constant independent with n.

8 η can be any suitable large constant. Since T is independent with n, this condition could be easily

satisfied if we simply set wt = 1/T for each t. In general, if each cluster size is bounded above, we could also plug in cluster weights.

Given the above 3 conditions, first we introduce Lemma 2.1, an analogue of Lemma 1 in Martin and

Walker (2014). Note that Lemma 2.1 holds regardless of Scenario (a) or (b).

Lemma 2.1. Let Dn be the denominator of posterior measure Qn. If Conditions 2.1, 2.2 and 2.3 hold, then

α ∗ Dn > 1+α exp{−K0εn − op(εn)} with Pθ −probability 1, where K0 is a positive constant independent with n.

After establishing a lower bound for the denominator of Qn, in order to prove posterior concentration,

we will establish an upper bound for that numerator of Qn measuring the complement of a ball centered

∗ at θ . Specifically, we show that posterior probability measure Qn concentrates asymptotically on a ball

∗ centered at the truth θ with square radius proportional to εn, namely,

n ∗ 2 AMεn = {θ ∈ R : kθ − θ k > Mεn}.

n First we provide a theorem when m1:T and w1:T do not depend on X .

n Theorem 2.1. Suppose m1:T and w1:T do not depend on X . If Conditions 2.1, 2.2 and 2.3 hold, then

∗ there exists M > 0 such that Qn(AMεn ) → 0 with Pθ -probability 1 as n → ∞.

n In Scenario (b) m1:T and w1:T are random vectors, which depend on X . In order to obtain the same theoretical results, we impose Condition 2.4 on σ2.

2 1 Condition 2.4. σ > κ(1−κ) .

Condition 2.4 is not a strict condition. σ2 needs to be bounded below to obtain a heavier-tail prior. The similar condition was used in Martin and Walker (2014), who imposed Condition 2.5 to bound σ2.

Condition 2.5. (Martin and Walker, 2014) There exists a constant β > 1 such that

1 1 κ[(1 − κ)β − 1] − < . σ2(1 + β/σ2)1/β σ2 + β β − 1

Figure 2.1 compares the lower bounds of σ2 in Condition 2.4 with that in Condition 2.5, where the two bounds are roughly the same for a large range of κ ∈ (0, 1). The above two conditions are similar.

Given Conditions 2.1, 2.2, 2.3 and 2.4, posterior concentration can be established in Scenario (b) as well.

See Theorem 2.2. Theorem 2.1 and 2.2 have exactly the same conclusion with slightly different conditions

9 because of different hyper-parameter specification schemes of w1:T and m1:T . The proof techniques are slightly different too.

30

Conditions Our Condition on sigma2 beta = 100 20

sigma2 beta = 50 beta = 25

10

0.25 0.50 0.75 kappa

Figure 2.1: Compare Condition 2.4 with Condition 2.5 in Martin and Walker (2014). Black curve is the lower bound of σ2 for κ ∈ (0, 1) in Condition 2.4 while three gray curves are lower bounds of σ2 in Condition 2.5 with β = 25, 50, 100 respectively.

n Theorem 2.2. Suppose m1:T and m1:T depend on X . If Conditions 2.1, 2.2, 2.3 and 2.4 hold, then there

∗ exists M > 0 such that Qn(AMεn ) → 0 with Pθ -probability 1 as n → ∞.

Remark 2.1. The proof of Lemma 2.1 and Theorem 2.1 (2.2) intrinsically shows using a proper clustering

algorithm to learn m1:T could improve the posterior concentration rate, thus the posterior mean estimator

has lower error rate. To see this, denote Qn(AMεn ) = Nn/Dn. Lemma 2.1 and Theorem 2.1 (2.2) show as

n → ∞, with Pθ∗ -probability 1

α X κ X D ≥ exp{ log(w ) − (X − m )2 − 2ε − o (ε )}; n 1 + α ti κσ2 + 1 i ti n p n i∈S∗ i∈S∗ κ(1 − κ)M N ≤ exp(− ε ). n 2 n

P Condition 2.1 and 2.3 guarantees i∈S∗ log(wti ) ≥ −K∗εn, where K∗ is a positive constant independent P 2 ∗ with n. Condition 2.2 guarantees i∈S∗ (Xi − mti ) ≤ K εn. P 2 The denominator Dn involves i∈S∗ (Xi − mti ) , which is the within-cluster sum of squares (WCSS) of Xn in S∗. The smaller WCSS results in the larger lower bound of the denominator, which leads to the

smaller estimation error. In terms of the implementation, a proper clustering algorithm could be applied on

10 n X to learn m1:T which minimize WCSS.

Theorem 2.1 (2.2) implies that our posterior distribution concentrates around the right place at the right

rate, so it ought to produce an estimator of θ with good properties. Next we show that the posterior mean

θˆ is a minimax estimator. Note that the proof is the same regardless of Scenario (a) or (b).

Theorem 2.3. If the conditions to prove Theorem 2.1 or 2.2 hold, there exists a universal constant M 0 > 0,

ˆ ∗ 2 0 such that Eθ∗ kθ − θ k ≤ M εn for all large n.

The proof is the same as the one by Martin and Walker (2014), provided that we have proved Lemma

2.1 and Theorem 2.1 (2.2).

As posterior concentrates around θ∗ at the minimax rate, we could conclude the majority of the posterior

n mass concentrates on sn-dimensional subspaces of R . The sparsity level of posterior is measured by the posterior distribution of w. Theorem 2.4 shows posterior distribution of w put large probability mass above

−1 1 − snn .

−1 Theorem 2.4. If the conditions to prove Theorem 2.1 or 2.2 hold, let δn = Kεnn , and K > 0 is a

n suitably large constant. Then Eθ∗ {P (1 − w > δn|X )} → 0 as n → ∞.

Notice that δn = Ksn/n log(n/sn) ≈ Ksn/n, therefore the posterior distribution of w concentrates

−1 around 1 − snn . That is, the effective dimension of our posterior distribution adapts to the unknown sparsity level sn. Our posterior mean estimator could detect true sparsity pattern with large probability. The proof is the same as the one by Martin and Walker (2014) as long as we have proved Lemma 2.1 and

Theorem 2.1 (2.2).

2.3 Prior Specification via Hard Thresholding

In this section we will demonstrate if we specify w1:T and m1:T via hard thresholding, Condition 2.2 and 2.3 are satisfied. We could construct the desired prior via simple hard thresholding. √ First, set the universal thresholding value an = 2 log n. Mn0 = {0} ∪ {Xi : |Xi| ≥ an}; Tˆn0 = |Mn0|. Set M = {mˆ , mˆ , ··· , mˆ } andw ˆ = 1/Tˆ for any t. Therefore we have n0 1 2 Tˆn0 t n0

X 2 X 2 X 2 2 (Xi − mˆ ti ) ≤ (Xi − Xi1|Xi|≥an ) = (Xi1|Xi|≤an ) ≤ snan = 2sn log(n) = O(εn). i∈S∗ i∈S∗ i∈S∗

Condition 2.2 is satisfied.

Since the weight isw ˆt = 1/Tˆn0, Choosing η ≥ 1 Condition 2.3 is satisfied trivially. In conclusion, using √ threshold an = 2 log n, Condition 2.2 and 2.3 hold.

11 After specifying prior parameters w1:T and m1:T , fully posterior computation can be implemented via prior (2.4). However, fast approximation could be done via EM algorithm. To approximate (2.4), a discrete

T n P prior ΠX (· | w1:T , m1:T ) = t=1 wtδmt is utilized with w1:T and m1:T estimated from data. Mn0 = √ {0} ∪ {Xi : |Xi| ≥ an} consists of m1:T , where an = 2 log n. w1:T is determined by a single parameter w: w1 = w and wt = (1 − w)/(T − 1) for t 6= 1. To estimate w, EM algorithm could be used with updating formula as follows (initial values are w = 1/T ):

n (k−1) 1 X w φ(Xi) w(k) = , n (k−1) PT (k−1) i=1 w φ(Xi) + t=2(1 − w )/T · φ(Xi − mt) where φ is standard Normal probability density function. After convergence, posterior mean estimator is calculated: PT m φ(X − m )w θˆ = t=1 t i t t . i PT t=1 φ(Xi − mt)wt Simulation study in the following sections show the above method has the good performance in terms of

mean absolute error (MAE). However when sn is relatively large, the above prior may not have good perfor-

mance. When sn is relatively large, we have more information about the nonzero elements, therefore using clustering methods to learn the prior can improve the performance. We propose a new prior construction

method based on the clustering algorithm to learn the clustering structure of θis in the following section.

2.4 Implementation

2 In Section 2.2 we have shown using good prior with parameters (w1:T , m1:T , σ , α) could result in posterior concentration and minimax rate of Bayes posterior mean estimator. In implementation, we first specify σ2

n P and α, then estimate (w1:T , m1:T ) using X via a clustering algorithm. Condition 2.2 implies i∈S∗ (Xi − 2 ∗ mti ) needs to be bounded, however, in practice S is unknown so that we cannot directly apply clustering algorithms. One remedy is to apply the clustering algorithm on the elements of Xn which are beyond certain

threshold. One cluster center is fixed to be 0 and we aim to learn all the nonzero cluster centers.

n we apply some proper clustering algorithm on X to assign clustering centers to m1:T and cluster weights

to w1:T . We need to pre-specify 0 as one clustering center so that the major task is to estimate the nonzero cluster centers. In Bayesian framework, Dirichlet process mixture model is used to cluster data. We build a

n Dirichlet process (DP) mixture model on X and estimate mt by plugging in corresponding cluster centers.

12 Dirichlet process mixture model is summarized as follows:

θi ∼ G, G ∼ DP(α0,G0);

Xi ∼ N(θi, 1), i = 1, 2, ··· , n;

where G0 is the base measure and α0 is the concentration measure. Given G0 and α0, we have stick breaking P∞ representation of G as t=1 πtδηt (·), where ηt is drawn independently and identically distributed (i.i.d.) from Qt−1 base measure G0, while πt = Vt l=1 (1 − Vl). Vl is drawn i.i.d. from Beta(1, α). From this formulation we could see that random distribution function G is almost surely discrete. Since θ∗ is a sparse vector, in

order to get a random distribution G with a positive point mass at 0, G0 ought to have a positive mass at

0. Therefore we model G0 as a normal component with a point mass at 0:

2 G0 = w0δ0 + (1 − w0)N(0, σ0);

2 where w0 and σ0 are 2 pre-specified parameters. Then G could be written as

∞ ∞ ∞ X X X G = π (ξ δ (·) + (1 − ξ )δ ∗ (·)) = π ξ δ (·) + π (1 − ξ )δ ∗ (·) t t 0 t ηt t t 0 t t ηt t=1 t=1 t=1 ∞ X ≡ wδ (·) + (1 − w) w δ ∗ (·), 0 t ηt t=1

P∞ πt(1−ξt) where w = t=1 πtξt and wt = 1−w . In this formulation ξt is drawn i.i.d. from Ber(w). If ξt = 0, ∗ 2 ∗ then ηt is drawn from N(0, σ0), otherwise ηt = 0. Since w > 0, G always has a positive probability mass at 0 which could induce certain level of sparsity. In the following subsection, we will develop a variational inference algorithm to estimate G.

2.4.1 Variational Algorithm to Specify Prior

Once we have specified Dirichlet process as the prior, we need to compute the posterior distribution of G in order to calculate Maximum A Posteriori (MAP) estimator of G. The major challenge here is that we have the infinite sum in the stick breaking form of G, which makes it impossible to sample from G. One remedy here is to use a constant T as the upper bound of the number of clusters (Blei et al., 2006). Then we have

2 the following truncated version of stick breaking process using G0 = w0δ0 + (1 − w0)N(0, σ0) as the base

13 measure.

Vt|α0 ∼ Beta(1, α0), t = 1, 2, ··· ,T − 1; (2.7)

ξt ∼ Ber(w0), t = 1, 2, ··· ,T ; (2.8)  δ0 ξt = 1 ∗  ηt |ξt ∼ ; t = 1, 2, ··· ,T ; (2.9)  2 N(0, σ0) ξt = 0 t−1 T −1 Y Y πt = Vt (1 − Vj), t = 1, 2, ··· ,T − 1, πT = (1 − Vj); (2.10) j=1 j=1

Zk|{V1,V2, ··· ,VT −1} ∼ Multinomial(π); (2.11)

X |Z ∼ N(η∗ , 1), i = 1, 2, ··· , n. (2.12) i i Zi

n ∗ ∗ ∗ ∗ The observed data are X and the parameters are Z1×n, V1×(T −1), η1×T , ξ1×T . η = (η1 , ··· , ηT ) contains all unique values of η = (η∗ )n . Zi i=1 Classical Markov Chain Monte Carlo (MCMC) (Escobar and West, 1995) has been developed to compute the posterior distribution of these parameters. However, due to large n, the number of parameters is huge, making MCMC converging very slowly. In this chapter, we use a variational algorithm to get posterior distribution. Variational methods are deterministic. The essence of variational inference is to view the computation of posterior distribution as an optimization problem. Solving this optimization problem gives an approximation to the posterior distribution. Blei et al. (2006) proposed variational algorithms for Dirichlet process mixture models for exponential family only. Although normal distribution with a positive mass at

0 does not belong to exponential family, we could follow the same philosophy to derive the corresponding algorithm.

We consider mean field variational inference and assume the following fully factorized form of variational distribution:

q(Z, V, η, ξ) = qp,m,τ (η, ξ)qγ1,γ2 (V)qΦ(Z);

Through the calculation shown in the Appendix, we could prove the optimal q must be further factorized as follows:

∗ QT ∗ • qp,m,τ (η , ξ) = t=1 qpt,mt,τt (ηt , ξt), where p = (p1, p2, ··· , pT ), m = (m1, m2, ··· , mT ), τ = ∗ ∗ ∗ (τ1, τ2, ··· , τT ), and qpt,mt,τt (ηt , ξt) = pt1ξt=1δ0 + (1 − pt)1ξt=0qmt,τt (ηt ), where qmt,τt (ηt ) is normal 2 probability density with mean mt and variance τt .

QT −1 • qγ1,γ2 (V) = t=1 qγ1t,γ2t (Vt), where γ1 = (γ11, γ12, ··· , γ1(T −1)), γ2 = (γ21, γ22, ··· ,

14 γ2(T −1)), qγ1t,γ2t (Vt) is Beta distribution with parameters (γ1t, γ2t).

Qn • qΦ(Z) = i=1 qφi (Zi); where Z = (Z1,Z2, ··· ,Zn), Φ = (φ1, φ2, ··· , φn),φi = (φi,1, φi,2, ··· , φi,T ),

φi,t = q(Zi = t), qφi (Zi) is Multinomial distribution with parameters φi.

Variational algorithm is summarized in Algorithm 1. Via iterating these steps we could update the

variational parameters. After convergence of Φ, p, m, τ , γ1 and γ2, we get an approximation of the posterior by plugging in these estimated parameters. The parameters we are interested in are Φ,p and m.

−1 (In the algorithm k · k∞,∞ means the element-wise maximum absolute value; logit(x) = (1 + exp(−x)) .)

Algorithm 1 Variational Bayes Algorithm for Dirichlet process mixture model with G0 input y, α0, σ0, w0,T initialize Φ(1) and Φ(0); (1) (0) while kΦ − Φ k∞,∞ >  do 2 Pn (0) σ0 · i=1 φi,t Xi mt ← 2 Pn (0) , t = 1, 2, ··· ,T ; σ0 · i=1 φi,t +1 2 2 σ0 τt ← 2 Pn (0) , t = 1, 2, ··· ,T ; σ0 · i=1 φi,t +1 σ2·(Pn φ(0)X )2 −1 2 Pn (0) 0 i=1 i,t i pt ← logit (log(w0) − log(1 − w0) + log(σ0 · i=1 φi,t + 1)/2 − 2 Pn (0) ), t = 1, 2, ··· ,T ; 2(σ0 · i=1 φi,t +1) Pn (0) γt,1 ← 1 + i=1 φi,t , t = 1, 2, ··· ,T − 1; Pn PT (0) γt,2 ← α0 + i=1 j=t+1 φi,t , t = 1, 2, ··· ,T − 1; E Pt−1 E 1 2 2 ,t ← q log Vt + i=1 q log(1−Vt)+(1−pt)mtXt − 2 (1−pt)(mt +τt ), t = 1, 2, ··· , T, i = 1, 2, ··· , n; (1) φi,t ∝ exp(Si,t), t = 1, 2, ··· , T, i = 1, 2, ··· , n; end while output p, m, Φ

2.4.2 Posterior Computation

Given estimators pˆ, mˆ , Φˆ , we construct a MAP estimator of G. Recall thatm ˆ t is the nonzero cluster ˆ center;p ˆt is the probability mass of zero of the component indexed by t; each entry φit of Φˆ is the posterior probability of Z belonging to the cluster t. The approximate posterior distribution of η∗ is i Zi

T T X ˆ X ˆ ( φitpˆt)δ0(·) + φit(1 − pˆt)δmt (·). t=1 t=1

The most probable posterior assignment of η∗ based on above posterior is denoted asη ˆ∗ . MAP estimate Zi Zi of cluster weights including zero clusters isw ˜ = #{k :η ˆ∗ =m ˆ }/n andw ˜ = #{k :η ˆ∗ = 0}/n. Then the t Zi t 0 Zi

15 estimated prior is T ˆ X GMAP =w ˜0 · δ0(·) + w˜tδmˆ t (·). t=1

κ Plugging in GˆMAP, hierarchical Bayes model could be written as (N denotes normal probability density to the power of κ):

T ˆ ˆ X θi ∼ GMAP, GMAP =w ˜0 · δ0(·) + w˜tδmˆ t (·) t=1

κ Xi ∼ N (θi, 1), i = 1, 2, ··· , n.

This is different from the original data generating process (2.4) but could be regarded as an approximation of (2.4) by setting σ2 = 0. The main reason why we use this approximation is that we could compute a

n closed-form posterior distribution of θ given X using Bayes Formula: since Xi ∼ N(θi, 1) and θi ∼ GˆMAP,

n the posterior distribution of θi given X has the form

2 2 κXi PT κ(Xi−mˆ t) T w˜0 exp(− 2 )δ0 + t=1 w˜t exp(− 2 )δmˆ t X 2 2 ≡ wˆ0 · δ0(·) + wˆtδmˆ t (·); κXi PT κ(Xi−mˆ t) w˜0 exp(− 2 ) + t=1 w˜t exp(− 2 ) t=1

PT wherew ˆt is the posterior weight for centerm ˆ t and t=1 wˆt = 1. The final estimator of θi is the posterior mean estimator T ˆ X θi = wˆtmˆ t. t=1

2.4.3 Simulation Studies

We conduct the following four simulation studies; R package for our empirical Bayes estimator is available

in https://github.com/yunboouyang/VBDP; all the source code is summarized in https://github.com/

yunboouyang/EBestimator. For our empirical Bayes estimator, we set the maximal number of clusters to be T = 10. We set fractional likelihood parameter κ = 0.99 throughout all simulation studies. We set

2 concentration parameter α0 = 1 in all of simulation studies. For G0 = w0δ0 + (1 − w0)N(0, σ0), we set

w0 = 0.01. For simulation study 1, we set σ0 = 4. For other simulation studies, we set σ0 = 6. The influence of these hyper-parameters will diminish when dimension n is large.

Besides prior specification via hard thresholding (denoted as “Hard Thresh Prior”) and prior specification

via Dirichlet Process mixture (denoted as “DP prior”), we also consider estimators proposed by Martin and

Walker (2014) (denoted as EBMW), by Koenker and Mizera (2014) (denoted as EBKM), and by Johnstone

and Silverman (2004)(denoted as EBMed, we directly use functions from the EbayesThresh package by

16 Johnstone and Silverman (2005)), as well as the Hard thresholding estimator, soft thresholding estimator,

SURE estimator (using waveThresh package) and FDR estimator with parameters q = 0.01, 0.1, 4.

n The oracle prior in the compound decision setup is Π = P δ ∗ /n. The corresponding posterior mean n i=1 θi estimator has the minimal risk among all separable rules. In all simulation studies, the optimal prior is

served as the benchmark.

Experiment 1

In the first simulation study, we take sample Xn of dimension n = 200 from the normal mean model

Xi ∼ N(θi, 1). In this case, we consider the number of nonzero elements to be sn = 10, 20, 40, 80 and the nonzero elements are fixed at values µ0 = 1, 3, 5, 7. We compute mean squared error (MSE) and mean absolute error (MAE) based on 200 replications as two measure of performance. We summarize the results in Table 2.1 and Table 2.2.

sn 10 20 40 80

µ0 1357135713571357 EBMW 10 54 21 13 20 96 35 25 40 152 61 49 79 234 108 96 EBKM 12 35 15 6 21 53 19 7 32 74 25 7 44 93 30 8 EBmed 10 43 22 14 20 69 36 28 37 103 66 54 64 157 127 107 SURE 14 42 45 43 23 68 69 69 42 104 105 105 74 152 153 153 Soft Thresholding 10 76 116 116 20 153 228 233 40 304 457 464 80 609 916 929 Hard Thresholding 13 61 21 12 24 122 39 23 45 237 75 42 87 473 149 82 FDR q = 0.01 10 75 25 11 20 143 40 22 41 253 67 44 81 434 112 85 FDR q = 0.1 11 55 24 20 23 93 39 37 44 141 67 64 88 208 113 111 FDR q = 0.4 22 63 53 51 36 94 82 80 64 134 119 117 119 175 161 161

Hard Thresh Prior 12 39 14 7 22 62 19 10 44 96 28 15 84 143 42 26 DP Prior 11 37 11 3 19 50 17 4 33 71 22 4 46 92 26 6

Oracle Prior 9 29 9 1 16 46 13 1 27 67 18 2 39 87 23 2

Table 2.1: MSE of Simulation Study 1 in Chapter 2. Error of the best method are marked as bold

17 sn 10 20 40 80

µ0 1357135713571357 EBMW 10 23 14 13 20 43 26 24 40 74 47 45 80 125 86 83 EBKM 27 35 22 19 40 48 25 20 60 62 29 21 82 76 32 22 EBmed 12 21 12 10 23 38 23 20 44 72 45 39 80 133 96 80 SURE 15 30 32 31 24 49 51 51 44 78 79 79 78 117 119 119 Soft Thresholding 10 27 33 33 20 54 65 65 40 109 130 130 80 217 260 261 Hard Thresholding 11 22 10 9 21 45 19 17 41 88 37 32 82 175 74 64 FDR q = 0.01 10 27 10 8 20 51 19 17 40 93 36 33 80 163 69 66 FDR q = 0.1 10 21 12 11 21 36 22 22 41 59 41 41 82 98 77 76 FDR q = 0.4 14 26 25 25 25 44 43 42 48 72 70 70 94 110 108 108

Hard Thresh Prior 11 20 9 6 21 33 14 10 41 56 24 18 80 93 42 35 DP Prior 23 31 18 14 36 42 20 16 61 57 25 17 87 72 27 19

Oracle Prior 9 29 9 1 16 46 13 1 27 67 18 2 39 87 23 2

Table 2.2: MAE of Simulation Study 1 in Chapter 2. Error of the best method are marked as bold

When µ0 is larger than one, our DP prior based method is the best one in terms of MSE since it is

easier to figure out the cluster centers of θi’s and estimate the nonzero θi’s. Besides, when sn is large,

our DP prior based method has the best performance among all since the more nonzero θi’s we have, the

easier we could estimate the cluster centers. When both µ0 and θi are small, which is a tough case since high dimensional noise might make it much more difficult to detect weak signals, our method still has the comparable performance among all methods. Hard Thresh prior based prior has better performance when measured by MAE than that in MSE. Other methods such as EBKM and EBmed also have comparable good performance. Hard Thresh prior based prior is the best method in terms of MAE. In terms of MSE,

DP prior based method has similar performance as the oracle.

Experiment 2

In the second simulation study we increase the data dimension to 500. We are interested in the case when sn = 25, 50, 100 and all nonzero elements are fixed at µ0 = 3, 4, 5. This is a more challenging case since the signal is not very strong. MSE and MAE are shown in Table 2.3 and Table 2.4 respectively.

18 sn 25 50 100

µ0 3 4 5 3 4 5 3 4 5

EBMW 138 99 52 233 156 90 385 248 153

EBKM 81 57 28 120 80 41 174 114 52

EBMed 107 78 50 165 124 91 254 208 163

SURE 102 106 105 163 168 170 257 260 259

ST 201 289 327 403 577 658 806 1155 1308

HT 172 143 64 341 282 129 680 563 251

FDR q = 0.01 192 151 62 356 241 101 639 385 164

FDR q = 0.1 139 87 55 224 135 99 355 213 167

FDR q = 0.4 150 133 126 229 206 201 332 302 294

Hard Thresh Prior 97 59 30 161 88 45 262 133 66

DP Prior 80 55 25 119 79 35 171 109 49

Oracle Prior 73 49 24 115 74 34 167 107 44

Table 2.3: MSE of Simulation Study 2 in Chapter 2. Error of the best method are marked as bold

sn 25 50 100

µ0 3 4 5 3 4 5 3 4 5

EBMW 58 47 36 105 82 66 186 143 118

EBKM 75 54 38 102 67 45 139 84 50

EBMed 49 37 29 88 69 58 176 135 113

SURE 73 76 75 120 124 124 194 196 196

ST 70 83 87 141 166 175 281 332 349

HT 62 44 26 123 86 52 245 172 102

FDR q = 0.01 67 46 26 127 77 48 233 134 90

FDR q = 0.1 52 34 29 87 61 56 148 111 103

FDR q = 0.4 63 61 60 107 106 106 179 175 174

Hard Thresh Prior 48 31 20 84 51 33 147 84 56

DP Prior 60 42 29 93 58 34 128 74 43

Oracle Prior 49 25 9 76 37 13 111 53 18

Table 2.4: MAE of Simulation Study 2 in Chapter 2. Error of the best method are marked as bold

In terms of MSE, our DP prior based method and Hard Thresh prior based method are consistently

19 the best methods among all the different configurations. The performance of EBKM is comparable. One drawback of EBKM is that the computational cost is high since solving a high dimensional optimization problem is needed. However for DP prior and Hard Thresh prior, we use a computationally efficient varia- tional inference algorithm, which dramatically saves the computational time. In terms of MSE, DP method has comparable performance as the oracle.

Experiment 3

In the third simulation study we maintain all the features of the second experiment except that nonzero elements of the true mean vector are now generated from standard Gaussian distribution centered at the original values in simulation study 2. All different θi’s form a data cloud which is centered at the distinct nonzero values in simulation study 2.

sn 25 50 100

µ0 3 4 5 3 4 5 3 4 5

EBMW 110 91 62 186 150 103 308 245 176

EBKM 77 68 49 121 105 77 185 163 124

EBMed 93 79 59 151 130 100 236 212 176

SURE 96 104 106 155 165 168 243 257 260

ST 196 269 313 389 534 626 780 1075 1251

HT 135 123 79 263 237 155 527 479 311

FDR q = 0.01 151 129 79 268 216 133 479 368 221

FDR q = 0.1 112 90 66 187 148 110 300 235 187

FDR q = 0.4 138 137 129 216 213 201 320 307 301

Hard Thresh Prior 81 66 46 138 107 74 237 176 120

DP Prior 76 68 53 120 110 84 189 164 123

Oracle Prior 67 58 41 111 94 69 177 152 112

Table 2.5: MSE of Simulation Study 3 in Chapter 2. Error of the best method are marked as bold

20 sn 25 50 100

µ0 3 4 5 3 4 5 3 4 5

EBMW 51 46 38 92 82 69 164 145 126

EBKM 75 63 52 108 91 74 160 135 111

EBMed 44 38 31 79 70 59 147 132 115

SURE 69 74 76 114 122 124 187 195 197

ST 67 79 86 133 158 171 266 317 342

HT 51 43 31 100 84 61 200 169 121

FDR q = 0.01 55 44 31 101 80 57 188 144 104

FDR q = 0.1 45 37 31 81 67 58 143 120 108

FDR q = 0.4 60 62 60 105 107 105 176 175 176

Hard Thresh Prior 42 35 27 75 61 48 135 107 85

DP Prior 59 54 44 97 84 69 156 127 102

Oracle Prior 51 39 27 87 66 48 143 112 85

Table 2.6: MAE of Simulation Study 3 in Chapter 2. Error of the best method are marked as bold

From Table 2.5 and Table 2.6, we conclude Hard Thresh prior based method has the best performance.

Even though our DP prior based method is not as good as EBKM and EBMed in some configurations, our DP method has quite similar performance even it is more difficult to estimate the clustering centers in simulation study 3 than simulation study 2. Hard Thresh based prior has better performance than DP prior based method.

Experiment 4

In the fourth example, we consider 1000-dimensional vector estimation, with the first 10 entries of θ∗ equal

10, the next 90 entries equal A, and the remaining 900 entries equal 0. We consider a range of A, from

A = 2 to A = 7. Average MSE and MAE are recorded in Table 7 and Table 8.

21 A 2 3 4 5 6 7

EBMW 320 421 291 175 134 125

EBKM 206 223 150 79 46 35

EBMed 337 350 240 173 148 138

SURE 278 327 333 337 336 334

ST 504 894 1259 1436 1475 1483

HT 375 671 610 301 134 102

FDR q = 0.01 375 633 441 199 121 111

FDR q = 0.1 373 421 259 194 185 181

FDR q = 0.4 440 451 404 403 399 396

Hard Thresh Prior 312 326 179 90 62 52

DP Prior 204 220 161 205 151 85

Oracle Prior 195 213 142 67 34 28

Table 2.7: MSE of Simulation Study 4 in Chapter 2. Error of the best method are marked as bold

A 2 3 4 5 6 7

EBMW 179 199 159 130 122 120

EBKM 225 182 115 79 61 61

EBMed 176 161 127 110 101 97

SURE 209 241 245 248 246 247

ST 216 294 347 367 371 372

HT 189 242 184 110 85 80

FDR q = 0.01 189 232 147 96 85 83

FDR q = 0.1 188 168 120 110 109 108

FDR q = 0.4 217 214 208 212 210 209

Hard Thresh Prior 170 164 102 64 56 54

DP Prior 215 161 91 89 75 57

Oracle Prior 195 142 70 27 14 17

Table 2.8: MAE of Simulation Study 4 in Chapter 2. Error of the best method are marked as bold

In some configurations, Hard Thresh prior based method and DP prior based method have the best

22 performance among all the state-of-the-art methods. Although in some cases EBKM and EBMed performs

better, MSE and MAE of our DP prior and Hard Thresh prior based methodsare still comparable. DP prior

based method has the best performance when there exists clusters with small values while Hard Thresh prior

based method has the best performance when there is no cluster which is centered at small values. The

reason why DP prior based method does not have good performance when A is large is that DP prior based

method may merge two clusters into only one cluster, which results in larger error.

2.5 Conclusions and Discussions

Empirical Bayes method introduced in Martin and Walker (2014) does not have good performance in simu-

lation study since a prior centered at Xi may introduce noise in posterior estimation, whereas our nonpara- metric Bayes based clustering method will decrease the noise but capture the general pattern: the estimate of θi is adaptively shrunk toward the mean of nearby data points. Therefore our DP method outperforms the empirical Bayes estimator based on Martin and Walker (2014).

Theoretical results show with proper choice of parameters in the prior, our posterior mean estima- tor achieves asymptotically minimax rate. In the implementation we propose a fast variational inference method which approximates posterior distribution, which is more efficient than the empirical Bayes method proposed by Koenker (2014) in terms of computation time. To our knowledge, this is a first work connecting high dimensional sparse vector estimation with clustering. Our nonparametric Bayesian estimator could be applied to high dimensional classification, feature selection, hypothesis testing and nonparametric function estimation. Possible extension in implementation includes using other well-studied clustering methods to estimate prior and compare their performance.

23 Chapter 3

An Empirical Bayes Approach for High Dimensional Classification

3.1 Introduction

Nowadays high dimensional classification is ubiquitous in many application areas, such as micro-array data

analysis in bioinformatics, document classification in information retrieval, and portfolio analysis in finance.

In this chapter, we consider the problem of constructing a linear classifier with high-dimensional features.

Suppose data from class k are generated from a p-dimensional multivariate Normal distribution Np(µk, Σ)

where k = 1, 2 and the prior proportions for two classes are πk, k = 1, 2 respectively. It is well-known that the optimal classification rule, i.e., the Bayes rule, classifies a new observation X to class 1 if and only if

t −1 δOPT (X) = (X − µ) Σ (µ1 − µ2) > log(π2/π1), (3.1)

where µ = (µ1 + µ2)/2. For simplicity, we assume both prior proportions πk and sample proportion of two classes are equal, but our theory could be easily extended to the case when two classes have unequal sample size but the ratio is bounded between 0 and 1. Therefore (3.1) could be simplified as: we classifies X to class 1 if and only if

t −1 δOPT (X) = (X − µ) Σ (µ1 − µ2) > 0. (3.2)

Since parameters θ = (µ1, µ2, Σ) are unknown and we are given a set of a random samples {Xki : i = 1, . . . , n; k = 1, 2}, we can estimate those unknown parameters and classify X to class 1 if

ˆ t ˆ −1 δ(X) = (X − µˆ) Σ (µˆ 1 − µˆ 2) > 0,

¯ ¯ ¯ where µˆ = X·· is the overall mean of the data, µˆ 1 = X1·, µˆ 2 = X2· is the sample mean for the two classes ˆ 1 P P ¯ ¯ t respectively, and Σ = 2n−2 k i(Xki − Xk·)(Xki − Xk·) is the pooled estimator of the covariance matrix. This is also known as the linear discriminant analysis (LDA).

LDA, however, does not perform well when p is much larger than n. Bickel and Levina (2004) have

24 shown that when the number of features p grows faster than the sample size n, LDA is asymptotically

as bad as random guessing due to the large bias of Σˆ in terms of the spectral norm. RDA by Friedman

(1989), thresholded covariance matrix estimator by Bickel and Levina (2008) and Sparse LDA by Shao et al.

(2011) use regularization to improve the estimation of Σ by assuming sparsity on off-diagonal elements of

−1 the covariance matrix. Cai and (2012) assume Σ (µ1 − µ2) is sparse and proposed LPD based on the −1 sparse estimator of Σ (µ1 − µ2). A seemingly extreme way is to set all the off-diagonal elements of Σˆ to be zero, i.e., ignore the correlation among the p features, and use the following Independence Rule:

1 ˆ t − 2 ˆ δI (X) = (X − µˆ) Dˆ d, (3.3)

ˆ ˆ ˆ ˆ − 1 ˆ where D = diag(Σ) and d = D 2 (µˆ 1 − µˆ 2). d is normalized sample mean difference and the corresponding − 1 population-level parameter is d = D 2 (µ1 − µ2). Theoretical studies such as Domingos and Pazzani (1997) and Bickel and Levina (2004) have shown that worst case misclassification error of Independence Rule is

well controlled and ignoring the correlation structure of Σ doesn’t lose much if the correlation matrix is well

conditioned.

To achieve good classification performance in a high-dimensional setting, it is not enough to regularize

just the covariance matrix. As pointed out by Fan and Fan (2008) and Shao et al. (2011), even if we use ˆ Independence Rule, the classification performance of δI could still be as bad as random guessing due to the error accumulation through all p dimensions of d. In Theorem 1 of Fan and Fan (2008), essentially using dˆ

to estimate d results in a strong condition that signal strength needs to grow linearly with the dimension p.

Estimating sparse d is a key part to construct a classifier with good performance. Instead of using sample

mean difference dˆ, other shrinkage and thresholding estimators could be naturally applied. In general we

denote the shrinkage estimator of d as ηˆ. Independence Rule (3.3) using ηˆ is

1 ˆ t − 2 δηˆ (X) = (X − µˆ) Dˆ ηˆ. (3.4)

Same as (3.3), µˆ is the sample mean across two groups and Dˆ is the pooled estimator of variance. Indepen- ˆ dence Rule δηˆ (X) is defined as the Independence Rule induced by ηˆ. In this chapter we focus on how the estimation error of d using ηˆ is related with the classification error.

Papers including Fan and Fan (2008), Shao et al. (2011), Greenshtein and Park (2009), Dicker and

(2016) proposed various classification rules. They are different rules since intrinsically regularized estimators

ηˆ are different.

25 Both the signal strength and the estimation error are important to determine the performance of the

classification rule. In theory, we have 3 major conclusions:

• When the signal strength converges to a constant, the classification error will converge to the error of

the optimal rule as long as the estimation error converges to 0.

• When the signal strength goes to infinity, the classification error will converge to 0 as long as the

growth rate of the signal strength is faster than the estimation error.

• When the signal strength goes to infinity, in order to have the same convergence rate as the classification

error of the optimal rule, the product of the signal strength and the estimation error should converge

to 0.

Since estimating sparse d is equivalent to estimating a sparse high dimensional sparse sequence, empirical

Bayes methods could be used to construct a regularized estimator. In the previous literature Greenshtein and Park (2009) proposed an empirical Bayes classifier inspired by the empirical Bayes estimator in Brown and Greenshtein (2009b) (denoted as EB); Dicker and Zhao (2016) proposed a classifier based on Koenker and Mizera (2014)’s nonparametric MLE. In chapter 2 we introduced a empirical Bayes estimator based on

Dirichlet process mixture (denoted as DP). Based on our previous work, in this chapter, we proposed two empirical Bayes classifiers based on our DP estimator and its sparse variant. Compared with Greenshtein and Park (2009), we establish the theoretical connection between the classification error of (3.3) and the L2 estimation error of ηˆ explicitly. In particular, we provide sufficient conditions for a estimator ηˆ to achieve optimal classification accuracy asymptotically, i.e., the resulting Independence Rule is asymptotically as good as the Bayes rule (3.2).

The rest of this chapter is organized as follows: in Section 3.2 we establish the relationship between the estimation error and the classification error. In Section 3.3, we study theoretical properties of DP classifier.

In Section 3.4 we introduce a sparse variant of DP classifier. In Section 3.5 we present the empirical results.

Conclusions and future work is in Section 3.6.

3.2 Relationship between the Estimation Error and the

Classification Error

We regard the linear classifier construction as a two-step procedure. First we calculate sample mean difference ˆ ˆ ˆ d and propose a estimator ηˆ based on d. Second we compute the classifier δηˆ : we classifies X to class 1 if t − 1 ˆ and only if (X − µˆ) Dˆ 2 ηˆ > 0. We call δηˆ a Independence Rule induced by ηˆ.

26 We use 0-1 loss function to evaluate a linear classifier. Without loss of generality, we assume the new observation X comes from class 1 due to symmetry of our rule. Let X denote the training data used to ˆ ˆ construct δηˆ , the classification error of δηˆ (given data) is

ˆ ˆ W (δηˆ , θ) = P (δηˆ (X) ≤ 0|X) = Φ(−Ψ), (3.5) where t ˆ − 1 (µ1 − µˆ) D 2 ηˆ Ψ = q . t − 1 − 1 ηˆ Dˆ 2 ΣDˆ 2 ηˆ

Φ(·) is standard Normal cumulative distribution function. To measure the signal strength of the classification problem, Mahalanobis distance is used, denoted as Cp:

t −1 Cp = (µ1 − µ2) D (µ1 − µ2).

Cp measures signal strength, which is the difficulty of the classification tasks. If Cp is small, it is difficult to differentiate two classes and vice versa. We use essentially the same definition for signal strength as Fan and Fan (2008).

We aim to find a linear classifier such that the performance is as good as the optimal rule asymptotically.

The classification error of an admissible classifier should be approximately equal to the classification error of the optimal Bayes rule. We define the asymptotical optimality and sub-optimality of a classifier in terms of the classification error, which is similar with definitions in Shao et al. (2011).

ˆ ˆ Definition 3.1. δ is asymptotically optimal if W (δ, θ)/W (δOPT , θ) →p 1.

ˆ ˆ Definition 3.2. δ is asymptotically sub-optimal if W (δ, θ) − W (δOPT , θ) →p 0.

Note that asymptotic optimality is equivalent to asymptotic sub-optimality when W (δOPT ) → c > 0.

Besides, asymptotic optimality implies asymptotic sub-optimality when W (δOPT ) → 0. ˆ W (δηˆ , θ) is intrisically related with the estimation error of ηˆ. In the following subsections, we first focus on the case when covariance matrix Σ is known and equal to identity where the result of L2 error in Chapter

2 can be applied directly. The general results are provided in the later subsection where Σ is unknown.

3.2.1 Known Covariance Matrix Σ = Ip

In this subsection, in order to draw the close connection between estimation and classification without involving estimating unknown covariance matrix, we assume Σ = Ip and known. The same conclusion

27 holds when Σ is a known general covariance matrix since we could transform the data to guarantee the new

covariance matrix is an identity and re-parameterize all other parameters. The same sparsity assumption,

which will be introduced later, is needed for the transformed mean difference.

t When Σ = Ip, the optimal rule is: X is classified to class 1 if δOPT (X) = (X − µ) (µ1 − µ2) > 0. 2 ˆ The signal strength is Cp = kµ1 − µ2k . d = µ1 − µ2 and d = µˆ 1 − µˆ 2. Denote the estimation error as 2 εp = Ekηˆ − dk . ˆ t ˆ The independence rule induced by ηˆ is δηˆ (X) = (X − µˆ) ηˆ. The classification error of δηˆ (X) is

ˆ t W (δηˆ , θ) = Φ(−Ψ) = Φ(−(µ1 − µˆ) ηˆ/kηˆk),

whereas the classification error of the optimal rule is

W (δOPT , θ) = Φ(−kµ1 − µ2k/2).

To analyze Ψ, first for the denominator, use triangle’s inequality, −kηˆ −dk+kdk ≤ kηˆk ≤ kηˆ −dk+kdk. √ Since kηˆ − dk = Op( εp), we have

√ √ kµ1 − µ2k − Op( εp) ≤ kηˆk ≤ kµ1 − µ2k + Op( εp).

For the numerator, we have

1 (µ − µˆ)tηˆ = [(µ − µ )td + (µ − µ )t(ηˆ − d) + (µ − µˆ )td + (µ − µˆ )td 1 2 1 2 1 2 1 1 2 2 t + (µ1 + µ2 − µˆ 1 − µˆ 2) (ηˆ − d)] 1 def= [(µ − µ )td + I + I + I + I ]. 2 1 2 1 2 3 4

t 2 The leading term is (µ1 −µ2) d = kµ1 −µ2k . For the remaining terms, we bound each one separately. |I1| ≤ √ p 2 2 2 p Op( kµ1 − µ2k εp). I2 ∼ N(0, kµ1 − µ2k /n) and I3 ∼ N(0, kµ1 − µ2k /n). Therefore I2 = Op( Cp/ n) p √ and I3 = Op( Cp/ n).

Next we bound I4. Denote e1 = µˆ 1 − µ1 ∼ Np(0, Ip/n) and e2 = µˆ 2 − µ2 ∼ Np(0, Ip/n). Therefore

µ1 + µ2 − µˆ 1 − µˆ 2 = −(e1 + e2) and ηˆ − d = ηˆ(µ1 − µ2 + (e1 − e2)) − (µ1 − µ2). Notice that e1 + e2 and

e1 −e2 are independent since e1 and e2 are independent and follow Np(0, Ip/n). Therefore µ1 +µ2 −µˆ 1 −µˆ 2

28 and ηˆ − d are independent. Hence EI4 = 0 and

2 VarI = E((µ + µ − µˆ − µˆ )t(ηˆ − d))2 = O (ε ). 4 1 2 1 2 n p p

p Therefore I4 = Op( εp/n). Asymptotically, we have

p p p Cp/2 − Op( Cpεp) − Op( Cp/n) − Op( εp/n) Ψ = p √ . Cp + Op( εp)

ˆ Therefore δηˆ is asymptotically sub-optimal if

εp = o(Cp).

Using Lemma 1 in Shao et al. (2011), let ξn = Cp/4 and

p p p Op( Cpεp) + Op( Cp/n) + Op( εp/n) τn = p . Op( Cpεp) + Cp

If

Cpεp → 0 and Cp = o(n);

ˆ we could easily verify that ξn → ∞, τn → 0 and τnξn → 0, therefore δηˆ is asymptotically optimal. To sum up, we have Theorem 3.1.

2 Theorem 3.1. Suppose εp = Ekηˆ − dk and n, p → ∞, the classification error is

p p p Cp/2 − Op( Cpεp) − Op( Cp/n) − Op( εp/n) Φ(− p √ ). Cp + Op( εp)

If εp = o(Cp), the classification rule is sub-optimal. If Cpεp → 0 and Cp = o(n), the classification rule is optimal.

3.2.2 Unknown Covariance Matrix

When the covariance matrix is unknown, we only estimate the diagonal elements of covariance matrix.

Features cannot have strong correlation to guarantee Independence Rule has good performance. Let R =

−1/2 −1/2 D ΣD be the correlation matrix, λ1 be a positive constant not growing with respect to n and p. We

−1 assume λ1 ≤ λmin(R) ≤ λmax(R) ≤ λ1 and min1≤i≤p σii > 0. When covariance matrix is unknown, first we define the signal part and noise part. Without loss of

29 c generality, S = {1, 2, ··· , sp} is the non-zero index set while S = {sp + 1, sp + 2, ··· , p} is the zero index

t t t t t t set of d. d = (d1, 0 ) , ηˆ = (ηˆ1, ηˆ2) , ηˆ1 and η1 are sp-dimensional, ηˆ2 is (p − sp)-dimensional. εp is E 2 defined as the estimation error to estimate nonzero elements of d: kηˆ1 − d1k . We assume the following two conditions on dˆ:

ˆ Condition 3.1. If |di| ≤ bn, then ηˆi = 0.

E 4 Condition 3.2. supi∈Sc (ˆηi ) < ∞.

E 2 E 2 Recall that kηˆ − ηk = εp + kηˆ2k . Condition 3.1 indicatesη ˆi is a thresholded estimator while

Condition 3.2 implies the tail ofη ˆi cannot be too heavy for zero elements. Condition 3.1 and 3.2 are used E 2 to control kηˆ2k . t −1 ˆ The signal strength is still defined as Cp = (µ1 − µ2) D (µ1 − µ2). Theorem 2 shows W (δηˆ , θ) is asymptotically close to W (δOPT , θ) as both n and p are diverging with growth rate constraints among Cp, sp and εp.

ˆ Theorem 3.2. Suppose ηˆ satisfies Condition 3.1 and 3.2. δηˆ is the classification rule induced by ηˆ. We

2 assume n → ∞, p → ∞, log(p − sp)/n ≺ bn ≺ 1. We have

q  spεp p p  Cp/2 − Op( ) − Op( εpCp) − Op( Cp/n) ˆ n W (δηˆ , θ) ≤ Φ − √ p √ (1 + op(1)) . λ1( Cp + Op( εp))

2 Furthermore, if εp = o(min(Cp, nCp /sp)) and nCp → ∞ then

p ! ˆ Cp W (δηˆ , θ) ≤ Φ − √ (1 + op(1)) . 2 λ1

ˆ ˆ If Cp → c < ∞, δηˆ is asymptotically optimal; if Cp → ∞, δηˆ is asymptotically sub-optimal.

It is not necessary to assume sp ≺ n in Theorem 3.2: the number of relevant features could be larger than the number of observations. Theorem 3.2 reveals the relationship between the estimation error εp and ˆ the classification error W (δηˆ , θ) explicitly. Bad performance of Independence Rule in Theorem 1 by Fan and Fan (2008) and LDA in Theorem 2 by Shao et al. (2011) is due to simply using sample mean difference p to estimate η. In those theorems Cp needs to dominate p/n to overcome the estimation loss. However, if we put sparsity assumptions on η and use a thresholded estimator satisfying Condition 3.1 and 3.2, the p condition on Cp could be relaxed: Cp should grow faster than max(εp, spεp/n), instead of growing faster than pp/n to achieve asymptotical sub-optimality.

Meanwhile, we have 2 remarks based on Theorem 3.2 .

30 2 Remark 3.1. log(p − sp)/n ≺ bn ≺ 1 guarantees noisy features could be eliminated.

c p If i ∈ S , n/2di follows t2n−2, centered t distribution with the degree of freedom 2n−2. maxi∈Sc |di| =

p 2 Op( log(p − sp)/n), therefore log(p − sp)/n ≺ bn could guarantee all noisy features are eliminated with probability tending to 1.

Remark 3.2. The simple hard thresholding estimator with threshold bn satisfies Condition 3.1 and 3.2; in

2 this special case εp has the same order as 2sp/n+Cp/(4n); sp/n = o(Cp) guarantees εp = o(min(Cp, nCp /sp)).

The hard thresholding estimator is written asη ˆ = 1 dˆ . Condition 3.1 is satisfied trivially. When i |dˆi|≤bn i p ˆ sample size n ≥ 4, since n/2di follows t distribution with degree of freedom 2n−2, condition 3.2 is satisfied. p ˆ When i ∈ S, recall that n/2di follows non-central t distribution with degree of freedom 2n − 2 and p noncentrality parameter n/2di. By gamma function approximation we have

X ˆ 2 X ˆ ˆ 2 X ˆ 2 εp = E(di − di) = E(di − Edi) + (Edi − di) i∈S i∈S i∈S 2(n − 1) 2(n − 1) Γ(n − 3/2) X = s + − 4p(n − 1)/n + 1 d2 p n(n − 2) n(n − 2) Γ(n − 1) i i∈S 2(n − 1) n C = s + C + O( p ) ∼ 2s /n + C /(4n). p n(n − 2) 2(n − 2)(2n − 3) p n p p

2 sp/n = o(Cp) guarantees εp = o(min(Cp, nCp /sp)). Therefore if the simple hard thresholding estimator 2 is chosen, the conditions we need are log(p − sp)/n ≺ bn ≺ 1, sp/n = o(Cp) and nCp → ∞. Recall p that in previous works Cp needs to dominate p/n to obtain asymptotic optimality. In many applications

sp ≺ p, therefore the condition on Cp is weaken if we use the simple hard thresholding estimator. Naturally estimation accuracy of η could be improved if we use other sparse estimators.

One interesting question is when Cp → ∞, what conditions we need to put to guarantee optimality. Theorem 3.3 provides an answer.

ˆ Theorem 3.3. Suppose ηˆ satisfies Condition 3.1 and 3.2. δηˆ is the classification rule induced by ηˆ. If p 2 ˆ εp = o(min(1/Cp, n/sp)), log p = o(n), log(p − sp)/n = o(bn) and bn → 0, then δηˆ is asymptotic optimal

as n → ∞, p → ∞ and Cp → ∞.

We need slower growth rate of Cp in Theorem 3.3. If Cp diverges to infinitely fast, the classification task ˆ is relatively easy, but W (δOPT , θ) converges to 0 faster than the rate of W (δ, θ). Therefore our classification ˆ rule is not optimal. However, if Cp diverges to infinity slowly, convergence rates of W (δOPT , θ) and W (δ, θ) are comparable. We could prove the ratio of these two converges to 1. For the simple thresholded estimator, we have the following remark:

31 √ √ Remark 3.3. For the simple hard thresholding estimator, sp = o( n) and Cp = o( n) guarantees asymp- totic optimality.

√ √ Ifη ˆ = 1 dˆ , we have ε ∼ 2s /n + C /(4n). Therefore if s = o( n) and C = o( n), we i |dˆi|≤bn i p p p p p

have εp = o(1/Cp, n/sp). Theorem 3 in Shao et al. (2011) puts stronger conditions of sp and Cp than our conditions when L0 sparsity measure is used.

Any good estimator of the sparse mean difference should have small estimation error leading to small

2 growth rate of εp. Our previous work have shown DP estimators have the estimation error Ekηˆ − dk ∼ sp log(p/sp)/n, which is asymptotically optimal in the minimax criteria. We apply DP estimator in the first step and construct the corresponding classification rule. We can gain estimation accuracy in the first step, resulting in the better classification performance. This classifier is denoted as DP classifier in the following sections.

3.3 Theory for DP Classifier

ˆ ˆ − 1 ˆ ˆ Denote each element of d = D 2 (µˆ 1 − µˆ 2) as di. For each di, we have

ˆ d di + Z di = p , V2n−2/(2n − 2)

2 ˆ where Z ∼ N(0, 2/n), V2n−2 ∼ χ2n−2; Z and V2n−2 are independent. di follows non-central t distribution. Therefore as sample size goes to infinity, it is reasonable to assume

ˆ dj ∼ N(dj, 2/n),

ηj ∼ G, j = 1, 2, ··· , p,

where G is unknown prior. Let ηˆ = ηˆ(dˆ) be an empirical Bayes estimator of d.

In this subsection we assume d is a sp-sparse vector. In Chapter 2, DP prior based method was proposed

with the estimation error εp ∼ sp log(p/sp)/n. This method could be directly applied in classification problems. To get the desired estimation error, we need to assume all entries of the mean difference vector

are independent. Therefore in the comparison of different classifiers in theory, we will focus on the case where

Σ = Ip. However in the implementation part we will show when the entries are correlated, our method still has good performance. First we rewrite Theorem 3.2 and 3.3 in terms of the above empirical Bayes classifier:

Theorem 3.4. Suppose Σ = Ip and DP classifier is used. If sp log(p/sp) = o(nCp), the classification rule

32 is sub-optimal. If Cpsp log(p/sp) = o(n), the classification rule is optimal.

The conclusions provided in the above subsections have close connection with Shao et al. (2011) and Fan and Fan (2008). First we focus on the scenario where Cp → c > 0; then we move to the case where Cp → ∞.

When Cp → c > 0, asymptotic optimality is equivalent to asymptotic sub-optimality. Since Cp → c makes classification task more difficult, to guarantee asymptotical optimality, conditions on estimation accuracy are more strict. From the proof we could easily see Ekηˆ −dk2/n → 0 is the sufficient and necessary condition for asymptotic optimality. For DP classifier, sp log p/n → 0 is needed to guarantee the classification error of our method converges to the same limit as the optimal rule, that is, asymptotic optimality requires

sp log p/n → 0, which is better than Independence Rule summarized in Fan and Fan (2008): asymptotic optimality of Independence Rule requires p/n → 0. FAIR introduced in Fan and Fan (2008) uses feature

selection before applying Independence Rule. When Cp → c, intrinsically sp log p/n → 0 implies asymptotical

optimality. Sparse LDA proposed in Shao et al. (2011) still requires sp log p/n → 0 in the estimation of mean difference vector. Table 3.1 summarizes the conditions.

Method Condition on Estimation Accuracy

DP Classifier sp log p/n → 0 Independence Rule p/n → 0

FAIR sp log p/n → 0

Sparse LDA sp log p/n → 0

Table 3.1: Condition Summary of Asymptotic Optimality for different classifiers in Chapter 3 when Cp → c > 0

When Cp goes to infinity, asymptotic optimality is not equivalent to asymptotic sub-optimality. The condition on estimation error is relaxed since classification task is easier. First we focus on asymptotic sub-optimality. For DP classifier ηˆ, Cp sp log p/n guarantees the classification error goes to 0. For p Independence Rule and LDA with known covariance matrix, Cp p/n guarantees asymptotical sub-

optimality. For FAIR, Cp will be greater than sp log p/n to guarantee asymptotical sub-optimality. The similar conclusion holds for Sparse LDA. Table summary of the conditions is shown in Table 3.2.

33 Method Condition on Estimation Accuracy p Independence Rule Cp p/n p LDA with known covariance matrix Cp p/n

FAIR with Σ = Ip known Cp sp log p/n

Sparse LDA Cp sp log p/n

IR induced by ηˆ with Σ = Ip known Cp sp log p/n

Table 3.2: Conditions to Guarantee Asymptotic Sub-optimality for different classifiers in Chapter 3 if Cp → ∞

Shao et al. (2011) introduce asymptotic optimality in the first time. Our DP classifier ηˆ is asymptotical optimal if Cp grows slower than n/(sp log p). For Sparse LDA, similarly Cp ≺ n/(sp log p) is needed for asymptotical optimality. The summary of different conclusions is in Table 3.3.

Method Condition

IR induced by ηˆ with Σ = Ip known Cp ≺ n/(sp log p)

Sparse LDA Cp ≺ n/(sp log p)

Table 3.3: Conditions to Guarantee Asymptotical Optimality for different classifiers in Chapter 3 if Cp → ∞

3.4 Implementation for Empirical Bayes Classifier

Given dˆ, in this section we build an empirical Bayes model with Dirichlet process prior to estimate η. We ˆ ˆ assume dj ∼ N(ηj, 1) and ηj ∼ G, where G is prior unknown. Since most of ηj’s are sparse, dj’s will concentrate around zero, forming a large cluster at 0 and several other clusters far away from 0. Dirichlet

process mixture model is one of Bayesian tools to capture clustering behaviors (see Lo (1984)). We build

a hierarchical Bayes model and assume G ∼ DP (α, G0), where α is the concentration parameter and G0 is the base measure. To guarantee G has a positive mass at 0, we model G0 as normal distribution with a

2 2 point mass at 0, that is, G0 = w0δ0 + (1 − w0)N(0, σ0), where w0 and σ0 are 2 pre-specified parameters. δ0 is a dirac function at 0. We apply the same data generating process as Section 2.7 and the same variational algorithm in Section 2.4. The resulted posterior mean estimator for kth normalized mean difference ηk is

DP PT DP DP t ηˆk = t=1 wˆktmˆ t. The linear classification rule induced by ηˆDP = (ˆη1 , ··· , ηˆp ) is: we classifies X to class 1 if and only if

1 ˆ t ˆ − 2 δDP(X) = (X − µˆ) D ηˆDP > 0.

34 We refer to ηˆDP as DP estimator and the corresponding classifier as Dirichlet process classifier (DP classifier). Additional sparsity could be introduced to DP estimator. Since we use the posterior mean as the

estimator, the resulting ηˆ is a shrinkage estimator of the true mean difference but not necessarily sparse. To

have better performance in the high dimensional extremely sparse case, we revise the original DP estimator

SDP via thresholding the posterior probability to be 0: if the posterior probability P (ηk = 0|X) > κ,η ˆk = 0, SDP DP otherwiseη ˆk =η ˆk , where κ is a tuning parameter which could be determined by cross validation. In all the simulation studies we fix κ = 0.5 since choosing the threshold at 0.5 is equivalent to getting a MAP

SDP SDP t estimator of index set of zeros. We refer to ηˆSDP = (ˆη1 , ··· , ηˆp ) as Sparse DP estimator and the ˆ t ˆ − 1 resulting linear classifier δSDP(X) = (X − µˆ) D 2 ηˆSDP as Sparse DP classifier. Sparse DP estimator is a thresholded estimator whereas DP estimator is not. Therefore Sparse DP estimator satisfies Condition 3.1.

Sparse DP estimator could eliminate noise of irrelevant features to enhance classification performance.

One practical issue of both DP and sparse DP estimator, is when η is extremely sparse, we might end

up with a MAP estimator Gˆ = δ0 occasionally. This is due to the “Rich gets richer” property of Dirichlet process prior. A remedy in this extreme case is to randomly equally divide all p sample mean differences into

I folds. For each fold of data we use Dirichlet process mixture model to estimate the discrete prior Gˆi. Then ˆ PI ˆ we average all the discrete priors to get a overall estimate G = i=1 Gi/I. For DP estimator and Sparse DP estimator we both use this refinement to estimate η. The rationale behind this “batch” processing idea

is when we divide elements of a high-dimensional vector into several batches, not only do the relatively

large elements pop out because the maximum of the noise decreases as the sample size is smaller, but also

the probability of all Gˆis equal to 0 is extremely small. The chance of detecting signals is increased. This refinement naturally leads to a parallelized variational Bayes algorithm: we could parallelize our algorithm

for every batch and then average the estimated prior.

3.5 Empirical Studies

In this section, we conducted three simulation studies and applied our method to one real data example. The

corresponding R package VBDP is available in https://github.com/yunboouyang/VBDP, which includes code to estimate sparse Gaussian sequence and code to construct DP and Sparse DP classifiers. Real data

example is also included in this package. The source code and simulation results are available in https:

//github.com/yunboouyang/EBclassifier. Parameter specification is also summarized in the source code. We also include a column “Hard Thresh DP” for comparison: Hard Threshold DP classifier uses the

same threshold as Sparse DP classifier, but instead of using posterior mean, Hard Threshold DP classifier

35 just uses sample mean difference to estimate d if the posterior probability at 0 is below threshold. εp is large for Hard Threshold DP classifier but small for DP classifier and Sparse DP classifier because only the last two methods apply shrinkage. The purpose to include Hard Threshold DP classifier is to demonstrate the ˆ ˆ influence of εp on classification error W (δ, θ). If εp is large, W (δ, θ) should be large. We do not recommend to use Hard Threshold DP classifier in practice.

3.5.1 Simulation Studies

We conducted three simulation studies. The first two are the same in Greenshtein and Park (2009). In the third simulation study we compare our methods with Fan and Fan (2008) in the same setting.

Simulation Study 1. We assume Σ has only diagonal elements. Without loss of generality, we set

µ2 = 0 and µ1 6= 0. We use (∆, l, Cp) to denote different configurations of µ1: the first l coordinates in µ1 are all valued ∆ while the remaining entries are all 0 or sampled from N(0, 0.12). In the first simulation study,

2 2 4 Σ = s Ip, where s = 25/2 and p = 10 . The sample size of each class is n = 25. We compute the theoretical classification error using the true mean and the true covariance matrix. We repeat our procedures 100 times and the average theoretical classification error are reported in Table 3.4 and Table 3.5 corresponding to different µ1. Bold case in all tables indicates the lowest classification error across each row.

(∆, l, Cp) Hard Thresh DP Sparse DP DP EB IR FAIR glmnet (1, 2000, 160) 0.0046 0.0003 0.0002 0.0004 0.0049 0.1211 0.4280

(1, 1000, 80) 0.0874 0.0454 0.0283 0.0428 0.0885 0.2393 0.4500

(1, 500, 40) 0.2423 0.2036 0.1858 0.2015 0.2435 0.3423 0.4750

(1.5, 300, 54) 0.1756 0.1303 0.1059 0.1160 0.1767 0.2222 0.4146

(2, 200, 64) 0.1362 0.0540 0.0412 0.0518 0.1372 0.1039 0.3046

(2.5, 100, 50) 0.1937 0.0449 0.0422 0.0585 0.1947 0.0852 0.2126

(3, 50, 36) 0.2652 0.0470 0.0677 0.0772 0.2665 0.0982 0.1498

(3.5, 50, 49) 0.1957 0.0066 0.0175 0.0152 0.1965 0.0229 0.0655

(4, 40, 51.2) 0.1883 0.0023 0.0059 0.0072 0.1901 0.0101 0.0332

Table 3.4: Classification error for simulation study 1, p = 104, p − l entries are 0

36 (∆, l, Cp) Hard Thresh DP Sparse DP DP EB IR FAIR glmnet (1, 2000, 166.4) 0.0035 0.0002 0.0001 0.0003 0.0038 0.1128 0.4216

(1, 1000, 87.2) 0.0699 0.0395 0.0241 0.0352 0.0710 0.2311 0.4551

(1, 500, 47.6) 0.2046 0.1948 0.1686 0.1751 0.2063 0.3280 0.4783

(1.5, 300, 61.8) 0.1450 0.1173 0.0976 0.0996 0.1465 0.2075 0.4190

(2, 200, 71.8) 0.1102 0.0470 0.0372 0.0431 0.1113 0.1011 0.3158

(2.5, 100, 57.9) 0.1583 0.0392 0.0415 0.0488 0.1595 0.0815 0.1945

(3, 50, 44) 0.2248 0.0444 0.0674 0.0687 0.2265 0.0969 0.1692

(3.5, 50, 57) 0.1637 0.0065 0.0119 0.0146 0.1655 0.0226 0.0640

(4, 40, 59.2) 0.1539 0.0019 0.0056 0.0057 0.1551 0.0088 0.0324

Table 3.5: Classification error for simulation study 1, p = 104, p − l entries are generated from N(0, 0.12)

Table 3.4 and Table 3.5 compare DP and Sparse DP classifier with several existing methods: Empirical

Bayes classifier (EB) by Greenshtein and Park (2009), Independence Rule (IR) by Bickel and Levina (2004),

Feature Annealed Independence Rule (FAIR) by Fan and Fan (2008) and logistic regression with lasso using

R package glmnet (denoted as glmnet).

DP and Sparse DP methods dominate other methods in the diagonal covariance matrix case whether the

mean difference is sparse or not. If the mean difference vector is extremely sparse while the signal is strong,

Sparse DP classifier outperforms DP classifier. In the relatively dense signal case, DP classifier outperforms

sparse DP classifier.

To illustrate how estimation accuracy influence classification accuracy, we compute εp, mean squared

error (MSE) of estimating µ1 − µ2 for DP estimators. Estimation error and classification error of DP methods are shown in Figure 3.1.

From Figure 3.1 we could conclude Cp plays the most important role in determining the magnitude of

classification error. In the first 3 scenarios, when sp is decreasing, both Cp and estimation error εp are decreasing; the resulting estimation error is decreasing but the classification error is increasing. Therefore

when Cp is large, even we might have large estimation error, the classification error is still small. From the

first scenario to last scenario, sp is decreasing and the estimation error is also decreasing, which is consistent with the asymptotical order of the estimation error sp log(p/sp). The trend of classification error depends

on both the magnitude of Cp and sp. Overall DP and sparse DP estimators could improve estimation accuracy of the nonzero true mean

difference while ruling out irrelevant features. Hard Thresh DP classifier has similar performance as IR,

37 0.3 3000

error type classification estimation

0.2 2000 estimation error classification error

0.1 1000

0.0 0

(1, 2000, 160) (1, 1000, 80) (1, 500, 40) (1.5, 300, 54) (2, 200, 64) (2.5, 100, 50) (3, 50, 36) (3.5, 50, 49) (4, 40, 51.2) scenario

Figure 3.1: Estimation error and classification error of DP methods when p − l entries are 0. Red boxplot indicates estimation error while blue boxplot indicates classification error.

indicating if estimation error is not well controlled, classification accuracy could not be guaranteed.

Simulation Study 2. We consider AR(1) covariance structure of Σ = s2R, where s2 = 25/2. That

|i−j| 4 is, the correlation satisfies Rij = Corr(Xkmi,Xkmj) = ρ , k = 1, 2, 1 ≤ m ≤ n, 1 ≤ i, j ≤ p. p = 10 .

Sample size of each class is n = 25. We consider 3 different configurations of µ1 in this simulation study. The simulation results are shown in Table 3.6, 3.7, 3.8 based on 100 repetitions to compare theoretical

classification error.

ρ Hard Thresh DP Sparse DP DP EB IR FAIR glmnet

0.3 0.0089 0.0031 0.0021 0.0022 0.0092 0.1276 0.4325

0.5 0.0235 0.0135 0.0105 0.0096 0.0237 0.1393 0.4340

0.7 0.0714 0.0539 0.0468 0.0430 0.0712 0.1800 0.4437

0.9 0.2079 0.1929 0.1867 0.1758 0.2073 0.2702 0.4586

4 Table 3.6: Classification error for simulation study 2, p = 10 , 2000 entries are 1 for µ1. Other entries are generated from N(0, 0.12)

38 ρ Hard Thresh DP Sparse DP DP EB IR FAIR glmnet

0.3 0.0237 0.0081 0.0056 0.0068 0.0243 0.0546 0.2315

0.5 0.0481 0.0233 0.0183 0.0203 0.0483 0.0699 0.2792

0.7 0.1036 0.0686 0.0612 0.0619 0.1033 0.1095 0.2978

0.9 0.2472 0.2111 0.2054 0.2024 0.2466 0.2308 0.3603

4 Table 3.7: Classification error for simulation study 2, p = 10 , 1000 entries are 1 for µ1. 100 entries are 2.5. Other entries are generated from N(0, 0.12)

ρ Hard Thresh DP Sparse DP DP EB IR FAIR glmnet

0.3 0.0233 0.0037 0.0032 0.0038 0.0238 0.0226 0.0913

0.5 0.0478 0.0139 0.0129 0.0138 0.0475 0.0374 0.1290

0.7 0.1069 0.0508 0.0493 0.0502 0.1069 0.0801 0.1747

0.9 0.2445 0.1827 0.1871 0.1834 0.2441 0.2067 0.2971

4 Table 3.8: Classification error for simulation study 2, p = 10 , 1000 entries are 1 for µ1. 50 entries are 3.5. Other entries are generated from N(0, 0.12)

DP family and EB are among the best methods in this AR(1) correlation structure except Hard Thresh

DP. If the correlation is severe and there are not very large mean difference, EB has better performance.

If the correlation is not extremely severe or there are some large mean difference, DP classifier has better performance. As ρ gets larger, the classification error keeps increasing for each method, Sparse DP classifier and DP classifier is still considered as 2 relatively good classifiers since we only have very few data points.

Simulation Study 3. We consider the same setting used in Fan and Fan (2008). The error vector is no longer normal and the covariance matrix has a group structure. All features are divided into 3 groups. Within each group, features share one unobservable common factor with different factor loadings. In addition, there is an unobservable common factor among all the features across 3 groups. p = 4500 and n = 30. To

2 construct the error vector, let Zij be a sequence of independent standard normal random variables, and χij √ 2 be a sequence of independent random variables of the same distribution as (χ6 − 6)/ 12. Let aj and bj be factor loading coefficients. Then the error vector for each class is defined as

Zij + a1jχ1i + a2jχ2i + a3jχ3i + bjχ4i ij = , i = 1, 2, ··· , 30, j = 1, 2, ··· , 4500, q 2 2 2 2 1 + a1j + a2j + a3j + bj

where aij = 0 except that a1j = aj for j = 1, ··· , 1500, a2j = aj for j = 1501, ··· , 3000, and a3j = aj

for j = 3001, ··· , 4500. Therefore E(ij) = 0 and Var(ij) = 1, and in general within-group correlation is

39 Figure 3.2: Boxplot and scatter plot of classification errors of 4 methods

greater than the between-group correlation. The factor loadings aj and bj are independently generated from

uniform distributions U(0, 0.4) and U(0, 0.2). The mean vector µ1 is taken from a realization of the mixture 1 of a point mass at 0 and a double exponential distribution: (1−c)δ0 + 2 c exp(−2|x|), where c = 0.02. µ2 = 0. There are only very few features with signal levels exceeding 1 standard deviation of the noise. We apply

Hard Thresh DP, Sparse DP, FAIR and glmnet to 400 test samples generated from the same process and calculate the average classification error. We also compare these methods to the oracle procedure, which we know the location of each nonzero element in µ2 vector and use these nonzero elements to construct Independence Rule. We have 100 repetitions. The boxplot and scatter plot of the classification error of above four methods are summarized in Figure 3.2 and the average classification error is summarized in

Table 3.9.

Oracle Hard Thresh DP Sparse DP FAIR glmnet

0.0021 0.0150 0.0126 0.0168 0.0252

Table 3.9: Average Misclassification Rate for Simulation Study 3

Both Hard Thresh DP and Sparse DP classifier are better than FAIR and outperforms the logistic

regression with Lasso. Even though on average oracle procedure’s classification error is smaller than that

of DP family classifiers, we could conclude from the plot the classification error of majority of 100 trials for

Sparse DP classifier is comparable to the classification error of the oracle procedure. Sparse DP classifier

still has very good performance except some extreme cases.

40 3.5.2 Classification with Missing Features

In real applications such as sentiment analysis, we are often given a long paragraph to train the model but only a short sentence to test our model. Generally speaking, in many classification problems, it is often the case that we use a large number of features to construct the classification rule in the training dataset but in the test data only a fraction of features are available to use. We suspect that independence rule has advantage in this scenario since it is difficult to use the correlation structure learnt from the training data if half features are missing in the test data.

All classification rules built with the training dataset need to be adjusted before applying the test dataset.

For all the induced Independence rules, the average of two class means for the feature in the training dataset is substituted for the missing feature values for the test dataset. For logistic regression based classification methods, classification rule is built using only the regression coefficients of the non-missing features; to make it comparable with independence rule, the intercept is determined by the average inner product of the regression coefficients and available features.

Simulation study 1 is rerun with only 50% features available for the test dataset. Classification error is calculated in Table 3.10. Table 3.10 has shown classification error increases for every method; DP and

Sparse DP classifier are still the best methods among all. Similar conclusion holds for Simulation study 2

(see Table 3.11 and 3.12)

(∆, l, Cp) Hard Thresh DP Sparse DP DP EB IR FAIR glmnet (1, 2000, 160) 0.096 0.041 0.0354 0.0446 0.0974 0.272 0.4632

(1, 1000, 80) 0.2466 0.1948 0.1683 0.1924 0.2482 0.3563 0.4803

(1, 500, 40) 0.3625 0.336 0.3192 0.3365 0.3633 0.4183 0.4916

(1.5, 300, 54) 0.3198 0.2813 0.2633 0.2716 0.3206 0.3463 0.4557

(2, 200, 64) 0.2896 0.2046 0.1905 0.2038 0.2906 0.2611 0.3871

(2.5, 100, 50) 0.3304 0.1925 0.1962 0.2092 0.3316 0.2431 0.3418

(3, 50, 36) 0.3771 0.1991 0.2444 0.2361 0.3776 0.2602 0.3028

(3.5, 50, 49) 0.3362 0.1131 0.1222 0.1423 0.3371 0.1588 0.2227

(4, 40, 51.2) 0.3292 0.0841 0.0867 0.1139 0.33 0.1158 0.1954

Table 3.10: Classification error with only 50% features available in the test data in Simulation Study 1, p = 104, p − l entries are 0

41 ρ Hard Thresh DP Sparse DP DP EB IR FAIR glmnet

0.3 0.0384 0.0157 0.0123 0.0142 0.0392 0.2175 0.4399

0.5 0.0552 0.0316 0.0269 0.027 0.0556 0.2108 0.4488

0.7 0.0963 0.0715 0.0641 0.0612 0.0958 0.2199 0.447

0.9 0.2176 0.1986 0.1919 0.1835 0.217 0.2801 0.4615

Table 3.11: Classification error with only 50% features available in the test data in Simulation Study 2, 4 2 p = 10 , 2000 entries are 1 for µ1. Other entries are generated from N(0, 0.1 )

ρ Hard Thresh DP Sparse DP DP EB IR FAIR glmnet

0.3 0.1653 0.1329 0.1102 0.1208 0.1665 0.2996 0.472

0.5 0.1861 0.1634 0.1394 0.1432 0.1865 0.3077 0.4667

0.7 0.2324 0.2216 0.201 0.1938 0.2324 0.3201 0.4759

0.9 0.3283 0.3195 0.3124 0.3055 0.3278 0.3606 0.4838

Table 3.12: Classification error with only 50% features available in the test data in Simulation Study 2, 4 2 p = 10 , 1000 entries are 1 for µ1. 100 entries are 2.5. Other entries are generated from N(0, 0.1 )

3.5.3 Real Data Examples

Besides simulation datasets, two real data examples are also included. First we focus on sentiment analysis for moview reviews. All the datasets are available on https://www.kaggle.com/c/word2vec-nlp-tutorial/data.

Both training and test dataset have 25,000 reviews. Training dataset’s reviews are labeled with 0 or 1, indicating negative reviews or positive reviews respectively. After converting reviews to the vector features, we have 35949 features. Methods we applied include DP, Sparse DP, Independence Rule, logistic regression with Lasso and EB. We consider 2 cases: (1) all the features in test data are available; (2) only half of the features in the test data are available. The prediction accuracy is shown in Table 3.13. When all features in the test dataset are available, logistic regression using Lasso has the best performance since correlation among the features are considered. However when only half features are available in the test dataset, Sparse

DP and EB have the best performance. Figure 3.3 illustrates the magnitude of coefficients for glmnet and DP classifier. The larger the magnitude is, the larger the word is. glmnet measures the conditional contribution for each feature given others, therefore review score is the most important feature. However, DP classifier measures marginal contribution of each feature, therefore only strong emotional words have large magnitude.

When only 50% features are available, measuring marginal contribution is better than measuring conditional contribution in terms of prediction accuracy.

42 DP Sparse DP IR glmnet EB

All Features Available 86.62% 86.79% 86.51% 90.12% 86.61%

Half Features Available 85.27% 85.36% 85.10% 84.62% 85.36%

Table 3.13: Classification accuracy of movie review sentiment analysis data

have_learned bad_movie beautifully has_everything only_complaint how_bad pathetic very_disappointed must_see horrible british_comedy than_thist_waste both wonderful thumbs_up wonderfully tedious fantastichighly as_the fast_forward surprisingly_good jessica_simpson pointless ridiculous laughable badly are_shot today redeemingmight_enjoy disappointing supposed2 perfect anyredeeming only_good horror_br disappointment up_because the_only lacks obnoxious unlessavoid minuteshorrible amazing 10_outfirst_rate not_recommend pointless least thing not_funny8_out t_even awfula_great well_worth10_10 blew_me waste_your a_must a_bad flimsya_must waste 2_fromrefreshing miscast actingexcellentwaste even money a_74_out flawless lameworse excellentworse 7_10poorlya_2 a_4 life waste_of 3_out instead donboring bad his why pitt_and skip_this 3_10 laughable stupid cheap 4_10 a_9 boring beautiful at_all worst loved adsgrade_b finely unwatchable there don_tterrible nopoornot_even mess dull 7_out 1_out vacuous great fails lousy one_of poorly forgettable definitely_worth had_high the_worst the_best oh avoid is_a scriptalways mcdowell worst grade_d awful a_pass 1_10 favoritewonderful interminable not_recommended 8_102_10 bad_it dull plot best love brilliant unfunny nothing also well t_recommend not_worth at_best badlyso_bad crap just mess uneducated so_disappointedonly_saving a_waste your_time pleasantly_surprised superb supposed_to just_wasn enjoyed_this very_entertaining worst_movie annoyingthe_acting bad_reviews loved_this embarrassment very_wellwasted an_excellent highly_recommendexcellently very little_movie my_favorite performance interspersed look_elsewhere bad_acting

(a) Lasso (b) DP

Figure 3.3: Word cloud of top 100 unigrams and bigrams in magnitude for coefficients for classifiers based on glmnet and DP. Red indicates negative sentiment and green indicates positive sentiment.

We also consider a leukemia data set which was first analyzed by Golub et al. (1999) and widely used in statistics literature. The data set can be downloaded in http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi.

There are 7129 genes and 72 samples generated from two classes, ALL (acute lymphocytic leukemia) and

AML (acute mylogenous leukemia). Among the 72 samples, the training data set has 38 (27 data points in

ALL and 11 data points in AML) and the test data set has 34 (20 in ALL and 14 in AML). We compared

DP and sparse DP classifier with IR, EB and FAIR, which was summarized in Table 3.14. For DP and sparse DP classifier, we set α = 1, σ = 4, w = 0.9 and we split 7129 entries into 7 batches.

43 Method Training error Test error

FAIR 1/38 1/34

EB 0/38 3/34

IR 1/38 6/34

DP 1/38 2/34

Sparse DP 1/38 2/34

Hard Thresh DP 1/38 2/34

Table 3.14: Training error and test error of leukemia data set

From Table 3.14 we conclude DP family classifiers outperform EB classifier. EB classifier has the same performance as IR. Improvement of EB compared with IR is marginal but using DP and Sparse DP classifier could result in some improvement. Both DP and EB classifier shrink the mean difference but does not eliminate any irrelevant feature. Sparse DP classifier selects 2092 features but has the same performance in terms of test error as DP classifier. This might be due to the fact that this dataset is relatively well separated. Thresholding might not improve a lot.

3.6 Discussion

The contribution of this paper is three-folds: first we established the relationship between the estimation error and the classification error theoretically; second we proposed two empirical Bayes estimators for the normalized mean difference and the induced linear classifiers. Third, for estimating d, we develop a vari- ational Bayes algorithm to approximate posterior distribution of Dirichlet process mixture model with a special base measure and we could parallelize our algorithm using the “batch” idea.

Yet, there are still many open problems and many possible extensions related to this work. For example, instead of using the Independence Rule, we could develop a Bayesian procedure to threshold both the mean difference and the sample covariance matrix, in a spirit similar to Shao et al. (2011), Cai and Liu (2012) and Bickel and Levina (2008). Another extension is to relax normality assumption: LDA is suitable for any elliptical distribution, therefore our work could also be extended to a larger family of distributions under sub-Gaussian constraints.

44 Chapter 4

Two-Stage Ridge Regression and Posterior Approximation of Bayesian Variable Selection

4.1 Introduction

In high dimensional regression we are involved with a large number of covariates. Variable selection is important in high dimensional regression and parsimonious model is preferable. Ordinary Least Squares

(OLS) estimator is not a good choice because of poor prediction performance and difficult interpretation. In real applications such as Bioinformatics, Finance and Image Processing the number of predictors p is often much larger than sample size n. Without carefully choosing important features, we could easily include some noisy features in the final model, which will increase variance of our prediction. Therefore variable selection is necessary and important.

Variable selection has been thoroughly studied in Frequentists’ view. Most of the methods involve thresh- olding and shrinking estimated regression coefficients via penalized likelihood. Consider a high dimensional linear model

2 yp×1 = Xn×pβp×1 + εn×1, ε ∼ Nn(0, σ In), where both β and σ2 are unknown. X is a fixed normalized design matrix: without loss of generality, all diagonal elements of XT X are equal to n.

By adding a penalty function, we search for a parsimonious model in the restricted space. The general form of penalized likelihood is written as

p ˆ −1 X β = arg max n ln(β) − pλ(|βi|), β i=1 where ln(·) is the log-likelihood function and pλ is the penalty function with tuning parameter λ. The most popular used penalties, to name a few, include L1 penalty (Tibshirani, 1996), SCAD penalty (Fan and , 2001) and Minimax Concave Penalty (Zhang, 2010). Selection consistency has been established for above methods. For a overall review of Frequentists’ variable selection methods, see Fan and Lv (2010).

Prediction accuracy has also been developed for L1 penalty: prediction for a new response is considered and

45 predictive error kXβˆ − Xβk2 is of the order q log p under Sparse Riesz Condition (Zhang and Huang, 2008) and Restricted Eigenvalue Condition (Bickel et al., 2009), where q is the number of nonzero elements of β.

Unlike L1 penalty, L2 penalty corresponding to ridge regression (Hoerl and Kennard, 1970) cannot provide a sparse estimator of β. As a refinement Shao and Deng (2012) added a thresholding step to the original ridge regression estimator.

One advantage of Bayesian variable selection method is the ability to provide posterior distribution, which is a powerful tool to calculate posterior probability of the selected model and to construct a credible set of regression coefficients. One Frequentists’ analogue to provide a distribution for inference is confidence distribution ( and Singh, 2013). However, to our knowledge, there are no relevant work using confidence distribution in variable selection.

Asymptotics for posterior distribution in Bayesian variable selection has been established in various aspects and model settings. Ghosal (1999) established asymptotical normality of posterior distribution of regression coefficients in linear models when p4 log p/n → 0. In high dimensional settings, Jiang (2007) considered consistency of conditional density of y given X and established convergence rates over a Hellinger neighborhood. Posterior concentration of “point mass spike and Laplacian slab” prior were established in

Castillo et al. (2015). They give rates of contraction of the posterior distribution both regarding to the predictive error kX(βˆ − β)k2 and regarding to the parameter β relative to the L1, L2 and L∞ error. The growth rate of the predictive error coincides with L1 type penalty. Two Laplacian mixture “spike and slab” prior was considered in Rovckova and George (2015) and posterior concentration was also obtained. Martin et al. (2017) considered a data dependent prior of β and also achieved posterior concentration. Their prior of βS given the support set S follows normal distribution with OLS estimator as the mean and proportional Gram matrix as the variance.

Explicit form of posterior distribution is often difficult to obtain in hierarchical prior setup but approx- imate posterior distribution could be obtained via MCMC. Stochastic Search Variable Selection method

(George and McCulloch, 1993) was proposed to compute the posterior using Gibbs sampler. To reduce high computational cost of George and McCulloch (1993), an EM algorithm (Rovckova and George, 2014) was proposed by treating model index as a latent variable. Another version of EM algorithm (Wang et al.,

2016) was proposed by treating β as a latent variable and selection consistency when p ≺ n was established.

However, EM algorithm could only compute posterior mode instead of approximating the whole posterior distribution.

Motivated by thresholded ridge regression (Shao and Deng, 2012), we propose an fast algorithm to estimate regression coefficients and approximate the posterior, which could be easily computed by applying

46 ridge regression twice. We apply ridge regression twice to reduce the bias introduced in original ridge regression in Shao and Deng (2012). Shrinkage factors are adaptively chosen to shrink only irrelevant variables heavier. Our algorithm could select the best model and estimate regression coefficients at the same time, with all the information summarized in a “distribution estimator”.

Let γˆ denote estimated model index, where each entryγ ˆj takes binary values 0 or 1, corresponding to j-th variable excluded or included. Define Sˆ = {γj :γ ˆj = 1}. The distribution estimator has the following form (we will specifyw ˆj, µSˆ,ΣSˆ, µj and σjj in the next section, φ(·) is normal density):

Y Πn(β) = φ|Sˆ|(βSˆ; µSˆ, ΣSˆ) ((1 − wˆj)δ0 +w ˆjφ(βj; µj, σjj)). j:ˆγj =0

Our distribution estimator has natural Bayesian interpretation. Based on the distribution estimator, we could select the most probable model, estimate regression coefficients and compute the posterior probability of the selected model. In theory, we will establish posterior concentration of our analogue of Bayesian posterior, that is, probability mass of the ball, with radius going to 0 at certain rate centered at the true regression coefficients, goes to 1. Meanwhile, we construct our estimated regression coefficients βˆ through the above distribution. We will show in terms of prediction accuracy, L2 prediction error kX(βˆ − β)k2 is of the similar order as the prediction error of Lasso estimator.

In section 4.2 we will propose two-stage ridge regression and construct this distribution estimator. In section 4.3 we will describe sufficient conditions for contraction property. In section 4.4 we will illustrate concentration property in the simple orthogonal design cases. In section 4.5 we will discuss concentration property and prediction accuracy in the low dimensional case. In section 4.6 we will move to the high dimensional case. All the proof is summarized in the appendix.

4.2 Proposed Algorithm

4.2.1 Two-Stage Ridge Regression

One-step ridge regression estimator in Shao and Deng (2012) has the following form:

(0) T −1 −1 T m = (X X + v0n Ip) X y, v0n → 0,

−1 (0) Since v0n → ∞, large penalty will introduce bias even though thresholding m will give the correct model index. Our two-stage ridge regression could reduce bias by first thresholding m(0) via pre-specified threshold

47 r = 1 log v1n ; then adaptively choose the penalty for each feature dimension (either large penalty 1/v0n−1/v1n v0n −1 −1 v0n or small penalty v1n ); redo the ridge regression using adaptive penalty. The procedure is summarized as follows:

(0) T −1 −1 T m = (X X + v0n Ip) X y, (4.1)  1, if(m(0))2 > r  j p γˆj = , j = 1, 2, ··· , p, γˆ = (ˆγj)j=1, (4.2)  (0) 2 0, if(mj ) < r

αj = (1 − γˆj)v0n +γ ˆjv1n, ∆ = diag((α1, ··· , αp)), (4.3)

m = (XT X + ∆−1)−1XT y. (4.4)

Initial ridge regression estimator m(0) shrinks nonzero regression coefficients heavily but could provide

guidance to select relevant variables. Therefore in the second round of ridge regression, with large proba-

bility we only shrink heavily on the coefficients corresponding to irrelevant features but shrink little on the

coefficients corresponding to relevant features. Based on this algorithm, our proposed estimator is

β˜ = γˆ ◦ m,

where ◦ denotes elementwise product of two vectors. We will prove γˆ is consistent and establish the order

of kX(β˜ − β)k2.

There are 2 tuning parameters involved in constructing the distribution estimator: v0n and v1n. v0n

tends to be small and v1n tends to be large. From now on we may suppress the dependence of n for notation

1 v1 simplification. The threshold r = log is related with both v0 and v1. 1/v0−1/v1 v0

4.2.2 Distribution Estimator Construction

Next we construct a distribution estimator based on m. We decompose m and β according to Sˆ: m =

(mT , mT )T and β = (βT , βT )T . In high dimensional situation when β is sparse, the computational cost to Sˆ Sˆc Sˆ Sˆc capture the correlation among large regression coefficients is small since only few large regression coefficients

are present. However, to capture the correlation of small regression coefficients is computationally expensive

and not very useful. Therefore, inspired by Skinny Gibbs sampler proposed by Narisetty and He (2014), we

c assume the distribution on Sˆ could be factorized into marginal distributions of βj.

c When j ∈ Sˆ , with large probability βj is sparse. It is reasonable to assume the marginal distribution of βj is a point mass at 0 plus a continuous normal density centered at mj. The weightw ˆj is constructed

48 2 2 through the logit link, which is related with the difference between threshold r and mj .w ˆj is small when mj

is below r, and vice versa. We plug in an estimator of Var(mj) in the variance component. The algorithm to construct the distribution estimator is summarized in Algorithm 2.

Algorithm 2 Distribution Estimator Construction 1: Initialize ν and λ;

2: Setσ ˆ2 = 1; T −1 −1 (0) T (0) (0) p p 3: Compute V ← (X X + v0 Ip) , m ← VX y, where m = (mj )j=1 and diag(V) = (Vjj)j=1; T (0) 2 Pp (0) 2 2 tr(XVX )+ky−Xm k + j=1(Vjj +(mj ) )/v0+νλ 4: Updateσ ˆ ← n+p+ν+2 ; 2 (0) 2 σˆ v1 5: Update γ: γj ← 1 if (m ) > r = log ; j 1/v0−1/v1 v0 6: Update m ← (XT X + ∆−1)−1XT y; m2+V wˆj 1 v1 1 j jj 1 1 7: Computew ˆj via log( ) = − log( ) + 2 ( − ) whenγ ˆj = 0; 1−wˆj 2 v0 2 σˆ v0 v1 8: Plug in to get distribution estimator

2 Y 2 Πn(β) = φ|Sˆ|(βSˆ; mSˆ, σˆ VSˆ) ((1 − wˆj)δ0 +w ˆjφ(βj; mj, σˆ Vjj)). j:ˆγj =0

The above algorithm has close relation with Bayesian variable selection using spike-and-slab prior. In

Bayesian literature variable selection is achieved via shrinkage prior. One widely used prior is spike-and-slab

prior (Mitchell and Beauchamp, 1988). This prior could eliminate noisy features and avoid shrinking large

regression coefficients too much simultaneously. Given the model index γ, β follows “Gaussian spike and

2 2 Gaussian slab” prior whose variance is v0σ for the spike component and v1σ for the slab component. The hierarchical Bayesian model is written as:

π(σ2) = IG(ν/2, νλ/2),

π(γ | θ) = Bern(θ), θ = 1/2,   2 N(0, σ v0) if γj = 0 π(βj | σ, γj) = , j = 1, 2, ··· , p,  2 N(0, σ v1) if γj = 1

2 π(y|β) = Nn(Xβ, σ In), (4.5)

where ν and λ are two hyper-parameters. θ is fixed to be 1/2 in this paper, which gives each model equal

prior probability.

The motivation behind this distribution construction is EM algorithm proposed in Wang et al. (2016).

49 EM algorithm was derived in Wang et al. (2016) to compute posterior mode via treating β as latent variables

and maximizing over the model index space. In EM algorithm we start with the null model, corresponding

to γ = 0, in favor of parsimonious model. Step 2-5 resembles initialization and the first iteration of EM

algorithm and Step 6 resembles updating model index γ in the beginning of the second iteration. Step 7 is

different from EM algorithm because EM algorithm could only provide posterior mode, but not the whole

posterior distribution. Meanwhile, Step 7 could also be viewed as a updating step in the variational inference

algorithm in the same model setup, which updates the probability of j-th variable included in each iteration.

One major advantage of providing a distribution estimator is the convenience of inference. Besides

choosing the single best model and MAP estimator of β, we could compare different candidate models,

retrieve correlation among relevant features, output point estimate of the candidate model, and so on, which

will be helpful for statistical inference. In the following section, we will focus on theoretical properties of this

distribution estimator. Specifically, we will establish concentration property of this distribution estimator.

First we use Orthogonal Design case as a starting point.

4.3 Sufficient Conditions of Concentration Property

an We first introduce the following notations. For two sequences, an ∼ bn if as n → ∞, → c, where bn

an 0 < c < ∞. For two sequences, an ≺ ( )bn if as n → ∞, → 0(∞). λmax(·) and λmin(·) denotes maximal bn # and minimal eigenvalues of a matrix. λmin(·) denote the minimal nonzero eigenvalue of a matrix. σmax(·)

and σmin(·) denote the maximal and minimal singular value of a matrix. tr(·) stands for matrix trace. k · k denotes traditional L2 vector norm. supp(·) denotes the support set of a vector. R(X) refers to the row space of matrix X while C(X) refers to the column space of matrix X.

Without loss of generality, β∗ stands for the vector of unknown true regression coefficients, whose first q

∗ ∗ T T T elements are nonzero and the remaining p−q elements are 0, that is, β = ((β1) , 0p−q) . The corresponding ∗ ∗ ∗ T T T support set is S = supp(β ) = {1, 2, ··· , q} and the model index set is γ = (1q , 0p−q) . we write m in 4.5 T T T ∗ in correspondence as m = (m1 , m2 ) . Besides, let V1 denote the principal sub-matrix of V indexed by S . (n×q) (n×(p−q)) T Finally we decompose X accordingly: X = (X1 , X2 ) . The bias of m is bias(m) = E(m) − β. Concentration studies the probability mass of the distribution estimator over a ball centered at the true

∗ ∗ ∗ value. The ball centerred at the true value β with radius Mn is denoted as B(β , Mn) = {β | kβ −β k ≤

Mn}, where M is a constant and n depends on n. We have the following definition:

∗ Definition 4.1 (Concentration Property). Πn(β) concentrates on β with rate n if there is a constant M

50 such that

∗ p Πn(B(β , Mn)) → 1 as n → ∞. (4.6)

∗ To find sufficient conditions to guarantee concentration, we use the following fact. Since B(β , Mn) ⊇

∗ ∗ {β | kβ − β k ≤ Mn, supp(β) = supp(β )}. The probability in (4.6) has the following lower bound:

∗ ∗ ∗ ∗ Πn(B(β , Mn)) ≥ Πn(B(β , Mn) | γˆ = γ )P (γˆ = γ ) Z ∗ ∗ = dΠn(β | γˆ = γ )P (γˆ = γ ) ∗ B(β ,Mn) Z ∗ ∗ ≥ dΠn(β | γˆ = γ )P (γˆ = γ ) ∗ ∗ B(β ,Mn)∩{β|supp(β)=supp(β )} Z ∗ Y 2 ≥ P (γˆ = γ ) (1 − wˆj) φq(β1; m1, σˆ V1)dβ1 B(β∗,M ) j∈ /S∗ 1 n

∗ Y ∗ = P (γˆ = γ ) × (1 − wˆj) × Pb1 (kb1 − β1k ≤ Mn), j∈ /S∗

2 where b1 is an q-dimensional vector which follows normal distribution with mean m1 and varianceσ ˆ V1. Q The second part in the product is j∈ /S∗ (1 − wˆj). In order to guarantee the lower bound is tight, we Q p p require j∈ /S∗ (1 − wˆj) → 1. To guarantee Πj∈ /S∗ (1 − wˆj) → 1, we have the following inequality:

Y X (1 − wˆj) ≥ 1 − wˆj. j∈ /S∗ j∈ /S∗

P p Therefore j∈ /S∗ wˆj → 0 is one sufficient condition. The following proposition is to summarize all the sufficient conditions we need:

∗ P p ∗ Proposition 4.1. If P (γˆ = γ ) → 1, j∈ /S∗ wˆj → 0 and Pb1 (kb1 − β1k ≤ Mn) → 1 where b1 ∼ 2 ∗ p Nq(m1, σˆ V1), then Πn(B(β , Mn)) → 1.

∗ ∗ To prove Pb1 (kb1 − β1k ≤ Mn) → 1, the key step is to bound maxj∈S Vjj, λmax(Var(m1)) and 2 2 kbias(m1)k , which are further bounded by maxj Vjj, λmax(Var(m)) and kbias(m)k respectively. Bounding

∗ P p these three terms will also be a key step to prove P (γˆ = γ ) → 1 and j∈ /S∗ wˆi → 0. In the following sections we will establish bound for the above three terms in various model setups. For illustration purpose, we first look at the orthogonal design matrix case.

51 4.4 Concentration and Prediction in the Orthogonal Design Case

T To illustrate theoretical results, first we consider orthogonal case: we assume X X = nIp (This implies p ≤ n). This condition is too strong to be satisfied. However, to have a better illustration and understanding

of what kind of theoretical conditions we need, orthogonal case is a good start.

4.4.1 Concentration Property

δ −α ∗ 2 η2 From now on we assume v1 = exp(n ), v0 = n and kβ k ∼ n . Assume α < 1. In step 3 V =

α −1 (0) α −1 T T ∗ 2 (0) n ∗ n 2 (n + n ) Ip and m = (n + n ) X y. Since X y ∼ Np(nβ , nσ Ip), m ∼ Np( n+nα β , (n+nα)2 σ Ip). We have

(0) 2 2α+η2−2 −1 (0) −1 Proposition 4.2. kbias(m )k ∼ n , λmax(V) ∼ n , λmax(Var(m )) ∼ n .

In Step 4, we first look at the denominator forσ ˆ2, tr(XVXT ) = np/(n + nα). The second term is

2 2 T 1 −1 T 2 ky − Xmk = ky − PXyk + kPXy − X(X X + Ip) X yk v0 2 2 2α−3 T 2 2 2 2α−3 2 ∗ 2 2 = σ χn−p(0) + n kX yk ≤ σ χn−p(0) + n Op(n kβ k + npσ )

2α+η2−1 = Op(n − p) + Op(n ),

where PX is the projection matrix of columns of X.

2 If 2α + η2 ≤ 2, we have ky − Xmk = Op(n).

Pp (0) 2 Pp (0) 2 α−1 For the second term j=1(Vjj +(mj ) )/αj, we have j=q+1(Vjj +(mj ) )/αj ∼ (p−q)n . If j ≤ q, Pq (0) 2 α −1 2 αj = v0, j=1(Vjj + (mj ) )/v0 ≤ n (2qn + Op(kβk )). The leading term in the numerator should be

Op(n).

2 Proposition 4.3. If α + η2 ≤ 1, σˆ = Op(1).

Step 5 determines the model index. Denote X = (x1, x2, ··· , xp). If j > q, xj represents an irrelevant

(0) n 2 δ−α variable, mj ∼ N(0, (n+nα)2 σ ). The thresholding r ∼ n . Using normal distribution tail probability

bound, there exists c0, such that:

(0) 2 2 1−α+δ Proposition 4.4. P (maxj>q(mj ) > r) ≤ 2(p − q) exp(−c0n ).

(0) n ∗ n 2 (0) ∗ (0) ∗ If j ≤ q, mj ∼ N( n+nα βj , (n+nα)2 σ ). We have minj≤q |mj | ≥ minj≤q |βj | − maxj≤q |mj − βj |. ∗ δ−α Denote the minimal signal an = minj≤q |βj |. We need to have a minimal signal condition: an n . (0) ∗ (0) n ∗ (0) Besides, for the second term maxj≤q |mj − βj |, since Emj = n+nα βj , there is a bias between mj and

52 ∗ βj . To control this bias, we need to assume η2 < 1. For that positive constant c0, for large enough n, we

have √ √ nα|β∗| − (n + nα) r P (|m(0) − β∗| > r) ≤ 2Φ( j √ ). j j nσ

α ∗ √ If n kβ k ≺ n r, that is, 3α + η2 − δ < 2, we have for large enough n,

(0) ∗ √ 2 δ−α+1 P (|mj − βj | > r) ≤ 2 exp(−c0n ).

Then we have

(0) ∗ 2 2 1−α+δ ∗ δ−α Proposition 4.5. P (maxj≤q(mj − βj ) > r) ≤ 2q exp(−c0n ) if an = minj≤q |βj | n , η2 < 1 and 3α + η2 − δ < 2.

∗ n (0) 2 o n (0) 2 o The event {γˆ = γ } is equivalent to min ∗ [m ] > r ∩ max ∗ [m ] < r . If α < 1, j:γj =1 j j:γj =0 j (δ−α)/2 3α + η2 − 2 < δ and an n , we have

∗ 2 1−α+δ Proposition 4.6. P (γˆ = γ ) ≥ 1 − 2p exp(−c0n ).

∗ δ 2 T In Step 6, since γˆ = γ with large probability, we have Var(m) = diag(n/(n + exp(−n )) 1p , n/(n + δ 2 T T δ δ ∗ T T n ) 1p−q) and bias(m) = (− exp(−n )/(exp(−n ) + n)β1, 0p−q) . Therefore

2 −2+η2 δ −1 Proposition 4.7. kbias(m)k ∼ n exp(−2n ), λmax(Var(m)) ∼ n .

Compared with kbias(m(0))k, by introducing the second step in ridge regression, kbias(m)k ≺ kbias(m(0))k.

Bias could be significantly reduced via adding the adaptive penalty. P p ∗ Concentration property holds when i/∈S∗ wˆi → 0 and Pb1 (kb1 − β1k ≤ Mn) → 1. We first deal with P ∗ i/∈S∗ wˆi. Since when i∈ / S we have

2 wˆi 1 δ 1 maxi>q(mi + Vii) δ α log( ) ≤ − (n + α log(n)) + 2 (− exp(−n ) + n ), 1 − wˆi 2 2 σˆ

P δ therefore i/∈S∗ wˆi  (p − q) exp(−n /2). Hence

P p Proposition 4.8. i/∈S∗ wˆi → 0.

∗ The second part is Pb1 (kb1 − β1k ≤ Mn):

∗ 2 ∗ 2 2 kb1 − β1k ≤ 2kEb1 (b1) − β1k + 2kb1 − Eb1 (b1)k

∗ 2 2 = 2km1 − β1k + 2kb1 − Eb1 (b1)k

2 2 2 ≤ 4km1 − Em1 (m1)k + 4kbias(m1)k + 2kb1 − Eb1 (b1)k .

53 2 The expectation of the first term is bounded by Ekm1 −Em1 (m1)k = tr(Var(m1)) ≤ qλmax(Var(m1)). The 2 2 ∗ expectation of the third term is bounded by Eb1 kb1 − E(b1)k ≤ qσˆ maxj∈S Vjj. Therefore we assume 2 2 2 2 2 qσˆ maxj∈S∗ Vjj = op(n), kbias(m1)k = o(n) and qλmax(Var(m1)) = o(n). From above analysis these

2 −2+η2 δ 2 conditions could be satisfied if q/n = o(n) and n exp(−n ) = o(n):

2 −2+η2 δ 2 ∗ Proposition 4.9. If q/n = o(n) and n exp(−n ) = o(n), Pb1 (kb1 − β1k ≤ Mn) → 1.

To sum up, we have the final proposition as follows:

∗ (δ−α)/2 2 −2+η2 δ Proposition 4.10. If 3α+η2−2 < δ, minj≤q |βj | n , α+η2 ≤ 1, q/n = o(n) and n exp(−2n ) = 2 ∗ p o(n), we have Πn(B(β , Mn)) → 1.

4.4.2 Prediction Accuracy

Since our distribution estimator concentrates around β∗, it is natural to conjecture the estimator induced by this distribution will be close to the true value. In this chapter we study the estimator of regression coefficients given by β˜ = γˆ ◦ m. Moreover we consider prediction accuracy and aim to compare prediction accuracy of our estimator with OLS estimator given only the relevant features. To illustrate this, let y∗ be the independent copy of y, for any estimator βˆ of β, we have

−1 ˆ 2 2 −1 ˆ 2 n Eky∗ − Xβk = σ + n EkXβ − Xβk .

Therefore in the following sections we only look at expected prediction error EkXβˆ−Xβk2. A well-known fact ˆ 2 2 is EkXβOLS − Xβk = pσ . An oracle is that we know q relevant predictors in advance and use only these q ˆ OLS 2 2 variables to construct OLS estimator. Then we have the oracle prediction error is EkX1β1 −X1β1k = qσ . In high dimensional situation qσ2 is optimal order for prediction error.

˜ ˜ T ˜ T T ˜ 2 ˜ 2 Denote β = (β1 , β2 ) . Back to the orthogonal case. We have EkXβ − Xβk = nEkβ − βk . Let ∗ An = {γˆ = γ }, on the event An, we have

n kXβ˜ − Xβ∗k21 = n km − β∗k21 ≤ q · ( )2 + n−1+η2 exp(−2nδ). E An E 1 1 An n + exp(−nδ)

˜ ∗ 2 We have EkXβ − Xβ k 1An  q. c ∗ 2 2 ∗ 2 On the event A , kXβ˜ − Xβ k 1 c ≤ 2kXβ˜ − Xmk 1 c + 2kXm − Xβ k 1 c . For the first term, n An An An kXβ˜ − Xmk2 = nkβ˜ − mk2 ≤ nδ−αp,

2 δ−α 2 1−α+δ kXβ˜ − Xmk 1 c  n p · 2p exp(−c n ) → 0. E An 0

54 For the second term, we use Cauchy-Schwartz inequality,

∗ 2 ∗ 2 kXm − Xβ k 1 c = n km − β k 1 c E An E An α−1 j ∗ −1 −1 T 2  = k − β + ( + ∆ ) X ε k 1 c E −1 p γ n An n + αj α−1  j 2 ∗ 2 n T  ≤ 2 (( ) kβ k + ε P ε )1 c E −1 −1 2 n X n An n + αj (n + αj )

−2+η2 2 1−α+δ  (n + O(1))2p exp(−c0n ) → 0.

2 Therefore kXβ˜ − Xβk 1 c → 0. For orthogonal design, using the same conditions as before, we have E An EkXβ˜ − Xβk2  q. Our estimator has asymptotically the same performance as OLS estimator with only relevant variables only. Our estimator has the optimal order of the predictive error.

4.5 Concentration and Prediction in Low Dimensional Case

If X is not orthogonal and p ≤ n, the proof follows if we put some constraints on eigenvalues of XT X.

T Correlation among covariates could not be too strong. Denote λn1 to be the smallest eigenvalue of X X. The regularity conditions are very similar to the orthogonal case. Posterior concentration is illustrated in

Theorem 4.1:

∗ Theorem 4.1. Suppose p ≤ n and the rank of X is p. Denote minimal signal a = min ∗ |β |. If n j:γj =1 j

η1 −α δ ∗ 2 η2 (δ−α)/2 λn1 ∼ n , v0n ∼ n , v1n ∼ exp(n ), kβ k ∼ n . If an n , δ > 3α − 2η1 + η2, α < η1 and p ∗ −η1 2 α + η2 ≤ 1, then Πn(B(β , Mn)) → 1 with radius satisfying qn = o(n).

To prove Theorem 4.1, we need to carefully calculate the order of bias and variance of our estimator.

We use the similar method in Shao and Deng (2012) to prove Theorem 4.1. We need to carefully check the correlation between relevant variables and irrelevant variables. Since we put the minimal eigenvalue condition on Gram Matrix XT X, the collinearity between relevant variables and irrelevant variables can not be too strong.

The adtantage of our two-stage ridge regression is that, we could dramatically reduce the bias introduced by ridge regression but the variance will not be increased too much. We summarize the magnitude of bias and variance of m(0) and m in Table 4.1 (they are buried in the proof in the appendix). Similar to the orthogonal case, the bias could be reduced in two-stage ridge regression.

55 (0) (0) bias(m ) λmax(Var(m )) bias(m) λmax(Var(m))

Order O(nα−η1+η2/2) O(n−η1 ) O(exp(−nδ)n−η1+η2/2) O(n−η1 )

Table 4.1: Compare Thresholded Ridge Regression (Shao et al., 2011) and Two-stage Ridge Regression. Bias and variance of m(0) and m when p ≤ n

Since we have proved when sample size goes to infinity, with large probability our algorithm could produce a distribution estimator which concentrates around the true value, it is natural to ask whether our estimator is comparable to OLS estimator in terms of prediction accuracy. EkXβ˜ − Xβk2 ∼ q holds provided the condition below:

Condition 4.1. λmax(PX2 PX1 ) < 1 − c1.

To explain the meanings of PX1 and PX2 , we decompose X = (X1, X2). X1 represents design matrix

of all relevant variables and X2 represents design matrix of all irrelevant variables. PX1 and PX2 are

T −1 T projection matrices of column space of X1 and X2 (c(X1) and c(X2)) : PX1 = X1(X1 X1) X1 and PX2 = T −1 T X2(X2 X2) X2 . c1 is a constant which does not depend on n. Condition 4.1 implies that asymptotically any column of X1 and X2 are not linearly dependent.

Condition 4.1 implies c(X1) ∩ c(X2) = ∅. If c(X1) ∩ c(X2) 6= ∅, then the largest eigenvalue of PX2 PX1 is 1, which implies if relevant variables could be represented as the linear combination of irrelevant variables, the condition cannot be satisfied. c(X1) ∩ c(X2) = ∅ could easily be satisfied in random design case when p < n. However when p n and each entry of X is generated from standard normal distribution, with probability 1 Condition 4.1 cannot be satisfied. We need to put other conditions on X and β in the high dimensional case, which will be illustrated in the next section.

∗ Theorem 4.2. Suppose p ≤ n and the rank of X is p. Denote minimal signal a = min ∗ |β |. n j:γj =1 j

η1 −α δ ∗ 2 η2 (δ−α)/2 If λn1 ∼ n , v0n ∼ n , v1n ∼ exp(n ), kβ k ∼ n . Suppose an n , δ > 3α − 2η1 + η2,

α < η1, η2 < η1. If additionally λmax(PX2 PX1 ) < 1 − c1 for all n and some positive constant c1, we have kXβ˜ − Xβ∗k2 ≤ q σ2 + o(1), where q denotes the size of the true model. E c1

Here we have a inflation factor c1 in the upper bound of L2 error. However, the growth rate should be of the order of q. c1 measures the correlation between X1 and X2. If the correlation is small, c1 is large, which could approach the best performance given by OLS estimator.

56 4.6 Concentration and Prediction in High Dimensional Case

In many real high-dimensional applications such as microarray data analysis, it is often the case that p n.

The number of variables are much larger than sample size. Typical conditions to prove selection consistency is to restrict the correlation among the variables, which could be satisfied by restricting the minimal nonzero

T T # T eigenvalue of X X since X X is not invertible when p > n. Let λn1 = λmin(X X) denote the smallest nonzero eigenvalue of XT X. One nice property about the minimal nonzero eigenvalue is that the minimal

nonzero eigenvalue of any positive semi-definite matrix is the lower bound for the minimal eigenvalue of any

full-rank principle sub-matrix. We will use this lemma in the proof since we need to control the minimal

eigenvalue of principal sub-matrices. The proof is easy so it is omitted here.

Lemma 4.1.A is not full rank and is positive semi-definite. A1 is any principle sub-matrix of A. If both # # A1 and A is not equal to O, then λmin(A1) ≥ λmin(A).

Besides this crucial eigenvalue property of matrices, we will use the following properties about matrix

in the proof of the following theorems. These results are used to bound the largest eigenvalues of various

matrices involved in the proof.

• AB and BA have the same nonzero eigenvalues.

• For symmetric matrices A and B, if A  B, λmax(A) ≥ λmax(B) and λmin(A) ≥ λmin(B).

• (Weyl’s Inequality) For symmetric matrices A and B, λmax(A+B) ≤ λmax(A)+λmax(B) and λmin(A+

B) ≥ λmin(A) + λmin(B).

• (Weyl’s Inequality for singular values) For complex matrices A and B, σmax(A + B) ≤ σmax(A) +

σmax(B).

• For positive semi-definite matrices A and B, λmax(AB) ≤ λmax(A)λmax(B) and λmin(AB) ≥ λmin(A)λmin(B).

Another major challenge in high dimensional regression is identifiability. Since we use deterministic

design matrix X and p n, β is not estimable. Motivated by Shao and Deng (2012), we project β onto

T T R(X), the row space of X. Denote X = PDQ , where P is an n×n orthogonal matrix satisfying P P = In,

T Q is a p × n matrix satisfying Q Q = In. Without loss of generality, through this section we assume the rank of X is equal to n. D is a n × n diagonal matrix with full rank. The projection of β∗ is

θ∗ = XT (XXT )−1Xβ∗ = QQT β∗.

57 We have Proposition 4.11 summarizing the relationship between θ∗ and β∗.

∗ ∗ T ∗ ∗ 2 ∗ ∗ Proposition 4.11. (1) θ ∈ R(X) and θ = QQ θ . (2)θ = arg minXβ=Xβ∗ kβk and kθ k ≤ kβ k. (3)

∗ ∗ Denote S ∗ = supp(β ). Decompose X corresponding to supp(β ): X = (X ∗ , X c ). If X ∗ has full β Sβ Sβ∗ Sβ ∗ ∗ column rank and C(X ∗ ) ∩ C(X c ) = ∅, θ = β . Sβ Sβ∗

Proof. (1) Because of the property of projection, θ∗ ∈ R(X) and θ∗ = QQT θ∗.

(2) Denote Lagrangian Multiplier L(β) = kβk2 + λT X(β − β∗). We have the optimal β should satisfy

T ∗ 2 2β + X λ = 0. Therefore the optimal β is in R(X). We have θ = arg minXβ=Xβ∗ kβk .

∗ ∗ T T T (3) Without loss of generality suppose Sβ∗ = {1, 2, ··· , qβ∗ }. β = ((β1) , 0 ) (The first qβ∗ elements of ∗ ∗ ∗ T ∗ T T ∗ ∗ β are nonzero, others are zero) and θ = ((θ1) , (θ2) ) (θ is decomposed the same way as β ). We have ∗ ∗ ∗ ∗ ∗ ∗ 2 X ∗ (β − θ ) = X c θ . Since X ∗ has full column rank, θ = β . θ = 0 because of L minimization. Sβ 1 1 Sβ∗ 2 Sβ 1 1 2 ∗ ∗ Therefore with condition C(X ∗ ) ∩ C(X c ) = ∅, we have θ = β . Sβ Sβ∗

θ∗ might be different from β∗. However, from (2) we could see θ∗ is more sparse than β∗ in L2 sense.

Instead of estimating β∗, we aim to estimate θ∗ because of identifiability. One sufficient condition to

∗ ∗ guarantee θ = β is C(X ∗ ) ∩ C(X c ) = ∅. However, in high dimensional random design case, for Sβ Sβ∗ example, each entry of X is generated from standard normal distribution, with probability 1 C(XSβ∗ ) ∩ ∗ ∗ C(XSβ∗ ) = ∅ cannot hold if p n. Therefore θ and β will be different. θ∗ will be a dense vector with large probability in random design case. However, we could divide the predictor set into 2 subsets according to the magnitude of entries of θ∗ instead of whether they are zero

∗ c ∗ or not: the strong signal set Sθ∗ = {j : |θj | ≥ an} and the weak signal set Sθ∗ = {j : |θj | ≤ bn}.

There is a gap between an and bn which will be specified later. Without loss of generality, we assume

∗ c ∗ Sθ = {1, 2, ··· , qSθ∗ } and Sθ∗ = {qSθ∗ + 1, ··· , p}. We will suppress subscript Sθ for notation simplicity ∗ ∗ ∗ ∗ ∗ when there is no ambiguity. We decompose θ = (θ1, θ2) according to S: θ1 and θ2 represent the strong signal coefficients and weak signal coefficients. X is also factorized as the matrix of relevant variables and irrelevant variables corresponding to S: X1 is n × q matrix and X2 is n × (p − q) matrix. We assume q ≤ n and rank of X1 is equal to q, rank of X2 is equal to r, r ≤ n. Let the corresponding true model index

∗ T T T γ = (1q , 0p−q) . Theorem 4.3 demonstrates concentration property.

δ # T η1 −α δ Theorem 4.3. Suppose p ≥ n, log p = o(n ), λn1 = λmin(X X) ∼ n , v0n ∼ n , v1n ∼ exp(n ),

∗ 2 η2 (δ−α)/2 −α/2 ∗ (δ−α)/2 kθ k ∼ n . If an n , bn ≺ n , kθ2k ≺ n , 1 ≥ η2 + α > δ > 3α − 2η1 + η2 and α ≤ η1, ∗ p α 2 ∗ 2 2 then Πn(B(θ , Mn)) → 1 with radius satisfying q/n = o(n) and kθ2k = o(n).

(δ−α)/2 −α/2 ∗ 2 η2 Remark 4.1. We assume an n , bn ≺ n and kθ k ∼ n . These imply α + η2 ≥ δ since

2 ∗ 2 ∗ 2 2 η2−δ+α 1−δ an  kθ k . Meanwhile, qSθ∗  kθ k /an ∼ n ≺ n . qSθ∗ is small compared with sample size

58 α 2 n. qSθ∗ /n = o(n) is easy to be satisfied. We implicitly control the sparsity level qSθ∗ but does not put

restriction on qSβ∗ .

Compared with the low dimensional case, first we explicitly show that the growth rate of p is related

with δ, the growth rate of v1. Meanwhile, there are also more restrictions on n since it is more difficult to

control V in the high dimensional case than the low dimensional case. The bound on an and bn shows there should be a gap between noisy part and signal part.

Two-stage ridge regression could also dramatically reduce the bias. The magnitude of bias and variance is shown in Table 4.2:

(0) (0) bias(m ) λmax(Var(m )) bias(m) λmax(Var(m))

α−η1+η2/2 −η1 δ −η1+η2/2 ∗ −η1 Order O(n ) O(n ) O(exp(−n )n ) + O(kθ2k) O(n )

Table 4.2: Compare Thresholded Ridge Regression (Shao et al., 2011) and Two-stage Ridge Regression. Bias and variance of m(0) and m when p ≥ n

Second we move on to prediction. We construct θ˜ based on the distribution estimator in the same way

in the lower dimensional case:

θ˜ = γˆ ◦ m.

In the dimensional case Condition 4.1 λmax(PX2 PX1 ) < 1 − c1 is too strong to be satisfied: consider the

random design case where each entry of X comes from an independent normal distribution. X2 has rank of

n since p n. Therefore PX2 = In and λmax(PX2 PX1 ) = 1. Some other mild conditions should be imposed instead.

∗ We could still compute the growth rate of EkXθ˜ −Xθ k2 with a little additional price. The price we need

1−η1 T η1 to pay is the additional multiplier of n . Since λmin(X X) ∼ n . If all the features are less correlated, the multiplier is smaller. Meanwhile, we need to bound the largest eigenvalue of XT X and the strength of

∗ noise part kX2θ2k. However the bound is not too stringent.

T # T η1 2 Theorem 4.4. Suppose p ≥ n, denote λnp = λmax(X X), λn1 = λmin(X X) ∼ n , suppose λnpp =

δ+η1−α −α δ ∗ 2 η2 (δ−α)/2 −α/2 o(exp(n )), v0n ∼ n , v1n ∼ exp(n ), kθ k ∼ n . If an n , bn ≺ n , 1 > α + η2 > δ >

3α − 2η1 + η2, α < η1. We have

˜ ∗ 2 1−η1 2 η2−2α δ 2 ∗ 2 EkXθ − Xθ k  qn σ + n exp(−2n )λnp + kX2θ2k .

1−δ ˜ ∗ 2 Recall that qSθ∗ ≺ n , the upper bound of EkXθ − Xθ k is tight. λnp ≥ p intrinsically since the diagonal elements of XT X are normalized to be n. p could grow as large as exp(nδ). Both Frequentists’

59 work (such as Zhang and Huang (2008),Bickel et al. (2009)) and Bayesian’s work (such as Castillo et al.

(2015), Martin et al. (2017)) indicate with large probability kXθ˜ − Xθ∗k2 is controlled by a quantity of the order q log p, where q is the cardinality of signal part. The price log p is paid for detecting the true model.

1−η1 Our multiplier n is smaller than log p if 1 − η1 < δ when p attains the fastest growth rate. Therefore our multiplier is comparable with multipliers of other methods.

∗ The leading term of EkXθ˜ − Xθ k2 should be the variance term, which is bounded by qn1−η1 σ2, while

˜ ∗ 2 η2−2α δ 2 ∗ 2 ∗ 2 the bias term of EkXθ − Xθ k is bounded by n exp(−2n )λnp and kX2θ2k . kX2θ2k is further bounded by p p ∗ 2 X ∗ 2 X ∗ 2 kX2θ2k ≤ ( |θj | · kxjk) = n( |θj |) . j=q+1 j=q+1

We have the following corollary to guarantee variance is the leading term:

Pp ∗ 2 −η1 2 1−η1−η2+2α δ Corollary 4.1. Given the above condition, if additionally ( j=q+1 |θj |) ≺ qn and λnp ≺ qn exp(2n ), we have

∗ 2 1−η 2 EkXθ˜ − Xθ k  qn 1 σ .

4.7 Conclusion

In this chapter we proposed a two-stage ridge regression which could reduce the bias compared with ridge regression estimator proposed in Shao and Deng (2012). Our method has intrinsical connection with EM algorithm when treating β as the latent variable in Wang et al. (2016). Two-stage ridge regression estimator could be extended to constructing a distribution estimator. We established concentration property of the distribution estimator in both low dimensional case and high dimensional case. Meanwhile, we compared the prediction error of our two-stage regression estimator with respect to the oracle procedure. Our esti- mator has comparable performance with oracle procedure up to a polynomial factor in sample size. In the future adaptive penalty could be extended to various branches, such as Gaussian sequence model and sparse principal component analysis.

60 Chapter 5

A Bayesian Approach for Sparse Principal Components Analysis

5.1 Introduction

Dimension Reduction is an important technique for big data analysis to obtain the signal and get rid of the noise. Big data often involves ultra-high dimension. For instance, in gene expression data analysis, we only have hundreds of observations but each observation contains thousands of genes. We need to find a small number of features to represent the variation of the whole dataset without much information loss.

Suppose we have a data matrix which contains n samples. Each observation contains p variables. Princi- pal Component Analysis (PCA) is done by eigenvalue decomposition of sample covariance matrix or sample correlation matrix. Each eigenvector serves as the weights of one principal component. The success of

PCA is based on the assumption that sample covariance matrix is a good approximation of the population covariance matrix. However, in large-p-small-n setup, sample covariance matrix could no longer be a con- sistent estimator of true covariance matrix. This is illustrated by studying the distribution of eigenvalues of sample covariance matrix via convergence of Empirical Spectral Distribution (Marcenko and Pastur, 1967) and studying the limiting distribution of the largest eigenvalue (Johnstone, 2001; Paul, 2007). Based on the inaccurate sample covariance matrix, we could not get a consistent principal component (Johnstone and ,

2001). Sparsity assumptions of true covariance matrix must be imposed.

In classical PCA all the loadings are nonzero. In high dimensional settings when p is large, this causes the problem of interpretation. Sparse PCA aims to find the principal components with a small number of nonzero loadings, which could not only improve interpretation but also improve estimation accuracy of eigenvectors of covariance matrix based on appropriate assumptions.

One natural approach to find sparse principal components is thresholding. Johnstone and Lu (2001) selected a subset of variables via thresholding sample variance, then applied standard PCA. Shen et al.

(2013) considered thresholding sample leading eigenvector. Ma (2013) obtained principal subspace via an additional thresholding step in orthogonal iteration.

Another branch of methods involves semi-definite programming, such as d’Aspremont et al. (2005),

61 Amini and Wainwright (2009), Vu et al. (2013) and et al. (2015). All of works modified the classical

variational representation of the largest eigenvalue of sample covariance matrix and formulate a semi-definite

programming (SDP) problem. Compared with thresholding diagonal elements, SDP has weaker conditions

on the growth rate of sample size to achieve consistency but the computational complexity is higher.

A third family of methods is via regression. As pointed by Zou et al. (2006), Shen and Huang (2008),

sparse PCA could be written as a regression-type optimization problem with L1 and L2 penalty via low rank approximation. To impose sparsity, we over-parametrize the optimization problem. It turns out an alternating least squares methods can be used to obtain principal components.

In terms of asymptotics, Johnstone and Lu (2001) focused on both estimation consistency in terms of inner product distance metric and selection consistency. Ma (2013) addressed estimation consistency for principal subspaces. Shen et al. (2013) considered high dimendion, low sample size contexts when p → ∞ while n is fixed. Regarding to SDP methods, selection consistency is shown using Primal-Dual Witness methods (Amini and Wainwright, 2009; Lei et al., 2015).

In Bayesian literature Tipping and Bishop (1999) proposed probabilistic PCA from Gaussian latent variable model, which re-established the link between factor analysis and PCA. In recent years, various sparsity induced continuous priors, such as double exponential and mixture of normals, are proposed to introduce sparsity in probabilistic PCA, such as Guan and Dy (2009), Nakajima et al. (2015). Inference is based on EM algorithm or variational inference algorithms. However, the estimated loadings from the aforementioned Bayesian methods are not exactly sparse and do not cover theory about consistency. Recently

Gao and Zhou (2015) established posterior concentration for estimating a sparse principal subspace under the optimal rate. However, selection consistency for Bayesian sparse PCA methods is still an open question.

In this chapter we propose a Bayesian approach to sparse PCA, in which sparsity is induced by spike- and-slab priors. An efficient Variational Bayesian algorithm is proposed to compute the sparse eigenvector.

The algorithm can be extended to retrieve multiple eigenvectors. A theoretical analysis is also provided for the spiked covariance model: with a proper choice of the hyper-parameters and initial values, our algorithm can select the correct support set of the top eigenvector with probability going to one. In section 5.2 we introduce the algorithm while in section 5.3 we establish selection consistency.

5.2 Hybrid of EM and Variational Inference Algorithm

The algorithm was originally developed in Hu (2017). This chapter mainly focuses on theoretical properties of the algorithm. We re-state the algorithm in this section mainly for readers’ convenience.

62 Sparsity Pattern of principal components could also be found using Bayesian methods. We assume only sample covariance matrix A is given. We start with the following Gaussian approximation of sample covariance matrix:

A ≈ uvT + E, (5.1)

p 2 where u, v ∈ R and E = (eij)1≤i,j≤p, eij ∼ N(0, σ ). For identifiability, we require v to be a unit vector, T 2 i.e., kvk2 = 1. The rationale behind (5.1) is that the MLE of u and v, which minimize kA − uv kF , are the 2 P 2 unnormalized and normalized eigenvector of A, where kAkF = i,j Aij denotes Frobenius norm of matrix A.

The major inference is on two vectors v and u. We restrict v to be a unit vector: kvk2 = 1. We specify

p−1 p a uniform prior on v on p − 1 dimensional unit sphere S = {x ∈ R : kxk2 = 1}. Then we incorporate sparsity pattern through the spike-and-slab prior over elements of u. The hierarchical prior specification is summarized as follows:

w ∼ Beta(a0, b0),

2 σ ∼ Inv-Gamma(ν0, λ0),

i.i.d. Zi ∼ Bern(w), i = 1, 2, ··· , p   2 N(0, a1nσ ),Zi = 1 ui|Zi ∼ , i = 1, 2, ··· , p  δ0,Zi = 0

v ∼ Unif(Sp−1),

2 Aij ∼ N(uivj, σ ), i, j = 1, 2, ··· , p

where ν0 and λ0 are fixed to be 1 if no prior information is given. a1n depends on sample size n. Denote p the sparsity index Z = (Zi)i=1. We use a refined variational algorithm to approximate the posterior π(u, Z, v, σ2, ω|A): we are interested in the posterior distribution of (u, Z), while the hyper-parameters are Θ = (v, σ2, ω). The objective function is

Q(u, Z) Ω(q , ..., q , σ2, ω, v) = q1,...,qp log , (5.2) 1 p E π(u, Z, v, σ2, ω|A)

63 where we assume Q(u, Z) has the factorization form

p p Y Y  Zi  1−Zi Q(u, Z) = (ui,Zi) = αifi(ui) (1 − αi)δ(ui) , i=1 i=1 where fi(ui) is a probability density and δ(·) is the dirac function at 0. We propose a hybrid of EM and variational inference algorithm since it is difficult to obtain the optimal approximate distribution of Θ: for hyper-parameter Θ, we plug in the estimates; meanwhile, we approximate the posterior of (u, Z) by Q(u, Z).

Our algorithm iteratively solves Q and Θ.

• Update qi(ui,Zi).

2 It can be shown that the optimal choice of fi(ui) in qi(ui,Zi) given Θ is normal distribution N(µi, σi ), where Pp 2 Aijvˆj σˆ µ = j=1 , σ2 = i 1 + 1 i 1 + 1 a1n a1n

and the posterior probability αi = P(Zi = 1|A) is updated via:

α ωˆ µ2 1 a σ2 log i = log + ui − log 1n . (5.3) 1 − α 1 − ωˆ 2σ2 2 σ2 i ui ui

t 2 2 2 t In the matrix form, let µu = (µ1, ..., µp) and σu = (σ1, ..., σp) , then

Avˆ σˆ2 µ = , σ2 = 1 . u 1 + 1 u 1 + 1 p×1 a1n a1

We also employ the following partial updating scheme for αis: at the t-th iteration where t > 1, if (t−1) (t−1) (t−1) (t−1) αi takes extreme values, e.g., if αi < cn or αi > 1 − cn (equivalently, logit(αi ) < −Cn (t−1) −1 (t−1) or logit(αi ) > Cn, where cn = (1 + exp(Cn)) ), we do not update αi any more in the future (t−1) (t−1) iterations. That is, at the (t − 1)-th iteration, we define set A(t) = {i : min(αi , 1 − αi ) > cn} = (t−1) (t−1) {|logit(αi )| ≥ Cn}, and then only update αi s via (5.3) for i from the active set A(t).

• Update Θ. Estimator of v is given by A Q(u) v = E , kAEQ(u)k

Q T where E (u) = (α1µ1, ..., αpµp) denotes the expectation of u with respect to the Q distribution.

The updating formulas for other parameters are given by

p P αi + a0 ωˆ = i=1 , p + a0 + b0

64 p Q P 2 1 P 2 2 (Aij − uivˆj) + αi(µ + σ ) + 2 E a1n ui ui 2 i,j i=1 σˆ = p . 2 P p + αi + 4 i=1

• Stopping rule. After we iteratively update all the parameters by the above formulas for several iter-

ations, we need to decide when to stop. We use the total entropy of the vector α = (α1, ··· , αp) to check whether the feature selection result converges,

p X   H(α) = − αi log αi − (1 − αi) log(1 − αi) . i=1

If the changing rate of entropy H(α) between two iterations is less then a pre-specified value the

algorithm will stop. Usually, the updating converges in a few iterations.

The algorithm is summarized in Algorithm 3.

Algorithm 3 Variational Inference Algorithm for Sparse PCA (0) 2 (0) (0) Require:v = (ˆv1, ..., vˆp), (σ ) , ω

1: while True do

2: for i in 1:p do Pp (t) 2 (t) (t+1) Aij v (σ ) 3: j=1 j 2 (t+1) µi ← 1 +1 , (σi ) ← 1 +1 a1n a1n (t) (t+1) −1 (t) (µ )2 2  4: α ← Logit log ω + i − 1 log a1nσ . i 1−ω(t) (t) 2 2 (t) 2 2(σi ) (σi ) 5: end for t Q (t+1) A (E u) 6: v ← t Q b kA (E u)k p P (t) αi +a0 7: ω(t+1) = i=1 , p+a0+b0 Q P (t+1) (t+1) 2 1 P (t+1) (t+1) 2 (t+1) 2 E (Aij −ui vj ) + a αi ((µi ) +(σi ) )+2 i,j 1n i=1:p 8: (σ(t+1))2 ← 2 P (t+1) p + αi +4 i=1:p 9: Until H(α(t+1)) converges

10: end while

11: Calculate ub

5.2.1 Two-stage Method

When p is very large (p n), directly applying Algorithm 3 is time consuming. Johnstone and Lu (2001) pro- posed an algorithm including a thresholding step which can reduce the number of variables before embarking on sparse PCA. Here we introduce a two-stage sparse PCA method as follows:

2 2 2 2 • First step: the diagonal elements of p by p sample covariance matrix A areσ ˆ1, σˆ2, σˆ3, ··· , σˆp. Order

65 2 2 2 2 them from the largest to the smallest:σ ˆ(1) > σˆ(2) > σˆ(3) > ··· > σˆ(p). Pick up k coordinates 2 2 2 2 corresponding to the largest k sample variance:σ ˆ(1) > σˆ(2) > σˆ(3) > ··· > σˆ(k). Denote the chosen

index set as SL = {i1, i2, ··· , ik}. The order of k will be specified in the next section, k ≤ n.

∗ • Second step: Let A = (,j : i, j ∈ SL) be the sample covariance matrix of the selected variables

∗ from first step, then we apply Algorithm 3 on the matrix A to compute ubi, i ∈ SL. The final estimate u is b   ubi, if i ∈ SL ub = .  0, if i∈ / SL

In the next section we will study the asymptotic behavior of our algorithm. Specifically, we will study

whether our algorithm will give us a consistent estimator of sparse principal component in terms of sparsity

pattern in various setups.

5.3 Theoretical Properties

The major goal for this section is to prove selection consistency. We assume the principal component is a

sparse vector and aim to prove via Algorithm 3 we could capture the sparsity pattern of this vector when

the principal component is high-dimensional. We need to find the single principal component ρ for p by p

large covariance matrix. We assume p → ∞. We consider the following three cases:

1. p/n → 0;

2. p/n → c where 0 < c < 1;

3. log p/nη → 0, where η is some positive constant less than 1.

Throughout this section, we consider Single Spike Covariance Model introduced by Johnstone and Lu

(2001): we assume each p-dimensional observation xi has the following structure:

xi = ciρ + ξwi, i = 1, 2, ..., n, (5.4)

T p i.i.d. i.i.d. where ρ = (ρ1, ..., ρp) ∈ R is the single component to be estimated, ci ∼ N(0, 1) and wi ∼ Np(0, Ip) are the independent p-dimensional noise vector.

n × p data matrix X could be written as

T T Xn×p = cn×1ρ + ξW ,

66 T where c = (c1, c2, ..., cn) and Wp×n = (w1, w2, ..., wn). We assume the mean of xi is 0 and known.

1 T T Consider the sample covariance matrix A = n X X, whose expectation is E(A) = A0 = ρρ + ξIp. The task is to estimate the single eigenvector ρ corresponding to the largest eigenvalue kρk2 + ξ2.

We assume p, n → ∞ and kρk → % where % is a positive constant. Denote ρ∗ = ρ/kρk to be the

c normalized leading eigenvector. Let S denote the support set of ρ, i.e., S = {i : ρi 6= 0} and and S = {i :

ρi = 0}. Without loss of generality, assume S = {1, . . . , s}. Assume s/n → 0. We prove by running VB algorithm only once we could achieve selection consistency of the principal

component, which dramatically saves computation time compared with other algorithms. Initial values of

Θ = (v, σ2, w) matter for fast convergence. First, prior sparsity level w(0) is set to be any constant between

0 and 1. Second, v(0) = vˆ. The initial value of v(0) is the normalized sample eigenvector corresponding to

the largest eigenvalues of A because the sample eigenvector is consistent when p/n → 0.

Third, the order of (σ(0))2 also matters. (5.1) specifies the working likelihood, while data is generated

from Single Spike Covariance Model (5.4). By Bernstein Inequality we have E = A − E(A) has the bound p (0) 2 in maximal norm: kEkmax = maxij |eij| = op( log p/n). Therefore we assume (σ ) ∼ log p/n, which is an upper bound for the true variation of the error matrix.

5.3.1 The general case

Recall that at each iteration, given the current values of (ω, σ2, µ, v), we update the inclusion probability

αi for the i-th dimension via

2 2 αi ω µi 1 a1nσ log = log + 2 − log 2 1 − αi 1 − ω 2σi 2 σi 2 ω (Av)i 1 = log + 2 − log(a1n + 1), (5.5) 1 − ω 2(1/a1n + 1)σ 2

where

T Av µ = (µ1, ··· , µp) = . (5.6) 1/a1n + 1

We first show that the if the current values of (ω, σ2, v) satisfy some conditions, then with a proper choice of

a1n, after updating αi’s via (5.5), there is a gap between mini∈S log αi/(1 − αi) and maxi∈Sc log αi/(1 − αi), which is big enough for us to separate the relevant and irrelevant dimensions.

2 2 T 2 Proposition 5.1. Let λ = kρk + ξ be the largest eigenvalue of ρρ + ξ Ip. Suppose at the t-iteration,

67 w(t) is a constant in (0, 1), σ(t) ∼ p(log p)/n, and h = Avˆ(t) − λρ∗ satisfy the following two conditions:

hi (C1) : max ∗ ≤ bn < 1, (5.7) i∈S λρi r ! log p (C2) : max |hi| = Op . (5.8) i∈Sc n

Then if a1n satisfies

2 1 log p log a1n a1n → ∞, min ρi 2 , (5.9) i∈S (1 − bn) n

after updating αi via (5.5), with probability going to 1, we have

αi αi max log > Cn and min log < −Cn, i∈S 1 − αi i∈Sc 1 − αi

where Cn = κ log a1n and 0 < κ < 1/2.

Conditions (C1) and (C2) ensure that vˆ, the estimate of ρ∗, should not be too far away from the truth, otherwise there is no hope for us to pick the relevant dimensions in the next iteration. If bn is bounded

2 by a constant D less than 1, then we only need to require the minimal signal mini∈S ρi to be larger than

(log p log a1n)/n (which is true when p/n → 0 and p/n → c). To guarantee bn < D < 1, we need to add

2 a new condition mini∈S ρi (s log p)/n, which is illustrated in the proof of main theorems. Otherwise the minimal signal needs to be of a larger order if bn converges to 1. Proposition (5.1) is easy to prove. Since

2 αi ω (Avˆ)i 1 log = log + 2 − log(a1n + 1), 1 − αi 1 − ω 2(1/a1n + 1)σ 2

the major quantity is |(Avˆ)i|. For i ∈ S,

∗ ∗ min |(Avˆ)i| = min |λρi + hi| ≥ min ||λρi | − |hi|| i∈S i∈S i∈S

∗  |hi|  ∗ ≥ min |λρi | 1 − max ∗ ≥ min |λρi |(1 − bn). i∈S i∈S |λρi | i∈S

Since σ(t) ∼ p(log p)/n, we need to put the above minimal signal condition to guarantee

2 ∗ 2 2 λ mini∈S(ρi ) (1 − bn) 1 log(a1n + 1). 2(1/a1n + 1) log p/n 2

68 c p For i ∈ S , since maxi∈Sc |hi| = Op( log p/n),

αi 1 max log ∼ − log(a1n + 1). i∈Sc 1 − αi 2

We could get the above gap between the logits.

For p/n → 0 case, we will show all the conditions in Proposition 5.1 are satisfied when we start with the sample eigenvector as the initial value for vˆ(0).

5.3.2 p/n → 0 case p/n → 0 corresponds to low dimensional case. However we still need to establish selection consistency because this consistency result will be used to deal with high dimensional case. First we need to control the spectral norm of the error matrix if p/n → 0. The key is to find the order of spectral norm of E, which is

summarized in Lemma 5.1.

Lemma 5.1. Assume p, n → ∞, p/n → c and kρk → %, where c and % are 2 constants, then almost surely

√ 2 √ lim sup kEk2 ≤ ξ c% + ξ (c + 2 c).

This lemma is originally stated and proved in Johnstone and Lu (2001). Therefore the proof is omitted p here. A direct conclusion of this lemma is kEk2 = Op( p/n). There is a natural intuition: when sample size n is large and p/n → 0, we expect the error term E to be small.

The sample eigenvector and true eigenvector should be close since the error matrix, which is the difference between true covariance matrix and sample covariance matrix, is small. This fact is summarized in Lemma

5.2. Remember that ρ∗ = ρ/kρk; the sample eigenvector of A is denoted as vˆ.

∗ p Lemma 5.2. Assume p, n → ∞, p/n → 0 and kρk → %, % is a positive constant, then kvˆ−ρ k = Op( p/n).

The proof is provided in the appendix. We compute µ via

Avˆ λˆvˆ µ = = ; 1/a1n + 1 1/a1n + 1

Remember that

α ω µ2 1 a σ2 ω λˆ2vˆ2 1 log i = log + ui − log 1n = log + i − log(a + 1), 1 − α 1 − ω 2σ2 2 σ2 1 − ω 2(1/a + 1)σ2 2 1n i ui ui 1n

ˆ the magnitude of λvˆi determines the magnitude of logit.

69 ˆ (t) After our algorithm converges after t-th iteration, the estimated support set is S = {i : logit(αi ) ≥ Cn}. p If q(Sˆ = S) → 1, we say our algorithm has Frequentist’s selection consistency. In Bayesian inference, remember we use a binary vector Z = (Z1,Z2, ··· ,Zp) to indicate whether ρi is equal to 0 and through variational inference we could approximate the true posterior distribution of Z by q(Z). Suppose Z∗ is the

∗ T T T true model index vector: Z = (1s , 0p−s) . Bayesian selection consistency requires the approximate posterior measure of Z at Z∗ converge to 1 in probability, that is,

∗ Y (t) Y (t) p q(Z = Z ) = αi (1 − αi ) → 1. i∈S i∈Sc

The aforementioned Frequentist’s consistency corresponds to that q(Z = Z∗) has the largest probability mass compared with other configurations of Z, while Bayesian consistency requires q(Z = Z∗) converges to

1 in probability, which is stronger than the frequentist consistency.

Theorem 5.1 shows using proper initial values and assuming minimal signal conditions, we could establish

Bayesian consistency and the algorithm will stop after one iteration with high probability. We suppress the iteration number t in the theorem for notation simplicity.

Theorem 5.1. Assume p, n → ∞, p/n → 0 and kρk → %, % is a positive constant. We choose initial

(0) (0) 2 (0) ∗ 2 values as follows: w is a constant in (0, 1); (σ ) ∼ log p/n; v = vˆ; a1n → ∞, suppose mini∈S(ρi ) ≥

(log p max(s, log a1n))/n → 0, Cn = κ log a1n where 0 < κ < 1/2, then after one iteration, we have q(Sˆ =

p κ ∗ p S) → 1 and the algorithm will stop with probability going to 1. In addition, if a1n p, we have q(Z = Z ) → 1, the algorithm attains Bayesian consistency.

Theorem 5.1 shows that if the signal is not too small, we could obtain correct sparsity pattern of the single principle component running the algorithm once. Theorem 5.1 is a direct conclusion from Lemma

5.2, and the proof is given in the appendix. By choosing proper a1n, we have Bayesian consistency, which is stronger than Frequentists’ consistency. Therefore in the remaining parts, we focus on Bayesian consistency.

5.3.3 p/n → c > 0 case p/n → 0 is a strong condition in high dimension. How’s the performance of our algorithm when the dimensionality gets larger? Motivated by Paul (2007), where the author mentioned a phase transition phenomenon for the leading eigenvector for Spike Covariance Model, we applied their theoretical results to show our VB algorithm still works in p/n → c > 0 case. Theorem 5.2 provides theoretical details.

Theorem 5.2. Assume p, n → ∞, p/n → c (0 < c < 1), kρk → % and % is a positive constant. If w(0)

(0) 2 (0) 2 2 √ ∗ 2 is a constant in (0, 1); (σ ) ∼ log p/n; v = vˆ; a1n → ∞; % /ξ > 1 + c. Suppose mini∈S(ρi ) ≥

70 κ (log p max(s, log a1n))/n → 0, Cn = κ log a1n where 0 < κ < 1/2 and a1n p, then after one iteration, we p have q(Z = Z∗) → 1 and the algorithm will stop with probability going to 1.

This theorem implies that by choosing proper a1n and initializing our algorithm via the sample eigen- vector, we could select nonzero entries of the leading eigenvector via Algorithm 3 consistently.

5.3.4 log p/nη → 0 case

It’s much more interesting to look at the performance of variational algorithm in high dimensional case. For

Theorem 5.1 or Theorem 5.2 to hold, in general we require p/n → c: the number of variables should grow

slower than sample size or grow at the same rate. This restricts the use of our algorithm. One direction

to enlarge the usage of our algorithm is to add a screening step. Assume the single principal component

is p−dimensional, among which there only exists s nonzero entries. s ≤ n. The first screening step is to pick k coordinates out of p, which with large probability will contain all of the s nonzero coordinates in the

first principal component. If k ≺ n, we could apply the consistency result in Theorem 5.1 to show selection consistency of our algorithm.

Denote the chosen index set after screening as SL = {i1, i2, ··· , ik}. Hopefully, SL will contain S =

{1, 2, ··· , s}. The cardinality of SL is k, which dramatically reduces the dimensionality. In the second stage,

∗ ∗ denote the principal sub-matrix of A induced by SL as A . Since A is k by k matrix and k/n → 0, we could apply our algorithm and use Theorem 5.1 to prove selection consistency. The theoretical justification is summarized in Theorem 5.3.

2 Theorem 5.3. Assume p, n → ∞ and kρk → %, % is a positive constant. Suppose mini∈S ρi = bn → 0. If 2 there exists a k ≥ s satisfying log(p − k) + log s = o(nbn), then after diagonal screening,

 3nb2   nb2  P (S S) ≤ s(p − k) exp − n + s exp − n → 0. L + 64ξ4 16ξ4

Then we combine Theorem 5.1 and Theorem 5.3 to obtain Theorem 5.4.

∗ 2 Theorem 5.4. Assume p, n → ∞ and kρk → %, % is a positive constant. Suppose mini∈S(ρi ) κ ∗ 2 p log k max(s, log a1n)/n, a1n k. If additionally mini∈S(ρi ) (log(p − k) + log s)/n, then after choosing k nonzero coordinates in the screening step and applying Algorithm 3 on k × k principal sub-matrix A∗, p q(Z = Z∗) → 1 and the algorithm will stop after one iteration with probability going to 1.

∗ 2 Theorem 5.4 is a direct corollary of Theorem 5.1 and Theorem 5.3. Since the conditions mini∈S(ρi ) ∗ 2 p log k max(s, log a1n)/n and mini∈S(ρi ) (log(p − k) + log s)/n may not seem intuitive, we list a special sparse situation here.

71 γ δ If we have a very sparse principal component, s ≺ log a1n, suppose log a1n ∼ n and s ∼ n , where

η ∗ 2 1 > γ > 1/2 and δ < γ, define η = 2γ − 1. then if log p = o(n ), the condition mini∈S(ρi ) p ∗ 2 (log(p − k) + log s)/n is satisfied. The minimal signal condition turns out to be mini∈S(ρi ) log k 1−γ 1−γ log a1n/n ∼ log k/n . Intrinsically we need (s log k)/n = O(1), therefore δ < 1 − γ. To sum up, if

γ δ η log a1n ∼ n and s ∼ n , where 1 > γ > 1/2 and δ < 1 − γ, define η = 2γ − 1. If log p = o(n ) and

∗ 2 1−γ mini∈S(ρi ) log k/n , Algorithm 3 could achieve selection consistency after the screening step. In terms of the requirements of the sample size, our algorithm has the similar conditions as Johnstone and

Lu (2001): s2 log(p−k)/n → 0 is required. For semi-definite programming (SDP) methods, s log(p−s)/n → 0

is required. SDP has weaker condition for sample size n. However, applying SDP methods is time-consuming

but our method is more efficient.

5.4 Conclusion and Discussion

In this chapter we proposed a variational algorithm to compute the first sparse principal component corre-

sponding to Single Spike Covariance Model. Theoretically we proved our algorithm achieves Frequentist’s

model selection consistency and Bayesian selection consistency in various combinations of p and n.

Even though Single Spike Covariance Model is a good start, it is too simple to be true in real data

analysis. If we are asked to find multiple principal components, we could iteratively find eigenvectors via

Algorithm 3, then project A to the orthogonal space of the previous eigenvectors and re-apply the algorithm.

A more efficient way is to generalize Gaussian approximation methods to find principal subspace. This is

our future direction.

72 Appendix A

Supplemental Material for Chapter 2

A.1 Proof for Lemma 2.1

Denote w = (w1, w2, ··· , wT ) and m = (m1, m2, ··· , mT ). Write Dn in terms of the conditional prior

(θ1, θ2, ··· , θn)|w ∼ Πw,w,m and the marginal prior w ∼ Π. Therefore

Z 1 n Z Y pθi (Xi) κ Dn = { } Πw,w,m(dθi)Π(dw); pθ∗ (Xi) 0 i=1 R i

(X −θ )2 where p (X ) = √1 exp(− i i ) is normal probability density function. For given w, the inner expec- θi i 2π 2 tation involves an average over all configurations of the indicators (1θ1=0, 1θ2=0, ··· , 1θn=0). This average is clearly larger than just the case where the indicators exactly match up with the support S∗ of θ∗, times the probability of that configuration, that is,

1 T Z Z κ ∗ 2 2 w 1 2 n−sn sn Y {(Xi−θ ) −(Xi−θi) } X t − (mt−θi) Dn > w (1 − w) Π(dw) e 2 i √ e 2σ2 dθi. 2 0 i∈S∗ R t=1 2πσ

For each i ∈ S∗ and each 1 ≤ t ≤ T ,

Z κ ∗ 2 2 w 1 2 {(Xi−θ ) −(Xi−θi) } t − (mt−θi) e 2 i √ e 2σ2 dθi 2 R 2πσ κ ∗ 2 Z κ 2 w 1 2 (Xi−θ ) − (Xi−θi) t − (mt−θi) = e 2 i e 2 √ e 2σ2 dθi 2 R 2πσ 2 κ (X −θ∗)2 wt κ(Xi − mt) = e 2 i i √ exp(− ). κσ2 + 1 κσ2 + 1

73 Therefore we have

T Z κ ∗ 2 2 w 1 2 Y {(Xi−θ ) −(Xi−θi) } X t − (mt−θi) e 2 i √ e 2σ2 dθi 2 i∈S∗ R t=1 2πσ T 2 Y κ (X −θ∗)2 X wt κ(Xi − mt) = {e 2 i i √ exp(− )} 2 κσ2 + 1 i∈S∗ t=1 κσ + 1 T 2 2 − sn Y κ (X −θ∗)2 Y X κ(Xi − mt) = (κσ + 1) 2 {e 2 i i } { w exp(− )}. t κσ2 + 1 i∈S∗ i∈S∗ t=1

Since

T 2 2 Y X κ(Xi − mt) Y κ(Xi − mt ) { w exp(− )} ≥ {w exp(− i )} t κσ2 + 1 ti κσ2 + 1 i∈S∗ t=1 i∈S∗ P 2 Y ∗ (Xi − mti ) = ( w ) × exp(−κ i∈S ). ti κσ2 + 1 i∈S∗

γ −η ∗ Recall that sn = O(n ) where γ < 1, we have εn = sn log(n/sn) ∼ sn log n. Besides, mini∈S wti ≥ Cn ,

Q −η sn we have i∈S∗ wti ≥ (Cn ) = exp(log C · sn − K∗εn), where K∗ is a constant independent with n. P 2 ∗ ∗ By Condition 2.2, we have i∈S∗ (Xi − mti ) ≤ K εn with probability going to 1 where K is a constant independent with n. Hence

T 2 Y X κ(Xi − mt) κ { w exp(− )} ≥ exp(log C · s − K ε − K∗ε ). t κσ2 + 1 n ∗ n κσ2 + 1 n i∈S∗ t=1

By Law of Large Numbers,

Y κ {(X −θ∗)2} κ P (X −θ∗)2 κsn +o (s ) e 2 i i = e 2 i∈S∗ i i = e 2 p n . i∈S∗

Therefore

T Z κ ∗ 2 2 w 1 2 Y {(Xi−θ ) −(Xi−θi) } X t − (mt−θi) e 2 i √ e 2σ2 dθi 2 i∈S∗ R t=1 2πσ

2 − sn κ ∗ ≥ (1 + κσ ) 2 exp(−(K + K )ε − o (ε )). ∗ κσ2 + 1 n p n

1 R n−sn sn Using the same trick in the proof of Lemma 1 in Martin and Walker (2014), 0 w (1 − w) Π(dw) is α lower bounded by 1+α exp[−2εn − αsn + o(sn)] as n is sufficiently large. Putting these pieces together, we

74 obtain

α κ α D ≥ exp{−(2 + K + K∗)ε − o (ε )} = exp{−K ε − o (ε )}, n 1 + α ∗ κσ2 + 1 n p n 1 + α 0 n p n

κ ∗ where K0 = 2 + K∗ + κσ2+1 K .

A.2 Proof for Theorem 2.1

∗ The main aim of the proof is to show the numerator, for sets An away from θ , is not too large. Let Nn be

the numerator for Qn(AMεn ), i.e.,

Z 1 Z n Y pθi (Xi) κ  Nn = { } Πw,w,m(dθi) Π(dw). pθ∗ (Xi) 0 AMεn i=1 i

Taking expectation of Nn, with respect to Pθ∗ , we get

Z 1 Z n Z Y pθi (xi) κ  ∗ (N ) = { } p ∗ (x )dx Π (dθ ) Π(dw). Eθ n θi i i w,w,m i pθ∗ (xi) 0 AMεn i=1 R i

n Here we can exchange the integral with respect to xi and θi since Πw,w,m does not depend on X .

R pθi (xi) κ κ(1−κ) ∗ 2 Since { } pθ∗ (xi)dxi = exp{− (θi − θ ) }, we have pθ∗ (xi) i 2 i R i

Z 1 Z n Y κ(1 − κ) ∗ 2 ∗ (N ) = exp{− (θ − θ ) } Π (dθ ) Π(dw). Eθ n 2 i i w,w,m i 0 AMεn i=1

∗ 2 Remind that AMεn = {θ| kθ − θ k > Mεn}, we have

κ(1 − κ)M ∗ (N ) ≤ exp{− ε }. Eθ n 2 n

κ(1−κ) Let 2 = c. Next, take M such that cM > K0, and then take K1 ∈ (K0, cM), then using Markov

−K1εn −(cM−K1)εn Inequality we get Pθ∗ (Nn > e ) ≤ e . This upper bound has a finite sum over n ≥ 1, so the

−K1εn Borel-Catelli lemma gives that Nn ≤ e with Pθ∗ -probability 1 for all large n. Therefore we get

Nn 1 + α ≤ e−(K1−K0)εn−op(εn). Dn α

∗ Hence Qn(AMεn ) → 0 as n → ∞ with Pθ -probability 1.

75 A.3 Proof for Theorem 2.2

Similarly, let Nn be the numerator for Qn(AMεn ), i.e.,

Z 1 Z n Z Y pθi (Xi) κ Nn = { } Πw,w,m(dθi)Π(dw). pθ∗ (Xi) 0 AMεn i=1 R i

Taking expectation of Nn, with respect to Pθ∗ , we get

Z 1 Z n Z Y pθi (Xi) κ ∗ (N ) = { } Π (dθ )p ∗ (x )dx Π(dw). Eθ n w,w,m i θi i i pθ∗ (Xi) 0 AMεn i=1 R i

Write Jw(dθi) for the measure defined in the i-th product term. Split this into discrete and continuous pieces:

Z pθi (Xi) κ Jw(dθi) = { } Πw,m,w(dθi)pθ∗ (xi)dxi p ∗ (X ) i R θi i Z pθi (Xi) κ = w{ { } pθ∗ (xi)dxi}δ0(dθi) p ∗ (X ) i θi i Z T 2 pθi (Xi) κ X wt (θi − mt) + (1 − w){ { } ( √ exp(− ))p ∗ (x )dx }dθ . 2 θi i i i pθ∗ (Xi) 2 2σ i t=1 2πσ

κ(1−κ) ∗ 2 For the discrete part, it simplifies to w exp{− 2 (θi − θi ) }δ0(dθi). For the continuous part, the upper bound is

Z T 2 pθi (Xi) κ X wt (θi − mt) { } ( √ exp(− ))p ∗ (x )dx 2 θi i i pθ∗ (Xi) 2 2σ i t=1 2πσ 1 Z κ(X − θ )2 (1 − κ)(X − θ∗)2 ≤ exp{− i i − i i }dx 2πσ 2 2 i

1 κ(1 − κ) ∗ 2 = √ exp{− (θi − θ ) } 2πσ 2 i 1 1 = exp{− (κ(1 − κ) − )(θ − θ∗)2}N(θ |θ∗, σ2). 2 σ2 i i i i

2 1 ∗ 2 Since σ > (1−κ)κ , the coefficient on (θi − θi ) is negative. Therefore we could find a constant c > 0, depending on (κ, σ2), such that

∗ 2 −c(θi−θi ) ∗ 2 Jw(dθi) ≤ e {wδ0(dθi) + (1 − w)N(θi|θi , σ )dθi}.

n Qn ∗ 2 Therefore Jw(dθ) ≡ i=1 Jw(dθi) is upper bounded by exp{−ckθ − θ k } times a valid probability

76 measure. Therefore by definition of AMεn , we have

Z 1 Z n −cMεn Eθ∗ (Nn) = Jw(dθ)Π(dw) ≤ e 0 AMεn

Next, take M such that cM > 2, and then take K1 ∈ (2, cM), then using Markov Inequality we get

−K1εn −(cM−K1)εn Pθ∗ (Nn > e ) ≤ e . This upper bound has a finite sum over n ≥ 1, so the Borel-Catelli

−K1εn lemma gives that Nn ≤ e with probability 1 for all large n. Therefore we get

Nn 1 + α ≤ e−(K1−K0)εn−op(εn). Dn α

∗ Qn(AMεn ) → 0 as n → ∞ with Pθ -probability 1.

A.4 Variational Inference Algorithm Derivation

2 We will derive the variational inference algorithm for Dirichlet process mixture model. α0, T , w0, σ0 and the data vector Xn is given in advance. The data generating process is summarized in 2.7. We treat Z as latent variables and V, η∗, ξ as parameters. Posterior distribution of all the parameters and latent variables is proportional to

∗ n n ∗ ∗ n ∗ P (Z, V, η , ξ|X ) ∝ P (X , Z, V, η , ξ) = P (ξ|w)P (V|α0)P (η |ξ)P (Z|V)P (X |Z, η )

T −1 PT ξt PT Y Y t=1 T − t=1 ξt α0−1 ∗ ∝ w0 (1 − w0) (1 − Vt) δ0(ηt )· t=1 t:ξt=1 T ∗ 2 Pn Y 1 (ηt ) Y i=1 1Zi=t √ exp(− 2 ) · πt · 2πσ0 2σ0 t:ξt=0 t=1 ! Pn PT (X − η∗)21 exp − i=1 t=1 i t Zi=t . 2

Recall that under the fully factorized variational assumption, we have

∗ ∗ q(Z, V, η , ξ) = qp,m,τ (η , ξ)qγ1,γ2 (V)qΦ(Z).

77 ∗ Define P (Zi = t) = φi,t. First we find the optimal form of q(η , ξ), which satisfies

PT ξt PT Y ∗ E t=1 T − t=1 ξt ∗ log q(η , ξ) = V,Z[log(w0 (1 − w0) δ0(ηt )· t:ξt=1 ∗ 2 Pn PT ∗ 2 Y 1 (ηt ) i=1 t=1(Xi − ηt ) 1Zi=t √ exp(− 2 ) exp(− ))] + const 2πσ0 2σ0 2 t:ξt=0 T X q (η∗)2 = [1 (log w + log δ (η∗)) + 1 (log(1 − w ) − log 2πσ2 − t ) ξt=1 0 0 t ξt=0 0 0 2σ2 t=1 0 Pn φ (X − η∗)2 − i=1 i,t i t ] + const 2 T Pn 2 X φi,tX = [1 (log w + log δ (η∗) − i=1 i ) ξt=1 0 0 t 2 t=1 q ∗ 2 Pn ∗ 2 2 (ηt ) i=1 φi,t(Xi − ηt ) + 1ξt=0(log(1 − w0) − log 2πσ0 − 2 − ) + const] 2σ0 2 T X ∗ ≡ log q(ξt, ηt ); t=1

Pn 2 ∗ 2 ∗ ∗ i=1 φi,tXi p 2 (ηt ) where log q(ξt, ηt ) = 1ξt=1(log w0 + log δ0(ηt ) − ) + 1ξt=0(log(1 − w0) − log 2πσ0 − 2 − 2 2σ0 Pn ∗ 2 i=1 φi,t(Xi−ηt ) ∗ 2 ) + const. Therefore the optimal form of qp,m,τ (η , ξ) is fully factorized across different clusters: T ∗ Y ∗ qp,m,τ (η , ξ) = qpt,mt,τt (ηt , ξt). t=1

In order to determine the updating formula for pt, mt, τt, we use Method of Undetermined Coefficients.

∗ Suppose q(ξt, ηt ) is a mixture of a point mass of zero and normal distribution,

∗ ∗ 2 −1/2 ∗ 2 2 q(ξt, ηt ) = pt1ξt=1δ0(ηt ) + (1 − pt)1ξt=0(2πτt ) exp(−(ηt − mt) /(2τt )), therefore

q ∗ 2 ∗ ∗ 2 (ηt − mt) log q(ξt, ηt ) = 1ξt=1(log pt + log(δ0(ηt ))) + 1ξt=0(log(1 − pt) − log( 2πτt ) − 2 ) + const. 2τt

Even though there’s a normalizing constant, but the difference between multipliers of 1ξt=1 and 1ξt=0 is invariant with respect to the constant. Therefore we have the following equation:

q ∗ 2 2 (ηt − mt) log pt − log(1 − pt) + log 2πτt + 2 = 2τt Pn 2 ∗ 2 Pn ∗ 2 i=1 φi,tXi (ηt ) i=1 φi,t(Xi − ηt ) log w − log(1 − w) − + 2 + ; 2 2σ0 2

78 ∗ R which holds for any ηt ∈ . The solutions are given as follows:

2 Pn σ0 · i=1 φi,tXi mt = 2 Pn , t = 1, 2, ··· ,T σ0 · i=1 φi,t + 1 2 2 σ0 τt = 2 Pn , t = 1, 2, ··· ,T σ0 · i=1 φi,t + 1  2 Pn 2  p 2 Pn σ0 ·( i=1 φi,tXi) exp log(w0) − log(1 − w0) + log( σ0 · i=1 φi,t + 1) − 2(σ2·Pn φ +1) p = 0 i=1 i,t , t  2 Pn 2  p 2 Pn σ0 ·( i=1 φi,tXi) exp log(w0) − log(1 − w0) + log( σ0 · i=1 φi,t + 1) − 2 Pn + 1 2(σ0 · i=1 φi,t+1) t = 1, 2, ··· ,T.

Next we deal with the optimal form for q(V), which satisfies

T −1 Pn n Y α −1 i=1 1Z =1 P 1 E 0 i i=1 Zi=2 log q(V) = Z[log( (1 − Vt) · V1 · (V2(1 − V1)) ··· t=1 T −2 T −1 Pn Pn Y 1Z =T −1 Y 1Z =T (VT −1 (1 − Vt)) i=1 i ( (1 − Vt)) i=1 i )] + const t=1 t=1 n T n n X X X X = φi,1 · log V1 + (α0 − 1 + φi,t) log(1 − V1) + φi,2 log V2 + (α0 − 1+ i=1 t=2 i=1 i=1 T n n X X X φi,t) · log(1 − V2) + ··· + φi,T −1 · log VT −1 t=3 i=1 i=1 n X + (α0 − 1 + φi,t) log(1 − VT −1) + const i=1 T −1 X ≡ log q(Vt); t=1

Pn PT Pn Pn where log q(V1) = i=1 φi,1·log V1+(α0−1+ t=2 i=1 φi,t) log(1−V1)+const, log q(V2) = i=1 φi,2 log V2+ PT Pn Pn (α0 − 1 + t=3 i=1 φi,t) · log(1 − V2) + const, ··· , log q(VT −1) = i=1 φi,T −1 · log VT −1 + (α0 − 1 + Pn i=1 φn,T ) log(1 − VT −1) + const. Thus we proved

T −1 Y qγ1,γ2 (V) = qγ1t,γ2t (Vt). t=1

Pn PT Pn Furthermore, V1 follows Beta distribution with parameters (γ11, γ21) = ( i=1 φk,1 +1, α0 + t=2 i=1 φi,t), Pn PT Pn V2 follows Beta distribution with parameters (γ12, γ22) = ( i=1 φk,2 + 1, α0 + t=3 i=1 φi,t),··· , VT −1 Pn Pn follows Beta distribution with parameters (γT −1,2, γT −1,2) = ( i=1 φk,T −1 + 1, α0 + i=1 φi,t).

79 Finally, we deal with q(Z). We have the following optimal form

T n T Pn P P ∗ 2 Y i=1 1Z =t i=1 t=1(Xi − ηt ) 1Zi=t log q(Z) = E ∗ [log( π i ) · exp(− )] + const ξ,η ,V t 2 t=1 n T ∗ 2 X X (Xi − ηt ) = E ∗ [ (log π − )1 ] + const ξ,η ,V t 2 Zi=t i=1 t=1 n T E ∗ ∗ 2 X X ξt,η (Xi − ηt ) = (E (log π ) − t )1 + const V t 2 Zi=t i=1 t=1 n X = log q(Zi); i=1

n ∗ 2 2 2 2 therefore we proved q (Z) = Q q (Z ). Since E ∗ (X −η ) = X −2(1−p )m X +(1−p )(m +τ ), Φ i=1 φi i ξt,ηt i t i t t i t t t we have

T X 1 log q(Z ) = (E (log π ) + (1 − p )m X − (1 − p )(m2 + τ 2))1 + const i V t t t i 2 t t t Zi=t t=1 T t−1 X E X E = [ γ1,t,γ2,t (log Vt) + γ1,i,γ2,i (log(1 − Vi)) t=1 i=1 1 + (1 − p )m X − (1 − p )(m2 + τ 2)]1 + const. t t i 2 t t t Zi=t

Therefore q(Zi) is the probability mass function of Multinomial distribution. Once we fix i, φi,t ∝ exp(St), E Pt−1 E 1 2 2 where St = exp[ γ1,t,γ2,t (log Vt) + i=1 γ1,i,γ2,i (log(1 − Vi)) + (1 − pt)mtXi − 2 (1 − pt)(mt + τt )].

80 Appendix B

Supplemental Material for Chapter 3

B.1 Proof of Theorem 3.2

Ψ could be written as t ˆ − 1 (µ1 − µˆ) D 2 ηˆ q , t − 1 − 1 ηˆ Dˆ 2 ΣDˆ 2 ηˆ

which is lower bounded by t ˆ − 1 ˜ (µ1 − µˆ) D 2 ηˆ Ψ = q . t − 1 − 1 λ1ηˆ Dˆ 2 DDˆ 2 ηˆ

By Lemma A.2. of Fan and Fan (2008) we have maxi≤p |σˆii − σii| →p 0. Therefore Dˆ = D(1 + op(1)). − 1 ˆ 2 (µ1−µˆ)D ηˆ Therefore we have Ψ˜ = √ (1 + op(1)). λ1kηˆk We first consider the denominator,

v u p s uX 2 X 2 X 2 p 2 2 kηˆk = t ηˆi = ηˆi + ηˆi = kηˆ1k + kηˆ2k . i=1 i∈S i∈Sc

p ˆ If di=0, n/2di ∼ t2n−2, according to t distribution tail probability inequality, we have

2n−1 ! 2 2n − 2 Γ( ) 1 nb 2n−1 ˆ 2 n − 2 P (|di| ≥ bn) ≤ · p · p (1 + ) . 2n − 3 π(2n − 2)Γ(n − 1) n/2bn 4n − 4

81 For any  > 0, using Markov Inequality and Cauchy-Schwartz Inequality,

q q 2 1 X 2 1 X 4 ˆ P (kηˆ k > ) ≤ E[ˆη 1 ˆ ] ≤ E(ˆη ) P (|di| ≥ bn) 2  i {|di|≥bn}  i i∈Sc i∈Sc p 4 q sup c E(ˆη ) X ≤ i∈S i P (|dˆ | ≥ b )  i n i∈Sc s pE 4 2n−1 supi∈Sc (ˆηi ) 2n − 2 2Γ( 2 ) ≤ (p − sp) ·  2n − 3 pπ(2n − 2)Γ(n − 1)

s 2 1 nbn − 2n−1 · p (1 + ) 2 n/2bn 4n − 4 r pE 4 s 2 4 2 supi∈Sc (ˆηi ) 1 nbn − 2n−1 ∼ (p − sp) · p (1 + ) 2 π  n/2bn 4n − 4 r p 4 2 2 sup c E(ˆη ) 1 2n − 1 nb ∼ 4 i∈S i exp(log(p − s ) − log(pn/2b ) − · n ) → 0; π  p 2 n 2n − 2 4

2 2 The last equivalence holds since bn → 0. The probability goes to 0 since log(p − sp) = o(nbn). 2 Recall that Cp = kd1k . According to triangular inequality,

√ √ −Op( εp) + kd1k ≤ −kηˆ1 − d1k + kd1k ≤ kηˆ1k ≤ kηˆ1 − d1k + kd1k = Op( εp) + kd1k.

√ We put these terms together to approximate the order of denominator as Op( εp) + kd1k.

t t t For numerator, denote µ1 = (µ11, µ12, ··· , µ1p) , µ2 = (µ21, µ22, ··· , µ2p) and µˆ = (ˆµ1, µˆ2, ··· , µˆp) , whereµ ˆi = (ˆµ1i +µ ˆ2i)/2. we have the following decomposition:

p 1 X − 1 X − 1 X − 1 t ˆ − 2 2 2 2 (µ1 − µˆ) D ηˆ = (µ1i − µˆi)ˆσii ηˆi = (µ1i − µˆi)ˆσii ηˆi + (µ1i − µˆi)ˆσii ηˆi ≡ I1 + I2. i=1 i∈Sc i∈S

For I1, we have the following decomposition since µ1i = µ2i:

1 X − 1 1 X − 1 1 1 I = − (ˆµ − µ )ˆσ 2 ηˆ − (ˆµ − µ )ˆσ 2 ηˆ ≡ I + I . 1 2 1i 1i ii i 2 2i 2i ii i 2 1,1 2 1,2 i∈Sc i∈Sc

82 Using Markov Inequality,

1 1 X − 2 P (|I1,1| > ) ≤ E|(ˆµ1i − µ1i)ˆσ ηˆi1 ˆ |  ii {|di|≥bn} i∈Sc

1 X p 2 q 2 ≤ E((ˆµ1i − µ1i) /σˆii) E(ˆη 1 ˆ )  i {|di|≥bn} i∈Sc q 1 X n − 1 q 4 ≤ √ 4 E(ˆη4) P (|dˆ | ≥ b )  (n − 2) n i i n i∈Sc q 8 p4 2 E 4 p 2 π supi∈Sc (ˆηi ) log(p − s ) log( n/2b ) 2n − 1 nb ∼ √ exp( p − n − · n ) → 0. n 2 4 2n − 2 4

p×p sp×sp Therefore I1,1 = op(1). Similarly I1,2 = op(1). Hence I1 = op(1). Suppose D = diag(diag(D1 ), (p−sp)×(p−sp) ˆ p×p ˆ sp×sp ˆ (p−sp)×(p−sp) ˆ diag(D2 )), D = diag(diag(D1 ), diag(D2 )), where D1 and D1 denote the

corresponding sub-matrix of relevant features, D2 and Dˆ 2 denote the corresponding submatrix of irrelevant features. Similarly for R denote the corresponding submatrix of relevant features as R1. Denote the sub-

∗ ∗ vectors of µ1 and µ2 indexed at S as µ1 and µ2. For I2 we have the following decomposition:

1 1 − 1 I = (µ∗ − µ∗)tDˆ −1(µ∗ − µ∗) − (µˆ ∗ − µ∗)tDˆ 2 (ηˆ − d ) 2 2 1 2 1 1 2 2 1 1 1 1 1 1 − 1 1 − 1 1 − 1 − (µˆ ∗ − µ∗)tDˆ 2 (ηˆ − d ) − (µˆ ∗ − µ∗)tDˆ 2 d − (µˆ ∗ − µ∗)tDˆ 2 d 2 2 2 1 1 1 2 1 1 1 1 2 2 2 1 1 ∗ ∗ µ − µ − 1 + ( 1 2 )tDˆ 2 (ηˆ − d ) 2 1 1 1 1 1 1 1 1 1 ≡ kd k2 − I − I − I − I + I . 2 1 2 2,1 2 2,2 2 2,3 2 2,4 2 2,5

∗ ∗ 1 1/2 1/2 Since µˆ 1 − µ1 ∼ Nsp (0, n D1 R1D1 ), we use Cauchy-Schwartz Inequality to get an upper bound. We have

1 − 1 1 I2 = ((µˆ ∗ − µ∗)tDˆ 2 (ηˆ − d ))2 ≤ (µˆ ∗ − µ∗)tDˆ −1(µˆ ∗ − µ∗) · kηˆ − d k2 2,1 4 1 1 1 1 1 4 1 1 1 1 1 1 1 1 = (µˆ ∗ − µ∗)tD−1(µˆ ∗ − µ∗) · kηˆ − d k2 · (1 + o (1)). 4 1 1 1 1 1 1 1 p

∗ ∗ t −1 ∗ ∗ sp sp 2 (µˆ 1 − µ1) D1 (µˆ 1 − µ1) is Op( n λmax(R1)) = Op( n ), meanwhile, kηˆ1 − d1k is Op(εp). Therefore

q spεp q spεp I2,1 = Op( n ). Similarly I2,2 = Op( n ). √ Using the same techniques in Fan and Fan (2008), we could prove I2,3 = Op(kd1k/ n) and I2,4 = √ Op(kd1k/ n).

For I2,5, according to Cauchy-Schwartz Inequality, we have

83 µ∗ − µ∗ µ∗ − µ∗ I2 ≤ ( 1 2 )tDˆ −1( 1 2 ) · kηˆ − d k2 2,5 2 1 2 1 1 µ∗ − µ∗ µ∗ − µ∗ = ( 1 2 )tD−1( 1 2 ) · kηˆ − d k2 · (1 + o (1)). 2 1 2 1 1 p

Therefore 1 √ |I | ≤ pkd k2 · kηˆ − d k2 = O ( ε kd k). 2,5 2 1 1 1 p p 1

Asymptotically, we have

q q √ 1 2 spεp √ kd1k − Op( ) − Op( εpkd1k) − Op(kd1k/ n) ˜ 2n n Ψ = √ √ (1 + op(1)). λ1(Op( εp) + kd1k)

Therefore   1 2 q spεp √ √ kd1k − Op( ) − Op( εpkd1k) − Op(kd1k/ n) ˆ  2 n  W (δθˆ , θ) ≤ 1 − Φ √ √ (1 + op(1)) . (B.1)  λ1(Op( εp) + kd1k) 

2 Since kd1k = Cp, we have

q  1 spεp p p  Cp − Op( ) − Op( εpCp) − Op( Cp/n) ˆ 2 n W (δθˆ) ≤ 1 − Φ  √ p √ (1 + op(1)) . λ1( Cp + Op( εp))

√ √ spεp p If εp = o(Cp), n = o(Cp) and nCp → ∞, then Cp/2 and λ1 Cp are the leading terms of denomi- nator and numerator respectively. We have

p ! ˆ Cp W (δ) ≤ 1 − Φ √ (1 + op(1)) . 2 λ1

B.2 Proof of Theorem 3.3

Conditions in Theorem 3.3 implies the conditions in Theorem 3.2. Therefore

q  1 spεp p p  Cp − Op( ) − Op( εpCp) − Op( Cp/n) ˆ 2 n W (δθˆ) ≤ Φ − √ p √ (1 + op(1)) . λ1( Cp + Op( εp))

84 Cp Using Lemma 1 in Shao et al. (2011), we let ξn = and 4λ1

q spεp p p Op( n ) + Op( εpCp) + Op( Cp/n) τn = p √ . Op( εpCp) + λ1Cp

√ Using the conditions εpCp = o(1) and spεp = o(n), we could easily verify that ξn → ∞, τn → 0 and

τnξn → 0, therefore ˆ p W (δθˆ)/W (δOPT ) → 0.

85 Appendix C

Supplemental Material for Chapter 4

C.1 Proof of Theorem 4.1

First we compute bias of m(0) for estimating β∗. We apply Singular Value Decomposition of X, X = PDQT ,

T where P is an n × p matrix satisfying P P = Ip, D is an p × p diagonal matrix of full rank, Q is a p × p orthogonal matrix. Then we have XT X = QD2QT . The minimal diagonal element of D2 is of order nη1 . we have

(0) 2 −1 T ∗ bias(m ) = −Q(v0nD + Ip) Q β ;

(0) 2 2 −1 −1 2 2 −1 −1 T Var(m ) = σ Q(D + v0n Ip) D (D + v0n Ip) Q ;

2 2 −1 −1 T V = σ Q(D + v0n Ip) Q .

(0) ∗ η2 (0) 1 α−η1+ 2 2 λn1 Therefore for any 1 ≤ j ≤ p, bias(mj ) ≤ v λ kβ k ∼ n . Moreover, Var(mj ) ≤ σ −1 2 ∼ 0n n1 (λn1+v0n ) 2 −η1 σ −η1 δ−α n → 0 and Vjj ≤ 1 ∼ n if α < η1. Threshold r ∼ n . We have the similar probability bound λn1+ v0n

  0 (δ−α)/2  (0) ∗ (δ−α)/2 |bias(mj )| − n P |mj − βj | > n ≤ 2Φ   q (0) Var(mj )   α−η + η2 (δ−α)/2 n 1 2 − n 2 η1+δ−α ≤ 2Φ   ≤ exp(−c0n ); q (0) Var(mj )

if δ > 3α − 2η1 + η2.

2 η +δ−α (0) (0) Therefore with probability larger than 1−q exp(−c n 1 ) we have min ∗ |m | ≥ − max ∗ |m − 0 j:γj =1 j j:γj =1 j ∗ ∗ (δ−α)/2 β | + min ∗ |β | ≥ −n + a . j j:γj =1 j n 2 η +δ−α (0) (δ−α)/2 Similarly, with probability larger than 1−2(p−q) exp(−c n 1 ), we have max ∗ |m | ≤ n . 0 j:γj =0 j

2 η1+δ−α ∗ In Step 5, with large enough n, with probability larger than 1 − 2p exp(−c0n ), P(γˆ = γ ) ≥ 1 −

2 η1+δ−α 2p exp(−c0n ).

2 T Pp λnj T In Step 4, forσ ˆ , tr(XVX ) ≤ 1  p where λnj is jth eigenvalue of X X. Denote PX = j=1 λnj + v0

86 T −1 T X(X X) X . Since 2α − η1 + η2 ≤ 1, we have

2 T 1 −1 T 2 ky − Xmk = ky − X(X X + Ip) X yk v0 2 T 1 −1 T 2 = ky − PXyk + kPXy − X(X X + Ip) X yk v0 1 p 2 2 v0  T 2 2 2 2α−η1+η2 2 2 = σ χn−p(0) + kP 1 j=1P yk  σ χn−p(0) + n + σ χp(0) = Op(n). + λnj v0

(0) 2 Pp (0) 2 if j > q, αj = v0n, We have ((mj ) + Vjj)/v0n = op(1). j=q+1((mj ) + Vjj)/v0n = op(p − q). If

Pq (0) 2 1 −η1 2 −η1 α+η2 j ≤ q, αj = v0n, ((m ) + Vjj)/αj ≺ (qn + Op(kβk + qn )) ≺ Op(n ). To combine the j=1 j v0n 2 results, the numerator should be Op(n), thereforeσ ˆ = Op(1).

2 η1+δ−α T T In Step 6, with probability larger than 1−p exp(−c0n ), we have ∆ = diag(v11q , v01p−q), therefore, 2 T −1 −1 T T −1 −1 T −1 −1 −1 ∗ Var(m) = σ (X X + ∆ ) X X(X X + ∆ ) and bias(m) = −(X X + ∆ ) ∆ β . Since v1 is a

T 1 −1 T −1 −1 positive constant and v0n goes to 0, therefore (X X + Ip) ≥ (X X + ∆ ) . Therefore v1

kbias(m)k2 = (β∗)T ∆−1(XT X + ∆−1)−2∆−1β∗

−2 ∗ T T −1 −2 ∗ −2 ∗ T T 1 −2 ∗ = v1 (β ) (X X + ∆ ) β ≤ v1 (β ) (X X + Ip) β v1 1 ≤ ( )2kβ∗k2. v1λ1n + 1

η2 1 ∗ δ −η1+ Hence for any 1 ≤ j ≤ p, bias(mj) ≤ kβ k ∼ exp(−n )n 2 . According to properties of positive v1λn1 semi-definite matrix, we have

−2 T −1 −1 T T −1 −1 σ λmax(Var(m)) = λmax((X X + ∆ ) X X(X X + ∆ ) )

T −1 T −1 T −1 −1 = [λmin((X X + ∆ )(X X) (X X + ∆ ))]

T −1 −1 T −1 −1 −1 = [λmin(X X + 2∆ + ∆ (X X) ∆ )]

T −1 −1 T −1 −1 −1 ≤ (Weyl’s Inequality)[λmin(X X) + 2λmin(∆ ) + λmin(∆ (X X) ∆ )] 1 T −1 −1 −η1 ≤ [λmin(X X) + 2λmin(∆ )] = ∼ n . η1 −1 n + 2v1

Therefore for any 1 ≤ j ≤ p, Var(m ) ≤ 1 ∼ n−η1 . j η1 −1 n +2v1 P p ∗ According to Proposition 4.1, concentration property holds when i/∈S∗ wˆi → 0 and Pb1 (kb1 − β1k ≤

2 −η1 −η1 δ Mn) → 1. Since maxi>q mi = Op(n ), maxi Vii = Op(n ), α < η1, we have maxj>q wˆj  exp(−n /2). P δ Therefore i/∈S∗ wˆi  (p − q) exp(−n /2).

∗ −η1 2 δ −2η1+η2 2 Pb1 (kb1 − β1k ≤ Mn) → 1 holds when qn = op(n) and exp(−2n )n = o(n). Therefore

87 ∗ p ΠX(B(β , Mn)) → 1.

C.2 Proof of Theorem 4.2

∗ ˜ ˜ T ˜ T T ˜ ∗ 2 ˜ Let An = {γˆ = γ }, denote β = (β1 , β2 ) , on the event An, we have EkXβ − Xβ k 1An ≤ EkX1β1 − ∗ 2 ˜ 0 ˜ 2 X1β1k = tr(X1Var(β1)X1) + kX1bias(β1)k . We deal with the variance component first.

Var(m) = σ2(XT X + ∆−1)−1XT X(XT X + ∆−1)−1.

2 T −1 ˜ ˜ It is easy to show Var(m) ≤ σ (X X) . Since Var(β1) is the q ×q principle submatrix of Var(m) , Var(β1) should be upper bounded by corresponding q × q principle submatrix of σ2(XT X)−1, which is

2 T T T −1 T −1 σ (X1 X1 − X1 X2(X2 X2) X2 X1) .

Therefore we have

˜ T 2 ˜ T tr(X1Var(β1)X1 ) = σ tr(Var(β1)X1 X1)

2 T T T −1 T −1 T ≤ σ tr((X1 X1 − X1 X2(X2 X2) X2 X1) X1 X1).

n×q n×q q×q q×q T n×(p−q) We apply Singular Value Decomposition to X1 and X2: X1 = P1 D1 (Q1 ) , X2 = n×(p−q) (p−q)×(p−q) (p−q)×(p−q) T T T T T T P2 D2 (Q2 ) , where Q1 Q1 = Q1Q1 = Iq, P1 P1 = Iq, Q2 Q2 = Q2Q2 = Ip−q, T T T P2 P2 = Ip−q. Then PX1 = P1P1 and PX2 = P2P2 . Therefore

T T T −1 T −1 T T T −1 T tr((X1 X1 − X1 X2(X2 X2) X2 X1) X1 X1) = tr(X1(X1 X1 − X1 PX2 X1) X1 )

T T −1 T T T −1 = tr(P1(Iq − P1 P2P2 P1) P1 ) = tr((Iq − P1 P2P2 P1) ).

T T T T T T Since λmax(PX2 PX1 ) = λmax(P2P2 P1P1 ) = λmax(P1 P2P2 P1) < 1−c1, we have λmin(−P1 P2P2 P1) > T T T T −1 + c1, since P1 P2P2 P1 is a semi-positive definite matrix, we have λmin(Iq − P1 P2P2 P1) > c1, therefore T T 1 T T −1 q λmin(Iq − P P2P P1) > , tr((Iq − P P2P P1) ) < . 1 2 c1 1 2 c1 For the bias part, since

bias(m) = −(XT X + ∆−1)−1∆−1β∗;

88 we have

˜ −1 T −1 T T −1 −1 T −1 ∗ X1bias(β1) = −v1 X1(X1 X1 + v1 Iq − X1 X2(X2 X2 + v0 Ip−q) X2 X1) β1;

T −1 T T −1 −1 T −1 X1(X1 X1 + v1 Iq − X1 X2(X2 X2 + v0 Ip−q) X2 X1)

T 1 −1 T 1 −1 T = X1(X1 X1 + Iq) + X1(X1 X1 + Iq) X1 X2 v1 v1 T 1 T T 1 −1 T −1 T T 1 −1 (X2 X2 + Ip−q − X2 X1(X1 X1 + Iq) X1 X2) X2 X1(X1 X1 + Iq) ; v0 v1 v1 the largest singular value is

T 1 −1 σmax(X1(X1 X1 + Iq) ) v1 r 1 1 η1 T −1 T T −1 − 2 = λmax((X1 X1 + Iq) X1 X1(X1 X1 + Iq) )  n . v1 v1

T 1 −1 T 2 1 −1 T X1(X1 X1 + Iq) X1 = P1D1(D1 + ) D1P1 ; v1 v1

T 1 −1 T Therefore the largest eigenvalue of X1(X X1 + Iq) X is bounded by 1. What’s more, 1 v1 1

T 1 T T 1 −1 T −1 T λmax(X2(X2 X2 + Ip−q − X2 X1(X1 X1 + Iq) X1 X2) X2 ) v0 v1 T T 1 T 2 1 T T −1 T = λmax(P2D2Q2 (Q2D2Q2 + Ip−q − Q2D2P2 P1D1(D1 + Iq)D1P1 P2D2Q2 ) Q2D2P2 ) v0 v1 T T 1 T 2 1 T T −1 = λmax(D2Q2 (Q2D2Q2 + Ip−q − Q2D2P2 P1D1(D1 + Iq)D1P1 P2D2Q2 ) Q2D2) v0 v1 1 −2 T 2 1 −1 T −1 = λmax((Ip−q + D2 − P2 P1D1(D1 + Iq) D1P1 P2) ) v0 v1 1 T 2 −1 T −η1+α −1 ≤ (1 − λmax(P2 P1D1(D1 + Iq) D1P1 P2) + n ) v1 1 2 −1 T T −η1+α −1 ≤ (1 − λmax(D1(D1 + Iq) D1)λmax(P1 P2P2 P1) + n ) v1 nη1 −η1+α −1 ≤ (1 − η (1 − c1) + n ) ≤ constant. n 1 + 1/v1

89 Based on the constant bound of largest eigenvalue, we have

T −1 T T −1 −1 T −1 σmax(X1(X1 X1 + v1 Iq − X1 X2(X2 X2 + v0n Ip−q) X2 X1) )

T 1 −1 T 1 −1 T ≤ σmax(X1(X1 X1 + Iq) ) + λmax(X1(X1 X1 + Iq) X1 ) v1 v1 T 1 T T 1 −1 T −1 T λmax(X2(X2 X2 + Ip−q − X2 X1(X1 X1 + Iq) X1 X2) X2 ) v0n v1 1 T −1 −η1/2 σmax(X1(X1 X1 + Iq) ) ∼ n ; v1

˜ 2 −η1+η2 Therefore kX1bias(β1)k  n → 0 on the event An. To combine these 2 parts together, we have

˜ ∗ 2 q 2 EkXβ − Xβ k 1An ≤ σ + o(1). c1

c 2 ∗ 2 2 ∗ 2 On the event A , since n ≥ p, λ ≤ n . kXβ˜ − Xβ k 1 c ≤ 2kXβ˜ − Xmk 1 c + 2kXm − Xβ k 1 c . n np An An An ˜ 2 ˜ 2 δ−α For the first term, kXβ − Xmk = λnpkβ − mk ≤ n pλnp,

2 δ−α 2 η1+δ−α kXβ˜ − Xmk 1 c  n pλ (2p exp(−c n )) → 0. E An np 0

Using the fact that (XT X + ∆−1)−1XT X(XT X + ∆−1)−1 ≤ (XT X + ∆−1)−1, for the second term, since

∗ 2 T −1 −1 −1 ∗ T −1 −1 T 2 kXm − Xβ k = k − X(X X + ∆ ) ∆ β + X(X X + ∆ ) X εnk

≤ 2(β∗)T ∆−1(XT X + ∆−1)−1XT X(XT X + ∆−1)−1∆−1β∗

T T −1 −1 T T −1 −1 T + 2εn X(X X + ∆ ) X X(X X + ∆ ) X εn

∗ T −1 T −1 −1 −1 ∗ T T −1 −1 T ≤ 2(β ) ∆ (X X + ∆ ) ∆ β + 2εn X(X X + ∆ ) X εn

T −1 −1 −η1 T −1 −1 T 2 −1 −1 T Since λmax((X X + ∆ ) )  n and X(X X + ∆ ) X ≤ PD(D + v1 Ip) DP , we have

∗ 2 2α+η2−η1 kXm − Xβ k ≤ O(n ) + Op(p). Therefore

2 2α+η2−η1 c kXm − Xβk 1 c ≤ (O(n ) + O(p))P (A ) → 0. E An n

To sum up, kXβ˜ − Xβk2 ≤ q σ2 + o(1). E c1

90 C.3 Proof of Theorem 4.3

First we compute bias of m(0) for estimating θ∗. First we apply Singular Value Decomposition to X,

T T X = PDQ , where P is an n × n orthogonal matrix satisfying P P = In, D is an n × n diagonal matrix of

T T 2 T full rank, Q is a p × n matrix satisfying Q Q = In. Then we have X X = QD Q . The minimal diagonal

2 η1 T T element of D is of order n . Let Q⊥ be p × (p − n) matrix which satisfies Q⊥Q⊥ = Ip−n and Q⊥Q = O. ∗ T ∗ Define Γ = (Q, Q⊥), using the same trick of Shao and Deng (2012) and the fact that θ = QQ θ , we have

(0) T −1 −1 T ∗ ∗ bias(m ) = (X X + v0 Ip) X Xθ − θ

2 T −1 ∗ T T 2 T −1 T T ∗ = −(v0QD Q + Ip) (θ ) = −Γ(v0Γ QD Q Γ + Ip) Γ QQ θ

2 −1 T ∗ = −Q(v0D + In) Q θ ;

Therefore

(0) 2 ∗ T 2 −2 T ∗ kbias(m )k = (θ ) Q(v0D + In) Q θ .

T 2 −2 1 1 Since minimal nonzero eigenvalue of X X is λ1n, we have (v0D +Ip) ≤ 2 Ip ≤ 2 2 Ip. Therefore (1+v0λn1) v0 λn1 (0) 2 2α ∗ 2 2 (0) 1 ∗ −η + η2 +α kbias(m )k ≺ n kθ k /(λ ). for any 1 ≤ j ≤ p, bias(m ) ≤ kθ k ∼ n 1 2 . n1 j v0λn1 For variance of m(0), we have

(0) 2 T −1 −1 T T −1 −1 Var(m ) = σ (X X + v0 Ip) X X(X X + v0 Ip)

2 2 T −1 −1 2 T 2 T −1 −1 = σ (QD Q + v0 Ip) QD Q (QD Q + v0 Ip) .

(0) 2 λn1 (0) 2 λn1 The maximal eigenvalue of Var(m ) is σ −1 2 . Therefore for any 1 ≤ j ≤ p, Var(mj ) ≤ σ −1 2 ∼ (λn1+v0 ) (λn1+v0 ) σ2n−η1 . For V, we have

2 2 T −1 −1 V = σ (QD Q + v0 Ip) ≤ v0Ip.

−α Therefore for any 1 ≤ j ≤ p, Vjj ≤ v0 ∼ n .

2 In Step 4 we showσ ˆ = Op(1). Since

T (0) 2 Pp (0) 2 tr(XVX ) + ky − Xm k + (Vjj + (m ) )/v0 + νλ σˆ2 = j=1 j , n + p + ν + 2

T 2 −1 −1 (0) 2 2 −1 −1 T 2 we have tr(XVX ) = tr(D(D + v0 In) D) ≤ n and ky − Xm k = ky − PD(D + v0 In) DP yk .

91 Using the same method in proof of Theorem 4.1 we have

2 −1 −1 T 2 Eky − PD(D + v0 In) DP yk

2 −1 −1 2 T ∗ 2 2 −1 −1 T 2 = kPD(In − (D + v0 In) D )Q θ k + EkP(In − D(D + v0 In) D)P εk

 n2α−η1+η2 + n1+α−η1 ,

if 2α − η1 + η2 ≤ 1 and η1 ≥ α.

(0) 2 Pp (0) 2 if j > q, αj = v0n, we have ((mj ) + Vjj)/v0n = Op(1). j=q+1((mj ) + Vjj)/v0n = Op(p − q). If

Pq (0) 2 1 −α ∗ 2 −η1 α+η2 j ≤ q, αj = v0n, ((m ) + Vjj)/αj ≺ (qn + Op(kθ k + qn )) ≺ Op(n ). To combine the j=1 j v0n 2 results, the numerator should be Op(n), thereforeσ ˆ = Op(1). (0) In step 5 we need to show when n is sufficiently large, with probability going to 1, min ∗ |m | > j:γj =1 j √ r(0). We have

  0 (δ−α)/2  (0) ∗ (δ−α)/2 |bias(mj )| − n P |mj − θj | > n ≤ 2Φ   q (0) Var(mj )   α−η + η2 (δ−α)/2 n 1 2 − n 2 η1+δ−α ≤ 2Φ   ≤ exp(−c0n ), q (0) Var(mj )

2 η +δ−α (0) since δ > 3α−2η +η . Therefore with probability larger than 1−q exp(−c n 1 ) we have min ∗ |m | ≥ 1 2 0 j:γj =1 j (0) ∗ ∗ (δ−α)/2 − max ∗ |m − θ | + min ∗ |θ | ≥ −n + a . j:γj =1 j j j:γj =1 j n

2 η1+δ−α Similarly, with probability larger than 1 − (p − q) exp(−c0n ), we have

√ max |m(0)| ≤ max |m(0) − θ∗| + max |θ∗| ≤ nδ−α/2 + b ≤ r. ∗ j ∗ j j ∗ j n j:γj =0 j:γj =0 j:γj =0

Hence

∗ 2 δ+η1−α P(γˆ = γ ) ≥ 1 − 2p exp(−c0n ).

T In Step 6 we measure the order of bias and variance. Consider SVD for X1 and X2: X1 = P1D1Q1 , T T X2 = P2D2Q2 , where P1 is n × q matrix satisfying P1 P1 = Iq, D1 is q × q diagonal matrix, Q1 is q × q T orthogonal matrix, P2 is n × r matrix satisfying P2 P2 = Ir, D2 is r × r diagonal matrix, Q2 is r × (p − q) orthogonal matrix. Using this notation, we have

T −1 −1 T 2 −1 −1 T T X2(X2 X2 + v0 Ip−q) X2 = P2D2(D2 + v0 Ir) D2P2 ≤ P2P2 ;

92 T −1 −1 T 2 −1 −1 T T X1(X1 X1 + v1 Iq) X1 = P1D1(D1 + v1 Iq) D1P1 ≤ P1P1 .

T T 2 T −1 −1 T T −1 −1 Define ∆ = diag(v11q , v01p−q), therefore, Var(m) = σ (X X + ∆ ) X X(X X + ∆ ) and bias(m) = −(XT X + ∆−1)−1∆−1θ∗. We have

−1 T −1 −1 ∗ T −1 −1 T −1 −1 ∗ T bias(m) = −v1 (X X + ∆ ) θ − (X X + ∆ ) (0 , (v0 − v1 )θ2) .

Therefore

2 −2 ∗ T T −1 −2 ∗ 2 ∗ 2 kbias(m)k ≤ 2v1 (θ ) (X X + ∆ ) θ + 2(1 − v0/v1) kθ2k

−2 ∗ T 2 −1 −2 T ∗ 2 ∗ 2 = 2v1 (θ ) Q(D + v1 ) Q θ + 2(1 − v0/v1) kθ2k

η2−2η1 δ ∗ 2 ∼ 2n exp(−2n ) + 2kθ2k .

Besides,

2 T −1 −1 T T −1 −1 λmax(Var(m)) = σ λmax((X X + ∆ ) X X(X X + ∆ ) )

T −1 −2 T T −1 −2 T = λmax(X(X X + ∆ ) X ) ≤ λmax(X(X X + v1 Ip) X )

T 2 T −1 −2 T = λmax(PDQ (QD Q + v1 Ip) QDP )

2 −1 −2 −η1 = λmax(D(D + v1 In) D) ∼ n .

−η1 P δ Therefore Var(mj) ≺ n . Using the same procedure as Theorem 1, we have i/∈S∗ wˆi  (p−q) exp(−n /2) ∗ (δ−α)/2 δ α 2 ∗ 2 2 if kθ2k ≺ n . Concentration property holds if log p = o(n ), q/n = o(n) and kθ2k = o(n).

C.4 Proof of Theorem 4.4

∗ The difference between Xθ and X1θ1 is

∗ ∗ 2 ∗ 2 kXθ − X1θ1k = kX2θ2k ,

˜ ∗ 2 ˜ ∗ 2 ∗ 2 therefore EkXθ − Xθ k  EkXθ − X1θ1k + kX2θ2k . ∗ ˜ ∗ 2 ˜ ∗ 2 ˜ T Let An = {γˆ = γ }, on the event An, we have EkXθ−X1θ1k 1An = EkX1θ1−X1θ1k ≤ tr(X1Var(θ1)X1 )+ ˜ 2 kX1bias(θ1)k .

93 For variance part, following proof of Theorem 4.3, we have

2 T −1 −1 T T −1 −1 λmax(Var(m)) = σ λmax((X X + ∆ ) X X(X X + ∆ ) )

2 T −1 −1 T T −1 −1 ≤ σ λmax((X X + v1 Ip) X X(X X + v1 Ip) )

2 2 −1 −2 2 −η1 ≤ σ λmax(D(D + v1 In) D)  σ n .

˜ ˜ 2 −η1 ˜ T Since Var(θ1) is a principal submatrix of Var(m), we have Var(θ1) ≤ σ n Iq. Therefore tr(X1Var(θ1)X1 ) ≤

2 −η1 T 1−η1 2 σ n tr(X1X1 ) = qn σ . T −1 −1 −1 ∗ ˜ −1 ∗ For the bias part, since bias(m) = −(X X + ∆ ) ∆ θ ; we have X1bias(θ1) = −v1 X1Aθ1 − −1 ∗ v0 X1Bθ2. X1A is

T −1 T T −1 −1 T −1 X1A = X1(X1 X1 + v1 Iq − X1 X2(X2 X2 + v0 Ip−q) X2 X1) .

Looking at squared singular values of X1A, we have

2 T 2 −1 T 2 −1 −1 T −2 T λmax(X1A X1 ) = λmax(P1D1[D1 + v1 Iq − D1P1 P2D2(D2 + v0 Ir) D2P2 P1D1] D1P1 )

−1 −2 T 2 −1 −1 T −2 = λmax([Iq + v1 D1 − P1 P2D2(D2 + v0 Ir) D2P2 P1] )

T 2 −1 −1 T −2 ≤ [1 − λmax(P1 P2D2(D2 + v0 Ir) D2P2 P1)]

2 −1 −1 T T −2 ≤ [1 − λmax(D2(D2 + v0 Ir) D2)λmax(P2 P1P1 P2)]

2 −1 −1 −2 T T T T ≤ [1 − λmax(D2(D2 + v0 Ir) D2)] ( since λmax(P2 P1P1 P2) = λmax(P2P2 P1P1 ) ≤ 1)

T T −1 −2 −2α 2 T ≤ [1 − λmax(X2 X2)/(λmax(X2 X2) + v0 )] ∼ n λmax(X2 X2).

−1 ∗ 2 η2−2α δ 2 2 η2−2α δ 2 Therefore kv1 X1Aθ1k  n exp(−2n )λmax(X2X2) ≤ n exp(−2n )λnp. For the second part of the bias we have

−1 ∗ −1 T 1 T T 1 −1 T −1 T − v0 X1Bθ2 = −v0 X1[X1 X1 + Iq − X1 X2(X2 X2 + Ip−q) X2 X1] X1 v1 v0 T 1 −1 ∗ X2(X2 X2 + Ip−q) θ2 v0 −1 T 1 T T 1 −1 T −1 T = −v0 X1[X1 X1 + Iq − X1 X2(X2 X2 + Ip−q) X2 X1] X1 v1 v0 T 1 −1 ∗ (X2X2 + In) X2θ2. v0

T 1 −1 T 1 −1 In the second equation we use the fact that X2(X X2 + Ip−q) = (X2X + In) X2, which is proved 2 v0 2 v0

94 by Woodbury formula. Meanwhile, using Woodbury formula, we have

T 1 −1 T −1 −1 T (X2X2 + In) = v0(In − X2(X2 X2 + v0 Ip−q) X2 ). (C.1) v0

Using (C.1) we have

−1 T 1 T T 1 −1 T −1 T T 1 −1 λmax(v0 X1[X1 X1 + Iq − X1 X2(X2 X2 + Ip−q) X2 X1] X1 (X2X2 + In) ) v1 v0 v0 −1 T 1 T T 1 −1 T −1 T T 1 −1 = λmax(v0 [X1 X1 + Iq − X1 X2(X2 X2 + Ip−q) X2 X1] X1 (X2X2 + In) X1) v1 v0 v0 −1 T T T 1 −1 T −1 T T 1 −1 ≤ λmax(v0 [X1 X1 − X1 X2(X2 X2 + Ip−q) X2 X1] X1 (X2X2 + In) X1) v0 v0 ≤ 1.

We have

−1 ∗ 2 ∗ 2 k − v0 X1Bθ2k  kX2θ2k .

To combine these 2 parts together, we have

˜ ∗ 2 1−η1 2 η2−2α δ 2 ∗ 2 EkXθ − Xθ k 1An ≤ qn σ + n exp(−2n )λnp + kX2θ2k .

2 δ+η −α ∗ 2 2 ∗ 2 Since p λ = o(exp(n 1 )), kXθ˜ − Xθ k 1 c ≤ 2kXθ˜ − Xmk 1 c + 2kXm − Xθ k 1 c . kXθ˜ − np An An An 2 ˜ 2 δ−α Xmk ≤ λnpkθ − mk ≤ n pλnp,

2 δ−α 2 δ+η1−α kXθ˜ − Xmk 1 c ≺ n p · λ (2p exp(−c n )) → 0. E An np 0

Besides, using the fact that (XT X + ∆−1)−1XT X(XT X + ∆−1)−1 ≤ (XT X + ∆−1)−1, for the second term,

we have

∗ 2 T −1 −1 −1 ∗ T −1 −1 T 2 kXm − Xθ k = k − X(X X + ∆ ) ∆ θ + X(X X + ∆ ) X εnk

∗ T −1 T −1 −1 −1 ∗ T T −1 −1 T ≤ 2(θ ) ∆ (X X + ∆ ) ∆ θ + 2εn X(X X + ∆ ) X εn.

−1 T −1 −1 −1 α T −1 −1 T 2 −1 −1 T Since ∆ λmax((X X + ∆ ) )∆  n and X(X X + ∆ ) X ≤ PD(D + v1 Ip) DP , we have

∗ 2 α+η2 kXm − Xθ k ≤ O(n ) + Op(n). Therefore

2 2α+η2−η1 c kXm − Xβk 1 c ≤ (O(n ) + O(n))P (A ) → 0. E An n

95 ∗ 2 Therefore kXθ˜ − Xθ k 1 c → 0. To sum up, E An

˜ ∗ 2 1−η1 2 η2−2α δ 2 ∗ 2 EkXθ − Xθ k  qn σ + n exp(−2n )λnp + kX2θ2k .

96 Appendix D

Supplemental Material for Chapter 5

D.1 Proof of Lemma 5.2

First we consider p/n → 0. According to the results of perturbation bounds (Golub and Van Loan (2012)),

2 p the gap between the largest eigenvalue and the second largest eigenvalue of A is kρk . Since kEk2 → 0,

∗ 2 p ∗ therefore sin ∠(vˆ, ρ ) ≤ (4/kρk )kEk2 = Op( p/n). Since both vˆ and ρ have the unit length, we have

1    kvˆ − ρ∗k = 2 sin (ρ∗, vˆ) ∼ sin (ρ∗, vˆ) ≤ O (pp/n). 2∠ ∠ p

D.2 Proof of Theorem 5.1

∗ T ∗ ∗ ∗ ∗ T First we consider a special case ρ = (1, 0, ··· , 0) before moving on to the general case ρ˜ = (ρ1, ρ2, ··· , ρs, 0, ··· , 0) by applying a proper orthogonal transformation (tilde here implies the generalized version).

First we check the order of Av(0) = Avˆ = λˆvˆ. Using Weyl’s inequality, we have

ˆ 2 2 p |λ − (kρk + ξ )| ≤ kEk2 = Op( p/n).

kρk → % and % is a positive constant, therefore λˆ is bounded from 0 and infinity.

To show selection consistency, it’s equivalent to show

αi αi P (min log > Cn and max log < −Cn) → 1. i∈S 1 − αi i∈Sc 1 − αi

The formula to update αi is

ˆ 2 αi ω (λvˆi) 1 log = log + − log(a1n + 1). 1 − αi 1 − ω 2(1/a1n + 1) log p/n 2

∗ The key thing is to figure out the magnitude ofv ˆi, 2 ≤ i ≤ p. We measure the magnitude of ei =v ˆi − ρi ,

97 ∗ ∗ T T T ∗ where ρ1 = 1 and ρi = 0 for 2 ≤ i ≤ p. Let vˆ = (ˆv1, vˆ2 ) where vˆ2 = (ˆv2, vˆ3, ··· , vˆp) , Since kvˆ − ρ k = p a.s. 2 p Op( p/n),v ˆ1 → 1 and k1 − vˆ1k = Op( p/n). a.s. ˆ For i = 1 (signal part), sincev ˆ1 → 1 and λ is bounded from 0 and infinity, we have

ˆ 2 α1 w (λvˆ1) 1 log ≥ log + − log(a1n + 1) Cn, 1 − α1 1 − w log p/n 2

if log a1n ≺ n/ log p, which is intrinsically included in minimal signal condition .

Theorem 6 in Paul (2007) shows that the vector vˆ2/kvˆ2k is distributed uniformly on the unit sphere

p−2 p 2 p S and is independent of kvˆ2k = 1 − vˆ1 = Op( p/n) (This theorem holds despite any asymptotic relationship between p and n).

We have

d max1≤j≤p−1 Yj i.i.d max |ei| = q , where Y1,Y2, ··· ,Yp−1 ∼ N(0, 1). i≥2 Pp−1 2 k=1 Yk √ q Since E(max Y ) ∼ log p and Pp−1 Y 2/(p−1) a.s.→ 1, max√1≤j≤p−1 Yj = O (plog p/p). Therefore 1≤i≤p−1 i k=1 k Pp−1 2 p k=1 Yk p p p maxi≥2 |ei| = Op( log p/p) · Op( p/n) = Op( log p/n).

αi 2 2 Second we consider the noise part maxi≥2 log . Since maxi≥2 v = maxi≥2 e = Op(log p/n), we have 1−αi i i

αi w Op(log p/n) 1 max log ∼ log + − log(a1n + 1) if i ≥ 2. i≥2 1 − αi 1 − w log p/n 2

αi Therefore maxi∈Sc log ≺ −κ log(a1n + 1) where κ is any constant between 0 and 1/2. Based on the 1−αi above 2 statements, we have shown

α1 αi P (log > Cn and max log < −Cn) → 1; 1 − α1 i≥2 1 − αi

therefore P (Sˆ = S) → 1. Our procedure has selection consistency property and the algorithm will stop after

one iteration with probability going to 1.

Then we move from the special case to the general case. For the special case ρ∗ = (1, 0, ··· , 0)T , true

covariance matrix is Σ = diag(kρk2 + ξ2, ξ2, ··· , ξ2). Contruct

  Q O  1  Q =   OIp−s

2 ∗ ∗ T 2 as the eigenvector matrix for Σ˜ = kρk ρ (ρ ) + ξ Ip. Q1 is a s × s orthogonal matrix which the first ∗ ∗ ∗ T ˜ T ∗ ∗ column is (ρ1, ρ2, ··· , ρs) . Then Σ = QΣQ , Qρ = ρ˜ , Qvˆ = v˜. Then e˜ = Qe is the transformed error

98 T T T T T T T vector where e = (1 − vˆ1, v2 ) = (1 − vˆ1, vˆ21, vˆ22) . vˆ21 = (ˆv2, ··· , vˆs) and vˆ22 = (ˆvs+1, ··· , vˆp) . Then the transformed error vector is

T  T T  T T T ∗ e˜ = Qe = (1 − vˆ1, vˆ21)Q1 , vˆ22 = (v˜1 , v˜2 ) , v˜ = e˜ + ρ˜ .

2 2 2 2 Therefore maxi∈Sc v˜i = maxi∈Sc e˜i = maxi≥s+1 ei ≤ maxi≥2 ei = Op(log p/n). Then using the same argument we could show

αi w Op(log p/n) 1 max log ≤ log + − log(a1n + 1) ≺ −Cn. i∈Sc 1 − αi 1 − w log p/n 2

T For signal part we need to find the order of (˜vi)1≤i≤s = (1 − vˆ1, vˆ21)Q1 , Decompose Q1 as

  ∗ T ρ1 q1      ρ∗ qT   2 2    ,  ······      ∗ T ρs qs

∗ T for any elementv ˜i (1 ≤ i ≤ s) in v˜, we havev ˜i =v ˆ1ρi + vˆ21qi. According to Cauchy-Schwartz Inequality,

v u s √ T u ∗ 2 X 2 p max kvˆ21qjk ≤ t(1 − (ρj ) ) v˜i ≤ s − 1 max |vˆi| ≤ Op( (s log p)/n). j∈S 2≤i≤s i=2

a.s. Rememberv ˆi → 1, when n is sufficiently large,

2 ∗ T 2 ∗ T 2 min vˆi ≥ min(|vˆ1||ρi | − |vˆ21qj|) ≥ (|vˆ1| min |ρi | − max |vˆ21qj|) i∈S i∈S i∈S i∈S

∗ p 2 ≥ (|vˆ1| min |ρi | − Op( (s log p)/n)) , i∈S

therefore for signal part

∗ p 2 αi w K(|vˆ1| mini∈S |ρi | − Op( (s log p)/n)) 1 log ≥ log + − log(a1n + 1) if i ∈ S, 1 − αi 1 − w log p/n 2

∗ 2 where K are the constant which doesn’t depend on n. If mini∈S(ρi ) ≥ (log p max(s, log a1n))/n, we have

∗ 2 αi w |vˆ1| mini∈S |ρi | √  min log ∼ log + K p − Op( s) − log(a1n + 1)/2 Cn. i∈S 1 − αi 1 − w log p/n

99 Based on the above 2 statements, we’ve shown

αi αi P (min log > Cn and max log < −Cn) → 1; i∈S 1 − αi i∈Sc 1 − αi

therefore P (Sˆ = S) → 1. Our procedure has frequentist selection consistency and the algorithm will stop after one iteration with probability going to 1.

In addition, to prove bayesian selection consistency, notice that

αi −1 min log > Cn ⇐⇒ max(1 − αi) < (1 + exp(Cn)) ; i∈S 1 − αi i∈S

αi −1 max log < −Cn ⇐⇒ max αi < (1 + exp(Cn)) ; i∈Sc 1 − αi i∈Sc

−1 −1 therefore P (maxi∈S(1 − αi) < (1 + exp(Cn)) and maxi∈Sc αi < (1 + exp(Cn)) ) → 1. Using inequality Q P i(1 − pi) ≥ 1 − i pi for any 0 ≤ pi ≤ 1, we have with large probability

∗ Y Y X X 1 − q(Z = Z ) = 1 − αi (1 − αi) ≤ αi + (1 − αi) i∈S i∈Sc i∈S i∈Sc

−1 κ ≤ p(1 + exp(Cn)) ) = p/(1 + a1n) → 0

p with probability going to 1. Therefore q(Z = Z∗) → 1. Our algorithm has Bayesian selection consistency.

D.3 Proof of Theorem 5.2

Theorem 5.2 has the same proof as Theorem 5.1 except the following minor modifications:

T T T Let vˆ = (ˆv1, vˆ2 ) where vˆ2 = (ˆv2, vˆ3, ··· , vˆp) , according to Theorem 4 of Paul (2007),

r c . c  √ vˆ a.s.→ M = 1 − 1 + , if %2/ξ2 − 1 > c. 1 (%2/ξ2 − 1)2 %2/ξ2 − 1

a.s. 2 √ where M is a positive constant between 0 and 1 (instead ofv ˆ1 → 1 in Theorem 1). Since (%/ξ) > 1 + c, p M is strictly larger than 0, we havev ˆ1 = Op(1) andv ˆ1 9 0. √ Besides, by Theorem 2 of Paul (2007), if %2/ξ2 − 1 > c,

 c  λˆ a.s.→ (%2 + ξ2) 1 + . %2/ξ2

So λˆ is bounded by 0 and infinity almost surely.

100 Other parts are the same as the proof of Theorem 5.1.

D.4 Proof of Theorem 5.3

The proof uses the similar method as Johnstone and Lu (2001). We use 2 chi-square large deviation inequal- ities:

2  2 P χn ≤ n(1 − ) ≤ exp(−n /4), 0 ≤  < 1,

2  2 P χn ≥ n(1 + ) ≤ exp(−3n /16), 0 ≤  < 1/2.

2 Define n = bn/ξ . For any index l from 1, 2, ··· , s, we have

 2 2  X 2 2  2 2  P σˆl < σˆ(k) ≤ P σˆi > ξ (1 + n/2) + P σˆl < ξ (1 + n/2) i≥k

2 2 2  2 2 2 2  ≤ (p − k)P ξ χn/n > ξ (1 + n/2) + P (ξ + ρl )χn/n < ξ (1 + n/2)

2  2 2 2 2  ≤ (p − k)P χn/n > 1 + n/2 + P (ξ + ρl )χn/n < ξ (1 + n/2)   2  2 1 + n/2 ≤ (p − k) exp −3nn/64 + P χn/n < 2 2 1 + ρl /ξ  2    3nbn 2 1 + n/2 ≤ (p − k) exp − 4 + P χn/n < 2 2 . 64ξ 1 + ρl /ξ

For the second term, we have

      2 1 + n/2 2 1 + n/2 2 1 + n/2 P χn/n < 2 2 ≤ P χn/n < 2 2 ≤ P χn/n < 1 + ρl /ξ 1 + mini∈S ρi /ξ 1 + n  2   2  nn nbn ≤ exp − 2 = exp − 2 4 . 16(1 + n) 16(1 + n) ξ

Therefore,  2   2   2 2  3nbn nbn P σˆl < σˆ(k) ≤ (p − k) exp − 4 + exp − 2 4 . 64ξ 16(1 + n) ξ

Hence

X  2 2  P (SL ) S) ≤ P σˆl < σˆ(k) l∈S  2   2  3nbn nbn ≤ s(p − k) exp − 4 + s exp − 2 4 64ξ 16(1 + n) ξ  3nb2   nb2  ∼ s(p − k) exp − n + s exp − n . 64ξ4 16ξ4

101 2 Since log(p − k) + log s = o(nbn), the 2 terms go to 0. Therefore P (SL ) S) → 0, P (SL ⊇ S) → 1.

102 References

Abramovich, F., Benjamini, Y., Donoho, D. L., and Johnstone, I. M. (2006). Special invited lecture: Adapting to unknown sparsity by controlling the false discovery rate. The Annals of Statistics, (34):584– 653. Amini, A. A. and Wainwright, M. J. (2009). High-dimensional analysis of semidefinite relaxations for sparse principal components. The Annals of Statistics, pages 2877–2921.

Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant function,’naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6):989–1010. Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1):199–227.

Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, pages 1705–1732. Blei, D. M., Jordan, M. I., et al. (2006). Variational inference for dirichlet process mixtures. Bayesian analysis, 1(1):121–143.

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, (just-accepted). Brown, L. D. and Greenshtein, E. (2009a). Nonparametric empirical bayes and compound decision ap- proaches to estimation of a high-dimensional vector of normal means. The Annals of Statistics, pages 1685–1704.

Brown, L. D. and Greenshtein, E. (2009b). Nonparametric empirical bayes and compound decision ap- proaches to estimation of a high-dimensional vector of normal means. The Annals of Statistics, pages 1685–1704. B¨uhlmann,P. and Van De Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media.

Cai, T. and Liu, W. (2012). A direct estimation approach to sparse linear discriminant analysis. Journal of the American Statistical Association, 106(496):1566–1577. Castillo, I., Schmidt-Hieber, J., and Van der Vaart, A. (2015). Bayesian linear regression with sparse priors. The Annals of Statistics, 43(5):1986–2018.

Castillo, I., van der Vaart, A., et al. (2012). Needles and straw in a haystack: Posterior concentration for possibly sparse sequences. The Annals of Statistics, 40(4):2069–2101. d’Aspremont, A., Ghaoui, L. E., Jordan, M. I., and Lanckriet, G. R. (2005). A direct formulation for sparse pca using semidefinite programming. In Advances in neural information processing systems, pages 41–48.

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38.

103 Dicker, L. H. and Zhao, S. D. (2016). High-dimensional classification via nonparametric empirical bayes and maximum likelihood inference. Biometrika, page asv067. Domingos, P. and Pazzani, M. (1997). On the optimality of the simple bayesian classifier under zero-one loss. Machine learning, 29(2-3):103–130.

Donoho, D. L., Johnstone, I. M., Hoch, J. C., and Stern, A. S. (1992). Maximum entropy and the nearly black object. Journal of the Royal Statistical Society. Series B (Methodological), pages 41–81. Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the american statistical association, 90(430):577–588. Fan, J. and Fan, Y. (2008). High dimensional classification using features annealed independence rules. The Annals of Statistics, 36(6):2605–2637. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360. Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20(1):101. Friedman, J. H. (1989). Regularized discriminant analysis. Journal of the American statistical association, 84(405):165–175. Gao, C. and Zhou, H. H. (2015). Rate-optimal posterior contraction for sparse pca. The Annals of Statistics, 43(2):785–818.

George, E. and Foster, D. P. (2000). Calibration and empirical bayes variable selection. Biometrika, 87(4):731–747. George, E. I. and McCulloch, R. E. (1993). Variable selection via gibbs sampling. Journal of the American Statistical Association, 88(423):881–889.

Ghosal, S. (1999). Asymptotic normality of posterior distributions in high-dimensional linear models. Bernoulli, 5(2):315–331. Golub, G. H. and Van Loan, C. F. (2012). Matrix computations, volume 3. JHU Press. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531–537. Golubev, G. K. (2002). Reconstruction of sparse vectors in white gaussian noise. Problems of Information Transmission, 38(1):65–79. Greenshtein, E. and Park, J. (2009). Application of non parametric empirical bayes estimation to high dimensional classification. The Journal of Machine Learning Research, 10:1687–1704. Guan, Y. and Dy, J. (2009). Sparse probabilistic principal component analysis. In Artificial Intelligence and Statistics, pages 185–192. Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67. Hu, J. (2017). Statistical methods for learning sparse features. Ph.D. Thesis in University of Illinois at Urbana-Champaign. Jiang, W. (2007). Bayesian variable selection for high dimensional generalized linear models: convergence rates of the fitted densities. The Annals of Statistics, 35(4):1487–1511.

104 Jiang, W. and Zhang, C.-H. (2009). General maximum likelihood empirical bayes estimation of normal means. The Annals of Statistics, 37(4):1647–1684. Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Annals of statistics, pages 295–327.

Johnstone, I. M. and Lu, A. Y. (2001). On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association. Johnstone, I. M. and Silverman, B. W. (2004). Needles and straw in haystacks: Empirical bayes estimates of possibly sparse sequences. Annals of Statistics, (32):1594–1649. Johnstone, I. M. and Silverman, B. W. (2005). Ebayesthresh: R and s-plus programs for empirical bayes thresholding. J. Statist. Soft, 12:1–38. Koenker, R. (2014). A gaussian compound decision bakeoff. Stat, 3(1):12–16. Koenker, R. and Mizera, I. (2014). Convex optimization, shape constraints, compound decisions, and empirical bayes rules. Journal of the American Statistical Association, 109(506):674–685.

Lei, J., Vu, V. Q., et al. (2015). Sparsistency and agnostic inference in sparse pca. The Annals of Statistics, 43(1):299–322. Lo, A. Y. (1984). On a class of bayesian nonparametric estimates: I. Density estimates. The annals of statistics, 12(1):351–357.

Ma, Z. (2013). Sparse principal component analysis and iterative thresholding. The Annals of Statistics, 41(2):772–801. Marcenko, V. A. and Pastur, L. A. (1967). Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457. Martin, R., Mess, R., Walker, S. G., et al. (2017). Empirical bayes posterior concentration in sparse high- dimensional linear models. Bernoulli, 23(3):1822–1847. Martin, R. and Walker, S. G. (2014). Asymptotically minimax empirical bayes estimation of a sparse normal mean vector. Electronic Journal of Statistics, 8(2):2188–2206. Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404):1023–1032. Nakajima, S., Tomioka, R., Sugiyama, M., and Babacan, S. D. (2015). Condition for perfect dimensionality recovery by variational bayesian pca. Journal of Machine Learning Research, 16:3757–3811. Narisetty, N. N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing priors. The Annals of Statistics, 42(2):789–817.

Neal, R. M. and Hinton, G. E. (1998). A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. Springer. Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, pages 1617–1642.

Peterson, C. and Hartman, E. (1989). Explorations of the mean field theory learning algorithm. Neural Networks, 2(6):475–494. Robbins, H. (1951). Asymptotically subminimax solutions of compound statistical decision problems. In Proc. Second Berkeley Symp. Math. Statist. Probab., volume 1, pages 131–148. Univ. California Press, Berkeley.

105 Rovckova, V. and George, E. I. (2014). EMVS: The EM approach to bayesian variable selection. Journal of the American Statistical Association, 109(506):828–846. Rovckova, V. and George, E. I. (2015). Bayesian estimation of sparse signals with a continuous spike-and-slab prior. Submitted manuscript, pages 1–34.

Shao, J. and Deng, X. (2012). Estimation in high-dimensional linear models with deterministic design matrices. The Annals of Statistics, 40(2):812–831. Shao, J., Wang, Y., Deng, X., and Wang, S. (2011). Sparse linear discriminant analysis by thresholding for high dimensional data. The Annals of statistics, 39(2):1241–1265. Shen, D., Shen, H., and Marron, J. S. (2013). Consistency of sparse PCA in high dimension, low sample size contexts. Journal of Multivariate Analysis, 115:317–333. Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. Journal of multivariate analysis, 99(6):1015–1034. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288. Tipping, M. E. and Bishop, C. M. (1999). Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622. Vu, V. Q., Lei, J., et al. (2013). Minimax sparse principal subspace estimation in high dimensions. The Annals of Statistics, 41(6):2905–2947.

Walker, S. and Hjort, N. L. (2001). On bayesian consistency. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 63(4):811–821. Wang, J., Liang, F., and , Y. (2016). An ensemble em algorithm for bayesian variable selection. arXiv preprint arXiv:1603.04360.

Xie, M.-g. and Singh, K. (2013). Confidence distribution, the frequentist distribution estimator of a param- eter: a review. International Statistical Review, 81(1):3–39. Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics, pages 894–942.

Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics, (36):1567–1594. Zou, H., Hastie, T., and Tibshirani, R. (2006). Sparse principal component analysis. Journal of computational and graphical statistics, 15(2):265–286.

106