<<

The Pennsylvania State University The Graduate School

THRESHOLDED PARTIAL CORRELATION APPROACH FOR

VARIABLE SELECTION IN LINEAR MODELS AND PARTIALLY

LINEAR MODELS

A Dissertation in by Lejia Lou

c 2013 Lejia Lou

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

December 2013 The dissertation of Lejia Lou was reviewed and approved∗ by the following:

Runze Li Distinguished Professor of Statistics Dissertation Advisor and Chair of Committee

David Hunter Professor of Statistics Head of Department of Statistics

Bing Li Professor of Statistics

Rongling Wu Professor of Public Health Science

Aleksandra Slavkovic Chair of Graduate Program in Statistics

∗Signatures are on file in the Graduate School. Abstract

This thesis is concerned with variable selection in linear models and partially linear models for high-dimensional analysis. With the development of technology, it is crucial to identify a small subset of covariates that exhibits the strongest relationship with the response. Researchers have made much effort on developing variable selection methodologies, such as regularized techniques, including least absolute shrinkage and selection operator (LASSO, Tibshirani, 1996) and penalized estimate with smoothly clipped absolute deviation penalty (SCAD, Fan and Li, 2001). Different from those regularization methods for variable selection in linear mod- els, Buhlmann et al. (2010) proposed the PC-simple algorithm to select significant variables. As they showed, under some conditions and proper choice of the sig- nificance level, the PC-simple algorithm can consistently identify the true active set with probability approaching 1 when the response and covariates are jointly normally distributed. In Chapter 3, we study the performance of the PC-simple algorithm under a non-. The PC-simple algorithm develops the variable selection based on the fact that the asymptotic distributions of Fisher’s z-transform of the sample marginal and partial correlations follow standard normal distribution. This fact is invalid when the samples are from non-normal distributions. This is the drawback of the PC-simple algorithm. Thus, we derive the asymptotic distribution of Fisher’s z-transform of the sample marginal correlations and partial correlations under elliptical distributions, and we find that the asymptotic distributions depend on the . According to the threshold we develop, the PC-simple algorithm would result in over-fitting a model with positive kurtosis, and under-fitting a model with negative kurtosis. The results from the extensive simulation studies with elliptical distributions,

iii including normal distributions and mixture normal distributions, are consistent with our understanding. With normal samples, the PC-simple algorithm and our proposal have similar performance as both of them utilize similar asymptotic dis- tributions of the sample marginal and partial correlations. However, with mixture normal distributions, the PC-simple algorithm overfits the data as the kurtosis of the mixture normal distributions are greater than 0. Our proposal outper- forms PC-simple algorithm in terms of the correct fitting percentage. That im- plies adjusting the is urgent. Moreover, the application of our proposal to the cardiomyopathy microarray data suggests that our proposal is comparable to the regularization approach with the SCAD penalty, and outperforms the LASSO penalty. Furthermore, by imposing some conditions on the partial correlations, we show that the proposed approach can consistently identify the true active set. In Chapter 4, we study how to apply the thresholded partial correlation ap- proach to select significant variables in partially linear models. First, we transform the partially to a linear model approximately by the partial residu- als technique. Then we apply the thresholded partial correlation approach to the resulting linear model to obtain the estimated active set. After that, we apply the least squares approach to get the estimates of the coefficients in the linear part. The estimation of the nonparametric function is obtained via substituting the estimation of the linear part into the original model. We call this approach the thresholded partial correlation on partial residuals (TPC-PR) approach. Sim- ilarly, we can utilize the PC-simple algorithm on the partial residuals, pretending the samples are from normal distributions, and we call the resulting algorithm the PC-simple algorithm on partial residuals (PC-PR). We establish the asymptotic consistency of variable selection in the linear part, and the asymptotic normality of the nonparametric function. Simulation studies show that our proposal performs as well as the penalized approach on partial residuals with the SCAD penalty (Fan and Li, 2004), and outperforms the LASSO penalty. The real data analysis also demonstrates that our proposal can yield a parsimonious model.

iv Table of Contents

List of Figures viii

List of Tables ix

Acknowledgments x

Chapter 1 Introduction 1 1.1 Background ...... 1 1.2 Contribution ...... 5 1.3 Organization ...... 7

Chapter 2 Literature Review 9 2.1 Introduction ...... 9 2.2 Variable Selection in Linear Models ...... 11 2.2.1 Classical Variable Selection Criteria ...... 11 2.2.2 Penalized Least Squares ...... 14 2.2.3 PC-simple Algorithm ...... 22 2.3 Partially Linear Models ...... 30 2.3.1 Model ...... 30 2.3.2 Partially Linear Models for Longitudinal Data ...... 31 2.3.3 Asymptotic Results for Profile Least Squares Estimators . . 36 2.4 Variable Selection in Partially Linear Models ...... 37 2.4.1 Penalized Profile Least Squares ...... 37 2.4.2 Asymptotic Results of Penalized Profile Approach ...... 38 2.4.3 Iterated Ridge Regression ...... 40

v Chapter 3 Thresholded Partial Correlation Approach for Variable Selec- tion in Linear Models 42 3.1 Introduction ...... 42 3.2 Preliminaries ...... 44 3.2.1 Model and Notation ...... 44 3.2.2 Partial Faithfulness ...... 45 3.2.3 Elliptical Distributions ...... 47 3.3 Thresholded Partial Correlation Approach ...... 49 3.3.1 Thresholded Partial Correlation Approach: General Samples 50 3.3.2 Limiting Distributions of Correlations and Partial Correlations 51 3.3.3 Thresholded Partial Correlation Approach: Elliptical Samples 55 3.4 Asymptotic Theory ...... 59 ˆ 3.4.1 Asymptotic Theory of An(α) ...... 60 [1] 3.4.2 Asymptotic Theory of Aˆn (α) ...... 61 3.4.3 Discussion of the Conditions ...... 62 3.5 Numerical Studies ...... 63 3.5.1 Simulation Studies ...... 63 3.5.2 Real Data: Cardiomyopathy Microarray Data ...... 76 3.6 Conclusion ...... 80 3.7 Lemmas and Technical Proofs ...... 81 3.7.1 Lemmas ...... 81 3.7.2 Proof of Theorem 3.8 ...... 84 3.7.3 Proof of Equation (3.3) ...... 86 3.7.4 Proof of Theorem 3.13 ...... 88 3.7.5 Proof of Theorem 3.14 ...... 98

Chapter 4 Thresholded Partial Correlation on Partial Residuals for Vari- able Selection in Partially Linear Models 100 4.1 Introduction ...... 100 4.2 Preliminaries ...... 102 4.2.1 Model and Objective ...... 102 4.2.2 Partial Faithfulness ...... 105 4.3 Thresholded Partial Correlation on Partial Residuals Approach . . . 105 4.3.1 Population Version ...... 106 4.3.2 Sample Version ...... 107 4.4 Asymptotic Properties ...... 115 4.4.1 Asymptotic Properties for Variable Selection ...... 115

vi 4.4.2 Bias, Variance and Asymptotic Normality of the Nonparametric Function ...... 118 4.4.3 Discussion of the Conditions ...... 118 4.5 Numerical Studies ...... 119 4.5.1 Simulations Studies ...... 119 4.5.2 Real Data: Istanbul Stock Exchange Data ...... 137 4.6 Conclusion ...... 143 4.7 Lemmas and Technical Proofs ...... 144 4.7.1 Lemmas ...... 144 4.7.2 Proof of Theorem 4.4 (First Half)...... 146 4.7.3 Proof of Theorem 4.5...... 156 4.7.4 Proof of Theorem 4.6...... 162 4.7.5 Proof of Theorem 4.7...... 163 4.7.5.1 Derivation of the Bias of Nonparametric Part. . . . 163 4.7.5.2 Derivation of the Variance of Nonparametric Part. 166 4.7.6 Proof of Theorem 4.8...... 167

Chapter 5 Conclusion and Future Research 171 5.1 Conclusion ...... 171 5.2 Future Research ...... 173 5.2.1 Further theoretical development ...... 173 5.2.2 Semi-parametric Varying-coefficient Models ...... 173

Bibliography 175

vii List of Figures

2.1 The soft thresholding rule...... 15 2.2 The SCAD thresholding rule...... 17 2.3 Comparison between local quadratic approximation and local linear approximation...... 23

4.1 Istanbul Stock Exchange Index against the predictors...... 138 4.2 Estimated curves ...... 139 4.3 Density plots of the variables...... 141 4.4 Distribution of residuals ...... 142

viii List of Tables

3.1 Simulation result for Example 1 ...... 68 3.2 Simulation result for Example 1 ...... 69 3.3 Simulation result for Example 1 ...... 70 3.4 Simulation result for Example 1 ...... 71 3.5 Simulation result for Example 2 ...... 72 3.6 Simulation result for Example 2 ...... 73 3.7 Simulation result for Example 2 ...... 74 3.8 Simulation result for Example 3 ...... 77 3.9 Simulation result for Example 3 ...... 78 3.10 Selected predictors ...... 79 3.11 Part of the correlations matrix ...... 79 3.12 Comparison of the performances ...... 80

4.1 Normal samples with AR correlation matrix (p = 20) ...... 124 4.2 Normal samples with compound symmetric correlation matrix (p = 20) ...... 125 4.3 Mixture normal samples with AR correlation matrix (p = 20) . . . . 127 4.4 Mixture normal samples with compound symmetric correlation ma- trix (p = 20) ...... 128 4.5 Normal samples with AR correlation structure (p=200) ...... 131 4.6 Mixture normal samples with AR correlation matrix (p = 200) . . . 133 4.7 Normal samples with AR correlation structure (p=500) ...... 135 4.8 Mixture normal samples with AR correlation structure (p=500) . . 136 4.9 Estimated coefficients ...... 139 4.10 Comparison of the performances ...... 143 4.11 Correlation matrix of the covariates ...... 143

ix Acknowledgments

First of all, I would like to gratefully and sincerely thank my Ph.D. advisor, Dr. Runze Li, for his guidance, encouragement and comments on my phd projects and career. His guidance and comments help me to think our projects and my life more deeply. His suggestions are valuable to help me get a well rounded experience in statistics. My whole life will benefit from my experience under his supervisor. Secondly, I wish to express my propound appreciation to my committee mem- bers, Dr. David Hunter, Dr. Bing Li and Dr. Ronglin Wu, for their precious time, input, discussion and comments to improve my dissertation. Further, I would like to say special thanks to Department of Statistics, Penn State University. More than four years ago, I was so honored to become one member of this family. I started my learning in statistics with so many outstanding faculty, smart classmates and friendly staff. The classes I took during my Ph.D. study, in spite of the annoying proofs, were helpful to understand statistics. The seminars I attended provided lots of information about the hot topics in statistics. I enjoy all the activities in this department, including the two qualification exams, conference and seminars. Last, but not the least, I am highly grateful to every member of my family: my parents, my brother and my sister-in-law. Without their love, support and understanding, I cannot complete the my study at Penn State University. This dissertation research was supported by National Institute on Drug Abuse (NIDA) grant P50-DA10075 and National Cancer Institute (NCI) grant R01 CA168676. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NSF, NIH, NIDA and NCI.

x Chapter 1

Introduction

1.1 Background

High-dimensional data analysis has become popular due to the rapid development in modern technologies of in diverse scientific fields, such as mi- cro arrays and finance. Usually, to attenuate the model biases, a large number of predictors are included in the model. However, this might result in loss of the prediction accuracy of the model and make interpretation more difficult. For high- dimensional data, the number of predictors, p, is usually larger than the sample size, n. To be precise, p = O(na) with a > 1. Researchers are interested in iden- tifying a smaller subset of covariates that can provide a parsimonious description of the relationship between the response and predictors. Two classic approaches of improving the estimates are subset selection (Miller, 2002) and ridge regression, but they have some drawbacks. Subset selection can result in interpretable models, but as Breiman (1996) ana- lyzed, it is a discrete procedure and thus unstable: small perturbation on the data would cause significant change in the selected models and hence subset selection would yield bad prediction. Ridge regression is a continuous approach and it is 2 stable; however, it does not result in a sparse model although it shrinks some coefficients. To avoid the above weakness, Tibshirani (1996) proposed the least absolute shrinkage and selection operator (LASSO). By minimizing the residual sum of squares under the constraint of the sum of the absolute value of the coefficients being less than a tuning parameter, LASSO effectively sets some coefficients to 0 while shrinking other coefficients. However, LASSO leads to inherent estimation bias for large true coefficients, as showed by Fan and Li (2001). Three desired properties of the estimates, continuity, sparsity and unbiasedness, are advocated by Fan and Li (2001). Through careful study of these three attributes, Fan and Li (2001) proposed the smoothly clipped absolute deviation (SCAD) penalty. As they demonstrated, the penalized least squares with SCAD penalty function enjoys the oracle property. That is, the SCAD approach can perform as if the true model were known. Fan and Li (2001) further proposed the local quadratic approximation algo- rithm (LQA) for maximizing the non-concave penalized least squares or, more generally, non-concave penalized likelihood. One weakness of LQA is that if a covariate is deleted at any step in the LQA procedure, it will be excluded from the final selected model (Hunter and Li, 2005). Zou and Li (2008) proposed local linear approximation (LLA) algorithm to avoid this drawback. As they showed, one-step LLA estimates enjoys the oracle property with good initial estimators if the regularization parameters are suitably chosen. There are some other regular- ized approaches, such as elastic net (Zou and Hastie, 2005), the adaptive LASSO (Zou, 2006) and Dantzig selector (Candes and Tao, 2007). Different from those regularization methods for variable selection in linear mod- els, Buhlmann et al. (2010) proposed the PC-simple algorithm, which is a simplifi- 3 cation of the PC algorithm (Spirtes et al., 2001). The letters PC stand for the first names of Peter Spirtes and Clarke Glymour, the inventors of the PC algorithm. The PC-simple algorithm ingeniously exploits the marginal correlation and partial correlation under the concept of partial faithfulness to select significant variables. As Buhlmann et al. (2010) showed, under some conditions and proper choice of the significance level, the PC-simple algorithm can consistently identify the true active set with probability approaching 1 when the response and covariates are jointly normally distributed. This algorithm can be viewed as a generalization of sure independence screening (SIS) proposed by Fan and Lv (2008); however, it is a more careful study of the correlations between the response and predictors than SIS procedure. Extensive simulations in Buhlmann et al. (2010) suggest that the PC-simple algorithm is competitively comparable to penalty-based approaches, such as LASSO and adaptive LASSO. Therefore, there are two comparable ways for variable selection in high-dimensional linear models and that would improve the confidence of variable selection. Partial linear models are popular semiparametric modeling techniques, and they assume the of response to be linearly dependent on some covariates, whereas its relation to other variables are characterized by nonparametric func- tions. Partial linear model is a special case of general additive models (Hastie and Tibshirani (1993). As usual, the is defined as follows:

y = α(u) + xT β + , (1.1) where y is the response variable, α(u) is an unspecified baseline function of u, x is the p × 1 covariate vector, β is a vector of unknown regression coefficients, and  is the random error. Much work has been done in estimation of the nonparametric function α(u) and 4

β. Zeger and Diggle (1994) developed an iterative algorithm to estimate α(t) and β based on the back-fitting technique; by extending the idea of partial residuals, Moyeed and Diggle (1994) improved the back-fitting approach. Fan and Li (2004) proposed difference-based estimates and profile least squares estimates to model the longitudinal data using partially linear models. As in linear models, variable selection is an important and crucial topic in par- tially linear models. With a large amount of predictors in the models, it is difficult to interpret the model although the model bias is small. To have a more effective description of the relationship between the response and covariates, we need to select those significant variables. Some classic variable selection procedures can be extended to partially linear models, but there are many challenges for implemen- tation. The computation is demanding, as we need to estimate the nonparametric part for each sub-model. Also, classic variable selection procedures have some other drawbacks, such as the instability of best subset selection (Breiman, 1996). Fan and Li (2004) applied the penalized least squares techniques to partially linear models. They suggested the penalized profile least squares approach for variable selection in partially linear models. Their approach dramatically reduces the computation cost by eliminating the nuisance nonparametric function via the profile technique. Moreover, they established the asymptotic theories of both the nonparametric function and regression coefficients. As they showed, with suitable choice of penalty functions and regularization parameters, the resulting estimate can perform as well as an oracle estimate. Li and Liang (2008) developed variable selection procedures for semi-parametric regression models using penalized likelihood. They proposed estimating the non- parametric function and regression coefficients by maximizing the local likelihood. To improve the efficiency of the regression coefficients, they suggested updating 5 the estimates of the parametric part by maximizing the global likelihood func- tion. They demonstrated the asymptotic normality of the resulting estimates with proper choices of penalty functions and regularization parameters. Moreover, they also showed that the proposed procedure performs as if the truth is known under some constraints.

1.2 Contribution

The contributions of this dissertation are summarized as follows. As mentioned before, Buhlmann et al. (2010) proposed the PC-simple algorithm for variable se- lection in linear models, but they only developed the asymptotic theory when the response and covariates are jointly normally distributed. The PC-simple algorithm based variable selection procedure is built on partial correlations. Under normality assumption, the partial correlation has succinct form. Normality assumption may be viewed a common assumption in high dimensional data analysis for convenience, but many proposed procedures for high dimensional data are still valid without normality assumption. It is of great interest to investigate how the PC-simple algo- rithm based variable selection procedure relies on normality assumption. Chapter 3 is devoted to studying the behaviors of the PC-simple algorithm based on vari- able selection procedures without normality assumption. In this chapter, we made the following theoretical contributions. To get insights into the PC-simple based variable selection procedures beyond normality assumption, we studied the asymptotic behavior of the sample partial correlation under elliptical distributions. The asymptotic normality of the sample partial correlation under elliptical distributions clearly shows that the PC-simple algorithm fails when the kurtosis of the marginal distribution, defined by the ratio of the fourth central to the squares of variance, divided by 3, then minus 6

1, is far away from 0. In particular, when the kurtosis is significantly more than 0, then the PC-simple algorithm will results in an over-fitted model, while the kurtosis is significantly below than 0, then the PC-simple algorithm will result in an under-fitted model. To avoid the drawback of the PC-simple algorithm, we suggest using a new threshold for the partial correlation. The threshold takes into account the marginal kurtosis. Our numerical results show the new threshold performs much better than the one based on normality assumption. To further get a better performance of partial correlation based variable selection procedure, we proposed using extended-BIC criterion to do a fine tuning on the threshold. The resulting procedure can significantly outperform both original PC-simple algorithm based variable selection procedure and the newly proposed partial correlation based variable selection procedure under some situations. Buhlmann et al. (2010) established the sure screening property of the PC-simple algorithm based feature screening under normality assumption and the dimension of predictors grows in a polynomial rate of the sample size. In our asymptoti- cal analysis, we relax the normality assumption, and allow the dimension of the predictors grows in exponential rate of the sample. We further establish the sure screening property of the newly proposed variable selection procedures. By impos- ing conditions on the joint distribution of the response and covariates along with the pairwise partial correlations, we show that the thresholded partial correlation approach can consistently select the significant predictors. We conduct extensive simulation studies to compare the performance of the newly proposed procedures with existing ones including penalized least squares with the SCAD penalty and the LASSO penalty. Our simulation results suggest that the thresholded partial correlation approach can perform as well as penalty- based approaches. We apply the thresholded partial correlation procedure to the 7 cardiomyopathy microarray data set. Variable selection for partially linear models is challenging, as it involves the estimation of the nonparametric function, the selection of the smoothing parame- ter and variable selection for the linear part. Fan and Li (2004) proposed penalized profile least squares for the partially linear models. Chapter 4 is devoted to devel- oping an alternative, new variable selection procedure for partially linear models using the idea of thresholded partial correlation. This chapter makes the following contributions. Based on the partial residual techniques developed for estimation of partially linear models, we propose to apply the thresholded partial correlation procedure to synthetic models for the partial residuals. Although the idea is natural, it is challenging in establishing the theoretical property of the proposed procedure. We establish the asymptotic theory for both the nonparametric and parametric parts. Specifically, we establish consistency of the proposed procedure. We further derived the asymptotic bias and variance of local linear regression estimate of the nonparametric baseline function. We further conduct extensive simulation studies to examine the finite sample performance of the proposed procedure, and to compare with some existing variable selection methods for partially liner models. Our numerical reuslts show that our proposal may outperform the penalized least squares approach on the partial residuals.

1.3 Organization

The dissertation is organized as follows. Chapter 2 provides a literature review. We introduce some variable selection approaches, including classic variable selec- tion techniques, modern regularization methods, and the PC-simple algorithm. 8

Partially linear models are introduced along with two ways of estimating the re- gression coefficients and nonparametric function. Moreover, the penalized profile least squares approach for variable selection in partially linear models is described. In Chapter 3, we mainly develop the thresholded partial correlation approach. We also introduce elliptical distributions and partial faithfulness, and develop the asymptotic distributions of the correlations and partial correlations. Moreover, we establish the asymptotic properties of the thresholded partial correlation approach. Following that, simulation studies and application of the thresholded partial cor- relation algorithm to the cardiomyopathy microarray data is presented; technical proofs are allocated to the last part of Chapter 3. In Chapter 4, we develop the thresholded partial correlation approach to select significant variables in partially linear models. First, the partially linear model is transformed to a linear model based on the partial residual approach; after that, for the resulting linear model, the thresholded partial correlation approach is applied to obtain the estimated active set. Then we can get the estimates of the coefficients and the nonparametric part. We call it the thresholded partial correlation on partial residuals approach (TPC-RR). Extensive simulation studies show that TPC-PR approach performs better than the penalized approach on the partial residulas with SCAD penalty and LASSO penalty. The application of TPC-PR on the Istanbul stock exchange index are present in Chapter 4. Chapter 2

Literature Review

2.1 Introduction

With the development of modern technology for data collection, high-dimensional data analysis becomes increasingly popular. In practice, a large number of predic- tors are introduced into the model to reduce the model bias. One approach is the ordinary least squares (OLS) estimate. However, OLS estimates often have low bias but large variance, resulting in poor prediction. Another drawback of OLS is interpretation. With a large number of predictors in the models, it is difficult to interpret the model. Thus, researchers have developed methodologies for identi- fying a smaller subset that exhibits the strongest effects, such as subset selection (Miller, 2002), least absolute shrinkage and selection operator (LASSO, Tibshirani, 1996), penalized least squares with SCAD penalty (SCAD, Fan and Li, 2001). Different from those penalty-based approaches, Buhlmann et al. (2010) pro- posed the PC-simple algorithm by exploiting the partial correlations between the response and covariates conditioning on some subset of the predictors. As the simulation studies showed, the PC-simple algorithm is competitively comparable to penalty-based approaches. Thus it improves our confidence in variable selection 10 in linear models. Modeling longitudinal data is an important research topic in a variety of fields. Researchers have worked to develop various parametric and non-parametric mod- els for longitudinal data. Parametric models would introduce model bias, though they provide interpretable models; non-parametric models are too flexible to make precise conclusions compared to parametric models. Semi-parametric models, or partially linear models, are good compromises. Fan and Li (2004) proposed two approaches for estimating the regression coefficients: the difference-based estimate and the profile least squares estimate. Also, for variable selection in partially linear models, they proposed the penalized profile least squares estimate to select the sig- nificant variables in partially linear models. They demonstrated that with proper choice of the regularization parameters and penalty functions, the profile penalized least squares estimate possesses the oracle property. Extensive simulation studies suggested this approach performs very well. This chapter is organized as follows. In Section 2.2, we mainly introduce the variable selection techniques in linear models, including some classic variable selec- tion approaches, penalized least squares approaches, and the PC-simple algorithm. In Section 2.3, we introduce partially linear models. Two ways of estimating the regression coefficients and the smoothing function are introduced along with the asymptotic normality of the resulting regression coefficients. In Section 2.4, profile penalized least squares estimate is introduced to do variable selection in partially linear models. The asymptotic properties show that the profile penalized least squares estimate can perform as good as an oracle estimator. 11

2.2 Variable Selection in Linear Models

In this section, we introduce the variable selection techniques for linear models: classic approaches, penalized least squares approaches and the PC-simple algo- rithm. Consider the linear model,

p T X y = x β +  = xjβj + , (2.1) j=1

T where y is the response variable, x = (x1, ··· , xp) is the p × 1 covariate vector, β is a vector of unknown regression coefficients and  is the random error with E(|x) = 0, V ar(|x) = σ2.

Specifically, suppose that (x1, y1), ··· , (xn, yn) is a random sample from model

T T (2.1). Let Y = (y1, ··· , yn) be an n-vector of responses and X = (x1, ··· , xn)

T be the n × p covariate matrix,  = (1, ··· , n) be the random errors. It follows that Y = Xβ + . (2.2)

When the dimension p is large, it is natural to assume the model is sparse. That is, only a small subset of predictors contribute to the response Y.

Our main goal is to identify the active set A = {j : βj 6= 0} ⊂ {1, ··· , p} and obtain their estimates based on n observations.

2.2.1 Classical Variable Selection Criteria

In the past years, there has been a large amount of literature on classical variable selection. For instance, Akaike (1974) proposed the Akaike’s information criterion (AIC), and Schwarz (1978) developed the Bayesian information criterion (BIC). 12

Miller (2002) provided a thorough review on subset regression. Many variable selection criterion are built on residual sum of squares (RSS), defined as RSS = ||Y − Xβˆ||2, (2.3)

ˆ ˆ where β is the OLS estimate, denote as βOLS.

1. R2 and Adjusted R2 R2 is a commonly used statistical criterion for linear regression, and is defined as

2 RSSd Rd = 1 − , (2.4) RSS0 where RSS0 is the residual sum of squares from the model with the intercept only while RSSd is the residual sum of squares from the model with an intercept and d covariates (1 ≤ d ≤ p). R2 measures the proportion of variance in Y explained by the model with d variables. It is obvious that RSSd becomes smaller with more variables, thus R2 always increases when a variable is added to a model. Therefore, R2 is not a good variable selection criterion. The adjusted R2 can avoid the drawback of R2 by taking the number of the variables in the model into account. It is defined by

2 2 n − 1 RSSd/(n − d) Ad = 1 − (1 − Rd) = 1 − . (2.5) n − d RSS0/(n − 1)

It is also known as Fisher’s A-. From the definition of Ad, maximizing Ad with respect to d is equivalent to choosing the model with lowest residual mean squares; that is RSSd/(n − d). Thus, adding more predictors to the model does not necessarily result in larger Ad. 13

2. AIC Akaike (1974) proposed the Akaike’s information criterion (AIC) which is de- fined as

2 AICd = RSSd + 2dσ , (2.6)

where RSSd is the residual sum of squares with d covariates in the model. In practice, σ2 is substituted with the unbiased estimate

RSS σˆ2 = p , (2.7) n − p

where p is total number of covariates and RSSp is the residual sum of squares under the full model. With more variables in the model, we can have smaller RSSd, but the term 2d serves as a penalty of adding more variables in the model.

3. BIC Schwarz (1978) suggested a different penalty based on Bayesian arguments. His measure, known as the Bayesian information criterion (BIC), has the general form

2 BICd = RSSd + 2 log(n)σ , (2.8)

2 where RSSd is the residual sum of squares with d covariates and σ is estimated by minimizing squared errors under the full model as AIC. For both AIC and BIC, smaller values are preferable. BIC is a consistent variable selection criterion while AIC tends to over-fit models. That is, assuming a true model with finite parameters exists, as the sample size approaches infinity, BIC can identify the true model, while AIC would provide an over-fit model. Computation is a big issue when p is large and this is one drawback of subset selection, motivating researchers to work on some other variable selection tech- 14 niques.

2.2.2 Penalized Least Squares

The classical variable selection criterion become infeasible for large p due to the heavy computation requirement. To overcome this drawback, pro- posed the penalized least squares estimate (PLS) via minimizing the objective function Q(β), defined as

p 1 X Q(β) = ||Y − Xβ||2 + n p (|β |), (2.9) 2 λj j j=1

where pλj (·) is the penalty function and λj is the regularity parameter controlling the model complexity.

1. LASSO: L1 Penalty Tibshirani (1996) proposed the least absolute shrinkage and selection operator (LASSO). It shrinks some coefficients and sets others to 0. Considering the linear model (2.1), the LASSO estimate βˆ is obtained by solving the following constrained least squares: p 2 X min ||Y − Xβ|| , subject to |βj| ≤ s, (2.10) β j=1 where s ≥ 0 is the tuning parameter which controls the amount of shrinkage applied to the estimates. It is equivalent to minimizing the penalized least squares function p 2 X Q(β) = ||Y − Xβ|| + λ |βj|, (2.11) j=1 where λ is the tuning parameter and it can be chosen by a data-driven method, such as CV and GCV.

T Under the orthogonal design case, that is, X X = Ip, where Ip is the p × p 15 identity matrix, the solutions to (2.11) are given by

ˆ ˆ0 ˆ0 βj = sgn(βj )(βj − λ)+, (2.12)

ˆ0 ˆ where βj is the jth component of ordinary least squares βOLS, sgn represents the sign function, and (·)+ is defined as follows:

  x if x ≥ 0; (x)+ =  0 if x < 0.

Thresholding rule from L1 penalty (LASSO) 10 5 z 0 −5 −10

−10 −5 0 5 10

z

Figure 2.1: The soft thresholding rule.

According to thresholding rule given in Figure 2.1, one drawback of LASSO is that for large true coefficients, it yields biased estimates.

2. Good Penalty Functions Fan and Li (2001) suggested a good penalized least squares estimate should have the following three properties:

(a) Continuity: The penalized estimate should be continuous in data to be stable 16

in model prediction.

(b) Sparsity: The penalized estimate should set the small coefficients to be 0 to make model interpretable.

(c) Unbiasedness: The penalized estimate should be nearly unbiased when true coefficients are large to obtain accurate models.

Fan and Li (2001) further provided some sufficient conditions on the penalized functions to ensure the resulting penalized least squares estimate possesses the above three properties.

0 (a) Continuity: if and only if arg minθ{|θ| + pλ(|θ|)} = 0,

0 (b) Sparsity: if minθ{|θ| + pλ(|θ|)} > 0,

0 (c) Unbiasedness: if and only if pλ(|θ|)} = 0 for large |θ|, where pλ(·) is a nondecreasing and continuously differentiable function on (0, +∞)

0 0 and pλ(0) implies pλ(0+). Obviously, the LASSO penalty does not satisfy the unbiasedness condition due

0 to the fact that pλ(|θ|)} ≡ 1 for all θ. That coincides with the drawback of LASSO we discussed in the previous subsection.

3. SCAD In the light of the three good attributes of penalty function listed above, Fan and Li (2001) suggested the smoothly clipped absolute deviation (SCAD) penalty which has the continuous first derivative as

 (aλ − θ)  p0 (θ) = λ I(θ ≤ λ) + + I(θ > λ) , (2.13) λ (a − 1)λ 17 where a > 2 and θ > 0. Both a and λ may be considered as tuning parameters controlling the size of the penalty. Under the orthogonal design matrix, the estimates with SCAD penalty can be shown to be (Fan, 1997)

  ˆ0 ˆ0 ˆ0 sgn(βj )(|βj | − λ)+, when |βj | ≤ 2λ  ˆ  βj = {(a − 1)βˆ0 − sgn(βˆ0)aλ}/(a − 1), when 2λ < |βˆ0| ≤ aλ (2.14)  j j j   ˆ0 ˆ0 βj , when |βj | > aλ.

ˆ0 where βj is the jth component of the ordinary least squares estimate. The SCAD thresholding rule is shown in Figure 2.2.

Threshold rule from SCAD penalty 10 5 0 scadest −5 −10

−10 −5 0 5 10

z

Figure 2.2: The SCAD thresholding rule.

It is easy to check that the SCAD penalty satisfies all the three conditions, thus the estimates from the SCAD penalty enjoy continuity, sparsity and unbiasedness. This coincides with the interpretation from Figure 2.2. Fan and Li (2001) discussed two ways of choosing the two tuning parameters (λ, a): fivefold cross-validation and generalized cross-validation. Fan and Li (2001) 18 suggested a = 3.7 from the Bayesian point of view.

4. Asymptotic Results for Penalized Likelihood Estimates Fan and Li (2001) studied the asymptotic theory for the estimates from non- convex penalty functions. In addition to discussing the penalized least squares, they studied the properties of the estimates from maximizing the penalized likeli- hood, and it is a more general case than the penalized least squares estimates.

T T Assume that the data {(x1 , y1), ··· , (xn , yn)} are collected independently. Con- T ditioning on xi, yi has a density fi(g(xi β), yi), where g is a known link function.

Let li = log fi denote the conditional log-likelihood of yi. Our objective is to maximize the penalized likelihood, defined as

n p X T X li(g(xi β), yi) − n pλ(|βj|), (2.15) i=1 j=1 with respect to β. Equivalently, we want to minimize

n p X X L(β) = − li(g(xiβ), yi) + n pλ(|βj|), (2.16) i=1 j=1 with respect to β. To be precise, in the linear models setting, we want to minimize Q(β), which is defined in (2.9). Fan and Li (2001) studied the convergence rate of the penalized likelihood

T T T estimate and the penalized least squares estimate. Let β0 = (β10, β20) be the true parameters and

0 an = max{p λn (|βj0|): βj0 6= 0}. (2.17)

T T Theorem 2.1. Let {(x1 , y1), ··· , (xn , yn)} be independently and identically dis- tributed with a density f(xT , y, β), satisfying conditions (A)-(C) in the Appendix 19 of Fan and Li (2001). If

00 max{p λn (|βj0|): βj0 6= 0} → 0, (2.18) then there exists a local minimizer βˆ of L(β) in (2.16) or Q(β) in (2.9), such that

ˆ −1/2 ||β − β0|| = OP (n + an).

For the SCAD penalty, the resulting penalized likelihood estimate is root-n consistent if λj → 0. Furthermore, Fan and Li (2001) demonstrated that the root-n consistent esti- mator enjoys the oracle property under some additional conditions. Denote

00 00 Σ = diag{p λn (|β10|), ··· , p λn (|βs0|)}, (2.19) and

0 0 T b = (p λn (|β10|)sgn(β10), ··· , pλn (|βs0|)sgn(βs0)) , (2.20) and s is the number of components of β10.

T T Theorem 2.2. Let (x1 , y1), ··· , (xn , yn) be independent and identically distributed with a density f(xT , y, β), satisfying conditions (A)-(C) in Appendix of Fan and Li (2001). Assume that

0 lim inf lim inf pλ (θ)/λn > 0. (2.21) n→∞ θ→0+ n

√ If λn → 0 and nλn → ∞ as n → ∞, then with probability tending to 1, the   ˆ β1 root-n consistent local maximizers βˆ =   must satisfy ˆ β2

ˆ (a) Sparsity: β2 = 0. 20

(b) Asymptotic normality:

√ ˆ −1 D n(I1(β10) + Σ){β − β10 + (I1(β10) + Σ) b} −→ N(0,I1(β10)), (2.22)

where I1(β10) is the Fisher information matrix knowing β20 = 0.

Fan and Li (2001) suggested that with proper choice of regularization parame- ters, the proposed SCAD penalty can satisfy the conditions in Theorem 2.2, thus estimators from SCAD penalty would perform as well as the oracle estimators. That is, the SCAD-penalized procedure would accurately estimate coefficients as if the true model were known. However, LASSO procedure does not have the oracle property.

5. Computation Algorithm

With a convex penalty, such as L1 penalty, our objective function (2.9) is con- vex, and the estimates can be obtained via convex optimization. However, the objective function (2.9) is no longer convex with some non-convex penalty func- tions, such as the SCAD penalty. With the SCAD penalty, the objective function is a high-dimensional non-concave function with some singularities. Minimizing such a target function is challenging. Fan and Li (2001) proposed the local quadratic ap- proximation (LQA) algorithm to deal with the penalty function and this algorithm is widely applicable to a variety of parametric models.

Local Quadratic Approximation algorithm: For the non-convex penalty functions, Fan and Li (2001) proposed approximat- ing the penalty function locally by a quadratic function. When βj0 is very close to ˆ 0, then set βj = 0. Otherwise, apply the Taylor expansion to obtain 21

0 0 βj 0 βj [pλ(|βj|)] = p λ(|βj|) ≈ p λ(|βj0|) , for βj ≈ βj0. (2.23) |βj| |βj0|

Integrating w.r.t βj from βj0 to βj, we can get

1 p (|β |) ≈ p (|β |) + p0 (|β |)/|β |(β2 − β2 ). λ j λ j0 2 λ j0 j0 j j0

For a general penalty function,

p 1 X Q(β) = ||Y − Xβ||2 + n p (|β |) 2 λ j j=1 p 1 X 1 ≈ ||Y − Xβ||2 + n {p (|β |) + p0 (|β |)/|β |(β2 − β2 )} 2 λ j0 2 λ j0 j0 j j0 j=1 p 1 X 1 = ||Y − Xβ||2 + n p0 (|β |)/|β |β2 + constant, 2 2 λ j0 j0 j j=1 the solution can be obtained by iteratively computing the ridge regression

T −1 T β1 = {X X + nΣλ(β0)} X Y, (2.24) where

0 0 Σλ(β0) = diag{p λ(|β10|)/|β10, ··· , p λ(|βj0|)/|βd0}, (2.25)

and d is the number of nonzero coefficients in β0. As described by Fan and Li

(2001), with good initial value β0, the one-step procedure can be as efficient as the fully iterative procedure. The major drawback of LQA is that if a covariate is removed at any step of LQA, it will never enter the model again.

Local Linear Approximation algorithm: 22

To overcome the weakness of LQA, Zou and Li (2008) developed the local linear approximation (LLA) algorithm to solve the minimization problem with the non- convex penalty functions. Based on the local linear approximation to the penalty function

0 0 0 0 0 pλ(|βj0|) = pλ(|βj |) + p λ(|βj |)(|βj| − |βj |), for βj ≈ βj , (2.26) the objective function as indicated in equation (2.9) becomes

p 1 X Q(β) = ||Y − Xβ||2 + n p (|β |) 2 λ j j=1 p 1 X ≈ ||Y − Xβ||2 + n {p (|β0|) + p0 (|β0|)(|β | − |β0|)} 2 λ j λ j j j j=1 p 1 X = ||Y − Xβ||2 + n p0 (|β0|)|β | + constant. 2 λ j j j=1

LLA results in a reweighed LASSO objective function and thus it shares the good features of LASSO in terms of computational efficiency. Zou and Li (2008) also demonstrated that estimates obtained from LLA enjoy the oracle property with good initial estimators if the regularization parameter is properly chosen.

The LQA and LLA approximation for L0.5 and the SCAD penalty are illustrated in Figure 2.3.

2.2.3 PC-simple Algorithm

Different from the penalty-based approaches, the PC-simple algorithm, proposed by Buhlmann et al. (2010), ingeniously selects the significant predictors by exploit- ing the partial correlations between the response and each predictor conditioning on some other covariates. Based on the positive definite of the matrix of the predictors, the partial faithfulness, multivariate normality and some other 23

(a) (b) 10 10 8 8 6 6 Penalty Penalty 4 4 2 2 0 0

−10 −5 0 5 10 −10 −5 0 5 10

beta beta

(a) L0.5 penalty: λ = 2 and β = 4. (b) L0.5 penalty: λ = 2 and β = 1.

(c) (d) 20 20 15 15 10 10 Penalty Penalty 5 5 0 0

−10 −5 0 5 10 −10 −5 0 5 10

beta beta

(c) SCAD penalty: λ = 2 and β = 4. (d) SCAD penalty: λ = 2 and β = 1.

Figure 2.3: Comparison between local quadratic approximation and local linear approximation. The dotted lines are from the local quadratic approximation, the long dashed lines are from the local linear approximation, and the solid lines are penalty functions. constraints on the partial correlations, Buhlmann et al. (2010) demonstrated that the sample version of the PC-simple algorithm can identify the significant variables consistently. Also, simulation studies showed that the PC-simple algorithm is com- petitively comparable to penalty-based approaches. This new approach raises the confidence for selection of the important variables. 24

Consider the linear model (2.1)

p T X y = x β +  = xjβj + , (2.27) j=1

T where y is the response variable, x = (x1, ··· , xp) is the p × 1 covariate vector, β is a vector of unknown regression coefficients and  is the random error with E(|x) = 0, V ar(|x) = σ2.

Denote E(x) = µx, Cov(x) = Σx. Assume x is uncorrelated with , and the response and covariates have finite second moments; that is, E(y2) < ∞ and

2 E(xj ) < ∞ for j = 1, ··· , p.

The main goal is to identify the active set A = {j : βj 6= 0} ⊂ {1, ··· , p} based on a random sample (x1, y1), ··· , (xn, yn). Denote the number of the nonzero βjs by peff = |A|.

1. Preliminary Theories Buhlmann et al. (2010) introduced the new concept of partial faithfulness, and claimed that partial faithfulness arises naturally in the context of linear models under the assumptions stated in Theorem 2.4.

T p Definition 2.3. (Partial Faithfulness) Let x = (x1, ··· , xp) ∈ R be a random vector and y ∈ R be a . The distribution of (x, y) is said to be partially faithful when the following condition holds for every j ∈ {1, ··· , p}:

c if ρ(y, xj|xS ) = 0 for some S ⊆ {j} , then ρ(y, xj|x{j}c ) = 0, (2.28)

where xS = {xj : j ∈ S}.

We define the following additional conditions:

(C1) ΣX is strictly positive. 25

peff (C2) {βj; j ∈ A} ∼ f(b)db, where f(·) denotes the density on a subset of R of an absolutely continuous distribution w.r.t. Lebesgue measure.

Theorem 2.4. Consider the linear model (2.1) satisfying conditions (C1) and (C2), then partial faithfulness holds almost surely with respect to the distribution generating the non-zero regression coefficients.

Thus under the assumptions of positive definite ΣX and partial faithfulness, all partial correlations between the response and a predictor will be nonzero if and only if the true coefficient for that predictor is nonzero; that is, for every j ∈ {1, ··· , p},

c ρ(y, xj|xS ) 6= 0 for all S ⊆ {j} if and only if βj 6= 0; (2.29) equivalently,

c ρ(y, xj|xS ) = 0 for some S ⊆ {j} if and only if βj = 0. (2.30)

Equations (2.29) and (2.30) are the crucial foundation for the PC-simple algorithm.

2. PC-simple Algorithm In this section, we will introduce the PC-simple algorithm in great detail. Based on equations (2.29) and (2.30), Buhlmann et al. (2010) first developed the popu- lation version of the PC-simple algorithm in the following steps according to the size of the conditional set.

Step 1: Let S = ∅ in (2.30); then ρ(y, xj) = 0 would result in βj = 0. Equiva- lently, screen all the marginal correlations between pairs (y, xj), j = 1, ··· , p and 26 build up the first set of candidate active variables:

[1] A = {j = 1, ··· , p : ρ(y, xj) 6= 0} ⊇ A.

[1] Step 2: For each j ∈ A , according to (2.30) again, let βj = 0 if there exists

[1] some k ∈ A \{j}, such that ρ(y, xj|xk) = 0; that is, screen all partial correlations of order one and establish the second set of candidate active variables:

[2] [1] [1] [1] A = {j ∈ A : ρ(y, xj|xk) 6= 0 for all k ∈ A \{j}} ⊆ A

Step 3: For each j ∈ A[m−1], calculate all partial correlations of order m − 1 by

ρ(y, xj|xS\{k}) − ρ(y, xk|xS\{k})ρ(xj, xk|xS\{k}) ρ(y, xj|xS ) = 2 2 1/2 , [{1 − ρ (y, xk|xS\{k})}{1 − ρ (xj, xk|xS\{k})}] for some k ∈ S ⊆ A[m−1]\{j} and |S| = m, applying (2.30) to get A[m] and A ⊆ A[m] ⊆ · · · ⊆ A[1].

[mreach] [m] Step 4: Repeat Step 3 until A , where mreach = min{m : |A | ≤ m}. The above steps leads to the pseudo codes illustrated in Algorithm 1. 27

Algorithm 1 (m, A[m]) = Population Version of the PC-simple(x, y) [1] 1. set m = 1, do correlation screening and build the step1 active set A = {j =

1, ··· , p : cor(y, xj) 6= 0};

2. m = m + 1, and construct the stepm active set

[m] [m−1] [m−1] A = {j ∈ A : cor(y, xj|xS) 6= 0, ∀S ⊆ A \{j}, |S| = m − 1};

3. repeat Step 2 until |A[m]| ≤ m or A[m] stops changing.

Buhlmann et al. (2010) showed the correctness of population version of the PC-simple algorithm which is stated in Theorem 2.5.

Theorem 2.5. For the linear model (2.1) satisfying conditions (C1) and partial faithfulness, the population version identifies the true underlying active set; that is,

[mreach] A = A = {j = 1, ··· , p : βj 6= 0}. (2.31)

However, in reality, the population correlations are never known. Therefore, Buhlmann et al. (2010) suggested replacing population correlations with sample correlations. For testing H0 : ρ(y, xj|xS ) = 0 against the two-sided alternative

HA : ρ(y, xj|xS ) 6= 0, they applied the Fisher’s Z-transform,

  1 1 +ρ ˆ(y, xj|xS ) Z(y, xj|xS ) = log . (2.32) 2 1 − ρˆ(y, xj|xS )

The null hypothesis H0 is rejected if

1/2 −1 (n − |S| − 3) Z(y, xj|xS ) > Φ (1 − α/2), (2.33) where α is the significance level and Φ is the cumulative density function of the 28 standard normal distribution. The sample version of the PC-simple algorithm can be established in a similar manner as the population version. The pseudo codes of the sample version is given in Algorithm 2.

Algorithm 2 (m, Aˆ[m]) = Sample Version of the PC- simple({(x1, y1), ··· , (xn, yn)}) ˆ[1] 1. set m = 1, do correlation screening and build the step1 active set A = {j =

1, ··· , p : |ρˆ(y, xj)| > threshold(α) decided by (2.32) and (2.33)};

2. m = m + 1, and construct the stepm active set

ˆ[m] ˆ[m−1] A = {j ∈ A : |ρˆ(y, xj|xS )| > threshold(α) decided by (2.32) and

(2.33) for all S ⊆ Aˆ[m−1]\{j} with |S| = m − 1};

3. repeat Step 2 until |Aˆ[m]| ≤ m or Aˆ[m] stops changing.

3. Asymptotic Results Buhlmann et al. (2010) demonstrated that, under some constraints on the covariance matrix and the partial population correlations, the sample version of the PC-simple algorithm can consistently select significant variables in high- dimensional linear models. The following five assumptions are necessary for estab- lishing the asymptotic theory for sample version of the PC-simple algorithm.

(D1) The distribution of (x, y)n is multivariate normal and satisfies (C1) and the partial faithfulness condition for all n.

a (D2) pn = O(n ) for some 0 ≤ a < ∞.

1−b (D3) peffn = |An| = O(n ) for some 0 < b ≤ 1. 29

c (D4) inf {|ρn(y, xj|xS )| : j = 1, ··· , pn, S ⊆ {j} , |S| ≤ peffn, ρn(y, xj|xS ) 6= 0}

−1 d ≥ cn, where cn = O(n ) for some 0 ≤ d < b/2 and b is as in (D3). (D5)

c sup{|ρn(y, xj|xS )| : S ⊆ {j} , |S| ≤ peffn} ≤ M < 1, (2.34)

c sup{|ρn(xi, xj|xS )| : i 6= j, S ⊆ {i, j} , |S| ≤ peffn} ≤ M < 1. (2.35)

Under the above assumptions, Buhlmann et al. (2010) established the asymp- totic theory for the sample version of the PC-simple algorithm, stated in Theorem 2.6.

Theorem 2.6. Consider the linear model (2.1) and assume (D1)-(D5). Then there exists a sequence αn → 0 as n → ∞, and a constant C > 0 such that the PC-simple algorithm satisfies

ˆ 1−2d P (An(αn) = An) = 1 − O{exp(−Cn )} → 1, (2.36) as n → ∞, where d is as in (D4).

The PC-simple algorithm (Buhlmann et al., 2010) is another variable selection technique in the high-dimensional linear models. It screens both the marginal and partial correlations. In that sense, it is a more sophisticated study of the relationship between response and covariates than SIS proposed by Fan and Lv (2008), which only screens the variables by the marginal correlations. The PC- simple algorithm is computationally efficient as many variables are deleted in each step; therefore it is computationally feasible even dealing with thousands of covari- ates. Also, the asymptotic theories are well-established under certain constraints. Simulation studies and a real data example showed the PC-simple algorithm can identify the significant predictors for high-dimensional linear models. Thus, be- 30 sides the penalty-based approaches, the PC-simple algorithm is another reliable variable selection technique.

2.3 Partially Linear Models

2.3.1 Model

Consider the following partially linear model,

y = α(u) + xT β + , (2.37) where y is the response variable, α(u) is an unspecified baseline function of u, x is the p × 1 covariate vector, β is a vector of unknown regression coefficients and  is the random error. In this model, the mean response is linearly dependent on x, while its relation with u is not specified up to any unknown function.

T T Specifically, suppose that (u1, x1 , y1), ··· , (un, xn , yn) are identical and inde- T pendent sample from model (2.37). Let Y = (y1, ··· , yn) be an n-vector of

T T responses, U = (u1, ··· , un), g(U) = (α(u1), ··· , α(un)) , X = (x1, ··· , xn) be

T the n × p covariate matrix, and  = (1, ··· , n) be the random errors. It follows that Y = g(U) + Xβ + . (2.38)

Many works have been done on estimating β and α(·). Wahba (1985) and Heckman (1986) studied the partial spline estimator, Speckman (1988) suggested the partial residual estimator, and Zeger and Diggle (1994) proposed an iterative algorithm to estimate β and g by the backfitting method. 31

2.3.2 Partially Linear Models for Longitudinal Data

Partially linear models are also useful to model longitudinal data. The longitudinal data come from studies involving repeated observations of the same variables over long periods of time. The observations are no longer independent and that imposes new challenge in building the models. In this section, we focus on modeling longitudinal data using partially linear models. We mainly review two estimation approaches: the difference-based esti- mator and the profile least squares estimator, proposed by Fan and Li (2004), for estimating the regression coefficients. In addition, the asymptotic normality of the resulting estimator from profile least squares approach is established. To be precise, based on the observations

T {(uij, x (uij), y(uij)), j = 1, ··· ,Ji, i = 1, ··· , n}, (2.39) the model can be expressed as

T yi(uij) = α(uij) + xi (uij)β + ij, j = 1, ··· ,Ji, i = 1, ··· , n, (2.40)

where yi(uij) and xi(uij) are the response variable and the covariate vector for ith individual collected at time uij, and Ji is the total number of observations on the ith subject. The random errors are correlated. Let

T yi = (yi(ui1), ··· , yi(uiJi )) ,

T Xi = (xi(ui1), ··· , xi(uiJi )) , and

T αi = (αi(ui1), ··· , αi(uiJi )) . 32

A weighted least squares estimate is obtained by minimizing the weighted least squares function, defined as

n 1 X Q(α(·), β) = (y − α − X β)T W −1(y − α − X β), (2.41) 2 i i i i i i i i=1 where Wi is a Ji × Ji working covariance matrix.

1. Difference Based Estimator Fan and Li (2004) proposed the difference based estimator (DBE). Dropping the subscript j of the observations in (2.39), the observed data can be rewritten as

n T ∗ ∗ X {(ui, xi , yi), i = 1, ··· , n } with n = Ji, i=1

ordered according to {uij}. Thus, model (2.40) can be written as

T ∗ yi = α(ui) + xi β + i, with E(i|xi) = 0, i = 1, ··· , n . (2.42)

Note that

T ∗ yi+1 − yi = α(ui+1) − α(ui) + (xi+1 − xi) β + ei, i = 1, ··· , n − 1, (2.43)

where ei = i+1 − i.

Under some mild conditions, the distance between ui and ui+1 is of order

O(1/n); then the term α(ui+1) − α(ui) can be ignored. The regression coeffi- cients β can be estimated via least squares approach. If the spacing is large, then the term α0 + α1(ui+1 − ui) is included in the model to attenuate the model bias. 33

Applying the least squares approach to the following linear model

T yi+1 − yi = α(ui+1) − α(ui) + (xi+1 − xi) β + ei

T ∗ ≈ α0 + α1(ui+1 − ui) + (xi+1 − xi) β + ei, i = 1, ··· , n − 1, the estimate of β can be improved. Both ways yield a quick and reliable initial value of β. One usage of this quick estimate of β is for the bandwidth selection in the profile least squares approach, which will be discussed later. After obtaining βˆ, we can apply the nonparametric smoothing techniques on

T ˆ ∗ {(ui, yi − xi β), i = 1, ··· , n } (2.44) to get an estimate of α(·). The advantage of DBE is that the estimation of β does not depend on the smoothing techniques while the disadvantage is that within- subject correlation is ignored, causing inefficiency.

2. Profile Least Squares Approach Motivated from the principle of profile likelihood, Fan and Li (2004) proposed the profile least squares approach. Although the form of the resulting estimate of β is similar to that of Speckman (1988) for partially linear models with independent observations, the motivation behind these two approaches are different. For a given β, let y∗(u) = y(u) − x(u)β; then model (2.37) can be written as

y∗(u) = α(u) + (u). (2.45)

Using the local linear regression technique, for u in a neighborhood of u0, it follows 34 by the Taylor expansion that

0 α(u) ≈ α(u0) + α (u0)(u − u0) ≡ a0 + a1(u − u0). (2.46)

Let K(·) be a kernel function, and h be a bandwidth, then Fan and Li (2004) suggested finding (ˆa0, aˆ1) which minimizes

n Ji X X ∗ 2 {yi (uij) − a0 − a1(uij − t0)} w(uij)Kh(uij − u0), (2.47) i=1 j=1

−1 where Kh(·) = h K(·/h) and w(uij) are the weighted functions.

T T T T T T T T T Let y = (y1 , ··· , yn ) , X = (X1 , ··· , Xn ) and α = (α1 , ··· , αn ) , then model (2.42) can be written as

y = α + Xβ + , (2.48) where  is the vector of random errors.

∗ According to Fan (1992), the local linear fit is linear in yi (uij); equivalently,

αˆ β = S(y − Xβ), (2.49) where the matrix S is called the smoothing matrix and it depends only on the observation times {uij, i = 1, ··· , n, j = 1, ··· ,Ji} and the degree of smoothing.

Substituting αˆ β into model (2.48), we can obtain

(I − S)y = (I − S)Xβ + . (2.50) 35

Applying weighted least squares, we get

βˆ = {XT (I − S)T W (I − S)X}−1XT (I − S)T W (I − S)y, (2.51)

where W is a diagonal matrix with the diagonal elements as w(tij)s. The resulting estimator of β is called the profile least squares estimator. Plugging βˆ into (2.49), the profile least squares estimator for the nonparametric part is simply αˆ βˆ. Following that, we can easily get an estimate of the covariance matrix of βˆ,

ˆ −1 −1 cov{β|uij, xi(uij)} = D VD , (2.52) where D−1 = XT (I − S)T W (I − S)X and V = cov{XT (I − S)T W }. V can be estimated by Vˆ = XT (I − S)T WCW T (I − S)X, (2.53)

T T where C = diag{ˆ1ˆ1 , ··· , ˆnˆn } and ˆi is the residual vector for the ith subject.

3. Bandwidth Selection The efficiency for estimating β can be affected by the choice of bandwidth. Fan and Li (2004) suggested the following procedure for choice of bandwidth. ˆ Step 1: Use DBE to get an estimate βDBE. ˆ Step 2: Plugging βDBE into model (2.40), now we have a univariate nonpara- metric regression problem. Step 3: For the above univariate problem, we can ˆ choose an appropriate bandwidth hn through data-driven procedures, such as cross- validation. ˆ hn can be used for the profile least squares estimate. Usually, according to nonparametric theory, the optimal choice of bandwidth is of order n−1/5. 36

2.3.3 Asymptotic Results for Profile Least Squares Esti- mators

Fan and Li (2004) demonstrated the asymptotic normality of the profile least squares estimators which is stated in Theorem 2.7.

Let α0(·) and β0 be the true parameters. Let

n Z ∞ ˆ −1 X N 2 Σn = n {xi(u) − Exi(u)} w(u)dNi(u), (2.54) i=1 0

n Z ∞ ˆ −1 X T ξn = n {xi(u) − Exi(u)} i(u)w(u)dNi(u), (2.55) i=1 0

∞ Z N A = E {x(u) − Ex(u)} 2w(u)dN(u), and (2.56) 0 Z ∞ N 2 B = E {x(u) − Ex(u)}(u)w(u)dN(u) , (2.57) 0 where aN 2 = aT a. The asymptotic theory is stated as follows.

Theorem 2.7. Suppose that w(·) is continuous, the Jis are bounded and the matri-

−a ces A and B exist. If A is finite positive definite and hn = bn for 1/8 < a < 1/2, then as n → ∞,

√ √ ˆ ˆ −1 ˆ D −1 −1 n(β − β0) = nΣn ξn + op(1) −→ N(0, A BA ), (2.58) where n is the number of the subjects. 37

2.4 Variable Selection in Partially Linear Models

Often, the number of potential explanatory variables, p, is large, but only a subset of them have strong effect on the response. Thus variable selection is needed to improve prediction accuracy and model interpretation. There is a vast amount of work on variable selection for linear models, such as stepwise selection, best subset selection, and shrinkage methods like nonnegative garrote (Breiman, 1996), least absolute selection and shrinkage operator (LASSO, Tibshirani, 1996), smoothly clipped absolute deviation (SCAD, Fan and Li, 2001), least angle regression (LARS, Efron et al., 2004), adaptive lasso (Zou, 2006). Some classic variable selection procedures can be extended to partially linear models, but they have posed challenges for implementation. The computation is demanding; as we need to estimate the nonparametric part for each sub-model. These classic variable selection procedures also have some other drawbacks, such as the best subset selection suffers from a lack of stability, as stated by Breiman (1996). Fan and Li (2004) were among the first to extend the shrinkage selection idea to partially linear models. By extending the idea of profile technique, Fan and Li (2004) proposed the penalized profile least squares approach to select significant variables in the parametric part. They showed that with a suitable choice of the penalty function and regularization parameters, the profile penalized least squares estimate performs as well as an oracle estimator.

2.4.1 Penalized Profile Least Squares

In this section, we introduce the penalized profile least squares approach for vari- able selection in partially linear models advocated by Fan and Li (2004). Let Q˜(β) be the weighted least squares that we want to minimize, then a penalized least 38 squares takes the form

p ∗ ˜ X Q (β) = Q(β) + n pλj (|βj|), (2.59) j=1

where p is the number of covariates, the pj(·)s are penalty functions, the λjs are tuning parameters. λjs can be chosen by some data-driven methods, such as cross- validation or generalized cross-validation. To be precise, for the partially linear model (2.40), the objective is to minimize the penalized quadratic loss,

n p 1 X X Q(α(·), β) = (y − α − X β)T W (y − α − X β) + n p (|β |). (2.60) 2 i i i i i i i λj j i=1 j=1

After eliminating the nuisance function α(·), that is, plugging αˆ β = S(y−Xβ), the penalized quadratic loss becomes

d 1 X Q(β) = (y − Xβ)T (I − S)T W(I − S)(y − Xβ) + n p (|β |). (2.61) 2 λj j i=1

Usually, the penalty functions pj(·) and regularization parameters λj do not need to be the same. The resulting estimator is called the penalized weighted least squares estimate or the penalized profile least squares estimate. Next the penalty functions, such as LASSO and SCAD penalty, and algorithm mentioned, such as LQA and LLA, in Section 2 can be applied to solve the above minimization problem.

2.4.2 Asymptotic Results of Penalized Profile Approach

Fan and Li (2004) established the convergence rate and the the oracle property for 39 the penalized profile least squares estimator. Let

0 an = max{p λ (|β|j0): βj0 6= 0)}, and (2.62) j jn

00 bn = max{p λ (|β|j0): βj0 6= 0)}, (2.63) j jn where β0 is the true value.

Theorem 2.8. Under the conditions of Theorem 2.7, if both an and bn tend to 0 as n → ∞, then with probability tending to 1, there exists a local minimizer βˆ of

∗ ˆ −1/2 Q (β) in (2.59) or Q(β) in (2.61) , such that ||β − β0|| = OP (n + an).

To achieve the root n consistency, we need to take λj small enough so that

−1/2 an = OP (n ). Fan and Li (2004) showed that with a proper choice of the regularization param- eters and penalty functions, the proposed penalized profile least squares variable selection procedure performs as well as an oracle estimator, stated in Theorem 2.9.

Let the first s components of β0 are not equal to 0, and all other components are 0. Denote

00 00 Σ = diag{pλ1n (|β10|), ··· , pλsn (|βs0|)}, and (2.64)

0 0 b = (p λ1n (|β10|)sgn(β10), ··· , p λsn (|βs0|)sgn(βs0)). (2.65)

ˆ ˆ ˆ Let β1 consist of the first s components of β and β2 consist of the last p − s components of βˆ.

√ Theorem 2.9. Assume for j = 1, ··· , d, λjn → 0, nλjn → ∞, and the penalty

function pλjn (|βj|) satisfies

0 p (βj) lim inf lim inf λjn > 0. (2.66) n→∞ βj →0+ λjn 40

−1/2 If an = OP (n ), then under the conditions of Theorem 2.8, with probability ˆ ˆT ˆT tending to 1, the root n consistent local minimizer β = (β1 , β2 ) in Theorem 2.8 must satisfy the following:

ˆ (a) Sparsity:β2 = 0.

(b) Asymptotic normality:

√ ˆ −1 D n{A11 + Σ}[β1 − β10 + {A11 + Σ} b] −→ Ns(0, B11), (2.67)

where A11 and B11 consist of the first s rows and columns of A and B in Theorem 2.7.

√ If all the penalty functions are SCAD penalties, let λjn → 0 and nλjn → ∞ for j = 1, ··· , p, then an = 0 when n is sufficiently large. Under condition (2.66), the penalized profile least squares estimate with SCAD penalty would possess the

−1/2 oracle property. But for the LASSO penalty, because an = maxj λjn = Op(n ), √ then the condition nλjn → ∞ can not hold simultaneously. Thus, the estimate from the LASSO procedure would not have the oracle property.

2.4.3 Iterated Ridge Regression

We locally approximate the non-convex penalty function in (2.60) by quadratic functions (LQA algorithm, Fan and Li, 2001). Given an initial value β(0), when (0) ˆ |βj | < η, then βj is set to 0 and it will be excluded from the model. Otherwise, the penalty function can be approximated by the quadratic function

1 p (|β |) ≈ p (|β(0)|) + p0 (|β(0)|)/|β(0)|(β2 − (β(0))2). λ j λ j 2 λ j j j j 41

After that, the Newton-Raphson algorithm can be implemented directly to update the solution to the penalized least squares by

ˆ(1) T T (0) −1 T T β = [X ((I − S) W (I − S)X + nΣλ(β )] X ((I − S) W (I − S)y, (2.68) where

(0) 0 (0) (0) 0 (0) (0) Σλ(β ) = diag{p λ1 (|β1 |)/|β1 |, ··· , p λd (|βd |)/|βd |}. (2.69)

In practice, the unpenalized least squares can be used as initial value. Chapter 3

Thresholded Partial Correlation Approach for Variable Selection in Linear Models

3.1 Introduction

Attention to variable selection in high-dimensional linear models is increasing. A large number of predictors would be included in the model to reduce the model bias. However, including more predictors in the model might lead to worse predic- tion and more difficult interpretation. Thus, researchers are motivated to develop techniques to determine a smaller subset that exhibits the strongest effect. Stan- dard procedures are subset selection and stepwise deletion, but both have draw- backs. Statisticians proposed other variable selection methodologies. Common approaches are regularized methods, including bridge regression (Frank and Fried- man, 1993), least absolute shrinkage and selection operator (LASSO, Tibshirani, 1996), penalized least squares with smoothly clipped absolute deviation penalty (SCAD, Fan and Li, 2001), group LASSO (Yuan and Lin, 2006), adaptive LASSO 43

(Zou, 2006) and Dantzig selector (Candes and Tao, 2007). Buhlmann et al. (2010) proposed utilizing PC-simple algorithm (Ref. Section 2.2.3) based on the partial correlations between the response and predictors to identify important variables. Under some regularity conditions on the design ma- trix and partial correlations, Buhlmann et al. (2010) established the asymptotic consistency of the estimated active set. The conditions, imposed in Buhlmann et al. (2010), are different from assumptions for penalized approaches to ensure a consistent variable selection procedure. Through simulations and a real data example, they demonstrated that the PC-simple algorithm is competitively com- parable to penalty-based variable selection approaches. Therefore, we have two comparable approaches for variable selection in high-dimensional linear models, and that would improve our confidence in variable selection. However, Buhlmann et al. (2010) established the asymptotic theories of the PC-simple algorithm only when the response and covariates are jointly normally distributed. We want to assess its effectiveness with non-normal samples. Thus, we developed the partial correlation based variable selection approach under a more general statistical setting, and we call it the thresholded partial correlation approach (TPC). We showed that under certain conditions, TPC can identify the true active set asymptotically. Intensive simulation studies on the elliptical distri- butions, such as normal distributions and mixture normal distributions, suggest that the thresholded partial correlation approach performs as well as some popular regularization methods. This chapter is organized as follows. In Section 2, we briefly introduce some preliminaries, including the model, goal, the concept of partial faithfulness and elliptical distributions. In Section 3, we provide the general description of the thresholded partial correlation approach, and derive the limiting distributions of 44 correlations and partial correlations. Especially, under the assumption that the response and covariates are elliptically distributed, the limiting distributions of correlations and partial correlations follow N(0, 1 + κ), where κ is the kurtosis of the distribution. Based on these limiting distributions, we develop the thresholded partial correlation approach with elliptical samples. Moreover, we establish the asymptotic properties of the estimated active set from our proposal in Section 4. Furthermore, simulation studies and application to the cardiomyopathy microarray data are provided in Section 5. A brief conclusion is given in Section 6, and technical proofs are given in Section 7.

3.2 Preliminaries

3.2.1 Model and Notation

Consider the linear model,

p T X y = x β +  = xjβj + , (3.1) j=1

T where y is the response variable, x = (x1, ··· , xp) is the p × 1 covariate vector, β is the p×1 vector of unknown regression coefficients, and  is the random error with

2 E() = 0, var() = σ . It is assumed that E(xj) = 0, j = 1, ··· , p, cov(X) = Σx, and the fourth moment of the response and covariates exist; that is, E(y4) < ∞

4 and E(xj ) < ∞ for j = 1, ··· , p. Our main goal is to select a small set of the covariates that exhibits the strongest correlation with the response. Equivalently, we aim to identify the active set

T T A = {j : βj 6= 0} ⊂ {1, ··· , p} based on random samples (x1 , y1), ··· , (xn , yn). We adopt the following notation. For a set S ⊆ {1, 2, ··· , p}, we use |S| to 45 denote its cardinality, Sc = {j = 1, ··· , p, : j 6∈ S} to denote its complement, and xS = {xj : j ∈ S} to denote a set of covariates. Let peff = |A| be the size of the true active set, and that equals the number of the nonzero βjs. Let A[0] = {1, ··· , p} be the set consisting of the index of all predictors, and Aˆ[m] be the Step m estimated active set.

Definition 3.1. (Partial Correlation) The partial correlation between xj and y given a set of controlling variables xS = {xk : k ∈ S}, written ρ(xj, xk|xS ), is the

correlation between the residuals rxj ,xS and ry,xS resulting from the linear regression of xj with xS and of y with xS , respectively. Denote the sample partial correlation between xj and xk given controlling variables xS by ρˆ(xj, xk|xS ).

We have the following recursive formulas for the population and sample partial correlations:

ρ(y, xj|xS\{k}) − ρ(y, xk|xS\{k})ρ(xj, xk|xS\{k}) ρ(y, xj|xS ) = , (3.2) p 2 2 {1 − ρ (y, xk|xS\{k})}{1 − ρ (xj, xk|xS\{k})}

ρˆ(y, xj|xS\{k}) − ρˆ(y, xk|xS\{k})ˆρ(xj, xk|xS\{k}) ρˆ(y, xj|xS ) = , (3.3) p 2 2 {1 − ρˆ (y, xk|xS\{k})}{1 − ρˆ (xj, xk|xS\{k})}

c for some k ∈ S ⊆ {j} . ρ(y, xj|xS ) andρ ˆ(y, xj|xS ) does not depend on the choice of k. A proof is provided in the Section 7.

3.2.2 Partial Faithfulness

The definition of partial faithfulness is introduced by Buhlmann et al. (2010). This concept plays an important role in identifying significant variables based on the partial correlations.

Definition 3.2. (Buhlmann et al., 2010, Partial Faithfulness) Let y ∈ R be a T p random variable and x = (x1, ··· , xp) ∈ R be a random vector. The distribution 46 of (xT , y) is said to be partially faithful when the following holds for every j ∈ {1, ··· , p}:

c if ρ(y, xj|xS ) = 0 for some S ⊆ {j} , then ρ(y, xj|x{j}c ) = 0. (3.4)

Partial faithfulness is a mild condition and a large number of distributions possess this property. Buhlmann et al. (2010) provided a sufficient condition for partial faithfulness. Define the following two conditions:

(C1) ΣX is strictly positive;

peff (C2) {βj; j ∈ A} ∼ f(b)db, where f(·) denotes the density on a subset of R of an absolutely continuous distribution with respect to Lebesgue measure.

Buhlmann et al. (2010) showed that the linear models enjoy partial faithfulness if conditions (C1) and (C2) are satisfied. Furthermore, they proposed the following lemma to illustrate the rationale of using population partial correlations to select the important predictors.

Lemma 3.3. Suppose that linear model (3.1) satisfies conditions (C1),(C2), then for every j = 1, ··· , p,

c ρ(y, xj|xS ) 6= 0 for all S ⊆ {j} if and only if βj 6= 0. (3.5)

Equivalently, for every j = 1, ··· , p,

c ρ(y, xj|xS ) = 0 for some S ⊆ {j} if and only if βj = 0. (3.6)

Property (3.5) or (3.6) is crucial to establish our proposal for variable selection in linear models. 47

3.2.3 Elliptical Distributions

We introduce the definition of elliptical distributions in this section. We provide a necessary and sufficient condition for elliptical distributions after the definition.

Definition 3.4. (Fang et al., 1990, Spherically Symmetric Distribution) A p × 1 random vector x is said to have a spherically symmetric distribution (or simply spherical distribution) if for every Γ ∈ O(p),

Γx =d x, (3.7)

where O(p) denotes the set of p × p orthogonal matrices.

Lemma 3.5. (Fang et al., 1990) A p × 1 random vector x has a spherical distri- bution if and only if its characteristic function ψ(t) satisfies one of the following equivalent conditions:

1) ψ(ΓT t) = ψ(t) for any Γ ∈ O(p);

2) There exists a function φ(·) of a scalar variable such that ψ(t) = φ(tT t).

Usually, φ(·) is called the characteristic generator of the spherical distribution and

write x ∼ Sp(φ).

T Remark: Let x = (x1, ··· , xp) be the standard multivariate normal distribution.

That is, x is distributed as Np(0, I). Then the characteristic function of x is

1 exp{− (t2 + ··· + t2)}. (3.8) 2 1 p

According to Lemma 3.5, x has a spherical distribution Sp(φ) with characteristic

u generator φ(u) = exp(− 2 ). Thus the standard multivariate normal distribution belong to the family of spherical distributions. 48

Elliptical distributions are defined as linear transformations of the spherical distributions.

Definition 3.6. (Fang et al., 1990, Elliptical Contoured Distribution) A p × 1 random vector x is said to have an elliptical contoured distribution (or ) with parameters µ (p × 1) and Σ (p × p) if

d T x = µ + A y, y ∼ Sk(φ), (3.9)

where A is a k × p matrix with AT A = Σ and rank(Σ) = k. Denote x ∼

ECp(µ, Σ, φ).

Remarks: 1. Multivariate normal distribution is a linear transformation of the standard multivariate normal distribution, hence it belongs to the family of elliptical distri- butions. 2. There are many distributions belonging to the elliptical distribution fam- ily, including multivariate normal distribution, multivariate t-distribution, multi- uniform distribution, multivariate Pearson Type VII distribution and multivariate Pearson Type II distribution. 3. The elliptical distributions can be constructed through the following expression:

T (k) x = µ + rA u ∼ ECp(µ, Σ, φ), (3.10)

where A is a k ×p matrix with AT A = Σ, u(k) denotes a random vector distributed uniformly on the unit sphere surface in Rn, r ≥ 0 is independent of u(k), and φ is decided by both A and r. 4. The matrix A is not unique, but the random variable x depends on the A through AT A = Σ. 49

Lemma 3.7. (Fang et al., 1990, Theorem 2.16) Let x ∼ ECn(µ, Σ, φ) with rank(Σ) = k, B is an n × m matrix and ν is an m × 1 vector, then

T T T ν + B x ∼ ECm(ν + B µ, B ΣB, φ). (3.11)

Remarks: 1. This lemma indicates that the linear combinations of the elliptical distributions belong to the elliptical family, and the characteristic generator will remain the same.

2. The kurtosis parameter κ of an elliptical distribution ECn(µ, Σ, φ) is defined as

φ00(0) κ = − 1, (3.12) (φ0(0))2

where φ0(0) and φ00(0) are the first and second derivative of φ, evaluated at 0. Hence the kurtosis only depends on the characteristic generator φ.

3.3 Thresholded Partial Correlation Approach

In this section, we first describe the thresholded partial correlation approach under a general setting. In reality, the population correlations and partial correlations are unknown. It is natural to use the sample version of correlations and partial correlations to identify the significant predictors. Therefore, we study the limiting distributions of the sample correlations and partial correlations. The are difficult to be derived. However, for elliptical samples, we can have a neat expression of the variances. Based on the above liming distributions, we develop the thresholded partial correlation approach in detail. The pseudo codes are provided at the end of this section. 50

3.3.1 Thresholded Partial Correlation Approach: General Samples

In reality, we do not know the population correlations and partial correlations. Thus we employ the sample correlations and partial correlations to identify the significant predictors. First, we screen the sample marginal correlations. We let ˆ βj = 0 if q |ρˆ(y, xj)| < γ ∗ vard(ˆρ(y, xj)), (3.13) where γ is the pre-specified threshold. We only keep these predictors that have larger absolute correlation with the response. The Step 1 estimated active set can be written as

 q  ˆ[1] [0] A = j = 1, ··· , p : |ρˆ(y, xj)| > γ ∗ vard(ˆρ(y, xj)) ⊆ A .

Secondly, we study the partial correlation of order 1. That is for j ∈ Aˆ[1], we ˆ ˆ[1] set βj = 0 if we can find k 6= j and k ∈ A , such that

q |ρˆ(y, xj|xk)| < γ ∗ vard(ˆρ(y, xj|xk)). (3.14)

We continue to identify the significant variables based on the partial correlations of higher orders. Step m estimated active set can be obtain through the following expression:

q ˆ[m] ˆ[m−1] ˆ[m−1] A = {j ∈ A : |ρˆ(y, xj|xS )| > γ ∗ vard(ˆρ(y, xj|xS )), for any S ⊆ A \{j}, with |S| = m − 1}. (3.15)

The predictors are selected by the comparison the standardized marginal corre- 51 lations and partial correlations to some pre-decided threshold γ. γ can be treated as the tuning parameter. Thus we call the above algorithm the thresholded partial correlation approach. In reality, it is difficult to obtain the variances of the partial correlations. However, for elliptical distributed samples, we can have the explicit expression of the variances of the correlations and partial correlations. Therefore, with elliptical samples, the thresholded partial correlation can be implemented.

3.3.2 Limiting Distributions of Correlations and Partial Correlations

In this section, we study the limiting distributions of correlations and partial corre- lations. We mainly develop the asymptotic distributions for two scenarios: normal samples and elliptical samples. By central limit theory and delta method, we obtain the asymptotic distribution of the marginal correlations, which is provided in Theorem 3.8.

T T Theorem 3.8. Given that (x1 , y1), ··· , (xn , yn) are independent and identical dis- tributed (i.i.d.) samples, then for any j = 1, ··· , p,

√ D ∗ n {ρˆ(y, xj) − ρ(y, xj)} −→ N(0, Σ ), (3.16) where

( ) ρ2(y, x ) cov(x2, x2) cov(y2, y2) 2cov(x2, y2) Σ∗ = j j j + + j 4 σ4 σ4 σ2 σ2 xj y xj y ( 2 2 ) cov(x , xjy) cov(x y, y ) cov(x y, x y) −ρ(y, x ) j + j + j j . j σ3 σ σ σ3 σ2 σ2 xj y xj y xj y

A proof is provided in Section 7.

T T First, we focus on the normal samples. (x1 , y1), ··· ,(xn , yn) are i.i.d. samples 52 from N(µ, Σ), where

    µ1 σ11 ··· σ1p σ1y      .   . . . .   .   . . . .      µ =   , and Σ =   , (3.17)  µ   σ ··· σ σ   p   p1 pp py      µy σy1 ··· σyp σyy

∗ 2 2 then Σ can be simplified to be {1 − ρ (y, xj)} . By applying the delta method, the asymptotic distribution of Fisher’s Z-transform,

  ˆ 1 1 +ρ ˆ(y, xj) Z(y, xj) = log , (3.18) 2 1 − ρˆ(y, xj) can be obtained. These results are stated in Theorem 3.9.

T T Theorem 3.9. Given that (x1 , y1), ··· , (xn , yn) are i.i.d. samples from N(µ, Σ), then for any j = 1, ··· , p,

√ D  2 2 n − 1 {ρˆ(y, xj) − ρ(y, xj)} −→ N 0, {1 − ρ (y, xj)} , and √ n ˆ o D n − 1 Z(y, xj) − Z(y, xj) −→ N(0, 1). (3.19)

For the sample partial correlations of high orders, the asymptotic result can be derived via the fact that the sample covariance follows a scaled Wishart distribu- tion.

T T Theorem 3.10. (Muirhead, 2009, Theorem 5.3.1) (x1 , y1), ··· , (xn , yn) are i.i.d. samples from N(µ, Σ), then for any j = 1, ··· , p, and S ⊆ {j}c,

p D  2 2 n − 1 − |S| {ρˆ(y, xj|xS )} − ρ(y, xj|xS )} −→ N 0, {1 − ρ (y, xj|xS )} , and p n ˆ o D n − 1 − |S| Z(y, xj|xS ) − Z(y, xj|xS ) −→ N(0, 1). (3.20) 53

T Next, we consider elliptical contoured distributed samples. Given that (x1 , y1), T ∗ ··· ,(xn , yn) are i.i.d. samples from elliptical distributions ECp+1(µ, Σ, φ), then Σ 2 2 can be simplified to be (1+κ){1−ρ (y, xj)} , where κ is the kurtosis. By applying the delta method, the Fisher’s Z-transform will follow N(0, 1 + κ) asymptotically. The above results are given in the Theorem 3.11.

T T Theorem 3.11. Given that (x1 , y1), ··· , (xn , yn) are i.i.d. samples drawn from

ECp+1(µ, Σ, φ), then for any j = 1, ··· , p,

√ D  2 2 n − 1 {ρˆ(y, xj) − ρ(y, xj)} −→ N 0, (1 + κ){1 − ρ (y, xj)} , and √ n ˆ o D n − 1 Z(y, xj) − Z(y, xj) −→ N(0, 1 + κ), (3.21)

φ00(0) where κ = (φ0(0))2 − 1.

We next derive the asymptotic distribution of sample partial correlation coef- ficient when the sample is randomly drawn from an elliptical distribution. To this end, let u1, ··· , un be an independent and identically distributed random sample from ∼ ECp(µ, Σ, φ). To study the asymptotic behaviors of partial correlation of elliptical distribution, we partition ui, µ and Σ as follows.

      u1i µ1 Σ11 Σ12 ui =   , µ =   , Σ =   , u2i µ2 Σ21 Σ22

where u1i and µ1 are p1-dimensional, while u2i and µ2 are p2-dimensional, Σ11 is a p1 × p1 matrix, and Σ22 is a p2 × p2 matrix. Here p = p1 + p2. Let

  I −Σ12 C =   , 0 I

where I stands for the identity matrix, and vi = C(ui − µ). Using Lemma 3.7, it 54 follows that   Σ11.2 0 vi ∼ ECp(0,   , φ) 0 Σ22

−1 where Σ11.2 = Σ11 − Σ12Σ22 Σ21. T T T Let U = (u1, ··· , un) and V = (v1, ··· , vn) . By definition of vi, V = UC .

Let 1n be an n × 1 vector with all elements being 1.

n 1 X 1 1 A = (u − u¯)(u − u¯)T = U(I − 11T )U n i i n n n i=1 and n 1 X 1 1 B = (v − v¯)(v − v¯)T = V(I − 11T )V = CACT . n i i n n n i=1

By direct calculation, we have that B11 = A11 − Σ12A21 − A12Σ21 + Σ12A22Σ21,

B12 = A12 − Σ12A22, B21 = A21 − A22Σ21 and B22 = A22. Partition A and B in the same way as that of Σ. By direct calculation, it follows that B11.2 = A11.2. √ √ √ Define W11 = n(B11 − Σ11.2), W12 = nB12, W21 = nB21, and W22 = √ n(B22 −Σ22). Assume that all fourth moments of ui are finite. This implies that all fourth moments of vi are finite. Thus, it follows by the that Wkl for k = 1, 2 and l = 1, 2, has an asymptotic normal distribution with mean zero and a finite covariance matrix. Then

1 1 1 B = √ W + Σ − W (Σ + √ W )−1W . 11.2 n 11 11.2 n 12 22 n 22 21

Therefore it follows that

√ −1/2 n(B11.2 − Σ11.2) = W11 + OP (n ).

√ −1/2 √ −1/2 That is, n(A11.2 − Σ11.2) = W11 + OP (n ) = n(B11 − Σ11.2) + OP (n ). 55

This implies that A11.2 and B11 have the same asymptotic normal distribution.

Let akl.2 is the (k, l)-element of A11.2. Then for elliptical distributions, the sample √ partial correlation of uik and uil given u2i indeed equals akl.2/ akk.2all.2. Note √ √ that akl.2/ akk.2all.2 and bkl/ bkkbll have the same asymptotic distribution, where

bkl is the (k, l)-element of B. The asymptotic normal distribution of the sample √ correlation coefficient bkl/ bkkbll of elliptical distributions can be found at Theo- rem 5.2.6. By the last theorem, this provides a mathematical justification of the following theorem.

T T Theorem 3.12. Given that (x1 , y1), ··· , (xn , yn) are i.i.d. samples drawn from c ECp+1(µ, Σ, φ), then for any j = 1, ··· , p, and S ⊆ {j} ,

p D  2 2 n − 1 − |S| {ρˆ(y, xj|xS )} − ρ(y, xj|xS )} −→ N 0, (1 + κ){1 − ρ (y, xj|xS )} , and p n ˆ o D n − 1 − |S| Z(y, xj|xS ) − Z(y, xj|xS ) −→ N(0, 1 + κ). (3.22)

Remarks: 1) The normal distributions belong to the family of elliptical distributions, and the kurtosis of normal distributions is 0. This indicates normal samples is a special case of elliptical distributed samples. We develop our proposal based on elliptical distributed samples because it is more comprehensive. 2) For small sample correction, we use pn − 3 − |S| instead of pn − 1 − |S|.

3.3.3 Thresholded Partial Correlation Approach: Ellipti-

cal Samples

Now we are ready to develop our proposal for variable selection with elliptical samples. In practice, the population correlations and partial correlations are un- known. Therefore, we utilize their sample counterparts to assess the importance 56 of the predictors.

T T Supposed the samples (x1 , y1), ··· , (xn , yn) are independent and identical sam- ples from elliptical distributions ECp+1(µ, Σ, φ), and we aim to identify the signif-

T T icant features based on (x1 , y1), ··· , (xn , yn). The procedure of our proposal will be developed according to the cardinality of S in (3.5) and (3.6). First, we screen the sample marginal correlations. Let S be the empty set in (3.6). For testing H0 : ρ(y, xj) = 0 against the two-sided alternative HA :

ρ(y, xj) 6= 0, the null hypothesis H0 is rejected if

ˆ 1/2 Z(y, xj) −1 (n − 1) √ > Φ (1 − α/2), (3.23) 1 +κ ˆ where α is the significance level, and Φ is the cumulative density function of the standard normal distribution, andκ ˆ is the sample estimate of the kurtosis:

p  1 Pn 4  1 X (xij − x¯j) κˆ = n i=1 − 1 . p 3{ 1 Pn (x − x¯ )2}2 j=1 n i=1 ij j

ˆ Otherwise, we set βj = 0. Define

 √ −1  exp 2 1+√κΦ (1−α/2) − 1 n−|S|−1 T (α, n, κ, |S|) = √ . (3.24)  −1  exp 2 1+√κΦ (1−α/2) + 1 n−|S|−1

T (α, n, κ, |S|) depends on the significance level α, sample size n, the kurtosis of the distribution, and the cardinality of S. Due to the monotonicity of the Z-transform, the thresholded rule in (3.23) can be restated as: if

|ρˆ(y, xj)| > T (α, n, κ,ˆ 0), (3.25) we will keep jth predictor. Equivalently, if the absolute value of sample marginal 57

correlation is less than T (α, n, κ,ˆ 0), we fail to reject H0 : ρ(y, xj) = 0, and set ˆ βj = 0. By examining the p sample marginal correlations, the Step 1 estimated active set, denoted by Aˆ[1], can be obtained, expressed as

ˆ[1] [0] A = {j = 1, ··· , p : |ρˆ(y, xj)| > T (α, n, κ,ˆ 0)} ⊆ A .

Secondly, we establish the Step 2 estimated active set by studying the sample partial correlation of order 1. For every j ∈ Aˆ[1], if there exists k ∈ Aˆ[1]\{j}, such that

|ρˆ(y, xj|xk)| < T (α, n, κ,ˆ 1), (3.26) where

ρˆ(y, xj) − ρˆ(y, xk)ˆρ(xj, xk) ρˆ(y, xj|xk) = , (3.27) p 2 2 {1 − ρˆ (y, xk)}{1 − ρˆ (xj, xk)} ˆ then we fail to reject H0 : ρ(y, xj|xk) = 0, and set βj = 0. Hence, the Step 2 estimated active set can be written as

n o ˆ[2] ˆ[1] ˆ[1] ˆ[1] A = j ∈ A : |ρˆ(y, xj|xk)| > T (α, n, κ,ˆ 1) for any k ∈ A \{j} ⊆ A . (3.28) We can continue to select significant variables based on the sample partial correlations of higher orders. The Step m estimated active set can be constructed by utilizing the sample partial correlations of order m−1. For any S ⊆ Aˆ[m−1]\{j} with |S| = m − 1, the sample partial correlations of higher order can be calculated based on lower order sample partial correlations via the following formula,

ρˆ(y, xj|xS\{k}) − ρˆ(y, xk|xS\{k})ˆρ(xj, xk|xS\{k}) ρˆ(y, xj|xS ) = , p 2 2 {1 − ρˆ (y, xk|xS\{k})}{1 − ρˆ (xj, xk|xS\{k})} 58 where k ∈ S and k 6= j. Then the Step m estimated active set can be expressed as

ˆ[m] ˆ[m−1] ˆ[m−1] A = {j ∈ A : |ρˆ(y, xj|xS )| > T (α, n, κ,ˆ m − 1) for any S ⊆ A \{j},

with |S| = m − 1}. (3.29)

The above procedure yields a nested sequence of estimated active sets

A[0] ⊇ Aˆ[1] ⊇ Aˆ[m] ⊇ · · · ⊇ Aˆ[m] ⊇ · · · . (3.30)

This algorithm will stop once the size of the estimated active set is smaller than the cardinality of S; that is, |Aˆ[m]| ≤ m. (3.31)

A sequence of thresholds, T (α, n, κ,ˆ 0), ··· , T (α, n, κ,ˆ m−1) are set up, and they are used to compare to the sample partial correlations of order 0, 1, ··· , m − 1. After obtaining the estimated active set Aˆreach, we apply the ordinary least squares approach to estimate the coefficients of predictors in the estimated active set. Meanwhile, we set the coefficients of insignificant predictors to zeros. The above algorithm is called the thresholded partial correlation approach, and its pseudo code is provided in Algorithm 3. 59

ˆ[m] T T Algorithm 3 (m, A ) = TPC({(x1 , y1), ··· , (xn , yn)}) 1. set m = 1, do marginal correlation screening and build up the Step 1 esti- mated active set

ˆ[1] A = {j = 1, ··· , p : |ρˆ(y, xj)| > T (α, n, κ,ˆ 0)};

2. m = m + 1, and construct the Step m estimated active set

ˆ[m] ˆ[m−1] ˆ[m−1] A = {j ∈ A : |ρˆ(y, xj|xS )| > T (α, n, κ,ˆ m − 1), ∀S ⊆ A \{j},

with |S| = m − 1};

3. repeat Step 2 until |Aˆ[m]| ≤ m.

4. apply least squares to get the estimates of the coefficients.

3.4 Asymptotic Theory

In this section, we study the asymptotic theory of the proposed approach. By imposing conditions on the population partial correlations between the response and covariates, we establish the consistency property of our proposal. The most crucial one is that the distribution of (xT , y) should satisfy the sub-exponential tail probability. After that, we discuss the assumptions needed to guarantee the consistency, and provide two ways of choosing the significance level α. To study the high-dimensional behavior, we let the dimension increase as a

T function of the sample size n, say p = pn. Let the joint distribution of (x , y) be Pn, the regression coefficients be βj = βj,n, the active set be A = An with ˆ peff = peffn = |An|. Denote An(α) the estimated active set from the thresholded 60 partial correlation approach with α as the significance level.

3.4.1 Asymptotic Theory of Aˆn(α)

Our assumptions are as follows:

T (D1) The joint distribution Pn of (x , y) follows an elliptical distribution, and satisfies partial faithfulness.

T (D2) Σx > 0, and the joint distribution Pn of (x , y) = (x1, ··· , xp, y) satisfies the

sub-exponential tail probability: there exists s0 > 0, such that 0 < s < s0,

2 2 max E{exp(sxi )} < ∞,E{exp(sy )} < ∞, 1≤i≤p

max E{exp(sxiy)} < ∞. (3.32) 1≤i≤p

(D3) The partial correlations ρn(y, xj|xS ) satisfy

c inf {|ρn(y, xj|xS )| : j = 1, ··· , pn, S ⊆ {j} , |S| ≤ peffn, ρn(y, xj|xS ) 6= 0}

−d ≥ cn, where cn = O(n ), 0 < d < 1/2.

(D4) The partial correlations ρn(y, xj|xS ) and ρn(xi, xj|xS ) satisfy:

c i). sup {|ρn(y, xj|xS )| : j = 1, ··· , pn, S ⊆ {j} , |S| ≤ peffn} ≤ τ < 1,

c ii). sup {|ρn(xi, xj|xS )| : i 6= j, i, j = 1, ··· , pn, S ⊆ {i, j} , |S| ≤ peffn} ≤ τ < 1.

Based on the above regularity conditions, we obtain the following consistency property. First we consider the model consistency of the final estimated active set by TPC. Since the TPC depends on the significance level α = αn, we rewrite the ˆ estimated active set as An(αn).

Theorem 3.13. Consider linear model (3.1) with assumptions (D1)-(D4), for

a b 1−2d pn = O(exp(n )), peffn = O(n ), where a, b ≥ 0, and a + b < 5 . There exists a 61

sequence αn → 0, such that

ˆ a+b −2d−4q log{P (An(αn) 6= An)} ≤ C5n + n log(1 − C3n ) + log(2C4)

→ −∞, (3.33)

1−2d 1−a−b−2d as n → ∞, where C3, ··· ,C5 are constants, 5 ≤ q < 4 , and cn is given in (D3).

p n cn A proof is given in Section 7. One possible choice is αn = 2{1 − Φ( 1+κ 2 )}.

[1] 3.4.2 Asymptotic Theory of Aˆn (α)

In addition, notice that the estimated active set from the first step of the TPC, ˆ[1] denoted by An (αn), can be viewed as a feature screening procedure, and is similar to the sure independence screening procedure (Fan and Lv, 2008). Thus, we also study the sure screening property (Fan and Lv, 2008) of this first step of TPC, but from a different point of view. We impose the following condition on the population marginal correlations:

(E3) The marginal correlations ρ(y, xj) satisfy

inf {|ρn(y, xj)| : j = 1, ··· , pn, ρn(y, xj) 6= 0} ≥ cn, (3.34)

−d 1 where cn = O(n ), and 0 < d < 2 .

(E4) The marginal correlations ρ(y, xj) also satisfy

sup {|ρn(y, xj)| : j = 1, ··· , pn, } ≤ τ < 1. (3.35) 62

Theorem 3.14. Consider linear model (3.1) and suppose that (D1)-(D2), and

a b (E3)-(E4) are valid. For pn = O(exp(n )), peffn = O(n ), where a, b ≥ 0, and a < (1 − 2d)/5, there exists a sequence αn → 0, the Step 1 estimated active set from the marginal correlation learning (the first step of TPC) satisfies

ˆ[1]  1−4q−2d a q a P {An ⊇ An} ≥ 1 − C4 exp(−C3n + n ) + C2 exp{−C1n + n + log(n)}

→ 1.

1−2d−a as n → ∞, for some C1, ··· ,C4 > 0, q ∈ (a, 4 ), and d is given in (E3).

A proof is given in Section 7. One possible choice of αn is αn = 2{1 −

p n cn Φ( 1+κ 2 )}. The first step is known as the marginal correlation screening. It is similar to sure independence screening (SIS, Fan and Lv, 2008), but they are different in two ways. First, the size of the estimated active set from the SIS approach can be chosen by researchers, while for our proposal, the size can only be chosen via the significance level α. Second, theoretically, the two approaches have different assumptions.

3.4.3 Discussion of the Conditions

Here we discuss in detail the conditions for asymptotic theories from previous section. It is virtually impossible to check assumptions (D2)-(D4) in practice. However, they are common assumptions for high-dimensional variable selection. (D2) requires that joint distribution of (xT , y) satisfies the sub-exponential tail probability; with that constraint, the difference between the population and sample partial correlations can be controlled, hence Type I and II errors from the correlation tests can be controlled. 63

More critical assumptions are the positive definite of the covariance matrix in (D2), partial faithfulness requirement in (D1), and the conditions on the partial correlations in (D3) and (D4). (D2) requires the covariance matrix to be positive definite; D4-ii) imposes a fixed upper bound on the population partial correlations between the covariates. These two assumptions exclude the perfect collinearity between the covariates, and lead to the identifiability of the regression coefficients. The upper bound of partial correlations in D4-i) is used to control Type I error. The lower bound of partial correlations in (D3) is used to control the type II errors from the tests.

3.5 Numerical Studies

In this section, we assess the performance of the thresholded partial correlation approach and compare the proposed procedure with existing ones. Our simulation studies show the thresholded partial correlation approach for variable selection in linear models can outperform the SCAD, LASSO and the PC-simple algorithm. All simulation studies and real data analysis are conducted in R.

3.5.1 Simulation Studies

We compare the results from the thresholded partial correlation approach to pe- nalized approaches with LASSO penalty, SCAD penalty and PC-simple algorithm. For LASSO, we directly use the package“glmnet” in R to implement the LASSO procedure, although there are many algorithms that may be used to find the LASSO estimate. For the SCAD, we applied the one-step LLA algorithm proposed by Zou and Li (2008). The initial values for coefficients are obtained via LASSO, and the 64 estimates are updated via the fully iterative local linear approximation approach (Zou and Li, 2008). The parameter a is set to be 3.7, which is recommended by Fan and Li (2001). The tuning parameter λ in SCAD is selected by five-fold cross-validation, as used in (Zou and Li, 2008). For the thresholded partial correlation method, referred to as TPC throughout this dissertation, T (α, n, κ,ˆ m) is used. We estimate κ by averaging the marginal moment estimators. For the elliptical distribution ECp(µ, Σ, φ), the p marginal distributions share the same characteristic generator. The kurtosis of an elliptical distribution is defined as φ00(0) κ = − 1, {φ”(0)}2 and it only depends on the characteristic generator. Thus the kurtosis of p marginal distributions are the same, and they are the same as the kurtosis of ECp(µ, Σ, φ). For a univariate variable x, its kurtosis is defined as

E(x − µ)4 − 1, 3{E(x − µ)2}2 thus we use the sample version to estimate the kurtosis of p univariate variables, and their average is treated as the estimation of the kurtosis of the elliptical dis- tribution. Equivalently,

p  1 Pn 4  1 X (xij − x¯j) κˆ = n i=1 − 1 . p 3{ 1 Pn (x − x¯ )2}2 j=1 n i=1 ij j

To further improve the thresholded partial correlation method, we consider a fine tuning on the threshold of the TPC. Specifically, we replace the threshold T (α, n, κ,ˆ m) with cT (α, n, κ,ˆ m), where c is a constant on (0.5, 1.5). c is chosen by the extended Bayesian information criterion, proposed by (Wang et al., 2013). For each threshold cT (α, n, κ,ˆ m), we can obtain the estimated active set via the thresh- 65 olded partial correlation approach. By applying the least squares, the estimation of the coefficient can be got. Then we use the following criterion to represent the performance of the threshold cT (α, n, κ,ˆ m):

log(ˆσ2) + df ∗ log(p) ∗ log(n)/n, (3.36) whereσ ˆ2 is the estimation of variance of the error, df is defined as the size of the estimated active set, p is the number of the covariates, and n is the sample size. We would choose the threshold cT (α, n, κ,ˆ m) that would minimize the above criterion. This refined method will be referred to as TPC-EBIC throughout this disser- tation. As the estimation of the kurtosis is not very close the true kurtosis, and it may lead to the poor performance of thresholded partial correlation approach. Thus, we rely on extended Bayesian information criterion to choose a better cut- off for the partial correlations. As demonstrated in our simulation studied, the TPC-EBIC may significantly improve the TPC in some settings. The performance is assessed by the following performance measures.

ˆ 2 ˆ T ˆ 1. Model error of prediction: defined as Ex[{x(β −β)} ] = (β −β) cov(x)(β − β).

Pp ˆ 2. True positive number (C): defined as j=1 I(βj 6= 0, βj 6= 0); this is the number of nonzero coefficients correctly estimated to be nonzero.

Pp ˆ 3. False positive number (IC): defined as j=1 I(βj 6= 0, βj = 0); this is the number of zero coefficients erroneously estimated to be nonzero.

4. Under-fit percentage: fail to identify at least one of the nonzero coefficients.

5. Correct-fit percentage: identify exact all the nonzero coefficients. 66

6. Over-fit percentage: identify all nonzero coefficients, and include at least one of the zero coefficients.

In our simulations, we will focus on two scenarios for the joint distribution of x and : one is the normal distribution N(0, Σ) for which we expect many existing methods may perform well, and the other one is mixtures of normals: 0.9N(0, Σ) + 0.1N(0, 9Σ), which implies that the joint distribution of x and y is elliptical distribution, but whose κ value will significantly deviate from 0. We will consider two different covariance structure for Σ: one is the AR and the other one is compound symmetric.

Example 1 (Low dimension): In this example, we simulated 200 datasets consisting of 200 observations from the model

y = xT β + , (3.37) where β = (3, 1.5, 0, 0, 2, 0, ··· , 0)T and  ∼ N(0, 1). In this example, the number of nonzero coefficients peff = 3. Furthermore, we set p = 20, a low dimensional case. We consider two covariance structures for Σ. Let σij be the (i, j)-element

|i−j| of Σ. The AR covariance structure corresponds to σij = ρ , and the compound symmetric covariance structure corresponds to σij = ρ for i 6= j and 1 for i = j. For each scenario, we let ρ vary among 0, 0.3, 0.5, 0.6, 0.7, 0.8 and 0.9, respec- tively. For TPC and PC-simple algorithm, the significance level α is set to be 0.05. After obtaining the estimated active set from TPC and PC-simple algorithm, the ordinary least squares estimate is evaluated for the selected models. The of the model errors (MedME) over 200 simulations along with the median of absolute deviation (Devi) are summarized in the first column of the table. The average number of correctly estimated nonzero coefficients is also reported in 67

Column “C” in the tables, in which the column labeled “IC” depicts the average number of zero coefficients erroneously set to nonzero. The next three columns present the percentages of under-fit, correct-fit and over-fit over 200 simulations. “SCAD” and “LASSO” refer to the approaches with SCAD and LASSO penalty. “PC-simple” refer to the PC-simple algorithm proposed by Buhlmann et al. (2010). “TPC” represents the proposed approach, and “TPC-EBIC” is the thresholded partial correlation approach with a fine tuning using the EBIC criterion. Tables 3.1 and 3.2 provide the results for normal and mixture normal samples with AR correlation matrix, and Tables 3.3 and 3.4 provide the results for normal and mixture normal samples with compound symmetric correlation matrix. For all four tables, LASSO has the largest model errors compared to SCAD, PC-simple algorithm, TPC and TPC-EBIC. This is due to the fact that LASSO erroneously sets some zero coefficients to be nonzero. The over-fit of LASSO can be seen in the “IC” and “Overfit” columns in the four tables. In PC-simple algorithm, the threshold rule is derived according to the fact that the limiting distribution of partial correlation is N(0, 1). However, for mixtures of normal samples, the true asymptotic distribution should be N(0, 1 + κ). Our proposal TPC develops the threshold criterion by utilizing N(0, 1 +κ ˆ) as the limiting distribution, whereκ ˆ is a consistent estimate of the kurtosis. When the samples are from normal distribution, TPC and PC-simple algorithm work almost identically. The kurtosis of normal distribution is 0, and the estimate of the kurtosis is close to 0. Therefore, PC-simple algorithm and our proposal TPC utilize similar distribution to get the threshold rules, and this is the reason why we can have almost identical results for the “PC-simple” and “TPC” in Tables 3.1 and 3.3. However, the estimated kurtosis of the mixture normals we choose is larger than 1. Thus we have T (α, n, 0, |S|) as the thresholded rule for PC-simple algorithm 68

Table 3.1: Normal samples with AR covariance (p = 20)

Method MedME(Devi) C IC Underfit Correct-fit Over-fit ρ = 0 SCAD 0.0162(0.0083) 3.000 0.400 0.000 0.830 0.170 LASSO 0.0531(0.0196) 3.000 5.320 0.000 0.055 0.945 PC-simple 0.0141(0.0070) 3.000 0.010 0.000 0.990 0.010 TPC 0.0141(0.0070) 3.000 0.010 0.000 0.990 0.010 TPC-EBIC 0.0137(0.0069) 3.000 0.000 0.000 1.000 0.000 ρ = 0.3 SCAD 0.0144(0.0078) 3.000 0.475 0.000 0.820 0.180 LASSO 0.0486(0.0167) 3.00 4.630 0.000 0.080 0.920 PC-simple 0.0133(0.0070) 3.000 0.005 0.000 0.995 0.005 TPC 0.0133(0.0070) 3.000 0.005 0.000 0.995 0.005 TPC-EBIC 0.0130(0.0069) 3.000 0.000 0.000 1.000 0.000 ρ = 0.5 SCAD 0.0136(0.0066) 3.000 0.320 0.000 0.845 0.155 LASSO 0.0437(0.0178) 3.000 4.190 0.000 0.080 0.920 PC-simple 0.0127(0.0061) 2.995 0.005 0.005 0.995 0.000 TPC 0.0127(0.0061) 2.995 0.005 0.005 0.995 0.000 TPC-EBIC 0.0127(0.0061) 3.000 0.010 0.000 0.995 0.005 ρ = 0.6 SCAD 0.0128(0.0060) 3.000 0.380 0.000 0.810 0.190 LASSO 0.0409(0.0177) 3.000 4.300 0.000 0.075 0.925 PC-simple 0.0119(0.0058) 2.990 0.010 0.010 0.990 0.000 TPC 0.0119(0.0058) 2.990 0.010 0.010 0.990 0.000 TPC-EBIC 0.0119(0.0058) 2.995 0.010 0.005 0.990 0.005 ρ = 0.7 SCAD 0.0109(0.0058) 3.000 0.365 0.000 0.815 0.185 LASSO 0.0390(0.0174) 3.000 4.150 0.000 0.075 0.925 PC-simple 0.0106(0.0055) 2.960 0.040 0.040 0.960 0.000 TPC 0.0106(0.0055) 2.960 0.040 0.040 0.960 0.000 TPC-EBIC 0.0105(0.0054) 2.990 0.015 0.010 0.990 0.000 ρ = 0.8 SCAD 0.0108(0.0061) 3.000 0.360 0.000 0.865 0.135 LASSO 0.0387(0.0178) 3.000 4.120 0.000 0.060 0.940 PC-simple 0.0116(0.0072) 2.875 0.145 0.125 0.875 0.000 TPC 0.0115(0.0071) 2.875 0.140 0.125 0.875 0.000 TPC-EBIC 0.0104(0.0059) 2.960 0.060 0.040 0.955 0.005 ρ = 0.9 SCAD 0.0156(0.0110) 3.000 0.335 0.000 0.845 0.155 LASSO 0.0507(0.0261) 3.000 4.615 0.000 0.035 0.965 PC-simple 0.0295(0.0260) 2.630 0.390 0.360 0.640 0.000 TPC 0.0296(0.0263) 2.625 0.400 0.365 0.635 0.000 TPC-EBIC 0.0144(0.0099) 2.895 0.125 0.100 0.895 0.005 while the true threshold should be T (α, n, κ, |S|). As T (α, n, κ, |S|) is an increasing function of κ, thus we have smaller thresholds with PC-simple algorithm. That will result in the over fitting phenomenon of PC-simple algorithm. With compound symmetric correlation matrix, the over fitting percentage of PC-simple algorithm 69

Table 3.2: Mixture normal samples with AR covariance (p = 20)

Method MedME(Devi) C IC Underfit Correct-fit Over-fit ρ = 0 SCAD 0.0387(0.0211) 3.000 1.385 0.000 0.640 0.360 LASSO 0.1092(0.0470) 3.000 6.950 0.000 0.020 0.980 PC-simple 0.0383(0.0194) 2.965 0.145 0.035 0.840 0.125 TPC 0.0352(0.0198) 2.920 0.000 0.080 0.920 0.000 TPC-EBIC 0.0333(0.0178) 2.975 0.015 0.025 0.965 0.010 ρ = 0.3 SCAD 0.0397(0.0217) 3.000 1.220 0.000 0.665 0.335 LASSO 0.1043(0.0423) 3.000 6.530 0.000 0.020 0.980 PC-simple 0.0372(0.0203) 2.995 0.115 0.005 0.905 0.090 TPC 0.0340(0.0189) 2.985 0.005 0.015 0.985 0.000 TPC-EBIC 0.0340(0.0191) 3.000 0.010 0.000 0.990 0.010 ρ = 0.5 SCAD 0.0399(0.0227) 3.000 1.140 0.000 0.670 0.330 LASSO 0.1016(0.0404) 3.000 6.140 0.000 0.020 0.980 PC-simple 0.0360(0.0201) 2.990 0.075 0.010 0.930 0.060 TPC 0.0348(0.0194) 2.955 0.050 0.045 0.950 0.005 TPC-EBIC 0.0345(0.0191) 2.995 0.020 0.005 0.980 0.015 ρ = 0.6 SCAD 0.0399(0.0232) 3.000 1.070 0.000 0.680 0.320 LASSO 0.0989(0.0407) 3.000 5.805 0.000 0.025 0.975 PC-simple 0.0345(0.0184) 2.985 0.075 0.015 0.930 0.055 TPC 0.0363(0.0201) 2.915 0.085 0.085 0.910 0.005 TPC-EBIC 0.0340(0.0183) 2.995 0.025 0.005 0.975 0.020 ρ = 0.7 SCAD 0.0388(0.0218) 3.000 0.935 0.000 0.720 0.280 LASSO 0.0942(0.0368) 3.000 5.530 0.000 0.025 0.975 PC-simple 0.0362(0.0203) 2.960 0.105 0.040 0.900 0.060 TPC 0.0392(0.0231) 2.865 0.140 0.125 0.870 0.005 TPC-EBIC 0.0328(0.0173) 2.995 0.025 0.005 0.975 0.020 ρ = 0.8 SCAD 0.0411(0.0242) 2.990 0.825 0.010 0.715 0.275 LASSO 0.0934(0.0354) 3.000 5.260 0.000 0.030 0.970 PC-simple 0.0395(0.0240) 2.915 0.190 0.085 0.835 0.080 TPC 0.0444(0.0296) 2.770 0.215 0.215 0.785 0.000 TPC-EBIC 0.0329(0.0186) 2.950 0.050 0.045 0.950 0.005 ρ = 0.9 SCAD 0.0388(0.0249) 2.990 0.560 0.010 0.745 0.245 LASSO 0.0922(0.0362) 2.995 5.110 0.005 0.030 0.965 PC-simple 0.0427(0.0293) 2.820 0.260 0.160 0.765 0.075 TPC 0.0621(0.0505) 2.545 0.370 0.380 0.620 0.000 TPC-EBIC 0.0368(0.0225) 2.905 0.105 0.095 0.880 0.025 is very high. For example, when ρ = 0.5, the proportion that PC-simple algorithm identifies the true active set is only 46%. More likely PC-simple algorithm include some insignificant predictors. However, the correct fitting percentage of TPC is 86%, and TPC-EBIC yields higher correct fitting percentage. 70

Table 3.3: Normal samples with compound symmetric covariance (p = 20)

Method MedME(Devi) C IC Underfit Correct-fit Over-fit ρ = 0.3 SCAD 0.0121(0.0076) 3.000 0.245 0.000 0.900 0.100 LASSO 0.0300(0.0152) 3.000 3.310 0.000 0.205 0.795 PC-simple 0.0117(0.0071) 3.000 0.060 0.000 0.940 0.060 TPC 0.0117(0.0071) 3.000 0.060 0.000 0.940 0.060 TPC-EBIC 0.0109(0.0064) 3.000 0.000 0.000 1.000 0.000 ρ = 0.5 SCAD 0.0127(0.0084) 3.000 0.185 0.000 0.925 0.075 LASSO 0.0384(0.0210) 3.000 3.100 0.000 0.260 0.740 PC-simple 0.0119(0.0077) 3.000 0.045 0.000 0.955 0.045 TPC 0.0116(0.0074) 3.000 0.040 0.000 0.960 0.040 TPC-EBIC 0.0109(0.0066) 3.000 0.000 0.000 1.000 0.000 ρ = 0.6 SCAD 0.0143(0.0095) 3.000 0.110 0.000 0.930 0.070 LASSO 0.0506(0.0296) 3.000 3.175 0.000 0.240 0.760 PC-simple 0.0124(0.0080) 3.000 0.035 0.000 0.965 0.035 TPC 0.0124(0.0080) 3.000 0.035 0.000 0.965 0.035 TPC-EBIC 0.0123(0.0078) 3.000 0.000 0.000 1.000 0.000 ρ = 0.7 SCAD 0.0164(0.0107) 3.000 0.095 0.000 0.935 0.065 LASSO 0.0687(0.0421) 3.000 3.335 0.000 0.220 0.780 PC-simple 0.0157(0.0102) 3.000 0.040 0.000 0.960 0.040 TPC 0.0157(0.0102) 3.000 0.040 0.000 0.960 0.040 TPC-EBIC 0.0152(0.0097) 3.000 0.000 0.000 1.000 0.000 ρ = 0.8 SCAD 0.0238(0.0159) 3.000 0.090 0.000 0.945 0.055 LASSO 0.1060(0.0725) 3.000 3.385 0.000 0.255 0.745 PC-simple 0.0235(0.0164) 3.000 0.070 0.000 0.930 0.070 TPC 0.0234(0.0161) 3.000 0.065 0.000 0.935 0.065 TPC-EBIC 0.0226(0.0147) 3.000 0.000 0.000 1.000 0.000 ρ = 0.9 SCAD 0.0449(0.0344) 3.000 0.120 0.000 0.940 0.060 LASSO 8.8929(0.0795) 1.000 0.000 1.000 0.000 0.000 PC-simple 0.0455(0.0343) 3.000 0.100 0.000 0.905 0.095 TPC 0.0459(0.0347) 3.000 0.105 0.000 0.900 0.100 TPC-EBIC 0.0424(0.0304) 3.000 0.000 0.000 1.000 0.000

Example 2 (Medium dimension): We simulated 200 datasets consisting of 300 observations from the model

y = xT β + σ2, (3.38)

T T where x = (x1, ··· , xp) , p = 200, β = (3, 1.5, 0, 0, 2, 0, ··· , 0) . Thus the size of the true active set is 3. There are two scenarios for the samples: normal samples and mixture normal samples. Similar to Example 1, two covariance structures 71

Table 3.4: Mixture normal samples with compound symmetric covariance (p = 20)

Method MedME(Devi) C IC Underfit Correct-fit Over-fit ρ = 0.3 SCAD 0.0357(0.0201) 3.000 1.070 0.000 0.685 0.315 LASSO 0.1056(0.0443) 3.000 6.235 0.000 0.015 0.985 PC-simple 0.0558(0.0280) 3.000 0.475 0.000 0.575 0.425 TPC 0.0337(0.0182) 2.990 0.065 0.010 0.935 0.055 TPC-EBIC 0.0338(0.0185) 3.000 0.065 0.000 0.935 0.065 ρ = 0.5 SCAD 0.0375(0.0206) 3.000 0.890 0.000 0.690 0.310 LASSO 0.1033(0.0414) 3.000 6.075 0.000 0.005 0.995 PC-simple 0.0615(0.0295) 3.000 0.615 0.000 0.460 0.540 TPC 0.0384(0.0201) 3.000 0.145 0.000 0.860 0.140 TPC-EBIC 0.0337(0.0192) 3.000 0.090 0.000 0.915 0.085 ρ = 0.6 SCAD 0.0367(0.0218) 3.000 0.785 0.000 0.705 0.295 LASSO 0.1028(0.0399) 3.000 6.020 0.000 0.000 1.000 PC-simple 0.0619(0.0303) 3.000 0.655 0.000 0.445 0.555 TPC 0.0373(0.0205) 2.995 0.160 0.005 0.840 0.155 TPC-EBIC 0.0338(0.0201) 3.000 0.085 0.000 0.915 0.085 ρ = 0.7 SCAD 0.0369(0.0230) 3.000 0.640 0.000 0.775 0.225 LASSO 0.1013(0.0394) 3.000 6.115 0.000 0.000 1.000 PC-simple 0.0638(0.0298) 3.000 0.720 0.000 0.395 0.605 TPC 0.0423(0.0224) 2.995 0.195 0.005 0.805 0.190 TPC-EBIC 0.0344(0.0199) 3.000 0.095 0.000 0.905 0.095 ρ = 0.8 SCAD 0.0415(0.0259) 2.995 0.520 0.005 0.785 0.210 LASSO 0.0999(0.0387) 3.000 6.090 0.000 0.000 1.000 PC-simple 0.0652(0.0282) 3.000 0.775 0.000 0.360 0.640 TPC 0.0405(0.0230) 2.995 0.210 0.005 0.790 0.205 TPC-EBIC 0.0349(0.0193) 3.000 0.100 0.000 0.900 0.100 ρ = 0.9 SCAD 0.0437(0.0285) 2.990 0.940 0.010 0.715 0.275 LASSO 0.0968(0.0370) 2.995 6.000 0.005 0.005 0.990 PC-simple 0.0677(0.0287) 2.995 0.825 0.005 0.335 0.660 TPC 0.0434(0.0254) 2.970 0.260 0.030 0.735 0.235 TPC-EBIC 0.0357(0.0196) 2.985 0.110 0.015 0.885 0.100 are considered: AR and compound symmetric covariance matrix. We let ρ to vary among 0, 0.3, 0.5, 0.6, 0.7, 0.8 and 0.9. The significance level of PC-simple algorithm, TPC and TPC-EBIC are set to 0.05. Other settings are the same as Example 1. According to Tables 3.5, 3.6, and 3.7, SCAD, PC-simple algorithm and TPC have better performance than LASSO. LASSO has much larger model errors than SCAD, PC-simple algorithm and TPC. This can be expected because LASSO tends 72

Table 3.5: Normal samples with AR correlation matrix (p = 200)

Method MedME(Devi) C IC Underfit Correct-fit Over-fit ρ = 0 SCAD 0.0093(0.0048) 3.000 0.555 0.000 0.710 0.290 LASSO 0.0610(0.0158) 3.000 10.855 0.000 0.015 0.985 PC-simple 0.0089(0.0049) 3.000 0.025 0.000 0.975 0.025 TPC 0.0088(0.0049) 3.000 0.020 0.000 0.980 0.020 TPC-EBIC 0.0086(0.0047) 3.000 0.000 0.000 1.000 0.000 ρ = 0.3 SCAD 0.0080(0.0044) 3.000 0.545 0.000 0.760 0.240 LASSO 0.0548(0.0156) 3.000 9.095 0.000 0.020 0.980 PC-simple 0.0075(0.0044) 3.000 0.035 0.000 0.965 0.035 TPC 0.0075(0.0044) 3.000 0.035 0.000 0.965 0.035 TPC-EBIC 0.0073(0.0043) 3.000 0.000 0.000 1.000 0.000 ρ = 0.5 SCAD 0.0071(0.0037) 3.000 0.480 0.000 0.765 0.235 LASSO 0.0458(0.0160) 3.000 8.395 0.000 0.055 0.945 PC-simple 0.0067(0.0036) 3.000 0.005 0.000 0.995 0.005 TPC 0.0067(0.0036) 3.000 0.005 0.000 0.995 0.005 TPC-EBIC 0.0066(0.0036) 3.000 0.000 0.000 1.000 0.000 ρ = 0.6 SCAD 0.0070(0.0038) 3.000 0.515 0.000 0.785 0.215 LASSO 0.0432(0.0139) 3.000 7.930 0.000 0.045 0.955 PC-simple 0.0064(0.0036) 3.000 0.005 0.000 0.995 0.005 TPC 0.0064(0.0036) 3.000 0.005 0.000 0.995 0.005 TPC-EBIC 0.0063(0.0035) 3.000 0.000 0.000 1.000 0.000 ρ = 0.7 SCAD 0.0067(0.0039) 3.000 0.350 0.000 0.795 0.205 LASSO 0.0407(0.0140) 3.000 7.620 0.000 0.050 0.950 PC-simple 0.0064(0.0035) 3.000 0.000 0.000 1.000 0.000 TPC 0.0064(0.0035) 3.000 0.000 0.000 1.000 0.000 TPC-EBIC 0.0064(0.0035) 3.000 0.000 0.000 1.000 0.000 ρ = 0.8 SCAD 0.0066(0.0039) 3.000 0.330 0.000 0.795 0.205 LASSO 0.0426(0.0160) 3.000 7.440 0.000 0.030 0.970 PC-simple 0.0064(0.0039) 2.985 0.030 0.015 0.985 0.000 TPC 0.0064(0.0039) 2.985 0.030 0.015 0.985 0.000 TPC-EBIC 0.0062(0.0039) 3.000 0.000 0.000 1.000 0.000 ρ = 0.9 SCAD 0.0077(0.0051) 3.000 0.305 0.000 0.830 0.170 LASSO 0.0727(0.0259) 3.000 8.860 0.000 0.005 0.995 PC-simple 0.0109(0.0083) 2.810 0.230 0.190 0.810 0.000 TPC 0.0109(0.0084) 2.805 0.235 0.195 0.805 0.000 TPC-EBIC 0.0080(0.0055) 2.965 0.060 0.035 0.955 0.010 to include more variables, which is implied by Columns “IC” and “Overfit” in all four tables. However, LASSO is the most efficient. This indicates that LASSO does a very rough variable selection, and it will not miss any important predictors. For normal samples and mixture normal samples, the performance of PC-simple 73

Table 3.6: Mixture normal with AR correlation matrix (p = 200)

Method MedME(Devi) C IC Underfit Correct-fit Over-fit ρ = 0 SCAD 0.0287(0.0134) 3.000 3.655 0.000 0.565 0.435 LASSO 0.1463(0.0406) 3.000 29.790 0.000 0.000 1.000 PC-simple 0.0509(0.0245) 3.000 0.825 0.000 0.355 0.645 TPC 0.0235(0.0128) 2.980 0.050 0.020 0.945 0.035 TPC-EBIC 0.0221(0.0120) 3.000 0.030 0.000 0.975 0.025 ρ = 0.3 SCAD 0.0290(0.0163) 3.000 3.540 0.000 0.570 0.430 LASSO 0.1347(0.0410) 3.000 27.110 0.000 0.000 1.000 PC-simple 0.0428(0.0242) 3.000 0.570 0.000 0.515 0.485 TPC 0.0230(0.0132) 2.995 0.035 0.005 0.965 0.030 TPC-EBIC 0.0228(0.0126) 3.000 0.015 0.000 0.985 0.015 ρ = 0.5 SCAD 0.0294(0.0168) 3.000 2.810 0.000 0.610 0.390 LASSO 0.1259(0.0433) 3.000 23.090 0.000 0.005 0.995 PC-simple 0.0390(0.0203) 3.000 0.440 0.000 0.615 0.385 TPC 0.0239(0.0133) 2.990 0.030 0.010 0.970 0.020 TPC-EBIC 0.0237(0.0130) 3.000 0.010 0.000 0.990 0.010 ρ = 0.6 SCAD 0.0295(0.0162) 3.000 2.420 0.000 0.635 0.365 LASSO 0.1196(0.0418) 3.000 20.790 0.000 0.010 0.990 PC-simple 0.0331(0.0177) 3.000 0.350 0.000 0.680 0.320 TPC 0.0241(0.0138) 2.985 0.025 0.015 0.980 0.005 TPC-EBIC 0.0243(0.0139) 3.000 0.025 0.000 0.975 0.025 ρ = 0.7 SCAD 0.0305(0.0164) 3.000 2.620 0.000 0.625 0.375 LASSO 0.1095(0.0413) 3.000 17.540 0.000 0.005 0.995 PC-simple 0.0313(0.0169) 2.990 0.300 0.010 0.740 0.250 TPC 0.0250(0.0139) 2.970 0.035 0.030 0.970 0.000 TPC-EBIC 0.0246(0.0136) 2.990 0.030 0.010 0.980 0.010 ρ = 0.8 SCAD 0.0298(0.0169) 3.000 2.135 0.000 0.645 0.355 LASSO 0.1024(0.0372) 3.000 15.100 0.000 0.010 0.990 PC-simple 0.0298(0.0174) 2.975 0.295 0.025 0.745 0.230 TPC 0.0262(0.0154) 2.905 0.110 0.090 0.900 0.010 TPC-EBIC 0.0253(0.0137) 2.990 0.035 0.010 0.975 0.015 ρ = 0.9 SCAD 0.0304(0.0156) 2.995 1.760 0.005 0.620 0.375 LASSO 0.0941(0.0379) 3.000 13.065 0.000 0.005 0.995 PC-simple 0.0309(0.0192) 2.925 0.275 0.070 0.770 0.160 TPC 0.0321(0.0201) 2.760 0.255 0.215 0.770 0.015 TPC-EBIC 0.0250(0.0134) 2.950 0.080 0.045 0.935 0.020 algorithm, TPC, TPC-EBIC are slightly better than SCAD in terms of median of the error and correct-fitting percentage. For normal samples, PC-simple algorithm, TPC and TPC-EBIC would yield similar results. This is because that these two approaches utilize similar limiting distributions when they form the reject region 74

Table 3.7: Mixture normal with compound correlation matrix (p = 200)

Method MedME(Devi) C IC Underfit Correct-fit Over-fit ρ = 0.3 SCAD 0.0289(0.0150) 3.000 2.470 0.000 0.605 0.395 LASSO 0.1340(0.0361) 3.000 25.500 0.000 0.000 1.000 PC-simple 0.0836(0.0350) 3.000 1.780 0.000 0.080 0.920 TPC 0.0406(0.0241) 2.990 0.425 0.010 0.615 0.375 TPC-EBIC 0.0248(0.0140) 3.000 0.080 0.000 0.920 0.080 ρ = 0.5 SCAD 0.0267(0.0156) 3.000 1.520 0.000 0.640 0.360 LASSO 0.1400(0.0438) 3.000 24.685 0.000 0.000 1.000 PC-simple 0.0913(0.0322) 3.000 1.955 0.000 0.020 0.980 TPC 0.0484(0.0303) 2.995 0.555 0.005 0.520 0.475 TPC-EBIC 0.0242(0.0160) 3.000 0.135 0.000 0.870 0.130 ρ = 0.6 SCAD 0.0263(0.0159) 3.000 1.230 0.000 0.665 0.335 LASSO 0.1402(0.0441) 3.000 24.965 0.000 0.000 1.000 PC-simple 0.0935(0.0300) 3.000 1.960 0.000 0.040 0.960 TPC 0.0555(0.0308) 2.995 0.625 0.005 0.470 0.525 TPC-EBIC 0.0241(0.0168) 3.000 0.155 0.000 0.855 0.145 ρ = 0.7 SCAD 0.0244(0.0156) 3.000 0.700 0.000 0.705 0.295 LASSO 0.1384(0.0450) 3.000 24.820 0.000 0.000 1.000 PC-simple 0.0930(0.0302) 3.000 1.930 0.000 0.045 0.955 TPC 0.0643(0.0307) 2.995 0.690 0.005 0.420 0.575 TPC-EBIC 0.0270(0.0192) 3.000 0.220 0.000 0.795 0.205 ρ = 0.8 SCAD 0.0246(0.0168) 3.000 0.385 0.000 0.850 0.150 LASSO 0.1383(0.0438) 3.000 24.485 0.000 0.000 1.000 PC-simple 0.0939(0.0294) 3.000 1.950 0.000 0.040 0.960 TPC 0.0658(0.0298) 2.990 0.765 0.010 0.385 0.605 TPC-EBIC 0.0296(0.0208) 2.995 0.255 0.005 0.765 0.230 ρ = 0.9 SCAD 0.0292(0.0216) 2.985 0.370 0.015 0.780 0.205 LASSO 0.1398(0.0443) 3.000 24.690 0.000 0.000 1.000 PC-simple 0.0945(0.0295) 2.995 1.955 0.005 0.040 0.955 TPC 0.0700(0.0311) 2.975 0.860 0.025 0.330 0.645 TPC-EBIC 0.0357(0.0266) 2.980 0.360 0.020 0.675 0.305 with normal samples. Focusing on Tables 3.6 and 3.7, PC-simple algorithm has a very large over fitting percentage compared to TPC, thus it has larger model errors compared to TPC and TPC-EBIC. The correct fitting percentages from TPC and TPC-EBIC are much higher than PC-simple algorithm for the mixture samples with compound symmetric covariance matrix, because TPC utilizes a more accurate limiting distribution of the partial correlations. Tables 3.6 and 3.7 show the performance with mixture normal samples under 75

AR and compound symmetric covariance matrix. Let us focus on ρ = 0.5. For AR covariance matrix, the correct fitting percentage of PC-simple algorithm is 61.5%, while TPC has 97% as its correct fitting proportion. For compound symmetric covariance matrix, the preformance of PC-simple algorithm is even worse. PC- simple algorithm can only identify the true active set only 2% out of 200 times, and its over fitting percentage is extremely large, which is 98%. Those results manifest that adjusting the variance of the partial correlations, what we are doing in TPC, is necessary and improves the performance significantly.

Example 3 (High dimension): We simulated 200 datasets consisting of 300 observations from the model y = xT β + σ2, (3.39)

T where x = (x1, ··· , xp) , p = 500, and the number of nonzero coefficients is set to be 3 and β = (3, 1.5, 0, 0, 2, 0, ··· , 0)T . We demonstrate that TPC performs bet- ter than PC-simple algorithm for variable selection under high dimension setting. Different from previous examples, we only use AR as the covariance matrix. From Tables 3.8 and 3.9, LASSO has the worst performance because LASSO tends to include more variables, as implied by Columns “IC” and “Overfit”. How- ever, LASSO needs least time to obtain its results. In spite of missing some important variables, the thresholded partial correlation does not result in large model error due to the high correlation between the covariates. For the normal samples, the performance of PC-simple algorithm and TPC, are similar according to Table 3.8. This is because with normal samples, PC-simple algorithm and TPC would utilize similar limiting distribution, and that would lead to similar thresholds. TPC-EBIC performs better than PC-simple algorithm and TPC. As ρ increases, PC-simple algorithm and TPC would miss a small proportion of the true covariates. SCAD has smaller correct fitting percentage compared to 76 them. For mixture normal samples, PC-simple algorithm performs different compared to TPC and TPC-EBIC. The correct fitting percentage of PC-simple algorithm drops to 37% when rho = 0.5, and its over fitting percentage increases to 63%.How- ever, TPC and TPC-EBIC can still identify the true active set, and their correct fitting percentage is 95%. Thus we can claim that TPC and TPC-EBIC performs bettern than PC-simple algorithm with AR covariance structure.

3.5.2 Real Data: Cardiomyopathy Microarray Data

In this section, we apply the thresholded partial correlation approach to cardiomy- opathy microarray data, from a transgenic mouse model of dilated cardiomyopathy. This data set was first analyzed by Segal et al. (2003) to assess the regression-based approaches to microarray analysis. They aim to identify the influential genes that affect the overexpression of a G protein-coupled receptor, called Ro1, in mice. The Ro1 expression level was measured for 30 specimens, and the predictors, the ge- netic expression levels, were obtained for p = 6, 319 genes. It was also analyzed by Hall and Miller (2009) and Li et al. (2012). Gene Msa.2134.0 and Msa.1166.0 are reported as influential with the generalized correlation proposed by Hall and Miller (2009), and DC-SIS proposed by Li et al. (2012) ranks Gene Msa.2134.0 and Msa.2877.0 at top. We apply the SCAD, LASSO, PC-simple algorithm, TPC and TPC-EBIC to this data set. Table 3.10 gives us the influential genes selected by those approaches. We only list the predictors are selected by SCAD, PC-simple algorithm, TPC and TPC-EBIC. We found out that LASSO overfits the data as it includes 28 predictors, and it results in the difficulties in the interpretation. SCAD selects 8 out of 6319 genes, PC-simple algorithm identifies 3 genes, and our proposal 77

Table 3.8: Normal samples with AR correlation matrix (p = 500)

Method MedME(Devi) C IC Underfit Correct-fit Over-fit ρ = 0 SCAD 0.0087(0.0043) 3.000 0.785 0.000 0.720 0.280 LASSO 0.0769(0.0246) 3.000 16.740 0.000 0.005 0.995 PC-simple 0.0089(0.0049) 3.000 0.095 0.000 0.905 0.095 TPC 0.0089(0.0051) 3.000 0.100 0.000 0.900 0.100 TPC-EBIC 0.0079(0.0043) 3.000 0.000 0.000 1.000 0.000 ρ = 0.3 SCAD 0.0079(0.0043) 3.000 0.785 0.000 0.765 0.235 LASSO 0.0647(0.0226) 3.000 13.640 0.000 0.020 0.980 PC-simple 0.0077(0.0045) 3.000 0.035 0.000 0.965 0.035 TPC 0.0077(0.0045) 3.000 0.035 0.000 0.965 0.035 TPC-EBIC 0.0075(0.0044) 3.000 0.005 0.000 0.995 0.000 ρ = 0.5 SCAD 0.0080(0.0044) 3.000 0.930 0.000 0.745 0.255 LASSO 0.0527(0.0172) 3.000 11.355 0.000 0.025 0.975 PC-simple 0.0076(0.0044) 3.000 0.025 0.000 0.975 0.025 TPC 0.0076(0.0044) 3.000 0.025 0.000 0.975 0.025 TPC-EBIC 0.0071(0.0042) 3.000 0.005 0.000 0.995 0.005 ρ = 0.6 SCAD 0.0076(0.0045) 3.000 0.935 0.000 0.765 0.235 LASSO 0.0475(0.0154) 3.000 10.240 0.000 0.040 0.960 PC-simple 0.0071(0.0041) 3.000 0.010 0.000 0.990 0.010 TPC 0.0071(0.0041) 3.000 0.010 0.000 0.985 0.015 TPC-EBIC 0.0070(0.0040) 3.000 0.010 0.000 1.000 0.000 ρ = 0.7 SCAD 0.0072(0.0044) 3.000 0.765 0.000 0.800 0.200 LASSO 0.0415(0.0143) 3.000 9.270 0.000 0.035 0.965 PC-simple 0.0063(0.0039) 3.000 0.005 0.000 0.995 0.005 TPC 0.0063(0.0039) 3.000 0.005 0.000 0.995 0.005 TPC-EBIC 0.0063(0.0039) 3.000 0.005 0.000 0.995 0.005 ρ = 0.8 SCAD 0.0071(0.0044) 3.000 0.665 0.000 0.790 0.210 LASSO 0.0453(0.0175) 3.000 7.230 0.000 0.030 0.970 PC-simple 0.0062(0.0042) 2.975 0.035 0.025 0.975 0.000 TPC 0.0062(0.0042) 2.975 0.035 0.025 0.975 0.000 TPC-EBIC 0.0060(0.0040) 2.995 0.015 0.005 0.990 0.005 ρ = 0.9 SCAD 0.0078(0.0053) 3.000 0.415 0.000 0.780 0.220 LASSO 0.2798(0.0791) 3.000 1.270 0.000 0.245 0.755 PC-simple 0.0116(0.0091) 2.780 0.270 0.220 0.780 0.000 TPC 0.0113(0.0089) 2.785 0.260 0.215 0.785 0.000 TPC-EBIC 0.0075(0.0055) 2.960 0.065 0.040 0.950 0.010 suggests 6 genes. Gene Msa.2877.0 is selected by all the approaches as well as the generalized correlation ranking and DC-SIS approach. TPC also identifies Msa.2134.0, but SCAD and the PC-simple algorithm fail to include this predictor in their final model. 78

Table 3.9: Mixture normal samples with AR correlation matrix (p = 500)

Method MedME(Devi) C IC Underfit Correct-fit Over-fit ρ = 0.3 SCAD 0.0295(0.0178) 3.000 5.715 0.000 0.545 0.455 LASSO 0.1497(0.0360) 3.000 38.330 0.000 0.000 1.000 PC-simple 0.0600(0.0293) 3.000 1.095 0.000 0.200 0.800 TPC 0.0199(0.0119) 3.000 0.060 0.000 0.940 0.060 TPC-EBIC 0.0195(0.0113) 3.000 0.010 0.000 0.990 0.010 ρ = 0.5 SCAD 0.0291(0.0182) 3.000 5.235 0.000 0.545 0.455 LASSO 0.1364(0.0339) 3.000 31.165 0.000 0.000 1.000 PC-simple 0.0501(0.0274) 3.000 0.825 0.000 0.370 0.630 TPC 0.0198(0.0115) 2.995 0.050 0.005 0.950 0.045 TPC-EBIC 0.0196(0.0114) 3.000 0.015 0.000 0.985 0.015 ρ = 0.6 SCAD 0.0282(0.0170) 3.000 5.010 0.000 0.540 0.460 LASSO 0.1302(0.0336) 3.000 26.605 0.000 0.000 1.000 PC-simple 0.0480(0.0293) 3.000 0.670 0.000 0.455 0.545 TPC 0.0197(0.0117) 2.990 0.055 0.010 0.945 0.045 TPC-EBIC 0.0196(0.0115) 3.000 0.020 0.000 0.980 0.020 ρ = 0.7 SCAD 0.0259(0.0159) 3.000 4.710 0.000 0.510 0.490 LASSO 0.1184(0.0339) 3.000 21.735 0.000 0.005 0.995 PC-simple 0.0381(0.0245) 3.000 0.475 0.000 0.590 0.410 TPC 0.0195(0.0116) 2.980 0.055 0.020 0.945 0.035 TPC-EBIC 0.0193(0.0114) 3.000 0.020 0.000 0.980 0.020 ρ = 0.8 SCAD 0.0278(0.0170) 3.000 3.960 0.000 0.530 0.470 LASSO 0.1103(0.0323) 3.000 18.140 0.000 0.005 0.995 PC-simple 0.0272(0.0194) 2.985 0.380 0.015 0.675 0.310 TPC 0.0209(0.0135) 2.930 0.105 0.070 0.900 0.030 TPC-EBIC 0.0199(0.0125) 2.995 0.035 0.005 0.965 0.030 ρ = 0.9 SCAD 0.0274(0.0155) 3.000 2.770 0.000 0.540 0.460 LASSO 0.0994(0.0315) 3.000 14.425 0.000 0.015 0.985 PC-simple 0.0283(0.0196) 2.925 0.400 0.075 0.660 0.265 TPC 0.0234(0.0158) 2.810 0.205 0.180 0.805 0.015 TPC-EBIC 0.0210(0.0125) 2.955 0.055 0.045 0.925 0.030

Now, let us have a look at the correlation matrix of some predictors shown in Table 3.11. The response is highly correlated with the genes listed in Table 3.11 according to the last row of the table. The first six genes are selected by TPC while Msa.1166.0 is claimed as important predictor by Hall and Miller (2009) and Msa.741.0 is identified as important by PC-simple algorithm. From the correlation matrix, the response is highly correlated with Msa.1166.0 and Msa.741.0. However, Msa.1166.0 is highly correlated with Msa.2134.0 and Msa.2877.0. If TPC already 79

Table 3.10: Predictors Selected by SCAD, LASSO, PC-simple Algorithm, TPC and TPC-EBIC

Approach SCAD LASSO PC-simple TPC TPC-EBIC Msa.2877.0 Yes Yes Yes Yes Yes Msa.963.0 Yes Yes Yes Yes Yes Msa.778.0-i Yes Yes / Yes Yes Msa.2134.0 / Yes / Yes / Msa.32461.0 / Yes / Yes / Msa.788.0 / Yes / Yes / Msa.741.0 / / Yes / / Msa.11304.0 Yes / / / / Msa.1379.0 Yes / / / / Msa.146.0 Yes / / / / Msa.1832.0 Yes / / / / Msa.2969.0 Yes / / / / identifies Msa.2134.0 and Msa.2877.0, the partial correlation between the response and Msa.1166.0 is small, and this explains that TPC would not include Msa.1166.0 in its final model. PC-simple use the same mechanism as TPC except they are utilizing the different asymptotic variance of the partial correlations. TPC is more data-driven compared to PC-simple algorithm as we calculated the kurtosis based on the samples.

Table 3.11: Part of the correlations matrix

Msa.2134.0 Msa.2877.0 Msa.32461.0 Msa.778.0-i Msa.788.0 Msa.963.0 Msa.1166.0 Msa.741.0 Msa.2134.0 1.000 0.740 -0.327 0.442 -0.460 0.486 -0.738 0.600 Msa.2877.0 0.740 1.000 -0.526 0.540 -0.558 0.647 -0.755 0.646 Msa.32461.0 -0.327 -0.526 1.000 -0.385 0.445 -0.534 0.469 -0.441 Msa.778.0-i 0.442 0.540 -0.385 1.000 -0.371 0.582 -0.586 0.773 Msa.788.0 -0.460 -0.558 0.445 -0.371 1.000 -0.500 0.491 -0.304 Msa.963.0 0.486 0.647 -0.534 0.582 -0.500 1.000 -0.602 0.579 Msa.1166.0 -0.738 -0.755 0.469 -0.586 0.491 -0.602 1.000 -0.660 Msa.741.0 0.600 0.646 -0.441 0.773 -0.304 0.579 -0.660 1.000 y 0.773 0.868 -0.640 0.739 -0.646 0.779 -0.750 0.765

As for the performance of the four approaches, LASSO includes many predictors in the model. Although LASSO has the smallest model error and largest adjusted R2, it yields a very large model and it causes the difficulties in the interpretation. 80

SCAD and TPC have similar performance for this data, and they identifies three common important predictors: Msa.2877.0, Msa.2134.0 and Msa.778.0-i as the important predictors. PC-simple algorithm only picks up 3 predictors out of 6319 genes. Even PC-simple algorithm results in the most parsimonies model, its model error is large compared to SCAD and TPC. This is due to the drawback that PC-simple algorithm underestimates the variance of the marginal correlations and partial correlations. Table 3.12: Comparison of the performances

Approach Size adjusted R2 SCAD 8 89.22% LASSO 28 97.36% PC-simple 3 85.88% TPC 6 93.80% TPC-EBIC 3 86.09%

3.6 Conclusion

In this chapter, we mainly study the PC-simple algorithm proposed by Buhlmann et al. (2010) under a more general statistical setting, and we call it thresholded par- tial correlation approach. First, we provide a general description of the thresholded partial correlation approach. It is hard to derive the variance of the correlations and partial correlations. But with elliptical distributed samples, the variances can be written in an explicit expression. Thus we develop the thresholded partial correlation approach with the elliptical samples. Secondly, the asymptotic consis- tency of variable selection is established. Third, we conduct intensive simulation studies of the thresholded partial correlation approach with elliptical distributions. The simulation results indicate that TPC performs as well as PC-simple algorithm when the ture kurtosis is close to 0, and it outperforms PC-simple algorithm when 81 the kurtosis is far away from 0. Meanwhile, it outperforms LASSO and is compet- itively comparable to SCAD penalty. Also, we apply TPC to the cardiomyopathy microarray data set. Based on the application to the application to the real-world data, we can conclude that TPC performs as well as SCAD, outperforms LASSO and PC-simple algorithm.

3.7 Lemmas and Technical Proofs

For all the proofs in this section, we denote C as a generic constant depending on the context, which can vary from line to line. In this section, we introduce the following lemmas which are used repeatedly in the proofs of Theorem 3.13 and 3.14.

3.7.1 Lemmas

Lemma 3.15. (Hoeffding’s Inequality) Assume the independent random sample

{Xi : i = 1, ··· , n} satisfies P (Xi ∈ [ai, bi]) = 1 for some ai and bi, ∀i = 1, ··· , n. Then, for any  > 0, we have

 22n2  P (|X¯ − E(X¯)| > ) ≤ 2 exp − . (3.40) Pn 2 i=1(bi − ai)

ˆ Lemma 3.16. Given that θn is the estimate of p-dimensional parameter θ, f : Rp 7→ R is a function of θ which is continuous in θ and has bounded first and second derivatives. That is, there exists M1,M2 > 0 such that

∂f(θ) ∂2f(θ) sup || ||2 ≤ M1, sup || T ||1 ≤ M2, (3.41) θ∈H ∂θ θ∈H ∂θ∂θ

q T 2 2 where ||(v1, ··· , vp) ||2 = v1 + ··· + vp, ||A||1 = max {λ1, ··· , λp}, λ1, ··· , λp 82

ˆ P ˆ are the eigen values of the matrix A. Assume θn −→ θ; that is, ∀ > 0,P (||θn −

θ||2 > ) → 0, then

ˆ ˆ P (|f(θn) − f(θ)| > ) ≤ P (||θn − θ||2 > C), (3.42) for some C > 0.

ˆ P Proof. Since θn −→ θ, f has first and second derivatives, then by Taylor’s expan- sion,

∂f(θ) ∂2f(θ) f(θˆ ) = f(θ) + (θˆ − θ) + (θˆ − θ)T (θˆ − θ) + o(θˆ − θ). (3.43) n ∂θ n n ∂θ∂θT n n

Following that, by the Cauchy-Schwartz Inequality,

∂f(θ) ∂2f(θ) |f(θˆ ) − f(θ)| ≤ || || ||θˆ − θ|| + || || ||θˆ − θ||2 + o(||θˆ − θ||2) n ∂θ 2 n 2 ∂θ∂θT 1 n 2 n 2 ˆ ˆ 2 ˆ 2 ≤ M1||θn − θ||2 + M2||θn − θ||2 + o(||θn − θ||2) 1 ≤ (M + M + )||θˆ − θ|| 1 2 2 n 2 1 ||θˆ − θ|| , , C n 2

1 where C = 1/(M1 + M2 + 2 ). Thus ∀ > 0,

1 P (|f(θˆ ) − f(θ)| > ) ≤ P ( ||θˆ − θ|| > ) = P (||θˆ − θ|| > C), (3.44) n C n 2 n 2 which completes the proof.

Lemma 3.17. Suppose that X is a random variable with E(ea|X|) < ∞ for some a > 0. Then for any M > 0, there exist positive constants b and c such that

P (|X| ≥ M) ≤ be−cM . (3.45) 83

Proof. The proof can be found in Liu et al. (2013).

Lemma 3.18. Assume γ1 and γ2 are two positive parameters, γˆ1 and γˆ2 are es- timates of γ1 and γ2 based on n samples. Suppose there exist positive constants

q C1, ··· C4, and M = n , q > 0, such that for j = 1, 2,

1−4q 2 q P {|γˆj − γj| > } ≤ C4 exp(−C3n  ) + C2n exp(−C1n ).

Then we can obtain some other useful inequalities:

1−4q 2 q P {|γˆ1γˆ2 − γ1γ2| > } ≤ C4 exp(−C3n  ) + C2n exp(−C1n ),

1−4q 2 q P {|γˆ1 − γˆ2 − (γ1 − γ2)| > } ≤ C4 exp(−C3n  ) + C2n exp(−C1n ),   γˆ1 γ1 1−4q 2 q P − >  ≤ C4 exp(−C3n  ) + C2n exp(−C1n ), γˆ2 γ2   √ √ 1−4q 2 q P γˆ1 − γ1 >  ≤ C4 exp(−C3n  ) + C2n exp(−C1n ).

Proof. These statements can be proved in a similar way, so we only show the detailed proof for the second one. We first prove thatγ ˆ is bounded from 0 with probability tending to 1. We can assume γ1, γ2 > 2, then

P (|γˆ2| < γ2/2) = P (|γ2 − (γ2 − γˆ2)| < γ2/2)

≤ P (γ2 − |γˆ2 − γ2| < γ2/2)

≤ P (|γˆ2 − γ2| > γ2/2) ≤ P (|γˆ2 − γ2| > )

1−4q 2 q ≤ C4 exp(−C3n  ) + C2n exp(−C1n ).

Then

  γˆ1 γ1 P − >  γˆ2 γ2   γˆ1 γ1 γ1 γ1 = P − + − >  γˆ2 γˆ2 γˆ2 γ2 84

    γˆ1 γ1 γ1 γ1 ≤ P − > /2 + P − > /2 γˆ2 γˆ2 γˆ2 γ2     |γˆ2| γ1 ≤ P |γˆ1 − γ1| > + P | ||γˆ2 − γ2| > /2 2 γ2γˆ2     |γˆ2| γ2|γˆ2| ≤ P |γˆ1 − γ1| > + P |γˆ2 − γ2| > . (3.46) 2 2γ1

Note that

 |γˆ | P |γˆ − γ | > 2 1 1 2  |γˆ |   |γˆ |  = P |γˆ − γ | > 2 , |γˆ | ≥ γ /2 + P |γˆ − γ | > 2 , |γˆ | < γ /2 1 1 2 2 2 1 1 2 2 2  |γˆ |   |γˆ |  ≤ P |γˆ − γ | > 2 , |γˆ | ≥ γ /2 + P |γˆ − γ | > 2 , |γˆ | < γ /2 1 1 2 2 2 1 1 2 2 2 n γ o ≤ P |γˆ − γ | > 2 + P {|γˆ | < γ /2} 1 1 4 2 2 1−4q 2 q ≤ C4 exp(−C3n  ) + C2n exp(−C1n ).

n |γˆ2| o 1−4q 2 q Similarly, P |γˆ2 − γ2| > 2 ≤ C4 exp(−C3n  )+C2n exp(−C1n ). Plugging the above two inequalities into (3.46), we can obtain

      γˆ1 γ1 |γˆ2| γ2|γˆ2| P − >  ≤ P |γˆ1 − γ1| > + P |γˆ2 − γ2| > γˆ2 γ2 2 2γ1 1−4q 2 q ≤ C4 exp(−C3n  ) + C2n exp(−C1n ).

3.7.2 Proof of Theorem 3.8

Proof. Without loss of generality, we assume Exj = Ey = 0.

1 Pn x y − ( 1 Pn x )( 1 Pn y ) ρˆ(y, x ) = n i=1 ij i n i=1 ij n i=1 i j  1/2  1 Pn 2 1 Pn 2  1 Pn 2 1 Pn 2 n i=1 xij − ( n i=1 xij) n i=1 yi − ( n i=1 yi) 85

mxj y − mxj my 4 Sxj y = 1/2 = ,   Sx Sy  2 2 j mxj xj − (mxj ) {myy − (my) }

1 Pn 1 Pn 1 Pn 1 Pn 2 where mxj = n i=1 xij, my = n i=1 yi, mxj y = n i=1 xijyi, mxj xj = n i=1 xij, m = 1 Pn y2, S2 = m − m2 , S2 = m − m2, and S = m − m m . yy n i=1 i xj xj xj xj y yy y xj y xj y xj y By Central Limit Theory, we have

    mx 0  j         my   0  "     # √ D    2  n  m  −  σ  −→ N5(0, Σ1),  xj xj   xj       m   σ2   yy   y      mxy σxj y where   2 2 cov(xj, xj) cov(xj, y) cov(xj, xj ) cov(xj, y ) cov(xj, xjy)    2 2   cov(y, xj) cov(y, y) cov(y, x ) cov(y, y ) cov(y, xjy)   j   2 2 2 2 2 2 2  Σ1 =  cov(x , x ) cov(x , y) cov(x , x ) cov(x , y ) cov(x , x y)  .  j j j j j j j j     cov(y2, x ) cov(y2, y) cov(y2, x2) cov(y2, y2) cov(y2, x y)   j j j    2 2 cov(xjy, xj) cov(xjy, y) cov(xjy, xj ) cov(xjy, y ) cov(xjy, xjy)

 2  u3 − u1   Let g(u , u , u , u , u ) =  2  , andg ˙ = ∂g , then by 1 2 3 4 5  u4 − u2  0 ∂u (0,0,σ2 ,σ2,σ )   xj y xj y u5 − u1u2 delta method, we can have the following result:

    S2 σ2  xj xj  √     D n  2  −  2  −→ N (0, Σ =g ˙ 0 Σ g˙ ), (3.47)  Sy   σy  3 2 0 1 0    

Sxj y σxj y 86

 2 2 2 2 2  cov(xj , xj ) cov(xj , y ) cov(xj , xjy)   where Σ =g ˙ T Σ g˙ =  2 2 2 2 2  . 2 0 1 0  cov(y , xj ) cov(y , y ) cov(y , xjy)    2 2 cov(xjy, xj ) cov(xjy, y ) cov(xjy, xjy)

ρ ρ √v3 ˙ ∂h 1 T Let h(v1, v2, v3) = , then h0 = = (− 2 , − 2 , − ) . By v1v2 ∂ν 2σx 2σy σxj σy (σ2 ,σ2,σ ) j xj y xj y delta method again, we have

√ D ∗ ˙ T ˙ n {ρˆ(y, xj) − ρ(y, xj)} −→ N(0, Σ = h0 Σ2h0), (3.48) where

( ) ρ2 cov(x2, x2) cov(y2, y2) 2cov(x2, y2) Σ∗ = j j + + j 4 σ4 σ4 σ2 σ2 xj y xj y ( 2 2 ) cov(x , xjy) cov(x y, y ) cov(x y, x y) −ρ j + j + j j σ3 σ σ σ3 σ2 σ2 xj y xj y xj y

3.7.3 Proof of Equation (3.3)

Theorem 3.19. ρˆ(y, xj|xS ) does not depend on the choice of k, where

ρˆ(y, xj|xS\{k}) − ρˆ(y, xk|xS\{k})ˆρ(xj, xk|xS\{k}) ρˆ(y, xj|xS ) = , (3.49) p 2 2 {1 − ρˆ (y, xk|xS\{k})}{1 − ρˆ (xj, xk|xS\{k})} and k ∈ S and k 6= j.

Proof. We would prove this fact using mathematical deduction. For easiness of presentation, we use the following convention of notation. Let ρyj = ρ(y, xj),

ρij = ρ(xi, xj), ρyj|S = ρ(y, xj|xS ), ρij|S = ρ(xi, xj|xS ). 87

For partial correlation of order 1, we have the following result:

ρyj − ρykρjk ρyj|k = . (3.50) q 2 2 (1 − ρyk)(1 − ρjk)

1. m = 2, we want to prove the following two expressions are the same.

ρyj|i − ρyk|iρjk|i ρyj|k − ρyi|kρji|k ρyj|ik = or . (3.51) q 2 2 q 2 2 (1 − ρyk|i)(1 − ρjk|i) (1 − ρyi|k)(1 − ρji|k)

We want to show the above two expressions are the same. By substituting the partial correlation of order 1 in the first expression of (3.51) with (3.50), we obtain

√ ρyj −ρyiρji √ ρyk−ρyiρki √ ρjk−ρjiρki 2 2 − 2 2 2 2 ρyj|i − ρyk|iρjk|i (1−ρyi)(1−ρji) (1−ρyi)(1−ρki) (1−ρji)(1−ρki) q = r r (1 − ρ2 )(1 − ρ2 ) √ ρyk−ρyiρki 2 √ ρjk−ρjiρki 2 yk|i jk|i 1 − [ 2 2 ] 1 − [ 2 2 ] (1−ρyi)(1−ρki) (1−ρji)(1−ρki)

2 (ρyj − ρyiρji)(1 − ρki) − (ρyk − ρyiρki)(ρjk − ρjiρki) 4 LN1 = q q = . 2 2 2 2 2 2 LD1 (1 − ρji)(1 − ρki) − (ρjk − ρjiρki) (1 − ρyi)(1 − ρki) − (ρyk − ρyiρki)

Then

2 LN1 = (ρyj − ρyiρji)(1 − ρki) − (ρyk − ρyiρki)(ρjk − ρjiρki)

2 = ρyj(1 − ρki) − ρyiρji − ρykρjk + ρki(ρykρji + ρyiρjk), and

2 2 2 2 2 2 2 LD1 = [(1 − ρji)(1 − ρki) − (ρjk − ρjiρki) ][(1 − ρyi)(1 − ρki) − (ρyk − ρyiρki) ]

2 2 2 2 2 2 = [1 − ρji − ρki − ρjk + 2ρjkρjiρki][1 − ρyi − ρki − ρyk + 2ρykρyiρki].

Let

ρyj|i − ρyk|iρjk|i 4 LN2 q = . 2 2 LD2 (1 − ρyk|i)(1 − ρjk|i) 88

In the similar way, we can obtain

2 LN2 = ρyj(1 − ρki) − ρykρjk − ρyiρji + ρki(ρyiρjk + ρykρji) = LN1,

2 2 2 2 2 2 2 2 LD2 = [1 − ρjk − ρki − ρji + 2ρjiρjkρki][1 − ρyk − ρki − ρyi + 2ρyiρykρki] = LD2.

Therefore, the two expressions in (3.51) are the same. 2. Now assume the statement holds for m = 3, 4, 5, ··· , n − 1. 3. m = n, and we want to prove for partial correlation of order n, the statement holds. Let us assume |S| = n. We need to show

ρ(y, xj|xS\{k}) − ρ(y, xk|xS\{k})ρ(xj, xk|xS\{k}) ρ(y, xj|xS ) = (3.52) p 2 2 {1 − ρ (y, xk|xS\{k})}{1 − ρ (xj, xk|xS\{k})} does not depend on the choice of k.

Now treat xS\{k} as a whole, repeat the proof procedure corresponding to case m = 2, we can get similar result. That implies partial correlation of order n does not depend on choice of k.

3.7.4 Proof of Theorem 3.13

Proof. To make the proof readable, we divide the proof into five parts. For any  > 0, we assume  < 1,

Step 1: Study P {|ρˆ(y, xj) − ρ(y, xj)| > }.

Without loss of generalization, we assume Exj = Ey = 0, then

1 Pn x y − ( 1 Pn x )( 1 Pn y ) ρˆ(y, x ) = n i=1 ij i n i=1 ij n i=1 i . j  1/2 1 Pn 2 1 Pn 2 1 Pn 2 1 Pn 2 { n i=1 xij − ( n i=1 xij) }{ n i=1 yi − ( n i=1 yi) } 89

1 Pn 1 Pn 1 Pn 1) Study P { n i=1 xijyi − ( n i=1 xij)( n i=1 yi) − cov(xj, y) > }. Note that

n n n 1 X 1 X 1 X x y − ( x )( y ) − cov(x , y) n ij i n ij n i j i=1 i=1 i=1 n n n 1 X 1 X 1 X = x y − ( x )( y ) − (Ex y − Ex Ey) n ij i n ij n i j j i=1 i=1 i=1 ( n ) ( n n ) 1 X 1 X 1 X 4 = x y − Ex y − ( x )( y ) − Ex Ey =I + I . n ij i j n ij n i j 1 2 i=1 i=1 i=1

For I1,

n 1 X P (|I | > ) = P (| x y − E(x y)| > ) 1 n ij i j i=1 n 1 X = P (| x y − E(x y)| > , |x | ≤ M, |y | ≤ M, for all i) n ij i j ij i i=1 n 1 X + P (| x y − E(x y)| > , |x | > M, or |y | > M, for some i) n ij i j ij i i=1 4 = I11 + I12, where M will be decided later. By Hoeffding’s inequality,

2n22 n2 |I | ≤ 2 exp{− } = 2 exp(− ). 11 n(2M 2)2 2M 4

From (D2) and Lemma 3.17, there exist two constants C1 and C2, such that for

∀i, j, P (|xij| ≤ M) ≤ C2 exp(−C1M), and P (|yi| ≤ M) ≤ C2 exp(−C1M), then

|I12| ≤ nP (|xij| ≤ M) + nP (|yi| ≤ M) ≤ 2nC2 exp(−C1M). 90

n2 Hence, P (|I1| > ) ≤ 2 exp(− 2M 4 ) + 2nC2 exp(−C1M).

For I2,

n n n 1 X 1 X 1 X P (|I2| > ) = P ( ( xij)( yi) − Exj( yi) > ) n n n i=1 i=1 i=1 n n 1 X 1 X = P ( xij − Exj yi > ). n n i=1 i=1

n 1 X P ( xij − Exj > ) n i=1 n 1 X = P ( xij − Exj > , xij ≤ M, for all i) n i=1 n 1 X + P ( xij − Exj > , xij > M, for some i). n i=1

By Hoeffding’s inequality, we have

n 1 X 2n22 P ( xij − Exj > , xij ≤ M, for all i) ≤ 2 exp(− ). n n(2M)2 i=1

Meanwhile,

n 1 X P ( xij − Exj > , xij > M, for some i) ≤ nP ( xij > M) ≤ nC2 exp(−C1M). n i=1

Thus

n 1 X n2 P ( xij − Exj > ) ≤ 2 exp(− ) + nC2 exp(−C1M). n 2M 2 i=1 91

Similarly, as Ey = 0, we have

n n 1 X 1 X n2 P ( yi − Ey > ) = P ( yi > ) ≤ 2 exp(− ) + nC2 exp(−C1M). n n 2M 2 i=1 i=1

Thus,

n n 1 X 1 X P (|I2| > ) = P ( xij − Exj yi > ) n n i=1 i=1 n n n 1 X 1 X 1 X = P ( xij − Exj yi > , xij − Exj > ) n n n i=1 i=1 i=1 n n n 1 X 1 X 1 X + P ( xij − Exj yi > , xij − Exj ≤ ) n n n i=1 i=1 i=1 n n 1 X 1 X ≤ P ( xij − Exj > ) + P ( yi > 1) n n i=1 i=1 n n 1 X 1 X ≤ P ( xij − Exj > ) + P ( yi > ) n n i=1 i=1 n2 ≤ 4 exp(− ) + 2nC exp(−C M). 2M 2 2 1

Therefore,

( n n n ) 1 X 1 X 1 X P xijyi − ( xij)( yi) − cov(xj, y) >  n n n i=1 i=1 i=1

= P ( I1 + I2 > ) ≤ P ( I1 > /2) + P ( I2 > /2) C n2 ≤ C exp(− 3 ) + C n exp(−C M). 4 M 4 2 1

Let M = nq, where q > 0 and will be decided later. Then

( n n n ) 1 X 1 X 1 X P xijyi − ( xij)( yi) − cov(xj, y) >  n n n i=1 i=1 i=1 1−4q 2 q ≤ C4 exp(−C3n  ) + C2n exp(−C1n ). 92

2) Similarly,

( n n ) 1 X 2 1 X 2 1−4q 2 q P x − ( xij) − var(xj) >  ≤ C4 exp(−C3n  ) + C2n exp(−C1n ), n ij n i=1 i=1 ( n n ) 1 X 2 1 X 2 1−4q 2 q P y − ( yi) − var(y) >  ≤ C4 exp(−C3n  ) + C2n exp(−C1n ). n i n i=1 i=1

3) By Lemma 3.18, given var(y) > 0, var(xj) > 0, for all j, we have

1−4q 2 q P (|ρˆ(y, xj) − ρ(y, xj)| > ) ≤ C4 exp(−C3n  ) + C2n exp(−C1n ).

Step 2: Study P (|ρˆ(y, xj|xS ) − ρ(y, xj|S )| > ). Note that for ∀j 6= k,

ρˆ(y, xj) − ρˆ(y, xk)ˆρ(xj, xk) ρˆ(y, xj|xk) = (3.53) 2 2 1/2 [{1 − ρˆ (y, xk)}{1 − ρˆ (xj, xk)}]

√ u1−u√2u3 is a function ofρ ˆ(y, xj),ρ ˆ(y, xk), ansρ ˆ(xj, xk). Let g1(u1, u2, u3) = 2 2 , 1−u2 1−u3 and u1, u2, u3 ∈ (−1, 1), then all the first and second derivatives are bounded from

1 given that u2 and u3 are bounded from 1. By (D4), we know that the population version of partial correlations of order smaller than or equal to the size of the true active set will be bounded from 1, then by Lemma 3.16,

 P ρˆ(y, xj|xk) − ρ(y, xj|xk) >   = P g1(ˆρ(y, xj), ρˆ(y, xk), ρˆ(xj, xk)) − g1(ρ(y, xj), ρ(y, xk), ρ(xj, xk)) >         ρˆ(y, xj) ρ(y, xj)        ≤ P   −   > C  ρˆ(y, xk)   ρ(y, xk)           ρˆ(xj, xk) ρ(xj, xk)  √ √ ≤ P { ρˆ(y, xj) − ρ(y, xj) > C/ 3} + P { ρˆ(y, xk) − ρ(y, xk) > C/ 3} 93

√ + P { ρˆ(xj, xk) − ρ(xj, xk) > C/ 3}

 1−4q 2 q ≤ 3 C4 exp(−C3n  ) + C2n exp(−C1n ) .

Following the same argument, we can have for any S ⊆ {j}c, we have

 |S|  1−4q 2 q P ρˆ(y, xj|xS ) − ρ(y, xj|xS ) >  ≤ 3 C4 exp(−C3n  ) + C2n exp(−C1n ) .

The right side will converge to 0 given probable q. One such choice is q = 1/5. Step 3: Under the elliptical assumption in (D1), we can use the following sample version of the marginal kurtosis,

p  1 Pn 4  1 X (xij − x¯j) κˆ = n i=1 − 1 , n 3{ 1 Pn (x − x¯ )2}2 j=1 n i=1 ij j to estimate the true kurtosis. Similar to the proof in Step 1, we can obtain the following inequality:

1−4q 2 q P {|κˆ − κ| > } ≤ C4 exp(−C3n  ) + C2n exp(−C1n ).

ˆ Zn(y,xj |xS ) Zn(y,xj |xS ) Step 4: Study: P ( √ − √ > ). 1+ˆκ 1+κ √1 1+u  Define g2(u, v) = 2 1+v log 1−u , u ∈ (−1, 1), v ∈ (−1, +∞), then

ˆ Zn(y, xj|xS ) Zn(y, xj|xS ) √ = g2(ˆρn(y, xj|xS ), κˆ), and √ = g2(ρn(y, xj|xS ), κ). 1 +κ ˆ 1 + κ

All the first and second derivatives are continuous and bounded for u ∈ (−τ, τ), v ∈ (−δ, +∞). By Lemma 3.16 and (D4),

ˆ Zn(y, xj|xS ) Zn(y, xj|xS ) P ( √ − √ > ) 1 +κ ˆ 1 + κ 94

       ρˆn(y, xj|xS ) ρn(y, xj|xS ) 

≤ P   −   > C  κˆ κ  √ √ ≤ P (|ρˆn(y, xj|xS ) − ρn(y, xj|xS )| > C/ 2) + P (|κˆ − κ| > C/ 2)

|S|  1−4q 2 q ≤ (3 + 1) C4 exp(−C3n  ) + C2n exp(−C1n )

|S|  1−4q 2 q ≤ 3 C4 exp(−C3n  ) + C2n exp(−C1n ) .

Step 5: Compute P (Ej|S ). When testing the jth predictor given S ⊆ {j}c, denote the event

Ej|S = {an error occurs when testing ρn(y, xj|xS ) = 0}

I II = Ej|S ∪ Ej|S ,

I II where Ej|S denotes the type I error while Ej|S represents the type II error. I 1) For Ej|S .

( ˆ ) I 1/2 Zn(y, xj|xS ) −1 E = (n − |S| − 1) √ > Φ (1 − αn/2) when Zn(y, xj|xS ) = 0 . j|S 1 +κ ˆ ( ˆ ) I 1/2 Zn(y, xj|xS ) −1 Then P (E ) = P (n − S| − 1) √ > Φ (1 − αn/2) when Zn(y, xj|xS ) = 0 j|S 1 +κ ˆ ( ˆ ) 1/2 Zn(y, xj|xS ) Zn(y, xj|xS ) −1 ≤ P (n − |S| − 1) √ − √ > Φ (1 − αn/2) 1 +κ ˆ 1 + κ ( ˆ r ) Zn(y, xj|xS ) Zn(y, xj|xS ) n cn = P √ − √ > √ 1 +κ ˆ 1 + κ n − |S| − 1 2 1 + κ ( ˆ ) Zn(y, xj|xS ) Zn(y, xj|xS ) cn ≤ P √ − √ > √ 1 +κ ˆ 1 + κ 2 1 + κ |S|  1−4q 2 q ≤ 3 C4 exp(−C3n  ) + C2n exp(−C1n ) ,

p n cn by choosing αn = 2{1 − Φ( 1+κ 2 )}. 95

II 2) For Ej|S .

( ˆ ) II 1/2 Zn(y, xj|xS ) −1 E = (n − |S| − 1) √ ≤ Φ (1 − αn/2) when Zn(y, xj|xS ) 6= 0 . j|S 1 +κ ˆ

p n cn By choosing αn = 2{1 − Φ( 1+κ 2 )}, we can get the following inequality:

( ˆ ) II 1/2 Zn(y, xj|xS ) −1 P (E ) = P (n − |S| − 1) √ ≤ Φ (1 − αn/2) when Zn(y, xj|xS ) 6= 0 j|S 1 +κ ˆ ( ˆ r ) Zn(y, xj|xS ) n cn = P √ ≤ √ when Zn(y, xj|xS ) 6= 0 1 +κ ˆ n − |S| − 1 2 1 + κ ( ˆ r ) Zn(y, xj|xS ) Zn(y, xj|xS ) Zn(y, xj|xS ) n cn ≤ P √ − √ − √ ≤ √ 1 + κ 1 +κ ˆ 1 + κ n − |S| − 1 2 1 + κ ( ˆ r ) Zn(y, xj|xS ) Zn(y, xj|xS ) Zn(y, xj|xS ) n cn = P √ − √ ≥ √ − √ . 1 +κ ˆ 1 + κ 1 + κ n − |S| − 1 2 1 + κ

1 1 Let g3(u) = 2 log{(1 + u)/(1 − u)}, then |g3(u)| = | 2 log{(1 + u)/(1 − u)}| ≥ |u|,

for all u ∈ (−1, 1), and according to (D4), |ρn(y, xj|xS )| ≥ cn, then

( ˆ r ) II Zn(y, xj|xS ) Zn(y, xj|xS ) cn n cn P (E ) ≤ P √ − √ ≥ √ − √ j|S 1 +κ ˆ 1 + κ 1 + κ n − |S| − 1 2 1 + κ ( ˆ  r ) Zn(y, xj|xS ) Zn(y, xj|xS ) cn n 1 ≤ P √ − √ ≥ √ 1 − 1 +κ ˆ 1 + κ 1 + κ n − |S| − 1 2 ( ˆ ) Zn(y, xj|xS ) Zn(y, xj|xS ) 3cn ≤ P √ − √ ≥ √ 1 +κ ˆ 1 + κ 8 1 + κ |S|  1−4q 2 q ≤ 3 C4 exp(−C3n  ) + C2n exp(−C1n ) ,

q n 5 as for large n, n−|S|−3 ≤ 4 . Combining the results from (1) and (2), we get the following results:

I II P (Ej|S ) = P (Ej|S ) + P (Ej|S ) 96

|S|  1−4q 2 q ≤ 3 C4 exp(−C3n  ) + C2n exp(−C1n ) .

ˆ Step 6: Prove P {An = An} → 1.

c Now consider all j = 1, ··· , pn and all S ⊆ {j} subject to |S| ≤ mn, define

mn c Kj = {S ⊆ {j} , |S| ≤ mn}, j = 1, ··· , pn.

ˆ P {An 6= An} = P {an error occurs for some j and some S}    [  = P Ej|S  mn  j=1,··· ,pn;S∈Kj X ≤ P (Ej|S ) mn j=1,··· ,pn;S∈Kj mn ≤ pn(pn) sup P (Ej|S ) mn j=1,··· ,pn;S∈Kj

mn |S|  1−4q 2 q ≤ pn(pn) 3 C4 exp(−C3n  ) + C2n exp(−C1n )

mn mn  1−4q 2 q ≤ pn(pn) 3 C4 exp(−C3n  ) + C2n exp(−C1n )

mn+1  1−4q 2 q ≤ n(3pn) C4 exp(−C3n  ) + C2n exp(−C1n ) ,

where C1, ··· ,C4 are constants. The second inequality holds since the number of

m possible choices of j is pn and there are pn possible choices for S. We take the log transformation to analyze the above expression.

ˆ log{P (An 6= An)}

 1−4q 2 q ≤ log(n) + (mn + 1) log(3pn) + log C4 exp(−C3n  ) + C2n exp(−C1n )

 1−4q 2 q ≤ 2(mn + 1) log(pn) + log C4 exp(−C3n  ) + C2n exp(−C1n ) . (3.54)

x −4q 2 Note that exp(−x) ≤ 1 − 2 , for x ∈ [0, log 2]. Let 0 < q < 1, then n cn → 0, q q−1 −4q 2 C1n n → 0 as n → +∞. Then for large n, 0 < C3n cn < log 2, and 0 < n < 97 log 2, thus

1−4q 2  −4q 2 n  −4q 2 n exp(−C3n cn) = exp(−C3n cn) ≤ 1 − C3n cn , q n  C n  n exp(−C nq) = exp(− 1 ) ≤ 1 − C n−(1−q) . 1 n 1

By plugging them into (4.54), we get

n no ˆ  −4q 2 n  −(1−q) log{P (An 6= An)} ≤ 2(mn + 1) log(pn) + log C4 1 − C3n cn + C2 1 − C1n .

(3.55)

Similar to Lemma 3 in Buhlmann et al. (2010), it can be shown that P (m ˆ reach,n = mreach,n) → 1, as n → +∞. And mreach,n ≤ peffn, thusm ˆ reach,n ≤ peffn. a a b b −d For pn = O(exp(n )) = C5 exp(n ), peffn = O(n ) = C6n , and cn = O(n ), replace mn withm ˆ reach,n in (3.55), we can obtain the following result:

ˆ log{P (An 6= An)}

b 5  −2d−4q n −(1−q) n ≤ 2(C6n + 1) log{C5 exp(n )} + log C4(1 − C3n ) + C2(1 − C1n )

a+b  −2d−4q n −(1−q) n ≤ C5n + log C4(1 − C3n ) + C2(1 − C1n ) .

1−2d 1−a−b−2d 1−a−b−2d Since 2d + 5(a + b) < 1, then 5 < 4 . Let (1 − 2d)/5 ≤ q < 4 , −2d−4q −(1−q) −2d−4q −(1−q) then n < n , thus 1 − C3n ≥ 1 − C1n . Then

ˆ a+b  −2d−4q n log{P (An 6= An)} ≤ C5n + log 2C4(1 − C3n )

a+b −2d−4q ≤ C5n + n log(1 − C3n ) + log(2C4).

Note that for large n, na+b na+b n4q −2d−4q ≈ −2d−4q ≈ − 1−a−b−2d → 0, and n log(1 − C3n ) nC3n (−1 + o(1)) C3n 98

−2d−4q −2d−4q 1−2d−4q n log(1 − C3n ) = nC3n (−1 + o(1)) = C3n (−1 + o(1)) → −∞, then

ˆ a+b −2d−4q log{P (An 6= An)} ≤ C5n + n log(1 − C3n ) + log(2C4)  a+b  −2d−4q C5n = n log(1 − C3n ) −2d−4q + 1 + log(2C4) n log(1 − C3n ) → −∞.

ˆ That is P (An 6= An) → 0, as n → +∞.

3.7.5 Proof of Theorem 3.14

Proof. Only consider the first step of the thresholded partial correlation approach; that is, S = ∅, then use the arguments in Step 4 of the proof of Theorem 3.13; that is,

ˆ Zn(y, xj) Zn(y, xj) 1−4q 2 q P ( √ − √ > ) ≤ C4 exp(−C3n  ) + C2n exp(−C1n ). 1 +κ ˆ 1 + κ

Define

II Ej = {fail to include xj when xj is a true predictor} ( ˆ ) 1/2 Zn(y, xj) −1 = (n − 1) √ ≤ Φ (1 − αn/2) when βj 6= 0 . 1 +κ ˆ

Similar to the proof of Theorem 3.13,

( ˆ ) II Zn(y, xj) Zn(y, xj) 3cn P (E ) ≤ P √ − √ ≥ √ j 1 +κ ˆ 1 + κ 8 1 + κ 1−4q 2 q ≤ C4 exp(−C3n cn) + C2n exp(−C1n ). 99

Furthermore,

( pn ) ˆ[1] [ II P {An + An} = P Ej j=1

pn X II ≤ P (Ej ) j=1  1−4q 2 q ≤ pn C4 exp(−C3n cn) + C2n exp(−C1n ) .

a a For pn = O(exp(n )), then pn = C5 exp(n ), the above inequality becomes

ˆ[1] 1−4q 2 a q a P {An + An} ≤ C4 exp(−C3n cn + n ) + C2n exp(−C1n + n )

1−4q−2d a q a ≤ C4 exp(−C3n + n ) + C2 exp(−C1n + n + log(n)).

1−5a 1−2d−a 1−2d−a Since 0 < d < 2 , then a < 4 . Let q ∈ (a, 4 ), then 1 − 4q − 2d > a, and nq > na + log(n). Hence,

ˆ[1] 1−4q−2d a q a P {An + An} ≤ C4 exp(−C3n + n ) + C2 exp(−C1n + n + log(n)) → 0, as n → +∞. Therefore,

ˆ[1]  1−4q−2d a q a P {An ⊇ An} ≥ 1 − C4 exp(−C3n + n ) + C2 exp(−C1n + n + log(n))

→ 1. as n → +∞. That completes the proof. Chapter 4

Thresholded Partial Correlation on Partial Residuals for Variable Selection in Partially Linear Models

4.1 Introduction

The partially linear model is one of the most popular semiparametric regression models. Although parametric models can provide a parsimonious illustration of the relationship between the response and covariates, they introduce modeling biases. To avoid the restriction of the parametric form, nonparametric models, including varying-coefficient models and functional linear models have been pro- posed in the literatures. However, nonparametric models may be too flexible to provide accurate conclusions. Partially linear models, or semiparametric models, are good alternatives, since they retain desirable features of both the parametric and nonparametric models. Though there are many works on variable selection for linear models, limited work on variable selection for partially linear models has been done. Variable se- 101 lection for partially linear models is challenging, as it involves several interrelated estimation and selection problems: nonparametric estimation, smoothing parame- ter selection, and variable selection and estimation for linear covariates. Fan and Li (2004) proposed the penalized profile approach to select significant variables in par- tially linear models. To avoid repeating the estimation of the nonparametric part for each submodel, they proposed the penalized profile approach, which utilizes the profile technique. This approach reduces the computation cost dramatically, and hence it may be implemented to select significant variables in large-dimensional partially linear models. In this chapter, we study how to apply the thresholded partial correlation approach described in Chapter 3 to the partially linear models. We call it the thresholded partial correlation on partial residuals approach (TPC-PR). First we transform the original partially linear model into a linear model by applying the partial residual technique. Then we utilize the thresholded partial correlation ap- proach to conduct variable selection in the resulting linear model, and obtain the estimated active set. Next, we estimate the coefficients by regressing the trans- formed response on the smoothed covariates in the estimated active set. Last, we update the estimation of the nonparametric part by substituting the estimates of the regression coefficients into the original model. Our simulation studies and real data analysis suggest that the thresholded partial correlation on partial residuals approach performs better than the penalized approach on partial residuals with the SCAD penalty, and outperforms the LASSO penalty. The rest of the chapter is organized as follows. In Section 2, we introduce some preliminaries for our proposal, including models and objectives. Then we propose the population and sample version of the thresholded partial correlation on partial residuals approach in Section 3. After that, we show the consistency of 102 variable selection of our proposal. Following that, we derive the bias, variance and asymptotic normality for the estimate of the nonparametric part. In Section 5, we conduct simulation studies in low and high dimensions with covariates following the normal distributions and mixture normal distributions. Finally, we illustrate our approach by analyzing the Istanbul stock exchange data set. The conclusion is provided in Section 6. All the proofs are given to in Section 7.

4.2 Preliminaries

In this section, we first introduce the model, objective and some notations.

4.2.1 Model and Objective

Consider the partially linear model,

y = g(u) + xT β + , (4.1) where y is the response variable, g(u) is an unspecified baseline function of u, x is the p × 1 covariate vector, and β is a vector of unknown regression coefficients.

2  is the random error with E(|u) = E(|xj) = 0, j = 1, ··· , p and E( ) < +∞. In this model, the regression function linearly depends on x, while its relationship with u is not specified up to some unknown function.

T T To be precise, we assume that (u1, x1 , y1), ··· , (un, xn , yn) are independently T and identically distributed samples from model (4.1). Let y = (y1, ··· , yn) be

T T an n-vector of responses, u = (u1, ··· , un) , g(u) = (g(u1), ··· , g(un)) , X =

T T (x1, ··· , xn) = (X1, ··· , Xp) be the n × p covariate matrix, and  = (1, ··· , n) 103

be the random errors. It follows that

y = g(u) + Xβ + . (4.2)

Let A = {j : βj 6= 0} ⊂ {1, ··· , p} be the true active set, and peff = |A|. Our objective is to identify the significant features in the parametric part and obtain the estimates for both the parametric and non-parametric components based on n observations. Note that E(y|u) = g(u) + E(xT |u)β + E(|u). Thus

y − E(y|u) = {x − E(x|u)}T β +  − E(|u). (4.3)

Therefore,

    cov{y − E(y|u), x1 − E(x1|u)} cov{x1 − E(x1|u),  − E(|u)}      .   .   .  = Σx,uβ +  .  ,     cov{y − E(y|u), xp − E(xp|u)} cov{xp − E(xp|u),  − E(|u)}

where

  cov{x1 − E(x1|u), x1 − E(x1|u)}··· cov{x1 − E(x1|u), xp − E(xp|u)}    . . .  Σx,u =  . . .  .   cov{xp − E(xp|u), x1 − E(x1|u)}··· cov{xp − E(xp|u), xp − E(xp|u)}

Note that for j = 1, ··· , p, E(|u) = E(|xj) = 0, then

cov{xj − E(xj|u),  − E(|u)}   = E {xj − E(xj|u)}{ − E(|u)} − E{xj − E(xj|u)}E{ − E(|u)}

= E(xj) − E{E(xj|u)E(|u)} = E{E(xj|xj)}

= E{xjE(|xj)} = 0, 104 thus   cov{x1 − E(x1|u),  − E(|u)}    .   .  = 0.   cov{xp − E(xp|u),  − E(|u)}

We define the following two assumptions:

(C1) Σx,u is positive definite.

peff (C2) {βj; j ∈ A} ∼ f(b)db, where f(·) denotes the density on a subset of R of an absolutely continuous distribution w.r.t. Lebesgue measure.

Under assumptions (C1), we have

    β1 cov{y − E(y|u), x1 − E(x1|u)}      .  −1  .   .  = Σx,u  .  . (4.4)     βp cov{y − E(y|u), xp − E(xp|u)}

To simplify the above expression, we adopt the following shorthand notation:

my(u) , E(y|u), mj(u) , E(xj|u),

∗ ∗ ∗ xj , xj − E(xj|u), y , y − E(y|u), xS , {xk − E(xk|u): j ∈ S},

∗ ∗ ∗ ρ(y , xj |xS ) , ρ{y − my(u), xj − mj(u)|xk − mk(u), k ∈ S}.

Thus for the partially linear model (4.1) with assumptions (C1) and (C2),

n o c βj = 0 ⇐⇒ ρ y − my(u), xj − mj(u)) xk − mk(u), k ∈ {j} n o ∗ ∗ ∗ c ⇐⇒ ρ y , xj xk, k ∈ {j} = 0. (4.5) 105

4.2.2 Partial Faithfulness

Now, we study the conditions to guarantee partial faithfulness in the partially linear models, which is stated in the following theorem.

Theorem 4.1. Consider partially linear model (4.1) satisfying assumptions (C1) and (C2), then (x∗T , y∗) satisfies partial faithfulness almost surely with respect to the distribution generating the non-zero regression coefficients.

A direct consequence of Theorem 4.1 and (4.5) is as follows:

Corollary 4.2. Considering partially linear model (4.1), assume that Σx,u > 0 and (x∗T , y∗) satisfies partial faithfulness. Then for j = 1, ··· , p,

∗ ∗ ∗ c ρ(y , xj |xS ) 6= 0 for all S ⊆ {j} if and only if βj 6= 0. (4.6)

Equivalently,

∗ ∗ ∗ c ρ(y , xj |xS ) = 0 for some S ⊆ {j} if and only if βj = 0. (4.7)

4.3 Thresholded Partial Correlation on Partial

Residuals Approach

Now we study how the thresholded partial correlation approach can be employed to do variable selection for partially linear models. In this section, we firstly focus on the population version. After that, we develop the sample version. 106

4.3.1 Population Version

We want to identify the significant covariates from model (4.3):

y − E(y|u) = {x − E(x|u)}T β +  − E(|u).

∗ ∗ First, consider |S| = 0 in expression (4.7). Let βj = 0 if ρ(y , xj ) = 0. This shows that we remove from the active set the smoothed covariates that are not marginally correlated with the smoothed response. Hence we are performing marginal corre- lation screening, and the first step active set can be expressed as

[1] ∗ ∗ A = {j = 1, ··· , p : ρ(y , xj ) 6= 0}. (4.8)

Due to the fact that (x∗T , y∗) satisfies the partial faithfulness, the first step active set contains the true active set. This is A ⊆ A[1].

[1] Second, let |S| = 1 in (4.7). For every j ∈ A , we let βj = 0 if we can find

[1] ∗ ∗ ∗ k ∈ A , such that ρ(y , xj |xk) = 0. In this step, we only need to consider the partial correlations of the covariates that are in the first step active set A[1]. The second step active set A[2] is obtained by screening the partial correlations of order one, and it can be written as

[2]  [1] ∗ ∗ ∗ [1] A = j ∈ A : ρ(y , xj |xk) 6= 0, for some k ∈ A \{j} . (4.9)

After that, we continue to screen the partial correlations of higher orders to get the active set. A[m] is the mth step active set which is obtained via screening the 107 partial correlation of order m − 1. Thus A[m] can be expressed as

[m]  [m−1] ∗ ∗ ∗ [m−1] A = j ∈ A : ρ(y , xj |xS ) 6= 0, for some S ⊆ A \{j}, |S| = m − 1 . (4.10) We will stop this algorithm when |A[m]| ≤ m or A[m] stops changing. Define

[m] mreach = min{m : |A | ≤ m}.

The population version of the thresholded partial correlation on partial residuals approach can correctly select important covariates. This is stated in the following theorem.

Theorem 4.3. For partially linear model (4.1) satisfying the conditional partial faithfulness and Σx,u > 0, then the population version identifies the true underlying active set; that is,

[mreach] A = A = {j = 1, ··· , p : βj 6= 0}. (4.11)

The proof is similar to the proof of Theorem 3 of Buhlmann et al. (2010).

4.3.2 Sample Version

In this section, we develop our proposal for variable selection based on the pop- ulation version. In practice, we do not know the population correlations between y − E(y|u) and xj − E(xj|u), j = 1, ··· , p. Thus we would use the sample correla- tions between y − E(y|u) and xj − E(xj|u), j = 1, ··· , p. Based on the nonpara- metric regression technique, we can obtain the estimates of E(y|u) and E(xj|u). 108

For u in a neighborhood of u0, it follows by the Taylor expansion that

0 my(u) ≈ my(u0) + my(u0)(u − u0) , b0 + b1(u − u0). (4.12)

ˆ ˆ Let K(·) be a kernel function, and h be a bandwidth. We want to find (b0, b1) that minimizes n X 2 {yi − b0 − b1(ui − u0)} Khy (ui − u0), (4.13) i=1

−1 where Kh(·) = h K(·/h). Then

  ˆ n b0 X 2   = arg min {yi − b0 − b1(ui − u0)} Khy (ui − u0) ˆ b0,b1 b1 i=1 T −1 T = (Z(u) W (u, hy)Z(u)) Z(u) W (u, hy)y,

  1 u1 − u    . .  where Z(u) =  . . , and W (u, h) = diag{Kh(u1 − u), ··· ,Kh(un − u)}.   1 un − u Let  T −1 T  (1, 0){Z (u1)W (u1, h)Z(u1)} Z (u1)W (u1, h)    .  S(h) =  .  ,   T −1 T (1, 0){Z (un)W (un, h)Z(un)} Z (un)W (un, h) then  ˆ  E(y|u1)    .   .  = S(hy)y.   ˆ E(y|un)

The matrix S(hy) is called the smoothing matrix, and it depends on the observa- tions {ui, i = 1, ··· , n} and y. Similarly, we can apply the above local linear approximation approach to estimate 109

E(xj|u), j = 1, ··· , p. For j = 1, ··· , p,

 ˆ  E(xj|u1)    .   .  = S(hj)Xj.   ˆ E(xj|un)

where matrix S(hj) depends on the observations {ui, i = 1, ··· , n} and Xj. Denote

4 T 4 y˜=(˜y1, ··· , y˜n) =(I − S(hy))y, and (4.14)

˜ 4 T 4 X=(x˜1, ··· , x˜n) = {(I − S(h1))X1, ··· , (I − S(hp))Xp)} , (4.15) and treat them as the new response and covariates. The new model can be ex- pressed as T ˜ y˜i ≈ x˜i β + i, i = 1, ··· , n, or y˜ ≈ Xβ + . (4.16)

Notice that models (4.1) and (4.16) share the same β. Hence, our objective is to identify significant features in model (4.16). Based on {(x˜i, y˜i), i = 1, ··· , n}, the sample correlations can be expressed as

1 Pn x˜ y˜ h(I − S(h ))y, (I − S(h ))X i ρˆ(y∗, x∗) = n i=1 ij i = y j j , j n n 1/2  1 P 2 1 P 2 k(I − S(hy))ykk(I − S(hj))Xjk ( n i=1 y˜i )( n i=1 x˜ij) (4.17) where h·, ·i denotes the inner product between two vectors, and k·k is the Euclidean distance. It is the sample versions of the correlations between y − E(y|u) and xj − E(xj|u), j = 1, ··· , p. The following theorem is essential to establish our proposal.

T T Theorem 4.4. Given that (v1, x1 , y1), ··· , (vn, xn , yn) are i.i.d. samples drawn from ECp+2(µ, Σ, φ), let ui = ψ(vi) for some ψ(·) such that the σ-field generated

c by ui is the same as that generated by vi, then for any j = 1, ··· , p, and S ⊆ {j} , 110

3 under regularity conditions (D3)-(D7), if hy → 0, hj → 0, nhy → +∞, and 3 ∗ ∗ ∗ ∗ ∗ ∗ nhj → +∞, ρˆ(y , xj |xS ) is a consistent estimate of ρ(y , xj |xS ). That is

∗ ∗ ∗ P ∗ ∗ ∗ ρˆ(y , xj |xS ) −→ ρ(y , xj |xS ), (4.18) where S ⊆ {j}c. Moreover,

p ∗ ∗ ∗ ∗ ∗ ∗ D 2 ∗ ∗ ∗ 2 n − |S| − 3|{(ˆρ(y , xj |xS ) − ρ(y , xj |xS )} −→ N(0, (1 + κ){1 − ρ (y , xj |xS )} ), (4.19)

The first half of the proof of Theorem 4.4 is given in Section 7. The second half will be discussed in Chapter 5. The procedure of our proposal can be developed according to the cardinality of S in (4.6) and (4.7). To select the important predictors, we need to test

∗ ∗ ∗ ∗ ∗ ∗ H0 : ρ(y , xj |xS ) = 0 against HA : ρ(y , xj |xS ) 6= 0. (4.20)

Applying Fisher’s Z-transform, we get

1 +ρ ˆ(y∗, x∗|x∗ ) ˆ ∗ ∗ ∗ 1 j S Z(y , xj |xS ) = log ∗ ∗ ∗ . (4.21) 2 1 − ρˆ(y , xj |xS )

The null hypothesis H0 is rejected if

ˆ ∗ ∗ ∗ 1/2 Z(y , xj |xS ) −1 (n − |S| − 1) √ > Φ (1 − α/2), (4.22) 1 +κ ˆ where |S| is the cardinality of S, α is the significance level, and Φ is the cumulative density function of the standard normal distribution,κ ˆ is the sample estimate of 111 the kurtosis:

p  1 Pn ¯ 4  1 X (˜xij − x˜j) κˆ = n i=1 − 1 . p 3{ 1 Pn (˜x − x˜¯ )2}2 j=1 n i=1 ij j

Define  √ −1  exp 2 1+√κΦ (1−α/2) − 1 n−|S|−1 T (α, n, κ, |S|) = √ . (4.23)  −1  exp 2 1+√κΦ (1−α/2) + 1 n−|S|−1

T (α, n, κ, |S|) depends on the significance level α, sample size n, the kurtosis of the distribution, and the cardinality of S. Due to the monotonicity of Z-transform, the rejection region is

∗ ∗ ∗ |ρˆ(y , xj |xS )| > T (α, n, κ,ˆ |S|). (4.24)

Now, we are ready to select the variables based on the smoothed covariates and responses.

∗ ∗ First, we screen the sample marginal correlations between y and xj . Let S = ∅ in (4.7). We will remove the jth covariate from the estimated active set if ∗ ∗ ∗ ∗ ρ(y , xj ) = 0. Since we do not know the true correlation between y and xj , we need to rely on Fisher’s Z-test to test whether the population correlation is zero

∗ ∗ or not. That is, we test H0 : ρ(y , xj ) = 0. From (4.24), if

∗ ∗ |ρˆ(y , xj )| ≤ T (α, n, κ,ˆ 0), (4.25)

∗ ∗ ˆ then we fail to reject H0 : ρ(y , xj ) = 0, and we would set βj = 0. Based on ∗ ∗ screening all the marginal sample correlations between y and xj , we can build the 112

Step 1 estimated active set:

ˆ[1]  ∗ ∗ A = j = 1, ··· , p : |ρˆ(y , xj )| > T (α, n, κ,ˆ 0) ⊇ A.

Second, we test the first order partial correlation between the smoothed re- sponse and these smoothed covariates in the Step 1 estimated active set Aˆ[1]. For ˆ[1] ∗ ∗ ∗ each j ∈ A , according to (4.7), we fail to reject H0 : ρ(y , xj |xk) = 0 if there exists some k ∈ Aˆ[1]\{j}, such that

∗ ∗ ∗ |ρˆ(y , xj |xk)| ≤ T (α, n, κ,ˆ 1), (4.26)

∗ ∗ ∗ whereρ ˆ(y , xj |xk) can be obtained via

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ρˆ(y , xj ) − ρˆ(y , xk)ˆρ(xj , xk) ρˆ(y , xj |xk) = ,  2 ∗ ∗  2 ∗ ∗ 1/2 {1 − ρˆ (y , xk)} 1 − ρˆ (xj , xk)

ˆ then we let βj = 0. That is, we screen all the sample partial correlations of order one, and establish the Step 2 estimated active set:

n o ˆ[2] ˆ[1] ∗ ∗ ∗ ˆ[1] ˆ[1] A = j ∈ A : |ρˆ(y , xj |xk)| > T (α, n, κ,ˆ 1), ∀k ∈ A \{j} ⊆ A .

For each j ∈ Aˆ[m−1], we get all sample partial correlations of order m − 1 by

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ρˆ(y , xj |x ) − ρˆ(y , x |x )ˆρ(xj , x |x ) ρˆ(y∗, x∗|x∗ ) = S\{k} k S\{k} k S\{k} , j S hn o n oi1/2 2 ∗ ∗ ∗ 2 ∗ ∗ ∗ 1 − ρˆ (y , xk|xS\{k}) 1 − ρˆ (xj , xk|xS\{k})

ˆ[m−1] ∗ ∗ ∗ for some k ∈ S ⊆ A \{j}, and |S| = m − 1. For testing H0 : ρ(y , xj |xS ) = 0, ˆ[m−1] we fail to reject H0 if we can find S, S ⊆ A \{j}, |S| = m − 1, such that

∗ ∗ ∗ |ρˆ(y , xj |xS )| ≤ T (α, n, κ,ˆ m − 1), (4.27) 113

ˆ then we set βj = 0. After screening all the sample partial correlations of order ∗ ∗ ˆ[m−1] m − 1 between y and {xj : j ∈ A }, we can establish the Step m estimated active set Aˆ[m] as follows:

ˆ[m] ˆ[m−1] ∗ ∗ ∗ ˆ[m−1] A = {j ∈ A : |ρˆ(y , xj |xS )| > T (α, n, κ,ˆ m − 1), ∀S ⊆ A \{j},

|S| = m − 1}. (4.28)

ˆ[mreach] ˆ[m] Repeat the above procedure until A , where mreach = min{m : |A | ≤ m}.

After obtaining the estimated active set Aˆ[mreach], we can estimate the regression

ˆ[mreach] ˆ ˆ[mreach] ˆ coefficients. For j∈ / A , we set βj = 0. For j ∈ A , βjs will be obtained n o ˆ[mreach] ˆ via least squares with y˜i, x˜ij : j ∈ A , i = 1, 2, ··· , n . By plugging β into ˆ gˆβ(u) = S(y − Xβ), we get the estimation of the nonparametric function g(u). The pseudo code is provided as follows. 114

Algorithm 4 (ˆg(·), m, Aˆ[m]) = TPC-PR(u, X, y)

˜ 1. get the residuals of Xj and y, i.e. Xj = (I −S(hj))Xj and y˜ = (I −S(hy))y;

2. set m = 1, do marginal correlation screening and build the Step 1 estimated active set

ˆ[1]  ∗ ∗ A = j = 1, ··· , p : |ρˆ(y , xj )| > T (α, n, κ,ˆ 0) ⊇ A;

3. m = m + 1, and establish the Step m estimated active set

ˆ[m] ˆ[m−1] ∗ ∗ ∗ A = {j ∈ A : |ρˆ(y , xj |xS )| > T (α, n, κ,ˆ m − 1),

∀S ⊆ Aˆ[m−1]\{j}, |S| = m − 1};

4. repeat Step 3 until |Aˆ[m]| ≤ m;

ˆ ˆ[m] ˆ[m] 5. set βj = 0 if j∈ / A , and for j ∈ A , get the least squares estimates;

ˆ ˆ 6. obtaing ˆβ(u) = S(y − Xβ) by plugging β.

Algorithm 4 shows that we first apply the nonparametric technique to get the partial residuals of the covariates and response, then the original partially linear models can be changed to linear models. By doing that, we transform the variable selection in partially linear models into linear models. Thus we can utilize the thresholded partial correlation approach described in Chapter 3. After obtaining the estimated active set, we can employ the least squares approach to get the estimates of the coefficients of the predictors in the estimated active set. We let the estimation of the coefficients of those variables that are not in the estimated 115 active set to be zero. By substituting back the estimates of the coefficients in the linear part to the original partially linear model, we get the estimate of the nonparametric part. That is the main idea of our approach, and we name it the thresholded partial correlation on partial residuals approach, and call it TPC-PR for short.

4.4 Asymptotic Properties

In this section, we mainly develop the asymptotic properties for variable selection in the linear part as well as the estimation of the nonparametric function, especially when p is much larger than n. For the linear part, we believe that TPC-PR can identify the true active set consistently; that is, with large sample size, our proposed method can identify the important features. Following that, we derive the bias and variance for the nonparametric function as well as its asymptotic normality.

4.4.1 Asymptotic Properties for Variable Selection

We now demonstrate that TPC-PR is consistent for variable selection in partially linear models, especially when p is much larger than n. We consider the partially linear models as described in (4.1). Let n be the sample size, pn be the number of covariates, An be the true active set, and peffn be the cardinality of the true active set. For simplicity of the proof, we let U = [a, b]. Our assumptions are as follows:

∗ ∗ (D1)( x1, ··· , xpn , y) follows an elliptical distribution, and satisfies the partial faithfulness for all n

(D2) Σx,u > 0, and the random variables xj and y satisfy the sub-exponential 116

tail probability uniformly in u. That is, there exists s0 > 0, such that for

0 < s < s0,

2 sup max E{exp(sxj )|u} < +∞, sup max E{exp(sxjy)|u} < +∞, 1≤j≤p 1≤j≤p u∈U n u∈U n sup E{exp(sy2)|u} < +∞. u∈U

(D3) For j = 1, ··· , p, the conditional expectations E(y|u), E(xj|u), E(yxj|u),

2 2 E(y |u), and E(xj |u) are all uniformly bounded in U. Furthermore, we

assume there exists δ1, such that

i) E{var(y|u)} = var{y − E(y|u)} ≥ δ1 > 0,

ii) E{var(xj|u)} = var{xj − E(xj|u)} ≥ δ1 > 0.

∗ ∗ ∗ (D4) The partial correlations ρn(y , xj xS ) satisfy:  ∗ ∗ ∗ c ∗ ∗ ∗ inf |ρn(y , xj |xS )| : j = 1, ··· , pn, S ⊆ {j} , |S| ≤ peffn, ρn(y , xj |xS ) 6= 0 −d 1 ≥ cn, where cn = O(n ), and 0 < d < 2 .

∗ ∗ ∗ ∗ ∗ ∗ (D5) The partial correlations ρn(y , xj |xS ) and ρn(xj , xk|xS ) satisfy:  ∗ ∗ ∗ c i). sup |ρn(y , xj |xS )| : j = 1, ··· , pn, S ⊆ {j} , |S| ≤ peffn ≤ τ < 1,  ∗ ∗ ∗ c ii). sup |ρn(xi , xj |xS )| : i 6= j, i, j = 1, ··· , pn, S ⊆ {i, j} , |S| ≤ peffn ≤ τ < 1.

(D6) Let {f(u), u ∈ U} be the density function of u. Assume f(u) is bounded from 0; and f(u), its first derivative f 0(u) and second derivative f 00(u) are

bounded uniformly in u. That is, there exist δ3 and M1 such that

0 00 inf |f(u)| ≥ δ3 > 0, sup |f(u)| ≤ M1, sup |f (u)| ≤ M1, sup |f (u)| ≤ M1. u∈ U u∈U u∈U u∈U

(D7) The kernel function K(·) is a symmetric density function, and it is bounded 117

and has finite second moment. That is

Z sup |K(u)| < +∞, and u2K(u)du < +∞. u∈U u∈U

ˆ Let An(αn) be the estimated active set obtained via TPC-PR with significance level αn based on n observations. The following theorem shows that TPC-PR is consistent for variable selection in the partially linear models.

Theorem 4.5. Consider the partially linear model (4.1) with assumptions (D1)-

a (D7); there exists a sequence αn → 0, and two constants a, b > 0, pn = O(exp(n )),

b 1−2d 3 peffn = O(n ), where a + b < 5 , such that if hy → 0, hj → 0, nhy → ∞, and 3 nhj → ∞. Then the true active set can be identified by the TPC-PR approach; that is

 C  log{P (Aˆ 6= A )} ≤ C na+b + log (1 − 2 )n + exp(1 − C nq) n n 1 n2d+4q 3 → −∞,

1−2d 1−a−b−2d as n → ∞, where cn is given in (D4) and 5 < q < 4 . The proof is given in Section 4.7. After we get the estimated active set, we apply the least squares approach. For fixed true active set, we can get the root-n consistency of the estimates of the parametric coefficients.

Theorem 4.6. Consider partially linear model (4.1) with assumptions (D1)-(D7),

a 1−2d pn = O(exp(n )), where a > 0, a < 5 , peffn is finite and fixed. If hy → 0, 3 3 hy → 0, nhy → ∞, and nhj → ∞, the estimates of the coefficients from TPC-PR enjoys the root-n consistency. That is, as n → ∞, with probability tending to 1,

ˆ 1/2 β − β = Op(−n ). (4.29) 118

The proof is given in Section 4.7.

4.4.2 Bias, Variance and Asymptotic Normality of the Nonparametric Function

In this section, the size of the true active set is fixed. Under that assumption, we provide the bias and variance for the estimate of the nonparametric curve. In addition, we develop the asymptotic normality of the nonparametric function.

Theorem 4.7. Under the same condition as Theorem 4.6, the bias and variance of gˆβˆ(u) are given as follows:

g00(u) E(ˆg (u)) = g(u) + µ h2 + o(h2) + O (n−1/2), (4.30) βˆ 2 2 p σ2ν vargˆ (u) = 0 {1 + o(1)}, (4.31) βˆ nhf(u)

R j R j 2 where µj = t K(t)dt, and νj = t K (t)dt.

The derivation of the bias and variance are given in Section 4.7.

Theorem 4.8. Under the same condition as Theorem 4.6, it can be shown that

  2 p 1 P  σ ν0  nh gˆ (u) − g(u) − g00(u)µ h2 −→ N 0, , (4.32) n βˆ 2 2 n f(u)

R j R j 2 as n → ∞, where µj = u K(u)du, νj = u K (u)du, and hn is the bandwidth.

The proof of Theorem 4.8 is provided in Section 4.7.

4.4.3 Discussion of the Conditions

In this section, we discuss Conditions (D1)-(D7) in detail. Note that Σx,u > 0 in (D1), and it is utilized to identify the linear coefficients. The conditional partial 119 faithfulness is crucial, and it is the foundation to (4.7) according to Corollary 4.2. (D4) places restrictions on the nonzero partial correlations, and it is used to control the type II error according to the proof of Theorem 4.5. (D5-i) imposes constraints on the conditional partial correlation between the response and covari- ates. (D5-ii) excludes perfect collinearity between the covariates, and it makes the coefficients in the linear part identifiable.

4.5 Numerical Studies

In this section, we conduct simulation studies to assess the performance of the pro- posed approach as compared to the penalized approaches with SCAD and LASSO penalties. Also, we illustrate the proposed methodology on one real-world data set. All simulation studies and real data analysis are conducted using R code.

4.5.1 Simulations Studies

We evaluate the performance of TPC-PR by comparing it to the penalized ap- proach on partial residuals. We mainly focus on the SCAD penalty, proposed by Fan and Li (2001), and LASSO penalty, proposed by Tibshirani (1996). For the penalized approach with SCAD and LASSO penalty, first, we trans- form the partially linear models to the linear models via partial residuals. After that, we can utilize the penalized approach to get the estimates of the coefficients, then obtain the estimation of the nonparametric function via substituting the es- timated coefficients back to the partially linear models. For the SCAD penalty, we follow the recommendation from Fan and Li (2004), and let a = 3.7. The tunning parameter λ is obtained by cross-validation. For the thresholded partial correlation approach with partial residuals, referred 120 as TPC-PR, we use the adjusted thresholds T (α, n, κ,ˆ m). The estimated kurtosis κˆ is the sample version of the kurtosis based on the partial residuals. Similarly, we can apply PC-simple algorithm on the partial residuals to obtain the estimated active set, then get the estimates via least square approach. The estimation of the nonparametric function can be obtained via replacing with the estimates of the linear coefficients in the original partially linear model. We call it PC-PR in our simulation studies. The difference between TPC-PR and PC-PR is that we take the kurtosis into consideration in TPC-PR while PC-PR pretends all the samples are from nor- mal distribution. As in Chapter 3, we expect the over fitting or under fitting phenomenon with PC-PR when the samples are from a non-normal distribution. To further improve the performance of the thresholded partial correlation ap- proach with partial residuals, we consider the TPC-PR with fine tuning. In this approach, we use cT (α, n, κ,ˆ m) as the thresholds, where c is chosen from (0.5,1.5) by extended Bayesian information criterion:

log(ˆσ2) + df ∗ log(p) ∗ log(n)/n, (4.33) whereσ ˆ2 is the estimate of the variance of the error, df is the number of nonzero estimated coefficients, p is the number of predictors and n is the size of the samples. The threshold cT (α, n, κ,ˆ m) is chosen by minimizing the criterion in (4.33). The bandwidth for the nonparametric part is chosen by the plug-in bandwidth selector developed by Ruppert et al. (1995), and we use the “dpill” command in R within the “KernSmooth” package. We report the results with the Gaussian kernel 1 K(u) = √ exp(−u2/2). (4.34) 2π 121

We focus on the following criterion to assess the performance of all approaches.

1. The performance of the estimate of g(·) is assessed by using the square root of average squared errors (RASE) defined by

 Ng 1/2  1 X  RASE = {gˆ(v ) − g(v )}2 , (4.35) N k k  g k=1 

where Ng is the number of grids we choose, and {vk, k = 1, ..., ng} are the grid points at which the functionsg ˆ(·) are evaluated.

2. The performance of variable selection in the parametric part is evaluated by the following numbers.

ˆ 2 ˆ (a) Model error of prediction (ME): defined as Ex[{x(β − β)} ] = (β − β)T cov(x)(βˆ − β).

Pp ˆ (b) True positive number (C): defined as j=1 I(βj 6= 0, βj 6= 0). This is the number of nonzero coefficients correctly estimated to be nonzero.

Pp ˆ (c) False positive number (IC): defined as j=1 I(βj 6= 0, βj = 0). This is the number of zero coefficients erroneously estimated to be nonzero.

(d) Under-fit percentage (Under-fit): miss at least one of the significant predictors in the linear part.

(e) Correctly-fit percentage (Cor-fit): identify exact all the significant pre- dictors.

(f) Over-fit percentage (Over-fit): identify all the significant predictors, but include at least one of the insignificant predictors erroneously.

We conduct simulations under three settings: low dimension (p=20), medium dimension (p=200), and high dimension (p=500). For each setting, we try different 122 distributions of the covariates: normal distributions and mixture normal distribu- tions. Within each example, there are two choices for the nonparametric function, and two choices for the error variance. This gives us four combinations in total. We report the median of RASE and ME, the median of their absolute deviations, true positive number (C), false positive number (IC), under-fit, correctly-fit and over-fit percentages. Example 1 (Low dimension): In this example we simulated 200 datasets consisting of 200 observations from the model,

y = g(u) + xT β + σ2. (4.36)

We set p = 20, and this example is treated as a low dimensional example. For the normal samples with AR covariance structure, we first draw the data

T (x1, ··· , xp+1) , from a jointly normal distribution, and the correlation between

|i−j| xi and xj is ρ with ρ. Let u = Φ(xp+1), where Φ is the cumulative density function of standard normal distribution. Similarly, we can draw samples from normal distribution with compound symmetric covariance structure, mixture nor- mal distribution with AR covariance structure, and mixture normal distribution with compound covariance structure. β = (3, 1.5, 0, 0, 2, 0, ··· , 0)T , then peff = 3. The variance of the error is set to 0.25 or 1. We choose g(u) = u2 or sin(2πu). We let the significance level α to be 0.05. 100 grid points are used and they are evenly spaced on the of u to evaluate the performance of the estimation of the non-parametric part. In the summarizing tables, the median of the RASE and model errors over 200 repetitions are summarized in Column “MedRASE” and “MedME” along with the median of their absolute deviations. Column “C” shows the average number of nonzero coefficients correctly estimated to be nonzero. The average number of 123 zero coefficients erroneously set to nonzero is reported in the column labeled “IC”. The remaining three columns present the proportion of under-fit, correctly-fit and over-fit over the 200 simulations. “SCAD” and “LASSO” refer to the penalized approach for the partial residuals with SCAD and LASSO penalty, respectively. “TPC-PR” represents our proposed approach. 124

Table 4.1: Normal samples with AR correlation matrix (p = 20)

σ2 g(u) Method MedME(Devi) C IC Under-fit Cor-fit Over-fit MedRASE(Devi)

ρ = 0.5 0.25 u2 SCAD 0.0051(0.0030) 3.000 0.225 0.000 0.905 0.095 0.0851(0.0230)

LASSO 0.0145(0.0059) 3.000 4.530 0.000 0.045 0.955 0.0899(0.0240)

PC-PR 0.0047(0.0026) 3.000 0.010 0.000 0.990 0.010 0.0842(0.0224)

TPC-PR 0.0047(0.0026) 3.000 0.010 0.000 0.990 0.010 0.0842(0.0224)

TPC-PR(EBIC) 0.0047(0.0026) 3.000 0.000 0.000 1.000 0.000 0.0843(0.0224)

sin(2πu) SCAD 0.0053(0.0032) 3.000 0.150 0.000 0.900 0.100 0.1081(0.0194)

LASSO 0.0149(0.0061) 3.000 4.430 0.000 0.050 0.950 0.1182(0.0232)

PC-PR 0.0048(0.0027) 3.000 0.010 0.000 0.990 0.010 0.1069(0.0200)

TPC-PR 0.0048(0.0027) 3.000 0.010 0.000 0.990 0.010 0.1069(0.0200)

TPC-PR(EBIC) 0.0048(0.0027) 3.000 0.000 0.000 1.000 0.000 0.1075(0.0198)

1 u2 SCAD 0.0171(0.0100) 3.000 0.380 0.000 0.855 0.145 0.1680(0.0442)

LASSO 0.0532(0.0195) 3.000 4.590 0.000 0.045 0.955 0.1757(0.0460) PC-PR 0.0153(0.0085) 3.000 0.010 0.000 0.990 0.010 0.1664(0.0417)

TPC-PR 0.0153(0.0085) 3.000 0.010 0.000 0.990 0.010 0.1664(0.0417)

TPC-PR(EBIC) 0.0153(0.0085) 3.000 0.000 0.000 1.000 0.000 0.1670(0.0413)

sin(2πu) SCAD 0.0185(0.0109) 3.000 0.395 0.000 0.865 0.135 0.1887(0.0390) LASSO 0.0544(0.0206) 3.000 4.605 0.000 0.050 0.950 0.2085(0.0421) PC-PR 0.0157(0.0090) 3.000 0.010 0.000 0.990 0.010 0.1889(0.0393)

TPC-PR 0.0157(0.0090) 3.000 0.010 0.000 0.990 0.010 0.1889(0.0393)

TPC-PR(EBIC) 0.0157(0.0090) 3.000 0.000 0.000 1.000 0.000 0.1897(0.0395)

ρ = 0.8 0.25 u2 SCAD 0.0041(0.0028) 3.000 0.160 0.000 0.890 0.110 0.0673(0.0196) LASSO 0.0117(0.0051) 3.000 4.840 0.000 0.030 0.970 0.0853(0.0227) PC-PR 0.0039(0.0026) 3.000 0.005 0.000 0.995 0.005 0.0673(0.0196)

TPC-PR 0.0039(0.0026) 3.000 0.005 0.000 0.995 0.005 0.0673(0.0196)

TPC-PR(EBIC) 0.0039(0.0026) 3.000 0.000 0.000 1.000 0.000 0.0673(0.0196)

sin(2πu) SCAD 0.0062(0.0038) 3.000 0.330 0.000 0.875 0.125 0.1085(0.0203)

LASSO 0.0184(0.0085) 3.000 4.715 0.000 0.020 0.980 0.1312(0.0268) PC-PR 0.0056(0.0035) 3.000 0.025 0.000 0.975 0.025 0.1085(0.0203)

TPC-PR 0.0056(0.0035) 3.000 0.025 0.000 0.975 0.025 0.1085(0.0203)

TPC-PR(EBIC) 0.0056(0.0034) 3.000 0.005 0.000 0.995 0.005 0.1085(0.0205)

1 u2 SCAD 0.0252(0.0168) 3.000 0.470 0.000 0.835 0.165 0.1807(0.0501)

LASSO 0.0636(0.0289) 3.000 4.765 0.000 0.015 0.985 0.2040(0.0544)

PC-PR 0.0216(0.0144) 3.000 0.035 0.000 0.965 0.035 0.1790(0.0514)

TPC-PR 0.0216(0.0146) 3.000 0.040 0.000 0.960 0.040 0.1790(0.0514)

TPC-PR(EBIC) 0.0205(0.0135) 3.00 0.000 0.000 1.000 0.000 0.1790(0.0514)

sin(2πu) SCAD 0.0246(0.0161) 3.000 0.490 0.000 0.835 0.165 0.1950(0.0424) LASSO 0.0659(0.0296) 3.000 4.750 0.000 0.015 0.985 0.2392(0.0532) PC-PR 0.0215(0.0142) 3.000 0.035 0.000 0.965 0.035 0.1933(0.0414)

TPC-PR 0.0215(0.0142) 3.000 0.035 0.000 0.965 0.035 0.1933(0.0414)

TPC-PR(EBIC) 0.0207(0.0132) 3.000 0.000 0.000 1.000 0.000 0.1933(0.0414) 125

Table 4.2: Normal samples with compound symmetric correlation matrix (p = 20)

σ2 g(u) Method MedME(Devi) C IC Under-fit Cor-fit Over-fit MedRASE(Devi)

ρ = 0.5 0.25 u2 SCAD 0.0047(0.0028) 3.000 0.210 0.000 0.925 0.075 0.0867(0.0232)

LASSO 0.0140(0.0062) 3.000 5.035 0.000 0.020 0.980 0.0900(0.0238)

PC-PR 0.0057(0.0036) 3.000 0.180 0.000 0.820 0.180 0.0875(0.0222)

TPC-PR 0.0057(0.0036) 3.000 0.175 0.000 0.825 0.175 0.0875(0.0222) TPC-PR(EBIC) 0.0046(0.0028) 3.000 0.005 0.000 0.995 0.000 0.0848(0.0223)

sin(2πu) SCAD 0.0047(0.0027) 3.000 0.210 0.000 0.930 0.070 0.1047(0.0204)

LASSO 0.0145(0.0068) 3.000 4.865 0.000 0.010 0.990 0.1135(0.0205)

PC-PR 0.0058(0.0036) 3.000 0.170 0.000 0.830 0.170 0.1036(0.0197)

TPC-PR 0.0058(0.0036) 3.000 0.165 0.000 0.835 0.165 0.1036(0.0197)

TPC-PR(EBIC) 0.0046(0.0027) 3.000 0.005 0.000 0.995 0.005 0.1040(0.0206)

1 u2 SCAD 0.0185(0.0104) 3.000 0.270 0.000 0.920 0.080 0.1663(0.0456)

LASSO 0.0548(0.0218) 3.000 4.995 0.000 0.015 0.985 0.1753(0.0453)

PC-PR 0.0218(0.0130) 3.000 0.240 0.000 0.765 0.235 0.1715(0.0449)

TPC-PR 0.0218(0.0130) 3.000 0.245 0.000 0.760 0.240 0.1720(0.0454)

TPC-PR(EBIC) 0.0166(0.0093) 3.000 0.015 0.000 0.985 0.015 0.1645(0.0450)

sin(2πu) SCAD 0.0178(0.0101) 3.000 0.275 0.000 0.915 0.085 0.1950(0.0390)

LASSO 0.0556(0.0234) 3.000 4.975 0.000 0.015 0.985 0.2029(0.0406)

PC-PR 0.0228(0.0137) 3.000 0.245 0.000 0.760 0.240 0.1959(0.0394)

TPC-PR 0.0228(0.0137) 3.000 0.245 0.000 0.760 0.240 0.1959(0.0393)

TPC-PR(EBIC) 0.0171(0.0098) 3.000 0.015 0.000 0.985 0.015 0.1950(0.0398)

ρ = 0.8 0.25 u2 SCAD 0.0083(0.0055) 3.000 0.150 0.000 0.935 0.065 0.1022(0.0285)

LASSO 0.0180(0.0098) 3.000 4.950 0.000 0.015 0.985 0.1133(0.0350)

PC-PR 0.0098(0.0063) 3.000 0.235 0.000 0.780 0.220 0.1044(0.0293)

TPC-PR 0.0096(0.0063) 3.000 0.240 0.000 0.775 0.225 0.1046(0.0292) TPC-PR(EBIC) 0.0074(0.0051) 3.000 0.010 0.000 0.990 0.010 0.1009(0.0281)

sin(2πu) SCAD 0.0083(0.0056) 3.000 0.070 0.000 0.935 0.065 0.1178(0.0246)

LASSO 0.0186(0.0104) 3.000 4.980 0.000 0.010 0.990 0.1349(0.0310)

PC-PR 0.0099(0.0067) 3.000 0.240 0.000 0.780 0.220 0.1172(0.0244)

TPC-PR 0.0100(0.0067) 3.000 0.250 0.000 0.770 0.230 0.1172(0.0244)

TPC-PR(EBIC) 0.0074(0.0050) 3.000 0.005 0.000 0.995 0.005 0.1176(0.0245)

1 u2 SCAD 0.0327(0.0218) 3.000 0.295 0.000 0.865 0.135 0.1988(0.0543)

LASSO 0.0685(0.0365) 3.000 4.970 0.000 0.015 0.985 0.2164(0.0607)

PC-PR 0.0367(0.0227) 3.000 0.280 0.000 0.730 0.270 0.2002(0.0558)

TPC-PR 0.0365(0.0227) 3.000 0.280 0.000 0.735 0.265 0.2002(0.0552)

TPC-PR(EBIC) 0.0283(0.0188) 3.000 0.030 0.000 0.970 0.030 0.2002(0.0570)

sin(2πu) SCAD 0.0325(0.0210) 3.000 0.285 0.000 0.860 0.140 0.2090(0.0473)

LASSO 0.0701(0.0380) 3.000 5.100 0.000 0.010 0.990 0.2365(0.0589)

PC-PR 0.0374(0.0225) 3.000 0.280 0.000 0.730 0.270 0.2133(0.0484)

TPC-PR 0.0373 (0.0223) 3.000 0.280 0.000 0.735 0.265 0.2130(0.0483)

TPC-PR(EBIC) 0.0270(0.0180) 3.000 0.025 0.000 0.975 0.025 0.2092(0.0458) 126

First, let us focus on the results from normal samples which are shown in Tables 4.1 and 4.2. LASSO has larger model errors than the other four methods, because LASSO selects some insignificant predictors and overfits the data. The over fitting percentage of LASSO is approaching 1 as given in Column “Over-fit”. For the nonparametric part, LASSO has slightly larger error than the other four methods. The PC-PR and TPC-PR perform similarly, and they can identify the true active set at majority of the time. TPC-PR(EBIC) outperforms PC-PR, TPC-PR(EBIC) and SCAD, and it can identify the true active set more than 95% of the time. As the variance of the error increases, the correct fitting percentages from all five methods drops, and this is expected as larger variance of the error indicates that it is harder to detect the truth. As ρ increases from 0.5 to 0.8, it becomes more difficulty to identify the true active set, because if the predictors are highly correlated with each other, it is harder for all the methods to find the true signif- icant variables. Also, the correct fitting percentages from PC-PR, TPC-PR and TPC-PR(EBIC) are dropping to around 75% with compound symmetric covari- ance structure, while with AR covariance structure, the correct fitting percentage is above 95%. 127

Table 4.3: Mixture normal samples with AR correlation matrix (p = 20)

σ2 g(u) Method MedME(Devi) C IC Under-fit Cor-fit Over-fit MedRASE(Devi)

ρ = 0.5 0.25 u2 SCAD 0.0125(0.0077) 3.000 1.145 0.000 0.665 0.335 0.1119(0.0272)

LASSO 0.0346(0.0153) 3.000 6.445 0.000 0.015 0.985 0.1129(0.0315)

PC-PR 0.0099(0.0064) 3.000 0.020 0.000 0.980 0.020 0.1115(0.0284)

TPC-PR 0.0099(0.0063) 2.990 0.010 0.010 0.990 0.000 0.1118(0.0288)

TPC-PR(EBIC) 0.0099(0.0063) 3.000 0.005 0.000 0.995 0.005 0.1117(0.0280)

sin(2πu) SCAD 0.0134(0.0080) 3.000 1.260 0.000 0.635 0.365 0.1274(0.0259)

LASSO 0.0352(0.0144) 3.000 6.360 0.000 0.025 0.975 0.1383(0.0297)

PC-PR 0.0101(0.0060) 3.000 0.030 0.000 0.970 0.030 0.1272(0.0258)

TPC-PR 0.0101(0.0060) 2.990 0.010 0.010 0.990 0.000 0.1274(0.0264)

TPC-PR(EBIC) 0.0101(0.0060) 3.000 0.005 0.000 0.995 0.005 0.1274(0.0259)

1 u2 SCAD 0.0509(0.0305) 3.000 1.580 0.000 0.595 0.405 0.2211(0.0539)

LASSO 0.1293(0.0535) 3.000 6.725 0.000 0.020 0.980 0.2229(0.0566)

PC-PR 0.0397(0.0236) 3.000 0.075 0.000 0.930 0.070 0.2199(0.0549)

TPC-PR 0.0355(0.0229) 2.980 0.025 0.020 0.975 0.005 0.2263(0.0584)

TPC-PR(EBIC) 0.0360(0.0232) 3.000 0.020 0.000 0.980 0.020 0.2214(0.0539)

sin(2πu) SCAD 0.0498(0.0290) 3.000 1.500 0.000 0.595 0.405 0.2305(0.0475)

LASSO 0.1280(0.0538) 3.000 6.735 0.000 0.020 0.980 0.2549(0.0541)

PC-PR 0.0394(0.0231) 3.000 0.070 0.000 0.935 0.065 0.2334(0.0521)

TPC-PR 0.0355(0.0229) 2.980 0.025 0.020 0.975 0.005 0.2346(0.0540)

TPC-PR(EBIC) 0.0364(0.0228) 3.000 0.020 0.000 0.980 0.020 0.2334(0.0511)

ρ = 0.8 0.25 u2 SCAD 0.0178(0.0116) 3.000 0.840 0.000 0.635 0.365 0.1273(0.0403)

LASSO 0.0442(0.0182) 3.000 6.235 0.000 0.020 0.980 0.1475(0.0497)

PC-PR 0.0160(0.0098) 2.990 0.105 0.010 0.900 0.090 0.1319(0.0422)

TPC-PR 0.0161(0.0115) 2.950 0.055 0.050 0.945 0.005 0.1353(0.0433)

TPC-PR(EBIC) 0.0158(0.0098) 3.000 0.020 0.000 0.980 0.020 0.1308(0.0409)

sin(2πu) SCAD 0.0175(0.0119) 3.000 0.790 0.000 0.655 0.345 0.1441(0.0334)

LASSO 0.0450(0.0187) 3.000 6.425 0.000 0.015 0.985 0.1686(0.0523)

PC-PR 0.0155(0.0104) 2.990 0.100 0.010 0.905 0.085 0.1488(0.0360)

TPC-PR 0.0156(0.0110) 2.950 0.055 0.050 0.945 0.005 0.1517(0.0370)

TPC-PR(EBIC) 0.0153(0.0102) 3.000 0.015 0.000 0.985 0.015 0.1478(0.0343)

1 u2 SCAD 0.0710(0.0451) 3.000 1.400 0.000 0.575 0.425 0.2526(0.0798)

LASSO 0.1710(0.0631) 3.000 6.605 0.000 0.010 0.990 0.2856(0.1114)

PC-PR 0.0615(0.0403) 2.965 0.200 0.035 0.825 0.140 0.2544(0.0773)

TPC-PR 0.0606(0.0410) 2.890 0.130 0.110 0.865 0.025 0.2580(0.0825)

TPC-PR(EBIC) 0.0582(0.0345) 2.990 0.060 0.010 0.945 0.045 0.2559(0.0785)

sin(2πu) SCAD 0.0719(0.0475) 3.000 1.360 0.000 0.595 0.405 0.2671(0.0687)

LASSO 0.1705(0.0628) 3.000 6.595 0.000 0.010 0.990 0.2989(0.1033)

PC-PR 0.0618(0.0411) 2.965 0.200 0.035 0.830 0.135 0.2772(0.0702)

TPC-PR 0.0606(0.0406) 2.895 0.125 0.105 0.870 0.025 0.2772(0.0735)

TPC-PR(EBIC) 0.0581(0.0354) 2.990 0.060 0.010 0.945 0.045 0.2712(0.0703) 128

Table 4.4: Mixture normal samples with compound symmetric correlation matrix (p = 20)

σ2 g(u) Method MedME(Devi) C IC Under-fit Cor-fit Over-fit MedRASE(Devi)

ρ = 0.5 0.25 u2 SCAD 0.0152(0.0097) 3.000 0.790 0.000 0.695 0.305 0.1172(0.0328)

LASSO 0.0373(0.0173) 3.000 6.555 0.000 0.010 0.990 0.1243(0.0356)

PC-PR 0.0142(0.0085) 2.995 0.545 0.005 0.535 0.460 0.1182(0.0369)

TPC-PR 0.0142(0.0085) 2.985 0.075 0.015 0.935 0.050 0.1182(0.0383)

TPC-PR(EBIC) 0.0143(0.0081) 2.995 0.090 0.005 0.925 0.070 0.1164(0.0358)

sin(2πu) SCAD 0.0152(0.0098) 3.000 0.870 0.000 0.675 0.325 0.1351(0.0318)

LASSO 0.0381(0.0172) 3.000 6.395 0.000 0.010 0.990 0.1394(0.0329)

PC-PR 0.0190(0.0093) 2.995 0.550 0.005 0.525 0.470 0.1347(0.0321)

TPC-PR 0.0142(0.0082) 2.985 0.075 0.015 0.935 0.050 0.1396(0.0350)

TPC-PR(EBIC) 0.0142(0.0080) 2.995 0.090 0.005 0.925 0.070 0.1369(0.0329)

1 u2 SCAD 0.0578(0.0351) 3.000 1.225 0.000 0.630 0.370 0.2257(0.0666)

LASSO 0.1452(0.0633) 3.000 6.595 0.000 0.015 0.985 0.2402(0.0630)

PC-PR 0.0845(0.0393) 2.995 0.710 0.005 0.430 0.565 0.2311(0.0697)

TPC-PR 0.0541(0.0307) 2.985 0.160 0.015 0.850 0.135 0.2253(0.0691)

TPC-PR(EBIC) 0.0531(0.0281) 2.995 0.115 0.005 0.890 0.105 0.2248(0.0690)

sin(2πu) SCAD 0.0566(0.0344) 3.000 1.190 0.000 0.640 0.360 0.2421(0.0561)

LASSO 0.1429(0.0599) 3.000 6.685 0.000 0.010 0.990 0.2561(0.0606)

PPC 0.0844(0.0386) 2.995 0.715 0.005 0.430 0.565 0.2434(0.0636)

TPC-PR 0.0538(0.0310) 2.985 0.165 0.015 0.845 0.140 0.2474(0.0569)

TPC-PR(EBIC) 0.0538(0.0294) 2.995 0.120 0.005 0.885 0.110 0.2446(0.0558)

ρ = 0.8 0.25 u2 SCAD 0.0229(0.0156) 3.000 0.605 0.000 0.795 0.205 0.1384(0.0433)

LASSO 0.0464(0.0239) 3.000 6.435 0.000 0.010 0.990 0.1630(0.0651)

PC-PR 0.0277(0.0131) 2.990 0.750 0.010 0.385 0.605 0.1453(0.0423)

TPC-PR 0.0211(0.0126) 2.980 0.155 0.020 0.860 0.120 0.1392(0.0427)

TPC-PR(EBIC) 0.0193(0.0121) 2.995 0.110 0.005 0.900 0.095 0.1384(0.0430)

sin(2πu) SCAD 0.0218(0.0156) 3.000 0.605 0.000 0.810 0.190 0.1619(0.0445)

LASSO 0.0460(0.0238) 3.000 6.390 0.000 0.010 0.990 0.1856(0.0639)

PC-PR 0.0266(0.0129) 2.990 0.745 0.010 0.390 0.600 0.1577(0.0383)

TPC-PR 0.0215(0.0129) 2.980 0.155 0.020 0.860 0.120 0.1641(0.0458)

TPC-PR(EBIC) 0.0188(0.0119) 2.995 0.100 0.005 0.910 0.085 0.1594(0.0443)

1 u2 SCAD 0.0927(0.0649) 3.000 1.170 0.000 0.705 0.295 0.2788(0.0783)

LASSO 0.1776(0.0816) 3.000 6.565 0.000 0.010 0.990 0.3039(0.1084)

PC-PR 0.1072(0.0566) 2.985 0.865 0.015 0.335 0.650 0.2763(0.0752)

TPC-PR 0.0798(0.0471) 2.970 0.265 0.030 0.755 0.215 0.2754(0.0782)

TPC-PR(EBIC) 0.0736(0.0438) 2.990 0.155 0.010 0.855 0.135 0.2703(0.0725)

sin(2πu) SCAD 0.0870(0.0610) 3.000 1.175 0.000 0.715 0.285 0.2986(0.0830)

LASSO 0.1796(0.0800) 3.000 6.620 0.000 0.010 0.990 0.3438(0.1177)

PC-PR 0.1070(0.0566) 2.990 0.855 0.010 0.335 0.655 0.2927(0.0720)

TPC-PR 0.0773(0.0453) 2.970 0.275 0.030 0.750 0.220 0.2997(0.0805)

TPC-PR(EBIC) 0.0727(0.0448) 2.990 0.155 0.010 0.855 0.135 0.2912(0.0735) 129

Tables 4.3 and 4.4 are the results for the mixture normal samples with AR and compound symmetric covariance structure. LASSO has the worst performance, as it has larger model error and square root of the squared error compared to the other four methods. Columns “IC” and “Over-fit” manifest that LASSO would include some insignificant predictors in its final model. For mixture normal samples, PC-PR and TPC-PR performs very different. For AR covariance structure, the difference is very slight, however, for compound sym- metric covariance structure, TPC-PR performs much better than PC-PR. TPC-PR can identify the true active set more than 80% of the time, while PC-PR selects the true significant predictors below 40% of the time. Let us focus on the third scenario for ρ = 0.5 and 0.8. The over fitting percentage of PC-PR increases from 56.5% to 65%. For ρ = 0.8, the correct fitting percentage of PC-PR is 33.5%, com- pared to 75.5% from TPC-PR and 85.5% from TPC-PR(EBIC). That implies that adjusting the variance of the partial correlations in PC-simple algorithm would im- prove the performance significantly. Even better, TPC-PR and TPC-PR(EBIC) would yield higher correct fitting percentages than the penalized approach with SCAD penalty. Example 2 (Medium dimension): In this example, we simulated 200 datasets consisting of 300 observations from the model

y = g(u) + xT β + σ2. (4.37)

The number of the covariates in this example is set to 200, and it is treated as a medium dimensional example. The number of nonzero coefficients is kept as 3; that is, peff = 3. The index of nonzero coefficients are 1, 2, 5, and β = (3, 1.5, 0, 0, 2, 0, ··· , 0)T . The samples are draw in a similar way as Example 1. In this example, we study the samples with AR correlation structure. The nonpara- 130 metric function is set to u2 or sin(2πu), and the variance of the error is varied between 0.25 and 1. Other settings are the same as Example 1. 131

Table 4.5: Normal samples with AR correlation structure (p=200)

σ2 g(u) Method MedME(Devi) C IC Under-fit Cor-fit Over-fit MedRASE(Devi)

ρ = 0.5 0.25 u2 SCAD 0.0028(0.0016) 3.000 0.810 0.000 0.865 0.135 0.0633(0.0169)

LASSO 0.0172(0.0057) 3.000 9.420 0.000 0.025 0.975 0.0800(0.0194)

PC-PR 0.0025(0.0014) 3.000 0.010 0.000 0.990 0.010 0.0637(0.0173)

TPC-PR 0.0025(0.0014) 3.000 0.010 0.000 0.990 0.010 0.0637(0.0173)

TPC-PR(EBIC) 0.0025(0.0014) 3.000 0.000 0.000 1.000 0.000 0.0637(0.0172)

sin(2πu) SCAD 0.0027(0.0015) 3.000 0.645 0.000 0.880 0.120 0.0874(0.0149)

LASSO 0.0184(0.0061) 3.000 8.860 0.000 0.025 0.975 0.1066(0.0168)

PC-PR 0.0025(0.0014) 3.000 0.010 0.000 0.990 0.010 0.0869(0.0151)

TPC-PR 0.0025(0.0014) 3.000 0.010 0.000 0.990 0.010 0.0869(0.0151)

TPC-PR(EBIC) 0.0025(0.0014) 3.000 0.000 0.000 1.000 0.000 0.0869(0.0151)

1 u2 SCAD 0.0102(0.0059) 3.000 0.880 0.000 0.750 0.250 0.1247(0.0340)

LASSO 0.0664(0.0203) 3.000 10.145 0.000 0.010 0.990 0.1581(0.0392)

PC-PR 0.0102(0.0060) 3.000 0.045 0.000 0.954 0.045 0.1253(0.0329)

TPC-PR 0.0102(0.0060) 3.000 0.050 0.000 0.950 0.050 0.1253(0.0329)

TPC-PR(EBIC) 0.0097(0.0054) 3.000 0.000 0.000 1.000 0.000 0.1249(0.0334)

sin(2πu) SCAD 0.0101(0.0058) 3.000 0.770 0.000 0.755 0.245 0.1564(0.0328)

LASSO 0.0674(0.0203) 3.000 10.165 0.000 0.010 0.990 0.1908(0.0349)

PC-PR 0.0100(0.0057) 3.000 0.045 0.000 0.955 0.045 0.1576(0.0324)

TPC-PR 0.0100(0.0057) 3.000 0.050 0.000 0.950 0.050 0.1576(0.0324)

TPC-PR(EBIC) 0.0097(0.0054) 3.000 0.000 0.000 1.000 0.000 0.1564(0.0330)

ρ = 0.8 0.25 u2 SCAD 0.0034(0.0019) 3.000 0.480 0.000 0.870 0.130 0.0703(0.0180)

LASSO 0.0295(0.0102) 3.000 11.895 0.000 0.000 1.000 0.1273(0.0351)

PC-PR 0.0032(0.0016) 3.000 0.005 0.000 0.995 0.005 0.0698(0.0178)

TPC-PR 0.0032(0.0016) 3.000 0.005 0.000 0.995 0.005 0.0698(0.0178)

TPC-PR(EBIC) 0.0031(0.0016) 3.000 0.000 0.000 1.000 0.000 0.0697(0.0178)

sin(2πu) SCAD 0.0035(0.0019) 3.000 0.470 0.000 0.880 0.120 0.0894(0.0168)

LASSO 0.0298(0.0104) 3.000 11.500 0.000 0.000 1.000 0.1563(0.0306)

PC-PR 0.0032(0.0016) 3.000 0.010 0.000 0.990 0.010 0.0890(0.0162)

TPC-PR 0.0032(0.0016) 3.000 0.010 0.000 0.990 0.010 0.0890(0.0162)

TPC-PR(EBIC) 0.0032(0.0016) 3.000 0.000 0.000 1.000 0.000 0.0890(0.0162)

1 u2 SCAD 0.0128(0.0071) 3.000 0.575 0.000 0.770 0.230 0.1361(0.0336)

LASSO 0.1127(0.0384) 3.000 11.910 0.000 0.000 1.000 0.2609(0.0667)

PC-PR 0.0117(0.0061) 3.000 0.025 0.000 0.975 0.025 0.1344(0.0335)

TPC-PR 0.0117(0.0061) 3.000 0.025 0.000 0.975 0.025 0.1344(0.0335)

TPC-PR(EBIC) 0.0116(0.0060) 3.000 0.000 0.000 1.000 0.000 0.1344(0.0335)

sin(2πu) SCAD 0.0129(0.0068) 3.000 0.595 0.000 0.765 0.235 0.1607(0.0306)

LASSO 0.1124(0.0392) 3.000 12.075 0.000 0.000 1.000 0.2903(0.0616)

PC-PR 0.0118(0.0063) 3.000 0.030 0.000 0.970 0.030 0.1605(0.0298)

TPC-PR 0.0118(0.0063) 3.000 0.030 0.000 0.970 0.030 0.1605(0.0298)

TPC-PR(EBIC) 0.0115(0.0060) 3.000 0.000 0.000 1.000 0.000 0.1605(0.0298) 132

Table 4.5 lists the results for normal samples with AR correlation structure with ρ = 0.5 and ρ = 0.8. LASSO has the largest model errors among the five approaches. That is expected as LASSO results in a over fitting model. LASSO selects more than 9 insignificant variables in its final model, and the number in Column “Over-fit” is very close to 1. For the nonparametric part, the PC-PR, TPC-PR and TPC-PR(EBIC) have the similar performance, and LASSO yields a slightly larger error than the other four methods. For the normal samples, the performance of PC-PR, TPC-PR and TPC-PR (EBIC) are similar. This is due to the fact that the kurtosis of the normal dis- tribution is 0, and all of them are utilizing similar limiting distributions for the sample marginal correlation distributions and partial correlations. The correct fitting percentages from PC-PR, TPC-PR and TPC-PR(EBIC) are higher than 95%, and it is much higher than penalized approach with SCAD penalty. Thus these three method has smaller model error compared the model error from SCAD. PC-PR, TPC-PR and TPC-PR(EBIC) can identify the true active set more than 95% of the time, and they outperform penalized approach with SCAD and LASSO penalty. The correct fitting percentages from all five methods drop as we increase the variance of the error. As we increase the ρ from 0.5 to 0.8, it becomes more difficult to identify the true predictors as the predictors are highly correlated with each other. Thus all the performs become slightly worse. 133

Table 4.6: Mixture normal samples with AR correlation matrix with (p = 200)

σ2 g(u) Method MedME(Devi) C IC Under-fit Cor-fit Over-fit MedRASE(Devi)

ρ = 0.5 0.25 u2 SCAD 0.0088(0.0053) 3.000 2.905 0.000 0.655 0.345 0.0971(0.0204)

LASSO 0.0382(0.0134) 3.000 21.180 0.000 0.010 0.990 0.1105(0.0291)

PC-PR 0.0075(0.0042) 3.000 0.135 0.000 0.865 0.135 0.0967(0.0209)

TPC-PR 0.0065(0.0037) 3.000 0.000 0.000 1.000 0.000 0.0970(0.0206)

TPC-PR(EBIC) 0.0065(0.0037) 3.000 0.000 0.000 1.000 0.000 0.0970(0.0206)

sin(2πu) SCAD 0.0085(0.0050) 3.000 2.910 0.000 0.645 0.355 0.0919(0.0234)

LASSO 0.0529(0.0160) 3.000 23.235 0.000 0.000 1.000 0.1504(0.0341)

PC-PR 0.0118(0.0067) 2.985 0.105 0.015 0.900 0.085 0.1396(0.0310)

TPC-PR 0.0113(0.0061) 2.980 0.030 0.020 0.975 0.005 0.1378(0.0315)

TPC-PR(EBIC) 0.0109(0.0057) 3.000 0.010 0.000 0.995 0.005 0.1375(0.0301)

1 u2 SCAD 0.0334(0.0191) 3.000 3.685 0.000 0.555 0.445 0.1807(0.0453)

LASSO 0.1498(0.0501) 3.000 24.675 0.000 0.005 0.995 0.2158(0.0547)

PC-PR 0.0439(0.0238) 3.000 0.575 0.000 0.510 0.490 0.1926(0.0468)

TPC-PR 0.0245(0.0134) 2.995 0.030 0.005 0.970 0.025 0.1921(0.0449)

TPC-PR(EBIC) 0.0240(0.0128) 3.000 0.000 0.000 1.000 0.000 0.1921(0.0441)

sin(2πu) SCAD 0.0334(0.0182) 3.000 3.680 0.000 0.550 0.450 0.2050(0.0416)

LASSO 0.1495(0.0493) 3.000 24.765 0.000 0.005 0.995 0.2355(0.0558)

PC-PR 0.0435(0.0239) 3.000 0.570 0.000 0.515 0.485 0.2128(0.0435)

TPC-PR 0.0246(0.0133) 2.995 0.030 0.005 0.970 0.025 0.2121(0.0421)

TPC-PR(EBIC) 0.0238(0.0126) 3.000 0.000 0.000 1.000 0.000 0.2118(0.0412)

ρ = 0.8 0.25 u2 SCAD 0.0100(0.0067) 3.000 1.715 0.000 0.675 0.325 0.1120(0.0309)

LASSO 0.0544(0.0212) 3.000 19.985 0.000 0.000 1.000 0.1739(0.0566)

PC-PR 0.0118(0.0072) 3.000 0.160 0.000 0.845 0.155 0.1128(0.0296)

TPC-PR 0.0104(0.0060) 2.990 0.020 0.010 0.980 0.010 0.1124(0.0283)

TPC-PR(EBIC) 0.0104(0.0059) 3.000 0.010 0.000 0.990 0.010 0.1115(0.0282)

sin(2πu) SCAD 0.0098(0.0068) 3.000 1.825 0.000 0.665 0.335 0.1240(0.0286)

LASSO 0.0561(0.0206) 3.000 20.725 0.000 0.000 1.000 0.2041(0.0596)

PC-PR 0.0117(0.0069) 3.000 0.160 0.000 0.840 0.160 0.1267(0.0255)

TPC-PR 0.0107(0.0059) 2.990 0.020 0.010 0.980 0.010 0.1285(0.0273)

TPC-PR(EBIC) 0.0107(0.0058) 3.000 0.010 0.000 0.990 0.010 0.1277(0.0264)

1 u2 SCAD 0.0387(0.0246) 3.000 2.200 0.000 0.590 0.410 0.2114(0.0649)

LASSO 0.2177(0.0799) 3.000 23.035 0.000 0.000 1.000 0.3344(0.1040)

PC-PR 0.0496(0.0283) 2.990 0.395 0.010 0.650 0.340 0.2174(0.0485)

TPC-PR 0.0390(0.0238) 2.980 0.055 0.020 0.945 0.035 0.2159(0.0482)

TPC-PR(EBIC) 0.0376(0.0215) 3.000 0.025 0.000 0.980 0.020 0.2153(0.0484)

sin(2πu) SCAD 0.0357(0.0238) 3.000 2.255 0.000 0.605 0.395 0.2264(0.0543)

LASSO 0.2228(0.0839) 3.000 23.220 0.000 0.000 1.000 0.3623(0.1055)

PC-PR 0.0494(0.0289) 2.990 0.395 0.010 0.650 0.340 0.2381(0.0553)

TPC-PR 0.0394(0.0240) 2.980 0.060 0.020 0.940 0.040 0.2414(0.0560)

TPC-PR(EBIC) 0.0371(0.0213) 3.000 0.025 0.000 0.980 0.020 0.2407(0.0571) 134

Tables 4.6 provides the results for mixture normal samples with AR correlation structure. Comparing to other four methods, LASSO results in the largest model error and square root of the squared error as LASSO includes some insignificant variables in the final model. For mixture normal samples, PC-PR and TPC-PR performs very different. The correct fitting percentage from TPC-PR and TPC-PR(EBIC) are much higher than that from PC-PR. This is due to in TPC-PR, we utilized the right variance for the sample marginal correlations and partial correlations. But the PC-PR approach pretends the samples are from a normal distribution. Because the kurtosis of the mixture normal distribution we choose is around 1.5, then the threshold from PC- PR is much smaller than that from TPC-PR, thus PC-PR would include more predictors in its final model compared to TPC-PR. Equivalently, PC-PR would result in a over fitting model. This is consistent with what we found in Table 4.6. The over fitting percentage is more than 45% when σ2 = 1. Example 3 (High dimension): In this example, we simulated 200 datasets consisting of 300 observations from the model

y = g(u) + xT β + σ2. (4.38)

The size of the true active set in this example is set to 500, and it is treated as a high dimensional example. The number of nonzero coefficients is 3; that is, peff = 3. The index of nonzero coefficients are 1, 2, 5, and β = (3, 1.5, 0, 0, 2, 0, ··· , 0)T . Other settings are the same as Example 1. 135

Table 4.7: Normal samples with AR correlation structure (p=500)

σ2 g(u) Method MedME(Devi) C IC Under-fit Cor-fit Over-fit MedRASE(Devi)

ρ = 0.5 0.25 u2 SCAD 0.0028(0.0017) 3.000 1.055 0.000 0.865 0.135 0.0619(0.0153)

LASSO 0.0197(0.0065) 3.000 11.220 0.000 0.025 0.975 0.0813(0.0159)

PC-PR 0.0022(0.0013) 3.000 0.000 0.000 1.000 0.000 0.0624(0.0148)

TPC-PR 0.0022(0.0013) 3.000 0.000 0.000 1.000 0.000 0.0624(0.0148)

TPC-PR(EBIC) 0.0022(0.0013) 3.000 0.000 0.000 1.000 0.000 0.0624(0.0148)

sin(2πu) SCAD 0.0027(0.0017) 3.000 1.010 0.000 0.855 0.145 0.0845(0.0149)

LASSO 0.0204(0.0067) 3.000 10.490 0.000 0.035 0.965 0.1076(0.0164)

PC-PR 0.0021(0.0014) 3.000 0.000 0.000 1.000 0.000 0.0845(0.0148)

TPC-PR 0.0021(0.0014) 3.000 0.000 0.000 1.000 0.000 0.0845(0.0148)

TPC-PR(EBIC) 0.0021(0.0014) 3.000 0.000 0.000 1.000 0.000 0.0845(0.0148)

1 u2 SCAD 0.0105(0.0057) 3.000 1.080 0.000 0.675 0.325 0.1267(0.0303)

LASSO 0.0760(0.0275) 3.000 12.415 0.000 0.020 0.980 0.1594(0.0332)

PC-PR 0.1257(0.0298) 3.000 0.040 0.000 0.960 0.040 0.0103(0.0059)

TPC-PR 0.0103(0.0059) 3.000 0.040 0.000 0.960 0.040 0.1257(0.0298)

TPC-PR(EBIC) 0.0087(0.0052) 3.000 0.000 0.000 1.000 0.000 0.1257(0.0298)

sin(2πu) SCAD 0.0102(0.0056) 3.000 1.225 0.000 0.680 0.320 0.1498(0.0273)

LASSO 0.0773(0.0262) 3.000 11.925 0.000 0.020 0.980 0.1893(0.0310)

PC-PR 0.0098(0.0059) 3.000 0.035 0.000 0.965 0.035 0.1500(0.0274)

TPC-PR 0.0098(0.0060) 3.000 0.040 0.000 0.960 0.040 0.1500(0.0274)

TPC-PR(EBIC) 0.0084(0.0051) 3.000 0.000 0.000 1.000 0.000 0.1500(0.0274)

ρ = 0.8 0.25 u2 SCAD 0.0037(0.0022) 3.000 0.565 0.000 0.915 0.085 0.0682(0.0159)

LASSO 0.0362(0.0132) 3.000 15.680 0.000 0.000 1.000 0.1445(0.0360)

PC-PR 0.0035(0.0020) 3.000 0.005 0.000 0.995 0.005 0.0679(0.0160)

TPC-PR 0.0035(0.0020) 3.000 0.005 0.000 0.995 0.005 0.0679(0.0160)

TPC-PR(EBIC) 0.0035(0.0020) 3.000 0.000 0.000 1.000 0.000 0.0679(0.0160)

sin(2πu) SCAD 0.0035(0.0021) 3.000 0.560 0.000 0.915 0.085 0.0875(0.0179)

LASSO 0.0392(0.0130) 3.000 15.100 0.000 0.000 1.000 0.1723(0.0340)

PC-PR 0.0033(0.0019) 3.000 0.005 0.000 0.995 0.005 0.0875(0.0176)

TPC-PR 0.0033(0.0019) 3.000 0.005 0.000 0.995 0.005 0.0875(0.0176)

TPC-PR(EBIC) 0.0033(0.0019) 3.000 0.000 0.000 1.000 0.000 0.0875(0.0176)

1 u2 SCAD 0.0129(0.0072) 3.000 0.815 0.000 0.715 0.285 0.1338(0.0338)

LASSO 0.1470(0.0512) 3.000 16.055 0.000 0.000 1.000 0.2903(0.0741)

PC-PR 0.0133(0.0076) 3.000 0.045 0.000 0.960 0.040 0.1337(0.0354)

TPC-PR 0.0133(0.0076) 3.000 0.045 0.000 0.960 0.040 0.1337(0.0354)

TPC-PR(EBIC) 0.0123(0.0072) 3.000 0.000 0.000 1.000 0.000 0.1337(0.0335)

sin(2πu) SCAD 0.0133(0.0070) 3.000 0.880 0.000 0.690 0.310 0.1537(0.0336)

LASSO 0.1452(0.0509) 3.000 16.310 0.000 0.000 1.000 0.3242(0.0694)

PC-PR 0.0129(0.0073) 3.000 0.045 0.000 0.960 0.040 0.1542(0.0344)

TPC-PR 0.0129(0.0073) 3.000 0.045 0.000 0.960 0.040 0.1542(0.0344)

TPC-PR(EBIC) 0.0122(0.0069) 3.000 0.000 0.000 1.000 0.000 0.1537(0.0339) 136

Table 4.8: Mixture normal samples with AR correlation structure (p=500)

σ2 g(u) Method MedME(Devi) C IC Under-fit Cor-fit Over-fit MedRASE(Devi)

ρ = 0.5 0.25 u2 SCAD 0.0074(0.0042) 3.000 1.245 0.000 0.640 0.360 0.0917(0.0247)

LASSO 0.0446(0.0120) 3.000 32.170 0.000 0.000 1.000 0.1027(0.0267)

PC-PR 0.0096(0.0051) 3.000 0.260 0.000 0.760 0.240 0.0893(0.0255)

TPC-PR 0.0066(0.0040) 3.000 0.005 0.000 0.995 0.005 0.0895(0.0250)

TPC-PR(EBIC) 0.0066(0.0040) 3.000 0.000 0.000 1.000 0.000 0.0895(0.0251)

sin(2πu) SCAD 0.0072(0.0040) 3.000 1.215 0.000 0.650 0.350 0.1117(0.0204)

LASSO 0.0649(0.0213) 3.000 28.750 0.000 0.005 0.995 0.1541(0.0322)

PC-PR 0.0142(0.0100) 3.000 0.220 0.000 0.785 0.215 0.1351(0.0279)

TPC-PR 0.0117(0.0076) 2.990 0.015 0.010 0.990 0.000 0.1340(0.0281)

TPC-PR(EBIC) 0.0114(0.0072) 3.000 0.000 0.000 1.000 0.000 0.1340(0.0278)

1 u2 SCAD 0.0342(0.0182) 3.000 4.965 0.000 0.535 0.465 0.1803(0.0525)

LASSO 0.1736(0.0430) 3.000 42.945 0.000 0.000 1.000 0.1913(0.0488)

PC-PR 0.0605(0.0274) 3.000 0.935 0.000 0.290 0.710 0.1763(0.0513)

TPC-PR 0.0248(0.0134) 3.000 0.040 0.000 0.960 0.040 0.1737(0.0513)

TPC-PR(EBIC) 0.0241(0.0134) 3.000 0.005 0.000 0.995 0.005 0.1737(0.0508)

sin(2πu) SCAD 0.0329(0.0170) 3.000 4.680 0.000 0.555 0.445 0.2010(0.0395)

LASSO 0.1748(0.0445) 3.000 41.800 0.000 0.000 1.000 0.2172(0.0408)

PC-PR 0.0614(0.0277) 3.000 0.920 0.000 0.295 0.705 0.1991(0.0402)

TPC-PR 0.0257(0.0140) 3.000 0.040 0.000 0.960 0.040 0.1968(0.0402)

TPC-PR(EBIC) 0.0250(0.0136) 3.000 0.005 0.000 0.995 0.005 0.1947(0.0419)

ρ = 0.8 0.25 u2 SCAD 0.0100(0.0057) 3.000 1.185 0.000 0.640 0.360 0.1121(0.0316)

LASSO 0.0658(0.0204) 3.000 29.150 0.000 0.000 1.000 0.1915(0.0571)

PC-PR 0.0121(0.0067) 3.000 0.195 0.000 0.815 0.185 0.1089(0.0345)

TPC-PR 0.0107(0.0064) 2.995 0.015 0.005 0.985 0.010 0.1089(0.0332)

TPC-PR(EBIC) 0.0101(0.0060) 3.000 0.000 0.000 1.000 0.000 0.1084(0.0328)

sin(2πu) SCAD 0.0102(0.0060) 3.000 1.120 0.000 0.660 0.340 0.1272(0.0293)

LASSO 0.0902(0.0300) 3.000 32.095 0.000 0.000 1.000 0.2326(0.0555)

PC-PR 0.0150(0.0099) 2.975 0.215 0.025 0.800 0.175 0.1527(0.0313)

TPC-PR 0.0122(0.0084) 2.950 0.065 0.050 0.935 0.015 0.1529(0.0318)

TPC-PR(EBIC) 0.0117(0.0078) 2.995 0.020 0.005 0.980 0.015 0.1512(0.0304)

1 u2 SCAD 0.0459(0.0232) 3.000 4.160 0.000 0.515 0.485 0.2121(0.0557)

LASSO 0.1736(0.0430) 3.000 42.945 0.000 0.000 1.000 0.1913(0.0488)

PC-PR 0.0605(0.0274) 3.000 0.935 0.000 0.290 0.710 0.1763(0.0513)

TPC-PR 0.0248(0.0134) 3.000 0.040 0.000 0.960 0.040 0.1737(0.0513)

TPC-PR(EBIC) 0.0241(0.0134) 3.000 0.005 0.000 0.995 0.005 0.1737(0.0508)

sin(2πu) SCAD 0.0456(0.0235) 3.000 4.120 0.000 0.540 0.460 0.2277(0.0504)

LASSO 0.2464(0.0676) 3.000 37.635 0.000 0.000 1.000 0.3521(0.0853)

PC-PR 0.0595(0.0307) 3.000 0.730 0.000 0.430 0.570 0.2317(0.0504)

TPC-PR 0.0343(0.0193) 2.970 0.055 0.030 0.945 0.020 0.2296(0.0550)

TPC-PR(EBIC) 0.0327(0.0176) 3.000 0.000 0.000 1.000 0.000 0.2290(0.0544) 137

Tables 4.7 and 4.8 are corresponding to normal samples and mixture normal samples, respectively. As expected, LASSO has the largest model errors and square root of the squared errors as LASSO yields a over fitting model. For normal samples, shown in Table 4.7, PC-PR and TPC-PR have the same performance, and they outperform SCAD. This is because for normal samples, PC-PR and TPC-PR employ the similar limiting distributions for the sample marginal and partial correlations. However, for mixture normal samples, TPC-PR and TPC-PR(EBIC) outper- form PC-PR, because TPC-PR adjusts the variance of the sample marginal and partial correlations, while PC-PR ignore the non-normality of the samples. PC-PR made the wrong assumption about the distribution of the samples. That is the reason it results in over-fitting models. Focus on the last scenario in Table 4.8 when ρ = 0.5, the correct fitting percentage of PC-PR is 29.5%, while TPC-PR results in 96% as its correct fitting percentage. That indicates TPC-PR performs much better thatn PC-PR.

4.5.2 Real Data: Istanbul Stock Exchange Data

In this section, we apply the thresholded partial correlation on partial residuals approach to the Istanbul Stock Exchange data from Jun 5, 2009 to Feb 22, 2011. The purpose of this study is to examine the correlation between returns of Istan- bul Stock Exchange and a number of other international index, such as Standard and Poor’s 500 Index and Stock market return index of Germany. This study col- lected the Istanbul Stock Exchange in the working days from Jun 5, 2009 to Feb 22, 2011. Data sets includes returns of Istanbul Stock Exchange with seven other in- ternational index: Standard and Poor’s 500 Index (SP), stock market return index of German (DAX), stock market return index of UK (FTSE), stock market return 138 index of Japan (NIKKEI), stock market return index of Brazil (BOVESPA), MSCI European index (MSCI-EU), and MSCI emerging markets index (MSCI-EM). From Figure 4.1, the first seven plots indicate that the Istanbul Stock Ex- change index is linear correlated with these 7 international index, and the last plot shows the nonlinear relation between the Istanbul Stock Exchange index and dates. Hence, we will fit the partially linear model for the Istanbul Stock Exchange index, with linear relation with SP, DAX, FTSE, NIKKEI, BOVESPA, MSCI-EU and MSCI-EM, and nonlinear relation to the times.

ISE against SP ISE against DAX ISE against FTSE ISE against NIKKEI

● ● ● ● 0.06 0.06 0.06 0.06 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.04 ● ● 0.04 ● ● 0.04 ● ● 0.04 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●●● ● ●● ● ●● ●●● ● ●●●●● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ●● ●● ● ● ● ●●● ● 0.02 ● ●● ● ● ● ● 0.02 ●● ●● ● ● 0.02 ● ●● ●●● ● 0.02 ● ●● ● ● ●● ●●●● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ●●● ●● ●● ●●●● ● ●●● ● ● ●● ●●●● ● ●●●● ●● ●● ●● ● ● ●●● ●●●●● ●●●●●●●● ● ● ● ●●● ●●●●●● ●● ●●●●● ● ● ● ●● ● ● ●●●●●● ● ● ● ● ● ● ●● ●●● ●● ●● ● ● ●● ● ● ● ● ●●●● ●● ● ●● ● ●●● ● ● ● ● ●●●●● ● ● ●● ●●●● ●●● ● ● ● ●●●●● ● ● ● ● ● ● ● ●● ●● ●● ●●●●● ●● ● ● ● ●● ●●●● ●● ● ●● ● ● ● ● ●●●●●● ●● ● ●● ● ● ● ●●● ● ●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●● ● ● ●● ● ●●●●●● ●●●●● ● ●● ● ● ● ●●●●●●●●●●● ● ●●● ●● ● ● ●●●●●●●●●●●●● ● ●●● ●● ● ● ● ●●●●●●● ●●●● ●●●● ● ● ● ● ● ●● ●● ●●●●● ●● ●●● ●● ● ● ●●● ● ● ●●●●●●●●●●●●●●●● ● ● ● ● ● ●●● ● ●●●●●●●●●●●● ● ● ● ● ● ●●●●● ●●●●●●● ● ● ● ● ● ●●●● ● ●●● ●●●● ●●● ● ● ● ● ● ●● ●●●●●●●●●●●●● ● ● ●● ● ●● ●●●●●●●●●●●●●● ● ● ● ●● ●●●● ●●●●● ●●●●●●● ● ●● ● ●●● ●●●●●●●●●●● ●● ● ● ● ● ●● ●● ●●●●●●●●●● ●● ●● ● ● ● ● ●●●●●●●●●●●●●● ● ● ● ● ● ●●● ●●●● ●●● ●●●● ●●● ● ● ●● ●●●●● ● ●● ●● ● ● ●●● ● ● ● ●● ● ●●●●●●● ● ● ● ● ● ●● ●●● ●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●● ●● ● ● ● ● ●● ● ●● ●●● ●●●● ● ● ●● ● ISE ● ●● ● ●● ●●●●●● ● ● ISE ●● ● ● ●● ●● ●●● ● ● ● ISE ●●●● ●● ● ● ● ●● ● ● ● ● ISE ● ● ● ●● ● ● ●● ●● ● ●● ● ● ●●●●●●●●●●● ● ● ● ●●● ●●●●●●●●●● ●●●●● ● ●●●● ●●●●●●●●●● ●●●● ● ● ●●● ●● ● ●●●●●●● ●●● ● ● ● ● ● ●●● ●●●●●●●●●●● ● ●● ● ●● ● ●●● ● ● ● ●● ● ●●● ● ●●●●● ●● ●● ●●●● ●● ●● ● ● ● ●●●●● ● 0.00 ● ● ●● ●● ●●● ● 0.00 ● ● ● ●● ● ●●● ●● 0.00 ● ● ● ● ●●●● ● 0.00 ●● ● ●●● ● ● ● ● ● ●●●● ●● ●●●●●● ●●●●● ●● ● ● ● ● ●●● ●●●● ●●● ●● ● ●●● ● ●●●●●● ●● ●● ●●●● ● ● ● ● ● ●●●●●●●● ●● ● ● ●●● ● ● ● ● ●● ●●●●●●●●● ●●●● ● ●● ●●● ●● ●●●●●●●●●●● ●● ● ● ●● ● ● ●●●●●●●●● ●●● ●●● ● ● ●●● ●● ●● ●●●●●●●● ● ● ●●● ● ● ● ● ● ● ●●●●●●● ● ● ●●●●●● ●● ● ● ●● ●●●●● ● ● ● ● ● ●● ●●●● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ●●●● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●● ●● ●● ●● ●● ● ● ●● ●● ●●● ● ● ● ● ● ●●● ●● ● ●● ●● ● ● ● ●●● ● ●● ● ●●●●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●●●● ● ● ●● ●● ●● ●●● ● ● ● ● ● ●● ●●●● ● ●● ●● ● ●●●● ● ● ● ● ● ●● ●● ●● ●●● ● ● ● ● ● ●●● ● ● ● ●●● ● ● ● ● ● ●●●●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ●●● ● ●●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● −0.02 ● −0.02 ● −0.02 ● −0.02 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−0.04 ● −0.04 ● −0.04 ● −0.04 ●

● ● ● ● ● ● ● ● −0.06 −0.06 −0.06 −0.06

−0.04 −0.02 0.00 0.02 0.04 0.06 −0.04 −0.02 0.00 0.02 0.04 0.06 −0.04 −0.02 0.00 0.02 0.04 −0.04 −0.02 0.00 0.02 0.04 0.06

SP DAX FTSE NIKKEI

ISE against BOVESPA ISE against MSCI−EU ISE against MSCI−EM ISE against Date

● ● ● ● 0.06 0.06 0.06 0.06 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.04 ●● 0.04 ● ● 0.04 ●● 0.04 ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ●●● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●●●● ● ● ● ●●● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ●● ● ●● ●● ●●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.02 ● ● ●● ● ● ● 0.02 ● ● ●●● ●● 0.02 ● ● ● ● ● ● ● 0.02 ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ● ●● ● ● ●●●● ●●●●●● ● ●●●●●●● ● ● ●●●● ●●●●●●●● ● ● ●● ● ● ●●●● ●●●●●●● ● ●● ● ● ●● ●●● ● ● ● ● ●●● ●●● ●● ● ●● ●● ● ●● ●●● ●●●●● ● ●● ● ● ● ● ● ● ● ●● ●●●●●● ● ● ● ● ● ● ● ● ●●●● ●● ●●●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ●● ●● ● ● ● ● ●● ●● ● ●●●●● ●●●● ●● ● ●●●●● ●●● ●● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ●● ● ● ●●●●●● ●●● ● ● ● ● ●●●●●●●● ●●●●●● ● ● ● ●● ● ●●●●●●●●●●●● ●● ● ●● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ●●● ●● ●●●●●● ● ● ● ● ● ● ●●●●●●●●●● ● ●●●● ● ● ● ● ● ● ● ●●● ●●●●●●●●● ● ●● ●●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ●●●● ● ● ● ●● ●● ●●● ● ●●●● ● ● ● ● ● ●●●●●● ●●●●●●●●● ● ●● ●● ●● ●●●● ● ●●●●●●●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●●●●●● ● ●●●● ●●●●● ● ● ● ● ●●●●●●●●●●● ●●●●●● ● ● ● ●●●●●●● ● ●●●●● ●●● ● ● ● ● ● ●● ● ● ●● ● ●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ●●●●●●●●● ●● ● ● ● ●● ●●●● ●●●●●●●●●●● ●● ●● ●●●●●●●●●●● ● ●● ● ●●●● ●●● ● ●● ● ● ●●● ●●● ● ● ● ●● ●●●● ●●●●●●●●●● ● ● ● ●● ● ●●●●●●● ●●●● ● ● ● ●● ●●● ●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ●●●● ● ●●● ● ●●●● ● ISE ● ● ● ●● ●●●● ●●●● ● ● ● ISE ●● ●● ● ● ●●● ●●● ● ● ● ISE ● ● ● ● ●● ●●●●●●● ● ● ISE ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●●●● ●●●●●● ●●● ●● ●● ● ● ●●●●●●●●●●●● ●●●●● ● ● ● ● ● ● ●●●●●●●●● ●●●● ●●● ● ● ● ● ●●● ● ● ● ●● ● ● ● ●● ●● ●● ● ●● ●● ●●●●● ●● ● ● ● ● ● ● ●● ●●●●●●● ● ● ● ● ● ● ●●●● ●● ●●●●●● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ● ● 0.00 ● ● ● ● ● ●● ●● ● 0.00 ● ●●● ● ●●● ● ● 0.00 ● ● ●● ● ● ● ● ● 0.00 ● ● ● ● ● ●● ● ●● ● ● ●●●●● ●●●● ●● ● ● ● ● ● ● ●●●●●●●●● ● ●● ●●● ● ●● ●● ●●●● ●●● ●● ● ●● ● ●● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●●● ●●●●●●●●●●●● ●● ●● ● ● ● ● ●● ●●●●●●●●●●● ●●● ● ● ●●●●●●●●● ●●●●●●● ●● ● ●●● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●●● ●● ●●● ●●●●●●●●● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ●● ● ●● ● ● ● ●● ● ●●●● ● ● ●● ● ●●● ● ● ●● ● ● ● ●●● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ●●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ●● ●●● ● ● ● ● ●● ●●●● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ●● ● ● ● ●● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● −0.02 ● −0.02 ● −0.02 ● −0.02 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●

−0.04 ● −0.04 ● −0.04 ● −0.04 ●

● ● ● ● ● ● ● ● −0.06 −0.06 −0.06 −0.06

−0.04 −0.02 0.00 0.02 0.04 0.06 −0.04 −0.02 0.00 0.02 0.04 0.06 −0.04 −0.02 0.00 0.02 0.04 0 200 400 600 800

BOVESPA MSCI−EU MSCI−EM Date

Figure 4.1: Istanbul Stock Exchange Index against the predictors. The first seven plots are the Istanbul Stock Exchange index against other seven international index: SP, DAX, FTSE, NIKKEI, BOVESPA, MSCI-EU and MSCI-EM. The last plot is the Istanbul Stock Exchange index against the dates.

We divide the data into 2 parts: 80% of the data is used as training to get the estimation while the remaining 20% is utilized to assess the performance of our different approaches. We first utilize the partial residuals approach to obtain a linear model; for the resulting linear model, we obtain the estimates of the coeffi- cients based on the ordinary least squares (OLS), SCAD penalty (SCAD), LASSO penalty (LASSO), the PC-simple algorithm on partial residuals, the thresholded 139 partial correlation on partial residuals (TPC-PR), and the thresholded partial cor- relation on partial residuals with EBIC (TPC-PR(EBIC)). For the SCAD penalty, we apply five-fold cross-validation to decide the regularization parameters; for PC- PR, TPC-PR, TPC-PR(EBIC), we let the significance level α be 0.05. Since the kurtosis of this data is estimated to be 0.6888, and according to the Table 4.9, PC-PR, TPC-PR and TPC-PR(EBIC) yield the same estimated active set and estimated coefficients.

Table 4.9: Estimated coefficients

Predictor OLS SCAD LASSO PC-PR TPC-PR TPC-PR(EBIC) SP 0.0310 0.0000 0.0000 0.0000 0.0000 0.0000 DAX -0.0967 0.0000 0.0000 0.0000 0.0000 0.0000 FTSE -0.0819 0.0000 0.0000 0.0000 0.0000 0.0000 NIKKEI -0.0378 0.0000 -0.0018 0.0000 0.0000 0.0000 BOVESPA -0.1380 -0.1074 -0.0741 0.0000 0.0000 0.0000 MSCI-EU 0.6192 0.4683 0.4074 0.4350 0.4350 0.4350 MSCI-EM 0.3674 0.3290 0.2527 0.2798 0.2798 0.2798

Nonparametric relation between ISE and Time 0.10 0.05 0.00 −0.05

Least Squares −0.10 Estimated Istanbul Stock Exchange Index Estimated Istanbul SCAD LASSO PC−PR TP−PRC −0.15 TPC−PR(EBIC)

0 200 400 600 800

time (from 06/06/2009 to 02/22/2011)

Figure 4.2: The black line is the estimated nonparametric curve from least square; the red line is that from SCAD penalty; the green line is that from LASSO; the blue line is that from PC-PR, the dark turquoise line is that from TPC-PR and the pink one is that from TPC-PR(EBIC). 140

Table 4.9 provides the coefficients for all linear predictors. All seven predictors are identified by least squares, and this can be expected, because to attenuate the model bias, least squares tends to include all the predictors. The SCAD gives non- zero coefficients to BOVESPA, MSCI-EU and MSCI-EM. The LASSO approach does not include SP, DAX and FTSE in its final model. PC-PR, TPC-PR and TPC-PR(EBIC) select the same set of the predictors.The absolute values of the coefficients from LASSO and SCAD are smaller than that from OLS. This is con- sistent with the fact that LASSO and SCAD are able to selectively shrink the coefficients and set some of them to be zero. Meanwhile, Figure 4.2 shows all the approaches yield similar estimates of the nonparametric curve. The reason that PC-PR, TPC-PR and TPC-PR(EBIC) have the same perfor- mance is that it seems that all the x-variables and the response are following the normal distribution according to Figure 4.3. From the above plots, all the vari- ables are not skew, and this explains the kurtosis is very close to 0. That results in the similar performance of PC-PR, TPC-PR and TPC-PR(EBIC). Meanwhile, the distribution of the residuals, showed in Figure 4.4, is approximately normally distributed. The density plots reflect that this data set satisfies the normality assumption, and this is the main reason that PC-PR and TPC-PR perform the same. 141

The density of SP The density of DAX The density of FTSE The density of NIKKEI 40 40 40 50 40 30 30 30 30 20 20 20 density density density density 20 10 10 10 10 0 0 0 0

−0.06 −0.04 −0.02 0.00 0.02 0.04 0.06 0.08 −0.06 −0.04 −0.02 0.00 0.02 0.04 0.06 −0.06 −0.04 −0.02 0.00 0.02 0.04 0.06 −0.06 −0.04 −0.02 0.00 0.02 0.04 0.06

SP DAX FTSE NIKKEI

The density of BOVESPA The density of EU The density of EM The density of TL_ISE 30 35 40 30 25 40 30 25 20 30 20 15 20 density density density density 15 20 10 10 10 10 5 5 0 0 0 0

−0.06 −0.04 −0.02 0.00 0.02 0.04 0.06 −0.06 −0.04 −0.02 0.00 0.02 0.04 0.06 0.08 −0.04 −0.02 0.00 0.02 0.04 −0.05 0.00 0.05

BOVESPA MSCI−EU MSCI−EM TL_ISE

Figure 4.3: The first seven plots are the density plot of seven international index: SP, DAX, FTSE, NIKKEI, BOVESPA, MSCI-EU and MSCI-EM. The last is the density plot of Istanbul Stock Exchange Index.

PC-PR, TPC-PR and TPC-PR(EBIC) yield more interpretable models than other approaches. Parsimonious models do not necessarily result in larger model error or prediction error. In Table 4.10, “Size” is the number of nonzero coeffi- cients of the predictors. “Prediction Error” is the sum of squares of the difference between the fitted values and the actual observed values based on the test data set. LASSO has slightly smaller prediction error than SCAD and the least squares approach, but it has larger prediction error compared to PC-PR, TPC-PR and TPC-PR(EBIC). This implies PC-PR, TPC-PR and TPC-PR(EBIC) can select the most important features while ignoring less important features, but this would not cause large prediction error. The TPC-PR approach only selects two predictors out of eight predictors, and it yields a more parsimonious model compared to least squares. The correlation matrix between all the covariates, provided in Table 4.11 provides the hint for the reason. The correlation between MSCI-EU and DAX is 0.9364, and this indicates 142

The density of residuals 40 30 20 density 10 0

−0.06 −0.04 −0.02 0.00 0.02 0.04

residuals

Figure 4.4: Distribution of residuals from TPC-PR. that these two indices are highly correlated. If TPC-PR already includes MSCI- EU in the estimated active set, then the partial correlation between the response and DAX, considering that MSCI-EU is already in the model, will be small. This explains why TPC-PR does not select DAX in the final model. In that sense, our proposal TPC-PR favors parsimonious models. 143

Table 4.10: Comparison of the Performances: The size is defined as the number of the predictors in the final model. The prediction error is defined as the square of the difference between the estimated values and the actual observed values values based on the test data.

Approach Size Prediction Error OLS 7 50.2199 SCAD 3 50.2736 LASSO 4 50.1308 PC-PR 2 49.4520 TPC-PR 2 49.4520 TPC-PR(EBIC) 2 49.4520

Table 4.11: Correlation matrix of the covariates

SP DAX FTSE NIKKEI BOVESPA MSCI-EU MSCI-EM SP 1.0000 0.6858 0.6577 0.1313 0.7221 0.6876 0.5282 DAX 0.6858 1.0000 0.8674 0.2586 0.5858 0.9364 0.6652 FTSE 0.6577 0.8674 1.0000 0.2553 0.5963 0.9490 0.6875 NIKKEI 0.1313 0.2586 0.2553 1.0000 0.1728 0.2838 0.5473 BOVESPA 0.7221 0.5858 0.5963 0.1728 1.0000 0.6217 0.6881 MSCI-EU 0.6876 0.9364 0.9490 0.2838 0.6217 1.0000 0.7165 MSCI-EM 0.5282 0.6652 0.6875 0.5473 0.6881 0.7165 1.0000

4.6 Conclusion

In this chapter, we mainly study how to apply the thresholded partial correlation approach to select significant variables in the partially linear models. It is not easy to identify the important predictors in the partially linear models, as it in- volves selection of bandwidth, estimation of nonparametric function and selection of the important covariates. First, we employ the partial residuals approach to obtain the smoothed response and covariates. This step helps us to transform the ordinal partially linear model to a linear model approximately. Following that, we apply the thresholded partial correlation approach described in Chapter 3 to identify the significant variables. We mainly develop this approach under elliptical distributions, because we want to investigate its performance when the samples are non-normal distributed. Meanwhile, we establish the asymptotic consistency 144 of variable selection in the linear part and asymptotic normality of the nonpara- metric function. Furthermore, the simulation studies and real data analysis show that the thresholded partial correlation on partial residuals perform comparative to penalized least squares approach on partial residuals with SCAD penalty.

4.7 Lemmas and Technical Proofs

4.7.1 Lemmas

Lemma 4.9. We adopt the following notation for simplicity:

n n 1 X 1 X ui − u Z (u, h) = K (u − u),Z (u, h) = K (u − u), 1 n h i 2 n h h i i=1 i=1 n  2 n 1 X ui − u 1 X Z (u, h) = K (u − u),Z (u, h) = x K (u − u), 3 n h h i 4 n ij h i i=1 i=1 n n 1 X ui − u 1 X Z (u, h) = x K (u − u),Z (u, h) = y K (u − u), 5 n ij h h i 6 n i h i i=1 i=1 n 1 X ui − u Z (u, h) = y K (u − u). 7 n i h h i i=1

Then under conditions (D3)-(D7), h → 0, and hh3 → +∞, for some small s > 0, we have the following results:

s n R1. supu∈[a,b] P {|Z1(u, h) − f(u)| > } ≤ 4(1 − 4 ) ,

n R +∞ o s n R2. supu∈[a,b] P |Z2(u, h) − f(u) −∞ tK(t)dt| >  ≤ 4(1 − 4 ) ,

n R +∞ 2 o s n R3. supu∈[a,b] P |Z3(u, h) − f(u) −∞ t K(t)dt| >  ≤ 4(1 − 4 ) ,

s n R4. supu∈[a,b] P {|Z4(u, h) − f(u)E(xj|u)| > } ≤ 4(1 − 4 ) ,

n R +∞ o s n R5. supu∈[a,b] P |Z5(u, h) − f(u)E(xj|u) −∞ tK(t)dt| >  ≤ 4(1 − 4 ) , 145

s n R6. supu∈[a,b] P {|Z6(u, h) − f(u)E(y|u)| > } ≤ 4(1 − 4 ) ,

n R +∞ o s n R7. supu∈[a,b] P |Z7(u, h) − f(u)E(y|u) −∞ tK(t)dt| >  ≤ 4(1 − 4 ) .

Proof. Please refer to Liu et al. (2013) for the proof.

Lemma 4.10. Assume A(u) and B(u) are two uniformly bounded functions of u.

That is, there exist M4 and M5, such that

sup |A(u)| ≤ M4, sup |B(u)| ≤ M5. u∈U u∈U

For any given u, Aˆ(u) and Bˆ(u) are estimates of A(u) and B(u) based on n samples. Suppose there exist C, C1, and M, such that

 2  ˆ C2 n q sup P {|A(u) − A(u)| > } ≤ C1n (1 − 4q ) + exp(−C3n ) , u∈U n  2  ˆ C2 n q sup P {|B(u) − B(u)| > } ≤ C1n (1 − 4q ) + exp(−C3n ) . u∈U n

Then

 2  ˆ ˆ C2 n q sup P { A(u)B(u) − A(u)B(u) > } ≤ C1n (1 − 4q ) + exp(−C3n ) . u∈U n

Furthermore, if infu∈U |B(u)| ≥ δ3 > 0, then

 2  ˆ ˆ C2 n q sup P { A(u)/B(u) − A(u)/B(u) > } ≤ C1n (1 − 4q ) + exp(−C3n ) . u∈U n q  2  ˆ p C2 n q sup P { B(u) − B(u) > } ≤ C1n (1 − 4q ) + exp(−C3n ) . u∈U n

Proof. Refer to Liu et al. (2013) for the proof. 146

Lemma 4.11. Define

Z3(u, h)Z4(u, h) − Z2(u, h)Z5(u, h) Wj(u, h) = , Z1(u, h)Z3(u, h) − Z2(u, h)Z2(u, h) Z (u, h)Z (u, h) − Z (u, h)Z (u, h) V (u, h) = 3 6 2 7 . Z1(u, h)Z3(u, h) − Z2(u, h)Z2(u, h)

Then under condition (D3)-(D7), h → 0, and hh3 → +∞, for some small s > 0, we have

s n R8. supu∈[a,b] P {|Wj(u, h) − E(xj|u)| > } ≤ 4(1 − 4 ) .

s n R9. supu∈[a,b] P {|V (u, h) − E(y|u)| > } ≤ 4(1 − 4 ) .

Remark: By Cauchy-Schwarz inequality,

( n )2 ( n )2 X ui − u X ui − u Z (u, h)Z (u, h) = K (u − u) = pK (u − u) ∗ pK (u − u) 2 2 h h i h i h h i i=1 i=1 n n 2  2 X n o X ui − u ≤ pK (u − u) ∗ pK (u − u) h i h h i i=1 i=1 = Z1(u, h)Z3(u, h).

u1−u un−u The equality will hold if and only if h = ··· = h , which is impossible. Therefore, we have

Z2(u, h)Z2(u, h) < Z1(u, h)Z3(u, h).

This implies the denominator of Wj(u, h) and V (u, h) is positive.

4.7.2 Proof of Theorem 4.4 (First Half).

Proof. We divide the proof into 4 steps. Step 1: Weighted least squares. 147

Consider the partially linear model, y = g(u) + xT β + . Conditioning on u, we can obtain E(y|u) = g(u) + E(xT |u)β + E(|u). Noting that E(|u) = 0, then

y − E(y|u) = g(u) + {x − E(x|u)}T β + 

Now we aim to estimate my(u) = E(y|u) and mj(u) = E(xj|u), j = 1, ··· , p.

For ui in a small neighborhood of u, we can approximate my(u) locally by a linear function

0 my(u) ≈ my(u0) + my(u0)(ui − u0) ≡ b0 + b1(ui − u0), i = 1, ··· , n. (4.39)

Then the estimates of b0 and b1 can be obtained by minimizing the following weighted least squares:

n X 2 {yi − b0 − b1(ui − u0)} Khy (ui − u0), (4.40) i=1

−1 where Kh(·) = h K(·/h), and K(·) is the kernel function. Equivalently,

  ˆ n b0 X 2   = arg min {yi − b0 − b1(ui − u0)} Khy (ui − u0) ˆ b0,b1 hyb1 i=1 T −1 T = (D (u, hy)W (u, hy)D(u, hy)) D (u, hy)W (u, hy)y,

  u1−u 1 h    . .  where D(u, h) =  . . , and W (u, h) = diag{Kh(u1 −u), ··· ,Kh(un −u)}.   un−u 1 h  T −1 T  (1, 0)(D (u1, h)W (u1, h)D(u1, h)) D (u1, h)W (u1, h)    .  Let S(h) =  . ;   T −1 T (1, 0)(D (un, h)W (un, h)D(un, h)) D (un, h)W (un, h) 148 then   mˆy(u1)    .   .  = S(hy)y.   mˆy(un)

DT (u, h)W (u, h)D(u, h)     u1−u   Kh(u1 − u) 1 h 1 ··· 1  .   . .  =    ..   . .  u1−u un−u     h ··· h     un−u Kh(un − u) 1 h     Pn Pn ui−u i=1 Kh(ui − u) i=1 h Kh(ui − u) Z1(u, h) Z2(u, h) =   = n   . Pn ui−u Pn ui−u 2 i=1 h Kh(ui − u) i=1( h ) Kh(ui − u) Z2(u, h) Z3(u, h)

Then {DT (u, h)W (u, h)D(u, h)}−1   1 Z3(u, h) −Z2(u, h) =   . n {Z1(u, h)Z3(u, h) − Z2(u, h)Z2(u, h)} −Z2(u, h) Z1(u, h)

      Khy (u1 − u) y1 1 ··· 1     T  ..   .  D (u, hy)W (u, hy)y =    .   .  u1−u ··· un−u     hy hy Khy (un − u) yn

 Pn    i=1 xijKhy (ui − u) Z4(u, hy) =   = n   . Pn x ui−u K (u − u) Z (u, h ) i=1 ij hy hy i 5 y

T −1 T Then (1 0){D (u, hy)W (u, hy)D(u, hy)} D (u, hy)W (u, hy)y

Z3(u, hy)Z4(u, hy) − Z2(u, hy)Z5(u, hy) = = V (u, hy). Z1(u, hy)Z3(u, hy) − Z2(u, hy)Z2(u, hy)

Similarly, for j = 1, ··· , p, 149

  mˆj(u1)    .   .  = S(hj)Xj.   mˆj(un)

T −1 T Similarly, we can get (1 0){D (u, hj)W (u, hj)D(u, hj)} D (u, hj)W (u, hj)Xj

Z3(u, hj)Z6(u, hj) − Z2(u, hj)Z7(u, hj) = = Wj(u, hj). Z1(u.hj)Z3(u.hj) − Z2(u, hj)Z2(u, hj)

    x1j − Wj(u1, hj) y1 − V (u1, hy)         Therefore, {I − S(hj)}Xj =  ···  , {I − S(hy)}y =  ···  .     xnj − Wj(un, hj) yn − V (un, hy)

Step 2: For  > 0, study

  1 P h(I − S(hy))y, (I − S(hj))Xji − cov {y − E(y|u), xj − E(xj|u)} >  . n

n 1 1 X h(I − S(h ))y, (I − S(h ))X i = (y − V (u , h ))(x − W (u , h )) n y j j n i i y ij j i j i=1 n n n n 1 X 1 X 1 X 1 X = y x − y W (u , h ) − x V (u , h ) + V (u , h )W (u , h ), n i ij n i j i j n ij i y n i y j i j i=1 i=1 i=1 i=1

and cov{y − E(y|u), xj − E(xj|u)} = E(yxj) − E{E(y|u)E(xj|u)}. Note that

E(E(y|u)xj) = E{E(y|u)E(xj|u)} and E(yE(xj|u)) = E{E(y|u)E(xj|u)}; then

1 I h(I − S(h ))y, (I − S(h ))X i − cov{y − E(y|u), x − E(x |u)} , n y j j j j ( n ) ( n ) 1 X 1 X = y x − E(yx ) − y W (u , h ) − E(yE(x |u)) n i ij j n i j i j j i=1 i=1 ( n ) ( n ) 1 X 1 X − x V (u , h ) − E(x E(y|u)) + V (u , h )W (u , h ) − E{E(y|u)E(x |u)} n ij i y j n i y j i j j i=1 i=1 , I1 − I2 − I3 + I4. 150

For I1: For some large M,

( n ) 1 X P ( I1 > ) = P yixij − E(yxj) >  n i=1 ( n ) 1 X = P yixij − E(yxj) > , |yi| ≤ M, and |xij| ≤ M, for all i n i=1 ( n ) 1 X +P yixij − E(yxj) > , |yi| > M, or |xij| > M, for some i n i=1 ( n ) 1 X ≤ P yixij − E(yxj) > , |yi| ≤ M, |xij| ≤ M, for all i n i=1 + nP (|yi| > M) + nP (|xij| > M)

, I11 + nI12 + nI13.

n22 n2 By Hoeffding’s inequality, I11 ≤ 2 exp{− Pn 2 2 } = 2 exp{− 4 }. From (D2) i=1(2M ) 4M and Lemma 3.17, there exist m1 and m2, such that

I12 = P (|yi| > M) ≤ m1 exp(−m2M),I13 = P (|xij| > M) ≤ m1 exp(−m2M).

Thus n2 P ( I1 > ) ≤ 2 exp{− } + 2nm1 exp(−m2M). (4.41) 4M 4

For I2: For the same large M,

n 1 X I = y W (u , h ) − E(yE(x |u)) 2 n i j i j j i=1 " n n # " n # 1 X 1 X 1 X = y W (u , h ) − y E(x |u ) + y E(x |u ) − E(yE(x |u)) I + I . n i j i j n i j i n i j i j , 21 22 i=1 i=1 i=1

( n n ) 1 X 1 X P ( I21 > /2) = P yiWj(ui, hj) − yiE(xj|ui) > /2 n n i=1 i=1 151

( n ) X = P yi{Wj(ui, hj) − E(xj|ui)} > n/2

i=1 ( n ) X ≤ P yi Wj(ui, hj) − E(xj|ui) > n/2 i=1  ≤ nP yi Wj(ui, hj) − E(xj|ui) > /2  = nP yi Wj(ui, hj) − E(xj|ui) > /2, Wj(ui, hj) − E(xj|ui) > /2M  +nP yi Wj(ui, hj) − E(xj|ui) > /2, Wj(ui, hj) − E(xj|ui) ≤ /2M   ≤ nP Wj(u,hj) − E(xj|ui) > /2M + nP yi > M .

 s n By (R9), we have P Wj(u,hj) − E(xj|ui) > /2M ≤ 4(1 − 8M ) ; then

s P (|I | > /2) ≤ 4n(1 − )n + nm exp(−m M). 21 8M 1 2

n 1 X P ( I22 > /2) ≤ P ( yiE(xj|ui) − E(yE(xj|u)) > /2) n i=1 n 1 X ≤ P ( yiE(xj|ui) − E(yE(xj|u)) > /2, |yi| ≤ M, for all i) n i=1 n 1 X +P ( yiE(xj|ui) − E(yE(xj|u)) > /2, |yi| > M, for some i) n i=1 n 1 X = P ( yiE(xj|ui) − E(yE(xj|u)) > /2, |yi| ≤ M, ∀i) + nP (|yi| > M) n i=1 , I221 + I222.

According to (D7): supu∈[a,b] |E(xj|u)| < +∞, we can let M large enough such that |E(xj|u)| ≤ M. Then by Hoeffding’s inequality,

 2(/2)2n2   n2  I ≤ 2 exp − = 2 exp − . 221 Pn 2 2 4 i=1(2M ) 8M

Again, from (D2) and Lemma 3.17, we have I222 ≤ nm1 exp(−m2M). 152

Thus  n2  P (|I | > /2) ≤ 2 exp − + nm exp(−m M). 22 8M 4 1 2

Therefore,

P (|I2| > ) ≤ P (|I21| > /2) + P (|I22| > /2) s  n2  ≤ 4n(1 − )n + nm exp(−m M) + 2 exp − + nm exp(−m M) 8M 1 2 8M 4 1 2 s  n2  = 4n(1 − )n + 2 exp − + 2nm exp(−m M). (4.42) 8M 8M 4 1 2

For I3: For the same large M,

n 1 X I = x V (u , h ) − E(x E(y|u)) 3 n ij i y j i=1 " n n # " n # 1 X 1 X 1 X = x V (u , h ) − x E(y|u ) + x E(y|u ) − E(x E(y|u)) . n ij i y n ij i n ij i j i=1 i=1 i=1

Similar to the proof for I2, we have

s  n2  P (|I | > ) ≤ 4n(1 − )n + 2 exp − + 2nm exp(−m M). (4.43) 3 8M 8M 4 1 2

For I4: For the same large M (M > ),

n 1 X I = W (u h )V (u , h ) − E{E(x |u)E(y|u)} 4 n j , j i y j i=1 " n n # 1 X 1 X = W (u h )V (u , h ) − V (u , h )E(x |u ) n j , j i y n i y j i i=1 i=1 " n n # 1 X 1 X + V (u , h )E(x |u ) − E(y|u )E(x |u ) n i y j i n i j i i=1 i=1 " n # 1 X + E(y|u )E(x |u ) − E{E(x |u)E(y|u)} I + I + I . n i j i j , 41 42 43 i=1 153

P (|I41| > /3) " n # 1 X = P {Wj(ui, hj) − E(xj|ui)}V (ui, hy) > /3 n i=1

≤ nP ( Wj(ui, hj) − E(xj|ui) V (ui, hy) > /3)

= nP ( Wj(ui, hj) − E(xj|ui) V (ui, hy) > /3, Wj(ui, hj) − E(xj|ui) > /6M)

+ nP ( Wj(ui, hj) − E(xj|ui) V (ui, hy) > /3, Wj(ui, hj) − E(xj|ui) ≤ /6M)

≤ nP ( Wj(ui, hj) − E(xj|ui) > /6M) + nP ( V (ui, hy) > 2M) , I411 + I412.

s n From (R8), we have I411 = nP ( Wj(ui, hj) − E(xj|ui) > /6M) ≤ 4n(1 − 24M ) .

I412 = nP (|V (ui, hy)| > 2M) = nP (|V (ui, hy) − E(y|ui) + E(y|ui)| > 2M)

≤ nP (|V (ui, hy) − E(y|ui)| ≥ M) because |E(y|ui)| ≤ M s ≤ nP (|V (u , h ) − E(y|u )| > ) ≤ 4n(1 − )n. i y i 4

s s s Therefore, P (|I | > /3) ≤ 4n(1 − )n + 4n(1 − )n ≤ 8n(1 − )n. 41 24M 4 24M ( n n ) 1 X 1 X P (|I42| > /3) = P V (ui, hy)E(xj|ui) − E(y|ui)E(xj|ui) > /3 n n i=1 i=1 ≤ nP {|V (ui, hy) − E(y|ui)| |E(xj|ui)| > /3}

≤ nP {|V (ui, hy) − E(y|ui)| > /3M} ( sup |E(xj|ui)| ≤ M) u∈[a,b] s ≤ 4n(1 − )n ( by (R9)). 12M

As supu∈[a,b] |E(y|u)| ≤ M, and supu∈[a,b] |E(xj|u)| ≤ M, by Hoeffding’s inequality, we have ( n ) 1 X P (|I43| > /3) = P E(y|ui)E(xj|ui) − E{E(xj|u)E(y|u)} > /3 n i=1 n2(/3)2 n2 ≤ 2 exp{− } = 2 exp{− }. Pn 2 2 4 i=1(2M ) 36M 154

Therefore,

P (|I4| > ) ≤ P (|I41| > /3) + P (|I42| > /3) + P (|I43| > /3) s s n2 ≤ 8n(1 − )n + 4n(1 − )n + 2 exp{− } 24M 12M 36M 4 s n2 ≤ 12n(1 − )n + 2 exp{− }. (4.44) 24M 36M 4

Now combining (4.41), (4.42), (4.43), and (4.44), we have

  1 P h(I − S(hy))y, (I − S(hj))Xji − cov{y − E(y|u), xj − E(xj|u)} >  n

≤ P (|I1| > /4) + P (|I2| > /4) + P (|I3| > /4) + P (|I4| > /4) n2 ≤ 2 exp{− } + 2nm exp(−m M) 64M 4 1 2  s  n2   + 2 4n(1 − )n + 2 exp − + 2nm exp(−m M) 32M 128M 4 1 2 s n2 + 12n(1 − )n + 2 exp{− } 96M 576M 4 n2 s ≤ 8 exp{− } + 6nm exp(−m M) + 20n(1 − )n. 576M 4 1 2 96M

Let M = nq, 0 < q < 1/4, then

  1 P h(I − S(hy))y, (I − S(hj))Xji − cov{y − E(y|u), xj − E(xj|u)} >  n n1−4q2 s ≤ 8 exp{− } + 6nm exp(−m nq) + 20n(1 − )n. 576 1 2 96nq

x −x 0 1 −x Step 3: Take f(x) = 1 − 2 − e , x ≥ 0, then f (x) = − 2 + e ≥ 0, for x ∈ [0, log 2]. This implies f(x) is increasing on [0, log 2], and f(x) ≥ f(0) = 0. Thus

x −x 2 1 − 2 ≥ e , for x ∈ [0, log 2]. Since 576n4q ∈ [0, log 2], then

n2  2 n 2 8 exp{− } = 8 exp{− } ≤ 8(1 − )n. (4.45) 576n4q 576n4q 1152n4q 155

Therefore,

2 s P (|I| > ) ≤ 8(1 − )n + 6nm exp(−m nq) + 20n(1 − )n 1152n4q 1 2 96nq C2 ≤ 28n(1 − )n + 6nm exp(−m nq), (4.46) n4q 1 2 where C is a constant that does not depend on n. Therefore,

  1 P h(I − S(hy))y, (I − S(hj))Xji − cov{y − E(y|u), xj − E(xj|u)} >  n  C 2  ≤ C n (1 − 2 )n + exp(−C nq) . (4.47) 1 n4q 3

Step 4: Similarly, we can get the following results:

   2  1 2 C2 n q P k(I − S(hy))yk − var{y − E(y|u)} >  ≤ C1n (1 − ) + exp(−C3n ) , n n4q    2  1 2 C2 n q P k(I − S(hj))Xjk − var{xj − E(xj|u)} >  ≤ C1n (1 − ) + exp(−C3n ) . n n4q

By applying Lemma 4.10, we have

( ) h(I − S(h ))y, (I − S(h ))X i cov{y − E(y|u), x − E(x |u)} P y j j − j j >  1/2 k(I − S(hy))ykk(I − S(hj))Xjk [var{y − E(y|u)}var{xj − E(xj|u)}]  C 2  ≤ C n (1 − 2 )n + exp(−C nq) . (4.48) 1 n4q 3

Equivalently,

 2  ∗ ∗ ∗ ∗ C2 n q P { ρˆ(y , x ) − ρ(y , x ) > } ≤ C1n (1 − ) + exp(−C3n ) . (4.49) j j n4q

Note that 156

C 2 C n(1 − 2 )n = C exp{log(n) + n log(1 − C 2/n4q)} → 0, (4.50) 1 n4q 1 2 q q C1n exp(−C3n ) = C1 exp{log(n) − C3n } → 0, (4.51) as n → +∞, then

 C 2  C n (1 − 2 )n + exp(−C nq) → 0. 1 n4q 3

That is,

∗ ∗ P ∗ ∗ ρˆ(y , xj ) −→ ρ(y , xj ), which completes the proof of Theorem 4.4. Remark: Similarly, we can show for j 6= k, j, k = 1, ··· , p,

 2  ∗ ∗ ∗ ∗ C2 n q P { ρˆ(x , x ) − ρ(x , x ) > } ≤ C1n (1 − ) + exp(−C3n ) ; (4.52) j k j k n4q equivalently,

∗ ∗ P ∗ ∗ ρˆ(xj , xk) −→ ρ(xj , xk),

4.7.3 Proof of Theorem 4.5.

Proof. We divide the proof into 6 steps. Step 1: ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ρˆ(y , xj ) − ρˆ(y , xk)ˆρ(xj , xk) ρˆ(y , xj |xk) = (4.53)  2 ∗ ∗  2 ∗ ∗ 1/2 {1 − ρˆ (y , xk)} 1 − ρˆ (xj , xk)

∗ ∗ ∗ ∗ ∗ ∗ x−yz is a function ofρ ˆ(y , x ),ρ ˆ(y , x ), andρ ˆ(x , x ). Let g(x, y, z) = √ √ , and j k j k 1−y2 1−z2 x, y, z ∈ (−1, 1), then all the first and second derivatives are bounded from 1, given y and z are bounded from 1. 157

By Lemma 3.16,

 ∗ ∗ ∗ ∗ ∗ ∗ P ρˆ(y , xj |xk) − ρ(y , xj |xk) >    ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ = P g{ρˆ(y , x ), ρˆ(y , x ), ρˆ(x , x )} − g{ρ(y , x ), ρ(y , x ), ρ(x , x )} >  j k j k j k j k

  ∗ ∗   ∗ ∗    ρˆ(y , xj ) ρ(y , xj )        ≤ P  ∗ ∗  −  ∗ ∗  > C  ρˆ(y , xk)   ρ(y , xk)       2   ∗ ∗ ∗ ∗   ρˆ(xj , xk) ρ(xj , xk)  √ √ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ≤ P { ρˆ(y , xj ) − ρ(y , xj ) > C/ 3} + P { ρˆ(y , xk) − ρ(y , xk) > C/ 3} √ ∗ ∗ ∗ ∗ + P { ρˆ(xj , xk) − ρ(xj , xk) > C/ 3}  C 2  ≤ 3C n (1 − 2 )n + exp(−C nq) . 1 n4q 3

c Similarly, we can have for any S ⊆ {j} , and |S| ≤ peffn, we can have

 2   ∗ ∗ ∗ ∗ ∗ ∗ |S| C2 n q P ρˆ(y , x |x ) − ρ(y , x |x ) >  ≤ 3 C1n (1 − ) + exp(−C3n ) . j S j S n4q

Step 2: Under the assumption of elliptical distribution, we can use the sample version of the marginal kurtosis,

p  1 Pn ¯ 4  1 X (˜xij − x˜j) κˆ = n i=1 − 1 , n 3{ 1 Pn (˜x − x˜¯ )2}2 j=1 n i=1 ij j to estimate the kurtosis. Similar to the proof in Step 1, we can obtain the following inequality:

1−4q 2 q P {|κˆ − κ| > } ≤ C4 exp(−C3n  ) + C2n exp(−C1n ).

ˆ ∗ ∗ ∗ ∗ ∗ ∗ Zn(y ,xj |xS ) Zn(y ,xj |xS ) Step 3: Study: P ( √ − √ > ). 1+ˆκ 1+κ 158

√1 1+u  Define g2(u, v) = 2 1+v log 1−u , u ∈ (−1, 1), v ∈ (−1, +∞), then

ˆ ∗ ∗ ∗ ∗ ∗ ∗ Zn(y , xj |xS ) ∗ ∗ ∗ Zn(y , xj |xS ) ∗ ∗ ∗ √ = g2(ˆρn(y , x |x ), κˆ), and √ = g2(ρn(y , x |x ), κ). 1 +κ ˆ j S 1 + κ j S

All the first and second derivatives are continuous and bounded for u ∈ (−τ, τ), v ∈ (−δ, +∞). By Lemma 3.16, (D4) and (D6),

ˆ ∗ ∗ ∗ ∗ ∗ ∗ Zn(y , xj |xS ) Zn(y , xj |xS ) P ( √ − √ > ) 1 +κ ˆ 1 + κ       ∗ ∗ ∗ ∗ ∗ ∗  ρˆn(y , x |x ) ρn(y , x |x )  j S j S ≤ P   −   > C  κˆ κ  √ √ ∗ ∗ ∗ ∗ ∗ ∗ ≤ P (|ρˆn(y , xj |xS ) − ρn(y , xj |xS )| > C/ 2) + P (|κˆ − κ| > C/ 2)  C 2  ≤ (3|S| + 1)C n (1 − 2 )n + exp(−C nq) 1 n4q 3  C 2  ≤ 3|S|C n (1 − 2 )n + exp(−C nq) . 1 n4q 3

Step 4: Compute P (Ej|S ). When testing the jth predictor given S ⊆ {j}c, denote the event

∗ ∗ ∗ Ej|S = {an error occurs when testing ρn(y , xj |xS ) = 0}

I II = Ej|S ∪ Ej|S ,

I II where Ej|S denotes the type I error while Ej|S represents the type II error. I 1) For Ej|S .

( ˆ ∗ ∗ ∗ ) I 1/2 Zn(y , xj |xS ) −1 ∗ ∗ E = (n − |S| − 1) √ > Φ (1 − αn/2) when Zn(y , x |xS ) = 0 . j|S 1 +κ ˆ j 159

Then

( ˆ ∗ ∗ ∗ ) I 1/2 Zn(y , xj |xS ) −1 ∗ ∗ ∗ P (E ) = P (n − S| − 1) √ > Φ (1 − αn/2) when Zn(y , x |x ) = 0 j|S 1 +κ ˆ j S ( ˆ ∗ ∗ ∗ ∗ ∗ ∗ ) 1/2 Zn(y , xj |xS ) Zn(y , xj |xS ) −1 ≤ P (n − |S| − 1) √ − √ > Φ (1 − αn/2) 1 +κ ˆ 1 + κ ( ˆ ∗ ∗ ∗ ∗ ∗ ∗ r ) Zn(y , xj |xS ) Zn(y , xj |xS ) n cn = P √ − √ > √ 1 +κ ˆ 1 + κ n − |S| − 1 2 1 + κ ( ˆ ∗ ∗ ∗ ∗ ∗ ∗ ) Zn(y , xj |xS ) Zn(y , xj |xS ) cn ≤ P √ − √ > √ 1 +κ ˆ 1 + κ 2 1 + κ  C c2  ≤ 3|S|C n (1 − 2 n )n + exp(−C nq) , 1 n4q 3

p n cn by choosing αn = 2{1 − Φ( 1+κ 2 )}. II 2) For Ej|S .

( ˆ ∗ ∗ ∗ ) II 1/2 Zn(y , xj |xS ) −1 ∗ ∗ ∗ E = (n − |S| − 1) √ ≤ Φ (1 − αn/2) when Zn(y , x |x ) 6= 0 . j|S 1 +κ ˆ j S

p n cn By choosing αn = 2{1 − Φ( 1+κ 2 )}, we can get the following inequality:

( ˆ ∗ ∗ ∗ ) II 1/2 Zn(y , xj |xS ) −1 ∗ ∗ ∗ P (E ) = P (n − |S| − 1) √ ≤ Φ (1 − αn/2) when Zn(y , x |x ) 6= 0 j|S 1 +κ ˆ j S ( ˆ ∗ ∗ ∗ r ) Zn(y , xj |xS ) n cn ∗ ∗ ∗ = P √ ≤ √ when Zn(y , x |x ) 6= 0 1 +κ ˆ n − |S| − 1 2 1 + κ j S ( ∗ ∗ ∗ ˆ ∗ ∗ ∗ ∗ ∗ ∗ r ) Zn(y , xj |xS ) Zn(y , xj |xS ) Zn(y , xj |xS ) n cn ≤ P √ − √ − √ ≤ √ 1 + κ 1 +κ ˆ 1 + κ n − |S| − 1 2 1 + κ ( ˆ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ r ) Zn(y , xj |xS ) Zn(y , xj |xS ) Zn(y , xj |xS ) n cn = P √ − √ ≥ √ − √ . 1 +κ ˆ 1 + κ 1 + κ n − |S| − 1 2 1 + κ

1 1 Let g3(u) = 2 log{(1 + u)/(1 − u)}, then |g3(u)| = | 2 log{(1 + u)/(1 − u)}| ≥ |u|, 160

∗ ∗ ∗ for all u ∈ (−1, 1), and according to (D5), |ρn(y , xj |xS )| ≥ cn, then

( ˆ ∗ ∗ ∗ ∗ ∗ ∗ r ) II Zn(y , xj |xS ) Zn(y , xj |xS ) cn n cn P (E ) ≤ P √ − √ ≥ √ − √ j|S 1 +κ ˆ 1 + κ 1 + κ n − |S| − 1 2 1 + κ ( ˆ ∗ ∗ ∗ ∗ ∗ ∗  r ) Zn(y , xj |xS ) Zn(y , xj |xS ) cn n 1 ≤ P √ − √ ≥ √ 1 − 1 +κ ˆ 1 + κ 1 + κ n − |S| − 1 2 ( ˆ ∗ ∗ ∗ ∗ ∗ ∗ ) Zn(y , xj |xS ) Zn(y , xj |xS ) 3cn ≤ P √ − √ ≥ √ 1 +κ ˆ 1 + κ 8 1 + κ  C c2  ≤ 3|S|C n (1 − 2 n )n + exp(−C nq) , 1 n4q 3

q n 5 as for large n, n−|S|−3 ≤ 4 . Combining the results from (1) and (2), we get the following results:

I II P (Ej|S ) = P (Ej|S ) + P (Ej|S )  C c2  ≤ 3|S|C n (1 − 2 n )n + exp(−C nq) . 1 n4q 3

ˆ Step 5: Prove P {An = An} → 1.

c Now consider all j = 1, ··· , pn and all S ⊆ {j} subject to |S| ≤ mn, mn = peffn, for any b > 0, under (D1) and (D2), for j = 1, ··· , pn, define

mn c Kj = {S ⊆ {j} , |S| ≤ mn}.

ˆ P {An 6= An} = P {an error occurs for some j and some S}    [  = P Ej|S  mn  j=1,··· ,pn;S∈Kj X ≤ P (Ej|S ) mn j=1,··· ,pn;S∈Kj mn ≤ pn(pn) sup P (Ej|S ) mn j=1,··· ,pn;S∈Kj 161

 C c2  ≤ 3mn (p )mn+1C n (1 − 2 n )n + exp(−C nq) . n 1 n4q 3

The last second inequality holds since the number of possible choices of j is pn and

mn there are pn possible choices for S.

Similar to Lemma 3 in Buhlmann et al. (2010), it can be shown that P (m ˆ reach,n = mreach,n) → 1, as n → +∞. And mreach,n ≤ peffn, thusm ˆ reach,n ≤ peffn. There- fore, in the previous arguments, mn should be replaced withm ˆ reach,n, and we can still use (D2) as mn =m ˆ reach,n ≤ peffn.

a a b Step 5: Under (D2), for pn = O(exp(n )) = C4 exp(n ), and peffn = O(n ) =

b C5n ,

 C c2  P (Aˆ 6= A ) ≤ 3mn C n(p )mn+1C n (1 − 2 n )n + exp(−C nq) . n n 1 n 1 n4q 3 ˆ log{P (An 6= An)}  C c2  ≤ m log 3 + (m + 1) log(p ) + log(C n) + log (1 − 2 n )n + exp(−C nq) . n n n 1 n4q 3  C c2  ≤ C (m + 1) log(p ) + log{ (1 − 2 n )n + exp(−C nq) } 1 n n n4q 3  C c2  ≤ C (C nb + 1)(log C + na) + log{ (1 − 2 n )n + exp(−C nq) }. 1 5 4 n4q 3

−x x 1−q Note that e ≤ 1 − 2 , for x ∈ [0, log 2]. Since 0 < q < 1, then C3/n ∈ [0, log 2].

q q n 1−q n 1−q n exp(−C3n ) = {exp(−C3n /n)} ≤ (1 − C3/2n ) = (1 − C3/n ) .

Therefore,

n o  C C  log P (Aˆ 6= A ) ≤ C (C nb + 1)(log C + na) + log (1 − 2 )n + (1 − 3 )n . n n 1 5 4 n2d+4q n1−q 162

1−a−b−2d 1 1 C2 C3 Let (1 − 2d)/5 < q < 4 , then n2d+4q < n1−q . Thus 1 − n2d+4q ≥ 1 − n1−q . Then

 C   C  log{P (Aˆ 6= A )} ≤ C na+b + log 2(1 − 2 )n ≤ C na+b + n log (1 − 2 ) . n n 1 n2d+4q 1 n2d+4q

1−2d 1−a−b−2d 1−a−b−2d 4q 4 Since 2d + 5(a + b) < 1, then 5 < 4 , then n < n , then na+b na+b C n4q ≈ ≈ − 2 → 0,  C2 C2 n1−a−b−2d n log (1 − n2d+4q ) n n2d+4q (−1 + o(1))  C  C C n log (1 − 2 ) ≈ C n 2 (−1 + o(1)) → −∞. 1 n2d+4q 1 n2d+4q

Thus,  C  log{P (Aˆ 6= A )} ≤ C na+b + n log (1 − 2 ) n n 1 n2d+4q   ( a+b ) C2 n ≤ C1n log (1 − ) + 1 → −∞. n2d+4q  C2 n log (1 − n2d+4q )

ˆ That is P (An 6= An) → 0, as n → +∞.

4.7.4 Proof of Theorem 4.6.

Proof. Since we are applying the least squares on the estimated active set, then as n → +∞,

n o ˆ 1/2 ˆ P β − β = Op(−n ) An(αn) = An → 1.

Therefore, as n → ∞,

n o ˆ 1/2 P β − β = Op(−n ) n o n o ˆ 1/2 ˆ ˆ 1/2 ˆ = P β − β = Op(−n ), An(αn) 6= An + P β − β = Op(−n ), An(αn) = An 163

n o ˆ ˆ 1/2 ˆ ˆ ≤ P {An(αn) 6= An)} + P β − β = Op(−n ) An(αn) = An P {An(αn) = An)}

→ 1.

4.7.5 Proof of Theorem 4.7.

4.7.5.1 Derivation of the Bias of Nonparametric Part.

After obtaining the estimated active set, for any fixed u,

  n ˆb (u) 2 ˆ 0 X n T ˆ o b(u) =   = arg min yi − xi β − b0 − b1(ui − u0) Kh(ui − u0) ˆ b0,b1 b1(u) i=1 T −1 T ˆ = (Zu WuZu) Zu Wu(y − Xβ),

  1 u1 − u    . .  where Zu =  . . , and Wu = diag{Kh(u1 − u), ··· ,Kh(un − u)}.   1 un − u We have the following facts: 1) E(y − Xβ|X, u) = g(u).

T −1 T ˆ −1/2 2) (Zu WuZu) Zu WuX = Op(1), and kβ − βk = Op(n ). 2 3) g(ui) = b0 + b1(ui − u) + b2(ui − u) + ··· , and g(u) = b0.

ˆ T −1 T ˆ E(b|X, u) = (Zu WuZu) Zu WuE(y − Xβ|X, u) T −1 T ˆ = (Zu WuZu) Zu WuE(y − Xβ + Xβ − Xβ|X, u) T −1 T n ˆ o = (Zu WuZu) Zu Wu g(u) + XE(β − β|X, u) 164

  b0 + b1(u1 − u) + g(u) − b0 − b1(u1 − u)   T −1 T  .  −1/2 = (Zu WuZu) Zu Wu  .  + Op(n )   b0 + b1(un − u) + g(u) − b0 − b1(un − u)

 2    b2(u1 − u) + ··· b0   T −1 T  .  −1/2 =   + (Zu WuZu) Zu Wu  .  + Op(n ) b1   2 b2(un − u) + ···   b 0 −1/2 =   + J1 + Op(n ). b1

Let

    Pn K (u − u) Pn (u − u)K (u − u) 1 0 T i=1 h i i=1 i h i Sn , Zu WuZu =   ,H ,   . Pn Pn 2 i=1(ui − u)Kh(ui − u) i=1(ui − u) Kh(ui − u) 0 h

Then

  1 Pn K (u − u) 1 Pn ui−u K (u − u) 1 −1 −1 n i=1 h i n i=1 h h i H SnH =   n 1 Pn ui−u 1 Pn ui−u 2 n i=1 h Kh(ui − u) n i=1( h ) Kh(ui − u)   1 0 r 1 = f(u) + O(h + ).   nh 0 µ2

 l  (u1 − u)   l 1 Pn ui−u l   nh ( ) Kh(ui − u) T  .  n i=1 h Zu Wu  .  =   . l+1 1 Pn ui−u l+1   nh ( ) Kh(ui − u) l n i=1 h (un − u) 165

Then J1 can be expressed as follows:

 2  b2(u1 − u) + ···   T −1 T  .  J1 = (Zu WuZu) Zu Wu  .    2 b2(un − u) + ···   nh2 1 Pn ( ui−u )2K (u − u) −1 n i=1 h h i 2 = b2Sn   + o(h ) 3 1 Pn ui−u 3 nh n i=1( h ) Kh(ui − u)   h2 1 Pn ( ui−u )2K (u − u) −1 1 −1 −1 −1 −1 n i=1 h h i 2 = b2H ( H SnH ) H   + o(h ) n 3 1 Pn ui−u 3 h n i=1( h ) Kh(ui − u)   1 Pn ( ui−u )2K (u − u) 2 −1 1 −1 −1 −1 n i=1 h h i 2 = b2h H ( H SnH )   + o(h ) n 1 Pn ui−u 3 n i=1( h ) Kh(ui − u)         1 0 r µ r 2 −1  −1 1   2 1  2 = b2h H f (u)   + Op(h + ) f(u)   + Op(h + ) + o(h ) 1 nh nh  0   µ3  µ2   r   µ2 1  = b h2H−1 + O (h + ) + o(h2). 2   p nh  µ3/µ2 

ˆ Thus, E(ˆgβˆ(u)|X, u) = (1 0)E(b|X, u) ( )  r 1  = b + b h2 µ + O h + + o(h2) + O (n−1/2) 0 2 2 p nh p  r 1  = b + b µ h2 + h3O 1 + + o(h2) + O (n−1/2) 0 2 2 p nh3 p 2 2 −1/2 = b0 + b2µ2h + o(h ) + Op(n ),

and the last equality holds because nh3 → +∞, as n → 0. Notice that the right side does not depend on X and u, therefore,

n o 2 2 −1/2 E gˆβˆ(u) = b0 + b2µ2h + o(h ) + Op(n ) 166

g00(u) = g(u) + µ h2 + o(h2) + O (n−1/2). (4.54) 2 2 p

4.7.5.2 Derivation of the Variance of Nonparametric Part.

ˆ T −1 T ˆ T −1 Note that var(b|X, u) = (Zu WuZu) Zu Wuvar(y − Xβ|X, u)WuZu(Zu WuZu) , and var(y − Xβˆ|X, u) = var(y − Xβ + Xβ − Xβˆ|X, u)

= var(y − Xβ|X, u) + var(Xβ − Xβˆ|X, u) + 2cov(y − Xβ, Xβ − Xβˆ|X, u)

2 −1/2 = {σ + Op(n )}In.

Then

ˆ T −1 T  2 −1/2 T −1 var(b|X, u) = (Zu WuZu) Zu Wu σ In + Op(n ) WuZu(Zu WuZu)

2 −1/2 −1 T 2  −1 = {σ + Op(n )}Sn Zu Wu Zu Sn

2 −1/2 −1 −1 −1 −1 −1 T 2  −1 −1 −1 −1 = {σ + Op(n )}H (H SnH ) H Zu Wu Zu H (H SnH )

2 −1/2 , {σ + Op(n )}J2.

−1 T 2  −1 H Zu Wu Zu H       Pn 2 Pn 2 1 0 i=1 Kh(ui − u) i=1(ui − u)Kh(ui − u) 1 0 =       Pn 2 Pn 2 2 0 1/h i=1(ui − u)Kh(ui − u) i=1(ui − u) Kh(ui − u) 0 1/h   Pn 2 Pn ui−u 2 i=1 Kh(ui − u) i=1 h Kh(ui − u) =   Pn ui−u 2 Pn ui−u 2 2 i=1 h Kh(ui − u) i=1( h ) Kh(ui − u)   r nf(u) ν0 ν1  1  = + nO 1 + . h   p nh3 ν1 ν2

1  1 −1 1  1 −1 J = H−1 H−1S H−1 H−1ZT W 2Z H−1 H−1S H−1 H−1 2 n n n n u u u n n 167

        1 0  r  ν ν  r  1 −1  −1 1  f(u) 0 1 1  = H f (u)   + Op h +   + Op 1 + n 1 nh h nh3  0   ν1 ν2  µ2      1 0  r 1  f −1(u) + O h + H−1   p nh  0 1/µ2      ν1 r  ν0   1 −1 1 µ2 1 −1 = H   + Op h + H . nh f(u) ν1 ν2 nh  µ2 µ2   1 n o   ˆ Therefore, var gˆβˆ(u)|X, u = 1 0 var(b|X, u)   0 ( ) σ2 + O (n−1/2) 1  r 1  = p ν + O h + nh f(u) 0 p nh   2 −1/2 q 1 σ ν0 + Op(n ) + Op h + nh = . nhf(u)

Notice that the right side does not depend on X and u, therefore,

 q  σ2ν + O (n−1/2) + O h + 1 n o 0 p p nh var gˆ (u) = . (4.55) βˆ nhf(u)

4.7.6 Proof of Theorem 4.8.

Proof. We have the following facts:

ˆ T −1 T ˆ 1)g ˆβˆ(u) = (1 0)b = (1 0)(Zu WuZu) Zu Wu(y − Xβ). ˆ 1 00 2 −1/2 2 2) E(ˆgβˆ(u)|X, u) = (1 0)E(b|X, u) = g(u) + 2 g (u)µ2h + +Op(n ) + o(h ). T −1 T ˆ 3) E(ˆgβˆ(u)|X, u) = (1 0)(Zu WuZu) Zu WuE(y − Xβ|X, u). T 4) y − Xβ − E(y − Xβ|X, u) = (1, ··· , n) .

1 00 2 Step 1): Studyg ˆβˆ(u) − g(u) − 2 g (u)µ2h . 168

1 gˆ (u) − g(u) − g00(u)µ h2 βˆ 2 2 −1/2 2 =g ˆβˆ(u) − E(ˆgβˆ(u)|X, u) + Op(n ) + o(h ) T −1 T n ˆ ˆ o −1/2 2 = (1 0)(Zu WuZu) Zu Wu y − Xβ − E(y − Xβ|X, u) + Op(n ) + o(h ) −1 T n ˆ ˆ o −1/2 2 = (1 0)Sn Zu Wu y − Xβ + X(β − β) − E(y − Xβ + X(β − β)|X, u) + Op(n ) + o(h ) −1 T n ˆ ˆ o = (1 0)Sn Zu Wu y − Xβ − E(y − Xβ|X, u) + X(β − β) − E(X(β − β)|X, u)

−1/2 2 + Op(n ) + o(h ).

−1 T ˆ −1/2 Because Sn Zu WuX = Op(1), β − β = Op(n ), then

1 gˆ (u) − g(u) − g00(u)µ h2 βˆ 2 2 n −1 T −1/2 2 = [1 0]Sn Zu Wu {y − Xβ − E(y − Xβ|X, u)} + Op(n ) + o(h )

−1 T T −1/2 2 = [1 0]Sn Zu Wu(1, ··· , n) + Op(n ) + o(h )  1  = [1 0]H−1(nHS−1H)H−1 ZT W ( , ··· ,  )T + O (n−1/2) + o(h2). n n u u 1 n p

    1 1 0 1 1 0 H−1S H−1 −→P f(u) , then nHS−1H −→P .(4.56) n n   n f(u)   0 µ2 0 1/µ2

      1 0 1 Pn K (u − u) −1 1 T T n i=1 h i i H Zu Wu(1, ··· , n) =     n 1 Pn 0 1/h n i=1(ui − u)Kh(ui − u)i   1 Pn n i=1 Kh(ui − u)i =   . 1 Pn ui−u n i=1 h Kh(ui − u)i

1 Pn Step 2): Now study n i=1 Kh(ui − u)i:

( n ) X 1 1 σ2 ξ2 = var K (u − u) = E K2(u − u)2 = E K2(u − u) . n n h i i n h i i n h i i=1 169

Note that

Z 1 u − u 1 Z E K2(u − u) = { K( i )}2f(u )du = K2(t)f(u + th)dt h i h h i i h 1 Z = K2(t){f(u) + thf 0(u) + o(h)}dt h 1 = {f(u)ν + hf 0(u)ν + o(h)}. h 0 1

2 σ2 0 σ2 Thus ξn = nh {f(u)ν0 + hf (u)ν1 + o(h)} = nh {C1 + o(h)}. Similarly, we have

n X 1 3 1 3 3 3 1 E Kh(ui − u)i = n( ) E K (ui − u) E  = {C2 + o(h)}. n n h i n2h2 i=1

Therefore, as n → +∞, nh3 → +∞, we have nh → +∞, then

Pn 1 3 1 E Kh(ui − u)i 2 2 {C2 + o(h)} 1 i=1 n = n h = {C + o(h)} → 0, ξ2 σ2 nh 3 n nh {C1 + o(h)}

By the Lyapunov Central Limit Theorem, we can obtain that

1 Pn n i=1 Kh(ui − u)i D q −→ N(0, 1). σ2 0 nh {f(u)ν0 + hf (u)ν1 + o(h)}

That is

r n   h X D K (u − u) −→ N 0, σ2f(u)ν . n h i i 0 i=1

Similarly,

r n   h X ui − u D K (u − u) −→ N 0, σ2f(u)ν . n h h i i 2 i=1 170

Step 3): Apply the Slutsky’s Theorem,

√  1  nh gˆ (u) − g(u) − g00(u)µ h2 βˆ 2 2 n √  1  √ √ = nh[1 0]H−1(nHS−1H)H−1 ZT W ( , ··· ,  )T + O ( h) + o( nh5) n n u u 1 n p  q  h Pn Kh(ui − u)i √ √ −1 −1 n i=1 5 = [1 0]H (nHSn H)  q  + Op( h) + o( nh ) h Pn ui−u n i=1 h Kh(ui − u)i 1    σ2ν  −→D N 0, σ2f(u)ν = N 0, 0 , f(u) 0 f(u) which completes the proof of Theorem 4.8. Chapter 5

Conclusion and Future Research

5.1 Conclusion

In this dissertation, we systematically reviewed the existing variable selection meth- ods for the high dimensional regressions. One well-known approach is the regular- ized approaches. Different from these approaches, Buhlmann et al. (2010) proposed the PC-simple algorithm for variable selection. However, they developed this ap- proach under the assumption of normality. Therefore, we want to investigate its performance for non-normal samples. We studied the asymptotic distributions of sample marginal and partial correlations with elliptical distributions. Those re- sults clearly show that the distributions of the partial correlations depend on the kurtosis of the distribution. That implies PC-simple algorithm would result in an over fitting model, while the kurtosis is much larger than 0. This is consistent with the simulation results we obtained. To overcome the drawback, we proposed a new threshold for the marginal and partial correlations based on their asymptotic distributions. The simulation studies manifest that adjusting the variance of the partial corre- lations improves the performance dramatically compared to PC-simple algorithm. 172

We also compared the performance of the newly proposed approach to the exist- ing penalized approaches with LASSO and SCAD penalty. Moreover, we suggest tuning the threshold based on the extended-BIC criterion, and call it TPC-EBIC. This resulting procedure outperforms both original PC-simple algorithm and the thresholded partial correlation approach. Furthermore, we established the asymp- totic consistency of the threshold partial correlation approach. In Chapter 4, we applied the newly proposal to do variable selection in the partially linear models. The variable selection in partially linear models is chal- lenging, because it involves several selection and estimation procedures: selection of the bandwidth, selection of the variables in the linear part, estimation of the nonparametric function and estimation of the linear coefficients. We first apply the partial residuals approach to transform the partially linear models to the linear models. After that, we apply the PC-simple algorithm, and the threshold partial correlations to the resulting linear models. We call these two algorithm PC-PR and TPC-PR for short. As expected, the simulation studies show that PC-PR would over fit or under fit the data unless the samples are normal. The TPC-PR performs similar as PC- PR with normal samples, and outperforms PC-PR for non-normal samples. Even better, the TPC-PR outperforms the penalized approach on the partial correlations with SCAD and LASSO penalties. Moreover, with the constraint on the partial correlations and the nonparametric part, we showed that TPC-PR can identify the true active set consistently. Furthermore, we derived the asymptotic bias and variance of the estimate of the nonparametric baseline function. 173

5.2 Future Research

5.2.1 Further theoretical development

In Chapter 4, we show the distribution of the sample partial correlations between the smoothed covariates and the response is N(0, 1 + κ). As we pointed in the conditions in Section 4.4.1, we require (y, v, x1, ··· , xp), where u = Φ(v), to be

∗ ∗ ∗ elliptically distributed, then (y , x1, ··· , xp) will follow an elliptical distribution with the same characteristic generator. As a challenging research topic, it is in- teresting to show that the sample marginal and partial correlations follow the normal distribution with 0 as its mean, 1+κ as its variance. It is more challenging than the proof in Chapter 3, because the performance of the estimation of the nonparametric function matters.

5.2.2 Semi-parametric Varying-coefficient Models

The thresholded partial correlation can be applied to do variable selection in semi- parametric varying-coefficient models. Fan and Huang (2005) proposed the profile least-squares approach for estimating the parametric part in the varying-coefficient models. Meanwhile, they studied the asymptotic normality of the profile least- squares estimator. The varying-coefficient partially linear model can be written as y = αT (u)x + βT z + , (5.1) where y is the response, (u, xT , z) is the covariates,  is independent of (u, xT , z),

2 T and E() = 0, var() = σ , β = (β1, ··· , βp) is a q-dimensional vector of q un- known coefficients, α(u) = (α1, ··· , αq) is a p-dimensional vector of p unknown coefficients functions. First, let y∗ = y − βT z, the the original would be trans- formed into a nonparametric model. By utilizing the local linear regression, the 174 estimation of the nonparametric part can be written as a linear combination of y∗. Equivalently, Mˆ = S(Y − Zβ), (5.2)

T T T where X = (X1, ··· , Xn) , Xi = (Xi1, ··· ,Xip) , Y = (y1, ··· , yn) , Z =

T T (Z1, ··· , Zn) , Zi = (Zi1, ··· ,Ziq) , Wu = diag(Kh(u1 − u), ··· ,Kh(un − u)),

   T T −1 T  α(u1)X1 (X1 , 0){Du1 Wu1 Du1 } Du1 Wu1     ˆ  .   .  M =  .  , S =  .  ,     T T −1 T α(un)Xn (Xn , 0){Dun Wun Dun } Dun Wun

  T u1−u T X1 h X1    . .  and Du =  . .  .   T un−u T Xn h Xn Using the profile technique, model (5.1) can be transformed into a linear model:

(I − S)Y = (I − S)Zβ + . (5.3)

Intuitively, one might apply the threshold partial correlation approach proposed in Chapter 3 to select the important variables. It is of interest to establish the theoretical property of the proposed procedure. Specifically, it is of interest to establish model selection consistency of the proposed procedure, and derive the asymptotic bias and variance of local linear regression estimate of the nonparametric baseline functions. This may be a good future research topic. Bibliography

Akaike, H. (1974). A new look at the identification. Automatic Control, IEEE Transactions on, 19(6):716 – 723. Breiman, L. (1996). Bagging predictors. Machine Learning, 24:123–140. 10.1007/BF00058655. Buhlmann, P., Kalisch, M., and Maathuis, M. (2010). Variable selection in high- dimensional linear models: partially faithful distributions and the pc-simple algorithm. Biometrika, 97(2):261–278. Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimation when p is much larger than n. (with discussions and rejoinder). The Annals of Statistics, 35(6):2313–2404. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regres- sion. The Annals of Statistics, 32(2):407–499. With discussion, and a rejoinder by the authors. Fan, J. (1992). Design-adaptive nonparametric regression. Journal of the American Statistical Association, 87(420):998–1004. Fan, J. (1997). Comments on in statistics: A review by a. antoniadis. Statistical Methods & Applications, 6:131–138. 10.1007/BF03178906. Fan, J. and Huang, T. (2005). Profile likelihood inferences on semiparametric varying-coefficient partially linear models. Bernoulli, 11(6):1031–1057. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likeli- hood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360. Fan, J. and Li, R. (2004). New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of the American Statistical Association, 99(467):710–723. 176

Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70:849–911.

Fang, K., Kotz, S., and Ng, K. (1990). Symmetric multivariate and related distribu- tions. Number 36 in Monographs on statistics and applied probability. Chapman and Hall, London.

Frank, I. E. and Friedman, J. H. (1993). A statistical view of some regression tools. Technometrics, 35(2):109–135.

Hall, P. and Miller, H. (2009). Using generalized correlation to effect variable selection in very high dimensional problems. Journal of Computational and Graphical Statistics, 18(3):533–550.

Hastie, T. and Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Society. Series B (Methodological), 55(4):757–796.

Heckman, N. E. (1986). Spline smoothing in a partly linear model. Journal of the Royal Statistical Society. Series B (Methodological), 48(2):244–248.

Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms. Ann. Stat., 33(4):1617–1642.

Li, R. and Liang, H. (2008). Variable selection in semiparametric regression mod- eling. Ann. Stat., 36(1):261–286.

Li, R., Zhong, W., and Zhu, L. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107(499):1129–1139.

Liu, J., Li, R., and Wu, R. (2013). Feature selection for varying coefficient models with ultrahigh dimensional covariates. Journal of American Statistical Associa- tion. Accepted.

Miller, A. (2002). Subset Selection in Regression. Monographs on Statistics and Applied Probability. Chapman & Hall/CRC.

Moyeed, R. and Diggle, P. (1994). Rates of convergence in semi-parametric of longitudinal data. Australian Journal of Statistics, 36(1):75–93.

Muirhead, R. J. (2009). Aspects of multivariate , volume 197. Wiley.

Ruppert, D., Sheather, S. J., and Wand, M. P. (1995). An effective bandwidth selector for local least squares regression. Journal of the American Statistical Association, 90(432):1257–1270. 177

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2):461–464.

Segal, M. R., Dahlquist, K. D., and Conklin, B. R. (2003). Regression approaches for microarray data analysis. Journal of Computational Biology, 10:961–980.

Speckman, P. (1988). Kernel smoothing in partial linear models. Journal of the Royal Statistical Society. Series B (Methodological), 50(3):413–436.

Spirtes, P., Glymour, C., and Scheines, R. (2001). Causation, Prediction, and Search, 2nd Edition. Adaptive Computation and Machine Learning. MIT Press.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288.

Wahba, G. (1985). A comparison of gcv and gml for choosing the smoothing parameter in the generalized spline smoothing problem. The Annals of Statistics, 13(4):1378–1402.

Wang, L., Kim, Y., and Li, R. (2013). Calibrating non-convex penalized regression in ultra-high dimension. The Annals of Statistics. In press.

Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67.

Zeger, S. L. and Diggle, P. J. (1994). Semiparametric models for longitudinal data with application to cd4 cell numbers in hiv seroconverters. Biometrics, 50(3):689–699.

Zou, H. (2006). The adaptive lasso and its orcle properties. Journal of the American Statistical Association, 101(476):1418–1429.

Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320.

Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models. The Annals of Statistics, 36(4):1509–1533. Vita Lejia Lou Research Interests Variable Selection, Partially Linear Models, Data Mining and Machine Learning.

Education Dec. 2013 Ph.D. in Statistics, Penn State University. Supervisor: Dr. Runze Li. Major GPA: 4.0/4.0. Overall GPA: 4.0/4.0.

Jul. 2007 B.S. in Mathematics, East China Normal University. Major GPA: 3.73/4.0. Overall GPA: 3.86/4.0. Rank: 2/134. Research Experience Aug. 2012-Dec. 2013 Supervisor: Dr. Runze Li. Thresholded Partial Correlation Approach on Partial Residuals for Variable Selection in High-dim Partially Linear Models. Sept. 2011-Aug. 2012 Supervisor: Dr. Runze Li. Thresholded Partial Correlation Approach for Variable Selection in High-dim Linear Models. Aug. 2011-May 2012 Graduate Statistical Consultant, Penn State University. May 2010-May 2011 Supervisor: Dr. Murali Haran. Kernel Mixing and Gaussian Predictive Processes Models. Intern Experience Jun. 2013-Aug. 2013 Supervisor: Dr. Julie Hsu. Intern at CASD, Business Insurance, Travelers. May 2011-Aug. 2011 Supervisor: Dr. George Chi. Intern at PRD, Johnson & Johnson. Teaching Experience Aug. 2012-Dec. 2012 Instructor of Experimental Methods, Penn State University. Software Skills R, Matlab, SAS, SQL and Minitab.