The Pennsylvania State University the Graduate School THRESHOLDED PARTIAL CORRELATION APPROACH for VARIABLE SELECTION in LINEAR
Total Page:16
File Type:pdf, Size:1020Kb
The Pennsylvania State University The Graduate School THRESHOLDED PARTIAL CORRELATION APPROACH FOR VARIABLE SELECTION IN LINEAR MODELS AND PARTIALLY LINEAR MODELS A Dissertation in Statistics by Lejia Lou c 2013 Lejia Lou Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy December 2013 The dissertation of Lejia Lou was reviewed and approved∗ by the following: Runze Li Distinguished Professor of Statistics Dissertation Advisor and Chair of Committee David Hunter Professor of Statistics Head of Department of Statistics Bing Li Professor of Statistics Rongling Wu Professor of Public Health Science Aleksandra Slavkovic Chair of Graduate Program in Statistics ∗Signatures are on file in the Graduate School. Abstract This thesis is concerned with variable selection in linear models and partially linear models for high-dimensional data analysis. With the development of technology, it is crucial to identify a small subset of covariates that exhibits the strongest relationship with the response. Researchers have made much effort on developing variable selection methodologies, such as regularized techniques, including least absolute shrinkage and selection operator (LASSO, Tibshirani, 1996) and penalized least squares estimate with smoothly clipped absolute deviation penalty (SCAD, Fan and Li, 2001). Different from those regularization methods for variable selection in linear mod- els, Buhlmann et al. (2010) proposed the PC-simple algorithm to select significant variables. As they showed, under some conditions and proper choice of the sig- nificance level, the PC-simple algorithm can consistently identify the true active set with probability approaching 1 when the response and covariates are jointly normally distributed. In Chapter 3, we study the performance of the PC-simple algorithm under a non-normal distribution. The PC-simple algorithm develops the variable selection based on the fact that the asymptotic distributions of Fisher's z-transform of the sample marginal and partial correlations follow standard normal distribution. This fact is invalid when the samples are from non-normal distributions. This is the drawback of the PC-simple algorithm. Thus, we derive the asymptotic distribution of Fisher's z-transform of the sample marginal correlations and partial correlations under elliptical distributions, and we find that the asymptotic distributions depend on the kurtosis. According to the threshold we develop, the PC-simple algorithm would result in over-fitting a model with positive kurtosis, and under-fitting a model with negative kurtosis. The results from the extensive simulation studies with elliptical distributions, iii including normal distributions and mixture normal distributions, are consistent with our understanding. With normal samples, the PC-simple algorithm and our proposal have similar performance as both of them utilize similar asymptotic dis- tributions of the sample marginal and partial correlations. However, with mixture normal distributions, the PC-simple algorithm overfits the data as the kurtosis of the mixture normal distributions are greater than 0. Our proposal outper- forms PC-simple algorithm in terms of the correct fitting percentage. That im- plies adjusting the variance is urgent. Moreover, the application of our proposal to the cardiomyopathy microarray data suggests that our proposal is comparable to the regularization approach with the SCAD penalty, and outperforms the LASSO penalty. Furthermore, by imposing some conditions on the partial correlations, we show that the proposed approach can consistently identify the true active set. In Chapter 4, we study how to apply the thresholded partial correlation ap- proach to select significant variables in partially linear models. First, we transform the partially linear model to a linear model approximately by the partial residu- als technique. Then we apply the thresholded partial correlation approach to the resulting linear model to obtain the estimated active set. After that, we apply the least squares approach to get the estimates of the coefficients in the linear part. The estimation of the nonparametric function is obtained via substituting the estimation of the linear part into the original model. We call this approach the thresholded partial correlation on partial residuals (TPC-PR) approach. Sim- ilarly, we can utilize the PC-simple algorithm on the partial residuals, pretending the samples are from normal distributions, and we call the resulting algorithm the PC-simple algorithm on partial residuals (PC-PR). We establish the asymptotic consistency of variable selection in the linear part, and the asymptotic normality of the nonparametric function. Simulation studies show that our proposal performs as well as the penalized approach on partial residuals with the SCAD penalty (Fan and Li, 2004), and outperforms the LASSO penalty. The real data analysis also demonstrates that our proposal can yield a parsimonious model. iv Table of Contents List of Figures viii List of Tables ix Acknowledgments x Chapter 1 Introduction 1 1.1 Background . 1 1.2 Contribution . 5 1.3 Organization . 7 Chapter 2 Literature Review 9 2.1 Introduction . 9 2.2 Variable Selection in Linear Models . 11 2.2.1 Classical Variable Selection Criteria . 11 2.2.2 Penalized Least Squares . 14 2.2.3 PC-simple Algorithm . 22 2.3 Partially Linear Models . 30 2.3.1 Model . 30 2.3.2 Partially Linear Models for Longitudinal Data . 31 2.3.3 Asymptotic Results for Profile Least Squares Estimators . 36 2.4 Variable Selection in Partially Linear Models . 37 2.4.1 Penalized Profile Least Squares . 37 2.4.2 Asymptotic Results of Penalized Profile Approach . 38 2.4.3 Iterated Ridge Regression . 40 v Chapter 3 Thresholded Partial Correlation Approach for Variable Selec- tion in Linear Models 42 3.1 Introduction . 42 3.2 Preliminaries . 44 3.2.1 Model and Notation . 44 3.2.2 Partial Faithfulness . 45 3.2.3 Elliptical Distributions . 47 3.3 Thresholded Partial Correlation Approach . 49 3.3.1 Thresholded Partial Correlation Approach: General Samples 50 3.3.2 Limiting Distributions of Correlations and Partial Correlations 51 3.3.3 Thresholded Partial Correlation Approach: Elliptical Samples 55 3.4 Asymptotic Theory . 59 ^ 3.4.1 Asymptotic Theory of An(α) . 60 [1] 3.4.2 Asymptotic Theory of A^n (α) . 61 3.4.3 Discussion of the Conditions . 62 3.5 Numerical Studies . 63 3.5.1 Simulation Studies . 63 3.5.2 Real Data: Cardiomyopathy Microarray Data . 76 3.6 Conclusion . 80 3.7 Lemmas and Technical Proofs . 81 3.7.1 Lemmas . 81 3.7.2 Proof of Theorem 3.8 . 84 3.7.3 Proof of Equation (3.3) . 86 3.7.4 Proof of Theorem 3.13 . 88 3.7.5 Proof of Theorem 3.14 . 98 Chapter 4 Thresholded Partial Correlation on Partial Residuals for Vari- able Selection in Partially Linear Models 100 4.1 Introduction . 100 4.2 Preliminaries . 102 4.2.1 Model and Objective . 102 4.2.2 Partial Faithfulness . 105 4.3 Thresholded Partial Correlation on Partial Residuals Approach . 105 4.3.1 Population Version . 106 4.3.2 Sample Version . 107 4.4 Asymptotic Properties . 115 4.4.1 Asymptotic Properties for Variable Selection . 115 vi 4.4.2 Bias, Variance and Asymptotic Normality of the Nonparametric Function . 118 4.4.3 Discussion of the Conditions . 118 4.5 Numerical Studies . 119 4.5.1 Simulations Studies . 119 4.5.2 Real Data: Istanbul Stock Exchange Data . 137 4.6 Conclusion . 143 4.7 Lemmas and Technical Proofs . 144 4.7.1 Lemmas . 144 4.7.2 Proof of Theorem 4.4 (First Half). 146 4.7.3 Proof of Theorem 4.5. 156 4.7.4 Proof of Theorem 4.6. 162 4.7.5 Proof of Theorem 4.7. 163 4.7.5.1 Derivation of the Bias of Nonparametric Part. 163 4.7.5.2 Derivation of the Variance of Nonparametric Part. 166 4.7.6 Proof of Theorem 4.8. 167 Chapter 5 Conclusion and Future Research 171 5.1 Conclusion . 171 5.2 Future Research . 173 5.2.1 Further theoretical development . 173 5.2.2 Semi-parametric Varying-coefficient Models . 173 Bibliography 175 vii List of Figures 2.1 The soft thresholding rule. 15 2.2 The SCAD thresholding rule. 17 2.3 Comparison between local quadratic approximation and local linear approximation. 23 4.1 Istanbul Stock Exchange Index against the predictors. 138 4.2 Estimated curves . 139 4.3 Density plots of the variables. 141 4.4 Distribution of residuals . 142 viii List of Tables 3.1 Simulation result for Example 1 . 68 3.2 Simulation result for Example 1 . 69 3.3 Simulation result for Example 1 . 70 3.4 Simulation result for Example 1 . 71 3.5 Simulation result for Example 2 . 72 3.6 Simulation result for Example 2 . 73 3.7 Simulation result for Example 2 . 74 3.8 Simulation result for Example 3 . 77 3.9 Simulation result for Example 3 . 78 3.10 Selected predictors . 79 3.11 Part of the correlations matrix . 79 3.12 Comparison of the performances . 80 4.1 Normal samples with AR correlation matrix (p = 20) . 124 4.2 Normal samples with compound symmetric correlation matrix (p = 20) . 125 4.3 Mixture normal samples with AR correlation matrix (p = 20) . 127 4.4 Mixture normal samples with compound symmetric correlation ma- trix (p = 20) . 128 4.5 Normal samples with AR correlation structure (p=200) . 131 4.6 Mixture normal samples with AR correlation matrix (p = 200) . 133 4.7 Normal samples with AR correlation structure (p=500) . 135 4.8 Mixture normal samples with AR correlation structure (p=500) . 136 4.9 Estimated coefficients . 139 4.10 Comparison of the performances . 143 4.11 Correlation matrix of the covariates . 143 ix Acknowledgments First of all, I would like to gratefully and sincerely thank my Ph.D. advisor, Dr. Runze Li, for his guidance, encouragement and comments on my phd projects and career.