C 2018 by Yunbo Ouyang. All Rights Reserved. SCALABLE SPARSITY STRUCTURE LEARNING USING BAYESIAN METHODS

c 2018 by Yunbo Ouyang. All rights reserved. SCALABLE SPARSITY STRUCTURE LEARNING USING BAYESIAN METHODS BY YUNBO OUYANG DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Statistics in the Graduate College of the University of Illinois at Urbana-Champaign, 2018 Urbana, Illinois Doctoral Committee: Associate Professor Feng Liang, Chair Professor Annie Qu Assistant Professor Naveen Narisetty Assistant Professor Ruoqing Zhu Abstract Learning sparsity pattern in high dimension is a great challenge in both implementation and theory. In this thesis we develop scalable Bayesian algorithms based on EM algorithm and variational inference to learn sparsity structure in various models. Estimation consistency and selection consistency of our methods are established. First, a nonparametric Bayes estimator is proposed for the problem of estimating a sparse sequence based on Gaussian random variables. We adopt the popular two-group prior with one component being a point mass at zero, and the other component being a mixture of Gaussian distributions. Although the Gaussian family has been shown to be suboptimal for this problem, we find that Gaussian mixtures, with a proper choice on the means and mixing weights, have the desired asymptotic behavior, e.g., the corresponding posterior concentrates on balls with the desired minimax rate. Second, the above estimator could be directly applied to the high dimensional linear classification. In theory, we not only build a bridge to connect the estimation error of the mean difference and the classification error in different scenarios, also provide sufficient conditions of sub-optimal classifiers and optimal classifiers. Third, we study adaptive ridge regression for linear models. Adaptive ridge regression is closely related with Bayesian variable selection problem with Gaussian mixture spike-and-slab prior because it resembles EM algorithm developed in Wang et al. (2016) for the above problem. The output of adaptive ridge regression can be used to construct a distribution estimator to approximate posterior. We show the approximate posterior has the desired concentration property and adaptive ridge regression estimator has desired predictive error. Last, we propose a Bayesian approach to sparse principal components analysis (PCA). We show that our algorithm, which is based on variational approximation, achieves Bayesian selection consistency when both p and n go to infinity. ii To my family. iii Acknowledgments First and foremost, I would like to express my sincere gratitude to my advisor Professor Feng Liang. Without her patient guidance and constinuous support, I would not make any great progress in my Ph.D. pursuit. I am always inspired by Professor Liang's innovative ideas and immense knowledge. Her continuous encour- agement and support help me become confident. Her enthusiasm in Statistics always inspires me to work on challenging problems. I cannot imagine my Ph.D. life without Professor Liang's guidance. Besides my advisor, I am also very grateful to the rest of my thesis committee members: Professor Annie Qu, Professor Naveen Narisetty, Professor Ruoqing Zhu. Their comments and suggestions are constructive and valuable for my research projects. I give my special thanks to Professor Narisetty for helpful discussions when I develop Chapter 4. Meanwhile, I would like to give my great thanks to Dr. Jin Wang and Dr. Jianjun Hu for the enjoyable and wonderful collaboration. Moreover, I want to appreciate all talented members in Professor Liang's research group: they are Jin Wang, Jianjun Hu, Xichen Huang, Lingrui Gan, Yinyin Chen and Wenjing Yin. Their deep insight, great knowledge and critical comments substantially improve my understanding of Bayesian Statistics. Furthermore, I would like to give my thanks to all the faculty, staff and students in Department of Statistics, UIUC. I sincerely appreciate the help and support from all my friends in the department. My great thanks goes to Jin Wang, Jianjun Hu, Xuan Bi, Xichen Huang, Xiwei Tang, Fan Yang and Xiaolu Zhu. Thank you for being company and bringing in countless memorable experiences. Finally, I want to thank my family: my parents and my girlfriend. I owe my appreciation to my parents for raising me up and giving me so much inspirations. Last but not least, I give my great thanks to my girlfriend Yinyin Chen: you have enlightened my life. iv Table of Contents List of Tables ................................................. viii List of Figures ................................................ x Chapter 1 Introduction ......................................... 1 1.1 Gaussian Sequence Model . 2 1.2 High Dimensional Classification . 3 1.3 Variable Selection . 3 1.4 Sparse PCA . 4 Chapter 2 An Empirical Bayes Approach for Sparse Sequence Estimation ......... 5 2.1 Introduction . 5 2.2 Main Results . 7 2.3 Prior Specification via Hard Thresholding . 11 2.4 Implementation . 12 2.4.1 Variational Algorithm to Specify Prior . 13 2.4.2 Posterior Computation . 15 2.4.3 Simulation Studies . 16 2.5 Conclusions and Discussions . 23 Chapter 3 An Empirical Bayes Approach for High Dimensional Classification ...... 24 3.1 Introduction . 24 3.2 Relationship between the Estimation Error and the Classification Error . 26 3.2.1 Known Covariance Matrix Σ = Ip ............................. 27 3.2.2 Unknown Covariance Matrix . 29 3.3 Theory for DP Classifier . 32 3.4 Implementation for Empirical Bayes Classifier . 34 v 3.5 Empirical Studies . 35 3.5.1 Simulation Studies . 36 3.5.2 Classification with Missing Features . 41 3.5.3 Real Data Examples . 42 3.6 Discussion . 44 Chapter 4 Two-Stage Ridge Regression and Posterior Approximation of Bayesian Vari- able Selection ............................................... 45 4.1 Introduction . 45 4.2 Proposed Algorithm . 47 4.2.1 Two-Stage Ridge Regression . 47 4.2.2 Distribution Estimator Construction . 48 4.3 Sufficient Conditions of Concentration Property . 50 4.4 Concentration and Prediction in the Orthogonal Design Case . 52 4.4.1 Concentration Property . 52 4.4.2 Prediction Accuracy . 54 4.5 Concentration and Prediction in Low Dimensional Case . 55 4.6 Concentration and Prediction in High Dimensional Case . 57 4.7 Conclusion . 60 Chapter 5 A Bayesian Approach for Sparse Principal Components Analysis ........ 61 5.1 Introduction . 61 5.2 Hybrid of EM and Variational Inference Algorithm . 62 5.2.1 Two-stage Method . 65 5.3 Theoretical Properties . 66 5.3.1 The general case . 67 5.3.2 p=n ! 0 case . 69 5.3.3 p=n ! c > 0 case . 70 5.3.4 log p=nη ! 0 case . 71 5.4 Conclusion and Discussion . 72 Appendix A Supplemental Material for Chapter 2 ........................ 73 A.1 Proof for Lemma 2.1 . 73 A.2 Proof for Theorem 2.1 . 75 vi A.3 Proof for Theorem 2.2 . 76 A.4 Variational Inference Algorithm Derivation . 77 Appendix B Supplemental Material for Chapter 3 ........................ 81 B.1 Proof of Theorem 3.2 . 81 B.2 Proof of Theorem 3.3 . 84 Appendix C Supplemental Material for Chapter 4 ........................ 86 C.1 Proof of Theorem 4.1 . 86 C.2 Proof of Theorem 4.2 . 88 C.3 Proof of Theorem 4.3 . 91 C.4 Proof of Theorem 4.4 . 93 Appendix D Supplemental Material for Chapter 5 ........................ 97 D.1 Proof of Lemma 5.2 . 97 D.2 Proof of Theorem 5.1 . 97 D.3 Proof of Theorem 5.2 . 100 D.4 Proof of Theorem 5.3 . 101 References ................................................... 103 vii List of Tables 2.1 MSE of Simulation Study 1 in Chapter 2. Error of the best method are marked as bold . 17 2.2 MAE of Simulation Study 1 in Chapter 2. Error of the best method are marked as bold . 18 2.3 MSE of Simulation Study 2 in Chapter 2. Error of the best method are marked as bold . 19 2.4 MAE of Simulation Study 2 in Chapter 2. Error of the best method are marked as bold . 19 2.5 MSE of Simulation Study 3 in Chapter 2. Error of the best method are marked as bold . 20 2.6 MAE of Simulation Study 3 in Chapter 2. Error of the best method are marked as bold . 21 2.7 MSE of Simulation Study 4 in Chapter 2. Error of the best method are marked as bold . 22 2.8 MAE of Simulation Study 4 in Chapter 2. Error of the best method are marked as bold . 22 3.1 Condition Summary of Asymptotic Optimality for different classifiers in Chapter 3 when Cp ! c > 0 .............................................. 33 3.2 Conditions to Guarantee Asymptotic Sub-optimality for different classifiers in Chapter 3 if Cp ! 1 ................................................ 34 3.3 Conditions to Guarantee Asymptotical Optimality for different classifiers in Chapter 3 if Cp ! 1 ................................................ 34 3.4 Classification error for simulation study 1, p = 104, p − l entries are 0 . 36 3.5 Classification error for simulation study 1, p = 104, p − l entries are generated from N(0; 0:12) 37 4 3.6 Classification error for simulation study 2, p = 10 , 2000 entries are 1 for µ1. Other entries are generated from N(0; 0:12) .................................... 38 4 3.7 Classification error for simulation study 2, p = 10 , 1000 entries are 1 for µ1. 100 entries are 2.5. Other entries are generated from N(0; 0:12).......................... 39 4 3.8 Classification error for simulation study 2, p = 10 , 1000 entries are 1 for µ1. 50 entries are 3.5. Other entries are generated from N(0; 0:12).......................... 39 3.9 Average Misclassification Rate for Simulation Study 3 . 40 3.10 Classification error with only 50% features available in the test data in Simulation Study 1, p = 104, p.

Load more