Estimation and Inference for High-Dimensional Gaussian Graphical Models with Structural Constraints

Estimation and Inference for High-Dimensional Gaussian Graphical Models with Structural Constraints by Jing Ma A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Statistics) in the University of Michigan 2015 Doctoral Committee: Professor George Michailidis, Co-Chair Professor Kerby Shedden, Co-Chair Professor Bin Nan Professor Ji Zhu c Jing Ma 2015 All Rights Reserved For all the people ii ACKNOWLEDGEMENTS First and foremost, I would like to thank my mentor Professor George Michailidis for guiding me through the years. George is a serious thinker. His attitudes towards statistical methodologies and scientific applications have a great impact on me which has finally directed me towards an academic career. I am very grateful to him for his kindness and patience in helping me with job applications, correcting my English writing, and introducing me to a field that I feel excited about. I would like to thank Professor Ji Zhu for his insightful comments about my research and presentations, as well as his encouragement through the years. I would also like to thank Professor Kerby Shedden for his extremely helpful instructions on applied statistics and teaching me effective communication of statistics to scien- tists. I am very grateful to my collaborator Professor Ali Shojaie for his guidance on statistical computation, and Professor Bin Nan for serving on my thesis committee. Finally, I would also like to express my gratitude to my parents for their love and support; in particular, I would like to thank my best friend David Jones for always being on my side. He gives me confidence and encouragement in both research and life. iii TABLE OF CONTENTS DEDICATION :::::::::::::::::::::::::::::::::: ii ACKNOWLEDGEMENTS :::::::::::::::::::::::::: iii LIST OF FIGURES ::::::::::::::::::::::::::::::: vi LIST OF TABLES :::::::::::::::::::::::::::::::: viii LIST OF ABBREVIATIONS ::::::::::::::::::::::::: x ABSTRACT ::::::::::::::::::::::::::::::::::: xii CHAPTER I. Introduction ..............................1 1.1 Gaussian Graphical Models . .1 1.1.1 Nodewise Regression . .2 1.1.2 Penalized Maximum Likelihood Estimation . .3 1.1.3 Covariance Estimation based on Undirected Graph .3 1.1.4 Sparse Partial Correlation Estimation . .4 1.1.5 Applications of GGM . .5 1.2 Outline . .5 II. Network-Based Pathway Enrichment Analysis with Incom- plete Network Information .....................8 2.1 Background . .8 2.2 Network Estimation Under External Information Constraints 11 2.3 NetGSA with Estimated Network Information . 18 2.3.1 Efficient Estimation of Model Parameters . 18 2.3.2 Joint Pathway Enrichment and Differential Network Analysis Using NetGSA . 21 2.4 Simulation Results . 23 2.5 Applications to Genomics and Metabolomics . 28 iv 2.6 Discussion . 30 2.7 Software . 32 2.8 Proof of Theorem II.4 . 32 2.9 Proof of Theorem II.8 . 41 2.10 Derivation for Newton's Method . 44 2.11 Additional Simulation Results . 47 III. Joint Structural Estimation of Multiple Graphical Models . 52 3.1 Background . 52 3.2 The Joint Structural Estimation Method . 55 3.2.1 An Illustrative Example . 56 3.2.2 The General Case . 57 3.2.3 Choice of Tuning Parameters . 58 3.3 Theoretical Results . 59 3.3.1 Estimation Consistency . 59 3.3.2 Graph Selection Consistency . 62 3.4 Performance Evaluation . 65 3.4.1 Simulation Study 1 . 65 3.4.2 Simulation Study 2 . 68 3.4.3 Simulation Study 3 . 72 3.5 Application to Climate Modeling . 73 3.6 Discussion . 80 3.7 Proof of Theorem III.1 . 80 3.7.1 Regression . 81 3.7.2 Selecting Edge Set . 88 3.7.3 Refitting . 90 3.8 Proof of Theorem III.2 . 95 BIBLIOGRAPHY :::::::::::::::::::::::::::::::: 98 v LIST OF FIGURES Figure 2.1 A graph showing the varying structure of pathways 5{8 from null (left) to alternative (right) in Experiment 2. Dashed lines represent edges that are present in only one condition. 24 3.1 Image plots of the adjacency matrices for all four graphical models. The black color represents presence of an edge. The structured spar- sity pattern is encoded in G = f(1; 2); (3; 4); (1; 3); (2; 4)g, i.e. each pair of graphical models in G share a subset of edges. 55 3.2 Simulation study 1: left panel shows the image plot of the adjacency matrix corresponding to the shared structure across all graphs. Each black cell indicates presence of an edge. The right panel shows the re- ceiver operating characteristic (ROC) curves for sample size nk = 50: Graphical Lasso (Glasso) (dotted), Joint Estimation Method - Guo et al. (2011) (JEM-G) (dotdash), Group Graphical Lasso (GGL) (solid), Fused Graphical Lasso (FGL) (dashed), Joint Structural Estimation Method (JSEM) (longdash). 66 3.3 Simulation study 2: image plots of the adjacency matrices from all graphical models. Graphs in the same row share the same connec- tivity pattern at the bottom right block, whereas graphs in the same column share the same pattern at remaining locations. 68 3.4 Simulation study 2: ROC curves for sample size nk = 100: Glasso (dotted), JEM-G (dotdash), GGL (solid), FGL (dashed), JSEM (longdash). The misspecification ratio ρ varies from (left to right): 0; 0:2; 0:4 (top row) and 0:6; 0:8; 1 (bottom row). 70 3.5 The selected 27 locations based on climate classification. The solid line separates the south and north of North America and corresponds to latitude 39 N. 76 vi 3.6 Estimated climate networks at the six distinct climate zones using JSEM, with edges shared across all locations solid and differential edges dashed. 77 3.7 Estimated climate networks at the six distinct climate zones using JEM-G, with edges shared across all locations solid and differential edges dashed. 78 3.8 Estimated climate networks at the six distinct climate zones using GGL, with edges shared across all locations solid and differential edges dashed. 79 vii LIST OF TABLES Table 2.1 Deviance measures for network estimation in experiment 1 and 2. FPR(%), false positive rate in percentage; FNR(%), false negative rate in percentage; MCC, Matthews correlation coefficient; Fnorm, Frobenius norm loss. The best cases are highlighted in bold. 25 2.2 Powers based on false discovery rate with q∗ = 0:05 in experiment 1. 0.2/0.8 refer to NetGSA with 20%/80% external information; E refers to NetGSA with the exact networks; T refers to the true power; GSA- s/GSA-c refer to Gene Set Analysis with self-contained/competitive null hypothesis in 1000 permutations, respectively. True powers are highlighted in bold. 26 2.3 Powers based on false discovery rate with q∗ = 0:05 in experiment 2. 0.2/0.8 refer to NetGSA with 20%/80% external information; E refers to NetGSA with the exact networks; T refers to the true power; GSA- s/GSA-c refer to Gene Set Analysis with self-contained/competitive null hypothesis in 1000 permutations, respectively. True powers are highlighted in bold. 27 2.4 Timings (in seconds) for NetGSA. Density (%) refers to density of studied networks in percentage; Newton's method refers to NetGSA implemented with Newton's method; L-BFGS-B refers to NetGSA implemented with the method of Byrd et al. (1995). 28 2.5 p-values for the pathways in the metabolomics data, with false discovery rate correction at q∗ = 0:01. NetGSA refers to Network-based Gene Set Analysis; GSA-s/GSA-c refer to Gene Set Analysis with self-contained/competitive null hypothesis in 3000 permutations, respectively. 29 viii 2.6 p-values for the pathways in the microarray data, with false discovery rate correction at q∗ = 0:001. NetGSA refers to Network-based Gene Set Analysis; GSA-s/GSA-c refer to Gene Set Analysis with self-contained/competitive null hypothesis in 3000 permutations, respectively. 30 2.7 Deviance measures for network estimation in experiment 3 and 4. FPR(%), false positive rate in percentage; FNR(%), false negative rate in percentage; MCC, Matthews correlation coefficient; Fnorm, Frobenius norm loss. The best cases are highlighted in bold. 49 2.8 Powers based on false discovery rate with q∗ = 0:05 in experiment 3. 0.2/0.8 refer to NetGSA with 20%/80% external information; E refers to NetGSA with the exact networks; T refers to the true power; GSA- s/GSA-c refer to Gene Set Analysis with self-contained/competitive null hypothesis in 1000 permutations, respectively. True powers are highlighted in bold. 50 2.9 Powers based on false discovery rate with q∗ = 0:05 in experiment 4. 0.2/0.8 refer to NetGSA with 20%/80% external information; E refers to NetGSA with the exact networks; T refers to the true power; GSA- s/GSA-c refer to Gene Set Analysis with self-contained/competitive null hypothesis in 1000 permutations, respectively. True powers are highlighted in bold. 51 3.1 Performance of different regularization methods for estimating graphical models in Simulation Study 1: average FP, FN, SHD, F1 and FL (SE) for sample size nk = 50. The best cases are highlighted in bold. 68 3.2 Performance of different regularization methods for estimating graphical models in Simulation Study 2: average FP, FN, SHD, F1 and FL (SE) for sample size nk = 100. The best cases are highlighted in bold. 71 3.3 Performance of JSEM and thresholded JSEM with misspecified groups (ρ = 0:3): average FP, FN, SHD, F1 and FL (SE) for sample size nk = 200. The better cases are highlighted in bold. 72 ix LIST OF ABBREVIATIONS AER average aerosol optical depth AUC area under the curve CLD cloud cover CH4 methane CO carbon monoxide CO2 carbon dioxide DTR diurnal temperature range FGL Fused Graphical Lasso FRS frost days GGL Group Graphical Lasso GGM Gaussian graphical models Glasso Graphical Lasso H2 hydrogen JEM-G Joint Estimation Method - Guo et al.

Load more