Statistical Methods for High-Dimensional Networked Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Methods for High-Dimensional Networked Data Analysis by Yan Zhou A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Biostatistics) in The University of Michigan 2015 Doctoral Committee: Professor Peter X. K. Song, Chair Professor Matthias Kretzler Assistant Professor Xiaoquan William Wen Professor Ji Zhu c Yan Zhou 2015 All Rights Reserved To my family ii ACKNOWLEDGEMENTS I would like to acknowledge many people for their help throughout my graduate study. I would first like to express my deepest gratitude to my advisor Dr. Peter X.- K. Song, for sharing his knowledge, encouraging me to work hard and constantly try to improve my work, and giving me the freedom to explore a diverse set of projects. He is the one who first sparked my interest in networked data analysis. I am not only respectful for his immense knowledge, motivation, and enthusiasm towards research, but also appreciate his invaluable guidance and enthusiastic encouragement during many difficult moments in my research. I could not have imagined having a better advisor and mentor for my Ph.D. study. I would also like to thank my other thesis committee members Dr. Ji Zhu, Dr. Xiaoquan Willian Wen and Dr. Matthias Kretzler for their help and advices that made me more productive than what I could achieve on my own. All enlightening discussions I had with them have lead to better contents of this dissertation. In addition, I would like to acknowledge my collaborators, Dr. Pei Wang, Dr. Betsy Lozoff and Dr. Fengji Geng, who offered me excellent opportunities and con- structive suggestions to apply my statistical knowledge to solve real-world problems, which greatly strengthened the scientific background of my dissertation research. I also want to take this opportunity to thank all the professors, staffs, my fellow graduate students and all my friends at the Department of Biostatistics, University of Michigan, for their support, encouragement and friendship. Finally and most importantly, I would like to dedicate this thesis to my family iii for their unconditional support and love, which set me worry free in my research endeavor. iv TABLE OF CONTENTS DEDICATION :::::::::::::::::::::::::::::::::: ii ACKNOWLEDGEMENTS :::::::::::::::::::::::::: iii LIST OF FIGURES ::::::::::::::::::::::::::::::: viii LIST OF TABLES :::::::::::::::::::::::::::::::: xi LIST OF APPENDICES :::::::::::::::::::::::::::: xiii ABSTRACT ::::::::::::::::::::::::::::::::::: xiv CHAPTER I. Introduction ..............................1 1.1 Overarching goals . .1 1.2 Project I: Construction of association maps . .3 1.2.1 Background . .3 1.2.2 Motivating data . .5 1.2.3 Outline of methodology development . .6 1.3 Project II: Reconstruction of gene regulatory networks . .7 1.3.1 Background . .7 1.3.2 Motivating data . .8 1.3.3 Outline of methodology development . .9 1.4 Project III: Regression analysis of networked data . .9 1.4.1 Background . .9 1.4.2 Motivating data . 10 1.4.3 Outline of methodology development . 11 1.5 Organization of the dissertation . 12 II. Sparse multivariate factor analysis regression models and its application to high-throughput array data analysis ...... 14 2.1 Introduction . 15 v 2.2 Model . 18 2.2.1 Multivariate regression model . 18 2.2.2 Factor analysis model . 19 2.2.3 Multivariate factor analysis regression model . 20 2.3 Regularized Estimation . 21 2.4 EM-GCD Algorithm . 22 2.4.1 EM algorithm . 23 2.4.2 Group-wise coordinate descent (GCD) algorithm . 24 2.4.3 Tuning parameter selection . 26 2.5 Relationship to the existing methods . 28 2.6 Simulation Studies . 29 2.6.1 Simulation Setup . 29 2.6.2 Findings from Simulation Studies . 33 2.7 Application . 37 2.8 Discussion . 43 III. Sparse structural factor equation model and its applications to the reconstruction of genetic regulatory networks ..... 46 3.1 Introduction . 47 3.2 Structural factor equation model . 51 3.2.1 Background and notation . 51 3.2.2 Structural factor equation model . 52 3.2.3 Graphical representation of SFEM . 53 3.2.4 Parameter identifiability in SFEM . 55 3.3 Penalized estimation . 56 3.3.1 Formulation . 56 3.3.2 EM-Coordinate-Descent Algorithm . 57 3.3.3 Tuning parameter selection . 61 3.4 Numerical Results . 62 3.5 Analysis of cell signaling data . 67 3.6 Confirmatory analysis of ADMG . 70 3.6.1 Formulation . 70 3.6.2 Some numerical results . 74 3.7 Discussion . 76 IV. Regression analysis of networked data .............. 79 4.1 Introduction . 79 4.2 Framework . 83 4.2.1 Estimating functions . 83 4.2.2 Graphic interpretation to basis matrices . 85 4.2.3 Data-driven network topology . 87 4.3 New Methodology . 88 4.3.1 Hybrid quadratic inference function . 88 vi 4.3.2 Asymptotic properties . 90 4.3.3 Choice of the shrinkage coefficient . 91 4.4 Simulation Experiment . 92 4.4.1 Networked continuous data . 93 4.4.2 Networked binary data . 105 4.5 Data Example: infant's memory ERP study . 111 4.6 Discussion . 115 V. Summary and future work ...................... 117 APPENDICES :::::::::::::::::::::::::::::::::: 120 BIBLIOGRAPHY :::::::::::::::::::::::::::::::: 132 vii LIST OF FIGURES Figure 2.1 True association maps of Θ (connectivity vs. heatmap) for Simulation I, II, and III. (LHS: connectivity maps of Θ between genes (white) and biomarkers (black); RHS: corresponding heatmap of Θ.) . 35 2.2 Breast cancer hormonal pathways are displayed by the heatmap of ^ gene-factor loadings Bq;k after varimax rotation from the fitted sm- FARM model for thej 654j selected genes and 5 factors. 40 3.1 Four examples of directed mixed graphs. The graph in (b) is cyclic, while all others are acyclic. The solid line indicates an directed edge and the dashed line denotes an undirected edge. 54 3.2 An example of SFEM for exploratory analysis: 3 common latent factors z1, z2 and z3........................... 55 3.3 Topology of two simulated DAGs. (a) A small DAG with 50 nodes and 25 edges. (b) A large DAG with 200 nodes and 100 edges. 63 3.4 Simulation results two DAG networks, where x-axis is the number of totally detected edges, and y-axis is the number of correctly identified edges. The vertical grey line corresponds to the number of true edges. (a) Simulation I: P = 50;N = 25;K = 2;M = 25, and KER = 2. (b) Simulation II: P = 200;N = 100;K = 5;M = 100, and KER = 5. 66 3.5 The consensus signaling network of 11 proteins. 68 3.6 The scree plot of eigenvalues. 69 3.7 The plot of the correct discovery, where x-axis is the number of totally detected edges, and y-axis is the number of correctly identified edges. 69 viii 3.8 Causal interactions among 11 proteins of the signaling pathway: black represents TP, pink represents FN, and grey represents FP. In the right panel, Z1;:::;Z4 represents 4 common latent factors. 70 3.9 Graphical representation of SFEM for confirmatory analysis. 71 3.10 The network of an ADMG with 50 nodes and 50 edges. Among 50 edges, 40 are directed edges showed in (a) and 10 are undirected edges induced by 10 augmented variables Z1;:::;Z10 showed in (b). 75 3.11 The plot of the correct discovery in the ADMG over 50 replicates. 76 4.1 Layout of the EGI 64-channel sensor net with 6 outlined clusters of nodes related to auditory recognition memory and 1 additional cluster of the remaining nodes. 81 4.2 Graphic representations of basis matrices Mcomp (M1), Mchain (M1∗) and M2∗ for a network of three nodes. 86 4.3 The histogram of γ∗ over 500 simulations and the plot of shrinkage 1 coefficient selection norm η (γ) = tr J − (β γ; H∗;R(α)) versus γ 0 0j for the HQIF(H =bH∗; γ = γ∗) method. The network structure of the a 5-subregion N3 (RCL) is used with H∗ = HCL. The first three columns are the histogram of γ∗ for mb = 50; 100; 150 and n = 50; 100; 500, and the fourth column is the plot of η0(γ). In each subplot, the optimal value γ0∗ is given by theb vertical line. 100 4.4 The histogram of γ∗ over 500 simulations and the plot of shrinkage 1 coefficient selection norm η (γ) = tr J − (β γ; H∗;R(α)) versus γ 0 0j for the HQIF method.b The network structure of the 5-subregion b N3 (RCL) is used with H∗ = HCL. The first three columns are the histogram of γ∗ for m = 50; 100; 150 and n = 50; 100; 500, and the fourth column is the plot of η0(γ). In each subplot, the optimal value γ0∗ is given byb the vertical line. 101 4.5 SRE comparison for continuous data with 5-subregion network N3, where the reference level is GEE oracle, the sample size n = 100; 500 and the number of nodes varies in m = 50; 100; 150. Plots (a) and (b) a are obtained from network dependence structure RCL, and plots (c) b and (d) are obtained from the structure RCL. The index of each line is defined as 1: GEE oracle; 2: HQIF(H = HCL; γ = γ∗); 3: HQIF(γ = 0); 4: HQIF(H = HCL; γ = 1); 5: HQIF(H = Mchain; γ = 1); 6: GEE independence; 7: HQIF(H = Mcomp; γ = 1). .b . 102 ix 4.6 QQ-plots of the null distribution for HQIF (H = HCL; γ = 0; 0:25; 0:5; 0:75; 1) a in 5-subregion network N3 (RCL) with n = 50. On the top pan- 2 ~ el: Qn(β γ) relative to χ3 and on the bottom panel Qn(a0; βB γ) j 2 j − Qn(βA; βB γ) relative to χ1....................... 103 b b j 4.7 QQ-plotsb b of the null distribution for HQIF (H = HCL; γ = 0; 0:25; 0:5; 0:75; 1) b in 5-subregion network N3 (RCL) with n = 500.