Think Global, Act Local When Estimating a Sparse Precision
Total Page:16
File Type:pdf, Size:1020Kb
Think Global, Act Local When Estimating a Sparse Precision Matrix by Peter Alexander Lee A.B., Harvard University (2007) Submitted to the Sloan School of Management in partial fulfillment of the requirements for the degree of Master of Science in Operations Research at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2016 ○c Peter Alexander Lee, MMXVI. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Author................................................................ Sloan School of Management May 12, 2016 Certified by. Cynthia Rudin Associate Professor Thesis Supervisor Accepted by . Patrick Jaillet Dugald C. Jackson Professor, Department of Electrical Engineering and Computer Science and Co-Director, Operations Research Center 2 Think Global, Act Local When Estimating a Sparse Precision Matrix by Peter Alexander Lee Submitted to the Sloan School of Management on May 12, 2016, in partial fulfillment of the requirements for the degree of Master of Science in Operations Research Abstract Substantial progress has been made in the estimation of sparse high dimensional pre- cision matrices from scant datasets. This is important because precision matrices underpin common tasks such as regression, discriminant analysis, and portfolio opti- mization. However, few good algorithms for this task exist outside the space of L1 penalized optimization approaches like GLASSO. This thesis introduces LGM, a new algorithm for the estimation of sparse high dimensional precision matrices. Using the framework of probabilistic graphical models, the algorithm performs robust covari- ance estimation to generate potentials for small cliques and fuses the local structures to form a sparse yet globally robust model of the entire distribution. Identification of appropriate local structures is done through stochastic discrete optimization. The algorithm is implemented in Matlab and benchmarked against competitor algorithms for an array of synthetic datasets. Simulation results suggest that LGM may outper- form GLASSO when model sparsity is especially important and when variables in the dataset belong to a number of closely related (if unknown) groups. Thesis Supervisor: Cynthia Rudin Title: Associate Professor 3 4 Dedication To my parents, Michael and Linda Lee. 5 Acknowledgments My supervisor, Cynthia Rudin, provided much-appreciated encouragement and guid- ance. Her suggestions regarding both structuring and describing a research project were invaluable, and her edits to the text of this thesis were gratefully received. Patrick Jaillet and David Gamarnik taught (and I took) an engaging class called Net- work Science and Models. My final project for that class became the starting point for this research. Andrew Lo has been a supportive mentor for almost a decade. He has guided me to and through a range of research opportunities, and he encouraged my interest in studying at MIT. Finally, my parents have helped and encouraged me in too many ways to count. 6 Contents 1 Introduction 19 2 Refresher on Basic Concepts 23 2.1 Information Theory . 23 2.1.1 Entropy . 23 2.1.2 Mutual Information . 23 2.1.3 Kullbach-Leibler Divergence . 24 2.2 Multivariate Gaussian Distribution . 24 2.2.1 Covariance and Precision Matrices . 24 2.2.2 Conditional Independence . 25 2.2.3 Unconditional Independence . 25 2.3 Probabilistic Graphical Models . 25 3 Prior Work 27 3.1 Methods for Etimating Sparse Precision Matrices . 27 3.2 Trees . 28 3.3 Quantifying Differences . 28 4 LGM Algorithm 31 4.1 Overview . 31 4.2 Local Models . 32 4.2.1 Variance Inflation . 32 4.2.2 Correlation Deflation . 33 7 4.2.3 Calibrating Correlation Deflation . 34 4.2.4 Assembling the Local Model . 36 4.2.5 Discussion . 37 4.3 Global Structure . 37 4.3.1 Constructing a Global Model . 37 4.4 Choosing a Collection of Local Models . 40 4.4.1 Key Principles . 41 4.4.2 Algorithm Overview . 42 4.4.3 Local Changes to the Junction Forest . 43 4.4.4 Making Local Changes . 45 4.5 Pre-screening . 46 4.6 Data Structure . 47 5 LGM Performance 49 5.1 Experimental Design . 50 5.1.1 Data . 50 5.1.2 Metrics . 52 5.1.3 Estimation Algorithms . 55 5.2 Algorithm Performance . 56 5.2.1 Model Accuracy . 56 5.2.2 Sensitivity to Dimension . 58 5.2.3 Computation Time . 64 5.2.4 LGM vs. GLASSO: Sparsity Case Studies . 66 6 LGM-R: A Sparse Regression Algorithm 69 6.1 The LGM-R Algorithm . 70 6.2 Evaluation Methodology . 71 6.3 Conventional Algorithms for Sparse Linear Modeling . 72 6.4 Results . 73 6.5 Discussion . 78 8 7 Portfolio Optimization 79 7.1 Experimental Design . 80 7.2 Simulated Portfolio Performance . 81 8 Conclusion 85 A Tables 87 B Figures 97 9 10 List of Figures 4-1 Heatmaps illustrate optimal shrinkage parameters calibrated based on 50 samples of 100 dimensional training data from six factor structures (the structures are described in chapter 5). High values indicate that little shrinkage is required and the local model will very nearly use the maximum likelihood estimates. Low values near zero mean that the data will be almost entirely discounted and variables in the local model will be assumed to be nearly independent. For each factor set, the shrinkage parameter varies as a function of clique size (y-axis) and the entropy (x-axis) of the Gaussian distribution parameterized by the maximum likelihood correlation matrix associated with the variables in the clique. 35 5-1 Correlation structures of datasets with dimension 100. In the left col- umn are test set correlations. In the center column are training set correlations. On the right is the correlation matrix associated with a precision matrix generated by LGM based on the training data. The test set consists of 100,000 samples. The training set consists of 50 samples. As a note, the heatmap color ranges vary from image to image; most notably, the constant correlation matrix has off-diagonal elements near 0.5, while the independent correlation matrix has off- diagonal elements near 0. 53 11 5-2 The ratio of approximating models’ Kullbach-Leibler divergence from the true distribution is shown on the y-axis. As the data dimension (x- axis) increases, this ratio tends to increase as well. The ratio increases at different rates for different algorithms. 60 5-3 The ratio of approximating models’ Kullbach-Leibler divergence from the true distribution is shown on the y-axis. As the data dimension (x- axis) increases, this ratio tends to increase as well. The ratio increases at different rates for different algorithms. 62 5-4 Modeling characteristics, adjusted for dimension. A characteristic that scales linearly with the dimension of the dataset would be be shown as a horizontal line (constant proportion). The Kullbach-Leibler divergence per dimension is shown in orange. For comparison, the divergence for two runs of the diagonal model (zero complexity) are shown in blue and yellow. In blue is the computation time per dimension. 63 5-5 Computation time increased as the dimension of the dataset increased. The average computation time was taken over three runs of each algo- rithm at each dimension, once over the ‘markovMulti’ factor structure, once over the ‘constantCorrelation’ factor structure, and once over the ‘tenStrong’ factor structure. Each dataset consisted of 100 samples. The simple heuristic methods are seen to be much faster than LGM. The cross-validated GLASSO implementation used in this study was faster than LGM for medium dimensions but slower than LGM when the dimension became quite large. 65 5-6 For each factor structure, 50 samples of dimension 100 were modeled using LGM and using GLASSO with the complexity parameter set to a range of values. The resulting complexities and Kullbach-Leibler divergences are plotted. The LGM outcome is plotted in green, while the frontier of GLASSO outcomes is plotted in blue. 67 12 6-1 Average goodness of fit of linear models produced by each of five methodologies. Sample size varies along the x-axis; LGM-R has better performance for small sample size but typical performance when the sample reaches the dimensionality of the data (namely, 100). 74 6-2 The average of the R2 values of the hundred models in each scenario with sample size 20 are shown in the top panel. Those of scenarios with sample size 100 are shown in the bottom panel. LGM-R models based on the LGM algorithm are shown in light blue. LGM-R outperforms in most scenarios when the training set consists of just 20 samples. It outperforms in fewer scenarios when the training set consists of 100 samples. 75 6-3 All methodologies produce models with more consistent goodness of fit as more data is made available (x-axis). LGM-R generates models with the most consistent performance (smallest average standard deviation of R2).................................... 76 6-4 Average number of non-zero coefficients used in linear models produced by each of five methodologies. Sample size varies along the x-axis; LGM-R is the light blue line. All methodologies tend to employ richer models (more factors) as more data is made available. 77 7-1 Simulated mean variance optimized portfolio returns, averaged over six repetitions of each experiment. A high Sharpe ratio corresponds to high risk-adjusted returns. Error bars show the standard error of the estimates. 82 B-1 For each factor structure, 50 samples of dimension 100 were modeled using LGM and using GLASSO with the complexity parameter set to a range of values. The resulting complexities and Kullbach-Leibler divergences are plotted. The LGM outcome is plotted in green, while the frontier of GLASSO outcomes is plotted in blue.