Think Global, Act Local When Estimating a Sparse Precision by Peter Alexander Lee A.B., Harvard University (2007) Submitted to the Sloan School of Management in partial fulfillment of the requirements for the degree of Master of Science in Operations Research at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2016 ○c Peter Alexander Lee, MMXVI. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created.

Author...... Sloan School of Management May 12, 2016 Certified by...... Cynthia Rudin Associate Professor Thesis Supervisor Accepted by ...... Patrick Jaillet Dugald C. Jackson Professor, Department of Electrical Engineering and Computer Science and Co-Director, Operations Research Center 2 Think Global, Act Local When Estimating a Sparse Precision Matrix by Peter Alexander Lee

Submitted to the Sloan School of Management on May 12, 2016, in partial fulfillment of the requirements for the degree of Master of Science in Operations Research

Abstract Substantial progress has been made in the estimation of sparse high dimensional pre- cision matrices from scant datasets. This is important because precision matrices underpin common tasks such as regression, discriminant analysis, and portfolio opti- mization. However, few good algorithms for this task exist outside the space of 퐿1 penalized optimization approaches like GLASSO. This thesis introduces LGM, a new algorithm for the estimation of sparse high dimensional precision matrices. Using the framework of probabilistic graphical models, the algorithm performs robust covari- ance estimation to generate potentials for small cliques and fuses the local structures to form a sparse yet globally robust model of the entire distribution. Identification of appropriate local structures is done through stochastic discrete optimization. The algorithm is implemented in Matlab and benchmarked against competitor algorithms for an array of synthetic datasets. Simulation results suggest that LGM may outper- form GLASSO when model sparsity is especially important and when variables in the dataset belong to a number of closely related (if unknown) groups.

Thesis Supervisor: Cynthia Rudin Title: Associate Professor

3 4 Dedication

To my parents, Michael and Linda Lee.

5 Acknowledgments

My supervisor, Cynthia Rudin, provided much-appreciated encouragement and guid- ance. Her suggestions regarding both structuring and describing a research project were invaluable, and her edits to the text of this thesis were gratefully received. Patrick Jaillet and David Gamarnik taught (and I took) an engaging class called Net- work Science and Models. My final project for that class became the starting point for this research. Andrew Lo has been a supportive mentor for almost a decade. He has guided me to and through a range of research opportunities, and he encouraged my interest in studying at MIT. Finally, my parents have helped and encouraged me in too many ways to count.

6 Contents

1 Introduction 19

2 Refresher on Basic Concepts 23 2.1 Information Theory ...... 23 2.1.1 Entropy ...... 23 2.1.2 Mutual Information ...... 23 2.1.3 Kullbach-Leibler Divergence ...... 24 2.2 Multivariate Gaussian Distribution ...... 24 2.2.1 Covariance and Precision Matrices ...... 24 2.2.2 Conditional Independence ...... 25 2.2.3 Unconditional Independence ...... 25 2.3 Probabilistic Graphical Models ...... 25

3 Prior Work 27 3.1 Methods for Etimating Sparse Precision Matrices ...... 27 3.2 Trees ...... 28 3.3 Quantifying Differences ...... 28

4 LGM Algorithm 31 4.1 Overview ...... 31 4.2 Local Models ...... 32 4.2.1 Inflation ...... 32 4.2.2 Correlation Deflation ...... 33

7 4.2.3 Calibrating Correlation Deflation ...... 34 4.2.4 Assembling the Local Model ...... 36 4.2.5 Discussion ...... 37 4.3 Global Structure ...... 37 4.3.1 Constructing a Global Model ...... 37 4.4 Choosing a Collection of Local Models ...... 40 4.4.1 Key Principles ...... 41 4.4.2 Algorithm Overview ...... 42 4.4.3 Local Changes to the Junction Forest ...... 43 4.4.4 Making Local Changes ...... 45 4.5 Pre-screening ...... 46 4.6 Data Structure ...... 47

5 LGM Performance 49 5.1 Experimental Design ...... 50 5.1.1 Data ...... 50 5.1.2 Metrics ...... 52 5.1.3 Estimation Algorithms ...... 55 5.2 Algorithm Performance ...... 56 5.2.1 Model Accuracy ...... 56 5.2.2 Sensitivity to Dimension ...... 58 5.2.3 Computation Time ...... 64 5.2.4 LGM vs. GLASSO: Sparsity Case Studies ...... 66

6 LGM-R: A Sparse Regression Algorithm 69 6.1 The LGM-R Algorithm ...... 70 6.2 Evaluation Methodology ...... 71 6.3 Conventional Algorithms for Sparse Linear Modeling ...... 72 6.4 Results ...... 73 6.5 Discussion ...... 78

8 7 Portfolio Optimization 79 7.1 Experimental Design ...... 80 7.2 Simulated Portfolio Performance ...... 81

8 Conclusion 85

A Tables 87

B Figures 97

9 10 List of Figures

4-1 Heatmaps illustrate optimal shrinkage parameters calibrated based on 50 samples of 100 dimensional training data from six factor structures (the structures are described in chapter 5). High values indicate that little shrinkage is required and the local model will very nearly use the maximum likelihood estimates. Low values near zero mean that the data will be almost entirely discounted and variables in the local model will be assumed to be nearly independent. For each factor set, the shrinkage parameter varies as a function of clique size (y-axis) and the entropy (x-axis) of the Gaussian distribution parameterized by the maximum likelihood correlation matrix associated with the variables in the clique...... 35

5-1 Correlation structures of datasets with dimension 100. In the left col- umn are test set correlations. In the center column are training set correlations. On the right is the correlation matrix associated with a precision matrix generated by LGM based on the training data. The test set consists of 100,000 samples. The training set consists of 50 samples. As a note, the heatmap color ranges vary from image to image; most notably, the constant correlation matrix has off-diagonal elements near 0.5, while the independent correlation matrix has off- diagonal elements near 0...... 53

11 5-2 The ratio of approximating models’ Kullbach-Leibler divergence from the true distribution is shown on the y-axis. As the data dimension (x- axis) increases, this ratio tends to increase as well. The ratio increases at different rates for different algorithms...... 60

5-3 The ratio of approximating models’ Kullbach-Leibler divergence from the true distribution is shown on the y-axis. As the data dimension (x- axis) increases, this ratio tends to increase as well. The ratio increases at different rates for different algorithms...... 62

5-4 Modeling characteristics, adjusted for dimension. A characteristic that scales linearly with the dimension of the dataset would be be shown as a horizontal line (constant proportion). The Kullbach-Leibler divergence per dimension is shown in orange. For comparison, the divergence for two runs of the diagonal model (zero complexity) are shown in blue and yellow. In blue is the computation time per dimension...... 63

5-5 Computation time increased as the dimension of the dataset increased. The average computation time was taken over three runs of each algo- rithm at each dimension, once over the ‘markovMulti’ factor structure, once over the ‘constantCorrelation’ factor structure, and once over the ‘tenStrong’ factor structure. Each dataset consisted of 100 samples. The simple heuristic methods are seen to be much faster than LGM. The cross-validated GLASSO implementation used in this study was faster than LGM for medium dimensions but slower than LGM when the dimension became quite large...... 65

5-6 For each factor structure, 50 samples of dimension 100 were modeled using LGM and using GLASSO with the complexity parameter set to a range of values. The resulting complexities and Kullbach-Leibler divergences are plotted. The LGM outcome is plotted in green, while the frontier of GLASSO outcomes is plotted in blue...... 67

12 6-1 Average goodness of fit of linear models produced by each of five methodologies. Sample size varies along the x-axis; LGM-R has better performance for small sample size but typical performance when the sample reaches the dimensionality of the data (namely, 100)...... 74 6-2 The average of the 푅2 values of the hundred models in each scenario with sample size 20 are shown in the top panel. Those of scenarios with sample size 100 are shown in the bottom panel. LGM-R models based on the LGM algorithm are shown in light blue. LGM-R outperforms in most scenarios when the training set consists of just 20 samples. It outperforms in fewer scenarios when the training set consists of 100 samples...... 75 6-3 All methodologies produce models with more consistent goodness of fit as more data is made available (x-axis). LGM-R generates models with the most consistent performance (smallest average standard deviation of 푅2)...... 76 6-4 Average number of non-zero coefficients used in linear models produced by each of five methodologies. Sample size varies along the x-axis; LGM-R is the light blue line. All methodologies tend to employ richer models (more factors) as more data is made available...... 77

7-1 Simulated mean variance optimized portfolio returns, averaged over six repetitions of each experiment. A high Sharpe ratio corresponds to high risk-adjusted returns. Error bars show the standard error of the estimates...... 82

B-1 For each factor structure, 50 samples of dimension 100 were modeled using LGM and using GLASSO with the complexity parameter set to a range of values. The resulting complexities and Kullbach-Leibler divergences are plotted. The LGM outcome is plotted in green, while the frontier of GLASSO outcomes is plotted in blue...... 98

13 14 List of Tables

5.1 Empirical Kullbach-Leibler divergence between data generating pro- cesses and models of those processes based on 25 samples, averaged over 5 iterations. Results are shown for problem dimensionalities rang- ing from 25 to 100 and for six factor structures. For each scenario, six algorithms were given an opportunity to construct an approximating model...... 57

5.2 Empirical Kullbach-Leibler divergence between data generating pro- cesses and models of those processes based on 100 samples. Results are shown for problem dimensionalities ranging from 50 to 500 and for three factor structures. For each scenario, six algorithms were given an opportunity to construct an approximating model...... 59

A.1 Simulation results for scenarios of 25 variables with 25 samples. In- cluded are the average log likelihood of the test set under each of the approximating models, the computation time, the entropy of the ap- proximating model, the entropy of the distribution of the test set, the Kullbach Leibler divergence from the test set to the approximating distribution, the number of conditional dependencies (a measure of model complexity) in each approximating model, the (somewhat arbi- trary) number of ‘top’ dependencies chosen based on the test set, and the precision and recall of the approximating models when they are viewed as algorithms that classify each dependency as existing or not. 88

15 A.2 Simulation results for scenarios of 50 variables with 25 samples. In- cluded are the average log likelihood of the test set under each of the approximating models, the computation time, the entropy of the ap- proximating model, the entropy of the distribution of the test set, the Kullbach Leibler divergence from the test set to the approximating distribution, the number of conditional dependencies (a measure of model complexity) in each approximating model, the (somewhat arbi- trary) number of ‘top’ dependencies chosen based on the test set, and the precision and recall of the approximating models when they are viewed as algorithms that classify each dependency as existing or not. 89

A.3 Simulation results for scenarios of 75 variables with 25 samples. In- cluded are the average log likelihood of the test set under each of the approximating models, the computation time, the entropy of the ap- proximating model, the entropy of the distribution of the test set, the Kullbach Leibler divergence from the test set to the approximating distribution, the number of conditional dependencies (a measure of model complexity) in each approximating model, the (somewhat arbi- trary) number of ‘top’ dependencies chosen based on the test set, and the precision and recall of the approximating models when they are viewed as algorithms that classify each dependency as existing or not. 90

A.4 Simulation results for scenarios of 100 variables with 25 samples. In- cluded are the average log likelihood of the test set under each of the approximating models, the computation time, the entropy of the ap- proximating model, the entropy of the distribution of the test set, the Kullbach Leibler divergence from the test set to the approximating distribution, the number of conditional dependencies (a measure of model complexity) in each approximating model, the (somewhat arbi- trary) number of ‘top’ dependencies chosen based on the test set, and the precision and recall of the approximating models when they are viewed as algorithms that classify each dependency as existing or not. 91

16 A.5 Simulation runtimes for all algorithms on scenarios of 50, 100, 250, and 500 variables each with 100 samples for the tenStrong, constantCorre- lation, and markovMulti factor structures...... 92

A.6 Simulation results for LGM on scenarios of 25 to 1,500 variables, each with 50 samples. Included are the average log likelihood of the test set under each of the approximating models, the computation time, the entropy of the approximating model, the entropy of the distribution of the test set, the Kullbach Leibler divergence from the test set to the approximating distribution, the number of conditional dependencies (a measure of model complexity) in each approximating model, the (somewhat arbitrary) number of ‘top’ dependencies chosen based on the test set, and the precision and recall of the approximating models when they are viewed as algorithms that classify each dependency as existing or not...... 93

A.7 Averages of out-of-sample linear model 푅2 values. As a basis for these results, 30 datasets of dimension 100 were generated. The datasets were distinguished by having one of six factor structures and one of five sample sizes. For each dataset, a sparse linear model was generated using five algorithms and the out-of-sample 푅2 was computed using a test set of 100, 000 samples drawn from the same distribution. The average of the 100 푅2 values for each dataset and estimation algorithm are reported here...... 94

A.8 Standard deviations of out-of-sample linear model 푅2 values. As a ba- sis for these results, 30 datasets of dimension 100 were generated. The datasets were distinguished by having one of six factor structures and one of five sample sizes. For each dataset, a sparse linear model was generated using five algorithms and the out-of-sample 푅2 was com- puted using a test set of 100, 000 samples drawn from the same distri- bution. The standard deviation of the 100 푅2 values for each dataset and estimation algorithm are reported here...... 95

17 A.9 Average number of covariates used in linear models. As a basis for these results, 30 datasets of dimension 100 were generated. The datasets were distinguished by having one of six factor structures and one of five sample sizes. For each dataset, a sparse linear model was generated using five algorithms. The average number of factors used in the models generated by each algorithm are reported here...... 96

18 Chapter 1

Introduction

Many important algorithms fail when when applied to high dimensional datasets comprised of relatively few samples. One class of such algorithms includes linear regression, mean-variance portfolio optimization, and discriminant analysis. These algorithms share a common weakness: their successful operation is predicated on a precision matrix that must be representative of an underlying jointly Gaussian- distributed population. The conventional way to estimate this precision matrix is to calculate an empirical based on sample data and invert it. But this approach is very sensitive to noise, so for small sample sizes the resulting precision matrix often is unrepresentative of the underlying data. Furthermore, it is well known that when the sample size is smaller than the the dimension of the data, the empirical covariance is not invertible. Without a precision matrix that is representative of the underlying data, these algorithms are ineffective or inoperative in most practical applications. Researchers have a proposed an array of robust solutions to these specific prob- lems. For example, in the space of robust regression, Hoerl and Kennard proposed ridge regression in 1970 [7], Tibshirani proposed the LASSO in 1997 [14], and Zou and Hastie proposed the Elastic Net in 2005 [15]. In the space of portfolio optimization, Michaud highlighted short-comings of mean variance optimization in the context of noisy inputs in 1989 [12]. Later, in 2003, Ledoit and Wolf [10] argued for the use of covariance shrinkage in portfolio optimization. Meanwhile, Cai and Liu (2013) intro-

19 duced a sparse approach to discriminant analysis. But while the solutions in each domain vary widely, the root problem remains the same: bad precision matrices. In recent years, researchers have begun to tackle the problem of bad precision matrices head on by designing algorithms that produce precision matrices that are more representative of the underlying data. These precision matrices can then be ‘dropped’ into the relevant algorithm, whether regression, discriminant analysis, or portfolio optimization, and the algorithm operates in the normal fashion. For exam- ple, in 2015 Goto and Xu [6] proposed dropping a purposefully-sparse estimate of the inverse covariance matrix into mean variance optimization in an effort to improve the outcome. Meaningful progress in the estimation - and in the use - of precision matri- ces has been made during the recent decade, but there is still room for improvement, particularly for high dimensional datasets. The primary contribution of this thesis is a new algorithm for estimating high dimensional precision matrices. The algorithm produces results that are generally sparser and thus more interpretable than those resulting from the popular GLASSO algorithm and some other competitors. The new algorithm is benchmarked against an array of competing algorithms; as measured by the resulting models’ Kullbach-Leibler divergence, for some datasets its performance is better than that of GLASSO. A secondary contribution of this thesis is to illustrate how improvements to pre- cision matrix estimation can be simply converted into improvements in the class of workhorse algorithms described above. Specifically, we demonstrate how improved precision matrix estimates can be used in regression modeling and in portfolio con- struction. We show that in some scenarios these approaches outperform more con- ventional alternatives such as LASSO. The new method for estimating precision matrices introduced here integrates lo- cal and global information about a sample distribution and is accordingly dubbed LGM, or the ‘Local Global Model.’ Because LGM relies on certain characteristics of the exponential family of probability distributions and on fundamental principles of probabilistic graphical modeling, it should be generalizable to certain other multi- variate distributions and, consequently, may also enable improvements to widely-used

20 algorithms used in scenarios involving categorical data. The remainder of this paper is organized as follows. Chapter 2 describes prior work. Chapter 3 provides a refresher on certain key concepts used in the LGM algo- rithm. Chapter 4 describes the LGM algorithm. Chapter 5 evaluates the performance of LGM and benchmarks it against an array of alternative algorithms. Chapter 6 in- troduces LGM-R, a sparse regression algorithm based on LGM, and shows that in certain scenarios it can outperform the LASSO and Elastic Net algorithms. Chapter 7 compares the efficacy of using LGM, GLASSO, and other algorithms for generating precision matrices for mean variance optimization. Chapter 8 concludes.

21 22 Chapter 2

Refresher on Basic Concepts

2.1 Information Theory

A clear explanation of the basic concepts from information theory used here can be found in a textbook by Cover and Thomas [2]. For convenience, three essential concepts are summarized here (loosely following the notation of Cover and Thomas):

2.1.1 Entropy

Entropy can be viewed as a scalar measure of the amount of uncertainty associated with a set of random variables. When very little is known about the variables, the entropy is high, but when their values are known almost precisely, entropy is low. More concretely, entropy is the negative average log likelihood of a or set of random varialbes, 푋, whose distribution is 푝(푥):

퐻(푋) = −퐸푝[푙표푔(퐿(푝(푥)))].

2.1.2 Mutual Information

Mutual information can be viewed as a scalar measure of the strength of the rela- tionships among a set of random variables. When knowing the values of some of the random variables gives you a good sense of the likely values of the other random

23 variables, mutual information is high. Conversely, if the random variables are inde- pendent, their mutual information is zero. One way to calculate mutual information is as the difference between the sum of the entropies of the marginal distributions and the entropy of the joint distribution of a set of variables. More concretely, the mutual information between two random variables 푋 and 푌 is

퐼(푋; 푌 ) = 퐻(푋) + 퐻(푌 ) − 퐻(푋, 푌 );

2.1.3 Kullbach-Leibler Divergence

Kullbach-Leibler divergence can be viewed as a measure of how different two proba- bility distributions are. Often it is used to quantify how different an approximating distribution 푞(푥) is relative to a true distribution 푝(푥). The true distribution, 푝(푥), can be expected to do a better job anticipating the contents of a dataset generated according to 푝. The amount that 푞(푥) should be expected to underperform can be quantified in the following way: the amount by which the expected average loglike- lihood of the data under the model 푝 exceeds the expected average log likelihood of the data under model 푞:

퐾퐿퐷 = 퐷(푝||푞) = 퐸푝[푙표푔(푝(푥))] − 퐸푝[푙표푔(푞(푥))]

2.2 Multivariate Gaussian Distribution

The multivariate Guassian is among the most widely used probability distributions. Among its interesting properties is the correspondence between its parameters and the dependence or independence of the variables being modeled as jointly Gaussian.

2.2.1 Covariance and Precision Matrices

The second central moment of the multivariate Gaussian distribution is the well- known covariance matrix, Σ. Given jointly Gaussian random variables 푥푖 and 푥푗,

24 푡ℎ the (푖, 푗) entry in Σ is 퐸푖,푗[(푥푖 − 퐸푖[(푥푖)])(푥푗 − 퐸푗[(푥푗)]]. Given this straightforward identity, it is popular to estimate the covariance matrix directly from data. But while the covariance matrix is easy to estimate, it is its inverse that is used in the distribution’s probability density function:

√︃ |Λ| − 1 (푥−휇)푇 Λ(푥−휇) 푝(푥) = 푒 2 (2휋)푘 where Λ = Σ−1. A final important note is that the covariance matrix (and therefore also its inverse, the precision matrix) must be positive definite to constitute valid parameters of a Gaussian distribution.

2.2.2 Conditional Independence

Off-diagonal elements of the precision matrix can be exactly zero. Whenthe (푖, 푗)

element of the precision matrix is zero, 푥푖 and 푥푗 are conditionally independent of

each other. That means that, if one knows the values of all variables except 푥푖 and

푥푗, observing 푥푖 will yield no further information about 푥푗, and observing 푥푗 will yield no further information about 푥푖 (note that the precision matrix is symmetric).

2.2.3 Unconditional Independence

Off-diagonal elements of the covariance matrix can also be exactly zero. Whenthe

(푖, 푗) element of the covariance matrix is zero, 푥푖 and 푥푗 are unconditionally inde- pendent of each other. That means that observing 푥푖 alone will yield no information about 푥푗, and observing 푥푗 alone will yield no further information about 푥푖 (note that the covariance matrix is symmetric).

2.3 Probabilistic Graphical Models

Probabilistic graphical models are an intuitive way to represent the relationships among variables in a multivariate . Variables can be repre-

25 sented as nodes, and conditional dependencies (relationships) can be represented as edges. Together, the nodes and edges form a graph that illustrates the structure of the distribution. Potential functions (non-negative functions of the variables) can be associated with the cliques of nodes in the graph. The ensemble of all the potential functions in a graph can specify a probability distribution. A good reference for this topic is [9].

26 Chapter 3

Prior Work

The fundamental importance of the precision and covariance matrices in data analysis has motivated a wealth of research from a number of perspectives. While far from an exhaustive review of research in the area, the sections below highlight some key developments.

3.1 Methods for Etimating Sparse Precision Matri- ces

In 1968, Chow and Liu proposed using Kruskal’s algorithm to create a pairwise fac- torization of multivariate distributions[1]. While computationally efficient and guar- anteed to generate a sparse distribution, it was not designed to adapt to the structure of a given problem, which might have a more- or less-dense dependence structure than the one implied by the pairwise factorization. A few years later in 1972, Dempster[3] proposed using the properties of exponential distributions as an aid to estimating covariance matrices with sparse inverses. More recently, Friedman, Hastie, and Tib- shirani proposed the GLASSO algorithm. Like the LASSO algorithm to which it is closely related, GLASSO uses an 퐿1 penalty to create sparsity and has generated strong interest in the research community, in part because it can run fairly quickly on high dimensional datasets. GLASSO has become a popular approach to covariance

27 and precision matrix estimation, and various variants have been proposed. In finance, practitioners sometimes take more straightforward approach and ‘shrink’ an empirical matrix toward a simpler one such as a ; in this vein, in 2004 Ledoit and Wolf[10] described a technique for shrinking a matrix toward one that incorporates dependence information from a low-dimensional factor model. Other methods such as covariance or eigenvalue thresholding have also been de- veloped. Covariance thresholding has attractive asymptotic properties but, for finite sample size, care must be taken that the matrix remains positive definite [5].

3.2 Trees

An important class of algorithms for covariance or precision matrix estimation take the perspective of probabilistic graphical models. Indeed, this is the spirit of Chow and Liu’s seminal work. More recent research has built on that sturdy foundation. For example, the question of how one may construct a more general probabilistic graphical model through a stepwise process was addressed, among other places, by Desphande et al in 2001. More recently, Szantai and Kovacs [13] explored regularly- structured ‘cherry trees,’ and Liu and Lafferty in 2014 [11] proposed an algorithm for non-parameteric ‘blossom trees.’

3.3 Quantifying Differences

It is not necessarily obvious what should constitute a ‘better’ precision or covariance matrix. In a useful review of recent work[5], Fan et al note that researchers have sometimes sought to minimize the difference between true and approximate covariance matrices using the Frobenious norm. Another common approach (GLASSO [8] and

variants) is to build models that maximize 퐿1 penalized in-sample likelihood. Likelihood maximization is closely related to Kullbach-Leibler divergence mini- mization; when we think about the Gaussian distribution associated with a precision or covariance matrix, the Kullbach-Leibler divergence is a natural quantification of

28 the distance between the true distribution associated with the true matrix and an approximate distribution associated with the approximate matrix. Thus it is a nat- ural summary of the differences between two precision or covariance matrices (when we assume that the real and approximate distributions have the same mean). Of course, the calculation of Kullbach-Leibler divergence is predicated on the knowledge of some ‘true’ distribution which, in a thoughtful essay[4], Akaike points out often either is not known or is knowable only in a subjective sense. Nonetheless, he describes a ‘predictive point of view [that] generalizes the concept of estimation from that of a parameter to that of the distribution of a future observation... The basic criterion in this generalized theory of estimation is then the measure of the ‘goodness’ of the predictive distribution.’ Akaike’s observations in this essay help motivate the LGM algorithm; although nominally the LGM algorithm aims to estimate a useful sparse precision matrix, in many ways its more fundamental aim is to identify a ‘predictive distribution’ that is close the to (often unknowable) true distribution.

29 30 Chapter 4

LGM Algorithm

The LGM algorithm accepts training data and outputs a sparse, positive definite precision matrix. In this chapter, we begin by providing a high-level overview of the algorithm. Then we describe LGM’s three main components in more detail. The first component is robust local modeling: the construction of probability distributions that each describe just a handful of variables. The second component of the algorithm is the fusing together of local models into a global model. The final main component of the algorithm is a discrete optimization process that determines the variables’ network structure. Important implementation details are included to complete the description.

4.1 Overview

The LGM algorithm recasts the precision matrix estimation task as an optimiza- tion problem. The goal of the optimization is to find a Gaussian distribution that best anticipates out-of-sample data. The algorithm returns the precision matrix that parameterizes the selected Gaussian distribution. To solve the optimization problem, a low-treewidth is con- structed through a discrete space search. The ramifications of each search step affect only a small set of nearby variables, so the problem is computationally tractable even for high dimensional models. Steps are undertaken based on their estimated potential

31 to improve the global model’s out-of-sample performance. As variables become grouped together in cliques, the variables in each clique are modeled as jointly Gaussian. However, the parameters for each clique’s distribution are not the maximum likelihood parameters estimated from the training set. Instead, they correspond to a Gaussian distribution that is like the maximum likelihood dis- tribution except for that it has been made more robust to overfitting. To determine how much additional robustness cliques require, an array of cliques of each size are bootstrapped from the training data, and cross-validation is employed to determine the optimal amount of robustness. In this way the algorithm exploits the combinatorially large number of cliques of size 푘 that can be formed in a dataset of dimension 푛, 푛 > 푘. It is worth noting that the algorithm described here is fairly general. Although a specific method for creating robust local models is described here, other approaches could be used. Similarly, while a specific discrete space search algorithm is described here, other discrete space search algorithms could be employed.

4.2 Local Models

The local model algorithm is given all 푛 examples of any 푚 variable subset of the data in the dataset and returns a mean-zero jointly Gaussian distribution that models that subset. In general, this distribution will be similar to maximum likelihood distribution but with higher entropy. The higher entropy guards against overfitting the available sample of data. Parameters dictating exactly how the entropy of the local distribution is boosted are calibrated using the entire data set (not merely 푚 of the variables).

4.2.1 Variance Inflation

A conventional approach to modeling univariate Gaussian or approximately-Gaussian data is to use a t-distribution when the sample size is small and a Gaussian distri- bution when it is large. However, a t-distribution is mathematically inconvenient, so instead we use a Gaussian distribution with inflated variance when the sample size is

32 small. More specifically, for each variable 푥푖, 푖 ∈ 1, ..., 푚,

2 2 휎ˆ푖 = 훿 · 휎푖,푀퐿퐸, (4.1) where 훿 ∈ [1, 2] is estimated through cross-validation. Although the MLE of each variable’s variance is unbiased, the costs of estimation error tend to be asymmetric: in practical applications, it is generally worse to un- derestimate than overestimate variance. By using cross-validation, the 훿 parameter may adapt both to the amount of available data and to non-Gaussianity in the data. When the variance inflation factor, 훿, is greater than 1, it increases the entropy of the local model distribution.

4.2.2 Correlation Deflation

The naive approach to building a jointly Gaussian model of several variables is to use the maximum likelihood estimates of their correlations. As with , the practical costs associated with correlation mis-estimation tend to be asymmetric: overestimating correlations tends to be more problematic than underestimating cor- relations. This is true not only in some practical applications, but also when one seeks simply to build a model that effectively predicts future data (i.e. when one wants the joint likelihood of additional data, 푋2, to be high given the model trained on the first tranche of data, 푀(푋1); that is 퐿(푋2|푀(푋1)) is large). A simple heuristic correction that allows for this asymmetry is to ‘shrink’ the maximum likelihood estimates of (off-diagonal) correlations uniformly toward 0 as indicated by some parameter, 훿:

* 푀퐿퐸 휌ˆ푖,푗 = (훾)ˆ휌푖,푗 , (4.2)

where 푖, 푗 ∈ 1, ...푚 and 푖 ̸= 푗. 훾 ∈ [0, 1]. Of course, 휌ˆ푖,푖 = 1∀푖. When 훾 is less than 1, the correlations are deflated, which increases the entropy of the local model distribution.

33 4.2.3 Calibrating Correlation Deflation

The optimal correlation parameter 훾 may vary depending on the size, 푚, of the group of variables in question. For example, if 100 variables are related according to an 퐴푅(1) process with a parameter of .8 and there are only 20 samples (푛 = 20), the true relationships are extremely local, so the deflation parameter for groups of two or three variables might be quite near 1, but the deflation parameter for groups of 15 or 16 variables might be somewhat lower because most of the correlations are hallucinated and the benefit of suppressing them outweights the cost of scaling down a few true correlations in the bunch.

The optimal correlation parameter may also vary based on the mutual information of the variables, or, equivalently, the entropy of the Gaussian distribution associated with the maximum likelihood estimate of the 푚 variables’ correlation matrix. That is, a group of 푚 variables that are tightly correlated with one another may see its correlations deflated by a different amount than a group of variables that are largely uncorrelated.

Continuing with our 퐴푅(1) example, suppose 푚 = 2 and we are considering defla- tion parameters for bivariate models. If we were to consider all the possible pairs of variables, we would find that a few hundred of them are closely related; the entropies of their maximum likelihood distributions will usually be quite low. However, the thousands of other pairs of variables are nearly independent and the maximum like- lihood estimates of their correlations will generally be smaller in magnitude, leading to correspondingly higher entropies. Cross-validation will show that it is beneficial to deflate the high-entropy groups more than the low-entropy groups because low entropy (in this case) is an indicator of ample signal relative to estimation noise.

To calibrate 훾, cliques of 푚 variables are bootstrapped from the data for 푚 =

1, ..., 푚푀퐴푋 . 푚푀퐴푋 is a value that can be set by the user to cap the treewidth of the probabilistic graphical model associated with the correlation matrix and also speed up computation. In this work, treewidth was capped at 10, and so 푚푀퐴푋 was set to 11. Using cross-validation, optimal deflation parameters are found as a function of

34 Figure 4-1: Heatmaps illustrate optimal shrinkage parameters calibrated based on 50 samples of 100 dimensional training data from six factor structures (the structures are described in chapter 5). High values indicate that little shrinkage is required and the local model will very nearly use the maximum likelihood estimates. Low values near zero mean that the data will be almost entirely discounted and variables in the local model will be assumed to be nearly independent. For each factor set, the shrinkage parameter varies as a function of clique size (y-axis) and the entropy (x-axis) of the Gaussian distribution parameterized by the maximum likelihood correlation matrix associated with the variables in the clique.

35 clique size and of the entropy associated with the clique. Thus the value of 훾 for a particular clique is:

훾(푘1, 푘2, ..., 푘푚) = 푓(푚, 퐻(푥푘1 , 푥푘2 , ..., 푥푘푚 )), (4.3) where 푓 is calibrated based on the full training set using cross-validation.

Some examples of calibrated local shrinkage parameters can be found in figure 4-1. Each heatmap illustrates the (estimated) optimal shrinkage parameters for a dataset with a different factor structure. In this example, when the data are truly indepen- dent, the shrinkage parameters learned from the data through cross-validation are quite small, never exceeding 0.07. This means that even when spurious correlations are found in the training data, they will be almost entirely suppressed. By contrast, when all variables have a true 0.5 correlation (the ‘constantCorrelation’ example), the lowest entropy samples are spuriously over-correlated and must be shrunk, while the highest entropy samples are, if anything, under-correlated and are shrunk less. In the ‘markov’ example, the true structure is an AR(1) process, so it should come as little surprise that small, strongly connected cliques (like the ones that define the AR(1) process’ true structure) are kept relatively untouched, while larger and less-strongly connected cliques are shrunk more.

4.2.4 Assembling the Local Model

Combining variance inflation and correlation deflation, we can construct a covariance matrix, Σˆ, whose diagonal elements are the inflated variances and whose off-diagonal elements are the deflated correlations multiplied by the inflated standard deviations:

ˆ Σ = [ˆ휌푖,푗 · 휎ˆ푖 · 휎ˆ푗]. (4.4)

Using this covariance matrix the local distribution is simply 푁(⃗0, Σ)ˆ .

36 4.2.5 Discussion

The parameters 훿 and 훾 are calibrated based on the entire dataset. However, if it were known that the dataset could be partitioned into several subsets, each with substantially different characteristics, one might condition one or both variables on that fact as well as or instead of clique size and clique entropy.

4.3 Global Structure

We build a global model by allowing closely-related variables to form cliques and an ensemble of such cliques to form one or more clique trees. The relationships among variables within each clique are described using their local model; consequently, vari- ables within cliques tend to be conditionally and unconditionally dependent. Vari- ables in the same tree but different cliques are conditionally independent but uncon- ditionally (usually) dependent. These depedencies can be described using a factorized probability distribution or the syntax of probabilistic graphical models, in which po- tentials are associated with cliques and separator sets. Variables in different trees are modeled as being both conditionally and unconditionally independent.

4.3.1 Constructing a Global Model

The uncertainty associated with jointly Gaussian random variables is summarized in the covariance matrix, Σ, or, equivalently, the precision matrix, Λ = Σ−1. We will focus more on Λ, but the two are equally valid summary as long as they are invertible. Thus building a global model amounts simply to constructing a matrix Λ for all variables of interest.

The local modeling process described above produces smaller matrices Λ1, Λ2, ...Λℎ for (sometimes overlapping) cliques {1, 2, ...ℎ}. All variables can be found within at least one clique and each variable is thus represented in at least one Λ푖. As a special case, when each variable is represented in exactly one clique, con- structing a global model is very simple. Without loss of generality, re-order the

37 variables so that variables within each clique are consecutive. Then create a block- diagonal matrix Λ where the block-diagonal elements are the precision matrices as- sociated with the corresponding cliques. That is,

⎡ ⎤ Λ1 0 ... 0 ⎢ ⎥ ⎢ ⎥ ⎢ 0 Λ2 ... 0 ⎥ Λ = ⎢ ⎥ . (4.5) ⎢ ⎥ ⎢...... ⎥ ⎣ ⎦ 0 0 ... Λℎ

The variables’ re-ordering can then be reversed. Usually, however, some cliques will describe overlapping subsets of variables.

Accordingly, let us consider the general case for two overlapping cliques, but first fix some notation. Let 푆1 be the set of variables in clique 1, 푆2 be the set of variables in clique 2, 푆1∖2 be the set of variables in clique 1 but not 2, 푆2∖1 the converse, and

푆1,2 the set of variables found in both nodes. Let 휓푖,푆 be the multivariate normal, mean-zero distribution associated with the variables in set 푆 based on local model

푖, where 푆 ∈ 푆푖 and we have marginalized out variables that are in 푆푖 but not 푆 to obtain Λ푖,푆. √︃ |Λ푖,푆| − 1 ·푥 Λ 푥푇 휓 = 푒 2 푆 푖,푆 푆 (4.6) 푖,푆 (2휋)푘

Past work in using junction trees to generate sparse representations of high- dimensional distributions has often assumed that the marginal distributions of the set of overlapping variables, 푆1,2, would be the same whether derived from 푆1 or 푆2.

That is, 휓1,푆1,2 = 휓2,푆1,2 . When this is the case, one can straightforwardly represent the global model using a factorized probability:

푃 (푥푆1 ) · 푃 (푥푆2 ) 푃 (푥푆1∪푆2 ) = 푃 (푥푆1 ) · 푃 (푥푆2 |푥푆1,2 ) = . (4.7) 푃 (푥푆1,2 )

Equivalently,

휓 · 휓 휓 · 휓 휓 · 휓 푃 (푥 ) = 1,푆1 2,푆2 = 1,푆1 2,푆2 = 1,푆1 2,푆2 . (4.8) 푆1∪푆2 √︀ 휓1,푆1,2 휓2,푆1,2 ·휓1,푆1,2 · 휓2,푆1,2

38 However, in this work we relax this requirement and allow 휓1,푆1,2 ̸= 휓2,푆1,2 . This breaks the equalities in equation 4.8. Instead we use simply

휓 · 휓 휓 · 휓 푃 (푥 ) = 1,푆1 2,푆2 = 1,푆1 2,푆2 , (4.9) 푆1∪푆2 √︀ ·휓1,푆1,2 · 휓2,푆1,2 휑푆1∩푆2

√︀ where we have introduced 휑푆1∩푆2 = 휓1,푆1,2 · 휓2,푆1,2 , the separator set in the junction tree for variables in both 푆1 and 푆2.

An equivalent way to think about this is the following. We have two overlapping cliques, and we want to model the distribution of variables in their intersection. Associated with each clique is a joint probability distribution, so we can marginalize out the other variables in each clique, giving us two estimates of the distribution of the variables in the overlapping set, 휓1,푆1∩푆2 and 휓2,푆1∩푆2 . We have no reason to think that one estimate is better than the other, so we choose a distribution that is the geometric mean of the two:

1/2 푝(푥푆1∩푆2 ) = (휓1,푆1∩푆2 · 휓2,푆1∩푆2 ) . (4.10)

As long as we are using exponential family distributions, the geometric mean of two such distributions will also be in the same family.

Then suppose we want the global model. This joint probability factors into three parts:

푝(푥푆1∪푆2 ) = 푝(푥푆1∩푆2 ) · 푝(푥푆1∖(푆1∩푆2)|푥(푆1∩푆2)) · 푝(푥푆2∖(푆1∩푆2)|푥(푆1∩푆2)), (4.11) or, equivalently,

휓1,푆1 휓2,푆2 푝(푥푆1∪푆2 ) = 휑푆1∩푆2 · · . (4.12) 휓1,푆1∩푆2 휓2,푆1∩푆2

Now we are ready to describe how to construct the precision matrix. First, re-

order it as necessary and partition it into three overlapping blocks: 푆1, 푆2, and

푆푆1∩푆2 . Populate the precision matrix initially with zeros. Then, in the 푆1 block,

add Λ1,푆1 . Because the elements are all zeros, this is the same as simply inserting it.

39 In the 푆 block, add Λ . In the 푆 ∩ 푆 block, subtract Λ , where Λ = 2 2,푆2 1 2 휑푆1∩푆2 휑푆1∩푆2 1 2 (Λ1,푆1∩푆2 + Λ2,푆1∩푆2 ). The procedure for building a global model from two overlapping local models generalizes naturally to the procedure for building a global model from any number of possibly-overlapping local models.

1. Make a list of all maximal intersection sets between cliques, 퐼1, ..., 퐼푠.

2. Count the number of times, or multiplicity, each intersection occurs: 푚1, ..., 푚푠.

3. Build intersection set potential functions 휑1, ..., 휑푠 parameterized by precision

matrices Λ휑푖 , 푖 ∈ 1, ...푠 where each

1 Λ = Σ 푖 푖 Λ , 휑푖 푗=푘1,...,푘푚 푗,푆휑푖 푚푖 푖

푖 푡ℎ 푡ℎ and the 푘푙 is the 푙 of the the 푚푖 indices of cliques sharing the 푖 intersection 푡ℎ set while 푆휑푖 is the set of variables included in the 푖 intersection set.

4. Initialize a precision matrix Λ for the global model with zeros.

5. For each clique, add its precision matrix Λ푖, 푆푖 to the corresponding submatrix of Λ.

6. For each intersection set with multiplicity 푚푖, subtract its precision matrix Λ휑푖

from the corresponding submatrix of Λ(푚푖 − 1) times.

Note that this procedure works only when the set of cliques is consistent with the properties of junction trees. In particular, it is important to enforce acyclicality (a distribution must be representable as a DAG) and therefore to pay heed to, for example, the running intersection property, when creating a collection of cliques.

4.4 Choosing a Collection of Local Models

In the preceding sections we described how to build a ‘local’ model of a small number of variables. We also described how to build a ‘global’ model of all variables by

40 synthesizing a valid collection of local models. However, there are many possible collections of local models: this is the set of all sets of any subsets of the variables. It is not obvious which such collection to use, and a brute force search over all possible collections of cliques is infeasible for problems of all but the smallest dimension. In this section, we describe a search algorithm that, while not guaranteed to return the very best result, in practice returns a useful collection of local models even in high dimensional problems.

4.4.1 Key Principles

Before describing the algorithm for selecting a collection cliques in detail, we will highlight some of its key principles.

1. Promote sparsity.

2. Maximize the mutual information among the variables.

3. Promote generalizability by maximizing expected out-of-sample mutual infor- mation, not in-sample mutual information.

4. Promote versatility by using principles applicable to all multivariate exponential family distributions, not just Gaussian distributions.

The goal of building a model that contains a maximal amount of information from the data is in tension with the goals of promoting sparsity and generalizability. Focusing strictly on building a model that fits the training data well would lead one to usethe MLE estimate of covariance, which tends to be overfit to the data. Requiring afo- cus on expected out-of-sample mutual information penalizes higher-dimensional local models for which there is not enough data; such models either see their covariances shrunk or are broken up into multiple lower-dimensional local models associated with smaller cliques. Imposing a further penalty on the number of conditional dependen- cies in the model eliminates some conditional dependencies that add complexity to the model while contributing only marginally in the way of incremental explanatory power.

41 As a note for future work, structuring the algorithm to reflect properties of ex- ponential family distributions means that the algorithm can be repurposed for use on the multivariate categorical distribution with minimal modifications. While linear models of interrelationships implied by the multivariate Gaussian distribution encom- pass a large class of important problems, binary classification problems constitute a similarly large category of important problems.

4.4.2 Algorithm Overview

Assembling a good collection of local models is structured as a constrained stochastic optimization problem with a barrier function. Broadly speaking, the optimization problem is

maximize 퐸[퐼(푥|Λ)] − 휆|Λ|0 Λ (4.13) subject to Λ ≻ 0

Our approach is to build Λ from a collection of local models associated with some set

of cliques {푆푖} in one or more junction trees that collectively span all the variables. So we might more specifically write

|{푆푖}| |{푁푗 }| maximize Σ푖=1 퐼(휓푖,푆푖 ) − Σ푗=1 퐼(휑푗) * (푚푗 − 1) − 휆|Λ|0 {푆푖} (4.14) subject to {푆푖} constitute a valid junction tree

and 푚푎푥{|푆푖|} <= MAXTREEWIDTH + 1

where the 휓푖,푆푖 are local models associated with each clique containing variables 푆푖,

{푁푗} are the maximal intersection sets associated with {푆푖}, 휑푗 are local derived

models corresponding to the maximal intersection sets, and the 푚푗 are the multiplic- ities of the intersection sets. 휆 is the 퐿0 penalty applied to the number of non-zero parameters in the precision matrix, Λ. 퐼(휓푖,푆푖 ) is the mutual information under the local model for the variables in 푆푖, while 퐼(휑푗) is the mutual information under the local model of the variables in separator set 푁푗. Mutual information quantifies the strength of the relationships among a set of random variables. MAXTREEWIDTH is an optional constraint on the treewidth of the graph and thus on the size of the

42 cliques. With this in mind, the high-level algorithm involves first building a model that captures the strongest dependencies among variables and then steadily adding less- important details. More specifically:

1. Initialize the junction forest by creating a unitary clique for each variable.

2. Initialize the 퐿0 penalty barrier function by setting 휆 at a high level.

3. Make local changes to the junction forest that increase the objective function.

4. If 휆 > 휆푚푖푛, reduce 휆 and return to step 3. Otherwise stop.

휆푚푖푛 is typically set at or slightly above zero. A larger 휆푚푖푛 will generally result in sparser models. We note that a natural way to parameterize the various levels of 휆 is to consider an array of correlation magnitudes beginning near 1 and decreasing to near or at 0. The mutual information of two variables with each of the indicated levels of correlation is calculated. Those quantities of mutual information are the progressively smaller values of 휆. Intuitively, this means that for correlations, 휌, approaching zero from one, we require that any local change that increases model complexity also increase the mutual information of the global model by at least 휆(휌) per additional conditional dependency (or, equivalently, per additional non-zero parameter in Λ).

4.4.3 Local Changes to the Junction Forest

Each change to the junction tree must improve the objective function. Thus candidate changes must, first, respect the structure of the junction tree (or junction forest). All variables must be included in at least one clique, and the running intersection prop- erty must be respected. Second, the gain in modeled mutual information must out- weigh the penalty associated with any increase in complexity. Third, the change can- not cause a clique to exceed the indicated maximum level, (MAXTREEWIDTH+1). MAXTREEWIDTH can be set to infinite, but a binding constraint often decreases

43 computational time and promotes sparsity; for all simulations reported here it was set to 10. We define four types of incremental changes to the objective function. Foreach level of the penalty term, 휆, we scan through all viable candidate changes of each type and make changes that improve mutual information. The four allowed types of incremental changes are:

1. Merge: Link together two junction trees by adding a new two-variable clique. One variable is in one of the original trees and one variable is in the other of the original trees. A new and larger tree is formed and the two old trees are deleted. Complexity cost: +1.

2. Add: Add to a clique a node already found in one or more neighboring cliques. If any neighboring cliques are subsumed, delete them. Complexity cost when original clique has 푛 nodes: +(푛−|푖|), where |푖| is the number of variables in the intersection of the clique and the union of all neighboring cliques that contain the node in question.

3. Split: Break a clique containing 푛 variables into two cliques each with 푛 − 1 variables and an intersection set including 푛 − 2 variables. Complexity cost when original clique has 푛 nodes: −1.

4. Reduce: Omit a variable from a clique. If the variable is not a member of any other clique, create a new unitary clique to contain that variable. Complexity cost when original clique has 푛 nodes: −(푛 − 1 − |푖|), where |푖| is the number of variables in the intersection of the clique and the union of any neighbors that include the variable in question. If the resulting clique is a subset of another clique, delete it.

All of these changes require only local computation; the direct computational costs of evaluating and/or making a change does not increase with the total number of variables. However, for higher dimensional problems, more such changes must be

44 evaluated. Furthermore, evaluating each change is generally more computationally expensive for larger cliques and for cliques with more neighbors. The ‘merge’ step creates connections among variables that had previously been modeled as being entirely independent. By contrast, the ‘add’ step adds conditional dependencies between variables that are already in the same junction tree. Both of these steps thus add structure. By contrast, the ‘split’ and ‘reduce’ steps prune the dependence structure, simplifying the model by removing dependencies that do not contribute much information.

4.4.4 Making Local Changes

Previous work has explored the idea of using a greedy algorithm to select the best local step. Indeed, the seminal algorithm by Chow and Liu for constructing a tree with a treewidth of one works in exactly this way (specifically, it employs Kruskal’s algorithm). However, every time a local step is made, some previously-viable lo- cal steps may become impossible while other local steps may become newly viable. Consequently, for higher-treewidth models it may be computationally expensive to guarantee that each step is fully greedy. We take a pragmatic approach of searching for possible changes and implementing any change that improves the (penalized) objective function. By gradually reducing the penalty, in the first pass we identify the most significant aspects of the depen- dence structure, and in each subsequent pass (with a reduced penalty) we identify incrementally more subtle aspects of the dependence structure. To implement all ‘add’ changes, we study each clique in turn. For each clique, we identify all variables that are in its neighbors but not in the clique itself. These are the only variables that can be added without violating the running intersection property. In a random order, the impact of adding each of these neighbor variables to the clique is evaluated. When one is found that improves the objective function, it is added to the clique. The resulting larger clique is sent to the back of the queue and is later again checked to determine whether any neighbors can be added. The ‘add’ phase is complete when the last clique has been checked for viable additions

45 and none have been found. The implementation of the ‘split’ and ‘reduce’ changes is similar. Each clique is checked in turn, modifications that improve the objective funtion are implemented immediately, and once implemented the affected clique is sent to the back of the queue to be checked again. By contrast, the implementation of the ‘merge’ changes is much like that for a minimal spanning tree: all changes that improve the objective function are made in the order of most-improvement to least.

4.5 Pre-screening

As a preliminary screen, we identify a handful of each variable’s strongest pairwise relationships. To obtain some measure of robustness, we do this first by simply select- ing the top 푘 most correlated variables but then also select 푚푖푛(푘, 5) variables via the LASSO algorithm (in general the selected sets of variables will be overlapping). Using the LASSO as a selection tool offers an improved prospect of identifying important but subtle relationships among variables. The impact of the pre-screen is two-fold. First, only cliques that include a chain of strong pairwise relationships considered in calibrating the shrinkage parameter for local models. And, second, only the same cliques are considered in the ‘merge’ and ‘add’ operations. In scenarios like an AR process where the number of true conditional dependencies in the data increases linearly with the dimension of the data, 푛, the total number of possible pairwise dependencies nonetheless increases quadratically with the dimension of the data (푛2). Thus the fraction of potential dependencies that are true declines

푛 1 as 푛 increases ( 푛2 = 푛 ). For large 푛, this can present a problem: when we calibrate the shrinkage parameters for the local models, because many of the cliques selected for cross-validation may consist of independent variables, in some cases we may be forced to select a rather drastic shrinkage parameter. However, if we pre-emptively identify the most closely related variables and refuse to consider the remainder, we can better focus attention on a subset of the potential dependencies that has a higher

46 chance of being true.

4.6 Data Structure

It is frequently noted in the literature that the computational efficiency of algorithms related to junction trees is very dependent on the algorithm’s implementation. In this case, to facilitate speedy computation, certain quantities are kept up-to-date at all times. These include:

∙ A membership matrix indicating the nodes in each clique

∙ A membership matrix indicating the cliques in each tree

∙ An indicating the cliques that overlap

∙ The mutual information associated with each clique

∙ The mutual information associated with each separating set

∙ The multiplicity of each separating set

47 48 Chapter 5

LGM Performance

In this section we evaluate the performance of the LGM (Local Global Model) algo- rithm and compare its performance to that of other algorithms. We define perfor- mance principally based on the model’s ability to anticipate out-of-sample data but also on its ability to correctly identify causal links. We find that the efficacy ofpar- ticular robust covariance algorithms depends on the nature of the problem at hand; accordingly, we evaluate an array of scenarios that differ based on (1) dimension (푚), (2) sample size (푛), and (3) factor structure. The performance of the LGM algorithm is compared to that of an array of competitors. Several high-level results stand out. LGM and GLASSO were generally the best of the models considered here based on their ability to anticipate out-of-sample data. The best algorithm for a particular dataset depended on the data’s underlying struc- ture. LGM tended to work best for datasets that were relatively sparse or included groups of highly correlated variables. Turning to the number of non-zero entries in the precision matrix, LGM generally produced sparser models than cross-validated GLASSO. Although GLASSO could be forced to produce a sub-optimal model that matched the sparsity of the LGM model, the equivalently sparse GLASSO model tended to underperform the LGM model. Another important question is whether LGM can process truly high dimensional datasets. LGM appeared to perform reasonably well in this sense, requiring just 56 minutes to build a sparse precision matrix for a dataset of dimension 1,500. Runtime

49 was observed to increase roughly quadratically with dimension. The remainder of this chapter is divided into two parts. First, the experimen- tal design is explained, including the synthetic datasets, competitor algorithms, and metrics for gauging success. Second, the experimental results are reported in detail.

5.1 Experimental Design

5.1.1 Data

Training and test datasets for performance evaluation are generated for a variety of scenarios. All data are jointly multivariate Gaussian and psuedorandomly generated using the default settings of Matlab 2016a. Each scenario is specified by a data gen- erating process that scales automatically to produce the desired number of samples, 푛, and dimensionality, 푚. Using this data generating process, a small training set and a large test set are generated. Typically, the training set contains between 10 and 1000 samples, while the test set contains 100, 000 samples. The scenarios of interest are as follows; all idiosyncratic factors 휖 and common factors 푦 are jointly Gaussian, uncorrelated, and have unit variance.

∙ Independent: For 푚 variables, {푥1, ..., 푥푚}, there are 푚 idiosyncratic factors

{휖1, ..., 휖푚}. The 푥푖 are defined as

푥푖 = 휖푖; (5.1)

∙ Constant Correlation: For 푚 variables, {푥1, ..., 푥푚}, there are 푚 idiosyn-

cratic factors {휖1, ..., 휖푚}, and one common factor, 푦. The 푥푖 are defined as

푥푖 = 휖푖 + 푦; (5.2)

∙ markov: An AR(1) model. For 푚 variables, {푥1, ..., 푥푚}, there are 푚 idiosyn-

cratic factors {휖1, ..., 휖푚}, and the 푥푖 are defined recursively such that 푥1 = 휖1

50 and for 푖 = {2, ..., 푚}

푥푖 = 훿 · 푥푖−1 + 휖푖 (5.3)

where we set 훿 = .8.

∙ markovMulti: An AR(3) model. For 푚 variables, {푥1, ..., 푥푚}, there are 푚

idiosyncratic factors {휖1, ..., 휖푚}, and the 푥푖 are defined recursively such that

푚푖푛(푡−1,3) 푥푖 = 훿 · Σ휏=1 푥푖−휏 + 휖푖 (5.4)

where we set 훿 = .3.

∙ ten: For 푚 variables, {푥1, ..., 푥푚}, there are 푚 idiosyncratic factors {휖1, ..., 휖푚}

and 푓푙표표푟(푚/10) common factors, {푦1 , ...푦 푖 }. Each common factor influ- 푓푙표표푟 10 ences ten variables. That is,

푥푖 = 푦 푖 + 휖푖 (5.5) 푓푙표표푟( 10 )

If 푚 is not divisible by 10, the last few variables have no common factors and

are simply 푥푖 = 휖푖.

∙ tenStrong: For 푚 variables, {푥1, ..., 푥푚}, there are 푚 idiosyncratic factors

{휖1, ..., 휖푚} and 푓푙표표푟(푚/10) common factors, {푦1 , ...푦 푖 }. Each common 푓푙표표푟 10 factor influences ten variables. That is,

푥푖 = 5 · 푦 푖 + 휖푖 (5.6) 푓푙표표푟( 10 )

If 푚 is not divisible by 10, the last few variables have no common factors and

are simply 푥푖 = 휖푖.

To make the factor structures of these datasets easier to grasp, example data was generated for each scenario with dimension 100. The correlation matrix of 100,000 ‘test set’ samples of the data is displayed in the left column of figure 5-1. In the center column is the correlation matrix of a dataset of just 50 ‘training’ samples. And in the

51 right column is the correlation matrix of a model generated by LGM based on the 50 samples of training data. In the first row can be seen the ten blocks of ten moderately-correlated variables that characterize the ‘ten’ scenario. With only 50 samples the training set estimate is polluted by noise, and variables that are truly independent appear dependent. In the LGM approximation, many of these independent relationships are restored, although some of the true relationships are underestimated. In the second row can be seen the ten blocks of ten strongly-correlated variables that characterize the ’tenStrong’ scenario. As in the ‘ten’ scenario, there is noise in the off-block-diagonal region of the training set, but it is mostly block-by-block. The LGM approximation appears to retain all the true structure while filtering out much of the noise. In the third row is the independent model. Although the small size of the training set introduces noise into empirical training correlation matrix, LGM is able to filter it out and restore the true correlation structure. In the fourth row is the constant correlation model. The empirical correlations are roughly 0.5, but they are far more variable in the training correlation matrix than the test correlation matrix. It is not clear that LGM is able to add any useful structure. In the fifth row is the markov model (an AR(1) process). Although the training set is polluted by noise, the true correlation structure is well recovered. In the sixth row is the markovMulti model (an AR(3) process). With a relatively weak true correlation structure, the training set is heavily polluted by noise. The LGM approximation keeps some structure while suppressing a great deal of noise.

5.1.2 Metrics

A common approach to planning is to identify likely scenarios, assign each scenario a probability, and then optimize one’s strategy based on the strategy’s expected impacts across the scenarios. For this approach to be effective, the actual scenarios that transpire must frequently be ones to which the planner assigned high probability. Assigning appropriate probabilities to scenarios is an important role of statistical

52 Figure 5-1: Correlation structures of datasets with dimension 100. In the left column are test set correlations. In the center column are training set correlations. On the right is the correlation matrix associated with a precision matrix generated by LGM based on the training data. The test set consists of 100,000 samples. The training set consists of 50 samples. As a note, the heatmap color ranges vary from image to image; most notably, the constant correlation matrix has off-diagonal elements near 0.5, while the independent correlation matrix has off-diagonal elements near 0.

53 models like LGM. The primary objective of the LGM is to produce a model of the training data that accurately anticipates out-of-sample test data drawn from the same distribu- tion. Accordingly, the joint log likelihood of the 100, 000 element test set is the main objective we seek to maximize. An equivalent (for our purposes) quantity that is directly proportional to this is the average log likelihood of the observations condi- tional on the model. As the model becomes better and better and approaches the true distribution, the negative average log likelihood approaches the entropy of the true distribution. The gap between these quantities is the Kullbach-Leibler diver- gence. Thus seeking a model that maximizes the test-set likelihood is equivalent to minimizing the divergence from the true distribution to the estimated distribution. In the experimental results, Kullbach-Leibler divergence is always calculated from the ‘true’ distribution to the ‘approximate’ distribution, where the maximum likeli- hood estimates of covariance based on the 100, 000 element test set parameterize the ‘true’ distribution and estimates based on the smaller training set parameterize the ‘approximate’ distribution. The secondary objective of LGM is to produce a sparse, interpretable model that accurately highlights the most important relationships among the variables in the underlying population. To obtain a generally-applicable notion of sparsity, we threshold the inverse of

1 the correlation matrix associated with each model at 1000 of the largest off-diagonal precision matrix entry. The complexity of the model is determined to be the number of off-diagonal entries greater than this level in the upper (or lower) triangular region. In general, sparser models are more desirable as they are more readily interpretable. Network structure identification can also be viewed as a classification problem. Thus as a supplementary consideration, we also run the following experiment: for a problem of dimension 푚, suppose that, with the luxury of the test set, one labeled as ‘important edges’ the 5 · 푚 strongest dependencies, as indicated by the magnitude of the off-diagonal entries of the inverse correlation matrix ofthe 100, 000 test samples. We can view the task of identifying the important edges (from the training set) as

54 a classification problem and calculate the precision and recall of the model structure relative to the labeled set of edges, of which 5 · 푚 are ‘important’ and the remainder ‘unimportant.’ Of course, 5 · 푚 is somewhat arbitrary; in some scenarios this may be a subset of the true relationships, while in others it may be a superset. Of tertiary but still considerable interest are sensitivity analyses that provide insight into how the dimension of the data and the nature of the factor structure affect the log likelihood of the test set, computation time, and sparsity. Given thatthe objective is to develop an algorithm that is effective when applied to high dimensional datasets, its scalability is of significant importance.

5.1.3 Estimation Algorithms

LGM is benchmarked against five precision matrix estimation algorithms. They are

1. GLASSO: Cross-validated GLASSO. The GLASSO algorithm was obtained from http://statweb.stanford.edu/ tibs/glasso/ and the cross-validation proce- dure for identifying the appropriate complexity parameter was implemented in Matlab.

2. mst: Minimum spanning tree. An implementation of the Chow Liu algorithm. It creates the treewidth 1 approximation of the empirical distribution that max- imizes mutual information. It then returns the precision (or covariance) matrix.

3. shrinkage: Cross-validated shrinkage toward the diagonal. If Σ푀퐿퐸 is the

maximum likelihood empirical covariance matrix and 퐷푀퐿퐸 is a diagonal ma-

trix with the variances on the diagonal and all other terms zero, 휆Σ푀퐿퐸 +

(1 − 휆)퐷푀퐿퐸 is a ‘shrunk’ covariance matrix. Cross-validation is used to deter- mine the optimal shrinkage parameter, 휆. The precision matrix is obtained by inverting the covariance matrix.

4. mleD: mleD outputs a matrix that is close to the maximum likelihood estimate of the covariance. It differs in that it is slightly shrunk휆 ( = .9) to make it

55 positive definite. The precision matrix is obtained by inverting the covariance matrix.

5. diag: This method simply creates a diagonal covariance matrix with the vari- ances on the diagonal. The precision matrix is obtained by inverting the covari- ance matrix.

5.2 Algorithm Performance

The experimental results fall into several broad themes. First, how accurate a model can each approximation algorithm construct? Second, how quickly does the accuracy of the model degrade when the dimensionality grows? Third, how is the computation time affected by the dimension of the data? And, fourth, how sparse are the models?

5.2.1 Model Accuracy

To assess model accuracy we run two sets of benchmarking experiments: one for scenarios of medium (25-100) dimension, and one for scenarios of high dimension (up to 500). The results of the first experiment are summarized in table 5.1 and reported infull in the appendix. Table 5.1 reports the Kullbach Leibler divergence from the large test set (essentially the true distribution) to the approximating distribution formed from the small training sample. Smaller values indicate that the approximating distribution is of higher quality. Performance is fairly consistent across the four dimensionalities. LGM and GLASSO were the best overall performers, but some scenarios were better suited to one than the other. LGM was the best algorithm for modeling the ‘tenStrong’ scenario that in- volves tightly interrelated groups of ten variables. LGM also did best at the ‘markov’ scenario, usually outperforming the minimum spanning tree (‘mst’ algorithm) even though the true dependence structure exactly matches the implicit assumptions of the minimal spanning tree algorithm. However, GLASSO outperformed LGM for the

56 Table 5.1: Empirical Kullbach-Leibler divergence between data generating processes and models of those processes based on 25 samples, averaged over 5 iterations. Results are shown for problem dimensionalities ranging from 25 to 100 and for six factor structures. For each scenario, six algorithms were given an opportunity to construct an approximating model.

57 ‘constantCorrelation’ scenario in which a common factor pervasively influences all variables. Because LGM is not equipped to directly model global factors, it makes sense that it would fall short here. The mleD algorithm - simply the maximum likelihood estimate of covariance with a little shrinkage toward the diagonal to en- sure positive definiteness - is the worst overall, but when the appropriate amount of shrinkage is calibrated via cross-validation in the ‘shrinkage’ algorithm performs become decent, if rarely more than that.

The results of the second experiment are summarized in table 5.2. For the most part, they confirm some results from the first experiment: LGM has the bestperfor- mance for the ‘tenStrong’ scenario in which a small number of its local models can span each of the groups of correlated varibales, but GLASSO has the best perfor- mance for the ‘constantCorrelation’ model. Intriguingly, the ‘markovMulti’ scenario is best modeled by GLASSO in low dimensions but LGM in higher dimensions. On closer inspection, it can be seen that in most cases the divergence increases slightly more quickly than linearly as the dimension increases. However, when dimension doubles from 250 to 500, the divergence of the GLASSO-constantCorrelation model triples, and the divegence of the ‘markovMulti’ model quintuples. In this example, the GLASSO algorithm appears to become less effective for problems of this dimension or higher.

5.2.2 Sensitivity to Dimension

It makes sense to see approximately linear increases in Kullbach-Leibler divergence as dimension increases. The divergence is computed from average log likelihoods, and the magnitudes of those log likelihoods should be proportional to the dimensionality of the data (all else equal). However, we are seeing that the divergence increases somewhat more quickly than linear in some cases. This suggests that the quality of the models deteriorates as dimension increases. While regrettable, this makes sense, as in higher dimensions there should tend to be more spurious correlations.

58 Table 5.2: Empirical Kullbach-Leibler divergence between data generating processes and models of those processes based on 100 samples. Results are shown for problem dimensionalities ranging from 50 to 500 and for three factor structures. For each scenario, six algorithms were given an opportunity to construct an approximating model.

59 Figure 5-2: The ratio of approximating models’ Kullbach-Leibler divergence from the true distribution is shown on the y-axis. As the data dimension (x-axis) increases, this ratio tends to increase as well. The ratio increases at different rates for different algorithms.

60 A simple measure that will help quantify this deterioration is the following:

퐾퐿퐷 퐾퐿퐷퐷 = (5.7) 푚

where 퐾퐿퐷 is the Kullbach-Leibler divergence from the true distribution to the approximating distribution, and 푚 is the dimensionality of the model. Thus 퐾퐿퐷퐷 is divergence per dimension (for example, in a model of individuals, it would be divergence per capita). Clearly it is good if 퐾퐿퐷퐷 is low. But for those interested in high dimensional problems, it is also important that 퐾퐿퐷퐷(푚) not increase too quickly with 푚. In figure 5-2, the model most akin to simply using the maximum likelihood estimate degrades rapidly as the dimension of the data increases. This serves as a reminder that maximum likelihood estimates of covariances matrices behave poorly in high dimension. By contrast, the diagonal model, while it has large divergence per variable, is essentially flat. Because there is no potential for misestimation, the quality ofthe diagonal model does not degrade as dimension increases. The remaining (better) models can be more easily seen in figure 5-3, where the mleD and diagonal models have been omitted. The quality of the minimum spanning tree and shrinkage models degrade materially as the dimension increases from 25 to 100. However, the LGM and GLASSO KLDD values increase only marginally. This indicates that these latter algorithms scale better than the others into higher dimensions. To further explore LGM’s performance in high dimension, the LGM algorithm was used to generate models for dimensionalities ranging from 25 to 1,500 with a dataset of just 50 samples generated using the multiMarkov (AR(3)) factor structure. Some of the key results are illustrated in figure 5-4, and full results can be found in the appendix. The monotonically increasing orange line suggests that the model quality degrades as the dimension increases. However, even at a dimension of 1,500, the LGM model is better than a simple (and hard to mis-estimate) diagonal model; its structure adds value.

61 Figure 5-3: The ratio of approximating models’ Kullbach-Leibler divergence from the true distribution is shown on the y-axis. As the data dimension (x-axis) increases, this ratio tends to increase as well. The ratio increases at different rates for different algorithms.

62 Figure 5-4: Modeling characteristics, adjusted for dimension. A characteristic that scales linearly with the dimension of the dataset would be be shown as a horizontal line (constant proportion). The Kullbach-Leibler divergence per dimension is shown in orange. For comparison, the divergence for two runs of the diagonal model (zero complexity) are shown in blue and yellow. In blue is the computation time per dimension.

63 5.2.3 Computation Time

The LGM algorithm incurs some computational overhead when it calibrates the lo- cal models; as a result, as the dimension increases from 25 to 100, in the example in figure 5-4 that overhead time is amortized over more variables, and theseconds of computation per dimension declines. As the dimension increases from 100 to 1,500, however, the computation time per dimension increases roughly linearly. This suggests that LGM’s computation time is approximately quadratic as a function of dimension for problems like this. As an informal confirmation of this observation, figure 5-5 illustrates that the total computation time for LGM increases superlinearly with dimension.

64 Figure 5-5: Computation time increased as the dimension of the dataset increased. The average computation time was taken over three runs of each algorithm at each dimension, once over the ‘markovMulti’ factor structure, once over the ‘constant- Correlation’ factor structure, and once over the ‘tenStrong’ factor structure. Each dataset consisted of 100 samples. The simple heuristic methods are seen to be much faster than LGM. The cross-validated GLASSO implementation used in this study was faster than LGM for medium dimensions but slower than LGM when the dimen- sion became quite large.

65 5.2.4 LGM vs. GLASSO: Sparsity Case Studies

Sparsity is a desirable property of explanatory models because it increases inter- pretability. When the statistical model is used as a guide to hypothesis generation and future experimentation, sparser models may provide the further benefit of a lower ‘false positive’ rate of candidate hypotheses. In the previous section we saw that, in our stylized scenario, LGM models generally had a lower rate of false positives than GLASSO.

LGM tends to generate models with fewer parameters than GLASSO when both algorithms are tuned through cross-validation. However, by increasing the GLASSO

퐿1 penalty above its optimal level, it is possible to induce the GLASSO to produce a model as sparse as that generated by LGM. Furthermore, by computing GLASSO models for a variety of complexity parameters, one can compute a two-dimensional ‘frontier’ illustrating the tradeoff between sparsity and Kullbach-Liebler divergence from the true distribution to the model distribution. Figure 5-6 shows that, for a variety of scenarios, the LGM model lies at or beyond the feasible frontier of GLASSO models. While figure 5-6 compares LGM and GLASSO for just one training setfor each factor structure, the results are fairly consistent from run to run. To provide a sense of the consistency of the results, figure 5-6 is regenerated using a new training set and a new run of the LGM and GLASSO algorithms; the second version of the figure can be found in the appendix.

For the ‘ten’ factor structure, although the best GLASSO model has less diver- gence than the LGM model, the GLASSO models with complexity similar to that of the LGM model have worse divergence. By contrast, LGM outperforms all the ‘tenStrong’ GLASSO models. When the underlying factor structure is independent, LGM does as well as the best GLASSO model: both identify that there is not much structure. In the constant correlation example, the main observation is that while GLASSO can produce a better model, it cannot produce a better model at the same complexity. However, there is an interesting kink in the GLASSO trajectory. This may be because the thresholding approach used to identify complexity disregards

66 Figure 5-6: For each factor structure, 50 samples of dimension 100 were modeled using LGM and using GLASSO with the complexity parameter set to a range of values. The resulting complexities and Kullbach-Leibler divergences are plotted. The LGM outcome is plotted in green, while the frontier of GLASSO outcomes is plotted in blue.

67 precision matrix entries more than three orders of magnitude smaller than the largest off-diagonal entry. Consequently, the GLASSO models likely are even more complex than indicated, and many of those additional dependencies are likely very weak. In the ‘markov’ example LGM outperforms GLASSO, and especially so when complexity is held constant. In the ‘markovMulti’ example LGM underperfoms the best GLASSO model but outperforms it for the same complexity. It would be of interest to know how LGM performs in high dimensions compared to GLASSO. Unfortunately, the implementation of cross-validated GLASSO used in this research fails to converge in reasonable time for dimensions of 1,000 or greater.

68 Chapter 6

LGM-R: A Sparse Regression Algorithm

Researchers who need to explain or predict a variable based on a dataset that is high dimensional and includes relatively few samples often turn to sparse regression techniques. The LASSO and Elastic Net are popular algorithms for sparse regression. Both modify the regression objective function by adding penalty terms (LASSO adds

an 퐿1 penalty while the Elastic Net adds both an 퐿1 and an 퐿2 penalty). These methods are fairly successful in practice but not perfect; for example, they can bias the regression coefficients toward zero away from their true values. In this section we present a new sparse regression algorithm (dubbed LGM-R) that utilizes the LGM algorithm.

While the LASSO and Elastic Net algorithms were designed specifically for sparse regression, the LGM-R algorithm is an example of how one can insert a general- purpose, sparse precision matrix into a traditional algorithm to make that algorithm work even when the dimension is high and sample size is small. Traditional regression has the form 훽ˆ = 푦푇 푋(푋푇 푋)−1.

LGM-R uses this traditional formulation but makes some substitutions. First, we use

LGM to estimate a sparse precision matrix Λ(푦,푋),(푦,푋), and we calculate its inverse,

69 푇 푇 Σ(푦,푋),(푦,푋). Then we simply replace 푦 푋 with Σ푦,푋 , and we replace 푋 푋 with Σ푋,푋 . As a small but important detail, by exploiting the properties of Markov random fields, we can avoid inverting the potentially-large Σ푋,푋 matrix. Simulation results suggest that LGM-R may outperform LASSO and the Elastic Net in three scenarios: when the sample size is especially small (less than half the dimension of the dataset), when variables ‘clump’ into tightly interrelated groups, and when there are no hidden variables. In addition, LGM-R may be the best algorithm for applications that require especially sparse linear models, as LGM-R tends to produce sparser models than LASSO or Elastic Net.

6.1 The LGM-R Algorithm

Suppose we have 푛 instances of some random variables {푥1, ..., 푥푚}, and we have an interest in one particular variable, 푥푗. Furthermore, suppose 푛 < 푚. We would like to have a linear model that predicts 푥푗 based on the remaining 푚 − 1 variables. However, with 푛 < 푚, we cannot use OLS. A traditional solution would be to estimate coefficients using LASSO or the Elastic Net. Alternatively, we might choose to capitalize on the sparse structure and cross- validated local models generated by the LGM algorithm. As reminder, the LGM algorithm outputs a junction forest that spans {푥1, ..., 푥푚}. Thus to model variable

푥푗, we follow a simple process:

1. Run the LGM algorithm on all the variables to generate a junction forest.

2. Identify the set of cliques that include variable 푥푗.

3. Identify the set of variables in those cliques (excluding 푥푗); call this set 푆.

Conditional on, 푥푆, the variables in 푆, 푥푗 is independent of all variables not in 푆.

4. Construct the joint distribution of 푥푗 and 푥푆 based on the subtree consisting of

the cliques that include variable 푥푗. We can partition the covariance matrix of the LGM model distribution as:

70 ⎡ ⎤ Σ푥푗 ,푥푗 Σ푥푗 ,푥푆 Σ = ⎣ ⎦ .

Σ푥푆 ,푥푗 Σ푥푆 ,푥푆

5. Find the conditional distribution of 푥푗 given 푥푆 using the Schur complement. This is

푝(푥|푆) = 푁(퐸 [푥 ] + Σ Σ−1 푥 , Σ − Σ Σ Σ푇 ), 푥푗 푗 푥푗 ,푥푆 푥푆 ,푥푆 푆 푥푗 ,푥푗 푥푗 ,푥푆 푥푆 ,푥푆 푥푗 ,푥푆

and when the unconditional means of 푥푗 and 푥푆 are 0 (true by construction if

we center the variables in advance), the conditional mean of 푥푗 is

퐸 [푥 ] = Σ Σ−1 푥 . (6.1) 푥푗 |푥푆 푗 푥푗 ,푥푆 푥푆 ,푥푆 푆

ˆ −1 6. In traditional regression syntax, 훽푆 = Σ푥푗 ,푆Σ푆,푆, while the other beta coefficients

are all 0. That is, our model of 푥푗 will generally indicate non-zero exposure to

variables found in the same cliques as 푥푗 and will always indicate zero exposure to all other variables.

6.2 Evaluation Methodology

Linear models are applied to a wide variety of datasets, and particular algorithms for generating linear models may work better in some situations and worse in others. As an initial exploration of the performance of LGM-R and comparison of its performance against that of LASSO and Elastic Net, 30 datasets of dimension 100 were generated. Each dataset corresponds to one of the six factor structure scenarios and contains either 20, 40, 60, 80, or 100 samples. Each algorithm generates 100 linear models for each dataset: one predicting the value of each of the 100 variables in the dataset based on the values of the 99 other variables. Once the models have been constructed, an additional 100,000 samples are gen- erated and serve as a test set. The 푅2 of the test set serves as the principal measure

71 of success for these linear models. Note that the definition of 푅2 used here, for some dependent variable 푦 and independent variables 푧, is:

100,000 ˆ 2 Σ (푦푗 − 훽 · 푧푗) 푅2 = 1 − 푗=1 . (6.2) 100,000 2 Σ푖=1 푦푖

Because we have built models to explain each 푥푖 for 푖 ∈ {1, ..., 100}, we have 100 2 2 푅 values, 푅푖 , 푖 ∈ {1, ..., 100}, that reflect the performance of each estimation method applied to each of the 30 scenarios. To summarize how well an estimation algorithm 2 ¯2 1 100 2 works in each of the scenarios, we average the 푅 values: 푅 = 100 · Σ푖=1푅푖 . As a secondary consideration, we prefer linear models that consistently work well to ones with more erratic performance; modeling approaches with highly variable performance (as measured by 푅2) cannot be as thoroughly relied upon. Following the example of the mean, we also calculate and report the standard deviation of each

2 hundred 푅푖 values. Finally, we prefer linear models that are simple, on grounds of interpretability and also the broad principle of Occum’s Razor. Accordingly, we count the number of non- zero coefficients (out of a maximum of 99) used in each of the hundred linear models and average them to obtain the average complexity. Looked at slightly differently, this is the average number of conditional dependencies of the target variable implied by the model.

6.3 Conventional Algorithms for Sparse Linear Mod- eling

As a benchmark for LGM-R, we also build linear models using Matlab’s implementa- tions of cross-validated LASSO and Elastic Net. For Elastic Net, we use a parameter of .5 ∈ [1, 0) to give weight to both the 퐿1 and 퐿2 penalty terms. Two popular heuristics for selecting the appropriate complexity are supported by the Matlab implementation of these algorithms. One heuristic selects the complexity parameter that minimizes validation error. The other heuristic chooses the complex-

72 ity parameter so as to minimize complexity as much as possible while keeping the validation error within one standard deviation of its minimum. For completeness we evaluate both variations of each method, for a total of four benchmark models:

∙ LASSO CV MAX

∙ LASSO CV 1STD

∙ ELASTIC NET CV MAX

∙ ELASTIC NET CV 1STD

6.4 Results

When the size of the dataset was small compared to its dimension, LGM-R models had better out-of-sample performance as measured by 푅2 than competing models for five of the six factor structures. However, as the size of the dataset approached the dataset’s dimension, the results became more equivocal; most algorithms did well, and LGM-R was typically near the middle of the pack. Turning to measures of consistency and simplicity, LGM-R was generally among the best or the best. On average, LGM-R had the most consistent 푅2 values and produced linear models that were at least as sparse as those generated by any other methodology. Figure 6-1 illustrates each model’s usefulness in predicting out-of-sample data. Several conclusions stand out. First, as one might expect, all of the models work bet- ter when trained on a larger data set. Second, when the number of samples is sharply smaller than the dimensionality of the data, LGM-R tends to outperform LASSO and the Elastic Net. When samples are especially scant, LGM’s local approach to learning structure and optimal robustness appears to lend it an edge. Finally, we note that the ‘conservative’ cross-validation heuristic of choosing the simplest com- plexity parameter within one standard deviation of the best parameter value yields, on average, suboptimal models. The fit of the linear models depends strongly on the nature of the underlying data. The upper panel of figure 6-2 illustrates that when the sample size is small, the

73 Figure 6-1: Average goodness of fit of linear models produced by each of five method- ologies. Sample size varies along the x-axis; LGM-R has better performance for small sample size but typical performance when the sample reaches the dimensionality of the data (namely, 100).

74 Figure 6-2: The average of the 푅2 values of the hundred models in each scenario with sample size 20 are shown in the top panel. Those of scenarios with sample size 100 are shown in the bottom panel. LGM-R models based on the LGM algorithm are shown in light blue. LGM-R outperforms in most scenarios when the training set consists of just 20 samples. It outperforms in fewer scenarios when the training set consists of 100 samples.

75 Figure 6-3: All methodologies produce models with more consistent goodness of fit as more data is made available (x-axis). LGM-R generates models with the most consistent performance (smallest average standard deviation of 푅2). average performance of LGM-R is the across all factor structures except constant cor- relation. In the constant correlation structure, all variables contain an equal amount of information about each other, so the truly best model would contain all variables. Consequently, it makes sense that the sparser modeling approaches, including LGM- R, underperform in this context.

The lower panel illustrates that even when more samples become available, LGM- R still works best for data with a sparse true factor structure (the ‘markov’ and ‘markovMulti’ scenarios, which are AR(1) and AR(3) processes). However, LASSO and Elastic Net may work as well or better in other scenarios.

In some applications it may be important not only that the average 푅2 of a collection of linear models be high, but also that the goodness of fit vary as little as possible from model to model. In figure 6-3 we see that LGM-R generated models with the most consistent 푅2 values. Coming in second for consistency were the conservatively cross-validated LASSO and Elastic Net models. Least consistent were

76 Figure 6-4: Average number of non-zero coefficients used in linear models produced by each of five methodologies. Sample size varies along the x-axis; LGM-R isthe light blue line. All methodologies tend to employ richer models (more factors) as more data is made available.

the conventionally cross-validated LASSO and Elastic net models. As a note, the standard deviations of 100 푅2 values were calculated for each model / factor structure / sample-size combination; to obtain the values shown in figure 6-3 the resulting standard deviations were averaged across factor structures. Disaggregated results can be found in the appendix.

In figure 6-4 we show the average number of non-zero coefficients used ineach linear model, averaged across variables and factor structures. In the LASSO and Elastic Net models, the number of factors selected is suppressed by the penalty pa- rameter (which is selected via cross-validation). In the LGM-R models, it reflects the average number of neighbors (conditional dependencies) that each variable has when one considers the network structure of the variables dependencies.

77 6.5 Discussion

LGM-R tended to outperform LASSO and Elastic Net (based on 푅2) when the train- ing set was very small or when its sparse factor structure was a good match for the data. In addition, the LGM-R linear models tended to be both sparse and of a con- sistent quality. Consequently, LGM-R has the potential to be a useful element of researchers’ toolkits. One other important consideration is run-time. LGM-R requires the LGM algo- rithm to finish before it can generate any linear models. However, once LGM isdone, it generates any number of individual models almomst instantaneously. By contrast, every model generated by LASSO or Elastic Net takes a similar increment of time. Consequently, for time-sensitive applications, LGM-R may enjoy a comparative ad- vantage when many models are required but a disadvantage when just a single model is required.

78 Chapter 7

Portfolio Optimization

Mean variance optimization is a procedure designed to identify the best portfolio for an investor to hold. However, investors and investment management professionals are often reluctant to use it because its results are easily polluted by any errors in the investor’s estimate of the covariances of the universe of candidate investments. In an effort to limit the effect of these errors, investors sometimes use covariance shrinkage methods, but only with limited success. Here we evaluate whether LGM, GLASSO, or other methods of covariance modeling might allow an investor to build a portfolio that outperforms one that relies on traditional covariance shrinkage. Our results suggest that the practitioner who can be troubled to use more sophisticated covariance estimation techniques than cross-validated shrinkage may be rewarded by higher risk-adjusted returns.

In the formulation of the mean variance optimization problem, the investor is assumed to prefer higher returns and lower risk, where risk is quantified by the antic- ipated variance of the portfolio. One way to formulate the optimization problem is to specify a maximum acceptable level of risk in terms of standard deviation, 휎푀퐴푋 , and identify the portfolio, 푤*, that maximizes returns subject to this constraint:

* 푇 푤 = 푎푟푔푚푎푥푤 휇 · 푤 (7.1) 푇 2 푠.푡. 푤 · Σ · 푤 ≤ 휎푀퐴푋

79 where 휇 is a vector of the average returns of the candidate investment assets and Σ is the covariance matrix describing the expected covariation of the candidate invest- ments. As long as 휇 does not consist entirely of zeros, this problem has a convenient analytic solution: 휎 푤* = 푀퐴푋 · 휇Σ−1. (7.2) √︀휇Σ−1휇′ In the expression on the right hand side, 휇·Σ−1 determines the optimal ratio (and sign) of investments and thus dictates the risk-adjusted returns of the portfolio. Meanwhile, the fraction √휎푀퐴푋 scales the portfolio to the desired level of volatility. Errors 휇Σ−1휇′ in estimating 휇Σ−1 will reduce the risk-adjusted return, while errors in estimating √휎푀퐴푋 will result in the investor inadvertently taking too much or too little risk. 휇Σ−1휇′ Errors in the estimation of Σ can be heavily magnified when the matrix is inverted, so we should expect them to adversely affect both the investor’s risk adjusted returns

2 and the investor’s ability to target the desired amount of risk, 휎푀퐴푋 . In their recently published work Goto and Xu [6] find that using a GLASSO- based estimate of the precision matrix can improve the performance of mean-variance optimization. In this spirit, we run a simple simulation experiment to assess whether the sparse precision matrix generated by LGM can similarly be useful. Much like in the LGM-R application, our first step is to use LGM to generate a sparse precision matrix, Λ. We then insert that matrix into the conventional analytic solution to the problem. In this case, we obtain

휎 푤* = √푀퐴푋 · 휇Λ. (7.3) 퐿퐺푀 휇Λ휇′

7.1 Experimental Design

As a simple experiment to assess whether LGM may be useful in portfolio optimiza- tion, we analyze a dataset of simulated daily returns of 200 hypothetical candidate investments. In particular, we generate a training set of 42 trading days (two months) of multivariate Gaussian returns, use several models to build corresponding precision

80 matrices, insert the resulting precision matrices into the analytic portfolio optimiza- tion solution, and evaluate the Sharpe ratio of the result over a simulated dataset of 100,000 out-of-sample trading days. To highlight the importance of factor structure, the experiment is run for two different factor structures. Neither is intended to represent any particular investment universe, but merely be representative of some characteristics one might find in an investment universe. In both distributions, the average arithmetically annualized return of each asset is a psuedorandom number generated from a uniform distribution over [0,.02], and the annualized volatility of each asset is 20%. The true expected returns of the assets, 휇, are assumed to be known to the investor. However, the investor must learn the variances and covariances of the returns from the training set. The first distribution uses a one-factor distribution: the variance of each invest- ment is half attributable to a common factor and half attributable to idiosyncratic

1 variance. The second distribution has more structure. For each asset, 10 of the vari- 4 ance is due to idiosyncratic variance, 10 is due to a common factor shared with four 3 other closely-related investments, 10 is due to a more broadly-encountered common 2 factor shared with a tenth of all candidate investments, and 10 is due to a ubiquitous risk factor common to all candidate investments.

7.2 Simulated Portfolio Performance

Figure 7-1 summarizes the results of the simulation study and suggests several conclu- sions. First, when the investment universe includes many small groups of highly cor- related investments that have differentiated expected returns, using precision matrix estimation algorithms that can accurately identify those groups of investments may make it possible to generate substantially higher returns. LGM, GLASSO, and the minimal spanning tree were able to outperform the other methods in this scenario. Of these three algorithms, LGM performed best, an echo of its success building a sparse model of data generated from the ‘tenStrong’ factor structure (see chapter 5 for details).

81 Figure 7-1: Simulated mean variance optimized portfolio returns, averaged over six repetitions of each experiment. A high Sharpe ratio corresponds to high risk-adjusted returns. Error bars show the standard error of the estimates.

82 However, in the single factor scenario where returns are modeled as 푟푖 = 푓 + 휖푖 (f is a common factor, and 휖푖 is i.i.d. across investments), LGM performed worse than GLASSO, shrinkage, or mleD. LGM’s underperformance in the constant correlation scenario is consistent with its underperformance as measured by Kullbach-Leibler divergence in chapter 5 and as measured by 푅2 in chapter 6. LGM appears to be relatively unsuitable for the constant correlation data because its sparse structure is inconsistent with the dense, uniform network of true dependencies in that data. These results support the general finding of Goto and Xu that using a carefully constructed sparse precision matrix may improve a portfolio’s risk adjusted returns. An interesting further implication, however, is that there may be an advantage in understanding the general structure of the underlying distribution of returns of the candidate investments and using that information to select the most appropriate algorithm for generating a precision matrix for use in portfolio optimization. When there are groups of highly correlated candidate investments, the best algorithm may be LGM.

83 84 Chapter 8

Conclusion

We presented a new algorithm, LGM, for estimating precision matrices based on datasets with few samples but high dimension. Well-estimated precision matrices are essential components of widely used algorithms such as regression, portfolio optimiza- tion, and discriminant analysis. LGM differs from prior work in two key ways. First, unlike much recent workin this area, LGM’s optimization routine uses an 퐿0, not 퐿1, penalty on model com- plexity. Nonetheless, it can run on 1000+ dimensional datasets. Second, LGM learns through cross-validation how to construct robust ‘local’ models describing the joint distributions of handfuls of variables, and then it fuses those local models into a global model. Future work on LGM may take several directions. First, the tree structure used in the algorithm may make it highly parallelizable, so future work may result in substantial reductions in runtime. Second, because LGM’s optimization routine relies on properties of the exponential family of distributions, with only adjustments LGM should also operate on discrete inputs modeled using the multivariate categorical distribution. Third, an improved algorithm could be employed for the construction of robust local models. Experiments on synthetic datasets suggested that LGM may outperform the popu- lar GLASSO algorithm when the data’s true dependence structure is relatively sparse or involves groups of highly correlated variables. Simulation also suggests that LGM

85 generally produces sparser models than GLASSO and may perform relatively better versus GLASSO as the dimension of the data increases. In addition to the core LGM algorithm, we demonstrated that an LGM-generated precision matrix can add value as part of a sparse regression algorithm (LGM-R) or as part of mean variance portfolio optimization. Through these applications, we illustrated that judicious use of well-estimated sparse precision matrices can make it possible to use algorithms like linear regression and portfolio optimization even when the number of samples in the dataset is much smaller than its dimension.

86 Appendix A

Tables

87 Table A.1: Simulation results for scenarios of 25 variables with 25 samples. Included are the average log likelihood of the test set under each of the approximating models, the computation time, the entropy of the approximating model, the entropy of the distribution of the test set, the Kullbach Leibler divergence from the test set to the approximating distribution, the number of conditional dependencies (a measure of model complexity) in each approximating model, the (somewhat arbitrary) number of ‘top’ dependencies chosen based on the test set, and the precision and recall of the approximating models when they are viewed as algorithms that classify each dependency as existing or not.

88 Table A.2: Simulation results for scenarios of 50 variables with 25 samples. Included are the average log likelihood of the test set under each of the approximating models, the computation time, the entropy of the approximating model, the entropy of the distribution of the test set, the Kullbach Leibler divergence from the test set to the approximating distribution, the number of conditional dependencies (a measure of model complexity) in each approximating model, the (somewhat arbitrary) number of ‘top’ dependencies chosen based on the test set, and the precision and recall of the approximating models when they are viewed as algorithms that classify each dependency as existing or not.

89 Table A.3: Simulation results for scenarios of 75 variables with 25 samples. Included are the average log likelihood of the test set under each of the approximating models, the computation time, the entropy of the approximating model, the entropy of the distribution of the test set, the Kullbach Leibler divergence from the test set to the approximating distribution, the number of conditional dependencies (a measure of model complexity) in each approximating model, the (somewhat arbitrary) number of ‘top’ dependencies chosen based on the test set, and the precision and recall of the approximating models when they are viewed as algorithms that classify each dependency as existing or not.

90 Table A.4: Simulation results for scenarios of 100 variables with 25 samples. Included are the average log likelihood of the test set under each of the approximating models, the computation time, the entropy of the approximating model, the entropy of the distribution of the test set, the Kullbach Leibler divergence from the test set to the approximating distribution, the number of conditional dependencies (a measure of model complexity) in each approximating model, the (somewhat arbitrary) number of ‘top’ dependencies chosen based on the test set, and the precision and recall of the approximating models when they are viewed as algorithms that classify each dependency as existing or not.

91 Table A.5: Simulation runtimes for all algorithms on scenarios of 50, 100, 250, and 500 variables each with 100 samples for the tenStrong, constantCorrelation, and markov- Multi factor structures.

92 Table A.6: Simulation results for LGM on scenarios of 25 to 1,500 variables, each with 50 samples. Included are the average log likelihood of the test set under each of the approximating models, the computation time, the entropy of the approximating model, the entropy of the distribution of the test set, the Kullbach Leibler divergence from the test set to the approximating distribution, the number of conditional depen- dencies (a measure of model complexity) in each approximating model, the (somewhat arbitrary) number of ‘top’ dependencies chosen based on the test set, and the preci- sion and recall of the approximating models when they are viewed as algorithms that classify each dependency as existing or not.

93 Table A.7: Averages of out-of-sample linear model 푅2 values. As a basis for these results, 30 datasets of dimension 100 were generated. The datasets were distinguished by having one of six factor structures and one of five sample sizes. For each dataset, a sparse linear model was generated using five algorithms and the out-of-sample 푅2 was computed using a test set of 100, 000 samples drawn from the same distribution. The average of the 100 푅2 values for each dataset and estimation algorithm are reported here.

94 Table A.8: Standard deviations of out-of-sample linear model 푅2 values. As a basis for these results, 30 datasets of dimension 100 were generated. The datasets were distinguished by having one of six factor structures and one of five sample sizes. For each dataset, a sparse linear model was generated using five algorithms and the out-of-sample 푅2 was computed using a test set of 100, 000 samples drawn from the same distribution. The standard deviation of the 100 푅2 values for each dataset and estimation algorithm are reported here.

95 Table A.9: Average number of covariates used in linear models. As a basis for these results, 30 datasets of dimension 100 were generated. The datasets were distinguished by having one of six factor structures and one of five sample sizes. For each dataset, a sparse linear model was generated using five algorithms. The average number of factors used in the models generated by each algorithm are reported here.

96 Appendix B

Figures

97 Figure B-1: For each factor structure, 50 samples of dimension 100 were modeled using LGM and using GLASSO with the complexity parameter set to a range of values. The resulting complexities and Kullbach-Leibler divergences are plotted. The LGM outcome is plotted in green, while the frontier of GLASSO outcomes is plotted in blue.

98 Bibliography

[1] C.K. Chow and C.N. Liu. Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14(3), 1968.

[2] Thomas Cover and Joy Thomas. Elements of Information Theory. John Wiley & Sons, 2006.

[3] A. P. Dempster. Covariance selection. Biometrics, 28(1), 1972.

[4] Emanuel Parzen et al, editor. Selected Papers of Hirotugu Akaike, chapter 6.1 Prediction and Entropy. Springer, 1998.

[5] Jianqing Fan et al. An overview of the estimation of large covariance and precision matrices. The Econometrics Journal, 2016.

[6] Singo Goto and Yan Xu. Improving mean variance optimization through sparse hedging restrictions. Journal of Financial and Quantitiatve Analysis (JFQA), 50, 2015.

[7] Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.

[8] T. Hasite J. Friedman and T. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 2008.

[9] Daphne Koller and Nir Friedman. Probabilistic Graphical Models, chapter A.1. MIT Press, 2009.

[10] Olivier Ledoit and Michael Wolf. Honey, i shrunk the sample covariance matrix. Journal of Portfolio Management, 31(1), 2004.

[11] Zhe Liu and John Lafferty. Blossom tree graphical models. Advances in Neural Information Processing Systems, 27, 2014.

[12] Richard O. Michaud. The markowitz optimization enigma: Is ‘optimized’ opti- mal? Financial Analysts Journal, 1989.

[13] TamÃąs SzÃąntai and Edith KovÃącs. Hypergraphs as a mean of discovering the dependence structure of a discrete multivariate probability distribution. Ann Oper Res, 2012.

99 [14] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996.

[15] Hui Zou and Trevor Hastie. Regularization and variable selection via the elas- tic net. Journal of the Royal Statistical Society. Series B (Methodological), 67(2):301–320, 2005.

100