Covariance Estimation in Two-Level Regression
Total Page:16
File Type:pdf, Size:1020Kb
2nd International Conference on Control and Fault-Tolerant Systems, SysTol'13, October 9-11, 2013, Nice, France 1 Covariance Estimation in Two-Level Regression Nicholas Moehle and Dimitry Gorinevsky Abstract— This paper considers estimation of covariance ma- Using data from multiple units in a two-level data set can trices in multivariate linear regression models for two-level data improve the covariance estimation accuracy. produced by a population of similar units (individuals). The proposed Bayesian formulation assumes that the covariances Covariance matrix estimation for one-level datasets has at- for different units are sampled from a common distribution. tracted substantial attention earlier. Several approaches to the Assuming that this common distribution is Wishart, the optimal maximum likelihood estimation (MLE) with shrinkage (reg- Bayesian estimation problem is shown to be convex. This paper proposes a specialized scalable algorithm for solving this two- ularization) have been proposed, e.g., see [10], [11] where level optimal Bayesian estimation problem. The algorithm scales further references can be found. The approaches related to to datasets with thousands of units and trillions of data points this paper add regularization by using a Bayesian prior in a per unit, by solving the problem recursively, allowing new maximum a posteriori probability (MAP) estimation, such as data to be quickly incorporated into the estimates. An example an inverse Wishart prior for the covariance matrix, see [13]. problem is used to show that the proposed approach improves over existing approaches to estimating covariance matrices in The inverse Wishart is the conjugate prior for the covariance linear models for two-level data. matrix of a multivariate normal distribution. As such, the MAP solution is a generalization of the MLE solution and has the same attractive properties. In particular, the solution I. INTRODUCTION can be efficiently computed for large datasets using recursive Consider a two-level dataset least squares (RLS) or a related formulation. n oN Ti Two-level datasets are considered in multivariate analysis D = fxi(t); yi(t)gt=1 ; (1) i=1 of covariance (MANCOVA), e.g., see [17]. In MANCOVA, where index i is the dataset number, index t is the sample the null hypothesis is that all data follows the same distri- n number. Each xi(t) 2 R is a vector of independent bution. The goal is to decide if inter-unit covariance and m variables (regressors), and each yi(t) 2 R is a vector intra-unit covariance is compatible with this hypothesis. of dependent variables. There are N datasets at all; dataset Two-level regression is used in ‘soft’ applications such i includes a total of T samples. Each dataset describes a i as social sciences, biology, economics, and medicine (drug separate individual unit of the overall population. testing) [1], [2], [9], [18]. The most established solution This paper studies linear multivariate regression models of approaches are based on Gibbs sampling [8] and other the two-level data (1). For each unit, the model is described approximate methods. More scalable methods have been by the mean (regression parameters) and covariance of the developed to estimate covariance structure for two-level model residual. The means for different units are assumed to datasets using expectation-maximization (EM) methods as a be different; this is known as a model with fixed effects. The heuristic to find the maximum-likelihood estimates [3], [12]. covariances for different units are assumed to be different but related; this is a novel formulation. There appears to be little prior work on scalable two- In the motivating example considered below, the covari- level modeling applicable to machine data monitoring. One ance estimates are used in statistical monitoring to find out- exception is [6], discussing scalable algorithms for estimating liers, which are labeled as possibly anomalous data. Accurate a two-level regression model of aircraft fleet data. In [6], the covariance estimation is necessary for the monitoring to be same covariances are assumed for all units. accurate. The described two-level formulation can be useful in many other problems requiring accurate estimation of co- The contributions of this paper are as follows. First, variance matrices, such as classification, linear discriminant we propose a novel two-level linear regression formula- analysis, portfolio management, and others. tion with fixed effects in the regression parameters and in The special case N = 1 in (1) yields a standard multi- the covariances of the random effects. In this hierarchical variate regression. In that case, accurate estimation of the Bayesian formulation the unit covariances are realizations of 2 the generating Wishart distribution. Second, we show that for m elements of the covariance matrix requires that T1 m [7]. For large m this might not hold. For an ill-conditioned the proposed model, Bayesian optimal estimation for data (1) covariance matrix, even more data samples are required. yields a (nonlinear) convex optimization problem. Third, we formulate a specialized convex optimization algorithm that Nicholas Moehle is with the Department of Mechanical Engineering, solves the estimation problem using recursive updates and is Stanford University, [email protected] scalable to very large datasets. Finally, numerical examples Dimitry Gorinevsky is with Mitek Analytics LLC, Palo Alto, CA, dim- [email protected], and the Department of Electrical Engineering, Stanford demonstrate that the proposed formulation improves accu- University, [email protected] racy of the estimation compared to the known methods. II. BASELINE PROBLEM The resulting log posterior has the form This section considers a single-level dataset with N = 1, 1 1 `(B; ΣjD) = − (N + α ) log jΣj − α tr Ω Σ−1 (there is only one unit). In this section we drop the unit index 2 1 2 1 1 i = 1. The dataset (1) becomes (10) 1 fx(t); y(t)gT : (2) − tr (Y − BX)T Σ−1(Y − BX) : t=1 2 As a baseline for introducing the main contribution in the The Maximum A posteriori Probability (MAP) estimates next section, this section briefly recaps the known formu- of B and Σ minimize the negative log-posterior (10); this lation of multivariate Bayesian linear regression for dataset minimization problem is convex in the variables Σ−1 and (2). In what follows, we use p(·) for all probability density Σ−1B. The first-order optimality conditions can be solved functions and `(·) for all log likelihood functions. The analytically (e.g., see [17]), yielding meaning should be clear from the context. B = YXT (XXT )−1 (11) 1 A. Multivariate Regression Σ = (Y − BX)(Y − BX)T + α Ω : (12) N + α 1 1 Consider a linear regression model y(t) = Bx(t) + v(t), 1 where B 2 Rm×n is the regression parameter matrix and In what follows we assume that the scatter matrix XXT is v(t) 2 Rm is the residual. For data (2), the model can be invertible. In practice, selecting independent regressors leads compactly represented as to invertible XXT . Y = BX + V; (3) The estimates (11), (12) can be computed recursively. These estimates can be written in terms of the scatter matrices XXT , YXT , and YY T . For (11) this is obvious; Y = y(1) ··· y(T ) ; for (12) this requires expanding the matrix product in the X = x(1) ··· x(T ) ; (4) numerator. As additional data become available, the scatter V = v(1) ··· v(T ) : matrices can be updated using rank-one recursion. The Bayesian formulation assumes that the residuals vi(t) are independently generated by the normal distribution III. TWO-LEVEL MULTIPLE REGRESSION PROBLEM v(t) ∼ N (0; Σ); (5) The rest of the paper considers two-level dataset (1) and (3), (4) is replaced by Yi = BiXi + Vi, (i = 1;:::;N), with covariance matrix Σ 2 Sm, where Sm is the cone of where positive definite m × m matrices. The log likelihood function for Σ and B is Yi = yi(1) ··· yi(T ) ; Xi = xi(1) ··· xi(T ) ; (13) `(Σ;B j D) (6) 1 1 Vi = vi(1) ··· vi(T ) : = − T log jΣj − tr Σ−1(Y − BX)(Y − BX)T : 2 2 The parameter matrix B 2 Rm×n in (3) is considered a A. Baseline Approaches nuisance parameter; no prior for B is specified. As discussed The formulation of Section II can be applied to two-level in [11], [19], a prior for the covariance matrix Σ is needed covariance estimation in two ways described below. These if the number of samples T is insufficiently large. A well are are later used to benchmark the proposed approach. established approach, which we take as a baseline, is to use 1) Pooled Regression: The multivariate linear regression an inverse Wishart prior of Section II can be used for the pooled data Σ ∼ W−1(Ψ; ν); (7) X = X1 ··· XN Y = Y1 ··· YN : where Ψ 2 Sm is the scale matrix and the positive integer ν The obtained estimates of B and Σ are identical across the is the number of degrees of freedom. The inverse Wishart is population. The obvious deficiency of the pooled formulation the conjugate prior for the covariance matrix in a multivariate is that it ignores differences in the units. If the units are normal distribution. substantially different, the obtained model can be inaccurate. The log likelihood of the inverse Wishart prior is, up to This is illustrated in the example of Section V. additive constants, 2) Separate Regressions: The second approach is to ig- 1 1 `(ΣjΨ; ν) = − (ν + m + 1) log jΣj − tr ΨΣ−1 : (8) nore any similarity between the units. The approach of II 2 2 is then applied separately to each pair of data matrices Xi 1 and Y . For each unit i, separate and unrelated estimates of The change of variables Ω1 = ν+m+1 Ψ and α1 = ν+m+1, i transforms (8) into the log prior Bi and Σi are computed.