2nd International Conference on Control and Fault-Tolerant Systems, SysTol'13, October 9-11, 2013, Nice, France 1

Covariance Estimation in Two-Level Regression

Nicholas Moehle and Dimitry Gorinevsky

Abstract— This paper considers estimation of covariance ma- Using data from multiple units in a two-level data set can trices in multivariate linear regression models for two-level data improve the covariance estimation accuracy. produced by a population of similar units (individuals). The proposed Bayesian formulation assumes that the covariances estimation for one-level datasets has at- for different units are sampled from a common distribution. tracted substantial attention earlier. Several approaches to the Assuming that this common distribution is Wishart, the optimal maximum likelihood estimation (MLE) with shrinkage (reg- Bayesian estimation problem is shown to be convex. This paper proposes a specialized scalable algorithm for solving this two- ularization) have been proposed, e.g., see [10], [11] where level optimal Bayesian estimation problem. The algorithm scales further references can be found. The approaches related to to datasets with thousands of units and trillions of data points this paper add regularization by using a Bayesian prior in a per unit, by solving the problem recursively, allowing new maximum a posteriori probability (MAP) estimation, such as data to be quickly incorporated into the estimates. An example an inverse Wishart prior for the covariance matrix, see [13]. problem is used to show that the proposed approach improves over existing approaches to estimating covariance matrices in The inverse Wishart is the conjugate prior for the covariance linear models for two-level data. matrix of a multivariate normal distribution. As such, the MAP solution is a generalization of the MLE solution and has the same attractive properties. In particular, the solution I.INTRODUCTION can be efficiently computed for large datasets using recursive Consider a two-level dataset least squares (RLS) or a related formulation. n oN Ti Two-level datasets are considered in multivariate analysis D = {xi(t), yi(t)}t=1 , (1) i=1 of covariance (MANCOVA), e.g., see [17]. In MANCOVA, where index i is the dataset number, index t is the sample the null hypothesis is that all data follows the same distri- n number. Each xi(t) ∈ R is a vector of independent bution. The goal is to decide if inter-unit covariance and m variables (regressors), and each yi(t) ∈ R is a vector intra-unit covariance is compatible with this hypothesis. of dependent variables. There are N datasets at all; dataset Two-level regression is used in ‘soft’ applications such i includes a total of T samples. Each dataset describes a i as social sciences, biology, economics, and medicine (drug separate individual unit of the overall population. testing) [1], [2], [9], [18]. The most established solution This paper studies linear multivariate regression models of approaches are based on Gibbs sampling [8] and other the two-level data (1). For each unit, the model is described approximate methods. More scalable methods have been by the mean (regression parameters) and covariance of the developed to estimate covariance structure for two-level model residual. The means for different units are assumed to datasets using expectation-maximization (EM) methods as a be different; this is known as a model with fixed effects. The heuristic to find the maximum-likelihood estimates [3], [12]. covariances for different units are assumed to be different but related; this is a novel formulation. There appears to be little prior work on scalable two- In the motivating example considered below, the covari- level modeling applicable to machine data monitoring. One ance estimates are used in statistical monitoring to find out- exception is [6], discussing scalable algorithms for estimating liers, which are labeled as possibly anomalous data. Accurate a two-level regression model of aircraft fleet data. In [6], the covariance estimation is necessary for the monitoring to be same covariances are assumed for all units. accurate. The described two-level formulation can be useful in many other problems requiring accurate estimation of co- The contributions of this paper are as follows. First, variance matrices, such as classification, linear discriminant we propose a novel two-level linear regression formula- analysis, portfolio management, and others. tion with fixed effects in the regression parameters and in The special case N = 1 in (1) yields a standard multi- the covariances of the random effects. In this hierarchical variate regression. In that case, accurate estimation of the Bayesian formulation the unit covariances are realizations of 2 the generating . Second, we show that for m elements of the covariance matrix requires that T1  m [7]. For large m this might not hold. For an ill-conditioned the proposed model, Bayesian optimal estimation for data (1) covariance matrix, even more data samples are required. yields a (nonlinear) convex optimization problem. Third, we formulate a specialized convex optimization algorithm that Nicholas Moehle is with the Department of Mechanical Engineering, solves the estimation problem using recursive updates and is Stanford University, [email protected] scalable to very large datasets. Finally, numerical examples Dimitry Gorinevsky is with Mitek Analytics LLC, Palo Alto, CA, dim- [email protected], and the Department of Electrical Engineering, Stanford demonstrate that the proposed formulation improves accu- University, [email protected] racy of the estimation compared to the known methods. II.BASELINE PROBLEM The resulting log posterior has the form This section considers a single-level dataset with N = 1, 1 1 `(B, Σ|D) = − (N + α ) log |Σ| − α tr Ω Σ−1 (there is only one unit). In this section we drop the unit index 2 1 2 1 1 i = 1. The dataset (1) becomes (10) 1 {x(t), y(t)}T . (2) − tr (Y − BX)T Σ−1(Y − BX) . t=1 2 As a baseline for introducing the main contribution in the The Maximum A posteriori Probability (MAP) estimates next section, this section briefly recaps the known formu- of B and Σ minimize the negative log-posterior (10); this lation of multivariate Bayesian linear regression for dataset minimization problem is convex in the variables Σ−1 and (2). In what follows, we use p(·) for all probability density Σ−1B. The first-order optimality conditions can be solved functions and `(·) for all log likelihood functions. The analytically (e.g., see [17]), yielding meaning should be clear from the context. B = YXT (XXT )−1 (11) 1 A. Multivariate Regression Σ = (Y − BX)(Y − BX)T + α Ω  . (12) N + α 1 1 Consider a linear regression model y(t) = Bx(t) + v(t), 1 where B ∈ Rm×n is the regression parameter matrix and In what follows we assume that the scatter matrix XXT is v(t) ∈ Rm is the residual. For data (2), the model can be invertible. In practice, selecting independent regressors leads compactly represented as to invertible XXT . Y = BX + V, (3) The estimates (11), (12) can be computed recursively. These estimates can be written in terms of the scatter matrices XXT , YXT , and YY T . For (11) this is obvious; Y =  y(1) ··· y(T ) , for (12) this requires expanding the matrix product in the   X = x(1) ··· x(T ) , (4) numerator. As additional data become available, the scatter V =  v(1) ··· v(T ) . matrices can be updated using rank-one recursion.

The Bayesian formulation assumes that the residuals vi(t) are independently generated by the normal distribution III.TWO-LEVEL MULTIPLE REGRESSION PROBLEM v(t) ∼ N (0, Σ), (5) The rest of the paper considers two-level dataset (1) and (3), (4) is replaced by Yi = BiXi + Vi, (i = 1,...,N), with covariance matrix Σ ∈ Sm, where Sm is the cone of where positive definite m × m matrices.   The log likelihood function for Σ and B is Yi = yi(1) ··· yi(T ) ,   Xi = xi(1) ··· xi(T ) , (13) `(Σ,B | D) (6)   1 1 Vi = vi(1) ··· vi(T ) . = − T log |Σ| − tr Σ−1(Y − BX)(Y − BX)T  . 2 2 The parameter matrix B ∈ Rm×n in (3) is considered a A. Baseline Approaches nuisance parameter; no prior for B is specified. As discussed The formulation of Section II can be applied to two-level in [11], [19], a prior for the covariance matrix Σ is needed covariance estimation in two ways described below. These if the number of samples T is insufficiently large. A well are are later used to benchmark the proposed approach. established approach, which we take as a baseline, is to use 1) Pooled Regression: The multivariate linear regression an inverse Wishart prior of Section II can be used for the pooled data Σ ∼ W−1(Ψ, ν), (7)     X = X1 ··· XN Y = Y1 ··· YN . where Ψ ∈ Sm is the scale matrix and the positive integer ν The obtained estimates of B and Σ are identical across the is the number of degrees of freedom. The inverse Wishart is population. The obvious deficiency of the pooled formulation the conjugate prior for the covariance matrix in a multivariate is that it ignores differences in the units. If the units are normal distribution. substantially different, the obtained model can be inaccurate. The log likelihood of the inverse Wishart prior is, up to This is illustrated in the example of Section V. additive constants, 2) Separate Regressions: The second approach is to ig- 1 1 `(Σ|Ψ, ν) = − (ν + m + 1) log |Σ| − tr ΨΣ−1 . (8) nore any similarity between the units. The approach of II 2 2 is then applied separately to each pair of data matrices Xi 1 and Y . For each unit i, separate and unrelated estimates of The change of variables Ω1 = ν+m+1 Ψ and α1 = ν+m+1, i transforms (8) into the log prior Bi and Σi are computed. The problem of this approach is that there may be insufficient data to accurately estimate the −1 `(Σ|Ω1, α1) = −α1 log |Σ| − α1 tr Ω1Σ . (9) covariance for each unit separately.

2 B. Estimation of the Two-level Model population Ω covariance We propose a generalization of the one-level Bayesian prior optimal estimation formulation (3), (4), (11), (12), to the two-level data (1), (13). The main novelty of the proposed approach is in modeling the commonality of the covariance population-wide structures of the level-one models. S covariance Consider the regression model  yi(t) ∼ N Bixi(t), Σi . (14) unit covariance By analogy with (6), ignoring additive constants, the log Σ1 ··· ΣN matrices likelihood of Σi and Bi is

1 nuisance `(Σi,Bi|D) = − Ti log |Σi| B ··· B 2 (15) 1 N parameters 1 −1 T  − tr Σi (Yi − BiXi)(Yi − BiXi) . 2 Fig. 1. The structure of the Bayesian priors. The one-level baseline approach of Section II, uses the inverse-Wishart prior for the covariance matrix. The pro- posed two-level formulation assumes that each individual The MAP estimates maximize (20). Substituting (15), covariance matrix is sampled from a population. In our (17), and (19) yields the optimization problem: Bayesian formulation, all individual covariance matrices are minimize α tr ΩΣ−1 + (α + N(β + m + 1)) log |S| related by having the same Wishart prior N X  −1  + (Ti − β) log |Σi| + β tr S Σi (21) Σ |S, β ∼ W(β−1S, β + m + 1). (16) i i=1 −1 T   This is the sampling distribution for a scatter matrix with + tr Σi (Yi − BiXi)(Yi − BiXi) zero-mean normally distributed columns. It has log prior The decision variables in problem (21) are the matrices 1 Bi and Σi, for i = 1,...,N, and the matrix S. The `(Σi|S) = − (β + m + 1) log |S| hyperparameters α, β, and Ω are assumed to be given. These 2 (17) 1 1 parameters can be used for tuning the estimator performance. − β log |Σ | − β tr S−1Σ  . 2 i 2 i In the absence of better information, a reasonable choice is Ω = ωI, where ω is a scalar and I is a unit matrix. The hyperparameter β sets the strength of the prior, the similarity between the estimated unit covariance matrices. C. Convexity The hyperparameter S is the generating covariance matrix for the population. We propose an inverse Wishart hyperprior Although the optimization problem (21) is not convex, for the hyperparameter S a change of variables yields an equivalent problem that is −1 −1 convex. Define matrices Λi = Σi , Mi = Σi Bi, and S ∼ W−1(αΩ, α − m − 1). (18) LT L = S−1. In these new decision variables, (21) becomes α tr LT ΩL + (α + N(β + m + 1)) log LT L Below, it is shown that with this hyperprior the MLE minimize N formulation is convex. Then, the log prior is similar to (8) X  T −1  + − (Ti − β) log |Λi| + β tr L Λi L 1 1 −1 i=1 `(S|Ω, α) = − α log |S| − α tr ΩΣ . (19) T T 2 2 + tr(Yi ΛiYi) − 2 tr(Yi MiXi) (22) T −1  The hyperhyperparameter α sets the strength of this hyper- + tr((MiXi) Λi (MiXi)) prior, how similar S is to Ω. The proposed Bayesian model structure with the parameters, hyperparameters, and nuisance Convexity of follows from the fact that the log-det function parameters is summarized in Figure 1. n f : S+ → R f : X 7→ log |X| The log posterior is the sum of the log likelihood (15), the log priors (17) for each unit, and the population log prior (19) and the matrix fractional function n n T −1 g : S+ × R → R g :(X, y) 7→ y X y `(Σ1, ..., ΣN ,B1, ..., BN ,S|D) (20) N T ! are both convex [4]. In (22) X X T T = −`(S|Ω, α) − `(Σi|S, β) + `(Σi,Bi|D) • log |L L| = log(|L |·|L|) = 2 log |L| is convex log-det i=1 t=1 function.

3 T −1 n T −1 • tr(L Λ L) = P (Le ) Λ (Le ) e New data point New data point i j=1 j i j , where the j ··· {x (t), y (t)} {x (t), y (t)} are the unit vectors is convex in both L and Λi as the 1 1 N N sum of matrix fractional functions. • (Ti − β) log |Λi| is convex for Ti ≥ β as a log-det T T Compute X1X1 , Compute XN XN , function. T T · ·· T T Y1X1 , Y1Y1 YN XN , YN YN Other terms are trivially convex or are similar to the above.

IV. SOLUTIONOF TWO-LEVEL PROBLEM Compute B1 ··· Compute BN A general-purpose convex solver could be used to compute from (25) from (25) a solution to (22). However, such solution will be inefficient or impossible for a large dataset. The number of decision variables in the problem is proportional to the number of Initialize S, Σ1, ..., ΣN units and can be very large. A large dataset may not even fit into computer memory. This section outlines a specialized method for computing the global optimum of (22). The Compute ΣN Compute Σ1 from (24) · ·· method is efficient and scalable to very large problems. from (24)

A. First Order Optimality Conditions Compute S from (23) The proposed approach is to compute the first order con- ditions for optimality to (22) and solve them iteratively. We will find these first order conditions using the transformed Fig. 2. The steps in the algorithm. variables, but will iteratively solve them using the original, Bayesian model variables. This amounts to a block coordi- nate descent method on the transformed, convex problem. C. Algorithm Differentiating (22) with respect to L and then changing The solution for the two-level estimation problem of back to the original Bayesian model variables yields Section III can be computed as follows N ! 1 X 1) Compute the scatter matrices X XT , Y XT , Y Y T , S = αΩ + β Σ . (23) i i i i i i α + N(β + m + 1) i where X and Y are given by (13). i=1 i i 2) Compute B (25) for each unit i. Differentiating (22) with respect to Λ gives in the original i i 3) Initialize the matrices Σ ,..., Σ and S. Bayesian model variables 1 N 4) For each i, solve (24) for Σi with S given. T T T 0 = −(Ti − β)Σi − βΣiSΣi + YiYi − BiXiXi Bi . 5) Solve (23) for S, with Σ1,..., ΣN given. (24) 6) Check for convergence. If the update have not con- verged, go to step 4. This is an algebraic Riccati equation. It can be solved for Σi in cubic time using standard solvers. Differentiating (22) The proposed algorithm is summarized in Figure IV-C. with respect to Mi yields The algorithm performs block coordinate descent in the −1 T T convex optimization problem (22). The coordinate blocks are 0 = 2Λ MiXiX − 2YiX . i i i given by matrices Σ1,..., ΣN and S. Such an algorithm is Changing back to the original Bayesian model variables gives guaranteed to converge [16], [14]. T T −1 The algorithm tuning parameters are the parameters of the Bi = YiXi (XiXi ) . (25) m two-level model explained in Section III, α > 0, Ω ∈ S+ (18), and β > 0 (16). The scalar α and matrix Ω in (18) B. Discussion should be chosen in the same way as the parameters α1 and The optimal estimate of S, given by (23), is a weighted Ω1 for the inverse Wishart prior in Section II-A. For large α, sum of Ω and the individual covariance matrices. For α = 0 the population-level covariance S will be weighted heavily and N large, S is the average of the unit covariance matrices. toward Ω. The scalar β affects the strength of the prior for The optimal estimate (25) of the regression parameters Bi is the covariance matrices Σi. A larger value of β enforces the same as for the one-level regression for each unit. greater similarity amongst the population of the covariances. For β = 0, the solutions to (24) and (25) are the MAP In the proposed algorithm, the optimal estimates depend T estimates to (see (11), (12)) for the one-level regression on the data (1) through the scatter matrices XiXi, YiXi , T problems for each unit. To see this, substitute the solution and YiYi only. This enables a recursive solution for a for Bi in (25) into (24) and set β = 0 to obtain the sample large (or growing) dataset (1). As new data is added, the covariance matrix scatter matrices can be recursively computed using rank-one 1 T T T −1 T T  update. Furthermore, for the proposed algorithm, the scatter Σi = YiYi − YiXi (XiXi ) Yi Xi , Ti matrices for each unit do not need to be loaded into memory which has the same form as (12) for B = Bi and α1 = 0. simultaneously. This allows to parallelize computation of Σi.

4 D. Algorithmic Complexity & ) To analyze the computational complexity of the proposed " algorithm, it is assumed that O(n) = O(m), and that all ' units have approximately the same number of data points, and that there are many more data points per unit than there # " are units. i.e. Ti ≈ T  m, n. ( ' T 1) The cost of computing scatter matrices XiXi , YiXi, " 2 and YiYi for N units is approximately O(n NT ). *+,-. & /01- 2) Computing Bi (25) for each unit given the scatter 3 2-34045- matrices costs O(n ). This is done once for each unit, ! 6++.-7 3 so the overall complexity is O(nxN). % ) 3) Computing coefficients of Riccati equation (24) costs ! " # " ! $ 2 3 ( O(n ) for each unit. Solving it costs O(n ) [5]. Over ' all units, the complexity is O(n3N). 2 4) Coomputing Σtrue from (23) costs O(n N). Fig. 3. Estimation results for Example 1 two-output dataset The global optimum is found by completing steps 1 and TABLE I 2, then iterating over steps 3 and 4 until convergence. This 2 3 EXAMPLE 2 GAS TURBINE MODEL INPUTSAND OUTPUTS process costs O(n NT + n NNiterations), where Niterations is the number of iterations over steps 2 and 3 required for Variable Description Unit x[1] ◦F convergence. The recursive formulation allows for T new inlet temp. e x[2] inlet pressure in. H2O data points to be added for each unit, and the estimates x[3] part load fraction nondimensional 2 3 x[4] steam injection flow lb/hr updated, for a cost of O(n NTe + n NNiteration). These y[1] heat rate BTU/kWh computations can be fully parallelized over N processors y[2] efficiency nondimensional 2 3 at the cost O(n Te + n Niterations) for each.

V. EXAMPLES The ellipsoids corresponding to the ground truth covari- The first of two examples in this section is a simple two- ance matrices Σ and the means B are plotted by the dashed output dataset explain the use of the proposed two-level i i lines, for all 4 units. The data y (t) used in the estimation estimation approach. The second example describes a gas i are plotted as dots. The ellipsoids produced by the proposed turbine fleet and illustrates the algorithm performance as novel method are plotted by solid lines. They correspond to compared to the baseline approaches. the estimated covariance matrices Σbi (enlarged by a factor of 3) and the estimated means Bbi. As the figure shows, the A. Example 1: Two-output Dataset ground truth and the estimated ellipsoids match closely. We consider a dataset (1) with m = 2, n = 1, N = 4 and Ti = 100 for all valid i. We assume that xi(t) = 1 for all i 2×1 B. Example 2: Gas Turbine Fleet and t. Then the matrices Bi ∈ R are the means of the multivariate normal distributions for yi(t). The next example uses simulated data from a fleet of In this example, the elements of Bi (a column) are gas turbines for power generation. The models of the form generated using a zero-mean normal distribution with covari- (14), estimated from the data, could be used for statistical 2 ance 3I2×2. The population covariance matrices Σi ∈ S monitoring of the performance of individual turbines. are generated independently for each unit following Σi ∼ The turbine model is based on [15]. The meanings of W(S/100, 100 + m + 1) distribution with components x[j] of the input vector x and the components  1  y[j] of the output vector y are explained in Table V-B. The 1 20 S = 1 1 . (26) model of [15] was linearized around the heat rate of 10, 000 20 10 BTH/kWh and efficiency of 80% (this is the operating point Matrices Bi, Σi specify the model for unit i. The output given in [15]). A central linear model of the form (14) is vectors yi(t) were generated independently from the distri- then described by a parameter matrix butions y (t) ∼ N (B , Σ ). i i i −0.2962 −0.2936 1 0.733  The generated dataset (1) was processed using the pro- B = (27) ideal 0.314 0.385 0.947 −0.951 posed method. The algorithm of Section IV-C was tuned by using the regularization parameters α = 0, Ω = I2,2 in (18), The linear input-output map Bi for each unit in the and β = 100 in (16). The resulting covariance estimates for simulated fleet of the turbines was generated from the central each unit are shown in Figure 3. Each ellipsoid is described model (27) as follows. Each column of Bi was normally by the center Bi and the covariance matrix Σi. The plotted distributed with mean given by the corresponding column of 2 ellipsoids are scaled up by a factor of 3 such that they contain Bideal and covariance matrix Σi. The covariances Σi ∈ S most of the data points. The novel (proposed) algorithm were generated for each unit, independently according to implemented sequentially runs on a PC in 0.09 seconds. Σi ∼ W(S/β, β + m + 1), following the assumptions

5 TABLE II above. The average values obtained in the 100 completed EXAMPLE 2 TURBINE FLEET RESULTS: 30 UNITS, 100 SAMPLES EA simulations are summarized in Tables II and III. Both metrics (28) and (29) improve for the proposed method compared to Performance Σ − Σ Σ−1 − Σ−1 i bi F i bi F the two baseline methods. The improvement is larger for the Novel 0.0271 0.624 smaller dataset in Example 1. The model estimation errors Separate 0.0536 1.65 kBbi − BikF are very small in all cases and are not shown. Pooled 1.17 8.92 VI.CONCLUSION We have presented a technique for estimating linear mod- TABLE III els and covariances for two-level datasets. The proposed EXAMPLE 2 TURBINE FLEET RESULTS: 500 UNITS, 5000 SAMPLESEA Bayesian approach assumes that the covariance matrices are sampled from a common population, It uses data from other Performance Σ − Σ Σ−1 − Σ−1 i bi F i bi F units to improve the estimation of the individual covariances. Novel 4.38 · 10−4 15.1 The approach is scalable to very large datasets that may Separate 5.23 · 10−4 17.4 not fit into computer memory. In the provided simulation example, the technique shows an improvement over two Pooled 0.129 794 versions of a standard one-level Bayesian approach to linear model and covariance estimation. made in Section III-B. The 2×2 population-wide generating REFERENCES covariance matrix S was taken to be the same as in (26). [1] M. Arellano, “Computing robust standard errors for within-groups estimators,” Oxford Bulletin of Economics and , Vol. 49, No. Given Bi and Σi, the dataset was generated as follows. 4 4, 1987, pp.431–434. Input vectors xi(t) ∈ R for turbine i at time t, were inde- [2] T. Asparouhov and B. Muthen, “Computationally efficient estimation x pendently sampled from the distribution xi(t) ∼ N (0, Σi ), of multilevel high-dimensional latent variable models,” Proc. Joint x 2 Statistical Meetings, pages 2531–2535, July 2007, Salt Lake City, UT. with Σ = I. The output vectors yi(t) ∈ R were then [3] P.J. Bickel and E. Levina, “Regularized estimation of large covariance generated in accordance with yi(t) ∼ N (Bixi(t), Σi). matrices,” The Annals of Statistics, Vol. 36, No. 1, 2008, pp. 199–227. We considered a population of N = 5000 individual units [4] S. Boyd and L. Vandenberghe, Convex optimization, Cambridge University Press, 2004. with Ti = T = 5000 data points for each units. The data [5] A. Chinchuluun and P.M. Pardalos, Optimization and optimal control, was processed according to the proposed two-level method, volume 39. Springer, 2010. as outlined in Figure IV-C, and also according to the both [6] E. Chu, D. Gorinevsky, and S.P. Boyd, “Scalable statistical monitoring baseline techniques discussed in Section III-A. The algorithm of fleet data,” 18th IFAC World Congress, volume 18, pages 13227– 13232, Milano, Italy, August 2011. used regularization parameters α = 0, Ω = I2,2 in (18) , [7] R. Couillet and M. Debbah, “Signal processing in large systems: a and β = 2000 in (16). The time taken to run the algorithm new paradigm,” IEEE Signal Processing Magazine, Vol. 30, No. 1, sequentially on a PC was 8.1 seconds. 2003, pp. 24–39. [8] A.E. Gelfand and A.F.M. Smith. “Sampling-based approaches to calculating marginal densities,” Journ. American Statistical Assoc., C. Algorithm Performance Vol, 85, No. 410, 1990, pp. 398–409. [9] H. Goldstein. Multilevel Statistical Models, Wiley Series in Probability Our motivation has been primarily to estimate the covari- and Statistics, Wiley, 2010. ance matrices. One natural performance metric is the popula- [10] W. James and C. Stein, “Estimation with quadratic loss,” Proc. 4th tion average Frobenius distance between the true covariance Berkeley Symp. on Math. Statistics and Probability, volume 1, pages 1–379, 1961. Σi and the estimated covariance Σbi for each unit. [11] O. Ledoit and M. Wolf, “A well-conditioned estimator for large- dimensional covariance matrices,” Journ. Multivariate Analysis, Vol. kΣi − ΣbikF (28) 88, No. 2, 2004, pp. 365–411. [12] S.Y. Lee and S.Y. Tsang, “Constrained maximum likelihood estimation A possible motivation for the two level regression model- of two-level covariance structure model via EM-type algorithms,” ing of the turbine fleet data is multivariate monitoring of Psychometrika, Vol. 64, No. 4, 1999, pp. 435–450. anomalies. Such monitoring could be implemented using [13] K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate Analysis, Academic Press, 1979. Hotelling-type statistics [14] Yu. Nesterov, “Efficiency of coordinate descent methods on huge-scale  T   optimization problems,” SIAM Journ. Optimization, Vol. 22, No. 2, −1 2012, pp. 341–362. T2 = yi(t) − Bbixi(t) Σbi yi(t) − Bbixi(t) [15] J. Petek and P. Hamilton, “Performance monitoring of gas turbines,” Orbit, Vol. 25, No. 1, 2005, pp. 64–74. where Bbi and Σbi are the estimates of the respective matrices [16] P. Richtarik´ and M. Taka´c,ˇ “Iteration complexity of randomized block- obtained by the algorithm. The monitoring performance, coordinate descent methods for minimizing a composite function,” then, depends on the estimation accuracy of the inverse co- Mathematical Programming, Ser. A, 2012, pp. 1–38. [17] B.G. Tabachnick, L.S. Fidell, and S.J. Osterlind, Using multivariate variance matrix. Therefore, the second performance metrics statistics, 4th Ed., New York: Allyn and Bacon, 2001. of the algorithm is the population-wide average of [18] J. Teachman, G.J. Duncan, W.J. Yeung, and D. Levy, “Covariance structure models for fixed and random effects,” Sociological Methods −1 −1 kΣi − Σbi kF (29) & Research, Vol 30, No. 2, 2001, pp. 271–288. [19] J.H. Won and S.J. Kim, “Maximum likelihood covariance estimation Monte Carlo simulations were completed to estimate (28) with a condition number constraint,” Proc. Asilomar Conf. on Signals, and (29) by repeatedly generating the data as described Systems and Computers, pages 1445–1449. Monterey, CA, Oct. 2006.

6