Condition Number Regularized Covariance Estimation

CONDITION NUMBER REGULARIZED COVARIANCE ESTIMATION

Joong-Ho Won Johan Lim Seung-Jean Kim Bala Rajaratnam

Technical Report No. 2012-10 September 2012

Department of Statistics STANFORD UNIVERSITY Stanford, California 94305-4065

CONDITION NUMBER REGULARIZED COVARIANCE ESTIMATION

Joong-Ho Won Korea University

Johan Lim Seoul National University

Seung-Jean Kim Citi Capital Advisors

Bala Rajaratnam Stanford University

Technical Report No. 2012-10 September 2012

This research was supported in part by National Science Foundation grants DMS 0906392, DMS 1106642, CMG 1025465, and AGS 1003823.

Department of Statistics STANFORD UNIVERSITY Stanford, California 94305-4065

http://statistics.stanford.edu Condition Number Regularized Covariance Estimation ∗

Joong-Ho Won School of Industrial Management Engineering, Korea University, Seoul, Korea Johan Lim Department of Statistics, Seoul National University, Seoul, Korea Seung-Jean Kim Citi Capital Advisors, New York, NY, USA Bala Rajaratnam Department of Statistics, Stanford University, Stanford, CA, USA

Abstract

Estimation of high-dimensional covariance matrices is known to be a difficult problem, has many applications, and is of current interest to the larger statistics community. In many applications including so-called the “large p small n” setting, the estimate of the covariance matrix is required to be not only invertible, but also well-conditioned. Although many regularization schemes attempt to do this, none of them address the ill-conditioning problem directly. In this paper, we propose a maximum likelihood approach, with the direct goal of obtaining a well-conditioned estimator. No sparsity assumption on either the covariance matrix or its inverse are are imposed, thus making our procedure more widely applicable. We demonstrate that the proposed regularization scheme is computationally efficient, yields a type of Steinian shrinkage estimator, and has a natural Bayesian interpretation. We investigate the theoretical properties of the regularized covariance estimator comprehensively, including its regularization path, and proceed to develop an approach that adaptively determines the level of regularization that is required. Finally, we demonstrate the performance of the regularized estimator in decision-theoretic comparisons and in the financial portfolio optimization setting. The proposed approach has desirable properties, and can serve as a competi- tive procedure, especially when the sample size is small and when a well-conditioned estimator is required. Keywords: covariance estimation, regularization, convex optimization, condition number, eigenvalue, shrinkage, cross-validation, risk comparisons, portfolio optimization

∗A preliminary version of the paper has appeared in an unrefereed conference proceedings previously.

1 1 Introduction

We consider the problem of regularized covariance estimation in the Gaussian setting. It is well known that, given n independent samples x ,...,x Rp from a zero-mean p-variate 1 n ∈ Gaussian distribution, the sample covariance matrix

1 n S = x xT , n i i i=1 X maximizes the log-likelihood as given by

n 1 1 T −1 l(Σ) = log 1 exp xi Σ xi p 2 −2 i=1 (2π) (det Σ) Y = (np/2) log(2π) (n/2)(tr(Σ−1S) logdetΣ−1), (1) − − − where det A and tr(A) denote the determinant and trace of a square matrix A respectively. In recent years, the availability of high-throughput data from various applications has pushed this problem to an extreme where in many situations, the number of samples, n, is often much smaller than the dimension of the estimand, p. When n

p, the eigenstructure tends to be systematically distorted unless p/n is extremely small, resulting in numerically ill-conditioned estimators for Σ; see Dempster (1972) and Stein (1975). For example, in mean-variance portfolio optimization (Markowitz, 1952), an ill- conditioned covariance matrix may amplify estimation error because the optimal portfolio involves matrix inversion (Ledoit and Wolf, 2003; Michaud, 1989). A common approach to mitigate the problem of numerical stability is regularization. In this paper, we propose regularizing the sample covariance matrix by explicitly imposing a constraint on the condition number.1 Instead of using the standard estimator S, we propose to solve the following constrained maximum likelihood (ML) estimation problem

maximize l(Σ) (2) subject to cond(Σ) κ , ≤ max 1This procedure was ﬁrst considered by two of the authors of this paper in a previous conference paper and is further elaborated in this paper (see Won and Kim (2006)).

2 where cond(M) stands for the condition number, a measure of numerical stability, of a matrix M (see Section 1.1 for details). The matrix M is invertible if cond(M) is finite, ill-conditioned if cond(M) is finite but high (say, greater than 103 as a rule of thumb), and well-conditioned if cond(M) is moderate. By bounding the condition number of the estimate by a regularization parameter κmax, we directly address the problem of invertibility or ill-conditioning. This direct control is appealing because the true covariance matrix is in most situations unlikely to be ill-conditioned whereas its sample counterpart is often ill-conditioned. It turns out that the resulting regularized matrix falls into a broad family of Steinian-type shrinkage estimators that shrink the eigenvalues of the sample covariance matrix towards a given structure (James and Stein, 1961; Stein, 1956). Moreover, the regularization parameter κmax is adaptively selected from the data using cross validation. Numerous authors have explored alternative estimators for Σ (or Σ−1) that perform better than the sample covariance matrix S from a decision-theoretic point of view. Many of these estimators give substantial risk reductions compared to S in small sample sizes, and often involve modifying the spectrum of the sample covariance matrix. A simple example is the family of linear shrinkage estimators which take a convex combination of the sample covariance matrix and a suitably chosen target or regularization matrix. Notable in the area is the seminal work of Ledoit and Wolf (2004) who study a linear shrinkage estimator toward a specified target covariance matrix, and choose the optimal shrinkage to minimize the Frobenius risk. Bayesian approaches often directly yield estimators which shrink toward a structure associated with a pre-specified prior. Standard Bayesian covariance estimators yield a posterior mean Σ that is a linear combination of S and the prior mean. It is easy to show that the eigenvalues of such estimators are also linear shrinkage estimators of Σ; see, e.g., Haff (1991). Other nonlinear Steinian-type estimators have also been proposed in the literature. James and Stein (1961) study a constant risk minimax estimator and its modification in a class of orthogonally invariant estimators. Dey and Srinivasan (1985) provide another minimax estimator which dominates the James-Stein estimator. Yang and Berger (1994) and Daniels and Kass (2001) consider a reference prior and hierarchical priors, that respectively yield posterior shrinkage. Likelihood-based approaches using multivariate Gaussian models have provided different perspectives on the regularization problem. Warton (2008) derives a novel family of linear shrinkage estimators from a penalized maximum likelihood framework. This formulation enables cross-validation of the regularization parameter, which we discuss in Section 3 for the proposed method. Related work in the area include Sheena and Gupta (2003), Pourahmadi

3 et al. (2007), and Ledoit and Wolf (2012). An extensive literature review is not undertaken here, but we note that the approaches mentioned above (and the one proposed in this paper) fall in the class of covariance estimation and related problems which do not assume or impose sparsity, on either the covariance matrix, or its inverse (for such approaches either in the frequentist, Bayesian, or testing frameworks, the reader is referred to Banerjee et al. (2008); Friedman et al. (2008); Hero and Rajaratnam (2011, 2012); Khare and Rajaratnam (2011); Letac and Massam (2007); Peng et al. (2009); Rajaratnam et al. (2008)).

1.1 Regularization by shrinking sample eigenvalues

We brieﬂy review Steinian-type eigenvalue shrinkage estimators in this subsection. Dempster (1972) and Stein (1975) noted that the eigenstructure of the sample covariance matrix S tends to be systematically distorted unless p/n is extremely small. They observed that the larger eigenvalues of S are overestimated whereas the smaller ones are underestimated. This observation led to estimators which directly modify the spectrum of the sample covariance matrix and are designed to “shrink” the eigenvalues together. Let li, i = 1,...,p, denote the eigenvalues of the sample covariance matrix (sample eigenvalues) in nonincreasing order (l ... l 0). The spectral decomposition of the sample covariance matrix is given by 1 ≥ ≥ p ≥

T S = Q diag(l1,...,lp) Q , (3)

where diag(l ,...,l ) is the diagonal matrix with diagonal entries l and Q Rp×p is the 1 p i ∈ orthogonal matrix whose i-th column is the eigenvector that corresponds to the eigenvalue

li. As discussed above, a large number of covariance estimators regularizes S by modifying its eigenvalues with the explicit goal of better estimating the eigenspectrum. In this light Stein (1975) proposed the class of orthogonally invariant estimators of the following form:

T Σ= Qdiag(λ1,..., λp)Q . (4)

Typically, these estimators shrinkb the sampleb eigenvaluesb so that the modiﬁed eigenspectrum is less spread than that of the sample covariance matrix. In many estimators, the shrunk eigenvalues are required to maintain the original order as those of S: λ λ . 1 ≥···≥ p One well-known example of Steinian-type shrinkage estimators is the linear shrinkage b b estimator as given by Σ = (1 δ)S + δF, 0 δ 1 (5) LS − ≤ ≤ b 4 where the target matrix F = cI for some c > 0 (Ledoit and Wolf, 2004; Warton, 2008).

For the linear estimator the relationship between the sample eigenvalues li and the modiﬁed

eigenvalues λi is aﬃne:

λi = (1 δ)li + δc b − Another example, Stein’s estimatorb (Stein, 1975, 1986), denoted by ΣStein, is given by −1 λi = li/di, i =1,...,p, with di = n p +1+2li j6=i(li lj) /n. The original order in − − b the estimator is preserved by applying isotonic regression (Lin and Perlman, 1985). b P 1.2 Regularization by imposing a condition number constraint

Now we proceed to introduce the condition number-regularized covariance estimator proposed in this paper. Recall that the condition number of a positive deﬁnite matrix Σ is deﬁned as

cond(Σ) = λmax(Σ)/λmin(Σ)

where λmax(Σ) and λmin(Σ) are the maximum and the minimum eigenvalues of Σ, respectively. (Understand that cond(Σ) = if λ (Σ) = 0. ) The condition number-regularized ∞ min covariance estimation problem (2) can therefore be formulated as

maximize l(Σ) (6) subject to λ (Σ)/λ (Σ) κ . max min ≤ max An implicit condition is that Σ be symmetric and positive deﬁnite2. The estimation problem (6) can be reformulated as a convex optimization problem in the matrix variable Σ−1 (see Section 2). Standard methods such as interior-point methods can solve the convex problem when the number of variables (i.e., entries in the matrix) is modest, say, under 1000. Since the number of variables is about p(p + 1)/2, the limit is around p = 45. Such a naive approach is not adequate for moderate to high dimensional problems. One of the main contributions of this paper is a signiﬁcant improvement on the solution method for (6) so that it scales well to much larger sizes. In particular, we show that (6) reduces to an unconstrained univariate optimization problem.Furthermore, the solution to (6), denoted by Σcond, has a Steinian-type shrinkage of the form as in (4) with eigenvalues

2This problem can be considered a generalization of the problem studied by Sheena and Gupta (2003), who consider imposing eitherb a fixed lower bound or a fixed upper bound on the eigenvalues. Their approach is however different from ours in a fundamental sense in that it is not designed to control the condition number. Hence such a solution does not correct for the overestimation of the largest eigenvalues and underestimation of the small eigenvalues simultaneously.

5 given by τ ⋆, l τ ⋆ i ≤ ⋆ ⋆ ⋆ ⋆ λi = min max(τ ,li),κmaxτ =  li, τ

0. Note that the modiﬁed eigenvalues are nonlinear functions of the sample eigenvalues, meet the order constraint of Section 1.1, and even when n

6 sacne ucino Σ = Ω of function convex a is odto ubrcntan saraystse n h sol the matrix and covariance satisfied sample already the is t constraint how c number shows condition the and of (6) (7) in given solution as the problem of estimation derivation covariance the gives solution of section characterization This and Derivation 2.1 estimation covariance number-regularized Condition 2 5. Section By n adnege 04 hp ) hrfr h covar the to Therefore equivalent 7). is Chap. 2004, Vandenberghe, and (Boyd oeta tsffie ocnie h case the consider to suffices it that Note tcnb hw httecniinnme osrito seq is Ω on constraint number condition > the u that shown be likelihood can the it with associated family exponential natural modifiedeigen c ti elkonta h o-ieiod()o multivariat a of (1) log-likelihood the that known well is It uhthat such 0 n h odto ubrcntandetmtr(right). estimator number-constrained condition the and iue1: Figure apeegnaus( eigenvalues sample iershrinkage linear uI oprsno ievlesrnaeo h iershrinkage linear the of shrinkage eigenvalue of Comparison c Ω κ l max i ) − S ujc to subject minimize uI 1 . oeta stecnnclprmtrfrte(Wishart) the for parameter canonical the is Ω that Note . where , A tr uI κ (Ω max 7 B S Ω ) eoe that denotes l < − o e Ω det log κ

1 modifiedeigen κ max /l max p τ τ uI, ⋆ n() ic odΣ cond(Ω), = cond(Σ) Since (1). in ⋆ cond( = to o()sml eue to reduces simply (6) to ution B odto ubrregularization number condition asincvrac matrix covariance Gaussian e ac siainpolm(6) problem estimation iance niinnumber-regularized ondition τ − iaett h xsec of existence the to uivalent ⋆ compute o apeegnaus( eigenvalues sample A S ,sneohrie the otherwise, since ), spstv semideﬁnite positive is siao (left) estimator τ ⋆ given l κ i ) max τ κ ⋆ max (8) . where the variables are a symmetric positive deﬁnite p p matrix Ω and a scalar u> 0. The × above problem in (8) is a convex optimization problem with p(p + 1)/2 + 1 variables, i.e., O(p2). The following lemma shows that by exploiting structure of the problem it can be reduced to a univariate convex problem, i.e., the dimension of the system is only of O(1) as compared to O(p2).

Lemma 1. The optimization problem (8) is equivalent to the unconstrained univariate convex optimization problem p (i) minimize Jκmax (u), (9) i=1 X where

li(κmaxu) log(κmaxu), u< 1/(κmaxli) (i) ⋆ ⋆ − J (u)= liµ (u) log µ (u)= 1+log l , 1/(κ l ) u 1/l κmax i − i  i max i ≤ ≤ i  l u log u, u> 1/l , i − i  and  µ⋆(u) = min max u, 1/l ,κ u , (10) i { i} max in the sense that the solution to (8) is a function of the solution u⋆ to (9), as follows.

⋆ ⋆ ⋆ ⋆ ⋆ T Ω = Qdiag(µ1(u ),...,µp(u ))Q , with Q deﬁned as in (3).

Proof. The proof is given in the Appendix.

Characterization of the solution to (9) is given by the following theorem.

Theorem 1. Given κ cond(S), the optimization problem (9) has a unique solution max ≤ given by ⋆ ⋆ ⋆ α + p β +1 u = α⋆ −p , (11) i=1 li + i=β⋆ κmaxli ⋆ ⋆ ⋆ where α 1,...,p is the largest indexP suchP that 1/l ⋆ < u and β 1,...,p is the ∈ { } α ∈ { } ⋆ ⋆ ⋆ smallest index such that 1/lβ⋆ > κmaxu . The quantities α and β are not determined a priori but can be found in O(p) operations on the sample eigenvalues l ... l . If 1 ≥ ≥ p ⋆ κmax > cond(S), the maximizer u is not unique but Σcond = S for all the maximizing values of u. b 8 Proof. The proof is given in the Appendix.

Comparing (10) to (7), it is immediately seen that

α⋆ p li/κmax + ⋆ li τ ⋆ =1/(κ u⋆)= i=1 i=β . (12) max α⋆ + p β⋆ +1 P − P Note that the lower cutoﬀ level τ ⋆ is an average of the (scaled and) truncated eigenvalues, in ⋆ which the eigenvalues above the upper cutoﬀ level κmaxτ are shrunk by a factor of 1/κmax. We highlight the fact that a reformulation of the original minimization into a univariate optimization problem makes the proposed methodology very attractive in high dimensional settings. The method is only limited by the complexity of spectral decomposition of the sample covariance matrix (or the singular value decomposition of the data matrix). The approach proposed here is therefore much faster than using interior point methods. We also note that the condition number-regularized covariance estimator is orthogonally invariant: T if Σcond is the estimator of the true covariance matrix Σ, then UΣcondU is the estimator of the true covariance matrix UΣU T , for U orthogonal. b b 2.2 A geometric perspective on the regularization path

In this section, we shall show that a simple relaxation of the optimization problem (8) provides an intuitive geometric perspective on the condition number-regularized estimator. Consider the function J(u,v)= min (tr(ΩS) logdetΩ) (13) uIΩvI − deﬁned as the minimum of the objective of (8) over a range uI Ω vI, where 0

J(u,v)= min (liµi log µi). u≤µi≤v − i=1 X Recall that the truncation range of the sample eigenvalues is therefore given by (1/v, 1/u). Now for given u, let α 1,...,p be the largest index such that l > 1/u and β 1,...,p ∈ { } α ∈ { } be the smallest index such that lβ < 1/v, i.e., the set of indexes where truncation at either end of the spectrum ﬁrst starts to become a binding constraint. With this convention it is

9 easily shown that J(u,v) can now be expressed in simpler form:

p J(u,v)= (l µ∗(u,v) log µ∗(u,v)) (14) i i − i i=1 X α β−1 p = (l u log u)+ (1 + log l )+ (l v log v), (15) i − i i − i=1 i=α+1 X X Xi=β where u, 1 i α ≤ ≤ µ∗(u,v) = min max u, 1/l ,v = 1/l , α

1. J does not increase as u decreases and v increases simultaneously. This follows from noting that simultaneously decreasing u and increasing v expands the domain of the minimization in (13).

2. J(u,v)= J(1/l ,l/l ) for u 1/l and v 1/l . Hence J(u,v) is constant on this part 1 p ≤ 1 ≥ p of the domain. For these values of u and v, (Ω⋆)−1 = S.

3. J(u,v) is almost everywhere diﬀerentiable in the interior of the domain (u,v):0 < { u v , except for on the lines u = 1/l ,..., 1/l and v = 1/l ,..., 1/l . This follows ≤ } 1 p 1 p from noting the the indexes α and β changes their values only on these lines. Hence the contribution to the three summands in (14) changes at these values.

We can now see the following obvious relation between the function J(u,v) and the original problem (8): the solution to (8) as given by u⋆ is the minimizer of J(u,v) on the line

v = κmaxu, i.e., the original univariate optimization problem (9) is equivalent to minimizing ⋆ ⋆ J(u,κmaxu). We denote this minimizer by u (κmax) and investigate how u (κmax) behaves ⋆ as κmax varies. The following proposition proves that u (κmax) is monotone in κmax. This ⋆ result sheds light on the regularization path, i.e., the solution path of u as κmax varies.

⋆ ⋆ ⋆ Proposition 1. u (κmax) is nonincreasing in κmax and v (κmax) , κmaxu (κmax) is nonde- creasing, both almost surely.

10 Proof. The proof is given in the Appendix.

Remark The above proposition has a very natural and intuitive interpretation: when the constraint on the condition number is relaxed to allow higher value of κmax, the gap ⋆ ⋆ ⋆ ⋆ between u and v widens so that the ratio of v to u remains at κmax. This implies ⋆ that as κmax is increased, the lower truncation value u decreases and the higher truncation value v⋆ increases. Proposition 1 can be equivalently interpreted by noting that the optimal ⋆ ⋆ truncation range τ (κmax),κmaxτ (κmax) of the sample eigenvalues is nested. In light of the solution to the condition number-regularized covariance estimation problems in (7), Proposition 1 also implies that once an eigenvalue li is truncated for κmax = ν0,

then it remains truncated for all κmax <ν0. Hence the regularized eigenvalue estimates are not only continuous, but they are also monotonic in the sense that they approach either

end of the truncation spectrum as the regularization parameter κmax is decreased to 1. This monotonicity of the condition number-regularized estimates gives a desirable property that is not always enjoyed by other regularized estimators, such as the lasso for instance (Personal communications: Jerome Friedman, Department of Statistics, Stanford University). With the above theoretical knowledge on the regularization path of the sample eigenvalues, we now provide an illustration of the regularization path; see Figure 2. More specif- ⋆ ⋆ ically, consider the plot of the path of the optimal point (u (κmax),v (κmax)) on the u-v ⋆ ⋆ plane from (u (1),u (1)) to (1/l1, 1/lp) when varying κmax from 1 to cond(S). The left ⋆ ⋆ panel shows the path of (u (κmax),v (κmax)) on the u-v plane for the case where the sample eigenvalues are (21, 7, 5.25, 3.5, 3). Here a point on the path represents the minimizer of

J(u,v) on a line v = κmaxu (hollow circle). The path starts from a point on the solid line v = u (κ = 1, square) and ends at (1/l , 1/l ), where the dashed line v = cond(S) u max 1 p · passes (κmax = cond(S), solid circle). Note that the starting point (κmax = 1) corresponds to Σcond = γI for some data-dependent γ > 0 and the end point (κmax = cond(S)) to ⋆ Σcond = S. When κmax > cond(S), multiple values of u are achieved in the shaded region b above the dashed line, yielding nevertheless the same estimator S. The right panel of Figure b 2 shows how the eigenvalues of the estimated covariance vary as a function of κmax. Here we see that as the constraint is made stricter the eigenvalue estimates decrease monotonically. Furthermore, the truncation range of the eigenvalues simultaneously decreases and remains nested.

11 v=cond(S)u 1/l_5 22 l1 1/l_4 20 v= κ u max 18

16 cond

v=u Σ 1/l_3 14 v

12 1/l_2 10

8 eigenvalues of

6 1/l_1 4

l5 2 0 1/l_1 1/l_2 1/l_3 1/l_4 1/l_5 1 2 3 4 5 6 7 u κmax (a) (b)

Figure 2: Regularization path of the condition number constrained estimator. (a) ⋆ ⋆ Path of (u (κmax),v (κmax)) on the u-v plane, for sample eigenvalues (21, 7, 5.25, 3.5, 3) (thick curve). (b) Regularization path of the same sample eigenvalues as a function of κmax. Note that the estimates decreases monotonically as the condition number constraint κmax decreases to 1.

3 Selection of regularization parameter κmax

⋆ ⋆ Sections 2.1 and 2.2 have already discussed how the optimal truncation range (τ ,κmaxτ ) is determined for a given regularization parameter κmax, and how it varies with the value of

κmax. This section proposes a data-driven (or adaptive) criterion for selecting the optimal

κmax and undertakes an analysis of this approach.

3.1 Predictive risk selection procedure

A natural approach is to select κmax that minimizes the predictive risk, or the expected negative predictive likelihood as given by

−1 T −1 PR(ν)= E E ˜ tr(Σ X˜X˜ ) log det Σ x ,..., x , (17) X ν − ν | 1 n n o b b where Σν is the estimated condition number-regularized covariance matrix from independent p samples x1, ,xn, with the value of the regularization parameter κmax set to ν, and X˜ R b ··· ∈ is a random vector, independent of the given samples, from the same distribution. We approx-

12 imate the predictive risk using K-fold cross validation. The K-fold cross validation approach divides the data matrix X =(xT , ,xT ) into K groups so that XT = XT ,...,XT with 1 ··· n 1 K nk observations in the k-th group, k =1,...,K. For the k-th iteration, each observation in the k-th group Xk plays the role of X˜ in (17), and the remaining K 1 groups are used [−k] − together to estimate the covariance matrix, denoted by Σν . The approximation of the predictive risk using the k-th group reduces to the predictive log-likelihood b −1 −1 l Σ[−k],X = (n /2) tr Σ[−k] X XT /n log det Σ[−k] . k ν k − k ν k k k − ν h i The estimateb of the predictive risk is thenb deﬁned as b

1 K PR(ν)= l Σ[−k],X . (18) −n k ν k k=1 X c b The optimal value for the regularization parameter κmax is selected as ν that minimizes (18), i.e.,

κmax = inf(arg inf PR(ν)). ν≥1

[−k] [−k] Note that the outer inﬁmum isb taken since lk Σν c,Xk is constant for ν cond(S ), ≥ [−k] where S is the k-th fold sample covariance matrix based on the remaining K 1 groups. b − 3.2 Properties of the optimal regularization parameter

We proceed to investigate the behavior of the selected regularization parameter κmax, both

theoretically and in simulations. We ﬁrst note that κmax is a consistent estimator for the true condition number κ . This fact is expressed below as one of the several propertieb s of b κmax:

b(P1) For a ﬁxed p, as n increases, κmax approaches κ in probability, where κ is the condition number of the true covariance matrix Σ. b This statement is stated formally below:

Theorem 2. For a given p,

lim pr κmax = κ =1. n→∞ Proof. The proof is given in Supplementalb Section A.

13 We further investigate the behavior of κmax in simulations. To this end, consider iid samples from a zero-mean p-variate Gaussian distribution with the following covariance matrices:

(i) Identity matrix in Rp.

(ii) diag(1,r,r2,...,rp), with condition number 1/rp = 5.

(iii) diag(1,r,r2,...,rp), with condition number 1/rp = 400.

(iv) Toeplitz matrix whose (i, j)th element is 0.3|i−j| for i, j =1,...,p.

We consider diﬀerent combinations of sample sizes and dimensions of the problem as given by n = 20, 80, 320 and p = 5, 20, 80. For each of these cases 100 replicates are generated and κmax computed with 5-fold cross validation. The behavior of κmax is plotted in Figure

3 and leads to insightful observations. A summary of the properties of the κmax is given in (P2)–(P4)b below: b b (P2) If the condition number of the true covariance matrix remains ﬁnite as p increases,

then for a ﬁxed n, κmax decreases.

(P3) If the condition numberb of the true covariance matrix remains ﬁnite as p increases, then for a ﬁxed n, κmax converges to 1.

(P4) The variance of κmaxb decreases as either n or p increases.

These properties areb analogous to those of the optimal regularization parameter δ for the linear shrinkage estimator (5), found using a similar predictive risk criterion (Warton, 2008). b

4 Bayesian interpretation

In the same spirit as the Bayesian posterior mode interpretation of the LASSO (Tibshirani, 1996), we can draw parallels for the condition number regularized covariance estimator. The condition number constraint given by λ (Σ)/λ (Σ) κ is similar to adding a penalty 1 p ≤ max term gmax(λ1(Σ)/λp(Σ)) to the likelihood equation for the eigenvalues:

−n/2 maximize exp (n/2) p l /λ p λ exp( g λ /λ ) − i=1 i i i=1 i − max 1 p subject to λ λ > 0 1 ≥···≥ pP Q

14 5 4 3 max max κ κ 2 1 0 0.0 0.5 1.0 1.5 2.0 2.5 3.0

p= 52080 5 20 80 5 20 80 p= 5 20805 20 805 20 80 n=20 n=80 n=320 n=20 n=80 n=320 (a) (b) max max κ κ 0 100 200 300 400 0.0 0.5 1.0 1.5 2.0 2.5 3.0

p= 5 20805 20 805 20 80 p= 52080 5 20 805 20 80 n=20 n=80 n=320 n=20 n=80 n=320 (c) (d)

Figure 3: Box plots summarizing the distribution of κmax for dimensions p = 5, 20, 80 and for sample sizes n = 20, 80, 320 for the following covariance matrices (a) identity (b) diagonal exponentially decreasing, condition numberb 5, (c) diagonal exponentially decreasing, condition number 400, (d) Toeplitz matrix whose(i,j)th element is 0.3|i−j| for i,j = 1, 2,...,p.

15 The above expression allows us to qualitatively interpret the condition number-regularized estimator as the Bayes posterior mode under the following prior

π(λ ,...,λ ) = exp( g λ /λ ), λ λ > 0 (19) 1 p − max 1 p 1 ≥···≥ p for the eigenvalues, and an independent Haar measure on the Stiefel manifold, as the prior for the eigenvectors. The aforementioned prior on the eigenvalues has useful interesting properties which help to explain the type of eigenvalue truncation described in previous sections. We note that the prior is improper but the posterior is always proper.

Proposition 2. The prior on the eigenvalues in (19) is improper, whereas the posterior yields a proper distribution. More formally,

π(λ)dλ = exp( g λ /λ )dλ = , − max 1 p ∞ ZC ZC and

p p −n/2 π(λ)f(λ,l)dλ exp (n/2) l /λ λ exp( g λ /λ )dλ < , ∝ − i i i − max 1 p ∞ ZC ZC i=1 i=1 X Y where λ =(λ ,...,λ ) and C = λ : λ λ > 0 . 1 p { 1 ≥···≥ p } Proof. The proof is given in Supplemental Section A.

The prior above puts the greatest mass around the region λ Rp : λ = = λ which { ∈ 1 ··· p} consequently encourages “shrinking” or “pulling” the eigenvalues closer together. Note that the support of both the prior and the posterior is the entire space of ordered eigenvalues. Hence the prior simply by itself does not immediately yield a hard constraint on the condition number. Evaluating the posterior mode yields an estimator that satisfies the condition number constraint. A clear picture of the regularization achieved by the prior above and its potential for “eigenvalue shrinkage” emerges when compared to the other types of priors suggested in the literature and the corresponding Bayes estimators. The standard MLE implies of course a completely flat prior on the constrained space C. A commonly used inverse Wishart conjugate prior Σ−1 Wishart(m,cI) yields a posterior mode which is a linear shrinkage ∼ estimator (5) with δ = m/(n + m). Note however that the coefficients of the combination do not depend of the data X, and are a function only of the sample size n and the degrees of

16 freedom or shape parameter from the prior, m. A useful prior for covariance matrices that yields a data-dependent posterior mode is the reference prior proposed by Yang and Berger (1994). For this prior, the eigenvalues are inversely proportional to the determinant of the p the covariance matrix, as given by i=1 λi, and also encourages shrinkage of the eigenvalues. The posterior mode using this referenceQ prior can be formulated similarly to that of condition number regularization:

p p p −n/2 −1 argmax exp (n/2) l /λ λ λ λ1≥...≥λp>0 − i i i i i=1 i=1 i=1 Xp Y Yp

= argminλ1≥...≥λp>0 (n/2) li/λi + ((n + 2)/2) log λi. i=1 i=1 X X An examination of the penalty implied by the reference prior suggests that there is no direct penalty on the condition number. In Supplemental Section B we illustrate the density of the priors discussed above in the two-dimensional case. In particular, the density of the condition number regularization prior places more emphasis on the line λ1 = λ2 thus “squeezing” the eigenvalues together. This is in direct contrast with the inverse Wishart or reference priors where this shrinkage eﬀect is not as pronounced.

5 Decision-theoretic risk properties

5.1 Asymptotic properties

We now show that the condition number-regularized estimator Σcond has asymptotically lower risk than the sample covariance matrix S with respect to entropy loss. Recall that the b entropy loss, also known as Stein’s loss, is deﬁned as follows:

(Σ, Σ) = tr(Σ−1Σ) log det(Σ−1Σ) p. (20) Lent − −

Let λ ,...,λ , with λ b λ , denote theb eigenvalues ofb the true covariance matrix Σ 1 p 1 ≥···≥ p −1 −1 −1 and Λ = diag λ1,...,λp . Define λ = λ1,...,λp , λ = λ1 ,...,λp , and κ = λ1 λp. First consider the trivial case when p>n. In this case, the sample covariance matrix S is singular regardless of the singularity of Σ, and (S, Σ) = , whereas the loss and Lent ∞ therefore the risk of Σcond are always finite. Thus, Σcond has smaller entropy risk than S. For p n, if the true covariance matrix has a finite condition number, it can be shown ≤ b b

17 that for a properly chosen κmax, the condition number-regularized estimator asymptotically dominates the sample covariance matrix. This assertion is formalized below.

Theorem 3. Consider a class of covariance matrices (κ, ω), whose condition numbers are D bounded above by κ and with minimum eigenvalue bounded below by ω > 0, i.e.,

(κ, ω)= Σ= RΛRT : R orthogonal, Λ = diag(λ ,...,λ ), ω λ ... λ κω . D 1 p ≤ p ≤ ≤ 1 ≤ Then, the following results hold.

T (i) Consider the quantity Σ(κmax, ω)= Qdiag(λ˜1,..., λ˜p)Q , where

e ω, if l ω i ≤ λ˜i = l , if ω l <κ ω  i ≤ i max  κ ω, if l κ ω, max i ≥ max   T and the sample covariance matrix given as S = Qdiag(l1,...,lp)Q , Q orthogonal, l ... l . If the true covariance matrix Σ (κ , ω), then n, Σ(κ , ω) has 1 ≥ ≥ p ∈ D max ∀ max a smaller entropy risk than S. b (ii) Consider a true covariance matrix Σ whose condition number is bounded above by κ, −2 i.e., Σ (κ, ω). If κmax κ 1 √γ , then as p n γ (0, 1), ∈ ω>0 D ≥ − → ∈ S pr Σ κ ,τ ⋆ eventually =1, ∈D max n o ⋆ ⋆ where τ = τ (κmax) is given in (12).

Proof. The proof is given in Supplemental Section A.

⋆ Combining the two results above, we conclude that the estimator Σcond = Σ(κmax,τ (κmax)) asymptotically has a lower entropy risk than the sample covariance matrix. b b 5.2 Finite sample performance

This section undertakes a simulation study in order to compare the ﬁnite-sample risks of the

condition number-regularized estimator Σcond with those of the sample covariance matrix

(S) and the linear shrinkage estimator (ΣLS) in the “large p, small n” setting. The regular- b ization parameter δ for ΣLS is chosen as prescribed in Warton (2008). The optimal Σcond is b b 18 b calculated using the adaptive parameter selection method outlined in Section 3. Since Σcond and ΣLS both select the regularization parameters similarly, i.e., by minimizing the empiri- b cal predictive risk (18), a meaningful comparison between two estimators can be made. We b consider two loss functions traditionally used in covariance estimation risk comparisons: (a) 2 entropy loss as given in (20) and (b) quadratic loss (Σ, Σ) = ΣΣ−1 I . LQ − F Condition number regularization applies shrinkage to both ends of the sample eigenvalue b b spectrum and does not affect the middle part, whereas linear shrinkage does this to the entire spectrum uniformly. Therefore, it is expected that Σcond works well when a small proportion of eigenvalues are found at the extremes. Such situations rise very naturally b when only a few eigenvalues explain most of the variation in data. To understand the performance of the estimators in this context the following scenarios were investigated. We consider diagonal matrices of dimensions p = 120, 250, 500 as true covariance matrices. The eigenvalues (diagonal elements) are dichotomous, where the “high” values are (1 ρ)+ρp and − the “low” values are 1 ρ. For each p, we vary the composition of the diagonal elements such − that the high values take only one (singleton), 10%, 20%, 30%, and 40% of the total number of p eigenvalues. The sample size n is chosen so that γ = p/n is approximately 1.25, 2, or 4. Note that for a given p, the condition number of the true covariance matrices is held fixed at 1+ pρ/(1 ρ) regardless of the composition of eigenvalues. For each of the simulation − scenarios, we generate 1000 data sets and compute 1000 estimates of the true covariance matrix. The risks are calculated by averaging the losses over these 1000 estimates. Figure 4 presents the results for ρ = 0.1, and represents a large condition number. It is observed that in general Σcond has less risk than ΣLS, which in turn has less risk than the sample covariance matrix (entropy loss is not defined for the sample covariance matrix). This b b phenomenon is most clear when the eigenvalue spectrum has a singleton in the extreme. In this case, Σcond gives a risk reduction between 27 % and 67 % for entropy loss, and between 67 % and 91 % for quadratic loss. The risk reduction tends to be more pronounced in high b dimensional scenarios, i.e., for p large and n small. The performance of Σcond over ΣLS is maintained until the “high” eigenvalues compose up to 30 % of the eigenvalue spectrum. b b Comparing the two loss functions, risk reduction of Σcond is more distinct in quadratic loss. We note that for quadratic loss with large p and large proportion of “high” eigenvalues, there b are cases that the sample covariance matrix can perform well. As an example of a moderate condition number, results for the ρ = 0.5 case is given in Supplemental Section C. General trends are the same as with the ρ =0.1 case. In summary, the risk comparison study provides a numerical evidence that condition

19 number regularization has merit when the true covariance matrix has a bimodal eigenvalue distribution and/or the true condition number is large.

6 Application to portfolio selection

This section illustrates the merits of the condition number regularization in the context of ﬁnancial portfolio optimization, where a well-conditioned covariance estimator is necessary. A portfolio refers to a collection of risky assets held by an institution or an individual. Over the holding period, the return on the portfolio is the weighted average of the returns on the individual assets that constitutes the portfolio, in which the weight associated with each asset corresponds to its proportion in monetary terms. The objective of portfolio optimization is to determine the weights that maximize the return on the portfolio. Since the asset returns are stochastic, a portfolio always carries a risk of loss. Hence the objective is to maximize the overall return subject to a given level of risk, or equivalently to minimize risk for a given level of return. Mean-variance portfolio (MVP) theory (Markowitz, 1952) uses the standard deviation of portfolio returns to quantify the risk. Estimation of the covariance matrix of asset returns thus becomes critical in the MVP setting. An important and diﬃcult component of MVP theory is to estimate the expected return on the portfolio (Luenberger, 1998; Merton, 1980). Since the focus of this paper lies in estimating covariance matrices and not expected returns, we shall focus on determining the minimum variance portfolio which only requires an estimate of the covariance matrix; see, e.g., Chan et al. (1999). For this we shall use the condition number regularization, linear shrinkage and the sample covariance matrix, in constructing a minimum variance portfolio. We compare their respective performance over a period of more than 14 years.

6.1 Minimum variance portfolio rebalancing

We begin with a formal description of the minimum variance portfolio selection problem. The universe of assets consists of p risky assets, denoted 1,...,p. We use ri to denote the return of asset i over one period; that is, its change in price over one time period divided by its price at the beginning of the period. Let Σ denote the covariance matrix of r = (r1,...,rp). We employ wi to denote the weight of asset i in the portfolio held throughout the period. A long position in asset i corresponds to wi > 0, and a short position corresponds to wi < 0. The portfolio is therefore unambiguously represented by the vector of weights w =(w1,...,wp). Without loss of generality, the budget constraint can be written as 1T w = 1, where 1 is the

20 singleton 10% high 20% high 30% high 40% high

Figure 4: Average risks (with error bars) over 1000 runs with respect to two loss functions when ρ = 0.1. sample=sample covariance matrix, Warton=linear shrinkage (Warton, 2008), CondReg=condition number regularization. Risks are normalized by the dimension (p).

21 vector of all ones. The risk of a portfolio is measured by the standard deviation (wT Σw)1/2 of its return. Now the minimum variance portfolio selection problem can be formulated as

minimize wT Σw (21) subject to 1T w =1.

This is a simple quadratic program that has an analytic solution w⋆ =(1T Σ−11)−1Σ−11. In practice, the parameter Σ has to be estimated. The standard portfolio selection problem described above assumes that the returns are stationary, which is of course not realistic. As a way of dealing with the nonstationarity of returns, we employ a minimum variance portfolio rebalancing (MVR) strategy as follows. (t) (t) (t) p Let r = (r ,...,rp ) R , t = 1,...,N , denote the realized returns of assets at time 1 ∈ tot t (the time unit under consideration can be a day, a week, or a month). The periodic minimum variance rebalancing strategy is implemented by updating the portfolio weights every L time units, i.e., the entire trading horizon is subdivided into blocks each consisting of L time units. At the start of each block, we determine the minimum variance portfolio weights based on the past Nestim observations of returns. We shall refer to Nestim as the estimation horizon size. The portfolio weights are then held constant for L time units during these “holding” periods, i.e., during each of these blocks, and subsequently updated at the beginning of the following one. For simplicity, we shall assume the entire trading horizon

consists of Ntot = Nestim + KL time units, for some positive integer K, i.e., there will be K updates. (The last rebalancing is done at the end of the entire period, and so the out- of-sample performance of the rebalanced portfolio for this holding period is not taken into account.) We therefore have a series of portfolios w(j) = (1T (Σ(j))−11)−1(Σ(j))−11 over the (j) holding periods of [Nestim +1+(j 1)L,Nestim +jL], j =1,...,K. Here Σ is the covariance − b b matrix of the asset returns estimated from those over the jth holding period. b 6.2 Empirical out-of-sample performance

In this empirical study, we use the 30 stocks that constituted the Dow Jones Industrial Average as of July 2008 (Supplemental Section D.1 lists these 30 stocks). We used the closing prices adjusted daily for all applicable splits and dividend distributions downloaded from Yahoo! Finance (http://finance.yahoo.com/). The whole period considered in our numerical study is from the trading date of December 14, 1992 to June 6, 2008 (this period

22 consists of 4100 trading days). We consider weekly returns: the time unit is 5 consecutive trading days. We take

Ntot = 820, L = 15, Nestim = 15, 30, 45, 60.

To estimate the covariance matrices, we use the last Nestim weekly returns of the constituents of the Dow Jones Industrial Average.3 The entire trading horizon corresponds to K = 48 holding periods, which span the dates from February 18, 1994 to June 6, 2008. In what follows, we compare the MVR strategy where the covariance matrices are estimated using the condition number regularization with those that use either the sample covariance matrix or linear shrinkage. We employ two linear shrinkage schemes: that of Warton (2008) of Section 5.2 and that of Ledoit and Wolf (2004). The latter is widely accepted as a well- conditioned estimator in the ﬁnancial literature.

Performance metrics

We use the following quantities in assessing the performance of the MVR strategies. For precise formulae of these metrics, refer to Supplemental Section D.3.

Realized return. The realized return of the portfolio over the trading period. • Realized risk. The realized risk (return standard deviation) of the portfolio over the • trading period. Realized Sharpe ratio. The realized excess return, with respect to the risk-free rate, • per unit risk of the portfolio. Turnover. Amount of new portfolio assets purchased or sold over the trading period. • Normalized wealth growth. Accumulated wealth yielded by the portfolio over the trad- • ing period when the initial budget is normalized to one, taking the transaction cost into account. Size of the short side. The proportion of the short side (negative) weights to the sum • of the absolute weights of the portfolio.

We assume that the transaction costs are the same for the 30 stocks and set them to 30 basis points. The risk-free rate is set at 5% per annum.

3Supplemental Section D.2 shows the periods determined by the choice of the parameters.

23 Comparison results

Figure 5 shows the normalized wealth growth over the trading horizon for four diﬀerent

values of Nestim. The sample covariance matrix failed in solving (21) for Nestim = 15 because of its singularity and hence is omitted in this figure. The MVR strategy using the condition number-regularized covariance matrix delivers higher growth as compared to using the sample covariance matrix, linear shrinkage or index tracking in this performance metric. The higher growth is realized consistently across the 14 year trading period and is regardless of the estimation horizon. A trading strategy based on the condition number-regularized covariance matrix consistently performs better than the S&P 500 index and can lead to profits as much as 175% more than its closest competitor. A useful result appears after further analysis. There is no significant difference between the condition number regularization approach and the two linear shrinkage schemes in terms of the realized return, risk, and Sharpe ratio. Supplemental Section D.4 summarizes these metrics for each estimator respectively averaged over the trading period. For all values of

Nestim, the average diﬀerences of the metrics between the two regularization schemes are within two standard errors of those. Hence the condition number regularized estimator delivers better normalized wealth growth than the other estimators but without compromising on other measures such as volatility. The turnover of the portfolio seems to be one of the major driving factors of the diﬀerence in wealth growth. In particular, the MVR strategy using the condition number-regularized covariance matrix gives far lower turnover and thus more stable weights than when using the linear shrinkage estimator or the sample covariance matrix. (See Supplemental Section D.5 for plots.) A lower turnover also implies less transaction costs, thereby also partially con- tributing to the higher wealth growth. Note that there is no explicit limit on turnover. The stability of the MVR portfolio using the condition number regularization appears to be related to its small size of the short side (reported in Supplemental Section D.6). Because stock borrowing is expensive, the condition number regularization based strategy can be advantageous in practice.

Appendix

Proof of Lemma 1. Recall the spectral decomposition of the sample covariance matrix S = QLQT , with L = diag(l ,...,l ) and l ... l 0. From the objective function in 1 p 1 ≥ ≥ p ≥ (8), suppose the variable Ω has the spectral decomposition RMRT , with R orthogonal and

24 7 7 LW sample Warton LW 6 condreg 6 Warton S&P500 condreg S&P500 5 5

4 4

3 3

2 2

1 1 normalizedwealth growth normalizedwealth growth

0 0

−1 −1 1/1996 1/1998 1/2000 1/2002 1/2004 1/2006 1/2008 1/1996 1/1998 1/2000 1/2002 1/2004 1/2006 1/2008 time time

(a) Nestim = 15 (b) Nestim = 30

7 7 sample sample LW LW 6 Warton 6 Warton condreg condreg S&P500 S&P500 5 5

4 4

3 3

2 2

1 1 normalizedwealth growth normalizedwealth growth

0 0

−1 −1 1/1996 1/1998 1/2000 1/2002 1/2004 1/2006 1/2008 1/1996 1/1998 1/2000 1/2002 1/2004 1/2006 1/2008 time time

Figure 5: Normalized wealth growth results of the minimum variance rebalancing strategy for various estimation horizon sizes over the trading period from February 18, 1994 through June 6, 2008. sample=sample covariance matrix, LW=linear shrinkage (Ledoit and Wolf, 2004), Warton=linear shrinkage (Warton, 2008), condreg=condition number regularization. For comparison, the S&P 500 index for the same period (i.e., index tracking), with the initial price normalized to 1, is also plotted.

25 M = diag(µ ,...,µ ), µ ... µ . Then the objective function in (8) can be written as 1 p 1 ≤ ≤ p

tr(ΩS) logdet(Ω) = tr(RMRT QLQT ) log det(RMRT ) − − = l(R,M) tr(ML) log det M = l(Q, M), ≥ − with equality in the last line when R = Q (Farrell, 1985, Ch. 14). Hence (8) amounts to

minimize p (l µ log µ ) i=1 i i − i (22) subject to u µ µ κ u, i =1,...,p, P≤ 1 ≤···≤ p ≤ max

with the optimization variables µ1,...,µp and u. For the moment we shall ignore the order constraints among the eigenvalues. Then

problem (22) becomes separable in µ1,...,µp. Call this related problem (22*). For a ﬁxed u, the minimizer of each summand of the objective function in (22) without the order constraints is given as

⋆ µi (u)= argmin (liµi log µi) = min max u, 1/li ,κmaxu . (23) u≤µi≤κmaxu − { } Note that (23) however satisﬁes the order constraints. In other words, µ⋆(u) ... µ⋆(u) 1 ≤ ≤ p for all u. Therefore (22*) is equivalent to (22). Plugging (23) in (22) removes the constraints and the objective function reduces to a univariate one:

p , (i) Jκmax (u) Jκmax (u), (24) i=1 X where

li(κmaxu) log(κmaxu), u< 1/(κmaxli) (i) ⋆ ⋆ − J (u)= liµ (u) log µ (u)= 1+log l , 1/(κ l ) u 1/l κmax i − i  i max i ≤ ≤ i  l u log u, u> 1/l . i − i   (i) The function (24) is convex, since each Jκmax is convex in u.

(i) Proof of Theorem 1. The function Jκmax (u) is convex and is constant on the interval [1/(κmaxli), 1/li]. , p (i) Thus, the function Jκmax (u) i=1 Jκmax (u) has a region on which it is a constant if and P 26 only if 1/(κ l ), 1/l 1/(κ l ), 1/l = , max 1 1 max p p 6 ∅ h i h i or equivalently, κ > cond(S). Therefore,\ provided that κ cond(S), the convex max max ≤ function Jκmax (u) does not have a region on which it is constant. Since Jκmax (u) is strictly decreasing for 0 1/lp, it has a unique ⋆ ⋆ minimizer u . If κmax > cond(S), the maximizer u may not be unique because Jκmax (u) has a plateau. However, since the condition number constraint becomes inactive, Σcond = S for all the maximizers. b Now assume that κ cond(S). For α 1,...,p 1 and β 2,...,p , deﬁne the max ≤ ∈ { − } ∈ { } following two quantities.

α + p β +1 uα,β = α − p and vα,β = κmaxuα,β. i=1 li + i=β κmaxli

P P⋆ By construction, uα,β coincides with u if and only if

1/l

Consider a set of rectangles R in the uv-plane such that { α,β}

R = (u,v):1/l

Then condition (25) is equivalent to

(u ,v ) R (26) α,β α,β ∈ α,β in the uv-plane. Since R partitions (u,v):1/l < u 1/l and 1/l v < 1/l { α,β} { 1 ≤ p 1 ≤ p} and (uα,β,vα,β) is on the line v = κmaxu, an obvious algorithm to ﬁnd the pair (α, β) that

satisﬁes the condition (26) is to keep track of the rectangles Rα,β that intersect this line. To understand that algorithm takes O(p) operations, start from the origin of the uv-plane, increase u and v along the line v = κmaxu. Since κmax > 1, if the line intersects Rα,β, then the next intersection occurs in one of the three rectangles: Rα+1,β, Rα,β+1, and Rα+1,β+1.

Therefore after finding the first intersection (which is on the line u = 1/l1), the search requires at most 2p tests to satisfy condition (26). Finding the first intersection takes at most p tests.

27 Proof of Proposition 1. Recall that, for κmax = ν0,

⋆ α + p β +1 u (ν0)= α − p i=1 li + ν0 i=β li and P P ⋆ ⋆ α + p β +1 v (ν0)= ν0u (ν0)= 1 α − p , ν0 i=1 li + i=β li where α = α(ν ) 1,...,p is the largest indexP such thatP 1/l < u⋆(ν ) and β = β(ν ) 0 ∈ { } α 0 0 ∈ 1,...,p is the smallest index such that 1/l >ν u⋆(ν ). Then { } β 0 0

1/l

⋆ ⋆ 1. 1/lα

We can ﬁnd ν>ν0 such that

1/l

and 1/l v⋆(ν) < 1/l . β−1 ≤ β Therefore,

⋆ α + p β +1 α + p β +1 ⋆ u (ν)= α − p < α − p = u (ν0) i=1 li + ν i=β li i=1 li + ν0 i=β li

and P P P P ⋆ α + p β +1 α + p β +1 ⋆ v (ν)= 1 α − p > 1 α − p = v (ν0). ν0 i=1 li + i=β li ν i=1 li + i=β li

⋆ P ⋆ P P P 2. u (ν0)=1/lα+1 and 1/lβ−1 u (ν0). Then we can ﬁnd ν>ν0 such that α(ν)= α(ν0)+1= α +1

28 and β(ν)= β(ν0)= β. Then,

⋆ α +1+ p β +1 u (ν)= α+1 − p . i=1 li + ν i=β li

Therefore, P P

α+1 p 1 1 i=1 li + ν i=β li ⋆ ⋆ =1/lα+1 u (ν0) − u (ν) − Pα +1+ p Pβ +1 − α+1 p (α + p β + 1)lα+1 ( li + ν li) = − − i=1 i=β > 0, α +1+ p β +1 −P P or α+1 p α+1 p li + ν li li + ν0 li l > i=1 i=β > i=1 i=β = l , α+1 α + p β +1 α + p β +1 α+1 P − P P − P which is a contradiction. Therefore, u⋆(ν) u⋆(ν ). ≤ 0 Now, we can ﬁnd ν>ν0 such that α(ν) = α(ν0) = α and β(ν) = β(ν0) = β. This reduces to case 1.

⋆ ⋆ 3. 1/lα ν0 such that α(ν) = α(ν0) = α and β(ν)= β(ν ) 1= β 1. Then, 0 − −

⋆ α + p β +2 v (ν)= 1 α − p . ν i=1 li + i=β−1 li

Therefore, P P

α p 1 1 i=1 li + ν i=β−1 li ⋆ ⋆ =1/lβ−1 v (ν0) − v (ν) − α + p β +2 P −P α p (α + p β + 1)lβ−1 ( li + ν li) = − − i=1 i=β < 0, α + p β +2 − P P or α 1 p α+1 1 p li + li li + li l < i=1 ν i=β−1 < i=1 ν0 i=β = l , β−1 α + p β +1 α + p β +1 β−1 P −P P − P which is a contradiction. Therefore, v⋆(ν) v⋆(ν ). ≥ 0 Now, we can ﬁnd ν>ν0 such that α(ν) = α(ν0) = α and β(ν) = β(ν0) = β. This reduces to case 1.

29 ⋆ ⋆ ⋆ ⋆ 4. u (ν0)=1/lα+1 and v (ν0)=1/lβ−1. 1/lα+1 = u (ν0) = v (ν0)/ν0 = 1/(ν0lβ−1). This is a measure zero event and does not aﬀect the conclusion.

Supplemental materials Accompanying supplemental materials contain additional proofs of Theorems 2 and 3, and Proposition 2; additional ﬁgures illustrating Bayesian prior densities; additional ﬁgure illustrating risk simulations; details of the empirical minimum variance rebalancing study.

Acknowledgments We thank the editor and the associate editor for useful comments that improved the presentation of the paper. J. Won was partially supported by the US National Institutes of Health (NIH) (MERIT Award R37EB02784) and by the US National Science Foundation (NSF) grant CCR 0309701. J. Lim’s research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (grant number: 2010-0011448). B. Rajaratnam was supported in part by NSF under grant nos. DMS-09-06392, DMS-CMG 1025465, AGS-1003823, DMS-1106642 and grants NSA H98230-11-1-0194, DARPA-YFA N66001-11-1-4131, and SUWIEVP10-SUFSC10-SMSCVISG0906.

References

Banerjee, O., L. El Ghaoui, and A. D’Aspremont (2008). Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. Journal of Machine Learning Research 9, 485–516.

Boyd, S. and L. Vandenberghe (2004). Convex Optimization. Cambridge University Press.

Chan, N., N. Karceski, and J. Lakonishok (1999). On portfolio optimization: Forecasting covariances and choosing the risk model. Review of Financial Studies 12 (5), 937–974.

Daniels, M. and R. Kass (2001). Shrinkage estimators for covariance matrices. Biometrics 57, 1173–1184.

Dempster, A. P. (1972). Covariance Selection. Biometrics 28 (1), 157–175.

Dey, D. K. and C. Srinivasan (1985). Estimation of a covariance matrix under Stein’s loss. The Annals of Statistics 13 (4), 1581–1591.

Farrell, R. H. (1985). Multivariate calculation. Springer-Verlag New York.

30 Friedman, J., T. Hastie, and R. Tibshirani (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 (3), 432–441.

Haﬀ, L. R. (1991). The variational form of certain Bayes estimators. The Annals of Statis- tics 19 (3), 1163–1190.

Hero, A. and B. Rajaratnam (2011). Large-scale correlation screening. Journal of the American Statistical Association 106 (496), 1540–1552.

Hero, A. and B. Rajaratnam (2012). Hub discovery in partial correlation graphs. Information Theory, IEEE Transactions on 58 (9), 6064–6078.

James, W. and C. Stein (1961). Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Stanford, Califor- nia, United States, pp. 361–379.

Khare, K. and B. Rajaratnam (2011). Wishart distributions for decomposable covariance graph models. The Annals of Statistics 39 (1), 514–555.

Ledoit, O. and M. Wolf (2003, December). Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance 10 (5), 603–621.

Ledoit, O. and M. Wolf (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88, 365–411.

Ledoit, O. and M. Wolf (2012, July). Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics 40 (2), 1024–1060.

Letac, G. and H. Massam (2007). Wishart distributions for decomposable graphs. The Annals of Statistics 35 (3), 1278–1323.

Lin, S. and M. Perlman (1985). A Monte-Carlo comparison of four estimators of a covariance matrix. Multivariate Analysis 6, 411–429.

Luenberger, D. G. (1998). Investment science. Oxford University Press New York.

Markowitz, H. (1952). Portfolio selection. Journal of Finance 7 (1), 77–91.

Merton, R. (1980). On estimating expected returns on the market: An exploratory investi- gation. Journal of Financial Economics 8, 323–361.

Michaud, R. O. (1989). The Markowitz Optimization Enigma: Is Optimized Optimal. Fi- nancial Analysts Journal 45 (1), 31–42.

Peng, J., P. Wang, N. Zhou, and J. Zhu (2009). Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association 104 (486), 735–746.

31 Pourahmadi, M., M. J. Daniels, and T. Park (2007, March). Simultaneous modelling of the cholesky decomposition of several covariance matrices. Journal of Multivariate Anal- ysis 98 (3), 568–587.

Rajaratnam, B., H. Massam, and C. Carvalho (2008). Flexible covariance estimation in graphical Gaussian models. The Annals of Statistics 36 (6), 2818–2849.

Sheena, Y. and A. Gupta (2003). Estimation of the multivariate normal covariance matrix under some restrictions. Statistics & Decisions 21, 327–342.

Stein, C. (1956). Some problems in multivariate analysis Part I. Technical Report 6, Dept. of Statistics, Stanford University.

Stein, C. (1975). Estimation of a covariance matrix. Reitz Lecture, IMS-ASA Annual Meet- ing.

Stein, C. (1986). Lectures on the theory of estimation of many parameters (English transla- tion). Journal of Mathematical Sciences 34 (1), 1373–1403.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58 (1), 267–288.

Warton, D. I. (2008). Penalized Normal Likelihood and Ridge Regularization of Correlation and Covariance Matrices. Journal of the American Statistical Association 103 (481), 340– 349.

Won, J. H. and S.-J. Kim (2006). Maximum Likelihood Covariance Estimation with a Condition Number Constraint. In Proceedings of the Fortieth Asilomar Conference on Signals, Systems and Computers, pp. 1445–1449.

Yang, R. and J. O. Berger (1994). Estimation of a covariance matrix using the reference prior. The Annals of Statistics 22 (3), 1195–1211.

32 A Additional proofs

Proof of Theorem 2. Suppose the spectral decomposition of the k-th fold covariance matrix esti- [−k] mate Σν , with κmax = ν, is given by

[−k] [−k] [−k] [−k] [−k] T b Σν = Q diag λ1 ,..., λp Q with b b b [−k]∗ [−k] [−k]∗ v if li

The right hand side of (27) convergesb in probability to the condition number κ of the true covariance matrix, as n increases while p is ﬁxed. Hence,

lim P κmax κ = 1. n→∞ ≤ We now show that b lim P κmax κ = 1. n→∞ ≥ by showing that PR ν is an asymptotically decreasingb function in ν. Recall that K c 1 PR ν = l Σ[−k],X , −n k ν k k=1 X where c b

l Σ[−k],X = (n /2) tr Σ[−k] −1X XT /n log det Σ[−k] −1 , k ν k − k ν k k k − ν h i b [−k] b b which, by the definition of Σν , is everywhere differentiable but at a finite number of points.

S-1 [−k] To see the asymptotic monotonicity of PR(ν), consider the derivative ∂l Σν ,X ∂ν: − k k [−k] [−k] [−k] −1 ∂lk Σν ,Xk nk [−k] −1 ∂cΣν ∂ Σν T b = tr Σν + tr XkXk nk − ∂ν 2 " ∂ν ∂ν # b b b [−k] [−k] −1 n b ∂Σ ∂ Σν = k tr Σ[−k] −1 ν + tr Σ[−k] 2 ν ∂ν ∂ν ν " b b b[−k] −1 b ∂ Σν T [−k] + tr XkXk nk Σν ∂ν − # b [−k] −1 n ∂ b ∂ Σν = k tr Σ[−k] −1Σ[−k] + tr X XT n Σ[−k] 2 ∂ν ν ν ∂ν k k k − ν " # b n ∂Σ˜ −1b b b = k tr ν X XT n Σ[−k] . 2 ∂ν k k k − ν [−k] b As n (hence nk) increases, Σν converges almost surely to the inverse of the solution to the following optimization problem b minimize tr(ΩΣ) logdetΩ − subject to cond(Ω) ν, ≤ [−k] with Σ and ν replacing S and κmax in (8). We denote the limit of Σν by Σ˜ ν. For the spectral decomposition of Σ T Σ= R diag λ1,...,λp R , b (28)

Σ˜ ν is given as T Σ˜ ν = R diag ψ1(ν),...,ψp(ν) R , (29) where, for some τ(ν) > 0,

τ(ν) if λ τ(ν) i ≤ ψ (ν)= λ if τ(ν) <λ ντ(ν) i  i i ≤  ντ(ν), if ντ(ν) <λi. Recall from Proposition 1 that τ(ν) is decreasing in ν and ντ(ν) is increasing. T Let ck be the limit of nk (2n) when both n and nk increases. Then, XkXk nk converges almost surely to Σ. Thus, [−k] ˜ −1 1 ∂lk Σν ,Xk ∂Σν ck tr Σ Σ˜ ν , almost surely. (30) −n ∂ν → ∂ν − b If ν κ, then Σ˜ = Σ, and the RHS of (30) degenerates to 0. Now consider the case that ≥ ν ν < κ. From (29),

∂Σ˜ −1 ∂Ψ−1 ∂ψ−1 ∂ψ−1 ν = R RT = R diag 1 ,..., p RT , ∂ν ∂ν ∂ν ∂ν

S-2 where 1 ∂τ(ν) −1 τ(ν)2 ∂ν ( 0) if λi τ(ν) ∂ψi − ≥ ≤ =  0 if τ(ν) <λi ντ(ν) ∂ν 1 ∂(ντ(ν)) ≤  2 2 ( 0) if ντ(ν) <λi. − ν τ(ν) ∂ν ≤ From (28) and (29),  Σ Σ˜ = Rdiag λ ψ ,...,λ ψ RT , − ν 1 − 1 p − p where λ τ(ν) ( 0) if λ τ(ν) i − ≤ i ≤ λ ψ = 0 if λ ντ(ν) i − i  i ≤ λ ντ(ν) ( 0) if ντ(ν) <λ .  i − ≥ i Therefore,  ∂Σ˜ −1 ∂ψ−1 tr ν Σ Σ˜ = i (λ ψ ) 0. ∂ν − ν ∂ν · i − i ≤ Xi In other words, the RHS of (30) is less than 0 and the almost sure limit of PR(ν) is decreasing in ν. By deﬁnition, PR(κ ) PR(κ). From this and the asymptotic monotonicityc of PR(ν), we max ≤ conclude that c b c lim pr κmax κ = 1. c n→∞ ≥ b Proof of Proposition 2. We are given that

λ π(λ ,...,λ ) = exp g 1 λ λ > 0. 1 p − max λ 1 ≥···≥ p p Now λ π(λ ,...,λ )dλ = exp g 1 dλ, 1 p − max λ ZC ZC p where C = λ λ > 0 . { 1 ≥···≥ p } Let us now make the following change of variables: xi = λi λi+1 for i = 1, 2,..,p 1, and p − − xp = λp. The inverse transformation yields λi = j=i xj for i = 1, 2,..,p. It is straightforward to verify that the Jacobian of this transformation is given by J = 1. P | |

S-3 Now we can therefore rewrite the integral above as

λ1 x1 + + xp exp gmax dλ = exp gmax ··· dx1 dxp − λ Rp − x ··· ZC p Z 1 p p−1 −gmax xi = e exp gmax dxi dxp " − xp # Z Yi=1 Z ∞ x p−1 = e−gmax p dx g p Z0 max −gmax ∞ e p−1 = p−1 xp dxp gmax 0 = . Z ∞ To prove that the posterior yields a proper distribution we proceed as follows:

π(λ)f(λ,l)dλ C Z n p p − 2 n l λ exp i λ exp g 1 dλ ∝ − 2 λ i − max λ ZC i=1 i ! i=1 ! p X Y n p p − 2 n l λ exp p λ exp g 1 dλ as l l i = 1,...,p ≤ − 2 λ i − max λ p ≤ i ∀ ZC i=1 i ! i=1 ! p X Y n p p − 2 n l λ exp p λ e−gmax dλ as 1 1 ≤ − 2 λ i λ ≥ ZC i=1 i ! i=1 ! p p X Y ∞ n −gmax n lp − 2 e exp λi dλi . ≤ 0 − 2 λi Yi=1 Z The above integrand is the density of the inverse gamma distribution and therefore the corresponding integral above has a ﬁnite normalizing constant and thus yielding a proper posterior.

Proof of Theorem 3. (i) The conditional risk of Σ(κmax,ω), given the sample eigenvalues l = (l1,...,lp), is e p E (Σ(κ ,ω), Σ) l = λ˜ E a (Q) l log λ˜ + logdetΣ p, Lent max i ii − i − i=1 X n o e p 2 −1 where aii(Q) = j=1 qjiλj and qji is the (j,i)-th element of the orthogonal matrix Q. This is because P

ent Σ(˜ κmax,ω), Σ = tr Λ˜A(Q) log det Λ+logdetΣ˜ p L p − − = λ˜ a (Q) log λ˜ + logdetΣ p, (31) i ii − i − Xi=1 n o

S-4 where A(Q)= QT Σ−1Q. In (31), the summand has the form

x E a (Q) l log x ii − whose minimum is achieved at x = 1/E a (Q) l . Since p q2 = 1 and Σ−1 κ , 1/(κ ω) ii j=1 ji ∈D max max if and only if Σ κ ,ω , we have 1/(κ ω) a (Q) 1/ω. Hence 1/E a (Q) l lies be- ∈ D max max ≤ iiP ≤ ii tween ω and κ ω almost surely. Therefore, max

1. If l ω, then λ˜ = ω and i ≤ i λ˜ E a (Q) l log λ˜ l E a (Q) l log l . i ii − i ≤ i ii − i 2. If ω l < κ ω, then λ˜ = l and ≤ i max i i λ˜ E a (Q) l log λ˜ = l E a (Q) l log l . i ii − i i ii − i 3. If l κ ω, then λ˜ = κ ω and i ≥ max i max λ˜ E a (Q) l log λ˜ l E a (Q) l log l . i ii − i ≤ i ii − i Thus, p p λ˜ E a (Q) l log λ˜ l E a (Q) l log l i ii − i ≤ i ii − i i=1 i=1 X n o X n o and the risk with respect to the entropy loss is

p Σ(κ ,ω) = E λ˜ E a (Q) l log λ˜ Rent max i ii − i i=1 X n o e p E l E a (Q) l log l ≤ i ii − i i=1 X n o = S . Rent In other words, Σ(κ ,ω) has a smaller risk than S, provided λ−1 (κ ,ω). max ∈D max (ii) Suppose the true covariance matrix Σ has the spectral decomposition Σ = RΛRT with R 1/2 T orthogonal and Λe = diag λ1,...,λp . Let A = RΛ , then S0 , A SA has the same distribution as the sample covariance matrix observed from a p-variate Gaussian distribution with the identity covariance matrix. From the variational deﬁnition of the largest eigenvalue l1 of S, we obtain

T T v Sv w S0w l1 = max = max , v6=0 vT v w6=0 wT Λ−1w where w = AT v. Furthermore, since for any w = 0, 6 T −1 T −1 −1 w Λ w w Λ w λ1 = min , w6=0 wT w ≤ wT w

S-5 we have T w S0w l1 λ1 max = λ1e1, (32) ≤ w6=0 wT w where e1 is the largest eigenvalue of S0. Using essentially the same argument, we can show that

l λ e , (33) p ≥ p p where ep is the smallest eigenvalue of S0. Then, from the results by Geman (1980) and Silverstein (1985), we see that

2 2 pr e 1+ √γ , e 1 √γ eventually = 1. (34) 1 ≤ p ≥ − n o The combination of (32)–(34) leads to

2 2 pr l λ 1+ √γ , l λ 1 √γ eventually = 1. 1 ≤ 1 p ≥ p − n o −2 On the other hand, if κmax κ 1 √γ , then ≥ − 2 2 l λ l λ 1+ √γ , l λ 1 √γ max 1 , 1 κ . 1 ≤ 1 p ≥ p − ⊂ λ l ≤ max p p Also, if max l /λ , λ /l κ , then 1 p 1 p ≤ max λ /κ l l /κ λ . 1 max ≤ p ≤ 1 max ≤ p ⋆ From (12), τ lies between lp and l1/κmax. Therefore,

τ ⋆ λ and λ κ τ ⋆. ≤ p 1 ≤ max In other words l1 λ1 ⋆ max , κmax Σ κmax,u , λp lp ≤ ⊂ ∈D which concludes the proof.

S-6 B Comparison of Bayesian prior densities

−3 4.5 x 10 8 4 2 6 l 3.5 4

2 3

0 2.5 5 4.5 4 sampleeigenvalue 2 3.5 5 3 4.5 4 1.5 2.5 3.5 2 3 sample eigenvalue 2.5 1.5 2 1 l 1.5 1 1.5 2 2.5 3 3.5 4 4.5 5 2 1 1 l l sample eigenvalue 1 sample eigenvalue 1 (a) (b) 5

4.5 x 10−37 4 4 2 3 l 3.5 2

1 3

0 2.5 5 4.5

4 sampleeigenvalue 2 3.5 5 3 4.5 4 1.5 2.5 3.5 2 3 sample eigenvalue 2.5 1.5 2 1 l 1.5 1 1.5 2 2.5 3 3.5 4 4.5 5 2 1 1 l l sample eigenvalue 1 sample eigenvalue 1 (c) (d) 5

4.5

1 4 2 0.8 l

0.6 3.5 0.4 3 0.2

0 2.5 5 4.5 4 sampleeigenvalue 2 3.5 5 3 4.5 4 1.5 2.5 3.5 2 3 sample eigenvalue 2.5 1.5 2 1 l 1.5 1 1.5 2 2.5 3 3.5 4 4.5 5 2 1 1 l l sample eigenvalue 1 sample eigenvalue 1 (e) (f)

Comparison of various prior densities for eigenvalue shrinkage (p = 2). (a), (c), (e): Three- dimensional views. (b), (d), (f): contour views. (a), (b): prior for regularization (19). (c), (d): inverse Wishart prior. (e), (f): reference prior due to Yang and Berger (1994).

S-7 C Risk simulation results (ρ =0.5)

singleton 10% high 20% high 30% high 40% high

S-8 Average risks (with error bars) over 1000 runs with respect to two loss functions ρ = 0.5. sample=sample covariance matrix, Warton=linear shrinkage (Warton, 2008), CondReg=condition number regularization. Risks are normalized by the dimension (p).

D Empirical minimum variance rebalancing study (Sec- tion 6.2) D.1 List of the Dow Jones constituents The Dow Jones stocks used in our numerical study and their market performance over the period from February 18, 1994 to June 6, 2008. The return, risk and the Sharpe ratio (SR) are annualized.

index company ticker return [%] risk [%] SR 1 3M Company MMM 12.04 10.74 0.25 2 Alcoa, Inc. AA 16.50 15.47 0.30 3 American Express AXP 17.52 14.61 0.35 4 American International Group, Inc. AIG 7.96 12.93 0.07 5 AT&T Inc. T 11.57 12.95 0.19 6 Bank of America BAC 11.57 13.14 0.19 7 The Boeing Company BA 13.03 13.81 0.23 8 Caterpillar Inc. CAT 18.53 14.26 0.39 9 Chevron Corporation CVX 15.86 10.53 0.42 10 Citigroup Inc. C 14.44 15.27 0.25 11 The Coca-Cola Company KO 10.74 10.77 0.20 12 E.I. du Pont de Nemours & Company DD 9.58 12.43 0.13 13 Exxon Mobil Corporation XOM 16.58 10.46 0.45 14 General Electric Company GE 13.47 12.04 0.28 15 General Motors Corporation GM -1.24 15.85 -0.20 16 The Hewlett-Packard Company HPQ 20.22 18.24 0.35 17 The Home Depot HD 12.96 15.28 0.20 18 Intel Corporation INTC 20.84 19.13 0.35 19 International Business Machines Corp. IBM 20.99 13.86 0.48 20 Johnson & Johnson JNJ 17.13 10.10 0.49 21 JPMorgan Chase & Co. JPM 15.84 15.44 0.29 22 McDonald’s Corporation MCD 14.05 12.05 0.30 23 Merck & Co., Inc. MRK 12.86 12.87 0.24 24 Microsoft Corporation MSFT 22.91 15.13 0.50 25 Pﬁzer Inc. PFE 15.34 12.92 0.32 26 The Procter & Gamble Company PG 15.25 11.06 0.37 27 United Technologies Corporation UTX 18.93 12.37 0.47 28 Verizon Communications Inc. VZ 9.93 12.38 0.14 29 Wal-Mart Stores, Inc. WMT 14.86 13.16 0.30 30 The Walt Disney Company DIS 10.08 14.05 0.13

S-9 D.2 Trading periods index period index period 1 12/14/1992 – 3/31/1993 2 4/1/1993 – 7/19/1993 3 7/20/1993 – 11/2/1993 4 11/3/1993 – 2/17/1994 5 2/18/1994 – 6/8/1994 6 6/9/1994 – 9/23/1994 7 9/26/1994 – 1/11/1995 8 1/12/1995 – 4/28/1995 9 5/1/1995 – 8/15/1995 10 8/16/1995 – 11/30/1995 11 12/1/1995 – 3/19/1996 12 3/20/1996 – 7/5/1996 13 7/8/1996 – 10/21/1996 14 10/22/1996 – 2/6/1997 15 2/7/1997 – 5/27/1997 16 5/28/1997 – 9/11/1997 17 9/12/1997 – 2/29/1997 18 12/30/1997 – 4/17/1998 19 4/20/1998 – 8/4/1998 20 8/5/1998 – 11/18/1998 21 11/19/1998 – 3/10/1999 22 3/11/1999 – 6/25/1999 23 6/28/1999 – 10/12/1999 24 10/13/1999 – 1/28/2000 25 1/31/2000 – 5/16/2000 26 5/17/2000 – 8/31/2000 27 9/1/2000 – 12/18/2000 28 12/19/2000 – 4/6/2001 29 4/9/2001 – 7/25/2001 30 7/26/2001 – 11/14/2001 31 11/15/2001 – 3/6/2002 32 3/7/2002 – 6/21/2002 33 6/24/2002 – 10/8/2002 34 10/9/2002 – 1/27/2003 35 1/28/2003 – 5/14/2003 36 5/15/2003 – 8/29/2003 37 9/2/2003 – 12/16/2003 38 12/17/2003 – 4/5/2004 39 4/6/2004 – 7/23/2004 40 7/26/2004 – 11/8/2004 41 11/9/2004 – 2/25/2005 42 2/28/2005 – 6/14/2005 43 6/15/2005 – 9/29/2005 44 9/30/2005 – 1/18/2006 45 1/19/2006 – 5/5/2006 46 5/8/2006 – 8/22/2006 47 8/23/2006 – 12/7/2006 48 12/8/2006 – 3/29/2007 49 3/30/2007 – 7/17/2007 50 7/18/2007 – 10/31/2007 51 11/1/2007 – 2/20/2008 52 2/21/2008 – 6/6/2008

D.3 Performance metrics We use the following quantities in assessing the performance of the MVR strategies.

Realized return. The realized return of a portfolio rebalancing strategy over the entire trading • period is computed as

K Nestim+jL 1 1 r = r(t)T w(j). p K L Xj=1 t=NestimX+1+(j−1)L Realized risk. The realized risk (return standard deviation) of a portfolio rebalancing strategy • over the entire trading period is computed as

K Nestim+jL 1 1 (t)T (j) 2 2 σp = v (r w ) rp. uK L − u j=1 t=Nestim+1+(j−1)L u X X t

S-10 Realized Sharpe ratio (SR). The realized Sharpe ratio, i.e., the ratio of the excess expected • return of a portfolio rebalancing strategy relative to the risk-free return rf is given by r r SR = p − f . σp

Turnover. The turnover from the portfolio w(j) held at the start date of the jth holding • period [N +1+(j 1)L,N + jL] to the portfolio w(j−1) held at the previous period estim − estim is computed as

p Nestim+jL TO(j)= w(j) r(t) w(j−1) . i −  i  i i=1 t=Nestim+1+(j−1)L X Y  

For the ﬁrst period, we take w(0) = 0, i.e., the initial holdings of the assets are zero. (j) (j) (j) Normalized wealth growth. Let w = (w ,...,wn ) be the portfolio constructed by a • 1 rebalancing strategy held over the jth holding period [N +1+(j 1)L,N + jL]. estim − estim When the initial budget is normalized to one, the normalized wealth grows according to the recursion

p (t) W (t 1)(1 + i=1 witri ), t Nestim + jL j = 1,...,K , W (t)= − p (t) 6∈ { | } ( W (t 1)(1 + witr ) TC(j), t = Nestim + jL, − Pi=1 i − P for t = Nestim,...,Nestim + KL, with the initial wealth W (Nestim) = 1. Here

(1) wi , t = Nestim + 1,...,Nestim + L, . wit =  .  w(K), t = N +1+(K 1)L,...,N + KL. i estim − estim  and  p Nestim+jL TC(j)= η w(j) r(t) w(j−1) i i −  i  i i=1 t=Nestim+1+(j−1)L X Y   is the transaction cost due to the rebalancing if the cost to buy or sell one share of stock i is

ηi. Size of the short side. The size of the short side of a portfolio rebalancing strategy over the • entire trading period is computed as

K p p 1 (j) (j) rp = min(wi , 0) / wi . K | | | |! Xj=1 Xi=1 Xi=1 D.4 Minimum variance portfolio theoretic performance metrics Realized utility, realized return, and realized risk based on diﬀerent regularization schemes for covariance matrices are reported. sample=sample covariance matrix, LW=linear shrinkage (Ledoit and

S-11 Wolf, 2004), Warton=linear shrinkage (Warton, 2008), condreg=condition number regularization. Each entry is the mean (standard deviation) of the corresponding metric over the trading period (48 holding periods) from February 1994 through June 2008. The standard deviations are computed in a heteroskedasticity-and-autocorrelation consistent manner discussed by Ledoit and Wolf (2008, Sec. 3.1.). For comparison, the performance metrics of the S&P 500 index for the same period are reported at the end of the table. All values are annualized.

covariance return[%] risk[%] SR regularization scheme Nestim = 15 sample – – – LW 12.12 (3.69) 14.38 (0.69) 0.50 (0.26) Warton 12.21 (3.71) 14.28 (0.68) 0.50 (0.26) condreg 14.49 (4.01) 16.07 (0.94) 0.59 (0.25) condreg - LW 2.37 (5.45) 1.69 (1.17) 0.10 (0.36) condreg - Warton 2.28 (5.47) 1.79 (1.16) 0.09 (0.37)

Nestim = 30 sample 145.91 (114.74) 851.07 (257.28) 0.17 (0.14) LW 11.82 (3.72) 14.27 (0.65) 0.48 (0.26) Warton 12.1 (3.68) 14.06 (0.69) 0.51 (0.27) condreg 14.32 (4.00) 15.92 (0.91) 0.59 (0.26) condreg - LW 2.51 (5.46) 1.65 (1.12) 0.11 (0.37) condreg - Warton 2.16 (5.43) 1.85 (1.14) 0.08 (0.37)

Nestim = 45 sample 9.52 (5.62) 21.15 (0.74) 0.21 (0.26) LW 12.21 (3.65) 14.04 (0.72) 0.51 (0.26) Warton 12.62 (3.60) 13.85 (0.76) 0.55 (0.26) condreg 14.53 (3.93) 15.63 (0.87) 0.61 (0.26) condreg - LW 2.33 (5.36) 1.60 (1.13) 0.10 (0.37) condreg - Warton 1.91 (5.33) 1.79 (1.16) 0.06 (0.37)

Nestim = 60 sample 9.32 (4.29) 17.52 (0.70) 0.25 (0.25) LW 11.55 (3.55) 13.94 (0.74) 0.47 (0.26) Warton 12.15 (3.55) 13.78 (0.77) 0.52 (0.26) condreg 14.13 (3.91) 15.65 (0.90) 0.58 (0.25) condreg - LW 2.58 (5.28) 1.71 (1.17) 0.11 (0.36) condreg - Warton 1.98 (5.28) 1.87 (1.18) 0.06 (0.37)

S&P500 – 8.71 (3.84) 16.04 (0.94) 0.23 (0.24)

D.5 Turnover Turnover for various estimation horizon sizes over the trading period from February 18, 1994 through June 6, 2008. sample=sample covariance matrix, LW=linear shrinkage (Ledoit and Wolf, 2004), Warton=linear shrinkage (Warton, 2008), condreg=condition number regularization.

S-12 2000 LW Warton 1800 condreg 200 1600

1400

1200

1000

100 800 sample turnover[%] turnover[%] LW Warton 600 condreg

400

200

0 0 5 15 25 35 45 5 15 25 35 45 period period

(a) Nestim =15 (b) Nestim = 30 1000 700

900 sample sample LW 600 LW Warton Warton 800 condreg condreg

700 500

600 400 500 300 400 turnover[%] turnover[%]

300 200

200 100 100

0 0 5 15 25 35 45 5 15 25 35 45 period period

D.6 Size of the short side The size of the short side of the portfolios based on diﬀerent regularization schemes for covariance matrices are reported. sample=sample covariance matrix, LW=linear shrinkage (Ledoit and Wolf, 2004), Warton=linear shrinkage (Warton, 2008), condreg=condition number regularization. Each entry is the mean (standard deviation) of the short-side size over the trading period (48 holding periods) from February 1994 through June 2008, given in percent.

covariance Nestim regularization scheme 15 30 45 60 sample – 47.47 (1.86) 36.84 (3.21) 32.83 (4.47) LW 13.05 (7.68) 15.87 (7.63) 17.29 (7.35) 17.79 (7.36) Warton 10.68 (7.62) 11.01 (6.91) 11.14 (7.09) 11.53 (0.73) condreg 0.08 (0.54) 0.27 (0.94) 0.61 (1.88) 0.96 (2.81)

References

Geman, S. (1980). A limit theorem for the norm of random matrices. The Annals of Probability 8 (2), 252–261.

S-13 Ledoit, O. and M. Wolf (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis 88, 365–411.

Ledoit, O. and M. Wolf (2008, December). Robust performance hypothesis testing with the sharpe ratio. Journal of Empirical Finance 15 (5), 850–859.

Silverstein, J. W. (1985). The smallest eigenvalue of a large dimensional wishart matrix. The Annals of Probability 13 (4), 1364–1368.

Warton, D. I. (2008). Penalized Normal Likelihood and Ridge Regularization of Correlation and Covariance Matrices. Journal of the American Statistical Association 103 (481), 340–349.

Yang, R. and J. O. Berger (1994). Estimation of a covariance matrix using the reference prior. The Annals of Statistics 22 (3), 1195–1211.

S-14